pith. machine review for the scientific record. sign in

arxiv: 2507.02546 · v1 · submitted 2025-07-03 · 💻 cs.CV

Recognition: 2 theorem links

· Lean Theorem

MoGe-2: Accurate Monocular Geometry with Metric Scale and Sharp Details

Guangzhong Sun, Jianfeng Xiang, Jiaolong Yang, Ruicheng Wang, Sicheng Xu, Xin Tong, Yu Deng, Yue Dong, Zelong Lv

Authors on Pith no claims yet

Pith reviewed 2026-05-14 21:16 UTC · model grok-4.3

classification 💻 cs.CV
keywords monocular geometry estimationmetric scale3D point mapsdata refinementsynthetic labelsrelative geometryfine detail recoverysingle image reconstruction
0
0 comments X

The pith

MoGe-2 recovers metric-scale 3D point maps from single images while preserving relative accuracy and recovering fine details.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents MoGe-2 as a model that takes one photo and outputs a 3D point map with real-world scale. It starts from an earlier method that gives only scale-free relative geometry and adds training steps to pin down the metric scale. The authors also refine noisy real training data by overlaying sharp labels from synthetic images, which restores small surface details that otherwise get lost. If successful, this combination would let ordinary cameras produce 3D reconstructions usable for measurement tasks without extra sensors. The work shows that both scale accuracy and detail can be obtained together, something prior single-image methods have not delivered at once.

Core claim

MoGe-2 extends the affine-invariant point-map representation of the earlier MoGe model to predict metric-scale 3D points from a single image. It does so by introducing training strategies that keep relative geometry intact while learning absolute scale, and by applying a unified data-refinement pipeline that filters and completes real-world training examples using sharp synthetic labels. The resulting model simultaneously achieves accurate relative geometry, precise metric scale, and high-granularity surface details on open-domain scenes.

What carries the argument

unified data refinement pipeline that filters and completes real data sources using sharp synthetic labels to restore fine-grained geometry while learning metric scale from affine-invariant point maps

If this is right

  • Single-image 3D reconstruction becomes usable for tasks that require both shape and absolute size, such as indoor measurement or robot navigation.
  • Detail recovery improves without loss of global accuracy when real training data is cleaned with synthetic labels.
  • The same refinement strategy can be applied to other monocular geometry models that currently suffer from noisy real-world labels.
  • Open-domain scenes can be reconstructed at metric scale without domain-specific fine-tuning.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The approach suggests that data quality may be a larger bottleneck than network architecture for recovering fine geometry details.
  • If the refinement step generalizes, it could reduce reliance on expensive synchronized camera rigs for creating metric training sets.
  • The method opens a route to combining large synthetic corpora with curated real footage for other scale-sensitive vision problems.

Load-bearing premise

Filtering and completing real data with sharp synthetic labels preserves overall accuracy without introducing systematic biases or artifacts in the metric scale prediction.

What would settle it

Running the trained model on a new set of real images with independent laser-scanned metric ground truth and checking whether the predicted point scales deviate by more than a few percent on average.

read the original abstract

We propose MoGe-2, an advanced open-domain geometry estimation model that recovers a metric scale 3D point map of a scene from a single image. Our method builds upon the recent monocular geometry estimation approach, MoGe, which predicts affine-invariant point maps with unknown scales. We explore effective strategies to extend MoGe for metric geometry prediction without compromising the relative geometry accuracy provided by the affine-invariant point representation. Additionally, we discover that noise and errors in real data diminish fine-grained detail in the predicted geometry. We address this by developing a unified data refinement approach that filters and completes real data from different sources using sharp synthetic labels, significantly enhancing the granularity of the reconstructed geometry while maintaining the overall accuracy. We train our model on a large corpus of mixed datasets and conducted comprehensive evaluations, demonstrating its superior performance in achieving accurate relative geometry, precise metric scale, and fine-grained detail recovery -- capabilities that no previous methods have simultaneously achieved.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper proposes MoGe-2, an extension of the prior MoGe model for monocular geometry estimation. It predicts metric-scale 3D point maps from single images by developing strategies to incorporate metric supervision while retaining the relative geometry accuracy of affine-invariant point representations. A unified data refinement pipeline is introduced that filters real data and completes it with sharp synthetic labels to improve fine-grained detail recovery. The model is trained on a large mixed corpus and evaluated to claim simultaneous superiority in relative geometry accuracy, precise metric scale, and detail sharpness over prior methods.

Significance. If the empirical claims hold after addressing the gaps below, the work would be significant for monocular 3D reconstruction: it targets the longstanding trade-off between relative accuracy, absolute metric scale, and high-frequency detail in a single open-domain model. The data-refinement strategy and mixed-dataset training provide a practical template that could transfer to other geometry tasks. The absence of circularity in the central claims (empirical training rather than self-referential fitting) strengthens the potential contribution.

major comments (3)
  1. [Abstract and §4] Abstract and §4 (Experiments): the claim of 'comprehensive evaluations' and 'superior performance' in metric scale is unsupported by any reported quantitative metrics, error bars, or ablation tables on the refinement step; without these, it is impossible to verify that synthetic completions preserve the original real-data metric distribution.
  2. [§3.2] §3.2 (Data Refinement): the unified filtering-and-completion procedure is described only at a high level; no analysis quantifies whether synthetic label insertion alters scale statistics or introduces label inconsistencies relative to the original metric measurements, which directly bears on the 'precise metric scale' part of the central claim.
  3. [§4] §4 (Ablations): no ablation isolates the contribution of the refinement pipeline versus the metric-scale training strategy; the headline result that 'no previous methods have simultaneously achieved' all three capabilities therefore rests on an untested assumption that the two components do not trade off against each other.
minor comments (2)
  1. [§3] Notation: the distinction between the affine-invariant point map (from MoGe) and the final metric-scale output should be made explicit with consistent symbols in the method overview figure and equations.
  2. [§4] Figure clarity: several qualitative results lack scale bars or reference objects, making visual assessment of metric accuracy difficult.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We have addressed each major comment by expanding the manuscript with additional quantitative analyses, detailed descriptions, and ablations. These revisions strengthen the empirical support for our claims without altering the core contributions.

read point-by-point responses
  1. Referee: [Abstract and §4] Abstract and §4 (Experiments): the claim of 'comprehensive evaluations' and 'superior performance' in metric scale is unsupported by any reported quantitative metrics, error bars, or ablation tables on the refinement step; without these, it is impossible to verify that synthetic completions preserve the original real-data metric distribution.

    Authors: We agree that the original submission would benefit from more explicit quantitative backing for the metric-scale claims. In the revised manuscript, we have added new tables in §4 that report absolute metric errors (e.g., scale-invariant and absolute depth errors on KITTI and NYU with ground-truth metric labels), including error bars computed over multiple random seeds. We also include a dedicated ablation on the refinement pipeline that quantifies preservation of the real-data metric distribution via scale-factor histograms and Kolmogorov-Smirnov tests, confirming negligible shift after synthetic completion. revision: yes

  2. Referee: [§3.2] §3.2 (Data Refinement): the unified filtering-and-completion procedure is described only at a high level; no analysis quantifies whether synthetic label insertion alters scale statistics or introduces label inconsistencies relative to the original metric measurements, which directly bears on the 'precise metric scale' part of the central claim.

    Authors: We acknowledge the description in §3.2 was high-level. We have substantially expanded this section with a step-by-step algorithmic description, pseudocode, and quantitative diagnostics. Specifically, we now report pre- and post-refinement statistics (mean and variance of per-image scale factors) and a consistency metric (fraction of points where synthetic labels deviate from real metric measurements by more than 5%). These additions demonstrate that the procedure preserves the original metric distribution while improving detail. revision: yes

  3. Referee: [§4] §4 (Ablations): no ablation isolates the contribution of the refinement pipeline versus the metric-scale training strategy; the headline result that 'no previous methods have simultaneously achieved' all three capabilities therefore rests on an untested assumption that the two components do not trade off against each other.

    Authors: We agree that an explicit isolation of the two components is necessary. We have added a new ablation subsection in §4 that trains and evaluates three controlled variants: (i) metric supervision without refinement, (ii) refinement without explicit metric supervision, and (iii) the full MoGe-2 pipeline. The results show complementary gains with no measurable trade-off; the combined model simultaneously improves relative geometry accuracy, metric precision, and detail sharpness, thereby supporting the central claim. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical training and evaluation pipeline is self-contained

full rationale

The paper presents an ML model that extends prior MoGe affine-invariant point maps to metric scale via data refinement (filtering real data and completing with synthetic labels) followed by training on mixed corpora and reporting benchmark results. No derivation chain, equation, or claim reduces to its own inputs by construction; performance assertions rest on external empirical measurements rather than self-referential fits or self-citation load-bearing steps. The approach is standard supervised learning with dataset curation and does not invoke uniqueness theorems or ansatzes that loop back to the target result.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the empirical effectiveness of the data refinement strategy and the assumption that neural networks can reliably map 2D images to metric 3D without additional constraints; no new physical entities are postulated.

axioms (1)
  • domain assumption Neural networks trained on mixed real and synthetic data can learn to predict both relative geometry and absolute metric scale from single images.
    Core premise invoked when extending the affine-invariant MoGe representation to metric output.

pith-pipeline@v0.9.0 · 5484 in / 1263 out tokens · 34353 ms · 2026-05-14T21:16:02.251020+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 21 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. PointForward: Feedforward Driving Reconstruction through Point-Aligned Representations

    cs.CV 2026-05 unverdicted novelty 7.0

    PointForward uses sparse world-space 3D queries and scene graphs to deliver consistent single-pass reconstruction of dynamic driving scenes via point-aligned representations.

  2. Differentiable Ray Tracing with Gaussians for Unified Radio Propagation Simulation and View Synthesis

    cs.CV 2026-05 unverdicted novelty 7.0

    Embedding Gaussian primitives into a ray tracing structure enables unified radio propagation simulation and view synthesis from visual-only reconstructions.

  3. CARD: A Multi-Modal Automotive Dataset for Dense 3D Reconstruction in Challenging Road Topography

    cs.CV 2026-05 conditional novelty 7.0

    CARD is a new multi-modal driving dataset delivering ~500K dense depth pixels per frame from challenging road topographies using stereo cameras and fused LiDARs over 110 km.

  4. Any 3D Scene is Worth 1K Tokens: 3D-Grounded Representation for Scene Generation at Scale

    cs.CV 2026-04 unverdicted novelty 7.0

    A 3D-grounded autoencoder and diffusion transformer allow direct generation of 3D scenes in an implicit latent space using a fixed 1K-token representation for arbitrary views and resolutions.

  5. CDPR: Cross-modal Diffusion with Polarization for Reliable Monocular Depth Estimation

    cs.CV 2026-04 unverdicted novelty 7.0

    CDPR integrates polarization priors into a diffusion-based monocular depth estimator via shared latent space and adaptive gating, outperforming RGB-only methods in challenging scenes.

  6. WildDet3D: Scaling Promptable 3D Detection in the Wild

    cs.CV 2026-04 unverdicted novelty 7.0

    WildDet3D is a promptable 3D detector paired with a new 1M-image dataset across 13.5K categories that sets SOTA on open-world and zero-shot 3D detection benchmarks.

  7. 3D-Fixer: Coarse-to-Fine In-place Completion for 3D Scenes from a Single Image

    cs.CV 2026-04 unverdicted novelty 7.0

    3D-Fixer performs in-place 3D asset completion from single-view partial point clouds via coarse-to-fine generation with ORFA conditioning, plus a new ARSG-110K dataset, to achieve higher geometric accuracy than MIDI a...

  8. UniDAC: Universal Metric Depth Estimation for Any Camera

    cs.CV 2026-03 unverdicted novelty 7.0

    UniDAC achieves universal metric depth estimation across camera types by decoupling relative depth prediction from spatially varying scale estimation using a depth-guided module and distortion-aware positional embedding.

  9. $\pi^3$: Permutation-Equivariant Visual Geometry Learning

    cs.CV 2025-07 conditional novelty 7.0

    π³ is a feed-forward network with full permutation equivariance that outputs affine-invariant poses and scale-invariant local point maps without reference frames, reaching state-of-the-art on camera pose, depth, and d...

  10. Pixal3D: Pixel-Aligned 3D Generation from Images

    cs.CV 2026-05 unverdicted novelty 6.0

    Pixal3D performs pixel-aligned 3D generation from images via back-projected multi-scale feature volumes, achieving fidelity close to reconstruction while supporting multi-view and scene synthesis.

  11. LA-Pose: Latent Action Pretraining Meets Pose Estimation

    cs.CV 2026-04 unverdicted novelty 6.0

    LA-Pose achieves over 10% higher pose accuracy than recent feed-forward methods on Waymo and PandaSet benchmarks by repurposing latent actions from self-supervised inverse-dynamics pretraining while using orders of ma...

  12. Vista4D: Video Reshooting with 4D Point Clouds

    cs.CV 2026-04 unverdicted novelty 6.0

    Vista4D re-synthesizes dynamic videos from new viewpoints by grounding them in a 4D point cloud built with static segmentation and multiview training.

  13. GRAFT: Geometric Refinement and Fitting Transformer for Human Scene Reconstruction

    cs.CV 2026-04 unverdicted novelty 6.0

    GRAFT amortizes human-scene fitting into a recurrent transformer that predicts interaction gradients via body-anchored geometric probes, delivering optimization-level interaction quality at 50x lower runtime.

  14. Enhancing Glass Surface Reconstruction via Depth Prior for Robot Navigation

    cs.RO 2026-04 unverdicted novelty 6.0

    A training-free RANSAC-based fusion of depth foundation model priors with sensor data recovers accurate metric depth on glass, supported by a new GlassRecon RGB-D dataset with derived ground truth.

  15. In Depth We Trust: Reliable Monocular Depth Supervision for Gaussian Splatting

    cs.CV 2026-04 unverdicted novelty 6.0

    A selective regularization framework lets scale-ambiguous monocular depth priors improve Gaussian Splatting geometry and rendering by isolating and supervising only ill-posed regions.

  16. GESS: Multi-cue Guided Local Feature Learning via Geometric and Semantic Synergy

    cs.CV 2026-04 unverdicted novelty 6.0

    GESS introduces joint semantic-normal and depth stability prediction heads, the SDAK keypoint mechanism, and the UTCF descriptor fusion module to leverage multi-cue synergy for improved robustness and discriminability.

  17. NavCrafter: Exploring 3D Scenes from a Single Image

    cs.CV 2026-04 unverdicted novelty 6.0

    NavCrafter generates controllable novel-view videos from one image via video diffusion, geometry-aware expansion, and enhanced 3D Gaussian Splatting to achieve state-of-the-art synthesis under large viewpoint changes.

  18. DVGT-2: Vision-Geometry-Action Model for Autonomous Driving at Scale

    cs.CV 2026-04 unverdicted novelty 6.0

    DVGT-2 is a streaming vision-geometry-action model that jointly reconstructs dense 3D geometry and plans trajectories online, achieving better reconstruction than prior batch methods while transferring directly to pla...

  19. WildPose: A Unified Framework for Robust Pose Estimation in the Wild

    cs.CV 2026-05 unverdicted novelty 5.0

    WildPose unifies feedforward 3D features from MASt3R with differentiable bundle adjustment for robust monocular pose estimation across dynamic, static, and low-ego-motion scenes.

  20. HY-World 2.0: A Multi-Modal World Model for Reconstructing, Generating, and Simulating 3D Worlds

    cs.CV 2026-04 unverdicted novelty 4.0

    HY-World 2.0 generates and reconstructs high-fidelity navigable 3D Gaussian Splatting worlds from text, images, or videos via upgraded panorama, planning, expansion, and composition modules, with released code claimin...

  21. NTIRE 2026 3D Restoration and Reconstruction in Real-world Adverse Conditions: RealX3D Challenge Results

    cs.CV 2026-04 unverdicted novelty 2.0

    The NTIRE 2026 challenge reports measurable progress in 3D reconstruction pipelines that handle real-world low-light and smoke degradation via the RealX3D benchmark.

Reference graph

Works this paper leans on

83 extracted references · 83 canonical work pages · cited by 21 Pith papers · 6 internal anchors

  1. [1]

    Apollo synthetic dataset, 2019

    Baidu Apollo. Apollo synthetic dataset, 2019. Accessed: 2025-03-06

  2. [2]

    Zip-nerf: Anti- aliased grid-based neural radiance fields

    Jonathan T Barron, Ben Mildenhall, Dor Verbin, Pratul P Srinivasan, and Peter Hedman. Zip-nerf: Anti- aliased grid-based neural radiance fields. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 19697–19705, 2023

  3. [3]

    ARKitscenes - a diverse real-world dataset for 3d indoor scene understanding using mobile RGB-d data

    Gilad Baruch, Zhuoyuan Chen, Afshin Dehghan, Tal Dimry, Yuri Feigin, Peter Fu, Thomas Gebauer, Brandon Joffe, Daniel Kurz, Arik Schwartz, and Elad Shulman. ARKitscenes - a diverse real-world dataset for 3d indoor scene understanding using mobile RGB-d data. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (...

  4. [4]

    Adabins: Depth estimation using adaptive bins

    Shariq Farooq Bhat, Ibraheem Alhashim, and Peter Wonka. Adabins: Depth estimation using adaptive bins. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages 4009–4018, 2021

  5. [5]

    ZoeDepth: Zero-shot Transfer by Combining Relative and Metric Depth

    Shariq Farooq Bhat, Reiner Birkl, Diana Wofk, Peter Wonka, and Matthias Müller. Zoedepth: Zero-shot transfer by combining relative and metric depth. arXiv preprint arXiv:2302.12288, 2023

  6. [6]

    1–a model zoo for robust monocular relative depth estimation

    Reiner Birkl, Diana Wofk, and Matthias Müller. Midas v3. 1–a model zoo for robust monocular relative depth estimation. arXiv preprint arXiv:2307.14460, 2023

  7. [7]

    Richter, and Vladlen Koltun

    Aleksei Bochkovskii, Amaël Delaunoy, Hugo Germain, Marcel Santos, Yichao Zhou, Stephan R. Richter, and Vladlen Koltun. Depth pro: Sharp monocular metric depth in less than a second. arXiv, 2024

  8. [8]

    D. J. Butler, J. Wulff, G. B. Stanley, and M. J. Black. A naturalistic open source movie for optical flow evaluation. In European Conf. on Computer Vision (ECCV), pages 611–625. Springer-Verlag, 2012

  9. [9]

    Chang, Manolis Savva, Maciej Halber, Thomas Funkhouser, and Matthias Nießner

    Angela Dai, Angel X. Chang, Manolis Savva, Maciej Halber, Thomas Funkhouser, and Matthias Nießner. Scannet: Richly-annotated 3d reconstructions of indoor scenes. In Proc. Computer Vision and Pattern Recognition (CVPR), IEEE, 2017

  10. [10]

    Objaverse: A universe of annotated 3d objects

    Matt Deitke, Dustin Schwenk, Jordi Salvador, Luca Weihs, Oscar Michel, Eli VanderBilt, Ludwig Schmidt, Kiana Ehsani, Aniruddha Kembhavi, and Ali Farhadi. Objaverse: A universe of annotated 3d objects. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13142– 13153, 2023

  11. [11]

    An image is worth 16x16 words: Transformers for image recognition at scale

    Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. ICLR, 2021

  12. [12]

    McHugh, and Vincent Vanhoucke

    Laura Downs, Anthony Francis, Nate Koenig, Brandon Kinman, Ryan Hickman, Krista Reymann, Thomas B. McHugh, and Vincent Vanhoucke. Google scanned objects: A high-quality dataset of 3d scanned household items, 2022

  13. [13]

    Depth map prediction from a single image using a multi-scale deep network

    David Eigen, Christian Puhrsch, and Rob Fergus. Depth map prediction from a single image using a multi-scale deep network. Advances in neural information processing systems, 27, 2014

  14. [14]

    Mid-air: A multi-modal dataset for extremely low altitude drone flights

    Michael Fonder and Marc Van Droogenbroeck. Mid-air: A multi-modal dataset for extremely low altitude drone flights. In Conference on Computer Vision and Pattern Recognition Workshop (CVPRW), 2019

  15. [15]

    Deep ordinal regression network for monocular depth estimation

    Huan Fu, Mingming Gong, Chaohui Wang, Kayhan Batmanghelich, and Dacheng Tao. Deep ordinal regression network for monocular depth estimation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2002–2011, 2018

  16. [16]

    Geowizard: Unleashing the diffusion priors for 3d geometry estimation from a single image

    Xiao Fu, Wei Yin, Mu Hu, Kaixuan Wang, Yuexin Ma, Ping Tan, Shaojie Shen, Dahua Lin, and Xiaoxiao Long. Geowizard: Unleashing the diffusion priors for 3d geometry estimation from a single image. arXiv preprint arXiv:2403.12013, 2024

  17. [17]

    Jakob Geyer, Yohannes Kassahun, Mentar Mahmudi, Xavier Ricou, Rupesh Durgesh, Andrew S. Chung, Lorenz Hauswald, Viet Hoang Pham, Maximilian Mühlegg, Sebastian Dorn, Tiffany Fernandez, Martin Jänicke, Sudesh Mirashi, Chiragkumar Savani, Martin Sturm, Oleksandr V orobiov, Martin Oelker, Sebastian Garreis, and Peter Schuberth. A2D2: Audi Autonomous Driving D...

  18. [18]

    Depthfm: Fast monocular depth estimation with flow matching

    Ming Gui, Johannes S Fischer, Ulrich Prestel, Pingchuan Ma, Dmytro Kotovenko, Olga Grebenkova, Stefan Andreas Baumann, Vincent Tao Hu, and Björn Ommer. Depthfm: Fast monocular depth estimation with flow matching. arXiv preprint arXiv:2403.13788, 2024. 10

  19. [19]

    3d packing for self- supervised monocular depth estimation

    Vitor Guizilini, Rares Ambrus, Sudeep Pillai, Allan Raventos, and Adrien Gaidon. 3d packing for self- supervised monocular depth estimation. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2020

  20. [20]

    Towards zero-shot scale-aware monocular depth estimation

    Vitor Guizilini, Igor Vasiljevic, Dian Chen, Rare s, Ambrus, , and Adrien Gaidon. Towards zero-shot scale-aware monocular depth estimation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 9233–9243, 2023

  21. [21]

    Gómez, Manuel Silva, Antonio Seoane, Agnès Borrás, Mario Noriega, Germán Ros, Jose A

    Jose L. Gómez, Manuel Silva, Antonio Seoane, Agnès Borrás, Mario Noriega, Germán Ros, Jose A. Iglesias-Guitian, and Antonio M. López. All for one, and one for all: Urbansyn dataset, the third musketeer of synthetic driving scenes, 2023

  22. [22]

    Deep residual learning for image recognition

    Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016

  23. [23]

    Metric3d v2: A versatile monocular geometric foundation model for zero-shot metric depth and surface normal estimation

    Mu Hu, Wei Yin, Chi Zhang, Zhipeng Cai, Xiaoxiao Long, Hao Chen, Kaixuan Wang, Gang Yu, Chunhua Shen, and Shaojie Shen. Metric3d v2: A versatile monocular geometric foundation model for zero-shot metric depth and surface normal estimation. arXiv preprint arXiv:2404.15506, 2024

  24. [24]

    Deepmvs: Learning multi-view stereopsis

    Po-Han Huang, Kevin Matzen, Johannes Kopf, Narendra Ahuja, and Jia-Bin Huang. Deepmvs: Learning multi-view stereopsis. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018

  25. [25]

    Depth map super-resolution by deep multi-scale guidance

    Tak-Wai Hui, Chen Change Loy, and Xiaoou Tang. Depth map super-resolution by deep multi-scale guidance. In Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part III 14, pages 353–369. Springer, 2016

  26. [26]

    On the importance of accurate geometry data for dense 3d vision tasks

    HyunJun Jung, Patrick Ruhkamp, Guangyao Zhai, Nikolas Brasch, Yitong Li, Yannick Verdie, Jifei Song, Yiren Zhou, Anil Armagan, Slobodan Ilic, et al. On the importance of accurate geometry data for dense 3d vision tasks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 780–791, 2023

  27. [27]

    Repurposing diffusion-based image generators for monocular depth estimation

    Bingxin Ke, Anton Obukhov, Shengyu Huang, Nando Metzger, Rodrigo Caye Daudt, and Konrad Schindler. Repurposing diffusion-based image generators for monocular depth estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9492–9502, 2024

  28. [28]

    Multi-task learning using uncertainty to weigh losses for scene geometry and semantics

    Alex Kendall, Yarin Gal, and Roberto Cipolla. Multi-task learning using uncertainty to weigh losses for scene geometry and semantics. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 7482–7491, 2018

  29. [29]

    Segment anything

    Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C Berg, Wan-Yen Lo, et al. Segment anything. In Proceedings of the IEEE/CVF international conference on computer vision, pages 4015–4026, 2023

  30. [30]

    Evaluation of cnn-based single- image depth estimation methods

    Tobias Koch, Lukas Liebel, Friedrich Fraundorfer, and Marco Körner. Evaluation of cnn-based single- image depth estimation methods. In Proceedings of the European Conference on Computer Vision Workshops (ECCV-WS), pages 331–348. Springer International Publishing, 2019

  31. [31]

    Comparison of monocular depth estimation methods using geometrically relevant metrics on the ibims-1 dataset

    Tobias Koch, Lukas Liebel, Marco Körner, and Friedrich Fraundorfer. Comparison of monocular depth estimation methods using geometrically relevant metrics on the ibims-1 dataset. Computer Vision and Image Understanding (CVIU), 191:102877, 2020

  32. [32]

    Grounding image matching in 3d with mast3r

    Vincent Leroy, Yohann Cabon, and Jérôme Revaud. Grounding image matching in 3d with mast3r. In European Conference on Computer Vision, pages 71–91. Springer, 2024

  33. [33]

    Matrixcity: A large-scale city dataset for city-scale neural rendering and beyond

    Yixuan Li, Lihan Jiang, Linning Xu, Yuanbo Xiangli, Zhenzhi Wang, Dahua Lin, and Bo Dai. Matrixcity: A large-scale city dataset for city-scale neural rendering and beyond. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 3205–3215, 2023

  34. [34]

    Megadepth: Learning single-view depth prediction from internet photos

    Zhengqi Li and Noah Snavely. Megadepth: Learning single-view depth prediction from internet photos. In Computer Vision and Pattern Recognition (CVPR), 2018

  35. [35]

    Patchfusion: An end-to-end tile-based framework for high-resolution monocular metric depth estimation

    Zhenyu Li, Shariq Farooq Bhat, and Peter Wonka. Patchfusion: An end-to-end tile-based framework for high-resolution monocular metric depth estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10016–10025, 2024

  36. [36]

    Megasam: Accurate, fast, and robust structure and motion from casual dynamic videos

    Zhengqi Li, Richard Tucker, Forrester Cole, Qianqian Wang, Linyi Jin, Vickie Ye, Angjoo Kanazawa, Aleksander Holynski, and Noah Snavely. Megasam: Accurate, fast, and robust structure and motion from casual dynamic videos. arXiv preprint arXiv:2412.04463, 2024. 11

  37. [37]

    Prompting depth anything for 4k resolution accurate metric depth estimation

    Haotong Lin, Sida Peng, Jingxiao Chen, Songyou Peng, Jiaming Sun, Minghuan Liu, Hujun Bao, Jiashi Feng, Xiaowei Zhou, and Bingyi Kang. Prompting depth anything for 4k resolution accurate metric depth estimation. arXiv preprint arXiv:2412.14015, 2024

  38. [38]

    Spring: A high- resolution high-detail dataset and benchmark for scene flow, optical flow and stereo

    Lukas Mehl, Jenny Schmalfuss, Azin Jahedi, Yaroslava Nalivayko, and Andrés Bruhn. Spring: A high- resolution high-detail dataset and benchmark for scene flow, optical flow and stereo. InProc. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023

  39. [39]

    Guided depth super-resolution by deep anisotropic diffusion

    Nando Metzger, Rodrigo Caye Daudt, and Konrad Schindler. Guided depth super-resolution by deep anisotropic diffusion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18237–18246, 2023

  40. [40]

    Boosting monocular depth estimation models to high-resolution via content-adaptive multi-resolution merging

    S Mahdi H Miangoleh, Sebastian Dille, Long Mai, Sylvain Paris, and Yagiz Aksoy. Boosting monocular depth estimation models to high-resolution via content-adaptive multi-resolution merging. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9685–9694, 2021

  41. [41]

    Indoor segmentation and support inference from rgbd images

    Pushmeet Kohli Nathan Silberman, Derek Hoiem and Rob Fergus. Indoor segmentation and support inference from rgbd images. In ECCV, 2012

  42. [42]

    3d ken burns effect from a single image

    Simon Niklaus, Long Mai, Jimei Yang, and Feng Liu. 3d ken burns effect from a single image. ACM Transactions on Graphics, 38(6):184:1–184:15, 2019

  43. [43]

    DINOv2: Learning Robust Visual Features without Supervision

    Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193, 2023

  44. [44]

    UniDepth: Universal monocular metric depth estimation

    Luigi Piccinelli, Yung-Hsu Yang, Christos Sakaridis, Mattia Segu, Siyuan Li, Luc Van Gool, and Fisher Yu. UniDepth: Universal monocular metric depth estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024

  45. [45]

    Unidepthv2: Universal monocular metric depth estimation made simpler

    Luigi Piccinelli, Christos Sakaridis, Yung-Hsu Yang, Mattia Segu, Siyuan Li, Wim Abbeloos, and Luc Van Gool. Unidepthv2: Universal monocular metric depth estimation made simpler. arXiv preprint arXiv:2502.20110, 2025

  46. [46]

    SpatialVLA: Exploring Spatial Representations for Visual-Language-Action Model

    Delin Qu, Haoming Song, Qizhi Chen, Yuanqi Yao, Xinyi Ye, Yan Ding, Zhigang Wang, JiaYuan Gu, Bin Zhao, Dong Wang, et al. Spatialvla: Exploring spatial representations for visual-language-action model. arXiv preprint arXiv:2501.15830, 2025

  47. [47]

    Towards robust monocular depth estimation: Mixing datasets for zero-shot cross-dataset transfer

    René Ranftl, Katrin Lasinger, David Hafner, Konrad Schindler, and Vladlen Koltun. Towards robust monocular depth estimation: Mixing datasets for zero-shot cross-dataset transfer. IEEE transactions on pattern analysis and machine intelligence, 44(3):1623–1637, 2020

  48. [48]

    Vision transformers for dense prediction

    René Ranftl, Alexey Bochkovskiy, and Vladlen Koltun. Vision transformers for dense prediction. In Proceedings of the IEEE/CVF international conference on computer vision, pages 12179–12188, 2021

  49. [49]

    Susskind

    Mike Roberts, Jason Ramapuram, Anurag Ranjan, Atulit Kumar, Miguel Angel Bautista, Nathan Paczan, Russ Webb, and Joshua M. Susskind. Hypersim: A photorealistic synthetic dataset for holistic indoor scene understanding. In International Conference on Computer Vision (ICCV) 2021, 2021

  50. [50]

    High-resolution image synthesis with latent diffusion models, 2021

    Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models, 2021

  51. [51]

    German Ros, Laura Sellart, Joanna Materzynska, David Vazquez, and Antonio M. Lopez. The synthia dataset: A large collection of synthetic images for semantic segmentation of urban scenes. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016

  52. [52]

    BAD SLAM: Bundle adjusted direct RGB-D SLAM

    Thomas Schöps, Torsten Sattler, and Marc Pollefeys. BAD SLAM: Bundle adjusted direct RGB-D SLAM. In Conference on Computer Vision and Pattern Recognition (CVPR), 2019

  53. [53]

    Scalability in perception for autonomous driving: Waymo open dataset

    Pei Sun, Henrik Kretzschmar, Xerxes Dotiwalla, Aurelien Chouard, Vijaysai Patnaik, Paul Tsui, James Guo, Yin Zhou, Yuning Chai, Benjamin Caine, Vijay Vasudevan, Wei Han, Jiquan Ngiam, Hang Zhao, Aleksei Timofeev, Scott Ettinger, Maxim Krivokon, Amy Gao, Aditya Joshi, Yu Zhang, Jonathon Shlens, Zhifeng Chen, and Dragomir Anguelov. Scalability in perception...

  54. [54]

    Droid-slam: Deep visual slam for monocular, stereo, and rgb-d cameras

    Zachary Teed and Jia Deng. Droid-slam: Deep visual slam for monocular, stereo, and rgb-d cameras. Advances in neural information processing systems, 34:16558–16569, 2021. 12

  55. [55]

    Smd-nets: Stereo mixture density networks

    Fabio Tosi, Yiyi Liao, Carolin Schmitt, and Andreas Geiger. Smd-nets: Stereo mixture density networks. In Conference on Computer Vision and Pattern Recognition (CVPR), 2021

  56. [56]

    Sparsity invariant cnns

    Jonas Uhrig, Nick Schneider, Lukas Schneider, Uwe Franke, Thomas Brox, and Andreas Geiger. Sparsity invariant cnns. In International Conference on 3D Vision (3DV), 2017

  57. [57]

    Dai, Andrea F

    Igor Vasiljevic, Nick Kolkin, Shanyi Zhang, Ruotian Luo, Haochen Wang, Falcon Z. Dai, Andrea F. Daniele, Mohammadreza Mostajabi, Steven Basart, Matthew R. Walter, and Gregory Shakhnarovich. DIODE: A Dense Indoor and Outdoor DEpth Dataset. CoRR, abs/1908.00463, 2019

  58. [58]

    Flow-motion and depth network for monocular stereo and beyond

    Kaixuan Wang and Shaojie Shen. Flow-motion and depth network for monocular stereo and beyond. CoRR, abs/1909.05452, 2019

  59. [59]

    Irs: A large naturalis- tic indoor robotics stereo dataset to train deep models for disparity and surface normal estimation

    Qiang Wang, Shizhen Zheng, Qingsong Yan, Fei Deng, Kaiyong Zhao, and Xiaowen Chu. IRS: A large synthetic indoor robotics stereo dataset for disparity and surface normal estimation.CoRR, abs/1912.09678, 2019

  60. [60]

    Diffusion models are geometry critics: Single image 3d editing using pre-trained diffusion priors

    Ruicheng Wang, Jianfeng Xiang, Jiaolong Yang, and Xin Tong. Diffusion models are geometry critics: Single image 3d editing using pre-trained diffusion priors. In European Conference on Computer Vision, pages 441–458. Springer, 2024

  61. [61]

    Moge: Unlocking accurate monocular geometry estimation for open-domain images with optimal training supervision

    Ruicheng Wang, Sicheng Xu, Cassie Dai, Jianfeng Xiang, Yu Deng, Xin Tong, and Jiaolong Yang. Moge: Unlocking accurate monocular geometry estimation for open-domain images with optimal training supervision. 2024

  62. [62]

    Dust3r: Geometric 3d vision made easy

    Shuzhe Wang, Vincent Leroy, Yohann Cabon, Boris Chidlovskii, and Jerome Revaud. Dust3r: Geometric 3d vision made easy. In CVPR, 2024

  63. [63]

    Tartanair: A dataset to push the limits of visual slam

    Wenshan Wang, Delong Zhu, Xiangwei Wang, Yaoyu Hu, Yuheng Qiu, Chen Wang, Yafei Hu, Ashish Kapoor, and Sebastian Scherer. Tartanair: A dataset to push the limits of visual slam. 2020

  64. [64]

    Argoverse 2: Next generation datasets for self-driving perception and forecasting

    Benjamin Wilson, William Qi, Tanmay Agarwal, John Lambert, Jagjeet Singh, Siddhesh Khandelwal, Bowen Pan, Ratnesh Kumar, Andrew Hartnett, Jhony Kaesemodel Pontes, Deva Ramanan, Peter Carr, and James Hays. Argoverse 2: Next generation datasets for self-driving perception and forecasting. In Proceedings of the Neural Information Processing Systems Track on ...

  65. [65]

    Synscapes: A Photorealistic Synthetic Dataset for Street Scene Parsing

    Magnus Wrenninge and Jonas Unger. Synscapes: A photorealistic synthetic dataset for street scene parsing. CoRR, abs/1810.08705, 2018

  66. [66]

    Depth anything: Unleashing the power of large-scale unlabeled data

    Lihe Yang, Bingyi Kang, Zilong Huang, Xiaogang Xu, Jiashi Feng, and Hengshuang Zhao. Depth anything: Unleashing the power of large-scale unlabeled data. In CVPR, 2024

  67. [67]

    Depth Anything V2

    Lihe Yang, Bingyi Kang, Zilong Huang, Zhen Zhao, Xiaogang Xu, Jiashi Feng, and Hengshuang Zhao. Depth anything v2. arXiv:2406.09414, 2024

  68. [68]

    Blendedmvs: A large-scale dataset for generalized multi-view stereo networks

    Yao Yao, Zixin Luo, Shiwei Li, Jingyang Zhang, Yufan Ren, Lei Zhou, Tian Fang, and Long Quan. Blendedmvs: A large-scale dataset for generalized multi-view stereo networks. Computer Vision and Pattern Recognition (CVPR), 2020

  69. [69]

    Scannet++: A high-fidelity dataset of 3d indoor scenes

    Chandan Yeshwanth, Yueh-Cheng Liu, Matthias Nießner, and Angela Dai. Scannet++: A high-fidelity dataset of 3d indoor scenes. In Proceedings of the International Conference on Computer Vision (ICCV), 2023

  70. [70]

    Enforcing geometric constraints of virtual normal for depth prediction

    Wei Yin, Yifan Liu, Chunhua Shen, and Youliang Yan. Enforcing geometric constraints of virtual normal for depth prediction. In Proceedings of the IEEE/CVF international conference on computer vision, pages 5684–5693, 2019

  71. [71]

    Learning to recover 3d scene shape from a single image

    Wei Yin, Jianming Zhang, Oliver Wang, Simon Niklaus, Long Mai, Simon Chen, and Chunhua Shen. Learning to recover 3d scene shape from a single image. CoRR, abs/2012.09365, 2020

  72. [72]

    Towards accurate reconstruction of 3d scene shape from a single monocular image

    Wei Yin, Jianming Zhang, Oliver Wang, Simon Niklaus, Simon Chen, Yifan Liu, and Chunhua Shen. Towards accurate reconstruction of 3d scene shape from a single monocular image. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(5):6480–6494, 2022

  73. [73]

    Metric3d: Towards zero-shot metric 3d prediction from a single image

    Wei Yin, Chi Zhang, Hao Chen, Zhipeng Cai, Gang Yu, Kaixuan Wang, Xiaozhi Chen, and Chunhua Shen. Metric3d: Towards zero-shot metric 3d prediction from a single image. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 9043–9053, 2023. 13

  74. [74]

    Freeman, and Jiajun Wu

    Hong-Xing Yu, Haoyi Duan, Charles Herrmann, William T Freeman, and Jiajun Wu. Wonderworld: Interactive 3d scene generation from a single image. arXiv preprint arXiv:2406.09394, 2024

  75. [75]

    Benchmarking the robustness of lidar-camera fusion for 3d object detection

    Kaicheng Yu, Tang Tao, Hongwei Xie, Zhiwei Lin, Tingting Liang, Bing Wang, Peng Chen, Dayang Hao, Yongtao Wang, and Xiaodan Liang. Benchmarking the robustness of lidar-camera fusion for 3d object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages 3188–3198, 2023

  76. [76]

    A survey of autonomous driving: Common practices and emerging technologies

    Ekim Yurtsever, Jacob Lambert, Alexander Carballo, and Kazuya Takeda. A survey of autonomous driving: Common practices and emerging technologies. IEEE access, 8:58443–58469, 2020

  77. [77]

    Taskonomy: Disentangling task transfer learning

    Amir R Zamir, Alexander Sax, , William B Shen, Leonidas Guibas, Jitendra Malik, and Silvio Savarese. Taskonomy: Disentangling task transfer learning. In 2018 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 2018

  78. [78]

    Adding conditional control to text-to-image diffusion models

    Lvmin Zhang, Anyi Rao, and Maneesh Agrawala. Adding conditional control to text-to-image diffusion models. In Proceedings of the IEEE/CVF international conference on computer vision, pages 3836–3847, 2023

  79. [79]

    Discrete cosine transform network for guided depth map super-resolution

    Zixiang Zhao, Jiangshe Zhang, Shuang Xu, Zudi Lin, and Hanspeter Pfister. Discrete cosine transform network for guided depth map super-resolution. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5697–5707, 2022

  80. [80]

    3D-VLA: A 3D Vision-Language-Action Generative World Model

    Haoyu Zhen, Xiaowen Qiu, Peihao Chen, Jincheng Yang, Xin Yan, Yilun Du, Yining Hong, and Chuang Gan. 3d-vla: A 3d vision-language-action generative world model. arXiv preprint arXiv:2403.09631, 2024

Showing first 80 references.