pith. machine review for the scientific record. sign in

arxiv: 2509.13414 · v3 · submitted 2025-09-16 · 💻 cs.CV · cs.AI· cs.LG· cs.RO

Recognition: 2 theorem links

· Lean Theorem

MapAnything: Universal Feed-Forward Metric 3D Reconstruction

Arno Knapitsch, Christian Richardt, Deva Ramanan, Duncan Zauss, Ethan Weber, Johannes Sch\"onberger, Jonathon Luiten, Lorenzo Porzi, Manuel Lopez-Antequera, Nelson Antunes, Nikhil Keetha, Norman M\"uller, Peter Kontschieder, Samuel Rota Bul\`o, Sebastian Scherer, Tobias Fischer, Yuchen Zhang

Authors on Pith no claims yet

Pith reviewed 2026-05-12 11:00 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.LGcs.RO
keywords feed-forward 3D reconstructionmetric scene geometrymulti-view stereostructure from motiondepth estimationtransformer modelunified 3D visioncamera localization
0
0 comments X

The pith

A single feed-forward model reconstructs metric 3D scenes from images and optional geometric inputs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces MapAnything as a transformer-based model that accepts one or more images plus optional inputs like camera intrinsics, poses, or partial depths and directly outputs metric 3D geometry and cameras. It factors the scene into depth maps, local ray maps, camera poses, and one overall metric scale factor that aligns everything into a single consistent frame. Standardizing training and supervision across many datasets lets this one model handle tasks from uncalibrated structure-from-motion to monocular depth estimation without task-specific heads or pipelines. If the approach holds, it would mean 3D vision systems no longer require separate specialist networks for each reconstruction problem.

Core claim

MapAnything is a unified transformer-based feed-forward model that ingests one or more images along with optional geometric inputs such as camera intrinsics, poses, depth, or partial reconstructions, and then directly regresses the metric 3D scene geometry and cameras. It leverages a factored representation of multi-view scene geometry, i.e., a collection of depth maps, local ray maps, camera poses, and a metric scale factor that effectively upgrades local reconstructions into a globally consistent metric frame. Standardizing the supervision and training across diverse datasets, along with flexible input augmentation, enables MapAnything to address a broad range of 3D vision tasks in a one-

What carries the argument

A factored representation consisting of depth maps, local ray maps, camera poses, and a metric scale factor that upgrades local reconstructions into a globally consistent metric frame.

If this is right

  • The model performs uncalibrated structure-from-motion, calibrated multi-view stereo, monocular depth estimation, camera localization, and depth completion in the same forward pass.
  • It outperforms or matches existing specialist feed-forward models on these tasks while requiring only one set of weights.
  • Joint training across multiple datasets becomes more efficient because the architecture and loss functions are shared rather than duplicated.
  • Optional geometric inputs can be used to refine or complete partial reconstructions without changing the model.
  • The metric scale factor is regressed directly, eliminating separate scale-recovery post-processing steps.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The factored representation could extend naturally to video sequences if temporal consistency terms are added to the training objective.
  • Robotics systems that need both mapping and localization might replace multiple perception modules with one call to this model.
  • The same backbone might support related tasks such as novel-view synthesis once a renderer is attached to the output depths and poses.

Load-bearing premise

Standardizing supervision and training across diverse datasets with flexible input augmentation allows one model to solve many different 3D reconstruction tasks at once.

What would settle it

A controlled test on a held-out task or dataset combination where MapAnything is compared head-to-head with a specialist feed-forward model trained only for that task and fails to match or exceed its accuracy on metric consistency or reconstruction quality.

read the original abstract

We introduce MapAnything, a unified transformer-based feed-forward model that ingests one or more images along with optional geometric inputs such as camera intrinsics, poses, depth, or partial reconstructions, and then directly regresses the metric 3D scene geometry and cameras. MapAnything leverages a factored representation of multi-view scene geometry, i.e., a collection of depth maps, local ray maps, camera poses, and a metric scale factor that effectively upgrades local reconstructions into a globally consistent metric frame. Standardizing the supervision and training across diverse datasets, along with flexible input augmentation, enables MapAnything to address a broad range of 3D vision tasks in a single feed-forward pass, including uncalibrated structure-from-motion, calibrated multi-view stereo, monocular depth estimation, camera localization, depth completion, and more. We provide extensive experimental analyses and model ablations demonstrating that MapAnything outperforms or matches specialist feed-forward models while offering more efficient joint training behavior, thus paving the way toward a universal 3D reconstruction backbone.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces MapAnything, a unified transformer-based feed-forward model that ingests one or more images (optionally with intrinsics, poses, depth, or partial reconstructions) and directly regresses metric 3D scene geometry and cameras. It employs a factored representation consisting of per-view depth maps, local ray maps, camera poses, and a single metric scale factor to convert local reconstructions into a globally consistent metric frame. The model is trained jointly across diverse datasets with standardized supervision and flexible augmentations, enabling it to address tasks including uncalibrated SfM, calibrated MVS, monocular depth estimation, camera localization, and depth completion in a single pass. The authors claim it outperforms or matches specialist feed-forward models while providing more efficient joint training.

Significance. If the performance and consistency claims hold, this work could provide a practical universal backbone for metric 3D reconstruction, reducing reliance on task-specific models and enabling more efficient multi-task training and inference in computer vision pipelines. The factored representation and joint-training approach, if validated, would represent a notable engineering advance for feed-forward 3D models.

major comments (2)
  1. [§3.2] §3.2 (Factored Representation): The central claim that a single scalar metric scale factor upgrades independently regressed per-view depth maps and local ray maps into globally consistent metric geometry is load-bearing, yet the manuscript provides no explicit multi-view consistency term (e.g., cross-view ray-intersection loss or differentiable bundle-adjustment surrogate). Without such a term, local inconsistencies in the transformer outputs may persist after global scaling, as noted in the stress-test concern.
  2. [§5] §5 (Experiments): The abstract asserts 'extensive experimental analyses and model ablations' demonstrating outperformance, but the provided manuscript excerpt contains no quantitative tables, error bars, or per-task metrics comparing against specialist baselines. This absence prevents verification of the 'outperforms or matches' claim and the efficiency of joint training.
minor comments (2)
  1. The abstract would be strengthened by including one or two key quantitative results (e.g., relative improvement on a standard benchmark) to support the performance claims.
  2. Notation for the 'local ray maps' component could be clarified with an explicit equation or diagram showing how they differ from standard depth or normal maps.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed review of our manuscript. We address each major comment below and describe the revisions we will make to strengthen the paper.

read point-by-point responses
  1. Referee: [§3.2] §3.2 (Factored Representation): The central claim that a single scalar metric scale factor upgrades independently regressed per-view depth maps and local ray maps into globally consistent metric geometry is load-bearing, yet the manuscript provides no explicit multi-view consistency term (e.g., cross-view ray-intersection loss or differentiable bundle-adjustment surrogate). Without such a term, local inconsistencies in the transformer outputs may persist after global scaling, as noted in the stress-test concern.

    Authors: We appreciate this observation on the factored representation. The current design relies on the transformer jointly processing all input views to regress depth maps, ray maps, and poses that are already locally consistent; the single predicted metric scale then aligns them globally. This consistency emerges from multi-task supervision across diverse datasets containing multi-view ground truth. We acknowledge that an explicit consistency regularizer could provide additional robustness. In the revised manuscript we will expand §3.2 with a discussion of implicit versus explicit consistency and add an ablation that quantifies multi-view geometric consistency (ray-intersection error) before and after scale application. This is a partial revision. revision: partial

  2. Referee: [§5] §5 (Experiments): The abstract asserts 'extensive experimental analyses and model ablations' demonstrating outperformance, but the provided manuscript excerpt contains no quantitative tables, error bars, or per-task metrics comparing against specialist baselines. This absence prevents verification of the 'outperforms or matches' claim and the efficiency of joint training.

    Authors: We regret that the excerpt supplied to the referee omitted the full experimental section. The complete manuscript contains §5 with multiple quantitative tables reporting per-task metrics (SfM, MVS, monocular depth, localization, depth completion), direct comparisons to specialist feed-forward baselines, ablations on joint-training efficiency, and error bars derived from multiple random seeds where appropriate. We will ensure all tables are clearly cross-referenced in the text and that any future review excerpts include the complete experimental results. No further revision is required on this point. revision: no

Circularity Check

0 steps flagged

No circularity: empirical feed-forward model with learned outputs

full rationale

The paper describes a transformer-based neural network that regresses depth maps, ray maps, poses, and a scale factor from image inputs, trained end-to-end on standardized datasets. No mathematical derivation, uniqueness theorem, or first-principles prediction is claimed that reduces by construction to fitted inputs or self-citations. The factored representation is an architectural design choice whose consistency is enforced via data-driven supervision rather than definitional equivalence. Experimental results and ablations are presented as empirical validation, not tautological outputs. This matches the default expectation for non-circular empirical ML papers.

Axiom & Free-Parameter Ledger

1 free parameters · 0 axioms · 1 invented entities

The central claim rests on the effectiveness of the factored geometry representation and the assumption that standardized multi-dataset training produces a generalist model; these are introduced by the paper rather than derived from prior literature.

free parameters (1)
  • metric scale factor
    Learned or regressed quantity that converts local reconstructions into a globally metric frame; its value is not fixed by external constants.
invented entities (1)
  • factored representation (depth maps, local ray maps, poses, metric scale) no independent evidence
    purpose: To allow a single network to produce globally consistent metric 3D from mixed inputs
    New representational choice introduced to upgrade local outputs to metric consistency; no independent evidence outside the model itself is provided in the abstract.

pith-pipeline@v0.9.0 · 5551 in / 1216 out tokens · 64273 ms · 2026-05-12T11:00:43.083291+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

Forward citations

Cited by 37 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. TrackCraft3R: Repurposing Video Diffusion Transformers for Dense 3D Tracking

    cs.CV 2026-05 unverdicted novelty 8.0

    TrackCraft3R is the first method to repurpose a video diffusion transformer as a feed-forward dense 3D tracker via dual-latent representations and temporal RoPE alignment, achieving SOTA performance with lower compute.

  2. Seeing Across Skies and Streets: Feedforward 3D Reconstruction from Satellite, Drone, and Ground Images

    cs.CV 2026-05 unverdicted novelty 7.0

    Cross3R performs feed-forward 3D reconstruction and 6-DoF pose estimation from any combination of satellite, UAV, and ground images, outperforming baselines on a new 278K-image tri-view dataset.

  3. Mix3R: Mixing Feed-forward Reconstruction and Generative 3D Priors for Joint Multi-view Aligned 3D Reconstruction and Pose Estimation

    cs.CV 2026-05 unverdicted novelty 7.0

    Mix3R mixes feed-forward reconstruction and generative 3D priors via Mixture-of-Transformers and overlap-based attention bias to achieve better-aligned 3D shapes and more accurate poses than either approach alone.

  4. AirZoo: A Unified Large-Scale Dataset for Grounding Aerial Geometric 3D Vision

    cs.CV 2026-04 conditional novelty 7.0

    AirZoo is a new large-scale synthetic dataset for aerial 3D vision that improves state-of-the-art models on image retrieval, cross-view matching, and 3D reconstruction when used for fine-tuning.

  5. Multi-Camera Self-Calibration in Sports Motion Capture: Leveraging Human and Stick Poses

    cs.CV 2026-04 unverdicted novelty 7.0

    A three-stage optimization pipeline for multi-camera extrinsic self-calibration that refines camera poses, reconstructs human and stick trajectories, and resolves global scale using the known stick length constraint.

  6. GlobalSplat: Efficient Feed-Forward 3D Gaussian Splatting via Global Scene Tokens

    cs.CV 2026-04 unverdicted novelty 7.0

    GlobalSplat achieves competitive novel-view synthesis on RealEstate10K and ACID using only 16K Gaussians via global scene tokens and coarse-to-fine training, with a 4MB footprint and under 78ms inference.

  7. Keep It CALM: Toward Calibration-Free Kilometer-Level SLAM with Visual Geometry Foundation Models via an Assistant Eye

    cs.RO 2026-04 unverdicted novelty 7.0

    CAL2M achieves calibration-free kilometer-level SLAM by using an assistant eye for scale, epipolar-guided intrinsic correction, and anchor propagation for nonlinear sub-map alignment.

  8. Attention Sink in Transformers: A Survey on Utilization, Interpretation, and Mitigation

    cs.LG 2026-04 unverdicted novelty 7.0

    The first survey on Attention Sink in Transformers structures the literature around fundamental utilization, mechanistic interpretation, and strategic mitigation.

  9. EgoTL: Egocentric Think-Aloud Chains for Long-Horizon Tasks

    cs.CV 2026-04 unverdicted novelty 7.0

    EgoTL provides a new egocentric dataset with think-aloud chains and metric labels that benchmarks VLMs on long-horizon tasks and improves their planning, reasoning, and spatial grounding after finetuning.

  10. LuMon: A Comprehensive Benchmark and Development Suite with Novel Datasets for Lunar Monocular Depth Estimation

    cs.CV 2026-04 unverdicted novelty 7.0

    A new benchmark with real lunar stereo ground truth and analog data shows that sim-to-real fine-tuned monocular depth models achieve large in-domain gains but minimal generalization to actual lunar images.

  11. AnchorSplat: Feed-Forward 3D Gaussian Splatting with 3D Geometric Priors

    cs.CV 2026-04 unverdicted novelty 7.0

    AnchorSplat uses anchor-aligned 3D Gaussians guided by geometric priors for feed-forward scene reconstruction, achieving SOTA novel view synthesis on ScanNet++ with fewer primitives and better view consistency.

  12. Learning 3D Reconstruction with Priors in Test Time

    cs.CV 2026-04 unverdicted novelty 7.0

    Test-time constrained optimization incorporates priors into pre-trained multiview transformers via self-supervised losses and penalty terms to improve 3D reconstruction accuracy.

  13. GemDepth: Geometry-Embedded Features for 3D-Consistent Video Depth

    cs.CV 2026-05 unverdicted novelty 6.0

    GemDepth predicts inter-frame camera poses to inject geometric embeddings into a spatio-temporal transformer, yielding state-of-the-art 3D-consistent video depth.

  14. GemDepth: Geometry-Embedded Features for 3D-Consistent Video Depth

    cs.CV 2026-05 unverdicted novelty 6.0

    GemDepth embeds predicted camera poses into a spatio-temporal transformer to achieve state-of-the-art 3D-consistent video depth estimation.

  15. GemDepth: Geometry-Embedded Features for 3D-Consistent Video Depth

    cs.CV 2026-05 unverdicted novelty 6.0

    GemDepth achieves improved 3D-consistent video depth by embedding predicted inter-frame camera poses into a network with an Alternating Spatio-Temporal Transformer for better spatial precision and temporal coherence.

  16. 3D-ReGen: A Unified 3D Geometry Regeneration Framework

    cs.CV 2026-04 unverdicted novelty 6.0

    3D-ReGen is a conditioned 3D regenerator using VecSet that learns a regeneration prior from unlabeled 3D datasets via self-supervised tasks and achieves state-of-the-art results on controllable 3D geometry tasks.

  17. LA-Pose: Latent Action Pretraining Meets Pose Estimation

    cs.CV 2026-04 unverdicted novelty 6.0

    LA-Pose achieves over 10% higher pose accuracy than recent feed-forward methods on Waymo and PandaSet benchmarks by repurposing latent actions from self-supervised inverse-dynamics pretraining while using orders of ma...

  18. SS3D: End2End Self-Supervised 3D from Web Videos

    cs.CV 2026-04 unverdicted novelty 6.0

    SS3D pretrains an end-to-end feed-forward 3D estimator on filtered YouTube-8M videos via SfM self-supervision, MVS filtering, and expert distillation, delivering stronger zero-shot transfer and fine-tuning than prior ...

  19. SS3D: End2End Self-Supervised 3D from Web Videos

    cs.CV 2026-04 unverdicted novelty 6.0

    SS3D pretrains an end-to-end 3D estimator on filtered YouTube-8M videos via SfM self-supervision, achieving improved zero-shot transfer and fine-tuning over prior baselines.

  20. Vista4D: Video Reshooting with 4D Point Clouds

    cs.CV 2026-04 unverdicted novelty 6.0

    Vista4D re-synthesizes dynamic videos from new viewpoints by grounding them in a 4D point cloud built with static segmentation and multiview training.

  21. Geometric Context Transformer for Streaming 3D Reconstruction

    cs.CV 2026-04 unverdicted novelty 6.0

    LingBot-Map is a streaming 3D reconstruction model built on a geometric context transformer that combines anchor context, pose-reference window, and trajectory memory to deliver accurate, drift-resistant results at 20...

  22. Rays as Pixels: Learning A Joint Distribution of Videos and Camera Trajectories

    cs.CV 2026-04 unverdicted novelty 6.0

    A video diffusion model learns a joint distribution over videos and camera trajectories by representing cameras as pixel-aligned ray encodings (raxels) denoised jointly with video frames via decoupled attention.

  23. Self-Improving 4D Perception via Self-Distillation

    cs.CV 2026-04 unverdicted novelty 6.0

    SelfEvo enables pretrained 4D perception models to self-improve on unlabeled videos via self-distillation, delivering up to 36.5% relative gains in video depth estimation and 20.1% in camera estimation across eight be...

  24. ZeD-MAP: Bundle Adjustment Guided Zero-Shot Depth Maps for Real-Time Aerial Imaging

    cs.CV 2026-04 unverdicted novelty 6.0

    ZeD-MAP uses incremental bundle adjustment on image clusters to guide zero-shot diffusion depth estimation, delivering sub-meter accuracy (0.87 m XY, 0.12 m Z) at 1.5-5 seconds per image on high-resolution aerial data.

  25. ZeD-MAP: Bundle Adjustment Guided Zero-Shot Depth Maps for Real-Time Aerial Imaging

    cs.CV 2026-04 conditional novelty 6.0

    ZeD-MAP integrates incremental cluster-based bundle adjustment with zero-shot diffusion depth estimation to deliver metrically consistent real-time depth maps from high-resolution UAV imagery.

  26. DVGT-2: Vision-Geometry-Action Model for Autonomous Driving at Scale

    cs.CV 2026-04 unverdicted novelty 6.0

    DVGT-2 is a streaming vision-geometry-action model that jointly reconstructs dense 3D geometry and plans trajectories online, achieving better reconstruction than prior batch methods while transferring directly to pla...

  27. Stepper: Stepwise Immersive Scene Generation with Multiview Panoramas

    cs.CV 2026-03 unverdicted novelty 6.0

    Stepper uses stepwise panoramic expansion with a multi-view 360-degree diffusion model and geometry reconstruction to produce high-fidelity, structurally consistent immersive 3D scenes from text.

  28. TerraSky3D: Multi-View Reconstructions of European Landmarks in 4K

    cs.CV 2026-03 unverdicted novelty 6.0

    TerraSky3D is a new high-resolution multi-view dataset with 50,000 images in 150 scenes of European landmarks, supplied with poses and depth maps to support 3D reconstruction research.

  29. Depth Anything 3: Recovering the Visual Space from Any Views

    cs.CV 2025-11 unverdicted novelty 6.0

    DA3 recovers consistent visual geometry from arbitrary views via a vanilla DINO transformer and depth-ray target, setting new SOTA on a visual geometry benchmark while outperforming DA2 on monocular depth.

  30. WildPose: A Unified Framework for Robust Pose Estimation in the Wild

    cs.CV 2026-05 unverdicted novelty 5.0

    WildPose unifies feedforward 3D features from MASt3R with differentiable bundle adjustment for robust monocular pose estimation across dynamic, static, and low-ego-motion scenes.

  31. Syn4D: A Multiview Synthetic 4D Dataset

    cs.CV 2026-05 unverdicted novelty 5.0

    Syn4D is a new multiview synthetic 4D dataset supplying dense ground-truth annotations for dynamic scene reconstruction, tracking, and human pose estimation.

  32. SS3D: End2End Self-Supervised 3D from Web Videos

    cs.CV 2026-04 unverdicted novelty 5.0

    SS3D scales SfM-based self-supervision to ~100M frames from YouTube-8M using a multi-view signal proxy for filtering and a two-stage training schedule, achieving strong zero-shot transfer and better fine-tuning than p...

  33. MonoEM-GS: Monocular Expectation-Maximization Gaussian Splatting SLAM

    cs.RO 2026-04 unverdicted novelty 5.0

    MonoEM-GS stabilizes view-dependent geometry from foundation models inside a global Gaussian Splatting representation via EM and adds multi-modal features for in-place open-set segmentation.

  34. HY-World 2.0: A Multi-Modal World Model for Reconstructing, Generating, and Simulating 3D Worlds

    cs.CV 2026-04 unverdicted novelty 4.0

    HY-World 2.0 generates and reconstructs high-fidelity navigable 3D Gaussian Splatting worlds from text, images, or videos via upgraded panorama, planning, expansion, and composition modules, with released code claimin...

  35. DINO_4D: Semantic-Aware 4D Reconstruction

    cs.CV 2026-04 unverdicted novelty 4.0

    DINO_4D uses frozen DINOv3 features to inject semantic awareness into 4D dynamic scene reconstruction, improving tracking accuracy and completeness on benchmarks while preserving O(T) complexity.

  36. VGGT-SLAM++

    cs.CV 2026-04 unverdicted novelty 4.0

    VGGT-SLAM++ improves on prior transformer SLAM by adding dense DEM submap graphs and high-cadence local optimization, achieving SOTA accuracy with reduced drift and bounded memory on benchmarks.

  37. OpenWorldLib: A Unified Codebase and Definition of Advanced World Models

    cs.CV 2026-04 unverdicted novelty 4.0

    OpenWorldLib offers a standardized codebase and definition for world models that combine perception, interaction, and memory to understand and predict the world.

Reference graph

Works this paper leans on

88 extracted references · 88 canonical work pages · cited by 32 Pith papers · 2 internal anchors

  1. [1]

    RayFronts: Open- set semantic ray frontiers for online scene understanding and exploration

    Omar Alama, Avigyan Bhattacharya, Haoyang He, Se- ungchan Kim, Yuheng Qiu, Wenshan Wang, Cherie Ho, Nikhil Keetha, and Sebastian Scherer. RayFronts: Open- set semantic ray frontiers for online scene understanding and exploration. InIROS, 2025. 2

  2. [2]

    SceneScript: Reconstructing scenes with an autoregressive structured language model

    Armen Avetisyan, Christopher Xie, Henry Howard-Jenkins, Tsun-Yi Yang, Samir Aroudj, Suvam Patra, Fuyang Zhang, Duncan Frost, Luke Holland, Campbell Orme, Jakob Engel, Edward Miller, Richard Newcombe, and Vasileios Balntas. SceneScript: Reconstructing scenes with an autoregressive structured language model. InECCV, 2024. 5

  3. [3]

    MultiMAE: Multi-modal multi-task masked autoen- coders

    Roman Bachmann, David Mizrahi, Andrei Atanov, and Amir Zamir. MultiMAE: Multi-modal multi-task masked autoen- coders. InECCV, 2022. 3

  4. [4]

    Jonathan T. Barron. A general and adaptive robust loss func- tion. InCVPR, 2019. 5

  5. [5]

    Richter, and Vladlen Koltun

    Aleksei Bochkovskii, Amaël Delaunoy, Hugo Germain, Mar- cel Santos, Yichao Zhou, Stephan R. Richter, and Vladlen Koltun. Depth pro: Sharp monocular metric depth in less than a second. InICLR, 2025. 8

  6. [6]

    MUSt3R: Multi-view network for stereo 3D reconstruction

    Yohann Cabon, Lucas Stoffl, Leonid Antsfeld, Gabriela Csurka, Boris Chidlovskii, Jerome Revaud, and Vincent Leroy. MUSt3R: Multi-view network for stereo 3D reconstruction. InCVPR, 2025. 2, 7, 8

  7. [7]

    Map-relative pose regression for visual re-localization

    Shuai Chen, Tommaso Cavallari, Victor Adrian Prisacariu, and Eric Brachmann. Map-relative pose regression for visual re-localization. InCVPR, 2024. 4

  8. [8]

    Reloc3r: Large-scale training of relative camera pose regression for generalizable, fast, and accurate visual localization

    Siyan Dong, Shuzhe Wang, Shaohui Liu, Lulu Cai, Qingnan Fan, Juho Kannala, and Yanchao Yang. Reloc3r: Large-scale training of relative camera pose regression for generalizable, fast, and accurate visual localization. InCVPR, 2025. 1, 2

  9. [9]

    Duisterhof, Jan Oberst, Bowen Wen, Stan Birch- field, Deva Ramanan, and Jeffrey Ichnowski

    Bardienus P. Duisterhof, Jan Oberst, Bowen Wen, Stan Birch- field, Deva Ramanan, and Jeffrey Ichnowski. RaySt3R: Pre- dicting novel depth maps for zero-shot object completion. In NeurIPS, 2025. 3

  10. [10]

    MASt3R-SfM: a fully-integrated solution for unconstrained structure-from-motion

    Bardienus Pieter Duisterhof, Lojze Zust, Philippe Weinza- epfel, Vincent Leroy, Yohann Cabon, and Jerome Revaud. MASt3R-SfM: a fully-integrated solution for unconstrained structure-from-motion. In3DV, 2025. 2

  11. [11]

    Light3R-SfM: Towards feed-forward structure-from- motion

    Sven Elflein, Qunjie Zhou, Sérgio Agostinho, and Laura Leal- Taixé. Light3R-SfM: Towards feed-forward structure-from- motion. InCVPR, 2025. 2

  12. [12]

    Grossberg and Shree K

    Michael D. Grossberg and Shree K. Nayar. A general imaging model and a method for finding its parameters. InICCV, 2001. 4

  13. [13]

    Rotation averaging.Int

    Richard Hartley, Jochen Trumpf, Yuchao Dai, and Hongdong Li. Rotation averaging.Int. J. Comput. Vis., 103(3):267–305,

  14. [14]

    RA- DIOv2.5: Improved baselines for agglomerative vision foun- dation models

    Greg Heinrich, Mike Ranzinger, Hongxu, Yao Lu, Jan Kautz, Andrew Tao, Bryan Catanzaro, and Pavlo Molchanov. RA- DIOv2.5: Improved baselines for agglomerative vision foun- dation models. InCVPR, 2025. 4

  15. [15]

    Map it anywhere: Empower- ing bev map prediction using large-scale public datasets

    Cherie Ho, Jiaye Zou, Omar Alama, Sai M Kumar, Benjamin Chiang, Taneesh Gupta, Chen Wang, Nikhil Keetha, Katia Sycara, and Sebastian Scherer. Map it anywhere: Empower- ing bev map prediction using large-scale public datasets. In NeurIPS, 2024. 2

  16. [16]

    Geometric context from a single image

    Derek Hoiem, Alexei A Efros, and Martial Hebert. Geometric context from a single image. InICCV, 2005. 1

  17. [17]

    Obtaining shape from shading information

    Berthold KP Horn. Obtaining shape from shading information. InShape from shading, pages 123–171. MIT Press, 1989. 1

  18. [18]

    Metric3D v2: A versatile monocular geometric foundation model for zero-shot metric depth and surface nor- mal estimation.IEEE Trans

    Mu Hu, Wei Yin, Chi Zhang, Zhipeng Cai, Xiaoxiao Long, Hao Chen, Kaixuan Wang, Gang Yu, Chunhua Shen, and Shaojie Shen. Metric3D v2: A versatile monocular geometric foundation model for zero-shot metric depth and surface nor- mal estimation.IEEE Trans. Pattern Anal. Mach. Intell., 46 (12):10579–10596, 2024. 8

  19. [19]

    Sycara, Matthew Johnson-Roberson, Dhruv Batra, Xiaolong Wang, Sebastian Scherer, Zsolt Kira, Fei Xia, and Yonatan Bisk

    Yafei Hu, Quanting Xie, Vidhi Jain, Jonathan Francis, Jay Patrikar, Nikhil Keetha, Seungchan Kim, Yaqi Xie, Tianyi Zhang, Hao-Shu Fang, Shibo Zhao, Shayegan Omidshafiei, Dong-Ki Kim, Ali akbar Agha-mohammadi, Katia Sycara, Matthew Johnson-Roberson, Dhruv Batra, Xiaolong Wang, Sebastian Scherer, Chen Wang, Zsolt Kira, Fei Xia, and Yonatan Bisk. Toward gene...

  20. [20]

    Yeh, and Alexan- der G

    Yuan-Ting Hu, Jiahong Wang, Raymond A. Yeh, and Alexan- der G. Schwing. SAIL-VOS 3D: A synthetic dataset and baselines for object detection and 3D mesh reconstruction from video data. InCVPR, 2021. 5

  21. [21]

    DeepMVS: Learning multi-view stereopsis

    Po-Han Huang, Kevin Matzen, Johannes Kopf, Narendra Ahuja, and Jia-Bin Huang. DeepMVS: Learning multi-view stereopsis. InCVPR, 2018. 5

  22. [22]

    MVSAnywhere: Zero-shot multi-view stereo

    Sergio Izquierdo, Mohamed Sayed, Michael Firman, Guillermo Garcia-Hernando, Daniyar Turmukhambetov, Javier Civera, Oisin Mac Aodha, Gabriel Brostow, and Jamie Watson. MVSAnywhere: Zero-shot multi-view stereo. In CVPR, 2025. 1, 7, 8

  23. [23]

    Pow3R: Empowering uncon- strained 3D reconstruction with camera and scene priors

    Wonbong Jang, Philippe Weinzaepfel, Vincent Leroy, Lourdes Agapito, and Jerome Revaud. Pow3R: Empowering uncon- strained 3D reconstruction with camera and scene priors. In CVPR, 2025. 3, 6, 7

  24. [24]

    LVSM: A large view synthesis model with minimal 3D inductive bias

    Haian Jin, Hanwen Jiang, Hao Tan, Kai Zhang, Sai Bi, Tianyuan Zhang, Fujun Luan, Noah Snavely, and Zexiang Xu. LVSM: A large view synthesis model with minimal 3D inductive bias. InICLR, 2025. 3

  25. [25]

    Dynam- icStereo: Consistent dynamic depth from stereo videos

    Nikita Karaev, Ignacio Rocco, Benjamin Graham, Natalia Neverova, Andrea Vedaldi, and Christian Rupprecht. Dynam- icStereo: Consistent dynamic depth from stereo videos. In CVPR, 2023. 5

  26. [26]

    Any4D: Unified feed-forward metric 4D reconstruction.arXiv preprint arXiv:2512.10935, 2025

    Jay Karhade, Nikhil Keetha, Yuchen Zhang, Tanisha Gupta, Akash Sharma, Sebastian Scherer, and Deva Ramanan. Any4D: Unified feed-forward metric 4d reconstruction.arXiv preprint arXiv:2512.10935, 2025. 8

  27. [27]

    SplaTAM: Splat, track & map 3D Gaussians for dense RGB-D SLAM

    Nikhil Keetha, Jay Karhade, Krishna Murthy Jatavallabhula, Gengshan Yang, Sebastian Scherer, Deva Ramanan, and Jonathon Luiten. SplaTAM: Splat, track & map 3D Gaussians for dense RGB-D SLAM. InCVPR, 2024. 2

  28. [28]

    Kingma and Jimmy Ba

    Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. InICLR, 2015. 9

  29. [29]

    Ground- ing image matching in 3D with MASt3R

    Vincent Leroy, Yohann Cabon, and Jérôme Revaud. Ground- ing image matching in 3D with MASt3R. InECCV, 2024. 1, 2, 7, 8 17

  30. [30]

    MegaDepth: Learning single- view depth prediction from internet photos

    Zhengqi Li and Noah Snavely. MegaDepth: Learning single- view depth prediction from internet photos. InCVPR, 2018. 5

  31. [31]

    Depth Anything 3: Recovering the Visual Space from Any Views

    Haotong Lin, Sili Chen, Junhao Liew, Donny Y Chen, Zhenyu Li, Guang Shi, Jiashi Feng, and Bingyi Kang. Depth anything 3: Recovering the visual space from any views.arXiv preprint arXiv:2511.10647, 2025. 4

  32. [32]

    DL3DV-10K: A large-scale scene dataset for deep learning-based 3D vision

    Lu Ling, Yichen Sheng, Zhi Tu, Wentian Zhao, Cheng Xin, Kun Wan, Lantao Yu, Qianyu Guo, Zixun Yu, Yawen Lu, Xu- anmao Li, Xingpeng Sun, Rohan Ashok, Aniruddha Mukher- jee, Hao Kang, Xiangrui Kong, Gang Hua, Tianyi Zhang, Bedrich Benes, and Aniket Bera. DL3DV-10K: A large-scale scene dataset for deep learning-based 3D vision. InCVPR,

  33. [33]

    David G. Lowe. Distinctive image features from scale- invariant keypoints.Int. J. Comput. Vis., 60(2):91–110, 2004. 1

  34. [34]

    Align3R: Aligned monocular depth estimation for dynamic videos

    Jiahao Lu, Tianyu Huang, Peng Li, Zhiyang Dou, Cheng Lin, Zhiming Cui, Zhen Dong, Sai-Kit Yeung, Wenping Wang, and Yuan Liu. Align3R: Aligned monocular depth estimation for dynamic videos. InCVPR, 2025. 3

  35. [35]

    Matrix3D: Large photogrammetry model all-in- one

    Yuanxun Lu, Jingyang Zhang, Tian Fang, Jean-Daniel Nah- mias, Yanghai Tsin, Long Quan, Xun Cao, Yao Yao, and Shiwei Li. Matrix3D: Large photogrammetry model all-in- one. InCVPR, 2025. 3

  36. [36]

    Mapillary planet-scale depth dataset

    Manuel López Antequera, Pau Gargallo, Markus Hofinger, Samuel Rota Bulò, Yubin Kuang, and Peter Kontschieder. Mapillary planet-scale depth dataset. InECCV, 2020. 5

  37. [37]

    Spring: A high-resolution high- detail dataset and benchmark for scene flow, optical flow and stereo

    Lukas Mehl, Jenny Schmalfuss, Azin Jahedi, Yaroslava Nali- vayko, and Andrés Bruhn. Spring: A high-resolution high- detail dataset and benchmark for scene flow, optical flow and stereo. InCVPR, 2023. 5

  38. [38]

    T2I- adapter: Learning adapters to dig out more controllable ability for text-to-image diffusion models

    Chong Mou, Xintao Wang, Liangbin Xie, Yanze Wu, Jian Zhang, Zhongang Qi, Ying Shan, and Xiaohu Qie. T2I- adapter: Learning adapters to dig out more controllable ability for text-to-image diffusion models. InAAAI, 2024. 3, 4

  39. [39]

    Riku Murai, Eric Dexheimer, and Andrew J. Davison. MASt3R-SLAM: Real-time dense SLAM with 3D recon- struction priors. InCVPR, 2025. 2

  40. [40]

    An efficient solution to the five-point relative pose problem.IEEE Trans

    David Nistér. An efficient solution to the five-point relative pose problem.IEEE Trans. Pattern Anal. Mach. Intell., 26 (06):756–777, 2004. 1

  41. [41]

    DI- NOv2: Learning robust visual features without supervision

    Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. DI- NOv2: Learning robust visual features without supervision. Transactions on Machine Learning Research, 2024. 4

  42. [42]

    Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy V . V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, Mido Ass- ran, Nicolas Ballas, Wojciech Galuba, Russell Howes, Po- Yao Huang, Shang-Wen Li, Ishan Misra, Michael Rabbat, Vasu Sharma, Gabriel Synnaeve, Hu Xu, Herve Jegou, Julien Mairal, Patric...

  43. [43]

    Global structure-from-motion revisited

    Linfei Pan, Dániel Baráth, Marc Pollefeys, and Johannes Lutz Schönberger. Global structure-from-motion revisited. In ECCV, 2024. 1

  44. [44]

    Schönberger, and Marc Pollefeys

    Zador Pataki, Paul-Edouard Sarlin, Johannes L. Schönberger, and Marc Pollefeys. MP-SfM: Monocular surface priors for robust structure-from-motion. InCVPR, 2025. 2

  45. [45]

    UniDepthV2: Universal monocular metric depth estimation made simpler.IEEE Trans

    Luigi Piccinelli, Christos Sakaridis, Yung-Hsu Yang, Mat- tia Segu, Siyuan Li, Wim Abbeloos, and Luc Van Gool. UniDepthV2: Universal monocular metric depth estimation made simpler.IEEE Trans. Pattern Anal. Mach. Intell., 2026. 8

  46. [46]

    Vi- sion transformers for dense prediction

    René Ranftl, Alexey Bochkovskiy, and Vladlen Koltun. Vi- sion transformers for dense prediction. InICCV, 2021. 4, 10

  47. [47]

    Towards robust monocular depth estimation: Mixing datasets for zero-shot cross-dataset transfer.IEEE Trans

    René Ranftl, Katrin Lasinger, David Hafner, Konrad Schindler, and Vladlen Koltun. Towards robust monocular depth estimation: Mixing datasets for zero-shot cross-dataset transfer.IEEE Trans. Pattern Anal. Mach. Intell., 44(3):1623– 1637, 2022. 5

  48. [48]

    AM-RADIO: Agglomerative vision foundation model – reduce all domains into one

    Mike Ranzinger, Greg Heinrich, Jan Kautz, and Pavlo Molchanov. AM-RADIO: Agglomerative vision foundation model – reduce all domains into one. InCVPR, 2024. 4

  49. [49]

    SuperGlue: Learning feature match- ing with graph neural networks

    Paul-Edouard Sarlin, Daniel DeTone, Tomasz Malisiewicz, and Andrew Rabinovich. SuperGlue: Learning feature match- ing with graph neural networks. InCVPR, 2020. 1

  50. [50]

    Fast image- based localization using direct 2D-to-3D matching

    Torsten Sattler, Bastian Leibe, and Leif Kobbelt. Fast image- based localization using direct 2D-to-3D matching. InICCV,

  51. [51]

    A benchmark and a baseline for robust multi- view depth estimation

    Philipp Schröppel, Jan Bechtold, Artemij Amiranashvili, and Thomas Brox. A benchmark and a baseline for robust multi- view depth estimation. In3DV, 2022. 8

  52. [52]

    Schönberger and Jan-Michael Frahm

    Johannes L. Schönberger and Jan-Michael Frahm. Structure- from-motion revisited. InCVPR, 2016. 1

  53. [53]

    Schönberger, Enliang Zheng, Jan-Michael Frahm, and Marc Pollefeys

    Johannes L. Schönberger, Enliang Zheng, Jan-Michael Frahm, and Marc Pollefeys. Pixelwise view selection for unstructured multi-view stereo. InECCV, 2016. 1

  54. [54]

    Schönberger, Silvano Galliani, Torsten Sattler, Konrad Schindler, Marc Pollefeys, and An- dreas Geiger

    Thomas Schöps, Johannes L. Schönberger, Silvano Galliani, Torsten Sattler, Konrad Schindler, Marc Pollefeys, and An- dreas Geiger. A multi-view stereo benchmark with high- resolution images and multi-camera videos. InCVPR, 2017. 5, 6

  55. [55]

    RoFormer: Enhanced transformer with rotary position embedding.Neurocomputing, 568(C),

    Jianlin Su, Yu Lu, Shengfeng Pan, Ahmed Murtadha, Bo Wen, and Yunfeng Liu. RoFormer: Enhanced transformer with rotary position embedding.Neurocomputing, 568(C),

  56. [56]

    MV- DUSt3R+: Single-stage scene reconstruction from sparse views in 2 seconds

    Zhenggang Tang, Yuchen Fan, Dilin Wang, Hongyu Xu, Rakesh Ranjan, Alexander Schwing, and Zhicheng Yan. MV- DUSt3R+: Single-stage scene reconstruction from sparse views in 2 seconds. InCVPR, 2025. 2

  57. [57]

    DeepV2D: Video to depth with differentiable structure from motion

    Zachary Teed and Jia Deng. DeepV2D: Video to depth with differentiable structure from motion. InICLR, 2020. 2, 8

  58. [58]

    AnyCalib: On- manifold learning for model-agnostic single-view camera calibration

    Javier Tirado-Garín and Javier Civera. AnyCalib: On- manifold learning for model-agnostic single-view camera calibration. InICCV, 2025. 7

  59. [59]

    SMD-nets: Stereo mixture density networks

    Fabio Tosi, Yiyi Liao, Carolin Schmitt, and Andreas Geiger. SMD-nets: Stereo mixture density networks. InCVPR, pages 8942–8952, 2021. 5 18

  60. [60]

    McLauchlan, Richard I

    Bill Triggs, Philip F. McLauchlan, Richard I. Hartley, and Andrew Fitzgibbon. Bundle adjustment – a modern synthesis. InICCV, pages 298–372, 2000. 1

  61. [61]

    DeMoN: Depth and motion network for learning monocular stereo

    Benjamin Ummenhofer, Huizhong Zhou, Jonas Uhrig, Niko- laus Mayer, Eddy Ilg, Alexey Dosovitskiy, and Thomas Brox. DeMoN: Depth and motion network for learning monocular stereo. InCVPR, 2017. 2, 8

  62. [62]

    Generative camera dolly: Extreme monocular dynamic novel view synthesis

    Basile Van Hoorick, Rundi Wu, Ege Ozguroglu, Kyle Sar- gent, Ruoshi Liu, Pavel Tokmakov, Achal Dave, Changxi Zheng, and Carl V ondrick. Generative camera dolly: Extreme monocular dynamic novel view synthesis. InECCV, 2024. 5

  63. [63]

    Neural ray surfaces for self-supervised learning of depth and ego-motion

    Igor Vasiljevic, Vitor Guizilini, Rares Ambrus, Sudeep Pillai, Wolfram Burgard, Greg Shakhnarovich, and Adrien Gaidon. Neural ray surfaces for self-supervised learning of depth and ego-motion. In3DV, 2020. 4

  64. [64]

    GeoCalib: Single-image calibration with geometric optimization

    Alexander Veicht, Paul-Edouard Sarlin, Philipp Lindenberger, and Marc Pollefeys. GeoCalib: Single-image calibration with geometric optimization. InECCV, 2024. 1

  65. [65]

    3D reconstruction with spatial memory

    Hengyi Wang and Lourdes Agapito. 3D reconstruction with spatial memory. In3DV, 2025. 2

  66. [66]

    Visual geometry grounded deep structure from motion

    Jianyuan Wang, Nikita Karaev, Christian Rupprecht, and David Novotny. Visual geometry grounded deep structure from motion. InCVPR, 2024. 2

  67. [67]

    VGGT: Visual geometry grounded transformer

    Jianyuan Wang, Minghao Chen, Nikita Karaev, Andrea Vedaldi, Christian Rupprecht, and David Novotny. VGGT: Visual geometry grounded transformer. InCVPR, 2025. 2, 4, 6, 7, 8, 9, 12

  68. [68]

    PF-LRM: Pose-free large reconstruction model for joint pose and shape prediction

    Peng Wang, Hao Tan, Sai Bi, Yinghao Xu, Fujun Luan, Kalyan Sunkavalli, Wenping Wang, Zexiang Xu, and Kai Zhang. PF-LRM: Pose-free large reconstruction model for joint pose and shape prediction. InICLR, 2024. 1, 2

  69. [69]

    Efros, and Angjoo Kanazawa

    Qianqian Wang, Yifei Zhang, Aleksander Holynski, Alexei A. Efros, and Angjoo Kanazawa. Continuous 3D perception model with persistent state. InCVPR, 2025. 2

  70. [70]

    MoGe: Unlocking accurate monocular geometry estimation for open-domain images with optimal training supervision

    Ruicheng Wang, Sicheng Xu, Cassie Dai, Jianfeng Xiang, Yu Deng, Xin Tong, and Jiaolong Yang. MoGe: Unlocking accurate monocular geometry estimation for open-domain images with optimal training supervision. InCVPR, 2025. 5, 7, 8

  71. [71]

    MoGe-2: Accurate monocular geometry with metric scale and sharp details

    Ruicheng Wang, Sicheng Xu, Yue Dong, Yu Deng, Jianfeng Xiang, Zelong Lv, Guangzhong Sun, Xin Tong, and Jiaolong Yang. MoGe-2: Accurate monocular geometry with metric scale and sharp details. InNeurIPS, 2025. 7, 8

  72. [72]

    DUSt3R: Geometric 3D vision made easy

    Shuzhe Wang, Vincent Leroy, Yohann Cabon, Boris Chidlovskii, and Jerome Revaud. DUSt3R: Geometric 3D vision made easy. InCVPR, 2024. 1, 2, 4, 5, 7

  73. [73]

    TartanAir: A dataset to push the limits of visual SLAM

    Wenshan Wang, Delong Zhu, Xiangwei Wang, Yaoyu Hu, Yuheng Qiu, Chen Wang, Yafei Hu, Ashish Kapoor, and Se- bastian Scherer. TartanAir: A dataset to push the limits of visual SLAM. InIROS, 2020. 5, 6

  74. [74]

    $\pi^3$: Permutation-Equivariant Visual Geometry Learning

    Yifan Wang, Jianjun Zhou, Haoyi Zhu, Wenzheng Chang, Yang Zhou, Zizun Li, Junyi Chen, Jiangmiao Pang, Chunhua Shen, and Tong He. π3: Scalable permutation-equivariant visual geometry learning. arXiv:2507.13347, 2025. 2, 3, 8

  75. [75]

    Fillerbuster: Multi-view scene completion for casual captures

    Ethan Weber, Norman Müller, Yash Kant, Vasu Agrawal, Michael Zollhöfer, Angjoo Kanazawa, and Christian Richardt. Fillerbuster: Multi-view scene completion for casual captures. In3DV, 2026. 3

  76. [76]

    CroCo v2: Improved cross-view completion pre-training for stereo matching and optical flow

    Philippe Weinzaepfel, Thomas Lucas, Vincent Leroy, Yohann Cabon, Vaibhav Arora, Romain Brégier, Gabriela Csurka, Leonid Antsfeld, Boris Chidlovskii, and Jerome Revaud. CroCo v2: Improved cross-view completion pre-training for stereo matching and optical flow. InICCV, 2023. 4

  77. [77]

    Robert J. Woodham. Photometric method for determining sur- face orientation from multiple images.Optical Engineering, 19(1):139–144, 1980. 1

  78. [78]

    Liang, Mikael Henaff, Hao Tang, Ang Cao, Joyce Chai, Franziska Meier, and Matt Feiszli

    Jianing Yang, Alexander Sax, Kevin J. Liang, Mikael Henaff, Hao Tang, Ang Cao, Joyce Chai, Franziska Meier, and Matt Feiszli. Fast3R: Towards 3D reconstruction of 1000+ images in one forward pass. InCVPR, 2025. 2, 12

  79. [79]

    Depth anything V2

    Lihe Yang, Bingyi Kang, Zilong Huang, Zhen Zhao, Xiao- gang Xu, Jiashi Feng, and Hengshuang Zhao. Depth anything V2. InNeurIPS, 2024. 5, 8

  80. [80]

    BlendedMVS: A large- scale dataset for generalized multi-view stereo networks

    Yao Yao, Zixin Luo, Shiwei Li, Jingyang Zhang, Yufan Ren, Lei Zhou, Tian Fang, and Long Quan. BlendedMVS: A large- scale dataset for generalized multi-view stereo networks. In CVPR, 2020. 5

Showing first 80 references.