arxiv: 2509.13414 · v3 · submitted 2025-09-16 · 💻 cs.CV · cs.AI· cs.LG· cs.RO

Recognition: 2 theorem links

· Lean Theorem

MapAnything: Universal Feed-Forward Metric 3D Reconstruction

Arno Knapitsch, Christian Richardt, Deva Ramanan, Duncan Zauss, Ethan Weber, Johannes Sch\"onberger, Jonathon Luiten, Lorenzo Porzi, Manuel Lopez-Antequera, Nelson Antunes, Nikhil Keetha, Norman M\"uller, Peter Kontschieder, Samuel Rota Bul\`o, Sebastian Scherer, Tobias Fischer, Yuchen Zhang

Authors on Pith no claims yet

Pith reviewed 2026-05-12 11:00 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.LGcs.RO

keywords feed-forward 3D reconstructionmetric scene geometrymulti-view stereostructure from motiondepth estimationtransformer modelunified 3D visioncamera localization

0 comments

The pith

A single feed-forward model reconstructs metric 3D scenes from images and optional geometric inputs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces MapAnything as a transformer-based model that accepts one or more images plus optional inputs like camera intrinsics, poses, or partial depths and directly outputs metric 3D geometry and cameras. It factors the scene into depth maps, local ray maps, camera poses, and one overall metric scale factor that aligns everything into a single consistent frame. Standardizing training and supervision across many datasets lets this one model handle tasks from uncalibrated structure-from-motion to monocular depth estimation without task-specific heads or pipelines. If the approach holds, it would mean 3D vision systems no longer require separate specialist networks for each reconstruction problem.

Core claim

MapAnything is a unified transformer-based feed-forward model that ingests one or more images along with optional geometric inputs such as camera intrinsics, poses, depth, or partial reconstructions, and then directly regresses the metric 3D scene geometry and cameras. It leverages a factored representation of multi-view scene geometry, i.e., a collection of depth maps, local ray maps, camera poses, and a metric scale factor that effectively upgrades local reconstructions into a globally consistent metric frame. Standardizing the supervision and training across diverse datasets, along with flexible input augmentation, enables MapAnything to address a broad range of 3D vision tasks in a one-

What carries the argument

A factored representation consisting of depth maps, local ray maps, camera poses, and a metric scale factor that upgrades local reconstructions into a globally consistent metric frame.

If this is right

The model performs uncalibrated structure-from-motion, calibrated multi-view stereo, monocular depth estimation, camera localization, and depth completion in the same forward pass.
It outperforms or matches existing specialist feed-forward models on these tasks while requiring only one set of weights.
Joint training across multiple datasets becomes more efficient because the architecture and loss functions are shared rather than duplicated.
Optional geometric inputs can be used to refine or complete partial reconstructions without changing the model.
The metric scale factor is regressed directly, eliminating separate scale-recovery post-processing steps.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The factored representation could extend naturally to video sequences if temporal consistency terms are added to the training objective.
Robotics systems that need both mapping and localization might replace multiple perception modules with one call to this model.
The same backbone might support related tasks such as novel-view synthesis once a renderer is attached to the output depths and poses.

Load-bearing premise

Standardizing supervision and training across diverse datasets with flexible input augmentation allows one model to solve many different 3D reconstruction tasks at once.

What would settle it

A controlled test on a held-out task or dataset combination where MapAnything is compared head-to-head with a specialist feed-forward model trained only for that task and fails to match or exceed its accuracy on metric consistency or reconstruction quality.

read the original abstract

We introduce MapAnything, a unified transformer-based feed-forward model that ingests one or more images along with optional geometric inputs such as camera intrinsics, poses, depth, or partial reconstructions, and then directly regresses the metric 3D scene geometry and cameras. MapAnything leverages a factored representation of multi-view scene geometry, i.e., a collection of depth maps, local ray maps, camera poses, and a metric scale factor that effectively upgrades local reconstructions into a globally consistent metric frame. Standardizing the supervision and training across diverse datasets, along with flexible input augmentation, enables MapAnything to address a broad range of 3D vision tasks in a single feed-forward pass, including uncalibrated structure-from-motion, calibrated multi-view stereo, monocular depth estimation, camera localization, depth completion, and more. We provide extensive experimental analyses and model ablations demonstrating that MapAnything outperforms or matches specialist feed-forward models while offering more efficient joint training behavior, thus paving the way toward a universal 3D reconstruction backbone.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MapAnything proposes a single transformer with a factored geometry output (depths, rays, poses, one scale) to unify many 3D tasks, but the abstract supplies no numbers to check whether the scale actually delivers consistent metric results.

read the letter

The main point is that this work tries to build one feed-forward model for a wide set of 3D problems by regressing a factored representation—per-view depth maps, local ray maps, camera poses, and a single metric scale factor—then claiming the scale turns the locals into globally consistent metric geometry. The combination of that factorization with a transformer that accepts both calibrated and uncalibrated inputs looks new compared with the specialist models referenced in the abstract. Standardizing supervision across datasets and using flexible augmentation to cover structure-from-motion, multi-view stereo, monocular depth, and localization in one pass is a practical direction if the numbers hold up. Joint training efficiency is also presented as an advantage over training separate specialists. The approach shows clear thinking about how to make a universal backbone without task-specific heads. The soft spot is the reliance on one post-hoc scale factor. If the transformer regresses depths and rays mostly view-wise, even with cross-attention, any local geometric drift will stay after scaling unless an explicit consistency term (ray intersection loss or similar) is present; the abstract does not mention one. The performance claims of matching or beating specialists also rest on “extensive experiments” that are not quantified here—no tables, no error bars, no ablations—so the central empirical claim cannot be checked yet. This paper is for people building feed-forward 3D systems in robotics or AR who want a single model instead of many. A reader working on universal backbones would find the factored representation and training strategy worth examining. I would send it to peer review because the concrete architecture and unification goal are substantial enough to merit referee scrutiny on the consistency mechanism and the actual results.

Referee Report

2 major / 2 minor

Summary. The paper introduces MapAnything, a unified transformer-based feed-forward model that ingests one or more images (optionally with intrinsics, poses, depth, or partial reconstructions) and directly regresses metric 3D scene geometry and cameras. It employs a factored representation consisting of per-view depth maps, local ray maps, camera poses, and a single metric scale factor to convert local reconstructions into a globally consistent metric frame. The model is trained jointly across diverse datasets with standardized supervision and flexible augmentations, enabling it to address tasks including uncalibrated SfM, calibrated MVS, monocular depth estimation, camera localization, and depth completion in a single pass. The authors claim it outperforms or matches specialist feed-forward models while providing more efficient joint training.

Significance. If the performance and consistency claims hold, this work could provide a practical universal backbone for metric 3D reconstruction, reducing reliance on task-specific models and enabling more efficient multi-task training and inference in computer vision pipelines. The factored representation and joint-training approach, if validated, would represent a notable engineering advance for feed-forward 3D models.

major comments (2)

[§3.2] §3.2 (Factored Representation): The central claim that a single scalar metric scale factor upgrades independently regressed per-view depth maps and local ray maps into globally consistent metric geometry is load-bearing, yet the manuscript provides no explicit multi-view consistency term (e.g., cross-view ray-intersection loss or differentiable bundle-adjustment surrogate). Without such a term, local inconsistencies in the transformer outputs may persist after global scaling, as noted in the stress-test concern.
[§5] §5 (Experiments): The abstract asserts 'extensive experimental analyses and model ablations' demonstrating outperformance, but the provided manuscript excerpt contains no quantitative tables, error bars, or per-task metrics comparing against specialist baselines. This absence prevents verification of the 'outperforms or matches' claim and the efficiency of joint training.

minor comments (2)

The abstract would be strengthened by including one or two key quantitative results (e.g., relative improvement on a standard benchmark) to support the performance claims.
Notation for the 'local ray maps' component could be clarified with an explicit equation or diagram showing how they differ from standard depth or normal maps.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed review of our manuscript. We address each major comment below and describe the revisions we will make to strengthen the paper.

read point-by-point responses

Referee: [§3.2] §3.2 (Factored Representation): The central claim that a single scalar metric scale factor upgrades independently regressed per-view depth maps and local ray maps into globally consistent metric geometry is load-bearing, yet the manuscript provides no explicit multi-view consistency term (e.g., cross-view ray-intersection loss or differentiable bundle-adjustment surrogate). Without such a term, local inconsistencies in the transformer outputs may persist after global scaling, as noted in the stress-test concern.

Authors: We appreciate this observation on the factored representation. The current design relies on the transformer jointly processing all input views to regress depth maps, ray maps, and poses that are already locally consistent; the single predicted metric scale then aligns them globally. This consistency emerges from multi-task supervision across diverse datasets containing multi-view ground truth. We acknowledge that an explicit consistency regularizer could provide additional robustness. In the revised manuscript we will expand §3.2 with a discussion of implicit versus explicit consistency and add an ablation that quantifies multi-view geometric consistency (ray-intersection error) before and after scale application. This is a partial revision. revision: partial
Referee: [§5] §5 (Experiments): The abstract asserts 'extensive experimental analyses and model ablations' demonstrating outperformance, but the provided manuscript excerpt contains no quantitative tables, error bars, or per-task metrics comparing against specialist baselines. This absence prevents verification of the 'outperforms or matches' claim and the efficiency of joint training.

Authors: We regret that the excerpt supplied to the referee omitted the full experimental section. The complete manuscript contains §5 with multiple quantitative tables reporting per-task metrics (SfM, MVS, monocular depth, localization, depth completion), direct comparisons to specialist feed-forward baselines, ablations on joint-training efficiency, and error bars derived from multiple random seeds where appropriate. We will ensure all tables are clearly cross-referenced in the text and that any future review excerpts include the complete experimental results. No further revision is required on this point. revision: no

Circularity Check

0 steps flagged

No circularity: empirical feed-forward model with learned outputs

full rationale

The paper describes a transformer-based neural network that regresses depth maps, ray maps, poses, and a scale factor from image inputs, trained end-to-end on standardized datasets. No mathematical derivation, uniqueness theorem, or first-principles prediction is claimed that reduces by construction to fitted inputs or self-citations. The factored representation is an architectural design choice whose consistency is enforced via data-driven supervision rather than definitional equivalence. Experimental results and ablations are presented as empirical validation, not tautological outputs. This matches the default expectation for non-circular empirical ML papers.

Axiom & Free-Parameter Ledger

1 free parameters · 0 axioms · 1 invented entities

The central claim rests on the effectiveness of the factored geometry representation and the assumption that standardized multi-dataset training produces a generalist model; these are introduced by the paper rather than derived from prior literature.

free parameters (1)

metric scale factor
Learned or regressed quantity that converts local reconstructions into a globally metric frame; its value is not fixed by external constants.

invented entities (1)

factored representation (depth maps, local ray maps, poses, metric scale) no independent evidence
purpose: To allow a single network to produce globally consistent metric 3D from mixed inputs
New representational choice introduced to upgrade local outputs to metric consistency; no independent evidence outside the model itself is provided in the abstract.

pith-pipeline@v0.9.0 · 5551 in / 1216 out tokens · 64273 ms · 2026-05-12T11:00:43.083291+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith.Foundation.DimensionForcing linking_requires_D3 unclear
MapAnything leverages a factored representation of multi-view scene geometry, i.e., a collection of depth maps, local ray maps, camera poses, and a metric scale factor that effectively upgrades local reconstructions into a globally consistent metric frame.
IndisputableMonolith.Foundation.HierarchyEmergence hierarchy_emergence_forces_phi unclear
Standardizing the supervision and training across diverse datasets, along with flexible input augmentation, enables MapAnything to address a broad range of 3D vision tasks in a single feed-forward pass

Forward citations

Cited by 37 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

TrackCraft3R: Repurposing Video Diffusion Transformers for Dense 3D Tracking
cs.CV 2026-05 unverdicted novelty 8.0

TrackCraft3R is the first method to repurpose a video diffusion transformer as a feed-forward dense 3D tracker via dual-latent representations and temporal RoPE alignment, achieving SOTA performance with lower compute.
Seeing Across Skies and Streets: Feedforward 3D Reconstruction from Satellite, Drone, and Ground Images
cs.CV 2026-05 unverdicted novelty 7.0

Cross3R performs feed-forward 3D reconstruction and 6-DoF pose estimation from any combination of satellite, UAV, and ground images, outperforming baselines on a new 278K-image tri-view dataset.
Mix3R: Mixing Feed-forward Reconstruction and Generative 3D Priors for Joint Multi-view Aligned 3D Reconstruction and Pose Estimation
cs.CV 2026-05 unverdicted novelty 7.0

Mix3R mixes feed-forward reconstruction and generative 3D priors via Mixture-of-Transformers and overlap-based attention bias to achieve better-aligned 3D shapes and more accurate poses than either approach alone.
AirZoo: A Unified Large-Scale Dataset for Grounding Aerial Geometric 3D Vision
cs.CV 2026-04 conditional novelty 7.0

AirZoo is a new large-scale synthetic dataset for aerial 3D vision that improves state-of-the-art models on image retrieval, cross-view matching, and 3D reconstruction when used for fine-tuning.
Multi-Camera Self-Calibration in Sports Motion Capture: Leveraging Human and Stick Poses
cs.CV 2026-04 unverdicted novelty 7.0

A three-stage optimization pipeline for multi-camera extrinsic self-calibration that refines camera poses, reconstructs human and stick trajectories, and resolves global scale using the known stick length constraint.
GlobalSplat: Efficient Feed-Forward 3D Gaussian Splatting via Global Scene Tokens
cs.CV 2026-04 unverdicted novelty 7.0

GlobalSplat achieves competitive novel-view synthesis on RealEstate10K and ACID using only 16K Gaussians via global scene tokens and coarse-to-fine training, with a 4MB footprint and under 78ms inference.
Keep It CALM: Toward Calibration-Free Kilometer-Level SLAM with Visual Geometry Foundation Models via an Assistant Eye
cs.RO 2026-04 unverdicted novelty 7.0

CAL2M achieves calibration-free kilometer-level SLAM by using an assistant eye for scale, epipolar-guided intrinsic correction, and anchor propagation for nonlinear sub-map alignment.
Attention Sink in Transformers: A Survey on Utilization, Interpretation, and Mitigation
cs.LG 2026-04 unverdicted novelty 7.0

The first survey on Attention Sink in Transformers structures the literature around fundamental utilization, mechanistic interpretation, and strategic mitigation.
EgoTL: Egocentric Think-Aloud Chains for Long-Horizon Tasks
cs.CV 2026-04 unverdicted novelty 7.0

EgoTL provides a new egocentric dataset with think-aloud chains and metric labels that benchmarks VLMs on long-horizon tasks and improves their planning, reasoning, and spatial grounding after finetuning.
LuMon: A Comprehensive Benchmark and Development Suite with Novel Datasets for Lunar Monocular Depth Estimation
cs.CV 2026-04 unverdicted novelty 7.0

A new benchmark with real lunar stereo ground truth and analog data shows that sim-to-real fine-tuned monocular depth models achieve large in-domain gains but minimal generalization to actual lunar images.
AnchorSplat: Feed-Forward 3D Gaussian Splatting with 3D Geometric Priors
cs.CV 2026-04 unverdicted novelty 7.0

AnchorSplat uses anchor-aligned 3D Gaussians guided by geometric priors for feed-forward scene reconstruction, achieving SOTA novel view synthesis on ScanNet++ with fewer primitives and better view consistency.
Learning 3D Reconstruction with Priors in Test Time
cs.CV 2026-04 unverdicted novelty 7.0

Test-time constrained optimization incorporates priors into pre-trained multiview transformers via self-supervised losses and penalty terms to improve 3D reconstruction accuracy.
GemDepth: Geometry-Embedded Features for 3D-Consistent Video Depth
cs.CV 2026-05 unverdicted novelty 6.0

GemDepth predicts inter-frame camera poses to inject geometric embeddings into a spatio-temporal transformer, yielding state-of-the-art 3D-consistent video depth.
GemDepth: Geometry-Embedded Features for 3D-Consistent Video Depth
cs.CV 2026-05 unverdicted novelty 6.0

GemDepth embeds predicted camera poses into a spatio-temporal transformer to achieve state-of-the-art 3D-consistent video depth estimation.
GemDepth: Geometry-Embedded Features for 3D-Consistent Video Depth
cs.CV 2026-05 unverdicted novelty 6.0

GemDepth achieves improved 3D-consistent video depth by embedding predicted inter-frame camera poses into a network with an Alternating Spatio-Temporal Transformer for better spatial precision and temporal coherence.
3D-ReGen: A Unified 3D Geometry Regeneration Framework
cs.CV 2026-04 unverdicted novelty 6.0

3D-ReGen is a conditioned 3D regenerator using VecSet that learns a regeneration prior from unlabeled 3D datasets via self-supervised tasks and achieves state-of-the-art results on controllable 3D geometry tasks.
LA-Pose: Latent Action Pretraining Meets Pose Estimation
cs.CV 2026-04 unverdicted novelty 6.0

LA-Pose achieves over 10% higher pose accuracy than recent feed-forward methods on Waymo and PandaSet benchmarks by repurposing latent actions from self-supervised inverse-dynamics pretraining while using orders of ma...
SS3D: End2End Self-Supervised 3D from Web Videos
cs.CV 2026-04 unverdicted novelty 6.0

SS3D pretrains an end-to-end feed-forward 3D estimator on filtered YouTube-8M videos via SfM self-supervision, MVS filtering, and expert distillation, delivering stronger zero-shot transfer and fine-tuning than prior ...
SS3D: End2End Self-Supervised 3D from Web Videos
cs.CV 2026-04 unverdicted novelty 6.0

SS3D pretrains an end-to-end 3D estimator on filtered YouTube-8M videos via SfM self-supervision, achieving improved zero-shot transfer and fine-tuning over prior baselines.
Vista4D: Video Reshooting with 4D Point Clouds
cs.CV 2026-04 unverdicted novelty 6.0

Vista4D re-synthesizes dynamic videos from new viewpoints by grounding them in a 4D point cloud built with static segmentation and multiview training.
Geometric Context Transformer for Streaming 3D Reconstruction
cs.CV 2026-04 unverdicted novelty 6.0

LingBot-Map is a streaming 3D reconstruction model built on a geometric context transformer that combines anchor context, pose-reference window, and trajectory memory to deliver accurate, drift-resistant results at 20...
Rays as Pixels: Learning A Joint Distribution of Videos and Camera Trajectories
cs.CV 2026-04 unverdicted novelty 6.0

A video diffusion model learns a joint distribution over videos and camera trajectories by representing cameras as pixel-aligned ray encodings (raxels) denoised jointly with video frames via decoupled attention.
Self-Improving 4D Perception via Self-Distillation
cs.CV 2026-04 unverdicted novelty 6.0

SelfEvo enables pretrained 4D perception models to self-improve on unlabeled videos via self-distillation, delivering up to 36.5% relative gains in video depth estimation and 20.1% in camera estimation across eight be...
ZeD-MAP: Bundle Adjustment Guided Zero-Shot Depth Maps for Real-Time Aerial Imaging
cs.CV 2026-04 unverdicted novelty 6.0

ZeD-MAP uses incremental bundle adjustment on image clusters to guide zero-shot diffusion depth estimation, delivering sub-meter accuracy (0.87 m XY, 0.12 m Z) at 1.5-5 seconds per image on high-resolution aerial data.
ZeD-MAP: Bundle Adjustment Guided Zero-Shot Depth Maps for Real-Time Aerial Imaging
cs.CV 2026-04 conditional novelty 6.0

ZeD-MAP integrates incremental cluster-based bundle adjustment with zero-shot diffusion depth estimation to deliver metrically consistent real-time depth maps from high-resolution UAV imagery.
DVGT-2: Vision-Geometry-Action Model for Autonomous Driving at Scale
cs.CV 2026-04 unverdicted novelty 6.0

DVGT-2 is a streaming vision-geometry-action model that jointly reconstructs dense 3D geometry and plans trajectories online, achieving better reconstruction than prior batch methods while transferring directly to pla...
Stepper: Stepwise Immersive Scene Generation with Multiview Panoramas
cs.CV 2026-03 unverdicted novelty 6.0

Stepper uses stepwise panoramic expansion with a multi-view 360-degree diffusion model and geometry reconstruction to produce high-fidelity, structurally consistent immersive 3D scenes from text.
TerraSky3D: Multi-View Reconstructions of European Landmarks in 4K
cs.CV 2026-03 unverdicted novelty 6.0

TerraSky3D is a new high-resolution multi-view dataset with 50,000 images in 150 scenes of European landmarks, supplied with poses and depth maps to support 3D reconstruction research.
Depth Anything 3: Recovering the Visual Space from Any Views
cs.CV 2025-11 unverdicted novelty 6.0

DA3 recovers consistent visual geometry from arbitrary views via a vanilla DINO transformer and depth-ray target, setting new SOTA on a visual geometry benchmark while outperforming DA2 on monocular depth.
WildPose: A Unified Framework for Robust Pose Estimation in the Wild
cs.CV 2026-05 unverdicted novelty 5.0

WildPose unifies feedforward 3D features from MASt3R with differentiable bundle adjustment for robust monocular pose estimation across dynamic, static, and low-ego-motion scenes.
Syn4D: A Multiview Synthetic 4D Dataset
cs.CV 2026-05 unverdicted novelty 5.0

Syn4D is a new multiview synthetic 4D dataset supplying dense ground-truth annotations for dynamic scene reconstruction, tracking, and human pose estimation.
SS3D: End2End Self-Supervised 3D from Web Videos
cs.CV 2026-04 unverdicted novelty 5.0

SS3D scales SfM-based self-supervision to ~100M frames from YouTube-8M using a multi-view signal proxy for filtering and a two-stage training schedule, achieving strong zero-shot transfer and better fine-tuning than p...
MonoEM-GS: Monocular Expectation-Maximization Gaussian Splatting SLAM
cs.RO 2026-04 unverdicted novelty 5.0

MonoEM-GS stabilizes view-dependent geometry from foundation models inside a global Gaussian Splatting representation via EM and adds multi-modal features for in-place open-set segmentation.
HY-World 2.0: A Multi-Modal World Model for Reconstructing, Generating, and Simulating 3D Worlds
cs.CV 2026-04 unverdicted novelty 4.0

HY-World 2.0 generates and reconstructs high-fidelity navigable 3D Gaussian Splatting worlds from text, images, or videos via upgraded panorama, planning, expansion, and composition modules, with released code claimin...
DINO_4D: Semantic-Aware 4D Reconstruction
cs.CV 2026-04 unverdicted novelty 4.0

DINO_4D uses frozen DINOv3 features to inject semantic awareness into 4D dynamic scene reconstruction, improving tracking accuracy and completeness on benchmarks while preserving O(T) complexity.
VGGT-SLAM++
cs.CV 2026-04 unverdicted novelty 4.0

VGGT-SLAM++ improves on prior transformer SLAM by adding dense DEM submap graphs and high-cadence local optimization, achieving SOTA accuracy with reduced drift and bounded memory on benchmarks.
OpenWorldLib: A Unified Codebase and Definition of Advanced World Models
cs.CV 2026-04 unverdicted novelty 4.0

OpenWorldLib offers a standardized codebase and definition for world models that combine perception, interaction, and memory to understand and predict the world.

Reference graph

Works this paper leans on

88 extracted references · 88 canonical work pages · cited by 32 Pith papers · 2 internal anchors

[1]

RayFronts: Open- set semantic ray frontiers for online scene understanding and exploration

Omar Alama, Avigyan Bhattacharya, Haoyang He, Se- ungchan Kim, Yuheng Qiu, Wenshan Wang, Cherie Ho, Nikhil Keetha, and Sebastian Scherer. RayFronts: Open- set semantic ray frontiers for online scene understanding and exploration. InIROS, 2025. 2

work page 2025
[2]

SceneScript: Reconstructing scenes with an autoregressive structured language model

Armen Avetisyan, Christopher Xie, Henry Howard-Jenkins, Tsun-Yi Yang, Samir Aroudj, Suvam Patra, Fuyang Zhang, Duncan Frost, Luke Holland, Campbell Orme, Jakob Engel, Edward Miller, Richard Newcombe, and Vasileios Balntas. SceneScript: Reconstructing scenes with an autoregressive structured language model. InECCV, 2024. 5

work page 2024
[3]

MultiMAE: Multi-modal multi-task masked autoen- coders

Roman Bachmann, David Mizrahi, Andrei Atanov, and Amir Zamir. MultiMAE: Multi-modal multi-task masked autoen- coders. InECCV, 2022. 3

work page 2022
[4]

Jonathan T. Barron. A general and adaptive robust loss func- tion. InCVPR, 2019. 5

work page 2019
[5]

Richter, and Vladlen Koltun

Aleksei Bochkovskii, Amaël Delaunoy, Hugo Germain, Mar- cel Santos, Yichao Zhou, Stephan R. Richter, and Vladlen Koltun. Depth pro: Sharp monocular metric depth in less than a second. InICLR, 2025. 8

work page 2025
[6]

MUSt3R: Multi-view network for stereo 3D reconstruction

Yohann Cabon, Lucas Stoffl, Leonid Antsfeld, Gabriela Csurka, Boris Chidlovskii, Jerome Revaud, and Vincent Leroy. MUSt3R: Multi-view network for stereo 3D reconstruction. InCVPR, 2025. 2, 7, 8

work page 2025
[7]

Map-relative pose regression for visual re-localization

Shuai Chen, Tommaso Cavallari, Victor Adrian Prisacariu, and Eric Brachmann. Map-relative pose regression for visual re-localization. InCVPR, 2024. 4

work page 2024
[8]

Reloc3r: Large-scale training of relative camera pose regression for generalizable, fast, and accurate visual localization

Siyan Dong, Shuzhe Wang, Shaohui Liu, Lulu Cai, Qingnan Fan, Juho Kannala, and Yanchao Yang. Reloc3r: Large-scale training of relative camera pose regression for generalizable, fast, and accurate visual localization. InCVPR, 2025. 1, 2

work page 2025
[9]

Duisterhof, Jan Oberst, Bowen Wen, Stan Birch- field, Deva Ramanan, and Jeffrey Ichnowski

Bardienus P. Duisterhof, Jan Oberst, Bowen Wen, Stan Birch- field, Deva Ramanan, and Jeffrey Ichnowski. RaySt3R: Pre- dicting novel depth maps for zero-shot object completion. In NeurIPS, 2025. 3

work page 2025
[10]

MASt3R-SfM: a fully-integrated solution for unconstrained structure-from-motion

Bardienus Pieter Duisterhof, Lojze Zust, Philippe Weinza- epfel, Vincent Leroy, Yohann Cabon, and Jerome Revaud. MASt3R-SfM: a fully-integrated solution for unconstrained structure-from-motion. In3DV, 2025. 2

work page 2025
[11]

Light3R-SfM: Towards feed-forward structure-from- motion

Sven Elflein, Qunjie Zhou, Sérgio Agostinho, and Laura Leal- Taixé. Light3R-SfM: Towards feed-forward structure-from- motion. InCVPR, 2025. 2

work page 2025
[12]

Grossberg and Shree K

Michael D. Grossberg and Shree K. Nayar. A general imaging model and a method for finding its parameters. InICCV, 2001. 4

work page 2001
[13]

Rotation averaging.Int

Richard Hartley, Jochen Trumpf, Yuchao Dai, and Hongdong Li. Rotation averaging.Int. J. Comput. Vis., 103(3):267–305,

work page
[14]

RA- DIOv2.5: Improved baselines for agglomerative vision foun- dation models

Greg Heinrich, Mike Ranzinger, Hongxu, Yao Lu, Jan Kautz, Andrew Tao, Bryan Catanzaro, and Pavlo Molchanov. RA- DIOv2.5: Improved baselines for agglomerative vision foun- dation models. InCVPR, 2025. 4

work page 2025
[15]

Map it anywhere: Empower- ing bev map prediction using large-scale public datasets

Cherie Ho, Jiaye Zou, Omar Alama, Sai M Kumar, Benjamin Chiang, Taneesh Gupta, Chen Wang, Nikhil Keetha, Katia Sycara, and Sebastian Scherer. Map it anywhere: Empower- ing bev map prediction using large-scale public datasets. In NeurIPS, 2024. 2

work page 2024
[16]

Geometric context from a single image

Derek Hoiem, Alexei A Efros, and Martial Hebert. Geometric context from a single image. InICCV, 2005. 1

work page 2005
[17]

Obtaining shape from shading information

Berthold KP Horn. Obtaining shape from shading information. InShape from shading, pages 123–171. MIT Press, 1989. 1

work page 1989
[18]

Metric3D v2: A versatile monocular geometric foundation model for zero-shot metric depth and surface nor- mal estimation.IEEE Trans

Mu Hu, Wei Yin, Chi Zhang, Zhipeng Cai, Xiaoxiao Long, Hao Chen, Kaixuan Wang, Gang Yu, Chunhua Shen, and Shaojie Shen. Metric3D v2: A versatile monocular geometric foundation model for zero-shot metric depth and surface nor- mal estimation.IEEE Trans. Pattern Anal. Mach. Intell., 46 (12):10579–10596, 2024. 8

work page 2024
[19]

Sycara, Matthew Johnson-Roberson, Dhruv Batra, Xiaolong Wang, Sebastian Scherer, Zsolt Kira, Fei Xia, and Yonatan Bisk

Yafei Hu, Quanting Xie, Vidhi Jain, Jonathan Francis, Jay Patrikar, Nikhil Keetha, Seungchan Kim, Yaqi Xie, Tianyi Zhang, Hao-Shu Fang, Shibo Zhao, Shayegan Omidshafiei, Dong-Ki Kim, Ali akbar Agha-mohammadi, Katia Sycara, Matthew Johnson-Roberson, Dhruv Batra, Xiaolong Wang, Sebastian Scherer, Chen Wang, Zsolt Kira, Fei Xia, and Yonatan Bisk. Toward gene...

work page arXiv
[20]

Yeh, and Alexan- der G

Yuan-Ting Hu, Jiahong Wang, Raymond A. Yeh, and Alexan- der G. Schwing. SAIL-VOS 3D: A synthetic dataset and baselines for object detection and 3D mesh reconstruction from video data. InCVPR, 2021. 5

work page 2021
[21]

DeepMVS: Learning multi-view stereopsis

Po-Han Huang, Kevin Matzen, Johannes Kopf, Narendra Ahuja, and Jia-Bin Huang. DeepMVS: Learning multi-view stereopsis. InCVPR, 2018. 5

work page 2018
[22]

MVSAnywhere: Zero-shot multi-view stereo

Sergio Izquierdo, Mohamed Sayed, Michael Firman, Guillermo Garcia-Hernando, Daniyar Turmukhambetov, Javier Civera, Oisin Mac Aodha, Gabriel Brostow, and Jamie Watson. MVSAnywhere: Zero-shot multi-view stereo. In CVPR, 2025. 1, 7, 8

work page 2025
[23]

Pow3R: Empowering uncon- strained 3D reconstruction with camera and scene priors

Wonbong Jang, Philippe Weinzaepfel, Vincent Leroy, Lourdes Agapito, and Jerome Revaud. Pow3R: Empowering uncon- strained 3D reconstruction with camera and scene priors. In CVPR, 2025. 3, 6, 7

work page 2025
[24]

LVSM: A large view synthesis model with minimal 3D inductive bias

Haian Jin, Hanwen Jiang, Hao Tan, Kai Zhang, Sai Bi, Tianyuan Zhang, Fujun Luan, Noah Snavely, and Zexiang Xu. LVSM: A large view synthesis model with minimal 3D inductive bias. InICLR, 2025. 3

work page 2025
[25]

Dynam- icStereo: Consistent dynamic depth from stereo videos

Nikita Karaev, Ignacio Rocco, Benjamin Graham, Natalia Neverova, Andrea Vedaldi, and Christian Rupprecht. Dynam- icStereo: Consistent dynamic depth from stereo videos. In CVPR, 2023. 5

work page 2023
[26]

Any4D: Unified feed-forward metric 4D reconstruction.arXiv preprint arXiv:2512.10935, 2025

Jay Karhade, Nikhil Keetha, Yuchen Zhang, Tanisha Gupta, Akash Sharma, Sebastian Scherer, and Deva Ramanan. Any4D: Unified feed-forward metric 4d reconstruction.arXiv preprint arXiv:2512.10935, 2025. 8

work page arXiv 2025
[27]

SplaTAM: Splat, track & map 3D Gaussians for dense RGB-D SLAM

Nikhil Keetha, Jay Karhade, Krishna Murthy Jatavallabhula, Gengshan Yang, Sebastian Scherer, Deva Ramanan, and Jonathon Luiten. SplaTAM: Splat, track & map 3D Gaussians for dense RGB-D SLAM. InCVPR, 2024. 2

work page 2024
[28]

Kingma and Jimmy Ba

Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. InICLR, 2015. 9

work page 2015
[29]

Ground- ing image matching in 3D with MASt3R

Vincent Leroy, Yohann Cabon, and Jérôme Revaud. Ground- ing image matching in 3D with MASt3R. InECCV, 2024. 1, 2, 7, 8 17

work page 2024
[30]

MegaDepth: Learning single- view depth prediction from internet photos

Zhengqi Li and Noah Snavely. MegaDepth: Learning single- view depth prediction from internet photos. InCVPR, 2018. 5

work page 2018
[31]

Depth Anything 3: Recovering the Visual Space from Any Views

Haotong Lin, Sili Chen, Junhao Liew, Donny Y Chen, Zhenyu Li, Guang Shi, Jiashi Feng, and Bingyi Kang. Depth anything 3: Recovering the visual space from any views.arXiv preprint arXiv:2511.10647, 2025. 4

work page internal anchor Pith review Pith/arXiv arXiv 2025
[32]

DL3DV-10K: A large-scale scene dataset for deep learning-based 3D vision

Lu Ling, Yichen Sheng, Zhi Tu, Wentian Zhao, Cheng Xin, Kun Wan, Lantao Yu, Qianyu Guo, Zixun Yu, Yawen Lu, Xu- anmao Li, Xingpeng Sun, Rohan Ashok, Aniruddha Mukher- jee, Hao Kang, Xiangrui Kong, Gang Hua, Tianyi Zhang, Bedrich Benes, and Aniket Bera. DL3DV-10K: A large-scale scene dataset for deep learning-based 3D vision. InCVPR,

work page
[33]

David G. Lowe. Distinctive image features from scale- invariant keypoints.Int. J. Comput. Vis., 60(2):91–110, 2004. 1

work page 2004
[34]

Align3R: Aligned monocular depth estimation for dynamic videos

Jiahao Lu, Tianyu Huang, Peng Li, Zhiyang Dou, Cheng Lin, Zhiming Cui, Zhen Dong, Sai-Kit Yeung, Wenping Wang, and Yuan Liu. Align3R: Aligned monocular depth estimation for dynamic videos. InCVPR, 2025. 3

work page 2025
[35]

Matrix3D: Large photogrammetry model all-in- one

Yuanxun Lu, Jingyang Zhang, Tian Fang, Jean-Daniel Nah- mias, Yanghai Tsin, Long Quan, Xun Cao, Yao Yao, and Shiwei Li. Matrix3D: Large photogrammetry model all-in- one. InCVPR, 2025. 3

work page 2025
[36]

Mapillary planet-scale depth dataset

Manuel López Antequera, Pau Gargallo, Markus Hofinger, Samuel Rota Bulò, Yubin Kuang, and Peter Kontschieder. Mapillary planet-scale depth dataset. InECCV, 2020. 5

work page 2020
[37]

Spring: A high-resolution high- detail dataset and benchmark for scene flow, optical flow and stereo

Lukas Mehl, Jenny Schmalfuss, Azin Jahedi, Yaroslava Nali- vayko, and Andrés Bruhn. Spring: A high-resolution high- detail dataset and benchmark for scene flow, optical flow and stereo. InCVPR, 2023. 5

work page 2023
[38]

T2I- adapter: Learning adapters to dig out more controllable ability for text-to-image diffusion models

Chong Mou, Xintao Wang, Liangbin Xie, Yanze Wu, Jian Zhang, Zhongang Qi, Ying Shan, and Xiaohu Qie. T2I- adapter: Learning adapters to dig out more controllable ability for text-to-image diffusion models. InAAAI, 2024. 3, 4

work page 2024
[39]

Riku Murai, Eric Dexheimer, and Andrew J. Davison. MASt3R-SLAM: Real-time dense SLAM with 3D recon- struction priors. InCVPR, 2025. 2

work page 2025
[40]

An efficient solution to the five-point relative pose problem.IEEE Trans

David Nistér. An efficient solution to the five-point relative pose problem.IEEE Trans. Pattern Anal. Mach. Intell., 26 (06):756–777, 2004. 1

work page 2004
[41]

DI- NOv2: Learning robust visual features without supervision

Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. DI- NOv2: Learning robust visual features without supervision. Transactions on Machine Learning Research, 2024. 4

work page 2024
[42]

Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy V . V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, Mido Ass- ran, Nicolas Ballas, Wojciech Galuba, Russell Howes, Po- Yao Huang, Shang-Wen Li, Ishan Misra, Michael Rabbat, Vasu Sharma, Gabriel Synnaeve, Hu Xu, Herve Jegou, Julien Mairal, Patric...

work page 2024
[43]

Global structure-from-motion revisited

Linfei Pan, Dániel Baráth, Marc Pollefeys, and Johannes Lutz Schönberger. Global structure-from-motion revisited. In ECCV, 2024. 1

work page 2024
[44]

Schönberger, and Marc Pollefeys

Zador Pataki, Paul-Edouard Sarlin, Johannes L. Schönberger, and Marc Pollefeys. MP-SfM: Monocular surface priors for robust structure-from-motion. InCVPR, 2025. 2

work page 2025
[45]

UniDepthV2: Universal monocular metric depth estimation made simpler.IEEE Trans

Luigi Piccinelli, Christos Sakaridis, Yung-Hsu Yang, Mat- tia Segu, Siyuan Li, Wim Abbeloos, and Luc Van Gool. UniDepthV2: Universal monocular metric depth estimation made simpler.IEEE Trans. Pattern Anal. Mach. Intell., 2026. 8

work page 2026
[46]

Vi- sion transformers for dense prediction

René Ranftl, Alexey Bochkovskiy, and Vladlen Koltun. Vi- sion transformers for dense prediction. InICCV, 2021. 4, 10

work page 2021
[47]

Towards robust monocular depth estimation: Mixing datasets for zero-shot cross-dataset transfer.IEEE Trans

René Ranftl, Katrin Lasinger, David Hafner, Konrad Schindler, and Vladlen Koltun. Towards robust monocular depth estimation: Mixing datasets for zero-shot cross-dataset transfer.IEEE Trans. Pattern Anal. Mach. Intell., 44(3):1623– 1637, 2022. 5

work page 2022
[48]

AM-RADIO: Agglomerative vision foundation model – reduce all domains into one

Mike Ranzinger, Greg Heinrich, Jan Kautz, and Pavlo Molchanov. AM-RADIO: Agglomerative vision foundation model – reduce all domains into one. InCVPR, 2024. 4

work page 2024
[49]

SuperGlue: Learning feature match- ing with graph neural networks

Paul-Edouard Sarlin, Daniel DeTone, Tomasz Malisiewicz, and Andrew Rabinovich. SuperGlue: Learning feature match- ing with graph neural networks. InCVPR, 2020. 1

work page 2020
[50]

Fast image- based localization using direct 2D-to-3D matching

Torsten Sattler, Bastian Leibe, and Leif Kobbelt. Fast image- based localization using direct 2D-to-3D matching. InICCV,

work page
[51]

A benchmark and a baseline for robust multi- view depth estimation

Philipp Schröppel, Jan Bechtold, Artemij Amiranashvili, and Thomas Brox. A benchmark and a baseline for robust multi- view depth estimation. In3DV, 2022. 8

work page 2022
[52]

Schönberger and Jan-Michael Frahm

Johannes L. Schönberger and Jan-Michael Frahm. Structure- from-motion revisited. InCVPR, 2016. 1

work page 2016
[53]

Schönberger, Enliang Zheng, Jan-Michael Frahm, and Marc Pollefeys

Johannes L. Schönberger, Enliang Zheng, Jan-Michael Frahm, and Marc Pollefeys. Pixelwise view selection for unstructured multi-view stereo. InECCV, 2016. 1

work page 2016
[54]

Schönberger, Silvano Galliani, Torsten Sattler, Konrad Schindler, Marc Pollefeys, and An- dreas Geiger

Thomas Schöps, Johannes L. Schönberger, Silvano Galliani, Torsten Sattler, Konrad Schindler, Marc Pollefeys, and An- dreas Geiger. A multi-view stereo benchmark with high- resolution images and multi-camera videos. InCVPR, 2017. 5, 6

work page 2017
[55]

RoFormer: Enhanced transformer with rotary position embedding.Neurocomputing, 568(C),

Jianlin Su, Yu Lu, Shengfeng Pan, Ahmed Murtadha, Bo Wen, and Yunfeng Liu. RoFormer: Enhanced transformer with rotary position embedding.Neurocomputing, 568(C),

work page
[56]

MV- DUSt3R+: Single-stage scene reconstruction from sparse views in 2 seconds

Zhenggang Tang, Yuchen Fan, Dilin Wang, Hongyu Xu, Rakesh Ranjan, Alexander Schwing, and Zhicheng Yan. MV- DUSt3R+: Single-stage scene reconstruction from sparse views in 2 seconds. InCVPR, 2025. 2

work page 2025
[57]

DeepV2D: Video to depth with differentiable structure from motion

Zachary Teed and Jia Deng. DeepV2D: Video to depth with differentiable structure from motion. InICLR, 2020. 2, 8

work page 2020
[58]

AnyCalib: On- manifold learning for model-agnostic single-view camera calibration

Javier Tirado-Garín and Javier Civera. AnyCalib: On- manifold learning for model-agnostic single-view camera calibration. InICCV, 2025. 7

work page 2025
[59]

SMD-nets: Stereo mixture density networks

Fabio Tosi, Yiyi Liao, Carolin Schmitt, and Andreas Geiger. SMD-nets: Stereo mixture density networks. InCVPR, pages 8942–8952, 2021. 5 18

work page 2021
[60]

McLauchlan, Richard I

Bill Triggs, Philip F. McLauchlan, Richard I. Hartley, and Andrew Fitzgibbon. Bundle adjustment – a modern synthesis. InICCV, pages 298–372, 2000. 1

work page 2000
[61]

DeMoN: Depth and motion network for learning monocular stereo

Benjamin Ummenhofer, Huizhong Zhou, Jonas Uhrig, Niko- laus Mayer, Eddy Ilg, Alexey Dosovitskiy, and Thomas Brox. DeMoN: Depth and motion network for learning monocular stereo. InCVPR, 2017. 2, 8

work page 2017
[62]

Generative camera dolly: Extreme monocular dynamic novel view synthesis

Basile Van Hoorick, Rundi Wu, Ege Ozguroglu, Kyle Sar- gent, Ruoshi Liu, Pavel Tokmakov, Achal Dave, Changxi Zheng, and Carl V ondrick. Generative camera dolly: Extreme monocular dynamic novel view synthesis. InECCV, 2024. 5

work page 2024
[63]

Neural ray surfaces for self-supervised learning of depth and ego-motion

Igor Vasiljevic, Vitor Guizilini, Rares Ambrus, Sudeep Pillai, Wolfram Burgard, Greg Shakhnarovich, and Adrien Gaidon. Neural ray surfaces for self-supervised learning of depth and ego-motion. In3DV, 2020. 4

work page 2020
[64]

GeoCalib: Single-image calibration with geometric optimization

Alexander Veicht, Paul-Edouard Sarlin, Philipp Lindenberger, and Marc Pollefeys. GeoCalib: Single-image calibration with geometric optimization. InECCV, 2024. 1

work page 2024
[65]

3D reconstruction with spatial memory

Hengyi Wang and Lourdes Agapito. 3D reconstruction with spatial memory. In3DV, 2025. 2

work page 2025
[66]

Visual geometry grounded deep structure from motion

Jianyuan Wang, Nikita Karaev, Christian Rupprecht, and David Novotny. Visual geometry grounded deep structure from motion. InCVPR, 2024. 2

work page 2024
[67]

VGGT: Visual geometry grounded transformer

Jianyuan Wang, Minghao Chen, Nikita Karaev, Andrea Vedaldi, Christian Rupprecht, and David Novotny. VGGT: Visual geometry grounded transformer. InCVPR, 2025. 2, 4, 6, 7, 8, 9, 12

work page 2025
[68]

PF-LRM: Pose-free large reconstruction model for joint pose and shape prediction

Peng Wang, Hao Tan, Sai Bi, Yinghao Xu, Fujun Luan, Kalyan Sunkavalli, Wenping Wang, Zexiang Xu, and Kai Zhang. PF-LRM: Pose-free large reconstruction model for joint pose and shape prediction. InICLR, 2024. 1, 2

work page 2024
[69]

Efros, and Angjoo Kanazawa

Qianqian Wang, Yifei Zhang, Aleksander Holynski, Alexei A. Efros, and Angjoo Kanazawa. Continuous 3D perception model with persistent state. InCVPR, 2025. 2

work page 2025
[70]

MoGe: Unlocking accurate monocular geometry estimation for open-domain images with optimal training supervision

Ruicheng Wang, Sicheng Xu, Cassie Dai, Jianfeng Xiang, Yu Deng, Xin Tong, and Jiaolong Yang. MoGe: Unlocking accurate monocular geometry estimation for open-domain images with optimal training supervision. InCVPR, 2025. 5, 7, 8

work page 2025
[71]

MoGe-2: Accurate monocular geometry with metric scale and sharp details

Ruicheng Wang, Sicheng Xu, Yue Dong, Yu Deng, Jianfeng Xiang, Zelong Lv, Guangzhong Sun, Xin Tong, and Jiaolong Yang. MoGe-2: Accurate monocular geometry with metric scale and sharp details. InNeurIPS, 2025. 7, 8

work page 2025
[72]

DUSt3R: Geometric 3D vision made easy

Shuzhe Wang, Vincent Leroy, Yohann Cabon, Boris Chidlovskii, and Jerome Revaud. DUSt3R: Geometric 3D vision made easy. InCVPR, 2024. 1, 2, 4, 5, 7

work page 2024
[73]

TartanAir: A dataset to push the limits of visual SLAM

Wenshan Wang, Delong Zhu, Xiangwei Wang, Yaoyu Hu, Yuheng Qiu, Chen Wang, Yafei Hu, Ashish Kapoor, and Se- bastian Scherer. TartanAir: A dataset to push the limits of visual SLAM. InIROS, 2020. 5, 6

work page 2020
[74]

$\pi^3$: Permutation-Equivariant Visual Geometry Learning

Yifan Wang, Jianjun Zhou, Haoyi Zhu, Wenzheng Chang, Yang Zhou, Zizun Li, Junyi Chen, Jiangmiao Pang, Chunhua Shen, and Tong He. π3: Scalable permutation-equivariant visual geometry learning. arXiv:2507.13347, 2025. 2, 3, 8

work page internal anchor Pith review arXiv 2025
[75]

Fillerbuster: Multi-view scene completion for casual captures

Ethan Weber, Norman Müller, Yash Kant, Vasu Agrawal, Michael Zollhöfer, Angjoo Kanazawa, and Christian Richardt. Fillerbuster: Multi-view scene completion for casual captures. In3DV, 2026. 3

work page 2026
[76]

CroCo v2: Improved cross-view completion pre-training for stereo matching and optical flow

Philippe Weinzaepfel, Thomas Lucas, Vincent Leroy, Yohann Cabon, Vaibhav Arora, Romain Brégier, Gabriela Csurka, Leonid Antsfeld, Boris Chidlovskii, and Jerome Revaud. CroCo v2: Improved cross-view completion pre-training for stereo matching and optical flow. InICCV, 2023. 4

work page 2023
[77]

Robert J. Woodham. Photometric method for determining sur- face orientation from multiple images.Optical Engineering, 19(1):139–144, 1980. 1

work page 1980
[78]

Liang, Mikael Henaff, Hao Tang, Ang Cao, Joyce Chai, Franziska Meier, and Matt Feiszli

Jianing Yang, Alexander Sax, Kevin J. Liang, Mikael Henaff, Hao Tang, Ang Cao, Joyce Chai, Franziska Meier, and Matt Feiszli. Fast3R: Towards 3D reconstruction of 1000+ images in one forward pass. InCVPR, 2025. 2, 12

work page 2025
[79]

Depth anything V2

Lihe Yang, Bingyi Kang, Zilong Huang, Zhen Zhao, Xiao- gang Xu, Jiashi Feng, and Hengshuang Zhao. Depth anything V2. InNeurIPS, 2024. 5, 8

work page 2024
[80]

BlendedMVS: A large- scale dataset for generalized multi-view stereo networks

Yao Yao, Zixin Luo, Shiwei Li, Jingyang Zhang, Yufan Ren, Lei Zhou, Tian Fang, and Long Quan. BlendedMVS: A large- scale dataset for generalized multi-view stereo networks. In CVPR, 2020. 5

work page 2020

Showing first 80 references.