Recognition: 2 theorem links
· Lean TheoremMapAnything: Universal Feed-Forward Metric 3D Reconstruction
Pith reviewed 2026-05-12 11:00 UTC · model grok-4.3
The pith
A single feed-forward model reconstructs metric 3D scenes from images and optional geometric inputs.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
MapAnything is a unified transformer-based feed-forward model that ingests one or more images along with optional geometric inputs such as camera intrinsics, poses, depth, or partial reconstructions, and then directly regresses the metric 3D scene geometry and cameras. It leverages a factored representation of multi-view scene geometry, i.e., a collection of depth maps, local ray maps, camera poses, and a metric scale factor that effectively upgrades local reconstructions into a globally consistent metric frame. Standardizing the supervision and training across diverse datasets, along with flexible input augmentation, enables MapAnything to address a broad range of 3D vision tasks in a one-
What carries the argument
A factored representation consisting of depth maps, local ray maps, camera poses, and a metric scale factor that upgrades local reconstructions into a globally consistent metric frame.
If this is right
- The model performs uncalibrated structure-from-motion, calibrated multi-view stereo, monocular depth estimation, camera localization, and depth completion in the same forward pass.
- It outperforms or matches existing specialist feed-forward models on these tasks while requiring only one set of weights.
- Joint training across multiple datasets becomes more efficient because the architecture and loss functions are shared rather than duplicated.
- Optional geometric inputs can be used to refine or complete partial reconstructions without changing the model.
- The metric scale factor is regressed directly, eliminating separate scale-recovery post-processing steps.
Where Pith is reading between the lines
- The factored representation could extend naturally to video sequences if temporal consistency terms are added to the training objective.
- Robotics systems that need both mapping and localization might replace multiple perception modules with one call to this model.
- The same backbone might support related tasks such as novel-view synthesis once a renderer is attached to the output depths and poses.
Load-bearing premise
Standardizing supervision and training across diverse datasets with flexible input augmentation allows one model to solve many different 3D reconstruction tasks at once.
What would settle it
A controlled test on a held-out task or dataset combination where MapAnything is compared head-to-head with a specialist feed-forward model trained only for that task and fails to match or exceed its accuracy on metric consistency or reconstruction quality.
read the original abstract
We introduce MapAnything, a unified transformer-based feed-forward model that ingests one or more images along with optional geometric inputs such as camera intrinsics, poses, depth, or partial reconstructions, and then directly regresses the metric 3D scene geometry and cameras. MapAnything leverages a factored representation of multi-view scene geometry, i.e., a collection of depth maps, local ray maps, camera poses, and a metric scale factor that effectively upgrades local reconstructions into a globally consistent metric frame. Standardizing the supervision and training across diverse datasets, along with flexible input augmentation, enables MapAnything to address a broad range of 3D vision tasks in a single feed-forward pass, including uncalibrated structure-from-motion, calibrated multi-view stereo, monocular depth estimation, camera localization, depth completion, and more. We provide extensive experimental analyses and model ablations demonstrating that MapAnything outperforms or matches specialist feed-forward models while offering more efficient joint training behavior, thus paving the way toward a universal 3D reconstruction backbone.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces MapAnything, a unified transformer-based feed-forward model that ingests one or more images (optionally with intrinsics, poses, depth, or partial reconstructions) and directly regresses metric 3D scene geometry and cameras. It employs a factored representation consisting of per-view depth maps, local ray maps, camera poses, and a single metric scale factor to convert local reconstructions into a globally consistent metric frame. The model is trained jointly across diverse datasets with standardized supervision and flexible augmentations, enabling it to address tasks including uncalibrated SfM, calibrated MVS, monocular depth estimation, camera localization, and depth completion in a single pass. The authors claim it outperforms or matches specialist feed-forward models while providing more efficient joint training.
Significance. If the performance and consistency claims hold, this work could provide a practical universal backbone for metric 3D reconstruction, reducing reliance on task-specific models and enabling more efficient multi-task training and inference in computer vision pipelines. The factored representation and joint-training approach, if validated, would represent a notable engineering advance for feed-forward 3D models.
major comments (2)
- [§3.2] §3.2 (Factored Representation): The central claim that a single scalar metric scale factor upgrades independently regressed per-view depth maps and local ray maps into globally consistent metric geometry is load-bearing, yet the manuscript provides no explicit multi-view consistency term (e.g., cross-view ray-intersection loss or differentiable bundle-adjustment surrogate). Without such a term, local inconsistencies in the transformer outputs may persist after global scaling, as noted in the stress-test concern.
- [§5] §5 (Experiments): The abstract asserts 'extensive experimental analyses and model ablations' demonstrating outperformance, but the provided manuscript excerpt contains no quantitative tables, error bars, or per-task metrics comparing against specialist baselines. This absence prevents verification of the 'outperforms or matches' claim and the efficiency of joint training.
minor comments (2)
- The abstract would be strengthened by including one or two key quantitative results (e.g., relative improvement on a standard benchmark) to support the performance claims.
- Notation for the 'local ray maps' component could be clarified with an explicit equation or diagram showing how they differ from standard depth or normal maps.
Simulated Author's Rebuttal
We thank the referee for their constructive and detailed review of our manuscript. We address each major comment below and describe the revisions we will make to strengthen the paper.
read point-by-point responses
-
Referee: [§3.2] §3.2 (Factored Representation): The central claim that a single scalar metric scale factor upgrades independently regressed per-view depth maps and local ray maps into globally consistent metric geometry is load-bearing, yet the manuscript provides no explicit multi-view consistency term (e.g., cross-view ray-intersection loss or differentiable bundle-adjustment surrogate). Without such a term, local inconsistencies in the transformer outputs may persist after global scaling, as noted in the stress-test concern.
Authors: We appreciate this observation on the factored representation. The current design relies on the transformer jointly processing all input views to regress depth maps, ray maps, and poses that are already locally consistent; the single predicted metric scale then aligns them globally. This consistency emerges from multi-task supervision across diverse datasets containing multi-view ground truth. We acknowledge that an explicit consistency regularizer could provide additional robustness. In the revised manuscript we will expand §3.2 with a discussion of implicit versus explicit consistency and add an ablation that quantifies multi-view geometric consistency (ray-intersection error) before and after scale application. This is a partial revision. revision: partial
-
Referee: [§5] §5 (Experiments): The abstract asserts 'extensive experimental analyses and model ablations' demonstrating outperformance, but the provided manuscript excerpt contains no quantitative tables, error bars, or per-task metrics comparing against specialist baselines. This absence prevents verification of the 'outperforms or matches' claim and the efficiency of joint training.
Authors: We regret that the excerpt supplied to the referee omitted the full experimental section. The complete manuscript contains §5 with multiple quantitative tables reporting per-task metrics (SfM, MVS, monocular depth, localization, depth completion), direct comparisons to specialist feed-forward baselines, ablations on joint-training efficiency, and error bars derived from multiple random seeds where appropriate. We will ensure all tables are clearly cross-referenced in the text and that any future review excerpts include the complete experimental results. No further revision is required on this point. revision: no
Circularity Check
No circularity: empirical feed-forward model with learned outputs
full rationale
The paper describes a transformer-based neural network that regresses depth maps, ray maps, poses, and a scale factor from image inputs, trained end-to-end on standardized datasets. No mathematical derivation, uniqueness theorem, or first-principles prediction is claimed that reduces by construction to fitted inputs or self-citations. The factored representation is an architectural design choice whose consistency is enforced via data-driven supervision rather than definitional equivalence. Experimental results and ablations are presented as empirical validation, not tautological outputs. This matches the default expectation for non-circular empirical ML papers.
Axiom & Free-Parameter Ledger
free parameters (1)
- metric scale factor
invented entities (1)
-
factored representation (depth maps, local ray maps, poses, metric scale)
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith.Foundation.DimensionForcinglinking_requires_D3 unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
MapAnything leverages a factored representation of multi-view scene geometry, i.e., a collection of depth maps, local ray maps, camera poses, and a metric scale factor that effectively upgrades local reconstructions into a globally consistent metric frame.
-
IndisputableMonolith.Foundation.HierarchyEmergencehierarchy_emergence_forces_phi unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Standardizing the supervision and training across diverse datasets, along with flexible input augmentation, enables MapAnything to address a broad range of 3D vision tasks in a single feed-forward pass
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 37 Pith papers
-
TrackCraft3R: Repurposing Video Diffusion Transformers for Dense 3D Tracking
TrackCraft3R is the first method to repurpose a video diffusion transformer as a feed-forward dense 3D tracker via dual-latent representations and temporal RoPE alignment, achieving SOTA performance with lower compute.
-
Seeing Across Skies and Streets: Feedforward 3D Reconstruction from Satellite, Drone, and Ground Images
Cross3R performs feed-forward 3D reconstruction and 6-DoF pose estimation from any combination of satellite, UAV, and ground images, outperforming baselines on a new 278K-image tri-view dataset.
-
Mix3R: Mixing Feed-forward Reconstruction and Generative 3D Priors for Joint Multi-view Aligned 3D Reconstruction and Pose Estimation
Mix3R mixes feed-forward reconstruction and generative 3D priors via Mixture-of-Transformers and overlap-based attention bias to achieve better-aligned 3D shapes and more accurate poses than either approach alone.
-
AirZoo: A Unified Large-Scale Dataset for Grounding Aerial Geometric 3D Vision
AirZoo is a new large-scale synthetic dataset for aerial 3D vision that improves state-of-the-art models on image retrieval, cross-view matching, and 3D reconstruction when used for fine-tuning.
-
Multi-Camera Self-Calibration in Sports Motion Capture: Leveraging Human and Stick Poses
A three-stage optimization pipeline for multi-camera extrinsic self-calibration that refines camera poses, reconstructs human and stick trajectories, and resolves global scale using the known stick length constraint.
-
GlobalSplat: Efficient Feed-Forward 3D Gaussian Splatting via Global Scene Tokens
GlobalSplat achieves competitive novel-view synthesis on RealEstate10K and ACID using only 16K Gaussians via global scene tokens and coarse-to-fine training, with a 4MB footprint and under 78ms inference.
-
Keep It CALM: Toward Calibration-Free Kilometer-Level SLAM with Visual Geometry Foundation Models via an Assistant Eye
CAL2M achieves calibration-free kilometer-level SLAM by using an assistant eye for scale, epipolar-guided intrinsic correction, and anchor propagation for nonlinear sub-map alignment.
-
Attention Sink in Transformers: A Survey on Utilization, Interpretation, and Mitigation
The first survey on Attention Sink in Transformers structures the literature around fundamental utilization, mechanistic interpretation, and strategic mitigation.
-
EgoTL: Egocentric Think-Aloud Chains for Long-Horizon Tasks
EgoTL provides a new egocentric dataset with think-aloud chains and metric labels that benchmarks VLMs on long-horizon tasks and improves their planning, reasoning, and spatial grounding after finetuning.
-
LuMon: A Comprehensive Benchmark and Development Suite with Novel Datasets for Lunar Monocular Depth Estimation
A new benchmark with real lunar stereo ground truth and analog data shows that sim-to-real fine-tuned monocular depth models achieve large in-domain gains but minimal generalization to actual lunar images.
-
AnchorSplat: Feed-Forward 3D Gaussian Splatting with 3D Geometric Priors
AnchorSplat uses anchor-aligned 3D Gaussians guided by geometric priors for feed-forward scene reconstruction, achieving SOTA novel view synthesis on ScanNet++ with fewer primitives and better view consistency.
-
Learning 3D Reconstruction with Priors in Test Time
Test-time constrained optimization incorporates priors into pre-trained multiview transformers via self-supervised losses and penalty terms to improve 3D reconstruction accuracy.
-
GemDepth: Geometry-Embedded Features for 3D-Consistent Video Depth
GemDepth predicts inter-frame camera poses to inject geometric embeddings into a spatio-temporal transformer, yielding state-of-the-art 3D-consistent video depth.
-
GemDepth: Geometry-Embedded Features for 3D-Consistent Video Depth
GemDepth embeds predicted camera poses into a spatio-temporal transformer to achieve state-of-the-art 3D-consistent video depth estimation.
-
GemDepth: Geometry-Embedded Features for 3D-Consistent Video Depth
GemDepth achieves improved 3D-consistent video depth by embedding predicted inter-frame camera poses into a network with an Alternating Spatio-Temporal Transformer for better spatial precision and temporal coherence.
-
3D-ReGen: A Unified 3D Geometry Regeneration Framework
3D-ReGen is a conditioned 3D regenerator using VecSet that learns a regeneration prior from unlabeled 3D datasets via self-supervised tasks and achieves state-of-the-art results on controllable 3D geometry tasks.
-
LA-Pose: Latent Action Pretraining Meets Pose Estimation
LA-Pose achieves over 10% higher pose accuracy than recent feed-forward methods on Waymo and PandaSet benchmarks by repurposing latent actions from self-supervised inverse-dynamics pretraining while using orders of ma...
-
SS3D: End2End Self-Supervised 3D from Web Videos
SS3D pretrains an end-to-end feed-forward 3D estimator on filtered YouTube-8M videos via SfM self-supervision, MVS filtering, and expert distillation, delivering stronger zero-shot transfer and fine-tuning than prior ...
-
SS3D: End2End Self-Supervised 3D from Web Videos
SS3D pretrains an end-to-end 3D estimator on filtered YouTube-8M videos via SfM self-supervision, achieving improved zero-shot transfer and fine-tuning over prior baselines.
-
Vista4D: Video Reshooting with 4D Point Clouds
Vista4D re-synthesizes dynamic videos from new viewpoints by grounding them in a 4D point cloud built with static segmentation and multiview training.
-
Geometric Context Transformer for Streaming 3D Reconstruction
LingBot-Map is a streaming 3D reconstruction model built on a geometric context transformer that combines anchor context, pose-reference window, and trajectory memory to deliver accurate, drift-resistant results at 20...
-
Rays as Pixels: Learning A Joint Distribution of Videos and Camera Trajectories
A video diffusion model learns a joint distribution over videos and camera trajectories by representing cameras as pixel-aligned ray encodings (raxels) denoised jointly with video frames via decoupled attention.
-
Self-Improving 4D Perception via Self-Distillation
SelfEvo enables pretrained 4D perception models to self-improve on unlabeled videos via self-distillation, delivering up to 36.5% relative gains in video depth estimation and 20.1% in camera estimation across eight be...
-
ZeD-MAP: Bundle Adjustment Guided Zero-Shot Depth Maps for Real-Time Aerial Imaging
ZeD-MAP uses incremental bundle adjustment on image clusters to guide zero-shot diffusion depth estimation, delivering sub-meter accuracy (0.87 m XY, 0.12 m Z) at 1.5-5 seconds per image on high-resolution aerial data.
-
ZeD-MAP: Bundle Adjustment Guided Zero-Shot Depth Maps for Real-Time Aerial Imaging
ZeD-MAP integrates incremental cluster-based bundle adjustment with zero-shot diffusion depth estimation to deliver metrically consistent real-time depth maps from high-resolution UAV imagery.
-
DVGT-2: Vision-Geometry-Action Model for Autonomous Driving at Scale
DVGT-2 is a streaming vision-geometry-action model that jointly reconstructs dense 3D geometry and plans trajectories online, achieving better reconstruction than prior batch methods while transferring directly to pla...
-
Stepper: Stepwise Immersive Scene Generation with Multiview Panoramas
Stepper uses stepwise panoramic expansion with a multi-view 360-degree diffusion model and geometry reconstruction to produce high-fidelity, structurally consistent immersive 3D scenes from text.
-
TerraSky3D: Multi-View Reconstructions of European Landmarks in 4K
TerraSky3D is a new high-resolution multi-view dataset with 50,000 images in 150 scenes of European landmarks, supplied with poses and depth maps to support 3D reconstruction research.
-
Depth Anything 3: Recovering the Visual Space from Any Views
DA3 recovers consistent visual geometry from arbitrary views via a vanilla DINO transformer and depth-ray target, setting new SOTA on a visual geometry benchmark while outperforming DA2 on monocular depth.
-
WildPose: A Unified Framework for Robust Pose Estimation in the Wild
WildPose unifies feedforward 3D features from MASt3R with differentiable bundle adjustment for robust monocular pose estimation across dynamic, static, and low-ego-motion scenes.
-
Syn4D: A Multiview Synthetic 4D Dataset
Syn4D is a new multiview synthetic 4D dataset supplying dense ground-truth annotations for dynamic scene reconstruction, tracking, and human pose estimation.
-
SS3D: End2End Self-Supervised 3D from Web Videos
SS3D scales SfM-based self-supervision to ~100M frames from YouTube-8M using a multi-view signal proxy for filtering and a two-stage training schedule, achieving strong zero-shot transfer and better fine-tuning than p...
-
MonoEM-GS: Monocular Expectation-Maximization Gaussian Splatting SLAM
MonoEM-GS stabilizes view-dependent geometry from foundation models inside a global Gaussian Splatting representation via EM and adds multi-modal features for in-place open-set segmentation.
-
HY-World 2.0: A Multi-Modal World Model for Reconstructing, Generating, and Simulating 3D Worlds
HY-World 2.0 generates and reconstructs high-fidelity navigable 3D Gaussian Splatting worlds from text, images, or videos via upgraded panorama, planning, expansion, and composition modules, with released code claimin...
-
DINO_4D: Semantic-Aware 4D Reconstruction
DINO_4D uses frozen DINOv3 features to inject semantic awareness into 4D dynamic scene reconstruction, improving tracking accuracy and completeness on benchmarks while preserving O(T) complexity.
-
VGGT-SLAM++
VGGT-SLAM++ improves on prior transformer SLAM by adding dense DEM submap graphs and high-cadence local optimization, achieving SOTA accuracy with reduced drift and bounded memory on benchmarks.
-
OpenWorldLib: A Unified Codebase and Definition of Advanced World Models
OpenWorldLib offers a standardized codebase and definition for world models that combine perception, interaction, and memory to understand and predict the world.
Reference graph
Works this paper leans on
-
[1]
RayFronts: Open- set semantic ray frontiers for online scene understanding and exploration
Omar Alama, Avigyan Bhattacharya, Haoyang He, Se- ungchan Kim, Yuheng Qiu, Wenshan Wang, Cherie Ho, Nikhil Keetha, and Sebastian Scherer. RayFronts: Open- set semantic ray frontiers for online scene understanding and exploration. InIROS, 2025. 2
work page 2025
-
[2]
SceneScript: Reconstructing scenes with an autoregressive structured language model
Armen Avetisyan, Christopher Xie, Henry Howard-Jenkins, Tsun-Yi Yang, Samir Aroudj, Suvam Patra, Fuyang Zhang, Duncan Frost, Luke Holland, Campbell Orme, Jakob Engel, Edward Miller, Richard Newcombe, and Vasileios Balntas. SceneScript: Reconstructing scenes with an autoregressive structured language model. InECCV, 2024. 5
work page 2024
-
[3]
MultiMAE: Multi-modal multi-task masked autoen- coders
Roman Bachmann, David Mizrahi, Andrei Atanov, and Amir Zamir. MultiMAE: Multi-modal multi-task masked autoen- coders. InECCV, 2022. 3
work page 2022
-
[4]
Jonathan T. Barron. A general and adaptive robust loss func- tion. InCVPR, 2019. 5
work page 2019
-
[5]
Aleksei Bochkovskii, Amaël Delaunoy, Hugo Germain, Mar- cel Santos, Yichao Zhou, Stephan R. Richter, and Vladlen Koltun. Depth pro: Sharp monocular metric depth in less than a second. InICLR, 2025. 8
work page 2025
-
[6]
MUSt3R: Multi-view network for stereo 3D reconstruction
Yohann Cabon, Lucas Stoffl, Leonid Antsfeld, Gabriela Csurka, Boris Chidlovskii, Jerome Revaud, and Vincent Leroy. MUSt3R: Multi-view network for stereo 3D reconstruction. InCVPR, 2025. 2, 7, 8
work page 2025
-
[7]
Map-relative pose regression for visual re-localization
Shuai Chen, Tommaso Cavallari, Victor Adrian Prisacariu, and Eric Brachmann. Map-relative pose regression for visual re-localization. InCVPR, 2024. 4
work page 2024
-
[8]
Siyan Dong, Shuzhe Wang, Shaohui Liu, Lulu Cai, Qingnan Fan, Juho Kannala, and Yanchao Yang. Reloc3r: Large-scale training of relative camera pose regression for generalizable, fast, and accurate visual localization. InCVPR, 2025. 1, 2
work page 2025
-
[9]
Duisterhof, Jan Oberst, Bowen Wen, Stan Birch- field, Deva Ramanan, and Jeffrey Ichnowski
Bardienus P. Duisterhof, Jan Oberst, Bowen Wen, Stan Birch- field, Deva Ramanan, and Jeffrey Ichnowski. RaySt3R: Pre- dicting novel depth maps for zero-shot object completion. In NeurIPS, 2025. 3
work page 2025
-
[10]
MASt3R-SfM: a fully-integrated solution for unconstrained structure-from-motion
Bardienus Pieter Duisterhof, Lojze Zust, Philippe Weinza- epfel, Vincent Leroy, Yohann Cabon, and Jerome Revaud. MASt3R-SfM: a fully-integrated solution for unconstrained structure-from-motion. In3DV, 2025. 2
work page 2025
-
[11]
Light3R-SfM: Towards feed-forward structure-from- motion
Sven Elflein, Qunjie Zhou, Sérgio Agostinho, and Laura Leal- Taixé. Light3R-SfM: Towards feed-forward structure-from- motion. InCVPR, 2025. 2
work page 2025
-
[12]
Michael D. Grossberg and Shree K. Nayar. A general imaging model and a method for finding its parameters. InICCV, 2001. 4
work page 2001
-
[13]
Richard Hartley, Jochen Trumpf, Yuchao Dai, and Hongdong Li. Rotation averaging.Int. J. Comput. Vis., 103(3):267–305,
-
[14]
RA- DIOv2.5: Improved baselines for agglomerative vision foun- dation models
Greg Heinrich, Mike Ranzinger, Hongxu, Yao Lu, Jan Kautz, Andrew Tao, Bryan Catanzaro, and Pavlo Molchanov. RA- DIOv2.5: Improved baselines for agglomerative vision foun- dation models. InCVPR, 2025. 4
work page 2025
-
[15]
Map it anywhere: Empower- ing bev map prediction using large-scale public datasets
Cherie Ho, Jiaye Zou, Omar Alama, Sai M Kumar, Benjamin Chiang, Taneesh Gupta, Chen Wang, Nikhil Keetha, Katia Sycara, and Sebastian Scherer. Map it anywhere: Empower- ing bev map prediction using large-scale public datasets. In NeurIPS, 2024. 2
work page 2024
-
[16]
Geometric context from a single image
Derek Hoiem, Alexei A Efros, and Martial Hebert. Geometric context from a single image. InICCV, 2005. 1
work page 2005
-
[17]
Obtaining shape from shading information
Berthold KP Horn. Obtaining shape from shading information. InShape from shading, pages 123–171. MIT Press, 1989. 1
work page 1989
-
[18]
Mu Hu, Wei Yin, Chi Zhang, Zhipeng Cai, Xiaoxiao Long, Hao Chen, Kaixuan Wang, Gang Yu, Chunhua Shen, and Shaojie Shen. Metric3D v2: A versatile monocular geometric foundation model for zero-shot metric depth and surface nor- mal estimation.IEEE Trans. Pattern Anal. Mach. Intell., 46 (12):10579–10596, 2024. 8
work page 2024
-
[19]
Yafei Hu, Quanting Xie, Vidhi Jain, Jonathan Francis, Jay Patrikar, Nikhil Keetha, Seungchan Kim, Yaqi Xie, Tianyi Zhang, Hao-Shu Fang, Shibo Zhao, Shayegan Omidshafiei, Dong-Ki Kim, Ali akbar Agha-mohammadi, Katia Sycara, Matthew Johnson-Roberson, Dhruv Batra, Xiaolong Wang, Sebastian Scherer, Chen Wang, Zsolt Kira, Fei Xia, and Yonatan Bisk. Toward gene...
-
[20]
Yuan-Ting Hu, Jiahong Wang, Raymond A. Yeh, and Alexan- der G. Schwing. SAIL-VOS 3D: A synthetic dataset and baselines for object detection and 3D mesh reconstruction from video data. InCVPR, 2021. 5
work page 2021
-
[21]
DeepMVS: Learning multi-view stereopsis
Po-Han Huang, Kevin Matzen, Johannes Kopf, Narendra Ahuja, and Jia-Bin Huang. DeepMVS: Learning multi-view stereopsis. InCVPR, 2018. 5
work page 2018
-
[22]
MVSAnywhere: Zero-shot multi-view stereo
Sergio Izquierdo, Mohamed Sayed, Michael Firman, Guillermo Garcia-Hernando, Daniyar Turmukhambetov, Javier Civera, Oisin Mac Aodha, Gabriel Brostow, and Jamie Watson. MVSAnywhere: Zero-shot multi-view stereo. In CVPR, 2025. 1, 7, 8
work page 2025
-
[23]
Pow3R: Empowering uncon- strained 3D reconstruction with camera and scene priors
Wonbong Jang, Philippe Weinzaepfel, Vincent Leroy, Lourdes Agapito, and Jerome Revaud. Pow3R: Empowering uncon- strained 3D reconstruction with camera and scene priors. In CVPR, 2025. 3, 6, 7
work page 2025
-
[24]
LVSM: A large view synthesis model with minimal 3D inductive bias
Haian Jin, Hanwen Jiang, Hao Tan, Kai Zhang, Sai Bi, Tianyuan Zhang, Fujun Luan, Noah Snavely, and Zexiang Xu. LVSM: A large view synthesis model with minimal 3D inductive bias. InICLR, 2025. 3
work page 2025
-
[25]
Dynam- icStereo: Consistent dynamic depth from stereo videos
Nikita Karaev, Ignacio Rocco, Benjamin Graham, Natalia Neverova, Andrea Vedaldi, and Christian Rupprecht. Dynam- icStereo: Consistent dynamic depth from stereo videos. In CVPR, 2023. 5
work page 2023
-
[26]
Any4D: Unified feed-forward metric 4D reconstruction.arXiv preprint arXiv:2512.10935, 2025
Jay Karhade, Nikhil Keetha, Yuchen Zhang, Tanisha Gupta, Akash Sharma, Sebastian Scherer, and Deva Ramanan. Any4D: Unified feed-forward metric 4d reconstruction.arXiv preprint arXiv:2512.10935, 2025. 8
-
[27]
SplaTAM: Splat, track & map 3D Gaussians for dense RGB-D SLAM
Nikhil Keetha, Jay Karhade, Krishna Murthy Jatavallabhula, Gengshan Yang, Sebastian Scherer, Deva Ramanan, and Jonathon Luiten. SplaTAM: Splat, track & map 3D Gaussians for dense RGB-D SLAM. InCVPR, 2024. 2
work page 2024
-
[28]
Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. InICLR, 2015. 9
work page 2015
-
[29]
Ground- ing image matching in 3D with MASt3R
Vincent Leroy, Yohann Cabon, and Jérôme Revaud. Ground- ing image matching in 3D with MASt3R. InECCV, 2024. 1, 2, 7, 8 17
work page 2024
-
[30]
MegaDepth: Learning single- view depth prediction from internet photos
Zhengqi Li and Noah Snavely. MegaDepth: Learning single- view depth prediction from internet photos. InCVPR, 2018. 5
work page 2018
-
[31]
Depth Anything 3: Recovering the Visual Space from Any Views
Haotong Lin, Sili Chen, Junhao Liew, Donny Y Chen, Zhenyu Li, Guang Shi, Jiashi Feng, and Bingyi Kang. Depth anything 3: Recovering the visual space from any views.arXiv preprint arXiv:2511.10647, 2025. 4
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[32]
DL3DV-10K: A large-scale scene dataset for deep learning-based 3D vision
Lu Ling, Yichen Sheng, Zhi Tu, Wentian Zhao, Cheng Xin, Kun Wan, Lantao Yu, Qianyu Guo, Zixun Yu, Yawen Lu, Xu- anmao Li, Xingpeng Sun, Rohan Ashok, Aniruddha Mukher- jee, Hao Kang, Xiangrui Kong, Gang Hua, Tianyi Zhang, Bedrich Benes, and Aniket Bera. DL3DV-10K: A large-scale scene dataset for deep learning-based 3D vision. InCVPR,
-
[33]
David G. Lowe. Distinctive image features from scale- invariant keypoints.Int. J. Comput. Vis., 60(2):91–110, 2004. 1
work page 2004
-
[34]
Align3R: Aligned monocular depth estimation for dynamic videos
Jiahao Lu, Tianyu Huang, Peng Li, Zhiyang Dou, Cheng Lin, Zhiming Cui, Zhen Dong, Sai-Kit Yeung, Wenping Wang, and Yuan Liu. Align3R: Aligned monocular depth estimation for dynamic videos. InCVPR, 2025. 3
work page 2025
-
[35]
Matrix3D: Large photogrammetry model all-in- one
Yuanxun Lu, Jingyang Zhang, Tian Fang, Jean-Daniel Nah- mias, Yanghai Tsin, Long Quan, Xun Cao, Yao Yao, and Shiwei Li. Matrix3D: Large photogrammetry model all-in- one. InCVPR, 2025. 3
work page 2025
-
[36]
Mapillary planet-scale depth dataset
Manuel López Antequera, Pau Gargallo, Markus Hofinger, Samuel Rota Bulò, Yubin Kuang, and Peter Kontschieder. Mapillary planet-scale depth dataset. InECCV, 2020. 5
work page 2020
-
[37]
Spring: A high-resolution high- detail dataset and benchmark for scene flow, optical flow and stereo
Lukas Mehl, Jenny Schmalfuss, Azin Jahedi, Yaroslava Nali- vayko, and Andrés Bruhn. Spring: A high-resolution high- detail dataset and benchmark for scene flow, optical flow and stereo. InCVPR, 2023. 5
work page 2023
-
[38]
Chong Mou, Xintao Wang, Liangbin Xie, Yanze Wu, Jian Zhang, Zhongang Qi, Ying Shan, and Xiaohu Qie. T2I- adapter: Learning adapters to dig out more controllable ability for text-to-image diffusion models. InAAAI, 2024. 3, 4
work page 2024
-
[39]
Riku Murai, Eric Dexheimer, and Andrew J. Davison. MASt3R-SLAM: Real-time dense SLAM with 3D recon- struction priors. InCVPR, 2025. 2
work page 2025
-
[40]
An efficient solution to the five-point relative pose problem.IEEE Trans
David Nistér. An efficient solution to the five-point relative pose problem.IEEE Trans. Pattern Anal. Mach. Intell., 26 (06):756–777, 2004. 1
work page 2004
-
[41]
DI- NOv2: Learning robust visual features without supervision
Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. DI- NOv2: Learning robust visual features without supervision. Transactions on Machine Learning Research, 2024. 4
work page 2024
-
[42]
Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy V . V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, Mido Ass- ran, Nicolas Ballas, Wojciech Galuba, Russell Howes, Po- Yao Huang, Shang-Wen Li, Ishan Misra, Michael Rabbat, Vasu Sharma, Gabriel Synnaeve, Hu Xu, Herve Jegou, Julien Mairal, Patric...
work page 2024
-
[43]
Global structure-from-motion revisited
Linfei Pan, Dániel Baráth, Marc Pollefeys, and Johannes Lutz Schönberger. Global structure-from-motion revisited. In ECCV, 2024. 1
work page 2024
-
[44]
Schönberger, and Marc Pollefeys
Zador Pataki, Paul-Edouard Sarlin, Johannes L. Schönberger, and Marc Pollefeys. MP-SfM: Monocular surface priors for robust structure-from-motion. InCVPR, 2025. 2
work page 2025
-
[45]
UniDepthV2: Universal monocular metric depth estimation made simpler.IEEE Trans
Luigi Piccinelli, Christos Sakaridis, Yung-Hsu Yang, Mat- tia Segu, Siyuan Li, Wim Abbeloos, and Luc Van Gool. UniDepthV2: Universal monocular metric depth estimation made simpler.IEEE Trans. Pattern Anal. Mach. Intell., 2026. 8
work page 2026
-
[46]
Vi- sion transformers for dense prediction
René Ranftl, Alexey Bochkovskiy, and Vladlen Koltun. Vi- sion transformers for dense prediction. InICCV, 2021. 4, 10
work page 2021
-
[47]
René Ranftl, Katrin Lasinger, David Hafner, Konrad Schindler, and Vladlen Koltun. Towards robust monocular depth estimation: Mixing datasets for zero-shot cross-dataset transfer.IEEE Trans. Pattern Anal. Mach. Intell., 44(3):1623– 1637, 2022. 5
work page 2022
-
[48]
AM-RADIO: Agglomerative vision foundation model – reduce all domains into one
Mike Ranzinger, Greg Heinrich, Jan Kautz, and Pavlo Molchanov. AM-RADIO: Agglomerative vision foundation model – reduce all domains into one. InCVPR, 2024. 4
work page 2024
-
[49]
SuperGlue: Learning feature match- ing with graph neural networks
Paul-Edouard Sarlin, Daniel DeTone, Tomasz Malisiewicz, and Andrew Rabinovich. SuperGlue: Learning feature match- ing with graph neural networks. InCVPR, 2020. 1
work page 2020
-
[50]
Fast image- based localization using direct 2D-to-3D matching
Torsten Sattler, Bastian Leibe, and Leif Kobbelt. Fast image- based localization using direct 2D-to-3D matching. InICCV,
-
[51]
A benchmark and a baseline for robust multi- view depth estimation
Philipp Schröppel, Jan Bechtold, Artemij Amiranashvili, and Thomas Brox. A benchmark and a baseline for robust multi- view depth estimation. In3DV, 2022. 8
work page 2022
-
[52]
Schönberger and Jan-Michael Frahm
Johannes L. Schönberger and Jan-Michael Frahm. Structure- from-motion revisited. InCVPR, 2016. 1
work page 2016
-
[53]
Schönberger, Enliang Zheng, Jan-Michael Frahm, and Marc Pollefeys
Johannes L. Schönberger, Enliang Zheng, Jan-Michael Frahm, and Marc Pollefeys. Pixelwise view selection for unstructured multi-view stereo. InECCV, 2016. 1
work page 2016
-
[54]
Thomas Schöps, Johannes L. Schönberger, Silvano Galliani, Torsten Sattler, Konrad Schindler, Marc Pollefeys, and An- dreas Geiger. A multi-view stereo benchmark with high- resolution images and multi-camera videos. InCVPR, 2017. 5, 6
work page 2017
-
[55]
RoFormer: Enhanced transformer with rotary position embedding.Neurocomputing, 568(C),
Jianlin Su, Yu Lu, Shengfeng Pan, Ahmed Murtadha, Bo Wen, and Yunfeng Liu. RoFormer: Enhanced transformer with rotary position embedding.Neurocomputing, 568(C),
-
[56]
MV- DUSt3R+: Single-stage scene reconstruction from sparse views in 2 seconds
Zhenggang Tang, Yuchen Fan, Dilin Wang, Hongyu Xu, Rakesh Ranjan, Alexander Schwing, and Zhicheng Yan. MV- DUSt3R+: Single-stage scene reconstruction from sparse views in 2 seconds. InCVPR, 2025. 2
work page 2025
-
[57]
DeepV2D: Video to depth with differentiable structure from motion
Zachary Teed and Jia Deng. DeepV2D: Video to depth with differentiable structure from motion. InICLR, 2020. 2, 8
work page 2020
-
[58]
AnyCalib: On- manifold learning for model-agnostic single-view camera calibration
Javier Tirado-Garín and Javier Civera. AnyCalib: On- manifold learning for model-agnostic single-view camera calibration. InICCV, 2025. 7
work page 2025
-
[59]
SMD-nets: Stereo mixture density networks
Fabio Tosi, Yiyi Liao, Carolin Schmitt, and Andreas Geiger. SMD-nets: Stereo mixture density networks. InCVPR, pages 8942–8952, 2021. 5 18
work page 2021
-
[60]
Bill Triggs, Philip F. McLauchlan, Richard I. Hartley, and Andrew Fitzgibbon. Bundle adjustment – a modern synthesis. InICCV, pages 298–372, 2000. 1
work page 2000
-
[61]
DeMoN: Depth and motion network for learning monocular stereo
Benjamin Ummenhofer, Huizhong Zhou, Jonas Uhrig, Niko- laus Mayer, Eddy Ilg, Alexey Dosovitskiy, and Thomas Brox. DeMoN: Depth and motion network for learning monocular stereo. InCVPR, 2017. 2, 8
work page 2017
-
[62]
Generative camera dolly: Extreme monocular dynamic novel view synthesis
Basile Van Hoorick, Rundi Wu, Ege Ozguroglu, Kyle Sar- gent, Ruoshi Liu, Pavel Tokmakov, Achal Dave, Changxi Zheng, and Carl V ondrick. Generative camera dolly: Extreme monocular dynamic novel view synthesis. InECCV, 2024. 5
work page 2024
-
[63]
Neural ray surfaces for self-supervised learning of depth and ego-motion
Igor Vasiljevic, Vitor Guizilini, Rares Ambrus, Sudeep Pillai, Wolfram Burgard, Greg Shakhnarovich, and Adrien Gaidon. Neural ray surfaces for self-supervised learning of depth and ego-motion. In3DV, 2020. 4
work page 2020
-
[64]
GeoCalib: Single-image calibration with geometric optimization
Alexander Veicht, Paul-Edouard Sarlin, Philipp Lindenberger, and Marc Pollefeys. GeoCalib: Single-image calibration with geometric optimization. InECCV, 2024. 1
work page 2024
-
[65]
3D reconstruction with spatial memory
Hengyi Wang and Lourdes Agapito. 3D reconstruction with spatial memory. In3DV, 2025. 2
work page 2025
-
[66]
Visual geometry grounded deep structure from motion
Jianyuan Wang, Nikita Karaev, Christian Rupprecht, and David Novotny. Visual geometry grounded deep structure from motion. InCVPR, 2024. 2
work page 2024
-
[67]
VGGT: Visual geometry grounded transformer
Jianyuan Wang, Minghao Chen, Nikita Karaev, Andrea Vedaldi, Christian Rupprecht, and David Novotny. VGGT: Visual geometry grounded transformer. InCVPR, 2025. 2, 4, 6, 7, 8, 9, 12
work page 2025
-
[68]
PF-LRM: Pose-free large reconstruction model for joint pose and shape prediction
Peng Wang, Hao Tan, Sai Bi, Yinghao Xu, Fujun Luan, Kalyan Sunkavalli, Wenping Wang, Zexiang Xu, and Kai Zhang. PF-LRM: Pose-free large reconstruction model for joint pose and shape prediction. InICLR, 2024. 1, 2
work page 2024
-
[69]
Qianqian Wang, Yifei Zhang, Aleksander Holynski, Alexei A. Efros, and Angjoo Kanazawa. Continuous 3D perception model with persistent state. InCVPR, 2025. 2
work page 2025
-
[70]
Ruicheng Wang, Sicheng Xu, Cassie Dai, Jianfeng Xiang, Yu Deng, Xin Tong, and Jiaolong Yang. MoGe: Unlocking accurate monocular geometry estimation for open-domain images with optimal training supervision. InCVPR, 2025. 5, 7, 8
work page 2025
-
[71]
MoGe-2: Accurate monocular geometry with metric scale and sharp details
Ruicheng Wang, Sicheng Xu, Yue Dong, Yu Deng, Jianfeng Xiang, Zelong Lv, Guangzhong Sun, Xin Tong, and Jiaolong Yang. MoGe-2: Accurate monocular geometry with metric scale and sharp details. InNeurIPS, 2025. 7, 8
work page 2025
-
[72]
DUSt3R: Geometric 3D vision made easy
Shuzhe Wang, Vincent Leroy, Yohann Cabon, Boris Chidlovskii, and Jerome Revaud. DUSt3R: Geometric 3D vision made easy. InCVPR, 2024. 1, 2, 4, 5, 7
work page 2024
-
[73]
TartanAir: A dataset to push the limits of visual SLAM
Wenshan Wang, Delong Zhu, Xiangwei Wang, Yaoyu Hu, Yuheng Qiu, Chen Wang, Yafei Hu, Ashish Kapoor, and Se- bastian Scherer. TartanAir: A dataset to push the limits of visual SLAM. InIROS, 2020. 5, 6
work page 2020
-
[74]
$\pi^3$: Permutation-Equivariant Visual Geometry Learning
Yifan Wang, Jianjun Zhou, Haoyi Zhu, Wenzheng Chang, Yang Zhou, Zizun Li, Junyi Chen, Jiangmiao Pang, Chunhua Shen, and Tong He. π3: Scalable permutation-equivariant visual geometry learning. arXiv:2507.13347, 2025. 2, 3, 8
work page internal anchor Pith review arXiv 2025
-
[75]
Fillerbuster: Multi-view scene completion for casual captures
Ethan Weber, Norman Müller, Yash Kant, Vasu Agrawal, Michael Zollhöfer, Angjoo Kanazawa, and Christian Richardt. Fillerbuster: Multi-view scene completion for casual captures. In3DV, 2026. 3
work page 2026
-
[76]
CroCo v2: Improved cross-view completion pre-training for stereo matching and optical flow
Philippe Weinzaepfel, Thomas Lucas, Vincent Leroy, Yohann Cabon, Vaibhav Arora, Romain Brégier, Gabriela Csurka, Leonid Antsfeld, Boris Chidlovskii, and Jerome Revaud. CroCo v2: Improved cross-view completion pre-training for stereo matching and optical flow. InICCV, 2023. 4
work page 2023
-
[77]
Robert J. Woodham. Photometric method for determining sur- face orientation from multiple images.Optical Engineering, 19(1):139–144, 1980. 1
work page 1980
-
[78]
Liang, Mikael Henaff, Hao Tang, Ang Cao, Joyce Chai, Franziska Meier, and Matt Feiszli
Jianing Yang, Alexander Sax, Kevin J. Liang, Mikael Henaff, Hao Tang, Ang Cao, Joyce Chai, Franziska Meier, and Matt Feiszli. Fast3R: Towards 3D reconstruction of 1000+ images in one forward pass. InCVPR, 2025. 2, 12
work page 2025
-
[79]
Lihe Yang, Bingyi Kang, Zilong Huang, Zhen Zhao, Xiao- gang Xu, Jiashi Feng, and Hengshuang Zhao. Depth anything V2. InNeurIPS, 2024. 5, 8
work page 2024
-
[80]
BlendedMVS: A large- scale dataset for generalized multi-view stereo networks
Yao Yao, Zixin Luo, Shiwei Li, Jingyang Zhang, Yufan Ren, Lei Zhou, Tian Fang, and Long Quan. BlendedMVS: A large- scale dataset for generalized multi-view stereo networks. In CVPR, 2020. 5
work page 2020
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.