arxiv: 2604.22714 · v1 · submitted 2026-04-24 · 💻 cs.CV

Recognition: unknown

Long-tail Internet photo reconstruction

Hadar Averbuch-Elor, Noah Snavely, Ruojin Cai, Yuanbo Xiangli, Yuan Li

Pith reviewed 2026-05-08 12:19 UTC · model grok-4.3

classification 💻 cs.CV

keywords 3D reconstructionlong-tail distributioninternet photossparse imageryfoundation modelsdense depthphotogrammetryscene reconstruction

0 comments

The pith

Simulating sparse photo subsets from dense landmarks lets 3D models handle long-tail Internet scenes.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Internet photo collections follow a long-tailed pattern where most real-world sites appear in only sparse, noisy, and uneven images that defeat both classical and learned 3D reconstruction techniques. The paper proposes to simulate these hard cases by drawing sparse image subsets from already well-reconstructed dense landmarks, then assembles the resulting data into a large dataset with clean dense depth. Finetuning 3D foundation models on this simulated long-tail data produces systems that reconstruct scenes reliably from very few inputs, handle symmetric and repetitive scenes more stably, and retain their accuracy on standard dense benchmarks. If the approach works, it would extend usable 3D modeling to the vast majority of locations that lack dense photography.

Core claim

The authors introduce MegaDepth-X, a dataset of 3D reconstructions supplied with clean dense depth, together with a sampling procedure that selects sets of training images whose camera distributions, noise levels, and coverage gaps match those found in long-tail scenes. Finetuning 3D foundation models on these components produces models that deliver robust reconstructions under extreme sparsity, improve reliability on symmetric and repetitive scenes, and preserve generalization on conventional dense 3D benchmarks.

What carries the argument

MegaDepth-X dataset paired with the sparse-subset sampling strategy that draws training images from dense Internet landmark reconstructions.

If this is right

Reconstruction succeeds from extremely small numbers of input photos drawn from long-tail scenes.
Reliability increases on scenes containing symmetry or repetitive textures that normally break correspondence.
Accuracy on standard dense 3D benchmark datasets remains unchanged after the finetuning step.
3D foundation models become adaptable to the long-tail regime without requiring new dense ground-truth capture.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Casual tourist photos alone could suffice for 3D modeling of heritage sites or everyday locations.
The same simulation tactic might be tested on other sparse-data problems such as video-based reconstruction or multi-view stereo in different environments.
If the simulation proves faithful, it could lower the cost of acquiring training data for future sparse-scene methods.

Load-bearing premise

Sampling sparse subsets from well-reconstructed dense landmarks accurately reproduces the camera distributions, noise patterns, and coverage gaps that occur in genuine long-tail real-world scenes.

What would settle it

Run the finetuned model on a collection of actual long-tail Internet sites that possess independent ground-truth 3D data and were never derived from dense reconstructions; if accuracy does not exceed or match the untuned baseline, the simulation claim fails.

Figures

Figures reproduced from arXiv: 2604.22714 by Hadar Averbuch-Elor, Noah Snavely, Ruojin Cai, Yuanbo Xiangli, Yuan Li.

**Figure 1.** Figure 1: Long-tail Internet photo reconstruction. Internet photo collections follow a long-tailed distribution. In the top plot, the x-axis represents scene index (sorted by image count) and the y-axis shows images per scene (scenes are drawn from MegaScenes [36], a dataset of Internet photo collections). The light blue curve plots the total number of Internet photos per scene, while the steel blue curve shows the … view at source ↗

**Figure 2.** Figure 2: Unreliable reconstructions in MegaScenes. Reconstructions are unreliable when feature matches are incorrectly established on salient, non-static objects (e.g., (a) humans, (b) statues, (c) airplanes) instead of the static scene structure. This results in fragmented and geometrically inconsistent point clouds. Example (d) illustrates a doppelganger failure, where images from opposite sides of the building a… view at source ↗

**Figure 3.** Figure 3: Depth refinement. MVS depth maps often suffer from artifacts like noise from transient objects (top row) and depth bleeding (bottom row). As shown in the middle column, the MegaDepth refinement pipeline (modified MVS, stability filtering, and semantic filtering) fails to fully remedy these issues. Our method (right column) introduces an additional monocular depth-guided filtering step, which effectively re… view at source ↗

**Figure 4.** Figure 4: Sparsity-aware sampling strategy. Top: Our method follows a multi-stage process: (1) Apply the Louvain algorithm to the view graph to identify distinct viewpoint communities. (2) From each community, randomly select a terminal view and construct an approximate Steiner Tree to form a minimal, connected subgraph spanning these communities. (3) Perform a Greedy Search on this subgraph to select a sparse and d… view at source ↗

**Figure 5.** Figure 5: Reconstruction results on the MegaDepth-X test set across two difficulty levels. For each level, the top row shows the full 24-image input set, and the bottom row compares reconstructions from ground truth, pretrained π 3 , and our finetuned model with top-down views shown in the insets. Our model shows clearer improvements in the hard setting, where the inputs are more challenging. Note that hard was obta… view at source ↗

**Figure 6.** Figure 6: Reconstruction results on real long-tail Internet scenes. Each scene contains only a handful of photos with uneven viewpoints and noisy content, where COLMAP fails to register most images and produces extremely sparse geometry. Pretrained π 3 makes low-confidence predictions and incomplete reconstructions, while our fine-tuned model discovers the correct large-scale layout (e.g., (1) Novo-Znamenka Manor, 6… view at source ↗

**Figure 7.** Figure 7: Coverage and sparsity vs. search depth. Metrics in (a) and (b) evaluate coverage with respect to the full view-graph, while (c) and (d) measure the sparsity of the sampled subset. As the search depth increases, the sampled set reaches a larger portion of the view-graph, as shown by the rise in k-hop (graph-distance) coverage in (a). The average distance from each camera to its nearest sampled view decrease… view at source ↗

**Figure 9.** Figure 9: Comparison of ablated models on doppelganger scenes We show predictions from the pre-trained model and ablated models on two doppelganger scenes. Disambiguation behavior holds across fine-tuned variants with sparsity-aware sampling, while the pre-trained model and model finetuned with densely sampled views are less robust to doppelgangers. fig.9. Results indicate that pretrained models and dense-only fine… view at source ↗

**Figure 10.** Figure 10: Quantitative results on Long-tail scenes. Our model performs better on scenes with strong ambiguities (first row) and on scenes with minimal overlap across different scene components (second row). For a more densely photographed scene that still exhibits large viewpoint variation (third row), our model not only reduces pose error but also reconstructs a more complete point cloud. translation errors across… view at source ↗

**Figure 11.** Figure 11: Limitations. This example contains images from two disjoint parts of the scene: indoor photos with warm lighting (producing a yellowish point cloud) and outdoor photos (producing a white point cloud). Pretrained π 3 struggles to handle such mixed inputs and produces inconsistent geometry. Our finetuned model is more robust in this setting, but both models still fuse the indoor and outdoor structures into … view at source ↗

**Figure 12.** Figure 12: Comparison of COLMAP and our reconstruction pipeline.We replace COLMAP with MASt3R-SfM [11] combined with the doppelganger++ classifier [45] to obtain sparse reconstructions, allowing effective disambiguation of doppelganger scenes. (a) The bridge has two similar dragon statues, one at each end. COLMAP incorrectly treats them as the same statue and registers them together, whereas our method correctly se… view at source ↗

read the original abstract

Internet photo collections exhibit an extremely long-tailed distribution: a few famous landmarks are densely photographed and easily reconstructed in 3D, while most real-world sites are represented with sparse, noisy, uneven imagery beyond the capabilities of both classical and learned 3D methods. We believe that tackling this long-tail regime represents one of the next frontiers for 3D foundation models. Although reliable ground-truth 3D supervision from sparse scenes is challenging to acquire, we observe that it can be effectively simulated by sampling sparse subsets from well-reconstructed Internet landmarks. To this end, we introduce MegaDepth-X, a large dataset of 3D reconstructions with clean, dense depth, together with a strategy for sampling sets of training images that mimic camera distributions in long-tail scenes. Finetuning 3D foundation models with these components yields robust reconstructions under extreme sparsity, and also enables more reliable reconstruction in symmetric and repetitive scenes, while preserving generalization to standard, dense 3D benchmark datasets.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces MegaDepth-X, a large dataset of 3D reconstructions with clean dense depth derived from Internet landmarks, together with a sampling strategy that selects sparse image subsets to simulate the camera distributions, noise, and coverage gaps typical of long-tail Internet photo collections. Finetuning 3D foundation models on this data is claimed to produce robust reconstructions under extreme sparsity and in symmetric/repetitive scenes while preserving generalization on standard dense 3D benchmarks.

Significance. If the simulation strategy accurately reproduces the statistical properties of real long-tail scenes, the work would meaningfully advance 3D foundation models toward practical use on the vast majority of Internet sites that lack dense photography. The explicit goal of maintaining performance on conventional benchmarks while improving robustness on challenging cases is a constructive contribution.

major comments (2)

[Abstract] Abstract: The central effectiveness claims (robustness under extreme sparsity, improved handling of symmetry/repetition) are stated without any quantitative metrics, ablation results, or baseline comparisons, leaving the magnitude and reliability of the reported gains unassessable from the provided summary.
[Section 3] Section 3 (MegaDepth-X construction and sampling strategy): The load-bearing assumption that sparse subsets drawn exclusively from already-successfully-reconstructed landmarks reproduce the camera-pose clustering, depth noise profiles, and coverage gaps of genuine long-tail scenes that fail classical SfM is not directly validated; without side-by-side statistics or failure-case comparisons against real unsuccessful Internet collections, transfer of the observed improvements remains uncertain.

minor comments (1)

[Abstract] Abstract: Consider including one or two key quantitative highlights (e.g., percentage improvement on sparsity metrics) to give readers an immediate sense of effect size.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for the constructive and detailed review of our manuscript on long-tail Internet photo reconstruction. We address each major comment below and have revised the paper to strengthen the presentation of our results and the validation of our approach.

read point-by-point responses

Referee: [Abstract] Abstract: The central effectiveness claims (robustness under extreme sparsity, improved handling of symmetry/repetition) are stated without any quantitative metrics, ablation results, or baseline comparisons, leaving the magnitude and reliability of the reported gains unassessable from the provided summary.

Authors: We agree that the abstract would benefit from quantitative support to make the effectiveness claims more assessable. In the revised manuscript, we have updated the abstract to include specific metrics from our experiments, such as reconstruction accuracy gains under extreme sparsity, success rates on symmetric and repetitive scenes, and comparisons to baseline methods, while preserving the abstract's conciseness. revision: yes
Referee: [Section 3] Section 3 (MegaDepth-X construction and sampling strategy): The load-bearing assumption that sparse subsets drawn exclusively from already-successfully-reconstructed landmarks reproduce the camera-pose clustering, depth noise profiles, and coverage gaps of genuine long-tail scenes that fail classical SfM is not directly validated; without side-by-side statistics or failure-case comparisons against real unsuccessful Internet collections, transfer of the observed improvements remains uncertain.

Authors: We acknowledge the value of further validating the sampling strategy. We have expanded Section 3 with side-by-side statistical comparisons of camera-pose clustering, depth noise profiles, and coverage gaps between our sampled subsets and real sparse Internet photo collections that present challenges for classical SfM. We argue that this supports transferability. However, direct failure-case comparisons to scenes that completely failed SfM remain inherently limited, as such scenes lack reliable ground-truth 3D data by definition. revision: partial

standing simulated objections not resolved

Direct failure-case comparisons against real unsuccessful Internet collections that failed classical SfM, as these lack ground-truth 3D reconstructions by nature.

Circularity Check

0 steps flagged

No significant circularity in derivation or claims

full rationale

The paper's chain consists of an empirical methodology: constructing MegaDepth-X via sparse subset sampling from pre-existing dense landmark reconstructions (a heuristic justified by the observation that ground-truth supervision is hard to acquire directly), followed by finetuning of foundation models and evaluation on independent standard dense 3D benchmarks plus the simulated sparse splits. No equations, fitted parameters, or predictions are presented that reduce to the inputs by construction. No self-citations are invoked as load-bearing uniqueness theorems or ansatzes. The performance claims rest on experimental results rather than tautological re-derivation, keeping the work self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on one key domain assumption about simulation fidelity and introduces one new dataset entity without external validation.

axioms (1)

domain assumption Sparse subsets sampled from dense Internet landmark reconstructions accurately represent the camera distributions and noise characteristics of long-tail real-world scenes.
This premise is invoked to justify using simulated data for training on extreme sparsity.

invented entities (1)

MegaDepth-X dataset no independent evidence
purpose: Supply clean dense depth maps paired with controllable sparse image subsets for long-tail training
Newly constructed resource whose independent validation is not described in the abstract.

pith-pipeline@v0.9.0 · 5473 in / 1251 out tokens · 47880 ms · 2026-05-08T12:19:15.941913+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

53 extracted references · 7 canonical work pages · 1 internal anchor

[1]

Building rome in a day.Communications of the ACM, 54(10): 105–112, 2011

Sameer Agarwal, Yasutaka Furukawa, Noah Snavely, Ian Simon, Brian Curless, Steven M Seitz, and Richard Szeliski. Building rome in a day.Communications of the ACM, 54(10): 105–112, 2011. 2

2011
[2]

Neural rgb-d surface reconstruction

Dejan Azinovi´c, Ricardo Martin-Brualla, Dan B Goldman, Matthias Nießner, and Justus Thies. Neural rgb-d surface reconstruction. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 6290–6301, 2022. 6, 7, 8

2022
[3]

Extreme rotation estimation in the wild

Hana Bezalel, Dotan Ankri, Ruojin Cai, and Hadar Averbach- Elor. Extreme rotation estimation in the wild. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 1061–1070, 2025. 3

2025
[4]

Fast unfolding of communities in large networks.Journal of statistical mechanics: theory and experiment, 2008(10):P10008, 2008

Vincent D Blondel, Jean-Loup Guillaume, Renaud Lambiotte, and Etienne Lefebvre. Fast unfolding of communities in large networks.Journal of statistical mechanics: theory and experiment, 2008(10):P10008, 2008. 5

2008
[5]

Butler, Jonas Wulff, Garrett B

Daniel J. Butler, Jonas Wulff, Garrett B. Stanley, and Michael J. Black. A naturalistic open source movie for optical flow evaluation. InProceedings of the 12th European Con- ference on Computer Vision - Volume Part VI, page 611–625, Berlin, Heidelberg, 2012. Springer-Verlag. 3, 4

2012
[6]

Extreme rotation estimation using dense cor- relation volumes

Ruojin Cai, Bharath Hariharan, Noah Snavely, and Hadar Averbuch-Elor. Extreme rotation estimation using dense cor- relation volumes. InProceedings of the IEEE/CVF Confer- ence on Computer Vision and Pattern Recognition, pages 14566–14575, 2021. 3

2021
[7]

Doppelgangers: Learning to disambiguate images of similar structures

Ruojin Cai, Joseph Tung, Qianqian Wang, Hadar Averbuch- Elor, Bharath Hariharan, and Noah Snavely. Doppelgangers: Learning to disambiguate images of similar structures. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 34–44, 2023. 3

2023
[8]

Salvation mountain - photogrammetry - terres- trial, photogrammetry - aerial, lidar - terrestrial, lidar - mobile, survey data, 2023

Filiberto Chiabrando, Loren Clark, John Driscoll, Scott McAvoy, Dominique Rissolo, Alessandra Spreafico, and Beat- rice Tanduo. Salvation mountain - photogrammetry - terres- trial, photogrammetry - aerial, lidar - terrestrial, lidar - mobile, survey data, 2023. Distributed by Open Heritage 3D. 4

2023
[9]

Great mosque - kilwa kisiwani - lidar - terrestrial, photogrammetry - terrestrial, photogrammetry - aerial, 2020

CyArk. Great mosque - kilwa kisiwani - lidar - terrestrial, photogrammetry - terrestrial, photogrammetry - aerial, 2020. Distributed by Open Heritage 3D. 4

2020
[10]

Superpoint: Self-supervised interest point detection and description

Daniel DeTone, Tomasz Malisiewicz, and Andrew Rabi- novich. Superpoint: Self-supervised interest point detection and description. InProceedings of the IEEE conference on computer vision and pattern recognition workshops, pages 224–236, 2018. 3

2018
[11]

MASt3r-sfm: a fully-integrated solution for unconstrained structure-from-motion

Bardienus Pieter Duisterhof, Lojze Zust, Philippe Weinza- epfel, Vincent Leroy, Yohann Cabon, and Jerome Revaud. MASt3r-sfm: a fully-integrated solution for unconstrained structure-from-motion. InInternational Conference on 3D Vision 2025, 2025. 6

2025
[12]

Building Rome on a Cloudless Day

Jan-Michael Frahm, Pierre Fite-Georgel, David Gallup, Tim Johnson, Rahul Raguram, Changchang Wu, Yi-Hung Jen, Enrique Dunn, Brian Clipp, Svetlana Lazebnik, and Marc Pollefeys. Building Rome on a Cloudless Day. InECCV,
[13]

Vision meets robotics: The kitti dataset.Interna- tional Journal of Robotics Research (IJRR), 2013

Andreas Geiger, Philip Lenz, Christoph Stiller, and Raquel Urtasun. Vision meets robotics: The kitti dataset.Interna- tional Journal of Robotics Research (IJRR), 2013. 3, 4

2013
[14]

Large scale multi-view stereopsis eval- uation

Rasmus Jensen, Anders Dahl, George V ogiatzis, Engil Tola, and Henrik Aanæs. Large scale multi-view stereopsis eval- uation. In2014 IEEE Conference on Computer Vision and Pattern Recognition, pages 406–413. IEEE, 2014. 7, 8

2014
[15]

Omniglue: General- izable feature matching with foundation model guidance

Hanwen Jiang, Hanwen Jiang, Arjun Karpur, Bingyi Cao, Qixing Huang, and Qi-Xing Huang. Omniglue: General- izable feature matching with foundation model guidance. 2024 IEEE/CVF Conference on Computer Vision and Pat- tern Recognition (CVPR), pages 19865–19875, 2024. 3

2024
[16]

Barron, Noah Snavely, and Aleksander Holynski

Haian Jin, Rundi Wu, Tianyuan Zhang, Ruiqi Gao, Jonathan T. Barron, Noah Snavely, and Aleksander Holynski. ZipMap: Linear-time stateful 3d reconstruction via test-time training. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2026. 2

2026
[17]

de Ara´ujo

Arjun Karpur, Guilherme Perrotta, Ricardo Martin-Brualla, Howard Zhou, and Andre F. de Ara´ujo. Lfm-3d: Learnable feature matching across wide baselines using 3d signals.2024 International Conference on 3D Vision (3DV), pages 11–20,

2024
[18]

A fast algorithm for steiner trees.Acta informatica, 15(2): 141–145, 1981

Lawrence Kou, George Markowsky, and Leonard Berman. A fast algorithm for steiner trees.Acta informatica, 15(2): 141–145, 1981. 5

1981
[19]

Ground- ing image matching in 3d with mast3r, 2024

Vincent Leroy, Yohann Cabon, and J´erˆome Revaud. Ground- ing image matching in 3d with mast3r, 2024. 2, 3

2024
[20]

Megadepth: Learning single- view depth prediction from internet photos

Zhengqi Li and Noah Snavely. Megadepth: Learning single- view depth prediction from internet photos. InProceedings of the IEEE conference on computer vision and pattern recog- nition, pages 2041–2050, 2018. 2, 3, 1

2041
[21]

Lightglue: Local feature matching atokens are frozen

Philipp Lindenberger, Paul-Edouard Sarlin, and Marc Polle- feys. Lightglue: Local feature matching atokens are frozen. arXiv preprint arXiv:2306.13643, 2023. 3

work page arXiv 2023
[22]

A faster approximation algorithm for the steiner problem in graphs.Information Processing Letters, 27 (3):125–128, 1988

Kurt Mehlhorn. A faster approximation algorithm for the steiner problem in graphs.Information Processing Letters, 27 (3):125–128, 1988. 5

1988
[23]

Indoor segmentation and support inference from rgbd images

Pushmeet Kohli Nathan Silberman, Derek Hoiem and Rob Fergus. Indoor segmentation and support inference from rgbd images. InECCV, 2012. 3, 4

2012
[24]

Palazzolo, J

E. Palazzolo, J. Behley, P. Lottes, P. Gigu`ere, and C. Stachniss. ReFusion: 3D Reconstruction in Dynamic Environments for RGB-D Cameras Exploiting Residuals.arXiv, 2019. 3, 4

2019
[25]

Jeremy Reizenstein, Roman Shapovalov, Philipp Henzler, Luca Sbordone, Patrick Labatut, and David Novotn´y. Com- mon objects in 3d: Large-scale learning and evaluation of real-life 3d category reconstruction.2021 IEEE/CVF In- ternational Conference on Computer Vision (ICCV), pages 10881–10891, 2021. 7

2021
[26]

Torre dei baldovinetti - florence - lidar - terrestrial, photogrammetry - terrestrial, 2023

Ashley Richter, Michael Hess, Vid Petrovic, Falko Kuester, Cultural Heritage Engineering Initiative (CHEI), Architecture Center of Interdisciplinary Science for Art, and Archaeology (CISA3). Torre dei baldovinetti - florence - lidar - terrestrial, photogrammetry - terrestrial, 2023. Distributed by Open Heritage 3D. 4

2023
[27]

Superglue: Learning feature match- ing with graph neural networks

Paul-Edouard Sarlin, Daniel DeTone, Tomasz Malisiewicz, and Andrew Rabinovich. Superglue: Learning feature match- ing with graph neural networks. InProceedings of the IEEE/CVF conference on computer vision and pattern recog- nition, pages 4938–4947, 2020. 3

2020
[28]

Structure-from-motion revisited

Johannes Lutz Sch ¨onberger and Jan-Michael Frahm. Structure-from-motion revisited. InConference on Computer Vision and Pattern Recognition (CVPR), 2016. 2, 3

2016
[29]

Structure- from-motion revisited

Johannes L Schonberger and Jan-Michael Frahm. Structure- from-motion revisited. InCVPR, 2016. 2, 3

2016
[30]

Sch¨onberger, Enliang Zheng, Jan-Michael Frahm, and Marc Pollefeys

Johannes L. Sch¨onberger, Enliang Zheng, Jan-Michael Frahm, and Marc Pollefeys. Pixelwise view selection for unstructured multi-view stereo. InEuropean Conference on Computer Vision, 2016. 2

2016
[31]

Pixelwise view selection for un- structured multi-view stereo

Johannes Lutz Sch¨onberger, Enliang Zheng, Marc Pollefeys, and Jan-Michael Frahm. Pixelwise view selection for un- structured multi-view stereo. InEuropean Conference on Computer Vision (ECCV), 2016. 3

2016
[32]

Sch¨onberger, Silvano Galliani, Torsten Sattler, Konrad Schindler, Marc Pollefeys, and An- dreas Geiger

Thomas Sch¨ops, Johannes L. Sch¨onberger, Silvano Galliani, Torsten Sattler, Konrad Schindler, Marc Pollefeys, and An- dreas Geiger. A multi-view stereo benchmark with high- resolution images and multi-camera videos. In2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 2538–2547, 2017. 7, 8

2017
[33]

Scene coordinate regression forests for camera relocalization in rgb- d images

Jamie Shotton, Ben Glocker, Christopher Zach, Shahram Izadi, Antonio Criminisi, and Andrew Fitzgibbon. Scene coordinate regression forests for camera relocalization in rgb- d images. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 2930–2937, 2013. 7, 8

2013
[34]

Photo tourism: exploring photo collections in 3d

Noah Snavely, Steven M Seitz, and Richard Szeliski. Photo tourism: exploring photo collections in 3d. InACM siggraph 2006 papers, pages 835–846. 2006. 2

2006
[35]

Skeletal graphs for efficient structure from motion

Noah Snavely, Steven M Seitz, and Richard Szeliski. Skeletal graphs for efficient structure from motion. In2008 IEEE Conference on Computer Vision and Pattern Recognition, pages 1–8. IEEE, 2008. 5

2008
[36]

Megascenes: Scene-level view synthesis at scale

Joseph Tung, Gene Chou, Ruojin Cai, Guandao Yang, Kai Zhang, Gordon Wetzstein, Bharath Hariharan, and Noah Snavely. Megascenes: Scene-level view synthesis at scale. arXiv preprint arXiv:2406.11819, 2024. 1, 2, 3

work page arXiv 2024
[37]

Disk: Learning local features with policy gradient.Advances in Neural Information Processing Systems, 33:14254–14265,

Michał Tyszkiewicz, Pascal Fua, and Eduard Trulls. Disk: Learning local features with policy gradient.Advances in Neural Information Processing Systems, 33:14254–14265,
[38]

arXiv preprint arXiv:2408.16061 (2024)

Hengyi Wang and Lourdes Agapito. 3d reconstruction with spatial memory.arXiv preprint arXiv:2408.16061, 2024. 6

work page arXiv 2024
[39]

Vggt: Visual geometry grounded transformer

Jianyuan Wang, Minghao Chen, Nikita Karaev, Andrea Vedaldi, Christian Rupprecht, and David Novotny. Vggt: Visual geometry grounded transformer. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 5294–5306, 2025. 2, 3, 6

2025
[40]

Efros, and Angjoo Kanazawa

Qianqian Wang*, Yifei Zhang*, Aleksander Holynski, Alexei A. Efros, and Angjoo Kanazawa. Continuous 3d per- ception model with persistent state. InCVPR, 2025. 6, 3

2025
[41]

Moge: Unlocking accurate monocular geometry estimation for open-domain images with optimal training supervision

Ruicheng Wang, Sicheng Xu, Cassie Dai, Jianfeng Xiang, Yu Deng, Xin Tong, and Jiaolong Yang. Moge: Unlocking accurate monocular geometry estimation for open-domain images with optimal training supervision. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 5261–5271, 2025. 4

2025
[42]

Dust3r: Geometric 3d vision made easy, 2024

Shuzhe Wang, Vincent Leroy, Yohann Cabon, Boris Chidlovskii, and Jerome Revaud. Dust3r: Geometric 3d vision made easy.arXiv preprint arXiv:2312.14132, 2023. 2, 6

work page arXiv 2023
[43]

Tartanair: A dataset to push the limits of visual slam

Wenshan Wang, Delong Zhu, Xiangwei Wang, Yaoyu Hu, Yuheng Qiu, Chen Wang, Yafei Hu, Ashish Kapoor, and Sebastian Scherer. Tartanair: A dataset to push the limits of visual slam. In2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 4909–4916. IEEE, 2020. 3

2020
[44]

π3: Scalable permutation-equivariant visual geometry learning, 2025

Yifan Wang, Jianjun Zhou, Haoyi Zhu, Wenzheng Chang, Yang Zhou, Zizun Li, Junyi Chen, Jiangmiao Pang, Chunhua Shen, and Tong He. π3: Scalable permutation-equivariant visual geometry learning, 2025. 1, 2, 6, 3

2025
[45]

Doppelgangers++: Improved visual disam- biguation with geometric 3d features, 2025

Yuanbo Xiangli, Ruojin Cai, Hanyu Chen, Jeffrey Byrne, and Noah Snavely. Doppelgangers++: Improved visual disam- biguation with geometric 3d features, 2025. 3, 6

2025
[46]

Scal3r: Scalable test-time training for large-scale 3d reconstruction, 2026

Tao Xie, Peishan Yang, Yudong Jin, Yingfeng Cai, Wei Yin, Weiqiang Ren, Qian Zhang, Wei Hua, Sida Peng, Xiaoyang Guo, and Xiaowei Zhou. Scal3r: Scalable test-time training for large-scale 3d reconstruction, 2026. 2

2026
[47]

Fast3r: Towards 3d reconstruction of 1000+ images in one forward pass

Jianing Yang, Alexander Sax, Kevin J Liang, Mikael Henaff, Hao Tang, Ang Cao, Joyce Chai, Franziska Meier, and Matt Feiszli. Fast3r: Towards 3d reconstruction of 1000+ images in one forward pass. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 21924–21935,
[48]

Blendedmvs: A large- scale dataset for generalized multi-view stereo networks

Yao Yao, Zixin Luo, Shiwei Li, Jingyang Zhang, Yufan Ren, Lei Zhou, Tian Fang, and Long Quan. Blendedmvs: A large- scale dataset for generalized multi-view stereo networks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 1790–1799, 2020. 3

2020
[49]

arXiv preprint arXiv:2410.03825 (2024)

Junyi Zhang, Charles Herrmann, Junhwa Hur, Varun Jampani, Trevor Darrell, Forrester Cole, Deqing Sun, and Ming-Hsuan Yang. Monst3r: A simple approach for estimating geometry in the presence of motion.arXiv preprint arxiv:2410.03825,

work page arXiv
[50]

Flare: Feed-forward geometry, appearance and camera estimation from uncalibrated sparse views

Shangzhan Zhang, Jianyuan Wang, Yinghao Xu, Nan Xue, Christian Rupprecht, Xiaowei Zhou, Yujun Shen, and Gor- don Wetzstein. Flare: Feed-forward geometry, appearance and camera estimation from uncalibrated sparse views. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 21936–21947, 2025. 2

2025
[51]

Stereo Magnification: Learning View Synthesis using Multiplane Images

Tinghui Zhou, Richard Tucker, John Flynn, Graham Fyffe, and Noah Snavely. Stereo magnification: Learning view synthesis using multiplane images.arXiv preprint arXiv:1805.09817, 2018. 7 Long-Tail Internet Photo Reconstruction Supplementary Material Visualization Webpage Please refer to our project page for additional visualizations beyond this PDF. The web...

work page internal anchor Pith review arXiv 2018
[52]

This encourages the trajectory to enter unex- plored regions of the view graph and reduces redundancy in viewpoint selection

Community novelty: prioritizing candidates whose camera-community has not yet been visited by the sam- pled set. This encourages the trajectory to enter unex- plored regions of the view graph and reduces redundancy in viewpoint selection
[53]

This promotes larger baselines and helps diversify the spatial coverage of the sampled views

Spatial distance: among candidates with equal novelty, preferring those that are farther from the current cam- era position. This promotes larger baselines and helps diversify the spatial coverage of the sampled views. Table 6.Dataset statistics and viewpoint-distribution metrics.We report reconstruction statistics and metrics describing camera coverage. ...

work page arXiv