pith. sign in

arxiv: 2605.18754 · v1 · pith:2JHM6GIVnew · submitted 2026-05-18 · 💻 cs.CV

Can These Views Be One Scene? Evaluating Multiview 3D Consistency when 3D Foundation Models Hallucinate

Pith reviewed 2026-05-20 10:38 UTC · model grok-4.3

classification 💻 cs.CV
keywords multiview 3D consistencynovel view synthesisCOLMAPMEt3R3D foundation modelshallucinationgeometric verificationhuman evaluation
0
0 comments X

The pith

COLMAP-based metrics achieve up to 4 times higher correlation with human judgments of multiview 3D consistency than MEt3R.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that standard assumptions in multiview 3D evaluation break down when images come from novel view synthesis or sparse reconstruction, because artifacts, repeats, or noise can still produce high consistency scores. It contrasts neural metrics that rely on learned reconstruction backbones, which can hallucinate geometry across unrelated inputs, with classical geometric verification. The authors create a robustness benchmark and a parametric decomposition of neural metrics that recovers MEt3R while producing up to 3 times more robust variants. They then introduce COLMAP-based signals that treat registration failure and lack of dense support as explicit inconsistency indicators. On real NVS outputs and human ratings, these signals align far more closely with whether people perceive the images as coming from one scene.

Core claim

We introduce a controlled robustness benchmark for multiview 3D consistency and a parametric family that decomposes neural metrics into backbone, residual, and aggregation components. This family recovers MEt3R and yields variants up to 3 times more robust. Foundation models such as VGGT, MASt3R, DUSt3R, and Fast3R can hallucinate dense geometry and cross-view support for unrelated scenes, repeated images, and random noise. We introduce COLMAP-based metrics that use matches, registration, dense support, and reconstruction failure as failure-aware consistency signals. On real NVS outputs and a structured human study, these metrics achieve up to 4 times higher correlation with human judgments.

What carries the argument

COLMAP-based metrics that treat matches, registration success, dense support, and reconstruction failure as explicit failure-aware consistency signals.

If this is right

  • Neural 3D foundation models can produce dense cross-view support even when inputs do not depict a single static scene.
  • Varying the backbone, residual, or aggregation stage in neural consistency metrics can increase their robustness by up to a factor of three.
  • COLMAP-based signals detect inconsistencies through geometric failure modes that learned priors often miss.
  • Evaluation protocols for novel view synthesis and sparse reconstruction should incorporate explicit reconstruction-failure cues to avoid over-optimistic scores.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Replacing or supplementing neural consistency checks with COLMAP-based ones could reduce the rate at which inconsistent generated views are accepted in downstream 3D pipelines.
  • The parametric decomposition offers a systematic way to diagnose which part of a neural metric is most vulnerable to hallucinations.
  • The benchmark could serve as a testbed for future hybrid metrics that combine classical geometry with learned components.

Load-bearing premise

The selected NVS outputs and the human raters in the study are representative of the artifacts and inconsistencies that occur across broader 3D model outputs.

What would settle it

A new collection of multiview images containing hallucinations or repeated views where MEt3R shows higher correlation with fresh human ratings than any of the COLMAP-based metrics.

Figures

Figures reproduced from arXiv: 2605.18754 by Alan Yuille, Prakhar Kaushik, Soumava Paul.

Figure 1
Figure 1. Figure 1: Can these views be one scene? Unrelated scenes, repeated views, and Gaussian noise should be scored as 3D-inconsistent because they do not define a single static scene from multiple views. However, learned reconstruction backbones such as VGGT, MASt3R, DUSt3R, and Fast3R can still produce dense geometry on these inputs. Metrics built on these backbones, such as MEt3R, can therefore report spuriously high 3… view at source ↗
Figure 2
Figure 2. Figure 2: Metric trends across SysCON3D scene types. Each panel plots metric score (lower=better) vs. view count K across SysCON3D scene types; a robust metric should separate these bands in order of increasing 3D inconsistency. Panel 1: MEt3R conflates L0 with cross-scene mixtures and scores Gaussian noise as more consistent than L0. Panel 2: PRISM-MMD – recovers the expected SysCON3D ordering best (app [PITH_FULL… view at source ↗
Figure 3
Figure 3. Figure 3: Learned 3D backbones hallucinate geometric support. VGGT, MASt3R, and DUSt3R produce dense point clouds on pure noise and cross-scene mixtures, although the inputs do not admit a coherent 3D scene. Interestingly, we notice that these models seem to produce distinct, hallucinatory patterns (e.g. a umbrella for VGGT). More in Section H.4. This is a core finding of our analysis. Learned 3D backbones, includin… view at source ↗
Figure 4
Figure 4. Figure 4: Human-evaluation interface. Participants see the K input views (top) alongside two anonymized 360◦ orbit videos (A/B) played in a synchronized loop with shared speed controls. Method names are never revealed. Each comparison requires forced A/B votes on three axes – 3D consistency, visual realism, and plausibility (no per-axis ties). Method ratings are aggregated with a weighted Elo protocol, giving 3D con… view at source ↗
Figure 5
Figure 5. Figure 5: Hallucination diagnostics across the four backbones. Left: cross-scene overlap hallucina￾tion rises sharply on L2, L3, and Gaussian noise. Right: confidence decreases on harder cross-scene mixtures, but it is not a reliable rejection signal on SysCON3D-N, most notably for DUSt3R. H.2 What each diagnostic reveals Reconstruction geometry [PITH_FULL_IMAGE:figures/full_fig_p026_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Analysis of 3D reconstruction backbones on SysCON3D benchmark. Input is Gaussian [PITH_FULL_IMAGE:figures/full_fig_p028_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Analysis of 3D reconstruction backbones on SysCON3D benchmark. Input is Gaussian [PITH_FULL_IMAGE:figures/full_fig_p029_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Analysis of 3D reconstruction backbones on SysCON3D benchmark. Input is Random [PITH_FULL_IMAGE:figures/full_fig_p029_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Analysis of 3D reconstruction backbones on SysCON3D benchmark. Input is Random [PITH_FULL_IMAGE:figures/full_fig_p029_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Analysis of 3D reconstruction backbones on SysCON3D benchmark. Input is Single [PITH_FULL_IMAGE:figures/full_fig_p029_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Analysis of 3D reconstruction backbones on SysCON3D benchmark. Input is Single [PITH_FULL_IMAGE:figures/full_fig_p030_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Analysis of 3D reconstruction backbones on SysCON3D benchmark. Input is Patched [PITH_FULL_IMAGE:figures/full_fig_p030_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Analysis of 3D reconstruction backbones on SysCON3D benchmark. Input is Patched [PITH_FULL_IMAGE:figures/full_fig_p030_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Analysis of 3D reconstruction backbones on SysCON3D benchmark. Input is Gaussian [PITH_FULL_IMAGE:figures/full_fig_p030_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Analysis of 3D reconstruction backbones on SysCON3D benchmark. Input is Gaussian [PITH_FULL_IMAGE:figures/full_fig_p031_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: Analysis of 3D reconstruction backbones on SysCON3D benchmark. Input is Random [PITH_FULL_IMAGE:figures/full_fig_p031_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: Analysis of 3D reconstruction backbones on SysCON3D benchmark. Input is Random [PITH_FULL_IMAGE:figures/full_fig_p031_17.png] view at source ↗
Figure 18
Figure 18. Figure 18: Analysis of 3D reconstruction backbones on SysCON3D benchmark. Input is Single [PITH_FULL_IMAGE:figures/full_fig_p031_18.png] view at source ↗
Figure 19
Figure 19. Figure 19: Analysis of 3D reconstruction backbones on SysCON3D benchmark. Input is Patched [PITH_FULL_IMAGE:figures/full_fig_p032_19.png] view at source ↗
Figure 20
Figure 20. Figure 20: Analysis of 3D reconstruction backbones on SysCON3D benchmark. Input is Patched [PITH_FULL_IMAGE:figures/full_fig_p032_20.png] view at source ↗
Figure 21
Figure 21. Figure 21: Analysis of 3D reconstruction backbones on SysCON3D benchmark. Input is Gaussian [PITH_FULL_IMAGE:figures/full_fig_p032_21.png] view at source ↗
Figure 22
Figure 22. Figure 22: Analysis of 3D reconstruction backbones on SysCON3D benchmark. Input is Random [PITH_FULL_IMAGE:figures/full_fig_p032_22.png] view at source ↗
Figure 23
Figure 23. Figure 23: Hallucinated correspondences between unrelated NVS frames. Pixel correspondences returned by MASt3R (top of each pair) and VGGT (bottom) on two pairs of views generated by MVSplat360. The two views in each pair come from consecutive orbit cameras, and their generated content shows no perceptible geometric or semantic overlap, yet both backbones return many low￾residual matches across unrelated patches. Un… view at source ↗
Figure 24
Figure 24. Figure 24: Examples of COLMAP failures from our evaluation on DL3DV and MipNeRF360 [PITH_FULL_IMAGE:figures/full_fig_p039_24.png] view at source ↗
Figure 25
Figure 25. Figure 25: COLMAP based metric on DL3DV for K=3 view split. [PITH_FULL_IMAGE:figures/full_fig_p041_25.png] view at source ↗
Figure 26
Figure 26. Figure 26: COLMAP based metric on MipNeRF360 for K=3 view split. [PITH_FULL_IMAGE:figures/full_fig_p042_26.png] view at source ↗
read the original abstract

Multiview 3D evaluation assumes that the images being scored are observations of one static 3D scene. This assumption can fail in NVS and sparse-view reconstruction: inputs or generated outputs may contain artifacts, outlier frames, repeated views, or noise, yet still receive high 3D consistency scores. Existing reference-based metrics require ground truth, while ground-truth-free metrics such as MEt3R depend on learned reconstruction backbones whose failure modes are poorly characterized. We study this reliability problem by comparing neural reconstruction priors with classical geometric verification. We introduce \benchmark, a controlled robustness benchmark for multiview 3D consistency, and a parametric family that decomposes neural metrics into backbone, residual, and aggregation components. This family recovers MEt3R and yields variants up to $3\times$ more robust. Our analysis shows that VGGT, MASt3R, DUSt3R, and Fast3R can hallucinate dense geometry and cross-view support for unrelated scenes, repeated images, and random noise. We introduce COLMAP-based metrics that use matches, registration, dense support, and reconstruction failure as failure-aware consistency signals. On real NVS outputs and a structured human study, these metrics achieve up to $4\times$ higher correlation with human judgments than MEt3R.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 3 minor

Summary. The manuscript examines the reliability of multiview 3D consistency metrics when inputs or outputs from neural view synthesis (NVS) contain artifacts such as hallucinations, repeated views, or noise. It introduces the benchmark, a parametric decomposition of neural metrics into backbone/residual/aggregation components that recovers MEt3R and produces more robust variants, demonstrates that models including VGGT, MASt3R, DUSt3R, and Fast3R can hallucinate dense geometry for unrelated scenes or noise, and proposes COLMAP-based metrics using matches, registration, and reconstruction failure signals. On real NVS outputs evaluated via a human study, the COLMAP metrics are reported to achieve up to 4× higher correlation with human judgments than MEt3R.

Significance. If the experimental claims hold after additional validation, the work would be significant for the field of 3D reconstruction and novel view synthesis evaluation. It provides concrete evidence of failure modes in current 3D foundation models, introduces a controlled benchmark for robustness testing, and offers a practical alternative using established classical geometry tools that better aligns with human perception of 3D consistency. The parametric decomposition is a useful contribution that could guide future metric design. Credit is due for the human study component and the explicit comparison against neural priors.

major comments (2)
  1. [Experimental Evaluation / Human Study] The central claim that COLMAP-based metrics achieve up to 4× higher correlation with human judgments than MEt3R (abstract and experimental section) is load-bearing for the paper's recommendation of these metrics as more reliable. However, the manuscript provides insufficient detail on the human study protocol, including the number of scenes, the source models and artifact types in the selected NVS outputs, the number of raters, inter-rater reliability statistics (e.g., Krippendorff’s alpha), and the statistical test used to establish significance of the correlation difference. Without these, it is unclear whether the reported gain generalizes beyond the particular test distribution or is influenced by post-hoc selection favoring classical geometry.
  2. [§4] §4 (Benchmark and Parametric Family): The claim that the parametric decomposition yields variants up to 3× more robust is central to the technical contribution, yet the manuscript does not specify the exact robustness metric (e.g., correlation under controlled artifact injection) or whether these variants were evaluated on the same human-study NVS outputs used for the 4× COLMAP comparison. This leaves open whether the robustness improvement is independent of the human correlation results.
minor comments (3)
  1. [Abstract / Introduction] The abstract and introduction use the term 'hallucinate dense geometry' without a precise operational definition tied to the benchmark; adding a short formalization (e.g., cross-view support for unrelated scenes) would improve clarity.
  2. [Results / Tables] Figure captions and tables reporting correlation values should include confidence intervals or p-values for the differences between metrics to allow readers to assess the strength of the 4× claim directly.
  3. [Related Work] Ensure the related-work section explicitly contrasts the proposed COLMAP signals against prior geometric verification methods in multi-view stereo literature.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below and have revised the manuscript to improve clarity and provide the requested experimental details.

read point-by-point responses
  1. Referee: [Experimental Evaluation / Human Study] The central claim that COLMAP-based metrics achieve up to 4× higher correlation with human judgments than MEt3R (abstract and experimental section) is load-bearing for the paper's recommendation of these metrics as more reliable. However, the manuscript provides insufficient detail on the human study protocol, including the number of scenes, the source models and artifact types in the selected NVS outputs, the number of raters, inter-rater reliability statistics (e.g., Krippendorff’s alpha), and the statistical test used to establish significance of the correlation difference. Without these, it is unclear whether the reported gain generalizes beyond the particular test distribution or is influenced by post-hoc selection favoring classical geometry.

    Authors: We agree that additional protocol details are necessary to substantiate the claim and address potential concerns about generalization or selection bias. In the revised manuscript we have expanded the experimental section with a dedicated description of the human study. This now specifies the number of scenes, the NVS models and artifact types sampled, the number of raters, the inter-rater reliability (Krippendorff’s alpha), and the statistical test used to compare correlations. Scene selection followed a stratified sampling strategy based on the artifact categories defined in the benchmark, ensuring coverage of common failure modes rather than post-hoc optimization for any particular metric. revision: yes

  2. Referee: [§4] §4 (Benchmark and Parametric Family): The claim that the parametric decomposition yields variants up to 3× more robust is central to the technical contribution, yet the manuscript does not specify the exact robustness metric (e.g., correlation under controlled artifact injection) or whether these variants were evaluated on the same human-study NVS outputs used for the 4× COLMAP comparison. This leaves open whether the robustness improvement is independent of the human correlation results.

    Authors: The robustness metric is explicitly the change in correlation to ground-truth 3D consistency labels when artifacts are injected under controlled conditions in the benchmark introduced in §4. This evaluation is performed entirely on the synthetic benchmark data and is independent of the separate human study on real NVS outputs. We have revised §4 to state the definition of the robustness metric and to clarify that the reported 3× improvement is measured on the benchmark, while the human-study results serve as an orthogonal validation on real data. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical evaluation stands on external human judgments and classical geometry.

full rationale

The paper's central claim rests on an empirical comparison: COLMAP-based metrics show up to 4× higher correlation with human judgments than MEt3R on selected NVS outputs. This is not derived from any internal equations or self-referential fitting; the parametric family is introduced as an analysis tool that recovers MEt3R rather than predicting it, and COLMAP metrics are constructed from standard geometric primitives (matches, registration, reconstruction failure) independent of the neural backbones under test. No self-citation load-bearing steps, uniqueness theorems, or ansatz smuggling appear in the described chain. The result is self-contained because validation comes from an external human study and classical SfM software rather than reducing to the paper's own fitted parameters or prior outputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The work relies on the assumption that COLMAP's match and registration outputs are stable failure signals for consistency, plus the representativeness of the chosen artifact types and human raters. No free parameters are explicitly fitted in the abstract; the benchmark itself is an invented evaluation construct.

axioms (1)
  • domain assumption COLMAP registration and match statistics provide reliable, failure-aware signals for multiview consistency even when neural backbones hallucinate.
    Invoked when the authors position COLMAP metrics as superior alternatives to neural ones.
invented entities (1)
  • Benchmark (named via LaTeX macro in abstract) no independent evidence
    purpose: Controlled testbed for measuring robustness of multiview 3D consistency metrics to artifacts and hallucinations.
    New evaluation construct introduced to expose failure modes of existing neural metrics.

pith-pipeline@v0.9.0 · 5778 in / 1339 out tokens · 26789 ms · 2026-05-20T10:38:21.775949+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

43 extracted references · 43 canonical work pages · 1 internal anchor

  1. [1]

    Met3r: Measuring multi-view consistency in generated images

    Mohammad Asim, Christopher Wewer, Thomas Wimmer, Bernt Schiele, and Jan Eric Lenssen. Met3r: Measuring multi-view consistency in generated images. InComputer Vision and Pattern Recognition (CVPR), 2025. 2, 3, 4, 5

  2. [2]

    Mip-nerf 360: Unbounded anti-aliased neural radiance fields

    Jonathan T Barron, Ben Mildenhall, Dor Verbin, Pratul P Srinivasan, and Peter Hedman. Mip-nerf 360: Unbounded anti-aliased neural radiance fields. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5470–5479, 2022. 6, 9

  3. [3]

    Mvgenmaster: Scaling multi-view generation from any image via 3d priors enhanced diffusion model

    Chenjie Cao, Chaohui Yu, Shang Liu, Fan Wang, Xiangyang Xue, and Yanwei Fu. Mvgenmaster: Scaling multi-view generation from any image via 3d priors enhanced diffusion model. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 6045–6056, 2025. 3, 9

  4. [4]

    Mvsplat360: Feed-forward 360 scene synthesis from sparse views.Advances in Neural Information Processing Systems, 37:107064–107086, 2024

    Yuedong Chen, Chuanxia Zheng, Haofei Xu, Bohan Zhuang, Andrea Vedaldi, Tat-Jen Cham, and Jianfei Cai. Mvsplat360: Feed-forward 360 scene synthesis from sparse views.Advances in Neural Information Processing Systems, 37:107064–107086, 2024. 3, 9, 33

  5. [5]

    Freeman, Noah A

    Shivam Duggal, Yushi Hu, Oscar Michel, Aniruddha Kembhavi, William T. Freeman, Noah A. Smith, Ranjay Krishna, Antonio Torralba, Ali Farhadi, and Wei-Chiu Ma. Eval3d: Interpretable and fine-grained evaluation for 3d generation.CVPR, 2025. 3

  6. [6]

    Fischler and Robert C

    Martin A. Fischler and Robert C. Bolles. Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography.Commun. ACM, 24(6):381–395, 1981. 4

  7. [7]

    Brandt, Axel Feldmann, Zhoutong Zhang, and William T

    Stephanie Fu, Mark Hamilton, Laura E. Brandt, Axel Feldmann, Zhoutong Zhang, and William T. Freeman. Featup: A model-agnostic framework for features at any resolution. InThe Twelfth International Conference on Learning Representations, 2024. 2, 3, 4

  8. [8]

    Rating the chess rating system

    Mark E Glickman and Albyn C Jones. Rating the chess rating system. 9, 22

  9. [9]

    Borgwardt, Malte J

    Arthur Gretton, Karsten M. Borgwardt, Malte J. Rasch, Bernhard Schölkopf, and Alexander Smola. A kernel two-sample test.J. Mach. Learn. Res., 13(null):723–773, 2012. 3

  10. [10]

    Frank R. Hampel. The influence curve and its role in robust estimation.Journal of the American Statistical Association, 69(346):383–393, 1974. 3

  11. [11]

    Emergent outlier view rejection in visual geometry grounded transformers.arXiv preprint arXiv:2512.04012, 2025

    Jisang Han, Sunghwan Hong, Jaewoo Jung, Wooseok Jang, Honggyu An, Qianqian Wang, Seungryong Kim, and Chen Feng. Emergent outlier view rejection in visual geometry grounded transformers.arXiv preprint arXiv:2512.04012, 2025. 2, 8, 33

  12. [12]

    Benchmarking neural network robustness to common corruptions and perturbations

    Dan Hendrycks and Thomas Dietterich. Benchmarking neural network robustness to common corruptions and perturbations. InInternational Conference on Learning Representations, 2019. 3

  13. [13]

    Peter J. Huber. Robust estimation of a location parameter.The Annals of Mathematical Statistics, 35(1): 73–101, 1964. 3

  14. [14]

    Tanks and temples: Benchmarking large-scale scene reconstruction.ACM Transactions on Graphics (ToG), 36(4):1–13, 2017

    Arno Knapitsch, Jaesik Park, Qian-Yi Zhou, and Vladlen Koltun. Tanks and temples: Benchmarking large-scale scene reconstruction.ACM Transactions on Graphics (ToG), 36(4):1–13, 2017. 3

  15. [15]

    Explaining human preferences via metrics for structured 3d reconstruction

    Jack Langerman, Denys Rozumnyi, Yuzhong Huang, and Dmytro Mishkin. Explaining human preferences via metrics for structured 3d reconstruction. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 26944–26953, 2025. 22

  16. [16]

    Grounding image matching in 3d with mast3r

    Vincent Leroy, Yohann Cabon, and Jérôme Revaud. Grounding image matching in 3d with mast3r. In European conference on computer vision, pages 71–91. Springer, 2024. 2, 3, 4, 9

  17. [17]

    Dl3dv-10k: A large-scale scene dataset for deep learning-based 3d vision

    Lu Ling, Yichen Sheng, Zhi Tu, Wentian Zhao, Cheng Xin, Kun Wan, Lantao Yu, Qianyu Guo, Zixun Yu, Yawen Lu, et al. Dl3dv-10k: A large-scale scene dataset for deep learning-based 3d vision. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22160–22169, 2024. 9 12

  18. [18]

    Zero- 1-to-3: Zero-shot one image to 3d object

    Ruoshi Liu, Rundi Wu, Basile Van Hoorick, Pavel Tokmakov, Sergey Zakharov, and Carl V ondrick. Zero- 1-to-3: Zero-shot one image to 3d object. InProceedings of the IEEE/CVF international conference on computer vision, pages 9298–9309, 2023. 5

  19. [19]

    Gen3deval: Using vllms for automatic evaluation of generated 3d objects

    Shalini Maiti, Lourdes Agapito, and Filippos Kokkinos. Gen3deval: Using vllms for automatic evaluation of generated 3d objects. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 18552–18562, 2025. 3

  20. [20]

    Maxime Oquab, Timothée Darcet, Theo Moutakanni, Huy V . V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, Russell Howes, Po-Yao Huang, Hu Xu, Vasu Sharma, Shang-Wen Li, Wojciech Galuba, Mike Rabbat, Mido Assran, Nicolas Ballas, Gabriel Synnaeve, Ishan Misra, Herve Jegou, Julien Mairal, Patrick Laba...

  21. [21]

    Structure-from-Motion Revisited

    Johannes Lutz Schönberger and Jan-Michael Frahm. Structure-from-Motion Revisited. InConference on Computer Vision and Pattern Recognition (CVPR), 2016. 2, 4, 5, 37

  22. [22]

    Gorillas in our midst: Sustained inattentional blindness for dynamic events.perception, 28(9):1059–1074, 1999

    Daniel J Simons and Christopher F Chabris. Gorillas in our midst: Sustained inattentional blindness for dynamic events.perception, 28(9):1059–1074, 1999. 22

  23. [23]

    Appreciate the view: A task-aware evaluation framework for novel view synthesis.arXiv preprint arXiv:2511.12675, 2025

    Saar Stern, Ido Sobol, and Or Litany. Appreciate the view: A task-aware evaluation framework for novel view synthesis.arXiv preprint arXiv:2511.12675, 2025. 3, 5

  24. [24]

    Vggt: Visual geometry grounded transformer

    Jianyuan Wang, Minghao Chen, Nikita Karaev, Andrea Vedaldi, Christian Rupprecht, and David Novotny. Vggt: Visual geometry grounded transformer. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025. 2, 3, 4, 9

  25. [25]

    Dust3r: Geometric 3d vision made easy

    Shuzhe Wang, Vincent Leroy, Yohann Cabon, Boris Chidlovskii, and Jerome Revaud. Dust3r: Geometric 3d vision made easy. InCVPR, 2024. 2, 3, 9

  26. [26]

    Difix3d+: Improving 3d reconstructions with single-step diffusion models

    Jay Zhangjie Wu, Yuxuan Zhang, Haithem Turki, Xuanchi Ren, Jun Gao, Mike Zheng Shou, Sanja Fidler, Zan Gojcic, and Huan Ling. Difix3d+: Improving 3d reconstructions with single-step diffusion models. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 26024–26035, 2025. 2, 3, 9

  27. [27]

    Genfusion: Closing the loop between reconstruction and generation via videos

    Sibo Wu, Congrong Xu, Binbin Huang, Geiger Andreas, and Anpei Chen. Genfusion: Closing the loop between reconstruction and generation via videos. InConference on Computer Vision and Pattern Recognition (CVPR), 2025. 3

  28. [28]

    Doppelgangers++: Improved visual disambiguation with geometric 3d features

    Yuanbo Xiangli, Ruojin Cai, Hanyu Chen, Jeffrey Byrne, and Noah Snavely. Doppelgangers++: Improved visual disambiguation with geometric 3d features. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025. 3

  29. [29]

    Depthsplat: Connecting gaussian splatting and depth

    Haofei Xu, Songyou Peng, Fangjinhua Wang, Hermann Blum, Daniel Barath, Andreas Geiger, and Marc Pollefeys. Depthsplat: Connecting gaussian splatting and depth. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 16453–16463, 2025. 2, 3, 9

  30. [30]

    Liang, Mikael Henaff, Hao Tang, Ang Cao, Joyce Chai, Franziska Meier, and Matt Feiszli

    Jianing Yang, Alexander Sax, Kevin J. Liang, Mikael Henaff, Hao Tang, Ang Cao, Joyce Chai, Franziska Meier, and Matt Feiszli. Fast3r: Towards 3d reconstruction of 1000+ images in one forward pass. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025. 2, 3, 4, 9

  31. [31]

    Mvsnet: Depth inference for unstructured multi-view stereo

    Yao Yao, Zixin Luo, Shiwei Li, Tian Fang, and Long Quan. Mvsnet: Depth inference for unstructured multi-view stereo. InProceedings of the European conference on computer vision (ECCV), pages 767–783,

  32. [32]

    Nvs-solver: Video diffusion model as zero-shot novel view synthesizer.arXiv preprint arXiv:2405.15364, 2024

    Meng You, Zhiyu Zhu, Hui Liu, and Junhui Hou. Nvs-solver: Video diffusion model as zero-shot novel view synthesizer.arXiv preprint arXiv:2405.15364, 2024. 3, 9

  33. [33]

    Long-term photometric consistent novel view synthesis with diffusion models

    Jason J Yu, Fereshteh Forghani, Konstantinos G Derpanis, and Marcus A Brubaker. Long-term photometric consistent novel view synthesis with diffusion models. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 7094–7104, 2023. 3, 5

  34. [34]

    ViewCrafter: Taming Video Diffusion Models for High-fidelity Novel View Synthesis

    Wangbo Yu, Jinbo Xing, Li Yuan, Wenbo Hu, Xiaoyu Li, Zhipeng Huang, Xiangjun Gao, Tien-Tsin Wong, Ying Shan, and Yonghong Tian. Viewcrafter: Taming video diffusion models for high-fidelity novel view synthesis.arXiv preprint arXiv:2409.02048, 2024. 2, 3, 9 13

  35. [35]

    Stable virtual camera: Generative view synthesis with diffusion models.arXiv preprint arXiv:2503.14489, 2025

    Jensen (Jinghao) Zhou, Hang Gao, Vikram V oleti, Aaryaman Vasishta, Chun-Han Yao, Mark Boss, Philip Torr, Christian Rupprecht, and Varun Jampani. Stable virtual camera: Generative view synthesis with diffusion models.arXiv preprint arXiv:2503.14489, 2025. 3, 9

  36. [36]

    Long-lrm: Long-sequence large reconstruction model for wide-coverage gaussian splats

    Chen Ziwen, Hao Tan, Kai Zhang, Sai Bi, Fujun Luan, Yicong Hong, Li Fuxin, and Zexiang Xu. Long-lrm: Long-sequence large reconstruction model for wide-coverage gaussian splats. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 4349–4359, 2025. 3, 9 14 Appendix Contents A Notation and Common Terms 17 B FAQs 18 C Detailed Cont...

  37. [37]

    Learned 3D backbones predict geometry, cameras, and correspondences together, so their matches are already tied to the model’s own reconstruction

    Why not apply RANSAC-style geometric checks to matches predicted by VGGT, DUSt3R, or MASt3R? 19 Classical matching separates evidence from verification: SIFT proposes discrete matches, and RANSAC tests whether they support a geometry. Learned 3D backbones predict geometry, cameras, and correspondences together, so their matches are already tied to the mod...

  38. [38]

    This formu- lation clarifies the design space of multi-view consistency metrics and enables principled variants beyond pairwise mean aggregation

    We introduce aunified parametric frameworkfor ground-truth-free 3D consistency met- rics, showing that existing and new neural metrics can be decomposed into three components: areconstruction backbone, aresidual function, and anaggregation function. This formu- lation clarifies the design space of multi-view consistency metrics and enables principled vari...

  39. [39]

    This benchmark enables direct testing of whether a metric can distinguish geometrically consistent scenes from increasingly inconsistent ones

    We constructSysCON3D, a robustness benchmark for 3D consistency evaluation, which systematically injects controlled cross-scene outliers and synthetic corruptions into multi- view image sets. This benchmark enables direct testing of whether a metric can distinguish geometrically consistent scenes from increasingly inconsistent ones

  40. [40]

    Consequently, evaluation metrics built on top of these learned backbones can inherit these biases and become unreliable

    Through SysCON3D, we uncover asystematic and previously underappreciated failure modeof modern data-driven 3D reconstruction backbones: rather than rejecting impossible inputs, they hallucinate non-trivial 3D structure and spurious cross-view consistency for cross-scene mixtures, repeated images, and random noise. Consequently, evaluation metrics built on...

  41. [41]

    These metrics provide a more interpretable, failure-aware, and robust measure of scene-level multi-view consistency

    To address this limitation, we developrobust COLMAP-based 3D consistency metrics that avoid learned priors and instead rely on classical geometric verification. These metrics provide a more interpretable, failure-aware, and robust measure of scene-level multi-view consistency

  42. [42]

    This yields a more targeted human reference for assessing how well automatic metrics align with human judgments of scene-level consistency

    We design astructured human preference studyfor evaluating 3D consistency, with explicit protocols that distinguish 3D consistency from visual realism and plausibility. This yields a more targeted human reference for assessing how well automatic metrics align with human judgments of scene-level consistency

  43. [43]

    We perform acomprehensive empirical evaluationon SysCON3D, Mip-NeRF360, and DL3DV , together with our human study, and show that the proposed metrics substantially improve robustness over prior work while also aligning more closely with human judgments. In particular, neural distributional metrics improve over MEt3R, and COLMAP-based metrics achieve the s...