Can These Views Be One Scene? Evaluating Multiview 3D Consistency when 3D Foundation Models Hallucinate

Alan Yuille; Prakhar Kaushik; Soumava Paul

arxiv: 2605.18754 · v1 · pith:2JHM6GIVnew · submitted 2026-05-18 · 💻 cs.CV

Can These Views Be One Scene? Evaluating Multiview 3D Consistency when 3D Foundation Models Hallucinate

Soumava Paul , Prakhar Kaushik , Alan Yuille This is my paper

Pith reviewed 2026-05-20 10:38 UTC · model grok-4.3

classification 💻 cs.CV

keywords multiview 3D consistencynovel view synthesisCOLMAPMEt3R3D foundation modelshallucinationgeometric verificationhuman evaluation

0 comments

The pith

COLMAP-based metrics achieve up to 4 times higher correlation with human judgments of multiview 3D consistency than MEt3R.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that standard assumptions in multiview 3D evaluation break down when images come from novel view synthesis or sparse reconstruction, because artifacts, repeats, or noise can still produce high consistency scores. It contrasts neural metrics that rely on learned reconstruction backbones, which can hallucinate geometry across unrelated inputs, with classical geometric verification. The authors create a robustness benchmark and a parametric decomposition of neural metrics that recovers MEt3R while producing up to 3 times more robust variants. They then introduce COLMAP-based signals that treat registration failure and lack of dense support as explicit inconsistency indicators. On real NVS outputs and human ratings, these signals align far more closely with whether people perceive the images as coming from one scene.

Core claim

We introduce a controlled robustness benchmark for multiview 3D consistency and a parametric family that decomposes neural metrics into backbone, residual, and aggregation components. This family recovers MEt3R and yields variants up to 3 times more robust. Foundation models such as VGGT, MASt3R, DUSt3R, and Fast3R can hallucinate dense geometry and cross-view support for unrelated scenes, repeated images, and random noise. We introduce COLMAP-based metrics that use matches, registration, dense support, and reconstruction failure as failure-aware consistency signals. On real NVS outputs and a structured human study, these metrics achieve up to 4 times higher correlation with human judgments.

What carries the argument

COLMAP-based metrics that treat matches, registration success, dense support, and reconstruction failure as explicit failure-aware consistency signals.

If this is right

Neural 3D foundation models can produce dense cross-view support even when inputs do not depict a single static scene.
Varying the backbone, residual, or aggregation stage in neural consistency metrics can increase their robustness by up to a factor of three.
COLMAP-based signals detect inconsistencies through geometric failure modes that learned priors often miss.
Evaluation protocols for novel view synthesis and sparse reconstruction should incorporate explicit reconstruction-failure cues to avoid over-optimistic scores.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Replacing or supplementing neural consistency checks with COLMAP-based ones could reduce the rate at which inconsistent generated views are accepted in downstream 3D pipelines.
The parametric decomposition offers a systematic way to diagnose which part of a neural metric is most vulnerable to hallucinations.
The benchmark could serve as a testbed for future hybrid metrics that combine classical geometry with learned components.

Load-bearing premise

The selected NVS outputs and the human raters in the study are representative of the artifacts and inconsistencies that occur across broader 3D model outputs.

What would settle it

A new collection of multiview images containing hallucinations or repeated views where MEt3R shows higher correlation with fresh human ratings than any of the COLMAP-based metrics.

Figures

Figures reproduced from arXiv: 2605.18754 by Alan Yuille, Prakhar Kaushik, Soumava Paul.

**Figure 1.** Figure 1: Can these views be one scene? Unrelated scenes, repeated views, and Gaussian noise should be scored as 3D-inconsistent because they do not define a single static scene from multiple views. However, learned reconstruction backbones such as VGGT, MASt3R, DUSt3R, and Fast3R can still produce dense geometry on these inputs. Metrics built on these backbones, such as MEt3R, can therefore report spuriously high 3… view at source ↗

**Figure 2.** Figure 2: Metric trends across SysCON3D scene types. Each panel plots metric score (lower=better) vs. view count K across SysCON3D scene types; a robust metric should separate these bands in order of increasing 3D inconsistency. Panel 1: MEt3R conflates L0 with cross-scene mixtures and scores Gaussian noise as more consistent than L0. Panel 2: PRISM-MMD – recovers the expected SysCON3D ordering best (app [PITH_FULL… view at source ↗

**Figure 3.** Figure 3: Learned 3D backbones hallucinate geometric support. VGGT, MASt3R, and DUSt3R produce dense point clouds on pure noise and cross-scene mixtures, although the inputs do not admit a coherent 3D scene. Interestingly, we notice that these models seem to produce distinct, hallucinatory patterns (e.g. a umbrella for VGGT). More in Section H.4. This is a core finding of our analysis. Learned 3D backbones, includin… view at source ↗

**Figure 4.** Figure 4: Human-evaluation interface. Participants see the K input views (top) alongside two anonymized 360◦ orbit videos (A/B) played in a synchronized loop with shared speed controls. Method names are never revealed. Each comparison requires forced A/B votes on three axes – 3D consistency, visual realism, and plausibility (no per-axis ties). Method ratings are aggregated with a weighted Elo protocol, giving 3D con… view at source ↗

**Figure 5.** Figure 5: Hallucination diagnostics across the four backbones. Left: cross-scene overlap hallucination rises sharply on L2, L3, and Gaussian noise. Right: confidence decreases on harder cross-scene mixtures, but it is not a reliable rejection signal on SysCON3D-N, most notably for DUSt3R. H.2 What each diagnostic reveals Reconstruction geometry [PITH_FULL_IMAGE:figures/full_fig_p026_5.png] view at source ↗

**Figure 6.** Figure 6: Analysis of 3D reconstruction backbones on SysCON3D benchmark. Input is Gaussian [PITH_FULL_IMAGE:figures/full_fig_p028_6.png] view at source ↗

**Figure 7.** Figure 7: Analysis of 3D reconstruction backbones on SysCON3D benchmark. Input is Gaussian [PITH_FULL_IMAGE:figures/full_fig_p029_7.png] view at source ↗

**Figure 8.** Figure 8: Analysis of 3D reconstruction backbones on SysCON3D benchmark. Input is Random [PITH_FULL_IMAGE:figures/full_fig_p029_8.png] view at source ↗

**Figure 9.** Figure 9: Analysis of 3D reconstruction backbones on SysCON3D benchmark. Input is Random [PITH_FULL_IMAGE:figures/full_fig_p029_9.png] view at source ↗

**Figure 10.** Figure 10: Analysis of 3D reconstruction backbones on SysCON3D benchmark. Input is Single [PITH_FULL_IMAGE:figures/full_fig_p029_10.png] view at source ↗

**Figure 11.** Figure 11: Analysis of 3D reconstruction backbones on SysCON3D benchmark. Input is Single [PITH_FULL_IMAGE:figures/full_fig_p030_11.png] view at source ↗

**Figure 12.** Figure 12: Analysis of 3D reconstruction backbones on SysCON3D benchmark. Input is Patched [PITH_FULL_IMAGE:figures/full_fig_p030_12.png] view at source ↗

**Figure 13.** Figure 13: Analysis of 3D reconstruction backbones on SysCON3D benchmark. Input is Patched [PITH_FULL_IMAGE:figures/full_fig_p030_13.png] view at source ↗

**Figure 14.** Figure 14: Analysis of 3D reconstruction backbones on SysCON3D benchmark. Input is Gaussian [PITH_FULL_IMAGE:figures/full_fig_p030_14.png] view at source ↗

**Figure 15.** Figure 15: Analysis of 3D reconstruction backbones on SysCON3D benchmark. Input is Gaussian [PITH_FULL_IMAGE:figures/full_fig_p031_15.png] view at source ↗

**Figure 16.** Figure 16: Analysis of 3D reconstruction backbones on SysCON3D benchmark. Input is Random [PITH_FULL_IMAGE:figures/full_fig_p031_16.png] view at source ↗

**Figure 17.** Figure 17: Analysis of 3D reconstruction backbones on SysCON3D benchmark. Input is Random [PITH_FULL_IMAGE:figures/full_fig_p031_17.png] view at source ↗

**Figure 18.** Figure 18: Analysis of 3D reconstruction backbones on SysCON3D benchmark. Input is Single [PITH_FULL_IMAGE:figures/full_fig_p031_18.png] view at source ↗

**Figure 19.** Figure 19: Analysis of 3D reconstruction backbones on SysCON3D benchmark. Input is Patched [PITH_FULL_IMAGE:figures/full_fig_p032_19.png] view at source ↗

**Figure 20.** Figure 20: Analysis of 3D reconstruction backbones on SysCON3D benchmark. Input is Patched [PITH_FULL_IMAGE:figures/full_fig_p032_20.png] view at source ↗

**Figure 21.** Figure 21: Analysis of 3D reconstruction backbones on SysCON3D benchmark. Input is Gaussian [PITH_FULL_IMAGE:figures/full_fig_p032_21.png] view at source ↗

**Figure 22.** Figure 22: Analysis of 3D reconstruction backbones on SysCON3D benchmark. Input is Random [PITH_FULL_IMAGE:figures/full_fig_p032_22.png] view at source ↗

**Figure 23.** Figure 23: Hallucinated correspondences between unrelated NVS frames. Pixel correspondences returned by MASt3R (top of each pair) and VGGT (bottom) on two pairs of views generated by MVSplat360. The two views in each pair come from consecutive orbit cameras, and their generated content shows no perceptible geometric or semantic overlap, yet both backbones return many lowresidual matches across unrelated patches. Un… view at source ↗

**Figure 24.** Figure 24: Examples of COLMAP failures from our evaluation on DL3DV and MipNeRF360 [PITH_FULL_IMAGE:figures/full_fig_p039_24.png] view at source ↗

**Figure 25.** Figure 25: COLMAP based metric on DL3DV for K=3 view split. [PITH_FULL_IMAGE:figures/full_fig_p041_25.png] view at source ↗

**Figure 26.** Figure 26: COLMAP based metric on MipNeRF360 for K=3 view split. [PITH_FULL_IMAGE:figures/full_fig_p042_26.png] view at source ↗

read the original abstract

Multiview 3D evaluation assumes that the images being scored are observations of one static 3D scene. This assumption can fail in NVS and sparse-view reconstruction: inputs or generated outputs may contain artifacts, outlier frames, repeated views, or noise, yet still receive high 3D consistency scores. Existing reference-based metrics require ground truth, while ground-truth-free metrics such as MEt3R depend on learned reconstruction backbones whose failure modes are poorly characterized. We study this reliability problem by comparing neural reconstruction priors with classical geometric verification. We introduce \benchmark, a controlled robustness benchmark for multiview 3D consistency, and a parametric family that decomposes neural metrics into backbone, residual, and aggregation components. This family recovers MEt3R and yields variants up to $3\times$ more robust. Our analysis shows that VGGT, MASt3R, DUSt3R, and Fast3R can hallucinate dense geometry and cross-view support for unrelated scenes, repeated images, and random noise. We introduce COLMAP-based metrics that use matches, registration, dense support, and reconstruction failure as failure-aware consistency signals. On real NVS outputs and a structured human study, these metrics achieve up to $4\times$ higher correlation with human judgments than MEt3R.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 3 minor

Summary. The manuscript examines the reliability of multiview 3D consistency metrics when inputs or outputs from neural view synthesis (NVS) contain artifacts such as hallucinations, repeated views, or noise. It introduces the benchmark, a parametric decomposition of neural metrics into backbone/residual/aggregation components that recovers MEt3R and produces more robust variants, demonstrates that models including VGGT, MASt3R, DUSt3R, and Fast3R can hallucinate dense geometry for unrelated scenes or noise, and proposes COLMAP-based metrics using matches, registration, and reconstruction failure signals. On real NVS outputs evaluated via a human study, the COLMAP metrics are reported to achieve up to 4× higher correlation with human judgments than MEt3R.

Significance. If the experimental claims hold after additional validation, the work would be significant for the field of 3D reconstruction and novel view synthesis evaluation. It provides concrete evidence of failure modes in current 3D foundation models, introduces a controlled benchmark for robustness testing, and offers a practical alternative using established classical geometry tools that better aligns with human perception of 3D consistency. The parametric decomposition is a useful contribution that could guide future metric design. Credit is due for the human study component and the explicit comparison against neural priors.

major comments (2)

[Experimental Evaluation / Human Study] The central claim that COLMAP-based metrics achieve up to 4× higher correlation with human judgments than MEt3R (abstract and experimental section) is load-bearing for the paper's recommendation of these metrics as more reliable. However, the manuscript provides insufficient detail on the human study protocol, including the number of scenes, the source models and artifact types in the selected NVS outputs, the number of raters, inter-rater reliability statistics (e.g., Krippendorff’s alpha), and the statistical test used to establish significance of the correlation difference. Without these, it is unclear whether the reported gain generalizes beyond the particular test distribution or is influenced by post-hoc selection favoring classical geometry.
[§4] §4 (Benchmark and Parametric Family): The claim that the parametric decomposition yields variants up to 3× more robust is central to the technical contribution, yet the manuscript does not specify the exact robustness metric (e.g., correlation under controlled artifact injection) or whether these variants were evaluated on the same human-study NVS outputs used for the 4× COLMAP comparison. This leaves open whether the robustness improvement is independent of the human correlation results.

minor comments (3)

[Abstract / Introduction] The abstract and introduction use the term 'hallucinate dense geometry' without a precise operational definition tied to the benchmark; adding a short formalization (e.g., cross-view support for unrelated scenes) would improve clarity.
[Results / Tables] Figure captions and tables reporting correlation values should include confidence intervals or p-values for the differences between metrics to allow readers to assess the strength of the 4× claim directly.
[Related Work] Ensure the related-work section explicitly contrasts the proposed COLMAP signals against prior geometric verification methods in multi-view stereo literature.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below and have revised the manuscript to improve clarity and provide the requested experimental details.

read point-by-point responses

Referee: [Experimental Evaluation / Human Study] The central claim that COLMAP-based metrics achieve up to 4× higher correlation with human judgments than MEt3R (abstract and experimental section) is load-bearing for the paper's recommendation of these metrics as more reliable. However, the manuscript provides insufficient detail on the human study protocol, including the number of scenes, the source models and artifact types in the selected NVS outputs, the number of raters, inter-rater reliability statistics (e.g., Krippendorff’s alpha), and the statistical test used to establish significance of the correlation difference. Without these, it is unclear whether the reported gain generalizes beyond the particular test distribution or is influenced by post-hoc selection favoring classical geometry.

Authors: We agree that additional protocol details are necessary to substantiate the claim and address potential concerns about generalization or selection bias. In the revised manuscript we have expanded the experimental section with a dedicated description of the human study. This now specifies the number of scenes, the NVS models and artifact types sampled, the number of raters, the inter-rater reliability (Krippendorff’s alpha), and the statistical test used to compare correlations. Scene selection followed a stratified sampling strategy based on the artifact categories defined in the benchmark, ensuring coverage of common failure modes rather than post-hoc optimization for any particular metric. revision: yes
Referee: [§4] §4 (Benchmark and Parametric Family): The claim that the parametric decomposition yields variants up to 3× more robust is central to the technical contribution, yet the manuscript does not specify the exact robustness metric (e.g., correlation under controlled artifact injection) or whether these variants were evaluated on the same human-study NVS outputs used for the 4× COLMAP comparison. This leaves open whether the robustness improvement is independent of the human correlation results.

Authors: The robustness metric is explicitly the change in correlation to ground-truth 3D consistency labels when artifacts are injected under controlled conditions in the benchmark introduced in §4. This evaluation is performed entirely on the synthetic benchmark data and is independent of the separate human study on real NVS outputs. We have revised §4 to state the definition of the robustness metric and to clarify that the reported 3× improvement is measured on the benchmark, while the human-study results serve as an orthogonal validation on real data. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical evaluation stands on external human judgments and classical geometry.

full rationale

The paper's central claim rests on an empirical comparison: COLMAP-based metrics show up to 4× higher correlation with human judgments than MEt3R on selected NVS outputs. This is not derived from any internal equations or self-referential fitting; the parametric family is introduced as an analysis tool that recovers MEt3R rather than predicting it, and COLMAP metrics are constructed from standard geometric primitives (matches, registration, reconstruction failure) independent of the neural backbones under test. No self-citation load-bearing steps, uniqueness theorems, or ansatz smuggling appear in the described chain. The result is self-contained because validation comes from an external human study and classical SfM software rather than reducing to the paper's own fitted parameters or prior outputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The work relies on the assumption that COLMAP's match and registration outputs are stable failure signals for consistency, plus the representativeness of the chosen artifact types and human raters. No free parameters are explicitly fitted in the abstract; the benchmark itself is an invented evaluation construct.

axioms (1)

domain assumption COLMAP registration and match statistics provide reliable, failure-aware signals for multiview consistency even when neural backbones hallucinate.
Invoked when the authors position COLMAP metrics as superior alternatives to neural ones.

invented entities (1)

Benchmark (named via LaTeX macro in abstract) no independent evidence
purpose: Controlled testbed for measuring robustness of multiview 3D consistency metrics to artifacts and hallucinations.
New evaluation construct introduced to expose failure modes of existing neural metrics.

pith-pipeline@v0.9.0 · 5778 in / 1339 out tokens · 26789 ms · 2026-05-20T10:38:21.775949+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/AlexanderDuality.lean alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We introduce SysCON3D, a controlled robustness benchmark... COLMAP-based metrics that use matches, registration, dense support, and reconstruction failure as failure-aware consistency signals.
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

a parametric family that decomposes neural metrics into backbone, residual, and aggregation components

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

43 extracted references · 43 canonical work pages · 1 internal anchor

[1]

Met3r: Measuring multi-view consistency in generated images

Mohammad Asim, Christopher Wewer, Thomas Wimmer, Bernt Schiele, and Jan Eric Lenssen. Met3r: Measuring multi-view consistency in generated images. InComputer Vision and Pattern Recognition (CVPR), 2025. 2, 3, 4, 5

work page 2025
[2]

Mip-nerf 360: Unbounded anti-aliased neural radiance fields

Jonathan T Barron, Ben Mildenhall, Dor Verbin, Pratul P Srinivasan, and Peter Hedman. Mip-nerf 360: Unbounded anti-aliased neural radiance fields. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5470–5479, 2022. 6, 9

work page 2022
[3]

Mvgenmaster: Scaling multi-view generation from any image via 3d priors enhanced diffusion model

Chenjie Cao, Chaohui Yu, Shang Liu, Fan Wang, Xiangyang Xue, and Yanwei Fu. Mvgenmaster: Scaling multi-view generation from any image via 3d priors enhanced diffusion model. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 6045–6056, 2025. 3, 9

work page 2025
[4]

Mvsplat360: Feed-forward 360 scene synthesis from sparse views.Advances in Neural Information Processing Systems, 37:107064–107086, 2024

Yuedong Chen, Chuanxia Zheng, Haofei Xu, Bohan Zhuang, Andrea Vedaldi, Tat-Jen Cham, and Jianfei Cai. Mvsplat360: Feed-forward 360 scene synthesis from sparse views.Advances in Neural Information Processing Systems, 37:107064–107086, 2024. 3, 9, 33

work page 2024
[5]

Freeman, Noah A

Shivam Duggal, Yushi Hu, Oscar Michel, Aniruddha Kembhavi, William T. Freeman, Noah A. Smith, Ranjay Krishna, Antonio Torralba, Ali Farhadi, and Wei-Chiu Ma. Eval3d: Interpretable and fine-grained evaluation for 3d generation.CVPR, 2025. 3

work page 2025
[6]

Fischler and Robert C

Martin A. Fischler and Robert C. Bolles. Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography.Commun. ACM, 24(6):381–395, 1981. 4

work page 1981
[7]

Brandt, Axel Feldmann, Zhoutong Zhang, and William T

Stephanie Fu, Mark Hamilton, Laura E. Brandt, Axel Feldmann, Zhoutong Zhang, and William T. Freeman. Featup: A model-agnostic framework for features at any resolution. InThe Twelfth International Conference on Learning Representations, 2024. 2, 3, 4

work page 2024
[8]

Rating the chess rating system

Mark E Glickman and Albyn C Jones. Rating the chess rating system. 9, 22

work page
[9]

Borgwardt, Malte J

Arthur Gretton, Karsten M. Borgwardt, Malte J. Rasch, Bernhard Schölkopf, and Alexander Smola. A kernel two-sample test.J. Mach. Learn. Res., 13(null):723–773, 2012. 3

work page 2012
[10]

Frank R. Hampel. The influence curve and its role in robust estimation.Journal of the American Statistical Association, 69(346):383–393, 1974. 3

work page 1974
[11]

Emergent outlier view rejection in visual geometry grounded transformers.arXiv preprint arXiv:2512.04012, 2025

Jisang Han, Sunghwan Hong, Jaewoo Jung, Wooseok Jang, Honggyu An, Qianqian Wang, Seungryong Kim, and Chen Feng. Emergent outlier view rejection in visual geometry grounded transformers.arXiv preprint arXiv:2512.04012, 2025. 2, 8, 33

work page arXiv 2025
[12]

Benchmarking neural network robustness to common corruptions and perturbations

Dan Hendrycks and Thomas Dietterich. Benchmarking neural network robustness to common corruptions and perturbations. InInternational Conference on Learning Representations, 2019. 3

work page 2019
[13]

Peter J. Huber. Robust estimation of a location parameter.The Annals of Mathematical Statistics, 35(1): 73–101, 1964. 3

work page 1964
[14]

Tanks and temples: Benchmarking large-scale scene reconstruction.ACM Transactions on Graphics (ToG), 36(4):1–13, 2017

Arno Knapitsch, Jaesik Park, Qian-Yi Zhou, and Vladlen Koltun. Tanks and temples: Benchmarking large-scale scene reconstruction.ACM Transactions on Graphics (ToG), 36(4):1–13, 2017. 3

work page 2017
[15]

Explaining human preferences via metrics for structured 3d reconstruction

Jack Langerman, Denys Rozumnyi, Yuzhong Huang, and Dmytro Mishkin. Explaining human preferences via metrics for structured 3d reconstruction. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 26944–26953, 2025. 22

work page 2025
[16]

Grounding image matching in 3d with mast3r

Vincent Leroy, Yohann Cabon, and Jérôme Revaud. Grounding image matching in 3d with mast3r. In European conference on computer vision, pages 71–91. Springer, 2024. 2, 3, 4, 9

work page 2024
[17]

Dl3dv-10k: A large-scale scene dataset for deep learning-based 3d vision

Lu Ling, Yichen Sheng, Zhi Tu, Wentian Zhao, Cheng Xin, Kun Wan, Lantao Yu, Qianyu Guo, Zixun Yu, Yawen Lu, et al. Dl3dv-10k: A large-scale scene dataset for deep learning-based 3d vision. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22160–22169, 2024. 9 12

work page 2024
[18]

Zero- 1-to-3: Zero-shot one image to 3d object

Ruoshi Liu, Rundi Wu, Basile Van Hoorick, Pavel Tokmakov, Sergey Zakharov, and Carl V ondrick. Zero- 1-to-3: Zero-shot one image to 3d object. InProceedings of the IEEE/CVF international conference on computer vision, pages 9298–9309, 2023. 5

work page 2023
[19]

Gen3deval: Using vllms for automatic evaluation of generated 3d objects

Shalini Maiti, Lourdes Agapito, and Filippos Kokkinos. Gen3deval: Using vllms for automatic evaluation of generated 3d objects. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 18552–18562, 2025. 3

work page 2025
[20]

Maxime Oquab, Timothée Darcet, Theo Moutakanni, Huy V . V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, Russell Howes, Po-Yao Huang, Hu Xu, Vasu Sharma, Shang-Wen Li, Wojciech Galuba, Mike Rabbat, Mido Assran, Nicolas Ballas, Gabriel Synnaeve, Ishan Misra, Herve Jegou, Julien Mairal, Patrick Laba...

work page 2023
[21]

Structure-from-Motion Revisited

Johannes Lutz Schönberger and Jan-Michael Frahm. Structure-from-Motion Revisited. InConference on Computer Vision and Pattern Recognition (CVPR), 2016. 2, 4, 5, 37

work page 2016
[22]

Gorillas in our midst: Sustained inattentional blindness for dynamic events.perception, 28(9):1059–1074, 1999

Daniel J Simons and Christopher F Chabris. Gorillas in our midst: Sustained inattentional blindness for dynamic events.perception, 28(9):1059–1074, 1999. 22

work page 1999
[23]

Appreciate the view: A task-aware evaluation framework for novel view synthesis.arXiv preprint arXiv:2511.12675, 2025

Saar Stern, Ido Sobol, and Or Litany. Appreciate the view: A task-aware evaluation framework for novel view synthesis.arXiv preprint arXiv:2511.12675, 2025. 3, 5

work page arXiv 2025
[24]

Vggt: Visual geometry grounded transformer

Jianyuan Wang, Minghao Chen, Nikita Karaev, Andrea Vedaldi, Christian Rupprecht, and David Novotny. Vggt: Visual geometry grounded transformer. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025. 2, 3, 4, 9

work page 2025
[25]

Dust3r: Geometric 3d vision made easy

Shuzhe Wang, Vincent Leroy, Yohann Cabon, Boris Chidlovskii, and Jerome Revaud. Dust3r: Geometric 3d vision made easy. InCVPR, 2024. 2, 3, 9

work page 2024
[26]

Difix3d+: Improving 3d reconstructions with single-step diffusion models

Jay Zhangjie Wu, Yuxuan Zhang, Haithem Turki, Xuanchi Ren, Jun Gao, Mike Zheng Shou, Sanja Fidler, Zan Gojcic, and Huan Ling. Difix3d+: Improving 3d reconstructions with single-step diffusion models. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 26024–26035, 2025. 2, 3, 9

work page 2025
[27]

Genfusion: Closing the loop between reconstruction and generation via videos

Sibo Wu, Congrong Xu, Binbin Huang, Geiger Andreas, and Anpei Chen. Genfusion: Closing the loop between reconstruction and generation via videos. InConference on Computer Vision and Pattern Recognition (CVPR), 2025. 3

work page 2025
[28]

Doppelgangers++: Improved visual disambiguation with geometric 3d features

Yuanbo Xiangli, Ruojin Cai, Hanyu Chen, Jeffrey Byrne, and Noah Snavely. Doppelgangers++: Improved visual disambiguation with geometric 3d features. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025. 3

work page 2025
[29]

Depthsplat: Connecting gaussian splatting and depth

Haofei Xu, Songyou Peng, Fangjinhua Wang, Hermann Blum, Daniel Barath, Andreas Geiger, and Marc Pollefeys. Depthsplat: Connecting gaussian splatting and depth. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 16453–16463, 2025. 2, 3, 9

work page 2025
[30]

Liang, Mikael Henaff, Hao Tang, Ang Cao, Joyce Chai, Franziska Meier, and Matt Feiszli

Jianing Yang, Alexander Sax, Kevin J. Liang, Mikael Henaff, Hao Tang, Ang Cao, Joyce Chai, Franziska Meier, and Matt Feiszli. Fast3r: Towards 3d reconstruction of 1000+ images in one forward pass. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025. 2, 3, 4, 9

work page 2025
[31]

Mvsnet: Depth inference for unstructured multi-view stereo

Yao Yao, Zixin Luo, Shiwei Li, Tian Fang, and Long Quan. Mvsnet: Depth inference for unstructured multi-view stereo. InProceedings of the European conference on computer vision (ECCV), pages 767–783,

work page
[32]

Nvs-solver: Video diffusion model as zero-shot novel view synthesizer.arXiv preprint arXiv:2405.15364, 2024

Meng You, Zhiyu Zhu, Hui Liu, and Junhui Hou. Nvs-solver: Video diffusion model as zero-shot novel view synthesizer.arXiv preprint arXiv:2405.15364, 2024. 3, 9

work page arXiv 2024
[33]

Long-term photometric consistent novel view synthesis with diffusion models

Jason J Yu, Fereshteh Forghani, Konstantinos G Derpanis, and Marcus A Brubaker. Long-term photometric consistent novel view synthesis with diffusion models. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 7094–7104, 2023. 3, 5

work page 2023
[34]

ViewCrafter: Taming Video Diffusion Models for High-fidelity Novel View Synthesis

Wangbo Yu, Jinbo Xing, Li Yuan, Wenbo Hu, Xiaoyu Li, Zhipeng Huang, Xiangjun Gao, Tien-Tsin Wong, Ying Shan, and Yonghong Tian. Viewcrafter: Taming video diffusion models for high-fidelity novel view synthesis.arXiv preprint arXiv:2409.02048, 2024. 2, 3, 9 13

work page internal anchor Pith review Pith/arXiv arXiv 2024
[35]

Stable virtual camera: Generative view synthesis with diffusion models.arXiv preprint arXiv:2503.14489, 2025

Jensen (Jinghao) Zhou, Hang Gao, Vikram V oleti, Aaryaman Vasishta, Chun-Han Yao, Mark Boss, Philip Torr, Christian Rupprecht, and Varun Jampani. Stable virtual camera: Generative view synthesis with diffusion models.arXiv preprint arXiv:2503.14489, 2025. 3, 9

work page arXiv 2025
[36]

Long-lrm: Long-sequence large reconstruction model for wide-coverage gaussian splats

Chen Ziwen, Hao Tan, Kai Zhang, Sai Bi, Fujun Luan, Yicong Hong, Li Fuxin, and Zexiang Xu. Long-lrm: Long-sequence large reconstruction model for wide-coverage gaussian splats. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 4349–4359, 2025. 3, 9 14 Appendix Contents A Notation and Common Terms 17 B FAQs 18 C Detailed Cont...

work page 2025
[37]

Learned 3D backbones predict geometry, cameras, and correspondences together, so their matches are already tied to the model’s own reconstruction

Why not apply RANSAC-style geometric checks to matches predicted by VGGT, DUSt3R, or MASt3R? 19 Classical matching separates evidence from verification: SIFT proposes discrete matches, and RANSAC tests whether they support a geometry. Learned 3D backbones predict geometry, cameras, and correspondences together, so their matches are already tied to the mod...

work page
[38]

This formu- lation clarifies the design space of multi-view consistency metrics and enables principled variants beyond pairwise mean aggregation

We introduce aunified parametric frameworkfor ground-truth-free 3D consistency met- rics, showing that existing and new neural metrics can be decomposed into three components: areconstruction backbone, aresidual function, and anaggregation function. This formu- lation clarifies the design space of multi-view consistency metrics and enables principled vari...

work page
[39]

This benchmark enables direct testing of whether a metric can distinguish geometrically consistent scenes from increasingly inconsistent ones

We constructSysCON3D, a robustness benchmark for 3D consistency evaluation, which systematically injects controlled cross-scene outliers and synthetic corruptions into multi- view image sets. This benchmark enables direct testing of whether a metric can distinguish geometrically consistent scenes from increasingly inconsistent ones

work page
[40]

Consequently, evaluation metrics built on top of these learned backbones can inherit these biases and become unreliable

Through SysCON3D, we uncover asystematic and previously underappreciated failure modeof modern data-driven 3D reconstruction backbones: rather than rejecting impossible inputs, they hallucinate non-trivial 3D structure and spurious cross-view consistency for cross-scene mixtures, repeated images, and random noise. Consequently, evaluation metrics built on...

work page
[41]

These metrics provide a more interpretable, failure-aware, and robust measure of scene-level multi-view consistency

To address this limitation, we developrobust COLMAP-based 3D consistency metrics that avoid learned priors and instead rely on classical geometric verification. These metrics provide a more interpretable, failure-aware, and robust measure of scene-level multi-view consistency

work page
[42]

This yields a more targeted human reference for assessing how well automatic metrics align with human judgments of scene-level consistency

We design astructured human preference studyfor evaluating 3D consistency, with explicit protocols that distinguish 3D consistency from visual realism and plausibility. This yields a more targeted human reference for assessing how well automatic metrics align with human judgments of scene-level consistency

work page
[43]

We perform acomprehensive empirical evaluationon SysCON3D, Mip-NeRF360, and DL3DV , together with our human study, and show that the proposed metrics substantially improve robustness over prior work while also aligning more closely with human judgments. In particular, neural distributional metrics improve over MEt3R, and COLMAP-based metrics achieve the s...

work page arXiv 2020

[1] [1]

Met3r: Measuring multi-view consistency in generated images

Mohammad Asim, Christopher Wewer, Thomas Wimmer, Bernt Schiele, and Jan Eric Lenssen. Met3r: Measuring multi-view consistency in generated images. InComputer Vision and Pattern Recognition (CVPR), 2025. 2, 3, 4, 5

work page 2025

[2] [2]

Mip-nerf 360: Unbounded anti-aliased neural radiance fields

Jonathan T Barron, Ben Mildenhall, Dor Verbin, Pratul P Srinivasan, and Peter Hedman. Mip-nerf 360: Unbounded anti-aliased neural radiance fields. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5470–5479, 2022. 6, 9

work page 2022

[3] [3]

Mvgenmaster: Scaling multi-view generation from any image via 3d priors enhanced diffusion model

Chenjie Cao, Chaohui Yu, Shang Liu, Fan Wang, Xiangyang Xue, and Yanwei Fu. Mvgenmaster: Scaling multi-view generation from any image via 3d priors enhanced diffusion model. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 6045–6056, 2025. 3, 9

work page 2025

[4] [4]

Mvsplat360: Feed-forward 360 scene synthesis from sparse views.Advances in Neural Information Processing Systems, 37:107064–107086, 2024

Yuedong Chen, Chuanxia Zheng, Haofei Xu, Bohan Zhuang, Andrea Vedaldi, Tat-Jen Cham, and Jianfei Cai. Mvsplat360: Feed-forward 360 scene synthesis from sparse views.Advances in Neural Information Processing Systems, 37:107064–107086, 2024. 3, 9, 33

work page 2024

[5] [5]

Freeman, Noah A

Shivam Duggal, Yushi Hu, Oscar Michel, Aniruddha Kembhavi, William T. Freeman, Noah A. Smith, Ranjay Krishna, Antonio Torralba, Ali Farhadi, and Wei-Chiu Ma. Eval3d: Interpretable and fine-grained evaluation for 3d generation.CVPR, 2025. 3

work page 2025

[6] [6]

Fischler and Robert C

Martin A. Fischler and Robert C. Bolles. Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography.Commun. ACM, 24(6):381–395, 1981. 4

work page 1981

[7] [7]

Brandt, Axel Feldmann, Zhoutong Zhang, and William T

Stephanie Fu, Mark Hamilton, Laura E. Brandt, Axel Feldmann, Zhoutong Zhang, and William T. Freeman. Featup: A model-agnostic framework for features at any resolution. InThe Twelfth International Conference on Learning Representations, 2024. 2, 3, 4

work page 2024

[8] [8]

Rating the chess rating system

Mark E Glickman and Albyn C Jones. Rating the chess rating system. 9, 22

work page

[9] [9]

Borgwardt, Malte J

Arthur Gretton, Karsten M. Borgwardt, Malte J. Rasch, Bernhard Schölkopf, and Alexander Smola. A kernel two-sample test.J. Mach. Learn. Res., 13(null):723–773, 2012. 3

work page 2012

[10] [10]

Frank R. Hampel. The influence curve and its role in robust estimation.Journal of the American Statistical Association, 69(346):383–393, 1974. 3

work page 1974

[11] [11]

Emergent outlier view rejection in visual geometry grounded transformers.arXiv preprint arXiv:2512.04012, 2025

Jisang Han, Sunghwan Hong, Jaewoo Jung, Wooseok Jang, Honggyu An, Qianqian Wang, Seungryong Kim, and Chen Feng. Emergent outlier view rejection in visual geometry grounded transformers.arXiv preprint arXiv:2512.04012, 2025. 2, 8, 33

work page arXiv 2025

[12] [12]

Benchmarking neural network robustness to common corruptions and perturbations

Dan Hendrycks and Thomas Dietterich. Benchmarking neural network robustness to common corruptions and perturbations. InInternational Conference on Learning Representations, 2019. 3

work page 2019

[13] [13]

Peter J. Huber. Robust estimation of a location parameter.The Annals of Mathematical Statistics, 35(1): 73–101, 1964. 3

work page 1964

[14] [14]

Tanks and temples: Benchmarking large-scale scene reconstruction.ACM Transactions on Graphics (ToG), 36(4):1–13, 2017

Arno Knapitsch, Jaesik Park, Qian-Yi Zhou, and Vladlen Koltun. Tanks and temples: Benchmarking large-scale scene reconstruction.ACM Transactions on Graphics (ToG), 36(4):1–13, 2017. 3

work page 2017

[15] [15]

Explaining human preferences via metrics for structured 3d reconstruction

Jack Langerman, Denys Rozumnyi, Yuzhong Huang, and Dmytro Mishkin. Explaining human preferences via metrics for structured 3d reconstruction. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 26944–26953, 2025. 22

work page 2025

[16] [16]

Grounding image matching in 3d with mast3r

Vincent Leroy, Yohann Cabon, and Jérôme Revaud. Grounding image matching in 3d with mast3r. In European conference on computer vision, pages 71–91. Springer, 2024. 2, 3, 4, 9

work page 2024

[17] [17]

Dl3dv-10k: A large-scale scene dataset for deep learning-based 3d vision

Lu Ling, Yichen Sheng, Zhi Tu, Wentian Zhao, Cheng Xin, Kun Wan, Lantao Yu, Qianyu Guo, Zixun Yu, Yawen Lu, et al. Dl3dv-10k: A large-scale scene dataset for deep learning-based 3d vision. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22160–22169, 2024. 9 12

work page 2024

[18] [18]

Zero- 1-to-3: Zero-shot one image to 3d object

Ruoshi Liu, Rundi Wu, Basile Van Hoorick, Pavel Tokmakov, Sergey Zakharov, and Carl V ondrick. Zero- 1-to-3: Zero-shot one image to 3d object. InProceedings of the IEEE/CVF international conference on computer vision, pages 9298–9309, 2023. 5

work page 2023

[19] [19]

Gen3deval: Using vllms for automatic evaluation of generated 3d objects

Shalini Maiti, Lourdes Agapito, and Filippos Kokkinos. Gen3deval: Using vllms for automatic evaluation of generated 3d objects. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 18552–18562, 2025. 3

work page 2025

[20] [20]

Maxime Oquab, Timothée Darcet, Theo Moutakanni, Huy V . V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, Russell Howes, Po-Yao Huang, Hu Xu, Vasu Sharma, Shang-Wen Li, Wojciech Galuba, Mike Rabbat, Mido Assran, Nicolas Ballas, Gabriel Synnaeve, Ishan Misra, Herve Jegou, Julien Mairal, Patrick Laba...

work page 2023

[21] [21]

Structure-from-Motion Revisited

Johannes Lutz Schönberger and Jan-Michael Frahm. Structure-from-Motion Revisited. InConference on Computer Vision and Pattern Recognition (CVPR), 2016. 2, 4, 5, 37

work page 2016

[22] [22]

Gorillas in our midst: Sustained inattentional blindness for dynamic events.perception, 28(9):1059–1074, 1999

Daniel J Simons and Christopher F Chabris. Gorillas in our midst: Sustained inattentional blindness for dynamic events.perception, 28(9):1059–1074, 1999. 22

work page 1999

[23] [23]

Appreciate the view: A task-aware evaluation framework for novel view synthesis.arXiv preprint arXiv:2511.12675, 2025

Saar Stern, Ido Sobol, and Or Litany. Appreciate the view: A task-aware evaluation framework for novel view synthesis.arXiv preprint arXiv:2511.12675, 2025. 3, 5

work page arXiv 2025

[24] [24]

Vggt: Visual geometry grounded transformer

Jianyuan Wang, Minghao Chen, Nikita Karaev, Andrea Vedaldi, Christian Rupprecht, and David Novotny. Vggt: Visual geometry grounded transformer. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025. 2, 3, 4, 9

work page 2025

[25] [25]

Dust3r: Geometric 3d vision made easy

Shuzhe Wang, Vincent Leroy, Yohann Cabon, Boris Chidlovskii, and Jerome Revaud. Dust3r: Geometric 3d vision made easy. InCVPR, 2024. 2, 3, 9

work page 2024

[26] [26]

Difix3d+: Improving 3d reconstructions with single-step diffusion models

Jay Zhangjie Wu, Yuxuan Zhang, Haithem Turki, Xuanchi Ren, Jun Gao, Mike Zheng Shou, Sanja Fidler, Zan Gojcic, and Huan Ling. Difix3d+: Improving 3d reconstructions with single-step diffusion models. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 26024–26035, 2025. 2, 3, 9

work page 2025

[27] [27]

Genfusion: Closing the loop between reconstruction and generation via videos

Sibo Wu, Congrong Xu, Binbin Huang, Geiger Andreas, and Anpei Chen. Genfusion: Closing the loop between reconstruction and generation via videos. InConference on Computer Vision and Pattern Recognition (CVPR), 2025. 3

work page 2025

[28] [28]

Doppelgangers++: Improved visual disambiguation with geometric 3d features

Yuanbo Xiangli, Ruojin Cai, Hanyu Chen, Jeffrey Byrne, and Noah Snavely. Doppelgangers++: Improved visual disambiguation with geometric 3d features. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025. 3

work page 2025

[29] [29]

Depthsplat: Connecting gaussian splatting and depth

Haofei Xu, Songyou Peng, Fangjinhua Wang, Hermann Blum, Daniel Barath, Andreas Geiger, and Marc Pollefeys. Depthsplat: Connecting gaussian splatting and depth. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 16453–16463, 2025. 2, 3, 9

work page 2025

[30] [30]

Liang, Mikael Henaff, Hao Tang, Ang Cao, Joyce Chai, Franziska Meier, and Matt Feiszli

Jianing Yang, Alexander Sax, Kevin J. Liang, Mikael Henaff, Hao Tang, Ang Cao, Joyce Chai, Franziska Meier, and Matt Feiszli. Fast3r: Towards 3d reconstruction of 1000+ images in one forward pass. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025. 2, 3, 4, 9

work page 2025

[31] [31]

Mvsnet: Depth inference for unstructured multi-view stereo

Yao Yao, Zixin Luo, Shiwei Li, Tian Fang, and Long Quan. Mvsnet: Depth inference for unstructured multi-view stereo. InProceedings of the European conference on computer vision (ECCV), pages 767–783,

work page

[32] [32]

Nvs-solver: Video diffusion model as zero-shot novel view synthesizer.arXiv preprint arXiv:2405.15364, 2024

Meng You, Zhiyu Zhu, Hui Liu, and Junhui Hou. Nvs-solver: Video diffusion model as zero-shot novel view synthesizer.arXiv preprint arXiv:2405.15364, 2024. 3, 9

work page arXiv 2024

[33] [33]

Long-term photometric consistent novel view synthesis with diffusion models

Jason J Yu, Fereshteh Forghani, Konstantinos G Derpanis, and Marcus A Brubaker. Long-term photometric consistent novel view synthesis with diffusion models. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 7094–7104, 2023. 3, 5

work page 2023

[34] [34]

ViewCrafter: Taming Video Diffusion Models for High-fidelity Novel View Synthesis

Wangbo Yu, Jinbo Xing, Li Yuan, Wenbo Hu, Xiaoyu Li, Zhipeng Huang, Xiangjun Gao, Tien-Tsin Wong, Ying Shan, and Yonghong Tian. Viewcrafter: Taming video diffusion models for high-fidelity novel view synthesis.arXiv preprint arXiv:2409.02048, 2024. 2, 3, 9 13

work page internal anchor Pith review Pith/arXiv arXiv 2024

[35] [35]

Stable virtual camera: Generative view synthesis with diffusion models.arXiv preprint arXiv:2503.14489, 2025

Jensen (Jinghao) Zhou, Hang Gao, Vikram V oleti, Aaryaman Vasishta, Chun-Han Yao, Mark Boss, Philip Torr, Christian Rupprecht, and Varun Jampani. Stable virtual camera: Generative view synthesis with diffusion models.arXiv preprint arXiv:2503.14489, 2025. 3, 9

work page arXiv 2025

[36] [36]

Long-lrm: Long-sequence large reconstruction model for wide-coverage gaussian splats

Chen Ziwen, Hao Tan, Kai Zhang, Sai Bi, Fujun Luan, Yicong Hong, Li Fuxin, and Zexiang Xu. Long-lrm: Long-sequence large reconstruction model for wide-coverage gaussian splats. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 4349–4359, 2025. 3, 9 14 Appendix Contents A Notation and Common Terms 17 B FAQs 18 C Detailed Cont...

work page 2025

[37] [37]

Learned 3D backbones predict geometry, cameras, and correspondences together, so their matches are already tied to the model’s own reconstruction

Why not apply RANSAC-style geometric checks to matches predicted by VGGT, DUSt3R, or MASt3R? 19 Classical matching separates evidence from verification: SIFT proposes discrete matches, and RANSAC tests whether they support a geometry. Learned 3D backbones predict geometry, cameras, and correspondences together, so their matches are already tied to the mod...

work page

[38] [38]

This formu- lation clarifies the design space of multi-view consistency metrics and enables principled variants beyond pairwise mean aggregation

We introduce aunified parametric frameworkfor ground-truth-free 3D consistency met- rics, showing that existing and new neural metrics can be decomposed into three components: areconstruction backbone, aresidual function, and anaggregation function. This formu- lation clarifies the design space of multi-view consistency metrics and enables principled vari...

work page

[39] [39]

This benchmark enables direct testing of whether a metric can distinguish geometrically consistent scenes from increasingly inconsistent ones

We constructSysCON3D, a robustness benchmark for 3D consistency evaluation, which systematically injects controlled cross-scene outliers and synthetic corruptions into multi- view image sets. This benchmark enables direct testing of whether a metric can distinguish geometrically consistent scenes from increasingly inconsistent ones

work page

[40] [40]

Consequently, evaluation metrics built on top of these learned backbones can inherit these biases and become unreliable

Through SysCON3D, we uncover asystematic and previously underappreciated failure modeof modern data-driven 3D reconstruction backbones: rather than rejecting impossible inputs, they hallucinate non-trivial 3D structure and spurious cross-view consistency for cross-scene mixtures, repeated images, and random noise. Consequently, evaluation metrics built on...

work page

[41] [41]

These metrics provide a more interpretable, failure-aware, and robust measure of scene-level multi-view consistency

To address this limitation, we developrobust COLMAP-based 3D consistency metrics that avoid learned priors and instead rely on classical geometric verification. These metrics provide a more interpretable, failure-aware, and robust measure of scene-level multi-view consistency

work page

[42] [42]

This yields a more targeted human reference for assessing how well automatic metrics align with human judgments of scene-level consistency

We design astructured human preference studyfor evaluating 3D consistency, with explicit protocols that distinguish 3D consistency from visual realism and plausibility. This yields a more targeted human reference for assessing how well automatic metrics align with human judgments of scene-level consistency

work page

[43] [43]

We perform acomprehensive empirical evaluationon SysCON3D, Mip-NeRF360, and DL3DV , together with our human study, and show that the proposed metrics substantially improve robustness over prior work while also aligning more closely with human judgments. In particular, neural distributional metrics improve over MEt3R, and COLMAP-based metrics achieve the s...

work page arXiv 2020