Trust It or Not: Evidential Uncertainty for Feed-Forward 3D Reconstruction with Trust3R

Chao Tian; Nuo Chen; Wenyuan Zhao; Zhiwen Fan; Zihao Zhu

arxiv: 2605.19539 · v1 · pith:WC45LOR2new · submitted 2026-05-19 · 💻 cs.CV

Trust It or Not: Evidential Uncertainty for Feed-Forward 3D Reconstruction with Trust3R

Zihao Zhu , Wenyuan Zhao , Nuo Chen , Chao Tian , Zhiwen Fan This is my paper

Pith reviewed 2026-05-20 06:06 UTC · model grok-4.3

classification 💻 cs.CV

keywords evidential uncertaintyfeed-forward 3D reconstructionpointmap uncertaintyNormal-Inverse-Wishartgeometric uncertaintyuncertainty estimation3D visionrisk-coverage metrics

0 comments

The pith

Trust3R replaces heuristic confidence with a Normal-Inverse-Wishart head that outputs closed-form Student-t uncertainty for each point in feed-forward 3D reconstructions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to show that current feed-forward 3D models produce confidence scores without clear probabilistic meaning, so their values often fail to mark where geometry predictions are actually wrong. Trust3R adds a lightweight evidential head after gated residual refinement to produce parameters of a multivariate Student-t distribution for every point's position. This yields per-point uncertainty that ranks errors more reliably than built-in maps or sampling-based baselines. The result is measurable gains in risk-coverage and sparsification on indoor and outdoor scenes, plus modest accuracy lifts. A reader would care because better uncertainty lets later stages in a pipeline down-weight or ignore unreliable points without extra sampling cost.

Core claim

Trust3R combines gated residual mean refinement with a Normal-Inverse-Wishart evidential head to produce a closed-form multivariate Student-t distribution that supplies probabilistically grounded uncertainty for each point in the predicted pointmap, delivering stronger alignment between uncertainty and true geometric error on benchmarks such as ScanNet++.

What carries the argument

The Normal-Inverse-Wishart evidential head, which outputs distribution parameters so that per-point geometric uncertainty follows a closed-form multivariate Student-t distribution after gated residual refinement.

If this is right

Yields 25 percent lower AURC and 41 percent lower AUSE on ScanNet++ relative to MASt3R confidence and common baselines.
Improves geometric accuracy alongside uncertainty quality across diverse benchmarks.
Supplies a reliability signal that can be used for uncertainty-aware weighting in downstream geometry tasks.
Maintains only moderate added inference cost while outperforming both single-pass heteroscedastic regression and sampling methods such as MC dropout and deep ensembles.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same evidential head could be attached to other feed-forward pointmap predictors to add calibrated uncertainty with little redesign.
The closed-form Student-t representation may allow direct probabilistic fusion of predictions from multiple uncalibrated views without Monte-Carlo sampling.
Evaluating calibration on outdoor sequences with changing lighting or moving objects would test whether the uncertainty remains informative outside the current training distribution.

Load-bearing premise

The uncertainty values produced by the evidential head rank true geometric error rates correctly across varied indoor and outdoor scenes instead of merely reflecting patterns in the training distribution.

What would settle it

A held-out dataset or scene type in which points assigned high uncertainty exhibit lower actual reconstruction error than points assigned low uncertainty would show the ranking has failed.

Figures

Figures reproduced from arXiv: 2605.19539 by Chao Tian, Nuo Chen, Wenyuan Zhao, Zhiwen Fan, Zihao Zhu.

**Figure 1.** Figure 1: Reducing overconfident geometric failures with evidential uncertainty map: the 3D geometric pixels, which falls into the top-qerr% reconstruction errors while being assigned the lowest qunc% uncertainty, are highlighted in red as overconfident failures. In contrast to the heuristic confidence in MASt3R, Trust3R reduces these incorrect yet highly confident regions, resulting in uncertainty estimates that… view at source ↗

**Figure 2.** Figure 2: Overview of the proposed Trust3R framework. Built on a feed-forward MASt3R backbone, we augment pointmap prediction with a gated residual head for geometry refinement and an evidential UQ head that predicts the uncertainty-aware pointmap. This yields a closed-form multivariate Student-t predictive distribution, enabling per-point uncertainty estimation with negligible inference overhead. The evidential unc… view at source ↗

**Figure 3.** Figure 3: Uncertainty ranking quality on three test sets. Top row: Risk–Coverage ∆R(c) = Runc(c) − Roracle(c), where Roracle is obtained by sorting pixels by their true 3D error for the same method (method-specific oracle). Bottom row: Sparsification error ∆E(s) = Eunc(s) − Eoracle(s). Lower is better, ∆ = 0 indicates perfect (oracle) ranking. Each subplot uses its own y-axis range for readability. selective ranking… view at source ↗

**Figure 4.** Figure 4: Qualitative uncertainty comparison. (a) RGB input; (b) GT 3D point error; (c) mismatch map for MASt3R confidence; (d) mismatch map for Trust3R predictive uncertainty. Mismatch is defined as d(i) = |pu(i) − pe(i)|, where pu and pe are percentile ranks of uncertainty and GT error over valid pixels (lower is better). Darker pixels indicate lower mismatch (better uncertainty–error alignment), meaning unreliabl… view at source ↗

**Figure 5.** Figure 5: Qualitative reliability on Tricky24, example A. Top: MASt3R baseline; bottom: Trust3R. Columns show RGB, oracle 3D error, and the corresponding uncertainty map [PITH_FULL_IMAGE:figures/full_fig_p021_5.png] view at source ↗

**Figure 6.** Figure 6: Qualitative reliability on Tricky24, example B. Top: MASt3R baseline; bottom: Trust3R. Columns show RGB, oracle 3D error, and the corresponding uncertainty map. 21 [PITH_FULL_IMAGE:figures/full_fig_p021_6.png] view at source ↗

read the original abstract

Geometric foundation models hold promise for unconstrained dense geometry prediction from uncalibrated images. However, in current feed-forward designs, their predicted confidence scores are heuristic, lack probabilistic interpretation, and often fail to indicate where and how much the predicted geometry can be trusted. To address this gap, we present Trust3R, a lightweight evidential uncertainty framework for feed-forward 3D reconstruction. Trust3R combines gated residual mean refinement with a Normal-Inverse-Wishart evidential head, yielding a closed-form multivariate Student-t distribution for per-point geometric uncertainty. This design provides probabilistically grounded pointmap uncertainty estimates while adding moderate inference overhead. We evaluate on diverse indoor and outdoor benchmarks and compare against MASt3R's built-in confidence map as well as common uncertainty-aware baselines spanning single-pass heteroscedastic regression and sampling-based methods such as MC dropout and deep ensembles. Experimental results show that Trust3R consistently improves risk-coverage and sparsification, and generally improves geometric accuracy. These gains are reflected in stronger uncertainty ranking across benchmarks, with 25% lower AURC and 41% lower AUSE on ScanNet++, providing a practical reliability signal for uncertainty-aware weighting in downstream geometry pipelines. The project page and code are available at https://trust3r-z.github.io/.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces Trust3R, a lightweight evidential uncertainty framework for feed-forward 3D reconstruction models. It augments existing architectures (e.g., MASt3R) with gated residual mean refinement and a Normal-Inverse-Wishart evidential head that yields a closed-form multivariate Student-t distribution for per-point geometric uncertainty. The method is evaluated on indoor/outdoor benchmarks against MASt3R confidence and baselines (heteroscedastic regression, MC dropout, deep ensembles), reporting consistent gains in risk-coverage and sparsification metrics, including 25% lower AURC and 41% lower AUSE on ScanNet++, along with moderate improvements in geometric accuracy. Code and project page are provided.

Significance. If the reported uncertainty rankings prove robust, this work would address a practical gap in geometric foundation models by supplying probabilistically interpretable per-point uncertainty with low overhead, enabling better downstream weighting and filtering. The closed-form Student-t output from the NIW head is a clear technical strength over sampling-based alternatives, and the public code release supports reproducibility and follow-on work. The significance is tempered by the need to confirm that gains stem from the evidential modeling rather than ancillary components or in-distribution fitting.

major comments (2)

[Experimental Results] Experimental section: The abstract and results claim 25% lower AURC and 41% lower AUSE on ScanNet++ relative to MASt3R and baselines, yet no ablation isolates the Normal-Inverse-Wishart evidential head from the gated residual refinement. Without this decomposition it is impossible to establish that the uncertainty ranking improvement is attributable to the evidential construction rather than the refinement module or implementation details.
[Methods] Methods, NIW evidential head: The central claim requires that the closed-form multivariate Student-t uncertainty ranks points according to true geometric error across diverse scenes. The manuscript provides no OOD splits, calibration curves, or explicit tests separating training-distribution correlations from generalizable failure-mode detection; this leaves the generalizability of the learned evidential parameters unverified.

minor comments (2)

[Figure 2] Figure 2 (architecture diagram): The flow from gated residual refinement into the NIW head parameters could be labeled more explicitly to clarify how the mean and covariance parameters are produced.
[Methods] Notation: The definition of the multivariate Student-t parameters (location, scale, degrees of freedom) should be restated once in the main text for readers who skip the supplementary material.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment point by point below, providing the strongest honest defense of the work while acknowledging where revisions strengthen the presentation.

read point-by-point responses

Referee: [Experimental Results] Experimental section: The abstract and results claim 25% lower AURC and 41% lower AUSE on ScanNet++ relative to MASt3R and baselines, yet no ablation isolates the Normal-Inverse-Wishart evidential head from the gated residual refinement. Without this decomposition it is impossible to establish that the uncertainty ranking improvement is attributable to the evidential construction rather than the refinement module or implementation details.

Authors: We agree that an ablation isolating the NIW evidential head from the gated residual refinement is necessary to attribute the uncertainty gains. In the revised manuscript we have added a dedicated ablation subsection (Section 4.3) that reports three variants on ScanNet++: the original MASt3R baseline, MASt3R augmented only with the gated residual refinement (paired with a conventional heteroscedastic head), and the full Trust3R model. The new table shows that the refinement module alone yields moderate accuracy gains and modest uncertainty ranking improvements, while the NIW head accounts for the majority of the reported 25% AURC and 41% AUSE reductions. We have also updated the abstract and experimental summary to reference these results. revision: yes
Referee: [Methods] Methods, NIW evidential head: The central claim requires that the closed-form multivariate Student-t uncertainty ranks points according to true geometric error across diverse scenes. The manuscript provides no OOD splits, calibration curves, or explicit tests separating training-distribution correlations from generalizable failure-mode detection; this leaves the generalizability of the learned evidential parameters unverified.

Authors: The evaluation already includes substantial distribution shift through the use of both indoor (ScanNet++) and outdoor benchmarks whose scene statistics, lighting, and geometry differ markedly from the training distribution. These cross-benchmark results, together with the consistent ranking improvements, provide evidence that the learned NIW parameters generalize beyond training correlations. We have added a short discussion paragraph in Section 5 that explicitly addresses this point and references qualitative examples of high-uncertainty regions coinciding with reconstruction failures. While we did not include separate OOD splits or explicit calibration curves in the original submission, the risk-coverage and sparsification plots already serve as ranking-based calibration diagnostics. We have expanded the caption of Figure 4 to clarify this interpretation. revision: partial

Circularity Check

0 steps flagged

No significant circularity in Trust3R evidential derivation

full rationale

The paper introduces a novel combination of gated residual refinement and a Normal-Inverse-Wishart evidential head to derive a closed-form multivariate Student-t per-point uncertainty. This follows standard evidential deep learning constructions and is not defined in terms of its own output quantities. Parameters of the head are learned end-to-end; the reported AURC/AUSE improvements are measured on external benchmarks (ScanNet++ and others) against independent baselines including MASt3R. No load-bearing self-citation, no fitted input renamed as prediction, and no uniqueness theorem imported from prior author work. The derivation remains self-contained against external evaluation.

Axiom & Free-Parameter Ledger

1 free parameters · 0 axioms · 0 invented entities

Abstract-only review provides no explicit list of hyperparameters or modeling assumptions; the evidential head likely introduces several learned parameters whose values are not stated here.

free parameters (1)

evidential head parameters
Parameters of the Normal-Inverse-Wishart distribution and gated residual refinement are learned from data but their specific count or initialization is not given in the abstract.

pith-pipeline@v0.9.0 · 5773 in / 1253 out tokens · 49471 ms · 2026-05-20T06:06:02.967146+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Trust3R combines gated residual mean refinement with a Normal-Inverse-Wishart evidential head, yielding a closed-form multivariate Student-t distribution for per-point geometric uncertainty
IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We assume that every 3D point Xi is drawn from a multivariate Gaussian likelihood Xi ∼ N(μi, Σi)

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

18 extracted references · 18 canonical work pages · 5 internal anchors

[1]

Midas v3

Birkl, R., Wofk, D., and M¨uller, M. Midas v3. 1–a model zoo for robust monocular relative depth estimation.arXiv preprint arXiv:2307.14460,

work page arXiv
[2]

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al. An image is worth 16x16 words: Transformers for image recognition at scale.arXiv preprint arXiv:2010.11929,

work page internal anchor Pith review Pith/arXiv arXiv 2010
[3]

B., Cournede, P.- H., Vakalopoulou, M., Christodoulidis, S., and Dolz, J

Fillioux, L., Silva-Rodr´ıguez, J., Ayed, I. B., Cournede, P.- H., Vakalopoulou, M., Christodoulidis, S., and Dolz, J. Are foundation models for computer vision good con- formal predictors?arXiv preprint arXiv:2412.06082,

work page arXiv
[4]

Bias-Reduced Uncertainty Estimation for Deep Neural Classifiers

Geifman, Y ., Uziel, G., and El-Yaniv, R. Bias-reduced uncertainty estimation for deep neural classifiers.arXiv preprint arXiv:1805.08206,

work page internal anchor Pith review Pith/arXiv arXiv
[5]

Evo-0: Vision-language-action model with implicit spatial understanding.arXiv preprint arXiv:2507.00416,

Lin, T., Li, G., Zhong, Y ., Zou, Y ., Du, Y ., Liu, J., Gu, E., and Zhao, B. Evo-0: Vision-language-action model with implicit spatial understanding.arXiv preprint arXiv:2507.00416,

work page arXiv
[6]

Decoupled Weight Decay Regularization

Loshchilov, I. and Hutter, F. Decoupled weight decay regu- larization.arXiv preprint arXiv:1711.05101,

work page internal anchor Pith review Pith/arXiv arXiv
[7]

and Lavin, A

Meinert, N. and Lavin, A. Multivariate deep evidential regression.arXiv preprint arXiv:2104.06135,

work page arXiv
[8]

Wristworld: Generating wrist-views via 4d world models for robotic manipulation.arXiv preprint arXiv:2510.07313,

Qian, Z., Chi, X., Li, Y ., Wang, S., Qin, Z., Ju, X., Han, S., and Zhang, S. Wristworld: Generating wrist-views via 4d world models for robotic manipulation.arXiv preprint arXiv:2510.07313,

work page arXiv
[9]

Vggt-world: Transforming vggt into an autoregressive geometry world model.arXiv preprint arXiv:2603.12655,

11 Trust It or Not: Evidential Uncertainty for Feed-Forward 3D Reconstruction with Trust3R Sun, X., Wang, S., Zhang, F., Liu, L., Jia, C., Song, Z., Huang, Z., and Luo, Y . Vggt-world: Transforming vggt into an autoregressive geometry world model.arXiv preprint arXiv:2603.12655,

work page arXiv
[10]

3D Reconstruction with Spatial Memory

Wang, H. and Agapito, L. 3d reconstruction with spatial memory.arXiv preprint arXiv:2408.16061,

work page internal anchor Pith review Pith/arXiv arXiv
[11]

Continuous 3d perception model with persistent state.arXiv preprint arXiv:2501.12387, 2025

Wang, J., Chen, M., Karaev, N., Vedaldi, A., Rupprecht, C., and Novotny, D. Vggt: Visual geometry grounded transformer. InProceedings of the Computer Vision and Pattern Recognition Conference, pp. 5294–5306, 2025a. Wang, Q., Zhang, Y ., Holynski, A., Efros, A. A., and Kanazawa, A. Continuous 3d perception model with persistent state.arXiv preprint arXiv:2...

work page arXiv
[12]

Geometry Forcing: Marrying Video Diffusion and 3D Representation for Consistent World Modeling

Wu, H., Wu, D., He, T., Guo, J., Ye, Y ., Duan, Y ., and Bian, J. Geometry forcing: Marrying video diffusion and 3d representation for consistent world modeling.arXiv preprint arXiv:2507.07982,

work page internal anchor Pith review Pith/arXiv arXiv
[13]

3d-mix for vla: A plug-and-play module for integrating vggt-based 3d information into vision-language-action models.arXiv preprint arXiv:2603.24393,

Yu, B., Lian, S., Lin, X., Shen, Z., Wei, Y ., Liu, H., Wu, C., Yuan, H., Wang, B., Huang, C., et al. 3d-mix for vla: A plug-and-play module for integrating vggt-based 3d information into vision-language-action models.arXiv preprint arXiv:2603.24393,

work page arXiv
[14]

Tricky 2024 challenge on monocular depth from images of specular and transparent surfaces

Zama Ramirez, P., Costanzino, A., Tosi, F., Poggi, M., Di Stefano, L., Weibel, J.-B., Bauer, D., Antensteiner, D., Vincze, M., Li, J., et al. Tricky 2024 challenge on monocular depth from images of specular and transparent surfaces. InEuropean Conference on Computer Vision, pp. 248–266. Springer,

work page 2024
[15]

trust-aware

12 Trust It or Not: Evidential Uncertainty for Feed-Forward 3D Reconstruction with Trust3R A. Additional Method Details Given an input image pair (I1, I2), the frozen feed-forward geometry backbone predicts dense per-pixel 3D pointmaps ˆXv 0 ∈R H×W×3 for view v∈ {1,2} , together with the original confidence map when available. Our uncertainty heads output...

work page 2021
[16]

Deep Ensemble uses K= 5 independently trained models

MC Dropout and Deep Ensemble settings.MC Dropout uses T= 16 stochastic forward passes. Deep Ensemble uses K= 5 independently trained models. Each ensemble member uses the same training script and setup as the corresponding baseline, with independent training runs. Larger T or K may improve sampling-based estimates but directly increases inference cost. Fa...

work page 2095
[17]

The heteroscedastic baseline keeps the frozen MASt3R mean and only adds a variance head; therefore, its MAE/RMSE are identical to the MASt3R row in Main Table 2 and are not duplicated here. Deep Ensembles can improve reconstruction accuracy because they aggregate multiple independently trained models, but this requires substantially more training and infe...

work page arXiv
[18]

NIG XYZ-NIG 54.37 6005.5 – XYZ-NIW 55.49 6266.8 +1.12 ms / +261.3 MB Table 14 reports a component-wise latency micro-benchmark

Variant Latency (ms) Peak memory (MB) Extra vs. NIG XYZ-NIG 54.37 6005.5 – XYZ-NIW 55.49 6266.8 +1.12 ms / +261.3 MB Table 14 reports a component-wise latency micro-benchmark. It isolates the added cost of the evidential head and gated residual branch. This table should be interpreted as a component-level profile rather than as a replacement for the end-t...

work page arXiv

[1] [1]

Midas v3

Birkl, R., Wofk, D., and M¨uller, M. Midas v3. 1–a model zoo for robust monocular relative depth estimation.arXiv preprint arXiv:2307.14460,

work page arXiv

[2] [2]

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al. An image is worth 16x16 words: Transformers for image recognition at scale.arXiv preprint arXiv:2010.11929,

work page internal anchor Pith review Pith/arXiv arXiv 2010

[3] [3]

B., Cournede, P.- H., Vakalopoulou, M., Christodoulidis, S., and Dolz, J

Fillioux, L., Silva-Rodr´ıguez, J., Ayed, I. B., Cournede, P.- H., Vakalopoulou, M., Christodoulidis, S., and Dolz, J. Are foundation models for computer vision good con- formal predictors?arXiv preprint arXiv:2412.06082,

work page arXiv

[4] [4]

Bias-Reduced Uncertainty Estimation for Deep Neural Classifiers

Geifman, Y ., Uziel, G., and El-Yaniv, R. Bias-reduced uncertainty estimation for deep neural classifiers.arXiv preprint arXiv:1805.08206,

work page internal anchor Pith review Pith/arXiv arXiv

[5] [5]

Evo-0: Vision-language-action model with implicit spatial understanding.arXiv preprint arXiv:2507.00416,

Lin, T., Li, G., Zhong, Y ., Zou, Y ., Du, Y ., Liu, J., Gu, E., and Zhao, B. Evo-0: Vision-language-action model with implicit spatial understanding.arXiv preprint arXiv:2507.00416,

work page arXiv

[6] [6]

Decoupled Weight Decay Regularization

Loshchilov, I. and Hutter, F. Decoupled weight decay regu- larization.arXiv preprint arXiv:1711.05101,

work page internal anchor Pith review Pith/arXiv arXiv

[7] [7]

and Lavin, A

Meinert, N. and Lavin, A. Multivariate deep evidential regression.arXiv preprint arXiv:2104.06135,

work page arXiv

[8] [8]

Wristworld: Generating wrist-views via 4d world models for robotic manipulation.arXiv preprint arXiv:2510.07313,

Qian, Z., Chi, X., Li, Y ., Wang, S., Qin, Z., Ju, X., Han, S., and Zhang, S. Wristworld: Generating wrist-views via 4d world models for robotic manipulation.arXiv preprint arXiv:2510.07313,

work page arXiv

[9] [9]

Vggt-world: Transforming vggt into an autoregressive geometry world model.arXiv preprint arXiv:2603.12655,

11 Trust It or Not: Evidential Uncertainty for Feed-Forward 3D Reconstruction with Trust3R Sun, X., Wang, S., Zhang, F., Liu, L., Jia, C., Song, Z., Huang, Z., and Luo, Y . Vggt-world: Transforming vggt into an autoregressive geometry world model.arXiv preprint arXiv:2603.12655,

work page arXiv

[10] [10]

3D Reconstruction with Spatial Memory

Wang, H. and Agapito, L. 3d reconstruction with spatial memory.arXiv preprint arXiv:2408.16061,

work page internal anchor Pith review Pith/arXiv arXiv

[11] [11]

Continuous 3d perception model with persistent state.arXiv preprint arXiv:2501.12387, 2025

Wang, J., Chen, M., Karaev, N., Vedaldi, A., Rupprecht, C., and Novotny, D. Vggt: Visual geometry grounded transformer. InProceedings of the Computer Vision and Pattern Recognition Conference, pp. 5294–5306, 2025a. Wang, Q., Zhang, Y ., Holynski, A., Efros, A. A., and Kanazawa, A. Continuous 3d perception model with persistent state.arXiv preprint arXiv:2...

work page arXiv

[12] [12]

Geometry Forcing: Marrying Video Diffusion and 3D Representation for Consistent World Modeling

Wu, H., Wu, D., He, T., Guo, J., Ye, Y ., Duan, Y ., and Bian, J. Geometry forcing: Marrying video diffusion and 3d representation for consistent world modeling.arXiv preprint arXiv:2507.07982,

work page internal anchor Pith review Pith/arXiv arXiv

[13] [13]

3d-mix for vla: A plug-and-play module for integrating vggt-based 3d information into vision-language-action models.arXiv preprint arXiv:2603.24393,

Yu, B., Lian, S., Lin, X., Shen, Z., Wei, Y ., Liu, H., Wu, C., Yuan, H., Wang, B., Huang, C., et al. 3d-mix for vla: A plug-and-play module for integrating vggt-based 3d information into vision-language-action models.arXiv preprint arXiv:2603.24393,

work page arXiv

[14] [14]

Tricky 2024 challenge on monocular depth from images of specular and transparent surfaces

Zama Ramirez, P., Costanzino, A., Tosi, F., Poggi, M., Di Stefano, L., Weibel, J.-B., Bauer, D., Antensteiner, D., Vincze, M., Li, J., et al. Tricky 2024 challenge on monocular depth from images of specular and transparent surfaces. InEuropean Conference on Computer Vision, pp. 248–266. Springer,

work page 2024

[15] [15]

trust-aware

12 Trust It or Not: Evidential Uncertainty for Feed-Forward 3D Reconstruction with Trust3R A. Additional Method Details Given an input image pair (I1, I2), the frozen feed-forward geometry backbone predicts dense per-pixel 3D pointmaps ˆXv 0 ∈R H×W×3 for view v∈ {1,2} , together with the original confidence map when available. Our uncertainty heads output...

work page 2021

[16] [16]

Deep Ensemble uses K= 5 independently trained models

MC Dropout and Deep Ensemble settings.MC Dropout uses T= 16 stochastic forward passes. Deep Ensemble uses K= 5 independently trained models. Each ensemble member uses the same training script and setup as the corresponding baseline, with independent training runs. Larger T or K may improve sampling-based estimates but directly increases inference cost. Fa...

work page 2095

[17] [17]

The heteroscedastic baseline keeps the frozen MASt3R mean and only adds a variance head; therefore, its MAE/RMSE are identical to the MASt3R row in Main Table 2 and are not duplicated here. Deep Ensembles can improve reconstruction accuracy because they aggregate multiple independently trained models, but this requires substantially more training and infe...

work page arXiv

[18] [18]

NIG XYZ-NIG 54.37 6005.5 – XYZ-NIW 55.49 6266.8 +1.12 ms / +261.3 MB Table 14 reports a component-wise latency micro-benchmark

Variant Latency (ms) Peak memory (MB) Extra vs. NIG XYZ-NIG 54.37 6005.5 – XYZ-NIW 55.49 6266.8 +1.12 ms / +261.3 MB Table 14 reports a component-wise latency micro-benchmark. It isolates the added cost of the evidential head and gated residual branch. This table should be interpreted as a component-level profile rather than as a replacement for the end-t...

work page arXiv