pith. sign in

arxiv: 2606.01573 · v2 · pith:GISV3WM3new · submitted 2026-06-01 · 💻 cs.CV

VG²GT: Voxel-Gaussian Splatting Visual Geometry Grounded Transformer

Pith reviewed 2026-06-28 15:40 UTC · model grok-4.3

classification 💻 cs.CV
keywords Gaussian splatting3D reconstructionvisual foundation modelvoxel featurestransformervolume renderingnovel view synthesisfeed-forward method
0
0 comments X

The pith

VG²GT regresses Gaussian primitive parameters directly from multi-scale voxel features of a frozen visual foundation model, supervised by stochastic solid volume rendering of depth maps.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents VG²GT to overcome the need for accurate camera parameters and per-scene optimization in Gaussian splatting for 3D reconstruction. It freezes a pretrained visual foundation model, adds a multi-scale differentiable voxel module for better geometry, and regresses Gaussian parameters straight from the voxel features. Depth supervision occurs through stochastic solid volume rendering during training. This setup allows the method to plug into various patch-based models at lower training cost while producing geometrically accurate scenes.

Core claim

VG²GT is a Voxel-Gaussian Splatting Visual Geometry-Grounded Transformer that leverages a frozen pretrained visual foundation model, incorporates a multi-scale differentiable voxel module to enhance geometric understanding, and directly splits and regresses Gaussian primitive parameters from voxel features, with depth maps supervised through stochastic solid volume rendering to enable geometrically accurate Gaussian scene reconstruction.

What carries the argument

The multi-scale differentiable voxel module that extracts features from a frozen VFM and supplies them for direct regression of Gaussian primitive parameters.

If this is right

  • VG²GT outperforms current state-of-the-art methods on the DTU, Replica, TAT, and ScanNet datasets.
  • The method requires no per-scene optimization or camera parameter refinement at inference time.
  • VG²GT can be plugged into any patch-feature-based visual foundation model.
  • Training cost is substantially reduced compared to methods that optimize per scene.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The feed-forward design could support reconstruction pipelines that process video streams in real time without repeated optimization steps.
  • Replacing the depth-only supervision with additional signals such as surface normals might reduce remaining inconsistencies in complex indoor scenes.
  • Because the voxel module operates on frozen features, the same architecture could be tested on tasks like semantic 3D labeling with minimal added heads.

Load-bearing premise

Directly regressing Gaussian parameters from multi-scale voxel features of a frozen VFM, supervised only by stochastic solid volume rendering of depth maps, produces artifact-free and geometrically accurate scene reconstructions without per-scene optimization or camera parameter refinement.

What would settle it

Significant geometric inaccuracies or visible artifacts appearing in reconstructions on the ScanNet dataset when tested without any per-scene adjustment would indicate the claim does not hold.

Figures

Figures reproduced from arXiv: 2606.01573 by Jianjun Yi, Jun Nan, Liwei Chen, Wenli Yang, Yibin Zhao, Yihan Pan.

Figure 1
Figure 1. Figure 1: VG2GT, a non-pixel-aligned Gaussian Splatting [15] feed-forward model for fast 3D reconstruction and novel view synthesis. VG2GT directly regresses geometrically accurate and spatially uniform Gaussian scenes from an arbitrary number of unposed and uncalibrated images within seconds. solids and renders them along camera rays [25]. It im￾proves the quality of synthesized Gaussian depth maps and strengthens … view at source ↗
Figure 2
Figure 2. Figure 2: Overview of VG2GT. The frozen VFM takes multi-view RGB images as input, computes patch tokens, and decodes depth maps and camera parameters. Multi-scale voxels decode the Gaussian scene: coarse voxels enhance patch tokens, while fine voxels split and decode Gaussian primitives. During training, differentiable stochastic solid volume rendering supervises scene geometry. Here, the 3DGS scene 𝐆 consists of 𝐺 … view at source ↗
Figure 3
Figure 3. Figure 3: Comparison of different depth rendering strategies. Compared with rasterization-based median depth, our contin￾uous volume rendering produces smoother and more accurate depth maps. During forward propagation, we define the depth map as the median depth, namely the location where accumulated transmittance reaches 0.5. The accumulated transmittance of multiple Gaussian primitives is computed by multiplica￾ti… view at source ↗
Figure 4
Figure 4. Figure 4: RGB images and depth maps rendered by rasteriza￾tion and volume rendering for the same Gaussian scene. As shown in [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Visual comparisons between VG2GT and baselines for geometric reconstruction. Gaussian depth maps rendered by VG2GT show higher accuracy, indicating improved geometric fidelity of the reconstructed Gaussian scenes. J.K. Krishnan et al.: Preprint submitted to Elsevier Page 7 of 15 [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Visual comparisons between VG2GT and baselines for NVS. Multi-scale voxel-based Gaussian scenes suppress the overlapping artifacts commonly observed in pixel-aligned methods, improving scene geometry and NVS quality. angular error (Mean◦ ), median angular error (Median◦ ), thresholded angular accuracy (Acc@10◦ and Acc@30◦ ), and cosine similarity (Cos.). For novel view synthesis, we evaluate rendering fide… view at source ↗
Figure 7
Figure 7. Figure 7: Visual comparisons of rendered normal maps. VG2GT produces more accurate and spatially consistent normal maps than pixel-aligned baselines. 4.5. Evaluation of Time Consumption The training and inference costs of VG2GT remain well controlled. We evaluate the time consumption of VG2GT using 10 input images. Inference. Although VG2GT decodes depth maps and camera parameters twice, the additional decoding over… view at source ↗
Figure 8
Figure 8. Figure 8: Time consumption comparisons for inference and training. Notably, owing to our voxel-based Gaussian scene decoding, when the image resolution increases to 518 × 518, the fine￾voxel decoding time remains almost unchanged. The total inference time is 2.440 s, only 0.260 s higher than DA3 (2.180 s), as shown in Figure 8a [PITH_FULL_IMAGE:figures/full_fig_p010_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Number of Gaussian primitives and decoding-head time as the number of input images increases. Voxel-based alignment also controls both the number of Gaussian primitives and the decoding cost as the num￾ber of input views increases. As shown in [PITH_FULL_IMAGE:figures/full_fig_p010_9.png] view at source ↗
read the original abstract

Gaussian splatting has shown strong potential for 3D reconstruction and novel view synthesis. However, most existing methods require accurate camera parameters and per-scene optimization, while feed-forward methods with pixel-aligned Gaussian primitives often suffer from artifacts and non-uniform primitives. In this paper, we propose $\text{VG}^2$GT, a Voxel-Gaussian Splatting Visual Geometry-Grounded Transformer. $\text{VG}^2$GT leverages a frozen pretrained visual foundation model (VFM), incorporates a multi-scale differentiable voxel module to enhance geometric understanding, and directly splits and regresses Gaussian primitive parameters from voxel features. During training, depth maps are supervised through stochastic solid volume rendering, enabling geometrically accurate Gaussian scene reconstruction while keeping the visual foundation model fully frozen. This design enables $\text{VG}^2$GT to be seamlessly plugged into any patch-feature-based VFM, while substantially reducing the required training cost. $\text{VG}^2$GT outperforms current state-of-the-art methods on widely used DTU, Replica, TAT, and ScanNet datasets.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper proposes VG²GT, a feed-forward architecture for 3D scene reconstruction that combines Gaussian splatting with a visual geometry-grounded transformer. It extracts multi-scale features from a frozen pretrained visual foundation model (VFM), processes them via a differentiable voxel module, and directly regresses all Gaussian primitive parameters (means, covariances, opacities, SH coefficients). Depth supervision is applied through stochastic solid volume rendering; the design avoids per-scene optimization and camera refinement while claiming to be pluggable into existing patch-feature VFMs and to outperform prior methods on DTU, Replica, TAT, and ScanNet.

Significance. If the empirical claims are substantiated, the work would be significant for demonstrating that frozen VFMs plus a lightweight voxel module can supply sufficient metric geometry for high-quality, artifact-free Gaussian reconstruction in a single forward pass, thereby removing the dominant computational cost of per-scene optimization in current pipelines.

major comments (2)
  1. [Abstract] Abstract: the central claim that VG²GT 'outperforms current state-of-the-art methods' on DTU, Replica, TAT, and ScanNet is presented without any quantitative numbers, tables, error bars, or ablation statistics, rendering the soundness of the outperformance assertion impossible to evaluate from the given material.
  2. [Method overview] Core regression step (described in the abstract and method overview): the load-bearing assumption that multi-scale voxel features extracted from a frozen VFM contain adequate metric geometric signal to regress accurate Gaussian means, covariances, opacities, and SH coefficients—supervised solely by stochastic solid volume rendering of depth maps—remains untested by the provided text. The skeptic concern that VFMs are optimized for 2D semantics rather than precise 3D geometry is therefore not yet addressed by concrete evidence such as failure-case analysis or controlled ablations that isolate the voxel module.
minor comments (1)
  1. [Abstract] Abstract: the phrasing 'substantially reducing the required training cost' is asserted without any reported training-time or memory figures relative to baselines.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments. We address each major point below and indicate where revisions will be made to strengthen the manuscript.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the central claim that VG²GT 'outperforms current state-of-the-art methods' on DTU, Replica, TAT, and ScanNet is presented without any quantitative numbers, tables, error bars, or ablation statistics, rendering the soundness of the outperformance assertion impossible to evaluate from the given material.

    Authors: We agree that the abstract would benefit from explicit quantitative support. In the revised version we will insert the primary metrics (e.g., mean PSNR/SSIM on DTU and Replica) directly into the abstract while retaining the high-level claim. revision: yes

  2. Referee: [Method overview] Core regression step (described in the abstract and method overview): the load-bearing assumption that multi-scale voxel features extracted from a frozen VFM contain adequate metric geometric signal to regress accurate Gaussian means, covariances, opacities, and SH coefficients—supervised solely by stochastic solid volume rendering of depth maps—remains untested by the provided text. The skeptic concern that VFMs are optimized for 2D semantics rather than precise 3D geometry is therefore not yet addressed by concrete evidence such as failure-case analysis or controlled ablations that isolate the voxel module.

    Authors: The current manuscript already contains ablation studies (Section 4.3) that remove the voxel module and report corresponding drops in depth and reconstruction accuracy, together with failure-case visualizations in the supplement. To make the geometric-signal argument more explicit, we will add a short dedicated paragraph in Section 3.2 and expand the ablation table with an additional row that isolates the voxel module while keeping the frozen VFM fixed. revision: partial

Circularity Check

0 steps flagged

No derivation chain or equations present; architectural proposal with empirical claims only.

full rationale

The provided abstract and description contain no equations, no claimed first-principles derivations, no fitted parameters presented as predictions, and no self-citations invoked as load-bearing uniqueness theorems. The method is described as an architectural design (frozen VFM + voxel module + regression + volume rendering supervision) whose performance is asserted via dataset benchmarks rather than any mathematical reduction to inputs. Without any load-bearing steps that reduce by construction, the circularity score is 0.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Abstract-only review; the method implicitly assumes that features from a frozen pretrained VFM plus voxel processing suffice for geometric accuracy, but no explicit free parameters, axioms, or invented entities are stated.

axioms (1)
  • domain assumption Features from a frozen pretrained visual foundation model remain useful for geometric understanding when combined with a differentiable voxel module.
    The design freezes the VFM and relies on its features for the voxel module.

pith-pipeline@v0.9.1-grok · 5734 in / 1165 out tokens · 28372 ms · 2026-06-28T15:40:10.866335+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

56 extracted references · 16 canonical work pages · 5 internal anchors

  1. [1]

    Large-scale data for multiple-view stere- opsis

    Aanaes, HenrikJensen, RamsbolVogiatzis, R., GeorgeTola, Engin- Dahl, Bjorholm, A., 2016. Large-scale data for multiple-view stere- opsis. International Journal of Computer Vision 120

  2. [2]

    Barron,J.T.,Mildenhall,B.,Verbin,D.,Srinivasan,P.P.,Hedman,P.,

  3. [3]

    5470–5479

    Mip-nerf 360: Unbounded anti-aliased neural radiance fields, in:ProceedingsoftheIEEE/CVFconferenceoncomputervisionand pattern recognition, pp. 5470–5479

  4. [4]

    Mvsplat: Efficient 3d gaussian splatting from sparse multi-view images

    Chen, Y., Xu, H., Zheng, C., Zhuang, B., Pollefeys, M., Geiger, A., Cham, T.J., Cai, J., 2024. Mvsplat: Efficient 3d gaussian splatting from sparse multi-view images. arXiv preprint arXiv:2403.14627

  5. [5]

    Scannet: Richly-annotated 3d reconstructions of indoor scenes, in: Proc

    Dai,A.,Chang,A.X.,Savva,M.,Halber,M.,Funkhouser,T.,Nießner, M., 2017. Scannet: Richly-annotated 3d reconstructions of indoor scenes, in: Proc. Computer Vision and Pattern Recognition (CVPR), IEEE

  6. [6]

    FlashAttention-2: Faster attention with better paral- lelism and work partitioning, in: International Conference on Learn- ing Representations (ICLR)

    Dao, T., 2024. FlashAttention-2: Faster attention with better paral- lelism and work partitioning, in: International Conference on Learn- ing Representations (ICLR)

  7. [7]

    Superpoint:Self- supervised interest point detection and description, in: Proceedings of the IEEE conference on computer vision and pattern recognition workshops, pp

    DeTone,D.,Malisiewicz,T.,Rabinovich,A.,2018. Superpoint:Self- supervised interest point detection and description, in: Proceedings of the IEEE conference on computer vision and pattern recognition workshops, pp. 224–236

  8. [8]

    Real-time plane-sweeping stereo with multiple sweeping directions, J.K

    Gallup,D.,Frahm,J.M.,Mordohai,P.,Yang,Q.,Pollefeys,M.,2007. Real-time plane-sweeping stereo with multiple sweeping directions, J.K. Krishnan et al.:Preprint submitted to ElsevierPage 12 of 15 Leveraging social media news Method 3 Views 10 Views 20 Views RMSE↓ABS↓𝛿 1.25 ↑𝛿 1.05 ↑ RMSE↓ABS↓𝛿 1.25 ↑𝛿 1.05 ↑ RMSE↓ABS↓𝛿 1.25 ↑𝛿 1.05 ↑ Replica Dataset[33] MVSpl...

  9. [9]

    Vision meets robotics:Thekittidataset

    Geiger, A., Lenz, P., Stiller, C., Urtasun, R., 2013. Vision meets robotics:Thekittidataset. InternationalJournalofRoboticsResearch (IJRR)

  10. [10]

    Rgbd gs-icp slam, in: European conference on computer vision, Springer

    Ha, S., Yeon, J., Yu, H., 2024. Rgbd gs-icp slam, in: European conference on computer vision, Springer. pp. 180–197

  11. [11]

    In: ACM SIGGRAPH 2024 Conference Pa- pers

    Huang, B., Yu, Z., Chen, A., Geiger, A., Gao, S., 2024. 2d gaussian splatting for geometrically accurate radiance fields, in: SIGGRAPH 2024 Conference Papers, Association for Computing Machinery. doi:10.1145/3641519.3657428

  12. [12]

    Huang, H., Wu, Y., Deng, C., Gao, G., Gu, M., Liu, Y.S., 2025. Fatesgs: Fast and accurate sparse-view surface reconstruction using gaussian splatting with depth-feature consistency, in: Proceedings of the AAAI Conference on Artificial Intelligence

  13. [13]

    Anysplat: Feed-forward 3d gaussian splatting from unconstrained views

    Jiang, L., Mao, Y., Xu, L., Lu, T., Ren, K., Jin, Y., Xu, X., Yu, M., Pang, J., Zhao, F., et al., 2025. Anysplat: Feed-forward 3d gaussian splatting from unconstrained views. ACM Transactions on Graphics (TOG) 44, 1–16

  14. [14]

    Splatam: Splat track & map 3d gaussians for dense rgb-d slam, in: Proceedings of the IEEE/CVF J.K

    Keetha, N., Karhade, J., Jatavallabhula, K.M., Yang, G., Scherer, S., Ramanan, D., Luiten, J., 2024. Splatam: Splat track & map 3d gaussians for dense rgb-d slam, in: Proceedings of the IEEE/CVF J.K. Krishnan et al.:Preprint submitted to ElsevierPage 13 of 15 Leveraging social media news conference on computer vision and pattern recognition, pp. 21357– 21366

  15. [15]

    MapAnything: Universal feed-forward metric 3D reconstruction, in: International Conference on 3D Vision (3DV), IEEE

    Keetha, N., Müller, N., Schönberger, J., Porzi, L., Zhang, Y., Fischer, T., Knapitsch, A., Zauss, D., Weber, E., Antunes, N., Luiten, J., Lopez-Antequera,M.,Bulò,S.R.,Richardt,C.,Ramanan,D.,Scherer, S., Kontschieder, P., 2026. MapAnything: Universal feed-forward metric 3D reconstruction, in: International Conference on 3D Vision (3DV), IEEE

  16. [16]

    3d gaussiansplattingforreal-timeradiancefieldrendering

    Kerbl, B., Kopanas, G., Leimkühler, T., Drettakis, G., 2023. 3d gaussiansplattingforreal-timeradiancefieldrendering. ACMTrans- actions on Graphics 42. URL:https://repo-sam.inria.fr/fungraph/ 3d-gaussian-splatting/

  17. [17]

    Tanks and temples: Benchmarking large-scale scene reconstruction

    Knapitsch, A., Park, J., Zhou, Q.Y., Koltun, V., 2017. Tanks and temples: Benchmarking large-scale scene reconstruction. ACM Transactions on Graphics 36

  18. [18]

    Iggt: Instance-grounded geometry transformer for semantic 3d reconstruction

    Li, H., Zou, Z., Liu, F., Zhang, X., Hong, F., Cao, Y., Lan, Y., Zhang, M., Yu, G., Zhang, D., et al., 2025. Iggt: Instance-grounded geometry transformer for semantic 3d reconstruction. arXiv preprint arXiv:2510.22706

  19. [19]

    Tokensplat: Token-aligned3dgaussiansplattingforfeed-forwardpose-freerecon- struction

    Li, Y., Lv, C., Tang, Z., Yang, H., Huang, D., 2026. Tokensplat: Token-aligned3dgaussiansplattingforfeed-forwardpose-freerecon- struction. arXiv preprint arXiv:2603.00697

  20. [20]

    Depth Anything 3: Recovering the Visual Space from Any Views

    Lin, H., Chen, S., Liew, J.H., Chen, D.Y., Li, Z., Shi, G., Feng, J., Kang, B., 2025. Depth anything 3: Recovering the visual space from any views. arXiv preprint arXiv:2511.10647

  21. [21]

    Scale invariant feature transform

    Lindeberg, T., 2012. Scale invariant feature transform

  22. [22]

    Scaffold-gs: Structured 3d gaussians for view-adaptive rendering, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp

    Lu, T., Yu, M., Xu, L., Xiangli, Y., Wang, L., Lin, D., Dai, B., 2024. Scaffold-gs: Structured 3d gaussians for view-adaptive rendering, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 20654–20664

  23. [23]

    Rethinking network design and local geometry in point cloud: A simple residual mlp framework

    Ma, X., Qin, C., You, H., Ran, H., Fu, Y., 2022. Rethinking network design and local geometry in point cloud: A simple residual mlp framework. arXiv preprint arXiv:2202.07123

  24. [24]

    Gnerf: Gan-based neural radiance field without posed camera, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp

    Meng, Q., Chen, A., Luo, H., Wu, M., Su, H., Xu, L., He, X., Yu, J., 2021. Gnerf: Gan-based neural radiance field without posed camera, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6351–6361

  25. [25]

    Nerf: Representing scenes as neural radiance fields for view synthesis, in: ECCV

    Mildenhall,B.,Srinivasan,P.P.,Tancik,M.,Barron,J.T.,Ramamoor- thi, R., Ng, R., 2020. Nerf: Representing scenes as neural radiance fields for view synthesis, in: ECCV

  26. [26]

    Objects as volumes: A stochastic geometry view of opaque solids

    Miller, B., Chen, H., Lai, A., Gkioulekas, I., 2023. Objects as volumes: A stochastic geometry view of opaque solids. IEEE

  27. [27]

    Dinov2: Learning robust visual features without supervision

    Oquab, M., Darcet, T., Moutakanni, T., Vo, H.V., Szafraniec, M., Khalidov, V., Fernandez, P., Haziza, D., Massa, F., El-Nouby, A., Howes, R., Huang, P.Y., Xu, H., Sharma, V., Li, S.W., Galuba, W., Rabbat, M., Assran, M., Ballas, N., Synnaeve, G., Misra, I., Jegou, H., Mairal, J., Labatut, P., Joulin, A., Bojanowski, P., 2023. Dinov2: Learning robust visua...

  28. [28]

    Global structure-from-motion revisited, in: European Conference on Com- puter Vision, Springer

    Pan, L., Baráth, D., Pollefeys, M., Schönberger, J.L., 2024. Global structure-from-motion revisited, in: European Conference on Com- puter Vision, Springer. pp. 58–77

  29. [29]

    Vision transformers for dense prediction

    Ranftl, R., Bochkovskiy, A., Koltun, V., 2021. Vision transformers for dense prediction. ArXiv preprint

  30. [30]

    Fastgs:Training3dgaussian splatting in 100 seconds

    Ren,S.,Wen,T.,Fang,Y.,Lu,B.,2025. Fastgs:Training3dgaussian splatting in 100 seconds. arXiv preprint arXiv:2511.04283

  31. [31]

    Superglue: Learning feature matching with graph neural networks, in:ProceedingsoftheIEEE/CVFconferenceoncomputervisionand pattern recognition, pp

    Sarlin, P.E., DeTone, D., Malisiewicz, T., Rabinovich, A., 2020. Superglue: Learning feature matching with graph neural networks, in:ProceedingsoftheIEEE/CVFconferenceoncomputervisionand pattern recognition, pp. 4938–4947

  32. [32]

    Structure-from-motion revis- ited, in: Conference on Computer Vision and Pattern Recognition (CVPR)

    Schönberger, J.L., Frahm, J.M., 2016. Structure-from-motion revis- ited, in: Conference on Computer Vision and Pattern Recognition (CVPR)

  33. [33]

    Pixel- wise view selection for unstructured multi-view stereo, in: European Conference on Computer Vision (ECCV)

    Schönberger,J.L.,Zheng,E.,Pollefeys,M.,Frahm,J.M.,2016. Pixel- wise view selection for unstructured multi-view stereo, in: European Conference on Computer Vision (ECCV)

  34. [34]

    The Replica Dataset: A Digital Replica of Indoor Spaces

    Straub, J., Whelan, T., Ma, L., Chen, Y., Wijmans, E., Green, S., Engel, J.J., Mur-Artal, R., Ren, C., Verma, S., Clarkson, A., Yan, M., Budge, B., Yan, Y., Pan, X., Yon, J., Zou, Y., Leon, K., Carter, N., Briales, J., Gillingham, T., Mueggler, E., Pesqueira, L., Savva, M., Batra, D., Strasdat, H.M., Nardi, R.D., Goesele, M., Lovegrove, S., Newcombe,R.,20...

  35. [35]

    Sparsityinvariantcnns,in:InternationalConferenceon3D Vision (3DV)

    Uhrig, J., Schneider, N., Schneider, L., Franke, U., Brox, T., Geiger, A.,2017. Sparsityinvariantcnns,in:InternationalConferenceon3D Vision (3DV)

  36. [36]

    Amb3r: Accurate feed-forward metric-scale 3d reconstruction with backend

    Wang, H., Agapito, L., 2025. Amb3r: Accurate feed-forward metric-scale 3d reconstruction with backend. arXiv preprint arXiv:2511.20343

  37. [37]

    Vggt:Visualgeometrygroundedtransformer,in:Proceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Wang,J.,Chen,M.,Karaev,N.,Vedaldi,A.,Rupprecht,C.,Novotny, D.,2025a. Vggt:Visualgeometrygroundedtransformer,in:Proceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

  38. [38]

    Wang, J., Chen, M., Zhang, S., Karaev, N., Schönberger, J., Labatut, P., Bojanowski, P., Novotny, D., Vedaldi, A., Rupprecht, C., 2026. Vggt-𝜔. URL:https://arxiv.org/abs/2605.15195,arXiv:2605.15195

  39. [39]

    Vggsfm: Visual geometry grounded deep structure from motion, in: Proceed- ings of the IEEE/CVF conference on computer vision and pattern recognition, pp

    Wang, J., Karaev, N., Rupprecht, C., Novotny, D., 2024a. Vggsfm: Visual geometry grounded deep structure from motion, in: Proceed- ings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 21686–21697

  40. [40]

    Dust3r: Geometric 3d vision made easy, in: CVPR

    Wang, S., Leroy, V., Cabon, Y., Chidlovskii, B., Revaud, J., 2024b. Dust3r: Geometric 3d vision made easy, in: CVPR

  41. [41]

    VolSplat: Rethinking Feed-Forward 3D Gaussian Splatting with Voxel-Aligned Prediction

    Wang, W., Chen, Y., Zhang, Z., Liu, H., Wang, H., Feng, Z., Qin, W., Chen, F., Zhu, Z., Chen, D.Y., Zhuang, B., 2025b. Volsplat: Rethinking feed-forward 3d gaussian splatting with voxel-aligned prediction. arXiv preprint arXiv:2509.19297

  42. [42]

    $\pi^3$: Permutation-Equivariant Visual Geometry Learning

    Wang, Y., Zhou, J., Zhu, H., Chang, W., Zhou, Y., Li, Z., Chen, J., Pang,J.,Shen,C.,He,T.,2025c.𝜋 3:Permutation-EquivariantVisual Geometry Learning. arXiv preprint arXiv:2507.13347

  43. [43]

    Image qualityassessment:fromerrorvisibilitytostructuralsimilarity

    Wang,Z.,Bovik,A.C.,Sheikh,H.R.,Simoncelli,E.P.,2004a. Image qualityassessment:fromerrorvisibilitytostructuralsimilarity. IEEE transactions on image processing 13, 600–612

  44. [44]

    Image qualityassessment:fromerrorvisibilitytostructuralsimilarity

    Wang,Z.,Bovik,A.C.,Sheikh,H.R.,Simoncelli,E.P.,2004b. Image qualityassessment:fromerrorvisibilitytostructuralsimilarity. IEEE Trans Image Process 13

  45. [45]

    Nerf–: Neural radiance fields without known camera parameters

    Wang, Z., Wu, S., Xie, W., Chen, M., Prisacariu, V.A., 2021. Nerf–: Neural radiance fields without known camera parameters

  46. [46]

    4d gaussian splatting for real-time dynamic scene rendering,in:ProceedingsoftheIEEE/CVFConferenceonComputer Vision and Pattern Recognition (CVPR), pp

    Wu,G.,Yi,T.,Fang,J.,Xie,L.,Zhang,X.,Wei,W.,Liu,W.,Tian,Q., Wang, X., 2024a. 4d gaussian splatting for real-time dynamic scene rendering,in:ProceedingsoftheIEEE/CVFConferenceonComputer Vision and Pattern Recognition (CVPR), pp. 20310–20320

  47. [47]

    Sparse2dgs: Geometry-prioritized gaussian splatting for surface reconstruction from sparse views

    Wu,J.,Li,R.,Zhu,Y.,Guo,R.,Sun,J.,Zhang,Y.,2025. Sparse2dgs: Geometry-prioritized gaussian splatting for surface reconstruction from sparse views. arXiv preprint arXiv:2504.20378

  48. [48]

    Point transformer v3: Simpler, faster, stronger, in: CVPR

    Wu, X., Jiang, L., Wang, P.S., Liu, Z., Liu, X., Qiao, Y., Ouyang, W., He, T., Zhao, H., 2024b. Point transformer v3: Simpler, faster, stronger, in: CVPR

  49. [49]

    Depthsplat: Connecting gaussian splatting and depth, in: CVPR

    Xu,H.,Peng,S.,Wang,F.,Blum,H.,Barath,D.,Geiger,A.,Pollefeys, M., 2025. Depthsplat: Connecting gaussian splatting and depth, in: CVPR

  50. [50]

    Freesplatter: Pose-free gaus- sian splatting for sparse-view 3d reconstruction

    Xu, J., Gao, S., Shan, Y., 2024. Freesplatter: Pose-free gaus- sian splatting for sparse-view 3d reconstruction. arXiv preprint arXiv:2412.09573

  51. [51]

    Mvsnet: Depth inference for unstructured multi-view stereo

    Yao, Y., Luo, Z., Li, S., Fang, T., Quan, L., 2018. Mvsnet: Depth inference for unstructured multi-view stereo. European Conference on Computer Vision (ECCV)

  52. [52]

    Yonosplat: You only need one model for feedforward 3d gaussian splatting

    Ye, B., Chen, B., Xu, H., Barath, D., Pollefeys, M., 2025. Yonosplat: You only need one model for feedforward 3d gaussian splatting. arXiv:2511.07321

  53. [53]

    Geometry- grounded gaussian splatting

    Zhang, B., Jiang, C., Li, H., Shen, S., Tan, P., 2026. Geometry- grounded gaussian splatting. arXiv preprint arXiv:2601.17835

  54. [54]

    The unreasonable effectiveness of deep features as a perceptual metric

    Zhang,R.,Isola,P.,Efros,A.A.,Shechtman,E.,Wang,O.,2018. The unreasonable effectiveness of deep features as a perceptual metric. IEEE

  55. [55]

    Applied Intelligence 55, 1118

    Zhao,Y.,Yi,J.,Pan,Y.,Chen,L.,2025.Robustgeometricreconstruc- tion of rgb-d data based on gaussian splatting. Applied Intelligence 55, 1118. J.K. Krishnan et al.:Preprint submitted to ElsevierPage 14 of 15 Leveraging social media news

  56. [56]

    Voxel- splat: Dynamic gaussian splatting as an effective loss for occupancy and flow prediction, in: Proceedings of the Computer Vision and Pattern Recognition Conference, pp

    Zhu, Z., Wang, S., Xie, J., Liu, J.j., Wang, J., Yang, J., 2025. Voxel- splat: Dynamic gaussian splatting as an effective loss for occupancy and flow prediction, in: Proceedings of the Computer Vision and Pattern Recognition Conference, pp. 6761–6771. Yibin Zhaois currently a Ph.D. student at East China University of Science and Technology at Shanghai. He...