pith. machine review for the scientific record. sign in

arxiv: 2605.04435 · v1 · submitted 2026-05-06 · 💻 cs.CV

Recognition: 2 theorem links

· Lean Theorem

Ground4D: Spatially-Grounded Feedforward 4D Reconstruction for Unstructured Off-Road Scenes

Chen Min, Fanjie Kong, Fuyang Liu, Jilin Mei, Shuai Wang, Shuo Wang, Wenfei Guan, Yu Hu, Zhihua Zhao

Pith reviewed 2026-05-08 18:42 UTC · model grok-4.3

classification 💻 cs.CV
keywords 4D reconstructionGaussian Splattingoff-road scenesfeedforward modeltemporal aggregationsurface normalspose-free reconstructionautonomous driving
0
0 comments X

The pith

Ground4D resolves temporal conflicts in off-road 4D reconstruction by partitioning Gaussians into spatial voxels for localized temporal attention.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents Ground4D as a feedforward method for 4D scene reconstruction that targets unstructured off-road environments where standard Gaussian Splatting fails. High-frequency terrain details, vehicle jitter, and non-rigid motion create inconsistent observations of the same Gaussian across frames, producing either blurred results or broken geometry. The approach counters this by dividing the space into voxels and restricting temporal attention to operate only inside each voxel, with a softmax that ties selection strength to occupancy. Surface normal estimates are added to keep the underlying surfaces consistent. If correct, this yields cleaner renderings on rough terrain and works on new locations without retraining or pose data.

Core claim

Ground4D is a spatially-grounded 4D feedforward framework that resolves conflicting Gaussian observations across timestamps in pose-free off-road scenes. It does so by introducing voxel-grounded temporal Gaussian aggregation, which divides the canonical space into voxels and applies query-conditioned temporal attention inside each voxel with intra-voxel softmax normalization. Surface normal cues are added as auxiliary guidance to regularize Gaussian geometry. Experiments on ORAD-3D and RELLIS-3D show consistent outperformance over prior feedforward methods and zero-shot generalization to unseen domains.

What carries the argument

Voxel-grounded temporal Gaussian aggregation, which partitions the canonical Gaussian space into spatial voxels and performs query-conditioned temporal attention within each voxel using intra-voxel softmax normalization to make temporal selectivity and spatial occupancy reinforce each other, aided by surface normal cues.

If this is right

  • Reconstruction quality improves over existing feedforward Gaussian methods on off-road datasets.
  • The model generalizes zero-shot to unseen off-road domains without retraining.
  • Temporal conflicts from ego-motion and non-rigid motion are reduced, avoiding both over-smoothing and structural breaks.
  • Pose-free operation becomes feasible in unstructured terrain where camera calibration is unreliable.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same voxel-localized conditioning could be tested on other dynamic settings such as forests or construction sites to check whether the spatial partitioning transfers.
  • Because the method works without poses, it might support mapping pipelines that rely only on visual odometry in GPS-denied areas.
  • Intra-voxel normalization could be adapted to other attention-based 3D models to enforce locality without full global recomputation.

Load-bearing premise

That localizing temporal attention inside spatial voxels together with surface normal guidance can remove conflicting observations across time without introducing new inconsistencies or requiring scene-specific tuning.

What would settle it

Persistent structural artifacts or over-smoothed surfaces appearing in renderings from a high-jitter off-road sequence when the voxel aggregation and normal cues are applied would show that the method does not resolve temporal conflicts as claimed.

Figures

Figures reproduced from arXiv: 2605.04435 by Chen Min, Fanjie Kong, Fuyang Liu, Jilin Mei, Shuai Wang, Shuo Wang, Wenfei Guan, Yu Hu, Zhihua Zhao.

Figure 1
Figure 1. Figure 1: Motivation for our work. Off-road scenes impose severe view at source ↗
Figure 2
Figure 2. Figure 2: Overview of our Ground4D framework. Ground4D transforms view at source ↗
Figure 3
Figure 3. Figure 3: Reconstruction quality on input context frames of ORAD-3D dataset. We compare Ground4D against different view at source ↗
Figure 4
Figure 4. Figure 4: Novel-view synthesis consistency under off-road ego-motion. We show five consecutive synthesized frames for four view at source ↗
read the original abstract

Feedforward Gaussian Splatting has recently emerged as an efficient paradigm for 4D reconstruction in autonomous driving. However, in unstructured off-road scenes, its performance degrades due to high-frequency geometry, ego-motion jitter, and increased non-rigid dynamics. These factors introduce conflicting Gaussian observations across timestamps, leading to either over-smoothed renderings or structural artifacts. To address this issue, we propose Ground4D, a spatially-grounded 4D feedforward framework for pose-free off-road reconstruction. The key idea is to resolve temporal conflicts through spatially localized conditioning. Specifically, we introduce voxel-grounded temporal Gaussian aggregation, which partitions the canonical Gaussian space into spatial voxels and performs query-conditioned temporal attention within each voxel. Intra-voxel softmax normalization ensures that temporal selectivity and spatial occupancy become mutually reinforcing rather than conflicting. We furthermore introduce surface normal cues as auxiliary geometric guidance to regularize the geometry of Gaussian primitives. Extensive experiments on ORAD-3D and RELLIS-3D demonstrate that Ground4D consistently outperforms existing feedforward methods in reconstruction quality and generalizes zero-shot to unseen off-road domains. Project page and code:https://github.com/wsnbws/Ground4D.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The manuscript presents Ground4D, a spatially-grounded feedforward 4D Gaussian Splatting framework for pose-free reconstruction in unstructured off-road scenes. It introduces voxel-grounded temporal Gaussian aggregation that partitions canonical space into voxels and applies query-conditioned temporal attention within each voxel, combined with intra-voxel softmax normalization to resolve conflicting Gaussian observations arising from high-frequency geometry, ego-motion, and non-rigid dynamics. Surface normal cues are added as auxiliary guidance for geometry regularization. Experiments on ORAD-3D and RELLIS-3D are reported to show consistent outperformance over existing feedforward methods together with zero-shot generalization to unseen off-road domains.

Significance. If the quantitative results hold, the work would advance feedforward 4D reconstruction for challenging unstructured environments relevant to off-road autonomy and robotics. The explicit spatial localization of temporal attention and the release of code and project page are positive features that support reproducibility and allow direct testing of the proposed components.

major comments (1)
  1. [§3.2] §3.2 (voxel-grounded temporal aggregation and intra-voxel softmax): the claim that intra-voxel softmax makes temporal selectivity and spatial occupancy mutually reinforcing is load-bearing for the central contribution, yet the manuscript provides no targeted analysis or visualization demonstrating that the normalization avoids occupancy discontinuities or over-smoothing when Gaussians straddle voxel boundaries under non-rigid motion and high-frequency off-road geometry. This assumption directly affects whether the reported quality gains are attributable to the mechanism rather than trading one class of artifacts for another.
minor comments (1)
  1. The abstract states outperformance and zero-shot generalization but does not report any numerical metrics, error bars, or baseline comparisons; adding a concise quantitative highlight would improve readability without altering the technical content.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback on our manuscript. We address the major comment point by point below and will revise the paper to incorporate additional analysis as suggested.

read point-by-point responses
  1. Referee: [§3.2] §3.2 (voxel-grounded temporal aggregation and intra-voxel softmax): the claim that intra-voxel softmax makes temporal selectivity and spatial occupancy mutually reinforcing is load-bearing for the central contribution, yet the manuscript provides no targeted analysis or visualization demonstrating that the normalization avoids occupancy discontinuities or over-smoothing when Gaussians straddle voxel boundaries under non-rigid motion and high-frequency off-road geometry. This assumption directly affects whether the reported quality gains are attributable to the mechanism rather than trading one class of artifacts for another.

    Authors: We agree that the manuscript lacks targeted visualizations or ablation analysis specifically demonstrating the effect of intra-voxel softmax on occupancy discontinuities and over-smoothing at voxel boundaries, particularly under non-rigid motion and high-frequency geometry. The design rationale in §3.2 is that performing query-conditioned temporal attention and softmax normalization strictly within each voxel localizes the temporal selection, preventing global conflicts from propagating across space and thereby making selectivity and occupancy mutually reinforcing. However, without explicit boundary-focused visualizations (e.g., Gaussian occupancy heatmaps or before/after renderings in dynamic off-road sequences), it is difficult for readers to verify that the gains are not simply trading one artifact class for another. In the revised version we will add a dedicated analysis subsection (or expanded figure in §3.2 and §4) containing: (i) side-by-side occupancy maps with and without intra-voxel softmax on sequences exhibiting non-rigid dynamics, (ii) zoomed renderings highlighting voxel-boundary regions, and (iii) quantitative boundary-consistency metrics. This will directly substantiate the load-bearing claim. revision: yes

Circularity Check

0 steps flagged

No significant circularity; method is a novel architectural proposal with independent components

full rationale

The paper presents Ground4D as a new feedforward framework introducing voxel-grounded temporal Gaussian aggregation, intra-voxel softmax normalization, and surface normal cues as explicit, non-reductive mechanisms to address temporal conflicts in off-road scenes. No equations or claims reduce a 'prediction' or result to fitted inputs by construction, nor do self-citations bear the central load; the derivation chain consists of proposed architectural choices justified by problem analysis rather than tautological redefinitions or renamings. The reported outperformance on ORAD-3D/RELLIS-3D is positioned as empirical validation of the new components, not a forced outcome of prior fits. This is the common case of a self-contained technical contribution.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review provides no details on parameters, axioms or entities; voxel partitioning presented as design choice.

pith-pipeline@v0.9.0 · 8546 in / 947 out tokens · 79041 ms · 2026-05-08T18:42:48.997760+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

62 extracted references · 14 canonical work pages

  1. [1]

    Jonathan T Barron, Ben Mildenhall, Matthew Tancik, Peter Hedman, Ricardo Martin-Brualla, and Pratul P Srinivasan. 2021. Mip-nerf: A multiscale repre- sentation for anti-aliasing neural radiance fields. InProceedings of the IEEE/CVF international conference on computer vision. 5855–5864

  2. [2]

    Jonathan T Barron, Ben Mildenhall, Dor Verbin, Pratul P Srinivasan, and Peter Hedman. 2022. Mip-nerf 360: Unbounded anti-aliased neural radiance fields. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 5470–5479

  3. [3]

    Jonathan T Barron, Ben Mildenhall, Dor Verbin, Pratul P Srinivasan, and Pe- ter Hedman. 2023. Zip-nerf: Anti-aliased grid-based neural radiance fields. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 19697– 19705

  4. [4]

    Yohann Cabon, Lucas Stoffl, Leonid Antsfeld, Gabriela Csurka, Boris Chidlovskii, Jerome Revaud, and Vincent Leroy. 2025. Must3r: Multi-view network for stereo 3d reconstruction. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 1050–1060

  5. [5]

    David Charatan, Sizhe Lester Li, Andrea Tagliasacchi, and Vincent Sitzmann

  6. [6]

    InProceedings of the IEEE/CVF conference on computer vision and pattern recognition

    pixelsplat: 3d gaussian splats from image pairs for scalable generalizable 3d reconstruction. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition. 19457–19467

  7. [7]

    Anpei Chen, Zexiang Xu, Andreas Geiger, Jingyi Yu, and Hao Su. 2022. Tensorf: Tensorial radiance fields. InEuropean conference on computer vision. Springer, 333–350

  8. [8]

    Anpei Chen, Zexiang Xu, Fuqiang Zhao, Xiaoshuai Zhang, Fanbo Xiang, Jingyi Yu, and Hao Su. 2021. Mvsnerf: Fast generalizable radiance field reconstruction from multi-view stereo. InProceedings of the IEEE/CVF international conference on computer vision. 14124–14133

  9. [9]

    Xiaoxue Chen, Ziyi Xiong, Yuantao Chen, Gen Li, Nan Wang, Hongcheng Luo, Long Chen, Haiyang Sun, Bing Wang, Guang Chen, et al . 2025. DGGT: Feed- forward 4D Reconstruction of Dynamic Driving Scenes using Unposed Images. arXiv preprint arXiv:2512.03004(2025)

  10. [10]

    Yue Chen, Xingyu Chen, Anpei Chen, Gerard Pons-Moll, and Yuliang Xiu. 2025. Feat2gs: Probing visual foundation models with gaussian splatting. InProceedings of the Computer Vision and Pattern Recognition Conference. 6348–6361

  11. [11]

    Yuedong Chen, Haofei Xu, Chuanxia Zheng, Bohan Zhuang, Marc Pollefeys, Andreas Geiger, Tat-Jen Cham, and Jianfei Cai. 2024. Mvsplat: Efficient 3d gaussian splatting from sparse multi-view images. InEuropean conference on computer vision. Springer, 370–386

  12. [12]

    Ziyu Chen, Jiawei Yang, Jiahui Huang, Riccardo De Lutio, Janick Martinez Esturo, Boris Ivanovic, Or Litany, Zan Gojcic, Sanja Fidler, Marco Pavone, et al . 2024. Omnire: Omni urban scene reconstruction.arXiv preprint arXiv:2408.16760(2024)

  13. [13]

    Zequn Chen, Jiezhi Yang, and Heng Yang. 2024. Pref3r: Pose-free feed-forward 3d gaussian splatting from variable-length image sequence.arXiv preprint arXiv:2411.16877(2024)

  14. [14]

    Sara Fridovich-Keil, Alex Yu, Matthew Tancik, Qinhong Chen, Benjamin Recht, and Angjoo Kanazawa. 2022. Plenoxels: Radiance fields without neural net- works. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition. 5501–5510

  15. [15]

    Binbin Huang, Zehao Yu, Anpei Chen, Andreas Geiger, and Shenghua Gao

  16. [16]

    InACM SIGGRAPH 2024 conference papers

    2d gaussian splatting for geometrically accurate radiance fields. InACM SIGGRAPH 2024 conference papers. 1–11

  17. [17]

    Nan Huang, Xiaobao Wei, Wenzhao Zheng, Pengju An, Ming Lu, Wei Zhan, Masayoshi Tomizuka, Kurt Keutzer, and Shanghang Zhang. 2024. S3Gaussian: Self-Supervised Street Gaussians for Autonomous Driving.arXiv preprint arXiv:2405.20323(2024)

  18. [18]

    Lihan Jiang, Yucheng Mao, Linning Xu, Tao Lu, Kerui Ren, Yichen Jin, Xudong Xu, Mulin Yu, Jiangmiao Pang, Feng Zhao, et al. 2025. Anysplat: Feed-forward 3d gaussian splatting from unconstrained views.ACM Transactions on Graphics (TOG)44, 6 (2025), 1–16

  19. [19]

    Peng Jiang, Philip Osteen, Maggie Wigness, and Srikanth Saripalli. 2021. Rellis-3d dataset: Data, benchmarks and analysis. In2021 IEEE international conference on robotics and automation (ICRA). IEEE, 1110–1116

  20. [20]

    Bernhard Kerbl, Georgios Kopanas, Thomas Leimkühler, George Drettakis, et al

  21. [21]

    Graph.42, 4 (2023), 139–1

    3d gaussian splatting for real-time radiance field rendering.ACM Trans. Graph.42, 4 (2023), 139–1

  22. [22]

    Vincent Leroy, Yohann Cabon, and Jérôme Revaud. 2024. Grounding image matching in 3d with mast3r. InEuropean conference on computer vision. Springer, 71–91

  23. [23]

    Yang Liu, Chuanchen Luo, Zimo Tang, Junran Peng, and Zhaoxiang Zhang

  24. [24]

    VGGT-X: When VGGT Meets Dense Novel View Synthesis.arXiv preprint arXiv:2509.25191(2025)

  25. [25]

    Hao Lu, Tianshuo Xu, Wenzhao Zheng, Yunpeng Zhang, Wei Zhan, Dalong Du, Masayoshi Tomizuka, Kurt Keutzer, and Yingcong Chen. 2024. DrivingRecon: Large 4D Gaussian reconstruction model for autonomous driving.arXiv preprint arXiv:2412.09043(2024). xx, xx, xx wang et al

  26. [26]

    Tao Lu, Mulin Yu, Linning Xu, Yuanbo Xiangli, Limin Wang, Dahua Lin, and Bo Dai. 2024. Scaffold-gs: Structured 3d gaussians for view-adaptive rendering. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 20654–20664

  27. [27]

    Xiaoyang Lyu, Yang-Tian Sun, Yi-Hua Huang, Xiuzhe Wu, Ziyi Yang, Yilun Chen, Jiangmiao Pang, and Xiaojuan Qi. 2024. 3dgsr: Implicit surface reconstruction with 3d gaussian splatting.ACM Transactions on Graphics (TOG)43, 6 (2024), 1–12

  28. [28]

    Ricardo Martin-Brualla, Noha Radwan, Mehdi SM Sajjadi, Jonathan T Barron, Alexey Dosovitskiy, and Daniel Duckworth. 2021. Nerf in the wild: Neural radiance fields for unconstrained photo collections. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition. 7210–7219

  29. [29]

    Ben Mildenhall, Pratul P Srinivasan, Matthew Tancik, Jonathan T Barron, Ravi Ramamoorthi, and Ren Ng. 2021. Nerf: Representing scenes as neural radiance fields for view synthesis.Commun. ACM65, 1 (2021), 99–106

  30. [30]

    Chen Min, Jilin Mei, Heng Zhai, Shuai Wang, Tong Sun, Fanjie Kong, Haoyang Li, Fangyuan Mao, Fuyang Liu, Shuo Wang, et al . 2025. Advancing Off-Road Autonomous Driving: The Large-Scale ORAD-3D Dataset and Comprehensive Benchmarks.arXiv preprint arXiv:2510.16500(2025)

  31. [31]

    Thomas Müller, Alex Evans, Christoph Schied, and Alexander Keller. 2022. In- stant neural graphics primitives with a multiresolution hash encoding.ACM transactions on graphics (TOG)41, 4 (2022), 1–15

  32. [32]

    Julian Ost, Fahim Mannan, Nils Thuerey, Julian Knodt, and Felix Heide. 2021. Neural scene graphs for dynamic scenes. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2856–2865

  33. [33]

    René Ranftl, Alexey Bochkovskiy, and Vladlen Koltun. 2021. Vision transformers for dense prediction. InProceedings of the IEEE/CVF international conference on computer vision. 12179–12188

  34. [34]

    Konstantinos Rematas, Andrew Liu, Pratul P Srinivasan, Jonathan T Barron, Andrea Tagliasacchi, Thomas Funkhouser, and Vittorio Ferrari. 2022. Urban radiance fields. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition. 12932–12942

  35. [35]

    Johannes L Schonberger and Jan-Michael Frahm. 2016. Structure-from-motion revisited. InProceedings of the IEEE conference on computer vision and pattern recognition. 4104–4113

  36. [36]

    You Shen, Zhipeng Zhang, Yansong Qu, Xiawu Zheng, Jiayi Ji, Shengchuan Zhang, and Liujuan Cao. 2025. Fastvggt: Training-free acceleration of visual geometry transformer.arXiv preprint arXiv:2509.02560(2025)

  37. [37]

    Brandon Smart, Chuanxia Zheng, Iro Laina, and Victor Adrian Prisacariu. 2024. Splatt3r: Zero-shot gaussian splatting from uncalibrated image pairs.arXiv preprint arXiv:2408.13912(2024)

  38. [38]

    Matthew Tancik, Vincent Casser, Xinchen Yan, Sabeek Pradhan, Ben Mildenhall, Pratul P Srinivasan, Jonathan T Barron, and Henrik Kretzschmar. 2022. Block- nerf: Scalable large scene neural view synthesis. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition. 8248–8258

  39. [39]

    Zhenggang Tang, Yuchen Fan, Dilin Wang, Hongyu Xu, Rakesh Ranjan, Alexan- der Schwing, and Zhicheng Yan. 2025. Mv-dust3r+: Single-stage scene recon- struction from sparse views in 2 seconds. InProceedings of the Computer Vision and Pattern Recognition Conference. 5283–5293

  40. [40]

    Qijian Tian, Xin Tan, Yuan Xie, and Lizhuang Ma. 2025. Drivingforward: Feed- forward 3d gaussian splatting for driving scene reconstruction from flexible surround-view input. InProceedings of the AAAI Conference on Artificial Intelli- gence, Vol. 39. 7374–7382

  41. [41]

    Jianyuan Wang, Minghao Chen, Nikita Karaev, Andrea Vedaldi, Christian Rup- precht, and David Novotny. 2025. Vggt: Visual geometry grounded transformer. InProceedings of the Computer Vision and Pattern Recognition Conference. 5294– 5306

  42. [42]

    Qianqian Wang, Zhicheng Wang, Kyle Genova, Pratul P Srinivasan, Howard Zhou, Jonathan T Barron, Ricardo Martin-Brualla, Noah Snavely, and Thomas Funkhouser. 2021. Ibrnet: Learning multi-view image-based rendering. InPro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition. 4690–4699

  43. [43]

    Shuzhe Wang, Vincent Leroy, Yohann Cabon, Boris Chidlovskii, and Jerome Revaud. 2024. Dust3r: Geometric 3d vision made easy. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition. 20697–20709

  44. [44]

    Dongxu Wei, Zhiqi Li, and Peidong Liu. 2025. Omni-scene: Omni-gaussian representation for ego-centric sparse-view scene reconstruction. InProceedings of the Computer Vision and Pattern Recognition Conference. 22317–22327

  45. [45]

    Guanjun Wu, Taoran Yi, Jiemin Fang, Lingxi Xie, Xiaopeng Zhang, Wei Wei, Wenyu Liu, Qi Tian, and Xinggang Wang. 2024. 4d gaussian splatting for real-time dynamic scene rendering. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition. 20310–20320

  46. [46]

    Zirui Wu, Tianyu Liu, Liyi Luo, Zhide Zhong, Jianteng Chen, Hongmin Xiao, Chao Hou, Haozhe Lou, Yuantao Chen, Runyi Yang, et al. 2023. Mars: An instance- aware, modular and realistic simulator for autonomous driving. InCAAI Interna- tional Conference on Artificial Intelligence. Springer, 3–15

  47. [47]

    Haofei Xu, Songyou Peng, Fangjinhua Wang, Hermann Blum, Daniel Barath, Andreas Geiger, and Marc Pollefeys. 2025. Depthsplat: Connecting gaussian splatting and depth. InProceedings of the Computer Vision and Pattern Recognition Conference. 16453–16463

  48. [48]

    Yunzhi Yan, Haotong Lin, Chenxu Zhou, Weijie Wang, Haiyang Sun, Kun Zhan, Xianpeng Lang, Xiaowei Zhou, and Sida Peng. 2024. Street gaussians: Model- ing dynamic urban scenes with gaussian splatting. InEuropean Conference on Computer Vision. Springer, 156–173

  49. [49]

    Jiawei Yang, Jiahui Huang, Yuxiao Chen, Yan Wang, Boyi Li, Yurong You, Apoorva Sharma, Maximilian Igl, Peter Karkus, Danfei Xu, et al . 2024. Storm: Spatio- temporal reconstruction model for large-scale outdoor scenes.arXiv preprint arXiv:2501.00602(2024)

  50. [50]

    Jiawei Yang, Boris Ivanovic, Or Litany, Xinshuo Weng, Seung Wook Kim, Boyi Li, Tong Che, Danfei Xu, Sanja Fidler, Marco Pavone, et al . 2023. Emernerf: Emergent spatial-temporal scene decomposition via self-supervision.arXiv preprint arXiv:2311.02077(2023)

  51. [51]

    Lihe Yang, Bingyi Kang, Zilong Huang, Zhen Zhao, Xiaogang Xu, Jiashi Feng, and Hengshuang Zhao. 2024. Depth anything v2.Advances in Neural Information Processing Systems37 (2024), 21875–21911

  52. [52]

    Ze Yang, Yun Chen, Jingkang Wang, Sivabalan Manivasagam, Wei-Chiu Ma, Anqi Joyce Yang, and Raquel Urtasun. 2023. Unisim: A neural closed-loop sensor simulator. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 1389–1399

  53. [53]

    Ziyi Yang, Xinyu Gao, Wen Zhou, Shaohui Jiao, Yuqing Zhang, and Xiaogang Jin. 2024. Deformable 3d gaussians for high-fidelity monocular dynamic scene reconstruction. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition. 20331–20341

  54. [54]

    Botao Ye, Sifei Liu, Haofei Xu, Xueting Li, Marc Pollefeys, Ming-Hsuan Yang, and Songyou Peng. 2024. No pose, no problem: Surprisingly simple 3d gaussian splats from sparse unposed images.arXiv preprint arXiv:2410.24207(2024)

  55. [55]

    Alex Yu, Vickie Ye, Matthew Tancik, and Angjoo Kanazawa. 2021. pixelnerf: Neural radiance fields from one or few images. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition. 4578–4587

  56. [56]

    Zehao Yu, Anpei Chen, Binbin Huang, Torsten Sattler, and Andreas Geiger. 2024. Mip-splatting: Alias-free 3d gaussian splatting. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition. 19447–19456

  57. [57]

    Zehao Yu, Torsten Sattler, and Andreas Geiger. 2024. Gaussian opacity fields: Efficient adaptive surface reconstruction in unbounded scenes.ACM Transactions on Graphics (ToG)43, 6 (2024), 1–13

  58. [58]

    Bowei Zhang, Lei Ke, Adam W Harley, and Katerina Fragkiadaki. 2025. TAPIP3D: Tracking Any Point in Persistent 3D Geometry.arXiv preprint arXiv:2504.14717 (2025)

  59. [59]

    Junyi Zhang, Charles Herrmann, Junhwa Hur, Varun Jampani, Trevor Darrell, Forrester Cole, Deqing Sun, and Ming-Hsuan Yang. 2024. Monst3r: A simple approach for estimating geometry in the presence of motion.arXiv preprint arXiv:2410.03825(2024)

  60. [60]

    Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang

  61. [61]

    InProceedings of the IEEE conference on computer vision and pattern recognition

    The unreasonable effectiveness of deep features as a perceptual metric. InProceedings of the IEEE conference on computer vision and pattern recognition. 586–595

  62. [62]

    Xiaoyu Zhou, Zhiwei Lin, Xiaojun Shan, Yongtao Wang, Deqing Sun, and Ming- Hsuan Yang. 2024. Drivinggaussian: Composite gaussian splatting for surround- ing dynamic autonomous driving scenes. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition. 21634–21643