pith. sign in

arxiv: 2605.22190 · v1 · pith:U2NANUR6new · submitted 2026-05-21 · 💻 cs.CV

No Pose, No Problem in 4D: Feed-Forward Dynamic Gaussians from Unposed Multi-View Videos

Pith reviewed 2026-05-22 06:23 UTC · model grok-4.3

classification 💻 cs.CV
keywords 4D reconstructiondynamic Gaussiansfeed-forwardunposed multi-viewoptical flow supervisionGaussian splattingvelocity decompositionpose-free
0
0 comments X

The pith

NoPo4D reconstructs dynamic 4D scenes from unposed multi-view videos in a single feed-forward pass.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces NoPo4D as the first feed-forward system that jointly handles dynamic content, multi-view inputs, and unknown camera poses for 3D Gaussian reconstruction. It extends pretrained geometry backbones and 4D Gaussian frameworks with a velocity decomposition that splits motion into per-pixel image-plane shifts and depth changes. This split permits direct supervision of the 2D component from pseudo ground-truth optical flow, bypassing the pose accuracy demands of differentiable rendering and the 3D motion ground truth needed by earlier pose-free methods. A bidirectional motion encoder aggregates features across views and frames while view-dependent opacity reduces misalignments. The result is consistent outperformance of prior feed-forward baselines on four benchmarks and, with optional post-optimization, quality that surpasses per-scene optimization at far lower cost.

Core claim

NoPo4D is the first feed-forward system to jointly address dynamic content, multi-view input, and unknown camera poses by introducing a velocity decomposition that splits Gaussian motion into per-pixel image-plane shifts and depth changes, enabling direct supervision from pseudo ground-truth optical flow, together with a bidirectional motion encoder for cross-view and cross-frame aggregation and view-dependent opacity to mitigate misalignments.

What carries the argument

Velocity decomposition that splits Gaussian motion into per-pixel image-plane shifts supervised by optical flow and separate depth changes.

If this is right

  • Enables joint reconstruction of dynamics, multiple views, and unknown poses in one forward pass.
  • Outperforms existing feed-forward baselines on four multi-view dynamic benchmarks.
  • With optional post-optimization reaches or exceeds quality of per-scene optimization methods.
  • Runs orders of magnitude faster than per-scene optimization approaches.
  • Avoids any requirement for 3D motion ground truth by relying on 2D optical flow for the image-plane term.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The decomposition may allow 2D optical-flow tools to bootstrap 3D dynamic modeling in other representations beyond Gaussians.
  • It could support real-time 4D capture pipelines where camera calibration is impractical or costly.
  • Future extensions might combine the same motion split with longer video sequences or additional priors for temporal consistency.
  • The approach suggests that many dynamic reconstruction tasks can be decoupled into 2D image-plane and depth components for simpler supervision.

Load-bearing premise

That pseudo ground-truth optical flow supplies sufficiently accurate supervision for the image-plane motion component without producing 3D inconsistencies that cannot be resolved later.

What would settle it

A multi-view dynamic benchmark in which estimated optical flow contains large errors, resulting in visibly broken 3D motion in the output Gaussians that post-optimization cannot repair.

Figures

Figures reproduced from arXiv: 2605.22190 by Chenyangguang Zhang, Marc Pollefeys, Matteo Balice, Matteo Matteucci, Sungwhan Hong, Yanik Kunzi.

Figure 1
Figure 1. Figure 1: Architecture overview. Given C streams of time-synchronized video, DA3 [37] first ex￾tracts multi-view features through alternating within-view and cross-view attention layers. Pretrained, frozen depth and camera heads then recover per-frame geometry, which is unprojected into Gaussian means µ. Subsequently, two trainable heads decode the remaining attributes: a Gaussian head predicts static parameters (R,… view at source ↗
Figure 2
Figure 2. Figure 2: Qualitative comparison on ExoRecon [56] and Kubric [16]. 4.5 Ablation Study We conduct ablations to validate each design choice on ExoRecon. Architectural components are isolated in Table 5a, auxiliary losses in Table 5b. Backbone fine-tuning strategies are analyzed in the supplementary material [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Qualitative comparison on N3DV [35] under extreme viewpoint changes. Architectural components. Table 5a isolates four architectural choices. Removing the bidirectional motion encoder M and feeding raw backbone tokens directly to the velocity DPT head (No motion branch) drops performance by 6.2 PSNR, confirming that explicit cross-frame feature aggregation is essential for predicting consistent motion. Repl… view at source ↗
Figure 4
Figure 4. Figure 4: Scalability Analysis. Multi-view input density vs. rendering quality (top row) and computational cost (bottom row). DGGT inference time and scaling curves indicate distinct trade-offs against baseline models [PITH_FULL_IMAGE:figures/full_fig_p016_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Failure cases. (a) Cross-view misalignments and floating artifacts in fast-moving regions, where the motion encoder cannot fully resolve large inter-frame displacements. (b) Degradation under camera motion (RecamMaster synthetic dataset [3]), where the static-rig assumption causes the per-camera pose averaging to collapse distinct viewpoints into an incorrect single pose. (c) Floating Gaussian artifacts in… view at source ↗
read the original abstract

Recent feed-forward 3D gaussian splatting methods have made dramatic progress on individual aspects of 3D scene reconstruction, but no existing method jointly addresses dynamic content, multi-view input, and unknown camera poses in a single feed-forward pass. Methods that handle dynamics either require accurate camera poses or accept only monocular input; pose-free multi-view methods address only static scenes; and per-scene optimization methods bridge some of these gaps but at minutes-to-hours cost per scene. We introduce NoPo4D, the first feed-forward system that addresses this empty quadrant. Building on a pretrained geometry backbone and recent 4D Gaussian frameworks, NoPo4D introduces a velocity decomposition that splits Gaussian motion into per-pixel image-plane shifts and depth changes, allowing direct supervision from pseudo ground-truth optical flow on the 2D component. This sidesteps both the differentiable rendering that couples prior posed methods to pose accuracy and the 3D motion ground truth that prior pose-free methods require. The system is rounded out by a bidirectional motion encoder for cross-view and cross-frame feature aggregation, and view-dependent opacity that mitigates cross-view and cross-timestep Gaussian misalignments. On four multi-view dynamic benchmarks, NoPo4D consistently outperforms prior feed-forward baselines, and with an optional post-optimization stage surpasses per-scene optimization methods, while running orders of magnitude faster.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The manuscript presents NoPo4D, the first feed-forward method for joint dynamic 4D Gaussian reconstruction from unposed multi-view videos. It builds on a pretrained geometry backbone and 4D Gaussian frameworks by introducing a velocity decomposition that splits Gaussian motion into per-pixel image-plane shifts (directly supervised by pseudo ground-truth optical flow) and scalar depth changes. The approach is completed by a bidirectional motion encoder for cross-view and cross-frame aggregation and view-dependent opacity to mitigate misalignments. Claims include consistent outperformance of prior feed-forward baselines on four multi-view dynamic benchmarks and, with optional post-optimization, surpassing per-scene optimization methods at orders-of-magnitude lower runtime.

Significance. If the central claims hold, the work would fill an important gap by enabling fast, pose-free feed-forward reconstruction of dynamic multi-view scenes without requiring known camera poses, monocular input restrictions, or expensive per-scene optimization. The velocity decomposition and optical-flow supervision strategy is a notable technical contribution that avoids coupling to differentiable rendering or 3D motion ground truth.

major comments (1)
  1. [§3.2] §3.2 (Velocity Decomposition): The decomposition splits Gaussian velocity into 2D image-plane shifts plus scalar depth changes, with direct supervision applied only to the 2D component via optical flow. Because no differentiable rendering, multi-view geometric consistency loss, or 3D motion ground truth is used, the depth-velocity component receives no explicit 3D signal. In regimes with non-negligible out-of-plane motion or parallax, this leaves 4D trajectories under-constrained; the bidirectional encoder and view-dependent opacity must then carry all cross-view consistency, which is itself learned. This directly affects the feed-forward quality claims and the reported gains from optional post-optimization.
minor comments (2)
  1. [Abstract] The abstract states performance gains but provides no quantitative numbers, ablation details, or error analysis; these should be summarized with key metrics and dataset names for readers who stop at the abstract.
  2. [§3.2] Notation for the velocity components (e.g., the exact definition of the depth-change scalar and its integration into the 4D Gaussian trajectory) should be made fully explicit with an equation reference in §3.2.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive and detailed review of our manuscript. We address the single major comment below, providing clarifications on the velocity decomposition design while noting targeted revisions to improve the discussion.

read point-by-point responses
  1. Referee: [§3.2] §3.2 (Velocity Decomposition): The decomposition splits Gaussian velocity into 2D image-plane shifts plus scalar depth changes, with direct supervision applied only to the 2D component via optical flow. Because no differentiable rendering, multi-view geometric consistency loss, or 3D motion ground truth is used, the depth-velocity component receives no explicit 3D signal. In regimes with non-negligible out-of-plane motion or parallax, this leaves 4D trajectories under-constrained; the bidirectional encoder and view-dependent opacity must then carry all cross-view consistency, which is itself learned. This directly affects the feed-forward quality claims and the reported gains from optional post-optimization.

    Authors: We thank the referee for this observation on the velocity decomposition. It is accurate that explicit supervision via optical flow is applied solely to the 2D image-plane shifts, and the scalar depth-velocity component lacks direct 3D ground truth, differentiable rendering losses, or an explicit multi-view geometric consistency term. We maintain, however, that the depth component is not left under-constrained in practice. The bidirectional motion encoder aggregates features across all input views and timesteps during both training and inference, enabling the network to learn joint representations that enforce cross-view and cross-frame consistency on the full 4D motion—including depth changes—from the multi-view video data itself. Because the model is trained end-to-end on posed multi-view dynamic sequences (even though poses are not provided at test time), the learned depth velocities are implicitly regularized by the requirement to produce coherent 4D Gaussians that can be rendered consistently from multiple viewpoints. The view-dependent opacity module further compensates for residual misalignments that may arise from out-of-plane motion. Ablation results in the manuscript show that disabling the bidirectional encoder leads to clear degradation on benchmarks containing substantial 3D dynamics and parallax, supporting its role in maintaining consistency. We have added a clarifying paragraph in the revised §3.2 that explicitly discusses how cross-view feature aggregation provides the necessary 3D signal without requiring differentiable rendering or 3D motion ground truth. This design choice directly supports the feed-forward claims by avoiding pose sensitivity; the empirical gains over baselines and the further improvement with optional post-optimization remain valid, revision: partial

Circularity Check

0 steps flagged

No significant circularity; derivation is self-contained with external supervision

full rationale

The paper introduces a velocity decomposition splitting Gaussian motion into per-pixel image-plane shifts and depth changes, with direct supervision applied to the 2D component from pseudo ground-truth optical flow. This supervision is external to the model's own outputs rather than a fitted parameter renamed as a prediction. The bidirectional motion encoder and view-dependent opacity are presented as additional architectural components for cross-view aggregation and misalignment mitigation. No load-bearing self-citations, uniqueness theorems from prior author work, or ansatzes smuggled via citation are described. The central claims rest on these new components and pretrained geometry backbone, with performance evaluated on external benchmarks, keeping the derivation independent of its own fitted results by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The approach rests on a pretrained geometry backbone and the effectiveness of pseudo ground-truth optical flow; no explicit free parameters or invented entities are detailed in the provided abstract.

axioms (1)
  • domain assumption A pretrained geometry backbone supplies reliable features for initializing dynamic scene reconstruction.
    The method is described as building on a pretrained geometry backbone.

pith-pipeline@v0.9.0 · 5803 in / 1268 out tokens · 40228 ms · 2026-05-22T06:23:34.596336+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

74 extracted references · 74 canonical work pages · 10 internal anchors

  1. [1]

    Cross-View Completion Models are Zero-shot Correspondence Estimators, December 2024

    Honggyu An, Jinhyeon Kim, Seonghoon Park, Jaewoo Jung, Jisang Han, Sunghwan Hong, and Seungryong Kim. Cross-View Completion Models are Zero-shot Correspondence Estimators, December 2024. URLhttp://arxiv.org/abs/2412.09072. arXiv:2412.09072

  2. [2]

    C3G: Learning Compact 3D Representations with 2K Gaussians

    Honggyu An, Jaewoo Jung, Mungyeom Kim, Sunghwan Hong, Chaehyun Kim, Kazumi Fukuda, Minkyeong Jeon, Jisang Han, Takuya Narihira, Hyuna Ko, et al. C3g: Learning compact 3d representations with 2k gaussians.arXiv preprint arXiv:2512.04021, 2025

  3. [3]

    ReCamMaster: Camera-Controlled Generative Rendering from A Single Video, March 2025

    Jianhong Bai, Menghan Xia, Xiao Fu, Xintao Wang, Lianrui Mu, Jinwen Cao, Zuozhu Liu, Haoji Hu, Xiang Bai, Pengfei Wan, and Di Zhang. ReCamMaster: Camera-Controlled Generative Rendering from A Single Video, March 2025. URL http://arxiv.org/abs/2503.11647. arXiv:2503.11647 version: 1

  4. [4]

    Spatiotemporal reservoir resampling for real-time ray tracing with dynamic direct lighting.ACM Transactions on Graphics (Proc

    Michael Broxton, John Flynn, Ryan Overbeck, Daniel Erickson, Peter Hedman, Matthew Duvall, Jason Dourgarian, Jay Busch, Matt Whalen, and Paul Debevec. Immersive light field video with a layered mesh representation.ACM Transactions on Graphics, 39(4), August 2020. ISSN 0730-0301, 1557-7368. doi: 10.1145/3386569.3392485. URL https://dl.acm.org/ doi/10.1145/...

  5. [5]

    HexPlane: A Fast Representation for Dynamic Scenes, March

    Ang Cao and Justin Johnson. HexPlane: A Fast Representation for Dynamic Scenes, March

  6. [6]

    arXiv:2301.09632

    URLhttp://arxiv.org/abs/2301.09632. arXiv:2301.09632

  7. [7]

    pixelsplat: 3d gaussian splats from image pairs for scalable generalizable 3d reconstruction, 2024

    David Charatan, Sizhe Li, Andrea Tagliasacchi, and Vincent Sitzmann. pixelSplat: 3D Gaussian Splats from Image Pairs for Scalable Generalizable 3D Reconstruction, December 2023. URL http://arxiv.org/abs/2312.12337. arXiv:2312.12337

  8. [8]

    Feedforward 4d reconstruction for dynamic driving scenes using unposed images

    Xiaoxue Chen, Ziyi Xiong, Yuantao Chen, Gen Li, Nan Wang, Hongcheng Luo, Long Chen, Haiyang Sun, BING WANG, Guang Chen, et al. Feedforward 4d reconstruction for dynamic driving scenes using unposed images

  9. [9]

    Mvsplat: Efficient 3d gaussian splatting from sparse multi-view images, 2024

    Yuedong Chen, Haofei Xu, Chuanxia Zheng, Bohan Zhuang, Marc Pollefeys, Andreas Geiger, Tat-Jen Cham, and Jianfei Cai. MVSplat: Efficient 3D Gaussian Splatting from Sparse Multi- View Images, March 2024. URL http://arxiv.org/abs/2403.14627. arXiv:2403.14627

  10. [10]

    Pref3r: Pose-free feed-forward 3d gaussian splatting from variable-length image sequence.arXiv preprint arXiv:2411.16877, 2024

    Zequn Chen, Jiezhi Yang, and Heng Yang. Pref3r: Pose-free feed-forward 3d gaussian splatting from variable-length image sequence.arXiv preprint arXiv:2411.16877, 2024

  11. [11]

    Cats: Cost aggregation transformers for visual correspondence.Advances in Neural Information Processing Systems, 34:9011–9023, 2021

    Seokju Cho, Sunghwan Hong, Sangryul Jeon, Yunsung Lee, Kwanghoon Sohn, and Seungryong Kim. Cats: Cost aggregation transformers for visual correspondence.Advances in Neural Information Processing Systems, 34:9011–9023, 2021

  12. [12]

    Cats++: Boosting cost aggregation with convolutions and transformers.IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(6):7174–7194, 2022

    Seokju Cho, Sunghwan Hong, and Seungryong Kim. Cats++: Boosting cost aggregation with convolutions and transformers.IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(6):7174–7194, 2022

  13. [13]

    Cat-seg: Cost aggregation for open-vocabulary semantic segmentation

    Seokju Cho, Heeseong Shin, Sunghwan Hong, Anurag Arnab, Paul Hongsuck Seo, and Se- ungryong Kim. Cat-seg: Cost aggregation for open-vocabulary semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4113–4123, 2024

  14. [14]

    4D-Rotor Gaussian Splatting: Towards Efficient Novel View Synthesis for Dynamic Scenes, July 2024

    Yuanxing Duan, Fangyin Wei, Qiyu Dai, Yuhang He, Wenzheng Chen, and Baoquan Chen. 4D-Rotor Gaussian Splatting: Towards Efficient Novel View Synthesis for Dynamic Scenes, July 2024. URLhttp://arxiv.org/abs/2402.03307. arXiv:2402.03307 [cs]. 10

  15. [15]

    K-Planes: Explicit Radiance Fields in Space, Time, and Appearance, March

    Sara Fridovich-Keil, Giacomo Meanti, Frederik Warburg, Benjamin Recht, and Angjoo Kanazawa. K-Planes: Explicit Radiance Fields in Space, Time, and Appearance, March

  16. [16]

    arXiv:2301.10241

    URLhttp://arxiv.org/abs/2301.10241. arXiv:2301.10241

  17. [17]

    Kristen Grauman, Andrew Westbury, Lorenzo Torresani, Kris Kitani, Jitendra Malik, Tri- antafyllos Afouras, Kumar Ashutosh, Vijay Baiyya, Siddhant Bansal, Bikram Boote, Eugene Byrne, Zach Chavis, Joya Chen, Feng Cheng, Fu-Jen Chu, Sean Crane, Avijit Dasgupta, Jing Dong, Maria Escobar, Cristhian Forigua, Abrham Gebreselasie, Sanjay Haresh, Jing Huang, Md Mo...

  18. [18]

    Klaus Greff, Francois Belletti, Lucas Beyer, Carl Doersch, Yilun Du, Daniel Duckworth, David J. Fleet, Dan Gnanapragasam, Florian Golemo, Charles Herrmann, Thomas Kipf, Abhijit Kundu, Dmitry Lagun, Issam Laradji, Hsueh-Ti, Liu, Henning Meyer, Yishu Miao, Derek Nowrouzezahrai, Cengiz Oztireli, Etienne Pot, Noha Radwan, Daniel Rebain, Sara Sabour, Mehdi S. ...

  19. [19]

    Entropy-Gradient Grounding: Training-Free Evidence Retrieval in Vision-Language Models

    Marcel Gröpl, Jaewoo Jung, Seungryong Kim, Marc Pollefeys, and Sunghwan Hong. Entropy- gradient grounding: Training-free evidence retrieval in vision-language models.arXiv preprint arXiv:2604.08456, 2026

  20. [20]

    ScreenParse: Moving Beyond Sparse Grounding with Complete Screen Parsing Supervision

    A Said Gurbuz, Sunghwan Hong, Ahmed Nassar, Marc Pollefeys, and Peter Staar. Mov- ing beyond sparse grounding with complete screen parsing supervision.arXiv preprint arXiv:2602.14276, 2026

  21. [21]

    D^2USt3R: Enhancing 3D Reconstruction with 4D Pointmaps for Dynamic Scenes, April 2025

    Jisang Han, Honggyu An, Jaewoo Jung, Takuya Narihira, Junyoung Seo, Kazumi Fukuda, Chaehyun Kim, Sunghwan Hong, Yuki Mitsufuji, and Seungryong Kim. D^2USt3R: Enhancing 3D Reconstruction with 4D Pointmaps for Dynamic Scenes, April 2025. URL http://arxiv. org/abs/2504.06264. arXiv:2504.06264

  22. [22]

    Emergent outlier view rejection in visual geometry grounded transformers.arXiv preprint arXiv:2512.04012, 2025

    Jisang Han, Sunghwan Hong, Jaewoo Jung, Wooseok Jang, Honggyu An, Qianqian Wang, Seungryong Kim, and Chen Feng. Emergent outlier view rejection in visual geometry grounded transformers.arXiv preprint arXiv:2512.04012, 2025

  23. [23]

    Deep matching prior: Test-time optimization for dense correspondence

    Sunghwan Hong and Seungryong Kim. Deep matching prior: Test-time optimization for dense correspondence. InProceedings of the IEEE/CVF international conference on computer vision, pages 9907–9917, 2021

  24. [24]

    Cost aggregation with 4d convolutional swin transformer for few-shot segmentation

    Sunghwan Hong, Seokju Cho, Jisu Nam, Stephen Lin, and Seungryong Kim. Cost aggregation with 4d convolutional swin transformer for few-shot segmentation. InEuropean Conference on Computer Vision, pages 108–126. Springer, 2022

  25. [25]

    Neural matching fields: Implicit representation of matching fields for visual correspondence.Advances in Neural Information Processing Systems, 35:13512–13526, 2022

    Sunghwan Hong, Jisu Nam, Seokju Cho, Susung Hong, Sangryul Jeon, Dongbo Min, and Seungryong Kim. Neural matching fields: Implicit representation of matching fields for visual correspondence.Advances in Neural Information Processing Systems, 35:13512–13526, 2022. 11

  26. [26]

    Pf3plat: Pose-free feed-forward 3d gaussian splatting, 2025

    Sunghwan Hong, Jaewoo Jung, Heeseong Shin, Jisang Han, Jiaolong Yang, Chong Luo, and Seungryong Kim. PF3plat: Pose-Free Feed-Forward 3D Gaussian Splatting, October 2024. URLhttp://arxiv.org/abs/2410.22128. arXiv:2410.22128

  27. [27]

    Unifying correspondence pose and nerf for generalized pose-free novel view synthesis

    Sunghwan Hong, Jaewoo Jung, Heeseong Shin, Jiaolong Yang, Seungryong Kim, and Chong Luo. Unifying correspondence pose and nerf for generalized pose-free novel view synthesis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 20196–20206, 2024

  28. [28]

    Sc-gs: Sparse-controlled gaussian splatting for editable dynamic scenes

    Yi-Hua Huang, Yang-Tian Sun, Ziyi Yang, Xiaoyang Lyu, Yan-Pei Cao, and Xiaojuan Qi. Sc-gs: Sparse-controlled gaussian splatting for editable dynamic scenes. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4220–4230, 2024

  29. [29]

    Ufo-4d: Unposed feedforward 4d reconstruction from two images.arXiv preprint arXiv:2602.24290, 2026

    Junhwa Hur, Charles Herrmann, Songyou Peng, Philipp Henzler, Zeyu Ma, Todd Zickler, and Deqing Sun. Ufo-4d: Unposed feedforward 4d reconstruction from two images.arXiv preprint arXiv:2602.24290, 2026

  30. [30]

    Anysplat: Feed-forward 3d gaussian splatting from unconstrained views, 2025

    Lihan Jiang, Yucheng Mao, Linning Xu, Tao Lu, Kerui Ren, Yichen Jin, Xudong Xu, Mulin Yu, Jiangmiao Pang, Feng Zhao, Dahua Lin, and Bo Dai. AnySplat: Feed-forward 3D Gaussian Splatting from Unconstrained Views, May 2025. URLhttp://arxiv.org/abs/2505.23716. arXiv:2505.23716

  31. [31]

    3D Gaussian Splatting for Real-Time Radiance Field Rendering, August 2023

    Bernhard Kerbl, Georgios Kopanas, Thomas Leimkühler, and George Drettakis. 3D Gaussian Splatting for Real-Time Radiance Field Rendering, August 2023. URL http://arxiv.org/ abs/2308.04079. arXiv:2308.04079

  32. [32]

    Seg4diff: Unveiling open-vocabulary segmentation in text-to-image diffusion transformers.arXiv preprint arXiv:2509.18096, 2025

    Chaehyun Kim, Heeseong Shin, Eunbeen Hong, Heeji Yoon, Anurag Arnab, Paul Hongsuck Seo, Sunghwan Hong, and Seungryong Kim. Seg4diff: Unveiling open-vocabulary segmentation in text-to-image diffusion transformers.arXiv preprint arXiv:2509.18096, 2025

  33. [33]

    3d scene prompting for scene-consistent camera-controllable video generation.arXiv preprint arXiv:2510.14945, 2025

    JoungBin Lee, Jaewoo Jung, Jisang Han, Takuya Narihira, Kazumi Fukuda, Junyoung Seo, Sunghwan Hong, Yuki Mitsufuji, and Seungryong Kim. 3d scene prompting for scene-consistent camera-controllable video generation.arXiv preprint arXiv:2510.14945, 2025

  34. [34]

    TORA: Topological Representation Alignment for 3D Shape Assembly

    Nahyuk Lee, Zhiang Chen, Marc Pollefeys, and Sunghwan Hong. Tora: Topological representa- tion alignment for 3d shape assembly.arXiv preprint arXiv:2604.04050, 2026

  35. [35]

    MoSca: Dynamic Gaussian Fusion from Casual Videos via 4D Motion Scaffolds, November 2024

    Jiahui Lei, Yijia Weng, Adam Harley, Leonidas Guibas, and Kostas Daniilidis. MoSca: Dynamic Gaussian Fusion from Casual Videos via 4D Motion Scaffolds, November 2024. URL http: //arxiv.org/abs/2405.17421. arXiv:2405.17421

  36. [36]

    Grounding image matching in 3d with mast3r, 2024

    Vincent Leroy, Yohann Cabon, and Jérôme Revaud. Grounding Image Matching in 3D with MASt3R, June 2024. URLhttp://arxiv.org/abs/2406.09756. arXiv:2406.09756 [cs]

  37. [37]

    Neural 3d video synthesis from multi-view video

    Tianye Li, Mira Slavcheva, Michael Zollhoefer, Simon Green, Christoph Lassner, Changil Kim, Tanner Schmidt, Steven Lovegrove, Michael Goesele, Richard Newcombe, et al. Neural 3d video synthesis from multi-view video. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5521–5531, 2022

  38. [38]

    Movies: Motion-aware 4d dynamic view synthesis in one second

    Chenguo Lin, Yuchen Lin, Panwang Pan, Yifan Yu, Tao Hu, Honglei Yan, Katerina Fragkiadaki, and Yadong Mu. Movies: Motion-aware 4d dynamic view synthesis in one second. In Proceedings of the Computer Vision and Pattern Recognition Conference (CVPR), 2026

  39. [39]

    Chen, Zhenyu Li, Guang Shi, Jiashi Feng, and Bingyi Kang

    Haotong Lin, Sili Chen, Junhao Liew, Donny Y . Chen, Zhenyu Li, Guang Shi, Jiashi Feng, and Bingyi Kang. Depth Anything 3: Recovering the Visual Space from Any Views, November

  40. [40]

    Depth Anything 3: Recovering the Visual Space from Any Views

    URLhttp://arxiv.org/abs/2511.10647. arXiv:2511.10647

  41. [41]

    Decoupled Weight Decay Regularization

    Ilya Loshchilov and Frank Hutter. Decoupled Weight Decay Regularization, November 2017. URLhttps://arxiv.org/abs/1711.05101v3

  42. [42]

    Dynamic 3D Gaussians: Tracking by Persistent Dynamic View Synthesis, August 2023

    Jonathon Luiten, Georgios Kopanas, Bastian Leibe, and Deva Ramanan. Dynamic 3D Gaussians: Tracking by Persistent Dynamic View Synthesis, August 2023. URL http://arxiv.org/ abs/2308.09713. arXiv:2308.09713 [cs]. 12

  43. [43]

    DINOv2: Learning Robust Visual Features without Supervision

    Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, Mahmoud Assran, Nicolas Ballas, Wojciech Galuba, Russell Howes, Po-Yao Huang, Shang-Wen Li, Ishan Misra, Michael Rabbat, Vasu Sharma, Gabriel Synnaeve, Hu Xu, Hervé Jegou, Julien Mairal, Patrick La...

  44. [44]

    Barron, Sofien Bouaziz, Dan B

    Keunhong Park, Utkarsh Sinha, Jonathan T. Barron, Sofien Bouaziz, Dan B. Goldman, Steven M. Seitz, and Ricardo Martin-Brualla. Nerfies: Deformable Neural Radiance Fields, September

  45. [45]

    arXiv:2011.12948

    URLhttp://arxiv.org/abs/2011.12948. arXiv:2011.12948

  46. [46]

    D-nerf: Neural radiance fields for dynamic scenes.arXiv preprint arXiv:2011.13961, 2020

    Albert Pumarola, Enric Corona, Gerard Pons-Moll, and Francesc Moreno-Noguer. D-NeRF: Neural Radiance Fields for Dynamic Scenes, November 2020. URL http://arxiv.org/ abs/2011.13961. arXiv:2011.13961

  47. [47]

    Vision Transformers for Dense Predic- tion, March 2021

    René Ranftl, Alexey Bochkovskiy, and Vladlen Koltun. Vision Transformers for Dense Predic- tion, March 2021. URLhttp://arxiv.org/abs/2103.13413. arXiv:2103.13413

  48. [48]

    L4GM: Large 4D Gaussian Reconstruction Model, June 2024

    Jiawei Ren, Kevin Xie, Ashkan Mirzaei, Hanxue Liang, Xiaohui Zeng, Karsten Kreis, Ziwei Liu, Antonio Torralba, Sanja Fidler, Seung Wook Kim, and Huan Ling. L4GM: Large 4D Gaussian Reconstruction Model, June 2024. URL http://arxiv.org/abs/2406.10324. arXiv:2406.10324

  49. [49]

    Towards open-vocabulary semantic segmentation without semantic labels.Advances in Neural Information Processing Systems, 37:9153–9177, 2024

    Heeseong Shin, Chaehyun Kim, Sunghwan Hong, Seokju Cho, Anurag Arnab, Paul Hongsuck Seo, and Seungryong Kim. Towards open-vocabulary semantic segmentation without semantic labels.Advances in Neural Information Processing Systems, 37:9153–9177, 2024

  50. [50]

    Splatt3R: Zero-shot Gaussian Splatting from Uncalibrated Image Pairs

    Brandon Smart, Chuanxia Zheng, Iro Laina, and Victor Adrian Prisacariu. Splatt3R: Zero-shot Gaussian Splatting from Uncalibrated Image Pairs, August 2024. URL http://arxiv.org/ abs/2408.13912. arXiv:2408.13912

  51. [51]

    Dynamic gaussian marbles for novel view synthesis of casual monocular videos

    Colton Stearns, Adam Harley, Mikaela Uy, Florian Dubost, Federico Tombari, Gordon Wet- zstein, and Leonidas Guibas. Dynamic gaussian marbles for novel view synthesis of casual monocular videos. InSIGGRAPH Asia 2024 Conference Papers, pages 1–11, 2024

  52. [52]

    Non-Rigid Neural Radiance Fields: Reconstruction and Novel View Synthesis of a Dynamic Scene From Monocular Video, August 2021

    Edgar Tretschk, Ayush Tewari, Vladislav Golyanik, Michael Zollhöfer, Christoph Lassner, and Christian Theobalt. Non-Rigid Neural Radiance Fields: Reconstruction and Novel View Synthesis of a Dynamic Scene From Monocular Video, August 2021. URL http://arxiv. org/abs/2012.12247. arXiv:2012.12247

  53. [53]

    3d reconstruction with spatial memory

    Hengyi Wang and Lourdes Agapito. 3d reconstruction with spatial memory. In2025 Interna- tional Conference on 3D Vision (3DV), pages 78–89. IEEE, 2025

  54. [54]

    Vggt: Visual geometry grounded transformer, 2025

    Jianyuan Wang, Minghao Chen, Nikita Karaev, Andrea Vedaldi, Christian Rupprecht, and David Novotny. VGGT: Visual Geometry Grounded Transformer, March 2025. URL http: //arxiv.org/abs/2503.11651. arXiv:2503.11651

  55. [55]

    Shape of motion: 4d reconstruc- tion from a single video

    Qianqian Wang, Vickie Ye, Hang Gao, Jake Austin, Zhengqi Li, and Angjoo Kanazawa. Shape of Motion: 4D Reconstruction from a Single Video, July 2024. URL http://arxiv.org/ abs/2407.13764. arXiv:2407.13764 [cs]

  56. [56]

    Shape of motion: 4d reconstruction from a single video

    Qianqian Wang, Vickie Ye, Hang Gao, Weijia Zeng, Jake Austin, Zhengqi Li, and Angjoo Kanazawa. Shape of motion: 4d reconstruction from a single video. InInternational Conference on Computer Vision (ICCV), 2025

  57. [57]

    Continuous 3d perception model with persistent state

    Qianqian Wang, Yifei Zhang, Aleksander Holynski, Alexei A Efros, and Angjoo Kanazawa. Continuous 3d perception model with persistent state. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 10510–10522, 2025

  58. [58]

    Dust3r: Geometric 3d vision made easy, 2024

    Shuzhe Wang, Vincent Leroy, Yohann Cabon, Boris Chidlovskii, and Jerome Revaud. DUSt3R: Geometric 3D Vision Made Easy, December 2023. URL http://arxiv.org/abs/2312. 14132. arXiv:2312.14132. 13

  59. [59]

    SEA-RAFT: Simple, Efficient, Accurate RAFT for Optical Flow, May 2024

    Yihan Wang, Lahav Lipson, and Jia Deng. SEA-RAFT: Simple, Efficient, Accurate RAFT for Optical Flow, May 2024. URLhttp://arxiv.org/abs/2405.14793. arXiv:2405.14793

  60. [60]

    MonoFusion: Sparse- View 4D Reconstruction via Monocular Fusion, July 2025

    Zihan Wang, Jeff Tan, Tarasha Khurana, Neehar Peri, and Deva Ramanan. MonoFusion: Sparse- View 4D Reconstruction via Monocular Fusion, July 2025. URL http://arxiv.org/abs/ 2507.23782. arXiv:2507.23782 [cs]

  61. [61]

    4d gaussian splatting for real-time dynamic scene rendering,

    Guanjun Wu, Taoran Yi, Jiemin Fang, Lingxi Xie, Xiaopeng Zhang, Wei Wei, Wenyu Liu, Qi Tian, and Xinggang Wang. 4D Gaussian Splatting for Real-Time Dynamic Scene Rendering, July 2024. URLhttp://arxiv.org/abs/2310.08528. arXiv:2310.08528 [cs]

  62. [62]

    4DGT: Learning a 4D Gaussian Transformer Using Real-World Monocular Videos, June 2025

    Zhen Xu, Zhengqin Li, Zhao Dong, Xiaowei Zhou, Richard Newcombe, and Zhaoyang Lv. 4DGT: Learning a 4D Gaussian Transformer Using Real-World Monocular Videos, June 2025. URLhttp://arxiv.org/abs/2506.08015. arXiv:2506.08015 [cs]

  63. [63]

    arXiv preprint arXiv:2501.13928 (2025)

    Jianing Yang, Alexander Sax, Kevin J. Liang, Mikael Henaff, Hao Tang, Ang Cao, Joyce Chai, Franziska Meier, and Matt Feiszli. Fast3R: Towards 3D Reconstruction of 1000+ Images in One Forward Pass, January 2025. URL http://arxiv.org/abs/2501.13928. arXiv:2501.13928

  64. [64]

    Depth anything: Unleash- ing the power of large-scale unlabeled data

    Lihe Yang, Bingyi Kang, Zilong Huang, Xiaogang Xu, Jiashi Feng, and Hengshuang Zhao. Depth Anything: Unleashing the Power of Large-Scale Unlabeled Data, April 2024. URL http://arxiv.org/abs/2401.10891. arXiv:2401.10891

  65. [65]

    Depth Anything V2

    Lihe Yang, Bingyi Kang, Zilong Huang, Zhen Zhao, Xiaogang Xu, Jiashi Feng, and Heng- shuang Zhao. Depth Anything V2, June 2024. URL http://arxiv.org/abs/2406.09414. arXiv:2406.09414

  66. [66]

    NeoVerse: Enhancing 4D World Model with in-the-wild Monocular Videos, January 2026

    Yuxue Yang, Lue Fan, Ziqi Shi, Junran Peng, Feng Wang, and Zhaoxiang Zhang. NeoVerse: Enhancing 4D World Model with in-the-wild Monocular Videos, January 2026. URL http: //arxiv.org/abs/2601.00393. arXiv:2601.00393

  67. [67]

    Real-time Photorealistic Dynamic Scene Representation and Rendering with 4D Gaussian Splatting, February 2024

    Zeyu Yang, Hongye Yang, Zijie Pan, and Li Zhang. Real-time Photorealistic Dynamic Scene Representation and Rendering with 4D Gaussian Splatting, February 2024. URL http:// arxiv.org/abs/2310.10642. arXiv:2310.10642 [cs]

  68. [68]

    Deformable 3d gaussians for high-fidelity monocular dynamic scene reconstruction.arXiv preprint arXiv:2309.13101, 2023

    Ziyi Yang, Xinyu Gao, Wen Zhou, Shaohui Jiao, Yuqing Zhang, and Xiaogang Jin. Deformable 3D Gaussians for High-Fidelity Monocular Dynamic Scene Reconstruction, November 2023. URLhttp://arxiv.org/abs/2309.13101. arXiv:2309.13101 [cs]

  69. [69]

    No pose, no problem: Surprisingly simple 3d gaussian splats from sparse unposed images

    Botao Ye, Sifei Liu, Haofei Xu, Xueting Li, Marc Pollefeys, Ming-Hsuan Yang, and Songyou Peng. No Pose, No Problem: Surprisingly Simple 3D Gaussian Splats from Sparse Unposed Images, October 2024. URLhttp://arxiv.org/abs/2410.24207. arXiv:2410.24207

  70. [70]

    Yonosplat: You only need one model for feedforward 3d gaussian splatting, 2025

    Botao Ye, Boqi Chen, Haofei Xu, Daniel Barath, and Marc Pollefeys. Yonosplat: You only need one model for feedforward 3d gaussian splatting.arXiv preprint arXiv:2511.07321, 2025

  71. [71]

    Visual representation alignment for multimodal large language models.arXiv preprint arXiv:2509.07979, 2025

    Heeji Yoon, Jaewoo Jung, Junwan Kim, Hyungyu Choi, Heeseong Shin, Sangbeom Lim, Honggyu An, Chaehyun Kim, Jisang Han, Donghyun Kim, et al. Visual representation alignment for multimodal large language models.arXiv preprint arXiv:2509.07979, 2025

  72. [72]

    Litept: Lighter yet stronger point transformer.arXiv preprint arXiv:2512.13689, 2025

    Yuanwen Yue, Damien Robert, Jianyuan Wang, Sunghwan Hong, Jan Dirk Wegner, Christian Rupprecht, and Konrad Schindler. Litept: Lighter yet stronger point transformer.arXiv preprint arXiv:2512.13689, 2025

  73. [73]

    MonST3R: A Simple Approach for Estimating Geometry in the Presence of Motion

    Junyi Zhang, Charles Herrmann, Junhwa Hur, Varun Jampani, Trevor Darrell, Forrester Cole, Deqing Sun, and Ming-Hsuan Yang. MonST3R: A Simple Approach for Estimating Geome- try in the Presence of Motion, October 2024. URL http://arxiv.org/abs/2410.03825. arXiv:2410.03825

  74. [74]

    Flare: Feed-forward geometry, appearance and camera estimation from uncalibrated sparse views, 2026

    Shangzhan Zhang, Jianyuan Wang, Yinghao Xu, Nan Xue, Christian Rupprecht, Xiaowei Zhou, Yujun Shen, and Gordon Wetzstein. FLARE: Feed-forward Geometry, Appearance and Camera Estimation from Uncalibrated Sparse Views, February 2025. URL http://arxiv.org/abs/ 2502.12138. arXiv:2502.12138. 14 Supplementary Material This document provides additional analysis ...