pith. sign in

arxiv: 2606.28828 · v1 · pith:442WMJB2new · submitted 2026-06-27 · 💻 cs.CV

Ground4D: Consistency-Aware 4D Reconstruction from Monocular Video

Pith reviewed 2026-06-30 10:13 UTC · model grok-4.3

classification 💻 cs.CV
keywords 4D reconstructionmonocular videodynamic Gaussian splattinggeometric consistencynovel view synthesisfoundation modelsscene representation
0
0 comments X

The pith

Ground4D achieves consistent 4D reconstruction from monocular video by grounding dynamic Gaussians in foundation model geometry.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper seeks to create 4D scene representations from a single monocular video that allow dynamic novel-view synthesis while keeping faithful geometry over time. It identifies that dynamic Gaussian Splatting provides good rendering but misses explicit multi-view geometric consistency, whereas 3D foundation models give coherent geometry without photorealistic rendering. The proposed Ground4D framework addresses this by first initializing with multi-view consistent 3D geometry and poses from a foundation model, then refining the dynamic Gaussians under consistency constraints during optimization. This integration leads to stronger reconstruction and rendering results, which matters for applications needing both visual quality and structural accuracy in dynamic scenes.

Core claim

Ground4D is built on two stages: first, geometry initialization via VGGT in a training-free manner to reconstruct multi-view-consistent 3D geometry and camera poses from monocular video, providing a structured initialization for dynamic Gaussian representations; second, geometry-consistency-aware refinement via dynamic Gaussian Splatting, optimizing through differentiable rendering while maintaining multi-view geometric consistency across observed and synthesized viewpoints, and inherently modeling continuous 4D dynamics for rendering at arbitrary timestamps.

What carries the argument

The two-stage Ground4D framework that uses foundation model geometry for initialization and then enforces geometric consistency in dynamic Gaussian Splatting optimization.

If this is right

  • Dynamic novel-view synthesis is supported with faithful geometry maintained over time.
  • Rendering at arbitrary timestamps is naturally enabled by the continuous 4D dynamics modeling.
  • Reconstruction fidelity and rendering performance are improved compared to standard dynamic Gaussian Splatting.
  • Multi-view geometric consistency holds across both observed and synthesized viewpoints.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar hybrid initialization strategies could enhance geometric reliability in other optimization-based reconstruction methods.
  • Extending the consistency enforcement to handle longer sequences or more complex dynamics would test the framework's robustness.
  • Connections to real-world capture systems might show reduced requirements for calibrated multi-view setups.

Load-bearing premise

The geometry initialization from the foundation model provides a reliable starting point that allows refinement while preserving multi-view geometric consistency in the Gaussian representation.

What would settle it

If dynamic Gaussian Splatting optimized without the VGGT initialization produces equivalent or better multi-view consistency and rendering quality on the same monocular videos, the benefit of the geometry-grounded stage would be disproven.

Figures

Figures reproduced from arXiv: 2606.28828 by Liang Lin, Pengxu Wei, Qing Zhao, Weijian Deng.

Figure 1
Figure 1. Figure 1: Ground4D enables consistency-aware 4D geometry reconstruction and high-quality novel-view synthesis from a single monocular video. While direct outputs from 3d foundation models like VGGT [10] provide strong spatial priors, they lack the fidelity for dynamic rendering. To bridge this gap, we first perform a training-free 4D geometry initialization to yield a geometrically grounded starting representation. … view at source ↗
Figure 2
Figure 2. Figure 2: Ground4D Overview. Given a monocular video and corresponding dynamic masks extracted via SAM [40], our method proceeds in two stages. (1) Geometry Initialization via 3D Foundation Models: We leverage VGGT in a training-free manner to recover multi-view-consistent 3D geometry and camera parameters by suppressing dynamic tokens within the global attention layers. The recovered 3D points initialize dynamic Ga… view at source ↗
Figure 3
Figure 3. Figure 3: Qualitative Results of the Reconstructed 4D Geometry. Our Ground4D achieves geometry consistent reconstruction of both static scenes and moving objects. TABLE III QUANTITATIVE COMPARISONS OF CAMERA POSE ESTIMATION ON THE DYCHECK [45] AND TUM-DYNAMICS [48] DATASET. DyCheck [45] TUM-dynamics [48] Method ATE ↓ RTE ↓ RRE ↓ ATE ↓ RTE ↓ RRE ↓ DUSt3R [7] 0.035 0.030 2.323 0.100 0.087 2.692 CUT3R [30] 0.029 0.020 … view at source ↗
Figure 4
Figure 4. Figure 4: Qualitative results of novel view synthesis. Ground4D effectively eliminates the blurring and artifacts, yielding sharper and more photorealistic novel views. Continuous 4D Dynamics Continuous 4D Dynamics Past: T = -0.1 Between: T = 0.1 Between: T = 0.1 Between: T = 0.2 Future: T = 1.1 Between: T = 0.3 [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Visualization of continuous-time novel-view synthesis. The temporal range of each input video is normalized to [0, 1]. Ground4D queries the learned continuous 4D dynamics at arbitrary timestamps to render dynamic scenes from novel viewpoints. The results show smooth foreground motion and coherent scene geometry, demonstrating that Ground4D lifts discrete frame-wise 4D reconstruction into continuous-time dy… view at source ↗
read the original abstract

Learning a 4D scene representation from a single monocular video that supports dynamic novel-view synthesis while maintaining faithful geometry over time remains challenging. Dynamic Gaussian Splatting achieves strong rendering performance through photometric optimization, yet does not explicitly enforce multi-view geometric consistency. In contrast, 3D foundation models recover coherent scene geometry and camera motion, but their point-based outputs are not designed for photorealistic rendering. We propose Ground4D, a geometry-grounded framework built on two stages. First, we perform geometry initialization via 3D foundation models, leveraging VGGT in a training-free manner to reconstruct multi-view-consistent 3D geometry and camera poses from monocular video. The recovered geometry provides a structured and reliable initialization for dynamic Gaussian representations. Second, we conduct geometry-consistency-aware refinement via dynamic Gaussian Splatting, optimizing the representation through differentiable rendering while maintaining multi-view geometric consistency across both observed and synthesized viewpoints. Furthermore, Ground4D inherently models the continuous 4D dynamics of the scene, naturally supporting rendering at arbitrary timestamps. By integrating foundation-level geometric priors into dynamic Gaussian optimization, Ground4D achieves stronger reconstruction fidelity and rendering performance, underscoring the role of geometry-grounded constraints in robust 4D scene modeling.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The paper proposes Ground4D, a two-stage geometry-grounded framework for 4D reconstruction from monocular video. Stage 1 uses VGGT in a training-free manner to initialize multi-view-consistent 3D geometry and camera poses. Stage 2 refines dynamic Gaussian representations via consistency-aware optimization to support photorealistic novel-view synthesis at arbitrary timestamps while enforcing geometric consistency across observed and synthesized views.

Significance. If the central claims hold with supporting evidence, the work would demonstrate a practical way to combine 3D foundation-model priors with dynamic Gaussian splatting, addressing the lack of explicit multi-view consistency in pure photometric Gaussian optimization and the limited rendering quality of point-based foundation outputs. This could strengthen 4D scene modeling by making geometry-grounded constraints load-bearing rather than post-hoc.

major comments (2)
  1. [Abstract] Abstract: the central claim that Ground4D 'achieves stronger reconstruction fidelity and rendering performance' is asserted without any quantitative results, error analysis, dataset details, baselines, or ablation studies. No tables, figures, or metrics (PSNR, SSIM, LPIPS, geometric error, etc.) are referenced to substantiate the performance gain from the two-stage process.
  2. [Abstract] Abstract (and implied § on method): the description of the geometry-consistency-aware refinement stage provides no explicit formulation of the consistency loss, how multi-view consistency is enforced on synthesized viewpoints, or how the VGGT initialization is integrated into the Gaussian optimization objective. Without these equations or pseudocode, the load-bearing mechanism cannot be verified.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed review and the opportunity to address these points. We respond to each major comment below, focusing on the manuscript content.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the central claim that Ground4D 'achieves stronger reconstruction fidelity and rendering performance' is asserted without any quantitative results, error analysis, dataset details, baselines, or ablation studies. No tables, figures, or metrics (PSNR, SSIM, LPIPS, geometric error, etc.) are referenced to substantiate the performance gain from the two-stage process.

    Authors: The abstract is a concise summary; the full manuscript provides the requested evidence in Section 4 (Experiments). This includes quantitative tables comparing PSNR/SSIM/LPIPS against baselines (e.g., DynamicGS, 4DGS), error analysis on geometric consistency, dataset details (e.g., DAVIS, HyperNeRF), and ablations isolating the two-stage contribution. We will revise the abstract to add an explicit reference such as '(see Table 1 and Figure 4 for quantitative results)'. revision: yes

  2. Referee: [Abstract] Abstract (and implied § on method): the description of the geometry-consistency-aware refinement stage provides no explicit formulation of the consistency loss, how multi-view consistency is enforced on synthesized viewpoints, or how the VGGT initialization is integrated into the Gaussian optimization objective. Without these equations or pseudocode, the load-bearing mechanism cannot be verified.

    Authors: The abstract summarizes at high level, but Section 3.2 gives the explicit formulation: the consistency loss L_cons = Σ_v ||render(D_v(t)) - D_VGGT||_1 + λ Σ_{v,v'} cross_view_consistency(rendered views at t), added to the photometric loss with VGGT point cloud directly initializing Gaussian means and covariances. Pseudocode appears in Algorithm 1 of the supplement. We will expand the abstract description slightly and ensure the equations are cross-referenced for clarity. revision: partial

Circularity Check

0 steps flagged

No significant circularity; derivation relies on external foundation models and standard refinement

full rationale

The paper presents a sequential two-stage pipeline: (1) training-free initialization of geometry and poses using the external VGGT 3D foundation model, followed by (2) refinement via dynamic Gaussian Splatting with consistency constraints. No equations, self-citations, or fitted parameters are shown that reduce any claimed output to the inputs by construction. The method is described as leveraging independent external priors for initialization and then performing differentiable optimization, with no self-definitional loops, renamed known results, or load-bearing internal citations. This matches the default expectation of a non-circular paper.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides no information on free parameters, axioms, or invented entities used in the method.

pith-pipeline@v0.9.1-grok · 5753 in / 1206 out tokens · 53737 ms · 2026-06-30T10:13:47.562733+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

50 extracted references · 3 canonical work pages · 1 internal anchor

  1. [1]

    4d gaussian splatting for real-time dynamic scene rendering,

    G. Wu, T. Yi, J. Fang, L. Xie, X. Zhang, W. Wei, W. Liu, Q. Tian, and X. Wang, “4d gaussian splatting for real-time dynamic scene rendering,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2024, pp. 20 310–20 320

  2. [2]

    Deformable 3d gaussians for high-fidelity monocular dynamic scene reconstruction,

    Z. Yang, X. Gao, W. Zhou, S. Jiao, Y . Zhang, and X. Jin, “Deformable 3d gaussians for high-fidelity monocular dynamic scene reconstruction,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2024, pp. 20 331–20 341

  3. [3]

    Dynamic 3d gaussians: Tracking by persistent dynamic view synthesis,

    J. Luiten, G. Kopanas, B. Leibe, and D. Ramanan, “Dynamic 3d gaussians: Tracking by persistent dynamic view synthesis,” in2024 International Conference on 3D Vision (3DV). IEEE, 2024, pp. 800– 809

  4. [4]

    Dynamic gaussian marbles for novel view synthesis of casual monocular videos,

    C. Stearns, A. Harley, M. Uy, F. Dubost, F. Tombari, G. Wetzstein, and L. Guibas, “Dynamic gaussian marbles for novel view synthesis of casual monocular videos,” inSIGGRAPH Asia 2024 Conference Papers, 2024, pp. 1–11

  5. [5]

    Mosca: Dynamic gaussian fusion from casual videos via 4d motion scaffolds,

    J. Lei, Y . Weng, A. W. Harley, L. Guibas, and K. Daniilidis, “Mosca: Dynamic gaussian fusion from casual videos via 4d motion scaffolds,” inProceedings of the Computer Vision and Pattern Recognition Confer- ence, 2025, pp. 6165–6177

  6. [6]

    Shape of motion: 4d reconstruction from a single video,

    Q. Wang, V . Ye, H. Gao, W. Zeng, J. Austin, Z. Li, and A. Kanazawa, “Shape of motion: 4d reconstruction from a single video,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2025, pp. 9660–9672. JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 10

  7. [7]

    Dust3r: Geometric 3d vision made easy,

    S. Wang, V . Leroy, Y . Cabon, B. Chidlovskii, and J. Revaud, “Dust3r: Geometric 3d vision made easy,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2024, pp. 20 697–20 709

  8. [8]

    Monst3r: A simple approach for estimating geometry in the presence of motion,

    J. Zhang, C. Herrmann, J. Hur, V . Jampani, T. Darrell, F. Cole, D. Sun, and M.-H. Yang, “Monst3r: A simple approach for estimating geometry in the presence of motion,”International Conference on Learning Representations, 2025

  9. [9]

    Easi3r: Estimating disentangled motion from dust3r without training,

    X. Chen, Y . Chen, Y . Xiu, A. Geiger, and A. Chen, “Easi3r: Estimating disentangled motion from dust3r without training,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2025, pp. 9158–9168

  10. [10]

    Vggt: Visual geometry grounded transformer,

    J. Wang, M. Chen, N. Karaev, A. Vedaldi, C. Rupprecht, and D. Novotny, “Vggt: Visual geometry grounded transformer,” inProceedings of the Computer Vision and Pattern Recognition Conference, 2025, pp. 5294– 5306

  11. [11]

    Page-4d: Disentangled pose and geometry estimation for 4d perception,

    K. Zhou, Y . Wang, G. Chen, G. Beaudouin, F. Zhan, P. P. Liang, and M. Wang, “Page-4d: Disentangled pose and geometry estimation for 4d perception,” inInternational Conference on Learning Representations, 2026

  12. [12]

    Vggt4d: Mining motion cues in visual geometry transformers for 4d scene reconstruction,

    Y . Hu, C. Cheng, S. Yu, X. Guo, and H. Wang, “Vggt4d: Mining motion cues in visual geometry transformers for 4d scene reconstruction,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2026, pp. 414–424

  13. [13]

    Building rome in a day,

    S. Agarwal, Y . Furukawa, N. Snavely, I. Simon, B. Curless, S. M. Seitz, and R. Szeliski, “Building rome in a day,”Communications of the ACM, vol. 54, no. 10, pp. 105–112, 2011

  14. [14]

    Self-calibration and metric reconstruction inspite of varying and unknown intrinsic camera parame- ters,

    M. Pollefeys, R. Koch, and L. V . Gool, “Self-calibration and metric reconstruction inspite of varying and unknown intrinsic camera parame- ters,”International journal of computer vision, vol. 32, no. 1, pp. 7–25, 1999

  15. [15]

    Visual modeling with a hand-held camera,

    M. Pollefeys, L. Van Gool, M. Vergauwen, F. Verbiest, K. Cornelis, J. Tops, and R. Koch, “Visual modeling with a hand-held camera,” International Journal of Computer Vision, vol. 59, no. 3, pp. 207–232, 2004

  16. [16]

    Structure-from-motion revisited,

    J. L. Schonberger and J.-M. Frahm, “Structure-from-motion revisited,” inProceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 4104–4113

  17. [17]

    Photo tourism: exploring photo collections in 3d,

    N. Snavely, S. M. Seitz, and R. Szeliski, “Photo tourism: exploring photo collections in 3d,” inACM siggraph, 2006, pp. 835–846

  18. [18]

    Modeling the world from internet photo collections,

    ——, “Modeling the world from internet photo collections,”Interna- tional journal of computer vision, vol. 80, no. 2, pp. 189–210, 2008

  19. [19]

    Bundle adjustment in the large,

    S. Agarwal, N. Snavely, S. M. Seitz, and R. Szeliski, “Bundle adjustment in the large,” inEuropean conference on computer vision. Springer, 2010, pp. 29–42

  20. [20]

    Scene coordinate reconstruction: Posing of image collections via incremental learning of a relocalizer,

    E. Brachmann, J. Wynn, S. Chen, T. Cavallari, A. Monszpart, D. Tur- mukhambetov, and V . A. Prisacariu, “Scene coordinate reconstruction: Posing of image collections via incremental learning of a relocalizer,” inEuropean Conference on Computer Vision. Springer, 2024, pp. 421– 440

  21. [21]

    Ba-net: Dense bundle adjustment network,

    C. Tang and P. Tan, “Ba-net: Dense bundle adjustment network,” International Conference on Learning Representations, 2019

  22. [22]

    Droid-slam: Deep visual slam for monocular, stereo, and rgb-d cameras,

    Z. Teed and J. Deng, “Droid-slam: Deep visual slam for monocular, stereo, and rgb-d cameras,”Advances in neural information processing systems, vol. 34, pp. 16 558–16 569, 2021

  23. [23]

    Vggsfm: Visual geometry grounded deep structure from motion,

    J. Wang, N. Karaev, C. Rupprecht, and D. Novotny, “Vggsfm: Visual geometry grounded deep structure from motion,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2024, pp. 21 686–21 697

  24. [24]

    Grounding image matching in 3d with mast3r,

    V . Leroy, Y . Cabon, and J. Revaud, “Grounding image matching in 3d with mast3r,” inEuropean conference on computer vision. Springer, 2024, pp. 71–91

  25. [25]

    Fast3r: Towards 3d reconstruction of 1000+ images in one forward pass,

    J. Yang, A. Sax, K. J. Liang, M. Henaff, H. Tang, A. Cao, J. Chai, F. Meier, and M. Feiszli, “Fast3r: Towards 3d reconstruction of 1000+ images in one forward pass,” inProceedings of the Computer Vision and Pattern Recognition Conference, 2025, pp. 21 924–21 935

  26. [26]

    Streaming 4d visual geometry transformer,

    D. Zhuo, W. Zheng, J. Guo, Y . Wu, J. Zhou, and J. Lu, “Streaming 4d visual geometry transformer,”International Conference on Learning Representations, 2026

  27. [27]

    Infinitevggt: Visual geometry grounded transformer for endless streams

    S. Yuan, Y . Yang, X. Yang, X. Zhang, Z. Zhao, L. Zhang, and Z. Zhang, “Infinitevggt: Visual geometry grounded transformer for endless streams,”arXiv preprint arXiv:2601.02281, 2026

  28. [28]

    π 3: Permutation-equivariant visual geometry learning,

    Y . Wang, J. Zhou, H. Zhu, W. Chang, Y . Zhou, Z. Li, J. Chen, J. Pang, C. Shen, and T. He, “π 3: Permutation-equivariant visual geometry learning,”International Conference on Learning Representations, 2026

  29. [29]

    Fastvggt: Training-free acceleration of visual geometry transformer,

    Y . Shen, Z. Zhang, Y . Qu, X. Zheng, J. Ji, S. Zhang, and L. Cao, “Fastvggt: Training-free acceleration of visual geometry transformer,” International Conference on Learning Representations, 2026

  30. [30]

    Continuous 3d perception model with persistent state,

    Q. Wang, Y . Zhang, A. Holynski, A. A. Efros, and A. Kanazawa, “Continuous 3d perception model with persistent state,” inProceedings of the Computer Vision and Pattern Recognition Conference, 2025, pp. 10 510–10 522

  31. [31]

    arXiv preprint arXiv:2412.19584 (2024)

    K. Xu, T. H. E. Tse, J. Peng, and A. Yao, “Das3r: Dynamics- aware gaussian splatting for static scene reconstruction,”arXiv preprint arXiv:2412.19584, 2024

  32. [32]

    Vision transformers for dense prediction,

    R. Ranftl, A. Bochkovskiy, and V . Koltun, “Vision transformers for dense prediction,” inProceedings of the IEEE/CVF international conference on computer vision, 2021, pp. 12 179–12 188

  33. [33]

    Nerf: Representing scenes as neural radiance fields for view synthesis,

    B. Mildenhall, P. P. Srinivasan, M. Tancik, J. T. Barron, R. Ramamoorthi, and R. Ng, “Nerf: Representing scenes as neural radiance fields for view synthesis,”Communications of the ACM, vol. 65, no. 1, pp. 99–106, 2021

  34. [34]

    Mip-nerf: A multiscale representation for anti- aliasing neural radiance fields,

    J. T. Barron, B. Mildenhall, M. Tancik, P. Hedman, R. Martin-Brualla, and P. P. Srinivasan, “Mip-nerf: A multiscale representation for anti- aliasing neural radiance fields,” inProceedings of the IEEE/CVF inter- national conference on computer vision, 2021, pp. 5855–5864

  35. [35]

    Robust dynamic radiance fields,

    Y .-L. Liu, C. Gao, A. Meuleman, H.-Y . Tseng, A. Saraf, C. Kim, Y .-Y . Chuang, J. Kopf, and J.-B. Huang, “Robust dynamic radiance fields,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 13–23

  36. [36]

    Cbarf: cascaded bundle-adjusting neural radiance fields from imperfect camera poses,

    H. Fu, X. Yu, L. Li, and L. Zhang, “Cbarf: cascaded bundle-adjusting neural radiance fields from imperfect camera poses,”IEEE Transactions on Multimedia, vol. 26, pp. 9304–9315, 2024

  37. [37]

    4dgstream: Variable bitrate dynamic gaussian splatting streaming,

    Z. Liang, D. Zhang, L. Shen, M. Zhang, J. Zhang, B. Ju, M. Dasari, F. Wang, and J. Liu, “4dgstream: Variable bitrate dynamic gaussian splatting streaming,”IEEE Transactions on Multimedia, 2026

  38. [38]

    3d gaussian splatting for real-time radiance field rendering

    B. Kerbl, G. Kopanas, T. Leimk ¨uhler, G. Drettakiset al., “3d gaussian splatting for real-time radiance field rendering.”ACM Trans. Graph., vol. 42, no. 4, pp. 139–1, 2023

  39. [39]

    Mip-splatting: Alias-free 3d gaussian splatting,

    Z. Yu, A. Chen, B. Huang, T. Sattler, and A. Geiger, “Mip-splatting: Alias-free 3d gaussian splatting,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2024, pp. 19 447–19 456

  40. [40]

    Segment anything,

    A. Kirillov, E. Mintun, N. Ravi, H. Mao, C. Rolland, L. Gustafson, T. Xiao, S. Whitehead, A. C. Berg, W.-Y . Loet al., “Segment anything,” inProceedings of the IEEE/CVF international conference on computer vision, 2023, pp. 4015–4026

  41. [41]

    Dinov2: Learning robust visual features without supervision,

    M. Oquab, T. Darcet, T. Moutakanni, H. V o, M. Szafraniec, V . Khalidov, P. Fernandez, D. Haziza, F. Massa, A. El-Noubyet al., “Dinov2: Learning robust visual features without supervision,”Transactions on Machine Learning Research, 2024

  42. [42]

    Skinning with dual quaternions,

    L. Kavan, S. Collins, J. ˇZ´ara, and C. O’Sullivan, “Skinning with dual quaternions,” inProceedings of the 2007 symposium on Interactive 3D graphics and games, 2007, pp. 39–46

  43. [43]

    As-rigid-as-possible surface modeling,

    O. Sorkine, M. Alexaet al., “As-rigid-as-possible surface modeling,” in Symposium on Geometry processing, vol. 4, 2007, pp. 109–116

  44. [44]

    Dynamicfusion: Recon- struction and tracking of non-rigid scenes in real-time,

    R. A. Newcombe, D. Fox, and S. M. Seitz, “Dynamicfusion: Recon- struction and tracking of non-rigid scenes in real-time,” inProceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 343–352

  45. [45]

    Monocular dynamic view synthesis: A reality check,

    H. Gao, R. Li, S. Tulsiani, B. Russell, and A. Kanazawa, “Monocular dynamic view synthesis: A reality check,”Advances in Neural Informa- tion Processing Systems, vol. 35, pp. 33 768–33 780, 2022

  46. [46]

    A benchmark dataset and evaluation methodology for video object segmentation,

    F. Perazzi, J. Pont-Tuset, B. McWilliams, L. Van Gool, M. Gross, and A. Sorkine-Hornung, “A benchmark dataset and evaluation methodology for video object segmentation,” inProceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 724–732

  47. [47]

    The 2017 DAVIS Challenge on Video Object Segmentation

    J. Pont-Tuset, F. Perazzi, S. Caelles, P. Arbel ´aez, A. Sorkine-Hornung, and L. Van Gool, “The 2017 davis challenge on video object segmen- tation,”arXiv preprint arXiv:1704.00675, 2017

  48. [48]

    A benchmark for the evaluation of rgb-d slam systems,

    J. Sturm, N. Engelhard, F. Endres, W. Burgard, and D. Cremers, “A benchmark for the evaluation of rgb-d slam systems,” in2012 IEEE/RSJ international conference on intelligent robots and systems. IEEE, 2012, pp. 573–580

  49. [49]

    Unsupervised collaborative learning of keyframe detection and visual odometry towards monocular deep slam,

    L. Sheng, D. Xu, W. Ouyang, and X. Wang, “Unsupervised collaborative learning of keyframe detection and visual odometry towards monocular deep slam,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2019, pp. 4302–4311

  50. [50]

    Ttt3r: 3d recon- struction as test-time training,

    X. Chen, Y . Chen, Y . Xiu, A. Geiger, and A. Chen, “Ttt3r: 3d recon- struction as test-time training,”International Conference on Learning Representations, 2026