pith. sign in

arxiv: 2606.04593 · v1 · pith:6ME63KL7new · submitted 2026-06-03 · 💻 cs.CV

4D Reconstruction from Sparse Dynamic Cameras

Pith reviewed 2026-06-28 06:43 UTC · model grok-4.3

classification 💻 cs.CV
keywords 4D reconstructionsparse dynamic cameras3D track initializationdepth-ordering lossLetCamsGo datasetspatiotemporal consistencymulti-view videodynamic scene reconstruction
0
0 comments X

The pith

Sparse dynamic cameras with inter-camera track initialization enable consistent 4D reconstruction where monocular and fixed-camera methods fail.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper targets 4D reconstruction from a handful of independently moving cameras, a setup that supplies multi-view constraints at low cost yet creates complex inconsistencies across space and time. It introduces a 3D track initialization step that merges inter-camera feature matching with intra-camera point tracking to enforce spatiotemporal consistency, plus a noise-robust depth-ordering loss and a batch-sampling strategy that favors diverse spatiotemporal coverage. These elements are evaluated on the new LetCamsGo dataset of five real-world sequences recorded by three moving cameras and one fixed camera across four environments. Benchmark results show measurable gains in reconstruction quality inside moving regions relative to direct adaptations of existing monocular or dense-fixed pipelines. The work positions the approach as a practical route to low-cost 4D capture for applications such as sports and live events.

Core claim

A 3D track initialization procedure that integrates inter-camera feature matching with intra-camera point tracking, combined with a noise-robust depth-ordering regularization loss and spatiotemporally diverse batch sampling, overcomes the spatiotemporal inconsistencies that defeat naive extensions of monocular or dense-fixed-camera methods and thereby improves 4D reconstruction quality in dynamic regions on the LetCamsGo benchmark.

What carries the argument

3D track initialization method that integrates inter-camera feature matching with intra-camera point tracking to enforce spatiotemporal consistency.

If this is right

  • Reconstruction quality improves specifically in regions with independent object motion.
  • The same capture hardware already used in sports and concert production can support 4D output without added fixed cameras.
  • LetCamsGo supplies a public, standardized test set for comparing future sparse-dynamic methods.
  • The pipeline remains practical for real-world video workflows that tolerate modest camera motion.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The initialization step may generalize to other multi-view dynamic tasks such as non-rigid structure from motion.
  • Reducing the number of required moving cameras below three could become feasible once the consistency mechanisms are further tuned.
  • The batch-sampling strategy might transfer to other optimization problems that suffer from view-time correlations.
  • Combining the approach with existing monocular depth estimators could further lower the minimum number of cameras needed.

Load-bearing premise

That the 3D track initialization, depth-ordering loss, and diverse batch sampling are together necessary and sufficient to resolve the spatiotemporal inconsistencies that arise when cameras move independently.

What would settle it

A controlled ablation on LetCamsGo in which any one of the three proposed components is removed and the remaining system shows no improvement, or a decline, in dynamic-region reconstruction metrics relative to the strongest baseline.

Figures

Figures reproduced from arXiv: 2606.04593 by Eisuke Takeuchi, Kazuki Ozeki, Kazumi Fukuda, Ryosuke Sawata, Shun Kenney, Takuya Narihira, Yoshimitsu Aoki, Yuki Mitsufuji, Yuto Shibata.

Figure 1
Figure 1. Figure 1: Overview. We introduce the task of 4D reconstruction from sparse dynamic cameras, a versatile and depth ambiguity-free con￾figuration of multiple, independently moving cameras prevalent in real-world video production. Our framework resolves the fundamental spatiotemporal inconsistencies that typically undermine naive adaptations of methods designed for monocular or dense-fixed camera se￾tups [20, 59, 68]. … view at source ↗
Figure 2
Figure 2. Figure 2: Initial dynamic points. Pose-conditioned metric multi￾view depth estimation [32] and multi-view 3D point tracking [44] produce noisy and inconsistent dynamic points across times and views (a). In contrast, our multi-view consistent 3D track initial￾ization produces more accurate and consistent dynamic points (b). monocular dynamic camera setup. On the basis of MosCa, we focus on improving its motion-scaffo… view at source ↗
Figure 3
Figure 3. Figure 3: Proposed multi-view consistent 3D track initialization. We combine inter-camera feature matching and intra-camera point tracking, followed by epipolar filtering and triangulation to produce spatiotemporally consistent 3D tracks for motion-scaffold initialization. 3.2. Naive Approach Since monocular 4D reconstruction is highly ill-posed, MoSca [20] relies on 2D foundation models for initializa￾tion and supe… view at source ↗
Figure 4
Figure 4. Figure 4: Visualization of estimated camera trajectories. The estimated trajectories of the three dynamic cameras are shown in red, green, and blue, and the fixed evaluation camera is shown in pink. For visual clarity, the frame rate is reduced to 6 FPS. Kc [Rc,t | tc,t] is the projection matrix. In contrast to back￾projecting noisy depth estimates, this step directly produces multi-view consistent 3D points at each… view at source ↗
Figure 5
Figure 5. Figure 5: Qualitative Results on LetCamsGo. Excluded regions from quantitative evaluation are masked in white [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗
read the original abstract

Although dynamic 3D (i.e., 4D) reconstruction from a monocular dynamic camera has recently advanced, it remains fundamentally limited by depth ambiguity. In this paper, we focus on an alternative practical way, i.e., sparse dynamic camera setup, where a handful of independently moving cameras capture the same subjects. While keeping capture costs low, this setup introduces multi-view constraints and remains practical for real-world video production such as sports, concerts, and TV shows. Despite its potential, our experiments show that naive extensions of existing monocular or dense-fixed camera-based methods are insufficient since they fail to resolve the complex spatiotemporal inconsistencies across views and time. To fill this gap, we propose a simple yet effective 3D track initialization method designed to ensure spatiotemporal consistency by integrating inter-camera feature matching with intra-camera point tracking. Additionally, we incorporate a noise-robust depth-ordering regularization loss and a spatiotemporally diverse batch sampling strategy to enhance optimization stability and cross-view generalization. Furthermore, to address the lack of standardized benchmarks for this task, we introduce LetCamsGo, a new real-world video dataset with 5 sequences across 4 diverse environments, recorded by three independently moving cameras and one fixed camera. Comprehensive benchmarking on LetCamsGo demonstrated that our proposed framework improves 4D reconstruction quality in dynamic regions compared with baselines, paving the way for a low-cost 4D reconstruction paradigm in the wild.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript proposes a framework for 4D reconstruction from sparse dynamic cameras. It introduces a 3D track initialization method integrating inter-camera feature matching with intra-camera point tracking, a noise-robust depth-ordering regularization loss, and a spatiotemporally diverse batch sampling strategy to resolve spatiotemporal inconsistencies that cause naive extensions of monocular or dense-fixed methods to fail. The authors present the LetCamsGo dataset (5 sequences across 4 environments captured by three moving cameras and one fixed camera) and claim that comprehensive benchmarking shows improved 4D reconstruction quality in dynamic regions relative to baselines.

Significance. If the claimed improvements hold under rigorous evaluation, the work could enable practical low-cost 4D capture in dynamic real-world settings such as sports and concerts by exploiting multi-view constraints without requiring dense fixed rigs or accepting monocular depth ambiguity.

major comments (2)
  1. [Abstract] Abstract: the claim that 'comprehensive benchmarking on LetCamsGo demonstrated that our proposed framework improves 4D reconstruction quality in dynamic regions' is presented without any quantitative metrics, error bars, ablation tables, or dataset statistics, rendering the central empirical claim unverifiable from the supplied evidence.
  2. [Experiments] Experiments (benchmarking on LetCamsGo): the assertion that the three components (3D track initialization, noise-robust depth-ordering loss, spatiotemporally diverse batch sampling) overcome spatiotemporal inconsistencies rests on end-to-end comparisons but supplies no ablations that isolate the contribution of each component versus hyper-parameter tuning, implementation details of the underlying 4D representation, or dataset-specific biases.
minor comments (1)
  1. [Methods] Methods section: explicit equations for the depth-ordering loss and the batch-sampling procedure would improve reproducibility and allow readers to assess how they differ from standard regularization terms.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments. We address each major point below and commit to revisions that strengthen the empirical presentation without altering the core contributions.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the claim that 'comprehensive benchmarking on LetCamsGo demonstrated that our proposed framework improves 4D reconstruction quality in dynamic regions' is presented without any quantitative metrics, error bars, ablation tables, or dataset statistics, rendering the central empirical claim unverifiable from the supplied evidence.

    Authors: We agree that the abstract would be strengthened by including summary quantitative evidence. The full paper contains the detailed metrics, comparisons, and dataset statistics in the Experiments section. We will revise the abstract to report key aggregate improvements (e.g., error reductions in dynamic regions) while remaining within length limits. revision: yes

  2. Referee: [Experiments] Experiments (benchmarking on LetCamsGo): the assertion that the three components (3D track initialization, noise-robust depth-ordering loss, spatiotemporally diverse batch sampling) overcome spatiotemporal inconsistencies rests on end-to-end comparisons but supplies no ablations that isolate the contribution of each component versus hyper-parameter tuning, implementation details of the underlying 4D representation, or dataset-specific biases.

    Authors: The referee correctly identifies that the current experiments emphasize end-to-end results. To isolate each component's contribution, we will add dedicated ablation studies in the revised manuscript. These will include controlled variants (with/without each module) while holding the 4D representation, hyperparameters, and training protocol fixed, plus discussion of potential dataset biases. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper describes a 4D reconstruction framework using 3D track initialization via inter-camera matching and intra-camera tracking, a noise-robust depth-ordering loss, and spatiotemporally diverse batch sampling, evaluated on the new LetCamsGo dataset. No equations, fitted parameters, or predictions are presented in the provided text that reduce any claimed result to its inputs by construction. No self-citations are invoked as load-bearing uniqueness theorems, and no ansatzes or renamings of known results appear. The central claims rest on empirical benchmarking and adaptation of standard geometric components, which remain independent of the method description itself.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides no explicit free parameters, axioms, or invented entities; review is limited to the provided summary.

pith-pipeline@v0.9.1-grok · 5820 in / 1035 out tokens · 26634 ms · 2026-06-28T06:43:45.161912+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

78 extracted references · 2 linked inside Pith

  1. [1]

    UltraSync.https://www.atomos.com/ wireless-sync/, 2025

    ATOMOS. UltraSync.https://www.atomos.com/ wireless-sync/, 2025. 5, 1

  2. [2]

    Per-Gaussian Embedding- Based Deformation for Deformable 3D Gaussian Splatting

    Jeongmin Bae, Seoha Kim, Youngsik Yun, Hahyun Lee, Gun Bang, and Youngjung Uh. Per-Gaussian Embedding- Based Deformation for Deformable 3D Gaussian Splatting. InECCV, 2024. 3

  3. [3]

    Recammaster: Camera-controlled gen- erative rendering from a single video

    Jianhong Bai, Menghan Xia, Xiao Fu, Xintao Wang, Lianrui Mu, Jinwen Cao, Zuozhu Liu, Haoji Hu, Xiang Bai, Pengfei Wan, and Di Zhang. Recammaster: Camera-controlled gen- erative rendering from a single video. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 14834–14844, 2025. 3, 1

  4. [4]

    Immersive light field video with a layered mesh representation, 2020

    Michael Broxton, John Flynn, Ryan Overbeck, Daniel Erick- son, Peter Hedman, Matthew Duvall, Jason Dourgarian, Jay Busch, Matt Whalen, and Paul Debevec. Immersive light field video with a layered mesh representation, 2020. 3

  5. [5]

    Lang, Sourabh V ora, Venice Erin Liong, Qiang Xu, Anush Krishnan, Yu Pan, Gi- ancarlo Baldan, and Oscar Beijbom

    Holger Caesar, Varun Bankiti, Alex H. Lang, Sourabh V ora, Venice Erin Liong, Qiang Xu, Anush Krishnan, Yu Pan, Gi- ancarlo Baldan, and Oscar Beijbom. nuscenes: A multi- modal dataset for autonomous driving. InCVPR, 2020. 3, 1

  6. [6]

    A benchmark dataset and evaluation methodology for video object segmentation

    Perazzi Federico, Jordi Pont-Tuset, Brian McWilliams, Luc Van Gool, Markus Gross, and Alexander Sorkine- Hornung. A benchmark dataset and evaluation methodology for video object segmentation. InCVPR, 2016. 3

  7. [7]

    Monocular dynamic view synthesis: A reality check

    Hang Gao, Ruilong Li, Shubham Tulsiani, Bryan Russell, and Angjoo Kanazawa. Monocular dynamic view synthesis: A reality check. InNeurIPS, pages 33768–33780, 2022. 3, 6, 1

  8. [8]

    QUEEN: QUantized efficient ENcoding for streaming free-viewpoint videos

    Sharath Girish, Tianye Li, Amrita Mazumdar, Abhinav Shri- vastava, David Luebke, and Shalini De Mello. QUEEN: QUantized efficient ENcoding for streaming free-viewpoint videos. InNeurIPS, 2024. 3

  9. [9]

    Kristen Grauman, Andrew Westbury, Eugene Byrne, Zachary Chavis, Antonino Furnari, Rohit Girdhar, Jackson Hamburger, Hao Jiang, Miao Liu, Xingyu Liu, Miguel Mar- tin, Tushar Nagarajan, Ilija Radosavovic, Santhosh Kumar Ramakrishnan, Fiona Ryan, Jayant Sharma, Michael Wray, Mengmeng Xu, Eric Zhongcong Xu, Chen Zhao, Siddhant Bansal, Dhruv Batra, Vincent Car...

  10. [10]

    Human Mesh Reconstruction of Sports Players with Multiple Dy- namic Cameras

    Yamato Hokari, Ryosuke Hori, and Hideo Saito. Human Mesh Reconstruction of Sports Players with Multiple Dy- namic Cameras. InCVPRW, pages 6039–6049, 2025. 3

  11. [11]

    4DGC: Rate-Aware 4D Gaussian Compression for Efficient Streamable Free-Viewpoint Video

    Qiang Hu, Zihan Zheng, Houqiang Zhong, Sihua Fu, Li Song, Xiaoyun Zhang, Guangtao Zhai, and Yanfeng Wang. 4DGC: Rate-Aware 4D Gaussian Compression for Efficient Streamable Free-Viewpoint Video. InCVPR, pages 875– 885, 2025. 3

  12. [12]

    Depthcrafter: Generating consistent long depth sequences for open-world videos

    Wenbo Hu, Xiangjun Gao, Xiaoyu Li, Sijie Zhao, Xiaodong Cun, Yong Zhang, Long Quan, and Ying Shan. Depthcrafter: Generating consistent long depth sequences for open-world videos. InCVPR, 2025. 4

  13. [13]

    SC-GS: Sparse-Controlled Gaussian Splatting for Editable Dynamic Scenes

    Yi-Hua Huang, Yang-Tian Sun, Ziyi Yang, Xiaoyang Lyu, Yan-Pei Cao, and Xiaojuan Qi. SC-GS: Sparse-Controlled Gaussian Splatting for Editable Dynamic Scenes. InCVPR, pages 4220–4230, 2024. 3

  14. [14]

    Wil- davatar: Learning in-the-wild 3d avatars from the web

    Zihao Huang, Shoukang Hu, Guangcong Wang, Tianqi Liu, Yuhang Zang, Zhiguo Cao, Wei Li, and Ziwei Liu. Wil- davatar: Learning in-the-wild 3d avatars from the web. In CVPR, 2025. 3

  15. [15]

    Geo4D: Leveraging Video Generators for Geometric 4D Scene Reconstruction

    Zeren Jiang, Chuanxia Zheng, Iro Laina, Diane Larlus, and Andrea Vedaldi. Geo4D: Leveraging Video Generators for Geometric 4D Scene Reconstruction. InICCV, pages 20658–20671, 2025. 3

  16. [16]

    Motion-x: A large- scale 3d expressive whole-body human motion dataset

    Lin Jing, Ailing Zeng, Shunlin Lu, Yuanhao Cai, Ruimao Zhang, Haoqian Wang, and Lei Zhang. Motion-x: A large- scale 3d expressive whole-body human motion dataset. In NeurIPS, 2023. 3

  17. [17]

    Panoptic studio: A massively multiview system for social motion capture

    Hanbyul Joo, Hao Liu, Lei Tan, Lin Gui, Bart Nabbe, Iain Matthews, Takeo Kanade, Shohei Nobuhara, , and Yaser Sheikh. Panoptic studio: A massively multiview system for social motion capture. InICCV, 2015. 3

  18. [18]

    3d gaussian splatting for real-time radiance field rendering.ACM Transactions on Graphics, 42 (4), 2023

    Bernhard Kerbl, Georgios Kopanas, Thomas Leimk ¨uhler, and George Drettakis. 3d gaussian splatting for real-time radiance field rendering.ACM Transactions on Graphics, 42 (4), 2023. 2, 3, 5, 6

  19. [19]

    A hierarchical 3d gaussian representation for real-time ren- dering of very large datasets.ACM Transactions on Graph- ics, 43(4), 2024

    Bernhard Kerbl, Andreas Meuleman, Georgios Kopanas, Michael Wimmer, Alexandre Lanvin, and George Drettakis. A hierarchical 3d gaussian representation for real-time ren- dering of very large datasets.ACM Transactions on Graph- ics, 43(4), 2024. 1

  20. [20]

    Mosca: Dynamic gaussian fusion from casual videos via 4d motion scaffolds

    Jiahui Lei, Yijia Weng, Adam Harley, Leonidas Guibas, and Kostas Daniilidis. Mosca: Dynamic gaussian fusion from casual videos via 4d motion scaffolds. InCVPR, pages 6165– 6177, 2025. 1, 2, 3, 4, 5, 6

  21. [21]

    Ground- ing Image Matching in 3D with MASt3R

    Vincent Leroy, Yohann Cabon, and Jerome Revaud. Ground- ing Image Matching in 3D with MASt3R. InECCV, 2024. 2, 4

  22. [22]

    Gifstream: 4d gaussian-based immersive video with feature stream

    Hao Li, Sicheng Li, Xiang Gao, Abudouaihati Batuer, Lu Yu, and Yiyi Liao. Gifstream: 4d gaussian-based immersive video with feature stream. InCVPR, pages 21761–21770,

  23. [23]

    Streaming radiance fields for 3d video synthesis

    Lingzhi Li, Zhen Shen, Zhongshu Wang, Li Shen, and Ping Tan. Streaming radiance fields for 3d video synthesis. In NeurIPS, 2022

  24. [24]

    Neural 3d video synthesis from multi- view video

    Tianye Li, Mira Slavcheva, Michael Zollhoefer, Simon Green, Christoph Lassner, Changil Kim, Tanner Schmidt, Steven Lovegrove1 Michael Goesele, Richard Newcombe, and Zhaoyang Lv. Neural 3d video synthesis from multi- view video. InCVPR, 2022. 3, 1

  25. [25]

    Geometry-consistent 4d gaussian splatting for sparse-input dynamic view synthesis

    Yiwei Li, Jiannong Cao, Penghui Ruan, Divya Saxena, Songye Zhu, and Yinfeng Cao. Geometry-consistent 4d gaussian splatting for sparse-input dynamic view synthesis. arXiv preprint arXiv:2511.23044, 2025. 5

  26. [26]

    Neural scene flow fields for space-time view synthesis of dy- namic scenes

    Zhengqi Li, Simon Niklaus, Noah Snavely, and Oliver Wang. Neural scene flow fields for space-time view synthesis of dy- namic scenes. InCVPR, 2021. 3, 1

  27. [27]

    Spacetime gaus- sian feature splatting for real-time dynamic view synthesis

    Zhan Li, Zhang Chen, Zhong Li, and Yi Xu. Spacetime gaus- sian feature splatting for real-time dynamic view synthesis. InCVPR, pages 8508–8520, 2024. 3

  28. [28]

    Feed- Forward Bullet-Time Reconstruction of Dynamic Scenes from Monocular Videos.arXiv preprint arXiv:2412.03526,

    Hanxue Liang, Jiawei Ren, Ashkan Mirzaei, Antonio Tor- ralba, Ziwei Liu, Igor Gilitschenski, Sanja Fidler, Cengiz Oztireli, Huan Ling, Zan Gojcic, and Jiahui Huang. Feed- Forward Bullet-Time Reconstruction of Dynamic Scenes from Monocular Videos.arXiv preprint arXiv:2412.03526,

  29. [29]

    HiMoR: Monocular deformable gaussian reconstruction with hierar- chical motion representation

    Yiming Liang, Tianhan Xu, and Yuta Kikuchi. HiMoR: Monocular deformable gaussian reconstruction with hierar- chical motion representation. InCVPR, 2025. 2, 3, 6

  30. [30]

    MoVieS: Motion-Aware 4D Dynamic View Synthesis in One Second

    Chenguo Lin, Yuchen Lin, Panwang Pan, Yifan Yu, Hon- glei Yan, Katerina Fragkiadaki, and Yadong Mu. MoVieS: Motion-Aware 4D Dynamic View Synthesis in One Second. arXiv preprint arXiv:2507.10065, 2025. 3

  31. [31]

    Efficient neural radiance fields for interactive free-viewpoint video

    Haotong Lin, Sida Peng, Zhen Xu, Yunzhi Yan, Qing Shuai, Hujun Bao, and Xiaowei Zhou. Efficient neural radiance fields for interactive free-viewpoint video. InSIGGRAPH Asia, 2022. 3

  32. [32]

    Chen, Zhenyu Li, Guang Shi, Jiashi Feng, and Bingyi Kang

    Haotong Lin, Sili Chen, Jun Hao Liew, Donny Y . Chen, Zhenyu Li, Guang Shi, Jiashi Feng, and Bingyi Kang. Depth anything 3: recovering the visual space from any views. arXiv preprint arXiv:2511.10647, 2025. 2, 3, 4, 5

  33. [33]

    D2GV: Deformable 2D Gaus- sian Splatting for Video Representation in 400FPS.arXiv preprint arXiv:2503.05600, 2025

    Mufan Liu, Qi Yang, Miaoran Zhao, He Huang, Le Yang, Zhu Li, and Yiling Xu. D2GV: Deformable 2D Gaus- sian Splatting for Video Representation in 400FPS.arXiv preprint arXiv:2503.05600, 2025. 3

  34. [34]

    MoDGS: Dy- namic gaussian splatting from casually-captured monocular videos with depth priors

    Qingming LIU, Yuan Liu, Jiepeng Wang, Xianqiang Lyu, Peng Wang, Wenping Wang, and Junhui Hou. MoDGS: Dy- namic gaussian splatting from casually-captured monocular videos with depth priors. InThe Thirteenth International Conference on Learning Representations, 2025. 3, 5

  35. [35]

    3D Geometry-aware Deformable Gaussian Splatting for Dynamic View Synthe- sis

    Zhicheng Lu, Xiang Guo, Le Hui, Tianrui Chen, Ming Yang, Xiao Tang, Feng Zhu, and Yuchao Dai. 3D Geometry-aware Deformable Gaussian Splatting for Dynamic View Synthe- sis. InCVPR, 2024. 3

  36. [36]

    Dynamic 3D Gaussians: Tracking by Per- sistent Dynamic View Synthesis

    Jonathon Luiten, Georgios Kopanas, Bastian Leibe, and Deva Ramanan. Dynamic 3D Gaussians: Tracking by Per- sistent Dynamic View Synthesis. In3DV, 2024. 3

  37. [37]

    4DTAM: Non-Rigid Tracking and Mapping via Dynamic Surface Gaussians

    Hidenobu Matsuki, Gwangbin Bae, and Andrew Davison. 4DTAM: Non-Rigid Tracking and Mapping via Dynamic Surface Gaussians. InCVPR, 2025. 3

  38. [38]

    SplatFields: Neural gaussian splats for sparse 3d and 4d re- construction

    Marko Mihajlovic, Sergey Prokudin, Siyu Tang, Robert Maier, Federica Bogo, Tony Tung, and Edmond Boyer. SplatFields: Neural gaussian splats for sparse 3d and 4d re- construction. InECCV. Springer, 2024. 3

  39. [39]

    Srinivasan, Matthew Tancik, Jonathan T

    Ben Mildenhall, Pratul P. Srinivasan, Matthew Tancik, Jonathan T. Barron, Ravi Ramamoorthi, and Ren Ng. Nerf: Representing scenes as neural radiance fields for view syn- thesis. InECCV, 2020. 2

  40. [40]

    Mustafa, H

    A. Mustafa, H. Kim, J-Y . Guillemaut, and A. Hilton. Tempo- rally coherent 4d reconstruction of complex dynamic scenes. InCVPR, 2016. 3

  41. [41]

    CoTracker3: Simpler and Better Point Tracking by Pseudo- Labelling Real Videos.arXiv preprint arXiv:2410.11831,

    Nikita Karaev and Iurii Makarov and Jianyuan Wang and Na- talia Neverova and Andrea Vedaldi and Christian Rupprecht. CoTracker3: Simpler and Better Point Tracking by Pseudo- Labelling Real Videos.arXiv preprint arXiv:2410.11831,

  42. [42]

    UniDepth: Universal monocular metric depth estimation

    Luigi Piccinelli, Yung-Hsu Yang, Christos Sakaridis, Mattia Segu, Siyuan Li, Luc Van Gool, and Fisher Yu. UniDepth: Universal monocular metric depth estimation. InProceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024. 4

  43. [43]

    Shape of motion: 4d reconstruc- tion from a single video

    Wang Qianqian, Vickie Ye, Hang Gao, Jake Austin, Zhengqi Li, and Angjoo Kanazawa. Shape of motion: 4d reconstruc- tion from a single video. InICCV, 2025. 2, 3, 5

  44. [44]

    Multi-view 3d point tracking

    Frano Raji ˇc, Haofei Xu, Marko Mihajlovic, Siyuan Li, Irem Demir, Emircan G¨undo˘gdu, Lei Ke, Sergey Prokudin, Marc Pollefeys, and Siyu Tang. Multi-view 3d point tracking. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2025. 2, 3, 4, 6

  45. [45]

    Sam 2: Segment anything in images and videos

    Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Ronghang Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, Roman R¨adle, Chloe Rolland, Laura Gustafson, Eric Mintun, Junt- ing Pan, Kalyan Vasudev Alwala, Nicolas Carion, Chao- Yuan Wu, Ross Girshick, Piotr Doll´ar, and Christoph Feicht- enhofer. Sam 2: Segment anything in images and videos. In ICLR, 2025. 2, 5, 1

  46. [46]

    L4GM: Large 4D Gaussian Reconstruction Model

    Jiawei Ren, Kevin Xie, Ashkan Mirzaei, Hanxue Liang, Xi- aohui Zeng, Karsten Kreis, Ziwei Liu, Antonio Torralba, Sanja Fidler, Seung Wook Kim, and Huan Ling. L4GM: Large 4D Gaussian Reconstruction Model. InNeurIPS,

  47. [47]

    Grounded sam: Assembling open-world models for diverse visual tasks,

    Tianhe Ren, Shilong Liu, Ailing Zeng, Jing Lin, Kun- chang Li, He Cao, Jiayu Chen, Xinyu Huang, Yukang Chen, Feng Yan, Zhaoyang Zeng, Hao Zhang, Feng Li, Jie Yang, Hongyang Li, Qing Jiang, and Lei Zhang. Grounded sam: Assembling open-world models for diverse visual tasks,

  48. [48]

    Fouhey, and Chen-Hsuan Lin

    Chris Rockwell, Joseph Tung, Tsung-Yi Lin, Ming-Yu Liu, David F. Fouhey, and Chen-Hsuan Lin. Dynamic camera poses and where to find them. InCVPR, 2025. 3

  49. [49]

    Dataset and pipeline for multi-view light-field video

    Neus Sabater, Guillaume Boisson, Benoit Vandame, Paul Kerbiriou, Frederic Babon, Matthieu Hog, Tristan Langlois, Remy Gendrot, Olivier Bureller, Arno Schubert, and Valerie Allie. Dataset and pipeline for multi-view light-field video. InCVPRW, 2017. 3

  50. [50]

    Structure-from-motion revisited

    Johannes Lutz Sch ¨onberger and Jan-Michael Frahm. Structure-from-motion revisited. InCVPR, pages 4104– 4113, 2016. 5, 1

  51. [51]

    Gim: Learning generalizable image matcher from internet videos

    Xuelun Shen, Zhipeng Cai, Wei Yin, Matthias M ¨uller, Zijun Li, Kaixuan Wang, Xiaozhi Chen, and Cheng Wang. Gim: Learning generalizable image matcher from internet videos. InICLR, 2024. 2, 4

  52. [52]

    SONY.α7S III.https://www.sony.jp/ichigan/ products/ILCE-7SM3/, 2025. 1

  53. [53]

    FX3.https://www.sony.jp/pro- cam/ products/ILME-FX3A/, 2025

    SONY. FX3.https://www.sony.jp/pro- cam/ products/ILME-FX3A/, 2025. 1

  54. [54]

    Scalability in perception for autonomous driving: Waymo open dataset

    Pei Sun, Henrik Kretzschmar, Xerxes Dotiwalla, Aurelien Chouard, Vijaysai Patnaik, Paul Tsui, James Guo, Yin Zhou, Yuning Chai, Benjamin Caine, Vijay Vasudevan, Wei Han, Jiquan Ngiam, Hang Zhao, Aleksei Timofeev, Scott Et- tinger, Maxim Krivokon, Amy Gao, Aditya Joshi, Yu Zhang, Jonathon Shlens, Zhifeng Chen, and Dragomir Anguelov. Scalability in percepti...

  55. [55]

    Splatter a Video: Video Gaussian Representation for Versatile Processing

    Yang-Tian Sun, Yi-Hua Huang, Lin Ma, Xiaoyang Lyu, Yan- Pei Cao, and Xiaojuan Qi. Splatter a Video: Video Gaussian Representation for Versatile Processing. InNeurIPS, pages 50401–50425, 2024. 3

  56. [56]

    Recovering accurate 3d human pose in the wild using imus and a moving camera

    Timo von Marcard, Roberto Henschel, Michael Black, Bodo Rosenhahn, and Gerard Pons-Moll. Recovering accurate 3d human pose in the wild using imus and a moving camera. In ECCV, 2018. 3

  57. [57]

    Freeman: Towards benchmarking 3d human pose estimation under real-world conditions

    Jiong Wang, Fengyu Yang, Bingliang Li, Wenbo Gou, Danqi Yan, Ailing Zeng, Yijun Gao, Junle Wang, Yanqing Jing, and Ruimao Zhang. Freeman: Towards benchmarking 3d human pose estimation under real-world conditions. InCVPR, 2024. 3

  58. [58]

    SplatV oxel: History-Aware Novel View Streaming without Temporal Training.arXiv preprint arXiv:2503.14698, 2025

    Yiming Wang, Lucy Chai, Xuan Luo, Michael Niemeyer, Manuel Lagunas, Stephen Lombardi, Siyu Tang, and Tiancheng Sun. SplatV oxel: History-Aware Novel View Streaming without Temporal Training.arXiv preprint arXiv:2503.14698, 2025. 3

  59. [59]

    FreeTimeGS: Free Gaussian Primitives at Anytime Anywhere for Dynamic Scene Reconstruction

    Yifan Wang, Peishan Yang, Zhen Xu, Jiaming Sun, Zhan- hua Zhang, Yong Chen, Hujun Bao, Sida Peng, and Xiaowei Zhou. FreeTimeGS: Free Gaussian Primitives at Anytime Anywhere for Dynamic Scene Reconstruction. InCVPR,

  60. [60]

    Monofusion: Sparse-view 4d reconstruction via monocular fusion

    Zihan Wang, Jeff Tan, Tarasha Khurana, Neehar Peri, and Deva Ramanan. Monofusion: Sparse-view 4d reconstruction via monocular fusion. InICCV, 2025. 5

  61. [61]

    4D-Fly: Fast 4D Recon- struction from a Single Monocular Video

    Diankun Wu, Fangfu Liu, Yi-Hsin Hung, Yue Qian, Xi- aohang Zhan, and Yueqi Duan. 4D-Fly: Fast 4D Recon- struction from a Single Monocular Video. InCVPR, pages 16663–16673, 2025. 3

  62. [62]

    4D Gaussian Splatting for Real-Time Dynamic Scene Ren- dering

    Guanjun Wu, Taoran Yi, Jiemin Fang, Lingxi Xie, Xiaopeng Zhang, Wei Wei, Wenyu Liu, Qi Tian, and Xinggang Wang. 4D Gaussian Splatting for Real-Time Dynamic Scene Ren- dering. InCVPR, pages 20310–20320, 2024. 3, 5

  63. [63]

    Spatialtracker: Tracking any 2d pixels in 3d space

    Yuxi Xiao, Qianqian Wang, Shangzhan Zhang, Nan Xue, Sida Peng, Yujun Shen, and Xiaowei Zhou. Spatialtracker: Tracking any 2d pixels in 3d space. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024. 4

  64. [64]

    Mitracker: Multi-view integration for visual object tracking

    Mengjie Xu, Yitao Zhu, Haotian Jiang, Jiaming Li, Zhen- rong Shen, Sheng Wang, Haolin Huang, Xinyu Wang, Qing Yang, Han Zhang, and Qian Wang. Mitracker: Multi-view integration for visual object tracking. InCVPR, 2025. 3

  65. [65]

    Representing long volumet- ric video with temporal gaussian hierarchy.ACM TOG, 43 (6):1–18, 2024

    Zhen Xu, Yinghao Xu, Zhiyuan Yu, Sida Peng, Jiaming Sun, Hujun Bao, and Xiaowei Zhou. Representing long volumet- ric video with temporal gaussian hierarchy.ACM TOG, 43 (6):1–18, 2024. 3

  66. [66]

    In- stant Gaussian Stream: Fast and Generalizable Streaming of Dynamic Scene Reconstruction via Gaussian Splatting

    Jinbo Yan, Rui Peng, Zhiyan Wang, Luyang Tang, Jiayu Yang, Jie Liang, Jiahao Wu, and Ronggang Wang. In- stant Gaussian Stream: Fast and Generalizable Streaming of Dynamic Scene Reconstruction via Gaussian Splatting. In CVPR, pages 16520–16531, 2025. 3

  67. [67]

    Depth any- thing v2.arXiv:2406.09414, 2024

    Lihe Yang, Bingyi Kang, Zilong Huang, Zhen Zhao, Xiao- gang Xu, Jiashi Feng, and Hengshuang Zhao. Depth any- thing v2.arXiv:2406.09414, 2024. 4

  68. [68]

    Deformable 3d gaussians for high- fidelity monocular dynamic scene reconstruction

    Ziyi Yang, Xinyu Gao, Wen Zhou, Shaohui Jiao, Yuqing Zhang, and Xiaogang Jin. Deformable 3d gaussians for high- fidelity monocular dynamic scene reconstruction. InCVPR, pages 20331–20341, 2024. 1, 3, 5, 6

  69. [69]

    4d gaussian splatting: Modeling dynamic scenes with native 4d primitives.arXiv preprint, 2024

    Zeyu Yang, Zijie Pan, Xiatian Zhu, Li Zhang, Jianfeng Feng, Yu-Gang Jiang, and Philip HS Torr. 4d gaussian splatting: Modeling dynamic scenes with native 4d primitives.arXiv preprint, 2024. 3

  70. [70]

    Real- time Photorealistic Dynamic Scene Representation and Ren- dering with 4D Gaussian Splatting

    Zeyu Yang, Hongye Yang, Zijie Pan, and Li Zhang. Real- time Photorealistic Dynamic Scene Representation and Ren- dering with 4D Gaussian Splatting. InICLR, 2024. 3

  71. [71]

    Imvid: Immersive volumetric videos for en- hanced vr engagement

    Zhengxian Yang, Shi Pan, Shengqi Wang, Haoxiang Wang, Li Lin, Guanjun Li, Zhengqi Wen, Borong Lin, Jianhua Tao, and Tao Yu. Imvid: Immersive volumetric videos for en- hanced vr engagement. InCVPR, 2025. 3, 1

  72. [72]

    SplineGS: Learning smooth trajectories in gaussian splatting for dynamic scene reconstruction

    Jihwan Yoon, Sangbeom Han, Jaeseok Oh, and Minsik Lee. SplineGS: Learning smooth trajectories in gaussian splatting for dynamic scene reconstruction. InICLR poster, 2025. 3

  73. [73]

    EvolvingGS: High-Fidelity Streamable V olumetric Video via Evolving 3D Gaussian Representation.arXiv preprint arXiv:2503.05162,

    Chao Zhang, Yifeng Zhou, Shuheng Wang, Wenfa Li, Degang Wang, Yi Xu, and Shaohui Jiao. EvolvingGS: High-Fidelity Streamable V olumetric Video via Evolving 3D Gaussian Representation.arXiv preprint arXiv:2503.05162,

  74. [74]

    The unreasonable effectiveness of deep features as a perceptual metric

    Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang. The unreasonable effectiveness of deep features as a perceptual metric. InCVPR, 2018. 6

  75. [75]

    Zetong Zhang, Manuel Kaufmann, Lixin Xue, Jie Song, and Martin R. Oswald. ODHSR: Online Dense 3D Reconstruc- tion of Humans and Scenes from Monocular Videos. In CVPR, pages 21824–21835, 2025. 3

  76. [76]

    GauSTAR: Gaussian Surface Tracking and Reconstruction

    Chengwei Zheng, Lixin Xue, Juan Zarate, and Jie Song. GauSTAR: Gaussian Surface Tracking and Reconstruction. InCVPR, pages 16543–16553, 2025. 3

  77. [77]

    GPS- Gaussian: Generalizable Pixel-wise 3D Gaussian Splatting for Real-time Human Novel View Synthesis

    Shunyuan Zheng, Boyao Zhou, Ruizhi Shao, Boning Liu, Shengping Zhang, Liqiang Nie, and Yebin Liu. GPS- Gaussian: Generalizable Pixel-wise 3D Gaussian Splatting for Real-time Human Novel View Synthesis. InCVPR, 2024. 3 4D Reconstruction from Sparse Dynamic Cameras Supplementary Material

  78. [78]

    Details of LetCamsGo 8.1. Data Acquisition LetCamsGo is captured using three independently moving cameras that follow the subjects during recording, mim- icking realistic handheld or operator-driven capture scenar- ios. We use two Sony FX3 cameras [53] and two Sony α7S III cameras [52], where three cameras act as dynamic cameras and oneα7S III serves as a...