pith. machine review for the scientific record. sign in

arxiv: 2603.26481 · v3 · submitted 2026-03-27 · 💻 cs.CV

Recognition: 1 theorem link

· Lean Theorem

SparseCam4D: Spatio-Temporally Consistent 4D Reconstruction from Sparse Cameras

Authors on Pith no claims yet

Pith reviewed 2026-05-14 23:38 UTC · model grok-4.3

classification 💻 cs.CV
keywords 4D reconstructionsparse camerasspatio-temporal consistencydynamic scenesgenerative observationsdistortion fielduncalibrated cameras
0
0 comments X

The pith

Spatio-Temporal Distortion Field corrects inconsistencies in generative observations to enable 4D reconstruction from sparse uncalibrated cameras.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that a single learned Spatio-Temporal Distortion Field can unify correction of spatial and temporal inconsistencies across generative observations. This correction supports a full pipeline for producing high-fidelity, consistent 4D models of dynamic scenes. A reader would care because conventional 4D capture requires tens or hundreds of synchronized cameras in costly lab setups, while this approach works with far fewer inputs. The result is scalable reconstruction that still achieves spatio-temporal consistency and outperforms prior methods on standard benchmarks.

Core claim

Our key innovation is the Spatio-Temporal Distortion Field, which provides a unified mechanism for modeling inconsistencies in generative observations across both spatial and temporal dimensions. Building on this, we develop a complete pipeline that enables 4D reconstruction from sparse and uncalibrated camera inputs.

What carries the argument

The Spatio-Temporal Distortion Field, a learned model that corrects spatial and temporal inconsistencies present in generative observations from sparse cameras.

If this is right

  • 4D reconstruction of dynamic scenes becomes practical without dense synchronized camera arrays.
  • Spatio-temporally consistent high-fidelity renderings are produced directly from inconsistent sparse inputs.
  • The pipeline outperforms existing methods on multi-camera dynamic scene benchmarks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same distortion-field idea could be tested on other generative inputs such as video diffusion outputs for static scenes.
  • Removing the need for calibration opens the door to casual capture setups using handheld devices.
  • Extending the field to handle lighting changes or view-dependent effects would be a direct next step.

Load-bearing premise

Generative observations contain enough reliable signal that one learned distortion field can remove their inconsistencies without introducing new artifacts or losing detail in the final 4D model.

What would settle it

A test set of dynamic scenes where the generative observations contain inconsistencies that vary independently across space and time in ways no single field can model, leading to visible artifacts or loss of fidelity in the reconstructed output.

Figures

Figures reproduced from arXiv: 2603.26481 by Guofeng Zhang, Haomin Liu, Nan Wang, Weihong Pan, Xiaoyu Zhang, Zhichao Ye, Zhuang Zhang.

Figure 1
Figure 1. Figure 1: Novel view rendering comparison. With as few as 2-3 cameras, our approach reconstructs high-quality dynamic scenes with spatio-temporal consistency and photorealistic quality. Please refer to our project page for additional dynamic results. Abstract High-quality 4D reconstruction enables photorealistic and immersive rendering of the dynamic real world. How￾ever, unlike static scenes that can be fully captu… view at source ↗
Figure 2
Figure 2. Figure 2: Spatio-temporal inconsistency. Real cameras (grey) capture consistent content of multi-view dynamic scene, while generative results (orange) include additional observations at different poses and time. Inconsistencies across poses at the same time are referred to as spatial inconsistencies, and inconsistencies across time at the same pose are referred to as temporal inconsistencies. be captured with a sing… view at source ↗
Figure 3
Figure 3. Figure 3: Method overview. Given a generated frame at temporal index t and pose index s, each 4D Gaussian at c = (x, y, z) is projected onto the planes of the Spatio-Temporal Distortion Field to obtain deformation features, which are then decoded by a small MLP to produce the deformation values. We use separate photometric losses for real and generated frames, and additionally introduce regularization terms on pose,… view at source ↗
Figure 4
Figure 4. Figure 4: Qualitative Comparisons of different methods on Technicolor [26], Neural 3D Video [13], and Nvidia Dynamic Scenes [46] Datasets. We conduct comparisons with representative dynamic scene reconstruction methods: MonoFusion [35], 4DGS [38], 4D￾Rotor [5], and Realtime4DGS [43]. MonoFusion∗ is our reproduced version. Our method significantly outperforms other baselines, producing visually reliable results with … view at source ↗
Figure 5
Figure 5. Figure 5: Spatio-Temporal Consistency. Rendering results (top) and space-time slices (bottom) constructed by concatenating the red pixel locations across all time steps, demonstrate that direct reconstruction from diffusion observations leads to severe blur and temporal instability(e.g., the moving hand at the bottom right). views are usually available, as they can be estimated to￾gether with the training views unde… view at source ↗
Figure 7
Figure 7. Figure 7: Spatio-Temporal consistency on ReCamMaster. (left). Heatmaps highlight regions with pronounced distor￾tions in the generated observations, such as facial features and the wine bottle. This illustrates that, in the diffusion￾generated observations along the trajectory, different con￾tents undergo varying degrees of distortion, which relates to how the diffusion model perceives the physical world. After trai… view at source ↗
Figure 8
Figure 8. Figure 8: Visualization of selected training cameras. The non￾red frustums denote the cameras used for training, corresponding to the images shown at the bottom and on the right respectively. The remaining red frustums indicate the cameras used for evalua￾tion. We construct a minimal subset of cameras Cs from the original set C (usually containing 12-21 cameras) to serve as the training views. The subset Cs is requi… view at source ↗
Figure 9
Figure 9. Figure 9: Reliability of PSNR vs. LPIPS in Ablation Experiments. We find that directly using generated-view reconstruction introduces severe oversmoothness that, paradoxically, favors PSNR computation, preventing PSNR from accurately reflecting changes in reconstruction quality. In contrast, LPIPS effectively captures the variations in reconstruction quality under these ablation settings. position as the starting vi… view at source ↗
Figure 10
Figure 10. Figure 10: Visualization of the point-cloud rendering condition, generation prior, and final results. [PITH_FULL_IMAGE:figures/full_fig_p014_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Visualization of one failure case. From left to right are the source image and point-cloud–rendering condition provided to ViewCrafter, the corresponding generated image, and a Gaussian rendering near that generated image. Since ViewCrafter is primarily trained on scene-centric data with few human subjects, this example suffers from an out-of-domain issue. In addition, the quality of the point-cloud–rende… view at source ↗
Figure 12
Figure 12. Figure 12: Visualization of Spatio-Temporal Distortion Field [PITH_FULL_IMAGE:figures/full_fig_p015_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Additional qualitative comparison results on Technicolor Dataset. We highlight the geometric completeness of background structures, bottom decorations, and the red bridge across different viewpoints, as well as the temporal consistency of dynamic regions such as human faces, the ‘Happy Birthday’ text, and the toy train [PITH_FULL_IMAGE:figures/full_fig_p016_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Additional qualitative comparison results on Neural 3D Video Dataset (part I) [PITH_FULL_IMAGE:figures/full_fig_p017_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Additional qualitative comparison results on Neural 3D Video Dataset (part II) [PITH_FULL_IMAGE:figures/full_fig_p018_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: Additional qualitative comparison results on Neural 3D Video Dataset (part III) [PITH_FULL_IMAGE:figures/full_fig_p019_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: Additional qualitative comparison results on Neural 3D Video Dataset (part IV) [PITH_FULL_IMAGE:figures/full_fig_p020_17.png] view at source ↗
Figure 18
Figure 18. Figure 18: Additional qualitative comparison results on Nvidia Dynamic Scenes Dataset (part I). In the top panel, we emphasize the geometric completeness of the left archway and the right escalator across different viewpoints, as well as the temporal consistency of dynamic regions such as the human body, balloon, and the blue tether. In the bottom panel, we highlight the fine details of background grass, shoes, and … view at source ↗
Figure 19
Figure 19. Figure 19: Additional qualitative comparison results on Nvidia Dynamic Scenes Dataset (part II) [PITH_FULL_IMAGE:figures/full_fig_p022_19.png] view at source ↗
read the original abstract

High-quality 4D reconstruction enables photorealistic and immersive rendering of the dynamic real world. However, unlike static scenes that can be fully captured with a single camera, high-quality dynamic scenes typically require dense arrays of tens or even hundreds of synchronized cameras. Dependence on such costly lab setups severely limits practical scalability. To this end, we propose a sparse-camera dynamic reconstruction framework that exploits abundant yet inconsistent generative observations. Our key innovation is the Spatio-Temporal Distortion Field, which provides a unified mechanism for modeling inconsistencies in generative observations across both spatial and temporal dimensions. Building on this, we develop a complete pipeline that enables 4D reconstruction from sparse and uncalibrated camera inputs. We evaluate our method on multi-camera dynamic scene benchmarks, achieving spatio-temporally consistent high-fidelity renderings and significantly outperforming existing approaches. Project page available at https://inspatio.github.io/sparse-cam4d/

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper proposes SparseCam4D, a framework for high-quality 4D reconstruction of dynamic scenes from sparse and uncalibrated cameras by exploiting inconsistent generative observations. Its key innovation is the Spatio-Temporal Distortion Field, a unified model for correcting spatial and temporal inconsistencies in these observations, which is integrated into a complete pipeline for consistent 4D output. The method is evaluated on multi-camera dynamic scene benchmarks and reported to outperform existing approaches in producing spatio-temporally consistent high-fidelity renderings.

Significance. If the central claim holds, the work would be significant for practical 4D reconstruction by reducing dependence on expensive dense synchronized camera arrays. The Spatio-Temporal Distortion Field offers a potentially general mechanism for handling generative observation noise, which could extend to other sparse-view dynamic modeling tasks if supported by rigorous sparse-input validation.

major comments (1)
  1. [Experiments] Experiments section (as summarized in the abstract): the central claim is that the Spatio-Temporal Distortion Field enables high-fidelity 4D reconstruction specifically from sparse, uncalibrated cameras. However, evaluation is reported only on standard 'multi-camera dynamic scene benchmarks' with no camera-count ablations, no results for 2–5 camera subsets, and no analysis of performance degradation as view density decreases. Standard benchmarks typically employ 10–20+ views, so the reported outperformance may derive from residual overlap rather than the proposed correction mechanism, leaving the sparse-input claim unverified.
minor comments (2)
  1. [Abstract] Abstract: the claim of 'significantly outperforming existing approaches' is stated without any quantitative metrics, error bars, or ablation details, which weakens the ability to assess result strength from the summary alone.
  2. [Abstract] Abstract and methods: the optimization procedure for the Spatio-Temporal Distortion Field (e.g., loss terms, training schedule, or how it interacts with the 4D reconstruction pipeline) is not described, even at a high level.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback. We agree that the sparse-input claim requires stronger experimental support and will revise the manuscript accordingly.

read point-by-point responses
  1. Referee: [Experiments] Experiments section (as summarized in the abstract): the central claim is that the Spatio-Temporal Distortion Field enables high-fidelity 4D reconstruction specifically from sparse, uncalibrated cameras. However, evaluation is reported only on standard 'multi-camera dynamic scene benchmarks' with no camera-count ablations, no results for 2–5 camera subsets, and no analysis of performance degradation as view density decreases. Standard benchmarks typically employ 10–20+ views, so the reported outperformance may derive from residual overlap rather than the proposed correction mechanism, leaving the sparse-input claim unverified.

    Authors: We acknowledge the validity of this observation. The current experiments follow the standard protocol of the cited multi-camera dynamic scene benchmarks to enable direct comparison with prior work. However, these benchmarks do not explicitly isolate performance under 2–5 view regimes. To address this, we will add a dedicated camera-count ablation study in the revised manuscript, reporting quantitative metrics (PSNR, SSIM, LPIPS, and temporal consistency) for subsets of 2, 3, 4, and 5 cameras drawn from the same benchmark sequences. We will also include a plot showing performance degradation as view density decreases and discuss how the Spatio-Temporal Distortion Field mitigates inconsistencies even in these low-view regimes. This addition will directly verify the sparse-input claim. revision: yes

Circularity Check

0 steps flagged

No circularity: Spatio-Temporal Distortion Field introduced as independent modeling construct

full rationale

The paper's central claim introduces the Spatio-Temporal Distortion Field as a new unified mechanism for correcting inconsistencies in generative observations, without any quoted equations or steps that reduce the field definition to a reparameterization of the input data itself. No self-citations are invoked as load-bearing uniqueness theorems, no fitted parameters are relabeled as predictions, and no ansatz is smuggled via prior work. The derivation chain remains self-contained because the field is presented as an additive modeling tool whose parameters are learned from the observations rather than presupposed by them. Evaluation concerns about camera sparsity are separate from circularity and do not affect the score.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim depends on the effectiveness of a newly postulated Spatio-Temporal Distortion Field whose ability to unify inconsistent generative observations is asserted without independent external validation beyond the reported benchmark improvements.

axioms (1)
  • domain assumption Generative observations supply abundant data whose spatial and temporal inconsistencies can be corrected by a single learned distortion field
    This is the load-bearing premise stated in the abstract as the key innovation enabling the pipeline.
invented entities (1)
  • Spatio-Temporal Distortion Field no independent evidence
    purpose: Unified modeling of inconsistencies in generative observations across space and time
    New entity introduced by the paper; no independent falsifiable evidence provided in the abstract.

pith-pipeline@v0.9.0 · 5477 in / 1328 out tokens · 51509 ms · 2026-05-14T23:38:03.062842+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

54 extracted references · 54 canonical work pages · 4 internal anchors

  1. [1]

    Hyperreel: High-fidelity 6-dof video with ray- conditioned sampling

    Benjamin Attal, Jia-Bin Huang, Christian Richardt, Michael Zollhoefer, Johannes Kopf, Matthew O’Toole, and Changil Kim. Hyperreel: High-fidelity 6-dof video with ray- conditioned sampling. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16610–16620, 2023. 5, 6

  2. [2]

    Ac3d: Analyzing and improving 3d camera control in video diffusion trans- formers

    Sherwin Bahmani, Ivan Skorokhodov, Guocheng Qian, Ali- aksandr Siarohin, Willi Menapace, Andrea Tagliasacchi, David B Lindell, and Sergey Tulyakov. Ac3d: Analyzing and improving 3d camera control in video diffusion trans- formers. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 22875–22889, 2025. 2

  3. [3]

    Recammaster: Camera-controlled generative rendering from a single video.arXiv preprint arXiv:2503.11647, 2025

    Jianhong Bai, Menghan Xia, Xiao Fu, Xintao Wang, Lian- rui Mu, Jinwen Cao, Zuozhu Liu, Haoji Hu, Xiang Bai, Pengfei Wan, et al. Recammaster: Camera-controlled generative rendering from a single video.arXiv preprint arXiv:2503.11647, 2025. 2, 3, 8

  4. [4]

    Hexplane: A fast representa- tion for dynamic scenes

    Ang Cao and Justin Johnson. Hexplane: A fast representa- tion for dynamic scenes. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 130–141, 2023. 2

  5. [5]

    4d-rotor gaussian splatting: towards efficient novel view synthesis for dynamic scenes

    Yuanxing Duan, Fangyin Wei, Qiyu Dai, Yuhang He, Wen- zheng Chen, and Baoquan Chen. 4d-rotor gaussian splatting: towards efficient novel view synthesis for dynamic scenes. InACM SIGGRAPH 2024 Conference Papers, pages 1–11,

  6. [6]

    arXiv preprint arXiv:2403.20309 (2024)

    Zhiwen Fan, Wenyan Cong, Kairun Wen, Kevin Wang, Jian Zhang, Xinghao Ding, Danfei Xu, Boris Ivanovic, Marco Pavone, Georgios Pavlakos, et al. Instantsplat: Un- bounded sparse-view pose-free gaussian splatting in 40 sec- onds.arXiv preprint arXiv:2403.20309, 2(3):4, 2024. 7

  7. [7]

    K-planes: Explicit radiance fields in space, time, and appearance

    Sara Fridovich-Keil, Giacomo Meanti, Frederik Rahbæk Warburg, Benjamin Recht, and Angjoo Kanazawa. K-planes: Explicit radiance fields in space, time, and appearance. In Proceedings of the IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition, pages 12479–12488, 2023. 2, 3, 5

  8. [8]

    CameraCtrl: Enabling Camera Control for Text-to-Video Generation

    Hao He, Yinghao Xu, Yuwei Guo, Gordon Wetzstein, Bo Dai, Hongsheng Li, and Ceyuan Yang. Cameractrl: Enabling camera control for text-to-video generation.arXiv preprint arXiv:2404.02101, 2024. 3

  9. [9]

    Gauhuman: Articu- lated gaussian splatting from monocular human videos

    Shoukang Hu, Tao Hu, and Ziwei Liu. Gauhuman: Articu- lated gaussian splatting from monocular human videos. In Proceedings of the IEEE/CVF conference on computer vi- sion and pattern recognition, pages 20418–20431, 2024. 2

  10. [10]

    Dif- fuman4d: 4d consistent human view synthesis from sparse- view videos with spatio-temporal diffusion models.arXiv preprint arXiv:2507.13344, 2025

    Yudong Jin, Sida Peng, Xuan Wang, Tao Xie, Zhen Xu, Yi- fan Yang, Yujun Shen, Hujun Bao, and Xiaowei Zhou. Dif- fuman4d: 4d consistent human view synthesis from sparse- view videos with spatio-temporal diffusion models.arXiv preprint arXiv:2507.13344, 2025. 2

  11. [11]

    Frnerf: Fusion and regularization fields for dynamic view synthesis.Computational Visual Media, 2025

    Xinyi Jing, Tao Yu, Renyuan He, Yu-Kun Lai, and Kun Li. Frnerf: Fusion and regularization fields for dynamic view synthesis.Computational Visual Media, 2025. 2

  12. [12]

    Mosca: Dynamic gaussian fusion from casual videos via 4d motion scaffolds.arXiv preprint arXiv:2405.17421, 2024

    Jiahui Lei, Yijia Weng, Adam Harley, Leonidas Guibas, and Kostas Daniilidis. Mosca: Dynamic gaussian fusion from casual videos via 4d motion scaffolds.arXiv preprint arXiv:2405.17421, 2024. 3

  13. [13]

    Neural 3d video synthesis from multi-view video

    Tianye Li, Mira Slavcheva, Michael Zollhoefer, Simon Green, Christoph Lassner, Changil Kim, Tanner Schmidt, Steven Lovegrove, Michael Goesele, Richard Newcombe, et al. Neural 3d video synthesis from multi-view video. In Proceedings of the IEEE/CVF conference on computer vi- sion and pattern recognition, pages 5521–5531, 2022. 2, 5, 6

  14. [14]

    Spacetime gaus- sian feature splatting for real-time dynamic view synthesis

    Zhan Li, Zhang Chen, Zhong Li, and Yi Xu. Spacetime gaus- sian feature splatting for real-time dynamic view synthesis. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8508–8520, 2024. 2

  15. [15]

    Himor: Monocular deformable gaussian reconstruction with hierar- chical motion representation

    Yiming Liang, Tianhan Xu, and Yuta Kikuchi. Himor: Monocular deformable gaussian reconstruction with hierar- chical motion representation. InProceedings of the Com- puter Vision and Pattern Recognition Conference, pages 886–895, 2025. 3

  16. [16]

    Gaussian-flow: 4d reconstruction with dynamic 3d gaus- sian particle

    Youtian Lin, Zuozhuo Dai, Siyu Zhu, and Yao Yao. Gaussian-flow: 4d reconstruction with dynamic 3d gaus- sian particle. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21136– 21145, 2024. 2

  17. [17]

    Sherpa3d: Boosting high-fidelity text-to-3d genera- tion via coarse 3d prior

    Fangfu Liu, Diankun Wu, Yi Wei, Yongming Rao, and Yueqi Duan. Sherpa3d: Boosting high-fidelity text-to-3d genera- tion via coarse 3d prior. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 20763–20774, 2024. 3

  18. [18]

    Modgs: Dy- namic gaussian splatting from casually-captured monocular videos with depth priors, 2025

    Qingming Liu, Yuan Liu, Jiepeng Wang, Xianqiang Lyv, Peng Wang, Wenping Wang, and Junhui Hou. Modgs: Dy- namic gaussian splatting from casually-captured monocular videos with depth priors, 2025. 3

  19. [19]

    Wonder3d: Sin- gle image to 3d using cross-domain diffusion

    Xiaoxiao Long, Yuan-Chen Guo, Cheng Lin, Yuan Liu, Zhiyang Dou, Lingjie Liu, Yuexin Ma, Song-Hai Zhang, Marc Habermann, Christian Theobalt, et al. Wonder3d: Sin- gle image to 3d using cross-domain diffusion. InProceed- ings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9970–9980, 2024. 3

  20. [20]

    Dynamicfusion: Reconstruction and tracking of non-rigid scenes in real-time

    Richard A Newcombe, Dieter Fox, and Steven M Seitz. Dynamicfusion: Reconstruction and tracking of non-rigid scenes in real-time. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 343–352,

  21. [21]

    DINOv2: Learning Robust Visual Features without Supervision

    Maxime Oquab, Timoth ´ee Darcet, Th ´eo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193, 2023. 3

  22. [22]

    Splinegs: Robust motion-adaptive spline for real-time dy- namic 3d gaussians from monocular video

    Jongmin Park, Minh-Quan Viet Bui, Juan Luis Gonza- lez Bello, Jaeho Moon, Jihyong Oh, and Munchurl Kim. Splinegs: Robust motion-adaptive spline for real-time dy- namic 3d gaussians from monocular video. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 26866–26875, 2025. 3

  23. [23]

    Pytorch: An im- perative style, high-performance deep learning library.Ad- vances in neural information processing systems, 32, 2019

    Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al. Pytorch: An im- perative style, high-performance deep learning library.Ad- vances in neural information processing systems, 32, 2019. 5

  24. [24]

    Ani- matable neural radiance fields for modeling dynamic human bodies

    Sida Peng, Junting Dong, Qianqian Wang, Shangzhan Zhang, Qing Shuai, Xiaowei Zhou, and Hujun Bao. Ani- matable neural radiance fields for modeling dynamic human bodies. InProceedings of the IEEE/CVF International Con- ference on Computer Vision, pages 14314–14323, 2021. 2

  25. [25]

    Neural body: Implicit neural representations with structured latent codes for novel view synthesis of dynamic humans

    Sida Peng, Yuanqing Zhang, Yinghao Xu, Qianqian Wang, Qing Shuai, Hujun Bao, and Xiaowei Zhou. Neural body: Implicit neural representations with structured latent codes for novel view synthesis of dynamic humans. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9054–9063, 2021. 2

  26. [26]

    Dataset and pipeline for multi-view light-field video

    Neus Sabater, Guillaume Boisson, Benoit Vandame, Paul Kerbiriou, Frederic Babon, Matthieu Hog, Remy Gendrot, Tristan Langlois, Olivier Bureller, Arno Schubert, et al. Dataset and pipeline for multi-view light-field video. InPro- ceedings of the IEEE conference on computer vision and pat- tern recognition Workshops, pages 30–40, 2017. 2, 5, 6, 7, 8

  27. [27]

    Structure-from-motion revisited

    Johannes Lutz Sch ¨onberger and Jan-Michael Frahm. Structure-from-motion revisited. InConference on Com- puter Vision and Pattern Recognition (CVPR), 2016. 5, 6

  28. [28]

    Pixelwise view selection for un- structured multi-view stereo

    Johannes Lutz Sch ¨onberger, Enliang Zheng, Marc Pollefeys, and Jan-Michael Frahm. Pixelwise view selection for un- structured multi-view stereo. InEuropean Conference on Computer Vision (ECCV), 2016. 5, 6

  29. [29]

    MVDream: Multi-view Diffusion for 3D Generation

    Yichun Shi, Peng Wang, Jianglong Ye, Mai Long, Kejie Li, and Xiao Yang. Mvdream: Multi-view diffusion for 3d gen- eration.arXiv preprint arXiv:2308.16512, 2023. 3

  30. [30]

    Dimensionx: Create any 3d and 4d scenes from a single image with controllable video diffusion.arXiv preprint arXiv:2411.04928, 2024

    Wenqiang Sun, Shuo Chen, Fangfu Liu, Zilong Chen, Yueqi Duan, Jun Zhang, and Yikai Wang. Dimensionx: Create any 3d and 4d scenes from a single image with controllable video diffusion.arXiv preprint arXiv:2411.04928, 2024. 3

  31. [31]

    4real-video-v2: Fused view-time attention and feedforward reconstruction for 4d scene generation.arXiv preprint arXiv:2506.18839, 2025

    Chaoyang Wang, Ashkan Mirzaei, Vidit Goel, Willi Menapace, Aliaksandr Siarohin, Avalon Vinella, Michael Vasilkovsky, Ivan Skorokhodov, Vladislav Shakhrai, Sergey Korolev, et al. 4real-video-v2: Fused view-time attention and feedforward reconstruction for 4d scene generation.arXiv preprint arXiv:2506.18839, 2025. 3

  32. [32]

    Shape of motion: 4d reconstruc- tion from a single video, 2024

    Qianqian Wang, Vickie Ye, Hang Gao, Jake Austin, Zhengqi Li, and Angjoo Kanazawa. Shape of motion: 4d reconstruc- tion from a single video, 2024. 2, 3

  33. [33]

    Image quality assessment: from error visibility to structural similarity.IEEE transactions on image processing, 13(4):600–612, 2004

    Zhou Wang, Alan C Bovik, Hamid R Sheikh, and Eero P Si- moncelli. Image quality assessment: from error visibility to structural similarity.IEEE transactions on image processing, 13(4):600–612, 2004. 5, 7

  34. [34]

    Motionctrl: A unified and flexible motion controller for video generation

    Zhouxia Wang, Ziyang Yuan, Xintao Wang, Yaowei Li, Tianshui Chen, Menghan Xia, Ping Luo, and Ying Shan. Motionctrl: A unified and flexible motion controller for video generation. InACM SIGGRAPH 2024 Conference Pa- pers, pages 1–11, 2024. 3

  35. [35]

    Monofusion: Sparse-view 4d reconstruc- tion via monocular fusion.arXiv preprint arXiv:2507.23782,

    Zihan Wang, Jeff Tan, Tarasha Khurana, Neehar Peri, and Deva Ramanan. Monofusion: Sparse-view 4d reconstruc- tion via monocular fusion.arXiv preprint arXiv:2507.23782,

  36. [36]

    Hu- mannerf: Free-viewpoint rendering of moving people from monocular video

    Chung-Yi Weng, Brian Curless, Pratul P Srinivasan, Jonathan T Barron, and Ira Kemelmacher-Shlizerman. Hu- mannerf: Free-viewpoint rendering of moving people from monocular video. InProceedings of the IEEE/CVF con- ference on computer vision and pattern Recognition, pages 16210–16220, 2022. 2

  37. [37]

    4d-fly: Fast 4d reconstruc- tion from a single monocular video

    Diankun Wu, Fangfu Liu, Yi-Hsin Hung, Yue Qian, Xiao- hang Zhan, and Yueqi Duan. 4d-fly: Fast 4d reconstruc- tion from a single monocular video. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 16663–16673, 2025. 3

  38. [38]

    4d gaussian splatting for real-time dynamic scene rendering

    Guanjun Wu, Taoran Yi, Jiemin Fang, Lingxi Xie, Xiaopeng Zhang, Wei Wei, Wenyu Liu, Qi Tian, and Xinggang Wang. 4d gaussian splatting for real-time dynamic scene rendering. InProceedings of the IEEE/CVF conference on computer vi- sion and pattern recognition, pages 20310–20320, 2024. 2, 3, 6, 7

  39. [39]

    Unique3d: High-quality and efficient 3d mesh generation from a single image.Advances in Neural Information Processing Systems, 37:125116–125141, 2024

    Kailu Wu, Fangfu Liu, Zhihan Cai, Runjie Yan, Hanyang Wang, Yating Hu, Yueqi Duan, and Kaisheng Ma. Unique3d: High-quality and efficient 3d mesh generation from a single image.Advances in Neural Information Processing Systems, 37:125116–125141, 2024. 3

  40. [40]

    Cat4d: Create anything in 4d with multi-view video diffusion mod- els

    Rundi Wu, Ruiqi Gao, Ben Poole, Alex Trevithick, Changxi Zheng, Jonathan T Barron, and Aleksander Holynski. Cat4d: Create anything in 4d with multi-view video diffusion mod- els. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 26057–26068, 2025. 3

  41. [41]

    Recent advances in 3d gaussian splatting.Computational Visual Media, 10(4):613– 642, 2024

    Tong Wu, Yu-Jie Yuan, Ling-Xiao Zhang, Jie Yang, Yan- Pei Cao, Ling-Qi Yan, and Lin Gao. Recent advances in 3d gaussian splatting.Computational Visual Media, 10(4):613– 642, 2024. 1

  42. [42]

    Camco: Camera- controllable 3d-consistent image-to-video generation.arXiv preprint arXiv:2406.02509, 2024

    Dejia Xu, Weili Nie, Chao Liu, Sifei Liu, Jan Kautz, Zhangyang Wang, and Arash Vahdat. Camco: Camera- controllable 3d-consistent image-to-video generation.arXiv preprint arXiv:2406.02509, 2024. 3

  43. [43]

    Real-time photorealistic dynamic scene representation and rendering with 4d gaussian splatting.arXiv preprint arXiv:2310.10642, 2023

    Zeyu Yang, Hongye Yang, Zijie Pan, and Li Zhang. Real-time photorealistic dynamic scene representation and rendering with 4d gaussian splatting.arXiv preprint arXiv:2310.10642, 2023. 2, 3, 5, 6, 7

  44. [44]

    Deformable 3d gaussians for high- fidelity monocular dynamic scene reconstruction

    Ziyi Yang, Xinyu Gao, Wen Zhou, Shaohui Jiao, Yuqing Zhang, and Xiaogang Jin. Deformable 3d gaussians for high- fidelity monocular dynamic scene reconstruction. InPro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 20331–20341, 2024. 2

  45. [45]

    Dream- reward: Text-to-3d generation with human preference

    Junliang Ye, Fangfu Liu, Qixiu Li, Zhengyi Wang, Yikai Wang, Xinzhou Wang, Yueqi Duan, and Jun Zhu. Dream- reward: Text-to-3d generation with human preference. In European Conference on Computer Vision, pages 259–276. Springer, 2024. 3

  46. [46]

    Novel view synthesis of dynamic scenes with globally coherent depths from a monocular camera

    Jae Shin Yoon, Kihwan Kim, Orazio Gallo, Hyun Soo Park, and Jan Kautz. Novel view synthesis of dynamic scenes with globally coherent depths from a monocular camera. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5336–5345, 2020. 2, 5, 6, 7, 8

  47. [47]

    Sparse-view 3d recon- struction: Recent advances and open challenges, 2025

    Tanveer Younis and Zhanglin Cheng. Sparse-view 3d recon- struction: Recent advances and open challenges, 2025. 2

  48. [48]

    ViewCrafter: Taming Video Diffusion Models for High-fidelity Novel View Synthesis

    Wangbo Yu, Jinbo Xing, Li Yuan, Wenbo Hu, Xiaoyu Li, Zhipeng Huang, Xiangjun Gao, Tien-Tsin Wong, Ying Shan, and Yonghong Tian. Viewcrafter: Taming video diffusion models for high-fidelity novel view synthesis.arXiv preprint arXiv:2409.02048, 2024. 2, 3, 5

  49. [49]

    The unreasonable effectiveness of deep features as a perceptual metric

    Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shecht- man, and Oliver Wang. The unreasonable effectiveness of deep features as a perceptual metric. InProceedings of the IEEE conference on computer vision and pattern recogni- tion, pages 586–595, 2018. 5, 7 SparseCam4D: Spatio-Temporally Consistent 4D Reconstruction from Sparse Cameras Supplementary Material

  50. [50]

    To assess the ver- satility of our method, we did not apply dataset-specific tun- ing; instead, a unified training schedule was adopted across all datasets

    Implementation Details Our framework is implemented in PyTorch, using the Real- Time4DGS codebase as the foundation. To assess the ver- satility of our method, we did not apply dataset-specific tun- ing; instead, a unified training schedule was adopted across all datasets. For the optimization of 4D Gaussians, we strictly follow the official RealTime4DGS ...

  51. [51]

    This choice is motivated by two considerations

    Metrics selection in Ablation Study In the ablation study, SSIM and LPIPS are adopted as the primary evaluation metrics instead of PSNR. This choice is motivated by two considerations. First, SSIM and LPIPS are more sensitive to structural details, with LPIPS in partic- ular capturing perceptual differences in texture, sharpness, and local geometry, where...

  52. [52]

    Evaluation under Different Sparsity Levels

    Additional Evaluation Results 8.1. Evaluation under Different Sparsity Levels. We compare our method with the baseline 4DGS approach, i.e. 4DGaussian, under different sparsity levels, and the results are reported in Tab. 5. Our method outperforms the baseline across all camera-view configurations, further demonstrating its practical robustness. Table 5.Pe...

  53. [53]

    ViewCrafter conditions its diffusion model on point-cloud renderings obtained from Dust3R, given two input images along with the target trajec- tory

    More Details for Generated Images In our main experiments, we adopt ViewCrafter to provide additional observations, from which 20–25 generated views are uniformly sampled for training. ViewCrafter conditions its diffusion model on point-cloud renderings obtained from Dust3R, given two input images along with the target trajec- tory. However, we find that ...

  54. [54]

    More visualization To gain an intuitive understanding of the STDF training re- sults, we visualize the full feature maps (16 channels) of each plane in the CoffeMartini scene in Fig. 12. The ac- tivated regions vary across different dimensions, indicating the intertwined nature of spatial and temporal deformations. To further interpret the outputs, we ren...