pith. machine review for the scientific record. sign in

arxiv: 2604.13746 · v1 · submitted 2026-04-15 · 💻 cs.CV

Recognition: unknown

ClipGStream: Clip-Stream Gaussian Splatting for Any Length and Any Motion Multi-View Dynamic Scene Reconstruction

Authors on Pith no claims yet

Pith reviewed 2026-05-10 13:39 UTC · model grok-4.3

classification 💻 cs.CV
keywords dynamic scene reconstructionGaussian splattingmulti-view videotemporal coherenceclip-based optimization3D dynamic scenesmemory-efficient reconstructionlong video 3D
0
0 comments X

The pith

ClipGStream divides long multi-view dynamic videos into short clips and optimizes Gaussian splatting with inherited anchors to achieve scalable, flicker-free 3D reconstruction.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Dynamic 3D scene reconstruction from multi-view videos struggles with long sequences that contain large motion, as frame-by-frame methods scale but flicker while full-clip methods stay consistent but demand too much memory. The paper introduces a hybrid Clip-Stream framework that splits the input into short clips, models local motion with clip-independent spatio-temporal fields and residual anchor compensation, and carries structural information forward via inherited anchors and decoders between clips. This design targets high temporal coherence across arbitrary lengths while cutting memory use. A reader would care because it makes extended immersive content practical for VR, MR, and XR without the previous trade-offs in stability or resources.

Core claim

ClipGStream performs stream optimization at the clip level by modeling dynamic motion through clip-independent spatio-temporal fields and residual anchor compensation for local variations, while inter-clip inherited anchors and decoders maintain structural consistency across the full sequence, enabling scalable, flicker-free reconstruction of long dynamic videos with high temporal coherence and reduced memory overhead.

What carries the argument

Clip-Stream Gaussian Splatting, which combines clip-level spatio-temporal fields and residual anchors with inter-clip inherited anchors to balance local motion capture and global consistency.

If this is right

  • Reconstruction of videos of any length becomes feasible without memory growing linearly with duration.
  • Temporal coherence improves enough to eliminate flicker in scenes with large or complex motion.
  • Memory overhead drops relative to methods that optimize entire clips or full sequences at once.
  • State-of-the-art reconstruction quality holds while processing efficiency rises on multi-view dynamic data.
  • The approach handles arbitrary motion patterns by localizing optimization within each clip.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The clip-boundary inheritance mechanism could be extended to streaming scenarios where new clips arrive continuously.
  • Combining this with other Gaussian variants might allow hybrid static-dynamic scene handling without full re-optimization.
  • Experiments on sequences longer than those tested could reveal practical upper limits on clip size choices.

Load-bearing premise

That splitting the sequence into clips and passing anchors and decoders between them preserves full structural consistency without losing local motion details or creating new boundary artifacts.

What would settle it

Reconstructed long sequences that display visible flickering, discontinuities at clip boundaries, or memory consumption that does not scale sublinearly with video length.

Figures

Figures reproduced from arXiv: 2604.13746 by Chao Wang, Feng Gao, Jiahao Wu, Jiayu Yang, Jie Liang, Jinbo Yan, Kaiqiang Xiong, Ronggang Wang, Xiaoyun Zheng, Zhanke Wang.

Figure 1
Figure 1. Figure 1: ClipGStream enables scalable and temporally stable dynamic scene reconstruction for sequences of any length and any motion. [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Our training process is divided into two stages: the [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: (a)Static features fs learns all background information, therefore sharing fs across clips ensures temporal consistency. (b) Dynamic features learns residual information that controls the vis￾ibility of dynamic content, and thus are clip-independent. where Σ denotes the covariance matrix of the 3D Gaus￾sian, which is usually computed from the rotation q and scaling s. In the rendering process, following th… view at source ↗
Figure 5
Figure 5. Figure 5: The ablation study on the Residual Anchors Com￾pensation Module (RAC) and the Anchors Inheritance Mod￾ule (AI). As seen from the residual heatmaps between adjacent clips, removing either module leads to strong responses in static regions as shown in (a)(b)(c), while enabling them, as illustrated in (d), significantly suppresses flicker and preserves smooth clip transitions, which demonstrates that both com… view at source ↗
Figure 6
Figure 6. Figure 6: Qualitative results on Long 360 (1,400 frames; extreme motion amplitudes and long-sequence challenges) and VRU GZ [31] (complex dynamic interactions). Compared to 4DGaussian [29] and LocalDyGS [31], our method produces sharper renderings in both dynamic regions (e.g., athletes) and static areas (e.g., court floor), with more stable and temporally coherent reconstructions. Additional results are provided in… view at source ↗
Figure 7
Figure 7. Figure 7: Qualitative results of flame salmon from the N3DV dataset [14], a fine-scale motion sequence. Our method outperforms SOTA approaches (Grid4D [42], LocalDyGS [31]) by better preserving details in dynamic regions such as the dog’s face and flames. while Clip methods face optimization challenges when ex￾tended to long sequences. By combining the advantages of both paradigms, ClipGStream achieves stable and hi… view at source ↗
Figure 8
Figure 8. Figure 8: The Ablation study on decoder inheritance. (a) With￾out inheriting the decoder, the rendered image exhibits noticeable blurriness. In contrast, employing decoder inheritance yields clear details, as visualized in (b). mance, demonstrating strong robustness and adaptability across diverse motion scales. Qualitative comparisons. As shown in [PITH_FULL_IMAGE:figures/full_fig_p008_8.png] view at source ↗
read the original abstract

Dynamic 3D scene reconstruction is essential for immersive media such as VR, MR, and XR, yet remains challenging for long multi-view sequences with large-scale motion. Existing dynamic Gaussian approaches are either Frame-Stream, offering scalability but poor temporal stability, or Clip, achieving local consistency at the cost of high memory and limited sequence length. We propose ClipGStream, a hybrid reconstruction framework that performs stream optimization at the clip level rather than the frame level. The sequence is divided into short clips, where dynamic motion is modeled using clip-independent spatio-temporal fields and residual anchor compensation to capture local variations efficiently, while inter-clip inherited anchors and decoders maintain structural consistency across clips. This Clip-Stream design enables scalable, flicker-free reconstruction of long dynamic videos with high temporal coherence and reduced memory overhead. Extensive experiments demonstrate that ClipGStream achieves state-of-the-art reconstruction quality and efficiency. The project page is available at: https://liangjie1999.github.io/ClipGStreamWeb/

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper proposes ClipGStream, a hybrid Clip-Stream Gaussian Splatting framework for reconstructing long multi-view dynamic 3D scenes. Sequences are partitioned into short clips; within each clip, motion is modeled via clip-independent spatio-temporal fields and residual anchor compensation, while inter-clip inherited anchors and decoders are used to propagate structural consistency. The design is claimed to deliver scalable, flicker-free reconstruction of arbitrary-length videos with large motions, high temporal coherence, and lower memory footprint than prior clip-based methods, with experiments asserted to demonstrate state-of-the-art quality and efficiency.

Significance. If the central claims are substantiated, the work would meaningfully advance dynamic scene reconstruction by offering a practical middle ground between memory-efficient but temporally unstable frame-stream methods and locally consistent but length-limited clip methods. Successful validation of drift-free inheritance across clips could enable reliable long-sequence modeling for VR/MR/XR applications.

major comments (3)
  1. [Abstract and §3] Abstract and §3 (Method): The inter-clip inheritance of anchors and decoders is presented only at a high level as the mechanism for global consistency and flicker-free output. No explicit alignment loss, shared canonical space, drift-correction term, or error-propagation bound is defined, leaving open whether residual compensation remains purely local and therefore permits cumulative misalignment on long sequences with large motions. This directly underpins the central 'flicker-free' and 'high temporal coherence' guarantees.
  2. [§4] §4 (Experiments): The abstract asserts state-of-the-art reconstruction quality and efficiency, yet the provided text supplies no quantitative metrics, baseline comparisons, ablation studies on clip length or inheritance, or error analysis across clip boundaries. Without these, the empirical support for the Clip-Stream design cannot be evaluated.
  3. [§3.1–3.2] §3.1–3.2: The claim that clip-independent spatio-temporal fields plus residual anchor compensation capture local variations 'efficiently' while inheritance preserves global structure requires a concrete formulation (e.g., the precise definition of the residual term and how inherited decoders are initialized or updated) to confirm it does not trade one form of artifact for another at clip transitions.
minor comments (2)
  1. [§3] Notation for 'clip-independent spatio-temporal fields' and 'residual anchor compensation' should be introduced with explicit equations or pseudocode on first use to improve readability.
  2. [Abstract] The project page URL is given but no supplementary video or long-sequence qualitative results are referenced in the text; adding these would help illustrate the claimed temporal coherence.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed comments. We agree that the current presentation of the inter-clip inheritance mechanism and the experimental validation require expansion to fully support the central claims. We will revise the manuscript to provide more explicit formulations, quantitative results, and analyses. Below we respond to each major comment.

read point-by-point responses
  1. Referee: [Abstract and §3] Abstract and §3 (Method): The inter-clip inheritance of anchors and decoders is presented only at a high level as the mechanism for global consistency and flicker-free output. No explicit alignment loss, shared canonical space, drift-correction term, or error-propagation bound is defined, leaving open whether residual compensation remains purely local and therefore permits cumulative misalignment on long sequences with large motions. This directly underpins the central 'flicker-free' and 'high temporal coherence' guarantees.

    Authors: We acknowledge that the abstract and §3 describe the inheritance at a high level. In the method, anchors and decoder parameters optimized at the end of one clip are directly inherited as initialization for the subsequent clip, while residual anchor compensation optimizes only local deviations within the current clip. This design aims to propagate global structure while allowing efficient local motion modeling. However, the comment correctly identifies the absence of an explicit alignment loss or error-propagation bound. In the revision we will expand §3 with a precise description of the inheritance process, the residual term definition, and a discussion of temporal coherence at clip boundaries (including qualitative evidence from our results). We will also consider adding a lightweight consistency regularizer if it improves stability without harming efficiency. revision: yes

  2. Referee: [§4] §4 (Experiments): The abstract asserts state-of-the-art reconstruction quality and efficiency, yet the provided text supplies no quantitative metrics, baseline comparisons, ablation studies on clip length or inheritance, or error analysis across clip boundaries. Without these, the empirical support for the Clip-Stream design cannot be evaluated.

    Authors: The full experiments section contains quantitative comparisons (PSNR, SSIM, LPIPS) against prior dynamic Gaussian methods on multiple multi-view datasets, together with runtime and memory measurements. We agree, however, that ablations specifically on clip length, the effect of inheritance, and error metrics at clip boundaries are not presented with sufficient detail or tables. We will add these elements in the revised §4, including new ablation tables, boundary-specific error plots, and direct comparisons that isolate the contribution of the Clip-Stream components. revision: yes

  3. Referee: [§3.1–3.2] §3.1–3.2: The claim that clip-independent spatio-temporal fields plus residual anchor compensation capture local variations 'efficiently' while inheritance preserves global structure requires a concrete formulation (e.g., the precise definition of the residual term and how inherited decoders are initialized or updated) to confirm it does not trade one form of artifact for another at clip transitions.

    Authors: The spatio-temporal fields are modeled independently per clip, and the residual compensation is realized by optimizing anchor offsets relative to the inherited anchors from the prior clip. Inherited decoders are initialized with the converged parameters of the previous clip and then fine-tuned at a lower learning rate. We will make these definitions explicit with equations in the revised §3.1–3.2, clarify the initialization and update rules, and add discussion showing that the design avoids introducing new transition artifacts (supported by our visual results across clip boundaries). revision: yes

Circularity Check

0 steps flagged

No circularity: architectural description with no self-referential equations or fitted predictions.

full rationale

The abstract and provided text describe a hybrid Clip-Stream framework that divides sequences into clips, uses clip-independent spatio-temporal fields plus residual anchor compensation for local motion, and employs inter-clip inherited anchors/decoders for consistency. No equations, parameter-fitting steps, or derivation chains are exhibited that would reduce any claimed performance metric (e.g., flicker-free reconstruction or temporal coherence) to quantities defined by the inputs themselves. The central claims are presented as consequences of the design choices rather than tautological re-statements. No self-citations, uniqueness theorems, or ansatzes are invoked in the given material. This is a standard non-circular finding for a methods paper whose contributions are algorithmic rather than derived from closed-form identities.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 3 invented entities

Abstract-only review; the method rests on several introduced components whose grounding in prior literature cannot be audited without full text and citations.

free parameters (1)
  • clip_length
    Sequences are divided into short clips whose specific duration is chosen but not quantified in the abstract.
axioms (1)
  • domain assumption Gaussian splatting can be extended to dynamic scenes via spatio-temporal fields and anchor mechanisms
    The framework assumes the base 3D Gaussian representation supports the added clip-level dynamic modeling.
invented entities (3)
  • clip-independent spatio-temporal fields no independent evidence
    purpose: Model dynamic motion within each clip
    New modeling component introduced to handle local temporal variations efficiently.
  • residual anchor compensation no independent evidence
    purpose: Capture local variations within clips
    Additional mechanism postulated to fit clip-specific details without full re-optimization.
  • inter-clip inherited anchors and decoders no independent evidence
    purpose: Maintain structural consistency across clips
    Mechanism introduced to propagate information between clips for temporal coherence.

pith-pipeline@v0.9.0 · 5504 in / 1469 out tokens · 43270 ms · 2026-05-10T13:39:18.947026+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

66 extracted references · 14 canonical work pages · 2 internal anchors

  1. [1]

    What is xr? towards a framework for augmented and virtual reality.Computers in human be- havior, 133:107289, 2022

    Philipp A Rauschnabel, Reto Felix, Chris Hinsch, Hamza Shahab, and Florian Alt. What is xr? towards a framework for augmented and virtual reality.Computers in human be- havior, 133:107289, 2022. 2

  2. [2]

    John Wiley & Sons, 2003

    Grigore C Burdea and Philippe Coiffet.Virtual reality tech- nology. John Wiley & Sons, 2003

  3. [3]

    What is mixed reality? InProceedings of the 2019 CHI conference on human factors in computing systems, pages 1–15, 2019

    Maximilian Speicher, Brian D Hall, and Michael Nebeling. What is mixed reality? InProceedings of the 2019 CHI conference on human factors in computing systems, pages 1–15, 2019. 2

  4. [4]

    Mip-nerf: A multiscale representation for anti-aliasing neu- ral radiance fields

    Jonathan T Barron, Ben Mildenhall, Matthew Tancik, Peter Hedman, Ricardo Martin-Brualla, and Pratul P Srinivasan. Mip-nerf: A multiscale representation for anti-aliasing neu- ral radiance fields. InProceedings of the IEEE/CVF inter- national conference on computer vision, pages 5855–5864,

  5. [5]

    Tri-miprf: Tri-mip represen- tation for efficient anti-aliasing neural radiance fields

    Wenbo Hu, Yuling Wang, Lin Ma, Bangbang Yang, Lin Gao, Xiao Liu, and Yuewen Ma. Tri-miprf: Tri-mip represen- tation for efficient anti-aliasing neural radiance fields. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 19774–19783, 2023

  6. [6]

    Cl-mvsnet: Unsupervised multi-view stereo with dual-level contrastive learning, 2025

    Kaiqiang Xiong, Rui Peng, Zhe Zhang, Tianxing Feng, Jianbo Jiao, Feng Gao, and Ronggang Wang. Cl-mvsnet: Unsupervised multi-view stereo with dual-level contrastive learning, 2025

  7. [7]

    Zip-nerf: Anti-aliased grid-based neural radiance fields

    Jonathan T Barron, Ben Mildenhall, Dor Verbin, Pratul P Srinivasan, and Peter Hedman. Zip-nerf: Anti-aliased grid-based neural radiance fields. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 19697–19705, 2023

  8. [8]

    Tensorf: Tensorial radiance fields

    Anpei Chen, Zexiang Xu, Andreas Geiger, Jingyi Yu, and Hao Su. Tensorf: Tensorial radiance fields. InEuropean con- ference on computer vision, pages 333–350. Springer, 2022

  9. [9]

    Structure consistent gaussian splatting with matching prior for few-shot novel view syn- thesis, 2024

    Rui Peng, Wangze Xu, Luyang Tang, Liwei Liao, Jianbo Jiao, and Ronggang Wang. Structure consistent gaussian splatting with matching prior for few-shot novel view syn- thesis, 2024

  10. [10]

    Yang Deng, Zhanke Wang, Jiahao Wu, Jie Liang, Jingui Ma, Yang Hu, and Ronggang Wang. Pano-gs: Perception- aware gaussian optimization with gradient consistency and multi-criteria densification for high-quality rendering.Pro- ceedings of the AAAI Conference on Artificial Intelligence, 40(5):3560–3568, Mar. 2026

  11. [11]

    Mip-nerf 360: Unbounded anti-aliased neural radiance fields

    Jonathan T Barron, Ben Mildenhall, Dor Verbin, Pratul P Srinivasan, and Peter Hedman. Mip-nerf 360: Unbounded anti-aliased neural radiance fields. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5470–5479, 2022. 2

  12. [12]

    From capture to display: A survey on volumetric video

    Yili Jin, Kaiyuan Hu, Junhua Liu, Fangxin Wang, and Xue Liu. From capture to display: A survey on volumetric video. arXiv preprint arXiv:2309.05658, 2023. 2

  13. [13]

    Mixed neural voxels for fast multi-view video synthesis.arXiv preprint arXiv:2212.00190, 2022

    Feng Wang, Sinan Tan, Xinghang Li, Zeyue Tian, and Huap- ing Liu. Mixed neural voxels for fast multi-view video syn- thesis.arXiv preprint arXiv:2212.00190, 2022. 6

  14. [14]

    Neural 3d video synthesis from multi-view video

    Tianye Li, Mira Slavcheva, Michael Zollhoefer, Simon Green, Christoph Lassner, Changil Kim, Tanner Schmidt, Steven Lovegrove, Michael Goesele, Richard Newcombe, et al. Neural 3d video synthesis from multi-view video. In Proceedings of the IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition, pages 5521–5531, 2022. 6, 7

  15. [15]

    4d visualization of dynamic events from unconstrained multi-view videos

    Aayush Bansal, Minh V o, Yaser Sheikh, Deva Ramanan, and Srinivasa Narasimhan. 4d visualization of dynamic events from unconstrained multi-view videos. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5366–5375, 2020

  16. [16]

    Neural vol- umes: Learning dynamic renderable volumes from images

    Stephen Lombardi, Tomas Simon, Jason Saragih, Gabriel Schwartz, Andreas Lehrmann, and Yaser Sheikh. Neural vol- umes: Learning dynamic renderable volumes from images. arXiv preprint arXiv:1906.07751, 2019

  17. [17]

    High-quality video view interpolation using a layered representation.ACM transactions on graphics (TOG), 23(3):600–608, 2004

    C Lawrence Zitnick, Sing Bing Kang, Matthew Uyttendaele, Simon Winder, and Richard Szeliski. High-quality video view interpolation using a layered representation.ACM transactions on graphics (TOG), 23(3):600–608, 2004. 2

  18. [18]

    Instant gaussian stream: Fast and generalizable streaming of dy- namic scene reconstruction via gaussian splatting

    Jinbo Yan, Rui Peng, Zhiyan Wang, Luyang Tang, Jiayu Yang, Jie Liang, Jiahao Wu, and Ronggang Wang. Instant gaussian stream: Fast and generalizable streaming of dy- namic scene reconstruction via gaussian splatting. InPro- ceedings of the Computer Vision and Pattern Recognition Conference, pages 16520–16531, 2025. 2

  19. [19]

    Plenoxels: Radiance fields without neural networks

    Sara Fridovich-Keil, Alex Yu, Matthew Tancik, Qinhong Chen, Benjamin Recht, and Angjoo Kanazawa. Plenoxels: Radiance fields without neural networks. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5501–5510, 2022. 2

  20. [20]

    Stopthepop: Sorted gaussian splatting for view-consistent real-time rendering.ACM Transactions on Graphics (TOG), 43(4):1–17, 2024

    Lukas Radl, Michael Steiner, Mathias Parger, Alexan- der Weinrauch, Bernhard Kerbl, and Markus Steinberger. Stopthepop: Sorted gaussian splatting for view-consistent real-time rendering.ACM Transactions on Graphics (TOG), 43(4):1–17, 2024

  21. [21]

    Gaussian opacity fields: Efficient and compact surface reconstruction in unbounded scenes.arXiv preprint arXiv:2404.10772, 2024

    Zehao Yu, Torsten Sattler, and Andreas Geiger. Gaussian opacity fields: Efficient and compact surface reconstruc- tion in unbounded scenes.arXiv preprint arXiv:2404.10772,

  22. [22]

    Octree-gs: Towards consistent real-time rendering with lod-structured 3d gaussians.arXiv preprint arXiv:2403.17898, 2024

    Kerui Ren, Lihan Jiang, Tao Lu, Mulin Yu, Linning Xu, Zhangkai Ni, and Bo Dai. Octree-gs: Towards consistent real-time rendering with lod-structured 3d gaussians.arXiv preprint arXiv:2403.17898, 2024. 2

  23. [23]

    2d gaussian splatting for geometrically accu- rate radiance fields.arXiv preprint arXiv:2403.17888, 2024

    Binbin Huang, Zehao Yu, Anpei Chen, Andreas Geiger, and Shenghua Gao. 2d gaussian splatting for geometrically accu- rate radiance fields.arXiv preprint arXiv:2403.17888, 2024. 2, 6

  24. [24]

    arXiv preprint arXiv:2406.01467 (2024)

    Baowen Zhang, Chuan Fang, Rakesh Shrestha, Yixun Liang, Xiaoxiao Long, and Ping Tan. Rade-gs: Rasterizing depth in gaussian splatting.arXiv preprint arXiv:2406.01467, 2024

  25. [25]

    Intrinsic geometry-appearance consistency optimization for sparse-view gaussian splatting, 2026

    Kaiqiang Xiong, Rui Peng, Jiahao Wu, Zhanke Wang, Jie Liang, Xiaoyun Zheng, Feng Gao, and Ronggang Wang. Intrinsic geometry-appearance consistency optimization for sparse-view gaussian splatting, 2026. 2

  26. [26]

    Dynamic 3d gaussians: Tracking by per- sistent dynamic view synthesis

    Jonathon Luiten, Georgios Kopanas, Bastian Leibe, and Deva Ramanan. Dynamic 3d gaussians: Tracking by per- sistent dynamic view synthesis. In3DV, 2024. 2, 7

  27. [27]

    3dgstream: On-the-fly training of 3d gaussians for efficient streaming of photo-realistic free- viewpoint videos

    Jiakai Sun, Han Jiao, Guangyuan Li, Zhanjie Zhang, Lei Zhao, and Wei Xing. 3dgstream: On-the-fly training of 3d gaussians for efficient streaming of photo-realistic free- viewpoint videos. InProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 20675–20685, 2024. 2, 6

  28. [28]

    Real-time photorealistic dynamic scene representation and rendering with 4d gaussian splatting.arXiv preprint arXiv:2310.10642, 2023

    Zeyu Yang, Hongye Yang, Zijie Pan, Xiatian Zhu, and Li Zhang. Real-time photorealistic dynamic scene repre- sentation and rendering with 4d gaussian splatting.arXiv preprint arXiv:2310.10642, 2023. 2, 3, 6

  29. [29]

    4d gaussian splatting for real-time dynamic scene rendering

    Guanjun Wu, Taoran Yi, Jiemin Fang, Lingxi Xie, Xiaopeng Zhang, Wei Wei, Wenyu Liu, Qi Tian, and Xinggang Wang. 4d gaussian splatting for real-time dynamic scene rendering. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 20310–20320, 2024. 2, 3, 6, 7

  30. [30]

    Spacetime gaus- sian feature splatting for real-time dynamic view synthesis

    Zhan Li, Zhang Chen, Zhong Li, and Yi Xu. Spacetime gaus- sian feature splatting for real-time dynamic view synthesis. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8508–8520, 2024. 2, 3, 6

  31. [31]

    Localdygs: Multi-view global dynamic scene modeling via adaptive local implicit feature decou- pling.arXiv preprint arXiv:2507.02363, 2025

    Jiahao Wu, Rui Peng, Jianbo Jiao, Jiayu Yang, Luyang Tang, Kaiqiang Xiong, Jie Liang, Jinbo Yan, Runling Liu, and Rong Wang. Localdygs: Multi-view global dynamic scene modeling via adaptive local implicit feature decou- pling.ArXiv, abs/2507.02363, 2025. 2, 4, 6, 7

  32. [32]

    Nerf-ds: Neural ra- diance fields for dynamic specular objects

    Zhiwen Yan, Chen Li, and Gim Hee Lee. Nerf-ds: Neural ra- diance fields for dynamic specular objects. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8285–8295, 2023. 2

  33. [33]

    Pear: Pixel-aligned expressive human mesh recovery, 2026

    Jiahao Wu, Yunfei Liu, Lijian Lin, Ye Zhu, Lei Zhu, Jingyi Li, and Yu Li. Pear: Pixel-aligned expressive human mesh recovery, 2026

  34. [34]

    Deformable 3d gaussians for high- fidelity monocular dynamic scene reconstruction

    Ziyi Yang, Xinyu Gao, Wen Zhou, Shaohui Jiao, Yuqing Zhang, and Xiaogang Jin. Deformable 3d gaussians for high- fidelity monocular dynamic scene reconstruction. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 20331–20341, 2024. 3

  35. [35]

    Nerfies: Deformable neural radiance fields

    Keunhong Park, Utkarsh Sinha, Jonathan T Barron, Sofien Bouaziz, Dan B Goldman, Steven M Seitz, and Ricardo Martin-Brualla. Nerfies: Deformable neural radiance fields. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 5865–5874, 2021. 3

  36. [36]

    HyperReel: High-fidelity 6-DoF video with ray- conditioned sampling.arXiv preprint arXiv:2301.02238,

    Benjamin Attal, Jia-Bin Huang, Christian Richardt, Michael Zollhoefer, Johannes Kopf, Matthew O’Toole, and Changil Kim. HyperReel: High-fidelity 6-DoF video with ray- conditioned sampling.arXiv preprint arXiv:2301.02238,

  37. [37]

    Nerf- player: A streamable dynamic scene representation with de- composed neural radiance fields.IEEE Transactions on Visu- alization and Computer Graphics, 29(5):2732–2742, 2023

    Liangchen Song, Anpei Chen, Zhong Li, Zhang Chen, Lele Chen, Junsong Yuan, Yi Xu, and Andreas Geiger. Nerf- player: A streamable dynamic scene representation with de- composed neural radiance fields.IEEE Transactions on Visu- alization and Computer Graphics, 29(5):2732–2742, 2023. 6

  38. [38]

    Mixed neural voxels for fast multi- view video synthesis

    Feng Wang, Sinan Tan, Xinghang Li, Zeyue Tian, Yafei Song, and Huaping Liu. Mixed neural voxels for fast multi- view video synthesis. InProceedings of the IEEE/CVF In- ternational Conference on Computer Vision, pages 19706– 19716, 2023. 2, 6

  39. [39]

    Swift4d:adaptive divide-and-conquer gaussian splatting for compact and efficient reconstruction of dynamic scene,

    Jiahao Wu, Rui Peng, Zhiyan Wang, Lu Xiao, Luyang Tang, Jinbo Yan, Kaiqiang Xiong, and Ronggang Wang. Swift4d:adaptive divide-and-conquer gaussian splatting for compact and efficient reconstruction of dynamic scene,

  40. [40]

    Streaming radiance fields for 3d video synthe- sis.Advances in Neural Information Processing Systems, 35:13485–13498, 2022

    Lingzhi Li, Zhen Shen, Zhongshu Wang, Li Shen, and Ping Tan. Streaming radiance fields for 3d video synthe- sis.Advances in Neural Information Processing Systems, 35:13485–13498, 2022. 2, 6

  41. [41]

    4k4d: Real-time 4d view synthesis at 4k resolution

    Zhen Xu, Sida Peng, Haotong Lin, Guangzhao He, Jiaming Sun, Yujun Shen, Hujun Bao, and Xiaowei Zhou. 4k4d: Real-time 4d view synthesis at 4k resolution. InProceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 20029–20040, 2024

  42. [42]

    Grid4D: 4D decomposed hash encoding for high-fidelity dynamic gaus- sian splatting.The Thirty-eighth Annual Conference on Neu- ral Information Processing Systems, 2024

    Xu Jiawei, Fan Zexin, Yang Jian, and Xie Jin. Grid4D: 4D decomposed hash encoding for high-fidelity dynamic gaus- sian splatting.The Thirty-eighth Annual Conference on Neu- ral Information Processing Systems, 2024. 2, 7

  43. [43]

    Plenoxels: Radiance fields without neural networks

    Sara Fridovich-Keil, Alex Yu, Matthew Tancik, Qinhong Chen, Benjamin Recht, and Angjoo Kanazawa. Plenoxels: Radiance fields without neural networks. InCVPR, 2022. 2

  44. [44]

    Compressing streamable free- viewpoint videos to 0.1 mb per frame

    Luyang Tang, Jiayu Yang, Rui Peng, Yongqi Zhai, Shihe Shen, and Ronggang Wang. Compressing streamable free- viewpoint videos to 0.1 mb per frame. InProceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 7257–7265, 2025. 2, 6

  45. [45]

    Hicom: Hierarchical coherent motion for dynamic streamable scenes with 3d gaussian splat- ting.Advances in Neural Information Processing Systems, 37:80609–80633, 2024

    Qiankun Gao, Jiarui Meng, Chengxiang Wen, Jie Chen, and Jian Zhang. Hicom: Hierarchical coherent motion for dynamic streamable scenes with 3d gaussian splat- ting.Advances in Neural Information Processing Systems, 37:80609–80633, 2024. 2, 6

  46. [46]

    K-planes: Explicit radiance fields in space, time, and appearance

    Sara Fridovich-Keil, Giacomo Meanti, Frederik Rahbæk Warburg, Benjamin Recht, and Angjoo Kanazawa. K-planes: Explicit radiance fields in space, time, and appearance. In Proceedings of the IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition, pages 12479–12488, 2023. 2, 6

  47. [47]

    Hexplane: A fast representa- tion for dynamic scenes

    Ang Cao and Justin Johnson. Hexplane: A fast representa- tion for dynamic scenes. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 130–141, 2023. 2, 6

  48. [48]

    D-nerf: Neural radiance fields for dynamic scenes

    Albert Pumarola, Enric Corona, Gerard Pons-Moll, and Francesc Moreno-Noguer. D-nerf: Neural radiance fields for dynamic scenes. InProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 10318–10327, 2021. 3

  49. [49]

    Sc-gs: Sparse-controlled gaussian splatting for editable dynamic scenes

    Yi-Hua Huang, Yang-Tian Sun, Ziyi Yang, Xiaoyang Lyu, Yan-Pei Cao, and Xiaojuan Qi. Sc-gs: Sparse-controlled gaussian splatting for editable dynamic scenes. InProceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4220–4230, 2024. 3

  50. [50]

    Dynmf: Neural motion factorization for real-time dynamic view synthesis with 3d gaussian splatting,

    Agelos Kratimenos, Jiahui Lei, and Kostas Daniilidis. Dynmf: Neural motion factorization for real-time dynamic view synthesis with 3d gaussian splatting.arXiv preprint arXiv:2312.00112, 2023. 3

  51. [51]

    Gaussian-flow: 4d reconstruction with dynamic 3d gaus- sian particle

    Youtian Lin, Zuozhuo Dai, Siyu Zhu, and Yao Yao. Gaussian-flow: 4d reconstruction with dynamic 3d gaus- sian particle. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21136– 21145, 2024. 3

  52. [52]

    arXiv preprint arXiv:2403.12365 (2024)

    Quankai Gao, Qiangeng Xu, Zhe Cao, Ben Mildenhall, Wen- chao Ma, Le Chen, Danhang Tang, and Ulrich Neumann. Gaussianflow: Splatting gaussian dynamics for 4d content creation.arXiv preprint arXiv:2403.12365, 2024. 3

  53. [53]

    Neural scene flow fields for space-time view synthesis of dy- namic scenes

    Zhengqi Li, Simon Niklaus, Noah Snavely, and Oliver Wang. Neural scene flow fields for space-time view synthesis of dy- namic scenes. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021. 3

  54. [54]

    3d gaussian splatting for real-time radiance field rendering.ACM Trans

    Bernhard Kerbl, Georgios Kopanas, Thomas Leimk ¨uhler, and George Drettakis. 3d gaussian splatting for real-time radiance field rendering.ACM Trans. Graph., 42(4):139–1,

  55. [55]

    Ewa splatting.IEEE Transactions on Visual- ization and Computer Graphics, 8(3):223–238, 2002

    Matthias Zwicker, Hanspeter Pfister, Jeroen Van Baar, and Markus Gross. Ewa splatting.IEEE Transactions on Visual- ization and Computer Graphics, 8(3):223–238, 2002. 4

  56. [56]

    Scaffold-gs: Structured 3d gaussians for view-adaptive rendering

    Tao Lu, Mulin Yu, Linning Xu, Yuanbo Xiangli, Limin Wang, Dahua Lin, and Bo Dai. Scaffold-gs: Structured 3d gaussians for view-adaptive rendering. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 20654–20664, 2024. 4, 6

  57. [57]

    Structure- from-motion revisited

    Johannes L Schonberger and Jan-Michael Frahm. Structure- from-motion revisited. InProceedings of the IEEE con- ference on computer vision and pattern recognition, pages 4104–4113, 2016. 4

  58. [58]

    High fidelity aggregated planar prior assisted patchmatch multi-view stereo

    Jie Liang, Rongjie Wang, Rui Peng, Zhe Zhang, Kaiqiang Xiong, and Ronggang Wang. High fidelity aggregated planar prior assisted patchmatch multi-view stereo. InProceedings of the 32nd ACM International Conference on Multimedia, MM ’24, page 3141–3150, New York, NY , USA, 2024. As- sociation for Computing Machinery. 4

  59. [59]

    Open3D: A Modern Library for 3D Data Processing

    Qian-Yi Zhou, Jaesik Park, and Vladlen Koltun. Open3D: A modern library for 3D data processing.arXiv:1801.09847,

  60. [60]

    Mix- ture of volumetric primitives for efficient neural rendering

    Stephen Lombardi, Tomas Simon, Gabriel Schwartz, Michael Zollhoefer, Yaser Sheikh, and Jason Saragih. Mix- ture of volumetric primitives for efficient neural rendering. ACM Transactions on Graphics (ToG), 40(4):1–13, 2021. 6

  61. [61]

    Adam: A Method for Stochastic Optimization

    Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization.arXiv preprint arXiv:1412.6980,

  62. [62]

    https://www.avs.org.cn/

    A VS. https://www.avs.org.cn/. 2024. 6

  63. [63]

    Neural scene flow fields for space-time view synthesis of dy- namic scenes

    Zhengqi Li, Simon Niklaus, Noah Snavely, and Oliver Wang. Neural scene flow fields for space-time view synthesis of dy- namic scenes. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6498– 6508, 2021. 6

  64. [64]

    4dgc: Rate-aware 4d gaussian compression for ef- ficient streamable free-viewpoint video

    Qiang Hu, Zihan Zheng, Houqiang Zhong, Sihua Fu, Li Song, Xiaoyun Zhang, Guangtao Zhai, and Yanfeng Wang. 4dgc: Rate-aware 4d gaussian compression for ef- ficient streamable free-viewpoint video. InProceedings of the Computer Vision and Pattern Recognition Conference (CVPR), pages 875–885, June 2025. 6

  65. [65]

    4k4d: Real-time 4d view synthesis at 4k resolution

    Zhen Xu, Sida Peng, Haotong Lin, Guangzhao He, Jiaming Sun, Yujun Shen, Hujun Bao, and Xiaowei Zhou. 4k4d: Real-time 4d view synthesis at 4k resolution. InCVPR, 2024. 7

  66. [66]

    Efficient neural radiance fields for interactive free-viewpoint video

    Haotong Lin, Sida Peng, Zhen Xu, Yunzhi Yan, Qing Shuai, Hujun Bao, and Xiaowei Zhou. Efficient neural radiance fields for interactive free-viewpoint video. InSIGGRAPH Asia Conference Proceedings, 2022. 7