arxiv: 2605.12119 · v2 · submitted 2026-05-12 · 💻 cs.CV · cs.GR

Recognition: 2 theorem links

· Lean Theorem

MoCam: Unified Novel View Synthesis via Structured Denoising Dynamics

Haofeng Liu , Yang Zhou , Ziheng Wang , Zhengbo Xu , Zhan Peng , Jie Ma , Jun Liang , Shengfeng He

show 1 more author

Jing Li

Authors on Pith no claims yet

Pith reviewed 2026-05-14 21:53 UTC · model grok-4.3

classification 💻 cs.CV cs.GR

keywords novel view synthesisdiffusion modelsdenoising dynamicsgeometric priorsappearance priorspoint cloudgenerative modelingview synthesis

0 comments

The pith

MoCam unifies novel view synthesis by switching from geometric priors early in diffusion denoising to appearance priors later to correct errors.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that novel view synthesis from point clouds can be improved by structuring the diffusion denoising process to first anchor coarse structures with geometric priors and then switch to appearance priors for correction and detail refinement. This matters because prior methods either spread geometric inaccuracies through the entire generation or encounter signal conflicts when combining both priors at once. The approach tolerates incompleteness in the input geometry and unifies handling of static and dynamic scenes by temporally separating alignment from refinement inside the diffusion steps. A sympathetic reader would care because it offers a way to generate consistent views from imperfect real-world captures without manual fusion rules.

Core claim

MoCam employs structured denoising dynamics to orchestrate a coordinated progression from geometry to appearance within the diffusion process: it first leverages geometric priors in early stages to anchor coarse structures and tolerate their incompleteness, then switches to appearance priors in later stages to actively correct geometric errors and refine details. This design naturally unifies static and dynamic view synthesis by temporally decoupling geometric alignment and appearance refinement within the diffusion process.

What carries the argument

Structured denoising dynamics that temporally decouple geometric alignment from appearance refinement inside the diffusion process.

If this is right

MoCam significantly outperforms prior methods particularly when point clouds contain severe holes or distortions.
It achieves robust geometry-appearance disentanglement.
The method tolerates incompleteness in geometric priors by using them only for initial anchoring.
It provides a natural unification for both static and dynamic novel view synthesis tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The staged prior switch could be tested on other generative tasks that combine sparse 3D structure with dense image cues.
This suggests potential gains in real-world capture pipelines where input geometry is often incomplete.
Further experiments on video sequences could check whether the temporal decoupling extends to motion without introducing temporal artifacts.

Load-bearing premise

That switching from geometric priors in early diffusion stages to appearance priors in later stages will actively correct geometric errors without introducing new inconsistencies or artifacts.

What would settle it

Apply the method to point clouds with deliberately added large holes and compare the synthesized novel views against ground-truth geometry and appearance metrics to check whether errors are corrected rather than propagated.

Figures

Figures reproduced from arXiv: 2605.12119 by Haofeng Liu, Jie Ma, Jing Li, Jun Liang, Shengfeng He, Yang Zhou, Zhan Peng, Zhengbo Xu, Ziheng Wang.

**Figure 1.** Figure 1: We propose MoCam, a method that unifies novel view synthesis through structured denoising dynamics. Existing methods rely on static guidance that entangles geometry and appearance, often resulting in geometric collapse and visual artifacts. MoCam introduces structured denoising dynamics that guide generation from motion alignment to appearance refinement, producing coherent and photorealistic results. Abs… view at source ↗

**Figure 2.** Figure 2: Overview of the MoCam Framework. Given a source video x src (or a single image repeated for N frames), a geometrically-aligned but imperfect scaffold video x tgt ren is first rendered along the target trajectory ψ tgt. After encoding these conditions into latent space, the model processes the initial noise z0 via the proposed structured denoising dynamics. Specifically, in the early stage, the denosing is … view at source ↗

**Figure 3.** Figure 3: Results of different guiding methods. itself. Though c ren and c src exhibit the mentioned-above complement relationship, they also contain conflicting signal to each other, i.e., the different camera movements. Since the camera movement of c src is different with c ren, it introduces interference against the guidance of c ren, which may confuse the model learning and decrease the final effectiveness [P… view at source ↗

**Figure 4.** Figure 4: Qualitative results for single-view 3D reconstruction. per video, including orbital, translational, and zoom motions. These monocular videos serve as direct input for the 4D re-camera experiments. For single-view 3D reconstruction, we randomly sample one frame from each video and replicate it to N frames. Our evaluation metrics include: (1) background consistency, subject consistency and imaging quality f… view at source ↗

**Figure 5.** Figure 5: Qualitative results from in-the-wild videos. The first example illustrates an ’orbit-to-left’ trajectory, while the second example demonstrates a camera motion that initially moves to the top-left with zoom-in, followed by a transition to the bottomright with a corresponding zoom-out. delity and precise camera control, effectively handling the geometric ambiguity inherent in single-view 3D reconstruction.… view at source ↗

**Figure 6.** Figure 6: Quantitative results of VBench metrics on various motion magnitudes [PITH_FULL_IMAGE:figures/full_fig_p011_6.png] view at source ↗

**Figure 7.** Figure 7: Qualitative results on various motion scales. The models are inferred under camera trajectories with three different scales of orbit degree. 4.2 Evaluation on Multi-view Video Benchmark While our primary focus is on in-the-wild monocular videos, we also conduct experiments on a multi-view dataset to enable evaluation with pixel-wise metrics. Following the setup of TrajectoryCrafter [44], we use the iPhone … view at source ↗

**Figure 8.** Figure 8: Qualitative results on iPhone Dataset. Input Ours (Wan2.1) View & Time Ours View & Time Scaffold-early Scaffold-only Scaffold Static-both [PITH_FULL_IMAGE:figures/full_fig_p013_8.png] view at source ↗

**Figure 9.** Figure 9: Ablation results on structured denoising dynamics. the subject’s clothing and the more accurate facial structure rendered by our method compared to the blurry or distorted results from others. 4.3 Ablation Studies We conduct a series of ablation studies to dissect the contributions of our key design choices. Effectiveness of Structured Denoising Dynamics. To validate our core hypothesis that a temporally-a… view at source ↗

**Figure 10.** Figure 10: Depth Robustness. before they are baked into the final output [PITH_FULL_IMAGE:figures/full_fig_p014_10.png] view at source ↗

**Figure 11.** Figure 11: Qualitative results for single-view 3D reconstruction [PITH_FULL_IMAGE:figures/full_fig_p019_11.png] view at source ↗

read the original abstract

Generative novel view synthesis faces a fundamental dilemma: geometric priors provide spatial alignment but become sparse and inaccurate under view changes, while appearance priors offer visual fidelity but lack geometric correspondence. Existing methods either propagate geometric errors throughout generation or suffer from signal conflicts when fusing both statically. We introduce MoCam, which employs structured denoising dynamics to orchestrate a coordinated progression from geometry to appearance within the diffusion process. MoCam first leverages geometric priors in early stages to anchor coarse structures and tolerate their incompleteness, then switches to appearance priors in later stages to actively correct geometric errors and refine details. This design naturally unifies static and dynamic view synthesis by temporally decoupling geometric alignment and appearance refinement within the diffusion process. Experiments demonstrate that MoCam significantly outperforms prior methods, particularly when point clouds contain severe holes or distortions, achieving robust geometry-appearance disentanglement.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MoCam's core idea of switching priors mid-denoising to let appearance fix geometric holes is a clean framing, but the abstract supplies no metrics or ablations so the correction claim stays untested.

read the letter

The paper's main contribution is a single diffusion process that anchors coarse structure with geometric priors early in denoising then hands off to appearance priors later to refine and supposedly correct errors from bad point clouds. This temporal split inside one model is presented as a way to avoid both error propagation and static fusion conflicts, and it claims to cover static and dynamic novel view synthesis without extra machinery. That orchestration feels like a genuine shift from prior separate-stage or multi-model approaches cited in the abstract. It targets a real issue in graphics and vision where real-world captures often have holes or distortions, and the design tries to make the diffusion dynamics do the disentangling work rather than post-processing. If the full experiments back the active correction part, it could be practical for downstream 3D tasks. The abstract does a decent job laying out the dilemma and the proposed fix without overclaiming the math itself. The soft spot is the complete absence of numbers. No PSNR, LPIPS, or other metrics, no listed baselines, no timing ablations on when the switch happens, and no error maps or feature evolution checks to show whether appearance stages actually repair geometry or just texture over it. The stress-test concern lands because the central claim rests on the U-Net dynamics allowing error correction without new inconsistencies, yet nothing in the provided text isolates that mechanism. Without those details the outperformance statement reads as assertion rather than evidence. This paper is aimed at researchers working on diffusion for novel view synthesis or imperfect-geometry reconstruction. A reader already running similar models might pick up the scheduling idea and test it themselves. It deserves a serious referee because the framing is coherent and the problem is relevant, even if the current write-up needs the experimental section expanded before it can be evaluated properly. I would send it out for review rather than desk reject.

Referee Report

2 major / 0 minor

Summary. The manuscript proposes MoCam, a diffusion-based framework for novel view synthesis that employs structured denoising dynamics to temporally decouple geometric and appearance priors: geometric priors anchor coarse structure in early denoising stages while appearance priors are introduced later to refine details and correct errors arising from incomplete or distorted input point clouds. The approach is presented as unifying static and dynamic view synthesis through this staged orchestration within the diffusion process, with claims of significant outperformance over prior methods particularly under severe geometric degradation.

Significance. If the staged prior-switching mechanism can be shown to enable genuine error correction rather than simple texture overlay, the work would offer a practical advance for novel view synthesis on real-world data with noisy or incomplete geometry, reducing reliance on perfect point-cloud inputs and providing a unified treatment of static and dynamic scenes.

major comments (2)

[Abstract / Experiments] Abstract and Experiments section: the central claim of significant outperformance and robust geometry-appearance disentanglement is unsupported because no quantitative metrics, baselines, ablation studies, or experimental details are supplied, leaving the reader unable to assess whether the reported gains are real or attributable to the proposed dynamics.
[Method] Method description of structured denoising dynamics: the claim that appearance priors in later stages actively correct geometric errors (rather than merely masking them) is load-bearing for the contribution, yet no timestep-specific analysis, error maps, or ablation on switch timing is provided to isolate correction from propagation of artifacts or new inconsistencies.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their thorough and constructive review. The feedback highlights important areas where additional evidence and analysis will strengthen the manuscript. We address each major comment point by point below.

read point-by-point responses

Referee: [Abstract / Experiments] Abstract and Experiments section: the central claim of significant outperformance and robust geometry-appearance disentanglement is unsupported because no quantitative metrics, baselines, ablation studies, or experimental details are supplied, leaving the reader unable to assess whether the reported gains are real or attributable to the proposed dynamics.

Authors: We agree that the current manuscript version does not provide sufficient quantitative details to fully support the claims in the abstract. In the revision we will expand the Experiments section with quantitative metrics (PSNR, SSIM, LPIPS), comparisons to relevant baselines on standard benchmarks, and ablation studies on the staged denoising components. These additions will allow readers to evaluate the reported gains, especially under severe geometric degradation. revision: yes
Referee: [Method] Method description of structured denoising dynamics: the claim that appearance priors in later stages actively correct geometric errors (rather than merely masking them) is load-bearing for the contribution, yet no timestep-specific analysis, error maps, or ablation on switch timing is provided to isolate correction from propagation of artifacts or new inconsistencies.

Authors: We acknowledge that the active correction claim requires stronger empirical isolation. In the revised manuscript we will include timestep-specific visualizations, geometric error maps across denoising stages, and an ablation on the geometry-to-appearance switch timing. These elements will demonstrate that later-stage appearance priors reduce errors from incomplete point clouds rather than merely overlaying details or introducing new inconsistencies. revision: yes

Circularity Check

0 steps flagged

No significant circularity in MoCam's derivation chain

full rationale

The paper introduces MoCam via a design choice of temporally decoupling geometric priors (early diffusion stages) from appearance priors (later stages) within structured denoising dynamics. This orchestration is presented as an independent mechanism to unify static and dynamic novel view synthesis and tolerate point-cloud holes, without any equations or self-citations that reduce the central claim to fitted inputs, self-definitions, or prior author results by construction. The abstract and described method contain no load-bearing steps where a 'prediction' collapses to a renamed fit or where uniqueness is imported from overlapping citations. The derivation remains self-contained as a proposed scheduling strategy whose validity is asserted through experimental comparison rather than tautological reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The approach rests on standard diffusion model assumptions about progressive denoising and introduces structured dynamics as a new orchestration without explicit free parameters or invented physical entities described in the abstract.

axioms (1)

domain assumption Diffusion models allow effective conditioning on different priors at successive denoising stages
Invoked to justify the early-geometry to late-appearance switch

invented entities (1)

structured denoising dynamics no independent evidence
purpose: orchestrate coordinated progression from geometry to appearance within the diffusion process
New mechanism introduced to unify static and dynamic view synthesis

pith-pipeline@v0.9.0 · 5458 in / 1182 out tokens · 38501 ms · 2026-05-14T21:53:18.890055+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

MoCam first leverages geometric priors in early stages to anchor coarse structures ... then switches to appearance priors in later stages to actively correct geometric errors

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

52 extracted references · 52 canonical work pages · 7 internal anchors

[1]

Bahmani, S., Skorokhodov, I., Siarohin, A., Menapace, W., Qian, G., Vasilkovsky, M., Lee, H.Y., Wang, C., Zou, J., Tagliasacchi, A., et al.: Vd3d: Taming large videodiffusiontransformersfor3dcameracontrol.arXivpreprintarXiv:2407.12781 (2024)

work page arXiv 2024
[2]

Recammaster: Camera-controlled generative rendering from a single video.arXiv preprint arXiv:2503.11647, 2025

Bai,J., Xia, M., Fu, X.,Wang, X.,Mu, L., Cao, J.,Liu, Z., Hu, H., Bai, X., Wan, P., et al.: Recammaster: Camera-controlled generative rendering from a single video. arXiv preprint arXiv:2503.11647 (2025)

work page arXiv 2025
[3]

Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets

Blattmann, A., Dockhorn, T., Kulal, S., Mendelevitch, D., Kilian, M., Lorenz, D., Levi, Y., English, Z., Voleti, V., Letts, A., et al.: Stable video diffusion: Scaling latent video diffusion models to large datasets. arXiv preprint arXiv:2311.15127 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[4]

In: Pro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recogni- tion

Cao, A., Johnson, J.: Hexplane: A fast representation for dynamic scenes. In: Pro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recogni- tion. pp. 130–141 (2023)

work page 2023
[5]

In: Advances in Neural Information Processing Systems (NeurIPS) (2025)

Chen, K., Khurana, T., Ramanan, D.: Reconstruct, inpaint, finetune: Dynamic novel-view synthesis from monocular videos. In: Advances in Neural Information Processing Systems (NeurIPS) (2025)

work page 2025
[6]

IEEE Transactions on Visualization and Computer Graphics (2025)

Chung, J., Lee, S., Nam, H., Lee, J., Lee, K.M.: Luciddreamer: Domain-free gen- eration of 3d gaussian splatting scenes. IEEE Transactions on Visualization and Computer Graphics (2025)

work page 2025
[7]

In: ACM SIGGRAPH 2024 Conference Papers

Duan, Y., Wei, F., Dai, Q., He, Y., Chen, W., Chen, B.: 4d-rotor gaussian splatting: towards efficient novel view synthesis for dynamic scenes. In: ACM SIGGRAPH 2024 Conference Papers. pp. 1–11 (2024)

work page 2024
[8]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Fan, C.D., Chang, C.W., Liu, Y.R., Lee, J.Y., Huang, J.L., Tseng, Y.C., Liu, Y.L.: Spectromotion: Dynamic 3d reconstruction of specular scenes. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 21328–21338 (June 2025)

work page 2025
[9]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Fridovich-Keil, S., Meanti, G., Warburg, F.R., Recht, B., Kanazawa, A.: K-planes: Explicit radiance fields in space, time, and appearance. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 12479– 12488 (2023)

work page 2023
[10]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision

Gao, C., Saraf, A., Kopf, J., Huang, J.B.: Dynamic view synthesis from dynamic monocular video. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 5712–5721 (2021)

work page 2021
[11]

In: NeurIPS (2022)

Gao, H., Li, R., Tulsiani, S., Russell, B., Kanazawa, A.: Dynamic novel-view syn- thesis: A reality check. In: NeurIPS (2022)

work page 2022
[12]

In: Proceedings of the AAAI Conference on Artificial Intelligence

Ham, S., Woo, S., Kim, J.Y., Go, H., Park, B., Kim, C.: Diffusion model patch- ing via mixture-of-prompts. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 39, pp. 17023–17031 (2025)

work page 2025
[13]

He, H., Xu, Y., Guo, Y., Wetzstein, G., Dai, B., Li, H., Yang, C.: Cameractrl: En- ablingcameracontrolfortext-to-videogeneration.arXivpreprintarXiv:2404.02101 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[14]

arXiv preprint arXiv:2508.10934 (2025)

Huang, J., Zhou, Q., Rabeti, H., Korovko, A., Ling, H., Ren, X., Shen, T., Gao, J., Slepichev,D.,Lin,C.H.,etal.:Vipe:Videoposeenginefor3dgeometricperception. arXiv preprint arXiv:2508.10934 (2025)

work page arXiv 2025
[15]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2024) 16 Liu et al

Huang, Z., He, Y., Yu, J., Zhang, F., Si, C., Jiang, Y., Zhang, Y., Wu, T., Jin, Q., Chanpaisit, N., Wang, Y., Chen, X., Wang, L., Lin, D., Qiao, Y., Liu, Z.: VBench: Comprehensive benchmark suite for video generative models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2024) 16 Liu et al

work page 2024
[16]

arXiv preprint arXiv:2503.09151 (2025)

Jeong, H., Lee, S., Ye, J.C.: Reangle-a-video: 4d video generation as video-to-video translation. arXiv preprint arXiv:2503.09151 (2025)

work page arXiv 2025
[17]

ACM Trans

Kerbl, B., Kopanas, G., Leimkühler, T., Drettakis, G.: 3d gaussian splatting for real-time radiance field rendering. ACM Trans. Graph.42(4), 139–1 (2023)

work page 2023
[18]

HunyuanVideo: A Systematic Framework For Large Video Generative Models

Kong, W., Tian, Q., Zhang, Z., Min, R., Dai, Z., Zhou, J., Xiong, J., Li, X., Wu, B., Zhang, J., et al.: Hunyuanvideo: A systematic framework for large video generative models. arXiv preprint arXiv:2412.03603 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[19]

arXiv preprint arXiv:2509.21119 (2025)

Lei, G., Wang, C., Wang, Y., Li, H., Song, Y., Xu, W.: Motionflow: Learning implicit motion flow for complex camera trajectory control in video generation. arXiv preprint arXiv:2509.21119 (2025)

work page arXiv 2025
[20]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Li, T., Slavcheva, M., Zollhoefer, M., Green, S., Lassner, C., Kim, C., Schmidt, T., Lovegrove, S., Goesele, M., Newcombe, R., et al.: Neural 3d video synthesis from multi-view video. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 5521–5531 (2022)

work page 2022
[21]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Li, Z., Chen, Z., Li, Z., Xu, Y.: Spacetime gaussian feature splatting for real- time dynamic view synthesis. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 8508–8520 (2024)

work page 2024
[22]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Li, Z., Niklaus, S., Snavely, N., Wang, O.: Neural scene flow fields for space-time view synthesis of dynamic scenes. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 6498–6508 (2021)

work page 2021
[23]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Lin, Y., Dai, Z., Zhu, S., Yao, Y.: Gaussian-flow: 4d reconstruction with dynamic 3d gaussian particle. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 21136–21145 (2024)

work page 2024
[24]

In: The Thirteenth International Conference on Learning Representations (2025),https://openreview.net/forum?id=2prShxdLkX

LIU, Q., Liu, Y., Wang, J., Lyu, X., Wang, P., Wang, W., Hou, J.: MoDGS: Dynamic gaussian splatting from casually-captured monocular videos with depth priors. In: The Thirteenth International Conference on Learning Representations (2025),https://openreview.net/forum?id=2prShxdLkX

work page 2025
[25]

Advances in neural information processing systems (2025)

Luo, Z., Ran, H., Lu, L.: Instant4d: 4d gaussian splatting in minutes. Advances in neural information processing systems (2025)

work page 2025
[26]

In: Eu- ropean Conference on Computer Vision

Mildenhall, B., Srinivasan, P.P., Tancik, M., Barron, J.T., Ramamoorthi, R., Ng, R.: Nerf: Representing scenes as neural radiance fields for view synthesis. In: Eu- ropean Conference on Computer Vision. pp. 405–421. Springer (2020)

work page 2020
[27]

In: Proceedings of the AAAI conference on artificial intelligence

Mou, C., Wang, X., Xie, L., Wu, Y., Zhang, J., Qi, Z., Shan, Y.: T2i-adapter: Learning adapters to dig out more controllable ability for text-to-image diffusion models. In: Proceedings of the AAAI conference on artificial intelligence. vol. 38, pp. 4296–4304 (2024)

work page 2024
[28]

OpenVid-1M: A Large-Scale High-Quality Dataset for Text-to-video Generation

Nan, K., Xie, R., Zhou, P., Fan, T., Yang, Z., Chen, Z., Li, X., Yang, J., Tai, Y.: Openvid-1m: A large-scale high-quality dataset for text-to-video generation. arXiv preprint arXiv:2407.02371 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[29]

2021 ieee

Pumarola, A., Corona, E., Pons-Moll, G., Moreno-Noguer, F.: D-nerf: Neural radi- ance fields for dynamic scenes. 2021 ieee. In: CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 10313–10322 (2020)

work page 2021
[30]

In: Proceedings of the Computer Vision and Pattern Recognition Conference

Ren, X., Shen, T., Huang, J., Ling, H., Lu, Y., Nimier-David, M., Müller, T., Keller, A., Fidler, S., Gao, J.: Gen3c: 3d-informed world-consistent video gener- ation with precise camera control. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 6121–6132 (2025)

work page 2025
[31]

In: 2025 International Conference on 3D Vision (3DV)

Shriram, J., Trevithick, A., Liu, L., Ramamoorthi, R.: Realmdreamer: Text-driven 3d scene generation with inpainting and depth diffusion. In: 2025 International Conference on 3D Vision (3DV). pp. 882–892. IEEE (2025) MoCam 17

work page 2025
[32]

In: Proceedings of the IEEE/CVF International Confer- ence on Computer Vision (ICCV)

Song,R.,Liang,C.,Xia,Y.,Zimmer,W.,Cao,H.,Caesar,H.,Festag,A.,Knoll,A.: Coda-4dgs: Dynamic gaussian splatting with context and deformation awareness for autonomous driving. In: Proceedings of the IEEE/CVF International Confer- ence on Computer Vision (ICCV). pp. 28031–28041 (October 2025)

work page 2025
[33]

In: European Conference on Computer Vision

Van Hoorick, B., Wu, R., Ozguroglu, E., Sargent, K., Liu, R., Tokmakov, P., Dave, A., Zheng, C., Vondrick, C.: Generative camera dolly: Extreme monocular dynamic novel view synthesis. In: European Conference on Computer Vision. pp. 313–331. Springer (2024)

work page 2024
[34]

Wan: Open and Advanced Large-Scale Video Generative Models

Wan, T., Wang, A., Ai, B., Wen, B., Mao, C., Xie, C.W., Chen, D., Yu, F., Zhao, H., Yang, J., Zeng, J., Wang, J., Zhang, J., Zhou, J., Wang, J., Chen, J., Zhu, K., Zhao, K., Yan, K., Huang, L., Feng, M., Zhang, N., Li, P., Wu, P., Chu, R., Feng, R., Zhang, S., Sun, S., Fang, T., Wang, T., Gui, T., Weng, T., Shen, T., Lin, W., Wang, W., Wang, W., Zhou, W.,...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[35]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision

Wang, H., Liu, Y., Liu, Z., Wang, W., Dong, Z., Yang, B.: Vistadream: Sampling multiview consistent images for single-view scene reconstruction. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 26772–26782 (2025)

work page 2025
[36]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Wu, G., Yi, T., Fang, J., Xie, L., Zhang, X., Wei, W., Liu, W., Tian, Q., Wang, X.: 4d gaussian splatting for real-time dynamic scene rendering. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 20310– 20320 (2024)

work page 2024
[37]

In: Proceedings of the Computer Vision and Pattern Recognition Conference

Wu, P., Zhu, K., Liu, Y., Zhao, L., Zhai, W., Cao, Y., Zha, Z.J.: Improved video vae for latent video diffusion model. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 18124–18133 (2025)

work page 2025
[38]

In: Proceed- ings of the Computer Vision and Pattern Recognition Conference

Wu, R., Gao, R., Poole, B., Trevithick, A., Zheng, C., Barron, J.T., Holynski, A.: Cat4d: Create anything in 4d with multi-view video diffusion models. In: Proceed- ings of the Computer Vision and Pattern Recognition Conference. pp. 26057–26068 (2025)

work page 2025
[39]

Real-time photorealistic dynamic scene representation and rendering with 4d gaussian splatting.arXiv preprint arXiv:2310.10642, 2023

Yang, Z., Yang, H., Pan, Z., Zhang, L.: Real-time photorealistic dynamic scene representation and rendering with 4d gaussian splatting. arXiv preprint arXiv:2310.10642 (2023)

work page arXiv 2023
[40]

CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer

Yang, Z., Teng, J., Zheng, W., Ding, M., Huang, S., Xu, J., Yang, Y., Hong, W., Zhang, X., Feng, G., et al.: Cogvideox: Text-to-video diffusion models with an expert transformer. arXiv preprint arXiv:2408.06072 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[41]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Yang, Z., Gao, X., Zhou, W., Jiao, S., Zhang, Y., Jin, X.: Deformable 3d gaussians for high-fidelity monocular dynamic scene reconstruction. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 20331– 20341 (2024)

work page 2024
[42]

IP-Adapter: Text Compatible Image Prompt Adapter for Text-to-Image Diffusion Models

Ye, H., Zhang, J., Liu, S., Han, X., Yang, W.: Ip-adapter: Text compati- ble image prompt adapter for text-to-image diffusion models. arXiv preprint arXiv:2308.06721 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[43]

NVS-Solver: Video diffusion model as zero-shot novel view synthesizer.arXiv preprint arXiv:2405.15364,

You, M., Zhu, Z., Liu, H., Hou, J.: Nvs-solver: Video diffusion model as zero-shot novel view synthesizer. arXiv preprint arXiv:2405.15364 (2024)

work page arXiv 2024
[44]

arXiv preprint arXiv:2503.05638 (2025) 18 Liu et al

YU, M., Hu, W., Xing, J., Shan, Y.: Trajectorycrafter: Redirecting camera trajec- tory for monocular videos via diffusion models. arXiv preprint arXiv:2503.05638 (2025) 18 Liu et al

work page arXiv 2025
[45]

IEEE Transactions on Pattern Analysis and Machine Intelligence (2025)

Yu, W., Xing, J., Yuan, L., Hu, W., Li, X., Huang, Z., Gao, X., Wong, T.T., Shan, Y., Tian, Y.: Viewcrafter: Taming video diffusion models for high-fidelity novel view synthesis. IEEE Transactions on Pattern Analysis and Machine Intelligence (2025)

work page 2025
[46]

In: Proceedings of the Computer Vision and Pattern Recognition Conference

Zhang, D.J., Paiss, R., Zada, S., Karnad, N., Jacobs, D.E., Pritch, Y., Mosseri, I., Shou, M.Z., Wadhwa, N., Ruiz, N.: Recapture: Generative video camera controls for user-provided videos using masked video fine-tuning. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 2050–2062 (2025)

work page 2050
[47]

IEEE Transactions on Visualization and Computer Graphics30(12), 7749–7762 (2024)

Zhang, J., Li, X., Wan, Z., Wang, C., Liao, J.: Text2nerf: Text-driven 3d scene generation with neural radiance fields. IEEE Transactions on Visualization and Computer Graphics30(12), 7749–7762 (2024)

work page 2024
[48]

Zhang, L., Rao, A., Agrawala, M.: Adding conditional control to text-to-image diffusion models (2023)

work page 2023
[49]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision

Zhang, S., Xu, H., Guo, S., Xie, Z., Bao, H., Xu, W., Zou, C.: Spatialcrafter: Unleashing the imagination of video diffusion models for scene reconstruction from limited observations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 27794–27805 (2025)

work page 2025
[50]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)

Zhang, X., Liu, Z., Zhang, Y., Ge, X., He, D., Xu, T., Wang, Y., Lin, Z., Yan, S., Zhang, J.: Mega: Memory-efficient 4d gaussian splatting for dynamic scenes. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). pp. 27828–27838 (October 2025)

work page 2025
[51]

Dynamic Scene Reconstruc- tion: Recent Advance in Real-time Rendering and Stream- ing.arXiv preprint arXiv:2503.08166, 2025

Zhu, J., Tang, H.: Dynamic scene reconstruction: Recent advance in real-time ren- dering and streaming. arXiv preprint arXiv:2503.08166 (2025)

work page arXiv 2025
[52]

Zhuang, S., Guo, Y., Ding, Y., Li, K., Chen, X., Wang, Y., Wang, F., Zhang, Y., Li, C., Wang, Y.: Timestep master: Asymmetrical mixture of timestep lora experts for versatile and efficient diffusion models in vision. arXiv preprint arXiv:2503.07416 (2025) MoCam 19 A Comparison with 3D-based method We further compare our method with ViewCrafter [45] on sin...

work page arXiv 2025