pith. sign in

arxiv: 2607.01962 · v1 · pith:UFTSXKZ3new · submitted 2026-07-02 · 💻 cs.CV · cs.AI· cs.GR· cs.RO

NeoMap: Training-free Novel-View Synthesis from Single Images and Videos

Pith reviewed 2026-07-03 15:44 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.GRcs.RO
keywords novel view synthesistraining-free generationvideo manifoldalternating projectionview consistencysingle-image to videomonocular video
0
0 comments X

The pith

NeoMap finds high-fidelity novel views inside the manifold of any pre-trained video model by alternating projections on initial noise.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that novel-view solutions already exist inside the data manifold learned by ordinary pre-trained video models. Rather than fine-tuning the model or adding camera conditions, the method repeatedly projects candidate solutions onto that manifold until convergence. The process is presented as a way to extract view-consistent videos directly from a single image or monocular video. If the claim holds, novel-view synthesis would no longer require task-specific training or architectural changes to the underlying generator.

Core claim

NeoMap demonstrates that convergent manifold alternating projection iterations applied to initial noise can locate high-fidelity, view-consistent novel-view solutions that are already encoded inside the natural video data manifold of unmodified pre-trained models.

What carries the argument

convergent manifold alternating projection iterations that optimize the initial noise to locate solutions on the learned video manifold

Load-bearing premise

Promising novel-view solutions are already present inside the data manifold of any general pre-trained video model and can be found by alternating projections.

What would settle it

A controlled test in which the alternating-projection iterations produce lower fidelity or less consistent novel views than ordinary sampling from the same pre-trained model on the Tanks-and-Temples, LLFF, or DAVIS benchmarks.

Figures

Figures reproduced from arXiv: 2607.01962 by Bo Yang, Jinxi Li, Koon Wing Macgyver Lin, Peng Huang, Tianyi Zhang, Yafei Yang, Zihui Zhang.

Figure 1
Figure 1. Figure 1: Different NVS paradigms: a) Learning new mapping relation would shrink the output data manifold, leading to overfitting and OOD problem; b) Guiding generation trajectory with bad condition would lead the generation output away from data man￾ifold; c) Different related output possibilities already exist in the data manifold. One line of work learns a new conditional video generation mapping by train￾ing mod… view at source ↗
Figure 2
Figure 2. Figure 2: An illustration of the three modules in our framework. NVS problem due to the noise of the estimated depth and the incompleteness of the object. Our framework elegantly overcomes these limitations by propagat￾ing valid semantic priors to meaningfully constrain and hallucinate the unseen regions and adaptively trusting only the reliable visible regions. 3 NeoMap We introduce a novel view synthesis (NVS) fra… view at source ↗
Figure 3
Figure 3. Figure 3: Qualitative examples show that our alternative projections iteratively refine the noise quality, leading the generated results to higher-fidelity, better view consistency, and more continuous semantics. 3.4 Pixel-Constrained Projection Module While the AMP module tries to project the state zˆ k M toward the data manifold, the similarity mask in the early iterations could be too sparse due to excessively st… view at source ↗
Figure 4
Figure 4. Figure 4: Qualitative results of top 4 baselines and our methods on Tanks-and-Temples dataset and LLFF dataset. respectively. We note that baselines perform differently with different feature dimensions for FID, so we report both FID-192 and FID-2048. CLIP Similarity refers to the average of per-frame CLIP [45] similarity between generated frames and ground-truth frames. 2) For camera accuracy, we report Absolute Tr… view at source ↗
Figure 5
Figure 5. Figure 5: Qualitative results of top 4 baselines and our methods on DAVIS dataset. (2) The number of alternating projection iterations K: we choose K = 15 in our main experiments, and we ablate the performance with K = 5 and K = 10. (3) The threshold used in the similarity mask in AMP: we choose λ = 0.8 in our main experiments, and we ablate the performance with λ = 0.7 and λ = 0.9. (4) Removing AMP: we remove the A… view at source ↗
Figure 6
Figure 6. Figure 6: Qualitative results of video-to-video future extrapolation on DAVIS dataset (upper two scenes) and PhysInOne dataset (bottom two scenes). Our method can provide smooth motion prediction based on the given videos. a pre-trained video generation model. Our core contribution lies in the mani￾fold alternating projection scheme: iterating between Anchored Manifold Projec￾tion (AMP) for plausible semantic comple… view at source ↗
Figure 7
Figure 7. Figure 7: Data distribution and almost-invisible results. E More details about the baselines. We list the adaptations of the baselines below: – For camera-controlled video generation baselines, i.e. TrajectoryAttention [58] and ReCamMaster [3], we keep everything as the official implementations instead of changing the length of input views into 49. – For warping-and-inpainting NVS baselines, i.e. NVS-Solver [62], Vi… view at source ↗
Figure 8
Figure 8. Figure 8: Qualitative results of all baselines and our methods on Tanks-and-Temples dataset [PITH_FULL_IMAGE:figures/full_fig_p028_8.png] view at source ↗
Figure 9
Figure 9. Figure 9 [PITH_FULL_IMAGE:figures/full_fig_p029_9.png] view at source ↗
Figure 10
Figure 10. Figure 10 [PITH_FULL_IMAGE:figures/full_fig_p030_10.png] view at source ↗
read the original abstract

We study the challenging problem of novel view video synthesis from single images or monocular videos. Existing methods, which operate under the assumption that pre-trained video models lack native novel view synthesis capability and enforce view alignment via camera conditioning, task-specific fine-tuning, or stepwise hard denoising guidance, often suffer from artifacts and compromised global scene consistency. In this paper, we introduce NeoMap, a novel training-free framework designed to locate high-fidelity, view-consistent novel view solutions from general pre-trained video models. The key to our approach is the core insight that promising novel view solutions are inherently encoded within the natural video data manifold learned by pre-trained models, and the core challenge is simply to locate this optimal solution. We solve this via our core mechanism: convergent manifold alternating projection iterations that optimize the initial noise. Extensive experiments demonstrate that NeoMap significantly outperforms all existing methods across 3 standard novel view synthesis benchmarks, including the challenging Tanks-and-Temples, LLFF and DAVIS datasets, achieving state-of-the-art generation fidelity and top-tier view consistency.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The paper introduces NeoMap, a training-free framework for novel-view video synthesis from single images or monocular videos. It claims that high-fidelity, view-consistent novel views are inherently encoded in the natural video data manifold of pre-trained models and can be located via convergent manifold alternating projection iterations that optimize the initial noise, without camera conditioning, fine-tuning, or hard denoising guidance. Extensive experiments are reported to show significant outperformance over existing methods on the Tanks-and-Temples, LLFF, and DAVIS benchmarks in generation fidelity and view consistency.

Significance. If the central claims hold with rigorous validation, the work would be significant for enabling training-free novel view synthesis that leverages existing video diffusion models directly, avoiding the artifacts and consistency issues of prior conditioning- or fine-tuning-based approaches. The parameter-free nature of the manifold projection mechanism (no free_parameters listed) and the focus on locating solutions already present in the model would be notable strengths if empirically supported.

major comments (2)
  1. [Abstract (core insight and mechanism)] The central claim that 'promising novel view solutions are inherently encoded within the natural video data manifold' for arbitrary (OOD) viewpoints is load-bearing but unsupported. Video models are trained on observed trajectories, and the alternating projection iterations provide no explicit mechanism (e.g., 3D consistency loss or multi-view conditioning) to ensure the target view lies on the manifold rather than producing temporally coherent but geometrically inconsistent outputs.
  2. [Abstract (experiments paragraph)] The outperformance claims on Tanks-and-Temples, LLFF, and DAVIS (including 'state-of-the-art generation fidelity and top-tier view consistency') cannot be assessed, as no method equations, quantitative results, ablation studies, or experimental details are provided to support them.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their careful reading and constructive comments. We address each major point below with clarifications drawn from the full manuscript.

read point-by-point responses
  1. Referee: [Abstract (core insight and mechanism)] The central claim that 'promising novel view solutions are inherently encoded within the natural video data manifold' for arbitrary (OOD) viewpoints is load-bearing but unsupported. Video models are trained on observed trajectories, and the alternating projection iterations provide no explicit mechanism (e.g., 3D consistency loss or multi-view conditioning) to ensure the target view lies on the manifold rather than producing temporally coherent but geometrically inconsistent outputs.

    Authors: The manuscript (Section 3) explains that video diffusion models trained on diverse real-world trajectories learn a manifold that encodes 3D scene variations beyond observed paths; the convergent alternating projection iterations optimize initial noise to locate points on this manifold, with consistency emerging from the data distribution rather than explicit 3D losses. Experiments on Tanks-and-Temples and LLFF (which include significant viewpoint shifts) show geometrically consistent outputs, supporting the claim. We will add a short clarifying paragraph in the introduction to make this implicit mechanism more explicit. revision: partial

  2. Referee: [Abstract (experiments paragraph)] The outperformance claims on Tanks-and-Temples, LLFF, and DAVIS (including 'state-of-the-art generation fidelity and top-tier view consistency') cannot be assessed, as no method equations, quantitative results, ablation studies, or experimental details are provided to support them.

    Authors: The full manuscript contains the method equations and algorithm in Section 3, quantitative comparisons in Tables 1–3 on all three benchmarks, ablation studies in Section 4.2, and experimental details plus implementation specifics in the appendix. These directly support the reported outperformance. No changes are required on this point. revision: no

Circularity Check

0 steps flagged

No significant circularity; derivation is self-contained

full rationale

The paper's central claim rests on an explicit modeling assumption (novel-view solutions lie on the pre-trained video manifold) followed by a proposed optimization procedure (manifold alternating projections on initial noise). No equations reduce a fitted parameter or output to an input by construction, no self-citations are invoked as load-bearing uniqueness theorems, and no ansatz is smuggled via prior author work. The approach is presented as training-free and uses off-the-shelf models, making the derivation independent of its own outputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests primarily on the stated core insight about solutions being encoded in the pre-trained manifold.

axioms (1)
  • domain assumption Promising novel view solutions are inherently encoded within the natural video data manifold learned by pre-trained models
    Explicitly stated as the core insight in the abstract.

pith-pipeline@v0.9.1-grok · 5740 in / 1099 out tokens · 28322 ms · 2026-07-03T15:44:05.607807+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

68 extracted references · 17 canonical work pages · 10 internal anchors

  1. [1]

    CVPR (2025)

    Bahmani, S., Skorokhodov, I., Qian, G., Siarohin, A., Menapace, W., Tagliasacchi, A., Lindell, D.B., Tulyakov, S.: Ac3d: Analyzing and improving 3d camera control in video diffusion transformers. CVPR (2025)

  2. [2]

    ICLR (2025)

    Bahmani, S., Skorokhodov, I., Siarohin, A., Menapace, W., Qian, G., Vasilkovsky, M., Lee, H.Y., Wang, C., Zou, J., Tagliasacchi, A., Lindell, D.B., Tulyakov, S.: Vd3d: Taming large video diffusion transformers for 3d camera control. ICLR (2025)

  3. [3]

    ICCV (2025)

    Bai, J., Xia, M., Fu, X., Wang, X., Mu, L., Cao, J., Liu, Z., Hu, H., Bai, X., Wan, P., Zhang, D.: Recammaster: Camera-controlled generative rendering from a single video. ICCV (2025)

  4. [4]

    Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets

    Blattmann, A., Dockhorn, T., Kulal, S., Mendelevitch, D., Kilian, M., Lorenz, D., Levi, Y., English, Z., Voleti, V., Letts, A., Jampani, V., Rombach, R.: Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets. arXiv preprint arXiv:2311.15127 (2023)

  5. [5]

    CVPR (2023)

    Blattmann, A., Rombach, R., Ling, H., Dockhorn, T., Kim, S.W., Fidler, S., Kreis, K.: Align your latents: High-resolution video synthesis with latent diffusion models. CVPR (2023)

  6. [6]

    CVPR (2025)

    Burgert, R., Xu, Y., Xian, W., Pilarski, O., Clausen, P., He, M., Ma, L., Deng, Y., Li, L., Mousavi, M., Ryoo, M., Debevec, P., Yu, N.: Go-with-the-flow: Motion- controllable video diffusion models using real-time warped noise. CVPR (2025)

  7. [7]

    ICCV (2023)

    Ceylan, D., Huang, C.H., Mitra, N.J.: Pix2video: Video editing using image diffu- sion. ICCV (2023)

  8. [8]

    ICCV (2023)

    Chai, W., Guo, X., Wang, G., Lu, Y.: Stablevideo: Text-driven consistency-aware diffusion video editing. ICCV (2023)

  9. [9]

    ICLR (2024)

    Chang, P., Tang, J., Gross, M., Azevedo, V.C.: How i warped your noise: a temporally-correlated noise prior for diffusion models. ICLR (2024)

  10. [10]

    ICML (2025)

    Chefer, H., Singer, U., Zohar, A., Kirstain, Y., Polyak, A., Taigman, Y., Wolf, L., Sheynin, S.: Videojam: Joint appearance-motion representations for enhanced motion generation in video models. ICML (2025)

  11. [11]

    CVPR (2024)

    Chen, H., Zhang, Y., Cun, X., Xia, M., Wang, X., Weng, C., Shan, Y.: Videocrafter2: Overcoming data limitations for high-quality video diffusion models. CVPR (2024)

  12. [12]

    arXiv preprint arXiv:2503.13265 (2025)

    Chen, L., Zhou, Z., Zhao, M., Wang, Y., Zhang, G., Huang, W., Sun, H., Wen, J.R., Li, C.: Flexworld: Progressively expanding 3d scenes for flexiable-view synthesis. arXiv preprint arXiv:2503.13265 (2025)

  13. [13]

    TMLR (2024)

    Couairon, P., Rambour, C., HAUGEARD, J.E., THOME, N.: Videdit: Zero-shot and spatially aware text-driven video editing. TMLR (2024)

  14. [14]

    ICCV (2023)

    Ge, S., Nah, S., Liu, G., Poon, T., Tao, A., Catanzaro, B., Jacobs, D., Huang, J.B., Liu, M.Y., Balaji, Y.: Preserve your own correlation: A noise prior for video diffusion models. ICCV (2023)

  15. [15]

    ICLR (2024)

    Geyer, M., Bar-Tal, O., Bagon, S., Dekel, T.: Tokenflow: Consistent diffusion fea- tures for consistent video editing. ICLR (2024)

  16. [16]

    Human and Environment Friendly Robots with High Intelligence and Emotional Quotients (Cat

    Goel, P.K., Roumeliotis, S.I., Sukhatme, G.S.: Robust localization using relative andabsolutepositionestimates.Proceedings1999IEEE/RSJInternationalConfer- ence on Intelligent Robots and Systems. Human and Environment Friendly Robots with High Intelligence and Emotional Quotients (Cat. No.99CH36289)2, 1134– 1140 vol.2 (1999),https://api.semanticscholar.org...

  17. [17]

    AnimateDiff: Animate Your Personalized Text-to-Image Diffusion Models without Specific Tuning

    Guo, Y., Yang, C., Rao, A., Liang, Z., Wang, Y., Qiao, Y., Agrawala, M., Lin, D., Dai, B.: Animatediff: Animate your personalized text-to-image diffusion models without specific tuning. arXiv preprint arXiv:2307.04725 (2023)

  18. [18]

    He, H., Xu, Y., Guo, Y., Wetzstein, G., Dai, B., Li, H., Yang, C.: Cameractrl: En- ablingcameracontrolfortext-to-videogeneration.arXivpreprintarXiv:2404.02101 (2024)

  19. [19]

    Prompt-to-Prompt Image Editing with Cross Attention Control

    Hertz, A., Mokady, R., Tenenbaum, J., Aberman, K., Pritch, Y., Cohen-Or, D.: Prompt-to-prompt image editing with cross attention control. arXiv preprint arXiv:2208.01626 (2022)

  20. [20]

    NeurIPS (2017)

    Heusel,M.,Ramsauer,H.,Unterthiner,T.,Nessler,B.,Hochreiter,S.:Ganstrained by a two time-scale update rule converge to a local nash equilibrium. NeurIPS (2017)

  21. [21]

    ICLR (2025)

    Hou, C., Chen, Z.: Training-free camera control for video generation. ICLR (2025)

  22. [22]

    arXiv preprint arXiv:2404.15789 (2024)

    Hu, T., Zhang, J., Yi, R., Wang, Y., Huang, H., Weng, J., Wang, Y., Ma, L.: Motionmaster: Training-free camera motion transfer for video generation. arXiv preprint arXiv:2404.15789 (2024)

  23. [23]

    ViPE: Video Pose Engine for 3D Geometric Perception

    Huang, J., Zhou, Q., Rabeti, H., Korovko, A., Ling, H., Ren, X., Shen, T., Gao, J., Slepichev,D.,Lin,C.H.,Ren,J.,Xie,K.,Biswas,J.,Leal-Taixe,L.,Fidler,S.:Vipe: Video pose engine for 3d geometric perception. arXiv preprint arXiv:2508.10934 (2025)

  24. [24]

    CVPR (2024)

    Huberman-Spiegelglas, I., Kulikov, V., Michaeli, T.: An edit friendly DDPM noise space: Inversion and manipulations. CVPR (2024)

  25. [25]

    Jang, S., Ki, T., Jo, J., Yoon, J., Kim, S.Y., Lin, Z., Hwang, S.J.: Frame guidance: Training-freeguidanceforframe-levelcontrolinvideodiffusionmodel.ICLR(2026)

  26. [26]

    TOG40(6), 1–12 (2021)

    Kasten, Y., Ofri, D., Wang, O., Dekel, T.: Layered neural atlases for consistent video editing. TOG40(6), 1–12 (2021)

  27. [27]

    ACM Transactions on Graphics42(4) (July 2023),https://repo-sam.inria.fr/fungraph/3d-gaussian-splatting/

    Kerbl, B., Kopanas, G., Leimkühler, T., Drettakis, G.: 3d gaussian splatting for real-time radiance field rendering. ACM Transactions on Graphics42(4) (July 2023),https://repo-sam.inria.fr/fungraph/3d-gaussian-splatting/

  28. [28]

    ICCV (2023)

    Khachatryan, L., Movsisyan, A., Tadevosyan, V., Henschel, R., Wang, Z., Navasardyan, S., Shi, H.: Text2video-zero: Text-to-image diffusion models are zero- shot video generators. ICCV (2023)

  29. [29]

    CVPR (2025)

    Kim, S., Suh, S., Lee, M.: Rad: Region-aware diffusion models for image inpainting. CVPR (2025)

  30. [30]

    TOG36(4) (2017)

    Knapitsch, A., Park, J., Zhou, Q.Y., Koltun, V.: Tanks and temples: Benchmarking large-scale scene reconstruction. TOG36(4) (2017)

  31. [31]

    HunyuanVideo: A Systematic Framework For Large Video Generative Models

    Kong, W., Tian, Q., Zhang, Z., Min, R., Dai, Z., Zhou, J., Xiong, J., Li, X., Wu, B., Zhang, J., et al.: Hunyuanvideo: A systematic framework for large video generative models. arXiv preprint arXiv:2412.03603 (2024)

  32. [32]

    arXiv preprint arXiv:2405.17414 (2024)

    Kuang, Z., Cai, S., He, H., Xu, Y., Li, H., Guibas, L., Wetzstein, G.: Collabora- tive video diffusion: Consistent multi-video generation with camera control. arXiv preprint arXiv:2405.17414 (2024)

  33. [33]

    arXiv preprint arXiv:2406.05338 (2024)

    Ling, P., Bu, J., Zhang, P., Dong, X., Zang, Y., Wu, T., Chen, H., Wang, J., Jin, Y.: Motionclone: Training-free motion cloning for controllable video generation. arXiv preprint arXiv:2406.05338 (2024)

  34. [34]

    Flow Matching for Generative Modeling

    Lipman, Y., Chen, R.T.Q., Ben-Hamu, H., Nickel, M., Le, M.: Flow matching for generative modeling. arXiv preprint arXiv:2210.02747 (2023)

  35. [35]

    Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow

    Liu, X., Gong, C., Liu, Q.: Flow straight and fast: Learning to generate and transfer data with rectified flow. arXiv preprint arXiv:2209.03003 (2022)

  36. [36]

    NeurIPS (2022) 18 J

    Lu, C., Zhou, Y., Bao, F., Chen, J., Li, C., Zhu, J.: Dpm-solver: A fast ode solver for diffusion probabilistic model sampling in around 10 steps. NeurIPS (2022) 18 J. Li et al

  37. [37]

    CVPR (2022)

    Lugmayr, A., Danelljan, M., Romero, A., Yu, F., Timofte, R., Gool, L.V.: Repaint: Inpainting using denoising diffusion probabilistic models. CVPR (2022)

  38. [38]

    TOG (2019)

    Mildenhall, B., Srinivasan, P.P., Ortiz-Cayon, R., Kalantari, N.K., Ramamoorthi, R., Ng, R., Kar, A.: Local light field fusion: Practical view synthesis with prescrip- tive sampling guidelines. TOG (2019)

  39. [39]

    ECCV (2020)

    Mildenhall, B., Srinivasan, P.P., Tancik, M., Barron, J.T., Ramamoorthi, R., Ng, R.: Nerf: Representing scenes as neural radiance fields for view synthesis. ECCV (2020)

  40. [40]

    arXiv preprint arXiv:2211.09794 (2022)

    Mokady, R., Hertz, A., Aberman, K., Pritch, Y., Cohen-Or, D.: Null-text in- version for editing real images using guided diffusion models. arXiv preprint arXiv:2211.09794 (2022)

  41. [41]

    CVPR (2023)

    Ni, H., Shi, C., Li, K., Huang, S.X., Min, M.R.: Conditional image-to-video gen- eration with latent flow diffusion models. CVPR (2023)

  42. [42]

    ECCV (2024)

    Pan, L., Barath, D., Pollefeys, M., Schönberger, J.L.: Global Structure-from- Motion Revisited. ECCV (2024)

  43. [43]

    SIGGRAPH (2023)

    Parmar, G., Singh, K.K., Zhang, R., Li, Y., Lu, J., Zhu, J.Y.: Zero-shot image-to- image translation. SIGGRAPH (2023)

  44. [44]

    CVPR (2016)

    Perazzi, F., Pont-Tuset, J., McWilliams, B., Gool, L.V., Gross, M., Sorkine- Hornung, A.: A benchmark dataset and evaluation methodology for video object segmentation. CVPR (2016)

  45. [45]

    ICML (2021)

    Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. ICML (2021)

  46. [46]

    CVPR (2022)

    Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models. CVPR (2022)

  47. [47]

    arXiv preprint arXiv:2405.17251 (2024)

    Seo, J., Fukuda, K., Shibuya, T., Narihira, T., Murata, N., Hu, S., Lai, C.H., Kim, S., Mitsufuji, Y.: Genwarp: Single image to novel views with semantic-preserving generative warping. arXiv preprint arXiv:2405.17251 (2024)

  48. [48]

    arXiv preprint arXiv:2506.01144 (2025)

    Shaulov, A., Hazan, I., Wolf, L., Chefer, H.: Flowmo: Variance-based flow guidance for coherent motion in video generation. arXiv preprint arXiv:2506.01144 (2025)

  49. [49]

    CVPR (2026)

    Song, C., Yang, Y., Zhao, T., Li, R., Zhang, C.: Taming video models for 3d and 4d generation via zero-shot camera control. CVPR (2026)

  50. [50]

    CVPR (2023)

    Tumanyan, N., Geyer, M., Bagon, S., Dekel, T.: Plug-and-play diffusion features for text-driven image-to-image translation. CVPR (2023)

  51. [51]

    Openreview (2019)

    Unterthiner, T., van Steenkiste, S., Kurach, K., Marinier, R., Michalski, M., Gelly, S.: Fvd: A new metric for video generation. Openreview (2019)

  52. [52]

    Wan: Open and Advanced Large-Scale Video Generative Models

    Wan, T., Wang, A., Ai, B., Wen, B., Mao, C., Xie, C.W., Chen, D., Yu, F., Zhao, H., Yang, J., Zeng, J., Wang, J., Zhang, J., Zhou, J., Wang, J., Chen, J., Zhu, K., Zhao, K., Yan, K., Huang, L., Feng, M., Zhang, N., Li, P., Wu, P., Chu, R., Feng, R., Zhang, S., Sun, S., Fang, T., Wang, T., Gui, T., Weng, T., Shen, T., Lin, W., Wang, W., Wang, W., Zhou, W.,...

  53. [53]

    ICCV (2025)

    Wang, H., Liu, Y., Liu, Z., Wang, W., Dong, Z., Yang, B.: Vistadream: Sampling multiview consistent images for single-view scene reconstruction. ICCV (2025)

  54. [54]

    CVPR (2025) NeoMap 19

    Wang, J., Chen, M., Karaev, N., Vedaldi, A., Rupprecht, C., Novotny, D.: Vggt: Visual geometry grounded transformer. CVPR (2025) NeoMap 19

  55. [55]

    IEEE transactions on image processing 13(4), 600–612 (2004)

    Wang, Z., Bovik, A.C., Sheikh, H.R., Simoncelli, E.P.: Image quality assessment: from error visibility to structural similarity. IEEE transactions on image processing 13(4), 600–612 (2004)

  56. [56]

    SIGGRAPH pp

    Wang, Z., Yuan, Z., Wang, X., Li, Y., Chen, T., Xia, M., Luo, P., Shan, Y.: Mo- tionctrl: A unified and flexible motion controller for video generation. SIGGRAPH pp. 1–11 (2024)

  57. [57]

    ICCV (2023)

    Wu, C.H., la Torre, F.D.: A latent space of stochastic diffusion models for zero-shot image editing and guidance. ICCV (2023)

  58. [58]

    ICLR (2025)

    Xiao, Z., Ouyang, W., Zhou, Y., Yang, S., Yang, L., Si, J., Pan, X.: Trajectory attention for fine-grained video motion control. ICLR (2025)

  59. [59]

    SIGGRAPH pp

    Yang, S., Hou, L., Huang, H., Ma, C., Wan, P., Zhang, D., Chen, X., Liao, J.: Direct-a-video: Customized video generation with user-directed camera movement and object motion. SIGGRAPH pp. 1–12 (2024)

  60. [60]

    ACM SIGGRAPH Asia Conference Proceedings (2023)

    Yang, S., Zhou, Y., Liu, Z., Loy, C.C.: Rerender a video: Zero-shot text-guided video-to-video translation. ACM SIGGRAPH Asia Conference Proceedings (2023)

  61. [61]

    ICLR (2025)

    Yang, Z., Teng, J., Zheng, W., Ding, M., Huang, S., Xu, J., Yang, Y., Hong, W., Zhang, X., Feng, G., et al.: Cogvideox: Text-to-video diffusion models with an expert transformer. ICLR (2025)

  62. [62]

    ICLR (2025)

    You, M., Zhu, Z., Liu, H., Hou, J.: Nvs-solver: Video diffusion model as zero-shot novel view synthesizer. ICLR (2025)

  63. [63]

    ICCV (2025)

    Yu, M., Hu, W., Xing, J., Shan, Y.: Trajectorycrafter: Redirecting camera trajec- tory for monocular videos via diffusion models. ICCV (2025)

  64. [64]

    ViewCrafter: Taming Video Diffusion Models for High-fidelity Novel View Synthesis

    Yu, W., Xing, J., Yuan, L., Hu, W., Li, X., Huang, Z., Gao, X., Wong, T.T., Shan, Y., Tian, Y.: Viewcrafter: Taming video diffusion models for high-fidelity novel view synthesis. arXiv preprint arXiv:2409.02048 (2024)

  65. [65]

    CVPR (2018)

    Zhang, R., Isola, P., Efros, A.A., Shechtman, E., Wang, O.: The unreasonable effectiveness of deep features as a perceptual metric. CVPR (2018)

  66. [66]

    NeurIPS (2023)

    Zhao, W., Bai, L., Rao, Y., Zhou, J., Lu, J.: Unipc: A unified predictor-corrector framework for fast sampling of diffusion models. NeurIPS (2023)

  67. [67]

    TMLR (2025)

    Zheng, C., Lan, Y., Wang, Y.: Lanpaint: Training-free diffusion inpainting with asymptotically exact and fast conditional sampling. TMLR (2025)

  68. [68]

    CVPR (2026) 20 J

    Zhou, S., Wang, H., Cheng, H., Li, J., Wang, D., Jiang, J., Jin, Y., Huang, J., Mao, S., Liu, S., Yang, Y., Song, H., Wei, S., Zhang, Z., Huang, P., Liu, S., Hao, Z., Li, H., Li, Y., Zhou, W., Zhao, Z., He, Z., Wen, H., Huang, S., Yun, P., Cheng, B., Fu, P.K., Lai, W.K., Chen, J., Wang, K., Sun, Z., Li, Z., Hu, H., Zhang, D., Yuen, C.H., Wang, B., Wang, Z...