pith. machine review for the scientific record. sign in

arxiv: 2604.06339 · v1 · submitted 2026-04-07 · 💻 cs.CV

Recognition: no theorem link

Evolution of Video Generative Foundations

Authors on Pith no claims yet

Pith reviewed 2026-05-10 18:37 UTC · model grok-4.3

classification 💻 cs.CV
keywords video generationgenerative adversarial networksdiffusion modelsautoregressive modelsmultimodal integrationAIGCworld modelstemporal coherence
0
0 comments X

The pith

Video generation has evolved from early GAN methods through dominant diffusion models to emerging autoregressive and multimodal techniques.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This survey establishes a full timeline of video generation progress, moving from initial GAN approaches to today's leading diffusion systems and onward to autoregressive models that incorporate multiple data types. It fills gaps left by narrower reviews by examining the core principles behind each stage, their key improvements, and how strengths in one approach compare to limitations in others. A reader would care because this path points toward systems capable of simulating real-world dynamics more accurately. The analysis also highlights how blending text, images, and other signals boosts the contextual quality of output videos. By connecting past steps to current work, the survey aims to direct efforts in building more capable models for practical uses.

Core claim

The paper states that video generation technology follows a traceable development sequence beginning with GAN-based systems, advancing to diffusion models as the prevailing method, and progressing toward autoregressive and multimodal frameworks. It delivers a detailed breakdown of the underlying mechanisms, major technical steps, and side-by-side evaluation of advantages versus shortcomings for each category. The survey then examines trends in combining diverse data sources to improve temporal consistency and semantic depth, ultimately linking these historical and present elements to steer upcoming work on applications such as world models and virtual environments.

What carries the argument

The evolutionary classification of video generative models, organized by core technique from GANs to diffusion to autoregressive and multimodal integration, which organizes the analysis of principles and trade-offs.

If this is right

  • The progression supports creation of world models that can simulate physical dynamics more faithfully.
  • Multimodal integration improves how well generated videos match surrounding context and meaning.
  • The traced path informs development of tools for virtual and augmented reality, personalized learning, driving simulations, and entertainment.
  • Comparative analysis of limitations in current methods highlights specific areas needing improvement for higher coherence over time.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • This timeline suggests that combining autoregressive prediction with diffusion denoising could yield hybrids stronger than either alone in handling long sequences.
  • Similar staged evolution may appear in related generative tasks such as audio or 3D scene creation, allowing cross-domain lessons.
  • The survey's structure could serve as a template for evaluating whether new proprietary systems like Sora truly advance beyond prior categories or mainly optimize within them.
  • Future experiments might test if the multimodal emphasis leads to measurable gains in semantic accuracy when models receive extra input modalities.

Load-bearing premise

The body of papers chosen for review fully represents the main advances and the comparisons of their strengths and limits are complete and even-handed.

What would settle it

A major new video generation method that cannot be placed in the GAN-diffusion-AR-multimodal sequence or that shows performance patterns contradicting the survey's listed advantages and drawbacks.

Figures

Figures reproduced from arXiv: 2604.06339 by Dacheng Tao, Hongrui Huang, Jiangning Zhang, Jieyu Weng, Lizhuang Ma, Ming-Hsuan Yang, Ran Yi, Teng Hu, Zhucun Xue, Zihan Su.

Figure 1
Figure 1. Figure 1: The evolution of generative paradigms in video generation, illustrating the shift from implicit adversarial learning (GANs) to explicit density estimation via denoising (Diffusion) and next-token prediction (Auto-Regressive) Diffusion models optimize an approximate likelihood (the ELBO), AR models define a tractable likelihood by leverag￾ing the chain rule of probability, enabling direct and exact likeliho… view at source ↗
Figure 2
Figure 2. Figure 2: The trending of video generation models in the top-tire conferences. [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: A comprehensive taxonomy of video generation methodologies. The framework categorizes existing works based on three dominant generative paradigms—GANs, Diffusion Models, and Auto-Regressive Models—and further structures them according to specific architectural designs and functional objectives. 3 MODERN VIDEO FOUNDATIONS In this section, we will systematically introduce the devel￾opment of video generation… view at source ↗
Figure 4
Figure 4. Figure 4: Comparison between different model structures for [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Timeline of the leading video generation models. [PITH_FULL_IMAGE:figures/full_fig_p012_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Benchmarks of video generation model. Specific capabilities: motion, temporality, and narrative. Another category focuses on specialized video aspects. VMBench [177] evaluates motion dynamics from a human perception standpoint, including smoothness, continuity, and camera stability. In contrast, ChronoMagic-Bench [178] tests long-range temporal reasoning by simulating gradual phenomena like aging or weathe… view at source ↗
Figure 7
Figure 7. Figure 7: The downstream video generation tasks can be primarily classified into 4 groups according to the modality and [PITH_FULL_IMAGE:figures/full_fig_p014_7.png] view at source ↗
read the original abstract

The rapid advancement of Artificial Intelligence Generated Content (AIGC) has revolutionized video generation, enabling systems ranging from proprietary pioneers like OpenAI's Sora, Google's Veo3, and Bytedance's Seedance to powerful open-source contenders like Wan and HunyuanVideo to synthesize temporally coherent and semantically rich videos. These advancements pave the way for building "world models" that simulate real-world dynamics, with applications spanning entertainment, education, and virtual reality. However, existing reviews on video generation often focus on narrow technical fields, e.g., Generative Adversarial Networks (GAN) and diffusion models, or specific tasks (e. g., video editing), lacking a comprehensive perspective on the field's evolution, especially regarding Auto-Regressive (AR) models and integration of multimodal information. To address these gaps, this survey firstly provides a systematic review of the development of video generation technology, tracing its evolution from early GANs to dominant diffusion models, and further to emerging AR-based and multimodal techniques. We conduct an in-depth analysis of the foundational principles, key advancements, and comparative strengths/limitations. Then, we explore emerging trends in multimodal video generation, emphasizing the integration of diverse data types to enhance contextual awareness. Finally, by bridging historical developments and contemporary innovations, this survey offers insights to guide future research in video generation and its applications, including virtual/augmented reality, personalized education, autonomous driving simulations, digital entertainment, and advanced world models, in this rapidly evolving field. For more details, please refer to the project at https://github.com/sjtuplayer/Awesome-Video-Foundations.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. This survey traces the evolution of video generation technology from early GAN-based methods through dominant diffusion models to emerging autoregressive (AR) and multimodal approaches. It analyzes foundational principles, key advancements, and comparative strengths/limitations of each paradigm; explores trends in multimodal integration for enhanced contextual awareness; and offers forward-looking insights for applications including virtual/augmented reality, personalized education, autonomous driving simulations, digital entertainment, and world models. The work is supported by a linked GitHub repository (https://github.com/sjtuplayer/Awesome-Video-Foundations) containing an extended list of references.

Significance. If the literature coverage proves representative, the survey would fill a documented gap by moving beyond narrow prior reviews (GAN-only or diffusion-only) to connect historical foundations with contemporary AR and multimodal developments. The explicit discussion of strengths/limitations and the GitHub resource constitute concrete strengths that enhance utility and reproducibility for the community.

major comments (1)
  1. [Abstract] Abstract and opening sections: the claim of a 'systematic review' of the development from GANs to diffusion models to AR-based and multimodal techniques is not supported by any description of literature search protocol, databases, keywords, inclusion/exclusion criteria, or date range. This directly affects the central assertion that the selected body of work is representative and that the comparative strengths/limitations analysis is unbiased and comprehensive.
minor comments (1)
  1. [Abstract] Abstract: the sentence 'this survey firstly provides a systematic review...' reads awkwardly; a simpler phrasing such as 'this survey provides a systematic review...' would improve readability without changing meaning.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback and positive assessment of the survey's scope and utility. We address the major comment below.

read point-by-point responses
  1. Referee: [Abstract] Abstract and opening sections: the claim of a 'systematic review' of the development from GANs to diffusion models to AR-based and multimodal techniques is not supported by any description of literature search protocol, databases, keywords, inclusion/exclusion criteria, or date range. This directly affects the central assertion that the selected body of work is representative and that the comparative strengths/limitations analysis is unbiased and comprehensive.

    Authors: We acknowledge that the manuscript asserts a 'systematic review' without providing an explicit literature search protocol, which is a valid observation. To resolve this, we will revise the abstract and introduction to include a dedicated 'Review Methodology' subsection. This addition will detail the databases consulted (arXiv, Google Scholar, IEEE Xplore, ACM Digital Library), search keywords (e.g., 'video generation GAN', 'diffusion models video synthesis', 'autoregressive video models', 'multimodal video generation'), date range (primarily 2014–2024 with key foundational works), inclusion criteria (peer-reviewed publications and high-impact preprints demonstrating technical novelty or empirical results), and exclusion criteria (non-English works, purely theoretical papers without empirical validation). We will also clarify how the comparative strengths/limitations analysis was derived from the collected literature. The existing GitHub repository already aggregates an extended reference list; we will update it to reflect the search process for transparency. These changes will directly support the representativeness claim. revision: yes

Circularity Check

0 steps flagged

No circularity: descriptive survey with no derivations or self-referential predictions

full rationale

This paper is a literature review tracing video generation history from GANs through diffusion models to AR and multimodal approaches. It contains no equations, no fitted parameters, no predictions derived from internal data or models, and no derivation chain that reduces to its own inputs by construction. Claims rest on narrative summaries of external cited works rather than any self-definitional or fitted-input structure. Absence of explicit search protocol is a methodological limitation but does not create circularity under the defined patterns, as no load-bearing mathematical or predictive step collapses to the paper's own outputs.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

As a literature survey the paper introduces no new free parameters, axioms, or invented entities. It relies on standard domain assumptions about the progression of generative models in computer vision.

pith-pipeline@v0.9.0 · 5615 in / 1033 out tokens · 46995 ms · 2026-05-10T18:37:47.194741+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

297 extracted references · 145 canonical work pages · 25 internal anchors

  1. [1]

    Denoising diffusion probabilistic models,

    J. Ho, A. Jain, and P . Abbeel, “Denoising diffusion probabilistic models,”NeurIPS, 2020. 1, 2, 4

  2. [2]

    High-resolution image synthesis with latent diffusion models,

    R. Rombach, A. Blattmann, D. Lorenz, P . Esser, and B. Ommer, “High-resolution image synthesis with latent diffusion models,” inCVPR, 2022. 1, 2, 3, 4

  3. [3]

    Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets

    A. Blattmann, T. Dockhorn, S. Kulal, D. Mendelevitch, M. Kilian, D. Lorenz, Y. Levi, Z. English, V . Voleti, A. Lettset al., “Stable video diffusion: Scaling latent video diffusion models to large datasets,”arXiv preprint arXiv:2311.15127, 2023. 1, 4, 7, 13

  4. [4]

    Video generation models as world simulators,

    T. Brooks, B. Peebles, C. Holmes, W. DePue, Y. Guo, L. Jing, D. Schnurr, J. Taylor, T. Luhman, E. Luhman, C. Ng, R. Wang, and A. Ramesh, “Video generation models as world simulators,” 2024. [Online]. Available: https://openai. com/research/video-generation-models-as-world-simulators 1, 13, 16

  5. [5]

    Google, “Veo3,” https://deepmind.google/models/veo/, 2025. 1, 13

  6. [6]

    Seedance 1.0: Exploring the Boundaries of Video Generation Models

    Y. Gao, H. Guo, T. Hoang, and e. a. Weilin Huang, “Seedance 1.0: Exploring the boundaries of video generation models,”ArXiv, vol. abs/2506.09113, 2025. 1

  7. [7]

    Wan: Open and advanced large-scale video generative models,

    W. Team, “Wan: Open and advanced large-scale video generative models,” 2025. 1, 4, 7, 12, 13, 16

  8. [8]

    HunyuanVideo: A Systematic Framework For Large Video Generative Models

    W. Kong, Q. Tian, Z. Zhang, R. Min, Z. Dai, J. Zhou, J. Xiong, X. Li, B. Wu, J. Zhanget al., “Hunyuanvideo: A systematic framework for large video generative models,”arXiv preprint arXiv:2412.03603, 2024. 1, 4, 8, 12, 13, 16

  9. [9]

    arXiv preprint arXiv:2403.16407 (2024)

    C. Li, D. Huang, Z. Lu, Y. Xiao, Q. Pei, and L. Bai, “A survey on long video generation: Challenges, methods, and prospects,” arXiv preprint arXiv:2403.16407, 2024. 1, 2

  10. [10]

    A comprehensive survey on human video generation: Challenges, methods, and insights,

    W. Lei, J. Wang, F. Ma, G. Huang, and L. Liu, “A comprehensive survey on human video generation: Challenges, methods, and insights,”arXiv preprint arXiv:2407.08428, 2024. 1

  11. [11]

    Video generative adversarial networks: a review,

    N. Aldausari, A. Sowmya, N. Marcus, and G. Mohammadi, “Video generative adversarial networks: a review,”ACM Com- puting Surveys (CSUR), 2022. 1, 2

  12. [12]

    A survey on video diffusion models,

    Z. Xing, Q. Feng, H. Chen, Q. Dai, H. Hu, H. Xu, Z. Wu, and Y.- G. Jiang, “A survey on video diffusion models,”ACM Computing Surveys, 2024. 1, 2

  13. [13]

    arXiv preprint arXiv:2405.03150 (2024)

    A. Melnik, M. Ljubljanac, C. Lu, Q. Yan, W. Ren, and H. Ritter, “Video diffusion models: A survey,”arXiv preprint arXiv:2405.03150, 2024. 1, 2

  14. [14]

    Diffusion model-based video editing: A survey.arXiv preprint arXiv:2407.07111, 2024

    W. Sun, R.-C. Tu, J. Liao, and D. Tao, “Diffusion model-based video editing: A survey,”arXiv preprint arXiv:2407.07111, 2024. 1, 2

  15. [15]

    Generative adver- sarial nets,

    I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde- Farley, S. Ozair, A. Courville, and Y. Bengio, “Generative adver- sarial nets,” inNeurIPS, 2014. 1, 2

  16. [16]

    A style-based generator architecture for generative adversarial networks,

    T. Karras, S. Laine, and T. Aila, “A style-based generator architecture for generative adversarial networks,” inCVPR, 2019. 1, 3, 4, 6

  17. [17]

    Videopoet: A large language model for zero- shot video generation,

    D. Kondratyuk, L. Yu, X. Gu, J. Lezama, and e. a. Jonathan Huang, “Videopoet: A large language model for zero- shot video generation,” inICML, 2024. 1, 10

  18. [18]

    Emu3: Next-Token Prediction is All You Need

    X. Wang, X. Zhang, Z. Luo, Q. Sun, Y. Cui, J. Wang, F. Zhang, Y. Wang, Z. Li, Q. Yuet al., “Emu3: Next-token prediction is all you need,”arXiv preprint arXiv:2409.18869, 2024. 1

  19. [19]

    Generating videos with scene dynamics,

    C. Vondrick, H. Pirsiavash, and A. Torralba, “Generating videos with scene dynamics,” 2016. 2, 5

  20. [20]

    Decomposing motion and content for natural video sequence prediction, 2018

    R. Villegas, J. Yang, S. Hong, X. Lin, and H. Lee, “Decomposing motion and content for natural video sequence prediction,”arXiv preprint arXiv:1706.08033, 2017. 2, 5

  21. [21]

    Mocogan: Decomposing motion and content for video generation,

    S. Tulyakov, M.-Y. Liu, X. Yang, and J. Kautz, “Mocogan: Decomposing motion and content for video generation,” in CVPR, 2018. 2, 4, 5

  22. [22]

    Video-to-video synthesis

    T.-C. Wang, M.-Y. Liu, J.-Y. Zhu, G. Liu, A. Tao, J. Kautz, and B. Catanzaro, “Video-to-video synthesis,”arXiv preprint arXiv:1808.06601, 2018. 2

  23. [23]

    Pixel recurrent neural networks,

    A. Van Den Oord, N. Kalchbrenner, and K. Kavukcuoglu, “Pixel recurrent neural networks,” inICML, 2016. 2, 3, 9

  24. [24]

    Video pixel networks,

    N. Kalchbrenner, A. van den Oord, K. Simonyan, I. Danihelka, O. Vinyals, A. Graves, and K. Kavukcuoglu, “Video pixel networks,” inICML, 2017. 2, 9

  25. [25]

    Parallel multiscale autoregressive density estimation,

    S. Reed, A. van den Oord, N. Kalchbrenner, S. G. Colmenarejo, Z. Wang, Y. Chen, D. Belov, and N. de Freitas, “Parallel multiscale autoregressive density estimation,” inICML, 2017. 2, 9

  26. [26]

    Scaling autore- gressive video models,

    D. Weissenborn, O. T ¨ackstr¨om, and J. Uszkoreit, “Scaling autore- gressive video models,” inICLR, 2020. 2, 9

  27. [27]

    Videogpt: Video generation using vq-vae and transformers,

    W. Yan, Y. Zhang, P . Abbeel, and A. Srinivas, “Videogpt: Video generation using vq-vae and transformers,” 2021. 3, 4

  28. [28]

    Language model beats diffusion - tokenizer is key to visual generation,

    L. Yu, J. Lezama, N. B. Gundavarapu, L. Versari, K. Sohn, D. Minnen, Y. Cheng, A. Gupta, X. Gu, A. G. Hauptmann, B. Gong, M.-H. Yang, I. Essa, D. A. Ross, and L. Jiang, “Language model beats diffusion - tokenizer is key to visual generation,” in ICLR, 2024. 3, 9

  29. [29]

    Neural discrete representa- tion learning,

    A. Van Den Oord, O. Vinyalset al., “Neural discrete representa- tion learning,”NeurIPS, 2017. 3, 4, 9

  30. [30]

    Magvit: Masked generative video transformer,

    L. Yu, Y. Cheng, K. Sohn, J. Lezama, H. Zhang, H. Chang, A. G. Hauptmann, M.-H. Yang, Y. Hao, I. Essa, and L. Jiang, “Magvit: Masked generative video transformer,” inCVPR, 2023. 4, 9, 10

  31. [31]

    Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2,

    I. Skorokhodov, S. Tulyakov, and M. Elhoseiny, “Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2,” inCVPR, 2022. 4, 6

  32. [32]

    Videocrafter2: Overcoming data limitations for high- quality video diffusion models,

    H. Chen, Y. Zhang, X. Cun, M. Xia, X. Wang, C. Weng, and Y. Shan, “Videocrafter2: Overcoming data limitations for high- quality video diffusion models,” inCVPR, 2024. 4, 7, 13

  33. [33]

    Scalable diffusion models with transform- ers,

    W. Peebles and S. Xie, “Scalable diffusion models with transform- ers,” inICCV, 2023. 4, 7

  34. [34]

    Learning spatiotemporal features with 3d convolutional net- works,

    D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri, “Learning spatiotemporal features with 3d convolutional net- works,” inICCV, 2015. 5

  35. [35]

    arXiv:1907.06571 , year=

    A. Clark, J. Donahue, and K. Simonyan, “Adversarial video generation on complex datasets,”arXiv preprint arXiv:1907.06571,

  36. [36]

    Temporal generative adversarial nets with singular value clipping,

    M. Saito, E. Matsumoto, and S. Saito, “Temporal generative adversarial nets with singular value clipping,” inICCV, 2017. 5

  37. [37]

    Finding structure in time,

    J. L. Elman, “Finding structure in time,”Cognitive ScienceV, 1990. 5

  38. [38]

    Gradient-based learning applied to document recognition,

    Y. LeCun, L. Bottou, Y. Bengio, and P . Haffner, “Gradient-based learning applied to document recognition,”Proceedings of the IEEE, 1998. 5

  39. [39]

    arXiv preprint arXiv:2104.15069 , year=

    Y. Tian, J. Ren, M. Chai, K. Olszewski, X. Peng, D. N. Metaxas, and S. Tulyakov, “A good image generator is what you need for high-resolution video synthesis,”arXiv preprint arXiv:2104.15069,

  40. [40]

    Temporal shift gan for large scale video generation,

    A. Munoz, M. Zolfaghari, M. Argus, and T. Brox, “Temporal shift gan for large scale video generation,” inWACV, 2021. 5

  41. [41]

    Dual motion gan for future-flow embedded video prediction,

    X. Liang, L. Lee, W. Dai, and E. P . Xing, “Dual motion gan for future-flow embedded video prediction,” inICCV, 2017. 5 19

  42. [42]

    Hierarchical video generation from orthogonal information: Optical flow and texture,

    K. Ohnishi, S. Yamamoto, Y. Ushiku, and T. Harada, “Hierarchical video generation from orthogonal information: Optical flow and texture,” inAAAI, 2018. 6

  43. [43]

    Fw- gan: Flow-navigated warping gan for video virtual try-on,

    H. Dong, X. Liang, X. Shen, B. Wu, B.-C. Chen, and J. Yin, “Fw- gan: Flow-navigated warping gan for video virtual try-on,” in ICCV, 2019. 6

  44. [44]

    Progressive Growing of GANs for Improved Quality, Stability, and Variation

    T. Karras, T. Aila, S. Laine, and J. Lehtinen, “Progressive growing of gans for improved quality, stability, and variation,”arXiv preprint arXiv:1710.10196, 2017. 6

  45. [45]

    arXiv:1810.02419 , year=

    D. Acharya, Z. Huang, D. P . Paudel, and L. Van Gool, “Towards high resolution video generation with progressive growing of sliced wasserstein gans,”arXiv preprint arXiv:1810.02419, 2018. 6

  46. [46]

    Swgan: A new algorithm of adhesive rice image segmentation based on improved generative adversarial networks,

    S. Zeng, H. Zhang, Y. Chen, Z. Sheng, Z. Kang, and H. Li, “Swgan: A new algorithm of adhesive rice image segmentation based on improved generative adversarial networks,”Computers and Electronics in Agriculture. 6

  47. [47]

    Sliced wasserstein generative models,

    J. Wu, Z. Huang, D. Acharya, W. Li, J. Thoma, D. P . Paudel, and L. V . Gool, “Sliced wasserstein generative models,” inCVPR,

  48. [48]

    Hierarchical patch vae-gan: Generating diverse videos from a single sample,

    S. Gur, S. Benaim, and L. Wolf, “Hierarchical patch vae-gan: Generating diverse videos from a single sample,” inNeurIPS,

  49. [49]

    Stylevideogan: A temporal generative model using a pretrained stylegan,

    G. Fox, A. Tewari, M. Elgharib, and C. Theobalt, “Stylevideogan: A temporal generative model using a pretrained stylegan,”arXiv preprint arXiv:2107.07224, 2021. 6

  50. [50]

    Stylefacev: Face video generation via decomposing and recomposing pretrained stylegan3,

    H. Qiu, Y. Jiang, H. Zhou, W. Wu, and Z. Liu, “Stylefacev: Face video generation via decomposing and recomposing pretrained stylegan3,”arXiv preprint arXiv:2208.07862, 2022. 6

  51. [51]

    Styleheat: One-shot high-resolution editable talking face generation via pre-trained stylegan,

    F. Yin, Y. Zhang, X. Cun, M. Cao, Y. Fan, X. Wang, Q. Bai, B. Wu, J. Wang, and Y. Yang, “Styleheat: One-shot high-resolution editable talking face generation via pre-trained stylegan,” in ECCV, 2022. 6

  52. [52]

    Make-A-Video: Text-to-Video Generation without Text-Video Data

    U. Singer, A. Polyak, T. Hayes, X. Yin, J. An, S. Zhang, Q. Hu, H. Yang, O. Ashual, O. Gafniet al., “Make-a-video: Text-to-video generation without text-video data,”arXiv preprint arXiv:2209.14792, 2022. 6

  53. [53]

    Imagen video: High definition video generation with diffusion models,

    J. Ho, W. Chan, C. Saharia, J. Whang, R. Gao, A. A. Gritsenko, D. P . Kingma, B. Poole, M. Norouzi, D. J. Fleet, and T. Salimans, “Imagen video: High definition video generation with diffusion models,”ArXiv, 2022. 6

  54. [54]

    Lavie: High-quality video generation with cascaded latent diffusion models,

    Y. Wang, X. Chen, X. Ma, S. Zhou, Z. Huang, Y. Wang, C. Yang, Y. He, J. Yu, P . Yanget al., “Lavie: High-quality video generation with cascaded latent diffusion models,”IJCV, 2024. 6

  55. [55]

    Latent Video Diffusion Models for High-Fidelity Long Video Generation

    Y. He, T. Yang, Y. Zhang, Y. Shan, and Q. Chen, “Latent video diffusion models for high-fidelity long video generation,”arXiv preprint arXiv:2211.13221, 2022. 6

  56. [56]

    Align your latents: High-resolution video synthesis with latent diffusion models,

    A. Blattmann, R. Rombach, H. Ling, T. Dockhorn, S. W. Kim, S. Fidler, and K. Kreis, “Align your latents: High-resolution video synthesis with latent diffusion models,” inCVPR, 2023. 6

  57. [57]

    Show-1: Marrying pixel and latent diffusion models for text-to-video generation,

    D. J. Zhang, J. Z. Wu, J.-W. Liu, R. Zhao, L. Ran, Y. Gu, D. Gao, and M. Z. Shou, “Show-1: Marrying pixel and latent diffusion models for text-to-video generation,”IJCV, 2024. 6

  58. [58]

    Videofactory: Swap attention in spatiotemporal diffusions for text-to-video generation,

    W. Wang, H. Yang, Z. Tuo, H. He, J. Zhu, J. Fu, and J. Liu, “Videofactory: Swap attention in spatiotemporal diffusions for text-to-video generation,” 2023. 7

  59. [59]

    ModelScope Text-to-Video Technical Report

    J. Wang, H. Yuan, D. Chen, Y. Zhang, X. Wang, and S. Zhang, “Modelscope text-to-video technical report,”arXiv preprint arXiv:2308.06571, 2023. 7

  60. [61]

    Lora: Low-rank adaptation of large language models

    E. J. Hu, Y. Shen, P . Wallis, Z. Allen-Zhu, Y. Li, S. Wang, L. Wang, W. Chenet al., “Lora: Low-rank adaptation of large language models.”ICLR, 2022. 7, 15

  61. [62]

    Dynamicrafter: Animating open-domain images with video diffusion priors,

    J. Xing, M. Xia, Y. Zhang, H. Chen, W. Yu, H. Liu, G. Liu, X. Wang, Y. Shan, and T.-T. Wong, “Dynamicrafter: Animating open-domain images with video diffusion priors,” inECCV, 2024. 7

  62. [63]

    arXiv preprint arXiv:2311.10709 (2023)

    R. Girdhar, M. Singh, A. Brown, Q. Duval, S. Azadi, S. Rambhatla, A. Shah, X. Yin, D. Parikh, and I. Misra, “Emu video: Factorizing text-to-video generation by explicit image conditioning (2023),” arXiv preprint arXiv:2311.10709. 7

  63. [64]

    Microcinema: A divide-and- conquer approach for text-to-video generation,

    Y. Wang, J. Bao, W. Weng, R. Feng, D. Yin, T. Yang, J. Zhang, Q. Dai, Z. Zhao, C. Wanget al., “Microcinema: A divide-and- conquer approach for text-to-video generation,” inCVPR, 2024. 7

  64. [65]

    Make pixels dance: High-dynamic video generation,

    Y. Zeng, G. Wei, J. Zheng, J. Zou, Y. Wei, Y. Zhang, and H. Li, “Make pixels dance: High-dynamic video generation,” inCVPR,

  65. [66]

    Dysen-vdm: Empowering dynamics-aware text-to-video diffusion with llms,

    H. Fei, S. Wu, W. Ji, H. Zhang, and T.-S. Chua, “Dysen-vdm: Empowering dynamics-aware text-to-video diffusion with llms,” inCVPR, 2024. 7

  66. [67]

    GPT-4 Technical Report

    J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman, D. Almeida, J. Altenschmidt, S. Altman, S. Anadkat et al., “Gpt-4 technical report,”arXiv preprint arXiv:2303.08774,

  67. [68]

    Videodirectorgpt: Consistent multi-scene video generation via llm-guided planning.arXiv preprint arXiv:2309.15091,

    H. Lin, A. Zala, J. Cho, and M. Bansal, “Videodirectorgpt: Con- sistent multi-scene video generation via llm-guided planning,” arXiv preprint arXiv:2309.15091, 2023. 7

  68. [69]

    arXiv:2211.11018 , year=

    D. Zhou, W. Wang, H. Yan, W. Lv, Y. Zhu, and J. Feng, “Magicvideo: Efficient video generation with latent diffusion models,”arXiv preprint arXiv:2211.11018, 2022. 7

  69. [70]

    Simda: Simple diffusion adapter for efficient video generation,

    Z. Xing, Q. Dai, H. Hu, Z. Wu, and Y.-G. Jiang, “Simda: Simple diffusion adapter for efficient video generation,” inCVPR, 2024. 7

  70. [71]

    arXiv preprint arXiv:2304.08477 (2023)

    J. An, S. Zhang, H. Yang, S. Gupta, J.-B. Huang, J. Luo, and X. Yin, “Latent-shift: Latent diffusion with temporal shift for efficient text-to-video generation,”arXiv preprint arXiv:2304.08477, 2023. 7

  71. [72]

    Magicvideo-v2: Multi-stage high- aesthetic video generation,

    W. Wang, J. Liu, Z. Lin, J. Yan, S. Chen, C. Low, T. Hoang, J. Wu, J. H. Liew, H. Yanet al., “Magicvideo-v2: Multi-stage high- aesthetic video generation,”arXiv preprint arXiv:2401.04468, 2024. 7

  72. [73]

    Preserve your own correlation: A noise prior for video diffusion models,

    S. Ge, S. Nah, G. Liu, T. Poon, A. Tao, B. Catanzaro, D. Jacobs, J.-B. Huang, M.-Y. Liu, and Y. Balaji, “Preserve your own correlation: A noise prior for video diffusion models,” inICCV, 2023. 7

  73. [74]

    Flexifilm: Long video gen- eration with flexible conditions,

    Y. Ouyang, H. Zhao, G. Wanget al., “Flexifilm: Long video gen- eration with flexible conditions,”arXiv preprint arXiv:2404.18620,

  74. [75]

    Vdt: General-purpose video diffusion transformers via mask modeling,

    H. Lu, G. Yang, N. Fei, Y. Huo, Z. Lu, P . Luo, and M. Ding, “Vdt: General-purpose video diffusion transformers via mask modeling,”arXiv preprint arXiv:2305.13311, 2023. 7

  75. [76]

    Latte: Latent Diffusion Transformer for Video Generation

    X. Ma, Y. Wang, G. Jia, X. Chen, Z. Liu, Y.-F. Li, C. Chen, and Y. Qiao, “Latte: Latent diffusion transformer for video generation,”arXiv preprint arXiv:2401.03048, 2024. 7

  76. [77]

    Gentron: Diffusion transformers for image and video generation,

    S. Chen, M. Xu, J. Ren, Y. Cong, S. He, Y. Xie, A. Sinha, P . Luo, T. Xiang, and J.-M. Perez-Rua, “Gentron: Diffusion transformers for image and video generation,” inCVPR, 2024. 7

  77. [78]

    Photorealistic video generation with diffusion models,

    A. Gupta, L. Yu, K. Sohn, X. Gu, M. Hahn, F.-F. Li, I. Essa, L. Jiang, and J. Lezama, “Photorealistic video generation with diffusion models,” inECCV, 2024. 7

  78. [79]

    LTX-Video: Realtime Video Latent Diffusion

    Y. HaCohen, N. Chiprut, B. Brazowski, D. Shalem, D. Moshe, E. Richardson, E. Levin, G. Shiran, N. Zabari, O. Gordon et al., “Ltx-video: Realtime video latent diffusion,”arXiv preprint arXiv:2501.00103, 2024. 7

  79. [80]

    Snap video: Scaled spatiotemporal transformers for text-to- video synthesis,

    W. Menapace, A. Siarohin, I. Skorokhodov, E. Deyneka, T.- S. Chen, A. Kag, Y. Fang, A. Stoliar, E. Ricci, J. Renet al., “Snap video: Scaled spatiotemporal transformers for text-to- video synthesis,” inCVPR, 2024. 7

  80. [81]

    Elucidating the design space of diffusion-based generative models,

    T. Karras, M. Aittala, T. Aila, and S. Laine, “Elucidating the design space of diffusion-based generative models,”NeurIPS,

Showing first 80 references.