pith. machine review for the scientific record. sign in

arxiv: 2211.11018 · v2 · submitted 2022-11-20 · 💻 cs.CV

Recognition: 2 theorem links

· Lean Theorem

MagicVideo: Efficient Video Generation With Latent Diffusion Models

Authors on Pith no claims yet

Pith reviewed 2026-05-15 18:43 UTC · model grok-4.3

classification 💻 cs.CV
keywords text-to-video generationlatent diffusion modelsefficient video synthesis3D U-Nettemporal attentionVAE latent spacesingle-GPU inference
0
0 comments X

The pith

MagicVideo generates 256x256 text-to-video clips on a single GPU using 64 times fewer computations than prior video diffusion models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents MagicVideo as a text-to-video framework that first encodes input clips into a compressed latent space with a pre-trained VAE instead of operating directly on RGB pixels. A diffusion model then learns the distribution of these latent codes through a modified 3D U-Net that incorporates a frame-wise adaptor to adjust from image pre-training and a directed temporal attention module to enforce coherence across frames. This design exploits existing image-model weights and avoids full high-dimensional training, allowing the entire process to run on one GPU card. A separate VideoVAE is added to improve reconstruction and reduce pixel-level artifacts in the final output. The result is claimed to produce smooth, text-aligned video clips at 256x256 resolution while cutting FLOPs by a factor of roughly 64 relative to earlier video diffusion approaches.

Core claim

By mapping video clips to a low-dimensional latent space via a pre-trained VAE and training a diffusion model on that space with a 3D U-Net augmented by a frame-wise lightweight adaptor and directed temporal attention, MagicVideo can synthesize 256x256-resolution video clips from text prompts on a single GPU card, using approximately 64 times fewer FLOPs than Video Diffusion Models while maintaining temporal coherence and visual quality.

What carries the argument

3D U-Net denoiser operating in VAE-compressed latent space, extended with a frame-wise adaptor for image-to-video adjustment and directed temporal attention to model frame dependencies.

If this is right

  • Text-to-video training can reuse weights from large image diffusion models, shortening the video-specific training phase.
  • Single-GPU inference makes on-device or consumer-level video generation practical for short clips.
  • The VideoVAE reconstruction step can be swapped or refined independently to target specific artifact types such as dithering.
  • The latent-space approach scales the same U-Net architecture to higher resolutions without a proportional explosion in memory or compute.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same latent-space plus adaptor pattern could be applied to other high-dimensional generation tasks such as 3D scene synthesis or longer video sequences.
  • If the VAE latent representation proves robust across domains, the method may generalize beyond the reported realistic and imaginary content examples without retraining the core diffusion backbone.
  • Direct comparison of reconstruction fidelity between the proposed VideoVAE and standard image VAEs on video data would quantify how much the temporal consistency gains come from the new auto-encoder.

Load-bearing premise

A pre-trained VAE can compress video clips into a low-dimensional latent space that still preserves enough spatial and temporal information for the diffusion model to reconstruct high-fidelity, coherent videos without major artifacts.

What would settle it

Measure actual FLOPs and visual quality (temporal coherence scores or side-by-side user ratings) when generating identical 256x256 clips from the same text prompts on the same single GPU hardware using both MagicVideo and the original Video Diffusion Models.

read the original abstract

We present an efficient text-to-video generation framework based on latent diffusion models, termed MagicVideo. MagicVideo can generate smooth video clips that are concordant with the given text descriptions. Due to a novel and efficient 3D U-Net design and modeling video distributions in a low-dimensional space, MagicVideo can synthesize video clips with 256x256 spatial resolution on a single GPU card, which takes around 64x fewer computations than the Video Diffusion Models (VDM) in terms of FLOPs. In specific, unlike existing works that directly train video models in the RGB space, we use a pre-trained VAE to map video clips into a low-dimensional latent space and learn the distribution of videos' latent codes via a diffusion model. Besides, we introduce two new designs to adapt the U-Net denoiser trained on image tasks to video data: a frame-wise lightweight adaptor for the image-to-video distribution adjustment and a directed temporal attention module to capture temporal dependencies across frames. Thus, we can exploit the informative weights of convolution operators from a text-to-image model for accelerating video training. To ameliorate the pixel dithering in the generated videos, we also propose a novel VideoVAE auto-encoder for better RGB reconstruction. We conduct extensive experiments and demonstrate that MagicVideo can generate high-quality video clips with either realistic or imaginary content. Refer to \url{https://magicvideo.github.io/#} for more examples.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 1 minor

Summary. The paper introduces MagicVideo, a text-to-video generation framework based on latent diffusion models. It encodes video clips into a low-dimensional latent space using a pre-trained VAE, then trains a diffusion model on these latents with a custom 3D U-Net that incorporates a frame-wise lightweight adaptor and a directed temporal attention module to adapt image-pretrained weights for video. A novel VideoVAE is proposed to reduce pixel dithering in reconstruction. The central claim is that this enables synthesis of 256x256 video clips on a single GPU with approximately 64x fewer FLOPs than Video Diffusion Models (VDM), while producing high-quality outputs concordant with text prompts.

Significance. If the efficiency and quality claims are substantiated, the work would represent a meaningful advance in making high-resolution video generation computationally accessible, by extending latent diffusion techniques from images to video via targeted architectural adaptations. The reuse of image-pretrained convolutions and the explicit handling of temporal dependencies address practical barriers in scaling video diffusion. The project page with examples aids qualitative assessment, though the absence of reported quantitative metrics limits immediate impact assessment.

major comments (3)
  1. [Abstract] Abstract: The headline claim of ~64x fewer FLOPs versus VDM is presented without any explicit calculation, baseline FLOPs values, or model configuration details (e.g., number of frames, latent dimensions, or U-Net channel counts). This renders the central efficiency result unverifiable from the manuscript and directly affects the soundness of the efficiency contribution.
  2. [Method] Method (latent space modeling): The pipeline assumes a pre-trained image VAE maps video clips into a latent space that preserves sufficient spatial-temporal information for coherent synthesis. However, the introduction of a separate VideoVAE to ameliorate pixel dithering indicates that standard VAE reconstruction already discards fine details; no ablation quantifies how much temporal dynamics are lost in the latent codes or whether the adaptor + temporal attention fully compensates.
  3. [Experiments] Experiments: The abstract states that 'extensive experiments' demonstrate high-quality generation, yet no quantitative metrics (FID, FVD, CLIP similarity), baseline comparisons, or training details (dataset size, epochs, learning rate) are referenced. This absence makes it impossible to assess whether the claimed quality holds relative to pixel-space or other latent video diffusion baselines.
minor comments (1)
  1. [Abstract] Abstract and method: The term 'directed temporal attention' is introduced without a precise equation or diagram reference; adding a short formal definition or pseudocode would improve clarity for readers familiar with standard attention mechanisms.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments on the efficiency claim, latent modeling assumptions, and experimental reporting. We address each point below and will revise the manuscript to improve verifiability and completeness.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The headline claim of ~64x fewer FLOPs versus VDM is presented without any explicit calculation, baseline FLOPs values, or model configuration details (e.g., number of frames, latent dimensions, or U-Net channel counts). This renders the central efficiency result unverifiable from the manuscript and directly affects the soundness of the efficiency contribution.

    Authors: We agree the abstract states the ~64x FLOPs reduction without a supporting breakdown. In the revised manuscript we will add an explicit calculation (in the main text or appendix) that reports the FLOPs for both MagicVideo and VDM under identical settings, including the number of frames, latent spatial-temporal dimensions, U-Net channel counts, and the precise formula used. This will make the efficiency claim directly verifiable. revision: yes

  2. Referee: [Method] Method (latent space modeling): The pipeline assumes a pre-trained image VAE maps video clips into a latent space that preserves sufficient spatial-temporal information for coherent synthesis. However, the introduction of a separate VideoVAE to ameliorate pixel dithering indicates that standard VAE reconstruction already discards fine details; no ablation quantifies how much temporal dynamics are lost in the latent codes or whether the adaptor + temporal attention fully compensates.

    Authors: The pre-trained image VAE is chosen for computational efficiency, and the VideoVAE is introduced precisely to mitigate visible reconstruction artifacts such as dithering. We did not include a dedicated quantitative ablation measuring temporal information loss in the latent codes. The directed temporal attention module is intended to recover temporal coherence; we will expand the method discussion to clarify this design rationale and add qualitative evidence of motion consistency from our experiments. A full numerical ablation on temporal loss would require new experiments and is noted as a possible extension. revision: partial

  3. Referee: [Experiments] Experiments: The abstract states that 'extensive experiments' demonstrate high-quality generation, yet no quantitative metrics (FID, FVD, CLIP similarity), baseline comparisons, or training details (dataset size, epochs, learning rate) are referenced. This absence makes it impossible to assess whether the claimed quality holds relative to pixel-space or other latent video diffusion baselines.

    Authors: We recognize that quantitative metrics would allow direct comparison with prior work. The current manuscript prioritizes qualitative demonstration and efficiency, with additional examples on the project page. In the revision we will report quantitative results (FVD, CLIP text-video similarity) together with baseline comparisons where feasible, and we will include the missing training details: dataset size and source, number of epochs, batch size, and learning rate schedule. revision: yes

Circularity Check

0 steps flagged

No circularity; efficiency follows directly from latent-space design choice

full rationale

The derivation chain consists of an explicit architectural decision (run 3D U-Net diffusion inside a pre-trained VAE latent space instead of pixel space) plus two new modules (frame-wise adaptor and directed temporal attention) whose roles are described without reference to fitted parameters or self-citations. The 64x FLOPs reduction is a straightforward consequence of the dimensionality reduction performed by the VAE, not a quantity that is fitted and then re-labeled as a prediction. The introduction of VideoVAE is presented as an empirical remedy for observed dithering rather than a hidden definitional step. No equations, uniqueness theorems, or ansatzes reduce to the paper's own inputs by construction; the method therefore remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The framework depends on the effectiveness of a pre-trained VAE for video compression and on the transferability of image diffusion weights; these are treated as given rather than re-derived.

axioms (1)
  • domain assumption A pre-trained VAE can compress video clips into a low-dimensional latent space while retaining sufficient information for high-quality reconstruction.
    Invoked when the authors map videos to latent codes before diffusion training.

pith-pipeline@v0.9.0 · 5566 in / 1153 out tokens · 32798 ms · 2026-05-15T18:43:06.063034+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 19 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Modality-Aware and Anatomical Vector-Quantized Autoencoding for Multimodal Brain MRI

    cs.CV 2026-04 unverdicted novelty 7.0

    NeuroQuant is a modality-aware 3D VQ-VAE that uses dual-stream encoding, a shared anatomical codebook, and FiLM to achieve superior multi-modal brain MRI reconstruction.

  2. Improving Text-to-Image Generation with Intrinsic Self-Confidence Rewards

    cs.CV 2026-03 unverdicted novelty 7.0

    SOLACE improves text-to-image generation by using intrinsic self-confidence rewards from noise reconstruction accuracy during reinforcement learning post-training without external supervision.

  3. AnimateDiff: Animate Your Personalized Text-to-Image Diffusion Models without Specific Tuning

    cs.CV 2023-07 unverdicted novelty 7.0

    A single motion module trained on videos adds temporally coherent animation to any personalized text-to-image model derived from the same base without additional tuning.

  4. FIS-DiT: Breaking the Few-Step Video Inference Barrier via Training-Free Frame Interleaved Sparsity

    cs.CV 2026-05 unverdicted novelty 6.0

    FIS-DiT achieves 2.11-2.41x speedup on video DiT models in few-step regimes with negligible quality loss by exploiting frame-wise sparsity and consistency through a training-free interleaved execution strategy.

  5. Stream-T1: Test-Time Scaling for Streaming Video Generation

    cs.CV 2026-05 unverdicted novelty 6.0

    Stream-T1 is a test-time scaling framework for streaming video generation using scaled noise propagation from history, reward pruning across short and long windows, and feedback-guided memory sinking to improve tempor...

  6. From Synchrony to Sequence: Exo-to-Ego Generation via Interpolation

    cs.CV 2026-04 unverdicted novelty 6.0

    Interpolating exo and ego videos into a single continuous sequence lets diffusion sequence models generate more coherent first-person videos than direct conditioning, even without pose interpolation.

  7. VBench-2.0: Advancing Video Generation Benchmark Suite for Intrinsic Faithfulness

    cs.CV 2025-03 accept novelty 6.0

    VBench-2.0 is a benchmark suite that automatically evaluates video generative models on five dimensions of intrinsic faithfulness: Human Fidelity, Controllability, Creativity, Physics, and Commonsense using VLMs, LLMs...

  8. CameraCtrl: Enabling Camera Control for Text-to-Video Generation

    cs.CV 2024-04 unverdicted novelty 6.0

    CameraCtrl enables accurate camera pose control in video diffusion models through a trained plug-and-play module and dataset choices emphasizing diverse camera trajectories with matching appearance.

  9. VideoPoet: A Large Language Model for Zero-Shot Video Generation

    cs.CV 2023-12 unverdicted novelty 6.0

    VideoPoet is a large language model that performs zero-shot video generation with audio from diverse multimodal conditioning signals.

  10. Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets

    cs.CV 2023-11 conditional novelty 6.0

    Stable Video Diffusion scales latent video diffusion models via text-to-image pretraining, video pretraining on curated data, and high-quality finetuning to produce competitive text-to-video and image-to-video results...

  11. VideoCrafter1: Open Diffusion Models for High-Quality Video Generation

    cs.CV 2023-10 unverdicted novelty 6.0

    Open-source text-to-video and image-to-video diffusion models generate high-quality 1024x576 videos, with the I2V variant claimed as the first to strictly preserve reference image content.

  12. MVDream: Multi-view Diffusion for 3D Generation

    cs.CV 2023-08 conditional novelty 6.0

    MVDream is a multi-view diffusion model that functions as a generalizable 3D prior, enabling more consistent text-to-3D generation and few-shot 3D concept learning from 2D examples.

  13. Latent Video Diffusion Models for High-Fidelity Long Video Generation

    cs.CV 2022-11 unverdicted novelty 6.0

    Latent-space hierarchical diffusion models with targeted error-correction techniques generate realistic videos exceeding 1000 frames while using less compute than prior pixel-space approaches.

  14. Diffusion-APO: Trajectory-Aware Direct Preference Alignment for Video Diffusion Transformers

    cs.CV 2026-05 unverdicted novelty 5.0

    Diffusion-APO synchronizes training noise with inference trajectories in video diffusion models to improve preference alignment and visual quality.

  15. Not all tokens contribute equally to diffusion learning

    cs.CV 2026-04 unverdicted novelty 5.0

    DARE mitigates neglect of important tokens in conditional diffusion models via distribution-rectified guidance and spatial attention alignment.

  16. Empowering Video Translation using Multimodal Large Language Models

    cs.CV 2026-04 unverdicted novelty 4.0

    The paper offers the first focused review of MLLM-based video translation organized by a three-role taxonomy of Semantic Reasoner, Expressive Performer, and Visual Synthesizer, plus open challenges.

  17. ModelScope Text-to-Video Technical Report

    cs.CV 2023-08 unverdicted novelty 4.0

    ModelScopeT2V is a 1.7-billion-parameter text-to-video model built on Stable Diffusion that adds temporal modeling and outperforms prior methods on three evaluation metrics.

  18. Evolution of Video Generative Foundations

    cs.CV 2026-04 unverdicted novelty 2.0

    This survey traces video generation technology from GANs to diffusion models and then to autoregressive and multimodal approaches while analyzing principles, strengths, and future trends.

  19. Sora: A Review on Background, Technology, Limitations, and Opportunities of Large Vision Models

    cs.CV 2024-02 unverdicted novelty 2.0

    The paper reviews the background, technology, applications, limitations, and future directions of OpenAI's Sora text-to-video generative model based on public information.

Reference graph

Works this paper leans on

52 extracted references · 52 canonical work pages · cited by 19 Pith papers · 14 internal anchors

  1. [1]

    Layer Normalization

    Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. Layer normalization. arXiv preprint arXiv:1607.06450, 2016

  2. [2]

    FitVid: Overfitting in Pixel-Level Video Pre- diction, June 2021

    Mohammad Babaeizadeh, Mohammad Taghi Saffar, Suraj Nair, Sergey Levine, Chelsea Finn, and Dumitru Erhan. FitVid: Overfitting in Pixel-Level Video Pre- diction, June 2021. arXiv:2106.13195

  3. [3]

    Frozen in time: A joint video and im- age encoder for end-to-end retrieval

    Max Bain, Arsha Nagrani, G ¨ul Varol, and Andrew Zisserman. Frozen in time: A joint video and im- age encoder for end-to-end retrieval. In ICCV, pages 1728–1738, 2021

  4. [4]

    Multimodal datasets: misogyny, pornog- raphy, and malignant stereotypes

    Abeba Birhane, Vinay Uday Prabhu, and Emmanuel Kahembwe. Multimodal datasets: misogyny, pornog- raphy, and malignant stereotypes. arXiv preprint arXiv:2110.01963, 2021

  5. [5]

    Quo vadis, action recognition? a new model and the kinet- ics dataset

    Joao Carreira and Andrew Zisserman. Quo vadis, action recognition? a new model and the kinet- ics dataset. In proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , pages 6299–6308, 2017

  6. [6]

    Adversarial Video Generation on Complex Datasets, Sept

    Aidan Clark, Jeff Donahue, and Karen Simonyan. Adversarial Video Generation on Complex Datasets, Sept. 2019. arXiv:1907.06571

  7. [7]

    An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

    Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020

  8. [8]

    Sigmoid-weighted linear units for neural network function approximation in reinforcement learning

    Stefan Elfwing, Eiji Uchibe, and Kenji Doya. Sigmoid-weighted linear units for neural network function approximation in reinforcement learning. Neural Networks, 107:3–11, 2018

  9. [9]

    Flex- ible Diffusion Modeling of Long Videos

    William Harvey, Saeid Naderiparizi, Vaden Mas- rani, Christian Weilbach, and Frank Wood. Flex- ible Diffusion Modeling of Long Videos. Tech- nical Report arXiv:2205.11495, arXiv, May 2022. arXiv:2205.11495

  10. [10]

    Prompt-to-Prompt Image Editing with Cross Attention Control

    Amir Hertz, Ron Mokady, Jay Tenenbaum, Kfir Aber- man, Yael Pritch, and Daniel Cohen-Or. Prompt-to- Prompt Image Editing with Cross Attention Control, Aug. 2022. arXiv:2208.01626

  11. [11]

    Kingma, Ben Poole, Mohammad Norouzi, David J

    Jonathan Ho, William Chan, Chitwan Saharia, Jay Whang, Ruiqi Gao, Alexey Gritsenko, Diederik P. Kingma, Ben Poole, Mohammad Norouzi, David J. Fleet, and Tim Salimans. Imagen Video: High Def- inition Video Generation with Diffusion Models, Oct

  12. [12]

    Denoising Diffusion Probabilistic Models

    Jonathan Ho, Ajay Jain, and Pieter Abbeel. De- noising Diffusion Probabilistic Models, Dec. 2020. arXiv:2006.11239

  13. [13]

    Video Diffusion Models

    Jonathan Ho, Tim Salimans, Alexey Gritsenko, William Chan, Mohammad Norouzi, and David J Fleet. Video diffusion models. arXiv preprint arXiv:2204.03458, 2022

  14. [14]

    Jonathan Ho, Tim Salimans, Alexey Gritsenko, William Chan, Mohammad Norouzi, and David J. Fleet. Video Diffusion Models. 2022

  15. [15]

    CogVideo: Large-scale Pretraining for Text-to-Video Generation via Transformers

    Wenyi Hong, Ming Ding, Wendi Zheng, Xinghan Liu, and Jie Tang. CogVideo: Large-scale Pretraining for Text-to-Video Generation via Transformers

  16. [16]

    Cogvideo: Large-scale pretraining for text-to-video generation via transformers, 2022

    Wenyi Hong, Ming Ding, Wendi Zheng, Xinghan Liu, and Jie Tang. Cogvideo: Large-scale pretraining for text-to-video generation via transformers, 2022

  17. [17]

    Video Pixel Networks

    Nal Kalchbrenner, A ¨aron Oord, Karen Simonyan, Ivo Danihelka, Oriol Vinyals, Alex Graves, and Koray Kavukcuoglu. Video Pixel Networks. In Proceed- ings of the 34th International Conference on Machine Learning, pages 1771–1779. PMLR, July 2017. ISSN: 2640-3498

  18. [18]

    Imagic: Text-Based Real Image Editing with Diffusion Models, Oct

    Bahjat Kawar, Shiran Zada, Oran Lang, Omer Tov, Huiwen Chang, Tali Dekel, Inbar Mosseri, and Michal Irani. Imagic: Text-Based Real Image Editing with Diffusion Models, Oct. 2022. arXiv:2210.09276

  19. [19]

    VideoFlow: A Conditional Flow- Based Model for Stochastic Video Generation

    Manoj Kumar, Mohammad Babaeizadeh, Dumitru Er- han, Chelsea Finn, Sergey Levine, Laurent Dinh, and Durk Kingma. VideoFlow: A Conditional Flow- Based Model for Stochastic Video Generation. Mar. 2020

  20. [20]

    MagicMix: Semantic Mixing with Diffusion Models, Oct

    Jun Hao Liew, Hanshu Yan, Daquan Zhou, and Jiashi Feng. MagicMix: Semantic Mixing with Diffusion Models, Oct. 2022. arXiv:2210.16056

  21. [21]

    Frozen clip models are efficient video learners

    Ziyi Lin, Shijie Geng, Renrui Zhang, Peng Gao, Ger- ard de Melo, Xiaogang Wang, Jifeng Dai, Yu Qiao, and Hongsheng Li. Frozen clip models are efficient video learners. arXiv preprint arXiv:2208.03550 , 2022

  22. [22]

    Re- Paint: Inpainting Using Denoising Diffusion Proba- bilistic Models

    Andreas Lugmayr, Martin Danelljan, Andres Romero, Fisher Yu, Radu Timofte, and Luc Van Gool. Re- Paint: Inpainting Using Denoising Diffusion Proba- bilistic Models. page 11

  23. [23]

    Deep multi-scale video prediction beyond mean square error

    Michael Mathieu, Camille Couprie, and Yann Le- Cun. Deep multi-scale video prediction beyond mean square error, Feb. 2016. arXiv:1511.05440

  24. [24]

    SDEdit: Guided Image Synthesis and Editing with Stochastic Differential Equations

    Chenlin Meng, Yutong He, Yang Song, Jiaming Song, Jiajun Wu, Jun-Yan Zhu, and Stefano Er- mon. SDEdit: Guided Image Synthesis and Edit- ing with Stochastic Differential Equations. Tech- nical Report arXiv:2108.01073, arXiv, Jan. 2022. arXiv:2108.01073

  25. [25]

    On aliased resizing and surprising subtleties in gan evalu- ation

    Gaurav Parmar, Richard Zhang, and Jun-Yan Zhu. On aliased resizing and surprising subtleties in gan evalu- ation. In CVPR, 2022

  26. [26]

    DreamFusion: Text-to-3D using 2D Diffusion

    Ben Poole, Ajay Jain, Jonathan T Barron, and Ben Mildenhall. Dreamfusion: Text-to-3d using 2d dif- fusion. arXiv preprint arXiv:2209.14988, 2022

  27. [27]

    Learning transferable visual models from natu- ral language supervision

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sas- try, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natu- ral language supervision. In ICML, pages 8748–8763. PMLR, 2021

  28. [28]

    Learning trans- ferable visual models from natural language supervi- sion, 2021

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sas- try, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning trans- ferable visual models from natural language supervi- sion, 2021

  29. [29]

    Hierarchical text- conditional image generation with clip latents, 2022

    Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical text- conditional image generation with clip latents, 2022

  30. [30]

    Video (language) modeling: a baseline for generative models of natural videos

    MarcAurelio Ranzato, Arthur Szlam, Joan Bruna, Michael Mathieu, Ronan Collobert, and Sumit Chopra. Video (language) modeling: a baseline for generative models of natural videos, May 2016. arXiv:1412.6604

  31. [31]

    Generating diverse high-fidelity images with vq-vae-

    Ali Razavi, Aaron Van den Oord, and Oriol Vinyals. Generating diverse high-fidelity images with vq-vae-

  32. [32]

    Advances in neural information processing systems, 32, 2019

  33. [33]

    Stochastic backpropagation and approxi- mate inference in deep generative models

    Danilo Jimenez Rezende, Shakir Mohamed, and Daan Wierstra. Stochastic backpropagation and approxi- mate inference in deep generative models. In Interna- tional conference on machine learning , pages 1278–

  34. [34]

    High-Resolution Image Synthesis with Latent Diffusion Models

    Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj ¨orn Ommer. High- Resolution Image Synthesis with Latent Diffusion Models, Apr. 2022. arXiv:2112.10752

  35. [35]

    High- resolution image synthesis with latent diffusion mod- els

    Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj ¨orn Ommer. High- resolution image synthesis with latent diffusion mod- els. In CVPR, pages 10684–10695, 2022

  36. [36]

    Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding

    Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily Denton, Seyed Kam- yar Seyed Ghasemipour, Burcu Karagol Ayan, S Sara Mahdavi, Rapha Gontijo Lopes, et al. Photorealis- tic text-to-image diffusion models with deep language understanding. arXiv preprint arXiv:2205.11487 , 2022

  37. [37]

    Sara Mahdavi, Rapha Gontijo Lopes, Tim Salimans, Jonathan Ho, David J Fleet, and Mohammad Norouzi

    Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily Denton, Seyed Kam- yar Seyed Ghasemipour, Burcu Karagol Ayan, S. Sara Mahdavi, Rapha Gontijo Lopes, Tim Salimans, Jonathan Ho, David J Fleet, and Mohammad Norouzi. Photorealistic text-to-image diffusion models with deep language understanding, 2022

  38. [38]

    Image super-resolution via iterative refinement

    Chitwan Saharia, Jonathan Ho, William Chan, Tim Salimans, David J Fleet, and Mohammad Norouzi. Image super-resolution via iterative refinement. IEEE Transactions on Pattern Analysis and Machine Intelli- gence, 2022

  39. [39]

    LAION-5B: laion-5b: A new era of open large-scale multi-modal datasets

    Christoph Schuhmann, Richard Vencu, Romain Beau- mont, Theo Coombes, Cade Gordon, Aarush Katta, Robert Kaczmarczyk, and Jenia Jitsev. LAION-5B: laion-5b: A new era of open large-scale multi-modal datasets. https://laion.ai/laion-5b-a-new-era-of-open- large-scale-multi-modal-datasets/, 2022

  40. [40]

    Make-A-Video: Text-to-Video Generation without Text-Video Data

    Uriel Singer, Adam Polyak, Thomas Hayes, Xi Yin, Jie An, Songyang Zhang, Qiyuan Hu, Harry Yang, Oron Ashual, Oran Gafni, Devi Parikh, Sonal Gupta, and Yaniv Taigman. Make-A-Video: Text-to-Video Generation without Text-Video Data, Sept. 2022. arXiv:2209.14792

  41. [41]

    Score-based generative modeling through stochastic differential equations

    Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equations. page 36, 2021

  42. [42]

    UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild

    Khurram Soomro, Amir Roshan Zamir, and Mubarak Shah. Ucf101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402, 2012

  43. [43]

    A good image generator is what you need for high- resolution video synthesis

    Yu Tian, Jian Ren, Menglei Chai, Kyle Olszewski, Xi Peng, Dimitris N Metaxas, and Sergey Tulyakov. A good image generator is what you need for high- resolution video synthesis. ICLR, 2021

  44. [44]

    A closer look at spatiotemporal convolutions for action recognition

    Du Tran, Heng Wang, Lorenzo Torresani, Jamie Ray, Yann LeCun, and Manohar Paluri. A closer look at spatiotemporal convolutions for action recognition. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition , pages 6450–6459, 2018

  45. [45]

    MoCoGAN: Decomposing Motion and Content for Video Generation

    Sergey Tulyakov, Ming-Yu Liu, Xiaodong Yang, and Jan Kautz. MoCoGAN: Decomposing Mo- tion and Content for Video Generation, Dec. 2017. arXiv:1707.04993

  46. [46]

    Fvd: A new metric for video generation

    Thomas Unterthiner, Sjoerd van Steenkiste, Karol Ku- rach, Rapha ¨el Marinier, Marcin Michalski, and Syl- vain Gelly. Fvd: A new metric for video generation. 2019

  47. [47]

    Attention is all you need

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in neural information processing systems , 30, 2017

  48. [48]

    Generating Videos with Scene Dynamics, Oct

    Carl V ondrick, Hamed Pirsiavash, and Antonio Tor- ralba. Generating Videos with Scene Dynamics, Oct

  49. [49]

    N¨uwa: Visual syn- thesis pre-training for neural visual world creation

    Chenfei Wu, Jian Liang, Lei Ji, Fan Yang, Yuejian Fang, Daxin Jiang, and Nan Duan. N¨uwa: Visual syn- thesis pre-training for neural visual world creation. In Computer Vision–ECCV 2022: 17th European Con- ference, Tel Aviv, Israel, October 23–27, 2022, Pro- ceedings, Part XVI, pages 720–736. Springer, 2022

  50. [50]

    Msr-vtt: A large video description dataset for bridging video and language

    Jun Xu, Tao Mei, Ting Yao, and Yong Rui. Msr-vtt: A large video description dataset for bridging video and language. In CVPR, pages 5288–5296, 2016

  51. [51]

    Advancing high-resolution video- language representation with large-scale video tran- scriptions

    Hongwei Xue, Tiankai Hang, Yanhong Zeng, Yu- chong Sun, Bei Liu, Huan Yang, Jianlong Fu, and Baining Guo. Advancing high-resolution video- language representation with large-scale video tran- scriptions. In CVPR, pages 5036–5045, 2022

  52. [52]

    Understanding the robustness in vision trans- formers

    Daquan Zhou, Zhiding Yu, Enze Xie, Chaowei Xiao, Animashree Anandkumar, Jiashi Feng, and Jose M Al- varez. Understanding the robustness in vision trans- formers. In International Conference on Machine Learning, pages 27378–27394. PMLR, 2022