pith. machine review for the scientific record. sign in

arxiv: 2204.03458 · v2 · submitted 2022-04-07 · 💻 cs.CV · cs.AI· cs.LG

Recognition: 2 theorem links

· Lean Theorem

Video Diffusion Models

Jonathan Ho , Tim Salimans , Alexey Gritsenko , William Chan , Mohammad Norouzi , David J. Fleet

Authors on Pith no claims yet

Pith reviewed 2026-05-13 14:33 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.LG
keywords diffusion modelsvideo generationtext-to-videogenerative modelingvideo predictionunconditional generationconditional sampling
0
0 comments X

The pith

A diffusion model extended from images generates high-fidelity coherent videos using joint training and conditional sampling.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper establishes that video generation can be achieved by extending standard image diffusion models without major new architectures for motion. Joint training on images and videos reduces gradient variance and accelerates learning. A novel conditional sampling method enables extending videos in space and time more effectively than prior techniques. These advances produce the first strong results on text-conditioned video generation at scale and set new records on prediction and unconditional generation tasks. The work matters because coherent video synthesis is a foundational capability for applications in media, simulation, and creative tools.

Core claim

The central discovery is that a diffusion model for video, built as a natural extension of image diffusion architectures, supports joint training from image and video data which reduces minibatch gradient variance and speeds optimization, while a new conditional sampling technique for spatial and temporal extension outperforms previous methods, yielding the first results on large text-conditioned video generation and state-of-the-art performance on video prediction and unconditional video generation benchmarks.

What carries the argument

The video diffusion model, which applies the image diffusion process to video sequences with added conditioning and a specialized sampling procedure for extending clips.

Load-bearing premise

The assumption that standard image diffusion architectures, with only joint training and new sampling, suffice to produce temporally coherent video without dedicated motion modeling components.

What would settle it

Observing persistent temporal inconsistencies, such as flickering or object disappearance, in generated videos involving complex motions like human actions or camera movements would indicate the extension is insufficient.

read the original abstract

Generating temporally coherent high fidelity video is an important milestone in generative modeling research. We make progress towards this milestone by proposing a diffusion model for video generation that shows very promising initial results. Our model is a natural extension of the standard image diffusion architecture, and it enables jointly training from image and video data, which we find to reduce the variance of minibatch gradients and speed up optimization. To generate long and higher resolution videos we introduce a new conditional sampling technique for spatial and temporal video extension that performs better than previously proposed methods. We present the first results on a large text-conditioned video generation task, as well as state-of-the-art results on established benchmarks for video prediction and unconditional video generation. Supplementary material is available at https://video-diffusion.github.io/

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes extending image diffusion models to video generation via a 3D U-Net architecture that employs space-time factorized convolutions and attention. It shows that joint training on image and video data stabilizes gradients and accelerates optimization, introduces a conditional sampling procedure for spatial and temporal video extension, and reports state-of-the-art results on video prediction (BAIR) and unconditional generation (Kinetics) benchmarks together with the first results on a large-scale text-conditioned video generation task.

Significance. If the empirical claims hold, the work demonstrates that diffusion models can produce temporally coherent high-fidelity video with only modest architectural extensions from image models, joint training provides measurable optimization benefits, and the new sampling technique outperforms prior extension methods. These outcomes would establish a strong baseline for text-to-video generation and influence subsequent multimodal diffusion research.

major comments (2)
  1. [§4.2, Table 2] §4.2 and Table 2: the SOTA claims on BAIR and Kinetics rest on single-run FVD and PSNR numbers without reported standard deviations or multiple random seeds; this makes it impossible to determine whether the reported margins over prior methods are statistically reliable.
  2. [§3.3, §4.3] §3.3 and §4.3: the ablation on joint image-video training shows reduced gradient variance but provides no quantitative comparison of final sample quality (FVD or human preference) between joint and video-only training; this leaves the central claim that joint training is beneficial for generation quality unverified.
minor comments (2)
  1. [Figure 3] Figure 3: the caption does not specify the exact conditioning strength or number of extension steps used for the long-video examples, making reproduction difficult.
  2. [§2.2] §2.2: the description of the space-time factorized attention should explicitly state the relative computational cost compared with full 3D attention.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below, indicating where revisions will be made to strengthen the manuscript.

read point-by-point responses
  1. Referee: [§4.2, Table 2] §4.2 and Table 2: the SOTA claims on BAIR and Kinetics rest on single-run FVD and PSNR numbers without reported standard deviations or multiple random seeds; this makes it impossible to determine whether the reported margins over prior methods are statistically reliable.

    Authors: We agree that multiple independent runs with reported standard deviations would provide stronger statistical grounding for the SOTA claims. Training these large-scale video diffusion models is computationally intensive, which is why we followed the standard practice in the field of reporting single-run results for such experiments. The observed margins are substantial (e.g., large FVD reductions on both BAIR and Kinetics), making it unlikely that run-to-run variance would alter the rankings. In the revised manuscript we will add an explicit discussion of this limitation in §4.2, including a note on the single-run nature of the results and the size of the reported improvements. revision: partial

  2. Referee: [§3.3, §4.3] §3.3 and §4.3: the ablation on joint image-video training shows reduced gradient variance but provides no quantitative comparison of final sample quality (FVD or human preference) between joint and video-only training; this leaves the central claim that joint training is beneficial for generation quality unverified.

    Authors: The central claim in §3.3 and the abstract is that joint image-video training reduces minibatch gradient variance and accelerates optimization; we did not claim or demonstrate a direct improvement in final sample quality metrics such as FVD. The optimization benefit is presented as the primary advantage. We will revise §4.3 to clarify this scope and add a short discussion of how faster convergence can indirectly support higher-quality generation within fixed compute budgets. No new quantitative FVD comparison between joint and video-only training will be added, as that would require additional large-scale experiments beyond the scope of the current work. revision: partial

Circularity Check

0 steps flagged

No significant circularity; results rest on external benchmarks

full rationale

The paper extends standard image diffusion to video via a 3D U-Net with space-time factorized convolutions/attention, joint image-video training for gradient stability, and a conditional sampling procedure for spatial/temporal extension. All reported outcomes (FVD, PSNR on BAIR/Kinetics, first text-to-video results) are measured against independent external benchmarks and prior methods, with no equations or claims reducing performance to internally fitted parameters, self-defined quantities, or load-bearing self-citations. The derivation of the reverse process and conditioning follows the established diffusion framework without internal reduction to the paper's own inputs.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The work rests on standard diffusion model assumptions from prior image work plus the domain assumption that video can be handled by the same forward/reverse process with added temporal conditioning.

free parameters (1)
  • noise schedule and conditioning hyperparameters
    Typical diffusion training choices that are selected or tuned for the video task.
axioms (1)
  • domain assumption The diffusion forward process and learned reverse process can be directly applied to video frames while preserving temporal coherence.
    Invoked when stating the model is a natural extension of the image diffusion architecture.

pith-pipeline@v0.9.0 · 5430 in / 1159 out tokens · 38654 ms · 2026-05-13T14:33:29.620042+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 26 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. MusicLM: Generating Music From Text

    cs.SD 2023-01 conditional novelty 8.0

    MusicLM produces coherent multi-minute 24 kHz music from text prompts using hierarchical sequence-to-sequence modeling and outperforms prior systems in quality and text adherence.

  2. Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow

    cs.LG 2022-09 unverdicted novelty 8.0

    Rectified flow learns straight-path neural ODEs for distribution transport, yielding efficient generative models and domain transfers that work well even with a single simulation step.

  3. $Z^2$-Sampling: Zero-Cost Zigzag Trajectories for Semantic Alignment in Diffusion Models

    cs.CV 2026-04 unverdicted novelty 7.0

    Z²-Sampling implicitly realizes zero-cost zigzag trajectories for curvature-aware semantic alignment in diffusion models by reducing multi-step paths via operator dualities and temporal caching while synthesizing a di...

  4. AttentionBender: Manipulating Cross-Attention in Video Diffusion Transformers as a Creative Probe

    cs.MM 2026-04 unverdicted novelty 7.0

    AttentionBender applies 2D transforms to cross-attention maps in video diffusion transformers, producing distributed distortions and glitch aesthetics that reveal entangled attention mechanisms while serving as both a...

  5. Speculative Decoding for Autoregressive Video Generation

    cs.CV 2026-04 conditional novelty 7.0

    A training-free speculative decoding method for block-based autoregressive video diffusion uses a quality router on worst-frame ImageReward scores to accept drafter proposals, achieving up to 2.09x speedup at 95.7% qu...

  6. Score Shocks: The Burgers Equation Structure of Diffusion Generative Models

    cond-mat.stat-mech 2026-04 unverdicted novelty 7.0

    The score in diffusion models obeys viscous Burgers dynamics, with binary mode boundaries producing a universal tanh interfacial profile whose sharpening marks speciation transitions.

  7. Physics-Aware Video Instance Removal Benchmark

    cs.CV 2026-04 unverdicted novelty 7.0

    The PVIR benchmark tests video object removal on physical consistency using 95 annotated videos and shows that existing methods struggle with complex interactions like lingering shadows.

  8. Imagen Video: High Definition Video Generation with Diffusion Models

    cs.CV 2022-10 unverdicted novelty 7.0

    Imagen Video generates high-definition text-conditional videos via a cascade of base and super-resolution diffusion models, achieving high fidelity and controllability.

  9. DreamFusion: Text-to-3D using 2D Diffusion

    cs.CV 2022-09 accept novelty 7.0

    Optimizes a Neural Radiance Field via probability density distillation from a 2D diffusion model to produce text-conditioned 3D scenes viewable from any angle.

  10. Human Motion Diffusion Model

    cs.CV 2022-09 unverdicted novelty 7.0

    MDM is a classifier-free diffusion model that generates expressive human motions by predicting clean samples rather than noise, supporting text and action conditioning and outperforming prior methods on standard benchmarks.

  11. Diffusion Posterior Sampling for General Noisy Inverse Problems

    stat.ML 2022-09 unverdicted novelty 7.0

    Diffusion models solve noisy (non)linear inverse problems via approximated posterior sampling that blends diffusion steps with manifold gradients without strict consistency projection.

  12. Bridging Brain and Semantics: A Hierarchical Framework for Semantically Enhanced fMRI-to-Video Reconstruction

    cs.CV 2026-05 unverdicted novelty 6.0

    CineNeuron improves fMRI-to-video reconstruction by combining bottom-up semantic enrichment with top-down Mixture-of-Memories integration and outperforms prior methods on benchmarks.

  13. UniVidX: A Unified Multimodal Framework for Versatile Video Generation via Diffusion Priors

    cs.CV 2026-05 unverdicted novelty 6.0

    UniVidX unifies diverse video generation tasks into one conditional diffusion model using stochastic condition masking, decoupled gated LoRAs, and cross-modal self-attention.

  14. Physically Native World Models: A Hamiltonian Perspective on Generative World Modeling

    cs.AI 2026-05 unverdicted novelty 6.0

    Hamiltonian World Models structure latent dynamics around energy-conserving Hamiltonian evolution to produce physically grounded, action-controllable predictions for embodied decision making.

  15. DynamicRad: Content-Adaptive Sparse Attention for Long Video Diffusion

    cs.CV 2026-04 unverdicted novelty 6.0

    DynamicRad achieves 1.7x-2.5x inference speedups in long video diffusion with over 80% sparsity by grounding adaptive selection in a radial locality prior, using dual-mode static/dynamic strategies and offline BO with...

  16. Representations Before Pixels: Semantics-Guided Hierarchical Video Prediction

    cs.CV 2026-04 unverdicted novelty 6.0

    Re2Pix decomposes video prediction into semantic feature forecasting followed by representation-conditioned diffusion synthesis, with nested dropout and mixed supervision to handle prediction errors.

  17. MAGI-1: Autoregressive Video Generation at Scale

    cs.CV 2025-05 unverdicted novelty 6.0

    MAGI-1 is a 24B-parameter autoregressive video world model that predicts denoised frame chunks sequentially with increasing noise to enable causal, scalable, streaming generation up to 4M token contexts.

  18. Unified World Models: Coupling Video and Action Diffusion for Pretraining on Large Robotic Datasets

    cs.RO 2025-04 unverdicted novelty 6.0

    Unified World Models couple video and action diffusion inside one transformer with independent timesteps, enabling pretraining on heterogeneous robot datasets that include action-free video and producing more generali...

  19. CameraCtrl: Enabling Camera Control for Text-to-Video Generation

    cs.CV 2024-04 unverdicted novelty 6.0

    CameraCtrl enables accurate camera pose control in video diffusion models through a trained plug-and-play module and dataset choices emphasizing diverse camera trajectories with matching appearance.

  20. Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets

    cs.CV 2023-11 conditional novelty 6.0

    Stable Video Diffusion scales latent video diffusion models via text-to-image pretraining, video pretraining on curated data, and high-quality finetuning to produce competitive text-to-video and image-to-video results...

  21. Make-A-Video: Text-to-Video Generation without Text-Video Data

    cs.CV 2022-09 unverdicted novelty 6.0

    Make-A-Video achieves state-of-the-art text-to-video generation by decomposing temporal U-Net and attention structures to add space-time modeling to text-to-image models, trained without any paired text-video data.

  22. CogVideo: Large-scale Pretraining for Text-to-Video Generation via Transformers

    cs.CV 2022-05 unverdicted novelty 5.0

    CogVideo is a large-scale transformer pretrained for text-to-video generation that outperforms public models in evaluations.

  23. World Action Models: The Next Frontier in Embodied AI

    cs.RO 2026-05 unverdicted novelty 4.0

    The paper introduces World Action Models as a new paradigm unifying predictive world modeling with action generation in embodied foundation models and provides a taxonomy of existing approaches.

  24. Watching Physics: the Generative Science of Matter and Motion

    cs.CE 2026-04 unverdicted novelty 4.0

    Generative video models recover physical quantities like surface strain from visible motion when coupled with experiments and simulations, but fail when internal variables dominate, defining a new Generative Science o...

  25. Discrete Meanflow Training Curriculum

    cs.LG 2026-04 unverdicted novelty 4.0

    A DMF curriculum initialized from pretrained flow models achieves one-step FID 3.36 on CIFAR-10 after only 2000 epochs by exploiting a discretized consistency property in the Meanflow objective.

  26. ModelScope Text-to-Video Technical Report

    cs.CV 2023-08 unverdicted novelty 4.0

    ModelScopeT2V is a 1.7-billion-parameter text-to-video model built on Stable Diffusion that adds temporal modeling and outperforms prior methods on three evaluation metrics.

Reference graph

Works this paper leans on

65 extracted references · 65 canonical work pages · cited by 26 Pith papers · 5 internal anchors

  1. [1]

    https://www.tensorflow.org/ datasets, 2022

    TensorFlow Datasets, a collection of ready-to-use datasets. https://www.tensorflow.org/ datasets, 2022

  2. [2]

    ViViT: A video vision transformer

    Anurag Arnab, Mostafa Dehghani, Georg Heigold, Chen Sun, Mario Lu ˇci´c, and Cordelia Schmid. ViViT: A video vision transformer. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 6836–6846, 2021

  3. [3]

    Stochastic Variational Video Prediction

    Mohammad Babaeizadeh, Chelsea Finn, Dumitru Erhan, Roy H Campbell, and Sergey Levine. Stochastic variational video prediction. arXiv preprint arXiv:1710.11252, 2017

  4. [4]

    arXiv:2106.13195 , year=

    Mohammad Babaeizadeh, Mohammad Taghi Saffar, Suraj Nair, Sergey Levine, Chelsea Finn, and Dumitru Erhan. FitVid: Overfitting in pixel-level video prediction. arXiv preprint arXiv:2106.13195, 2021

  5. [5]

    Is space-time attention all you need for video understanding

    Gedas Bertasius, Heng Wang, and Lorenzo Torresani. Is space-time attention all you need for video understanding. arXiv preprint arXiv:2102.05095, 2(3):4, 2021

  6. [6]

    Gender shades: Intersectional accuracy disparities in commercial gender classification

    Joy Buolamwini and Timnit Gebru. Gender shades: Intersectional accuracy disparities in commercial gender classification. In Conference on Fairness, Accountability and Transparency, FAT 2018, 23-24 February 2018, New York, NY, USA , Proceedings of Machine Learning Research. PMLR, 2018

  7. [7]

    Women also snowboard: Overcoming bias in captioning models

    Kaylee Burns, Lisa Hendricks, Trevor Darrell, and Anna Rohrbach. Women also snowboard: Overcoming bias in captioning models. In European Conference on Computer Vision (ECCV), 2018

  8. [8]

    Quo vadis, action recognition? a new model and the kinetics dataset

    Joao Carreira and Andrew Zisserman. Quo vadis, action recognition? a new model and the kinetics dataset. In proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 6299–6308, 2017

  9. [9]

    7, 13, 16

    Joao Carreira, Eric Noland, Andras Banki-Horvath, Chloe Hillier, and Andrew Zisserman. A short note about kinetics-600. arXiv preprint arXiv:1808.01340, 2018

  10. [10]

    WaveGrad: Estimating gradients for waveform generation

    Nanxin Chen, Yu Zhang, Heiga Zen, Ron J Weiss, Mohammad Norouzi, and William Chan. WaveGrad: Estimating gradients for waveform generation. International Conference on Learn- ing Representations, 2021

  11. [11]

    PixelSNAIL: An improved autoregressive generative model

    Xi Chen, Nikhil Mishra, Mostafa Rohaninejad, and Pieter Abbeel. PixelSNAIL: An improved autoregressive generative model. In International Conference on Machine Learning , pages 863–871, 2018

  12. [12]

    Dall-eval: Probing the reasoning skills and social biases of text-to-image generative transformers

    Jaemin Cho, Abhay Zala, and Mohit Bansal. Dall-eval: Probing the reasoning skills and social biases of text-to-image generative transformers. arxiv:2202.04053, 2022

  13. [13]

    3d u-net: learning dense volumetric segmentation from sparse annotation

    Özgün Çiçek, Ahmed Abdulkadir, Soeren S Lienkamp, Thomas Brox, and Olaf Ronneberger. 3d u-net: learning dense volumetric segmentation from sparse annotation. In International conference on medical image computing and computer-assisted intervention, pages 424–432. Springer, 2016

  14. [14]

    arXiv:1907.06571 , year=

    Aidan Clark, Jeff Donahue, and Karen Simonyan. Adversarial video generation on complex datasets. arXiv preprint arXiv:1907.06571, 2019

  15. [15]

    BERT: pre-training of deep bidirectional transformers for language understanding

    Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT, pages 4171–4186. Association for Computational Ling...

  16. [16]

    Diffusion models beat GANs on image synthesis

    Prafulla Dhariwal and Alexander Nichol. Diffusion models beat GANs on image synthesis. Advances in Neural Information Processing Systems, 34, 2021

  17. [17]

    Self-supervised visual planning with temporal skip connections

    Frederik Ebert, Chelsea Finn, Alex X Lee, and Sergey Levine. Self-supervised visual planning with temporal skip connections. In CoRL, pages 344–356, 2017

  18. [18]

    Latent neural differential equations for video generation

    Cade Gordon and Natalie Parde. Latent neural differential equations for video generation. In NeurIPS 2020 Workshop on Pre-registration in Machine Learning, pages 73–86. PMLR, 2021

  19. [19]

    GANs trained by a two time-scale update rule converge to a local Nash equilibrium

    Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. GANs trained by a two time-scale update rule converge to a local Nash equilibrium. InAdvances in Neural Information Processing Systems, pages 6626–6637, 2017

  20. [20]

    Classifier-free diffusion guidance

    Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance. In NeurIPS 2021 Workshop on Deep Generative Models and Downstream Applications, 2021

  21. [21]

    Ax- ial attention in multidimensional transformers,

    Jonathan Ho, Nal Kalchbrenner, Dirk Weissenborn, and Tim Salimans. Axial attention in multidimensional transformers. arXiv preprint arXiv:1912.12180, 2019

  22. [22]

    Denoising diffusion probabilistic models

    Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. In Advances in Neural Information Processing Systems, pages 6840–6851, 2020

  23. [23]

    Cascaded diffusion models for high fidelity image generation

    Jonathan Ho, Chitwan Saharia, William Chan, David J Fleet, Mohammad Norouzi, and Tim Salimans. Cascaded diffusion models for high fidelity image generation. arXiv preprint arXiv:2106.15282, 2021

  24. [24]

    Stochastic solutions for linear inverse problems using the prior implicit in a denoiser

    Zahra Kadkhodaie and Eero Simoncelli. Stochastic solutions for linear inverse problems using the prior implicit in a denoiser. Advances in Neural Information Processing Systems, 34, 2021

  25. [25]

    Solving linear inverse problems using the prior implicit in a denoiser

    Zahra Kadkhodaie and Eero P Simoncelli. Solving linear inverse problems using the prior implicit in a denoiser. arXiv preprint arXiv:2007.13640, 2020

  26. [26]

    Lower dimensional kernels for video discriminators

    Emmanuel Kahembwe and Subramanian Ramamoorthy. Lower dimensional kernels for video discriminators. Neural Networks, 132:506–520, 2020

  27. [27]

    The Kinetics Human Action Video Dataset

    Will Kay, Joao Carreira, Karen Simonyan, Brian Zhang, Chloe Hillier, Sudheendra Vijaya- narasimhan, Fabio Viola, Tim Green, Trevor Back, Paul Natsev, et al. The kinetics human action video dataset. arXiv preprint arXiv:1705.06950, 2017

  28. [28]

    Variational diffusion models

    Diederik P Kingma, Tim Salimans, Ben Poole, and Jonathan Ho. Variational diffusion models. arXiv preprint arXiv:2107.00630, 2021

  29. [29]

    DiffWave: A versatile diffusion model for audio synthesis

    Zhifeng Kong, Wei Ping, Jiaji Huang, Kexin Zhao, and Bryan Catanzaro. DiffWave: A versatile diffusion model for audio synthesis. In 9th International Conference on Learning Representations, ICLR, 2021

  30. [30]

    VideoFlow: A flow-based generative model for video

    Manoj Kumar, Mohammad Babaeizadeh, Dumitru Erhan, Chelsea Finn, Sergey Levine, Laurent Dinh, and Durk Kingma. VideoFlow: A flow-based generative model for video. arXiv preprint arXiv:1903.01434, 2019

  31. [31]

    Ccvs: Context-aware controllable video synthesis

    Guillaume Le Moing, Jean Ponce, and Cordelia Schmid. Ccvs: Context-aware controllable video synthesis. Advances in Neural Information Processing Systems, 34, 2021

  32. [32]

    X., Zhang, R., Ebert, F., Abbeel, P., Finn, C., and Levine, S

    Alex X Lee, Richard Zhang, Frederik Ebert, Pieter Abbeel, Chelsea Finn, and Sergey Levine. Stochastic adversarial video prediction. arXiv preprint arXiv:1804.01523, 2018

  33. [33]

    arXiv:2003.04035 , year=

    Pauline Luc, Aidan Clark, Sander Dieleman, Diego de Las Casas, Yotam Doron, Albin Cassirer, and Karen Simonyan. Transformation-based adversarial video prediction on large-scale data. arXiv preprint arXiv:2003.04035, 2020

  34. [34]

    Generating high fidelity images with subscale pixel networks and multidimensional upscaling

    Jacob Menick and Nal Kalchbrenner. Generating high fidelity images with subscale pixel networks and multidimensional upscaling. In International Conference on Learning Represen- tations, 2019. 11

  35. [35]

    arXiv:2203.09494 , year=

    Charlie Nash, João Carreira, Jacob Walker, Iain Barr, Andrew Jaegle, Mateusz Malinowski, and Peter Battaglia. Transframer: Arbitrary frame prediction with generative models. arXiv preprint arXiv:2203.09494, 2022

  36. [36]

    GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models

    Alex Nichol, Prafulla Dhariwal, Aditya Ramesh, Pranav Shyam, Pamela Mishkin, Bob McGrew, Ilya Sutskever, and Mark Chen. Glide: Towards photorealistic image generation and editing with text-guided diffusion models. arXiv preprint arXiv:2112.10741, 2021

  37. [37]

    Improved denoising diffusion probabilistic models

    Alexander Quinn Nichol and Prafulla Dhariwal. Improved denoising diffusion probabilistic models. In Marina Meila and Tong Zhang, editors, Proceedings of the 38th International Conference on Machine Learning, ICML, 2021

  38. [38]

    U-Net: Convolutional networks for biomedical image segmentation

    Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-Net: Convolutional networks for biomedical image segmentation. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pages 234–241. Springer, 2015

  39. [39]

    Palette: Image-to-image diffusion models

    Chitwan Saharia, William Chan, Huiwen Chang, Chris A Lee, Jonathan Ho, Tim Salimans, David J Fleet, and Mohammad Norouzi. Palette: Image-to-image diffusion models. arXiv preprint arXiv:2111.05826, 2021

  40. [40]

    Fleet, and Mohammad Norouzi

    Chitwan Saharia, Jonathan Ho, William Chan, Tim Salimans, David J Fleet, and Mohammad Norouzi. Image super-resolution via iterative refinement. arXiv preprint arXiv:2104.07636, 2021

  41. [41]

    Train sparsely, generate densely: Memory-efficient unsupervised training of high-resolution temporal GAN

    Masaki Saito, Shunta Saito, Masanori Koyama, and Sosuke Kobayashi. Train sparsely, generate densely: Memory-efficient unsupervised training of high-resolution temporal GAN. Interna- tional Journal of Computer Vision, 128(10):2586–2606, 2020

  42. [42]

    Progressive distillation for fast sampling of diffusion models

    Tim Salimans and Jonathan Ho. Progressive distillation for fast sampling of diffusion models. In International Conference on Learning Representations, 2021

  43. [43]

    Improved techniques for training gans

    Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and Xi Chen. Improved techniques for training gans. In Advances in Neural Information Processing Systems, pages 2234–2242, 2016

  44. [44]

    PixelCNN++: Improving the PixelCNN with discretized logistic mixture likelihood and other modifications

    Tim Salimans, Andrej Karpathy, Xi Chen, and Diederik P Kingma. PixelCNN++: Improving the PixelCNN with discretized logistic mixture likelihood and other modifications. In International Conference on Learning Representations, 2017

  45. [45]

    Self-attention with relative position repre- sentations

    Peter Shaw, Jakob Uszkoreit, and Ashish Vaswani. Self-attention with relative position repre- sentations. arXiv preprint arXiv:1803.02155, 2018

  46. [46]

    Deep unsuper- vised learning using nonequilibrium thermodynamics

    Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan, and Surya Ganguli. Deep unsuper- vised learning using nonequilibrium thermodynamics. In International Conference on Machine Learning, pages 2256–2265, 2015

  47. [47]

    Generative modeling by estimating gradients of the data distribution

    Yang Song and Stefano Ermon. Generative modeling by estimating gradients of the data distribution. In Advances in Neural Information Processing Systems, pages 11895–11907, 2019

  48. [48]

    Score-based generative modeling through stochastic differential equations

    Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equations. International Conference on Learning Representations, 2021

  49. [49]

    A dataset of 101 human actions classes from videos in the wild

    Khurram Soomro, Amir Roshan Zamir, and Mubarak Shah. A dataset of 101 human actions classes from videos in the wild. CRCV-TR-12-01, 2012

  50. [50]

    Image representations learned with unsupervised pre-training contain human-like biases

    Ryan Steed and Aylin Caliskan. Image representations learned with unsupervised pre-training contain human-like biases. In Proceedings of the 2021 ACM Conference on Fairness, Account- ability, and Transparency, FAccT ’21, page 701–713. Association for Computing Machinery, 2021

  51. [51]

    Learning spa- tiotemporal features with 3d convolutional networks

    Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. Learning spa- tiotemporal features with 3d convolutional networks. In Proceedings of the IEEE international conference on computer vision, pages 4489–4497, 2015. 12

  52. [52]

    Mocogan: Decomposing motion and content for video generation

    Sergey Tulyakov, Ming-Yu Liu, Xiaodong Yang, and Jan Kautz. Mocogan: Decomposing motion and content for video generation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1526–1535, 2018

  53. [53]

    arXiv:1905.09883 , year=

    Belinda Tzen and Maxim Raginsky. Neural stochastic differential equations: Deep latent gaussian models in the diffusion limit. arXiv preprint arXiv:1905.09883, 2019

  54. [54]

    Towards Accurate Generative Models of Video: A New Metric & Challenges

    Thomas Unterthiner, Sjoerd van Steenkiste, Karol Kurach, Raphael Marinier, Marcin Michalski, and Sylvain Gelly. Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:1812.01717, 2018

  55. [55]

    Attention is all you need

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in Neural Informa- tion Processing Systems, pages 5998–6008, 2017

  56. [56]

    A connection between score matching and denoising autoencoders

    Pascal Vincent. A connection between score matching and denoising autoencoders. Neural Computation, 23(7):1661–1674, 2011

  57. [57]

    Predicting video with vqvae.arXiv preprint arXiv:2103.01950, 2021

    Jacob Walker, Ali Razavi, and Aäron van den Oord. Predicting video with vqvae.arXiv preprint arXiv:2103.01950, 2021

  58. [58]

    Non-local neural networks

    Xiaolong Wang, Ross Girshick, Abhinav Gupta, and Kaiming He. Non-local neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 7794–7803, 2018

  59. [59]

    Scaling autoregressive video models

    Dirk Weissenborn, Oscar Täckström, and Jakob Uszkoreit. Scaling autoregressive video models. In International Conference on Learning Representations, 2019

  60. [60]

    Deblurring via stochastic refinement

    Jay Whang, Mauricio Delbracio, Hossein Talebi, Chitwan Saharia, Alexandros G Dimakis, and Peyman Milanfar. Deblurring via stochastic refinement. arXiv preprint arXiv:2112.02475, 2021

  61. [61]

    NÜW A: Visual synthesis pre-training for neural visual world creation

    Chenfei Wu, Jian Liang, Lei Ji, Fan Yang, Yuejian Fang, Daxin Jiang, and Nan Duan. NÜW A: Visual synthesis pre-training for neural visual world creation. arXiv preprint arXiv:2111.12417, 2021

  62. [62]

    VideoGPT: Video Generation using VQ-VAE and Transformers

    Wilson Yan, Yunzhi Zhang, Pieter Abbeel, and Aravind Srinivas. Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:2104.10157, 2021

  63. [63]

    Dif- fusion probabilistic modeling for video generation

    Ruihan Yang, Prakhar Srivastava, and Stephan Mandt. Diffusion probabilistic modeling for video generation. arXiv preprint arXiv:2203.09481, 2022

  64. [64]

    Markov decision process for video generation

    Vladyslav Yushchenko, Nikita Araslanov, and Stefan Roth. Markov decision process for video generation. In Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops, pages 0–0, 2019

  65. [65]

    Wide Residual Networks

    Sergey Zagoruyko and Nikos Komodakis. Wide residual networks. arXiv preprint arXiv:1605.07146, 2016. 13 A Details and hyperparameters Figure 5: More samples accompanying Fig. 2. Here, we list the hyperparameters, training details, and compute resources used for each model. A.1 UCF101 Base channels: 256 Optimizer: Adam ( β1 = 0.9,β 2 = 0.99) Channel multip...