arxiv: 2204.03458 · v2 · submitted 2022-04-07 · 💻 cs.CV · cs.AI· cs.LG

Recognition: 2 theorem links

· Lean Theorem

Video Diffusion Models

Jonathan Ho , Tim Salimans , Alexey Gritsenko , William Chan , Mohammad Norouzi , David J. Fleet

Authors on Pith no claims yet

Pith reviewed 2026-05-13 14:33 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.LG

keywords diffusion modelsvideo generationtext-to-videogenerative modelingvideo predictionunconditional generationconditional sampling

0 comments

The pith

A diffusion model extended from images generates high-fidelity coherent videos using joint training and conditional sampling.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper establishes that video generation can be achieved by extending standard image diffusion models without major new architectures for motion. Joint training on images and videos reduces gradient variance and accelerates learning. A novel conditional sampling method enables extending videos in space and time more effectively than prior techniques. These advances produce the first strong results on text-conditioned video generation at scale and set new records on prediction and unconditional generation tasks. The work matters because coherent video synthesis is a foundational capability for applications in media, simulation, and creative tools.

Core claim

The central discovery is that a diffusion model for video, built as a natural extension of image diffusion architectures, supports joint training from image and video data which reduces minibatch gradient variance and speeds optimization, while a new conditional sampling technique for spatial and temporal extension outperforms previous methods, yielding the first results on large text-conditioned video generation and state-of-the-art performance on video prediction and unconditional video generation benchmarks.

What carries the argument

The video diffusion model, which applies the image diffusion process to video sequences with added conditioning and a specialized sampling procedure for extending clips.

Load-bearing premise

The assumption that standard image diffusion architectures, with only joint training and new sampling, suffice to produce temporally coherent video without dedicated motion modeling components.

What would settle it

Observing persistent temporal inconsistencies, such as flickering or object disappearance, in generated videos involving complex motions like human actions or camera movements would indicate the extension is insufficient.

read the original abstract

Generating temporally coherent high fidelity video is an important milestone in generative modeling research. We make progress towards this milestone by proposing a diffusion model for video generation that shows very promising initial results. Our model is a natural extension of the standard image diffusion architecture, and it enables jointly training from image and video data, which we find to reduce the variance of minibatch gradients and speed up optimization. To generate long and higher resolution videos we introduce a new conditional sampling technique for spatial and temporal video extension that performs better than previously proposed methods. We present the first results on a large text-conditioned video generation task, as well as state-of-the-art results on established benchmarks for video prediction and unconditional video generation. Supplementary material is available at https://video-diffusion.github.io/

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

They show diffusion models can generate coherent video by jointly training on images and video plus a new conditional sampling trick for extensions.

read the letter

The key takeaway is that a straightforward extension of image diffusion models to video, using joint training and a conditional sampling procedure, delivers competitive results on video generation tasks. What the paper does well is demonstrate that training a 3D U-Net diffusion model on both image and video data reduces minibatch gradient variance and speeds up optimization. Their new conditional sampling for spatial and temporal extension outperforms earlier methods, allowing longer and higher-resolution videos. They report state-of-the-art performance on benchmarks like BAIR for video prediction and Kinetics for unconditional generation, along with the first text-conditioned video generation results on a large scale. The architecture relies on space-time factorized convolutions and attention, which is a practical choice. The empirical comparisons using metrics like FVD and PSNR look reasonable against prior work. On the soft side, the manuscript provides limited ablation studies or details on hyperparameters, and there's no mention of error bars or variance across runs. This makes it harder to assess how sensitive the gains are to the specific choices. The temporal coherence seems to hold in the reported examples, but without more analysis on motion modeling, it's not clear if this approach scales without additional components for complex dynamics. This paper is aimed at researchers in generative modeling and computer vision who want to see how diffusion models perform beyond static images. Anyone working on video synthesis or data augmentation for video tasks would find the results useful. It deserves a serious referee because the claims are backed by concrete experiments on established benchmarks, even if more details would strengthen it. I'd recommend putting it through peer review rather than desk rejecting it.

Referee Report

2 major / 2 minor

Summary. The paper proposes extending image diffusion models to video generation via a 3D U-Net architecture that employs space-time factorized convolutions and attention. It shows that joint training on image and video data stabilizes gradients and accelerates optimization, introduces a conditional sampling procedure for spatial and temporal video extension, and reports state-of-the-art results on video prediction (BAIR) and unconditional generation (Kinetics) benchmarks together with the first results on a large-scale text-conditioned video generation task.

Significance. If the empirical claims hold, the work demonstrates that diffusion models can produce temporally coherent high-fidelity video with only modest architectural extensions from image models, joint training provides measurable optimization benefits, and the new sampling technique outperforms prior extension methods. These outcomes would establish a strong baseline for text-to-video generation and influence subsequent multimodal diffusion research.

major comments (2)

[§4.2, Table 2] §4.2 and Table 2: the SOTA claims on BAIR and Kinetics rest on single-run FVD and PSNR numbers without reported standard deviations or multiple random seeds; this makes it impossible to determine whether the reported margins over prior methods are statistically reliable.
[§3.3, §4.3] §3.3 and §4.3: the ablation on joint image-video training shows reduced gradient variance but provides no quantitative comparison of final sample quality (FVD or human preference) between joint and video-only training; this leaves the central claim that joint training is beneficial for generation quality unverified.

minor comments (2)

[Figure 3] Figure 3: the caption does not specify the exact conditioning strength or number of extension steps used for the long-video examples, making reproduction difficult.
[§2.2] §2.2: the description of the space-time factorized attention should explicitly state the relative computational cost compared with full 3D attention.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below, indicating where revisions will be made to strengthen the manuscript.

read point-by-point responses

Referee: [§4.2, Table 2] §4.2 and Table 2: the SOTA claims on BAIR and Kinetics rest on single-run FVD and PSNR numbers without reported standard deviations or multiple random seeds; this makes it impossible to determine whether the reported margins over prior methods are statistically reliable.

Authors: We agree that multiple independent runs with reported standard deviations would provide stronger statistical grounding for the SOTA claims. Training these large-scale video diffusion models is computationally intensive, which is why we followed the standard practice in the field of reporting single-run results for such experiments. The observed margins are substantial (e.g., large FVD reductions on both BAIR and Kinetics), making it unlikely that run-to-run variance would alter the rankings. In the revised manuscript we will add an explicit discussion of this limitation in §4.2, including a note on the single-run nature of the results and the size of the reported improvements. revision: partial
Referee: [§3.3, §4.3] §3.3 and §4.3: the ablation on joint image-video training shows reduced gradient variance but provides no quantitative comparison of final sample quality (FVD or human preference) between joint and video-only training; this leaves the central claim that joint training is beneficial for generation quality unverified.

Authors: The central claim in §3.3 and the abstract is that joint image-video training reduces minibatch gradient variance and accelerates optimization; we did not claim or demonstrate a direct improvement in final sample quality metrics such as FVD. The optimization benefit is presented as the primary advantage. We will revise §4.3 to clarify this scope and add a short discussion of how faster convergence can indirectly support higher-quality generation within fixed compute budgets. No new quantitative FVD comparison between joint and video-only training will be added, as that would require additional large-scale experiments beyond the scope of the current work. revision: partial

Circularity Check

0 steps flagged

No significant circularity; results rest on external benchmarks

full rationale

The paper extends standard image diffusion to video via a 3D U-Net with space-time factorized convolutions/attention, joint image-video training for gradient stability, and a conditional sampling procedure for spatial/temporal extension. All reported outcomes (FVD, PSNR on BAIR/Kinetics, first text-to-video results) are measured against independent external benchmarks and prior methods, with no equations or claims reducing performance to internally fitted parameters, self-defined quantities, or load-bearing self-citations. The derivation of the reverse process and conditioning follows the established diffusion framework without internal reduction to the paper's own inputs.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The work rests on standard diffusion model assumptions from prior image work plus the domain assumption that video can be handled by the same forward/reverse process with added temporal conditioning.

free parameters (1)

noise schedule and conditioning hyperparameters
Typical diffusion training choices that are selected or tuned for the video task.

axioms (1)

domain assumption The diffusion forward process and learned reverse process can be directly applied to video frames while preserving temporal coherence.
Invoked when stating the model is a natural extension of the image diffusion architecture.

pith-pipeline@v0.9.0 · 5430 in / 1159 out tokens · 38654 ms · 2026-05-13T14:33:29.620042+00:00 · methodology

discussion (0)

Forward citations

Cited by 26 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

MusicLM: Generating Music From Text
cs.SD 2023-01 conditional novelty 8.0

MusicLM produces coherent multi-minute 24 kHz music from text prompts using hierarchical sequence-to-sequence modeling and outperforms prior systems in quality and text adherence.
Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow
cs.LG 2022-09 unverdicted novelty 8.0

Rectified flow learns straight-path neural ODEs for distribution transport, yielding efficient generative models and domain transfers that work well even with a single simulation step.
$Z^2$-Sampling: Zero-Cost Zigzag Trajectories for Semantic Alignment in Diffusion Models
cs.CV 2026-04 unverdicted novelty 7.0

Z²-Sampling implicitly realizes zero-cost zigzag trajectories for curvature-aware semantic alignment in diffusion models by reducing multi-step paths via operator dualities and temporal caching while synthesizing a di...
AttentionBender: Manipulating Cross-Attention in Video Diffusion Transformers as a Creative Probe
cs.MM 2026-04 unverdicted novelty 7.0

AttentionBender applies 2D transforms to cross-attention maps in video diffusion transformers, producing distributed distortions and glitch aesthetics that reveal entangled attention mechanisms while serving as both a...
Speculative Decoding for Autoregressive Video Generation
cs.CV 2026-04 conditional novelty 7.0

A training-free speculative decoding method for block-based autoregressive video diffusion uses a quality router on worst-frame ImageReward scores to accept drafter proposals, achieving up to 2.09x speedup at 95.7% qu...
Score Shocks: The Burgers Equation Structure of Diffusion Generative Models
cond-mat.stat-mech 2026-04 unverdicted novelty 7.0

The score in diffusion models obeys viscous Burgers dynamics, with binary mode boundaries producing a universal tanh interfacial profile whose sharpening marks speciation transitions.
Physics-Aware Video Instance Removal Benchmark
cs.CV 2026-04 unverdicted novelty 7.0

The PVIR benchmark tests video object removal on physical consistency using 95 annotated videos and shows that existing methods struggle with complex interactions like lingering shadows.
Imagen Video: High Definition Video Generation with Diffusion Models
cs.CV 2022-10 unverdicted novelty 7.0

Imagen Video generates high-definition text-conditional videos via a cascade of base and super-resolution diffusion models, achieving high fidelity and controllability.
DreamFusion: Text-to-3D using 2D Diffusion
cs.CV 2022-09 accept novelty 7.0

Optimizes a Neural Radiance Field via probability density distillation from a 2D diffusion model to produce text-conditioned 3D scenes viewable from any angle.
Human Motion Diffusion Model
cs.CV 2022-09 unverdicted novelty 7.0

MDM is a classifier-free diffusion model that generates expressive human motions by predicting clean samples rather than noise, supporting text and action conditioning and outperforming prior methods on standard benchmarks.
Diffusion Posterior Sampling for General Noisy Inverse Problems
stat.ML 2022-09 unverdicted novelty 7.0

Diffusion models solve noisy (non)linear inverse problems via approximated posterior sampling that blends diffusion steps with manifold gradients without strict consistency projection.
Bridging Brain and Semantics: A Hierarchical Framework for Semantically Enhanced fMRI-to-Video Reconstruction
cs.CV 2026-05 unverdicted novelty 6.0

CineNeuron improves fMRI-to-video reconstruction by combining bottom-up semantic enrichment with top-down Mixture-of-Memories integration and outperforms prior methods on benchmarks.
UniVidX: A Unified Multimodal Framework for Versatile Video Generation via Diffusion Priors
cs.CV 2026-05 unverdicted novelty 6.0

UniVidX unifies diverse video generation tasks into one conditional diffusion model using stochastic condition masking, decoupled gated LoRAs, and cross-modal self-attention.
Physically Native World Models: A Hamiltonian Perspective on Generative World Modeling
cs.AI 2026-05 unverdicted novelty 6.0

Hamiltonian World Models structure latent dynamics around energy-conserving Hamiltonian evolution to produce physically grounded, action-controllable predictions for embodied decision making.
DynamicRad: Content-Adaptive Sparse Attention for Long Video Diffusion
cs.CV 2026-04 unverdicted novelty 6.0

DynamicRad achieves 1.7x-2.5x inference speedups in long video diffusion with over 80% sparsity by grounding adaptive selection in a radial locality prior, using dual-mode static/dynamic strategies and offline BO with...
Representations Before Pixels: Semantics-Guided Hierarchical Video Prediction
cs.CV 2026-04 unverdicted novelty 6.0

Re2Pix decomposes video prediction into semantic feature forecasting followed by representation-conditioned diffusion synthesis, with nested dropout and mixed supervision to handle prediction errors.
MAGI-1: Autoregressive Video Generation at Scale
cs.CV 2025-05 unverdicted novelty 6.0

MAGI-1 is a 24B-parameter autoregressive video world model that predicts denoised frame chunks sequentially with increasing noise to enable causal, scalable, streaming generation up to 4M token contexts.
Unified World Models: Coupling Video and Action Diffusion for Pretraining on Large Robotic Datasets
cs.RO 2025-04 unverdicted novelty 6.0

Unified World Models couple video and action diffusion inside one transformer with independent timesteps, enabling pretraining on heterogeneous robot datasets that include action-free video and producing more generali...
CameraCtrl: Enabling Camera Control for Text-to-Video Generation
cs.CV 2024-04 unverdicted novelty 6.0

CameraCtrl enables accurate camera pose control in video diffusion models through a trained plug-and-play module and dataset choices emphasizing diverse camera trajectories with matching appearance.
Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets
cs.CV 2023-11 conditional novelty 6.0

Stable Video Diffusion scales latent video diffusion models via text-to-image pretraining, video pretraining on curated data, and high-quality finetuning to produce competitive text-to-video and image-to-video results...
Make-A-Video: Text-to-Video Generation without Text-Video Data
cs.CV 2022-09 unverdicted novelty 6.0

Make-A-Video achieves state-of-the-art text-to-video generation by decomposing temporal U-Net and attention structures to add space-time modeling to text-to-image models, trained without any paired text-video data.
CogVideo: Large-scale Pretraining for Text-to-Video Generation via Transformers
cs.CV 2022-05 unverdicted novelty 5.0

CogVideo is a large-scale transformer pretrained for text-to-video generation that outperforms public models in evaluations.
World Action Models: The Next Frontier in Embodied AI
cs.RO 2026-05 unverdicted novelty 4.0

The paper introduces World Action Models as a new paradigm unifying predictive world modeling with action generation in embodied foundation models and provides a taxonomy of existing approaches.
Watching Physics: the Generative Science of Matter and Motion
cs.CE 2026-04 unverdicted novelty 4.0

Generative video models recover physical quantities like surface strain from visible motion when coupled with experiments and simulations, but fail when internal variables dominate, defining a new Generative Science o...
Discrete Meanflow Training Curriculum
cs.LG 2026-04 unverdicted novelty 4.0

A DMF curriculum initialized from pretrained flow models achieves one-step FID 3.36 on CIFAR-10 after only 2000 epochs by exploiting a discretized consistency property in the Meanflow objective.
ModelScope Text-to-Video Technical Report
cs.CV 2023-08 unverdicted novelty 4.0

ModelScopeT2V is a 1.7-billion-parameter text-to-video model built on Stable Diffusion that adds temporal modeling and outperforms prior methods on three evaluation metrics.

Reference graph

Works this paper leans on

65 extracted references · 65 canonical work pages · cited by 26 Pith papers · 5 internal anchors

[1]

https://www.tensorflow.org/ datasets, 2022

TensorFlow Datasets, a collection of ready-to-use datasets. https://www.tensorflow.org/ datasets, 2022

work page 2022
[2]

ViViT: A video vision transformer

Anurag Arnab, Mostafa Dehghani, Georg Heigold, Chen Sun, Mario Lu ˇci´c, and Cordelia Schmid. ViViT: A video vision transformer. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 6836–6846, 2021

work page 2021
[3]

Stochastic Variational Video Prediction

Mohammad Babaeizadeh, Chelsea Finn, Dumitru Erhan, Roy H Campbell, and Sergey Levine. Stochastic variational video prediction. arXiv preprint arXiv:1710.11252, 2017

work page Pith review arXiv 2017
[4]

arXiv:2106.13195 , year=

Mohammad Babaeizadeh, Mohammad Taghi Saffar, Suraj Nair, Sergey Levine, Chelsea Finn, and Dumitru Erhan. FitVid: Overﬁtting in pixel-level video prediction. arXiv preprint arXiv:2106.13195, 2021

work page arXiv 2021
[5]

Is space-time attention all you need for video understanding

Gedas Bertasius, Heng Wang, and Lorenzo Torresani. Is space-time attention all you need for video understanding. arXiv preprint arXiv:2102.05095, 2(3):4, 2021

work page arXiv 2021
[6]

Gender shades: Intersectional accuracy disparities in commercial gender classiﬁcation

Joy Buolamwini and Timnit Gebru. Gender shades: Intersectional accuracy disparities in commercial gender classiﬁcation. In Conference on Fairness, Accountability and Transparency, FAT 2018, 23-24 February 2018, New York, NY, USA , Proceedings of Machine Learning Research. PMLR, 2018

work page 2018
[7]

Women also snowboard: Overcoming bias in captioning models

Kaylee Burns, Lisa Hendricks, Trevor Darrell, and Anna Rohrbach. Women also snowboard: Overcoming bias in captioning models. In European Conference on Computer Vision (ECCV), 2018

work page 2018
[8]

Quo vadis, action recognition? a new model and the kinetics dataset

Joao Carreira and Andrew Zisserman. Quo vadis, action recognition? a new model and the kinetics dataset. In proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 6299–6308, 2017

work page 2017
[9]

7, 13, 16

Joao Carreira, Eric Noland, Andras Banki-Horvath, Chloe Hillier, and Andrew Zisserman. A short note about kinetics-600. arXiv preprint arXiv:1808.01340, 2018

work page arXiv 2018
[10]

WaveGrad: Estimating gradients for waveform generation

Nanxin Chen, Yu Zhang, Heiga Zen, Ron J Weiss, Mohammad Norouzi, and William Chan. WaveGrad: Estimating gradients for waveform generation. International Conference on Learn- ing Representations, 2021

work page 2021
[11]

PixelSNAIL: An improved autoregressive generative model

Xi Chen, Nikhil Mishra, Mostafa Rohaninejad, and Pieter Abbeel. PixelSNAIL: An improved autoregressive generative model. In International Conference on Machine Learning , pages 863–871, 2018

work page 2018
[12]

Dall-eval: Probing the reasoning skills and social biases of text-to-image generative transformers

Jaemin Cho, Abhay Zala, and Mohit Bansal. Dall-eval: Probing the reasoning skills and social biases of text-to-image generative transformers. arxiv:2202.04053, 2022

work page arXiv 2022
[13]

3d u-net: learning dense volumetric segmentation from sparse annotation

Özgün Çiçek, Ahmed Abdulkadir, Soeren S Lienkamp, Thomas Brox, and Olaf Ronneberger. 3d u-net: learning dense volumetric segmentation from sparse annotation. In International conference on medical image computing and computer-assisted intervention, pages 424–432. Springer, 2016

work page 2016
[14]

arXiv:1907.06571 , year=

Aidan Clark, Jeff Donahue, and Karen Simonyan. Adversarial video generation on complex datasets. arXiv preprint arXiv:1907.06571, 2019

work page arXiv 1907
[15]

BERT: pre-training of deep bidirectional transformers for language understanding

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT, pages 4171–4186. Association for Computational Ling...

work page 2019
[16]

Diffusion models beat GANs on image synthesis

Prafulla Dhariwal and Alexander Nichol. Diffusion models beat GANs on image synthesis. Advances in Neural Information Processing Systems, 34, 2021

work page 2021
[17]

Self-supervised visual planning with temporal skip connections

Frederik Ebert, Chelsea Finn, Alex X Lee, and Sergey Levine. Self-supervised visual planning with temporal skip connections. In CoRL, pages 344–356, 2017

work page 2017
[18]

Latent neural differential equations for video generation

Cade Gordon and Natalie Parde. Latent neural differential equations for video generation. In NeurIPS 2020 Workshop on Pre-registration in Machine Learning, pages 73–86. PMLR, 2021

work page 2020
[19]

GANs trained by a two time-scale update rule converge to a local Nash equilibrium

Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. GANs trained by a two time-scale update rule converge to a local Nash equilibrium. InAdvances in Neural Information Processing Systems, pages 6626–6637, 2017

work page 2017
[20]

Classiﬁer-free diffusion guidance

Jonathan Ho and Tim Salimans. Classiﬁer-free diffusion guidance. In NeurIPS 2021 Workshop on Deep Generative Models and Downstream Applications, 2021

work page 2021
[21]

Ax- ial attention in multidimensional transformers,

Jonathan Ho, Nal Kalchbrenner, Dirk Weissenborn, and Tim Salimans. Axial attention in multidimensional transformers. arXiv preprint arXiv:1912.12180, 2019

work page arXiv 1912
[22]

Denoising diffusion probabilistic models

Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. In Advances in Neural Information Processing Systems, pages 6840–6851, 2020

work page 2020
[23]

Cascaded diffusion models for high ﬁdelity image generation

Jonathan Ho, Chitwan Saharia, William Chan, David J Fleet, Mohammad Norouzi, and Tim Salimans. Cascaded diffusion models for high ﬁdelity image generation. arXiv preprint arXiv:2106.15282, 2021

work page arXiv 2021
[24]

Stochastic solutions for linear inverse problems using the prior implicit in a denoiser

Zahra Kadkhodaie and Eero Simoncelli. Stochastic solutions for linear inverse problems using the prior implicit in a denoiser. Advances in Neural Information Processing Systems, 34, 2021

work page 2021
[25]

Solving linear inverse problems using the prior implicit in a denoiser

Zahra Kadkhodaie and Eero P Simoncelli. Solving linear inverse problems using the prior implicit in a denoiser. arXiv preprint arXiv:2007.13640, 2020

work page arXiv 2007
[26]

Lower dimensional kernels for video discriminators

Emmanuel Kahembwe and Subramanian Ramamoorthy. Lower dimensional kernels for video discriminators. Neural Networks, 132:506–520, 2020

work page 2020
[27]

The Kinetics Human Action Video Dataset

Will Kay, Joao Carreira, Karen Simonyan, Brian Zhang, Chloe Hillier, Sudheendra Vijaya- narasimhan, Fabio Viola, Tim Green, Trevor Back, Paul Natsev, et al. The kinetics human action video dataset. arXiv preprint arXiv:1705.06950, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[28]

Variational diffusion models

Diederik P Kingma, Tim Salimans, Ben Poole, and Jonathan Ho. Variational diffusion models. arXiv preprint arXiv:2107.00630, 2021

work page arXiv 2021
[29]

DiffWave: A versatile diffusion model for audio synthesis

Zhifeng Kong, Wei Ping, Jiaji Huang, Kexin Zhao, and Bryan Catanzaro. DiffWave: A versatile diffusion model for audio synthesis. In 9th International Conference on Learning Representations, ICLR, 2021

work page 2021
[30]

VideoFlow: A ﬂow-based generative model for video

Manoj Kumar, Mohammad Babaeizadeh, Dumitru Erhan, Chelsea Finn, Sergey Levine, Laurent Dinh, and Durk Kingma. VideoFlow: A ﬂow-based generative model for video. arXiv preprint arXiv:1903.01434, 2019

work page arXiv 1903
[31]

Ccvs: Context-aware controllable video synthesis

Guillaume Le Moing, Jean Ponce, and Cordelia Schmid. Ccvs: Context-aware controllable video synthesis. Advances in Neural Information Processing Systems, 34, 2021

work page 2021
[32]

X., Zhang, R., Ebert, F., Abbeel, P., Finn, C., and Levine, S

Alex X Lee, Richard Zhang, Frederik Ebert, Pieter Abbeel, Chelsea Finn, and Sergey Levine. Stochastic adversarial video prediction. arXiv preprint arXiv:1804.01523, 2018

work page arXiv 2018
[33]

arXiv:2003.04035 , year=

Pauline Luc, Aidan Clark, Sander Dieleman, Diego de Las Casas, Yotam Doron, Albin Cassirer, and Karen Simonyan. Transformation-based adversarial video prediction on large-scale data. arXiv preprint arXiv:2003.04035, 2020

work page arXiv 2003
[34]

Generating high ﬁdelity images with subscale pixel networks and multidimensional upscaling

Jacob Menick and Nal Kalchbrenner. Generating high ﬁdelity images with subscale pixel networks and multidimensional upscaling. In International Conference on Learning Represen- tations, 2019. 11

work page 2019
[35]

arXiv:2203.09494 , year=

Charlie Nash, João Carreira, Jacob Walker, Iain Barr, Andrew Jaegle, Mateusz Malinowski, and Peter Battaglia. Transframer: Arbitrary frame prediction with generative models. arXiv preprint arXiv:2203.09494, 2022

work page arXiv 2022
[36]

GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models

Alex Nichol, Prafulla Dhariwal, Aditya Ramesh, Pranav Shyam, Pamela Mishkin, Bob McGrew, Ilya Sutskever, and Mark Chen. Glide: Towards photorealistic image generation and editing with text-guided diffusion models. arXiv preprint arXiv:2112.10741, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[37]

Improved denoising diffusion probabilistic models

Alexander Quinn Nichol and Prafulla Dhariwal. Improved denoising diffusion probabilistic models. In Marina Meila and Tong Zhang, editors, Proceedings of the 38th International Conference on Machine Learning, ICML, 2021

work page 2021
[38]

U-Net: Convolutional networks for biomedical image segmentation

Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-Net: Convolutional networks for biomedical image segmentation. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pages 234–241. Springer, 2015

work page 2015
[39]

Palette: Image-to-image diffusion models

Chitwan Saharia, William Chan, Huiwen Chang, Chris A Lee, Jonathan Ho, Tim Salimans, David J Fleet, and Mohammad Norouzi. Palette: Image-to-image diffusion models. arXiv preprint arXiv:2111.05826, 2021

work page arXiv 2021
[40]

Fleet, and Mohammad Norouzi

Chitwan Saharia, Jonathan Ho, William Chan, Tim Salimans, David J Fleet, and Mohammad Norouzi. Image super-resolution via iterative reﬁnement. arXiv preprint arXiv:2104.07636, 2021

work page arXiv 2021
[41]

Train sparsely, generate densely: Memory-efﬁcient unsupervised training of high-resolution temporal GAN

Masaki Saito, Shunta Saito, Masanori Koyama, and Sosuke Kobayashi. Train sparsely, generate densely: Memory-efﬁcient unsupervised training of high-resolution temporal GAN. Interna- tional Journal of Computer Vision, 128(10):2586–2606, 2020

work page 2020
[42]

Progressive distillation for fast sampling of diffusion models

Tim Salimans and Jonathan Ho. Progressive distillation for fast sampling of diffusion models. In International Conference on Learning Representations, 2021

work page 2021
[43]

Improved techniques for training gans

Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and Xi Chen. Improved techniques for training gans. In Advances in Neural Information Processing Systems, pages 2234–2242, 2016

work page 2016
[44]

PixelCNN++: Improving the PixelCNN with discretized logistic mixture likelihood and other modiﬁcations

Tim Salimans, Andrej Karpathy, Xi Chen, and Diederik P Kingma. PixelCNN++: Improving the PixelCNN with discretized logistic mixture likelihood and other modiﬁcations. In International Conference on Learning Representations, 2017

work page 2017
[45]

Self-attention with relative position repre- sentations

Peter Shaw, Jakob Uszkoreit, and Ashish Vaswani. Self-attention with relative position repre- sentations. arXiv preprint arXiv:1803.02155, 2018

work page arXiv 2018
[46]

Deep unsuper- vised learning using nonequilibrium thermodynamics

Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan, and Surya Ganguli. Deep unsuper- vised learning using nonequilibrium thermodynamics. In International Conference on Machine Learning, pages 2256–2265, 2015

work page 2015
[47]

Generative modeling by estimating gradients of the data distribution

Yang Song and Stefano Ermon. Generative modeling by estimating gradients of the data distribution. In Advances in Neural Information Processing Systems, pages 11895–11907, 2019

work page 2019
[48]

Score-based generative modeling through stochastic differential equations

Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equations. International Conference on Learning Representations, 2021

work page 2021
[49]

A dataset of 101 human actions classes from videos in the wild

Khurram Soomro, Amir Roshan Zamir, and Mubarak Shah. A dataset of 101 human actions classes from videos in the wild. CRCV-TR-12-01, 2012

work page 2012
[50]

Image representations learned with unsupervised pre-training contain human-like biases

Ryan Steed and Aylin Caliskan. Image representations learned with unsupervised pre-training contain human-like biases. In Proceedings of the 2021 ACM Conference on Fairness, Account- ability, and Transparency, FAccT ’21, page 701–713. Association for Computing Machinery, 2021

work page 2021
[51]

Learning spa- tiotemporal features with 3d convolutional networks

Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. Learning spa- tiotemporal features with 3d convolutional networks. In Proceedings of the IEEE international conference on computer vision, pages 4489–4497, 2015. 12

work page 2015
[52]

Mocogan: Decomposing motion and content for video generation

Sergey Tulyakov, Ming-Yu Liu, Xiaodong Yang, and Jan Kautz. Mocogan: Decomposing motion and content for video generation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1526–1535, 2018

work page 2018
[53]

arXiv:1905.09883 , year=

Belinda Tzen and Maxim Raginsky. Neural stochastic differential equations: Deep latent gaussian models in the diffusion limit. arXiv preprint arXiv:1905.09883, 2019

work page arXiv 1905
[54]

Towards Accurate Generative Models of Video: A New Metric & Challenges

Thomas Unterthiner, Sjoerd van Steenkiste, Karol Kurach, Raphael Marinier, Marcin Michalski, and Sylvain Gelly. Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:1812.01717, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[55]

Attention is all you need

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in Neural Informa- tion Processing Systems, pages 5998–6008, 2017

work page 2017
[56]

A connection between score matching and denoising autoencoders

Pascal Vincent. A connection between score matching and denoising autoencoders. Neural Computation, 23(7):1661–1674, 2011

work page 2011
[57]

Predicting video with vqvae.arXiv preprint arXiv:2103.01950, 2021

Jacob Walker, Ali Razavi, and Aäron van den Oord. Predicting video with vqvae.arXiv preprint arXiv:2103.01950, 2021

work page arXiv 2021
[58]

Non-local neural networks

Xiaolong Wang, Ross Girshick, Abhinav Gupta, and Kaiming He. Non-local neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 7794–7803, 2018

work page 2018
[59]

Scaling autoregressive video models

Dirk Weissenborn, Oscar Täckström, and Jakob Uszkoreit. Scaling autoregressive video models. In International Conference on Learning Representations, 2019

work page 2019
[60]

Deblurring via stochastic reﬁnement

Jay Whang, Mauricio Delbracio, Hossein Talebi, Chitwan Saharia, Alexandros G Dimakis, and Peyman Milanfar. Deblurring via stochastic reﬁnement. arXiv preprint arXiv:2112.02475, 2021

work page arXiv 2021
[61]

NÜW A: Visual synthesis pre-training for neural visual world creation

Chenfei Wu, Jian Liang, Lei Ji, Fan Yang, Yuejian Fang, Daxin Jiang, and Nan Duan. NÜW A: Visual synthesis pre-training for neural visual world creation. arXiv preprint arXiv:2111.12417, 2021

work page arXiv 2021
[62]

VideoGPT: Video Generation using VQ-VAE and Transformers

Wilson Yan, Yunzhi Zhang, Pieter Abbeel, and Aravind Srinivas. Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:2104.10157, 2021

work page internal anchor Pith review arXiv 2021
[63]

Dif- fusion probabilistic modeling for video generation

Ruihan Yang, Prakhar Srivastava, and Stephan Mandt. Diffusion probabilistic modeling for video generation. arXiv preprint arXiv:2203.09481, 2022

work page arXiv 2022
[64]

Markov decision process for video generation

Vladyslav Yushchenko, Nikita Araslanov, and Stefan Roth. Markov decision process for video generation. In Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops, pages 0–0, 2019

work page 2019
[65]

Wide Residual Networks

Sergey Zagoruyko and Nikos Komodakis. Wide residual networks. arXiv preprint arXiv:1605.07146, 2016. 13 A Details and hyperparameters Figure 5: More samples accompanying Fig. 2. Here, we list the hyperparameters, training details, and compute resources used for each model. A.1 UCF101 Base channels: 256 Optimizer: Adam ( β1 = 0.9,β 2 = 0.99) Channel multip...

work page internal anchor Pith review Pith/arXiv arXiv 2016