Recognition: 2 theorem links
· Lean TheoremVideo Diffusion Models
Pith reviewed 2026-05-13 14:33 UTC · model grok-4.3
The pith
A diffusion model extended from images generates high-fidelity coherent videos using joint training and conditional sampling.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central discovery is that a diffusion model for video, built as a natural extension of image diffusion architectures, supports joint training from image and video data which reduces minibatch gradient variance and speeds optimization, while a new conditional sampling technique for spatial and temporal extension outperforms previous methods, yielding the first results on large text-conditioned video generation and state-of-the-art performance on video prediction and unconditional video generation benchmarks.
What carries the argument
The video diffusion model, which applies the image diffusion process to video sequences with added conditioning and a specialized sampling procedure for extending clips.
Load-bearing premise
The assumption that standard image diffusion architectures, with only joint training and new sampling, suffice to produce temporally coherent video without dedicated motion modeling components.
What would settle it
Observing persistent temporal inconsistencies, such as flickering or object disappearance, in generated videos involving complex motions like human actions or camera movements would indicate the extension is insufficient.
read the original abstract
Generating temporally coherent high fidelity video is an important milestone in generative modeling research. We make progress towards this milestone by proposing a diffusion model for video generation that shows very promising initial results. Our model is a natural extension of the standard image diffusion architecture, and it enables jointly training from image and video data, which we find to reduce the variance of minibatch gradients and speed up optimization. To generate long and higher resolution videos we introduce a new conditional sampling technique for spatial and temporal video extension that performs better than previously proposed methods. We present the first results on a large text-conditioned video generation task, as well as state-of-the-art results on established benchmarks for video prediction and unconditional video generation. Supplementary material is available at https://video-diffusion.github.io/
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes extending image diffusion models to video generation via a 3D U-Net architecture that employs space-time factorized convolutions and attention. It shows that joint training on image and video data stabilizes gradients and accelerates optimization, introduces a conditional sampling procedure for spatial and temporal video extension, and reports state-of-the-art results on video prediction (BAIR) and unconditional generation (Kinetics) benchmarks together with the first results on a large-scale text-conditioned video generation task.
Significance. If the empirical claims hold, the work demonstrates that diffusion models can produce temporally coherent high-fidelity video with only modest architectural extensions from image models, joint training provides measurable optimization benefits, and the new sampling technique outperforms prior extension methods. These outcomes would establish a strong baseline for text-to-video generation and influence subsequent multimodal diffusion research.
major comments (2)
- [§4.2, Table 2] §4.2 and Table 2: the SOTA claims on BAIR and Kinetics rest on single-run FVD and PSNR numbers without reported standard deviations or multiple random seeds; this makes it impossible to determine whether the reported margins over prior methods are statistically reliable.
- [§3.3, §4.3] §3.3 and §4.3: the ablation on joint image-video training shows reduced gradient variance but provides no quantitative comparison of final sample quality (FVD or human preference) between joint and video-only training; this leaves the central claim that joint training is beneficial for generation quality unverified.
minor comments (2)
- [Figure 3] Figure 3: the caption does not specify the exact conditioning strength or number of extension steps used for the long-video examples, making reproduction difficult.
- [§2.2] §2.2: the description of the space-time factorized attention should explicitly state the relative computational cost compared with full 3D attention.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. We address each major comment below, indicating where revisions will be made to strengthen the manuscript.
read point-by-point responses
-
Referee: [§4.2, Table 2] §4.2 and Table 2: the SOTA claims on BAIR and Kinetics rest on single-run FVD and PSNR numbers without reported standard deviations or multiple random seeds; this makes it impossible to determine whether the reported margins over prior methods are statistically reliable.
Authors: We agree that multiple independent runs with reported standard deviations would provide stronger statistical grounding for the SOTA claims. Training these large-scale video diffusion models is computationally intensive, which is why we followed the standard practice in the field of reporting single-run results for such experiments. The observed margins are substantial (e.g., large FVD reductions on both BAIR and Kinetics), making it unlikely that run-to-run variance would alter the rankings. In the revised manuscript we will add an explicit discussion of this limitation in §4.2, including a note on the single-run nature of the results and the size of the reported improvements. revision: partial
-
Referee: [§3.3, §4.3] §3.3 and §4.3: the ablation on joint image-video training shows reduced gradient variance but provides no quantitative comparison of final sample quality (FVD or human preference) between joint and video-only training; this leaves the central claim that joint training is beneficial for generation quality unverified.
Authors: The central claim in §3.3 and the abstract is that joint image-video training reduces minibatch gradient variance and accelerates optimization; we did not claim or demonstrate a direct improvement in final sample quality metrics such as FVD. The optimization benefit is presented as the primary advantage. We will revise §4.3 to clarify this scope and add a short discussion of how faster convergence can indirectly support higher-quality generation within fixed compute budgets. No new quantitative FVD comparison between joint and video-only training will be added, as that would require additional large-scale experiments beyond the scope of the current work. revision: partial
Circularity Check
No significant circularity; results rest on external benchmarks
full rationale
The paper extends standard image diffusion to video via a 3D U-Net with space-time factorized convolutions/attention, joint image-video training for gradient stability, and a conditional sampling procedure for spatial/temporal extension. All reported outcomes (FVD, PSNR on BAIR/Kinetics, first text-to-video results) are measured against independent external benchmarks and prior methods, with no equations or claims reducing performance to internally fitted parameters, self-defined quantities, or load-bearing self-citations. The derivation of the reverse process and conditioning follows the established diffusion framework without internal reduction to the paper's own inputs.
Axiom & Free-Parameter Ledger
free parameters (1)
- noise schedule and conditioning hyperparameters
axioms (1)
- domain assumption The diffusion forward process and learned reverse process can be directly applied to video frames while preserving temporal coherence.
Forward citations
Cited by 26 Pith papers
-
MusicLM: Generating Music From Text
MusicLM produces coherent multi-minute 24 kHz music from text prompts using hierarchical sequence-to-sequence modeling and outperforms prior systems in quality and text adherence.
-
Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow
Rectified flow learns straight-path neural ODEs for distribution transport, yielding efficient generative models and domain transfers that work well even with a single simulation step.
-
$Z^2$-Sampling: Zero-Cost Zigzag Trajectories for Semantic Alignment in Diffusion Models
Z²-Sampling implicitly realizes zero-cost zigzag trajectories for curvature-aware semantic alignment in diffusion models by reducing multi-step paths via operator dualities and temporal caching while synthesizing a di...
-
AttentionBender: Manipulating Cross-Attention in Video Diffusion Transformers as a Creative Probe
AttentionBender applies 2D transforms to cross-attention maps in video diffusion transformers, producing distributed distortions and glitch aesthetics that reveal entangled attention mechanisms while serving as both a...
-
Speculative Decoding for Autoregressive Video Generation
A training-free speculative decoding method for block-based autoregressive video diffusion uses a quality router on worst-frame ImageReward scores to accept drafter proposals, achieving up to 2.09x speedup at 95.7% qu...
-
Score Shocks: The Burgers Equation Structure of Diffusion Generative Models
The score in diffusion models obeys viscous Burgers dynamics, with binary mode boundaries producing a universal tanh interfacial profile whose sharpening marks speciation transitions.
-
Physics-Aware Video Instance Removal Benchmark
The PVIR benchmark tests video object removal on physical consistency using 95 annotated videos and shows that existing methods struggle with complex interactions like lingering shadows.
-
Imagen Video: High Definition Video Generation with Diffusion Models
Imagen Video generates high-definition text-conditional videos via a cascade of base and super-resolution diffusion models, achieving high fidelity and controllability.
-
DreamFusion: Text-to-3D using 2D Diffusion
Optimizes a Neural Radiance Field via probability density distillation from a 2D diffusion model to produce text-conditioned 3D scenes viewable from any angle.
-
Human Motion Diffusion Model
MDM is a classifier-free diffusion model that generates expressive human motions by predicting clean samples rather than noise, supporting text and action conditioning and outperforming prior methods on standard benchmarks.
-
Diffusion Posterior Sampling for General Noisy Inverse Problems
Diffusion models solve noisy (non)linear inverse problems via approximated posterior sampling that blends diffusion steps with manifold gradients without strict consistency projection.
-
Bridging Brain and Semantics: A Hierarchical Framework for Semantically Enhanced fMRI-to-Video Reconstruction
CineNeuron improves fMRI-to-video reconstruction by combining bottom-up semantic enrichment with top-down Mixture-of-Memories integration and outperforms prior methods on benchmarks.
-
UniVidX: A Unified Multimodal Framework for Versatile Video Generation via Diffusion Priors
UniVidX unifies diverse video generation tasks into one conditional diffusion model using stochastic condition masking, decoupled gated LoRAs, and cross-modal self-attention.
-
Physically Native World Models: A Hamiltonian Perspective on Generative World Modeling
Hamiltonian World Models structure latent dynamics around energy-conserving Hamiltonian evolution to produce physically grounded, action-controllable predictions for embodied decision making.
-
DynamicRad: Content-Adaptive Sparse Attention for Long Video Diffusion
DynamicRad achieves 1.7x-2.5x inference speedups in long video diffusion with over 80% sparsity by grounding adaptive selection in a radial locality prior, using dual-mode static/dynamic strategies and offline BO with...
-
Representations Before Pixels: Semantics-Guided Hierarchical Video Prediction
Re2Pix decomposes video prediction into semantic feature forecasting followed by representation-conditioned diffusion synthesis, with nested dropout and mixed supervision to handle prediction errors.
-
MAGI-1: Autoregressive Video Generation at Scale
MAGI-1 is a 24B-parameter autoregressive video world model that predicts denoised frame chunks sequentially with increasing noise to enable causal, scalable, streaming generation up to 4M token contexts.
-
Unified World Models: Coupling Video and Action Diffusion for Pretraining on Large Robotic Datasets
Unified World Models couple video and action diffusion inside one transformer with independent timesteps, enabling pretraining on heterogeneous robot datasets that include action-free video and producing more generali...
-
CameraCtrl: Enabling Camera Control for Text-to-Video Generation
CameraCtrl enables accurate camera pose control in video diffusion models through a trained plug-and-play module and dataset choices emphasizing diverse camera trajectories with matching appearance.
-
Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets
Stable Video Diffusion scales latent video diffusion models via text-to-image pretraining, video pretraining on curated data, and high-quality finetuning to produce competitive text-to-video and image-to-video results...
-
Make-A-Video: Text-to-Video Generation without Text-Video Data
Make-A-Video achieves state-of-the-art text-to-video generation by decomposing temporal U-Net and attention structures to add space-time modeling to text-to-image models, trained without any paired text-video data.
-
CogVideo: Large-scale Pretraining for Text-to-Video Generation via Transformers
CogVideo is a large-scale transformer pretrained for text-to-video generation that outperforms public models in evaluations.
-
World Action Models: The Next Frontier in Embodied AI
The paper introduces World Action Models as a new paradigm unifying predictive world modeling with action generation in embodied foundation models and provides a taxonomy of existing approaches.
-
Watching Physics: the Generative Science of Matter and Motion
Generative video models recover physical quantities like surface strain from visible motion when coupled with experiments and simulations, but fail when internal variables dominate, defining a new Generative Science o...
-
Discrete Meanflow Training Curriculum
A DMF curriculum initialized from pretrained flow models achieves one-step FID 3.36 on CIFAR-10 after only 2000 epochs by exploiting a discretized consistency property in the Meanflow objective.
-
ModelScope Text-to-Video Technical Report
ModelScopeT2V is a 1.7-billion-parameter text-to-video model built on Stable Diffusion that adds temporal modeling and outperforms prior methods on three evaluation metrics.
Reference graph
Works this paper leans on
-
[1]
https://www.tensorflow.org/ datasets, 2022
TensorFlow Datasets, a collection of ready-to-use datasets. https://www.tensorflow.org/ datasets, 2022
work page 2022
-
[2]
ViViT: A video vision transformer
Anurag Arnab, Mostafa Dehghani, Georg Heigold, Chen Sun, Mario Lu ˇci´c, and Cordelia Schmid. ViViT: A video vision transformer. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 6836–6846, 2021
work page 2021
-
[3]
Stochastic Variational Video Prediction
Mohammad Babaeizadeh, Chelsea Finn, Dumitru Erhan, Roy H Campbell, and Sergey Levine. Stochastic variational video prediction. arXiv preprint arXiv:1710.11252, 2017
work page Pith review arXiv 2017
-
[4]
Mohammad Babaeizadeh, Mohammad Taghi Saffar, Suraj Nair, Sergey Levine, Chelsea Finn, and Dumitru Erhan. FitVid: Overfitting in pixel-level video prediction. arXiv preprint arXiv:2106.13195, 2021
-
[5]
Is space-time attention all you need for video understanding
Gedas Bertasius, Heng Wang, and Lorenzo Torresani. Is space-time attention all you need for video understanding. arXiv preprint arXiv:2102.05095, 2(3):4, 2021
-
[6]
Gender shades: Intersectional accuracy disparities in commercial gender classification
Joy Buolamwini and Timnit Gebru. Gender shades: Intersectional accuracy disparities in commercial gender classification. In Conference on Fairness, Accountability and Transparency, FAT 2018, 23-24 February 2018, New York, NY, USA , Proceedings of Machine Learning Research. PMLR, 2018
work page 2018
-
[7]
Women also snowboard: Overcoming bias in captioning models
Kaylee Burns, Lisa Hendricks, Trevor Darrell, and Anna Rohrbach. Women also snowboard: Overcoming bias in captioning models. In European Conference on Computer Vision (ECCV), 2018
work page 2018
-
[8]
Quo vadis, action recognition? a new model and the kinetics dataset
Joao Carreira and Andrew Zisserman. Quo vadis, action recognition? a new model and the kinetics dataset. In proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 6299–6308, 2017
work page 2017
- [9]
-
[10]
WaveGrad: Estimating gradients for waveform generation
Nanxin Chen, Yu Zhang, Heiga Zen, Ron J Weiss, Mohammad Norouzi, and William Chan. WaveGrad: Estimating gradients for waveform generation. International Conference on Learn- ing Representations, 2021
work page 2021
-
[11]
PixelSNAIL: An improved autoregressive generative model
Xi Chen, Nikhil Mishra, Mostafa Rohaninejad, and Pieter Abbeel. PixelSNAIL: An improved autoregressive generative model. In International Conference on Machine Learning , pages 863–871, 2018
work page 2018
-
[12]
Dall-eval: Probing the reasoning skills and social biases of text-to-image generative transformers
Jaemin Cho, Abhay Zala, and Mohit Bansal. Dall-eval: Probing the reasoning skills and social biases of text-to-image generative transformers. arxiv:2202.04053, 2022
-
[13]
3d u-net: learning dense volumetric segmentation from sparse annotation
Özgün Çiçek, Ahmed Abdulkadir, Soeren S Lienkamp, Thomas Brox, and Olaf Ronneberger. 3d u-net: learning dense volumetric segmentation from sparse annotation. In International conference on medical image computing and computer-assisted intervention, pages 424–432. Springer, 2016
work page 2016
-
[14]
Aidan Clark, Jeff Donahue, and Karen Simonyan. Adversarial video generation on complex datasets. arXiv preprint arXiv:1907.06571, 2019
-
[15]
BERT: pre-training of deep bidirectional transformers for language understanding
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT, pages 4171–4186. Association for Computational Ling...
work page 2019
-
[16]
Diffusion models beat GANs on image synthesis
Prafulla Dhariwal and Alexander Nichol. Diffusion models beat GANs on image synthesis. Advances in Neural Information Processing Systems, 34, 2021
work page 2021
-
[17]
Self-supervised visual planning with temporal skip connections
Frederik Ebert, Chelsea Finn, Alex X Lee, and Sergey Levine. Self-supervised visual planning with temporal skip connections. In CoRL, pages 344–356, 2017
work page 2017
-
[18]
Latent neural differential equations for video generation
Cade Gordon and Natalie Parde. Latent neural differential equations for video generation. In NeurIPS 2020 Workshop on Pre-registration in Machine Learning, pages 73–86. PMLR, 2021
work page 2020
-
[19]
GANs trained by a two time-scale update rule converge to a local Nash equilibrium
Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. GANs trained by a two time-scale update rule converge to a local Nash equilibrium. InAdvances in Neural Information Processing Systems, pages 6626–6637, 2017
work page 2017
-
[20]
Classifier-free diffusion guidance
Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance. In NeurIPS 2021 Workshop on Deep Generative Models and Downstream Applications, 2021
work page 2021
-
[21]
Ax- ial attention in multidimensional transformers,
Jonathan Ho, Nal Kalchbrenner, Dirk Weissenborn, and Tim Salimans. Axial attention in multidimensional transformers. arXiv preprint arXiv:1912.12180, 2019
-
[22]
Denoising diffusion probabilistic models
Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. In Advances in Neural Information Processing Systems, pages 6840–6851, 2020
work page 2020
-
[23]
Cascaded diffusion models for high fidelity image generation
Jonathan Ho, Chitwan Saharia, William Chan, David J Fleet, Mohammad Norouzi, and Tim Salimans. Cascaded diffusion models for high fidelity image generation. arXiv preprint arXiv:2106.15282, 2021
-
[24]
Stochastic solutions for linear inverse problems using the prior implicit in a denoiser
Zahra Kadkhodaie and Eero Simoncelli. Stochastic solutions for linear inverse problems using the prior implicit in a denoiser. Advances in Neural Information Processing Systems, 34, 2021
work page 2021
-
[25]
Solving linear inverse problems using the prior implicit in a denoiser
Zahra Kadkhodaie and Eero P Simoncelli. Solving linear inverse problems using the prior implicit in a denoiser. arXiv preprint arXiv:2007.13640, 2020
-
[26]
Lower dimensional kernels for video discriminators
Emmanuel Kahembwe and Subramanian Ramamoorthy. Lower dimensional kernels for video discriminators. Neural Networks, 132:506–520, 2020
work page 2020
-
[27]
The Kinetics Human Action Video Dataset
Will Kay, Joao Carreira, Karen Simonyan, Brian Zhang, Chloe Hillier, Sudheendra Vijaya- narasimhan, Fabio Viola, Tim Green, Trevor Back, Paul Natsev, et al. The kinetics human action video dataset. arXiv preprint arXiv:1705.06950, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[28]
Diederik P Kingma, Tim Salimans, Ben Poole, and Jonathan Ho. Variational diffusion models. arXiv preprint arXiv:2107.00630, 2021
-
[29]
DiffWave: A versatile diffusion model for audio synthesis
Zhifeng Kong, Wei Ping, Jiaji Huang, Kexin Zhao, and Bryan Catanzaro. DiffWave: A versatile diffusion model for audio synthesis. In 9th International Conference on Learning Representations, ICLR, 2021
work page 2021
-
[30]
VideoFlow: A flow-based generative model for video
Manoj Kumar, Mohammad Babaeizadeh, Dumitru Erhan, Chelsea Finn, Sergey Levine, Laurent Dinh, and Durk Kingma. VideoFlow: A flow-based generative model for video. arXiv preprint arXiv:1903.01434, 2019
-
[31]
Ccvs: Context-aware controllable video synthesis
Guillaume Le Moing, Jean Ponce, and Cordelia Schmid. Ccvs: Context-aware controllable video synthesis. Advances in Neural Information Processing Systems, 34, 2021
work page 2021
-
[32]
X., Zhang, R., Ebert, F., Abbeel, P., Finn, C., and Levine, S
Alex X Lee, Richard Zhang, Frederik Ebert, Pieter Abbeel, Chelsea Finn, and Sergey Levine. Stochastic adversarial video prediction. arXiv preprint arXiv:1804.01523, 2018
-
[33]
Pauline Luc, Aidan Clark, Sander Dieleman, Diego de Las Casas, Yotam Doron, Albin Cassirer, and Karen Simonyan. Transformation-based adversarial video prediction on large-scale data. arXiv preprint arXiv:2003.04035, 2020
-
[34]
Generating high fidelity images with subscale pixel networks and multidimensional upscaling
Jacob Menick and Nal Kalchbrenner. Generating high fidelity images with subscale pixel networks and multidimensional upscaling. In International Conference on Learning Represen- tations, 2019. 11
work page 2019
-
[35]
Charlie Nash, João Carreira, Jacob Walker, Iain Barr, Andrew Jaegle, Mateusz Malinowski, and Peter Battaglia. Transframer: Arbitrary frame prediction with generative models. arXiv preprint arXiv:2203.09494, 2022
-
[36]
GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models
Alex Nichol, Prafulla Dhariwal, Aditya Ramesh, Pranav Shyam, Pamela Mishkin, Bob McGrew, Ilya Sutskever, and Mark Chen. Glide: Towards photorealistic image generation and editing with text-guided diffusion models. arXiv preprint arXiv:2112.10741, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[37]
Improved denoising diffusion probabilistic models
Alexander Quinn Nichol and Prafulla Dhariwal. Improved denoising diffusion probabilistic models. In Marina Meila and Tong Zhang, editors, Proceedings of the 38th International Conference on Machine Learning, ICML, 2021
work page 2021
-
[38]
U-Net: Convolutional networks for biomedical image segmentation
Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-Net: Convolutional networks for biomedical image segmentation. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pages 234–241. Springer, 2015
work page 2015
-
[39]
Palette: Image-to-image diffusion models
Chitwan Saharia, William Chan, Huiwen Chang, Chris A Lee, Jonathan Ho, Tim Salimans, David J Fleet, and Mohammad Norouzi. Palette: Image-to-image diffusion models. arXiv preprint arXiv:2111.05826, 2021
-
[40]
Chitwan Saharia, Jonathan Ho, William Chan, Tim Salimans, David J Fleet, and Mohammad Norouzi. Image super-resolution via iterative refinement. arXiv preprint arXiv:2104.07636, 2021
-
[41]
Masaki Saito, Shunta Saito, Masanori Koyama, and Sosuke Kobayashi. Train sparsely, generate densely: Memory-efficient unsupervised training of high-resolution temporal GAN. Interna- tional Journal of Computer Vision, 128(10):2586–2606, 2020
work page 2020
-
[42]
Progressive distillation for fast sampling of diffusion models
Tim Salimans and Jonathan Ho. Progressive distillation for fast sampling of diffusion models. In International Conference on Learning Representations, 2021
work page 2021
-
[43]
Improved techniques for training gans
Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and Xi Chen. Improved techniques for training gans. In Advances in Neural Information Processing Systems, pages 2234–2242, 2016
work page 2016
-
[44]
Tim Salimans, Andrej Karpathy, Xi Chen, and Diederik P Kingma. PixelCNN++: Improving the PixelCNN with discretized logistic mixture likelihood and other modifications. In International Conference on Learning Representations, 2017
work page 2017
-
[45]
Self-attention with relative position repre- sentations
Peter Shaw, Jakob Uszkoreit, and Ashish Vaswani. Self-attention with relative position repre- sentations. arXiv preprint arXiv:1803.02155, 2018
-
[46]
Deep unsuper- vised learning using nonequilibrium thermodynamics
Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan, and Surya Ganguli. Deep unsuper- vised learning using nonequilibrium thermodynamics. In International Conference on Machine Learning, pages 2256–2265, 2015
work page 2015
-
[47]
Generative modeling by estimating gradients of the data distribution
Yang Song and Stefano Ermon. Generative modeling by estimating gradients of the data distribution. In Advances in Neural Information Processing Systems, pages 11895–11907, 2019
work page 2019
-
[48]
Score-based generative modeling through stochastic differential equations
Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equations. International Conference on Learning Representations, 2021
work page 2021
-
[49]
A dataset of 101 human actions classes from videos in the wild
Khurram Soomro, Amir Roshan Zamir, and Mubarak Shah. A dataset of 101 human actions classes from videos in the wild. CRCV-TR-12-01, 2012
work page 2012
-
[50]
Image representations learned with unsupervised pre-training contain human-like biases
Ryan Steed and Aylin Caliskan. Image representations learned with unsupervised pre-training contain human-like biases. In Proceedings of the 2021 ACM Conference on Fairness, Account- ability, and Transparency, FAccT ’21, page 701–713. Association for Computing Machinery, 2021
work page 2021
-
[51]
Learning spa- tiotemporal features with 3d convolutional networks
Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. Learning spa- tiotemporal features with 3d convolutional networks. In Proceedings of the IEEE international conference on computer vision, pages 4489–4497, 2015. 12
work page 2015
-
[52]
Mocogan: Decomposing motion and content for video generation
Sergey Tulyakov, Ming-Yu Liu, Xiaodong Yang, and Jan Kautz. Mocogan: Decomposing motion and content for video generation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1526–1535, 2018
work page 2018
-
[53]
Belinda Tzen and Maxim Raginsky. Neural stochastic differential equations: Deep latent gaussian models in the diffusion limit. arXiv preprint arXiv:1905.09883, 2019
-
[54]
Towards Accurate Generative Models of Video: A New Metric & Challenges
Thomas Unterthiner, Sjoerd van Steenkiste, Karol Kurach, Raphael Marinier, Marcin Michalski, and Sylvain Gelly. Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:1812.01717, 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[55]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in Neural Informa- tion Processing Systems, pages 5998–6008, 2017
work page 2017
-
[56]
A connection between score matching and denoising autoencoders
Pascal Vincent. A connection between score matching and denoising autoencoders. Neural Computation, 23(7):1661–1674, 2011
work page 2011
-
[57]
Predicting video with vqvae.arXiv preprint arXiv:2103.01950, 2021
Jacob Walker, Ali Razavi, and Aäron van den Oord. Predicting video with vqvae.arXiv preprint arXiv:2103.01950, 2021
-
[58]
Xiaolong Wang, Ross Girshick, Abhinav Gupta, and Kaiming He. Non-local neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 7794–7803, 2018
work page 2018
-
[59]
Scaling autoregressive video models
Dirk Weissenborn, Oscar Täckström, and Jakob Uszkoreit. Scaling autoregressive video models. In International Conference on Learning Representations, 2019
work page 2019
-
[60]
Deblurring via stochastic refinement
Jay Whang, Mauricio Delbracio, Hossein Talebi, Chitwan Saharia, Alexandros G Dimakis, and Peyman Milanfar. Deblurring via stochastic refinement. arXiv preprint arXiv:2112.02475, 2021
-
[61]
NÜW A: Visual synthesis pre-training for neural visual world creation
Chenfei Wu, Jian Liang, Lei Ji, Fan Yang, Yuejian Fang, Daxin Jiang, and Nan Duan. NÜW A: Visual synthesis pre-training for neural visual world creation. arXiv preprint arXiv:2111.12417, 2021
-
[62]
VideoGPT: Video Generation using VQ-VAE and Transformers
Wilson Yan, Yunzhi Zhang, Pieter Abbeel, and Aravind Srinivas. Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:2104.10157, 2021
work page internal anchor Pith review arXiv 2021
-
[63]
Dif- fusion probabilistic modeling for video generation
Ruihan Yang, Prakhar Srivastava, and Stephan Mandt. Diffusion probabilistic modeling for video generation. arXiv preprint arXiv:2203.09481, 2022
-
[64]
Markov decision process for video generation
Vladyslav Yushchenko, Nikita Araslanov, and Stefan Roth. Markov decision process for video generation. In Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops, pages 0–0, 2019
work page 2019
-
[65]
Sergey Zagoruyko and Nikos Komodakis. Wide residual networks. arXiv preprint arXiv:1605.07146, 2016. 13 A Details and hyperparameters Figure 5: More samples accompanying Fig. 2. Here, we list the hyperparameters, training details, and compute resources used for each model. A.1 UCF101 Base channels: 256 Optimizer: Adam ( β1 = 0.9,β 2 = 0.99) Channel multip...
work page internal anchor Pith review Pith/arXiv arXiv 2016
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.