Recognition: 2 theorem links
· Lean TheoremMagicVideo: Efficient Video Generation With Latent Diffusion Models
Pith reviewed 2026-05-15 18:43 UTC · model grok-4.3
The pith
MagicVideo generates 256x256 text-to-video clips on a single GPU using 64 times fewer computations than prior video diffusion models.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By mapping video clips to a low-dimensional latent space via a pre-trained VAE and training a diffusion model on that space with a 3D U-Net augmented by a frame-wise lightweight adaptor and directed temporal attention, MagicVideo can synthesize 256x256-resolution video clips from text prompts on a single GPU card, using approximately 64 times fewer FLOPs than Video Diffusion Models while maintaining temporal coherence and visual quality.
What carries the argument
3D U-Net denoiser operating in VAE-compressed latent space, extended with a frame-wise adaptor for image-to-video adjustment and directed temporal attention to model frame dependencies.
If this is right
- Text-to-video training can reuse weights from large image diffusion models, shortening the video-specific training phase.
- Single-GPU inference makes on-device or consumer-level video generation practical for short clips.
- The VideoVAE reconstruction step can be swapped or refined independently to target specific artifact types such as dithering.
- The latent-space approach scales the same U-Net architecture to higher resolutions without a proportional explosion in memory or compute.
Where Pith is reading between the lines
- The same latent-space plus adaptor pattern could be applied to other high-dimensional generation tasks such as 3D scene synthesis or longer video sequences.
- If the VAE latent representation proves robust across domains, the method may generalize beyond the reported realistic and imaginary content examples without retraining the core diffusion backbone.
- Direct comparison of reconstruction fidelity between the proposed VideoVAE and standard image VAEs on video data would quantify how much the temporal consistency gains come from the new auto-encoder.
Load-bearing premise
A pre-trained VAE can compress video clips into a low-dimensional latent space that still preserves enough spatial and temporal information for the diffusion model to reconstruct high-fidelity, coherent videos without major artifacts.
What would settle it
Measure actual FLOPs and visual quality (temporal coherence scores or side-by-side user ratings) when generating identical 256x256 clips from the same text prompts on the same single GPU hardware using both MagicVideo and the original Video Diffusion Models.
read the original abstract
We present an efficient text-to-video generation framework based on latent diffusion models, termed MagicVideo. MagicVideo can generate smooth video clips that are concordant with the given text descriptions. Due to a novel and efficient 3D U-Net design and modeling video distributions in a low-dimensional space, MagicVideo can synthesize video clips with 256x256 spatial resolution on a single GPU card, which takes around 64x fewer computations than the Video Diffusion Models (VDM) in terms of FLOPs. In specific, unlike existing works that directly train video models in the RGB space, we use a pre-trained VAE to map video clips into a low-dimensional latent space and learn the distribution of videos' latent codes via a diffusion model. Besides, we introduce two new designs to adapt the U-Net denoiser trained on image tasks to video data: a frame-wise lightweight adaptor for the image-to-video distribution adjustment and a directed temporal attention module to capture temporal dependencies across frames. Thus, we can exploit the informative weights of convolution operators from a text-to-image model for accelerating video training. To ameliorate the pixel dithering in the generated videos, we also propose a novel VideoVAE auto-encoder for better RGB reconstruction. We conduct extensive experiments and demonstrate that MagicVideo can generate high-quality video clips with either realistic or imaginary content. Refer to \url{https://magicvideo.github.io/#} for more examples.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces MagicVideo, a text-to-video generation framework based on latent diffusion models. It encodes video clips into a low-dimensional latent space using a pre-trained VAE, then trains a diffusion model on these latents with a custom 3D U-Net that incorporates a frame-wise lightweight adaptor and a directed temporal attention module to adapt image-pretrained weights for video. A novel VideoVAE is proposed to reduce pixel dithering in reconstruction. The central claim is that this enables synthesis of 256x256 video clips on a single GPU with approximately 64x fewer FLOPs than Video Diffusion Models (VDM), while producing high-quality outputs concordant with text prompts.
Significance. If the efficiency and quality claims are substantiated, the work would represent a meaningful advance in making high-resolution video generation computationally accessible, by extending latent diffusion techniques from images to video via targeted architectural adaptations. The reuse of image-pretrained convolutions and the explicit handling of temporal dependencies address practical barriers in scaling video diffusion. The project page with examples aids qualitative assessment, though the absence of reported quantitative metrics limits immediate impact assessment.
major comments (3)
- [Abstract] Abstract: The headline claim of ~64x fewer FLOPs versus VDM is presented without any explicit calculation, baseline FLOPs values, or model configuration details (e.g., number of frames, latent dimensions, or U-Net channel counts). This renders the central efficiency result unverifiable from the manuscript and directly affects the soundness of the efficiency contribution.
- [Method] Method (latent space modeling): The pipeline assumes a pre-trained image VAE maps video clips into a latent space that preserves sufficient spatial-temporal information for coherent synthesis. However, the introduction of a separate VideoVAE to ameliorate pixel dithering indicates that standard VAE reconstruction already discards fine details; no ablation quantifies how much temporal dynamics are lost in the latent codes or whether the adaptor + temporal attention fully compensates.
- [Experiments] Experiments: The abstract states that 'extensive experiments' demonstrate high-quality generation, yet no quantitative metrics (FID, FVD, CLIP similarity), baseline comparisons, or training details (dataset size, epochs, learning rate) are referenced. This absence makes it impossible to assess whether the claimed quality holds relative to pixel-space or other latent video diffusion baselines.
minor comments (1)
- [Abstract] Abstract and method: The term 'directed temporal attention' is introduced without a precise equation or diagram reference; adding a short formal definition or pseudocode would improve clarity for readers familiar with standard attention mechanisms.
Simulated Author's Rebuttal
We thank the referee for the constructive comments on the efficiency claim, latent modeling assumptions, and experimental reporting. We address each point below and will revise the manuscript to improve verifiability and completeness.
read point-by-point responses
-
Referee: [Abstract] Abstract: The headline claim of ~64x fewer FLOPs versus VDM is presented without any explicit calculation, baseline FLOPs values, or model configuration details (e.g., number of frames, latent dimensions, or U-Net channel counts). This renders the central efficiency result unverifiable from the manuscript and directly affects the soundness of the efficiency contribution.
Authors: We agree the abstract states the ~64x FLOPs reduction without a supporting breakdown. In the revised manuscript we will add an explicit calculation (in the main text or appendix) that reports the FLOPs for both MagicVideo and VDM under identical settings, including the number of frames, latent spatial-temporal dimensions, U-Net channel counts, and the precise formula used. This will make the efficiency claim directly verifiable. revision: yes
-
Referee: [Method] Method (latent space modeling): The pipeline assumes a pre-trained image VAE maps video clips into a latent space that preserves sufficient spatial-temporal information for coherent synthesis. However, the introduction of a separate VideoVAE to ameliorate pixel dithering indicates that standard VAE reconstruction already discards fine details; no ablation quantifies how much temporal dynamics are lost in the latent codes or whether the adaptor + temporal attention fully compensates.
Authors: The pre-trained image VAE is chosen for computational efficiency, and the VideoVAE is introduced precisely to mitigate visible reconstruction artifacts such as dithering. We did not include a dedicated quantitative ablation measuring temporal information loss in the latent codes. The directed temporal attention module is intended to recover temporal coherence; we will expand the method discussion to clarify this design rationale and add qualitative evidence of motion consistency from our experiments. A full numerical ablation on temporal loss would require new experiments and is noted as a possible extension. revision: partial
-
Referee: [Experiments] Experiments: The abstract states that 'extensive experiments' demonstrate high-quality generation, yet no quantitative metrics (FID, FVD, CLIP similarity), baseline comparisons, or training details (dataset size, epochs, learning rate) are referenced. This absence makes it impossible to assess whether the claimed quality holds relative to pixel-space or other latent video diffusion baselines.
Authors: We recognize that quantitative metrics would allow direct comparison with prior work. The current manuscript prioritizes qualitative demonstration and efficiency, with additional examples on the project page. In the revision we will report quantitative results (FVD, CLIP text-video similarity) together with baseline comparisons where feasible, and we will include the missing training details: dataset size and source, number of epochs, batch size, and learning rate schedule. revision: yes
Circularity Check
No circularity; efficiency follows directly from latent-space design choice
full rationale
The derivation chain consists of an explicit architectural decision (run 3D U-Net diffusion inside a pre-trained VAE latent space instead of pixel space) plus two new modules (frame-wise adaptor and directed temporal attention) whose roles are described without reference to fitted parameters or self-citations. The 64x FLOPs reduction is a straightforward consequence of the dimensionality reduction performed by the VAE, not a quantity that is fitted and then re-labeled as a prediction. The introduction of VideoVAE is presented as an empirical remedy for observed dithering rather than a hidden definitional step. No equations, uniqueness theorems, or ansatzes reduce to the paper's own inputs by construction; the method therefore remains self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption A pre-trained VAE can compress video clips into a low-dimensional latent space while retaining sufficient information for high-quality reconstruction.
Forward citations
Cited by 19 Pith papers
-
Modality-Aware and Anatomical Vector-Quantized Autoencoding for Multimodal Brain MRI
NeuroQuant is a modality-aware 3D VQ-VAE that uses dual-stream encoding, a shared anatomical codebook, and FiLM to achieve superior multi-modal brain MRI reconstruction.
-
Improving Text-to-Image Generation with Intrinsic Self-Confidence Rewards
SOLACE improves text-to-image generation by using intrinsic self-confidence rewards from noise reconstruction accuracy during reinforcement learning post-training without external supervision.
-
AnimateDiff: Animate Your Personalized Text-to-Image Diffusion Models without Specific Tuning
A single motion module trained on videos adds temporally coherent animation to any personalized text-to-image model derived from the same base without additional tuning.
-
FIS-DiT: Breaking the Few-Step Video Inference Barrier via Training-Free Frame Interleaved Sparsity
FIS-DiT achieves 2.11-2.41x speedup on video DiT models in few-step regimes with negligible quality loss by exploiting frame-wise sparsity and consistency through a training-free interleaved execution strategy.
-
Stream-T1: Test-Time Scaling for Streaming Video Generation
Stream-T1 is a test-time scaling framework for streaming video generation using scaled noise propagation from history, reward pruning across short and long windows, and feedback-guided memory sinking to improve tempor...
-
From Synchrony to Sequence: Exo-to-Ego Generation via Interpolation
Interpolating exo and ego videos into a single continuous sequence lets diffusion sequence models generate more coherent first-person videos than direct conditioning, even without pose interpolation.
-
VBench-2.0: Advancing Video Generation Benchmark Suite for Intrinsic Faithfulness
VBench-2.0 is a benchmark suite that automatically evaluates video generative models on five dimensions of intrinsic faithfulness: Human Fidelity, Controllability, Creativity, Physics, and Commonsense using VLMs, LLMs...
-
CameraCtrl: Enabling Camera Control for Text-to-Video Generation
CameraCtrl enables accurate camera pose control in video diffusion models through a trained plug-and-play module and dataset choices emphasizing diverse camera trajectories with matching appearance.
-
VideoPoet: A Large Language Model for Zero-Shot Video Generation
VideoPoet is a large language model that performs zero-shot video generation with audio from diverse multimodal conditioning signals.
-
Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets
Stable Video Diffusion scales latent video diffusion models via text-to-image pretraining, video pretraining on curated data, and high-quality finetuning to produce competitive text-to-video and image-to-video results...
-
VideoCrafter1: Open Diffusion Models for High-Quality Video Generation
Open-source text-to-video and image-to-video diffusion models generate high-quality 1024x576 videos, with the I2V variant claimed as the first to strictly preserve reference image content.
-
MVDream: Multi-view Diffusion for 3D Generation
MVDream is a multi-view diffusion model that functions as a generalizable 3D prior, enabling more consistent text-to-3D generation and few-shot 3D concept learning from 2D examples.
-
Latent Video Diffusion Models for High-Fidelity Long Video Generation
Latent-space hierarchical diffusion models with targeted error-correction techniques generate realistic videos exceeding 1000 frames while using less compute than prior pixel-space approaches.
-
Diffusion-APO: Trajectory-Aware Direct Preference Alignment for Video Diffusion Transformers
Diffusion-APO synchronizes training noise with inference trajectories in video diffusion models to improve preference alignment and visual quality.
-
Not all tokens contribute equally to diffusion learning
DARE mitigates neglect of important tokens in conditional diffusion models via distribution-rectified guidance and spatial attention alignment.
-
Empowering Video Translation using Multimodal Large Language Models
The paper offers the first focused review of MLLM-based video translation organized by a three-role taxonomy of Semantic Reasoner, Expressive Performer, and Visual Synthesizer, plus open challenges.
-
ModelScope Text-to-Video Technical Report
ModelScopeT2V is a 1.7-billion-parameter text-to-video model built on Stable Diffusion that adds temporal modeling and outperforms prior methods on three evaluation metrics.
-
Evolution of Video Generative Foundations
This survey traces video generation technology from GANs to diffusion models and then to autoregressive and multimodal approaches while analyzing principles, strengths, and future trends.
-
Sora: A Review on Background, Technology, Limitations, and Opportunities of Large Vision Models
The paper reviews the background, technology, applications, limitations, and future directions of OpenAI's Sora text-to-video generative model based on public information.
Reference graph
Works this paper leans on
-
[1]
Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. Layer normalization. arXiv preprint arXiv:1607.06450, 2016
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[2]
FitVid: Overfitting in Pixel-Level Video Pre- diction, June 2021
Mohammad Babaeizadeh, Mohammad Taghi Saffar, Suraj Nair, Sergey Levine, Chelsea Finn, and Dumitru Erhan. FitVid: Overfitting in Pixel-Level Video Pre- diction, June 2021. arXiv:2106.13195
-
[3]
Frozen in time: A joint video and im- age encoder for end-to-end retrieval
Max Bain, Arsha Nagrani, G ¨ul Varol, and Andrew Zisserman. Frozen in time: A joint video and im- age encoder for end-to-end retrieval. In ICCV, pages 1728–1738, 2021
work page 2021
-
[4]
Multimodal datasets: misogyny, pornog- raphy, and malignant stereotypes
Abeba Birhane, Vinay Uday Prabhu, and Emmanuel Kahembwe. Multimodal datasets: misogyny, pornog- raphy, and malignant stereotypes. arXiv preprint arXiv:2110.01963, 2021
-
[5]
Quo vadis, action recognition? a new model and the kinet- ics dataset
Joao Carreira and Andrew Zisserman. Quo vadis, action recognition? a new model and the kinet- ics dataset. In proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , pages 6299–6308, 2017
work page 2017
-
[6]
Adversarial Video Generation on Complex Datasets, Sept
Aidan Clark, Jeff Donahue, and Karen Simonyan. Adversarial Video Generation on Complex Datasets, Sept. 2019. arXiv:1907.06571
-
[7]
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020
work page internal anchor Pith review Pith/arXiv arXiv 2010
-
[8]
Sigmoid-weighted linear units for neural network function approximation in reinforcement learning
Stefan Elfwing, Eiji Uchibe, and Kenji Doya. Sigmoid-weighted linear units for neural network function approximation in reinforcement learning. Neural Networks, 107:3–11, 2018
work page 2018
-
[9]
Flex- ible Diffusion Modeling of Long Videos
William Harvey, Saeid Naderiparizi, Vaden Mas- rani, Christian Weilbach, and Frank Wood. Flex- ible Diffusion Modeling of Long Videos. Tech- nical Report arXiv:2205.11495, arXiv, May 2022. arXiv:2205.11495
-
[10]
Prompt-to-Prompt Image Editing with Cross Attention Control
Amir Hertz, Ron Mokady, Jay Tenenbaum, Kfir Aber- man, Yael Pritch, and Daniel Cohen-Or. Prompt-to- Prompt Image Editing with Cross Attention Control, Aug. 2022. arXiv:2208.01626
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[11]
Kingma, Ben Poole, Mohammad Norouzi, David J
Jonathan Ho, William Chan, Chitwan Saharia, Jay Whang, Ruiqi Gao, Alexey Gritsenko, Diederik P. Kingma, Ben Poole, Mohammad Norouzi, David J. Fleet, and Tim Salimans. Imagen Video: High Def- inition Video Generation with Diffusion Models, Oct
-
[12]
Denoising Diffusion Probabilistic Models
Jonathan Ho, Ajay Jain, and Pieter Abbeel. De- noising Diffusion Probabilistic Models, Dec. 2020. arXiv:2006.11239
work page internal anchor Pith review Pith/arXiv arXiv 2020
-
[13]
Jonathan Ho, Tim Salimans, Alexey Gritsenko, William Chan, Mohammad Norouzi, and David J Fleet. Video diffusion models. arXiv preprint arXiv:2204.03458, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[14]
Jonathan Ho, Tim Salimans, Alexey Gritsenko, William Chan, Mohammad Norouzi, and David J. Fleet. Video Diffusion Models. 2022
work page 2022
-
[15]
CogVideo: Large-scale Pretraining for Text-to-Video Generation via Transformers
Wenyi Hong, Ming Ding, Wendi Zheng, Xinghan Liu, and Jie Tang. CogVideo: Large-scale Pretraining for Text-to-Video Generation via Transformers
-
[16]
Cogvideo: Large-scale pretraining for text-to-video generation via transformers, 2022
Wenyi Hong, Ming Ding, Wendi Zheng, Xinghan Liu, and Jie Tang. Cogvideo: Large-scale pretraining for text-to-video generation via transformers, 2022
work page 2022
-
[17]
Nal Kalchbrenner, A ¨aron Oord, Karen Simonyan, Ivo Danihelka, Oriol Vinyals, Alex Graves, and Koray Kavukcuoglu. Video Pixel Networks. In Proceed- ings of the 34th International Conference on Machine Learning, pages 1771–1779. PMLR, July 2017. ISSN: 2640-3498
work page 2017
-
[18]
Imagic: Text-Based Real Image Editing with Diffusion Models, Oct
Bahjat Kawar, Shiran Zada, Oran Lang, Omer Tov, Huiwen Chang, Tali Dekel, Inbar Mosseri, and Michal Irani. Imagic: Text-Based Real Image Editing with Diffusion Models, Oct. 2022. arXiv:2210.09276
-
[19]
VideoFlow: A Conditional Flow- Based Model for Stochastic Video Generation
Manoj Kumar, Mohammad Babaeizadeh, Dumitru Er- han, Chelsea Finn, Sergey Levine, Laurent Dinh, and Durk Kingma. VideoFlow: A Conditional Flow- Based Model for Stochastic Video Generation. Mar. 2020
work page 2020
-
[20]
MagicMix: Semantic Mixing with Diffusion Models, Oct
Jun Hao Liew, Hanshu Yan, Daquan Zhou, and Jiashi Feng. MagicMix: Semantic Mixing with Diffusion Models, Oct. 2022. arXiv:2210.16056
-
[21]
Frozen clip models are efficient video learners
Ziyi Lin, Shijie Geng, Renrui Zhang, Peng Gao, Ger- ard de Melo, Xiaogang Wang, Jifeng Dai, Yu Qiao, and Hongsheng Li. Frozen clip models are efficient video learners. arXiv preprint arXiv:2208.03550 , 2022
-
[22]
Re- Paint: Inpainting Using Denoising Diffusion Proba- bilistic Models
Andreas Lugmayr, Martin Danelljan, Andres Romero, Fisher Yu, Radu Timofte, and Luc Van Gool. Re- Paint: Inpainting Using Denoising Diffusion Proba- bilistic Models. page 11
-
[23]
Deep multi-scale video prediction beyond mean square error
Michael Mathieu, Camille Couprie, and Yann Le- Cun. Deep multi-scale video prediction beyond mean square error, Feb. 2016. arXiv:1511.05440
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[24]
SDEdit: Guided Image Synthesis and Editing with Stochastic Differential Equations
Chenlin Meng, Yutong He, Yang Song, Jiaming Song, Jiajun Wu, Jun-Yan Zhu, and Stefano Er- mon. SDEdit: Guided Image Synthesis and Edit- ing with Stochastic Differential Equations. Tech- nical Report arXiv:2108.01073, arXiv, Jan. 2022. arXiv:2108.01073
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[25]
On aliased resizing and surprising subtleties in gan evalu- ation
Gaurav Parmar, Richard Zhang, and Jun-Yan Zhu. On aliased resizing and surprising subtleties in gan evalu- ation. In CVPR, 2022
work page 2022
-
[26]
DreamFusion: Text-to-3D using 2D Diffusion
Ben Poole, Ajay Jain, Jonathan T Barron, and Ben Mildenhall. Dreamfusion: Text-to-3d using 2d dif- fusion. arXiv preprint arXiv:2209.14988, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[27]
Learning transferable visual models from natu- ral language supervision
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sas- try, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natu- ral language supervision. In ICML, pages 8748–8763. PMLR, 2021
work page 2021
-
[28]
Learning trans- ferable visual models from natural language supervi- sion, 2021
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sas- try, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning trans- ferable visual models from natural language supervi- sion, 2021
work page 2021
-
[29]
Hierarchical text- conditional image generation with clip latents, 2022
Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical text- conditional image generation with clip latents, 2022
work page 2022
-
[30]
Video (language) modeling: a baseline for generative models of natural videos
MarcAurelio Ranzato, Arthur Szlam, Joan Bruna, Michael Mathieu, Ronan Collobert, and Sumit Chopra. Video (language) modeling: a baseline for generative models of natural videos, May 2016. arXiv:1412.6604
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[31]
Generating diverse high-fidelity images with vq-vae-
Ali Razavi, Aaron Van den Oord, and Oriol Vinyals. Generating diverse high-fidelity images with vq-vae-
-
[32]
Advances in neural information processing systems, 32, 2019
work page 2019
-
[33]
Stochastic backpropagation and approxi- mate inference in deep generative models
Danilo Jimenez Rezende, Shakir Mohamed, and Daan Wierstra. Stochastic backpropagation and approxi- mate inference in deep generative models. In Interna- tional conference on machine learning , pages 1278–
-
[34]
High-Resolution Image Synthesis with Latent Diffusion Models
Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj ¨orn Ommer. High- Resolution Image Synthesis with Latent Diffusion Models, Apr. 2022. arXiv:2112.10752
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[35]
High- resolution image synthesis with latent diffusion mod- els
Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj ¨orn Ommer. High- resolution image synthesis with latent diffusion mod- els. In CVPR, pages 10684–10695, 2022
work page 2022
-
[36]
Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding
Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily Denton, Seyed Kam- yar Seyed Ghasemipour, Burcu Karagol Ayan, S Sara Mahdavi, Rapha Gontijo Lopes, et al. Photorealis- tic text-to-image diffusion models with deep language understanding. arXiv preprint arXiv:2205.11487 , 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[37]
Sara Mahdavi, Rapha Gontijo Lopes, Tim Salimans, Jonathan Ho, David J Fleet, and Mohammad Norouzi
Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily Denton, Seyed Kam- yar Seyed Ghasemipour, Burcu Karagol Ayan, S. Sara Mahdavi, Rapha Gontijo Lopes, Tim Salimans, Jonathan Ho, David J Fleet, and Mohammad Norouzi. Photorealistic text-to-image diffusion models with deep language understanding, 2022
work page 2022
-
[38]
Image super-resolution via iterative refinement
Chitwan Saharia, Jonathan Ho, William Chan, Tim Salimans, David J Fleet, and Mohammad Norouzi. Image super-resolution via iterative refinement. IEEE Transactions on Pattern Analysis and Machine Intelli- gence, 2022
work page 2022
-
[39]
LAION-5B: laion-5b: A new era of open large-scale multi-modal datasets
Christoph Schuhmann, Richard Vencu, Romain Beau- mont, Theo Coombes, Cade Gordon, Aarush Katta, Robert Kaczmarczyk, and Jenia Jitsev. LAION-5B: laion-5b: A new era of open large-scale multi-modal datasets. https://laion.ai/laion-5b-a-new-era-of-open- large-scale-multi-modal-datasets/, 2022
work page 2022
-
[40]
Make-A-Video: Text-to-Video Generation without Text-Video Data
Uriel Singer, Adam Polyak, Thomas Hayes, Xi Yin, Jie An, Songyang Zhang, Qiyuan Hu, Harry Yang, Oron Ashual, Oran Gafni, Devi Parikh, Sonal Gupta, and Yaniv Taigman. Make-A-Video: Text-to-Video Generation without Text-Video Data, Sept. 2022. arXiv:2209.14792
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[41]
Score-based generative modeling through stochastic differential equations
Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equations. page 36, 2021
work page 2021
-
[42]
UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild
Khurram Soomro, Amir Roshan Zamir, and Mubarak Shah. Ucf101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402, 2012
work page internal anchor Pith review Pith/arXiv arXiv 2012
-
[43]
A good image generator is what you need for high- resolution video synthesis
Yu Tian, Jian Ren, Menglei Chai, Kyle Olszewski, Xi Peng, Dimitris N Metaxas, and Sergey Tulyakov. A good image generator is what you need for high- resolution video synthesis. ICLR, 2021
work page 2021
-
[44]
A closer look at spatiotemporal convolutions for action recognition
Du Tran, Heng Wang, Lorenzo Torresani, Jamie Ray, Yann LeCun, and Manohar Paluri. A closer look at spatiotemporal convolutions for action recognition. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition , pages 6450–6459, 2018
work page 2018
-
[45]
MoCoGAN: Decomposing Motion and Content for Video Generation
Sergey Tulyakov, Ming-Yu Liu, Xiaodong Yang, and Jan Kautz. MoCoGAN: Decomposing Mo- tion and Content for Video Generation, Dec. 2017. arXiv:1707.04993
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[46]
Fvd: A new metric for video generation
Thomas Unterthiner, Sjoerd van Steenkiste, Karol Ku- rach, Rapha ¨el Marinier, Marcin Michalski, and Syl- vain Gelly. Fvd: A new metric for video generation. 2019
work page 2019
-
[47]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in neural information processing systems , 30, 2017
work page 2017
-
[48]
Generating Videos with Scene Dynamics, Oct
Carl V ondrick, Hamed Pirsiavash, and Antonio Tor- ralba. Generating Videos with Scene Dynamics, Oct
-
[49]
N¨uwa: Visual syn- thesis pre-training for neural visual world creation
Chenfei Wu, Jian Liang, Lei Ji, Fan Yang, Yuejian Fang, Daxin Jiang, and Nan Duan. N¨uwa: Visual syn- thesis pre-training for neural visual world creation. In Computer Vision–ECCV 2022: 17th European Con- ference, Tel Aviv, Israel, October 23–27, 2022, Pro- ceedings, Part XVI, pages 720–736. Springer, 2022
work page 2022
-
[50]
Msr-vtt: A large video description dataset for bridging video and language
Jun Xu, Tao Mei, Ting Yao, and Yong Rui. Msr-vtt: A large video description dataset for bridging video and language. In CVPR, pages 5288–5296, 2016
work page 2016
-
[51]
Advancing high-resolution video- language representation with large-scale video tran- scriptions
Hongwei Xue, Tiankai Hang, Yanhong Zeng, Yu- chong Sun, Bei Liu, Huan Yang, Jianlong Fu, and Baining Guo. Advancing high-resolution video- language representation with large-scale video tran- scriptions. In CVPR, pages 5036–5045, 2022
work page 2022
-
[52]
Understanding the robustness in vision trans- formers
Daquan Zhou, Zhiding Yu, Enze Xie, Chaowei Xiao, Animashree Anandkumar, Jiashi Feng, and Jose M Al- varez. Understanding the robustness in vision trans- formers. In International Conference on Machine Learning, pages 27378–27394. PMLR, 2022
work page 2022
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.