arxiv: 2209.14792 · v1 · submitted 2022-09-29 · 💻 cs.CV · cs.AI· cs.LG

Recognition: 2 theorem links

· Lean Theorem

Make-A-Video: Text-to-Video Generation without Text-Video Data

Adam Polyak, Devi Parikh, Harry Yang, Jie An, Oran Gafni, Oron Ashual, Qiyuan Hu, Sonal Gupta, Songyang Zhang, Thomas Hayes, Uriel Singer, Xi Yin, Yaniv Taigman

Pith reviewed 2026-05-11 01:08 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.LG

keywords text-to-video generationtext-to-image modelsunsupervised videospatial-temporal modulesvideo super-resolutiongenerative modelsmotion transfer

0 comments

The pith

A method turns text into videos by extending image generators with motion learned separately from unlabeled footage.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows how to move from text-to-image generation to text-to-video generation without starting over or collecting rare paired text-video examples. It trains image and description understanding on text-image pairs, then learns motion dynamics from ordinary video clips that have no text labels. A pipeline of spatial-temporal modules added to existing image models produces the final video frames. This shortcut speeds up training, preserves the creative range of modern image models, and reaches higher resolution, frame rate, and text accuracy than earlier video methods. A reader would care because it suggests video synthesis can scale using data that already exists in large quantities.

Core claim

Make-A-Video decomposes the temporal U-Net and attention tensors into separate spatial and temporal approximations and then runs a spatial-temporal pipeline that includes a video decoder, an interpolation model, and two super-resolution models. The system re-uses a pre-trained text-to-image model for visual content and text alignment while adding motion learned from unsupervised video. The outcome is state-of-the-art text-to-video output in resolution, frame rate, text faithfulness, and overall quality, achieved without any paired text-video training data.

What carries the argument

Spatial-temporal decomposition of U-Net and attention tensors together with a multi-stage pipeline of video decoder, interpolation, and super-resolution models.

If this is right

Text-to-video training becomes faster because visual and language representations are reused rather than learned from scratch.
Paired text-video datasets are no longer required to reach competitive performance.
The generated videos carry over the aesthetic variety and fantastical content already present in current text-to-image systems.
High-resolution and high-frame-rate results are produced by chaining the dedicated interpolation and super-resolution stages.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same separation of appearance learning from motion learning could be tried on other data-scarce generation tasks such as 3D or audio synthesis.
Modular pipelines like this one may reduce the total compute needed when extending image models to new domains.
The approach opens a route to video editing or animation tools that start from a single text prompt and then refine motion independently.

Load-bearing premise

Motion patterns taken from unlabeled video can be added to a text-to-image model through these modules without creating visible motion artifacts or weakening how well the output matches the original text prompt.

What would settle it

A side-by-side evaluation on the same text prompts where Make-A-Video outputs show more flickering, unnatural object trajectories, or lower text-video alignment scores than models trained directly on paired text-video data.

read the original abstract

We propose Make-A-Video -- an approach for directly translating the tremendous recent progress in Text-to-Image (T2I) generation to Text-to-Video (T2V). Our intuition is simple: learn what the world looks like and how it is described from paired text-image data, and learn how the world moves from unsupervised video footage. Make-A-Video has three advantages: (1) it accelerates training of the T2V model (it does not need to learn visual and multimodal representations from scratch), (2) it does not require paired text-video data, and (3) the generated videos inherit the vastness (diversity in aesthetic, fantastical depictions, etc.) of today's image generation models. We design a simple yet effective way to build on T2I models with novel and effective spatial-temporal modules. First, we decompose the full temporal U-Net and attention tensors and approximate them in space and time. Second, we design a spatial temporal pipeline to generate high resolution and frame rate videos with a video decoder, interpolation model and two super resolution models that can enable various applications besides T2V. In all aspects, spatial and temporal resolution, faithfulness to text, and quality, Make-A-Video sets the new state-of-the-art in text-to-video generation, as determined by both qualitative and quantitative measures.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Make-A-Video shows a workable split between image appearance and video motion to skip paired text-video data, but the SOTA claim sits on asserted results rather than displayed evidence.

read the letter

The core move here is to take a strong pretrained text-to-image model, freeze most of its spatial weights, and add lightweight temporal layers trained on raw video. They decompose the U-Net and attention tensors into separate space and time factors, then run a pipeline that decodes video, interpolates frames, and applies two stages of super-resolution. This keeps the diversity and text alignment from the image model while learning dynamics without text-video pairs. That decomposition is the concrete technical contribution, and it is a reasonable engineering response to the data shortage in video generation. The pipeline also looks designed for practical use, since the same components can support different resolutions and frame rates. The paper is clear that this accelerates training and inherits the scale of current image generators. Those points land. The main weakness is that the abstract states new state-of-the-art numbers in resolution, text faithfulness, and overall quality without showing any tables, ablations, or direct comparisons. The claim is presented as fact, yet the supporting measurements are not visible in the summary. If the full paper contains controlled experiments and human evaluations that hold up, the result strengthens; if the gains are mostly qualitative or come from cherry-picked examples, the advantage shrinks. No circular logic appears in the method itself, and the separation of concerns is internally consistent. This work is aimed at groups already running large diffusion or U-Net models who want to move into video without collecting new paired datasets. A reader who needs a concrete recipe for adding temporal capacity to an existing image generator will find usable details. The paper is coherent enough on its own terms to merit referee time, though any review should focus first on the missing quantitative backbone. I would send it to peer review rather than desk-reject it.

Referee Report

2 major / 2 minor

Summary. The paper proposes Make-A-Video, a text-to-video generation method that transfers progress from text-to-image (T2I) models by learning appearance and text alignment from paired text-image data while acquiring motion dynamics from unsupervised video footage. It introduces a spatial-temporal decomposition of the U-Net and attention tensors, combined with a multi-stage pipeline (video decoder, temporal interpolation, and super-resolution models) to produce high-resolution, high-frame-rate videos without requiring paired text-video data. The central claim is that this yields state-of-the-art results in spatial/temporal resolution, text faithfulness, and perceptual quality, as measured by both qualitative examples and quantitative metrics.

Significance. If the quantitative claims hold, the work is significant because it demonstrates a practical route to high-quality T2V generation that sidesteps the scarcity of paired text-video data, accelerates training by reusing T2I representations, and inherits the diversity of modern image generators. The decomposition approach and modular pipeline are reusable for other video synthesis tasks and could reduce compute barriers in the field.

major comments (2)

[§4] §4 (Experiments): The SOTA claim is central but rests on quantitative comparisons whose details (specific metrics such as FVD, CLIP similarity, or human preference scores, exact baselines, and effect sizes) are not summarized in the abstract and must be verified against prior T2V methods; without these numbers and ablations on the spatial-temporal modules, the superiority cannot be assessed.
[§3.2] §3.2 (Spatial-Temporal Decomposition): The approximation of full temporal U-Net and attention tensors in space and time is described at a high level; the paper must supply the precise tensor factorization or insertion points (e.g., which layers receive the temporal attention) to confirm that motion transfer occurs without degrading text conditioning or introducing systematic artifacts.

minor comments (2)

[Abstract] The abstract and introduction could more explicitly list the quantitative metrics and baselines used to support the SOTA statement.
[Figures] Figure captions for qualitative results should include the exact text prompts and frame counts to aid reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment below with clarifications from the paper and propose targeted revisions to strengthen the presentation of our results and technical details.

read point-by-point responses

Referee: [§4] §4 (Experiments): The SOTA claim is central but rests on quantitative comparisons whose details (specific metrics such as FVD, CLIP similarity, or human preference scores, exact baselines, and effect sizes) are not summarized in the abstract and must be verified against prior T2V methods; without these numbers and ablations on the spatial-temporal modules, the superiority cannot be assessed.

Authors: We agree that a concise summary of the key quantitative results would improve accessibility. Section 4 reports FVD, CLIP similarity, and human preference scores against baselines including CogVideo and other recent T2V methods, with effect sizes and ablations on the spatial-temporal modules detailed in Tables 1-3 and Section 4.3 (plus appendix). The abstract states the SOTA outcome but does not list the numbers. We will revise the abstract to include a brief summary of the primary metrics and baselines while retaining the existing detailed comparisons in the experiments section. revision: partial
Referee: [§3.2] §3.2 (Spatial-Temporal Decomposition): The approximation of full temporal U-Net and attention tensors in space and time is described at a high level; the paper must supply the precise tensor factorization or insertion points (e.g., which layers receive the temporal attention) to confirm that motion transfer occurs without degrading text conditioning or introducing systematic artifacts.

Authors: We appreciate this request for greater precision. Section 3.2 describes the decomposition of the U-Net and attention tensors into separate spatial and temporal factors, with temporal attention inserted after spatial attention in the decoder blocks to enable motion modeling while preserving the pretrained text-image conditioning pathway. To address the comment directly, we will add a detailed diagram and explicit layer specifications (including tensor shapes and insertion points) in the revised Section 3.2. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper presents Make-A-Video as a pipeline that inherits appearance from external pretrained T2I models and motion from separate unsupervised video data. It describes a spatial-temporal decomposition of U-Net/attention tensors plus a multi-stage generation pipeline (video decoder, interpolation, super-resolution). No load-bearing step reduces by construction to a self-fit, self-definition, or self-citation chain; the central claim is a concrete engineering combination of independent pretrained components rather than a tautological prediction. The SOTA assertion rests on external qualitative/quantitative evaluation, not internal re-derivation of inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the unproven assumption that motion can be learned independently from appearance using only unlabeled video and that the proposed decomposition sufficiently approximates full spatiotemporal modeling.

axioms (1)

domain assumption Decomposing full temporal U-Net and attention tensors into separate spatial and temporal approximations preserves sufficient modeling capacity for coherent video generation.
Invoked when describing the novel spatial-temporal modules added to the T2I backbone.

pith-pipeline@v0.9.0 · 5588 in / 1198 out tokens · 32678 ms · 2026-05-11T01:08:48.153334+00:00 · methodology

discussion (0)

Forward citations

Cited by 48 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

GTA: Advancing Image-to-3D World Generation via Geometry Then Appearance Video Diffusion
cs.CV 2026-05 unverdicted novelty 7.0

GTA generates 3D worlds from single images via a two-stage video diffusion process that prioritizes geometry before appearance to improve structural consistency.
Curated Synthetic Data Doesn't Have to Collapse: A Theoretical Study of Generative Retraining with Pluralistic Preferences
cs.LG 2026-05 unverdicted novelty 7.0

Recursive generative retraining with pluralistic preferences converges to a stable diverse distribution that satisfies a weighted Nash bargaining solution.
DCR: Counterfactual Attractor Guidance for Rare Compositional Generation
cs.CV 2026-05 unverdicted novelty 7.0

DCR uses a counterfactual attractor and projection-based repulsion to suppress default completion bias in diffusion models, improving fidelity for rare compositional prompts while preserving quality.
Structured Diffusion Bridges: Inductive Bias for Denoising Diffusion Bridges
cs.LG 2026-05 unverdicted novelty 7.0

Structured diffusion bridges with alignment constraints achieve near fully-paired quality in modality translation while working effectively in unpaired and semi-paired regimes.
TrajShield: Trajectory-Level Safety Mediation for Defending Text-to-Video Models Against Jailbreak Attacks
cs.CV 2026-05 unverdicted novelty 7.0

TrajShield is a training-free defense that reduces jailbreak success rates by 52.44% on average in text-to-video models by localizing and neutralizing risks through trajectory simulation and causal intervention.
Cutscene Agent: An LLM Agent Framework for Automated 3D Cutscene Generation
cs.GR 2026-04 unverdicted novelty 7.0

Cutscene Agent uses a multi-agent LLM system and a new toolkit for game engine control to automate end-to-end 3D cutscene generation, evaluated on the introduced CutsceneBench.
$Z^2$-Sampling: Zero-Cost Zigzag Trajectories for Semantic Alignment in Diffusion Models
cs.CV 2026-04 unverdicted novelty 7.0

Z²-Sampling implicitly realizes zero-cost zigzag trajectories for curvature-aware semantic alignment in diffusion models by reducing multi-step paths via operator dualities and temporal caching while synthesizing a di...
Sparse Forcing: Native Trainable Sparse Attention for Real-time Autoregressive Diffusion Video Generation
cs.CV 2026-04 unverdicted novelty 7.0

Sparse Forcing adds a native trainable sparsity mechanism and PBSA kernel to autoregressive diffusion video models, yielding higher VBench scores and 1.1-1.27x speedups on 5s to 1min generations.
ViVa: A Video-Generative Value Model for Robot Reinforcement Learning
cs.RO 2026-04 unverdicted novelty 7.0

ViVa turns a video generator into a value model for robot RL that jointly forecasts future states and task value, yielding better performance on real-world box assembly when integrated with RECAP.
HumANDiff: Articulated Noise Diffusion for Motion-Consistent Human Video Generation
cs.CV 2026-04 unverdicted novelty 7.0

HumANDiff improves motion consistency in human video generation by sampling diffusion noise on an articulated human body template and adding joint appearance-motion prediction plus a geometric consistency loss.
Training-Free Refinement of Flow Matching with Divergence-based Sampling
cs.CV 2026-04 unverdicted novelty 7.0

Flow Divergence Sampler refines flow matching by computing velocity field divergence to correct ambiguous intermediate states during inference, improving fidelity in text-to-image and inverse problem tasks.
Omni-NegCLIP: Enhancing CLIP with Front-Layer Contrastive Fine-Tuning for Comprehensive Negation Understanding
cs.CV 2026-03 unverdicted novelty 7.0

Omni-NegCLIP improves CLIP's negation understanding by up to 52.65% on presence-based and 12.50% on absence-based tasks through front-layer fine-tuning with specialized contrastive losses.
Language Model Beats Diffusion -- Tokenizer is Key to Visual Generation
cs.CV 2023-10 unverdicted novelty 7.0

A new shared video-image tokenizer enables large language models to surpass diffusion models on standard visual generation benchmarks.
AnimateDiff: Animate Your Personalized Text-to-Image Diffusion Models without Specific Tuning
cs.CV 2023-07 unverdicted novelty 7.0

A single motion module trained on videos adds temporally coherent animation to any personalized text-to-image model derived from the same base without additional tuning.
Imagen Video: High Definition Video Generation with Diffusion Models
cs.CV 2022-10 unverdicted novelty 7.0

Imagen Video generates high-definition text-conditional videos via a cascade of base and super-resolution diffusion models, achieving high fidelity and controllability.
Beyond Text Prompts: Visual-to-Visual Generation as A Unified Paradigm
cs.CV 2026-05 unverdicted novelty 6.0

V2V-Zero adapts frozen VLMs for visual conditioning via hidden states from specification pages, scoring 0.85 on GenEval and 32.7 on a new seven-task benchmark while revealing capability hierarchies in attribute bindin...
FIS-DiT: Breaking the Few-Step Video Inference Barrier via Training-Free Frame Interleaved Sparsity
cs.CV 2026-05 unverdicted novelty 6.0

FIS-DiT achieves 2.11-2.41x speedup on video DiT models in few-step regimes with negligible quality loss by exploiting frame-wise sparsity and consistency through a training-free interleaved execution strategy.
Stage-adaptive audio diffusion modeling
cs.SD 2026-05 unverdicted novelty 6.0

A semantic progress signal from SSL discrepancy slope enables three stage-aware mechanisms that improve training efficiency and performance in audio diffusion models over static baselines.
Stream-T1: Test-Time Scaling for Streaming Video Generation
cs.CV 2026-05 unverdicted novelty 6.0

Stream-T1 is a test-time scaling framework for streaming video generation using scaled noise propagation from history, reward pruning across short and long windows, and feedback-guided memory sinking to improve tempor...
A unified perspective on fine-tuning and sampling with diffusion and flow models
stat.ML 2026-04 unverdicted novelty 6.0

A unified framework for exponential tilting in diffusion and flow models that includes bias-variance decompositions showing finite gradient variance for some methods, norm bounds on adjoint ODEs, and adapted losses wi...
CityRAG: Stepping Into a City via Spatially-Grounded Video Generation
cs.CV 2026-04 unverdicted novelty 6.0

CityRAG generates minutes-long 3D-consistent videos of real-world cities by grounding outputs in geo-registered data and using temporally unaligned training to disentangle fixed scenes from transient elements like weather.
Deepfake Detection Generalization with Diffusion Noise
cs.CV 2026-04 unverdicted novelty 6.0

ANL uses diffusion noise prediction and attention to regularize deepfake detectors for better generalization to unseen synthesis methods without added inference cost.
DiT as Real-Time Rerenderer: Streaming Video Stylization with Autoregressive Diffusion Transformer
cs.CV 2026-04 unverdicted novelty 6.0

RTR-DiT distills a bidirectional DiT teacher into an autoregressive few-step model using Self Forcing and Distribution Matching Distillation, plus a reference-preserving KV cache, to enable stable real-time text- and ...
VGA-Bench: A Unified Benchmark and Multi-Model Framework for Video Aesthetics and Generation Quality Evaluation
cs.CV 2026-04 unverdicted novelty 6.0

VGA-Bench creates a three-tier taxonomy, 1,016-prompt dataset of 60k+ videos, and three multi-task neural models (VAQA-Net, VTag-Net, VGQA-Net) that align with human judgments for video aesthetics and generation quality.
Rays as Pixels: Learning A Joint Distribution of Videos and Camera Trajectories
cs.CV 2026-04 unverdicted novelty 6.0

A video diffusion model learns a joint distribution over videos and camera trajectories by representing cameras as pixel-aligned ray encodings (raxels) denoised jointly with video frames via decoupled attention.
ELT: Elastic Looped Transformers for Visual Generation
cs.CV 2026-04 unverdicted novelty 6.0

Elastic Looped Transformers share weights across recurrent blocks and apply intra-loop self-distillation to deliver 4x parameter reduction while matching competitive FID and FVD scores on ImageNet and UCF-101.
ImVideoEdit: Image-learning Video Editing via 2D Spatial Difference Attention Blocks
cs.CV 2026-04 unverdicted novelty 6.0

ImVideoEdit learns video editing from 13K image pairs by decoupling spatial modifications from frozen temporal dynamics in pretrained models, matching larger video-trained systems in fidelity and consistency.
INSPATIO-WORLD: A Real-Time 4D World Simulator via Spatiotemporal Autoregressive Modeling
cs.CV 2026-04 unverdicted novelty 6.0

INSPATIO-WORLD is a real-time framework for high-fidelity 4D scene generation and navigation from monocular videos via STAR architecture with implicit caching, explicit geometric constraints, and distribution-matching...
Veo-Act: How Far Can Frontier Video Models Advance Generalizable Robot Manipulation?
cs.RO 2026-04 unverdicted novelty 6.0

Veo-3 video predictions enable approximate task-level robot trajectories in zero-shot settings but require hierarchical integration with low-level VLA policies for reliable manipulation performance.
GENSERVE: Efficient Co-Serving of Heterogeneous Diffusion Model Workloads
cs.DC 2026-04 unverdicted novelty 6.0

GENSERVE improves SLO attainment by up to 44% for co-serving heterogeneous T2I and T2V diffusion workloads via step-level preemption, elastic parallelism, and joint scheduling.
HVG-3D: Bridging Real and Simulation Domains for 3D-Conditional Hand-Object Interaction Video Synthesis
cs.CV 2026-03 unverdicted novelty 6.0

HVG-3D uses a 3D-aware diffusion architecture with ControlNet to synthesize high-fidelity hand-object interaction videos from 3D control signals, achieving state-of-the-art spatial fidelity and temporal coherence on t...
SkyReels-V2: Infinite-length Film Generative Model
cs.CV 2025-04 unverdicted novelty 6.0

SkyReels-V2 produces infinite-length film videos via MLLM-based captioning, progressive pretraining, motion RL, and diffusion forcing with non-decreasing noise schedules.
CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer
cs.CV 2024-08 unverdicted novelty 6.0

CogVideoX generates coherent 10-second text-to-video outputs at high resolution using a 3D VAE, expert adaptive LayerNorm transformer, progressive training, and a custom data pipeline, claiming state-of-the-art results.
CameraCtrl: Enabling Camera Control for Text-to-Video Generation
cs.CV 2024-04 unverdicted novelty 6.0

CameraCtrl enables accurate camera pose control in video diffusion models through a trained plug-and-play module and dataset choices emphasizing diverse camera trajectories with matching appearance.
Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets
cs.CV 2023-11 conditional novelty 6.0

Stable Video Diffusion scales latent video diffusion models via text-to-image pretraining, video pretraining on curated data, and high-quality finetuning to produce competitive text-to-video and image-to-video results...
SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis
cs.CV 2023-07 conditional novelty 6.0

SDXL improves upon prior Stable Diffusion versions through a larger UNet backbone, dual text encoders, novel conditioning, and a refinement model, producing higher-fidelity images competitive with black-box state-of-t...
Training Diffusion Models with Reinforcement Learning
cs.LG 2023-05 unverdicted novelty 6.0

DDPO uses policy gradients on the denoising process to optimize diffusion models for arbitrary rewards like human feedback or compressibility.
R-DMesh: Video-Guided 3D Animation via Rectified Dynamic Mesh Flow
cs.CV 2026-05 unverdicted novelty 5.0

R-DMesh uses a VAE with a learned rectification jump offset and Triflow Attention inside a rectified-flow diffusion transformer to produce video-aligned 4D meshes despite initial pose misalignment.
RealDiffusion: Physics-informed Attention for Multi-character Storybook Generation
cs.CV 2026-05 unverdicted novelty 5.0

RealDiffusion uses heat diffusion as a dissipative prior and a region-aware stochastic process inside a training-free physics-informed attention mechanism to improve multi-character coherence while preserving narrativ...
ST-Gen4D: Embedding 4D Spatiotemporal Cognition into World Model for 4D Generation
cs.CV 2026-05 unverdicted novelty 5.0

ST-Gen4D uses a world model that fuses global appearance and local dynamic graphs into a 4D cognition representation to guide consistent 4D Gaussian generation.
Structured Diffusion Bridges: Inductive Bias for Denoising Diffusion Bridges
cs.LG 2026-05 unverdicted novelty 5.0

A structured diffusion bridge method achieves near fully-paired modality translation quality using alignment constraints even in unpaired or semi-paired regimes.
DiffMagicFace: Identity Consistent Facial Editing of Real Videos
cs.CV 2026-04 unverdicted novelty 5.0

DiffMagicFace uses concurrent fine-tuned text and image diffusion models plus a rendered multi-view dataset to achieve identity-consistent text-conditioned editing of real facial videos.
FineEdit: Fine-Grained Image Edit with Bounding Box Guidance
cs.CV 2026-04 unverdicted novelty 5.0

FineEdit adds multi-level bounding box injection to diffusion image editing, releases a 1.2M-pair dataset with box annotations, and shows better instruction following and background consistency than prior open models ...
Not all tokens contribute equally to diffusion learning
cs.CV 2026-04 unverdicted novelty 5.0

DARE mitigates neglect of important tokens in conditional diffusion models via distribution-rectified guidance and spatial attention alignment.
Open-Sora: Democratizing Efficient Video Production for All
cs.CV 2024-12 unverdicted novelty 5.0

Open-Sora releases an open-source video generation model based on a Spatial-Temporal Diffusion Transformer that decouples spatial and temporal attention, supporting text-to-video, image-to-video, and text-to-image tas...
ModelScope Text-to-Video Technical Report
cs.CV 2023-08 unverdicted novelty 4.0

ModelScopeT2V is a 1.7-billion-parameter text-to-video model built on Stable Diffusion that adds temporal modeling and outperforms prior methods on three evaluation metrics.
Evolution of Video Generative Foundations
cs.CV 2026-04 unverdicted novelty 2.0

This survey traces video generation technology from GANs to diffusion models and then to autoregressive and multimodal approaches while analyzing principles, strengths, and future trends.
Sora: A Review on Background, Technology, Limitations, and Opportunities of Large Vision Models
cs.CV 2024-02 unverdicted novelty 2.0

The paper reviews the background, technology, applications, limitations, and future directions of OpenAI's Sora text-to-video generative model based on public information.

Reference graph

Works this paper leans on

15 extracted references · 15 canonical work pages · cited by 47 Pith papers · 11 internal anchors

[2]

Language Models are Few-Shot Learners

URL https://arxiv.org/abs/2005.14165. Franc ¸ois Chollet. Xception: Deep learning with depthwise separable convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1251–1258,

work page internal anchor Pith review Pith/arXiv arXiv 2005
[3]

Cogview2: Faster and better text-to-image generation via hierarchical transformers

Ming Ding, Wendi Zheng, Wenyi Hong, and Jie Tang. Cogview2: Faster and better text-to-image generation via hierarchical transformers. arXiv preprint arXiv:2204.14217,

work page arXiv
[4]

Make-a-scene: Scene-based text-to-image generation with human priors

URLhttps://arxiv. org/abs/2203.13131. Songwei Ge, Thomas Hayes, Harry Yang, Xi Yin, Guan Pang, David Jacobs, Jia-Bin Huang, and Devi Parikh. Long video generation with time-agnostic vqgan and time-sensitive transformer. ECCV,

work page arXiv
[5]

Trevor Hastie, Andrea Montanari, Saharon Rosset, and Ryan J

doi: 10.1109/CVPRW50498.2020.00193. Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial networks. NIPS,

work page doi:10.1109/cvprw50498.2020.00193 2020
[6]

Denoising Diffusion Probabilistic Models

URL https://arxiv.org/abs/2006.11239. Jonathan Ho, Tim Salimans, Alexey Gritsenko, William Chan, Mohammad Norouzi, and David J. Fleet. Video diffusion models,

work page internal anchor Pith review arXiv 2006
[7]

Video Diffusion Models

URL https://arxiv.org/abs/2204.03458. Seunghoon Hong, Dingdong Yang, Jongwook Choi, and Honglak Lee. Inferring semantic layout for hierarchical text-to-image synthesis. In CVPR, pp. 7986–7994,

work page internal anchor Pith review arXiv
[8]

CogVideo: Large-scale Pretraining for Text-to-Video Generation via Transformers

URL https://arxiv.org/ abs/2205.15868. Yitong Li, Martin Min, Dinghan Shen, David Carlson, and Lawrence Carin. Video generation from text. In AAAI, volume 32,

work page internal anchor Pith review arXiv
[9]

RoBERTa: A Robustly Optimized BERT Pretraining Approach

Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. Roberta: A robustly optimized BERT pretraining approach. CoRR, abs/1907.11692, 2019a. URL http://arxiv.org/abs/1907.11692. Yue Liu, Xin Wang, Yitian Yuan, and Wenwu Zhu. Cross-modal dual learning for sentence-to- video ge...

work page internal anchor Pith review Pith/arXiv arXiv 1907
[10]

GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models

Alex Nichol, Prafulla Dhariwal, Aditya Ramesh, Pranav Shyam, Pamela Mishkin, Bob McGrew, Ilya Sutskever, and Mark Chen. Glide: Towards photorealistic image generation and editing with text-guided diffusion models. arXiv preprint arXiv:2112.10741,

work page internal anchor Pith review arXiv
[11]

Hierarchical Text-Conditional Image Generation with CLIP Latents

URL https://arxiv.org/abs/ 2204.06125. Fitsum Reda, Janne Kontkanen, Eric Tabellion, Deqing Sun, Caroline Pantofaru, and Brian Curless. Film: Frame interpolation for large motion. arXiv preprint arXiv:2202.04901,

work page internal anchor Pith review Pith/arXiv arXiv
[12]

Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding

URL https://arxiv.org/abs/ 2205.11487. Masaki Saito, Shunta Saito, Masanori Koyama, and Sosuke Kobayashi. Train sparsely, generate densely: Memory-efﬁcient unsupervised training of high-resolution temporal gan. International Journal of Computer Vision, 128(10):2586–2606,

work page internal anchor Pith review arXiv
[13]

UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild

Khurram Soomro, Amir Roshan Zamir, and Mubarak Shah. Ucf101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402,

work page internal anchor Pith review arXiv
[14]

Attention Is All You Need

URL https://arxiv. org/abs/1706.03762. Chenfei Wu, Lun Huang, Qianxi Zhang, Binyang Li, Lei Ji, Fan Yang, Guillermo Sapiro, and Nan Duan. Godiva: Generating open-domain videos from natural descriptions. arXiv preprint arXiv:2104.14806, 2021a. Chenfei Wu, Jian Liang, Lei Ji, Fan Yang, Yuejian Fang, Daxin Jiang, and Nan Duan. N¨Uwa: Visual synthesis pre-tra...

work page internal anchor Pith review Pith/arXiv arXiv
[15]

Y ., Zhang, H., Pang, R., Qin, J., Ku, A., Xu, Y ., Baldridge, J., and Wu, Y

Jiahui Yu, Xin Li, Jing Yu Koh, Han Zhang, Ruoming Pang, James Qin, Alexander Ku, Yuanzhong Xu, Jason Baldridge, and Yonghui Wu. Vector-quantized image modeling with improved vqgan. arXiv preprint arXiv:2110.04627,

work page arXiv
[16]

Scaling Autoregressive Models for Content-Rich Text-to-Image Generation

12 Jiahui Yu, Yuanzhong Xu, Jing Yu Koh, Thang Luong, Gunjan Baid, Zirui Wang, Vijay Vasudevan, Alexander Ku, Yinfei Yang, Burcu Karagol Ayan, Ben Hutchinson, Wei Han, Zarana Parekh, Xin Li, Han Zhang, Jason Baldridge, and Yonghui Wu. Scaling autoregressive models for content-rich text-to-image generation, 2022a. URL https://arxiv.org/abs/2206.10789. Sihy...

work page internal anchor Pith review arXiv