Recognition: 2 theorem links
· Lean TheoremMake-A-Video: Text-to-Video Generation without Text-Video Data
Pith reviewed 2026-05-11 01:08 UTC · model grok-4.3
The pith
A method turns text into videos by extending image generators with motion learned separately from unlabeled footage.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Make-A-Video decomposes the temporal U-Net and attention tensors into separate spatial and temporal approximations and then runs a spatial-temporal pipeline that includes a video decoder, an interpolation model, and two super-resolution models. The system re-uses a pre-trained text-to-image model for visual content and text alignment while adding motion learned from unsupervised video. The outcome is state-of-the-art text-to-video output in resolution, frame rate, text faithfulness, and overall quality, achieved without any paired text-video training data.
What carries the argument
Spatial-temporal decomposition of U-Net and attention tensors together with a multi-stage pipeline of video decoder, interpolation, and super-resolution models.
If this is right
- Text-to-video training becomes faster because visual and language representations are reused rather than learned from scratch.
- Paired text-video datasets are no longer required to reach competitive performance.
- The generated videos carry over the aesthetic variety and fantastical content already present in current text-to-image systems.
- High-resolution and high-frame-rate results are produced by chaining the dedicated interpolation and super-resolution stages.
Where Pith is reading between the lines
- The same separation of appearance learning from motion learning could be tried on other data-scarce generation tasks such as 3D or audio synthesis.
- Modular pipelines like this one may reduce the total compute needed when extending image models to new domains.
- The approach opens a route to video editing or animation tools that start from a single text prompt and then refine motion independently.
Load-bearing premise
Motion patterns taken from unlabeled video can be added to a text-to-image model through these modules without creating visible motion artifacts or weakening how well the output matches the original text prompt.
What would settle it
A side-by-side evaluation on the same text prompts where Make-A-Video outputs show more flickering, unnatural object trajectories, or lower text-video alignment scores than models trained directly on paired text-video data.
read the original abstract
We propose Make-A-Video -- an approach for directly translating the tremendous recent progress in Text-to-Image (T2I) generation to Text-to-Video (T2V). Our intuition is simple: learn what the world looks like and how it is described from paired text-image data, and learn how the world moves from unsupervised video footage. Make-A-Video has three advantages: (1) it accelerates training of the T2V model (it does not need to learn visual and multimodal representations from scratch), (2) it does not require paired text-video data, and (3) the generated videos inherit the vastness (diversity in aesthetic, fantastical depictions, etc.) of today's image generation models. We design a simple yet effective way to build on T2I models with novel and effective spatial-temporal modules. First, we decompose the full temporal U-Net and attention tensors and approximate them in space and time. Second, we design a spatial temporal pipeline to generate high resolution and frame rate videos with a video decoder, interpolation model and two super resolution models that can enable various applications besides T2V. In all aspects, spatial and temporal resolution, faithfulness to text, and quality, Make-A-Video sets the new state-of-the-art in text-to-video generation, as determined by both qualitative and quantitative measures.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes Make-A-Video, a text-to-video generation method that transfers progress from text-to-image (T2I) models by learning appearance and text alignment from paired text-image data while acquiring motion dynamics from unsupervised video footage. It introduces a spatial-temporal decomposition of the U-Net and attention tensors, combined with a multi-stage pipeline (video decoder, temporal interpolation, and super-resolution models) to produce high-resolution, high-frame-rate videos without requiring paired text-video data. The central claim is that this yields state-of-the-art results in spatial/temporal resolution, text faithfulness, and perceptual quality, as measured by both qualitative examples and quantitative metrics.
Significance. If the quantitative claims hold, the work is significant because it demonstrates a practical route to high-quality T2V generation that sidesteps the scarcity of paired text-video data, accelerates training by reusing T2I representations, and inherits the diversity of modern image generators. The decomposition approach and modular pipeline are reusable for other video synthesis tasks and could reduce compute barriers in the field.
major comments (2)
- [§4] §4 (Experiments): The SOTA claim is central but rests on quantitative comparisons whose details (specific metrics such as FVD, CLIP similarity, or human preference scores, exact baselines, and effect sizes) are not summarized in the abstract and must be verified against prior T2V methods; without these numbers and ablations on the spatial-temporal modules, the superiority cannot be assessed.
- [§3.2] §3.2 (Spatial-Temporal Decomposition): The approximation of full temporal U-Net and attention tensors in space and time is described at a high level; the paper must supply the precise tensor factorization or insertion points (e.g., which layers receive the temporal attention) to confirm that motion transfer occurs without degrading text conditioning or introducing systematic artifacts.
minor comments (2)
- [Abstract] The abstract and introduction could more explicitly list the quantitative metrics and baselines used to support the SOTA statement.
- [Figures] Figure captions for qualitative results should include the exact text prompts and frame counts to aid reproducibility.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our manuscript. We address each major comment below with clarifications from the paper and propose targeted revisions to strengthen the presentation of our results and technical details.
read point-by-point responses
-
Referee: [§4] §4 (Experiments): The SOTA claim is central but rests on quantitative comparisons whose details (specific metrics such as FVD, CLIP similarity, or human preference scores, exact baselines, and effect sizes) are not summarized in the abstract and must be verified against prior T2V methods; without these numbers and ablations on the spatial-temporal modules, the superiority cannot be assessed.
Authors: We agree that a concise summary of the key quantitative results would improve accessibility. Section 4 reports FVD, CLIP similarity, and human preference scores against baselines including CogVideo and other recent T2V methods, with effect sizes and ablations on the spatial-temporal modules detailed in Tables 1-3 and Section 4.3 (plus appendix). The abstract states the SOTA outcome but does not list the numbers. We will revise the abstract to include a brief summary of the primary metrics and baselines while retaining the existing detailed comparisons in the experiments section. revision: partial
-
Referee: [§3.2] §3.2 (Spatial-Temporal Decomposition): The approximation of full temporal U-Net and attention tensors in space and time is described at a high level; the paper must supply the precise tensor factorization or insertion points (e.g., which layers receive the temporal attention) to confirm that motion transfer occurs without degrading text conditioning or introducing systematic artifacts.
Authors: We appreciate this request for greater precision. Section 3.2 describes the decomposition of the U-Net and attention tensors into separate spatial and temporal factors, with temporal attention inserted after spatial attention in the decoder blocks to enable motion modeling while preserving the pretrained text-image conditioning pathway. To address the comment directly, we will add a detailed diagram and explicit layer specifications (including tensor shapes and insertion points) in the revised Section 3.2. revision: yes
Circularity Check
No significant circularity in derivation chain
full rationale
The paper presents Make-A-Video as a pipeline that inherits appearance from external pretrained T2I models and motion from separate unsupervised video data. It describes a spatial-temporal decomposition of U-Net/attention tensors plus a multi-stage generation pipeline (video decoder, interpolation, super-resolution). No load-bearing step reduces by construction to a self-fit, self-definition, or self-citation chain; the central claim is a concrete engineering combination of independent pretrained components rather than a tautological prediction. The SOTA assertion rests on external qualitative/quantitative evaluation, not internal re-derivation of inputs.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Decomposing full temporal U-Net and attention tensors into separate spatial and temporal approximations preserves sufficient modeling capacity for coherent video generation.
Forward citations
Cited by 48 Pith papers
-
GTA: Advancing Image-to-3D World Generation via Geometry Then Appearance Video Diffusion
GTA generates 3D worlds from single images via a two-stage video diffusion process that prioritizes geometry before appearance to improve structural consistency.
-
Curated Synthetic Data Doesn't Have to Collapse: A Theoretical Study of Generative Retraining with Pluralistic Preferences
Recursive generative retraining with pluralistic preferences converges to a stable diverse distribution that satisfies a weighted Nash bargaining solution.
-
DCR: Counterfactual Attractor Guidance for Rare Compositional Generation
DCR uses a counterfactual attractor and projection-based repulsion to suppress default completion bias in diffusion models, improving fidelity for rare compositional prompts while preserving quality.
-
Structured Diffusion Bridges: Inductive Bias for Denoising Diffusion Bridges
Structured diffusion bridges with alignment constraints achieve near fully-paired quality in modality translation while working effectively in unpaired and semi-paired regimes.
-
TrajShield: Trajectory-Level Safety Mediation for Defending Text-to-Video Models Against Jailbreak Attacks
TrajShield is a training-free defense that reduces jailbreak success rates by 52.44% on average in text-to-video models by localizing and neutralizing risks through trajectory simulation and causal intervention.
-
Cutscene Agent: An LLM Agent Framework for Automated 3D Cutscene Generation
Cutscene Agent uses a multi-agent LLM system and a new toolkit for game engine control to automate end-to-end 3D cutscene generation, evaluated on the introduced CutsceneBench.
-
$Z^2$-Sampling: Zero-Cost Zigzag Trajectories for Semantic Alignment in Diffusion Models
Z²-Sampling implicitly realizes zero-cost zigzag trajectories for curvature-aware semantic alignment in diffusion models by reducing multi-step paths via operator dualities and temporal caching while synthesizing a di...
-
Sparse Forcing: Native Trainable Sparse Attention for Real-time Autoregressive Diffusion Video Generation
Sparse Forcing adds a native trainable sparsity mechanism and PBSA kernel to autoregressive diffusion video models, yielding higher VBench scores and 1.1-1.27x speedups on 5s to 1min generations.
-
ViVa: A Video-Generative Value Model for Robot Reinforcement Learning
ViVa turns a video generator into a value model for robot RL that jointly forecasts future states and task value, yielding better performance on real-world box assembly when integrated with RECAP.
-
HumANDiff: Articulated Noise Diffusion for Motion-Consistent Human Video Generation
HumANDiff improves motion consistency in human video generation by sampling diffusion noise on an articulated human body template and adding joint appearance-motion prediction plus a geometric consistency loss.
-
Training-Free Refinement of Flow Matching with Divergence-based Sampling
Flow Divergence Sampler refines flow matching by computing velocity field divergence to correct ambiguous intermediate states during inference, improving fidelity in text-to-image and inverse problem tasks.
-
Omni-NegCLIP: Enhancing CLIP with Front-Layer Contrastive Fine-Tuning for Comprehensive Negation Understanding
Omni-NegCLIP improves CLIP's negation understanding by up to 52.65% on presence-based and 12.50% on absence-based tasks through front-layer fine-tuning with specialized contrastive losses.
-
Language Model Beats Diffusion -- Tokenizer is Key to Visual Generation
A new shared video-image tokenizer enables large language models to surpass diffusion models on standard visual generation benchmarks.
-
AnimateDiff: Animate Your Personalized Text-to-Image Diffusion Models without Specific Tuning
A single motion module trained on videos adds temporally coherent animation to any personalized text-to-image model derived from the same base without additional tuning.
-
Imagen Video: High Definition Video Generation with Diffusion Models
Imagen Video generates high-definition text-conditional videos via a cascade of base and super-resolution diffusion models, achieving high fidelity and controllability.
-
Beyond Text Prompts: Visual-to-Visual Generation as A Unified Paradigm
V2V-Zero adapts frozen VLMs for visual conditioning via hidden states from specification pages, scoring 0.85 on GenEval and 32.7 on a new seven-task benchmark while revealing capability hierarchies in attribute bindin...
-
FIS-DiT: Breaking the Few-Step Video Inference Barrier via Training-Free Frame Interleaved Sparsity
FIS-DiT achieves 2.11-2.41x speedup on video DiT models in few-step regimes with negligible quality loss by exploiting frame-wise sparsity and consistency through a training-free interleaved execution strategy.
-
Stage-adaptive audio diffusion modeling
A semantic progress signal from SSL discrepancy slope enables three stage-aware mechanisms that improve training efficiency and performance in audio diffusion models over static baselines.
-
Stream-T1: Test-Time Scaling for Streaming Video Generation
Stream-T1 is a test-time scaling framework for streaming video generation using scaled noise propagation from history, reward pruning across short and long windows, and feedback-guided memory sinking to improve tempor...
-
A unified perspective on fine-tuning and sampling with diffusion and flow models
A unified framework for exponential tilting in diffusion and flow models that includes bias-variance decompositions showing finite gradient variance for some methods, norm bounds on adjoint ODEs, and adapted losses wi...
-
CityRAG: Stepping Into a City via Spatially-Grounded Video Generation
CityRAG generates minutes-long 3D-consistent videos of real-world cities by grounding outputs in geo-registered data and using temporally unaligned training to disentangle fixed scenes from transient elements like weather.
-
Deepfake Detection Generalization with Diffusion Noise
ANL uses diffusion noise prediction and attention to regularize deepfake detectors for better generalization to unseen synthesis methods without added inference cost.
-
DiT as Real-Time Rerenderer: Streaming Video Stylization with Autoregressive Diffusion Transformer
RTR-DiT distills a bidirectional DiT teacher into an autoregressive few-step model using Self Forcing and Distribution Matching Distillation, plus a reference-preserving KV cache, to enable stable real-time text- and ...
-
VGA-Bench: A Unified Benchmark and Multi-Model Framework for Video Aesthetics and Generation Quality Evaluation
VGA-Bench creates a three-tier taxonomy, 1,016-prompt dataset of 60k+ videos, and three multi-task neural models (VAQA-Net, VTag-Net, VGQA-Net) that align with human judgments for video aesthetics and generation quality.
-
Rays as Pixels: Learning A Joint Distribution of Videos and Camera Trajectories
A video diffusion model learns a joint distribution over videos and camera trajectories by representing cameras as pixel-aligned ray encodings (raxels) denoised jointly with video frames via decoupled attention.
-
ELT: Elastic Looped Transformers for Visual Generation
Elastic Looped Transformers share weights across recurrent blocks and apply intra-loop self-distillation to deliver 4x parameter reduction while matching competitive FID and FVD scores on ImageNet and UCF-101.
-
ImVideoEdit: Image-learning Video Editing via 2D Spatial Difference Attention Blocks
ImVideoEdit learns video editing from 13K image pairs by decoupling spatial modifications from frozen temporal dynamics in pretrained models, matching larger video-trained systems in fidelity and consistency.
-
INSPATIO-WORLD: A Real-Time 4D World Simulator via Spatiotemporal Autoregressive Modeling
INSPATIO-WORLD is a real-time framework for high-fidelity 4D scene generation and navigation from monocular videos via STAR architecture with implicit caching, explicit geometric constraints, and distribution-matching...
-
Veo-Act: How Far Can Frontier Video Models Advance Generalizable Robot Manipulation?
Veo-3 video predictions enable approximate task-level robot trajectories in zero-shot settings but require hierarchical integration with low-level VLA policies for reliable manipulation performance.
-
GENSERVE: Efficient Co-Serving of Heterogeneous Diffusion Model Workloads
GENSERVE improves SLO attainment by up to 44% for co-serving heterogeneous T2I and T2V diffusion workloads via step-level preemption, elastic parallelism, and joint scheduling.
-
HVG-3D: Bridging Real and Simulation Domains for 3D-Conditional Hand-Object Interaction Video Synthesis
HVG-3D uses a 3D-aware diffusion architecture with ControlNet to synthesize high-fidelity hand-object interaction videos from 3D control signals, achieving state-of-the-art spatial fidelity and temporal coherence on t...
-
SkyReels-V2: Infinite-length Film Generative Model
SkyReels-V2 produces infinite-length film videos via MLLM-based captioning, progressive pretraining, motion RL, and diffusion forcing with non-decreasing noise schedules.
-
CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer
CogVideoX generates coherent 10-second text-to-video outputs at high resolution using a 3D VAE, expert adaptive LayerNorm transformer, progressive training, and a custom data pipeline, claiming state-of-the-art results.
-
CameraCtrl: Enabling Camera Control for Text-to-Video Generation
CameraCtrl enables accurate camera pose control in video diffusion models through a trained plug-and-play module and dataset choices emphasizing diverse camera trajectories with matching appearance.
-
Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets
Stable Video Diffusion scales latent video diffusion models via text-to-image pretraining, video pretraining on curated data, and high-quality finetuning to produce competitive text-to-video and image-to-video results...
-
SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis
SDXL improves upon prior Stable Diffusion versions through a larger UNet backbone, dual text encoders, novel conditioning, and a refinement model, producing higher-fidelity images competitive with black-box state-of-t...
-
Training Diffusion Models with Reinforcement Learning
DDPO uses policy gradients on the denoising process to optimize diffusion models for arbitrary rewards like human feedback or compressibility.
-
R-DMesh: Video-Guided 3D Animation via Rectified Dynamic Mesh Flow
R-DMesh uses a VAE with a learned rectification jump offset and Triflow Attention inside a rectified-flow diffusion transformer to produce video-aligned 4D meshes despite initial pose misalignment.
-
RealDiffusion: Physics-informed Attention for Multi-character Storybook Generation
RealDiffusion uses heat diffusion as a dissipative prior and a region-aware stochastic process inside a training-free physics-informed attention mechanism to improve multi-character coherence while preserving narrativ...
-
ST-Gen4D: Embedding 4D Spatiotemporal Cognition into World Model for 4D Generation
ST-Gen4D uses a world model that fuses global appearance and local dynamic graphs into a 4D cognition representation to guide consistent 4D Gaussian generation.
-
Structured Diffusion Bridges: Inductive Bias for Denoising Diffusion Bridges
A structured diffusion bridge method achieves near fully-paired modality translation quality using alignment constraints even in unpaired or semi-paired regimes.
-
DiffMagicFace: Identity Consistent Facial Editing of Real Videos
DiffMagicFace uses concurrent fine-tuned text and image diffusion models plus a rendered multi-view dataset to achieve identity-consistent text-conditioned editing of real facial videos.
-
FineEdit: Fine-Grained Image Edit with Bounding Box Guidance
FineEdit adds multi-level bounding box injection to diffusion image editing, releases a 1.2M-pair dataset with box annotations, and shows better instruction following and background consistency than prior open models ...
-
Not all tokens contribute equally to diffusion learning
DARE mitigates neglect of important tokens in conditional diffusion models via distribution-rectified guidance and spatial attention alignment.
-
Open-Sora: Democratizing Efficient Video Production for All
Open-Sora releases an open-source video generation model based on a Spatial-Temporal Diffusion Transformer that decouples spatial and temporal attention, supporting text-to-video, image-to-video, and text-to-image tas...
-
ModelScope Text-to-Video Technical Report
ModelScopeT2V is a 1.7-billion-parameter text-to-video model built on Stable Diffusion that adds temporal modeling and outperforms prior methods on three evaluation metrics.
-
Evolution of Video Generative Foundations
This survey traces video generation technology from GANs to diffusion models and then to autoregressive and multimodal approaches while analyzing principles, strengths, and future trends.
-
Sora: A Review on Background, Technology, Limitations, and Opportunities of Large Vision Models
The paper reviews the background, technology, applications, limitations, and future directions of OpenAI's Sora text-to-video generative model based on public information.
Reference graph
Works this paper leans on
-
[2]
Language Models are Few-Shot Learners
URL https://arxiv.org/abs/2005.14165. Franc ¸ois Chollet. Xception: Deep learning with depthwise separable convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1251–1258,
work page internal anchor Pith review Pith/arXiv arXiv 2005
-
[3]
Cogview2: Faster and better text-to-image generation via hierarchical transformers
Ming Ding, Wendi Zheng, Wenyi Hong, and Jie Tang. Cogview2: Faster and better text-to-image generation via hierarchical transformers. arXiv preprint arXiv:2204.14217,
-
[4]
Make-a-scene: Scene-based text-to-image generation with human priors
URLhttps://arxiv. org/abs/2203.13131. Songwei Ge, Thomas Hayes, Harry Yang, Xi Yin, Guan Pang, David Jacobs, Jia-Bin Huang, and Devi Parikh. Long video generation with time-agnostic vqgan and time-sensitive transformer. ECCV,
-
[5]
Trevor Hastie, Andrea Montanari, Saharon Rosset, and Ryan J
doi: 10.1109/CVPRW50498.2020.00193. Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial networks. NIPS,
-
[6]
Denoising Diffusion Probabilistic Models
URL https://arxiv.org/abs/2006.11239. Jonathan Ho, Tim Salimans, Alexey Gritsenko, William Chan, Mohammad Norouzi, and David J. Fleet. Video diffusion models,
work page internal anchor Pith review arXiv 2006
-
[7]
URL https://arxiv.org/abs/2204.03458. Seunghoon Hong, Dingdong Yang, Jongwook Choi, and Honglak Lee. Inferring semantic layout for hierarchical text-to-image synthesis. In CVPR, pp. 7986–7994,
work page internal anchor Pith review arXiv
-
[8]
CogVideo: Large-scale Pretraining for Text-to-Video Generation via Transformers
URL https://arxiv.org/ abs/2205.15868. Yitong Li, Martin Min, Dinghan Shen, David Carlson, and Lawrence Carin. Video generation from text. In AAAI, volume 32,
work page internal anchor Pith review arXiv
-
[9]
RoBERTa: A Robustly Optimized BERT Pretraining Approach
Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. Roberta: A robustly optimized BERT pretraining approach. CoRR, abs/1907.11692, 2019a. URL http://arxiv.org/abs/1907.11692. Yue Liu, Xin Wang, Yitian Yuan, and Wenwu Zhu. Cross-modal dual learning for sentence-to- video ge...
work page internal anchor Pith review Pith/arXiv arXiv 1907
-
[10]
GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models
Alex Nichol, Prafulla Dhariwal, Aditya Ramesh, Pranav Shyam, Pamela Mishkin, Bob McGrew, Ilya Sutskever, and Mark Chen. Glide: Towards photorealistic image generation and editing with text-guided diffusion models. arXiv preprint arXiv:2112.10741,
work page internal anchor Pith review arXiv
-
[11]
Hierarchical Text-Conditional Image Generation with CLIP Latents
URL https://arxiv.org/abs/ 2204.06125. Fitsum Reda, Janne Kontkanen, Eric Tabellion, Deqing Sun, Caroline Pantofaru, and Brian Curless. Film: Frame interpolation for large motion. arXiv preprint arXiv:2202.04901,
work page internal anchor Pith review Pith/arXiv arXiv
-
[12]
Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding
URL https://arxiv.org/abs/ 2205.11487. Masaki Saito, Shunta Saito, Masanori Koyama, and Sosuke Kobayashi. Train sparsely, generate densely: Memory-efficient unsupervised training of high-resolution temporal gan. International Journal of Computer Vision, 128(10):2586–2606,
work page internal anchor Pith review arXiv
-
[13]
UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild
Khurram Soomro, Amir Roshan Zamir, and Mubarak Shah. Ucf101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402,
work page internal anchor Pith review arXiv
-
[14]
URL https://arxiv. org/abs/1706.03762. Chenfei Wu, Lun Huang, Qianxi Zhang, Binyang Li, Lei Ji, Fan Yang, Guillermo Sapiro, and Nan Duan. Godiva: Generating open-domain videos from natural descriptions. arXiv preprint arXiv:2104.14806, 2021a. Chenfei Wu, Jian Liang, Lei Ji, Fan Yang, Yuejian Fang, Daxin Jiang, and Nan Duan. N¨Uwa: Visual synthesis pre-tra...
work page internal anchor Pith review Pith/arXiv arXiv
-
[15]
Y ., Zhang, H., Pang, R., Qin, J., Ku, A., Xu, Y ., Baldridge, J., and Wu, Y
Jiahui Yu, Xin Li, Jing Yu Koh, Han Zhang, Ruoming Pang, James Qin, Alexander Ku, Yuanzhong Xu, Jason Baldridge, and Yonghui Wu. Vector-quantized image modeling with improved vqgan. arXiv preprint arXiv:2110.04627,
-
[16]
Scaling Autoregressive Models for Content-Rich Text-to-Image Generation
12 Jiahui Yu, Yuanzhong Xu, Jing Yu Koh, Thang Luong, Gunjan Baid, Zirui Wang, Vijay Vasudevan, Alexander Ku, Yinfei Yang, Burcu Karagol Ayan, Ben Hutchinson, Wei Han, Zarana Parekh, Xin Li, Han Zhang, Jason Baldridge, and Yonghui Wu. Scaling autoregressive models for content-rich text-to-image generation, 2022a. URL https://arxiv.org/abs/2206.10789. Sihy...
work page internal anchor Pith review arXiv
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.