Recognition: 3 theorem links
· Lean TheoremPhotorealistic Text-to-Image Diffusion Models with Deep Language Understanding
Pith reviewed 2026-05-12 07:34 UTC · model grok-4.3
The pith
Large text-only language models condition diffusion models for photorealistic images more effectively than larger image models do.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Our key discovery is that generic large language models (e.g. T5), pretrained on text-only corpora, are surprisingly effective at encoding text for image synthesis: increasing the size of the language model in Imagen boosts both sample fidelity and image-text alignment much more than increasing the size of the image diffusion model. Imagen achieves a new state-of-the-art FID score of 7.27 on the COCO dataset, without ever training on COCO, and human raters find Imagen samples to be on par with the COCO data itself in image-text alignment.
What carries the argument
A cascaded diffusion model whose denoising steps are conditioned on embeddings from a large, frozen T5 language model pretrained only on text.
If this is right
- Imagen sets a new FID record of 7.27 on COCO without any COCO training data.
- Human raters judge Imagen samples equal to real COCO images for image-text alignment.
- On the new DrawBench benchmark Imagen is preferred over VQ-GAN+CLIP, Latent Diffusion Models, and DALL-E 2 in both quality and alignment.
Where Pith is reading between the lines
- The result suggests language understanding acquired from text alone transfers directly to visual generation tasks.
- Future scaling laws for text-to-image systems may prioritize language-model size over diffusion-model capacity.
- The approach could reduce reliance on large curated image-text datasets for training high-quality generators.
Load-bearing premise
A language model pretrained solely on text corpora can provide sufficiently rich conditioning signals for high-fidelity image synthesis without any image-text paired pretraining or fine-tuning.
What would settle it
A controlled experiment that scales the language model while holding the diffusion model fixed and shows no further gains in FID or human preference scores would falsify the central claim.
read the original abstract
We present Imagen, a text-to-image diffusion model with an unprecedented degree of photorealism and a deep level of language understanding. Imagen builds on the power of large transformer language models in understanding text and hinges on the strength of diffusion models in high-fidelity image generation. Our key discovery is that generic large language models (e.g. T5), pretrained on text-only corpora, are surprisingly effective at encoding text for image synthesis: increasing the size of the language model in Imagen boosts both sample fidelity and image-text alignment much more than increasing the size of the image diffusion model. Imagen achieves a new state-of-the-art FID score of 7.27 on the COCO dataset, without ever training on COCO, and human raters find Imagen samples to be on par with the COCO data itself in image-text alignment. To assess text-to-image models in greater depth, we introduce DrawBench, a comprehensive and challenging benchmark for text-to-image models. With DrawBench, we compare Imagen with recent methods including VQ-GAN+CLIP, Latent Diffusion Models, and DALL-E 2, and find that human raters prefer Imagen over other models in side-by-side comparisons, both in terms of sample quality and image-text alignment. See https://imagen.research.google/ for an overview of the results.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces Imagen, a text-to-image diffusion model that conditions a cascaded diffusion process on text embeddings from a large pretrained T5 language model. Its central claim is that scaling the language model size improves both sample fidelity and text-image alignment substantially more than scaling the diffusion model, yielding a new zero-shot SOTA FID of 7.27 on COCO (without any COCO training) and human preference over DALL-E 2 and other baselines on the introduced DrawBench benchmark.
Significance. If reproducible, the result establishes that text-only pretrained LMs supply sufficiently rich conditioning signals for photorealistic synthesis and that LM capacity is the dominant scaling axis. This supplies a concrete, falsifiable scaling observation together with strong external benchmarks (COCO FID, DrawBench human ratings) and could redirect architectural priorities in multimodal generation toward larger language encoders.
major comments (2)
- [Ablation studies] The scaling claim (LM size > diffusion size) is load-bearing yet the manuscript provides only qualitative statements; quantitative deltas in FID and human preference for matched compute increases in each component are needed to substantiate 'much more' (see ablation results).
- [Experimental evaluation] The zero-shot COCO FID of 7.27 is a headline result, but the text does not report the number of generated samples, the precise FID implementation (Inception-v3 features, etc.), or any data-exclusion criteria applied to the COCO validation set.
minor comments (3)
- [DrawBench] DrawBench prompt selection and rating protocol should be described in more detail (e.g., how many raters per pair, inter-rater agreement, exact side-by-side presentation order) to allow independent replication.
- [Model architecture and training] Training hyperparameters (learning rates, noise schedules, classifier-free guidance scale, exact T5 variant sizes) are only partially listed; a complete table would improve reproducibility.
- [Figures] Figure captions for qualitative samples should explicitly state the prompt used and whether any post-processing was applied.
Simulated Author's Rebuttal
We thank the referee for their positive recommendation of minor revision and for the constructive comments on our scaling claims and experimental reporting. We address each point below and will update the manuscript to incorporate the requested clarifications and additional quantitative details.
read point-by-point responses
-
Referee: [Ablation studies] The scaling claim (LM size > diffusion size) is load-bearing yet the manuscript provides only qualitative statements; quantitative deltas in FID and human preference for matched compute increases in each component are needed to substantiate 'much more' (see ablation results).
Authors: We agree that more explicit quantitative evidence would strengthen the presentation of the scaling results. While the manuscript already contains ablation experiments comparing language model and diffusion model scaling, we will expand this section in the revision with a new table that reports specific FID deltas and human preference score differences for matched compute increases in each component. This will provide the numerical substantiation requested for the claim that language model scaling yields substantially larger gains. revision: yes
-
Referee: [Experimental evaluation] The zero-shot COCO FID of 7.27 is a headline result, but the text does not report the number of generated samples, the precise FID implementation (Inception-v3 features, etc.), or any data-exclusion criteria applied to the COCO validation set.
Authors: We thank the referee for noting these omissions. In the revised manuscript we will explicitly report that the FID of 7.27 was computed using 30,000 generated samples, the standard Inception-v3 feature extractor (with the exact implementation and preprocessing details now stated), and that no images were excluded from the COCO validation set. Because Imagen was never trained on COCO data, the full validation set was used for the zero-shot evaluation. revision: yes
Circularity Check
No significant circularity; empirical results on external benchmarks
full rationale
The paper's central claims rest on empirical ablations and zero-shot evaluations (COCO FID of 7.27 without COCO training, DrawBench human preferences) rather than any derivation that reduces to fitted parameters or self-citations by construction. LM scaling benefits are reported as observed outcomes from varying model sizes, not tautological predictions. No self-definitional equations, fitted-input-as-prediction, or load-bearing self-citation chains appear in the reported methodology or results. The text-only LM conditioning assumption is tested directly by the performance metrics, which are independent of the training data used for the reported numbers.
Axiom & Free-Parameter Ledger
free parameters (2)
- T5 language model size
- Diffusion model size
axioms (1)
- domain assumption A transformer language model pretrained only on text can produce embeddings that are effective conditioning signals for a diffusion image generator.
Forward citations
Cited by 32 Pith papers
-
Diffusion-Based Posterior Sampling: A Feynman-Kac Analysis of Bias and Stability
Diffusion posterior samplers produce biased outputs that can be expressed as an Ornstein-Uhlenbeck path expectation via a surrogate Gaussian path and Feynman-Kac representation, with STSL flattening the spatially vary...
-
Consistency Models
Consistency models achieve fast one-step generation with SOTA FID of 3.55 on CIFAR-10 and 6.20 on ImageNet 64x64 by directly mapping noise to data, outperforming prior distillation techniques.
-
MusicLM: Generating Music From Text
MusicLM produces coherent multi-minute 24 kHz music from text prompts using hierarchical sequence-to-sequence modeling and outperforms prior systems in quality and text adherence.
-
Building Normalizing Flows with Stochastic Interpolants
Normalizing flows are constructed by learning the velocity of a stochastic interpolant via a quadratic loss derived from its probability current, yielding an efficient ODE-based alternative to diffusion models.
-
Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow
Rectified flow learns straight-path neural ODEs for distribution transport, yielding efficient generative models and domain transfers that work well even with a single simulation step.
-
Prompt-to-Prompt Image Editing with Cross Attention Control
Cross-attention control in text-conditioned models enables localized and global image edits by editing only the input text prompt.
-
An Image is Worth One Word: Personalizing Text-to-Image Generation using Textual Inversion
Textual Inversion learns a single embedding vector from a few images to represent personal concepts inside the text embedding space of a frozen text-to-image model, enabling their composition in natural language prompts.
-
Adaptive Subspace Projection for Generative Personalization
A training-free adaptive subspace projection method mitigates semantic collapsing in generative personalization by isolating and adjusting drift in a low-dimensional subspace using the stable pre-trained embedding as anchor.
-
$Z^2$-Sampling: Zero-Cost Zigzag Trajectories for Semantic Alignment in Diffusion Models
Z²-Sampling implicitly realizes zero-cost zigzag trajectories for curvature-aware semantic alignment in diffusion models by reducing multi-step paths via operator dualities and temporal caching while synthesizing a di...
-
Nautilus: An Auto-Scheduling Tensor Compiler for Efficient Tiled GPU Kernels
Nautilus auto-compiles math-like tensor descriptions into optimized GPU kernels, delivering up to 42% higher throughput than prior compilers on transformer models across NVIDIA GPUs.
-
Scaling Exposes the Trigger: Input-Level Backdoor Detection in Text-to-Image Diffusion Models via Cross-Attention Scaling
SET detects input-level backdoors in T2I diffusion models by learning a benign cross-attention response space from clean samples and flagging deviations under multi-scale perturbations.
-
SurFITR: A Dataset for Surveillance Image Forgery Detection and Localisation
SurFITR is a new collection of 137k+ surveillance-style forged images that causes existing detectors to degrade while enabling substantial gains when used for training in both in-domain and cross-domain settings.
-
GENFIG1: Visual Summaries of Scholarly Work as a Challenge for Vision-Language Models
GENFIG1 is a new benchmark that tests whether vision-language models can create effective Figure 1 visuals capturing the central scientific idea from paper text.
-
Visual Instruction Tuning
LLaVA is trained on GPT-4 generated visual instruction data to achieve 85.1% relative performance to GPT-4 on synthetic multimodal tasks and 92.53% accuracy on Science QA.
-
Scalable Diffusion Models with Transformers
DiTs achieve SOTA FID of 2.27 on ImageNet 256x256 by scaling transformer-based latent diffusion models, with performance improving consistently as Gflops increase.
-
DreamFusion: Text-to-3D using 2D Diffusion
Optimizes a Neural Radiance Field via probability density distillation from a 2D diffusion model to produce text-conditioned 3D scenes viewable from any angle.
-
The two clocks and the innovation window: When and how generative models learn rules
Generative models learn rules before memorizing data, creating an innovation window whose width depends on dataset size and rule complexity, observed in both diffusion and autoregressive architectures.
-
Delta Score Matters! Spatial Adaptive Multi Guidance in Diffusion Models
SAMG uses spatially adaptive guidance scales derived from a geometric analysis of classifier-free guidance to resolve the detail-artifact dilemma in diffusion-based image and video generation.
-
Deepfake Detection Generalization with Diffusion Noise
ANL uses diffusion noise prediction and attention to regularize deepfake detectors for better generalization to unseen synthesis methods without added inference cost.
-
Creo: From One-Shot Image Generation to Progressive, Co-Creative Ideation
Creo scaffolds text-to-image generation through progressive stages with editable abstractions and decision locking to improve controllability, agency, and output diversity.
-
PostureObjectstitch: Anomaly Image Generation Considering Assembly Relationships in Industrial Scenarios
PostureObjectStitch generates assembly-aware anomaly images by decoupling multi-view features into high-frequency, texture and RGB components, modulating them temporally in a diffusion model, and applying conditional ...
-
Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets
Stable Video Diffusion scales latent video diffusion models via text-to-image pretraining, video pretraining on curated data, and high-quality finetuning to produce competitive text-to-video and image-to-video results...
-
SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis
SDXL improves upon prior Stable Diffusion versions through a larger UNet backbone, dual text encoders, novel conditioning, and a refinement model, producing higher-fidelity images competitive with black-box state-of-t...
-
Training Diffusion Models with Reinforcement Learning
DDPO uses policy gradients on the denoising process to optimize diffusion models for arbitrary rewards like human feedback or compressibility.
-
Make-A-Video: Text-to-Video Generation without Text-Video Data
Make-A-Video achieves state-of-the-art text-to-video generation by decomposing temporal U-Net and attention structures to add space-time modeling to text-to-image models, trained without any paired text-video data.
-
Scaling Autoregressive Models for Content-Rich Text-to-Image Generation
Scaling an autoregressive Transformer to 20B parameters for text-to-image generation using image token sequences achieves new SOTA zero-shot FID of 7.23 and fine-tuned FID of 3.22 on MS-COCO.
-
KAN Text to Vision? The Exploration of Kolmogorov-Arnold Networks for Multi-Scale Sequence-Based Pose Animation from Sign Language Notation
KANMultiSign generates sign language poses from notation via coarse-to-fine multi-scale supervision and compact KAN-Transformer modules, achieving lower DTW joint error with fewer parameters than baselines on several ...
-
Towards Robust Sequential Decomposition for Complex Image Editing
Sequential decomposition trained on synthetic editing tasks improves robustness for complex image instructions and transfers to real images via co-training.
-
Visual Generation in the New Era: An Evolution from Atomic Mapping to Agentic World Modeling
Visual generation models are evolving from passive renderers to interactive agentic world modelers, but current systems lack spatial reasoning, temporal consistency, and causal understanding, with evaluations overemph...
-
Seeing Is No Longer Believing: Frontier Image Generation Models, Synthetic Visual Evidence, and Real-World Risk
Frontier image models enable synthetic visual evidence that erodes trust in photos through combined realism, text, and identity features, calling for layered technical and policy controls.
-
LTX-2: Efficient Joint Audio-Visual Foundation Model
LTX-2 generates high-quality synchronized audiovisual content from text prompts via an asymmetric 14B-video / 5B-audio dual-stream transformer with cross-attention and modality-aware guidance.
-
Observable Performance Does Not Fully Reflect System Organization: A Multi-Level Analysis of Gait Dynamics Under Occlusal Constraint
In a single Parkinson's patient, gait conditions with comparable linear performance metrics showed different temporal organizations in dynamical state space and unsupervised latent embeddings when vertical occlusion d...
Reference graph
Works this paper leans on
-
[1]
Measur- ing Model Biases in the Absence of Ground Truth
Osman Aka, Ken Burke, Alex Bauerle, Christina Greer, and Margaret Mitchell. Measur- ing Model Biases in the Absence of Ground Truth. In Proceedings of the 2021 AAAI/ACM Conference on AI, Ethics, and Society, 2021
work page 2021
-
[2]
Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. Layer normalization. arXiv preprint arXiv:1607.06450, 2016
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[3]
Bender, Timnit Gebru, Angelina McMillan-Major, and Shmargaret Shmitchell
Emily M. Bender, Timnit Gebru, Angelina McMillan-Major, and Shmargaret Shmitchell. On the dangers of stochastic parrots: Can language models be too big? . In Proceedings of FAccT 2021, 2021
work page 2021
-
[4]
Multimodal datasets: misog- yny, pornography, and malignant stereotypes
Abeba Birhane, Vinay Uday Prabhu, and Emmanuel Kahembwe. Multimodal datasets: misog- yny, pornography, and malignant stereotypes. In arXiv:2110.01963, 2021
-
[5]
Shikha Bordia and Samuel R. Bowman. Identifying and Reducing Gender Bias in Word-Level Language Models. In NAACL, 2017. 10
work page 2017
-
[6]
Large Scale GAN Training for High Fidelity Natural Image Synthesis
Andrew Brock, Jeff Donahue, and Karen Simonyan. Large scale gan training for high fidelity natural image synthesis. arXiv preprint arXiv:1809.11096, 2018
work page internal anchor Pith review arXiv 2018
-
[7]
Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-V oss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwi...
work page 2020
-
[8]
Gender shades: Intersectional accuracy disparities in commercial gender classification
Joy Buolamwini and Timnit Gebru. Gender shades: Intersectional accuracy disparities in commercial gender classification. In Conference on Fairness, Accountability and Transparency, FAT 2018, 23-24 February 2018, New York, NY, USA , Proceedings of Machine Learning Research. PMLR, 2018
work page 2018
-
[9]
Women also snowboard: Overcoming bias in captioning models
Kaylee Burns, Lisa Hendricks, Trevor Darrell, and Anna Rohrbach. Women also snowboard: Overcoming bias in captioning models. In European Conference on Computer Vision (ECCV), 2018
work page 2018
-
[10]
Dall-eval: Probing the reasoning skills and social biases of text-to-image generative transformers
Jaemin Cho, Abhay Zala, and Mohit Bansal. Dall-eval: Probing the reasoning skills and social biases of text-to-image generative transformers. arxiv:2202.04053, 2022
-
[11]
Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, Parker Schuh, Kensen Shi, Sasha Tsvyashchenko, Joshua Maynez, Abhishek Rao, Parker Barnes, Yi Tay, Noam Shazeer, Vinodkumar Prabhakaran, Emily Reif, Nan Du, Ben Hutchinson, Reiner Pope, James Bradb...
work page internal anchor Pith review Pith/arXiv arXiv 2001
-
[12]
Vqgan-clip: Open domain image generation and editing with natural language guidance
Katherine Crowson, Stella Biderman, Daniel Kornis, Dashiell Stander, Eric Hallahan, Louis Castricato, and Edward Raff. Vqgan-clip: Open domain image generation and editing with natural language guidance. arXiv preprint arXiv:2204.08583, 2022
-
[13]
Diffusion schrödinger bridge with applications to score-based generative modeling
Valentin De Bortoli, James Thornton, Jeremy Heng, and Arnaud Doucet. Diffusion schrödinger bridge with applications to score-based generative modeling. Advances in Neural Information Processing Systems, 34, 2021
work page 2021
-
[14]
Deep Generative Image Models using a Laplacian Pyramid of Adversarial Networks
Emily Denton, Soumith Chintala, Arthur Szlam, and Rob Fergus. Deep Generative Image Models using a Laplacian Pyramid of Adversarial Networks. In NIPS, 2015
work page 2015
-
[15]
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In NAACL, 2019
work page 2019
-
[16]
Diffusion models beat gans on image synthesis
Prafulla Dhariwal and Alex Nichol. Diffusion models beat gans on image synthesis. InNeurIPS, 2022
work page 2022
-
[17]
Cogview: Mastering text-to-image generation via transformers
Ming Ding, Zhuoyi Yang, Wenyi Hong, Wendi Zheng, Chang Zhou, Da Yin, Junyang Lin, Xu Zou, Zhou Shao, Hongxia Yang, et al. Cogview: Mastering text-to-image generation via transformers. Advances in Neural Information Processing Systems, 34, 2021
work page 2021
-
[18]
Issues in Computer Vision Data Collection: Bias, Consent, and Label Taxon- omy
Dulhanty, Chris. Issues in Computer Vision Data Collection: Bias, Consent, and Label Taxon- omy. In UWSpace, 2020
work page 2020
-
[19]
Taming transformers for high-resolution image synthesis
Patrick Esser, Robin Rombach, and Bjorn Ommer. Taming transformers for high-resolution image synthesis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12873–12883, 2021. 11
work page 2021
-
[20]
Sex, lies and videotape: deep fakes and free speech delusions
Mary Anne Franks and Ari Ezra Waldman. Sex, lies and videotape: deep fakes and free speech delusions. Maryland Law Review, 78(4):892–898, 2019
work page 2019
-
[21]
Language-Driven Image Style Transfer
Tsu-Jui Fu, Xin Eric Wang, and William Yang Wang. Language-Driven Image Style Transfer. arXiv preprint arXiv:2106.00178, 2021
-
[22]
Make-a-scene: Scene-based text-to-image generation with human priors
Oran Gafni, Adam Polyak, Oron Ashual, Shelly Sheynin, Devi Parikh, and Yaniv Taigman. Make-a-scene: Scene-based text-to-image generation with human priors. arXiv preprint arXiv:2203.13131, 2022
-
[23]
doi:10.48550/arXiv.1803.09010 arXiv:1803.09010 [cs]
Timnit Gebru, Jamie Morgenstern, Briana Vecchione, Jennifer Wortman Vaughan, Hanna Wallach, Hal Daumé III, and Kate Crawford. Datasheets for Datasets. arXiv:1803.09010 [cs], March 2020
-
[24]
Adam Harvey and Jules LaPlace. MegaPixels: Origins and endpoints of biometric datasets "In the Wild". https://megapixels.cc, 2019
work page 2019
-
[25]
CLIPScore: A Reference-free Evaluation Metric for Image Captioning
Jack Hessel, Ari Holtzman, Maxwell Forbes, Ronan Le Bras, and Yejin Choi. Clipscore: A reference-free evaluation metric for image captioning. arXiv preprint arXiv:2104.08718, 2021
work page internal anchor Pith review arXiv 2021
-
[26]
GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium
Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium. arXiv preprint arXiv:1706.08500, 2017
work page Pith review arXiv 2017
-
[27]
Classifier-free diffusion guidance
Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance. In NeurIPS 2021 Workshop on Deep Generative Models and Downstream Applications, 2021
work page 2021
-
[28]
Denoising Diffusion Probabilistic Models
Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising Diffusion Probabilistic Models. NeurIPS, 2020
work page 2020
-
[29]
Cascaded diffusion models for high fidelity image generation
Jonathan Ho, Chitwan Saharia, William Chan, David J Fleet, Mohammad Norouzi, and Tim Salimans. Cascaded diffusion models for high fidelity image generation. JMLR, 2022
work page 2022
-
[30]
Hughes, Liming Zhu, and Tomasz Bednarz
Rowan T. Hughes, Liming Zhu, and Tomasz Bednarz. Generative adversarial networks-enabled human-artificial intelligence collaborative applications for creative and design industries: A systematic review of current approaches and trends. Frontiers in artificial intelligence, 4, 2021
work page 2021
-
[31]
Scaling up visual and vision-language representation learning with noisy text supervision
Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc Le, Yun- Hsuan Sung, Zhen Li, and Tom Duerig. Scaling up visual and vision-language representation learning with noisy text supervision. In International Conference on Machine Learning, pages 4904–4916. PMLR, 2021
work page 2021
-
[32]
Solving linear inverse problems using the prior implicit in a denoiser
Zahra Kadkhodaie and Eero P Simoncelli. Solving linear inverse problems using the prior implicit in a denoiser. arXiv preprint arXiv:2007.13640, 2020
-
[33]
Stochastic solutions for linear inverse problems using the prior implicit in a denoiser
Zahra Kadkhodaie and Eero P Simoncelli. Stochastic solutions for linear inverse problems using the prior implicit in a denoiser. Advances in Neural Information Processing Systems, 34, 2021
work page 2021
-
[34]
Diffusionclip: Text-guided image manipulation using diffusion models
Gwanghyun Kim and Jong Chul Ye. Diffusionclip: Text-guided image manipulation using diffusion models. arXiv preprint arXiv:2110.02711, 2021
-
[35]
Diederik P Kingma, Tim Salimans, Ben Poole, and Jonathan Ho. Variational diffusion models. arXiv preprint arXiv:2107.00630, 2021
-
[36]
Microsoft COCO: Common Objects in Context
Tsung-Yi Lin, Michael Maire, Serge Belongie, Lubomir Bourdev, Ross Girshick, James Hays, Pietro Perona, Deva Ramanan, Lawrence Zitnick, and Piotr Dollár. Microsoft COCO: Common Objects in Context. In ECCV, 2014
work page 2014
-
[37]
Generating Images from Captions with Attention
Elman Mansimov, Emilio Parisotto, Jimmy Lei Ba, and Ruslan Salakhutdinov. Generating Images from Captions with Attention. In ICLR, 2016
work page 2016
-
[38]
A very preliminary analysis of DALL-E 2
Gary Marcus, Ernest Davis, and Scott Aaronson. A very preliminary analysis of DALL-E 2. In arXiv:2204.13807, 2022. 12
-
[39]
Generating High Fidelity Images with Subscale Pixel Networks and Multidimensional Upscaling
Jacob Menick and Nal Kalchbrenner. Generating High Fidelity Images with Subscale Pixel Networks and Multidimensional Upscaling. In ICLR, 2019
work page 2019
-
[40]
Improved denois- ing diffusion probabilistic models.arXiv preprint arXiv:2102.09672,
Alex Nichol and Prafulla Dhariwal. Improved denoising diffusion probabilistic models. arXiv preprint arXiv:2102.09672, 2021
-
[41]
GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models
Alex Nichol, Prafulla Dhariwal, Aditya Ramesh, Pranav Shyam, Bob McGrew Pamela Mishkin, Ilya Sutskever, and Mark Chen. GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models. In arXiv:2112.10741, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[42]
On Aliased Resizing and Surprising Subtleties in GAN Evaluation
Gaurav Parmar, Richard Zhang, and Jun-Yan Zhu. On Aliased Resizing and Surprising Subtleties in GAN Evaluation. In CVPR, 2022
work page 2022
-
[43]
Bender, Emily Denton, and Alex Hanna
Amandalynne Paullada, Inioluwa Deborah Raji, Emily M. Bender, Emily Denton, and Alex Hanna. Data and its (dis)contents: A survey of dataset development and use in machine learning research. Patterns, 2(11):100336, 2021
work page 2021
-
[44]
Large image datasets: A pyrrhic win for computer vision? arXiv:2006.16923, 2020
Vinay Uday Prabhu and Abeba Birhane. Large image datasets: A pyrrhic win for computer vision? arXiv:2006.16923, 2020
-
[45]
Data cards: Purposeful and trans- parent dataset documentation for responsible ai
Mahima Pushkarna, Andrew Zaldivar, and Oddur Kjartansson. Data cards: Purposeful and trans- parent dataset documentation for responsible ai. In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, 2022
work page 2021
-
[46]
Learning to Generate Reviews and Discovering Sentiment , publisher =
Alec Radford, Rafal Jozefowicz, and Ilya Sutskever. Learning to Generate Reviews and Discovering Sentiment. In arXiv:1704.01444, 2017
-
[47]
Improving Language Understanding by Generative Pre-Training
Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. Improving Language Understanding by Generative Pre-Training. In preprint, 2018
work page 2018
-
[48]
Language Models are Unsupervised Multitask Learners
Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. Language Models are Unsupervised Multitask Learners. In preprint, 2019
work page 2019
-
[49]
Learning Transferable Visual Models From Natural Language Supervision
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agar- wal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning Transferable Visual Models From Natural Language Supervision. In ICML, 2021
work page 2021
-
[50]
Scaling Language Models: Methods, Analysis & Insights from Training Gopher
Jack Rae, Sebastian Borgeaud, Trevor Cai, Katie Millican, Jordan Hoffmann, Francis Song, John Aslanides, Sarah Henderson, Roman Ring, Susannah Young, Eliza Rutherford, Tom Hennigan, Jacob Menick, Albin Cassirer, Richard Powell, George Driessche, Lisa Hendricks, Maribeth Rauh, Po-Sen Huang, and Geoffrey Irving. Scaling language models: Methods, analysis & ...
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[51]
Colin Raffel, Minh-Thang Luong, Peter J. Liu, Ron J. Weiss, and Douglas Eck. Online and Linear-Time Attention by Enforcing Monotonic Alignments. In ICML, 2017
work page 2017
-
[52]
Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer. JMLR, 21(140), 2020
work page 2020
-
[53]
Zero-Shot Text-to-Image Generation
Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray, Chelsea V oss, Alec Radford, Mark Chen, and Ilya Sutskever. Zero-Shot Text-to-Image Generation. In ICML, 2021
work page 2021
-
[54]
Hierarchical Text-Conditional Image Generation with CLIP Latents
Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical Text-Conditional Image Generation with CLIP Latents. In arXiv, 2022
work page 2022
-
[55]
Generating diverse high-fidelity images with VQ-V AE-2.arXiv:1906.00446, 2019
Ali Razavi, Aaron van den Oord, and Oriol Vinyals. Generating diverse high-fidelity images with vq-vae-2. arXiv preprint arXiv:1906.00446, 2019
-
[56]
Generative adversarial text to image synthesis
Scott Reed, Zeynep Akata, Xinchen Yan, Lajanugen Logeswaran, Bernt Schiele, and Honglak Lee. Generative adversarial text to image synthesis. In International conference on machine learning, pages 1060–1069. PMLR, 2016. 13
work page 2016
-
[57]
High-Resolution Image Synthesis with Latent Diffusion Models
Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-Resolution Image Synthesis with Latent Diffusion Models. In CVPR, 2022
work page 2022
-
[58]
Palette: Image-to-image diffusion models
Chitwan Saharia, William Chan, Huiwen Chang, Chris A. Lee, Jonathan Ho, Tim Salimans, David J. Fleet, and Mohammad Norouzi. Palette: Image-to-Image Diffusion Models. In arXiv:2111.05826, 2021
-
[59]
Image super-resolution via iterative refinement
Chitwan Saharia, Jonathan Ho, William Chan, Tim Salimans, David J Fleet, and Mohammad Norouzi. Image super-resolution via iterative refinement. arXiv preprint arXiv:2104.07636, 2021
-
[60]
Morgan Klaus Scheuerman, Emily L. Denton, and A. Hanna. Do datasets have politics? disciplinary values in computer vision dataset development. Proceedings of the ACM on Human-Computer Interaction, 5:1 – 37, 2021
work page 2021
-
[61]
LAION-400M: Open Dataset of CLIP-Filtered 400 Million Image-Text Pairs
Christoph Schuhmann, Richard Vencu, Romain Beaumont, Robert Kaczmarczyk, Clayton Mullis, Aarush Katta, Theo Coombes, Jenia Jitsev, and Aran Komatsuzaki. Laion-400m: Open dataset of clip-filtered 400 million image-text pairs. arXiv preprint arXiv:2111.02114, 2021
work page internal anchor Pith review arXiv 2021
-
[62]
Lucas Sequeira, Bruno Moreschi, Amanda Jurno, and Vinicius Arruda dos Santos. Which faces can AI generate? Normativity, whiteness and lack of diversity in This Person Does Not Exist. In CVPR Workshop Beyond Fairness: Towards a Just, Equitable, and Accountable Computer Vision, 2021
work page 2021
-
[63]
Deep unsuper- vised learning using nonequilibrium thermodynamics
Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan, and Surya Ganguli. Deep unsuper- vised learning using nonequilibrium thermodynamics. In International Conference on Machine Learning, pages 2256–2265. PMLR, 2015
work page 2015
-
[64]
Denoising Diffusion Implicit Models
Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models. arXiv preprint arXiv:2010.02502, 2020
work page internal anchor Pith review Pith/arXiv arXiv 2010
-
[65]
Generative Modeling by Estimating Gradients of the Data Distribution
Yang Song and Stefano Ermon. Generative Modeling by Estimating Gradients of the Data Distribution. NeurIPS, 2019
work page 2019
-
[66]
Score-based generative modeling through stochastic differential equations
Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equations. In ICLR, 2021
work page 2021
-
[67]
Biases in generative art: A causal look from the lens of art history
Ramya Srinivasan and Kanji Uchino. Biases in generative art: A causal look from the lens of art history. In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, page 41–51, 2021
work page 2021
-
[68]
Image representations learned with unsupervised pre-training contain human-like biases
Ryan Steed and Aylin Caliskan. Image representations learned with unsupervised pre-training contain human-like biases. In Proceedings of the 2021 ACM Conference on Fairness, Account- ability, and Transparency, FAccT ’21, page 701–713. Association for Computing Machinery, 2021
work page 2021
-
[69]
Df-gan: Deep fusion generative adversarial networks for text-to-image synthesis
Ming Tao, Hao Tang, Songsong Wu, Nicu Sebe, Xiao-Yuan Jing, Fei Wu, and Bingkun Bao. Df-gan: Deep fusion generative adversarial networks for text-to-image synthesis. arXiv preprint arXiv:2008.05865, 2020
-
[70]
Belinda Tzen and Maxim Raginsky. Neural Stochastic Differential Equations: Deep Latent Gaussian Models in the Diffusion Limit. In arXiv:1905.09883, 2019
-
[71]
Neural discrete representation learning
Aaron Van Den Oord, Oriol Vinyals, et al. Neural discrete representation learning. Advances in neural information processing systems, 30, 2017
work page 2017
-
[72]
A connection between score matching and denoising autoencoders
Pascal Vincent. A connection between score matching and denoising autoencoders. Neural Computation, 23(7):1661–1674, 2011
work page 2011
-
[73]
High-resolution image synthesis and semantic manipulation with conditional gans
Ting-Chun Wang, Ming-Yu Liu, Jun-Yan Zhu, Andrew Tao, Jan Kautz, and Bryan Catanzaro. High-resolution image synthesis and semantic manipulation with conditional gans. In Proceed- ings of the IEEE conference on computer vision and pattern recognition , pages 8798–8807, 2018. 14
work page 2018
-
[74]
Wsabie: Scaling up to large vocabulary image annotation
Jason Weston, Samy Bengio, and Nicolas Usunier. Wsabie: Scaling up to large vocabulary image annotation. In Twenty-Second International Joint Conference on Artificial Intelligence, 2011
work page 2011
-
[75]
Deblurring via stochastic refinement
Jay Whang, Mauricio Delbracio, Hossein Talebi, Chitwan Saharia, Alexandros G Dimakis, and Peyman Milanfar. Deblurring via stochastic refinement. arXiv preprint arXiv:2112.02475, 2021
-
[76]
AttnGAN: Fine-Grained Text to Image Generation with Attentional Generative Adversarial Networks
Tao Xu, Pengchuan Zhang, Qiuyuan Huang, Han Zhang, Zhe Gan, Xiaolei Huang, and Xiaodong He. AttnGAN: Fine-Grained Text to Image Generation with Attentional Generative Adversarial Networks. In CVPR, 2018
work page 2018
-
[77]
Attngan: Fine-grained text to image generation with attentional generative adversarial networks
Tao Xu, Pengchuan Zhang, Qiuyuan Huang, Han Zhang, Zhe Gan, Xiaolei Huang, and Xiaodong He. Attngan: Fine-grained text to image generation with attentional generative adversarial networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1316–1324, 2018
work page 2018
-
[78]
Improving text-to-image synthesis using contrastive learning
Hui Ye, Xiulong Yang, Martin Takac, Rajshekhar Sunderraman, and Shihao Ji. Improving text-to-image synthesis using contrastive learning. arXiv preprint arXiv:2107.02423, 2021
-
[79]
Y ., Zhang, H., Pang, R., Qin, J., Ku, A., Xu, Y ., Baldridge, J., and Wu, Y
Jiahui Yu, Xin Li, Jing Yu Koh, Han Zhang, Ruoming Pang, James Qin, Alexander Ku, Yuanzhong Xu, Jason Baldridge, and Yonghui Wu. Vector-quantized image modeling with improved vqgan. arXiv preprint arXiv:2110.04627, 2021
-
[80]
Coca: Contrastive captioners are image- text foundation models
Jiahui Yu, Zirui Wang, Vijay Vasudevan, Legg Yeung, Mojtaba Seyedhosseini, and Yonghui Wu. Coca: Contrastive captioners are image-text foundation models. arXiv preprint arXiv:2205.01917, 2022
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.