pith. machine review for the scientific record. sign in

arxiv: 2205.11487 · v1 · submitted 2022-05-23 · 💻 cs.CV · cs.LG

Recognition: 3 theorem links

· Lean Theorem

Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding

Burcu Karagol Ayan, Chitwan Saharia, David J Fleet, Emily Denton, Jay Whang, Jonathan Ho, Lala Li, Mohammad Norouzi, Rapha Gontijo Lopes, Saurabh Saxena, Seyed Kamyar Seyed Ghasemipour, S. Sara Mahdavi, Tim Salimans, William Chan

Authors on Pith no claims yet

Pith reviewed 2026-05-12 07:34 UTC · model grok-4.3

classification 💻 cs.CV cs.LG
keywords text-to-image synthesisdiffusion modelslarge language modelsphotorealistic generationimagendrawbenchcoco benchmark
0
0 comments X

The pith

Large text-only language models condition diffusion models for photorealistic images more effectively than larger image models do.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Imagen, a text-to-image diffusion model that reaches new levels of photorealism and language understanding. Its central finding is that scaling a generic language model pretrained only on text, such as T5, improves both image fidelity and text alignment more than scaling the diffusion model itself. Imagen records a state-of-the-art FID of 7.27 on the COCO benchmark despite never training on COCO images, and human evaluators judge its outputs comparable to real COCO photographs in alignment. The work also presents DrawBench, a new evaluation set that shows Imagen preferred over prior methods including DALL-E 2 in side-by-side human judgments.

Core claim

Our key discovery is that generic large language models (e.g. T5), pretrained on text-only corpora, are surprisingly effective at encoding text for image synthesis: increasing the size of the language model in Imagen boosts both sample fidelity and image-text alignment much more than increasing the size of the image diffusion model. Imagen achieves a new state-of-the-art FID score of 7.27 on the COCO dataset, without ever training on COCO, and human raters find Imagen samples to be on par with the COCO data itself in image-text alignment.

What carries the argument

A cascaded diffusion model whose denoising steps are conditioned on embeddings from a large, frozen T5 language model pretrained only on text.

If this is right

  • Imagen sets a new FID record of 7.27 on COCO without any COCO training data.
  • Human raters judge Imagen samples equal to real COCO images for image-text alignment.
  • On the new DrawBench benchmark Imagen is preferred over VQ-GAN+CLIP, Latent Diffusion Models, and DALL-E 2 in both quality and alignment.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The result suggests language understanding acquired from text alone transfers directly to visual generation tasks.
  • Future scaling laws for text-to-image systems may prioritize language-model size over diffusion-model capacity.
  • The approach could reduce reliance on large curated image-text datasets for training high-quality generators.

Load-bearing premise

A language model pretrained solely on text corpora can provide sufficiently rich conditioning signals for high-fidelity image synthesis without any image-text paired pretraining or fine-tuning.

What would settle it

A controlled experiment that scales the language model while holding the diffusion model fixed and shows no further gains in FID or human preference scores would falsify the central claim.

read the original abstract

We present Imagen, a text-to-image diffusion model with an unprecedented degree of photorealism and a deep level of language understanding. Imagen builds on the power of large transformer language models in understanding text and hinges on the strength of diffusion models in high-fidelity image generation. Our key discovery is that generic large language models (e.g. T5), pretrained on text-only corpora, are surprisingly effective at encoding text for image synthesis: increasing the size of the language model in Imagen boosts both sample fidelity and image-text alignment much more than increasing the size of the image diffusion model. Imagen achieves a new state-of-the-art FID score of 7.27 on the COCO dataset, without ever training on COCO, and human raters find Imagen samples to be on par with the COCO data itself in image-text alignment. To assess text-to-image models in greater depth, we introduce DrawBench, a comprehensive and challenging benchmark for text-to-image models. With DrawBench, we compare Imagen with recent methods including VQ-GAN+CLIP, Latent Diffusion Models, and DALL-E 2, and find that human raters prefer Imagen over other models in side-by-side comparisons, both in terms of sample quality and image-text alignment. See https://imagen.research.google/ for an overview of the results.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 3 minor

Summary. The paper introduces Imagen, a text-to-image diffusion model that conditions a cascaded diffusion process on text embeddings from a large pretrained T5 language model. Its central claim is that scaling the language model size improves both sample fidelity and text-image alignment substantially more than scaling the diffusion model, yielding a new zero-shot SOTA FID of 7.27 on COCO (without any COCO training) and human preference over DALL-E 2 and other baselines on the introduced DrawBench benchmark.

Significance. If reproducible, the result establishes that text-only pretrained LMs supply sufficiently rich conditioning signals for photorealistic synthesis and that LM capacity is the dominant scaling axis. This supplies a concrete, falsifiable scaling observation together with strong external benchmarks (COCO FID, DrawBench human ratings) and could redirect architectural priorities in multimodal generation toward larger language encoders.

major comments (2)
  1. [Ablation studies] The scaling claim (LM size > diffusion size) is load-bearing yet the manuscript provides only qualitative statements; quantitative deltas in FID and human preference for matched compute increases in each component are needed to substantiate 'much more' (see ablation results).
  2. [Experimental evaluation] The zero-shot COCO FID of 7.27 is a headline result, but the text does not report the number of generated samples, the precise FID implementation (Inception-v3 features, etc.), or any data-exclusion criteria applied to the COCO validation set.
minor comments (3)
  1. [DrawBench] DrawBench prompt selection and rating protocol should be described in more detail (e.g., how many raters per pair, inter-rater agreement, exact side-by-side presentation order) to allow independent replication.
  2. [Model architecture and training] Training hyperparameters (learning rates, noise schedules, classifier-free guidance scale, exact T5 variant sizes) are only partially listed; a complete table would improve reproducibility.
  3. [Figures] Figure captions for qualitative samples should explicitly state the prompt used and whether any post-processing was applied.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their positive recommendation of minor revision and for the constructive comments on our scaling claims and experimental reporting. We address each point below and will update the manuscript to incorporate the requested clarifications and additional quantitative details.

read point-by-point responses
  1. Referee: [Ablation studies] The scaling claim (LM size > diffusion size) is load-bearing yet the manuscript provides only qualitative statements; quantitative deltas in FID and human preference for matched compute increases in each component are needed to substantiate 'much more' (see ablation results).

    Authors: We agree that more explicit quantitative evidence would strengthen the presentation of the scaling results. While the manuscript already contains ablation experiments comparing language model and diffusion model scaling, we will expand this section in the revision with a new table that reports specific FID deltas and human preference score differences for matched compute increases in each component. This will provide the numerical substantiation requested for the claim that language model scaling yields substantially larger gains. revision: yes

  2. Referee: [Experimental evaluation] The zero-shot COCO FID of 7.27 is a headline result, but the text does not report the number of generated samples, the precise FID implementation (Inception-v3 features, etc.), or any data-exclusion criteria applied to the COCO validation set.

    Authors: We thank the referee for noting these omissions. In the revised manuscript we will explicitly report that the FID of 7.27 was computed using 30,000 generated samples, the standard Inception-v3 feature extractor (with the exact implementation and preprocessing details now stated), and that no images were excluded from the COCO validation set. Because Imagen was never trained on COCO data, the full validation set was used for the zero-shot evaluation. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical results on external benchmarks

full rationale

The paper's central claims rest on empirical ablations and zero-shot evaluations (COCO FID of 7.27 without COCO training, DrawBench human preferences) rather than any derivation that reduces to fitted parameters or self-citations by construction. LM scaling benefits are reported as observed outcomes from varying model sizes, not tautological predictions. No self-definitional equations, fitted-input-as-prediction, or load-bearing self-citation chains appear in the reported methodology or results. The text-only LM conditioning assumption is tested directly by the performance metrics, which are independent of the training data used for the reported numbers.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 0 invented entities

The approach rests on standard diffusion training assumptions and the empirical observation that text-only LMs transfer to image conditioning; no new physical entities or ad-hoc constants are introduced beyond conventional model-size hyperparameters.

free parameters (2)
  • T5 language model size
    The paper varies the size of the pretrained T5 encoder as a hyperparameter and reports that larger sizes yield better results.
  • Diffusion model size
    The capacity of the image diffusion U-Net is also scaled and compared against language-model scaling.
axioms (1)
  • domain assumption A transformer language model pretrained only on text can produce embeddings that are effective conditioning signals for a diffusion image generator.
    This is the central premise tested by the scaling experiments.

pith-pipeline@v0.9.0 · 5607 in / 1343 out tokens · 52293 ms · 2026-05-12T07:34:24.130359+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 32 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Diffusion-Based Posterior Sampling: A Feynman-Kac Analysis of Bias and Stability

    cs.LG 2026-05 unverdicted novelty 8.0

    Diffusion posterior samplers produce biased outputs that can be expressed as an Ornstein-Uhlenbeck path expectation via a surrogate Gaussian path and Feynman-Kac representation, with STSL flattening the spatially vary...

  2. Consistency Models

    cs.LG 2023-03 conditional novelty 8.0

    Consistency models achieve fast one-step generation with SOTA FID of 3.55 on CIFAR-10 and 6.20 on ImageNet 64x64 by directly mapping noise to data, outperforming prior distillation techniques.

  3. MusicLM: Generating Music From Text

    cs.SD 2023-01 conditional novelty 8.0

    MusicLM produces coherent multi-minute 24 kHz music from text prompts using hierarchical sequence-to-sequence modeling and outperforms prior systems in quality and text adherence.

  4. Building Normalizing Flows with Stochastic Interpolants

    cs.LG 2022-09 conditional novelty 8.0

    Normalizing flows are constructed by learning the velocity of a stochastic interpolant via a quadratic loss derived from its probability current, yielding an efficient ODE-based alternative to diffusion models.

  5. Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow

    cs.LG 2022-09 unverdicted novelty 8.0

    Rectified flow learns straight-path neural ODEs for distribution transport, yielding efficient generative models and domain transfers that work well even with a single simulation step.

  6. Prompt-to-Prompt Image Editing with Cross Attention Control

    cs.CV 2022-08 unverdicted novelty 8.0

    Cross-attention control in text-conditioned models enables localized and global image edits by editing only the input text prompt.

  7. An Image is Worth One Word: Personalizing Text-to-Image Generation using Textual Inversion

    cs.CV 2022-08 unverdicted novelty 8.0

    Textual Inversion learns a single embedding vector from a few images to represent personal concepts inside the text embedding space of a frozen text-to-image model, enabling their composition in natural language prompts.

  8. Adaptive Subspace Projection for Generative Personalization

    cs.CV 2026-05 unverdicted novelty 7.0

    A training-free adaptive subspace projection method mitigates semantic collapsing in generative personalization by isolating and adjusting drift in a low-dimensional subspace using the stable pre-trained embedding as anchor.

  9. $Z^2$-Sampling: Zero-Cost Zigzag Trajectories for Semantic Alignment in Diffusion Models

    cs.CV 2026-04 unverdicted novelty 7.0

    Z²-Sampling implicitly realizes zero-cost zigzag trajectories for curvature-aware semantic alignment in diffusion models by reducing multi-step paths via operator dualities and temporal caching while synthesizing a di...

  10. Nautilus: An Auto-Scheduling Tensor Compiler for Efficient Tiled GPU Kernels

    cs.PL 2026-04 unverdicted novelty 7.0

    Nautilus auto-compiles math-like tensor descriptions into optimized GPU kernels, delivering up to 42% higher throughput than prior compilers on transformer models across NVIDIA GPUs.

  11. Scaling Exposes the Trigger: Input-Level Backdoor Detection in Text-to-Image Diffusion Models via Cross-Attention Scaling

    cs.CR 2026-04 unverdicted novelty 7.0

    SET detects input-level backdoors in T2I diffusion models by learning a benign cross-attention response space from clean samples and flagging deviations under multi-scale perturbations.

  12. SurFITR: A Dataset for Surveillance Image Forgery Detection and Localisation

    cs.CV 2026-04 conditional novelty 7.0

    SurFITR is a new collection of 137k+ surveillance-style forged images that causes existing detectors to degrade while enabling substantial gains when used for training in both in-domain and cross-domain settings.

  13. GENFIG1: Visual Summaries of Scholarly Work as a Challenge for Vision-Language Models

    cs.CV 2026-04 unverdicted novelty 7.0

    GENFIG1 is a new benchmark that tests whether vision-language models can create effective Figure 1 visuals capturing the central scientific idea from paper text.

  14. Visual Instruction Tuning

    cs.CV 2023-04 unverdicted novelty 7.0

    LLaVA is trained on GPT-4 generated visual instruction data to achieve 85.1% relative performance to GPT-4 on synthetic multimodal tasks and 92.53% accuracy on Science QA.

  15. Scalable Diffusion Models with Transformers

    cs.CV 2022-12 unverdicted novelty 7.0

    DiTs achieve SOTA FID of 2.27 on ImageNet 256x256 by scaling transformer-based latent diffusion models, with performance improving consistently as Gflops increase.

  16. DreamFusion: Text-to-3D using 2D Diffusion

    cs.CV 2022-09 accept novelty 7.0

    Optimizes a Neural Radiance Field via probability density distillation from a 2D diffusion model to produce text-conditioned 3D scenes viewable from any angle.

  17. The two clocks and the innovation window: When and how generative models learn rules

    cs.LG 2026-05 unverdicted novelty 6.0

    Generative models learn rules before memorizing data, creating an innovation window whose width depends on dataset size and rule complexity, observed in both diffusion and autoregressive architectures.

  18. Delta Score Matters! Spatial Adaptive Multi Guidance in Diffusion Models

    cs.CV 2026-04 unverdicted novelty 6.0

    SAMG uses spatially adaptive guidance scales derived from a geometric analysis of classifier-free guidance to resolve the detail-artifact dilemma in diffusion-based image and video generation.

  19. Deepfake Detection Generalization with Diffusion Noise

    cs.CV 2026-04 unverdicted novelty 6.0

    ANL uses diffusion noise prediction and attention to regularize deepfake detectors for better generalization to unseen synthesis methods without added inference cost.

  20. Creo: From One-Shot Image Generation to Progressive, Co-Creative Ideation

    cs.HC 2026-04 unverdicted novelty 6.0

    Creo scaffolds text-to-image generation through progressive stages with editable abstractions and decision locking to improve controllability, agency, and output diversity.

  21. PostureObjectstitch: Anomaly Image Generation Considering Assembly Relationships in Industrial Scenarios

    cs.CV 2026-04 unverdicted novelty 6.0

    PostureObjectStitch generates assembly-aware anomaly images by decoupling multi-view features into high-frequency, texture and RGB components, modulating them temporally in a diffusion model, and applying conditional ...

  22. Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets

    cs.CV 2023-11 conditional novelty 6.0

    Stable Video Diffusion scales latent video diffusion models via text-to-image pretraining, video pretraining on curated data, and high-quality finetuning to produce competitive text-to-video and image-to-video results...

  23. SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis

    cs.CV 2023-07 conditional novelty 6.0

    SDXL improves upon prior Stable Diffusion versions through a larger UNet backbone, dual text encoders, novel conditioning, and a refinement model, producing higher-fidelity images competitive with black-box state-of-t...

  24. Training Diffusion Models with Reinforcement Learning

    cs.LG 2023-05 unverdicted novelty 6.0

    DDPO uses policy gradients on the denoising process to optimize diffusion models for arbitrary rewards like human feedback or compressibility.

  25. Make-A-Video: Text-to-Video Generation without Text-Video Data

    cs.CV 2022-09 unverdicted novelty 6.0

    Make-A-Video achieves state-of-the-art text-to-video generation by decomposing temporal U-Net and attention structures to add space-time modeling to text-to-image models, trained without any paired text-video data.

  26. Scaling Autoregressive Models for Content-Rich Text-to-Image Generation

    cs.CV 2022-06 unverdicted novelty 6.0

    Scaling an autoregressive Transformer to 20B parameters for text-to-image generation using image token sequences achieves new SOTA zero-shot FID of 7.23 and fine-tuned FID of 3.22 on MS-COCO.

  27. KAN Text to Vision? The Exploration of Kolmogorov-Arnold Networks for Multi-Scale Sequence-Based Pose Animation from Sign Language Notation

    cs.CV 2026-05 unverdicted novelty 5.0

    KANMultiSign generates sign language poses from notation via coarse-to-fine multi-scale supervision and compact KAN-Transformer modules, achieving lower DTW joint error with fewer parameters than baselines on several ...

  28. Towards Robust Sequential Decomposition for Complex Image Editing

    cs.CV 2026-05 unverdicted novelty 5.0

    Sequential decomposition trained on synthetic editing tasks improves robustness for complex image instructions and transfers to real images via co-training.

  29. Visual Generation in the New Era: An Evolution from Atomic Mapping to Agentic World Modeling

    cs.CV 2026-04 unverdicted novelty 5.0

    Visual generation models are evolving from passive renderers to interactive agentic world modelers, but current systems lack spatial reasoning, temporal consistency, and causal understanding, with evaluations overemph...

  30. Seeing Is No Longer Believing: Frontier Image Generation Models, Synthetic Visual Evidence, and Real-World Risk

    cs.CL 2026-04 unverdicted novelty 5.0

    Frontier image models enable synthetic visual evidence that erodes trust in photos through combined realism, text, and identity features, calling for layered technical and policy controls.

  31. LTX-2: Efficient Joint Audio-Visual Foundation Model

    cs.CV 2026-01 conditional novelty 5.0

    LTX-2 generates high-quality synchronized audiovisual content from text prompts via an asymmetric 14B-video / 5B-audio dual-stream transformer with cross-attention and modality-aware guidance.

  32. Observable Performance Does Not Fully Reflect System Organization: A Multi-Level Analysis of Gait Dynamics Under Occlusal Constraint

    cs.LG 2026-05 unverdicted novelty 3.0

    In a single Parkinson's patient, gait conditions with comparable linear performance metrics showed different temporal organizations in dynamical state space and unsupervised latent embeddings when vertical occlusion d...

Reference graph

Works this paper leans on

87 extracted references · 87 canonical work pages · cited by 32 Pith papers · 8 internal anchors

  1. [1]

    Measur- ing Model Biases in the Absence of Ground Truth

    Osman Aka, Ken Burke, Alex Bauerle, Christina Greer, and Margaret Mitchell. Measur- ing Model Biases in the Absence of Ground Truth. In Proceedings of the 2021 AAAI/ACM Conference on AI, Ethics, and Society, 2021

  2. [2]

    Layer Normalization

    Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. Layer normalization. arXiv preprint arXiv:1607.06450, 2016

  3. [3]

    Bender, Timnit Gebru, Angelina McMillan-Major, and Shmargaret Shmitchell

    Emily M. Bender, Timnit Gebru, Angelina McMillan-Major, and Shmargaret Shmitchell. On the dangers of stochastic parrots: Can language models be too big? . In Proceedings of FAccT 2021, 2021

  4. [4]

    Multimodal datasets: misog- yny, pornography, and malignant stereotypes

    Abeba Birhane, Vinay Uday Prabhu, and Emmanuel Kahembwe. Multimodal datasets: misog- yny, pornography, and malignant stereotypes. In arXiv:2110.01963, 2021

  5. [5]

    Shikha Bordia and Samuel R. Bowman. Identifying and Reducing Gender Bias in Word-Level Language Models. In NAACL, 2017. 10

  6. [6]

    Large Scale GAN Training for High Fidelity Natural Image Synthesis

    Andrew Brock, Jeff Donahue, and Karen Simonyan. Large scale gan training for high fidelity natural image synthesis. arXiv preprint arXiv:1809.11096, 2018

  7. [7]

    Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-V oss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwi...

  8. [8]

    Gender shades: Intersectional accuracy disparities in commercial gender classification

    Joy Buolamwini and Timnit Gebru. Gender shades: Intersectional accuracy disparities in commercial gender classification. In Conference on Fairness, Accountability and Transparency, FAT 2018, 23-24 February 2018, New York, NY, USA , Proceedings of Machine Learning Research. PMLR, 2018

  9. [9]

    Women also snowboard: Overcoming bias in captioning models

    Kaylee Burns, Lisa Hendricks, Trevor Darrell, and Anna Rohrbach. Women also snowboard: Overcoming bias in captioning models. In European Conference on Computer Vision (ECCV), 2018

  10. [10]

    Dall-eval: Probing the reasoning skills and social biases of text-to-image generative transformers

    Jaemin Cho, Abhay Zala, and Mohit Bansal. Dall-eval: Probing the reasoning skills and social biases of text-to-image generative transformers. arxiv:2202.04053, 2022

  11. [11]

    Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, Parker Schuh, Kensen Shi, Sasha Tsvyashchenko, Joshua Maynez, Abhishek Rao, Parker Barnes, Yi Tay, Noam Shazeer, Vinodkumar Prabhakaran, Emily Reif, Nan Du, Ben Hutchinson, Reiner Pope, James Bradb...

  12. [12]

    Vqgan-clip: Open domain image generation and editing with natural language guidance

    Katherine Crowson, Stella Biderman, Daniel Kornis, Dashiell Stander, Eric Hallahan, Louis Castricato, and Edward Raff. Vqgan-clip: Open domain image generation and editing with natural language guidance. arXiv preprint arXiv:2204.08583, 2022

  13. [13]

    Diffusion schrödinger bridge with applications to score-based generative modeling

    Valentin De Bortoli, James Thornton, Jeremy Heng, and Arnaud Doucet. Diffusion schrödinger bridge with applications to score-based generative modeling. Advances in Neural Information Processing Systems, 34, 2021

  14. [14]

    Deep Generative Image Models using a Laplacian Pyramid of Adversarial Networks

    Emily Denton, Soumith Chintala, Arthur Szlam, and Rob Fergus. Deep Generative Image Models using a Laplacian Pyramid of Adversarial Networks. In NIPS, 2015

  15. [15]

    BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

    Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In NAACL, 2019

  16. [16]

    Diffusion models beat gans on image synthesis

    Prafulla Dhariwal and Alex Nichol. Diffusion models beat gans on image synthesis. InNeurIPS, 2022

  17. [17]

    Cogview: Mastering text-to-image generation via transformers

    Ming Ding, Zhuoyi Yang, Wenyi Hong, Wendi Zheng, Chang Zhou, Da Yin, Junyang Lin, Xu Zou, Zhou Shao, Hongxia Yang, et al. Cogview: Mastering text-to-image generation via transformers. Advances in Neural Information Processing Systems, 34, 2021

  18. [18]

    Issues in Computer Vision Data Collection: Bias, Consent, and Label Taxon- omy

    Dulhanty, Chris. Issues in Computer Vision Data Collection: Bias, Consent, and Label Taxon- omy. In UWSpace, 2020

  19. [19]

    Taming transformers for high-resolution image synthesis

    Patrick Esser, Robin Rombach, and Bjorn Ommer. Taming transformers for high-resolution image synthesis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12873–12883, 2021. 11

  20. [20]

    Sex, lies and videotape: deep fakes and free speech delusions

    Mary Anne Franks and Ari Ezra Waldman. Sex, lies and videotape: deep fakes and free speech delusions. Maryland Law Review, 78(4):892–898, 2019

  21. [21]

    Language-Driven Image Style Transfer

    Tsu-Jui Fu, Xin Eric Wang, and William Yang Wang. Language-Driven Image Style Transfer. arXiv preprint arXiv:2106.00178, 2021

  22. [22]

    Make-a-scene: Scene-based text-to-image generation with human priors

    Oran Gafni, Adam Polyak, Oron Ashual, Shelly Sheynin, Devi Parikh, and Yaniv Taigman. Make-a-scene: Scene-based text-to-image generation with human priors. arXiv preprint arXiv:2203.13131, 2022

  23. [23]

    doi:10.48550/arXiv.1803.09010 arXiv:1803.09010 [cs]

    Timnit Gebru, Jamie Morgenstern, Briana Vecchione, Jennifer Wortman Vaughan, Hanna Wallach, Hal Daumé III, and Kate Crawford. Datasheets for Datasets. arXiv:1803.09010 [cs], March 2020

  24. [24]

    In the Wild

    Adam Harvey and Jules LaPlace. MegaPixels: Origins and endpoints of biometric datasets "In the Wild". https://megapixels.cc, 2019

  25. [25]

    CLIPScore: A Reference-free Evaluation Metric for Image Captioning

    Jack Hessel, Ari Holtzman, Maxwell Forbes, Ronan Le Bras, and Yejin Choi. Clipscore: A reference-free evaluation metric for image captioning. arXiv preprint arXiv:2104.08718, 2021

  26. [26]

    GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium

    Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium. arXiv preprint arXiv:1706.08500, 2017

  27. [27]

    Classifier-free diffusion guidance

    Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance. In NeurIPS 2021 Workshop on Deep Generative Models and Downstream Applications, 2021

  28. [28]

    Denoising Diffusion Probabilistic Models

    Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising Diffusion Probabilistic Models. NeurIPS, 2020

  29. [29]

    Cascaded diffusion models for high fidelity image generation

    Jonathan Ho, Chitwan Saharia, William Chan, David J Fleet, Mohammad Norouzi, and Tim Salimans. Cascaded diffusion models for high fidelity image generation. JMLR, 2022

  30. [30]

    Hughes, Liming Zhu, and Tomasz Bednarz

    Rowan T. Hughes, Liming Zhu, and Tomasz Bednarz. Generative adversarial networks-enabled human-artificial intelligence collaborative applications for creative and design industries: A systematic review of current approaches and trends. Frontiers in artificial intelligence, 4, 2021

  31. [31]

    Scaling up visual and vision-language representation learning with noisy text supervision

    Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc Le, Yun- Hsuan Sung, Zhen Li, and Tom Duerig. Scaling up visual and vision-language representation learning with noisy text supervision. In International Conference on Machine Learning, pages 4904–4916. PMLR, 2021

  32. [32]

    Solving linear inverse problems using the prior implicit in a denoiser

    Zahra Kadkhodaie and Eero P Simoncelli. Solving linear inverse problems using the prior implicit in a denoiser. arXiv preprint arXiv:2007.13640, 2020

  33. [33]

    Stochastic solutions for linear inverse problems using the prior implicit in a denoiser

    Zahra Kadkhodaie and Eero P Simoncelli. Stochastic solutions for linear inverse problems using the prior implicit in a denoiser. Advances in Neural Information Processing Systems, 34, 2021

  34. [34]

    Diffusionclip: Text-guided image manipulation using diffusion models

    Gwanghyun Kim and Jong Chul Ye. Diffusionclip: Text-guided image manipulation using diffusion models. arXiv preprint arXiv:2110.02711, 2021

  35. [35]

    Variational diffusion models

    Diederik P Kingma, Tim Salimans, Ben Poole, and Jonathan Ho. Variational diffusion models. arXiv preprint arXiv:2107.00630, 2021

  36. [36]

    Microsoft COCO: Common Objects in Context

    Tsung-Yi Lin, Michael Maire, Serge Belongie, Lubomir Bourdev, Ross Girshick, James Hays, Pietro Perona, Deva Ramanan, Lawrence Zitnick, and Piotr Dollár. Microsoft COCO: Common Objects in Context. In ECCV, 2014

  37. [37]

    Generating Images from Captions with Attention

    Elman Mansimov, Emilio Parisotto, Jimmy Lei Ba, and Ruslan Salakhutdinov. Generating Images from Captions with Attention. In ICLR, 2016

  38. [38]

    A very preliminary analysis of DALL-E 2

    Gary Marcus, Ernest Davis, and Scott Aaronson. A very preliminary analysis of DALL-E 2. In arXiv:2204.13807, 2022. 12

  39. [39]

    Generating High Fidelity Images with Subscale Pixel Networks and Multidimensional Upscaling

    Jacob Menick and Nal Kalchbrenner. Generating High Fidelity Images with Subscale Pixel Networks and Multidimensional Upscaling. In ICLR, 2019

  40. [40]

    Improved denois- ing diffusion probabilistic models.arXiv preprint arXiv:2102.09672,

    Alex Nichol and Prafulla Dhariwal. Improved denoising diffusion probabilistic models. arXiv preprint arXiv:2102.09672, 2021

  41. [41]

    GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models

    Alex Nichol, Prafulla Dhariwal, Aditya Ramesh, Pranav Shyam, Bob McGrew Pamela Mishkin, Ilya Sutskever, and Mark Chen. GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models. In arXiv:2112.10741, 2021

  42. [42]

    On Aliased Resizing and Surprising Subtleties in GAN Evaluation

    Gaurav Parmar, Richard Zhang, and Jun-Yan Zhu. On Aliased Resizing and Surprising Subtleties in GAN Evaluation. In CVPR, 2022

  43. [43]

    Bender, Emily Denton, and Alex Hanna

    Amandalynne Paullada, Inioluwa Deborah Raji, Emily M. Bender, Emily Denton, and Alex Hanna. Data and its (dis)contents: A survey of dataset development and use in machine learning research. Patterns, 2(11):100336, 2021

  44. [44]

    Large image datasets: A pyrrhic win for computer vision? arXiv:2006.16923, 2020

    Vinay Uday Prabhu and Abeba Birhane. Large image datasets: A pyrrhic win for computer vision? arXiv:2006.16923, 2020

  45. [45]

    Data cards: Purposeful and trans- parent dataset documentation for responsible ai

    Mahima Pushkarna, Andrew Zaldivar, and Oddur Kjartansson. Data cards: Purposeful and trans- parent dataset documentation for responsible ai. In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, 2022

  46. [46]

    Learning to Generate Reviews and Discovering Sentiment , publisher =

    Alec Radford, Rafal Jozefowicz, and Ilya Sutskever. Learning to Generate Reviews and Discovering Sentiment. In arXiv:1704.01444, 2017

  47. [47]

    Improving Language Understanding by Generative Pre-Training

    Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. Improving Language Understanding by Generative Pre-Training. In preprint, 2018

  48. [48]

    Language Models are Unsupervised Multitask Learners

    Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. Language Models are Unsupervised Multitask Learners. In preprint, 2019

  49. [49]

    Learning Transferable Visual Models From Natural Language Supervision

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agar- wal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning Transferable Visual Models From Natural Language Supervision. In ICML, 2021

  50. [50]

    Scaling Language Models: Methods, Analysis & Insights from Training Gopher

    Jack Rae, Sebastian Borgeaud, Trevor Cai, Katie Millican, Jordan Hoffmann, Francis Song, John Aslanides, Sarah Henderson, Roman Ring, Susannah Young, Eliza Rutherford, Tom Hennigan, Jacob Menick, Albin Cassirer, Richard Powell, George Driessche, Lisa Hendricks, Maribeth Rauh, Po-Sen Huang, and Geoffrey Irving. Scaling language models: Methods, analysis & ...

  51. [51]

    Liu, Ron J

    Colin Raffel, Minh-Thang Luong, Peter J. Liu, Ron J. Weiss, and Douglas Eck. Online and Linear-Time Attention by Enforcing Monotonic Alignments. In ICML, 2017

  52. [52]

    Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer. JMLR, 21(140), 2020

  53. [53]

    Zero-Shot Text-to-Image Generation

    Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray, Chelsea V oss, Alec Radford, Mark Chen, and Ilya Sutskever. Zero-Shot Text-to-Image Generation. In ICML, 2021

  54. [54]

    Hierarchical Text-Conditional Image Generation with CLIP Latents

    Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical Text-Conditional Image Generation with CLIP Latents. In arXiv, 2022

  55. [55]

    Generating diverse high-fidelity images with VQ-V AE-2.arXiv:1906.00446, 2019

    Ali Razavi, Aaron van den Oord, and Oriol Vinyals. Generating diverse high-fidelity images with vq-vae-2. arXiv preprint arXiv:1906.00446, 2019

  56. [56]

    Generative adversarial text to image synthesis

    Scott Reed, Zeynep Akata, Xinchen Yan, Lajanugen Logeswaran, Bernt Schiele, and Honglak Lee. Generative adversarial text to image synthesis. In International conference on machine learning, pages 1060–1069. PMLR, 2016. 13

  57. [57]

    High-Resolution Image Synthesis with Latent Diffusion Models

    Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-Resolution Image Synthesis with Latent Diffusion Models. In CVPR, 2022

  58. [58]

    Palette: Image-to-image diffusion models

    Chitwan Saharia, William Chan, Huiwen Chang, Chris A. Lee, Jonathan Ho, Tim Salimans, David J. Fleet, and Mohammad Norouzi. Palette: Image-to-Image Diffusion Models. In arXiv:2111.05826, 2021

  59. [59]

    Image super-resolution via iterative refinement

    Chitwan Saharia, Jonathan Ho, William Chan, Tim Salimans, David J Fleet, and Mohammad Norouzi. Image super-resolution via iterative refinement. arXiv preprint arXiv:2104.07636, 2021

  60. [60]

    Denton, and A

    Morgan Klaus Scheuerman, Emily L. Denton, and A. Hanna. Do datasets have politics? disciplinary values in computer vision dataset development. Proceedings of the ACM on Human-Computer Interaction, 5:1 – 37, 2021

  61. [61]

    LAION-400M: Open Dataset of CLIP-Filtered 400 Million Image-Text Pairs

    Christoph Schuhmann, Richard Vencu, Romain Beaumont, Robert Kaczmarczyk, Clayton Mullis, Aarush Katta, Theo Coombes, Jenia Jitsev, and Aran Komatsuzaki. Laion-400m: Open dataset of clip-filtered 400 million image-text pairs. arXiv preprint arXiv:2111.02114, 2021

  62. [62]

    Which faces can AI generate? Normativity, whiteness and lack of diversity in This Person Does Not Exist

    Lucas Sequeira, Bruno Moreschi, Amanda Jurno, and Vinicius Arruda dos Santos. Which faces can AI generate? Normativity, whiteness and lack of diversity in This Person Does Not Exist. In CVPR Workshop Beyond Fairness: Towards a Just, Equitable, and Accountable Computer Vision, 2021

  63. [63]

    Deep unsuper- vised learning using nonequilibrium thermodynamics

    Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan, and Surya Ganguli. Deep unsuper- vised learning using nonequilibrium thermodynamics. In International Conference on Machine Learning, pages 2256–2265. PMLR, 2015

  64. [64]

    Denoising Diffusion Implicit Models

    Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models. arXiv preprint arXiv:2010.02502, 2020

  65. [65]

    Generative Modeling by Estimating Gradients of the Data Distribution

    Yang Song and Stefano Ermon. Generative Modeling by Estimating Gradients of the Data Distribution. NeurIPS, 2019

  66. [66]

    Score-based generative modeling through stochastic differential equations

    Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equations. In ICLR, 2021

  67. [67]

    Biases in generative art: A causal look from the lens of art history

    Ramya Srinivasan and Kanji Uchino. Biases in generative art: A causal look from the lens of art history. In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, page 41–51, 2021

  68. [68]

    Image representations learned with unsupervised pre-training contain human-like biases

    Ryan Steed and Aylin Caliskan. Image representations learned with unsupervised pre-training contain human-like biases. In Proceedings of the 2021 ACM Conference on Fairness, Account- ability, and Transparency, FAccT ’21, page 701–713. Association for Computing Machinery, 2021

  69. [69]

    Df-gan: Deep fusion generative adversarial networks for text-to-image synthesis

    Ming Tao, Hao Tang, Songsong Wu, Nicu Sebe, Xiao-Yuan Jing, Fei Wu, and Bingkun Bao. Df-gan: Deep fusion generative adversarial networks for text-to-image synthesis. arXiv preprint arXiv:2008.05865, 2020

  70. [70]

    arXiv:1905.09883 , year=

    Belinda Tzen and Maxim Raginsky. Neural Stochastic Differential Equations: Deep Latent Gaussian Models in the Diffusion Limit. In arXiv:1905.09883, 2019

  71. [71]

    Neural discrete representation learning

    Aaron Van Den Oord, Oriol Vinyals, et al. Neural discrete representation learning. Advances in neural information processing systems, 30, 2017

  72. [72]

    A connection between score matching and denoising autoencoders

    Pascal Vincent. A connection between score matching and denoising autoencoders. Neural Computation, 23(7):1661–1674, 2011

  73. [73]

    High-resolution image synthesis and semantic manipulation with conditional gans

    Ting-Chun Wang, Ming-Yu Liu, Jun-Yan Zhu, Andrew Tao, Jan Kautz, and Bryan Catanzaro. High-resolution image synthesis and semantic manipulation with conditional gans. In Proceed- ings of the IEEE conference on computer vision and pattern recognition , pages 8798–8807, 2018. 14

  74. [74]

    Wsabie: Scaling up to large vocabulary image annotation

    Jason Weston, Samy Bengio, and Nicolas Usunier. Wsabie: Scaling up to large vocabulary image annotation. In Twenty-Second International Joint Conference on Artificial Intelligence, 2011

  75. [75]

    Deblurring via stochastic refinement

    Jay Whang, Mauricio Delbracio, Hossein Talebi, Chitwan Saharia, Alexandros G Dimakis, and Peyman Milanfar. Deblurring via stochastic refinement. arXiv preprint arXiv:2112.02475, 2021

  76. [76]

    AttnGAN: Fine-Grained Text to Image Generation with Attentional Generative Adversarial Networks

    Tao Xu, Pengchuan Zhang, Qiuyuan Huang, Han Zhang, Zhe Gan, Xiaolei Huang, and Xiaodong He. AttnGAN: Fine-Grained Text to Image Generation with Attentional Generative Adversarial Networks. In CVPR, 2018

  77. [77]

    Attngan: Fine-grained text to image generation with attentional generative adversarial networks

    Tao Xu, Pengchuan Zhang, Qiuyuan Huang, Han Zhang, Zhe Gan, Xiaolei Huang, and Xiaodong He. Attngan: Fine-grained text to image generation with attentional generative adversarial networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1316–1324, 2018

  78. [78]

    Improving text-to-image synthesis using contrastive learning

    Hui Ye, Xiulong Yang, Martin Takac, Rajshekhar Sunderraman, and Shihao Ji. Improving text-to-image synthesis using contrastive learning. arXiv preprint arXiv:2107.02423, 2021

  79. [79]

    Y ., Zhang, H., Pang, R., Qin, J., Ku, A., Xu, Y ., Baldridge, J., and Wu, Y

    Jiahui Yu, Xin Li, Jing Yu Koh, Han Zhang, Ruoming Pang, James Qin, Alexander Ku, Yuanzhong Xu, Jason Baldridge, and Yonghui Wu. Vector-quantized image modeling with improved vqgan. arXiv preprint arXiv:2110.04627, 2021

  80. [80]

    Coca: Contrastive captioners are image- text foundation models

    Jiahui Yu, Zirui Wang, Vijay Vasudevan, Legg Yeung, Mojtaba Seyedhosseini, and Yonghui Wu. Coca: Contrastive captioners are image-text foundation models. arXiv preprint arXiv:2205.01917, 2022

Showing first 80 references.