arxiv: 2205.11487 · v1 · submitted 2022-05-23 · 💻 cs.CV · cs.LG

Recognition: 3 theorem links

· Lean Theorem

Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding

Burcu Karagol Ayan, Chitwan Saharia, David J Fleet, Emily Denton, Jay Whang, Jonathan Ho, Lala Li, Mohammad Norouzi, Rapha Gontijo Lopes, Saurabh Saxena, Seyed Kamyar Seyed Ghasemipour, S. Sara Mahdavi, Tim Salimans, William Chan

Authors on Pith no claims yet

Pith reviewed 2026-05-12 07:34 UTC · model grok-4.3

classification 💻 cs.CV cs.LG

keywords text-to-image synthesisdiffusion modelslarge language modelsphotorealistic generationimagendrawbenchcoco benchmark

0 comments

The pith

Large text-only language models condition diffusion models for photorealistic images more effectively than larger image models do.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Imagen, a text-to-image diffusion model that reaches new levels of photorealism and language understanding. Its central finding is that scaling a generic language model pretrained only on text, such as T5, improves both image fidelity and text alignment more than scaling the diffusion model itself. Imagen records a state-of-the-art FID of 7.27 on the COCO benchmark despite never training on COCO images, and human evaluators judge its outputs comparable to real COCO photographs in alignment. The work also presents DrawBench, a new evaluation set that shows Imagen preferred over prior methods including DALL-E 2 in side-by-side human judgments.

Core claim

Our key discovery is that generic large language models (e.g. T5), pretrained on text-only corpora, are surprisingly effective at encoding text for image synthesis: increasing the size of the language model in Imagen boosts both sample fidelity and image-text alignment much more than increasing the size of the image diffusion model. Imagen achieves a new state-of-the-art FID score of 7.27 on the COCO dataset, without ever training on COCO, and human raters find Imagen samples to be on par with the COCO data itself in image-text alignment.

What carries the argument

A cascaded diffusion model whose denoising steps are conditioned on embeddings from a large, frozen T5 language model pretrained only on text.

If this is right

Imagen sets a new FID record of 7.27 on COCO without any COCO training data.
Human raters judge Imagen samples equal to real COCO images for image-text alignment.
On the new DrawBench benchmark Imagen is preferred over VQ-GAN+CLIP, Latent Diffusion Models, and DALL-E 2 in both quality and alignment.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The result suggests language understanding acquired from text alone transfers directly to visual generation tasks.
Future scaling laws for text-to-image systems may prioritize language-model size over diffusion-model capacity.
The approach could reduce reliance on large curated image-text datasets for training high-quality generators.

Load-bearing premise

A language model pretrained solely on text corpora can provide sufficiently rich conditioning signals for high-fidelity image synthesis without any image-text paired pretraining or fine-tuning.

What would settle it

A controlled experiment that scales the language model while holding the diffusion model fixed and shows no further gains in FID or human preference scores would falsify the central claim.

read the original abstract

We present Imagen, a text-to-image diffusion model with an unprecedented degree of photorealism and a deep level of language understanding. Imagen builds on the power of large transformer language models in understanding text and hinges on the strength of diffusion models in high-fidelity image generation. Our key discovery is that generic large language models (e.g. T5), pretrained on text-only corpora, are surprisingly effective at encoding text for image synthesis: increasing the size of the language model in Imagen boosts both sample fidelity and image-text alignment much more than increasing the size of the image diffusion model. Imagen achieves a new state-of-the-art FID score of 7.27 on the COCO dataset, without ever training on COCO, and human raters find Imagen samples to be on par with the COCO data itself in image-text alignment. To assess text-to-image models in greater depth, we introduce DrawBench, a comprehensive and challenging benchmark for text-to-image models. With DrawBench, we compare Imagen with recent methods including VQ-GAN+CLIP, Latent Diffusion Models, and DALL-E 2, and find that human raters prefer Imagen over other models in side-by-side comparisons, both in terms of sample quality and image-text alignment. See https://imagen.research.google/ for an overview of the results.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Imagen shows that scaling a text-only LM like T5 for conditioning beats scaling the diffusion model itself, delivering SOTA zero-shot COCO FID and a useful new benchmark.

read the letter

The main point is that this paper finds large text-only language models pretrained on plain text corpora work surprisingly well as encoders for diffusion-based image generation. Increasing T5 size improves both fidelity and alignment more than increasing the diffusion U-Net size, and they reach FID 7.27 on COCO without ever training on COCO data. Human raters also rate the outputs on par with real COCO images for text alignment and prefer them over DALL-E 2 and other recent models on their new DrawBench set.

Referee Report

2 major / 3 minor

Summary. The paper introduces Imagen, a text-to-image diffusion model that conditions a cascaded diffusion process on text embeddings from a large pretrained T5 language model. Its central claim is that scaling the language model size improves both sample fidelity and text-image alignment substantially more than scaling the diffusion model, yielding a new zero-shot SOTA FID of 7.27 on COCO (without any COCO training) and human preference over DALL-E 2 and other baselines on the introduced DrawBench benchmark.

Significance. If reproducible, the result establishes that text-only pretrained LMs supply sufficiently rich conditioning signals for photorealistic synthesis and that LM capacity is the dominant scaling axis. This supplies a concrete, falsifiable scaling observation together with strong external benchmarks (COCO FID, DrawBench human ratings) and could redirect architectural priorities in multimodal generation toward larger language encoders.

major comments (2)

[Ablation studies] The scaling claim (LM size > diffusion size) is load-bearing yet the manuscript provides only qualitative statements; quantitative deltas in FID and human preference for matched compute increases in each component are needed to substantiate 'much more' (see ablation results).
[Experimental evaluation] The zero-shot COCO FID of 7.27 is a headline result, but the text does not report the number of generated samples, the precise FID implementation (Inception-v3 features, etc.), or any data-exclusion criteria applied to the COCO validation set.

minor comments (3)

[DrawBench] DrawBench prompt selection and rating protocol should be described in more detail (e.g., how many raters per pair, inter-rater agreement, exact side-by-side presentation order) to allow independent replication.
[Model architecture and training] Training hyperparameters (learning rates, noise schedules, classifier-free guidance scale, exact T5 variant sizes) are only partially listed; a complete table would improve reproducibility.
[Figures] Figure captions for qualitative samples should explicitly state the prompt used and whether any post-processing was applied.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their positive recommendation of minor revision and for the constructive comments on our scaling claims and experimental reporting. We address each point below and will update the manuscript to incorporate the requested clarifications and additional quantitative details.

read point-by-point responses

Referee: [Ablation studies] The scaling claim (LM size > diffusion size) is load-bearing yet the manuscript provides only qualitative statements; quantitative deltas in FID and human preference for matched compute increases in each component are needed to substantiate 'much more' (see ablation results).

Authors: We agree that more explicit quantitative evidence would strengthen the presentation of the scaling results. While the manuscript already contains ablation experiments comparing language model and diffusion model scaling, we will expand this section in the revision with a new table that reports specific FID deltas and human preference score differences for matched compute increases in each component. This will provide the numerical substantiation requested for the claim that language model scaling yields substantially larger gains. revision: yes
Referee: [Experimental evaluation] The zero-shot COCO FID of 7.27 is a headline result, but the text does not report the number of generated samples, the precise FID implementation (Inception-v3 features, etc.), or any data-exclusion criteria applied to the COCO validation set.

Authors: We thank the referee for noting these omissions. In the revised manuscript we will explicitly report that the FID of 7.27 was computed using 30,000 generated samples, the standard Inception-v3 feature extractor (with the exact implementation and preprocessing details now stated), and that no images were excluded from the COCO validation set. Because Imagen was never trained on COCO data, the full validation set was used for the zero-shot evaluation. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical results on external benchmarks

full rationale

The paper's central claims rest on empirical ablations and zero-shot evaluations (COCO FID of 7.27 without COCO training, DrawBench human preferences) rather than any derivation that reduces to fitted parameters or self-citations by construction. LM scaling benefits are reported as observed outcomes from varying model sizes, not tautological predictions. No self-definitional equations, fitted-input-as-prediction, or load-bearing self-citation chains appear in the reported methodology or results. The text-only LM conditioning assumption is tested directly by the performance metrics, which are independent of the training data used for the reported numbers.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 0 invented entities

The approach rests on standard diffusion training assumptions and the empirical observation that text-only LMs transfer to image conditioning; no new physical entities or ad-hoc constants are introduced beyond conventional model-size hyperparameters.

free parameters (2)

T5 language model size
The paper varies the size of the pretrained T5 encoder as a hyperparameter and reports that larger sizes yield better results.
Diffusion model size
The capacity of the image diffusion U-Net is also scaled and compared against language-model scaling.

axioms (1)

domain assumption A transformer language model pretrained only on text can produce embeddings that are effective conditioning signals for a diffusion image generator.
This is the central premise tested by the scaling experiments.

pith-pipeline@v0.9.0 · 5607 in / 1343 out tokens · 52293 ms · 2026-05-12T07:34:24.130359+00:00 · methodology

discussion (0)

Forward citations

Cited by 32 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Diffusion-Based Posterior Sampling: A Feynman-Kac Analysis of Bias and Stability
cs.LG 2026-05 unverdicted novelty 8.0

Diffusion posterior samplers produce biased outputs that can be expressed as an Ornstein-Uhlenbeck path expectation via a surrogate Gaussian path and Feynman-Kac representation, with STSL flattening the spatially vary...
Consistency Models
cs.LG 2023-03 conditional novelty 8.0

Consistency models achieve fast one-step generation with SOTA FID of 3.55 on CIFAR-10 and 6.20 on ImageNet 64x64 by directly mapping noise to data, outperforming prior distillation techniques.
MusicLM: Generating Music From Text
cs.SD 2023-01 conditional novelty 8.0

MusicLM produces coherent multi-minute 24 kHz music from text prompts using hierarchical sequence-to-sequence modeling and outperforms prior systems in quality and text adherence.
Building Normalizing Flows with Stochastic Interpolants
cs.LG 2022-09 conditional novelty 8.0

Normalizing flows are constructed by learning the velocity of a stochastic interpolant via a quadratic loss derived from its probability current, yielding an efficient ODE-based alternative to diffusion models.
Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow
cs.LG 2022-09 unverdicted novelty 8.0

Rectified flow learns straight-path neural ODEs for distribution transport, yielding efficient generative models and domain transfers that work well even with a single simulation step.
Prompt-to-Prompt Image Editing with Cross Attention Control
cs.CV 2022-08 unverdicted novelty 8.0

Cross-attention control in text-conditioned models enables localized and global image edits by editing only the input text prompt.
An Image is Worth One Word: Personalizing Text-to-Image Generation using Textual Inversion
cs.CV 2022-08 unverdicted novelty 8.0

Textual Inversion learns a single embedding vector from a few images to represent personal concepts inside the text embedding space of a frozen text-to-image model, enabling their composition in natural language prompts.
Adaptive Subspace Projection for Generative Personalization
cs.CV 2026-05 unverdicted novelty 7.0

A training-free adaptive subspace projection method mitigates semantic collapsing in generative personalization by isolating and adjusting drift in a low-dimensional subspace using the stable pre-trained embedding as anchor.
$Z^2$-Sampling: Zero-Cost Zigzag Trajectories for Semantic Alignment in Diffusion Models
cs.CV 2026-04 unverdicted novelty 7.0

Z²-Sampling implicitly realizes zero-cost zigzag trajectories for curvature-aware semantic alignment in diffusion models by reducing multi-step paths via operator dualities and temporal caching while synthesizing a di...
Nautilus: An Auto-Scheduling Tensor Compiler for Efficient Tiled GPU Kernels
cs.PL 2026-04 unverdicted novelty 7.0

Nautilus auto-compiles math-like tensor descriptions into optimized GPU kernels, delivering up to 42% higher throughput than prior compilers on transformer models across NVIDIA GPUs.
Scaling Exposes the Trigger: Input-Level Backdoor Detection in Text-to-Image Diffusion Models via Cross-Attention Scaling
cs.CR 2026-04 unverdicted novelty 7.0

SET detects input-level backdoors in T2I diffusion models by learning a benign cross-attention response space from clean samples and flagging deviations under multi-scale perturbations.
SurFITR: A Dataset for Surveillance Image Forgery Detection and Localisation
cs.CV 2026-04 conditional novelty 7.0

SurFITR is a new collection of 137k+ surveillance-style forged images that causes existing detectors to degrade while enabling substantial gains when used for training in both in-domain and cross-domain settings.
GENFIG1: Visual Summaries of Scholarly Work as a Challenge for Vision-Language Models
cs.CV 2026-04 unverdicted novelty 7.0

GENFIG1 is a new benchmark that tests whether vision-language models can create effective Figure 1 visuals capturing the central scientific idea from paper text.
Visual Instruction Tuning
cs.CV 2023-04 unverdicted novelty 7.0

LLaVA is trained on GPT-4 generated visual instruction data to achieve 85.1% relative performance to GPT-4 on synthetic multimodal tasks and 92.53% accuracy on Science QA.
Scalable Diffusion Models with Transformers
cs.CV 2022-12 unverdicted novelty 7.0

DiTs achieve SOTA FID of 2.27 on ImageNet 256x256 by scaling transformer-based latent diffusion models, with performance improving consistently as Gflops increase.
DreamFusion: Text-to-3D using 2D Diffusion
cs.CV 2022-09 accept novelty 7.0

Optimizes a Neural Radiance Field via probability density distillation from a 2D diffusion model to produce text-conditioned 3D scenes viewable from any angle.
The two clocks and the innovation window: When and how generative models learn rules
cs.LG 2026-05 unverdicted novelty 6.0

Generative models learn rules before memorizing data, creating an innovation window whose width depends on dataset size and rule complexity, observed in both diffusion and autoregressive architectures.
Delta Score Matters! Spatial Adaptive Multi Guidance in Diffusion Models
cs.CV 2026-04 unverdicted novelty 6.0

SAMG uses spatially adaptive guidance scales derived from a geometric analysis of classifier-free guidance to resolve the detail-artifact dilemma in diffusion-based image and video generation.
Deepfake Detection Generalization with Diffusion Noise
cs.CV 2026-04 unverdicted novelty 6.0

ANL uses diffusion noise prediction and attention to regularize deepfake detectors for better generalization to unseen synthesis methods without added inference cost.
Creo: From One-Shot Image Generation to Progressive, Co-Creative Ideation
cs.HC 2026-04 unverdicted novelty 6.0

Creo scaffolds text-to-image generation through progressive stages with editable abstractions and decision locking to improve controllability, agency, and output diversity.
PostureObjectstitch: Anomaly Image Generation Considering Assembly Relationships in Industrial Scenarios
cs.CV 2026-04 unverdicted novelty 6.0

PostureObjectStitch generates assembly-aware anomaly images by decoupling multi-view features into high-frequency, texture and RGB components, modulating them temporally in a diffusion model, and applying conditional ...
Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets
cs.CV 2023-11 conditional novelty 6.0

Stable Video Diffusion scales latent video diffusion models via text-to-image pretraining, video pretraining on curated data, and high-quality finetuning to produce competitive text-to-video and image-to-video results...
SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis
cs.CV 2023-07 conditional novelty 6.0

SDXL improves upon prior Stable Diffusion versions through a larger UNet backbone, dual text encoders, novel conditioning, and a refinement model, producing higher-fidelity images competitive with black-box state-of-t...
Training Diffusion Models with Reinforcement Learning
cs.LG 2023-05 unverdicted novelty 6.0

DDPO uses policy gradients on the denoising process to optimize diffusion models for arbitrary rewards like human feedback or compressibility.
Make-A-Video: Text-to-Video Generation without Text-Video Data
cs.CV 2022-09 unverdicted novelty 6.0

Make-A-Video achieves state-of-the-art text-to-video generation by decomposing temporal U-Net and attention structures to add space-time modeling to text-to-image models, trained without any paired text-video data.
Scaling Autoregressive Models for Content-Rich Text-to-Image Generation
cs.CV 2022-06 unverdicted novelty 6.0

Scaling an autoregressive Transformer to 20B parameters for text-to-image generation using image token sequences achieves new SOTA zero-shot FID of 7.23 and fine-tuned FID of 3.22 on MS-COCO.
KAN Text to Vision? The Exploration of Kolmogorov-Arnold Networks for Multi-Scale Sequence-Based Pose Animation from Sign Language Notation
cs.CV 2026-05 unverdicted novelty 5.0

KANMultiSign generates sign language poses from notation via coarse-to-fine multi-scale supervision and compact KAN-Transformer modules, achieving lower DTW joint error with fewer parameters than baselines on several ...
Towards Robust Sequential Decomposition for Complex Image Editing
cs.CV 2026-05 unverdicted novelty 5.0

Sequential decomposition trained on synthetic editing tasks improves robustness for complex image instructions and transfers to real images via co-training.
Visual Generation in the New Era: An Evolution from Atomic Mapping to Agentic World Modeling
cs.CV 2026-04 unverdicted novelty 5.0

Visual generation models are evolving from passive renderers to interactive agentic world modelers, but current systems lack spatial reasoning, temporal consistency, and causal understanding, with evaluations overemph...
Seeing Is No Longer Believing: Frontier Image Generation Models, Synthetic Visual Evidence, and Real-World Risk
cs.CL 2026-04 unverdicted novelty 5.0

Frontier image models enable synthetic visual evidence that erodes trust in photos through combined realism, text, and identity features, calling for layered technical and policy controls.
LTX-2: Efficient Joint Audio-Visual Foundation Model
cs.CV 2026-01 conditional novelty 5.0

LTX-2 generates high-quality synchronized audiovisual content from text prompts via an asymmetric 14B-video / 5B-audio dual-stream transformer with cross-attention and modality-aware guidance.
Observable Performance Does Not Fully Reflect System Organization: A Multi-Level Analysis of Gait Dynamics Under Occlusal Constraint
cs.LG 2026-05 unverdicted novelty 3.0

In a single Parkinson's patient, gait conditions with comparable linear performance metrics showed different temporal organizations in dynamical state space and unsupervised latent embeddings when vertical occlusion d...

Reference graph

Works this paper leans on

87 extracted references · 87 canonical work pages · cited by 32 Pith papers · 8 internal anchors

[1]

Measur- ing Model Biases in the Absence of Ground Truth

Osman Aka, Ken Burke, Alex Bauerle, Christina Greer, and Margaret Mitchell. Measur- ing Model Biases in the Absence of Ground Truth. In Proceedings of the 2021 AAAI/ACM Conference on AI, Ethics, and Society, 2021

work page 2021
[2]

Layer Normalization

Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. Layer normalization. arXiv preprint arXiv:1607.06450, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016
[3]

Bender, Timnit Gebru, Angelina McMillan-Major, and Shmargaret Shmitchell

Emily M. Bender, Timnit Gebru, Angelina McMillan-Major, and Shmargaret Shmitchell. On the dangers of stochastic parrots: Can language models be too big? . In Proceedings of FAccT 2021, 2021

work page 2021
[4]

Multimodal datasets: misog- yny, pornography, and malignant stereotypes

Abeba Birhane, Vinay Uday Prabhu, and Emmanuel Kahembwe. Multimodal datasets: misog- yny, pornography, and malignant stereotypes. In arXiv:2110.01963, 2021

work page arXiv 2021
[5]

Shikha Bordia and Samuel R. Bowman. Identifying and Reducing Gender Bias in Word-Level Language Models. In NAACL, 2017. 10

work page 2017
[6]

Large Scale GAN Training for High Fidelity Natural Image Synthesis

Andrew Brock, Jeff Donahue, and Karen Simonyan. Large scale gan training for high ﬁdelity natural image synthesis. arXiv preprint arXiv:1809.11096, 2018

work page internal anchor Pith review arXiv 2018
[7]

Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-V oss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwi...

work page 2020
[8]

Gender shades: Intersectional accuracy disparities in commercial gender classiﬁcation

Joy Buolamwini and Timnit Gebru. Gender shades: Intersectional accuracy disparities in commercial gender classiﬁcation. In Conference on Fairness, Accountability and Transparency, FAT 2018, 23-24 February 2018, New York, NY, USA , Proceedings of Machine Learning Research. PMLR, 2018

work page 2018
[9]

Women also snowboard: Overcoming bias in captioning models

Kaylee Burns, Lisa Hendricks, Trevor Darrell, and Anna Rohrbach. Women also snowboard: Overcoming bias in captioning models. In European Conference on Computer Vision (ECCV), 2018

work page 2018
[10]

Dall-eval: Probing the reasoning skills and social biases of text-to-image generative transformers

Jaemin Cho, Abhay Zala, and Mohit Bansal. Dall-eval: Probing the reasoning skills and social biases of text-to-image generative transformers. arxiv:2202.04053, 2022

work page arXiv 2022
[11]

Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, Parker Schuh, Kensen Shi, Sasha Tsvyashchenko, Joshua Maynez, Abhishek Rao, Parker Barnes, Yi Tay, Noam Shazeer, Vinodkumar Prabhakaran, Emily Reif, Nan Du, Ben Hutchinson, Reiner Pope, James Bradb...

work page internal anchor Pith review Pith/arXiv arXiv 2001
[12]

Vqgan-clip: Open domain image generation and editing with natural language guidance

Katherine Crowson, Stella Biderman, Daniel Kornis, Dashiell Stander, Eric Hallahan, Louis Castricato, and Edward Raff. Vqgan-clip: Open domain image generation and editing with natural language guidance. arXiv preprint arXiv:2204.08583, 2022

work page arXiv 2022
[13]

Diffusion schrödinger bridge with applications to score-based generative modeling

Valentin De Bortoli, James Thornton, Jeremy Heng, and Arnaud Doucet. Diffusion schrödinger bridge with applications to score-based generative modeling. Advances in Neural Information Processing Systems, 34, 2021

work page 2021
[14]

Deep Generative Image Models using a Laplacian Pyramid of Adversarial Networks

Emily Denton, Soumith Chintala, Arthur Szlam, and Rob Fergus. Deep Generative Image Models using a Laplacian Pyramid of Adversarial Networks. In NIPS, 2015

work page 2015
[15]

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In NAACL, 2019

work page 2019
[16]

Diffusion models beat gans on image synthesis

Prafulla Dhariwal and Alex Nichol. Diffusion models beat gans on image synthesis. InNeurIPS, 2022

work page 2022
[17]

Cogview: Mastering text-to-image generation via transformers

Ming Ding, Zhuoyi Yang, Wenyi Hong, Wendi Zheng, Chang Zhou, Da Yin, Junyang Lin, Xu Zou, Zhou Shao, Hongxia Yang, et al. Cogview: Mastering text-to-image generation via transformers. Advances in Neural Information Processing Systems, 34, 2021

work page 2021
[18]

Issues in Computer Vision Data Collection: Bias, Consent, and Label Taxon- omy

Dulhanty, Chris. Issues in Computer Vision Data Collection: Bias, Consent, and Label Taxon- omy. In UWSpace, 2020

work page 2020
[19]

Taming transformers for high-resolution image synthesis

Patrick Esser, Robin Rombach, and Bjorn Ommer. Taming transformers for high-resolution image synthesis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12873–12883, 2021. 11

work page 2021
[20]

Sex, lies and videotape: deep fakes and free speech delusions

Mary Anne Franks and Ari Ezra Waldman. Sex, lies and videotape: deep fakes and free speech delusions. Maryland Law Review, 78(4):892–898, 2019

work page 2019
[21]

Language-Driven Image Style Transfer

Tsu-Jui Fu, Xin Eric Wang, and William Yang Wang. Language-Driven Image Style Transfer. arXiv preprint arXiv:2106.00178, 2021

work page arXiv 2021
[22]

Make-a-scene: Scene-based text-to-image generation with human priors

Oran Gafni, Adam Polyak, Oron Ashual, Shelly Sheynin, Devi Parikh, and Yaniv Taigman. Make-a-scene: Scene-based text-to-image generation with human priors. arXiv preprint arXiv:2203.13131, 2022

work page arXiv 2022
[23]

doi:10.48550/arXiv.1803.09010 arXiv:1803.09010 [cs]

Timnit Gebru, Jamie Morgenstern, Briana Vecchione, Jennifer Wortman Vaughan, Hanna Wallach, Hal Daumé III, and Kate Crawford. Datasheets for Datasets. arXiv:1803.09010 [cs], March 2020

work page arXiv 2020
[24]

In the Wild

Adam Harvey and Jules LaPlace. MegaPixels: Origins and endpoints of biometric datasets "In the Wild". https://megapixels.cc, 2019

work page 2019
[25]

CLIPScore: A Reference-free Evaluation Metric for Image Captioning

Jack Hessel, Ari Holtzman, Maxwell Forbes, Ronan Le Bras, and Yejin Choi. Clipscore: A reference-free evaluation metric for image captioning. arXiv preprint arXiv:2104.08718, 2021

work page internal anchor Pith review arXiv 2021
[26]

GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium

Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium. arXiv preprint arXiv:1706.08500, 2017

work page Pith review arXiv 2017
[27]

Classiﬁer-free diffusion guidance

Jonathan Ho and Tim Salimans. Classiﬁer-free diffusion guidance. In NeurIPS 2021 Workshop on Deep Generative Models and Downstream Applications, 2021

work page 2021
[28]

Denoising Diffusion Probabilistic Models

Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising Diffusion Probabilistic Models. NeurIPS, 2020

work page 2020
[29]

Cascaded diffusion models for high ﬁdelity image generation

Jonathan Ho, Chitwan Saharia, William Chan, David J Fleet, Mohammad Norouzi, and Tim Salimans. Cascaded diffusion models for high ﬁdelity image generation. JMLR, 2022

work page 2022
[30]

Hughes, Liming Zhu, and Tomasz Bednarz

Rowan T. Hughes, Liming Zhu, and Tomasz Bednarz. Generative adversarial networks-enabled human-artiﬁcial intelligence collaborative applications for creative and design industries: A systematic review of current approaches and trends. Frontiers in artiﬁcial intelligence, 4, 2021

work page 2021
[31]

Scaling up visual and vision-language representation learning with noisy text supervision

Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc Le, Yun- Hsuan Sung, Zhen Li, and Tom Duerig. Scaling up visual and vision-language representation learning with noisy text supervision. In International Conference on Machine Learning, pages 4904–4916. PMLR, 2021

work page 2021
[32]

Solving linear inverse problems using the prior implicit in a denoiser

Zahra Kadkhodaie and Eero P Simoncelli. Solving linear inverse problems using the prior implicit in a denoiser. arXiv preprint arXiv:2007.13640, 2020

work page arXiv 2007
[33]

Stochastic solutions for linear inverse problems using the prior implicit in a denoiser

Zahra Kadkhodaie and Eero P Simoncelli. Stochastic solutions for linear inverse problems using the prior implicit in a denoiser. Advances in Neural Information Processing Systems, 34, 2021

work page 2021
[34]

Diffusionclip: Text-guided image manipulation using diffusion models

Gwanghyun Kim and Jong Chul Ye. Diffusionclip: Text-guided image manipulation using diffusion models. arXiv preprint arXiv:2110.02711, 2021

work page arXiv 2021
[35]

Variational diffusion models

Diederik P Kingma, Tim Salimans, Ben Poole, and Jonathan Ho. Variational diffusion models. arXiv preprint arXiv:2107.00630, 2021

work page arXiv 2021
[36]

Microsoft COCO: Common Objects in Context

Tsung-Yi Lin, Michael Maire, Serge Belongie, Lubomir Bourdev, Ross Girshick, James Hays, Pietro Perona, Deva Ramanan, Lawrence Zitnick, and Piotr Dollár. Microsoft COCO: Common Objects in Context. In ECCV, 2014

work page 2014
[37]

Generating Images from Captions with Attention

Elman Mansimov, Emilio Parisotto, Jimmy Lei Ba, and Ruslan Salakhutdinov. Generating Images from Captions with Attention. In ICLR, 2016

work page 2016
[38]

A very preliminary analysis of DALL-E 2

Gary Marcus, Ernest Davis, and Scott Aaronson. A very preliminary analysis of DALL-E 2. In arXiv:2204.13807, 2022. 12

work page arXiv 2022
[39]

Generating High Fidelity Images with Subscale Pixel Networks and Multidimensional Upscaling

Jacob Menick and Nal Kalchbrenner. Generating High Fidelity Images with Subscale Pixel Networks and Multidimensional Upscaling. In ICLR, 2019

work page 2019
[40]

Improved denois- ing diffusion probabilistic models.arXiv preprint arXiv:2102.09672,

Alex Nichol and Prafulla Dhariwal. Improved denoising diffusion probabilistic models. arXiv preprint arXiv:2102.09672, 2021

work page arXiv 2021
[41]

GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models

Alex Nichol, Prafulla Dhariwal, Aditya Ramesh, Pranav Shyam, Bob McGrew Pamela Mishkin, Ilya Sutskever, and Mark Chen. GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models. In arXiv:2112.10741, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[42]

On Aliased Resizing and Surprising Subtleties in GAN Evaluation

Gaurav Parmar, Richard Zhang, and Jun-Yan Zhu. On Aliased Resizing and Surprising Subtleties in GAN Evaluation. In CVPR, 2022

work page 2022
[43]

Bender, Emily Denton, and Alex Hanna

Amandalynne Paullada, Inioluwa Deborah Raji, Emily M. Bender, Emily Denton, and Alex Hanna. Data and its (dis)contents: A survey of dataset development and use in machine learning research. Patterns, 2(11):100336, 2021

work page 2021
[44]

Large image datasets: A pyrrhic win for computer vision? arXiv:2006.16923, 2020

Vinay Uday Prabhu and Abeba Birhane. Large image datasets: A pyrrhic win for computer vision? arXiv:2006.16923, 2020

work page arXiv 2006
[45]

Data cards: Purposeful and trans- parent dataset documentation for responsible ai

Mahima Pushkarna, Andrew Zaldivar, and Oddur Kjartansson. Data cards: Purposeful and trans- parent dataset documentation for responsible ai. In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, 2022

work page 2021
[46]

Learning to Generate Reviews and Discovering Sentiment , publisher =

Alec Radford, Rafal Jozefowicz, and Ilya Sutskever. Learning to Generate Reviews and Discovering Sentiment. In arXiv:1704.01444, 2017

work page arXiv 2017
[47]

Improving Language Understanding by Generative Pre-Training

Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. Improving Language Understanding by Generative Pre-Training. In preprint, 2018

work page 2018
[48]

Language Models are Unsupervised Multitask Learners

Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. Language Models are Unsupervised Multitask Learners. In preprint, 2019

work page 2019
[49]

Learning Transferable Visual Models From Natural Language Supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agar- wal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning Transferable Visual Models From Natural Language Supervision. In ICML, 2021

work page 2021
[50]

Scaling Language Models: Methods, Analysis & Insights from Training Gopher

Jack Rae, Sebastian Borgeaud, Trevor Cai, Katie Millican, Jordan Hoffmann, Francis Song, John Aslanides, Sarah Henderson, Roman Ring, Susannah Young, Eliza Rutherford, Tom Hennigan, Jacob Menick, Albin Cassirer, Richard Powell, George Driessche, Lisa Hendricks, Maribeth Rauh, Po-Sen Huang, and Geoffrey Irving. Scaling language models: Methods, analysis & ...

work page internal anchor Pith review Pith/arXiv arXiv 2021
[51]

Liu, Ron J

Colin Raffel, Minh-Thang Luong, Peter J. Liu, Ron J. Weiss, and Douglas Eck. Online and Linear-Time Attention by Enforcing Monotonic Alignments. In ICML, 2017

work page 2017
[52]

Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. Exploring the Limits of Transfer Learning with a Uniﬁed Text-to-Text Transformer. JMLR, 21(140), 2020

work page 2020
[53]

Zero-Shot Text-to-Image Generation

Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray, Chelsea V oss, Alec Radford, Mark Chen, and Ilya Sutskever. Zero-Shot Text-to-Image Generation. In ICML, 2021

work page 2021
[54]

Hierarchical Text-Conditional Image Generation with CLIP Latents

Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical Text-Conditional Image Generation with CLIP Latents. In arXiv, 2022

work page 2022
[55]

Generating diverse high-ﬁdelity images with VQ-V AE-2.arXiv:1906.00446, 2019

Ali Razavi, Aaron van den Oord, and Oriol Vinyals. Generating diverse high-ﬁdelity images with vq-vae-2. arXiv preprint arXiv:1906.00446, 2019

work page arXiv 1906
[56]

Generative adversarial text to image synthesis

Scott Reed, Zeynep Akata, Xinchen Yan, Lajanugen Logeswaran, Bernt Schiele, and Honglak Lee. Generative adversarial text to image synthesis. In International conference on machine learning, pages 1060–1069. PMLR, 2016. 13

work page 2016
[57]

High-Resolution Image Synthesis with Latent Diffusion Models

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-Resolution Image Synthesis with Latent Diffusion Models. In CVPR, 2022

work page 2022
[58]

Palette: Image-to-image diffusion models

Chitwan Saharia, William Chan, Huiwen Chang, Chris A. Lee, Jonathan Ho, Tim Salimans, David J. Fleet, and Mohammad Norouzi. Palette: Image-to-Image Diffusion Models. In arXiv:2111.05826, 2021

work page arXiv 2021
[59]

Image super-resolution via iterative reﬁnement

Chitwan Saharia, Jonathan Ho, William Chan, Tim Salimans, David J Fleet, and Mohammad Norouzi. Image super-resolution via iterative reﬁnement. arXiv preprint arXiv:2104.07636, 2021

work page arXiv 2021
[60]

Denton, and A

Morgan Klaus Scheuerman, Emily L. Denton, and A. Hanna. Do datasets have politics? disciplinary values in computer vision dataset development. Proceedings of the ACM on Human-Computer Interaction, 5:1 – 37, 2021

work page 2021
[61]

LAION-400M: Open Dataset of CLIP-Filtered 400 Million Image-Text Pairs

Christoph Schuhmann, Richard Vencu, Romain Beaumont, Robert Kaczmarczyk, Clayton Mullis, Aarush Katta, Theo Coombes, Jenia Jitsev, and Aran Komatsuzaki. Laion-400m: Open dataset of clip-ﬁltered 400 million image-text pairs. arXiv preprint arXiv:2111.02114, 2021

work page internal anchor Pith review arXiv 2021
[62]

Which faces can AI generate? Normativity, whiteness and lack of diversity in This Person Does Not Exist

Lucas Sequeira, Bruno Moreschi, Amanda Jurno, and Vinicius Arruda dos Santos. Which faces can AI generate? Normativity, whiteness and lack of diversity in This Person Does Not Exist. In CVPR Workshop Beyond Fairness: Towards a Just, Equitable, and Accountable Computer Vision, 2021

work page 2021
[63]

Deep unsuper- vised learning using nonequilibrium thermodynamics

Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan, and Surya Ganguli. Deep unsuper- vised learning using nonequilibrium thermodynamics. In International Conference on Machine Learning, pages 2256–2265. PMLR, 2015

work page 2015
[64]

Denoising Diffusion Implicit Models

Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models. arXiv preprint arXiv:2010.02502, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2010
[65]

Generative Modeling by Estimating Gradients of the Data Distribution

Yang Song and Stefano Ermon. Generative Modeling by Estimating Gradients of the Data Distribution. NeurIPS, 2019

work page 2019
[66]

Score-based generative modeling through stochastic differential equations

Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equations. In ICLR, 2021

work page 2021
[67]

Biases in generative art: A causal look from the lens of art history

Ramya Srinivasan and Kanji Uchino. Biases in generative art: A causal look from the lens of art history. In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, page 41–51, 2021

work page 2021
[68]

Image representations learned with unsupervised pre-training contain human-like biases

Ryan Steed and Aylin Caliskan. Image representations learned with unsupervised pre-training contain human-like biases. In Proceedings of the 2021 ACM Conference on Fairness, Account- ability, and Transparency, FAccT ’21, page 701–713. Association for Computing Machinery, 2021

work page 2021
[69]

Df-gan: Deep fusion generative adversarial networks for text-to-image synthesis

Ming Tao, Hao Tang, Songsong Wu, Nicu Sebe, Xiao-Yuan Jing, Fei Wu, and Bingkun Bao. Df-gan: Deep fusion generative adversarial networks for text-to-image synthesis. arXiv preprint arXiv:2008.05865, 2020

work page arXiv 2008
[70]

arXiv:1905.09883 , year=

Belinda Tzen and Maxim Raginsky. Neural Stochastic Differential Equations: Deep Latent Gaussian Models in the Diffusion Limit. In arXiv:1905.09883, 2019

work page arXiv 1905
[71]

Neural discrete representation learning

Aaron Van Den Oord, Oriol Vinyals, et al. Neural discrete representation learning. Advances in neural information processing systems, 30, 2017

work page 2017
[72]

A connection between score matching and denoising autoencoders

Pascal Vincent. A connection between score matching and denoising autoencoders. Neural Computation, 23(7):1661–1674, 2011

work page 2011
[73]

High-resolution image synthesis and semantic manipulation with conditional gans

Ting-Chun Wang, Ming-Yu Liu, Jun-Yan Zhu, Andrew Tao, Jan Kautz, and Bryan Catanzaro. High-resolution image synthesis and semantic manipulation with conditional gans. In Proceed- ings of the IEEE conference on computer vision and pattern recognition , pages 8798–8807, 2018. 14

work page 2018
[74]

Wsabie: Scaling up to large vocabulary image annotation

Jason Weston, Samy Bengio, and Nicolas Usunier. Wsabie: Scaling up to large vocabulary image annotation. In Twenty-Second International Joint Conference on Artiﬁcial Intelligence, 2011

work page 2011
[75]

Deblurring via stochastic reﬁnement

Jay Whang, Mauricio Delbracio, Hossein Talebi, Chitwan Saharia, Alexandros G Dimakis, and Peyman Milanfar. Deblurring via stochastic reﬁnement. arXiv preprint arXiv:2112.02475, 2021

work page arXiv 2021
[76]

AttnGAN: Fine-Grained Text to Image Generation with Attentional Generative Adversarial Networks

Tao Xu, Pengchuan Zhang, Qiuyuan Huang, Han Zhang, Zhe Gan, Xiaolei Huang, and Xiaodong He. AttnGAN: Fine-Grained Text to Image Generation with Attentional Generative Adversarial Networks. In CVPR, 2018

work page 2018
[77]

Attngan: Fine-grained text to image generation with attentional generative adversarial networks

Tao Xu, Pengchuan Zhang, Qiuyuan Huang, Han Zhang, Zhe Gan, Xiaolei Huang, and Xiaodong He. Attngan: Fine-grained text to image generation with attentional generative adversarial networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1316–1324, 2018

work page 2018
[78]

Improving text-to-image synthesis using contrastive learning

Hui Ye, Xiulong Yang, Martin Takac, Rajshekhar Sunderraman, and Shihao Ji. Improving text-to-image synthesis using contrastive learning. arXiv preprint arXiv:2107.02423, 2021

work page arXiv 2021
[79]

Y ., Zhang, H., Pang, R., Qin, J., Ku, A., Xu, Y ., Baldridge, J., and Wu, Y

Jiahui Yu, Xin Li, Jing Yu Koh, Han Zhang, Ruoming Pang, James Qin, Alexander Ku, Yuanzhong Xu, Jason Baldridge, and Yonghui Wu. Vector-quantized image modeling with improved vqgan. arXiv preprint arXiv:2110.04627, 2021

work page arXiv 2021
[80]

Coca: Contrastive captioners are image- text foundation models

Jiahui Yu, Zirui Wang, Vijay Vasudevan, Legg Yeung, Mojtaba Seyedhosseini, and Yonghui Wu. Coca: Contrastive captioners are image-text foundation models. arXiv preprint arXiv:2205.01917, 2022

work page arXiv 2022

Showing first 80 references.