arxiv: 2112.10741 · v3 · submitted 2021-12-20 · 💻 cs.CV · cs.GR· cs.LG

Recognition: 1 theorem link

· Lean Theorem

GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models

Aditya Ramesh, Alex Nichol, Bob McGrew, Ilya Sutskever, Mark Chen, Pamela Mishkin, Prafulla Dhariwal, Pranav Shyam

Pith reviewed 2026-05-11 05:52 UTC · model grok-4.3

classification 💻 cs.CV cs.GRcs.LG

keywords text-to-image synthesisdiffusion modelsclassifier-free guidanceimage generationimage editingphotorealistic imagesinpainting

0 comments

The pith

Text-conditional diffusion models using classifier-free guidance generate images humans prefer over DALL-E for photorealism.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper demonstrates that diffusion models conditioned on text descriptions can synthesize high-quality images. It finds that classifier-free guidance is preferred by human evaluators over CLIP guidance for both photorealism and how well the image matches the caption. A 3.5 billion parameter model trained with this approach produces samples that people rate higher than those from DALL-E, even when DALL-E applies CLIP reranking. The models can also be fine-tuned to support text-guided inpainting for image editing tasks.

Core claim

A 3.5 billion parameter text-conditional diffusion model using classifier-free guidance generates samples that human evaluators favor over DALL-E outputs for both photorealism and caption similarity. The model can be fine-tuned to perform text-driven image inpainting.

What carries the argument

Classifier-free guidance applied to text-conditional diffusion models, which steers the denoising process using the text condition directly to improve fidelity without an external classifier.

If this is right

Human evaluators consistently prefer classifier-free guidance to CLIP guidance in text-to-image generation tasks.
Large-scale diffusion models can achieve superior results to prior text-to-image systems like DALL-E in blind human comparisons.
Fine-tuning allows diffusion models to support practical editing capabilities such as text-based inpainting.
The open-sourced smaller model enables further research and applications in text-guided image synthesis.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Further increases in model scale and data quality could push photorealism even closer to real photographs.
Similar guidance techniques might improve conditional generation in related domains like video or 3D synthesis.
The success of classifier-free guidance suggests it could simplify training pipelines by removing the need for separate guidance models.

Load-bearing premise

The judgments of the human evaluators accurately capture photorealism and text similarity in a way that generalizes beyond the specific test conditions.

What would settle it

A follow-up experiment with a new group of raters or an objective metric such as improved FID scores on a held-out test set that shows DALL-E preferred instead.

read the original abstract

Diffusion models have recently been shown to generate high-quality synthetic images, especially when paired with a guidance technique to trade off diversity for fidelity. We explore diffusion models for the problem of text-conditional image synthesis and compare two different guidance strategies: CLIP guidance and classifier-free guidance. We find that the latter is preferred by human evaluators for both photorealism and caption similarity, and often produces photorealistic samples. Samples from a 3.5 billion parameter text-conditional diffusion model using classifier-free guidance are favored by human evaluators to those from DALL-E, even when the latter uses expensive CLIP reranking. Additionally, we find that our models can be fine-tuned to perform image inpainting, enabling powerful text-driven image editing. We train a smaller model on a filtered dataset and release the code and weights at https://github.com/openai/glide-text2im.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 3 minor

Summary. The paper introduces GLIDE, a text-conditional diffusion model for photorealistic image generation and editing. It compares two guidance strategies (CLIP guidance vs. classifier-free guidance) and reports that classifier-free guidance is preferred by human evaluators on both photorealism and caption similarity. A 3.5B-parameter model using classifier-free guidance produces samples that human raters favor over DALL-E outputs even when DALL-E employs CLIP reranking. The work further shows that the model can be fine-tuned for text-driven inpainting and releases code plus weights for a smaller filtered-data variant.

Significance. If the human-study results hold, the paper establishes classifier-free guidance as a strong, parameter-efficient alternative to CLIP-based guidance for text-to-image diffusion, with credible evidence of outperformance versus DALL-E on the same prompt set. The open release of the smaller model and code directly supports reproducibility. The inpainting fine-tuning result demonstrates a practical editing capability that extends the core generation contribution.

minor comments (3)

Abstract: the central human-preference claim is stated without any mention of protocol details (rater count, question wording, or statistical controls). Although these details appear in the main text, a single sentence in the abstract would make the claim self-contained.
Human-evaluation section: the manuscript reports judgments on the same 1000 prompts used by DALL-E and one sample per prompt per model, but does not explicitly state whether prompt order or model identity was blinded to raters; adding this sentence would strengthen the protocol description.
Figure captions and qualitative results: several comparison figures would benefit from explicit indication of which guidance method and sampling steps were used for each panel, to allow readers to map visuals directly to the quantitative claims.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive review and the recommendation to accept. No major comments were raised.

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper is an empirical ML study describing the training and evaluation of a text-conditional diffusion model (GLIDE) with comparisons to DALL-E via human raters on photorealism and caption similarity. No derivation chain, first-principles predictions, or fitted parameters renamed as outputs exist; the central claims rest on experimental results using standard diffusion model formulations and external baselines, with no self-referential reductions or load-bearing self-citations that collapse the argument to its inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is an empirical machine-learning paper whose claims rest on training runs and human evaluations rather than new theoretical axioms or invented physical entities.

pith-pipeline@v0.9.0 · 5484 in / 1081 out tokens · 26353 ms · 2026-05-11T05:52:00.061358+00:00 · methodology

discussion (0)

Forward citations

Cited by 47 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Consistency Models
cs.LG 2023-03 conditional novelty 8.0

Consistency models achieve fast one-step generation with SOTA FID of 3.55 on CIFAR-10 and 6.20 on ImageNet 64x64 by directly mapping noise to data, outperforming prior distillation techniques.
Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow
cs.LG 2022-09 unverdicted novelty 8.0

Rectified flow learns straight-path neural ODEs for distribution transport, yielding efficient generative models and domain transfers that work well even with a single simulation step.
Prompt-to-Prompt Image Editing with Cross Attention Control
cs.CV 2022-08 unverdicted novelty 8.0

Cross-attention control in text-conditioned models enables localized and global image edits by editing only the input text prompt.
An Image is Worth One Word: Personalizing Text-to-Image Generation using Textual Inversion
cs.CV 2022-08 unverdicted novelty 8.0

Textual Inversion learns a single embedding vector from a few images to represent personal concepts inside the text embedding space of a frozen text-to-image model, enabling their composition in natural language prompts.
From Diffusion to Rectified Flow: Rethinking Text-Based Segmentation
cs.CV 2026-05 unverdicted novelty 7.0

RLFSeg repurposes pretrained generative models via Rectified Flow for direct latent-space image-to-mask mapping in text-based segmentation, outperforming diffusion-based methods especially in zero-shot cases.
Active Sampling for Ultra-Low-Bit-Rate Video Compression via Conditional Controlled Diffusion
cs.CV 2026-05 unverdicted novelty 7.0

ActDiff-VC achieves up to 64.6% bitrate reduction at matched NIQE and improves perceptual metrics like KID and FID by using content-adaptive keyframe selection and budget-aware sparse trajectory selection to condition...
Benchmarking Layout-Guided Diffusion Models through Unified Semantic-Spatial Evaluation in Closed and Open Settings
cs.CV 2026-04 conditional novelty 7.0

Introduces closed-set C-Bench and open-set O-Bench for layout-guided diffusion models, a unified semantic-spatial scoring protocol, and ranks six models after generating and evaluating 319,086 images.
GENFIG1: Visual Summaries of Scholarly Work as a Challenge for Vision-Language Models
cs.CV 2026-04 unverdicted novelty 7.0

GENFIG1 is a new benchmark that tests whether vision-language models can create effective Figure 1 visuals capturing the central scientific idea from paper text.
Autoregressive Model Beats Diffusion: Llama for Scalable Image Generation
cs.CV 2024-06 conditional novelty 7.0

Scaled vanilla autoregressive models based on Llama achieve 2.18 FID on ImageNet 256x256 image generation, beating popular diffusion models without visual inductive biases.
ELLA: Equip Diffusion Models with LLM for Enhanced Semantic Alignment
cs.CV 2024-03 unverdicted novelty 7.0

ELLA introduces a timestep-aware semantic connector to link LLMs with diffusion models for improved dense prompt following, validated on a new 1K-prompt benchmark.
Latent Consistency Models: Synthesizing High-Resolution Images with Few-Step Inference
cs.CV 2023-10 unverdicted novelty 7.0

Latent Consistency Models enable high-fidelity text-to-image generation in 2-4 steps by directly predicting solutions to the probability flow ODE in latent space, distilled from pre-trained LDMs.
AnimateDiff: Animate Your Personalized Text-to-Image Diffusion Models without Specific Tuning
cs.CV 2023-07 unverdicted novelty 7.0

A single motion module trained on videos adds temporally coherent animation to any personalized text-to-image model derived from the same base without additional tuning.
Scalable Diffusion Models with Transformers
cs.CV 2022-12 unverdicted novelty 7.0

DiTs achieve SOTA FID of 2.27 on ImageNet 256x256 by scaling transformer-based latent diffusion models, with performance improving consistently as Gflops increase.
LAION-5B: An open large-scale dataset for training next generation image-text models
cs.CV 2022-10 accept novelty 7.0

LAION-5B is an openly released dataset of 5.85 billion CLIP-filtered image-text pairs that enables replication of foundational vision-language models.
Imagen Video: High Definition Video Generation with Diffusion Models
cs.CV 2022-10 unverdicted novelty 7.0

Imagen Video generates high-definition text-conditional videos via a cascade of base and super-resolution diffusion models, achieving high fidelity and controllability.
Diffusion Posterior Sampling for General Noisy Inverse Problems
stat.ML 2022-09 unverdicted novelty 7.0

Diffusion models solve noisy (non)linear inverse problems via approximated posterior sampling that blends diffusion steps with manifold gradients without strict consistency projection.
Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding
cs.CV 2022-05 accept novelty 7.0

Imagen achieves state-of-the-art photorealistic text-to-image generation by scaling a text-only pretrained T5 language model within a diffusion framework, reaching FID 7.27 on COCO without training on it.
Hierarchical Text-Conditional Image Generation with CLIP Latents
cs.CV 2022-04 accept novelty 7.0

A hierarchical prior-decoder model using CLIP latents generates more diverse text-conditional images than direct methods while preserving photorealism and caption fidelity.
Video Diffusion Models
cs.CV 2022-04 unverdicted novelty 7.0

A diffusion model for video generation extends image architectures with joint image-video training and improved conditional sampling, delivering first large-scale text-to-video results and state-of-the-art performance...
High-Resolution Image Synthesis with Latent Diffusion Models
cs.CV 2021-12 conditional novelty 7.0

Latent diffusion models achieve state-of-the-art inpainting and competitive results on unconditional generation, scene synthesis, and super-resolution by performing the diffusion process in the latent space of pretrai...
FlashClear: Ultra-Fast Image Content Removal via Efficient Step Distillation and Feature Caching
cs.CV 2026-05 unverdicted novelty 6.0

FlashClear achieves up to 8.26x speedup over its base diffusion model and 122x over OmniPaint for image object removal via region-aware adversarial distillation and foreground-prioritized caching while claiming to mai...
FlashClear: Ultra-Fast Image Content Removal via Efficient Step Distillation and Feature Caching
cs.CV 2026-05 unverdicted novelty 6.0

FlashClear delivers up to 122x faster object removal than prior diffusion models via adversarial step distillation and asymmetric attention caching while preserving visual quality.
Intermediate Representations are Strong AI-Generated Image Detectors
cs.CV 2026-05 unverdicted novelty 6.0

Intermediate layer embedding sensitivity to perturbations distinguishes AI-generated images from real ones, yielding higher AUROC on GenImage and Forensics Small benchmarks than prior methods.
Learning to Theorize the World from Observation
cs.LG 2026-05 unverdicted novelty 6.0

NEO induces compositional latent programs as world theories from observations and executes them to enable explanation-driven generalization.
VibeToken: Scaling 1D Image Tokenizers and Autoregressive Models for Dynamic Resolution Generations
cs.CV 2026-04 unverdicted novelty 6.0

VibeToken enables autoregressive image generation at arbitrary resolutions using 64 tokens for 1024x1024 images with 3.94 gFID, constant 179G FLOPs, and better efficiency than diffusion or fixed AR baselines.
Rethinking Where to Edit: Task-Aware Localization for Instruction-Based Image Editing
cs.CV 2026-04 unverdicted novelty 6.0

Task-aware localization via attention cues and feature centroids from source/target streams in IIE models improves non-edit consistency while preserving instruction following.
PostureObjectstitch: Anomaly Image Generation Considering Assembly Relationships in Industrial Scenarios
cs.CV 2026-04 unverdicted novelty 6.0

PostureObjectStitch generates assembly-aware anomaly images by decoupling multi-view features into high-frequency, texture and RGB components, modulating them temporally in a diffusion model, and applying conditional ...
MuPPet: Multi-person 2D-to-3D Pose Lifting
cs.CV 2026-04 unverdicted novelty 6.0

MuPPet introduces person encoding, permutation augmentation, and dynamic multi-person attention to outperform prior single- and multi-person 2D-to-3D pose lifting methods on group interaction datasets while improving ...
Controllable Image Generation with Composed Parallel Token Prediction
cs.LG 2026-04 unverdicted novelty 6.0

A new formulation for composing discrete generative processes enables precise control over novel condition combinations in image generation, cutting error rates by 63% and speeding up inference.
Erasure or Erosion? Evaluating Compositional Degradation in Unlearned Text-To-Image Diffusion Models
cs.CV 2026-04 unverdicted novelty 6.0

Unlearning methods that strongly erase concepts from text-to-image diffusion models consistently degrade performance on attribute binding, spatial reasoning, and counting tasks.
LTX-Video: Realtime Video Latent Diffusion
cs.CV 2024-12 conditional novelty 6.0

LTX-Video integrates Video-VAE and transformer for 1:192 latent compression and real-time video diffusion by moving patchifying to the VAE and letting the decoder finish denoising in pixel space.
CameraCtrl: Enabling Camera Control for Text-to-Video Generation
cs.CV 2024-04 unverdicted novelty 6.0

CameraCtrl enables accurate camera pose control in video diffusion models through a trained plug-and-play module and dataset choices emphasizing diverse camera trajectories with matching appearance.
IP-Adapter: Text Compatible Image Prompt Adapter for Text-to-Image Diffusion Models
cs.CV 2023-08 unverdicted novelty 6.0

IP-Adapter adds effective image prompting to text-to-image diffusion models using a lightweight decoupled cross-attention adapter that works alongside text prompts and other controls.
SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis
cs.CV 2023-07 conditional novelty 6.0

SDXL improves upon prior Stable Diffusion versions through a larger UNet backbone, dual text encoders, novel conditioning, and a refinement model, producing higher-fidelity images competitive with black-box state-of-t...
Make-A-Video: Text-to-Video Generation without Text-Video Data
cs.CV 2022-09 unverdicted novelty 6.0

Make-A-Video achieves state-of-the-art text-to-video generation by decomposing temporal U-Net and attention structures to add space-time modeling to text-to-image models, trained without any paired text-video data.
Scaling Autoregressive Models for Content-Rich Text-to-Image Generation
cs.CV 2022-06 unverdicted novelty 6.0

Scaling an autoregressive Transformer to 20B parameters for text-to-image generation using image token sequences achieves new SOTA zero-shot FID of 7.23 and fine-tuned FID of 3.22 on MS-COCO.
Empty SPACE: Cross-Attention Sparsity for Concept Erasure in Diffusion Models
cs.LG 2026-05 unverdicted novelty 5.0

SPACE induces sparsity in cross-attention parameters via closed-form iterative updates to erase target concepts more effectively than dense baselines in large diffusion models.
Micro-Defects Expose Macro-Fakes: Detecting AI-Generated Images via Local Distributional Shifts
cs.CV 2026-05 unverdicted novelty 5.0

MDMF detects AI-generated images by learning patch-level forensic signatures and quantifying their distributional discrepancies with MMD, yielding larger separation than global methods when micro-defects are present.
AI-Generated Images: What Humans and Machines See When They Look at the Same Image
cs.CV 2026-05 unverdicted novelty 5.0

Researchers train AI detectors on a large photorealistic fake image dataset, apply 16 XAI methods, and use human survey feedback to assess alignment between machine explanations and human perception of AI-generated images.
DiffMagicFace: Identity Consistent Facial Editing of Real Videos
cs.CV 2026-04 unverdicted novelty 5.0

DiffMagicFace uses concurrent fine-tuned text and image diffusion models plus a rendered multi-view dataset to achieve identity-consistent text-conditioned editing of real facial videos.
MegaStyle: Constructing Diverse and Scalable Style Dataset via Consistent Text-to-Image Style Mapping
cs.CV 2026-04 unverdicted novelty 5.0

A scalable pipeline generates an intra-consistent, inter-diverse 1.4M style image dataset from text-to-image models and uses it to train a style encoder and generalizable style transfer model.
LTX-2: Efficient Joint Audio-Visual Foundation Model
cs.CV 2026-01 conditional novelty 5.0

LTX-2 generates high-quality synchronized audiovisual content from text prompts via an asymmetric 14B-video / 5B-audio dual-stream transformer with cross-attention and modality-aware guidance.
Show-o: One Single Transformer to Unify Multimodal Understanding and Generation
cs.CV 2024-08 unverdicted novelty 5.0

Show-o unifies autoregressive and discrete diffusion modeling inside one transformer to support multimodal understanding and generation tasks with competitive benchmark performance.
Conditional Flow-VAE for Safety-Critical Traffic Scenario Generation
cs.RO 2026-05 unverdicted novelty 4.0

A conditional flow matching model generates realistic safety-critical traffic scenarios by turning nominal scenes into dangerous rollouts using combined simulation and real data.
Adaptive Forensic Feature Refinement via Intrinsic Importance Perception
cs.CV 2026-04 unverdicted novelty 4.0

I2P adaptively selects the most discriminative layers from visual foundation models for synthetic image detection and constrains task updates to low-sensitivity parameter subspaces to improve specificity without harmi...
Training-Free Object-Background Compositional T2I via Dynamic Spatial Guidance and Multi-Path Pruning
cs.CV 2026-04 unverdicted novelty 4.0

A training-free method with time-dependent attention gating and trajectory pruning enhances object-background balance in diffusion-based image synthesis.
ModelScope Text-to-Video Technical Report
cs.CV 2023-08 unverdicted novelty 4.0

ModelScopeT2V is a 1.7-billion-parameter text-to-video model built on Stable Diffusion that adds temporal modeling and outperforms prior methods on three evaluation metrics.

Reference graph

Works this paper leans on

41 extracted references · 41 canonical work pages · cited by 46 Pith papers · 11 internal anchors

[1]

Blended diffusion for text-driven editing of natural images

Avrahami, O., Lischinski, D., and Fried, O. Blended diffusion for text-driven editing of natural images. arXiv:2111.14818,

work page arXiv
[2]

Paint by word

Bau, D., Andonian, A., Cui, A., Park, Y ., Jahanian, A., Oliva, A., and Torralba, A. Paint by word. arXiv:2103.10951,

work page arXiv
[3]

Large Scale GAN Training for High Fidelity Natural Image Synthesis

Brock, A., Donahue, J., and Simonyan, K. Large scale gan training for high ﬁdelity natural image synthesis. arXiv:1809.11096,

work page internal anchor Pith review arXiv
[4]

Diffusion Models Beat GANs on Image Synthesis

URL https://proceedings.mlr.press/v81/ buolamwini18a.html. Crowson, K. Clip guided diffusion hq 256x256. https: //colab.research.google.com/drive/ 12a_Wrfi2_gwwAuN3VvMTwVMz9TfqctNj, 2021a. Crowson, K. Clip guided diffusion 512x512, secondary model method. https:// twitter.com/RiversHaveWings/status/ 1462859669454536711, 2021b. Dhariwal, P. and Nichol, A. ...

work page internal anchor Pith review arXiv
[5]

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., and Houlsby, N. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv:2010.11929,

work page internal anchor Pith review Pith/arXiv arXiv 2010
[6]

Stylegan-nada: Clip-guided domain adaptation of image generators

Gal, R., Patashnik, O., Maron, H., Chechik, G., and Cohen- Or, D. Stylegan-nada: Clip-guided domain adaptation of image generators. arXiv:2108.00946,

work page arXiv
[7]

Galatolo, Mario G

Galatolo, F. A., Cimino, M. G. C. A., and Vaglini, G. Gener- ating images from caption and vice versa via clip-guided generative latent space search. arXiv:2102.01645,

work page arXiv
[8]

Vector quantized diﬀusion model for text-to-image synthesis

Gu, S., Chen, D., Bao, J., Wen, F., Zhang, B., Chen, D., Yuan, L., and Guo, B. Vector quantized diffusion model for text-to-image synthesis. arXiv:2111.14822,

work page arXiv
[9]

Gans trained by a two time-scale update rule converge to a local nash equilibrium

Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., and Hochreiter, S. Gans trained by a two time-scale update rule converge to a local nash equilibrium. Advances in Neural Information Processing Systems 30 (NIPS 2017),

work page 2017
[10]

and Salimans, T

Ho, J. and Salimans, T. Classiﬁer-free diffusion guidance. In NeurIPS 2021 Workshop on Deep Generative Models and Downstream Applications,

work page 2021
[11]

Denoising Diffusion Probabilistic Models

URL https:// openreview.net/forum?id=qw8AKxfYbI. Ho, J., Jain, A., and Abbeel, P. Denoising diffusion proba- bilistic models. arXiv:2006.11239,

work page internal anchor Pith review Pith/arXiv arXiv 2006
[12]

Cascaded diffusion models for high ﬁdelity image generation

Ho, J., Saharia, C., Chan, W., Fleet, D. J., Norouzi, M., and Salimans, T. Cascaded diffusion models for high ﬁdelity image generation. arXiv:2106.15282,

work page arXiv
[13]

A style-based generator architecture for generative adversarial networks

Karras, T., Laine, S., and Aila, T. A style-based gen- erator architecture for generative adversarial networks. arXiv:arXiv:1812.04948, 2019a. Karras, T., Laine, S., Aittala, M., Hellsten, J., Lehtinen, J., and Aila, T. Analyzing and improving the image quality of stylegan. arXiv:1912.04958, 2019b. Kim, G. and Ye, J. C. Diffusionclip: Text-guided image ma...

work page arXiv 1912
[14]

Improved precision and recall metric for assessing generative models

Kynk¨a¨anniemi, T., Karras, T., Laine, S., Lehtinen, J., and Aila, T. Improved precision and recall metric for assess- ing generative models. arXiv:1904.06991,

work page arXiv 1904
[15]

SDEdit: Guided Image Synthesis and Editing with Stochastic Differential Equations

Meng, C., Song, Y ., Song, J., Wu, J., Zhu, J.-Y ., and Ermon, S. Sdedit: Image synthesis and editing with stochastic differential equations. arXiv:2108.01073,

work page internal anchor Pith review arXiv
[16]

Mixed Precision Training

Micikevicius, P., Narang, S., Alben, J., Diamos, G., Elsen, E., Garcia, D., Ginsburg, B., Houston, M., Kuchaiev, O., Venkatesh, G., and Wu, H. Mixed precision training. arXiv:1710.03740,

work page internal anchor Pith review arXiv
[17]

The big sleep

Murdock, R. The big sleep. https://twitter.com/ advadnoun/status/1351038053033406468,

work page arXiv
[18]

Improved denois- ing diffusion probabilistic models.arXiv preprint arXiv:2102.09672,

Nichol, A. and Dhariwal, P. Improved denoising diffusion probabilistic models. arXiv:2102.09672,

work page arXiv
[19]

Styleclip: Text-driven manipulation of stylegan imagery

Patashnik, O., Wu, Z., Shechtman, E., Cohen-Or, D., and Lischinski, D. Styleclip: Text-driven manipulation of stylegan imagery. arXiv:2103.17249,

work page arXiv
[20]

Learning Transferable Visual Models From Natural Language Supervision

Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., Krueger, G., and Sutskever, I. Learning transfer- able visual models from natural language supervision. arXiv:2103.00020,

work page internal anchor Pith review Pith/arXiv arXiv
[21]

Zero-Shot Text-to-Image Generation

Ramesh, A., Pavlov, M., Goh, G., Gray, S., V oss, C., Rad- ford, A., Chen, M., and Sutskever, I. Zero-shot text-to- image generation. arXiv:2102.12092,

work page internal anchor Pith review arXiv
[22]

Generating diverse high-ﬁdelity images with VQ-V AE-2.arXiv:1906.00446, 2019

GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models Razavi, A., van den Oord, A., and Vinyals, O. Gen- erating diverse high-ﬁdelity images with VQ-V AE-2. arXiv:1906.00446,

work page arXiv 1906
[23]

Palette: Image-to-image diffusion models

Saharia, C., Chan, W., Chang, H., Lee, C. A., Ho, J., Sali- mans, T., Fleet, D. J., and Norouzi, M. Palette: Image-to- image diffusion models. arXiv:2111.05826, 2021a. Saharia, C., Ho, J., Chan, W., Salimans, T., Fleet, D. J., and Norouzi, M. Image super-resolution via iterative reﬁnement. arXiv:arXiv:2104.07636, 2021b. Salimans, T., Goodfellow, I., Zarem...

work page arXiv
[24]

Image synthesis with a single (robust) classiﬁer

Santurkar, S., Tsipras, D., Tran, B., Ilyas, A., Engstrom, L., and Madry, A. Image synthesis with a single (robust) classiﬁer.arXiv:1906.09453,

work page arXiv 1906
[25]

Deep Unsupervised Learning using Nonequilibrium Thermodynamics

Sohl-Dickstein, J., Weiss, E. A., Maheswaranathan, N., and Ganguli, S. Deep unsupervised learning using nonequi- librium thermodynamics. arXiv:1503.03585,

work page internal anchor Pith review arXiv
[26]

Denoising Diffusion Implicit Models

Song, J., Meng, C., and Ermon, S. Denoising diffusion implicit models. arXiv:2010.02502, 2020a. Song, Y . and Ermon, S. Improved techniques for train- ing score-based generative models. arXiv:2006.09011, 2020a. Song, Y . and Ermon, S. Generative modeling by estimating gradients of the data distribution. arXiv:arXiv:1907.05600, 2020b. Song, Y ., Sohl-Dicks...

work page internal anchor Pith review Pith/arXiv arXiv 2010
[27]

Df-gan: Deep fusion generative adversarial networks for text-to-image synthesis

Tao, M., Tang, H., Wu, S., Sebe, N., Jing, X.-Y ., Wu, F., and Bao, B. Df-gan: Deep fusion generative adversarial networks for text-to-image synthesis. arXiv:2008.05865,

work page arXiv 2008
[28]

://arxiv.org/abs/1711.00937

van den Oord, A., Vinyals, O., and Kavukcuoglu, K. Neural discrete representation learning. arXiv:1711.00937,

work page arXiv
[29]

Attention Is All You Need

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., and Polosukhin, I. Attention is all you need. arXiv:1706.03762,

work page internal anchor Pith review Pith/arXiv arXiv
[30]

Attngan: Fine-grained text to image gen- eration with attentional generative adversarial networks

Xu, T., Zhang, P., Huang, Q., Zhang, H., Gan, Z., Huang, X., and He, X. Attngan: Fine-grained text to image gen- eration with attentional generative adversarial networks. arXiv:1711.10485,

work page arXiv
[31]

Improving text-to-image synthesis using contrastive learning

Ye, H., Yang, X., Takac, M., Sunderraman, R., and Ji, S. Improving text-to-image synthesis using contrastive learn- ing. arXiv:2107.02423,

work page arXiv
[32]

Y ., Baldridge, J., Lee, H., and Yang, Y

Zhang, H., Koh, J. Y ., Baldridge, J., Lee, H., and Yang, Y . Cross-modal contrastive learning for text-to-image generation. arXiv:2101.04702,

work page arXiv
[33]

URL https://doi.org/ 10.1145/3394171.3414017

1145/3394171.3414017. URL https://doi.org/ 10.1145/3394171.3414017. Zhou, S., Gordon, M. L., Krishna, R., Narcomey, A., Fei- Fei, L., and Bernstein, M. S. Hype: A benchmark for human eye perceptual evaluation of generative models,

work page doi:10.1145/3394171.3414017
[34]

Laﬁte: Towards language-free training for text-to-image generation

Zhou, Y ., Zhang, R., Chen, C., Li, C., Tensmeyer, C., Yu, T., Gu, J., Xu, J., and Sun, T. Laﬁte: Towards language-free training for text-to-image generation. arXiv:2111.13792,

work page arXiv
[35]

Dm-gan: Dy- namic memory generative adversarial networks for text- to-image synthesis

Zhu, M., Pan, P., Chen, W., and Yang, Y . Dm-gan: Dy- namic memory generative adversarial networks for text- to-image synthesis. arXiv:1904.01310,

work page arXiv 1904
[36]

By doing this, ties effectively dilute the wins of each model

When computing wins and Elo scores, we count a tie as half of a win for each model. By doing this, ties effectively dilute the wins of each model. To compute Elo scores, we construct a matrixA such that entryAij is the number of times modeli beats modelj. We initialize Elo scores for allN models asσi = 0,i∈ [1,N ]. We compute Elo scores by minimizing the ...

work page 2021
[37]

(2021) and Ramesh et al

We trained our CLIP models for 390K iterations with batch size 32K on a 50%-50% mixture of the datasets used by Radford et al. (2021) and Ramesh et al. (2021). For our ﬁnal CLIP model, we trained a ViT-L with weight decay 0.0125. After training, we ﬁne-tuned the ﬁnal ViT-L for 30K iterations on an even broader dataset of internet images. We pre-trained GL...

work page 2021
[38]

a corgi in a ﬁeld

Comparing classiﬁer-free guided samples from our large model (ﬁrst row), a small version trained on the same data (second row), and our released small model trained on a smaller, ﬁltered dataset. In the ﬁnal row, we show samples using our small model guided by a CLIP model trained on ﬁltered data. Samples are not cherry-picked. D. Comparison to Unnoised C...

work page 2021
[39]

Comparison of GLIDE to two CLIP guidance strategies applied to pre-trained ImageNet diffusion models. On the left, we use a vanilla CLIP model to guide the 256 × 256 diffusion model from Dhariwal & Nichol (2021), using a combination of engineered perceptual losses and data augmentations (Crowson, 2021a). In the middle, we use our noised ViT-B CLIP model t...

work page 2021
[40]

pink yarn ball

is not yet available, we evaluate our model on a few of the prompts shown in the paper (Figure 11). We ﬁnd that our ﬁne-tuned model sometimes chooses to ignore the given text prompt and instead produces an image that seems inﬂuenced only by the surrounding context. To mitigate this phenomenon, we also evaluate our model with the context fully masked out. ...

work page 2015
[41]

weapon”, “violence

Comparison of image inpainting quality on real images. (1) Local CLIP-guided diffusion (Crowson, 2021a), (2) PaintByWord++ (Bau et al., 2021; Avrahami et al., 2021), (3) Blended Diffusion (Avrahami et al., 2021). For our results, we follow Avrahami et al. (2021) and use CLIP to select the best of 64 samples. Our ﬁne-tuned samples have more realistic light...

work page 2021