pith. machine review for the scientific record. sign in

arxiv: 2112.10741 · v3 · submitted 2021-12-20 · 💻 cs.CV · cs.GR· cs.LG

Recognition: 1 theorem link

· Lean Theorem

GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models

Aditya Ramesh, Alex Nichol, Bob McGrew, Ilya Sutskever, Mark Chen, Pamela Mishkin, Prafulla Dhariwal, Pranav Shyam

Pith reviewed 2026-05-11 05:52 UTC · model grok-4.3

classification 💻 cs.CV cs.GRcs.LG
keywords text-to-image synthesisdiffusion modelsclassifier-free guidanceimage generationimage editingphotorealistic imagesinpainting
0
0 comments X

The pith

Text-conditional diffusion models using classifier-free guidance generate images humans prefer over DALL-E for photorealism.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper demonstrates that diffusion models conditioned on text descriptions can synthesize high-quality images. It finds that classifier-free guidance is preferred by human evaluators over CLIP guidance for both photorealism and how well the image matches the caption. A 3.5 billion parameter model trained with this approach produces samples that people rate higher than those from DALL-E, even when DALL-E applies CLIP reranking. The models can also be fine-tuned to support text-guided inpainting for image editing tasks.

Core claim

A 3.5 billion parameter text-conditional diffusion model using classifier-free guidance generates samples that human evaluators favor over DALL-E outputs for both photorealism and caption similarity. The model can be fine-tuned to perform text-driven image inpainting.

What carries the argument

Classifier-free guidance applied to text-conditional diffusion models, which steers the denoising process using the text condition directly to improve fidelity without an external classifier.

If this is right

  • Human evaluators consistently prefer classifier-free guidance to CLIP guidance in text-to-image generation tasks.
  • Large-scale diffusion models can achieve superior results to prior text-to-image systems like DALL-E in blind human comparisons.
  • Fine-tuning allows diffusion models to support practical editing capabilities such as text-based inpainting.
  • The open-sourced smaller model enables further research and applications in text-guided image synthesis.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Further increases in model scale and data quality could push photorealism even closer to real photographs.
  • Similar guidance techniques might improve conditional generation in related domains like video or 3D synthesis.
  • The success of classifier-free guidance suggests it could simplify training pipelines by removing the need for separate guidance models.

Load-bearing premise

The judgments of the human evaluators accurately capture photorealism and text similarity in a way that generalizes beyond the specific test conditions.

What would settle it

A follow-up experiment with a new group of raters or an objective metric such as improved FID scores on a held-out test set that shows DALL-E preferred instead.

read the original abstract

Diffusion models have recently been shown to generate high-quality synthetic images, especially when paired with a guidance technique to trade off diversity for fidelity. We explore diffusion models for the problem of text-conditional image synthesis and compare two different guidance strategies: CLIP guidance and classifier-free guidance. We find that the latter is preferred by human evaluators for both photorealism and caption similarity, and often produces photorealistic samples. Samples from a 3.5 billion parameter text-conditional diffusion model using classifier-free guidance are favored by human evaluators to those from DALL-E, even when the latter uses expensive CLIP reranking. Additionally, we find that our models can be fine-tuned to perform image inpainting, enabling powerful text-driven image editing. We train a smaller model on a filtered dataset and release the code and weights at https://github.com/openai/glide-text2im.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 3 minor

Summary. The paper introduces GLIDE, a text-conditional diffusion model for photorealistic image generation and editing. It compares two guidance strategies (CLIP guidance vs. classifier-free guidance) and reports that classifier-free guidance is preferred by human evaluators on both photorealism and caption similarity. A 3.5B-parameter model using classifier-free guidance produces samples that human raters favor over DALL-E outputs even when DALL-E employs CLIP reranking. The work further shows that the model can be fine-tuned for text-driven inpainting and releases code plus weights for a smaller filtered-data variant.

Significance. If the human-study results hold, the paper establishes classifier-free guidance as a strong, parameter-efficient alternative to CLIP-based guidance for text-to-image diffusion, with credible evidence of outperformance versus DALL-E on the same prompt set. The open release of the smaller model and code directly supports reproducibility. The inpainting fine-tuning result demonstrates a practical editing capability that extends the core generation contribution.

minor comments (3)
  1. Abstract: the central human-preference claim is stated without any mention of protocol details (rater count, question wording, or statistical controls). Although these details appear in the main text, a single sentence in the abstract would make the claim self-contained.
  2. Human-evaluation section: the manuscript reports judgments on the same 1000 prompts used by DALL-E and one sample per prompt per model, but does not explicitly state whether prompt order or model identity was blinded to raters; adding this sentence would strengthen the protocol description.
  3. Figure captions and qualitative results: several comparison figures would benefit from explicit indication of which guidance method and sampling steps were used for each panel, to allow readers to map visuals directly to the quantitative claims.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive review and the recommendation to accept. No major comments were raised.

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper is an empirical ML study describing the training and evaluation of a text-conditional diffusion model (GLIDE) with comparisons to DALL-E via human raters on photorealism and caption similarity. No derivation chain, first-principles predictions, or fitted parameters renamed as outputs exist; the central claims rest on experimental results using standard diffusion model formulations and external baselines, with no self-referential reductions or load-bearing self-citations that collapse the argument to its inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is an empirical machine-learning paper whose claims rest on training runs and human evaluations rather than new theoretical axioms or invented physical entities.

pith-pipeline@v0.9.0 · 5484 in / 1081 out tokens · 26353 ms · 2026-05-11T05:52:00.061358+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 47 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Consistency Models

    cs.LG 2023-03 conditional novelty 8.0

    Consistency models achieve fast one-step generation with SOTA FID of 3.55 on CIFAR-10 and 6.20 on ImageNet 64x64 by directly mapping noise to data, outperforming prior distillation techniques.

  2. Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow

    cs.LG 2022-09 unverdicted novelty 8.0

    Rectified flow learns straight-path neural ODEs for distribution transport, yielding efficient generative models and domain transfers that work well even with a single simulation step.

  3. Prompt-to-Prompt Image Editing with Cross Attention Control

    cs.CV 2022-08 unverdicted novelty 8.0

    Cross-attention control in text-conditioned models enables localized and global image edits by editing only the input text prompt.

  4. An Image is Worth One Word: Personalizing Text-to-Image Generation using Textual Inversion

    cs.CV 2022-08 unverdicted novelty 8.0

    Textual Inversion learns a single embedding vector from a few images to represent personal concepts inside the text embedding space of a frozen text-to-image model, enabling their composition in natural language prompts.

  5. From Diffusion to Rectified Flow: Rethinking Text-Based Segmentation

    cs.CV 2026-05 unverdicted novelty 7.0

    RLFSeg repurposes pretrained generative models via Rectified Flow for direct latent-space image-to-mask mapping in text-based segmentation, outperforming diffusion-based methods especially in zero-shot cases.

  6. Active Sampling for Ultra-Low-Bit-Rate Video Compression via Conditional Controlled Diffusion

    cs.CV 2026-05 unverdicted novelty 7.0

    ActDiff-VC achieves up to 64.6% bitrate reduction at matched NIQE and improves perceptual metrics like KID and FID by using content-adaptive keyframe selection and budget-aware sparse trajectory selection to condition...

  7. Benchmarking Layout-Guided Diffusion Models through Unified Semantic-Spatial Evaluation in Closed and Open Settings

    cs.CV 2026-04 conditional novelty 7.0

    Introduces closed-set C-Bench and open-set O-Bench for layout-guided diffusion models, a unified semantic-spatial scoring protocol, and ranks six models after generating and evaluating 319,086 images.

  8. GENFIG1: Visual Summaries of Scholarly Work as a Challenge for Vision-Language Models

    cs.CV 2026-04 unverdicted novelty 7.0

    GENFIG1 is a new benchmark that tests whether vision-language models can create effective Figure 1 visuals capturing the central scientific idea from paper text.

  9. Autoregressive Model Beats Diffusion: Llama for Scalable Image Generation

    cs.CV 2024-06 conditional novelty 7.0

    Scaled vanilla autoregressive models based on Llama achieve 2.18 FID on ImageNet 256x256 image generation, beating popular diffusion models without visual inductive biases.

  10. ELLA: Equip Diffusion Models with LLM for Enhanced Semantic Alignment

    cs.CV 2024-03 unverdicted novelty 7.0

    ELLA introduces a timestep-aware semantic connector to link LLMs with diffusion models for improved dense prompt following, validated on a new 1K-prompt benchmark.

  11. Latent Consistency Models: Synthesizing High-Resolution Images with Few-Step Inference

    cs.CV 2023-10 unverdicted novelty 7.0

    Latent Consistency Models enable high-fidelity text-to-image generation in 2-4 steps by directly predicting solutions to the probability flow ODE in latent space, distilled from pre-trained LDMs.

  12. AnimateDiff: Animate Your Personalized Text-to-Image Diffusion Models without Specific Tuning

    cs.CV 2023-07 unverdicted novelty 7.0

    A single motion module trained on videos adds temporally coherent animation to any personalized text-to-image model derived from the same base without additional tuning.

  13. Scalable Diffusion Models with Transformers

    cs.CV 2022-12 unverdicted novelty 7.0

    DiTs achieve SOTA FID of 2.27 on ImageNet 256x256 by scaling transformer-based latent diffusion models, with performance improving consistently as Gflops increase.

  14. LAION-5B: An open large-scale dataset for training next generation image-text models

    cs.CV 2022-10 accept novelty 7.0

    LAION-5B is an openly released dataset of 5.85 billion CLIP-filtered image-text pairs that enables replication of foundational vision-language models.

  15. Imagen Video: High Definition Video Generation with Diffusion Models

    cs.CV 2022-10 unverdicted novelty 7.0

    Imagen Video generates high-definition text-conditional videos via a cascade of base and super-resolution diffusion models, achieving high fidelity and controllability.

  16. Diffusion Posterior Sampling for General Noisy Inverse Problems

    stat.ML 2022-09 unverdicted novelty 7.0

    Diffusion models solve noisy (non)linear inverse problems via approximated posterior sampling that blends diffusion steps with manifold gradients without strict consistency projection.

  17. Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding

    cs.CV 2022-05 accept novelty 7.0

    Imagen achieves state-of-the-art photorealistic text-to-image generation by scaling a text-only pretrained T5 language model within a diffusion framework, reaching FID 7.27 on COCO without training on it.

  18. Hierarchical Text-Conditional Image Generation with CLIP Latents

    cs.CV 2022-04 accept novelty 7.0

    A hierarchical prior-decoder model using CLIP latents generates more diverse text-conditional images than direct methods while preserving photorealism and caption fidelity.

  19. Video Diffusion Models

    cs.CV 2022-04 unverdicted novelty 7.0

    A diffusion model for video generation extends image architectures with joint image-video training and improved conditional sampling, delivering first large-scale text-to-video results and state-of-the-art performance...

  20. High-Resolution Image Synthesis with Latent Diffusion Models

    cs.CV 2021-12 conditional novelty 7.0

    Latent diffusion models achieve state-of-the-art inpainting and competitive results on unconditional generation, scene synthesis, and super-resolution by performing the diffusion process in the latent space of pretrai...

  21. FlashClear: Ultra-Fast Image Content Removal via Efficient Step Distillation and Feature Caching

    cs.CV 2026-05 unverdicted novelty 6.0

    FlashClear achieves up to 8.26x speedup over its base diffusion model and 122x over OmniPaint for image object removal via region-aware adversarial distillation and foreground-prioritized caching while claiming to mai...

  22. FlashClear: Ultra-Fast Image Content Removal via Efficient Step Distillation and Feature Caching

    cs.CV 2026-05 unverdicted novelty 6.0

    FlashClear delivers up to 122x faster object removal than prior diffusion models via adversarial step distillation and asymmetric attention caching while preserving visual quality.

  23. Intermediate Representations are Strong AI-Generated Image Detectors

    cs.CV 2026-05 unverdicted novelty 6.0

    Intermediate layer embedding sensitivity to perturbations distinguishes AI-generated images from real ones, yielding higher AUROC on GenImage and Forensics Small benchmarks than prior methods.

  24. Learning to Theorize the World from Observation

    cs.LG 2026-05 unverdicted novelty 6.0

    NEO induces compositional latent programs as world theories from observations and executes them to enable explanation-driven generalization.

  25. VibeToken: Scaling 1D Image Tokenizers and Autoregressive Models for Dynamic Resolution Generations

    cs.CV 2026-04 unverdicted novelty 6.0

    VibeToken enables autoregressive image generation at arbitrary resolutions using 64 tokens for 1024x1024 images with 3.94 gFID, constant 179G FLOPs, and better efficiency than diffusion or fixed AR baselines.

  26. Rethinking Where to Edit: Task-Aware Localization for Instruction-Based Image Editing

    cs.CV 2026-04 unverdicted novelty 6.0

    Task-aware localization via attention cues and feature centroids from source/target streams in IIE models improves non-edit consistency while preserving instruction following.

  27. PostureObjectstitch: Anomaly Image Generation Considering Assembly Relationships in Industrial Scenarios

    cs.CV 2026-04 unverdicted novelty 6.0

    PostureObjectStitch generates assembly-aware anomaly images by decoupling multi-view features into high-frequency, texture and RGB components, modulating them temporally in a diffusion model, and applying conditional ...

  28. MuPPet: Multi-person 2D-to-3D Pose Lifting

    cs.CV 2026-04 unverdicted novelty 6.0

    MuPPet introduces person encoding, permutation augmentation, and dynamic multi-person attention to outperform prior single- and multi-person 2D-to-3D pose lifting methods on group interaction datasets while improving ...

  29. Controllable Image Generation with Composed Parallel Token Prediction

    cs.LG 2026-04 unverdicted novelty 6.0

    A new formulation for composing discrete generative processes enables precise control over novel condition combinations in image generation, cutting error rates by 63% and speeding up inference.

  30. Erasure or Erosion? Evaluating Compositional Degradation in Unlearned Text-To-Image Diffusion Models

    cs.CV 2026-04 unverdicted novelty 6.0

    Unlearning methods that strongly erase concepts from text-to-image diffusion models consistently degrade performance on attribute binding, spatial reasoning, and counting tasks.

  31. LTX-Video: Realtime Video Latent Diffusion

    cs.CV 2024-12 conditional novelty 6.0

    LTX-Video integrates Video-VAE and transformer for 1:192 latent compression and real-time video diffusion by moving patchifying to the VAE and letting the decoder finish denoising in pixel space.

  32. CameraCtrl: Enabling Camera Control for Text-to-Video Generation

    cs.CV 2024-04 unverdicted novelty 6.0

    CameraCtrl enables accurate camera pose control in video diffusion models through a trained plug-and-play module and dataset choices emphasizing diverse camera trajectories with matching appearance.

  33. IP-Adapter: Text Compatible Image Prompt Adapter for Text-to-Image Diffusion Models

    cs.CV 2023-08 unverdicted novelty 6.0

    IP-Adapter adds effective image prompting to text-to-image diffusion models using a lightweight decoupled cross-attention adapter that works alongside text prompts and other controls.

  34. SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis

    cs.CV 2023-07 conditional novelty 6.0

    SDXL improves upon prior Stable Diffusion versions through a larger UNet backbone, dual text encoders, novel conditioning, and a refinement model, producing higher-fidelity images competitive with black-box state-of-t...

  35. Make-A-Video: Text-to-Video Generation without Text-Video Data

    cs.CV 2022-09 unverdicted novelty 6.0

    Make-A-Video achieves state-of-the-art text-to-video generation by decomposing temporal U-Net and attention structures to add space-time modeling to text-to-image models, trained without any paired text-video data.

  36. Scaling Autoregressive Models for Content-Rich Text-to-Image Generation

    cs.CV 2022-06 unverdicted novelty 6.0

    Scaling an autoregressive Transformer to 20B parameters for text-to-image generation using image token sequences achieves new SOTA zero-shot FID of 7.23 and fine-tuned FID of 3.22 on MS-COCO.

  37. Empty SPACE: Cross-Attention Sparsity for Concept Erasure in Diffusion Models

    cs.LG 2026-05 unverdicted novelty 5.0

    SPACE induces sparsity in cross-attention parameters via closed-form iterative updates to erase target concepts more effectively than dense baselines in large diffusion models.

  38. Micro-Defects Expose Macro-Fakes: Detecting AI-Generated Images via Local Distributional Shifts

    cs.CV 2026-05 unverdicted novelty 5.0

    MDMF detects AI-generated images by learning patch-level forensic signatures and quantifying their distributional discrepancies with MMD, yielding larger separation than global methods when micro-defects are present.

  39. AI-Generated Images: What Humans and Machines See When They Look at the Same Image

    cs.CV 2026-05 unverdicted novelty 5.0

    Researchers train AI detectors on a large photorealistic fake image dataset, apply 16 XAI methods, and use human survey feedback to assess alignment between machine explanations and human perception of AI-generated images.

  40. DiffMagicFace: Identity Consistent Facial Editing of Real Videos

    cs.CV 2026-04 unverdicted novelty 5.0

    DiffMagicFace uses concurrent fine-tuned text and image diffusion models plus a rendered multi-view dataset to achieve identity-consistent text-conditioned editing of real facial videos.

  41. MegaStyle: Constructing Diverse and Scalable Style Dataset via Consistent Text-to-Image Style Mapping

    cs.CV 2026-04 unverdicted novelty 5.0

    A scalable pipeline generates an intra-consistent, inter-diverse 1.4M style image dataset from text-to-image models and uses it to train a style encoder and generalizable style transfer model.

  42. LTX-2: Efficient Joint Audio-Visual Foundation Model

    cs.CV 2026-01 conditional novelty 5.0

    LTX-2 generates high-quality synchronized audiovisual content from text prompts via an asymmetric 14B-video / 5B-audio dual-stream transformer with cross-attention and modality-aware guidance.

  43. Show-o: One Single Transformer to Unify Multimodal Understanding and Generation

    cs.CV 2024-08 unverdicted novelty 5.0

    Show-o unifies autoregressive and discrete diffusion modeling inside one transformer to support multimodal understanding and generation tasks with competitive benchmark performance.

  44. Conditional Flow-VAE for Safety-Critical Traffic Scenario Generation

    cs.RO 2026-05 unverdicted novelty 4.0

    A conditional flow matching model generates realistic safety-critical traffic scenarios by turning nominal scenes into dangerous rollouts using combined simulation and real data.

  45. Adaptive Forensic Feature Refinement via Intrinsic Importance Perception

    cs.CV 2026-04 unverdicted novelty 4.0

    I2P adaptively selects the most discriminative layers from visual foundation models for synthetic image detection and constrains task updates to low-sensitivity parameter subspaces to improve specificity without harmi...

  46. Training-Free Object-Background Compositional T2I via Dynamic Spatial Guidance and Multi-Path Pruning

    cs.CV 2026-04 unverdicted novelty 4.0

    A training-free method with time-dependent attention gating and trajectory pruning enhances object-background balance in diffusion-based image synthesis.

  47. ModelScope Text-to-Video Technical Report

    cs.CV 2023-08 unverdicted novelty 4.0

    ModelScopeT2V is a 1.7-billion-parameter text-to-video model built on Stable Diffusion that adds temporal modeling and outperforms prior methods on three evaluation metrics.

Reference graph

Works this paper leans on

41 extracted references · 41 canonical work pages · cited by 46 Pith papers · 11 internal anchors

  1. [1]

    Blended diffusion for text-driven editing of natural images

    Avrahami, O., Lischinski, D., and Fried, O. Blended diffusion for text-driven editing of natural images. arXiv:2111.14818,

  2. [2]

    Paint by word

    Bau, D., Andonian, A., Cui, A., Park, Y ., Jahanian, A., Oliva, A., and Torralba, A. Paint by word. arXiv:2103.10951,

  3. [3]

    Large Scale GAN Training for High Fidelity Natural Image Synthesis

    Brock, A., Donahue, J., and Simonyan, K. Large scale gan training for high fidelity natural image synthesis. arXiv:1809.11096,

  4. [4]

    Diffusion Models Beat GANs on Image Synthesis

    URL https://proceedings.mlr.press/v81/ buolamwini18a.html. Crowson, K. Clip guided diffusion hq 256x256. https: //colab.research.google.com/drive/ 12a_Wrfi2_gwwAuN3VvMTwVMz9TfqctNj, 2021a. Crowson, K. Clip guided diffusion 512x512, secondary model method. https:// twitter.com/RiversHaveWings/status/ 1462859669454536711, 2021b. Dhariwal, P. and Nichol, A. ...

  5. [5]

    An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

    Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., and Houlsby, N. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv:2010.11929,

  6. [6]

    Stylegan-nada: Clip-guided domain adaptation of image generators

    Gal, R., Patashnik, O., Maron, H., Chechik, G., and Cohen- Or, D. Stylegan-nada: Clip-guided domain adaptation of image generators. arXiv:2108.00946,

  7. [7]

    Galatolo, Mario G

    Galatolo, F. A., Cimino, M. G. C. A., and Vaglini, G. Gener- ating images from caption and vice versa via clip-guided generative latent space search. arXiv:2102.01645,

  8. [8]

    Vector quantized diffusion model for text-to-image synthesis

    Gu, S., Chen, D., Bao, J., Wen, F., Zhang, B., Chen, D., Yuan, L., and Guo, B. Vector quantized diffusion model for text-to-image synthesis. arXiv:2111.14822,

  9. [9]

    Gans trained by a two time-scale update rule converge to a local nash equilibrium

    Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., and Hochreiter, S. Gans trained by a two time-scale update rule converge to a local nash equilibrium. Advances in Neural Information Processing Systems 30 (NIPS 2017),

  10. [10]

    and Salimans, T

    Ho, J. and Salimans, T. Classifier-free diffusion guidance. In NeurIPS 2021 Workshop on Deep Generative Models and Downstream Applications,

  11. [11]

    Denoising Diffusion Probabilistic Models

    URL https:// openreview.net/forum?id=qw8AKxfYbI. Ho, J., Jain, A., and Abbeel, P. Denoising diffusion proba- bilistic models. arXiv:2006.11239,

  12. [12]

    Cascaded diffusion models for high fidelity image generation

    Ho, J., Saharia, C., Chan, W., Fleet, D. J., Norouzi, M., and Salimans, T. Cascaded diffusion models for high fidelity image generation. arXiv:2106.15282,

  13. [13]

    A style-based generator architecture for generative adversarial networks

    Karras, T., Laine, S., and Aila, T. A style-based gen- erator architecture for generative adversarial networks. arXiv:arXiv:1812.04948, 2019a. Karras, T., Laine, S., Aittala, M., Hellsten, J., Lehtinen, J., and Aila, T. Analyzing and improving the image quality of stylegan. arXiv:1912.04958, 2019b. Kim, G. and Ye, J. C. Diffusionclip: Text-guided image ma...

  14. [14]

    Improved precision and recall metric for assessing generative models

    Kynk¨a¨anniemi, T., Karras, T., Laine, S., Lehtinen, J., and Aila, T. Improved precision and recall metric for assess- ing generative models. arXiv:1904.06991,

  15. [15]

    SDEdit: Guided Image Synthesis and Editing with Stochastic Differential Equations

    Meng, C., Song, Y ., Song, J., Wu, J., Zhu, J.-Y ., and Ermon, S. Sdedit: Image synthesis and editing with stochastic differential equations. arXiv:2108.01073,

  16. [16]

    Mixed Precision Training

    Micikevicius, P., Narang, S., Alben, J., Diamos, G., Elsen, E., Garcia, D., Ginsburg, B., Houston, M., Kuchaiev, O., Venkatesh, G., and Wu, H. Mixed precision training. arXiv:1710.03740,

  17. [17]

    The big sleep

    Murdock, R. The big sleep. https://twitter.com/ advadnoun/status/1351038053033406468,

  18. [18]

    Improved denois- ing diffusion probabilistic models.arXiv preprint arXiv:2102.09672,

    Nichol, A. and Dhariwal, P. Improved denoising diffusion probabilistic models. arXiv:2102.09672,

  19. [19]

    Styleclip: Text-driven manipulation of stylegan imagery

    Patashnik, O., Wu, Z., Shechtman, E., Cohen-Or, D., and Lischinski, D. Styleclip: Text-driven manipulation of stylegan imagery. arXiv:2103.17249,

  20. [20]

    Learning Transferable Visual Models From Natural Language Supervision

    Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., Krueger, G., and Sutskever, I. Learning transfer- able visual models from natural language supervision. arXiv:2103.00020,

  21. [21]

    Zero-Shot Text-to-Image Generation

    Ramesh, A., Pavlov, M., Goh, G., Gray, S., V oss, C., Rad- ford, A., Chen, M., and Sutskever, I. Zero-shot text-to- image generation. arXiv:2102.12092,

  22. [22]

    Generating diverse high-fidelity images with VQ-V AE-2.arXiv:1906.00446, 2019

    GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models Razavi, A., van den Oord, A., and Vinyals, O. Gen- erating diverse high-fidelity images with VQ-V AE-2. arXiv:1906.00446,

  23. [23]

    Palette: Image-to-image diffusion models

    Saharia, C., Chan, W., Chang, H., Lee, C. A., Ho, J., Sali- mans, T., Fleet, D. J., and Norouzi, M. Palette: Image-to- image diffusion models. arXiv:2111.05826, 2021a. Saharia, C., Ho, J., Chan, W., Salimans, T., Fleet, D. J., and Norouzi, M. Image super-resolution via iterative refinement. arXiv:arXiv:2104.07636, 2021b. Salimans, T., Goodfellow, I., Zarem...

  24. [24]

    Image synthesis with a single (robust) classifier

    Santurkar, S., Tsipras, D., Tran, B., Ilyas, A., Engstrom, L., and Madry, A. Image synthesis with a single (robust) classifier.arXiv:1906.09453,

  25. [25]

    Deep Unsupervised Learning using Nonequilibrium Thermodynamics

    Sohl-Dickstein, J., Weiss, E. A., Maheswaranathan, N., and Ganguli, S. Deep unsupervised learning using nonequi- librium thermodynamics. arXiv:1503.03585,

  26. [26]

    Denoising Diffusion Implicit Models

    Song, J., Meng, C., and Ermon, S. Denoising diffusion implicit models. arXiv:2010.02502, 2020a. Song, Y . and Ermon, S. Improved techniques for train- ing score-based generative models. arXiv:2006.09011, 2020a. Song, Y . and Ermon, S. Generative modeling by estimating gradients of the data distribution. arXiv:arXiv:1907.05600, 2020b. Song, Y ., Sohl-Dicks...

  27. [27]

    Df-gan: Deep fusion generative adversarial networks for text-to-image synthesis

    Tao, M., Tang, H., Wu, S., Sebe, N., Jing, X.-Y ., Wu, F., and Bao, B. Df-gan: Deep fusion generative adversarial networks for text-to-image synthesis. arXiv:2008.05865,

  28. [28]

    ://arxiv.org/abs/1711.00937

    van den Oord, A., Vinyals, O., and Kavukcuoglu, K. Neural discrete representation learning. arXiv:1711.00937,

  29. [29]

    Attention Is All You Need

    Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., and Polosukhin, I. Attention is all you need. arXiv:1706.03762,

  30. [30]

    Attngan: Fine-grained text to image gen- eration with attentional generative adversarial networks

    Xu, T., Zhang, P., Huang, Q., Zhang, H., Gan, Z., Huang, X., and He, X. Attngan: Fine-grained text to image gen- eration with attentional generative adversarial networks. arXiv:1711.10485,

  31. [31]

    Improving text-to-image synthesis using contrastive learning

    Ye, H., Yang, X., Takac, M., Sunderraman, R., and Ji, S. Improving text-to-image synthesis using contrastive learn- ing. arXiv:2107.02423,

  32. [32]

    Y ., Baldridge, J., Lee, H., and Yang, Y

    Zhang, H., Koh, J. Y ., Baldridge, J., Lee, H., and Yang, Y . Cross-modal contrastive learning for text-to-image generation. arXiv:2101.04702,

  33. [33]

    URL https://doi.org/ 10.1145/3394171.3414017

    1145/3394171.3414017. URL https://doi.org/ 10.1145/3394171.3414017. Zhou, S., Gordon, M. L., Krishna, R., Narcomey, A., Fei- Fei, L., and Bernstein, M. S. Hype: A benchmark for human eye perceptual evaluation of generative models,

  34. [34]

    Lafite: Towards language-free training for text-to-image generation

    Zhou, Y ., Zhang, R., Chen, C., Li, C., Tensmeyer, C., Yu, T., Gu, J., Xu, J., and Sun, T. Lafite: Towards language-free training for text-to-image generation. arXiv:2111.13792,

  35. [35]

    Dm-gan: Dy- namic memory generative adversarial networks for text- to-image synthesis

    Zhu, M., Pan, P., Chen, W., and Yang, Y . Dm-gan: Dy- namic memory generative adversarial networks for text- to-image synthesis. arXiv:1904.01310,

  36. [36]

    By doing this, ties effectively dilute the wins of each model

    When computing wins and Elo scores, we count a tie as half of a win for each model. By doing this, ties effectively dilute the wins of each model. To compute Elo scores, we construct a matrixA such that entryAij is the number of times modeli beats modelj. We initialize Elo scores for allN models asσi = 0,i∈ [1,N ]. We compute Elo scores by minimizing the ...

  37. [37]

    (2021) and Ramesh et al

    We trained our CLIP models for 390K iterations with batch size 32K on a 50%-50% mixture of the datasets used by Radford et al. (2021) and Ramesh et al. (2021). For our final CLIP model, we trained a ViT-L with weight decay 0.0125. After training, we fine-tuned the final ViT-L for 30K iterations on an even broader dataset of internet images. We pre-trained GL...

  38. [38]

    a corgi in a field

    Comparing classifier-free guided samples from our large model (first row), a small version trained on the same data (second row), and our released small model trained on a smaller, filtered dataset. In the final row, we show samples using our small model guided by a CLIP model trained on filtered data. Samples are not cherry-picked. D. Comparison to Unnoised C...

  39. [39]

    Comparison of GLIDE to two CLIP guidance strategies applied to pre-trained ImageNet diffusion models. On the left, we use a vanilla CLIP model to guide the 256 × 256 diffusion model from Dhariwal & Nichol (2021), using a combination of engineered perceptual losses and data augmentations (Crowson, 2021a). In the middle, we use our noised ViT-B CLIP model t...

  40. [40]

    pink yarn ball

    is not yet available, we evaluate our model on a few of the prompts shown in the paper (Figure 11). We find that our fine-tuned model sometimes chooses to ignore the given text prompt and instead produces an image that seems influenced only by the surrounding context. To mitigate this phenomenon, we also evaluate our model with the context fully masked out. ...

  41. [41]

    weapon”, “violence

    Comparison of image inpainting quality on real images. (1) Local CLIP-guided diffusion (Crowson, 2021a), (2) PaintByWord++ (Bau et al., 2021; Avrahami et al., 2021), (3) Blended Diffusion (Avrahami et al., 2021). For our results, we follow Avrahami et al. (2021) and use CLIP to select the best of 64 samples. Our fine-tuned samples have more realistic light...