Recognition: 1 theorem link
· Lean TheoremGLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models
Pith reviewed 2026-05-11 05:52 UTC · model grok-4.3
The pith
Text-conditional diffusion models using classifier-free guidance generate images humans prefer over DALL-E for photorealism.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
A 3.5 billion parameter text-conditional diffusion model using classifier-free guidance generates samples that human evaluators favor over DALL-E outputs for both photorealism and caption similarity. The model can be fine-tuned to perform text-driven image inpainting.
What carries the argument
Classifier-free guidance applied to text-conditional diffusion models, which steers the denoising process using the text condition directly to improve fidelity without an external classifier.
If this is right
- Human evaluators consistently prefer classifier-free guidance to CLIP guidance in text-to-image generation tasks.
- Large-scale diffusion models can achieve superior results to prior text-to-image systems like DALL-E in blind human comparisons.
- Fine-tuning allows diffusion models to support practical editing capabilities such as text-based inpainting.
- The open-sourced smaller model enables further research and applications in text-guided image synthesis.
Where Pith is reading between the lines
- Further increases in model scale and data quality could push photorealism even closer to real photographs.
- Similar guidance techniques might improve conditional generation in related domains like video or 3D synthesis.
- The success of classifier-free guidance suggests it could simplify training pipelines by removing the need for separate guidance models.
Load-bearing premise
The judgments of the human evaluators accurately capture photorealism and text similarity in a way that generalizes beyond the specific test conditions.
What would settle it
A follow-up experiment with a new group of raters or an objective metric such as improved FID scores on a held-out test set that shows DALL-E preferred instead.
read the original abstract
Diffusion models have recently been shown to generate high-quality synthetic images, especially when paired with a guidance technique to trade off diversity for fidelity. We explore diffusion models for the problem of text-conditional image synthesis and compare two different guidance strategies: CLIP guidance and classifier-free guidance. We find that the latter is preferred by human evaluators for both photorealism and caption similarity, and often produces photorealistic samples. Samples from a 3.5 billion parameter text-conditional diffusion model using classifier-free guidance are favored by human evaluators to those from DALL-E, even when the latter uses expensive CLIP reranking. Additionally, we find that our models can be fine-tuned to perform image inpainting, enabling powerful text-driven image editing. We train a smaller model on a filtered dataset and release the code and weights at https://github.com/openai/glide-text2im.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces GLIDE, a text-conditional diffusion model for photorealistic image generation and editing. It compares two guidance strategies (CLIP guidance vs. classifier-free guidance) and reports that classifier-free guidance is preferred by human evaluators on both photorealism and caption similarity. A 3.5B-parameter model using classifier-free guidance produces samples that human raters favor over DALL-E outputs even when DALL-E employs CLIP reranking. The work further shows that the model can be fine-tuned for text-driven inpainting and releases code plus weights for a smaller filtered-data variant.
Significance. If the human-study results hold, the paper establishes classifier-free guidance as a strong, parameter-efficient alternative to CLIP-based guidance for text-to-image diffusion, with credible evidence of outperformance versus DALL-E on the same prompt set. The open release of the smaller model and code directly supports reproducibility. The inpainting fine-tuning result demonstrates a practical editing capability that extends the core generation contribution.
minor comments (3)
- Abstract: the central human-preference claim is stated without any mention of protocol details (rater count, question wording, or statistical controls). Although these details appear in the main text, a single sentence in the abstract would make the claim self-contained.
- Human-evaluation section: the manuscript reports judgments on the same 1000 prompts used by DALL-E and one sample per prompt per model, but does not explicitly state whether prompt order or model identity was blinded to raters; adding this sentence would strengthen the protocol description.
- Figure captions and qualitative results: several comparison figures would benefit from explicit indication of which guidance method and sampling steps were used for each panel, to allow readers to map visuals directly to the quantitative claims.
Simulated Author's Rebuttal
We thank the referee for the positive review and the recommendation to accept. No major comments were raised.
Circularity Check
No significant circularity
full rationale
The paper is an empirical ML study describing the training and evaluation of a text-conditional diffusion model (GLIDE) with comparisons to DALL-E via human raters on photorealism and caption similarity. No derivation chain, first-principles predictions, or fitted parameters renamed as outputs exist; the central claims rest on experimental results using standard diffusion model formulations and external baselines, with no self-referential reductions or load-bearing self-citations that collapse the argument to its inputs.
Axiom & Free-Parameter Ledger
Forward citations
Cited by 47 Pith papers
-
Consistency Models
Consistency models achieve fast one-step generation with SOTA FID of 3.55 on CIFAR-10 and 6.20 on ImageNet 64x64 by directly mapping noise to data, outperforming prior distillation techniques.
-
Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow
Rectified flow learns straight-path neural ODEs for distribution transport, yielding efficient generative models and domain transfers that work well even with a single simulation step.
-
Prompt-to-Prompt Image Editing with Cross Attention Control
Cross-attention control in text-conditioned models enables localized and global image edits by editing only the input text prompt.
-
An Image is Worth One Word: Personalizing Text-to-Image Generation using Textual Inversion
Textual Inversion learns a single embedding vector from a few images to represent personal concepts inside the text embedding space of a frozen text-to-image model, enabling their composition in natural language prompts.
-
From Diffusion to Rectified Flow: Rethinking Text-Based Segmentation
RLFSeg repurposes pretrained generative models via Rectified Flow for direct latent-space image-to-mask mapping in text-based segmentation, outperforming diffusion-based methods especially in zero-shot cases.
-
Active Sampling for Ultra-Low-Bit-Rate Video Compression via Conditional Controlled Diffusion
ActDiff-VC achieves up to 64.6% bitrate reduction at matched NIQE and improves perceptual metrics like KID and FID by using content-adaptive keyframe selection and budget-aware sparse trajectory selection to condition...
-
Benchmarking Layout-Guided Diffusion Models through Unified Semantic-Spatial Evaluation in Closed and Open Settings
Introduces closed-set C-Bench and open-set O-Bench for layout-guided diffusion models, a unified semantic-spatial scoring protocol, and ranks six models after generating and evaluating 319,086 images.
-
GENFIG1: Visual Summaries of Scholarly Work as a Challenge for Vision-Language Models
GENFIG1 is a new benchmark that tests whether vision-language models can create effective Figure 1 visuals capturing the central scientific idea from paper text.
-
Autoregressive Model Beats Diffusion: Llama for Scalable Image Generation
Scaled vanilla autoregressive models based on Llama achieve 2.18 FID on ImageNet 256x256 image generation, beating popular diffusion models without visual inductive biases.
-
ELLA: Equip Diffusion Models with LLM for Enhanced Semantic Alignment
ELLA introduces a timestep-aware semantic connector to link LLMs with diffusion models for improved dense prompt following, validated on a new 1K-prompt benchmark.
-
Latent Consistency Models: Synthesizing High-Resolution Images with Few-Step Inference
Latent Consistency Models enable high-fidelity text-to-image generation in 2-4 steps by directly predicting solutions to the probability flow ODE in latent space, distilled from pre-trained LDMs.
-
AnimateDiff: Animate Your Personalized Text-to-Image Diffusion Models without Specific Tuning
A single motion module trained on videos adds temporally coherent animation to any personalized text-to-image model derived from the same base without additional tuning.
-
Scalable Diffusion Models with Transformers
DiTs achieve SOTA FID of 2.27 on ImageNet 256x256 by scaling transformer-based latent diffusion models, with performance improving consistently as Gflops increase.
-
LAION-5B: An open large-scale dataset for training next generation image-text models
LAION-5B is an openly released dataset of 5.85 billion CLIP-filtered image-text pairs that enables replication of foundational vision-language models.
-
Imagen Video: High Definition Video Generation with Diffusion Models
Imagen Video generates high-definition text-conditional videos via a cascade of base and super-resolution diffusion models, achieving high fidelity and controllability.
-
Diffusion Posterior Sampling for General Noisy Inverse Problems
Diffusion models solve noisy (non)linear inverse problems via approximated posterior sampling that blends diffusion steps with manifold gradients without strict consistency projection.
-
Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding
Imagen achieves state-of-the-art photorealistic text-to-image generation by scaling a text-only pretrained T5 language model within a diffusion framework, reaching FID 7.27 on COCO without training on it.
-
Hierarchical Text-Conditional Image Generation with CLIP Latents
A hierarchical prior-decoder model using CLIP latents generates more diverse text-conditional images than direct methods while preserving photorealism and caption fidelity.
-
Video Diffusion Models
A diffusion model for video generation extends image architectures with joint image-video training and improved conditional sampling, delivering first large-scale text-to-video results and state-of-the-art performance...
-
High-Resolution Image Synthesis with Latent Diffusion Models
Latent diffusion models achieve state-of-the-art inpainting and competitive results on unconditional generation, scene synthesis, and super-resolution by performing the diffusion process in the latent space of pretrai...
-
FlashClear: Ultra-Fast Image Content Removal via Efficient Step Distillation and Feature Caching
FlashClear achieves up to 8.26x speedup over its base diffusion model and 122x over OmniPaint for image object removal via region-aware adversarial distillation and foreground-prioritized caching while claiming to mai...
-
FlashClear: Ultra-Fast Image Content Removal via Efficient Step Distillation and Feature Caching
FlashClear delivers up to 122x faster object removal than prior diffusion models via adversarial step distillation and asymmetric attention caching while preserving visual quality.
-
Intermediate Representations are Strong AI-Generated Image Detectors
Intermediate layer embedding sensitivity to perturbations distinguishes AI-generated images from real ones, yielding higher AUROC on GenImage and Forensics Small benchmarks than prior methods.
-
Learning to Theorize the World from Observation
NEO induces compositional latent programs as world theories from observations and executes them to enable explanation-driven generalization.
-
VibeToken: Scaling 1D Image Tokenizers and Autoregressive Models for Dynamic Resolution Generations
VibeToken enables autoregressive image generation at arbitrary resolutions using 64 tokens for 1024x1024 images with 3.94 gFID, constant 179G FLOPs, and better efficiency than diffusion or fixed AR baselines.
-
Rethinking Where to Edit: Task-Aware Localization for Instruction-Based Image Editing
Task-aware localization via attention cues and feature centroids from source/target streams in IIE models improves non-edit consistency while preserving instruction following.
-
PostureObjectstitch: Anomaly Image Generation Considering Assembly Relationships in Industrial Scenarios
PostureObjectStitch generates assembly-aware anomaly images by decoupling multi-view features into high-frequency, texture and RGB components, modulating them temporally in a diffusion model, and applying conditional ...
-
MuPPet: Multi-person 2D-to-3D Pose Lifting
MuPPet introduces person encoding, permutation augmentation, and dynamic multi-person attention to outperform prior single- and multi-person 2D-to-3D pose lifting methods on group interaction datasets while improving ...
-
Controllable Image Generation with Composed Parallel Token Prediction
A new formulation for composing discrete generative processes enables precise control over novel condition combinations in image generation, cutting error rates by 63% and speeding up inference.
-
Erasure or Erosion? Evaluating Compositional Degradation in Unlearned Text-To-Image Diffusion Models
Unlearning methods that strongly erase concepts from text-to-image diffusion models consistently degrade performance on attribute binding, spatial reasoning, and counting tasks.
-
LTX-Video: Realtime Video Latent Diffusion
LTX-Video integrates Video-VAE and transformer for 1:192 latent compression and real-time video diffusion by moving patchifying to the VAE and letting the decoder finish denoising in pixel space.
-
CameraCtrl: Enabling Camera Control for Text-to-Video Generation
CameraCtrl enables accurate camera pose control in video diffusion models through a trained plug-and-play module and dataset choices emphasizing diverse camera trajectories with matching appearance.
-
IP-Adapter: Text Compatible Image Prompt Adapter for Text-to-Image Diffusion Models
IP-Adapter adds effective image prompting to text-to-image diffusion models using a lightweight decoupled cross-attention adapter that works alongside text prompts and other controls.
-
SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis
SDXL improves upon prior Stable Diffusion versions through a larger UNet backbone, dual text encoders, novel conditioning, and a refinement model, producing higher-fidelity images competitive with black-box state-of-t...
-
Make-A-Video: Text-to-Video Generation without Text-Video Data
Make-A-Video achieves state-of-the-art text-to-video generation by decomposing temporal U-Net and attention structures to add space-time modeling to text-to-image models, trained without any paired text-video data.
-
Scaling Autoregressive Models for Content-Rich Text-to-Image Generation
Scaling an autoregressive Transformer to 20B parameters for text-to-image generation using image token sequences achieves new SOTA zero-shot FID of 7.23 and fine-tuned FID of 3.22 on MS-COCO.
-
Empty SPACE: Cross-Attention Sparsity for Concept Erasure in Diffusion Models
SPACE induces sparsity in cross-attention parameters via closed-form iterative updates to erase target concepts more effectively than dense baselines in large diffusion models.
-
Micro-Defects Expose Macro-Fakes: Detecting AI-Generated Images via Local Distributional Shifts
MDMF detects AI-generated images by learning patch-level forensic signatures and quantifying their distributional discrepancies with MMD, yielding larger separation than global methods when micro-defects are present.
-
AI-Generated Images: What Humans and Machines See When They Look at the Same Image
Researchers train AI detectors on a large photorealistic fake image dataset, apply 16 XAI methods, and use human survey feedback to assess alignment between machine explanations and human perception of AI-generated images.
-
DiffMagicFace: Identity Consistent Facial Editing of Real Videos
DiffMagicFace uses concurrent fine-tuned text and image diffusion models plus a rendered multi-view dataset to achieve identity-consistent text-conditioned editing of real facial videos.
-
MegaStyle: Constructing Diverse and Scalable Style Dataset via Consistent Text-to-Image Style Mapping
A scalable pipeline generates an intra-consistent, inter-diverse 1.4M style image dataset from text-to-image models and uses it to train a style encoder and generalizable style transfer model.
-
LTX-2: Efficient Joint Audio-Visual Foundation Model
LTX-2 generates high-quality synchronized audiovisual content from text prompts via an asymmetric 14B-video / 5B-audio dual-stream transformer with cross-attention and modality-aware guidance.
-
Show-o: One Single Transformer to Unify Multimodal Understanding and Generation
Show-o unifies autoregressive and discrete diffusion modeling inside one transformer to support multimodal understanding and generation tasks with competitive benchmark performance.
-
Conditional Flow-VAE for Safety-Critical Traffic Scenario Generation
A conditional flow matching model generates realistic safety-critical traffic scenarios by turning nominal scenes into dangerous rollouts using combined simulation and real data.
-
Adaptive Forensic Feature Refinement via Intrinsic Importance Perception
I2P adaptively selects the most discriminative layers from visual foundation models for synthetic image detection and constrains task updates to low-sensitivity parameter subspaces to improve specificity without harmi...
-
Training-Free Object-Background Compositional T2I via Dynamic Spatial Guidance and Multi-Path Pruning
A training-free method with time-dependent attention gating and trajectory pruning enhances object-background balance in diffusion-based image synthesis.
-
ModelScope Text-to-Video Technical Report
ModelScopeT2V is a 1.7-billion-parameter text-to-video model built on Stable Diffusion that adds temporal modeling and outperforms prior methods on three evaluation metrics.
Reference graph
Works this paper leans on
-
[1]
Blended diffusion for text-driven editing of natural images
Avrahami, O., Lischinski, D., and Fried, O. Blended diffusion for text-driven editing of natural images. arXiv:2111.14818,
-
[2]
Bau, D., Andonian, A., Cui, A., Park, Y ., Jahanian, A., Oliva, A., and Torralba, A. Paint by word. arXiv:2103.10951,
-
[3]
Large Scale GAN Training for High Fidelity Natural Image Synthesis
Brock, A., Donahue, J., and Simonyan, K. Large scale gan training for high fidelity natural image synthesis. arXiv:1809.11096,
work page internal anchor Pith review arXiv
-
[4]
Diffusion Models Beat GANs on Image Synthesis
URL https://proceedings.mlr.press/v81/ buolamwini18a.html. Crowson, K. Clip guided diffusion hq 256x256. https: //colab.research.google.com/drive/ 12a_Wrfi2_gwwAuN3VvMTwVMz9TfqctNj, 2021a. Crowson, K. Clip guided diffusion 512x512, secondary model method. https:// twitter.com/RiversHaveWings/status/ 1462859669454536711, 2021b. Dhariwal, P. and Nichol, A. ...
work page internal anchor Pith review arXiv
-
[5]
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., and Houlsby, N. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv:2010.11929,
work page internal anchor Pith review Pith/arXiv arXiv 2010
-
[6]
Stylegan-nada: Clip-guided domain adaptation of image generators
Gal, R., Patashnik, O., Maron, H., Chechik, G., and Cohen- Or, D. Stylegan-nada: Clip-guided domain adaptation of image generators. arXiv:2108.00946,
-
[7]
Galatolo, F. A., Cimino, M. G. C. A., and Vaglini, G. Gener- ating images from caption and vice versa via clip-guided generative latent space search. arXiv:2102.01645,
-
[8]
Vector quantized diffusion model for text-to-image synthesis
Gu, S., Chen, D., Bao, J., Wen, F., Zhang, B., Chen, D., Yuan, L., and Guo, B. Vector quantized diffusion model for text-to-image synthesis. arXiv:2111.14822,
-
[9]
Gans trained by a two time-scale update rule converge to a local nash equilibrium
Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., and Hochreiter, S. Gans trained by a two time-scale update rule converge to a local nash equilibrium. Advances in Neural Information Processing Systems 30 (NIPS 2017),
work page 2017
-
[10]
Ho, J. and Salimans, T. Classifier-free diffusion guidance. In NeurIPS 2021 Workshop on Deep Generative Models and Downstream Applications,
work page 2021
-
[11]
Denoising Diffusion Probabilistic Models
URL https:// openreview.net/forum?id=qw8AKxfYbI. Ho, J., Jain, A., and Abbeel, P. Denoising diffusion proba- bilistic models. arXiv:2006.11239,
work page internal anchor Pith review Pith/arXiv arXiv 2006
-
[12]
Cascaded diffusion models for high fidelity image generation
Ho, J., Saharia, C., Chan, W., Fleet, D. J., Norouzi, M., and Salimans, T. Cascaded diffusion models for high fidelity image generation. arXiv:2106.15282,
-
[13]
A style-based generator architecture for generative adversarial networks
Karras, T., Laine, S., and Aila, T. A style-based gen- erator architecture for generative adversarial networks. arXiv:arXiv:1812.04948, 2019a. Karras, T., Laine, S., Aittala, M., Hellsten, J., Lehtinen, J., and Aila, T. Analyzing and improving the image quality of stylegan. arXiv:1912.04958, 2019b. Kim, G. and Ye, J. C. Diffusionclip: Text-guided image ma...
-
[14]
Improved precision and recall metric for assessing generative models
Kynk¨a¨anniemi, T., Karras, T., Laine, S., Lehtinen, J., and Aila, T. Improved precision and recall metric for assess- ing generative models. arXiv:1904.06991,
-
[15]
SDEdit: Guided Image Synthesis and Editing with Stochastic Differential Equations
Meng, C., Song, Y ., Song, J., Wu, J., Zhu, J.-Y ., and Ermon, S. Sdedit: Image synthesis and editing with stochastic differential equations. arXiv:2108.01073,
work page internal anchor Pith review arXiv
-
[16]
Micikevicius, P., Narang, S., Alben, J., Diamos, G., Elsen, E., Garcia, D., Ginsburg, B., Houston, M., Kuchaiev, O., Venkatesh, G., and Wu, H. Mixed precision training. arXiv:1710.03740,
work page internal anchor Pith review arXiv
-
[17]
Murdock, R. The big sleep. https://twitter.com/ advadnoun/status/1351038053033406468,
-
[18]
Improved denois- ing diffusion probabilistic models.arXiv preprint arXiv:2102.09672,
Nichol, A. and Dhariwal, P. Improved denoising diffusion probabilistic models. arXiv:2102.09672,
-
[19]
Styleclip: Text-driven manipulation of stylegan imagery
Patashnik, O., Wu, Z., Shechtman, E., Cohen-Or, D., and Lischinski, D. Styleclip: Text-driven manipulation of stylegan imagery. arXiv:2103.17249,
-
[20]
Learning Transferable Visual Models From Natural Language Supervision
Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., Krueger, G., and Sutskever, I. Learning transfer- able visual models from natural language supervision. arXiv:2103.00020,
work page internal anchor Pith review Pith/arXiv arXiv
-
[21]
Zero-Shot Text-to-Image Generation
Ramesh, A., Pavlov, M., Goh, G., Gray, S., V oss, C., Rad- ford, A., Chen, M., and Sutskever, I. Zero-shot text-to- image generation. arXiv:2102.12092,
work page internal anchor Pith review arXiv
-
[22]
Generating diverse high-fidelity images with VQ-V AE-2.arXiv:1906.00446, 2019
GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models Razavi, A., van den Oord, A., and Vinyals, O. Gen- erating diverse high-fidelity images with VQ-V AE-2. arXiv:1906.00446,
-
[23]
Palette: Image-to-image diffusion models
Saharia, C., Chan, W., Chang, H., Lee, C. A., Ho, J., Sali- mans, T., Fleet, D. J., and Norouzi, M. Palette: Image-to- image diffusion models. arXiv:2111.05826, 2021a. Saharia, C., Ho, J., Chan, W., Salimans, T., Fleet, D. J., and Norouzi, M. Image super-resolution via iterative refinement. arXiv:arXiv:2104.07636, 2021b. Salimans, T., Goodfellow, I., Zarem...
-
[24]
Image synthesis with a single (robust) classifier
Santurkar, S., Tsipras, D., Tran, B., Ilyas, A., Engstrom, L., and Madry, A. Image synthesis with a single (robust) classifier.arXiv:1906.09453,
-
[25]
Deep Unsupervised Learning using Nonequilibrium Thermodynamics
Sohl-Dickstein, J., Weiss, E. A., Maheswaranathan, N., and Ganguli, S. Deep unsupervised learning using nonequi- librium thermodynamics. arXiv:1503.03585,
work page internal anchor Pith review arXiv
-
[26]
Denoising Diffusion Implicit Models
Song, J., Meng, C., and Ermon, S. Denoising diffusion implicit models. arXiv:2010.02502, 2020a. Song, Y . and Ermon, S. Improved techniques for train- ing score-based generative models. arXiv:2006.09011, 2020a. Song, Y . and Ermon, S. Generative modeling by estimating gradients of the data distribution. arXiv:arXiv:1907.05600, 2020b. Song, Y ., Sohl-Dicks...
work page internal anchor Pith review Pith/arXiv arXiv 2010
-
[27]
Df-gan: Deep fusion generative adversarial networks for text-to-image synthesis
Tao, M., Tang, H., Wu, S., Sebe, N., Jing, X.-Y ., Wu, F., and Bao, B. Df-gan: Deep fusion generative adversarial networks for text-to-image synthesis. arXiv:2008.05865,
-
[28]
van den Oord, A., Vinyals, O., and Kavukcuoglu, K. Neural discrete representation learning. arXiv:1711.00937,
-
[29]
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., and Polosukhin, I. Attention is all you need. arXiv:1706.03762,
work page internal anchor Pith review Pith/arXiv arXiv
-
[30]
Attngan: Fine-grained text to image gen- eration with attentional generative adversarial networks
Xu, T., Zhang, P., Huang, Q., Zhang, H., Gan, Z., Huang, X., and He, X. Attngan: Fine-grained text to image gen- eration with attentional generative adversarial networks. arXiv:1711.10485,
-
[31]
Improving text-to-image synthesis using contrastive learning
Ye, H., Yang, X., Takac, M., Sunderraman, R., and Ji, S. Improving text-to-image synthesis using contrastive learn- ing. arXiv:2107.02423,
-
[32]
Y ., Baldridge, J., Lee, H., and Yang, Y
Zhang, H., Koh, J. Y ., Baldridge, J., Lee, H., and Yang, Y . Cross-modal contrastive learning for text-to-image generation. arXiv:2101.04702,
-
[33]
URL https://doi.org/ 10.1145/3394171.3414017
1145/3394171.3414017. URL https://doi.org/ 10.1145/3394171.3414017. Zhou, S., Gordon, M. L., Krishna, R., Narcomey, A., Fei- Fei, L., and Bernstein, M. S. Hype: A benchmark for human eye perceptual evaluation of generative models,
-
[34]
Lafite: Towards language-free training for text-to-image generation
Zhou, Y ., Zhang, R., Chen, C., Li, C., Tensmeyer, C., Yu, T., Gu, J., Xu, J., and Sun, T. Lafite: Towards language-free training for text-to-image generation. arXiv:2111.13792,
-
[35]
Dm-gan: Dy- namic memory generative adversarial networks for text- to-image synthesis
Zhu, M., Pan, P., Chen, W., and Yang, Y . Dm-gan: Dy- namic memory generative adversarial networks for text- to-image synthesis. arXiv:1904.01310,
-
[36]
By doing this, ties effectively dilute the wins of each model
When computing wins and Elo scores, we count a tie as half of a win for each model. By doing this, ties effectively dilute the wins of each model. To compute Elo scores, we construct a matrixA such that entryAij is the number of times modeli beats modelj. We initialize Elo scores for allN models asσi = 0,i∈ [1,N ]. We compute Elo scores by minimizing the ...
work page 2021
-
[37]
We trained our CLIP models for 390K iterations with batch size 32K on a 50%-50% mixture of the datasets used by Radford et al. (2021) and Ramesh et al. (2021). For our final CLIP model, we trained a ViT-L with weight decay 0.0125. After training, we fine-tuned the final ViT-L for 30K iterations on an even broader dataset of internet images. We pre-trained GL...
work page 2021
-
[38]
Comparing classifier-free guided samples from our large model (first row), a small version trained on the same data (second row), and our released small model trained on a smaller, filtered dataset. In the final row, we show samples using our small model guided by a CLIP model trained on filtered data. Samples are not cherry-picked. D. Comparison to Unnoised C...
work page 2021
-
[39]
Comparison of GLIDE to two CLIP guidance strategies applied to pre-trained ImageNet diffusion models. On the left, we use a vanilla CLIP model to guide the 256 × 256 diffusion model from Dhariwal & Nichol (2021), using a combination of engineered perceptual losses and data augmentations (Crowson, 2021a). In the middle, we use our noised ViT-B CLIP model t...
work page 2021
-
[40]
is not yet available, we evaluate our model on a few of the prompts shown in the paper (Figure 11). We find that our fine-tuned model sometimes chooses to ignore the given text prompt and instead produces an image that seems influenced only by the surrounding context. To mitigate this phenomenon, we also evaluate our model with the context fully masked out. ...
work page 2015
-
[41]
Comparison of image inpainting quality on real images. (1) Local CLIP-guided diffusion (Crowson, 2021a), (2) PaintByWord++ (Bau et al., 2021; Avrahami et al., 2021), (3) Blended Diffusion (Avrahami et al., 2021). For our results, we follow Avrahami et al. (2021) and use CLIP to select the best of 64 samples. Our fine-tuned samples have more realistic light...
work page 2021
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.