pith. machine review for the scientific record. sign in

arxiv: 2105.05233 · v4 · submitted 2021-05-11 · 💻 cs.LG · cs.AI· cs.CV· stat.ML

Recognition: 2 theorem links

· Lean Theorem

Diffusion Models Beat GANs on Image Synthesis

Alex Nichol, Prafulla Dhariwal

Pith reviewed 2026-05-13 11:12 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CVstat.ML
keywords diffusion modelsimage synthesisGANsclassifier guidanceFID scoreImageNetgenerative modelsupsampling
0
0 comments X

The pith

Diffusion models achieve higher image sample quality than GANs on ImageNet through architecture improvements and classifier guidance.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that diffusion models, which iteratively remove noise to form images, can surpass generative adversarial networks in sample quality for both unconditional and conditional image synthesis. Architectural refinements identified through systematic ablations boost unconditional performance, while classifier guidance steers sampling with classifier gradients to improve fidelity at the cost of some diversity. This combination produces new low FID scores of 2.97 on ImageNet 128x128, 4.59 on 256x256, and 7.72 on 512x512, and it matches top GANs with far fewer steps while covering the data distribution more fully. The work positions diffusion models as a competitive or superior option for high-quality image generation.

Core claim

Diffusion models can achieve image sample quality superior to the current state-of-the-art generative models. For unconditional synthesis a better architecture is found through a series of ablations. For conditional synthesis classifier guidance further improves quality by trading off diversity for fidelity using gradients from a classifier. The models reach FID scores of 2.97 on ImageNet 128x128, 4.59 on 256x256, and 7.72 on 512x512, match BigGAN-deep with as few as 25 forward passes, and maintain better distribution coverage. Classifier guidance also combines effectively with upsampling diffusion models to reach even lower FID values.

What carries the argument

Classifier guidance, a sampling technique that uses gradients from a pre-trained classifier to steer the reverse diffusion process toward higher-fidelity outputs.

If this is right

  • Diffusion models match or exceed prior GAN performance while using only 25 sampling steps per image.
  • Classifier guidance enables an explicit, compute-efficient trade-off between sample fidelity and diversity.
  • Combining guidance with upsampling diffusion models yields further FID reductions to 3.94 on ImageNet 256x256.
  • The generated samples cover the target distribution more completely than the compared GAN baselines.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same guidance approach may extend to other data types such as video or audio if suitable classifiers exist.
  • Performance could degrade on domains where high-accuracy classifiers are unavailable or expensive to train.
  • Subsequent work could test whether similar ablations applied to GANs would close the reported quality gap.
  • Wider use might encourage replacing adversarial objectives with iterative denoising in many generative pipelines.

Load-bearing premise

That the architecture improvements found by ablation and the classifier guidance method will generalize to other datasets and tasks without substantial extra tuning.

What would settle it

An experiment in which a new GAN variant records a lower FID than 2.97 on ImageNet 128x128 would show the claimed superiority does not hold.

read the original abstract

We show that diffusion models can achieve image sample quality superior to the current state-of-the-art generative models. We achieve this on unconditional image synthesis by finding a better architecture through a series of ablations. For conditional image synthesis, we further improve sample quality with classifier guidance: a simple, compute-efficient method for trading off diversity for fidelity using gradients from a classifier. We achieve an FID of 2.97 on ImageNet 128$\times$128, 4.59 on ImageNet 256$\times$256, and 7.72 on ImageNet 512$\times$512, and we match BigGAN-deep even with as few as 25 forward passes per sample, all while maintaining better coverage of the distribution. Finally, we find that classifier guidance combines well with upsampling diffusion models, further improving FID to 3.94 on ImageNet 256$\times$256 and 3.85 on ImageNet 512$\times$512. We release our code at https://github.com/openai/guided-diffusion

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 3 minor

Summary. The paper claims that diffusion models can achieve image sample quality superior to current state-of-the-art generative models such as BigGAN. This is demonstrated on unconditional ImageNet synthesis via a series of architecture ablations, and on conditional synthesis via the introduction of classifier guidance, which trades off diversity for fidelity using classifier gradients. Reported results include FID scores of 2.97 (128x128), 4.59 (256x256), and 7.72 (512x512), with matching or better performance than BigGAN-deep using as few as 25 sampling steps while maintaining superior coverage; further gains are shown when combining classifier guidance with upsampling diffusion models.

Significance. If the empirical results hold, the work is significant because it provides the first clear demonstration that diffusion models can outperform leading GANs on high-resolution image synthesis benchmarks, supported by extensive ablations, direct quantitative comparisons, and released code for reproducibility. Classifier guidance offers a simple, compute-efficient mechanism for controlling the fidelity-diversity tradeoff, and the findings suggest diffusion models as a strong alternative paradigm with better distribution coverage.

minor comments (3)
  1. [Section 3.2] Section 3.2: The explanation of how classifier gradients are scaled and added during sampling would benefit from an explicit equation showing the modified mean prediction step.
  2. [Figure 5] Figure 5: The legend and axis labels on the coverage vs. FID scatter plots are slightly crowded; increasing font size or splitting into two panels would improve readability.
  3. [Table 2] Table 2: Clarify whether the reported FID values for the 25-step regime use the same classifier guidance scale as the full 250-step results or a separately tuned value.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their positive review and recommendation to accept. We appreciate the recognition of the work's significance in demonstrating that diffusion models can outperform leading GANs on high-resolution image synthesis, along with the value placed on the ablations, quantitative comparisons, classifier guidance mechanism, and code release.

Circularity Check

0 steps flagged

No significant circularity identified

full rationale

The paper's central claims rest on new empirical results: architecture ablations for unconditional diffusion models and classifier guidance for conditional synthesis, with direct FID reporting on ImageNet 128/256/512 and explicit comparisons to BigGAN. No load-bearing step reduces by construction to a fitted parameter, self-citation chain, or ansatz smuggled from prior work; the reported improvements are demonstrated through fresh experiments and released code rather than derived from the paper's own inputs.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The work rests primarily on empirical validation and prior diffusion model foundations rather than new theoretical axioms or invented entities.

free parameters (1)
  • classifier guidance scale
    Hyperparameter tuned across experiments to trade fidelity against diversity; values are selected based on FID performance on validation splits.
axioms (1)
  • domain assumption The forward diffusion process can be reversed by learning a denoising network.
    Invoked throughout as the core mechanism of diffusion models, drawn from prior literature.

pith-pipeline@v0.9.0 · 5484 in / 1164 out tokens · 33054 ms · 2026-05-13T11:12:38.168916+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 24 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Autoregressive Learning in Joint KL: Sharp Oracle Bounds and Lower Bounds

    cs.LG 2026-05 unverdicted novelty 8.0

    Joint KL yields horizon-free approximation but an information-theoretic lower bound of order Omega(H) for estimation error in autoregressive learning, with matching computationally efficient upper bounds.

  2. Classifier-Free Diffusion Guidance

    cs.LG 2022-07 unverdicted novelty 8.0

    Classifier-free guidance trades off sample quality and diversity in conditional diffusion models by combining scores from jointly trained conditional and unconditional models.

  3. Tempered Guided Diffusion

    stat.ML 2026-05 unverdicted novelty 7.0

    Tempered Guided Diffusion uses annealed SMC to produce consistent particle approximations to the posterior for training-free conditional diffusion sampling, outperforming independent guided trajectories in experiments.

  4. VASR: Variance-Aware Systematic Resampling for Reward-Guided Diffusion

    cs.AI 2026-04 unverdicted novelty 7.0

    FVD applies Fleming-Viot population dynamics to diffusion model sampling at inference time to reduce diversity collapse while improving reward alignment and FID scores.

  5. Hierarchical Text-Conditional Image Generation with CLIP Latents

    cs.CV 2022-04 accept novelty 7.0

    A hierarchical prior-decoder model using CLIP latents generates more diverse text-conditional images than direct methods while preserving photorealism and caption fidelity.

  6. Progressive Distillation for Fast Sampling of Diffusion Models

    cs.LG 2022-02 unverdicted novelty 7.0

    Progressive distillation halves sampling steps repeatedly in diffusion models, reaching 4 steps with FID 3.0 on CIFAR-10 from 8192-step samplers.

  7. High-Resolution Image Synthesis with Latent Diffusion Models

    cs.CV 2021-12 conditional novelty 7.0

    Latent diffusion models achieve state-of-the-art inpainting and competitive results on unconditional generation, scene synthesis, and super-resolution by performing the diffusion process in the latent space of pretrai...

  8. GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models

    cs.CV 2021-12 accept novelty 7.0

    A 3.5-billion-parameter diffusion model with classifier-free guidance generates images preferred over DALL-E by human raters and can be fine-tuned for text-guided inpainting.

  9. SDEdit: Guided Image Synthesis and Editing with Stochastic Differential Equations

    cs.CV 2021-08 conditional novelty 7.0

    SDEdit performs guided image synthesis and editing by adding noise to inputs and refining them via denoising with a diffusion model's SDE prior, outperforming GAN methods in human studies without task-specific training.

  10. A unified perspective on fine-tuning and sampling with diffusion and flow models

    stat.ML 2026-04 unverdicted novelty 6.0

    A unified framework for exponential tilting in diffusion and flow models that includes bias-variance decompositions showing finite gradient variance for some methods, norm bounds on adjoint ODEs, and adapted losses wi...

  11. DDA-Thinker: Decoupled Dual-Atomic Reinforcement Learning for Reasoning-Driven Image Editing

    cs.CV 2026-04 unverdicted novelty 6.0

    DDA-Thinker decouples planning from generation and applies dual-atomic RL with checklist-based rewards to boost reasoning in image editing, yielding competitive results on RISE-Bench and KRIS-Bench.

  12. Normalizing Flows with Iterative Denoising

    cs.CV 2026-04 unverdicted novelty 6.0

    iTARFlow augments normalizing flows with diffusion-style iterative denoising during sampling while preserving end-to-end likelihood training, reaching competitive results on ImageNet 64/128/256.

  13. DVAR: Adversarial Multi-Agent Debate for Video Authenticity Detection

    cs.CV 2026-04 unverdicted novelty 6.0

    DVAR turns video authenticity detection into an iterative debate between a generative hypothesis agent and a natural mechanism agent, resolved via minimum description length and a knowledge base for better generalizat...

  14. Deepfake Detection Generalization with Diffusion Noise

    cs.CV 2026-04 unverdicted novelty 6.0

    ANL uses diffusion noise prediction and attention to regularize deepfake detectors for better generalization to unseen synthesis methods without added inference cost.

  15. U-Cast: A Surprisingly Simple and Efficient Frontier Probabilistic AI Weather Forecaster

    cs.LG 2026-04 conditional novelty 6.0

    A standard U-Net with MAE pre-training followed by short CRPS fine-tuning via Monte Carlo Dropout matches or exceeds GenCast and IFS ENS probabilistic skill at 1.5° resolution while cutting training compute and infere...

  16. VASR: Variance-Aware Systematic Resampling for Reward-Guided Diffusion

    cs.AI 2026-04 unverdicted novelty 6.0

    VASR separates continuation and residual variance in reward-guided diffusion SMC, using optimal mass allocation and systematic resampling to achieve up to 26% better FID scores and faster runtimes than prior SMC and M...

  17. Rethinking Forward Processes for Score-Based Data Assimilation in High Dimensions

    stat.ML 2026-04 unverdicted novelty 6.0

    A measurement-aware forward process for score-based data assimilation yields an exact likelihood score for linear measurements by construction.

  18. Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets

    cs.CV 2023-11 conditional novelty 6.0

    Stable Video Diffusion scales latent video diffusion models via text-to-image pretraining, video pretraining on curated data, and high-quality finetuning to produce competitive text-to-video and image-to-video results...

  19. SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis

    cs.CV 2023-07 conditional novelty 6.0

    SDXL improves upon prior Stable Diffusion versions through a larger UNet backbone, dual text encoders, novel conditioning, and a refinement model, producing higher-fidelity images competitive with black-box state-of-t...

  20. CaloArt: Large-Patch x-Prediction Diffusion Transformers for High-Granularity Calorimeter Shower Generation

    physics.ins-det 2026-05 unverdicted novelty 5.0

    CaloArt achieves top FPD, high-level, and classifier metrics on CaloChallenge datasets 2 and 3 while keeping single-GPU generation at 9-11 ms per shower by combining large-patch tokenization, x-prediction, and conditi...

  21. On the Tradeoffs of On-Device Generative Models in Federated Predictive Maintenance Systems

    cs.LG 2026-05 unverdicted novelty 5.0

    Experiments on real industrial time series show that partial model sharing improves diffusion model performance in bandwidth-limited non-IID settings, while full sharing stabilizes GAN training but offers less robustn...

  22. Score-Based Matching with Target Guidance for Cryo-EM Denoising

    cs.CV 2026-04 unverdicted novelty 5.0

    Score-based denoising with reference-density guidance improves particle-background separability and downstream 3D reconstruction consistency on cryo-EM datasets.

  23. Movie Gen: A Cast of Media Foundation Models

    cs.CV 2024-10 unverdicted novelty 5.0

    A 30B-parameter transformer and related models generate high-quality videos and audio, claiming state-of-the-art results on text-to-video, video editing, personalization, and audio generation tasks.

  24. SPECTRA-Net: Scalable Pipeline for Explainable Cross-domain Tensor Representations for AI-generated Images Detection

    cs.CV 2026-05 unverdicted novelty 4.0

    SPECTRA-Net fuses multi-view tensor representations from vision foundation models, spectral analysis, local anomaly detection, and statistical descriptors to achieve state-of-the-art cross-domain AI-generated image de...

Reference graph

Works this paper leans on

73 extracted references · 73 canonical work pages · cited by 23 Pith papers · 18 internal anchors

  1. [1]

    A learning algorithm for boltzmann machines

    David Ackley, Geoffrey Hinton, and Terrence Sejnowski. A learning algorithm for boltzmann machines. Cognitive science, 9(1):147-169, 1985

  2. [2]

    The big sleep

    Adverb. The big sleep. https://twitter.com/advadnoun/status/ 1351038053033406468, 2021

  3. [3]

    A note on the Inception Score

    Shane Barratt and Rishi Sharma. A note on the inception score. arXiv:1801.01973, 2018

  4. [4]

    Andrew Brock, Theodore Lim, J. M. Ritchie, and Nick Weston. Neural photo editing with introspective adversarial networks. arXiv:1609.07093, 2016

  5. [5]

    Large Scale GAN Training for High Fidelity Natural Image Synthesis

    Andrew Brock, Jeff Donahue, and Karen Simonyan. Large scale gan training for high fidelity natural image synthesis. arXiv:1809.11096, 2018

  6. [6]

    Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhari- wal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-V oss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Lit...

  7. [7]

    Generative pretraining from pixels

    Mark Chen, Alec Radford, Rewon Child, Jeffrey Wu, Heewoo Jun, David Luan, and Ilya Sutskever. Generative pretraining from pixels. In International Conference on Machine Learning, pages 1691–1703. PMLR, 2020

  8. [8]

    Weiss, Mohammad Norouzi, and William Chan

    Nanxin Chen, Yu Zhang, Heiga Zen, Ron J. Weiss, Mohammad Norouzi, and William Chan. Wavegrad: Estimating gradients for waveform generation. arXiv:2009.00713, 2020

  9. [9]

    Very deep vaes generalize autoregressive models and can outperform them on images

    Rewon Child. Very deep vaes generalize autoregressive models and can outperform them on images. arXiv:2011.10650, 2021

  10. [10]

    The helmholtz machine

    Peter Dayan, Geoffrey E Hinton, Radford M Neal, and Richard S Zemel. The helmholtz machine. Neural computation, 7(5):889–904, 1995

  11. [11]

    Modulating early visual processing by language

    Harm de Vries, Florian Strub, Jérémie Mary, Hugo Larochelle, Olivier Pietquin, and Aaron Courville. Modulating early visual processing by language. arXiv:1707.00683, 2017

  12. [12]

    Biggan-deep 128x128 on tensorflow hub

    DeepMind. Biggan-deep 128x128 on tensorflow hub. https://tfhub.dev/deepmind/ biggan-deep-128/1, 2018. 13

  13. [13]

    arXiv preprint arXiv:2005.00341 (2020) 14 H

    Prafulla Dhariwal, Heewoo Jun, Christine Payne, Jong Wook Kim, Alec Radford, and Ilya Sutskever. Jukebox: A generative model for music. arXiv:2005.00341, 2020

  14. [14]

    Large scale adversarial representation learning

    Jeff Donahue and Karen Simonyan. Large scale adversarial representation learning. arXiv:1907.02544, 2019

  15. [15]

    Implicit generation and generalization in energy-based models

    Yilun Du and Igor Mordatch. Implicit generation and generalization in energy-based models. arXiv:1903.08689, 2019

  16. [16]

    A learned representation for artistic style

    Vincent Dumoulin, Jonathon Shlens, and Manjunath Kudlur. A learned representation for artistic style. arXiv:1610.07629, 2017

  17. [17]

    Galatolo, Mario G

    Federico A. Galatolo, Mario G. C. A. Cimino, and Gigliola Vaglini. Generating images from caption and vice versa via clip-guided generative latent space search. arXiv:2102.01645, 2021

  18. [18]

    Ruiqi Gao, Yang Song, Ben Poole, Ying Nian Wu, and Diederik P. Kingma. Learning energy- based models by diffusion recovery likelihood. arXiv:2012.08125, 2020

  19. [19]

    Generative Adversarial Networks

    Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial networks.arXiv:1406.2661, 2014

  20. [20]

    Cloud tpus

    Google. Cloud tpus. https://cloud.google.com/tpu/, 2018

  21. [21]

    Variational walkback: Learning a transition operator as a stochastic recurrent net

    Anirudh Goyal, Nan Rosemary Ke, Surya Ganguli, and Yoshua Bengio. Variational walkback: Learning a transition operator as a stochastic recurrent net. arXiv:1711.02282, 2017

  22. [22]

    Your classifier is secretly an energy based model and you should treat it like one

    Will Grathwohl, Kuan-Chieh Wang, Jörn-Henrik Jacobsen, David Duvenaud, Mohammad Norouzi, and Kevin Swersky. Your classifier is secretly an energy based model and you should treat it like one. arXiv:1912.03263, 2019

  23. [23]

    Gans trained by a two time-scale update rule converge to a local nash equilibrium

    Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium. Advances in Neural Information Processing Systems 30 (NIPS 2017) , 2017

  24. [24]

    Training products of experts by minimizing contrastive divergence

    Geoffrey E Hinton. Training products of experts by minimizing contrastive divergence. Neural computation, 14(8):1771–1800, 2002

  25. [25]

    Denoising Diffusion Probabilistic Models

    Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. arXiv:2006.11239, 2020

  26. [26]

    Adversarial score matching and improved sampling for image generation

    Alexia Jolicoeur-Martineau, Rémi Piché-Taillefer, Rémi Tachet des Combes, and Ioan- nis Mitliagkas. Adversarial score matching and improved sampling for image generation. arXiv:2009.05475, 2020

  27. [27]

    A style-based generator architecture for generative adversarial networks

    Tero Karras, Samuli Laine, and Timo Aila. A style-based generator architecture for generative adversarial networks. arXiv:arXiv:1812.04948, 2019

  28. [28]

    Analyzing and improving the image quality of stylegan

    Tero Karras, Samuli Laine, Miika Aittala, Janne Hellsten, Jaakko Lehtinen, and Timo Aila. Analyzing and improving the image quality of stylegan. arXiv:1912.04958, 2019

  29. [29]

    Adam: A Method for Stochastic Optimization

    Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv:1412.6980, 2014

  30. [30]

    Diff W ave: A V ersatile D iffusion M odel for A udio S ynthesis

    Zhifeng Kong, Wei Ping, Jiaji Huang, Kexin Zhao, and Bryan Catanzaro. Diffwave: A versatile diffusion model for audio synthesis. arXiv:2009.09761, 2020

  31. [31]

    CIFAR-10 (Canadian Institute for Advanced Research), 2009

    Alex Krizhevsky, Vinod Nair, and Geoffrey Hinton. CIFAR-10 (Canadian Institute for Advanced Research), 2009. URL http://www.cs.toronto.edu/~kriz/cifar.html

  32. [32]

    Improved precision and recall metric for assessing generative models

    Tuomas Kynkäänniemi, Tero Karras, Samuli Laine, Jaakko Lehtinen, and Timo Aila. Improved precision and recall metric for assessing generative models. arXiv:1904.06991, 2019

  33. [33]

    Refinenet: Multi-path refinement networks for high-resolution semantic segmentation

    Guosheng Lin, Anton Milan, Chunhua Shen, and Ian Reid. Refinenet: Multi-path refinement networks for high-resolution semantic segmentation. arXiv:1611.06612, 2016. 14

  34. [34]

    Deep learning face attributes in the wild

    Ziwei Liu, Ping Luo, Xiaogang Wang, and Xiaoou Tang. Deep learning face attributes in the wild. In Proceedings of International Conference on Computer Vision (ICCV) , December 2015

  35. [35]

    Decoupled Weight Decay Regularization

    Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. arXiv:1711.05101, 2017

  36. [36]

    High-fidelity image generation with fewer labels

    Mario Lucic, Michael Tschannen, Marvin Ritter, Xiaohua Zhai, Olivier Bachem, and Sylvain Gelly. High-fidelity image generation with fewer labels. arXiv:1903.02271, 2019

  37. [37]

    Swin transformer v2: Scaling up capacity and resolution

    Eric Luhman and Troy Luhman. Knowledge distillation in iterative generative models for improved sampling speed. arXiv:2101.02388, 2021

  38. [38]

    Mixed Precision Training

    Paulius Micikevicius, Sharan Narang, Jonah Alben, Gregory Diamos, Erich Elsen, David Garcia, Boris Ginsburg, Michael Houston, Oleksii Kuchaiev, Ganesh Venkatesh, and Hao Wu. Mixed precision training. arXiv:1710.03740, 2017

  39. [39]

    Conditional Generative Adversarial Nets

    Mehdi Mirza and Simon Osindero. Conditional generative adversarial nets. arXiv:1411.1784, 2014

  40. [40]

    cgans with projection discriminator

    Takeru Miyato and Masanori Koyama. cgans with projection discriminator. arXiv:1802.05637, 2018

  41. [41]

    arXiv preprint arXiv:1802.05957

    Takeru Miyato, Toshiki Kataoka, Masanori Koyama, and Yuichi Yoshida. Spectral normalization for generative adversarial networks. arXiv:1802.05957, 2018

  42. [42]

    Battaglia

    Charlie Nash, Jacob Menick, Sander Dieleman, and Peter W. Battaglia. Generating images with sparse representations. arXiv:2103.03841, 2021

  43. [43]

    Improved denois- ing diffusion probabilistic models.arXiv preprint arXiv:2102.09672,

    Alex Nichol and Prafulla Dhariwal. Improved denoising diffusion probabilistic models. arXiv:2102.09672, 2021

  44. [44]

    Stylegan2

    NVIDIA. Stylegan2. https://github.com/NVlabs/stylegan2, 2019

  45. [45]

    On buggy resizing libraries and surprising subtleties in fid calculation.arXiv preprint arXiv:2104.11222, 5:14,

    Gaurav Parmar, Richard Zhang, and Jun-Yan Zhu. On buggy resizing libraries and surprising subtleties in fid calculation. arXiv:2104.11222, 2021

  46. [46]

    PyTorch: An Imperative Style, High-Performance Deep Learning Library

    Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al. Pytorch: An imperative style, high-performance deep learning library. arXiv:1912.01703, 2019

  47. [47]

    Styleclip: Text-driven manipulation of stylegan imagery

    Or Patashnik, Zongze Wu, Eli Shechtman, Daniel Cohen-Or, and Dani Lischinski. Styleclip: Text-driven manipulation of stylegan imagery. arXiv:2103.17249, 2021

  48. [48]

    Film: Visual reasoning with a general conditioning layer

    Ethan Perez, Florian Strub, Harm de Vries, Vincent Dumoulin, and Aaron Courville. Film: Visual reasoning with a general conditioning layer. arXiv:1709.07871, 2017

  49. [49]

    Learning Transferable Visual Models From Natural Language Supervision

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision. arXiv:2103.00020, 2021

  50. [50]

    Zero-Shot Text-to-Image Generation

    Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray, Chelsea V oss, Alec Radford, Mark Chen, and Ilya Sutskever. Zero-shot text-to-image generation. arXiv:2102.12092, 2021

  51. [51]

    Generating diverse high-fidelity images with VQ-V AE-2.arXiv:1906.00446, 2019

    Ali Razavi, Aaron van den Oord, and Oriol Vinyals. Generating diverse high-fidelity images with VQ-V AE-2.arXiv:1906.00446, 2019

  52. [52]

    Berg, and Li Fei-Fei

    Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg, and Li Fei-Fei. Imagenet large scale visual recognition challenge. arXiv:1409.0575, 2014

  53. [53]

    Image super-resolution via iterative refinement

    Chitwan Saharia, Jonathan Ho, William Chan, Tim Salimans, David J. Fleet, and Mohammad Norouzi. Image super-resolution via iterative refinement. arXiv:arXiv:2104.07636, 2021. 15

  54. [54]

    Improved techniques for training gans

    Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and Xi Chen. Improved techniques for training gans. arXiv:1606.03498, 2016

  55. [55]

    Image synthesis with a single (robust) classifier

    Shibani Santurkar, Dimitris Tsipras, Brandon Tran, Andrew Ilyas, Logan Engstrom, and Aleksander Madry. Image synthesis with a single (robust) classifier. arXiv:1906.09453, 2019

  56. [56]

    Deep Unsupervised Learning using Nonequilibrium Thermodynamics

    Jascha Sohl-Dickstein, Eric A. Weiss, Niru Maheswaranathan, and Surya Ganguli. Deep unsupervised learning using nonequilibrium thermodynamics. arXiv:1503.03585, 2015

  57. [57]

    Denoising Diffusion Implicit Models

    Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models. arXiv:2010.02502, 2020

  58. [58]

    Improved techniques for training score-based generative models

    Yang Song and Stefano Ermon. Improved techniques for training score-based generative models. arXiv:2006.09011, 2020

  59. [59]

    Song and S

    Yang Song and Stefano Ermon. Generative modeling by estimating gradients of the data distribution. arXiv:arXiv:1907.05600, 2020

  60. [60]

    Score-Based Generative Modeling through Stochastic Differential Equations

    Yang Song, Jascha Sohl-Dickstein, Diederik P. Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equations. arXiv:2011.13456, 2020

  61. [61]

    Intriguing properties of neural networks

    Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian Good- fellow, and Rob Fergus. Intriguing properties of neural networks. arXiv:1312.6199, 2013

  62. [62]

    Rethinking the inception architecture for computer vision

    Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jonathon Shlens, and Zbigniew Wojna. Rethinking the inception architecture for computer vision. arXiv:1512.00567, 2015

  63. [63]

    NV AE: A deep hierarchical variational autoencoder,

    Arash Vahdat and Jan Kautz. Nvae: A deep hierarchical variational autoencoder. arXiv:2007.03898, 2020

  64. [64]

    WaveNet: A Generative Model for Raw Audio

    Aaron van den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex Graves, Nal Kalchbrenner, Andrew Senior, and Koray Kavukcuoglu. Wavenet: A generative model for raw audio. arXiv:1609.03499, 2016

  65. [65]

    ://arxiv.org/abs/1711.00937

    Aaron van den Oord, Oriol Vinyals, and Koray Kavukcuoglu. Neural discrete representation learning. arXiv:1711.00937, 2017

  66. [66]

    Attention Is All You Need

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. arXiv:1706.03762, 2017

  67. [67]

    Bayesian learning via stochastic gradient langevin dynamics

    Max Welling and Yee W Teh. Bayesian learning via stochastic gradient langevin dynamics. In Proceedings of the 28th international conference on machine learning (ICML-11) , pages 681–688. Citeseer, 2011

  68. [68]

    Logan: Latent optimisation for generative adversarial networks

    Yan Wu, Jeff Donahue, David Balduzzi, Karen Simonyan, and Timothy Lillicrap. Logan: Latent optimisation for generative adversarial networks. arXiv:1912.00953, 2019

  69. [69]

    Group normalization

    Yuxin Wu and Kaiming He. Group normalization. arXiv:1803.08494, 2018

  70. [70]

    A theory of generative convnet

    Jianwen Xie, Yang Lu, Song-Chun Zhu, and Ying Nian Wu. A theory of generative convnet. arXiv:1602.03264, 2016

  71. [71]

    LSUN: Construction of a Large-scale Image Dataset using Deep Learning with Humans in the Loop

    Fisher Yu, Ari Seff, Yinda Zhang, Shuran Song, Thomas Funkhouser, and Jianxiong Xiao. Lsun: Construction of a large-scale image dataset using deep learning with humans in the loop. arXiv:1506.03365, 2015

  72. [72]

    Stackgan: Text to photo-realistic image synthesis with stacked generative adversarial networks

    Han Zhang, Tao Xu, Hongsheng Li, Shaoting Zhang, Xiaogang Wang, Xiaolei Huang, and Dimitris Metaxas. Stackgan: Text to photo-realistic image synthesis with stacked generative adversarial networks. arXiv:1612.03242, 2016

  73. [73]

    Ligeng Zhu. Thop. https://github.com/Lyken17/pytorch-OpCounter, 2018. 16 A Computational Requirements Compute is essential to modern machine learning applications, and more compute typically yields better results. It is thus important to compare our method’s compute requirements to competing methods. In this section, we demonstrate that we can achieve res...