arxiv: 2105.05233 · v4 · submitted 2021-05-11 · 💻 cs.LG · cs.AI· cs.CV· stat.ML

Recognition: 2 theorem links

· Lean Theorem

Diffusion Models Beat GANs on Image Synthesis

Alex Nichol, Prafulla Dhariwal

Pith reviewed 2026-05-13 11:12 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CVstat.ML

keywords diffusion modelsimage synthesisGANsclassifier guidanceFID scoreImageNetgenerative modelsupsampling

0 comments

The pith

Diffusion models achieve higher image sample quality than GANs on ImageNet through architecture improvements and classifier guidance.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that diffusion models, which iteratively remove noise to form images, can surpass generative adversarial networks in sample quality for both unconditional and conditional image synthesis. Architectural refinements identified through systematic ablations boost unconditional performance, while classifier guidance steers sampling with classifier gradients to improve fidelity at the cost of some diversity. This combination produces new low FID scores of 2.97 on ImageNet 128x128, 4.59 on 256x256, and 7.72 on 512x512, and it matches top GANs with far fewer steps while covering the data distribution more fully. The work positions diffusion models as a competitive or superior option for high-quality image generation.

Core claim

Diffusion models can achieve image sample quality superior to the current state-of-the-art generative models. For unconditional synthesis a better architecture is found through a series of ablations. For conditional synthesis classifier guidance further improves quality by trading off diversity for fidelity using gradients from a classifier. The models reach FID scores of 2.97 on ImageNet 128x128, 4.59 on 256x256, and 7.72 on 512x512, match BigGAN-deep with as few as 25 forward passes, and maintain better distribution coverage. Classifier guidance also combines effectively with upsampling diffusion models to reach even lower FID values.

What carries the argument

Classifier guidance, a sampling technique that uses gradients from a pre-trained classifier to steer the reverse diffusion process toward higher-fidelity outputs.

If this is right

Diffusion models match or exceed prior GAN performance while using only 25 sampling steps per image.
Classifier guidance enables an explicit, compute-efficient trade-off between sample fidelity and diversity.
Combining guidance with upsampling diffusion models yields further FID reductions to 3.94 on ImageNet 256x256.
The generated samples cover the target distribution more completely than the compared GAN baselines.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same guidance approach may extend to other data types such as video or audio if suitable classifiers exist.
Performance could degrade on domains where high-accuracy classifiers are unavailable or expensive to train.
Subsequent work could test whether similar ablations applied to GANs would close the reported quality gap.
Wider use might encourage replacing adversarial objectives with iterative denoising in many generative pipelines.

Load-bearing premise

That the architecture improvements found by ablation and the classifier guidance method will generalize to other datasets and tasks without substantial extra tuning.

What would settle it

An experiment in which a new GAN variant records a lower FID than 2.97 on ImageNet 128x128 would show the claimed superiority does not hold.

read the original abstract

We show that diffusion models can achieve image sample quality superior to the current state-of-the-art generative models. We achieve this on unconditional image synthesis by finding a better architecture through a series of ablations. For conditional image synthesis, we further improve sample quality with classifier guidance: a simple, compute-efficient method for trading off diversity for fidelity using gradients from a classifier. We achieve an FID of 2.97 on ImageNet 128$\times$128, 4.59 on ImageNet 256$\times$256, and 7.72 on ImageNet 512$\times$512, and we match BigGAN-deep even with as few as 25 forward passes per sample, all while maintaining better coverage of the distribution. Finally, we find that classifier guidance combines well with upsampling diffusion models, further improving FID to 3.94 on ImageNet 256$\times$256 and 3.85 on ImageNet 512$\times$512. We release our code at https://github.com/openai/guided-diffusion

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 3 minor

Summary. The paper claims that diffusion models can achieve image sample quality superior to current state-of-the-art generative models such as BigGAN. This is demonstrated on unconditional ImageNet synthesis via a series of architecture ablations, and on conditional synthesis via the introduction of classifier guidance, which trades off diversity for fidelity using classifier gradients. Reported results include FID scores of 2.97 (128x128), 4.59 (256x256), and 7.72 (512x512), with matching or better performance than BigGAN-deep using as few as 25 sampling steps while maintaining superior coverage; further gains are shown when combining classifier guidance with upsampling diffusion models.

Significance. If the empirical results hold, the work is significant because it provides the first clear demonstration that diffusion models can outperform leading GANs on high-resolution image synthesis benchmarks, supported by extensive ablations, direct quantitative comparisons, and released code for reproducibility. Classifier guidance offers a simple, compute-efficient mechanism for controlling the fidelity-diversity tradeoff, and the findings suggest diffusion models as a strong alternative paradigm with better distribution coverage.

minor comments (3)

[Section 3.2] Section 3.2: The explanation of how classifier gradients are scaled and added during sampling would benefit from an explicit equation showing the modified mean prediction step.
[Figure 5] Figure 5: The legend and axis labels on the coverage vs. FID scatter plots are slightly crowded; increasing font size or splitting into two panels would improve readability.
[Table 2] Table 2: Clarify whether the reported FID values for the 25-step regime use the same classifier guidance scale as the full 250-step results or a separately tuned value.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their positive review and recommendation to accept. We appreciate the recognition of the work's significance in demonstrating that diffusion models can outperform leading GANs on high-resolution image synthesis, along with the value placed on the ablations, quantitative comparisons, classifier guidance mechanism, and code release.

Circularity Check

0 steps flagged

No significant circularity identified

full rationale

The paper's central claims rest on new empirical results: architecture ablations for unconditional diffusion models and classifier guidance for conditional synthesis, with direct FID reporting on ImageNet 128/256/512 and explicit comparisons to BigGAN. No load-bearing step reduces by construction to a fitted parameter, self-citation chain, or ansatz smuggled from prior work; the reported improvements are demonstrated through fresh experiments and released code rather than derived from the paper's own inputs.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The work rests primarily on empirical validation and prior diffusion model foundations rather than new theoretical axioms or invented entities.

free parameters (1)

classifier guidance scale
Hyperparameter tuned across experiments to trade fidelity against diversity; values are selected based on FID performance on validation splits.

axioms (1)

domain assumption The forward diffusion process can be reversed by learning a denoising network.
Invoked throughout as the core mechanism of diffusion models, drawn from prior literature.

pith-pipeline@v0.9.0 · 5484 in / 1164 out tokens · 33054 ms · 2026-05-13T11:12:38.168916+00:00 · methodology

discussion (0)

Forward citations

Cited by 24 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Autoregressive Learning in Joint KL: Sharp Oracle Bounds and Lower Bounds
cs.LG 2026-05 unverdicted novelty 8.0

Joint KL yields horizon-free approximation but an information-theoretic lower bound of order Omega(H) for estimation error in autoregressive learning, with matching computationally efficient upper bounds.
Classifier-Free Diffusion Guidance
cs.LG 2022-07 unverdicted novelty 8.0

Classifier-free guidance trades off sample quality and diversity in conditional diffusion models by combining scores from jointly trained conditional and unconditional models.
Tempered Guided Diffusion
stat.ML 2026-05 unverdicted novelty 7.0

Tempered Guided Diffusion uses annealed SMC to produce consistent particle approximations to the posterior for training-free conditional diffusion sampling, outperforming independent guided trajectories in experiments.
VASR: Variance-Aware Systematic Resampling for Reward-Guided Diffusion
cs.AI 2026-04 unverdicted novelty 7.0

FVD applies Fleming-Viot population dynamics to diffusion model sampling at inference time to reduce diversity collapse while improving reward alignment and FID scores.
Hierarchical Text-Conditional Image Generation with CLIP Latents
cs.CV 2022-04 accept novelty 7.0

A hierarchical prior-decoder model using CLIP latents generates more diverse text-conditional images than direct methods while preserving photorealism and caption fidelity.
Progressive Distillation for Fast Sampling of Diffusion Models
cs.LG 2022-02 unverdicted novelty 7.0

Progressive distillation halves sampling steps repeatedly in diffusion models, reaching 4 steps with FID 3.0 on CIFAR-10 from 8192-step samplers.
High-Resolution Image Synthesis with Latent Diffusion Models
cs.CV 2021-12 conditional novelty 7.0

Latent diffusion models achieve state-of-the-art inpainting and competitive results on unconditional generation, scene synthesis, and super-resolution by performing the diffusion process in the latent space of pretrai...
GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models
cs.CV 2021-12 accept novelty 7.0

A 3.5-billion-parameter diffusion model with classifier-free guidance generates images preferred over DALL-E by human raters and can be fine-tuned for text-guided inpainting.
SDEdit: Guided Image Synthesis and Editing with Stochastic Differential Equations
cs.CV 2021-08 conditional novelty 7.0

SDEdit performs guided image synthesis and editing by adding noise to inputs and refining them via denoising with a diffusion model's SDE prior, outperforming GAN methods in human studies without task-specific training.
A unified perspective on fine-tuning and sampling with diffusion and flow models
stat.ML 2026-04 unverdicted novelty 6.0

A unified framework for exponential tilting in diffusion and flow models that includes bias-variance decompositions showing finite gradient variance for some methods, norm bounds on adjoint ODEs, and adapted losses wi...
DDA-Thinker: Decoupled Dual-Atomic Reinforcement Learning for Reasoning-Driven Image Editing
cs.CV 2026-04 unverdicted novelty 6.0

DDA-Thinker decouples planning from generation and applies dual-atomic RL with checklist-based rewards to boost reasoning in image editing, yielding competitive results on RISE-Bench and KRIS-Bench.
Normalizing Flows with Iterative Denoising
cs.CV 2026-04 unverdicted novelty 6.0

iTARFlow augments normalizing flows with diffusion-style iterative denoising during sampling while preserving end-to-end likelihood training, reaching competitive results on ImageNet 64/128/256.
DVAR: Adversarial Multi-Agent Debate for Video Authenticity Detection
cs.CV 2026-04 unverdicted novelty 6.0

DVAR turns video authenticity detection into an iterative debate between a generative hypothesis agent and a natural mechanism agent, resolved via minimum description length and a knowledge base for better generalizat...
Deepfake Detection Generalization with Diffusion Noise
cs.CV 2026-04 unverdicted novelty 6.0

ANL uses diffusion noise prediction and attention to regularize deepfake detectors for better generalization to unseen synthesis methods without added inference cost.
U-Cast: A Surprisingly Simple and Efficient Frontier Probabilistic AI Weather Forecaster
cs.LG 2026-04 conditional novelty 6.0

A standard U-Net with MAE pre-training followed by short CRPS fine-tuning via Monte Carlo Dropout matches or exceeds GenCast and IFS ENS probabilistic skill at 1.5° resolution while cutting training compute and infere...
VASR: Variance-Aware Systematic Resampling for Reward-Guided Diffusion
cs.AI 2026-04 unverdicted novelty 6.0

VASR separates continuation and residual variance in reward-guided diffusion SMC, using optimal mass allocation and systematic resampling to achieve up to 26% better FID scores and faster runtimes than prior SMC and M...
Rethinking Forward Processes for Score-Based Data Assimilation in High Dimensions
stat.ML 2026-04 unverdicted novelty 6.0

A measurement-aware forward process for score-based data assimilation yields an exact likelihood score for linear measurements by construction.
Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets
cs.CV 2023-11 conditional novelty 6.0

Stable Video Diffusion scales latent video diffusion models via text-to-image pretraining, video pretraining on curated data, and high-quality finetuning to produce competitive text-to-video and image-to-video results...
SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis
cs.CV 2023-07 conditional novelty 6.0

SDXL improves upon prior Stable Diffusion versions through a larger UNet backbone, dual text encoders, novel conditioning, and a refinement model, producing higher-fidelity images competitive with black-box state-of-t...
CaloArt: Large-Patch x-Prediction Diffusion Transformers for High-Granularity Calorimeter Shower Generation
physics.ins-det 2026-05 unverdicted novelty 5.0

CaloArt achieves top FPD, high-level, and classifier metrics on CaloChallenge datasets 2 and 3 while keeping single-GPU generation at 9-11 ms per shower by combining large-patch tokenization, x-prediction, and conditi...
On the Tradeoffs of On-Device Generative Models in Federated Predictive Maintenance Systems
cs.LG 2026-05 unverdicted novelty 5.0

Experiments on real industrial time series show that partial model sharing improves diffusion model performance in bandwidth-limited non-IID settings, while full sharing stabilizes GAN training but offers less robustn...
Score-Based Matching with Target Guidance for Cryo-EM Denoising
cs.CV 2026-04 unverdicted novelty 5.0

Score-based denoising with reference-density guidance improves particle-background separability and downstream 3D reconstruction consistency on cryo-EM datasets.
Movie Gen: A Cast of Media Foundation Models
cs.CV 2024-10 unverdicted novelty 5.0

A 30B-parameter transformer and related models generate high-quality videos and audio, claiming state-of-the-art results on text-to-video, video editing, personalization, and audio generation tasks.
SPECTRA-Net: Scalable Pipeline for Explainable Cross-domain Tensor Representations for AI-generated Images Detection
cs.CV 2026-05 unverdicted novelty 4.0

SPECTRA-Net fuses multi-view tensor representations from vision foundation models, spectral analysis, local anomaly detection, and statistical descriptors to achieve state-of-the-art cross-domain AI-generated image de...

Reference graph

Works this paper leans on

73 extracted references · 73 canonical work pages · cited by 23 Pith papers · 18 internal anchors

[1]

A learning algorithm for boltzmann machines

David Ackley, Geoffrey Hinton, and Terrence Sejnowski. A learning algorithm for boltzmann machines. Cognitive science, 9(1):147-169, 1985

work page 1985
[2]

The big sleep

Adverb. The big sleep. https://twitter.com/advadnoun/status/ 1351038053033406468, 2021

work page 2021
[3]

A note on the Inception Score

Shane Barratt and Rishi Sharma. A note on the inception score. arXiv:1801.01973, 2018

work page arXiv 2018
[4]

Andrew Brock, Theodore Lim, J. M. Ritchie, and Nick Weston. Neural photo editing with introspective adversarial networks. arXiv:1609.07093, 2016

work page arXiv 2016
[5]

Large Scale GAN Training for High Fidelity Natural Image Synthesis

Andrew Brock, Jeff Donahue, and Karen Simonyan. Large scale gan training for high ﬁdelity natural image synthesis. arXiv:1809.11096, 2018

work page internal anchor Pith review arXiv 2018
[6]

Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhari- wal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-V oss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Lit...

work page internal anchor Pith review Pith/arXiv arXiv 2005
[7]

Generative pretraining from pixels

Mark Chen, Alec Radford, Rewon Child, Jeffrey Wu, Heewoo Jun, David Luan, and Ilya Sutskever. Generative pretraining from pixels. In International Conference on Machine Learning, pages 1691–1703. PMLR, 2020

work page 2020
[8]

Weiss, Mohammad Norouzi, and William Chan

Nanxin Chen, Yu Zhang, Heiga Zen, Ron J. Weiss, Mohammad Norouzi, and William Chan. Wavegrad: Estimating gradients for waveform generation. arXiv:2009.00713, 2020

work page arXiv 2009
[9]

Very deep vaes generalize autoregressive models and can outperform them on images

Rewon Child. Very deep vaes generalize autoregressive models and can outperform them on images. arXiv:2011.10650, 2021

work page arXiv 2011
[10]

The helmholtz machine

Peter Dayan, Geoffrey E Hinton, Radford M Neal, and Richard S Zemel. The helmholtz machine. Neural computation, 7(5):889–904, 1995

work page 1995
[11]

Modulating early visual processing by language

Harm de Vries, Florian Strub, Jérémie Mary, Hugo Larochelle, Olivier Pietquin, and Aaron Courville. Modulating early visual processing by language. arXiv:1707.00683, 2017

work page arXiv 2017
[12]

Biggan-deep 128x128 on tensorﬂow hub

DeepMind. Biggan-deep 128x128 on tensorﬂow hub. https://tfhub.dev/deepmind/ biggan-deep-128/1, 2018. 13

work page 2018
[13]

arXiv preprint arXiv:2005.00341 (2020) 14 H

Prafulla Dhariwal, Heewoo Jun, Christine Payne, Jong Wook Kim, Alec Radford, and Ilya Sutskever. Jukebox: A generative model for music. arXiv:2005.00341, 2020

work page arXiv 2005
[14]

Large scale adversarial representation learning

Jeff Donahue and Karen Simonyan. Large scale adversarial representation learning. arXiv:1907.02544, 2019

work page arXiv 1907
[15]

Implicit generation and generalization in energy-based models

Yilun Du and Igor Mordatch. Implicit generation and generalization in energy-based models. arXiv:1903.08689, 2019

work page arXiv 1903
[16]

A learned representation for artistic style

Vincent Dumoulin, Jonathon Shlens, and Manjunath Kudlur. A learned representation for artistic style. arXiv:1610.07629, 2017

work page arXiv 2017
[17]

Galatolo, Mario G

Federico A. Galatolo, Mario G. C. A. Cimino, and Gigliola Vaglini. Generating images from caption and vice versa via clip-guided generative latent space search. arXiv:2102.01645, 2021

work page arXiv 2021
[18]

Ruiqi Gao, Yang Song, Ben Poole, Ying Nian Wu, and Diederik P. Kingma. Learning energy- based models by diffusion recovery likelihood. arXiv:2012.08125, 2020

work page arXiv 2012
[19]

Generative Adversarial Networks

Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial networks.arXiv:1406.2661, 2014

work page internal anchor Pith review Pith/arXiv arXiv 2014
[20]

Cloud tpus

Google. Cloud tpus. https://cloud.google.com/tpu/, 2018

work page 2018
[21]

Variational walkback: Learning a transition operator as a stochastic recurrent net

Anirudh Goyal, Nan Rosemary Ke, Surya Ganguli, and Yoshua Bengio. Variational walkback: Learning a transition operator as a stochastic recurrent net. arXiv:1711.02282, 2017

work page arXiv 2017
[22]

Your classiﬁer is secretly an energy based model and you should treat it like one

Will Grathwohl, Kuan-Chieh Wang, Jörn-Henrik Jacobsen, David Duvenaud, Mohammad Norouzi, and Kevin Swersky. Your classiﬁer is secretly an energy based model and you should treat it like one. arXiv:1912.03263, 2019

work page arXiv 1912
[23]

Gans trained by a two time-scale update rule converge to a local nash equilibrium

Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium. Advances in Neural Information Processing Systems 30 (NIPS 2017) , 2017

work page 2017
[24]

Training products of experts by minimizing contrastive divergence

Geoffrey E Hinton. Training products of experts by minimizing contrastive divergence. Neural computation, 14(8):1771–1800, 2002

work page 2002
[25]

Denoising Diffusion Probabilistic Models

Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. arXiv:2006.11239, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2006
[26]

Adversarial score matching and improved sampling for image generation

Alexia Jolicoeur-Martineau, Rémi Piché-Taillefer, Rémi Tachet des Combes, and Ioan- nis Mitliagkas. Adversarial score matching and improved sampling for image generation. arXiv:2009.05475, 2020

work page arXiv 2009
[27]

A style-based generator architecture for generative adversarial networks

Tero Karras, Samuli Laine, and Timo Aila. A style-based generator architecture for generative adversarial networks. arXiv:arXiv:1812.04948, 2019

work page arXiv 2019
[28]

Analyzing and improving the image quality of stylegan

Tero Karras, Samuli Laine, Miika Aittala, Janne Hellsten, Jaakko Lehtinen, and Timo Aila. Analyzing and improving the image quality of stylegan. arXiv:1912.04958, 2019

work page arXiv 1912
[29]

Adam: A Method for Stochastic Optimization

Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv:1412.6980, 2014

work page internal anchor Pith review Pith/arXiv arXiv 2014
[30]

Diff W ave: A V ersatile D iffusion M odel for A udio S ynthesis

Zhifeng Kong, Wei Ping, Jiaji Huang, Kexin Zhao, and Bryan Catanzaro. Diffwave: A versatile diffusion model for audio synthesis. arXiv:2009.09761, 2020

work page arXiv 2009
[31]

CIFAR-10 (Canadian Institute for Advanced Research), 2009

Alex Krizhevsky, Vinod Nair, and Geoffrey Hinton. CIFAR-10 (Canadian Institute for Advanced Research), 2009. URL http://www.cs.toronto.edu/~kriz/cifar.html

work page 2009
[32]

Improved precision and recall metric for assessing generative models

Tuomas Kynkäänniemi, Tero Karras, Samuli Laine, Jaakko Lehtinen, and Timo Aila. Improved precision and recall metric for assessing generative models. arXiv:1904.06991, 2019

work page arXiv 1904
[33]

Reﬁnenet: Multi-path reﬁnement networks for high-resolution semantic segmentation

Guosheng Lin, Anton Milan, Chunhua Shen, and Ian Reid. Reﬁnenet: Multi-path reﬁnement networks for high-resolution semantic segmentation. arXiv:1611.06612, 2016. 14

work page arXiv 2016
[34]

Deep learning face attributes in the wild

Ziwei Liu, Ping Luo, Xiaogang Wang, and Xiaoou Tang. Deep learning face attributes in the wild. In Proceedings of International Conference on Computer Vision (ICCV) , December 2015

work page 2015
[35]

Decoupled Weight Decay Regularization

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. arXiv:1711.05101, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[36]

High-ﬁdelity image generation with fewer labels

Mario Lucic, Michael Tschannen, Marvin Ritter, Xiaohua Zhai, Olivier Bachem, and Sylvain Gelly. High-ﬁdelity image generation with fewer labels. arXiv:1903.02271, 2019

work page arXiv 1903
[37]

Swin transformer v2: Scaling up capacity and resolution

Eric Luhman and Troy Luhman. Knowledge distillation in iterative generative models for improved sampling speed. arXiv:2101.02388, 2021

work page arXiv 2021
[38]

Mixed Precision Training

Paulius Micikevicius, Sharan Narang, Jonah Alben, Gregory Diamos, Erich Elsen, David Garcia, Boris Ginsburg, Michael Houston, Oleksii Kuchaiev, Ganesh Venkatesh, and Hao Wu. Mixed precision training. arXiv:1710.03740, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[39]

Conditional Generative Adversarial Nets

Mehdi Mirza and Simon Osindero. Conditional generative adversarial nets. arXiv:1411.1784, 2014

work page internal anchor Pith review Pith/arXiv arXiv 2014
[40]

cgans with projection discriminator

Takeru Miyato and Masanori Koyama. cgans with projection discriminator. arXiv:1802.05637, 2018

work page arXiv 2018
[41]

arXiv preprint arXiv:1802.05957

Takeru Miyato, Toshiki Kataoka, Masanori Koyama, and Yuichi Yoshida. Spectral normalization for generative adversarial networks. arXiv:1802.05957, 2018

work page arXiv 2018
[42]

Battaglia

Charlie Nash, Jacob Menick, Sander Dieleman, and Peter W. Battaglia. Generating images with sparse representations. arXiv:2103.03841, 2021

work page arXiv 2021
[43]

Improved denois- ing diffusion probabilistic models.arXiv preprint arXiv:2102.09672,

Alex Nichol and Prafulla Dhariwal. Improved denoising diffusion probabilistic models. arXiv:2102.09672, 2021

work page arXiv 2021
[44]

Stylegan2

NVIDIA. Stylegan2. https://github.com/NVlabs/stylegan2, 2019

work page 2019
[45]

On buggy resizing libraries and surprising subtleties in fid calculation.arXiv preprint arXiv:2104.11222, 5:14,

Gaurav Parmar, Richard Zhang, and Jun-Yan Zhu. On buggy resizing libraries and surprising subtleties in ﬁd calculation. arXiv:2104.11222, 2021

work page arXiv 2021
[46]

PyTorch: An Imperative Style, High-Performance Deep Learning Library

Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al. Pytorch: An imperative style, high-performance deep learning library. arXiv:1912.01703, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1912
[47]

Styleclip: Text-driven manipulation of stylegan imagery

Or Patashnik, Zongze Wu, Eli Shechtman, Daniel Cohen-Or, and Dani Lischinski. Styleclip: Text-driven manipulation of stylegan imagery. arXiv:2103.17249, 2021

work page arXiv 2021
[48]

Film: Visual reasoning with a general conditioning layer

Ethan Perez, Florian Strub, Harm de Vries, Vincent Dumoulin, and Aaron Courville. Film: Visual reasoning with a general conditioning layer. arXiv:1709.07871, 2017

work page arXiv 2017
[49]

Learning Transferable Visual Models From Natural Language Supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision. arXiv:2103.00020, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[50]

Zero-Shot Text-to-Image Generation

Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray, Chelsea V oss, Alec Radford, Mark Chen, and Ilya Sutskever. Zero-shot text-to-image generation. arXiv:2102.12092, 2021

work page internal anchor Pith review arXiv 2021
[51]

Generating diverse high-ﬁdelity images with VQ-V AE-2.arXiv:1906.00446, 2019

Ali Razavi, Aaron van den Oord, and Oriol Vinyals. Generating diverse high-ﬁdelity images with VQ-V AE-2.arXiv:1906.00446, 2019

work page arXiv 1906
[52]

Berg, and Li Fei-Fei

Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg, and Li Fei-Fei. Imagenet large scale visual recognition challenge. arXiv:1409.0575, 2014

work page arXiv 2014
[53]

Image super-resolution via iterative reﬁnement

Chitwan Saharia, Jonathan Ho, William Chan, Tim Salimans, David J. Fleet, and Mohammad Norouzi. Image super-resolution via iterative reﬁnement. arXiv:arXiv:2104.07636, 2021. 15

work page arXiv 2021
[54]

Improved techniques for training gans

Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and Xi Chen. Improved techniques for training gans. arXiv:1606.03498, 2016

work page arXiv 2016
[55]

Image synthesis with a single (robust) classiﬁer

Shibani Santurkar, Dimitris Tsipras, Brandon Tran, Andrew Ilyas, Logan Engstrom, and Aleksander Madry. Image synthesis with a single (robust) classiﬁer. arXiv:1906.09453, 2019

work page arXiv 1906
[56]

Deep Unsupervised Learning using Nonequilibrium Thermodynamics

Jascha Sohl-Dickstein, Eric A. Weiss, Niru Maheswaranathan, and Surya Ganguli. Deep unsupervised learning using nonequilibrium thermodynamics. arXiv:1503.03585, 2015

work page internal anchor Pith review arXiv 2015
[57]

Denoising Diffusion Implicit Models

Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models. arXiv:2010.02502, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2010
[58]

Improved techniques for training score-based generative models

Yang Song and Stefano Ermon. Improved techniques for training score-based generative models. arXiv:2006.09011, 2020

work page arXiv 2006
[59]

Song and S

Yang Song and Stefano Ermon. Generative modeling by estimating gradients of the data distribution. arXiv:arXiv:1907.05600, 2020

work page arXiv 1907
[60]

Score-Based Generative Modeling through Stochastic Differential Equations

Yang Song, Jascha Sohl-Dickstein, Diederik P. Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equations. arXiv:2011.13456, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2011
[61]

Intriguing properties of neural networks

Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian Good- fellow, and Rob Fergus. Intriguing properties of neural networks. arXiv:1312.6199, 2013

work page internal anchor Pith review Pith/arXiv arXiv 2013
[62]

Rethinking the inception architecture for computer vision

Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jonathon Shlens, and Zbigniew Wojna. Rethinking the inception architecture for computer vision. arXiv:1512.00567, 2015

work page arXiv 2015
[63]

NV AE: A deep hierarchical variational autoencoder,

Arash Vahdat and Jan Kautz. Nvae: A deep hierarchical variational autoencoder. arXiv:2007.03898, 2020

work page arXiv 2007
[64]

WaveNet: A Generative Model for Raw Audio

Aaron van den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex Graves, Nal Kalchbrenner, Andrew Senior, and Koray Kavukcuoglu. Wavenet: A generative model for raw audio. arXiv:1609.03499, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016
[65]

://arxiv.org/abs/1711.00937

Aaron van den Oord, Oriol Vinyals, and Koray Kavukcuoglu. Neural discrete representation learning. arXiv:1711.00937, 2017

work page arXiv 2017
[66]

Attention Is All You Need

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. arXiv:1706.03762, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[67]

Bayesian learning via stochastic gradient langevin dynamics

Max Welling and Yee W Teh. Bayesian learning via stochastic gradient langevin dynamics. In Proceedings of the 28th international conference on machine learning (ICML-11) , pages 681–688. Citeseer, 2011

work page 2011
[68]

Logan: Latent optimisation for generative adversarial networks

Yan Wu, Jeff Donahue, David Balduzzi, Karen Simonyan, and Timothy Lillicrap. Logan: Latent optimisation for generative adversarial networks. arXiv:1912.00953, 2019

work page arXiv 1912
[69]

Group normalization

Yuxin Wu and Kaiming He. Group normalization. arXiv:1803.08494, 2018

work page arXiv 2018
[70]

A theory of generative convnet

Jianwen Xie, Yang Lu, Song-Chun Zhu, and Ying Nian Wu. A theory of generative convnet. arXiv:1602.03264, 2016

work page arXiv 2016
[71]

LSUN: Construction of a Large-scale Image Dataset using Deep Learning with Humans in the Loop

Fisher Yu, Ari Seff, Yinda Zhang, Shuran Song, Thomas Funkhouser, and Jianxiong Xiao. Lsun: Construction of a large-scale image dataset using deep learning with humans in the loop. arXiv:1506.03365, 2015

work page internal anchor Pith review arXiv 2015
[72]

Stackgan: Text to photo-realistic image synthesis with stacked generative adversarial networks

Han Zhang, Tao Xu, Hongsheng Li, Shaoting Zhang, Xiaogang Wang, Xiaolei Huang, and Dimitris Metaxas. Stackgan: Text to photo-realistic image synthesis with stacked generative adversarial networks. arXiv:1612.03242, 2016

work page arXiv 2016
[73]

Ligeng Zhu. Thop. https://github.com/Lyken17/pytorch-OpCounter, 2018. 16 A Computational Requirements Compute is essential to modern machine learning applications, and more compute typically yields better results. It is thus important to compare our method’s compute requirements to competing methods. In this section, we demonstrate that we can achieve res...

work page 2018