arxiv: 2206.00364 · v2 · submitted 2022-06-01 · 💻 cs.CV · cs.AI· cs.LG· cs.NE· stat.ML

Recognition: 1 theorem link

· Lean Theorem

Elucidating the Design Space of Diffusion-Based Generative Models

Tero Karras , Miika Aittala , Timo Aila , Samuli Laine

Authors on Pith no claims yet

Pith reviewed 2026-05-14 17:42 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.LGcs.NEstat.ML

keywords diffusion modelsgenerative modelsimage synthesisFID scoresampling efficiencydesign spacepreconditioningscore networks

0 comments

The pith

Separating concrete design choices in diffusion models produces new state-of-the-art FID scores with far fewer sampling steps.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The authors argue that current diffusion-based generative models hide their key decisions in a convoluted mix of training, sampling, and network scaling. They respond by laying out an explicit design space that isolates those decisions into independent axes such as preconditioning, loss weighting, and the choice of sampler. Once isolated, targeted adjustments to each axis improve both sample quality and speed on CIFAR-10 and ImageNet-64. The same modular changes also raise the performance of previously trained networks without retraining them from scratch. A reader who accepts the separation of concerns can therefore treat future model improvements as combinations of a small number of reusable building blocks rather than opaque end-to-end redesigns.

Core claim

By enumerating and isolating the concrete design choices in diffusion models, the authors identify specific modifications to the sampling process, the training objective, and the preconditioning applied to score networks. These modifications together produce an FID of 1.79 on class-conditional CIFAR-10 and 1.97 on the unconditional version while requiring only 35 network evaluations per image. The same changes raise the quality of an existing pre-trained ImageNet-64 model from 2.07 to 1.55 FID and, after re-training, to 1.36 FID.

What carries the argument

An explicit design space that factors diffusion models into independent axes (preconditioning, loss weighting, sampler, and network scaling) so each axis can be altered without affecting the others.

If this is right

A single pre-trained score network can be upgraded to near-SOTA quality by swapping only the sampler and loss weighting.
Sampling time drops to 35 network evaluations while still surpassing earlier diffusion results that used hundreds of steps.
Class-conditional and unconditional training both benefit from the same set of axis-level changes.
The design space makes it possible to recombine improvements across papers without retraining the entire model.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same factorization could be applied to diffusion models for audio or 3-D data to test whether the axes remain independent outside images.
If the axes truly are modular, future work could optimize each axis separately rather than searching over full model configurations.
The reported gains on small-resolution benchmarks suggest a practical route to faster high-resolution generation once the axes are validated at larger scales.

Load-bearing premise

The listed design axes cover the important decisions and the improvements seen on CIFAR-10 and ImageNet-64 will appear on other datasets and resolutions without further retuning.

What would settle it

Reproduce the reported training and sampling procedure on a held-out dataset such as ImageNet-256 or LSUN and observe that the FID either fails to improve or requires substantially more than 35 network evaluations to match prior results.

read the original abstract

We argue that the theory and practice of diffusion-based generative models are currently unnecessarily convoluted and seek to remedy the situation by presenting a design space that clearly separates the concrete design choices. This lets us identify several changes to both the sampling and training processes, as well as preconditioning of the score networks. Together, our improvements yield new state-of-the-art FID of 1.79 for CIFAR-10 in a class-conditional setting and 1.97 in an unconditional setting, with much faster sampling (35 network evaluations per image) than prior designs. To further demonstrate their modular nature, we show that our design changes dramatically improve both the efficiency and quality obtainable with pre-trained score networks from previous work, including improving the FID of a previously trained ImageNet-64 model from 2.07 to near-SOTA 1.55, and after re-training with our proposed improvements to a new SOTA of 1.36.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This paper breaks diffusion models into clear design axes and shows concrete, measurable gains in FID and sampling speed on CIFAR-10 and ImageNet.

read the letter

The main point is that separating preconditioning, loss weighting, and sampling choices lets the authors test specific combinations that improve both quality and efficiency. They report new FID numbers of 1.79 conditional and 1.97 unconditional on CIFAR-10 using only 35 network evaluations, and they show the same changes can upgrade pre-trained networks without full retraining, such as moving an ImageNet-64 model from 2.07 to 1.55 FID. The ablations are direct and the gains appear consistently across from-scratch training and fine-tuning setups. This modular framing is the useful part: it turns ad-hoc tweaks into testable options that others can adopt. The experiments use standard metrics on established datasets and avoid obvious post-selection issues. One soft spot is that the listed axes are treated as largely complete, yet interactions or unlisted choices could matter on other resolutions or data types, and the transfer evidence stays within the tested regimes. The work is for people actively training or adapting diffusion models who need practical recipes rather than new theory. It is grounded enough to deserve referee time, even if some readers will want extra validation on broader settings.

Referee Report

0 major / 3 minor

Summary. The manuscript argues that the theory and practice of diffusion-based generative models are unnecessarily convoluted and introduces a design space that separates concrete choices in network preconditioning, loss weighting, sampling procedures, and training objectives. Through systematic ablations, the authors identify a combination of changes that yields new state-of-the-art FID scores of 1.79 (class-conditional) and 1.97 (unconditional) on CIFAR-10 with only 35 network evaluations per image; they further show that the same modular changes improve pre-trained networks, raising an ImageNet-64 model from 2.07 to 1.55 FID and, after re-training, to 1.36 FID.

Significance. If the empirical results hold, the work supplies a clear, reusable framework that demystifies diffusion-model design and delivers both higher sample quality and substantially faster sampling. The successful adaptation of prior pre-trained networks without full retraining from scratch is a notable practical strength, as is the consistent reporting of standard FID metrics across from-scratch and fine-tuning regimes on multiple datasets.

minor comments (3)

[§3.2] §3.2: the preconditioning formulation would be easier to compare with prior work if the key equations were presented side-by-side with the corresponding expressions from EDM and DDPM.
[Table 2] Table 2: several ablation rows report single-run FID values; adding standard deviations across at least three independent runs would strengthen the claim that the observed gains are robust.
[Figure 4] Figure 4: the legend for the sampler comparison is small and the curves for the proposed method overlap with a baseline; increasing line thickness or using distinct markers would improve readability.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their positive assessment of the manuscript and the recommendation to accept. We appreciate the recognition of the design space framework, the empirical improvements on CIFAR-10 and ImageNet-64, and the practical value of the modular changes for both training from scratch and adapting pre-trained networks.

Circularity Check

0 steps flagged

No significant circularity identified

full rationale

The paper enumerates concrete, independently testable design axes (preconditioning, loss weighting, sampler) and validates each via direct ablations against external benchmarks (FID on CIFAR-10 and ImageNet-64). No central quantity is defined in terms of a fitted parameter that is then re-labeled as a prediction, nor does any derivation reduce to a self-citation chain or ansatz smuggled from prior work by the same authors. The reported improvements (FID 1.79/1.97, 35 evaluations) are obtained by measuring the enumerated choices on held-out data; the design space itself is presented as an organizational tool rather than a derived result. Self-citations, if present, support peripheral background and are not load-bearing for the empirical claims.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The work relies on standard assumptions of score-based generative modeling (existence of a well-behaved score function, Markovian forward process) but introduces no new free parameters or invented entities beyond the enumerated design choices.

axioms (1)

standard math The forward diffusion process is a fixed Markov chain that gradually adds Gaussian noise.
Invoked in the background setup of diffusion models; standard in the field and not derived in the paper.

pith-pipeline@v0.9.0 · 5479 in / 1139 out tokens · 35980 ms · 2026-05-14T17:42:19.232977+00:00 · methodology

discussion (0)

Forward citations

Cited by 24 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Diffusion Policy: Visuomotor Policy Learning via Action Diffusion
cs.RO 2023-03 accept novelty 8.0

Diffusion Policy models robot actions as a conditional diffusion process, outperforming prior state-of-the-art methods by 46.9% on average across 12 manipulation tasks from four benchmarks.
Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow
cs.LG 2022-09 unverdicted novelty 8.0

Rectified flow learns straight-path neural ODEs for distribution transport, yielding efficient generative models and domain transfers that work well even with a single simulation step.
Covariance-aware sampling for Diffusion Models
stat.ML 2026-05 conditional novelty 7.0

A covariance-aware extension of DDIM sampling for pixel-space diffusion models that uses Tweedie's formula and Fourier decomposition to model reverse-process covariance and improves sample quality at low NFE.
Discrete Stochastic Localization for Non-autoregressive Generation
cs.LG 2026-05 unverdicted novelty 7.0

Discrete Stochastic Localization provides a continuous-state framework with SNR-invariant denoisers on unit-sphere embeddings, enabling one network to support multiple per-token noise paths and improving MAUVE on OpenWebText.
Offline Preference Optimization for Rectified Flow with Noise-Tracked Pairs
cs.CV 2026-05 unverdicted novelty 7.0

PNAPO augments preference data with prior noise pairs and uses straight-line interpolation to create a tighter surrogate objective for offline alignment of rectified flow models.
Tempered Guided Diffusion
stat.ML 2026-05 unverdicted novelty 7.0

Tempered Guided Diffusion uses annealed SMC to produce consistent particle approximations to the posterior for training-free conditional diffusion sampling, outperforming independent guided trajectories in experiments.
$Z^2$-Sampling: Zero-Cost Zigzag Trajectories for Semantic Alignment in Diffusion Models
cs.CV 2026-04 unverdicted novelty 7.0

Z²-Sampling implicitly realizes zero-cost zigzag trajectories for curvature-aware semantic alignment in diffusion models by reducing multi-step paths via operator dualities and temporal caching while synthesizing a di...
Conflated Inverse Modeling to Generate Diverse and Temperature-Change Inducing Urban Vegetation Patterns
cs.CV 2026-04 unverdicted novelty 7.0

A diffusion generative inverse model conditioned on temperature targets produces diverse, physically plausible urban vegetation patterns that achieve specified regional temperature shifts.
GVCC: Zero-Shot Video Compression via Codebook-Driven Stochastic Rectified Flow
cs.CV 2026-03 unverdicted novelty 7.0

GVCC achieves the lowest LPIPS on UVG at bitrates down to 0.003 bpp by encoding stochastic innovations in a marginal-preserving stochastic process derived from a pretrained rectified-flow video model, with 65% LPIPS r...
Imagen Video: High Definition Video Generation with Diffusion Models
cs.CV 2022-10 unverdicted novelty 7.0

Imagen Video generates high-definition text-conditional videos via a cascade of base and super-resolution diffusion models, achieving high fidelity and controllability.
Delta Score Matters! Spatial Adaptive Multi Guidance in Diffusion Models
cs.CV 2026-04 unverdicted novelty 6.0

SAMG uses spatially adaptive guidance scales derived from a geometric analysis of classifier-free guidance to resolve the detail-artifact dilemma in diffusion-based image and video generation.
U-Cast: A Surprisingly Simple and Efficient Frontier Probabilistic AI Weather Forecaster
cs.LG 2026-04 conditional novelty 6.0

A standard U-Net with MAE pre-training followed by short CRPS fine-tuning via Monte Carlo Dropout matches or exceeds GenCast and IFS ENS probabilistic skill at 1.5° resolution while cutting training compute and infere...
Scaling Rectified Flow Transformers for High-Resolution Image Synthesis
cs.CV 2024-03 conditional novelty 6.0

Biased noise sampling for rectified flows combined with a bidirectional text-image transformer architecture yields state-of-the-art high-resolution text-to-image results that scale predictably with model size.
Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets
cs.CV 2023-11 conditional novelty 6.0

Stable Video Diffusion scales latent video diffusion models via text-to-image pretraining, video pretraining on curated data, and high-quality finetuning to produce competitive text-to-video and image-to-video results...
SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis
cs.CV 2023-07 conditional novelty 6.0

SDXL improves upon prior Stable Diffusion versions through a larger UNet backbone, dual text encoders, novel conditioning, and a refinement model, producing higher-fidelity images competitive with black-box state-of-t...
CaloArt: Large-Patch x-Prediction Diffusion Transformers for High-Granularity Calorimeter Shower Generation
physics.ins-det 2026-05 unverdicted novelty 5.0

CaloArt achieves top FPD, high-level, and classifier metrics on CaloChallenge datasets 2 and 3 while keeping single-GPU generation at 9-11 ms per shower by combining large-patch tokenization, x-prediction, and conditi...
Consistency Regularised Gradient Flows for Inverse Problems
stat.ML 2026-05 unverdicted novelty 5.0

A consistency-regularized Euclidean-Wasserstein-2 gradient flow performs joint posterior sampling and prompt optimization in latent space for efficient low-NFE inverse problem solving with diffusion models.
Lightning Unified Video Editing via In-Context Sparse Attention
cs.CV 2026-05 unverdicted novelty 5.0

ISA prunes low-saliency context tokens and routes queries by sharpness to either full or 0-th order Taylor sparse attention, enabling LIVEditor to cut attention latency ~60% while beating prior video editing methods o...
The Physical Limit of Neural Hypoxia Detection in the Black Sea from Satellite Observations
physics.ao-ph 2026-04 unverdicted novelty 5.0

Neural networks can detect 38% of summer hypoxic events shelf-wide from satellites with 47% precision, but only within the homogeneous mixed layer.
The Physical Limit of Neural Hypoxia Detection in the Black Sea from Satellite Observations
physics.ao-ph 2026-04 unverdicted novelty 5.0

A neural network trained on model data detects 38% of summer hypoxic events shelf-wide from satellite observations with 47% precision, but only within the homogeneous surface mixing layer.
Score-Based Matching with Target Guidance for Cryo-EM Denoising
cs.CV 2026-04 unverdicted novelty 5.0

Score-based denoising with reference-density guidance improves particle-background separability and downstream 3D reconstruction consistency on cryo-EM datasets.
Rethinking the Diffusion Model from a Langevin Perspective
cs.LG 2026-04 unverdicted novelty 5.0

Diffusion models are reorganized under a Langevin perspective that unifies ODE and SDE formulations and shows flow matching is equivalent to denoising under maximum likelihood.
Downscaling weather forecasts from Low- to High-Resolution with Diffusion Models
physics.ao-ph 2026-03 unverdicted novelty 5.0

A conditional diffusion model downscales global atmospheric forecasts from 100 km to 30 km resolution while improving probabilistic skill, matching power spectra, and preserving physical relationships.
A Unified Measure-Theoretic View of Diffusion, Score-Based, and Flow Matching Generative Models
cs.LG 2026-05 unverdicted novelty 4.0

Diffusion, score-based, and flow matching models are unified as instances of learning time-dependent vector fields inducing marginal distributions governed by continuity and Fokker-Planck equations.

Reference graph

Works this paper leans on

64 extracted references · 64 canonical work pages · cited by 23 Pith papers · 1 internal anchor

[1]

B. D. Anderson. Reverse-time diffusion equation models. Stochastic Processes and their Applications , 12(3):313–326, 1982

work page 1982
[2]

U. M. Ascher and L. R. Petzold. Computer Methods for Ordinary Differential Equations and Differential- Algebraic Equations. Society for Industrial and Applied Mathematics, 1998. 10

work page 1998
[3]

F. Bao, C. Li, J. Zhu, and B. Zhang. Analytic-DPM: an analytic estimate of the optimal reverse variance in diffusion probabilistic models. In Proc. ICLR, 2022

work page 2022
[4]

Baranchuk, A

D. Baranchuk, A. V oynov, I. Rubachev, V . Khrulkov, and A. Babenko. Label-efﬁcient semantic segmenta- tion with diffusion models. In Proc. ICLR, 2022

work page 2022
[5]

C. M. Bishop. Neural networks for pattern recognition . Oxford University Press, USA, 1995

work page 1995
[6]

J. Choi, J. Lee, C. Shin, S. Kim, H. Kim, and S. Yoon. Perception prioritized training of diffusion models. In Proc. CVPR, 2022

work page 2022
[7]

Y . Choi, Y . Uh, J. Yoo, and J.-W. Ha. StarGAN v2: Diverse image synthesis for multiple domains. InProc. CVPR, 2020

work page 2020
[8]

J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. ImageNet: A large-scale hierarchical image database. In Proc. CVPR, 2009

work page 2009
[9]

Dhariwal and A

P. Dhariwal and A. Q. Nichol. Diffusion models beat GANs on image synthesis. In Proc. NeurIPS, 2021

work page 2021
[10]

Dockhorn, A

T. Dockhorn, A. Vahdat, and K. Kreis. Score-based generative modeling with critically-damped Langevin diffusion. In Proc. ICLR, 2022

work page 2022
[11]

J. R. Dormand and P. J. Prince. A family of embedded Runge-Kutta formulae. Journal of computational and applied mathematics, 6(1):19–26, 1980

work page 1980
[12]

J. B. J. Fourier, G. Darboux, et al. Théorie analytique de la chaleur , volume 504. Didot Paris, 1822

work page
[13]

Goodfellow, J

I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y . Bengio. Generative adversarial networks. In Proc. NIPS, 2014

work page 2014
[14]

Grenander and M

U. Grenander and M. I. Miller. Representations of knowledge in complex systems. Journal of the Royal Statistical Society: Series B (Methodological) , 56(4):549–581, 1994

work page 1994
[15]

Heusel, H

M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, and S. Hochreiter. GANs trained by a two time-scale update rule converge to a local Nash equilibrium. In Proc. NIPS, 2017

work page 2017
[16]

J. Ho, A. Jain, and P. Abbeel. Denoising diffusion probabilistic models. In Proc. NeurIPS, 2020

work page 2020
[17]

J. Ho, C. Saharia, W. Chan, D. J. Fleet, M. Norouzi, and T. Salimans. Cascaded diffusion models for high ﬁdelity image generation. Journal of Machine Learning Research, 23, 2022

work page 2022
[18]

Ho and T

J. Ho and T. Salimans. Classiﬁer-free diffusion guidance. In NeurIPS 2021 Workshop on Deep Generative Models and Downstream Applications, 2021

work page 2021
[19]

J. Ho, T. Salimans, A. A. Gritsenko, W. Chan, M. Norouzi, and D. J. Fleet. Video diffusion models. In Proc. ICLR Workshop on Deep Generative Models for Highly Structured Data , 2022

work page 2022
[20]

Huang, J

C.-W. Huang, J. H. Lim, and A. C. Courville. A variational perspective on diffusion-based generative models and score matching. In Proc. NeurIPS, 2021

work page 2021
[21]

Huang, J

L. Huang, J. Qin, Y . Zhou, F. Zhu, L. Liu, and L. Shao. Normalization techniques in training DNNs: Methodology, analysis and application. CoRR, abs/2009.12836, 2020

work page arXiv 2009
[22]

Hyvärinen

A. Hyvärinen. Estimation of non-normalized statistical models by score matching. Journal of Machine Learning Research, 6(24):695–709, 2005

work page 2005
[23]

B. Jing, G. Corso, R. Berlinghieri, and T. Jaakkola. Subspace diffusion generative models. In Proc. ECCV, 2022

work page 2022
[24]

Jolicoeur-Martineau, K

A. Jolicoeur-Martineau, K. Li, R. Piché-Taillefer, T. Kachman, and I. Mitliagkas. Gotta go fast when generating data with score-based models. CoRR, abs/2105.14080, 2021

work page arXiv 2021
[25]

Karras, M

T. Karras, M. Aittala, J. Hellsten, S. Laine, J. Lehtinen, and T. Aila. Training generative adversarial networks with limited data. In Proc. NeurIPS, 2020

work page 2020
[26]

Karras, M

T. Karras, M. Aittala, S. Laine, E. Härkönen, J. Hellsten, J. Lehtinen, and T. Aila. Alias-free generative adversarial networks. In Proc. NeurIPS, 2021

work page 2021
[27]

Karras, S

T. Karras, S. Laine, and T. Aila. A style-based generator architecture for generative adversarial networks. In Proc. CVPR, 2018

work page 2018
[28]

Z. Kong, W. Ping, J. Huang, K. Zhao, and B. Catanzaro. DiffWave: A versatile diffusion model for audio synthesis. In Proc. ICLR, 2021

work page 2021
[29]

Krizhevsky

A. Krizhevsky. Learning multiple layers of features from tiny images. Technical report, University of Toronto, 2009

work page 2009
[30]

Lehtinen, J

J. Lehtinen, J. Munkberg, J. Hasselgren, S. Laine, T. Karras, M. Aittala, and T. Aila. Noise2Noise: Learning image restoration without clean data. In Proc. ICML, 2018

work page 2018
[31]

L. Liu, Y . Ren, Z. Lin, and Z. Zhao. Pseudo numerical methods for diffusion models on manifolds. In Proc. ICLR, 2022

work page 2022
[32]

C. Lu, Y . Zhou, F. Bao, J. Chen, C. Li, and J. Zhu. DPM-Solver: A fast ODE solver for diffusion probabilistic model sampling in around 10 steps. In Proc. NeurIPS, 2022

work page 2022
[33]

Knowledge distillation in iterative generative models for improved sampling speed

E. Luhman and T. Luhman. Knowledge distillation in iterative generative models for improved sampling speed. CoRR, abs/2101.02388, 2021

work page arXiv 2021
[34]

Mishkin, L

P. Mishkin, L. Ahmad, M. Brundage, G. Krueger, and G. Sastry. DALL·E 2 preview – risks and limitations. OpenAI, 2022

work page 2022
[35]

Nachmani and S

E. Nachmani and S. Dovrat. Zero-shot translation using diffusion models. CoRR, abs/2111.01471, 2021. 11

work page arXiv 2021
[36]

Nichol, P

A. Nichol, P. Dhariwal, A. Ramesh, P. Shyam, P. Mishkin, B. McGrew, I. Sutskever, and M. Chen. GLIDE: Towards photorealistic image generation and editing with text-guided diffusion models. In Proc. ICML, 2022

work page 2022
[37]

A. Q. Nichol and P. Dhariwal. Improved denoising diffusion probabilistic models. In Proc. ICML, volume 139, pages 8162–8171, 2021

work page 2021
[38]

Popov, I

V . Popov, I. V ovk, V . Gogoryan, T. Sadekova, and M. Kudinov. Grad-TTS: A diffusion probabilistic model for text-to-speech. In Proc. ICML, volume 139, pages 8599–8608, 2021

work page 2021
[39]

Preechakul, N

K. Preechakul, N. Chatthee, S. Wizadwongsa, and S. Suwajanakorn. Diffusion autoencoders: Toward a meaningful and decodable representation. In Proc. CVPR, 2022

work page 2022
[40]

Ramesh, P

A. Ramesh, P. Dhariwal, A. Nichol, C. Chu, and M. Chen. Hierarchical text-conditional image generation with CLIP latents. Technical report, OpenAI, 2022

work page 2022
[41]

A. J. Roberts. Modify the improved Euler scheme to integrate stochastic differential equations. CoRR, abs/1210.0933, 2012

work page internal anchor Pith review Pith/arXiv arXiv 2012
[42]

Rombach, A

R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer. High-resolution image synthesis with latent diffusion models. In Proc. CVPR, 2022

work page 2022
[43]

Saharia, W

C. Saharia, W. Chan, H. Chang, C. A. Lee, J. Ho, T. Salimans, D. J. Fleet, and M. Norouzi. Palette: Image-to-image diffusion models. In Proc. SIGGRAPH, 2022

work page 2022
[44]

Salimans and J

T. Salimans and J. Ho. Progressive distillation for fast sampling of diffusion models. In Proc. ICLR, 2022

work page 2022
[45]

Sauer, K

A. Sauer, K. Schwarz, and A. Geiger. StyleGAN-XL: Scaling StyleGAN to large diverse datasets. In Proc. SIGGRAPH, 2022

work page 2022
[46]

Sohl-Dickstein, E

J. Sohl-Dickstein, E. Weiss, N. Maheswaranathan, and S. Ganguli. Deep unsupervised learning using nonequilibrium thermodynamics. In Proc. ICML, pages 2256–2265, 2015

work page 2015
[47]

J. Song, C. Meng, and S. Ermon. Denoising diffusion implicit models. In Proc. ICLR, 2021

work page 2021
[48]

Song and S

Y . Song and S. Ermon. Generative modeling by estimating gradients of the data distribution. In Proc. NeurIPS, 2019

work page 2019
[49]

Y . Song, J. Sohl-Dickstein, D. P. Kingma, A. Kumar, S. Ermon, and B. Poole. Score-based generative modeling through stochastic differential equations. In Proc. ICLR, 2021

work page 2021
[50]

Süli and D

E. Süli and D. F. Mayers. An Introduction to Numerical Analysis . Cambridge University Press, 2003

work page 2003
[51]

Szegedy, V

C. Szegedy, V . Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna. Rethinking the Inception architecture for computer vision. In Proc. CVPR, 2016

work page 2016
[52]

Tancik, P

M. Tancik, P. P. Srinivasan, B. Mildenhall, S. Fridovich-Keil, N. Raghavan, U. Singhal, R. Ramamoorthi, J. T. Barron, and R. Ng. Fourier features let networks learn high frequency functions in low dimensional domains. In Proc. NeurIPS, 2020

work page 2020
[53]

Vahdat, K

A. Vahdat, K. Kreis, and J. Kautz. Score-based generative modeling in latent space. In Proc. NeurIPS, 2021

work page 2021
[54]

P. Vincent. A connection between score matching and denoising autoencoders. Neural Computation, 23(7):1661–1674, 2011

work page 2011
[55]

Watson, W

D. Watson, W. Chan, J. Ho, and M. Norouzi. Learning fast samplers for diffusion models by differentiating through sample quality. In Proc. ICLR, 2022

work page 2022
[56]

Watson, J

D. Watson, J. Ho, M. Norouzi, and W. Chan. Learning to efﬁciently sample from diffusion probabilistic models. CoRR, abs/2106.03802, 2021

work page arXiv 2021
[57]

Wolleb, R

J. Wolleb, R. Sandkühler, F. Bieder, P. Valmaggia, and P. C. Cattin. Diffusion models for implicit image segmentation ensembles. In Medical Imaging with Deep Learning , 2022

work page 2022
[58]

Zhang and Y

Q. Zhang and Y . Chen. Diffusion normalizing ﬂow. In Proc. NeurIPS, 2021

work page 2021
[59]

Fast sampling of dif- fusion models with exponential integrator

Q. Zhang and Y . Chen. Fast sampling of diffusion models with exponential integrator. CoRR, abs/2204.13902, 2022. 12 Appendices A Additional results Figure 6 presents generated images for class-conditional ImageNet-64 [ 8] using the pre-trained ADM model by Dhariwal and Nichol [9]. The original DDIM [47] and iDDPM [37] samplers are compared to ours in bot...

work page arXiv 2022
[60]

Schedule

(104) We obtain overall training loss by taking a weighted expectation of L(Dθ;σ) over the noise levels: L(Dθ) = Eσ∼ptrain [ λ(σ) L(Dθ;σ) ] (105) = Eσ∼ptrain [ λ(σ) Ey∼pdata En∼N(0,σ2I) Dθ(y +n;σ) −y 2 2 ] (106) = Eσ∼ptrain Ey∼pdata En∼N(0,σ2I) [ λ(σ) Dθ(y +n;σ) −y 2 2 ] (107) = Eσ,y,n [ λ(σ) Dθ(y +n;σ) −y 2 2 ] , (108) where the noise levels ...

work page
[61]

Time steps

In the context of DDIM, we must choose how to resample {uj} to yield {ti} forN ̸=M. Song et al. [47] employ a simple resampling scheme whereti =uk·i for resampling factor k ∈ Z+. This scheme, however, requires that1000 ≡ 0 (mod N), which limits the possible choices forN considerably. Nichol and Dhariwal [37], on the other hand, employ a more ﬂexible schem...

work page
[62]

5), we must ensure that σi ∈ {uj}

In the context of our time step discretization (Eq. 5), we must ensure that σi ∈ {uj}. We accomplish this by rounding each σi to its nearest supported counterpart, i.e., σi ← uarg minj|uj−σi|, and setting σmin = 0.0064 ≈ uN−1. This is sufﬁcient, because Algo- rithm 1 only evaluatesDθ(·;σ) withσ ∈ {σi<N }

work page
[63]

Practical considerations

In the context of our stochastic sampler, we must ensure that ˆti ∈ {uj}. We accomplish this by replacing line 5 of Algorithm 2 with ˆti ←uarg minj|uj−(ti+γiti)|. With these changes, we are able to import the pre-trained model directly asFθ(·) and run Algorithms 1 and 2 using the deﬁnitions in Table 1. Note that the model outputs bothϵθ(·) and Σθ(·), as d...

work page 2021
[64]

Network and preconditioning

We saved a snapshot of the model every 2.5 million images and reported results for the snapshot that achieved the lowest FID according to our deterministic sampler with NFE= 35 or NFE = 79, depending on the resolution. In conﬁg B, we re-adjust the basic hyperparameters to enable faster training and obtain a more meaningful point of comparison. Speciﬁcally...

work page