pith. machine review for the scientific record. sign in

arxiv: 2206.00364 · v2 · submitted 2022-06-01 · 💻 cs.CV · cs.AI· cs.LG· cs.NE· stat.ML

Recognition: 1 theorem link

· Lean Theorem

Elucidating the Design Space of Diffusion-Based Generative Models

Authors on Pith no claims yet

Pith reviewed 2026-05-14 17:42 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.LGcs.NEstat.ML
keywords diffusion modelsgenerative modelsimage synthesisFID scoresampling efficiencydesign spacepreconditioningscore networks
0
0 comments X

The pith

Separating concrete design choices in diffusion models produces new state-of-the-art FID scores with far fewer sampling steps.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The authors argue that current diffusion-based generative models hide their key decisions in a convoluted mix of training, sampling, and network scaling. They respond by laying out an explicit design space that isolates those decisions into independent axes such as preconditioning, loss weighting, and the choice of sampler. Once isolated, targeted adjustments to each axis improve both sample quality and speed on CIFAR-10 and ImageNet-64. The same modular changes also raise the performance of previously trained networks without retraining them from scratch. A reader who accepts the separation of concerns can therefore treat future model improvements as combinations of a small number of reusable building blocks rather than opaque end-to-end redesigns.

Core claim

By enumerating and isolating the concrete design choices in diffusion models, the authors identify specific modifications to the sampling process, the training objective, and the preconditioning applied to score networks. These modifications together produce an FID of 1.79 on class-conditional CIFAR-10 and 1.97 on the unconditional version while requiring only 35 network evaluations per image. The same changes raise the quality of an existing pre-trained ImageNet-64 model from 2.07 to 1.55 FID and, after re-training, to 1.36 FID.

What carries the argument

An explicit design space that factors diffusion models into independent axes (preconditioning, loss weighting, sampler, and network scaling) so each axis can be altered without affecting the others.

If this is right

  • A single pre-trained score network can be upgraded to near-SOTA quality by swapping only the sampler and loss weighting.
  • Sampling time drops to 35 network evaluations while still surpassing earlier diffusion results that used hundreds of steps.
  • Class-conditional and unconditional training both benefit from the same set of axis-level changes.
  • The design space makes it possible to recombine improvements across papers without retraining the entire model.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same factorization could be applied to diffusion models for audio or 3-D data to test whether the axes remain independent outside images.
  • If the axes truly are modular, future work could optimize each axis separately rather than searching over full model configurations.
  • The reported gains on small-resolution benchmarks suggest a practical route to faster high-resolution generation once the axes are validated at larger scales.

Load-bearing premise

The listed design axes cover the important decisions and the improvements seen on CIFAR-10 and ImageNet-64 will appear on other datasets and resolutions without further retuning.

What would settle it

Reproduce the reported training and sampling procedure on a held-out dataset such as ImageNet-256 or LSUN and observe that the FID either fails to improve or requires substantially more than 35 network evaluations to match prior results.

read the original abstract

We argue that the theory and practice of diffusion-based generative models are currently unnecessarily convoluted and seek to remedy the situation by presenting a design space that clearly separates the concrete design choices. This lets us identify several changes to both the sampling and training processes, as well as preconditioning of the score networks. Together, our improvements yield new state-of-the-art FID of 1.79 for CIFAR-10 in a class-conditional setting and 1.97 in an unconditional setting, with much faster sampling (35 network evaluations per image) than prior designs. To further demonstrate their modular nature, we show that our design changes dramatically improve both the efficiency and quality obtainable with pre-trained score networks from previous work, including improving the FID of a previously trained ImageNet-64 model from 2.07 to near-SOTA 1.55, and after re-training with our proposed improvements to a new SOTA of 1.36.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 3 minor

Summary. The manuscript argues that the theory and practice of diffusion-based generative models are unnecessarily convoluted and introduces a design space that separates concrete choices in network preconditioning, loss weighting, sampling procedures, and training objectives. Through systematic ablations, the authors identify a combination of changes that yields new state-of-the-art FID scores of 1.79 (class-conditional) and 1.97 (unconditional) on CIFAR-10 with only 35 network evaluations per image; they further show that the same modular changes improve pre-trained networks, raising an ImageNet-64 model from 2.07 to 1.55 FID and, after re-training, to 1.36 FID.

Significance. If the empirical results hold, the work supplies a clear, reusable framework that demystifies diffusion-model design and delivers both higher sample quality and substantially faster sampling. The successful adaptation of prior pre-trained networks without full retraining from scratch is a notable practical strength, as is the consistent reporting of standard FID metrics across from-scratch and fine-tuning regimes on multiple datasets.

minor comments (3)
  1. [§3.2] §3.2: the preconditioning formulation would be easier to compare with prior work if the key equations were presented side-by-side with the corresponding expressions from EDM and DDPM.
  2. [Table 2] Table 2: several ablation rows report single-run FID values; adding standard deviations across at least three independent runs would strengthen the claim that the observed gains are robust.
  3. [Figure 4] Figure 4: the legend for the sampler comparison is small and the curves for the proposed method overlap with a baseline; increasing line thickness or using distinct markers would improve readability.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their positive assessment of the manuscript and the recommendation to accept. We appreciate the recognition of the design space framework, the empirical improvements on CIFAR-10 and ImageNet-64, and the practical value of the modular changes for both training from scratch and adapting pre-trained networks.

Circularity Check

0 steps flagged

No significant circularity identified

full rationale

The paper enumerates concrete, independently testable design axes (preconditioning, loss weighting, sampler) and validates each via direct ablations against external benchmarks (FID on CIFAR-10 and ImageNet-64). No central quantity is defined in terms of a fitted parameter that is then re-labeled as a prediction, nor does any derivation reduce to a self-citation chain or ansatz smuggled from prior work by the same authors. The reported improvements (FID 1.79/1.97, 35 evaluations) are obtained by measuring the enumerated choices on held-out data; the design space itself is presented as an organizational tool rather than a derived result. Self-citations, if present, support peripheral background and are not load-bearing for the empirical claims.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The work relies on standard assumptions of score-based generative modeling (existence of a well-behaved score function, Markovian forward process) but introduces no new free parameters or invented entities beyond the enumerated design choices.

axioms (1)
  • standard math The forward diffusion process is a fixed Markov chain that gradually adds Gaussian noise.
    Invoked in the background setup of diffusion models; standard in the field and not derived in the paper.

pith-pipeline@v0.9.0 · 5479 in / 1139 out tokens · 35980 ms · 2026-05-14T17:42:19.232977+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 24 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Diffusion Policy: Visuomotor Policy Learning via Action Diffusion

    cs.RO 2023-03 accept novelty 8.0

    Diffusion Policy models robot actions as a conditional diffusion process, outperforming prior state-of-the-art methods by 46.9% on average across 12 manipulation tasks from four benchmarks.

  2. Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow

    cs.LG 2022-09 unverdicted novelty 8.0

    Rectified flow learns straight-path neural ODEs for distribution transport, yielding efficient generative models and domain transfers that work well even with a single simulation step.

  3. Covariance-aware sampling for Diffusion Models

    stat.ML 2026-05 conditional novelty 7.0

    A covariance-aware extension of DDIM sampling for pixel-space diffusion models that uses Tweedie's formula and Fourier decomposition to model reverse-process covariance and improves sample quality at low NFE.

  4. Discrete Stochastic Localization for Non-autoregressive Generation

    cs.LG 2026-05 unverdicted novelty 7.0

    Discrete Stochastic Localization provides a continuous-state framework with SNR-invariant denoisers on unit-sphere embeddings, enabling one network to support multiple per-token noise paths and improving MAUVE on OpenWebText.

  5. Offline Preference Optimization for Rectified Flow with Noise-Tracked Pairs

    cs.CV 2026-05 unverdicted novelty 7.0

    PNAPO augments preference data with prior noise pairs and uses straight-line interpolation to create a tighter surrogate objective for offline alignment of rectified flow models.

  6. Tempered Guided Diffusion

    stat.ML 2026-05 unverdicted novelty 7.0

    Tempered Guided Diffusion uses annealed SMC to produce consistent particle approximations to the posterior for training-free conditional diffusion sampling, outperforming independent guided trajectories in experiments.

  7. $Z^2$-Sampling: Zero-Cost Zigzag Trajectories for Semantic Alignment in Diffusion Models

    cs.CV 2026-04 unverdicted novelty 7.0

    Z²-Sampling implicitly realizes zero-cost zigzag trajectories for curvature-aware semantic alignment in diffusion models by reducing multi-step paths via operator dualities and temporal caching while synthesizing a di...

  8. Conflated Inverse Modeling to Generate Diverse and Temperature-Change Inducing Urban Vegetation Patterns

    cs.CV 2026-04 unverdicted novelty 7.0

    A diffusion generative inverse model conditioned on temperature targets produces diverse, physically plausible urban vegetation patterns that achieve specified regional temperature shifts.

  9. GVCC: Zero-Shot Video Compression via Codebook-Driven Stochastic Rectified Flow

    cs.CV 2026-03 unverdicted novelty 7.0

    GVCC achieves the lowest LPIPS on UVG at bitrates down to 0.003 bpp by encoding stochastic innovations in a marginal-preserving stochastic process derived from a pretrained rectified-flow video model, with 65% LPIPS r...

  10. Imagen Video: High Definition Video Generation with Diffusion Models

    cs.CV 2022-10 unverdicted novelty 7.0

    Imagen Video generates high-definition text-conditional videos via a cascade of base and super-resolution diffusion models, achieving high fidelity and controllability.

  11. Delta Score Matters! Spatial Adaptive Multi Guidance in Diffusion Models

    cs.CV 2026-04 unverdicted novelty 6.0

    SAMG uses spatially adaptive guidance scales derived from a geometric analysis of classifier-free guidance to resolve the detail-artifact dilemma in diffusion-based image and video generation.

  12. U-Cast: A Surprisingly Simple and Efficient Frontier Probabilistic AI Weather Forecaster

    cs.LG 2026-04 conditional novelty 6.0

    A standard U-Net with MAE pre-training followed by short CRPS fine-tuning via Monte Carlo Dropout matches or exceeds GenCast and IFS ENS probabilistic skill at 1.5° resolution while cutting training compute and infere...

  13. Scaling Rectified Flow Transformers for High-Resolution Image Synthesis

    cs.CV 2024-03 conditional novelty 6.0

    Biased noise sampling for rectified flows combined with a bidirectional text-image transformer architecture yields state-of-the-art high-resolution text-to-image results that scale predictably with model size.

  14. Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets

    cs.CV 2023-11 conditional novelty 6.0

    Stable Video Diffusion scales latent video diffusion models via text-to-image pretraining, video pretraining on curated data, and high-quality finetuning to produce competitive text-to-video and image-to-video results...

  15. SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis

    cs.CV 2023-07 conditional novelty 6.0

    SDXL improves upon prior Stable Diffusion versions through a larger UNet backbone, dual text encoders, novel conditioning, and a refinement model, producing higher-fidelity images competitive with black-box state-of-t...

  16. CaloArt: Large-Patch x-Prediction Diffusion Transformers for High-Granularity Calorimeter Shower Generation

    physics.ins-det 2026-05 unverdicted novelty 5.0

    CaloArt achieves top FPD, high-level, and classifier metrics on CaloChallenge datasets 2 and 3 while keeping single-GPU generation at 9-11 ms per shower by combining large-patch tokenization, x-prediction, and conditi...

  17. Consistency Regularised Gradient Flows for Inverse Problems

    stat.ML 2026-05 unverdicted novelty 5.0

    A consistency-regularized Euclidean-Wasserstein-2 gradient flow performs joint posterior sampling and prompt optimization in latent space for efficient low-NFE inverse problem solving with diffusion models.

  18. Lightning Unified Video Editing via In-Context Sparse Attention

    cs.CV 2026-05 unverdicted novelty 5.0

    ISA prunes low-saliency context tokens and routes queries by sharpness to either full or 0-th order Taylor sparse attention, enabling LIVEditor to cut attention latency ~60% while beating prior video editing methods o...

  19. The Physical Limit of Neural Hypoxia Detection in the Black Sea from Satellite Observations

    physics.ao-ph 2026-04 unverdicted novelty 5.0

    Neural networks can detect 38% of summer hypoxic events shelf-wide from satellites with 47% precision, but only within the homogeneous mixed layer.

  20. The Physical Limit of Neural Hypoxia Detection in the Black Sea from Satellite Observations

    physics.ao-ph 2026-04 unverdicted novelty 5.0

    A neural network trained on model data detects 38% of summer hypoxic events shelf-wide from satellite observations with 47% precision, but only within the homogeneous surface mixing layer.

  21. Score-Based Matching with Target Guidance for Cryo-EM Denoising

    cs.CV 2026-04 unverdicted novelty 5.0

    Score-based denoising with reference-density guidance improves particle-background separability and downstream 3D reconstruction consistency on cryo-EM datasets.

  22. Rethinking the Diffusion Model from a Langevin Perspective

    cs.LG 2026-04 unverdicted novelty 5.0

    Diffusion models are reorganized under a Langevin perspective that unifies ODE and SDE formulations and shows flow matching is equivalent to denoising under maximum likelihood.

  23. Downscaling weather forecasts from Low- to High-Resolution with Diffusion Models

    physics.ao-ph 2026-03 unverdicted novelty 5.0

    A conditional diffusion model downscales global atmospheric forecasts from 100 km to 30 km resolution while improving probabilistic skill, matching power spectra, and preserving physical relationships.

  24. A Unified Measure-Theoretic View of Diffusion, Score-Based, and Flow Matching Generative Models

    cs.LG 2026-05 unverdicted novelty 4.0

    Diffusion, score-based, and flow matching models are unified as instances of learning time-dependent vector fields inducing marginal distributions governed by continuity and Fokker-Planck equations.

Reference graph

Works this paper leans on

64 extracted references · 64 canonical work pages · cited by 23 Pith papers · 1 internal anchor

  1. [1]

    B. D. Anderson. Reverse-time diffusion equation models. Stochastic Processes and their Applications , 12(3):313–326, 1982

  2. [2]

    U. M. Ascher and L. R. Petzold. Computer Methods for Ordinary Differential Equations and Differential- Algebraic Equations. Society for Industrial and Applied Mathematics, 1998. 10

  3. [3]

    F. Bao, C. Li, J. Zhu, and B. Zhang. Analytic-DPM: an analytic estimate of the optimal reverse variance in diffusion probabilistic models. In Proc. ICLR, 2022

  4. [4]

    Baranchuk, A

    D. Baranchuk, A. V oynov, I. Rubachev, V . Khrulkov, and A. Babenko. Label-efficient semantic segmenta- tion with diffusion models. In Proc. ICLR, 2022

  5. [5]

    C. M. Bishop. Neural networks for pattern recognition . Oxford University Press, USA, 1995

  6. [6]

    J. Choi, J. Lee, C. Shin, S. Kim, H. Kim, and S. Yoon. Perception prioritized training of diffusion models. In Proc. CVPR, 2022

  7. [7]

    Y . Choi, Y . Uh, J. Yoo, and J.-W. Ha. StarGAN v2: Diverse image synthesis for multiple domains. InProc. CVPR, 2020

  8. [8]

    J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. ImageNet: A large-scale hierarchical image database. In Proc. CVPR, 2009

  9. [9]

    Dhariwal and A

    P. Dhariwal and A. Q. Nichol. Diffusion models beat GANs on image synthesis. In Proc. NeurIPS, 2021

  10. [10]

    Dockhorn, A

    T. Dockhorn, A. Vahdat, and K. Kreis. Score-based generative modeling with critically-damped Langevin diffusion. In Proc. ICLR, 2022

  11. [11]

    J. R. Dormand and P. J. Prince. A family of embedded Runge-Kutta formulae. Journal of computational and applied mathematics, 6(1):19–26, 1980

  12. [12]

    J. B. J. Fourier, G. Darboux, et al. Théorie analytique de la chaleur , volume 504. Didot Paris, 1822

  13. [13]

    Goodfellow, J

    I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y . Bengio. Generative adversarial networks. In Proc. NIPS, 2014

  14. [14]

    Grenander and M

    U. Grenander and M. I. Miller. Representations of knowledge in complex systems. Journal of the Royal Statistical Society: Series B (Methodological) , 56(4):549–581, 1994

  15. [15]

    Heusel, H

    M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, and S. Hochreiter. GANs trained by a two time-scale update rule converge to a local Nash equilibrium. In Proc. NIPS, 2017

  16. [16]

    J. Ho, A. Jain, and P. Abbeel. Denoising diffusion probabilistic models. In Proc. NeurIPS, 2020

  17. [17]

    J. Ho, C. Saharia, W. Chan, D. J. Fleet, M. Norouzi, and T. Salimans. Cascaded diffusion models for high fidelity image generation. Journal of Machine Learning Research, 23, 2022

  18. [18]

    Ho and T

    J. Ho and T. Salimans. Classifier-free diffusion guidance. In NeurIPS 2021 Workshop on Deep Generative Models and Downstream Applications, 2021

  19. [19]

    J. Ho, T. Salimans, A. A. Gritsenko, W. Chan, M. Norouzi, and D. J. Fleet. Video diffusion models. In Proc. ICLR Workshop on Deep Generative Models for Highly Structured Data , 2022

  20. [20]

    Huang, J

    C.-W. Huang, J. H. Lim, and A. C. Courville. A variational perspective on diffusion-based generative models and score matching. In Proc. NeurIPS, 2021

  21. [21]

    Huang, J

    L. Huang, J. Qin, Y . Zhou, F. Zhu, L. Liu, and L. Shao. Normalization techniques in training DNNs: Methodology, analysis and application. CoRR, abs/2009.12836, 2020

  22. [22]

    Hyvärinen

    A. Hyvärinen. Estimation of non-normalized statistical models by score matching. Journal of Machine Learning Research, 6(24):695–709, 2005

  23. [23]

    B. Jing, G. Corso, R. Berlinghieri, and T. Jaakkola. Subspace diffusion generative models. In Proc. ECCV, 2022

  24. [24]

    Jolicoeur-Martineau, K

    A. Jolicoeur-Martineau, K. Li, R. Piché-Taillefer, T. Kachman, and I. Mitliagkas. Gotta go fast when generating data with score-based models. CoRR, abs/2105.14080, 2021

  25. [25]

    Karras, M

    T. Karras, M. Aittala, J. Hellsten, S. Laine, J. Lehtinen, and T. Aila. Training generative adversarial networks with limited data. In Proc. NeurIPS, 2020

  26. [26]

    Karras, M

    T. Karras, M. Aittala, S. Laine, E. Härkönen, J. Hellsten, J. Lehtinen, and T. Aila. Alias-free generative adversarial networks. In Proc. NeurIPS, 2021

  27. [27]

    Karras, S

    T. Karras, S. Laine, and T. Aila. A style-based generator architecture for generative adversarial networks. In Proc. CVPR, 2018

  28. [28]

    Z. Kong, W. Ping, J. Huang, K. Zhao, and B. Catanzaro. DiffWave: A versatile diffusion model for audio synthesis. In Proc. ICLR, 2021

  29. [29]

    Krizhevsky

    A. Krizhevsky. Learning multiple layers of features from tiny images. Technical report, University of Toronto, 2009

  30. [30]

    Lehtinen, J

    J. Lehtinen, J. Munkberg, J. Hasselgren, S. Laine, T. Karras, M. Aittala, and T. Aila. Noise2Noise: Learning image restoration without clean data. In Proc. ICML, 2018

  31. [31]

    L. Liu, Y . Ren, Z. Lin, and Z. Zhao. Pseudo numerical methods for diffusion models on manifolds. In Proc. ICLR, 2022

  32. [32]

    C. Lu, Y . Zhou, F. Bao, J. Chen, C. Li, and J. Zhu. DPM-Solver: A fast ODE solver for diffusion probabilistic model sampling in around 10 steps. In Proc. NeurIPS, 2022

  33. [33]

    Knowledge distillation in iterative generative models for improved sampling speed

    E. Luhman and T. Luhman. Knowledge distillation in iterative generative models for improved sampling speed. CoRR, abs/2101.02388, 2021

  34. [34]

    Mishkin, L

    P. Mishkin, L. Ahmad, M. Brundage, G. Krueger, and G. Sastry. DALL·E 2 preview – risks and limitations. OpenAI, 2022

  35. [35]

    Nachmani and S

    E. Nachmani and S. Dovrat. Zero-shot translation using diffusion models. CoRR, abs/2111.01471, 2021. 11

  36. [36]

    Nichol, P

    A. Nichol, P. Dhariwal, A. Ramesh, P. Shyam, P. Mishkin, B. McGrew, I. Sutskever, and M. Chen. GLIDE: Towards photorealistic image generation and editing with text-guided diffusion models. In Proc. ICML, 2022

  37. [37]

    A. Q. Nichol and P. Dhariwal. Improved denoising diffusion probabilistic models. In Proc. ICML, volume 139, pages 8162–8171, 2021

  38. [38]

    Popov, I

    V . Popov, I. V ovk, V . Gogoryan, T. Sadekova, and M. Kudinov. Grad-TTS: A diffusion probabilistic model for text-to-speech. In Proc. ICML, volume 139, pages 8599–8608, 2021

  39. [39]

    Preechakul, N

    K. Preechakul, N. Chatthee, S. Wizadwongsa, and S. Suwajanakorn. Diffusion autoencoders: Toward a meaningful and decodable representation. In Proc. CVPR, 2022

  40. [40]

    Ramesh, P

    A. Ramesh, P. Dhariwal, A. Nichol, C. Chu, and M. Chen. Hierarchical text-conditional image generation with CLIP latents. Technical report, OpenAI, 2022

  41. [41]

    A. J. Roberts. Modify the improved Euler scheme to integrate stochastic differential equations. CoRR, abs/1210.0933, 2012

  42. [42]

    Rombach, A

    R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer. High-resolution image synthesis with latent diffusion models. In Proc. CVPR, 2022

  43. [43]

    Saharia, W

    C. Saharia, W. Chan, H. Chang, C. A. Lee, J. Ho, T. Salimans, D. J. Fleet, and M. Norouzi. Palette: Image-to-image diffusion models. In Proc. SIGGRAPH, 2022

  44. [44]

    Salimans and J

    T. Salimans and J. Ho. Progressive distillation for fast sampling of diffusion models. In Proc. ICLR, 2022

  45. [45]

    Sauer, K

    A. Sauer, K. Schwarz, and A. Geiger. StyleGAN-XL: Scaling StyleGAN to large diverse datasets. In Proc. SIGGRAPH, 2022

  46. [46]

    Sohl-Dickstein, E

    J. Sohl-Dickstein, E. Weiss, N. Maheswaranathan, and S. Ganguli. Deep unsupervised learning using nonequilibrium thermodynamics. In Proc. ICML, pages 2256–2265, 2015

  47. [47]

    J. Song, C. Meng, and S. Ermon. Denoising diffusion implicit models. In Proc. ICLR, 2021

  48. [48]

    Song and S

    Y . Song and S. Ermon. Generative modeling by estimating gradients of the data distribution. In Proc. NeurIPS, 2019

  49. [49]

    Y . Song, J. Sohl-Dickstein, D. P. Kingma, A. Kumar, S. Ermon, and B. Poole. Score-based generative modeling through stochastic differential equations. In Proc. ICLR, 2021

  50. [50]

    Süli and D

    E. Süli and D. F. Mayers. An Introduction to Numerical Analysis . Cambridge University Press, 2003

  51. [51]

    Szegedy, V

    C. Szegedy, V . Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna. Rethinking the Inception architecture for computer vision. In Proc. CVPR, 2016

  52. [52]

    Tancik, P

    M. Tancik, P. P. Srinivasan, B. Mildenhall, S. Fridovich-Keil, N. Raghavan, U. Singhal, R. Ramamoorthi, J. T. Barron, and R. Ng. Fourier features let networks learn high frequency functions in low dimensional domains. In Proc. NeurIPS, 2020

  53. [53]

    Vahdat, K

    A. Vahdat, K. Kreis, and J. Kautz. Score-based generative modeling in latent space. In Proc. NeurIPS, 2021

  54. [54]

    P. Vincent. A connection between score matching and denoising autoencoders. Neural Computation, 23(7):1661–1674, 2011

  55. [55]

    Watson, W

    D. Watson, W. Chan, J. Ho, and M. Norouzi. Learning fast samplers for diffusion models by differentiating through sample quality. In Proc. ICLR, 2022

  56. [56]

    Watson, J

    D. Watson, J. Ho, M. Norouzi, and W. Chan. Learning to efficiently sample from diffusion probabilistic models. CoRR, abs/2106.03802, 2021

  57. [57]

    Wolleb, R

    J. Wolleb, R. Sandkühler, F. Bieder, P. Valmaggia, and P. C. Cattin. Diffusion models for implicit image segmentation ensembles. In Medical Imaging with Deep Learning , 2022

  58. [58]

    Zhang and Y

    Q. Zhang and Y . Chen. Diffusion normalizing flow. In Proc. NeurIPS, 2021

  59. [59]

    Fast sampling of dif- fusion models with exponential integrator

    Q. Zhang and Y . Chen. Fast sampling of diffusion models with exponential integrator. CoRR, abs/2204.13902, 2022. 12 Appendices A Additional results Figure 6 presents generated images for class-conditional ImageNet-64 [ 8] using the pre-trained ADM model by Dhariwal and Nichol [9]. The original DDIM [47] and iDDPM [37] samplers are compared to ours in bot...

  60. [60]

    Schedule

    (104) We obtain overall training loss by taking a weighted expectation of L(Dθ;σ) over the noise levels: L(Dθ) = Eσ∼ptrain [ λ(σ) L(Dθ;σ) ] (105) = Eσ∼ptrain [ λ(σ) Ey∼pdata En∼N(0,σ2I) Dθ(y +n;σ) −y 2 2 ] (106) = Eσ∼ptrain Ey∼pdata En∼N(0,σ2I) [ λ(σ) Dθ(y +n;σ) −y 2 2 ] (107) = Eσ,y,n [ λ(σ) Dθ(y +n;σ) −y 2 2 ] , (108) where the noise levels ...

  61. [61]

    Time steps

    In the context of DDIM, we must choose how to resample {uj} to yield {ti} forN ̸=M. Song et al. [47] employ a simple resampling scheme whereti =uk·i for resampling factor k ∈ Z+. This scheme, however, requires that1000 ≡ 0 (mod N), which limits the possible choices forN considerably. Nichol and Dhariwal [37], on the other hand, employ a more flexible schem...

  62. [62]

    5), we must ensure that σi ∈ {uj}

    In the context of our time step discretization (Eq. 5), we must ensure that σi ∈ {uj}. We accomplish this by rounding each σi to its nearest supported counterpart, i.e., σi ← uarg minj|uj−σi|, and setting σmin = 0.0064 ≈ uN−1. This is sufficient, because Algo- rithm 1 only evaluatesDθ(·;σ) withσ ∈ {σi<N }

  63. [63]

    Practical considerations

    In the context of our stochastic sampler, we must ensure that ˆti ∈ {uj}. We accomplish this by replacing line 5 of Algorithm 2 with ˆti ←uarg minj|uj−(ti+γiti)|. With these changes, we are able to import the pre-trained model directly asFθ(·) and run Algorithms 1 and 2 using the definitions in Table 1. Note that the model outputs bothϵθ(·) and Σθ(·), as d...

  64. [64]

    Network and preconditioning

    We saved a snapshot of the model every 2.5 million images and reported results for the snapshot that achieved the lowest FID according to our deterministic sampler with NFE= 35 or NFE = 79, depending on the resolution. In config B, we re-adjust the basic hyperparameters to enable faster training and obtain a more meaningful point of comparison. Specifically...