pith. machine review for the scientific record. sign in

arxiv: 2303.01469 · v2 · submitted 2023-03-02 · 💻 cs.LG · cs.CV· stat.ML

Recognition: 2 theorem links

· Lean Theorem

Consistency Models

Ilya Sutskever, Mark Chen, Prafulla Dhariwal, Yang Song

Pith reviewed 2026-05-13 15:41 UTC · model grok-4.3

classification 💻 cs.LG cs.CVstat.ML
keywords consistency modelsdiffusion modelsgenerative modelsone-step generationimage synthesismodel distillationzero-shot editing
0
0 comments X

The pith

Consistency models generate high-quality samples by directly mapping noise to data in one step.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces consistency models as a new family of generative models that produce high quality samples by directly mapping noise to data. This overcomes the slow iterative sampling process required by diffusion models. The models support fast one-step generation by design while still allowing multistep sampling to trade compute for better quality. They also enable zero-shot data editing tasks such as inpainting and super-resolution without explicit training on those tasks. The approach matters because it achieves new state-of-the-art one-step FID scores on CIFAR-10 and ImageNet 64x64 and can be trained either by distilling existing diffusion models or from scratch as standalone models.

Core claim

We propose consistency models, a new family of models that generate high quality samples by directly mapping noise to data. They support fast one-step generation by design, while still allowing multistep sampling to trade compute for sample quality. They also support zero-shot data editing, such as image inpainting, colorization, and super-resolution, without requiring explicit training on these tasks. Consistency models can be trained either by distilling pre-trained diffusion models, or as standalone generative models altogether, outperforming existing distillation techniques and one-step non-adversarial generative models on standard benchmarks.

What carries the argument

The consistency function, which maps any noisy input at any noise level to the identical clean data output, enforcing consistency along the entire noise trajectory.

If this is right

  • One-step sampling from consistency models achieves FID of 3.55 on CIFAR-10 and 6.20 on ImageNet 64x64.
  • Multistep sampling can be applied to trade additional compute for higher sample quality.
  • Zero-shot editing capabilities such as inpainting, colorization, and super-resolution are available without dedicated training.
  • Standalone consistency models outperform prior one-step non-adversarial generative models on CIFAR-10, ImageNet 64x64, and LSUN 256x256.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • These models could lower the barrier to real-time image synthesis in applications where multiple denoising steps are currently too slow.
  • The consistency principle might extend naturally to other iterative generative processes such as those used in audio or video synthesis.
  • Further gains could come from hybridizing consistency training with small amounts of adversarial fine-tuning.
  • Scaling experiments on higher-resolution datasets would test whether the direct noise-to-data mapping remains stable without additional regularization.

Load-bearing premise

A single learned consistency function can map noise at any level to the same clean output and generalize to zero-shot editing tasks without task-specific supervision.

What would settle it

If one-step samples produced by a trained consistency model show FID scores no better than existing one-step baselines on CIFAR-10 or ImageNet 64x64, or if zero-shot inpainting results contain visible inconsistencies not present in supervised editing methods.

read the original abstract

Diffusion models have significantly advanced the fields of image, audio, and video generation, but they depend on an iterative sampling process that causes slow generation. To overcome this limitation, we propose consistency models, a new family of models that generate high quality samples by directly mapping noise to data. They support fast one-step generation by design, while still allowing multistep sampling to trade compute for sample quality. They also support zero-shot data editing, such as image inpainting, colorization, and super-resolution, without requiring explicit training on these tasks. Consistency models can be trained either by distilling pre-trained diffusion models, or as standalone generative models altogether. Through extensive experiments, we demonstrate that they outperform existing distillation techniques for diffusion models in one- and few-step sampling, achieving the new state-of-the-art FID of 3.55 on CIFAR-10 and 6.20 on ImageNet 64x64 for one-step generation. When trained in isolation, consistency models become a new family of generative models that can outperform existing one-step, non-adversarial generative models on standard benchmarks such as CIFAR-10, ImageNet 64x64 and LSUN 256x256.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The manuscript introduces consistency models as a new family of generative models that directly map noise to data, enabling high-quality one-step generation while supporting multi-step refinement and zero-shot editing tasks such as inpainting, colorization, and super-resolution. Models can be trained either by distilling from pre-trained diffusion models or independently from scratch. Extensive experiments report new state-of-the-art FID scores of 3.55 on CIFAR-10 and 6.20 on ImageNet 64x64 for one-step sampling, outperforming prior distillation techniques, with additional results on LSUN 256x256.

Significance. If the central claims hold, this work offers a meaningful advance toward computationally efficient sampling in generative modeling by largely eliminating iterative processes while preserving sample quality. The reported FID improvements on standard benchmarks are substantial, and the zero-shot editing results without task-specific supervision add practical value. The dual training options (distillation and standalone) broaden applicability. Strengths include the scale of empirical validation across datasets and the introduction of a distinct model family that competes with existing one-step non-adversarial generators.

major comments (1)
  1. [§3.2] §3.2: The consistency loss minimizes ||f_θ(x_t, t) - f_θ(x_s, s)|| only over randomly sampled discrete pairs (t, s). This does not enforce exact invariance of f_θ(·, t) to the same x_0 along the full continuous trajectory, which is required for the one-step generation claim (t=1 to t=0) and for reliable zero-shot editing. Residual inconsistencies on unseen times could degrade performance; the manuscript should provide either theoretical bounds on trajectory consistency error or empirical measurements of invariance across dense time grids.
minor comments (2)
  1. [§4.1] §4.1 and Table 1: The one-step FID numbers are presented without reported standard deviations or number of independent runs; adding these would allow readers to assess the statistical reliability of the claimed improvements over baselines.
  2. [Figure 3] Figure 3: The caption and axis labels for the multi-step sampling curves could more explicitly indicate the compute-quality trade-off relative to the one-step baseline.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their thoughtful and constructive review of our manuscript. We address the major comment point by point below.

read point-by-point responses
  1. Referee: [§3.2] §3.2: The consistency loss minimizes ||f_θ(x_t, t) - f_θ(x_s, s)|| only over randomly sampled discrete pairs (t, s). This does not enforce exact invariance of f_θ(·, t) to the same x_0 along the full continuous trajectory, which is required for the one-step generation claim (t=1 to t=0) and for reliable zero-shot editing. Residual inconsistencies on unseen times could degrade performance; the manuscript should provide either theoretical bounds on trajectory consistency error or empirical measurements of invariance across dense time grids.

    Authors: We appreciate the referee's observation on the formulation of the consistency loss. The loss is indeed defined over discrete pairs (t, s) drawn from the continuous time distribution, rather than enforcing exact invariance at every point along the trajectory. While the repeated sampling of such pairs during training, together with the self-consistency objective, is intended to promote approximate invariance in practice, we acknowledge that this does not constitute a strict guarantee for all unseen times. To address the concern directly, we will add to the revised manuscript a new set of empirical measurements: consistency error evaluated on a dense grid of time points (e.g., 100 uniformly spaced values) not encountered during training, along with plots of ||f_θ(x_t, t) - x_0|| for fixed x_0 across the trajectory. These additions will provide quantitative support for the reliability of one-step generation and zero-shot editing results. revision: yes

Circularity Check

0 steps flagged

No significant circularity; results are empirical and self-contained

full rationale

The paper defines consistency models via a loss that enforces pairwise agreement on randomly sampled (t,s) pairs drawn from diffusion trajectories, then evaluates one-step and multi-step sampling performance via FID on held-out benchmarks (CIFAR-10, ImageNet 64x64). This training objective is an approximation to the desired trajectory invariance and does not presuppose the final sample quality or editing behavior. No load-bearing step reduces by construction to a fitted parameter renamed as prediction, a self-citation chain, or an ansatz smuggled from prior work by the same authors. The reported SOTA numbers rest on external benchmark comparison rather than internal redefinition of inputs. The derivation chain therefore remains independent of its measured outputs.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the learnability of a consistency function that produces identical outputs for all noise levels along a trajectory; this is treated as a domain assumption without independent proof in the abstract.

free parameters (1)
  • sampling step count
    Variable number of steps is used to trade compute for quality, but no specific fitted values are stated in the abstract.
axioms (1)
  • domain assumption A consistency function exists that maps any point on a diffusion trajectory to the same clean data point
    This is the defining property invoked to enable one-step generation.

pith-pipeline@v0.9.0 · 5511 in / 1197 out tokens · 108446 ms · 2026-05-13T15:41:20.707568+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

  • DAlembert.Inevitability bilinear_family_forced contradicts

    Training minimizes ||f_θ(x_t, t) - f_θ(x_s, s)|| for randomly drawn pairs t,s (typically via the consistency loss in §3.2). This objective only penalizes inconsistency on the sampled pairs and does not constrain the function to be exactly constant along the entire continuous trajectory.

Forward citations

Cited by 23 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Query Lower Bounds for Diffusion Sampling

    cs.LG 2026-04 unverdicted novelty 8.0

    Diffusion sampling from d-dimensional distributions requires at least ~sqrt(d) adaptive score queries when score estimates have polynomial accuracy.

  2. Diffusion Policy: Visuomotor Policy Learning via Action Diffusion

    cs.RO 2023-03 accept novelty 8.0

    Diffusion Policy models robot actions as a conditional diffusion process, outperforming prior state-of-the-art methods by 46.9% on average across 12 manipulation tasks from four benchmarks.

  3. ExpoCM: Exposure-Aware One-Step Generative Single-Image HDR Reconstruction

    cs.CV 2026-05 unverdicted novelty 7.0

    ExpoCM enables fast one-step single-image HDR reconstruction via exposure-dependent perturbations and region-conditioned consistency trajectories derived from a probability flow ODE.

  4. How to Guide Your Flow: Few-Step Alignment via Flow Map Reward Guidance

    cs.LG 2026-04 unverdicted novelty 7.0

    FMRG is a training-free, single-trajectory guidance method for flow models derived from optimal control that achieves strong reward alignment with only 3 NFEs.

  5. Characterizing Vision-Language-Action Models across XPUs: Constraints and Acceleration for On-Robot Deployment

    cs.RO 2026-04 unverdicted novelty 7.0

    VLA models exhibit a compute-bound VLM phase followed by a memory-bound action phase on edge hardware; DP-Cache and V-AEFusion reduce redundancy and enable pipeline parallelism for up to 6x speedup on NPUs with margin...

  6. From Competition to Coopetition: Coopetitive Training-Free Image Editing Based on Text Guidance

    cs.CV 2026-04 unverdicted novelty 7.0

    CoEdit is a zero-shot coopetitive framework for text-guided image editing that uses dual-entropy attention manipulation and entropic latent refinement to improve editing harmony and structural preservation.

  7. Isokinetic Flow Matching for Pathwise Straightening of Generative Flows

    cs.LG 2026-04 unverdicted novelty 7.0

    Isokinetic Flow Matching adds a lightweight regularization term to flow matching that penalizes acceleration along paths via self-guided finite differences, yielding straighter trajectories and large gains in few-step...

  8. VOSR: A Vision-Only Generative Model for Image Super-Resolution

    cs.CV 2026-04 conditional novelty 7.0

    VOSR shows that competitive generative image super-resolution with faithful structures can be achieved by training a diffusion-style model from scratch on visual data alone, using a vision encoder for guidance and a r...

  9. Latent Consistency Models: Synthesizing High-Resolution Images with Few-Step Inference

    cs.CV 2023-10 unverdicted novelty 7.0

    Latent Consistency Models enable high-fidelity text-to-image generation in 2-4 steps by directly predicting solutions to the probability flow ODE in latent space, distilled from pre-trained LDMs.

  10. Gradient-Free Noise Optimization for Reward Alignment in Generative Models

    cs.LG 2026-05 unverdicted novelty 6.0

    ZeNO formulates noise optimization for reward alignment as a path-integral control problem solvable via zeroth-order reward evaluations alone, connecting to Langevin dynamics under an Ornstein-Uhlenbeck process.

  11. dFlowGRPO: Rate-Aware Policy Optimization for Discrete Flow Models

    cs.LG 2026-05 unverdicted novelty 6.0

    dFlowGRPO is a new rate-aware RL method for discrete flow models that outperforms prior GRPO approaches on image generation and matches continuous flow models while supporting broad probability paths.

  12. Tyche: One Step Flow for Efficient Probabilistic Weather Forecasting

    cs.LG 2026-05 unverdicted novelty 6.0

    Tyche achieves competitive probabilistic weather forecasting skill and calibration using a single-step flow model with JVP-regularized training and rollout finetuning.

  13. GCCM: Enhancing Generative Graph Prediction via Contrastive Consistency Model

    cs.AI 2026-05 unverdicted novelty 6.0

    GCCM prevents shortcut collapse in consistency models for graph prediction by using contrastive negative pairs and input feature perturbation, leading to better performance than deterministic baselines.

  14. MetaSR: Content-Adaptive Metadata Orchestration for Generative Super-Resolution

    cs.CV 2026-04 unverdicted novelty 6.0

    MetaSR adaptively orchestrates metadata in a DiT-based generative SR model to deliver up to 1 dB PSNR gains and 50% bitrate savings across diverse content and degradations.

  15. Pairing Regularization for Mitigating Many-to-One Collapse in GANs

    cs.LG 2026-04 unverdicted novelty 6.0

    Pairing regularization mitigates intra-mode collapse in GANs by penalizing redundant latent-to-sample mappings, improving recall under collapse-prone conditions or precision under stabilized training.

  16. ELT: Elastic Looped Transformers for Visual Generation

    cs.CV 2026-04 unverdicted novelty 6.0

    Elastic Looped Transformers share weights across recurrent blocks and apply intra-loop self-distillation to deliver 4x parameter reduction while matching competitive FID and FVD scores on ImageNet and UCF-101.

  17. Unified Video Action Model

    cs.RO 2025-02 unverdicted novelty 6.0

    UVA learns a joint video-action latent representation with decoupled diffusion decoding heads, enabling a single model to perform accurate fast policy learning, forward/inverse dynamics, and video generation without p...

  18. Teacher-Feature Drifting: One-Step Diffusion Distillation with Pretrained Diffusion Representations

    cs.CV 2026-05 unverdicted novelty 5.0

    A simplified one-step diffusion distillation uses pretrained teacher features directly for drifting loss plus a mode coverage term, achieving FID 1.58 on ImageNet-64 and 18.4 on SDXL.

  19. Lightning Unified Video Editing via In-Context Sparse Attention

    cs.CV 2026-05 unverdicted novelty 5.0

    ISA prunes low-saliency context tokens and routes queries by sharpness to either full or 0-th order Taylor sparse attention, enabling LIVEditor to cut attention latency ~60% while beating prior video editing methods o...

  20. Fast Text-to-Audio Generation with One-Step Sampling via Energy-Scoring and Auxiliary Contextual Representation Distillation

    cs.SD 2026-05 unverdicted novelty 5.0

    A one-step text-to-audio model using energy-distance training and contextual distillation outperforms prior fast baselines on AudioCaps and achieves up to 8.5x faster inference than the multi-step IMPACT system with c...

  21. A Unified Measure-Theoretic View of Diffusion, Score-Based, and Flow Matching Generative Models

    cs.LG 2026-05 unverdicted novelty 4.0

    Diffusion, score-based, and flow matching models are unified as instances of learning time-dependent vector fields inducing marginal distributions governed by continuity and Fokker-Planck equations.

  22. OmniVLA-RL: A Vision-Language-Action Model with Spatial Understanding and Online RL

    cs.RO 2026-04 unverdicted novelty 4.0

    OmniVLA-RL uses a mix-of-transformers architecture and flow-matching reformulated as SDE with group segmented policy optimization to surpass prior VLA models on LIBERO benchmarks.

  23. Discrete Meanflow Training Curriculum

    cs.LG 2026-04 unverdicted novelty 4.0

    A DMF curriculum initialized from pretrained flow models achieves one-step FID 3.36 on CIFAR-10 after only 2000 epochs by exploiting a discretized consistency property in the Meanflow objective.

Reference graph

Works this paper leans on

74 extracted references · 74 canonical work pages · cited by 23 Pith papers · 10 internal anchors

  1. [1]

    ediff-i: Text-to-image diffusion models with ensemble of expert denoisers

    Balaji, Y., Nah, S., Huang, X., Vahdat, A., Song, J., Kreis, K., Aittala, M., Aila, T., Laine, S., Catanzaro, B., Karras, T., and Liu, M.-Y. ediff-i: Text-to-image diffusion models with ensemble of expert denoisers. arXiv preprint arXiv:2211.01324, 2022

  2. [2]

    S., Januschowski, T., and G \"u nnemann, S

    Bilo s , M., Sommer, J., Rangapuram, S. S., Januschowski, T., and G \"u nnemann, S. Neural flows: Efficient alternative to neural odes. Advances in Neural Information Processing Systems, 34: 0 21325--21337, 2021

  3. [3]

    Large scale GAN training for high fidelity natural image synthesis

    Brock, A., Donahue, J., and Simonyan, K. Large scale GAN training for high fidelity natural image synthesis. In International Conference on Learning Representations, 2019. URL https://openreview.net/forum?id=B1xsqj09Fm

  4. [4]

    J., Norouzi, M., and Chan, W

    Chen, N., Zhang, Y., Zen, H., Weiss, R. J., Norouzi, M., and Chan, W. Wavegrad: Estimating gradients for waveform generation. In International Conference on Learning Representations (ICLR), 2021

  5. [5]

    T., Rubanova, Y., Bettencourt, J., and Duvenaud, D

    Chen, R. T., Rubanova, Y., Bettencourt, J., and Duvenaud, D. K. Neural O rdinary D ifferential E quations. In Advances in neural information processing systems, pp.\ 6571--6583, 2018

  6. [6]

    T., Behrmann, J., Duvenaud, D

    Chen, R. T., Behrmann, J., Duvenaud, D. K., and Jacobsen, J.-H. Residual flows for invertible generative modeling. In Advances in Neural Information Processing Systems, pp.\ 9916--9926, 2019

  7. [7]

    T., Klasky, M

    Chung, H., Kim, J., Mccann, M. T., Klasky, M. L., and Ye, J. C. Diffusion posterior sampling for general noisy inverse problems. In International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=OnD9zGAGT0k

  8. [8]

    Imagenet: A large-scale hierarchical image database

    Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., and Fei-Fei, L. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pp.\ 248--255. Ieee, 2009

  9. [9]

    and Nichol, A

    Dhariwal, P. and Nichol, A. Diffusion models beat gans on image synthesis. Advances in Neural Information Processing Systems (NeurIPS), 2021

  10. [10]

    NICE : Non-linear independent components estimation

    Dinh, L., Krueger, D., and Bengio, Y. NICE : Non-linear independent components estimation. International Conference in Learning Representations Workshop Track, 2015

  11. [11]

    Density estimation using real NVP

    Dinh, L., Sohl - Dickstein, J., and Bengio, S. Density estimation using real NVP . In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings . OpenReview.net, 2017. URL https://openreview.net/forum?id=HkpbnH9lx

  12. [12]

    Genie: Higher-order denoising diffusion solvers

    Dockhorn, T., Vahdat, A., and Kreis, K. Genie: Higher-order denoising diffusion solvers. arXiv preprint arXiv:2210.05475, 2022

  13. [13]

    Autogan: Neural architecture search for generative adversarial networks

    Gong, X., Chang, S., Jiang, Y., and Wang, Z. Autogan: Neural architecture search for generative adversarial networks. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp.\ 3224--3234, 2019

  14. [14]

    Generative adversarial nets

    Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., and Bengio, Y. Generative adversarial nets. In Advances in neural information processing systems, pp.\ 2672--2680, 2014

  15. [15]

    Densely connected normalizing flows

    Grci \'c , M., Grubi s i \'c , I., and S egvi \'c , S. Densely connected normalizing flows. Advances in Neural Information Processing Systems, 34: 0 23968--23982, 2021

  16. [16]

    Bootstrap your own latent-a new approach to self-supervised learning

    Grill, J.-B., Strub, F., Altch \'e , F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., et al. Bootstrap your own latent-a new approach to self-supervised learning. Advances in neural information processing systems, 33: 0 21271--21284, 2020

  17. [17]

    Momentum contrast for unsupervised visual representation learning

    He, K., Fan, H., Wu, Y., Xie, S., and Girshick, R. Momentum contrast for unsupervised visual representation learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp.\ 9729--9738, 2020

  18. [18]

    GANs trained by a two time-scale update rule converge to a local Nash equilibrium

    Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., and Hochreiter, S. GANs trained by a two time-scale update rule converge to a local Nash equilibrium. In Advances in Neural Information Processing Systems, pp.\ 6626--6637, 2017

  19. [19]

    Denoising D iffusion P robabilistic M odels

    Ho, J., Jain, A., and Abbeel, P. Denoising D iffusion P robabilistic M odels. Advances in Neural Information Processing Systems, 33, 2020

  20. [20]

    Imagen Video: High Definition Video Generation with Diffusion Models

    Ho, J., Chan, W., Saharia, C., Whang, J., Gao, R., Gritsenko, A., Kingma, D. P., Poole, B., Norouzi, M., Fleet, D. J., et al. Imagen video: High definition video generation with diffusion models. arXiv preprint arXiv:2210.02303, 2022 a

  21. [21]

    A., Chan, W., Norouzi, M., and Fleet, D

    Ho, J., Salimans, T., Gritsenko, A. A., Chan, W., Norouzi, M., and Fleet, D. J. Video diffusion models. In ICLR Workshop on Deep Generative Models for Highly Structured Data, 2022 b . URL https://openreview.net/forum?id=BBelR2NdDZ5

  22. [22]

    and Dayan, P

    Hyv \"a rinen, A. and Dayan, P. Estimation of non-normalized statistical models by score matching. Journal of Machine Learning Research (JMLR), 6 0 (4), 2005

  23. [23]

    Transgan: Two pure transformers can make one strong gan, and that can scale up

    Jiang, Y., Chang, S., and Wang, Z. Transgan: Two pure transformers can make one strong gan, and that can scale up. Advances in Neural Information Processing Systems, 34: 0 14745--14758, 2021

  24. [24]

    Progressive growing of GAN s for improved quality, stability, and variation

    Karras, T., Aila, T., Laine, S., and Lehtinen, J. Progressive growing of GAN s for improved quality, stability, and variation. In International Conference on Learning Representations, 2018. URL https://openreview.net/forum?id=Hk99zCeAb

  25. [25]

    Analyzing and improving the image quality of stylegan

    Karras, T., Laine, S., Aittala, M., Hellsten, J., Lehtinen, J., and Aila, T. Analyzing and improving the image quality of stylegan. 2020

  26. [26]

    Elucidating the design space of diffusion-based generative models

    Karras, T., Aittala, M., Aila, T., and Laine, S. Elucidating the design space of diffusion-based generative models. In Proc. NeurIPS, 2022

  27. [27]

    Snips: Solving noisy inverse problems stochastically

    Kawar, B., Vaksman, G., and Elad, M. Snips: Solving noisy inverse problems stochastically. arXiv preprint arXiv:2105.14951, 2021

  28. [28]

    Denoising diffusion restoration models

    Kawar, B., Elad, M., Ermon, S., and Song, J. Denoising diffusion restoration models. In Advances in Neural Information Processing Systems, 2022

  29. [29]

    Kingma, D. P. and Dhariwal, P. Glow: Generative flow with invertible 1x1 convolutions. In Bengio, S., Wallach, H., Larochelle, H., Grauman, K., Cesa-Bianchi, N., and Garnett, R. (eds.), Advances in Neural Information Processing Systems 31, pp.\ 10215--10224. 2018

  30. [30]

    Kingma, D. P. and Welling, M. Auto-encoding variational bayes. In International Conference on Learning Representations, 2014

  31. [31]

    Diffwave: A versatile diffusion model for audio synthesis.arXiv preprint arXiv:2009.09761, 2020

    Kong, Z., Ping, W., Huang, J., Zhao, K., and Catanzaro, B. Diff W ave: A V ersatile D iffusion M odel for A udio S ynthesis. arXiv preprint arXiv:2009.09761, 2020

  32. [32]

    Learning multiple layers of features from tiny images

    Krizhevsky, A., Hinton, G., et al. Learning multiple layers of features from tiny images. 2009

  33. [33]

    Improved precision and recall metric for assessing generative models

    Kynk \"a \"a nniemi, T., Karras, T., Laine, S., Lehtinen, J., and Aila, T. Improved precision and recall metric for assessing generative models. Advances in Neural Information Processing Systems, 32, 2019

  34. [34]

    Vitgan: Training gans with vision transformers

    Lee, K., Chang, H., Jiang, L., Zhang, H., Tu, Z., and Liu, C. Vitgan: Training gans with vision transformers. arXiv preprint arXiv:2107.04589, 2021

  35. [35]

    Continuous control with deep reinforcement learning

    Lillicrap, T. P., Hunt, J. J., Pritzel, A., Heess, N., Erez, T., Tassa, Y., Silver, D., and Wierstra, D. Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971, 2015

  36. [36]

    On the variance of the adaptive learning rate and beyond

    Liu, L., Jiang, H., He, P., Chen, W., Liu, X., Gao, J., and Han, J. On the variance of the adaptive learning rate and beyond. arXiv preprint arXiv:1908.03265, 2019

  37. [37]

    Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow

    Liu, X., Gong, C., and Liu, Q. Flow straight and fast: Learning to generate and transfer data with rectified flow. arXiv preprint arXiv:2209.03003, 2022

  38. [38]

    Dpm-solver: A fast ode solver for diffusion probabilistic model sampling in around 10 steps

    Lu, C., Zhou, Y., Bao, F., Chen, J., Li, C., and Zhu, J. Dpm-solver: A fast ode solver for diffusion probabilistic model sampling in around 10 steps. arXiv preprint arXiv:2206.00927, 2022

  39. [39]

    Luhman and T

    Luhman, E. and Luhman, T. Knowledge distillation in iterative generative models for improved sampling speed. arXiv preprint arXiv:2101.02388, 2021

  40. [40]

    SDEdit: Guided Image Synthesis and Editing with Stochastic Differential Equations

    Meng, C., Song, Y., Song, J., Wu, J., Zhu, J.-Y., and Ermon, S. Sdedit: Image synthesis and editing with stochastic differential equations. arXiv preprint arXiv:2108.01073, 2021

  41. [41]

    P., Ermon, S., Ho, J., and Salimans, T

    Meng, C., Gao, R., Kingma, D. P., Ermon, S., Ho, J., and Salimans, T. On distillation of guided diffusion models. arXiv preprint arXiv:2210.03142, 2022

  42. [42]

    Playing Atari with Deep Reinforcement Learning

    Mnih, V., Kavukcuoglu, K., Silver, D., Graves, A., Antonoglou, I., Wierstra, D., and Riedmiller, M. Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602, 2013

  43. [43]

    A., Veness, J., Bellemare, M

    Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J., Bellemare, M. G., Graves, A., Riedmiller, M., Fidjeland, A. K., Ostrovski, G., et al. Human-level control through deep reinforcement learning. nature, 518 0 (7540): 0 529--533, 2015

  44. [44]

    GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models

    Nichol, A., Dhariwal, P., Ramesh, A., Shyam, P., Mishkin, P., McGrew, B., Sutskever, I., and Chen, M. Glide: Towards photorealistic image generation and editing with text-guided diffusion models. arXiv preprint arXiv:2112.10741, 2021

  45. [45]

    Dual contradistinctive generative autoencoder

    Parmar, G., Li, D., Lee, K., and Tu, Z. Dual contradistinctive generative autoencoder. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.\ 823--832, 2021

  46. [46]

    Grad- TTS : A diffusion probabilistic model for text-to-speech

    Popov, V., Vovk, I., Gogoryan, V., Sadekova, T., and Kudinov, M. Grad- TTS : A diffusion probabilistic model for text-to-speech. arXiv preprint arXiv:2105.06337, 2021

  47. [47]

    Hierarchical Text-Conditional Image Generation with CLIP Latents

    Ramesh, A., Dhariwal, P., Nichol, A., Chu, C., and Chen, M. Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 2022

  48. [48]

    J., Mohamed, S., and Wierstra, D

    Rezende, D. J., Mohamed, S., and Wierstra, D. Stochastic backpropagation and approximate inference in deep generative models. In Proceedings of the 31st International Conference on Machine Learning, pp.\ 1278--1286, 2014

  49. [49]

    High-resolution image synthesis with latent diffusion models

    Rombach, R., Blattmann, A., Lorenz, D., Esser, P., and Ommer, B. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.\ 10684--10695, 2022

  50. [50]

    Saharia, C., Chan, W., Saxena, S., Li, L., Whang, J., Denton, E., Ghasemipour, S. K. S., Ayan, B. K., Mahdavi, S. S., Lopes, R. G., et al. Photorealistic text-to-image diffusion models with deep language understanding. arXiv preprint arXiv:2205.11487, 2022

  51. [51]

    and Ho, J

    Salimans, T. and Ho, J. Progressive distillation for fast sampling of diffusion models. In International Conference on Learning Representations, 2022. URL https://openreview.net/forum?id=TIdIXIpzhoI

  52. [52]

    Improved techniques for training gans

    Salimans, T., Goodfellow, I., Zaremba, W., Cheung, V., Radford, A., and Chen, X. Improved techniques for training gans. In Advances in neural information processing systems, pp.\ 2234--2242, 2016

  53. [53]

    Stylegan-xl: Scaling stylegan to large diverse datasets

    Sauer, A., Schwarz, K., and Geiger, A. Stylegan-xl: Scaling stylegan to large diverse datasets. In ACM SIGGRAPH 2022 conference proceedings, pp.\ 1--10, 2022

  54. [54]

    Deep U nsupervised L earning U sing N onequilibrium T hermodynamics

    Sohl-Dickstein, J., Weiss, E., Maheswaranathan, N., and Ganguli, S. Deep U nsupervised L earning U sing N onequilibrium T hermodynamics. In International Conference on Machine Learning, pp.\ 2256--2265, 2015

  55. [55]

    Denoising Diffusion Implicit Models

    Song, J., Meng, C., and Ermon, S. Denoising diffusion implicit models. arXiv preprint arXiv:2010.02502, 2020

  56. [56]

    Pseudoinverse-guided diffusion models for inverse problems

    Song, J., Vahdat, A., Mardani, M., and Kautz, J. Pseudoinverse-guided diffusion models for inverse problems. In International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=9_gsMA8MRKQ

  57. [57]

    and Ermon, S

    Song, Y. and Ermon, S. Generative M odeling by E stimating G radients of the D ata D istribution. In Advances in Neural Information Processing Systems, pp.\ 11918--11930, 2019

  58. [58]

    and Ermon, S

    Song, Y. and Ermon, S. Improved T echniques for T raining S core- B ased G enerative M odels. Advances in Neural Information Processing Systems, 33, 2020

  59. [59]

    Sliced score matching: A scalable approach to density and score estimation

    Song, Y., Garg, S., Shi, J., and Ermon, S. Sliced score matching: A scalable approach to density and score estimation. In Proceedings of the Thirty-Fifth Conference on Uncertainty in Artificial Intelligence, UAI 2019, Tel Aviv, Israel, July 22-25, 2019 , pp.\ 204, 2019

  60. [60]

    P., Kumar, A., Ermon, S., and Poole, B

    Song, Y., Sohl-Dickstein, J., Kingma, D. P., Kumar, A., Ermon, S., and Poole, B. Score-based generative modeling through stochastic differential equations. In International Conference on Learning Representations, 2021. URL https://openreview.net/forum?id=PxTIG12RRHS

  61. [61]

    Solving inverse problems in medical imaging with score-based generative models

    Song, Y., Shen, L., Xing, L., and Ermon, S. Solving inverse problems in medical imaging with score-based generative models. In International Conference on Learning Representations, 2022. URL https://openreview.net/forum?id=vaRCHVj0uGI

  62. [62]

    and Mayers, D

    S \"u li, E. and Mayers, D. F. An introduction to numerical analysis. Cambridge university press, 2003

  63. [63]

    Off-policy reinforcement learning for efficient and effective gan architecture search

    Tian, Y., Wang, Q., Huang, Z., Li, W., Dai, D., Yang, M., Wang, J., and Fink, O. Off-policy reinforcement learning for efficient and effective gan architecture search. In Computer Vision--ECCV 2020: 16th European Conference, Glasgow, UK, August 23--28, 2020, Proceedings, Part VII 16, pp.\ 175--192. Springer, 2020

  64. [64]

    Score-based generative modeling in latent space

    Vahdat, A., Kreis, K., and Kautz, J. Score-based generative modeling in latent space. Advances in Neural Information Processing Systems, 34: 0 11287--11302, 2021

  65. [65]

    A C onnection B etween S core M atching and D enoising A utoencoders

    Vincent, P. A C onnection B etween S core M atching and D enoising A utoencoders. Neural Computation, 23 0 (7): 0 1661--1674, 2011

  66. [66]

    P., and Gool, L

    Wu, J., Huang, Z., Acharya, D., Li, W., Thoma, J., Paudel, D. P., and Gool, L. V. Sliced wasserstein generative models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.\ 3713--3722, 2019

  67. [67]

    Generative latent flow

    Xiao, Z., Yan, Q., and Amit, Y. Generative latent flow. arXiv preprint arXiv:1905.10485, 2019

  68. [68]

    Tackling the generative learning trilemma with denoising diffusion GAN s

    Xiao, Z., Kreis, K., and Vahdat, A. Tackling the generative learning trilemma with denoising diffusion GAN s. In International Conference on Learning Representations, 2022. URL https://openreview.net/forum?id=JprM0p-q0Co

  69. [69]

    Xu, Y., Liu, Z., Tegmark, M., and Jaakkola, T. S. Poisson flow generative models. In Oh, A. H., Agarwal, A., Belgrave, D., and Cho, K. (eds.), Advances in Neural Information Processing Systems, 2022. URL https://openreview.net/forum?id=voV_TRqcWh

  70. [70]

    LSUN: Construction of a Large-scale Image Dataset using Deep Learning with Humans in the Loop

    Yu, F., Seff, A., Zhang, Y., Song, S., Funkhouser, T., and Xiao, J. Lsun: Construction of a large-scale image dataset using deep learning with humans in the loop. arXiv preprint arXiv:1506.03365, 2015

  71. [71]

    arXiv preprint arXiv:2204.13902 , year=

    Zhang, Q. and Chen, Y. Fast sampling of diffusion models with exponential integrator. arXiv preprint arXiv:2204.13902, 2022

  72. [72]

    A., Shechtman, E., and Wang, O

    Zhang, R., Isola, P., Efros, A. A., Shechtman, E., and Wang, O. The unreasonable effectiveness of deep features as a perceptual metric. In CVPR, 2018

  73. [73]

    Fast sampling of diffusion models via operator learning

    Zheng, H., Nie, W., Vahdat, A., Azizzadenesheli, K., and Anandkumar, A. Fast sampling of diffusion models via operator learning. arXiv preprint arXiv:2211.13449, 2022

  74. [74]

    Truncated diffusion probabilistic models and diffusion-based adversarial auto-encoders

    Zheng, H., He, P., Chen, W., and Zhou, M. Truncated diffusion probabilistic models and diffusion-based adversarial auto-encoders. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=HDxgaKk956l