pith. the verified trust layer for science. sign in

arxiv: 2506.13763 · v2 · submitted 2025-06-16 · 💻 cs.LG · cs.AI· cs.CV· stat.ML

Diagnosing and Improving Diffusion Models by Estimating the Optimal Loss Value

Pith reviewed 2026-05-19 09:02 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CVstat.ML
keywords diffusion modelsoptimal losstraining diagnosisscaling lawsgenerative modelingloss estimationunified formulationpower law
0
0 comments X p. Extension

The pith

The optimal loss value for diffusion models can be derived in closed form and estimated to diagnose training quality.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Diffusion models succeed in generation but their training loss does not reach zero and its minimum value has been unknown, making it hard to judge whether a model is well trained or simply facing a high baseline. The paper derives this optimal loss exactly under a single mathematical description that covers common diffusion variants and supplies practical estimators, one of which is stochastic and works on large data. With the optimum in hand, users can measure how close actual training has come to the limit, adjust the schedule for better results, and see a cleaner power-law relationship between model size and performance once the baseline is removed.

Core claim

Under a unified formulation of diffusion models, the optimal loss value can be derived in closed form. Practical estimators, including a stochastic variant scalable to large datasets, allow diagnosis of training quality for mainstream variants. These estimators also support a more performant training schedule, and subtracting the optimal loss from the observed loss makes power-law scaling clearer for models with 120M to 1.5B parameters.

What carries the argument

Closed-form optimal loss derived from the unified diffusion formulation; it acts as a fixed baseline that normalizes observed training loss to separate training quality from inherent task difficulty.

If this is right

  • Training quality of mainstream diffusion variants can be diagnosed by the gap between achieved loss and the estimated optimum.
  • A training schedule constructed using the optimal loss estimate yields better performance than standard schedules.
  • Power-law scaling with model size becomes more evident when loss is measured as the excess over the optimal value.
  • The estimators apply to the range of diffusion models covered by the unified formulation, including those with 120M to 1.5B parameters.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same subtraction of optimal loss could be tested on other generative families, such as flow-matching or score-based models, to see whether scaling relations sharpen there as well.
  • During long training runs, monitoring the gap to the estimated optimum could serve as a practical signal for early stopping or model scaling decisions.
  • If the estimator generalizes, it might help compare architectures trained under different noise schedules on equal footing.
  • The closed-form expression could guide the design of new diffusion objectives that explicitly minimize the excess loss rather than the raw loss.

Load-bearing premise

A single unified mathematical description accurately represents the forward and reverse processes of the diffusion models used in practice and permits an exact closed-form solution for the optimal loss.

What would settle it

On a small dataset with fully known data distribution, compute the closed-form optimal loss and train multiple diffusion models until convergence; if any model achieves a lower loss than the estimate, the derivation is incorrect.

Figures

Figures reproduced from arXiv: 2506.13763 by Chang Liu, Di He, Liwei Wang, Shengjie Luo, Yixian Xu.

Figure 1
Figure 1. Figure 1: Estimation results of optimal loss value. [PITH_FULL_IMAGE:figures/full_fig_p006_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Actual stepwise training loss across noise scales by various diffusion models on CIFAR-10, [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Scaling law study using optimal loss on ImageNet-64. Training curves at [PITH_FULL_IMAGE:figures/full_fig_p010_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Convergence of our estimator with respect to the number of subsets [PITH_FULL_IMAGE:figures/full_fig_p026_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Convergence of our estimator with respect to [PITH_FULL_IMAGE:figures/full_fig_p026_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Scaling law fitting results using the modified power law in Eq. [PITH_FULL_IMAGE:figures/full_fig_p027_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Scaling law study on ImageNet-64. Each row corresponds to a different noise scale. The [PITH_FULL_IMAGE:figures/full_fig_p028_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Scaling law study on ImageNet-512. Each row corresponds to a different noise scale. The [PITH_FULL_IMAGE:figures/full_fig_p029_8.png] view at source ↗
read the original abstract

Diffusion models have achieved remarkable success in generative modeling. Despite more stable training, the loss of diffusion models is not indicative of absolute data-fitting quality, since its optimal value is typically not zero but unknown, leading to confusion between large optimal loss and insufficient model capacity. In this work, we advocate the need to estimate the optimal loss value for diagnosing and improving diffusion models. We first derive the optimal loss in closed form under a unified formulation of diffusion models, and develop effective estimators for it, including a stochastic variant scalable to large datasets with proper control of variance and bias. With this tool, we unlock the inherent metric for diagnosing the training quality of mainstream diffusion model variants, and develop a more performant training schedule based on the optimal loss. Moreover, using models with 120M to 1.5B parameters, we find that the power law is better demonstrated after subtracting the optimal loss from the actual training loss, suggesting a more principled setting for investigating the scaling law for diffusion models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript derives a closed-form expression for the optimal loss value under a unified formulation of diffusion models that covers mainstream variants. It introduces practical estimators for this optimal loss, including a scalable stochastic variant with claimed bias/variance control, and uses them to diagnose training quality, design an improved training schedule, and show that power-law scaling becomes clearer after subtracting the estimated optimal loss from observed training losses, demonstrated on models with 120M to 1.5B parameters.

Significance. If the closed-form derivation is exact and the estimators recover the optimal loss reliably, the work supplies a principled diagnostic that distinguishes inherent loss floors from insufficient model capacity or training issues. The resulting training schedule improvements and the observation that power laws are more evident post-subtraction could inform more accurate scaling studies and better diffusion training practices. The provision of a closed-form result together with practical, scalable estimators is a concrete strength.

major comments (2)
  1. [§3.2] §3.2 (stochastic estimator): the bias and variance control for the Monte-Carlo stochastic estimator is asserted via analysis but is not empirically verified on tractable cases (e.g., isotropic Gaussian data) where the true optimal loss can be computed exactly in closed form. Without such a sanity check, residual bias that scales with dataset size or noise schedule would undermine the reliability of the diagnostic tool and the reported training-schedule and scaling-law improvements.
  2. [§4] §4 (scaling experiments): the claim that power-law scaling is 'better demonstrated' after optimal-loss subtraction is supported only by visual inspection of plots; quantitative metrics (e.g., change in R² or scaling exponent with confidence intervals) comparing raw versus subtracted loss are needed to establish that the improvement is statistically meaningful rather than cosmetic.
minor comments (2)
  1. [§2] Notation for the unified forward/reverse process parameters should be introduced once in §2 and used consistently thereafter to prevent readers from having to cross-reference multiple definitions.
  2. [Figures in §4] Figure captions for the scaling plots should explicitly state the number of runs and whether error bars represent standard deviation or standard error.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback on our manuscript. We have carefully reviewed the major comments and will revise the paper to address the concerns about empirical validation of the stochastic estimator and the need for quantitative metrics in the scaling analysis. Our point-by-point responses follow.

read point-by-point responses
  1. Referee: [§3.2] §3.2 (stochastic estimator): the bias and variance control for the Monte-Carlo stochastic estimator is asserted via analysis but is not empirically verified on tractable cases (e.g., isotropic Gaussian data) where the true optimal loss can be computed exactly in closed form. Without such a sanity check, residual bias that scales with dataset size or noise schedule would undermine the reliability of the diagnostic tool and the reported training-schedule and scaling-law improvements.

    Authors: We appreciate the referee's suggestion to strengthen the validation of the stochastic estimator. While §3.2 provides an analytical derivation of the bias and variance control, we agree that an empirical sanity check on a tractable setting—such as isotropic Gaussian data, where the optimal loss admits an exact closed-form expression—would offer useful corroboration and help rule out any residual bias that might depend on dataset size or the noise schedule. In the revised manuscript, we will add these experiments, comparing the stochastic estimator outputs against the known ground-truth optimal loss to empirically confirm the claimed bias and variance properties. revision: yes

  2. Referee: [§4] §4 (scaling experiments): the claim that power-law scaling is 'better demonstrated' after optimal-loss subtraction is supported only by visual inspection of plots; quantitative metrics (e.g., change in R² or scaling exponent with confidence intervals) comparing raw versus subtracted loss are needed to establish that the improvement is statistically meaningful rather than cosmetic.

    Authors: We thank the referee for this observation. The current manuscript relies on visual comparison of the plots to illustrate that power-law scaling appears clearer after subtracting the estimated optimal loss. We acknowledge that this is insufficient to rigorously establish statistical improvement. In the revision, we will augment §4 with quantitative metrics: specifically, we will report R² values for the power-law fits, the fitted scaling exponents, and associated confidence intervals, computed separately for the raw training losses and for the losses after optimal-loss subtraction. These additions will provide a statistical basis for the claim. revision: yes

Circularity Check

0 steps flagged

No significant circularity; optimal loss derived independently from model equations

full rationale

The paper derives the optimal loss in closed form directly from the unified formulation of the forward and reverse diffusion processes and the associated loss function. This mathematical derivation produces an expression for the theoretical minimum loss value without reference to any empirical training losses, fitted parameters, or observed data statistics from the experiments. The subsequent stochastic and deterministic estimators are constructed to approximate this independently derived quantity, and the diagnostic and scaling-law applications consist of subtracting the estimated optimum from measured losses. No step in the chain reduces the claimed result to its own inputs by construction, self-definition, or load-bearing self-citation; the central claim remains a first-principles consequence of the model specification.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on a unified mathematical formulation of diffusion processes that permits an exact closed-form solution for the optimal loss; no explicit free parameters or invented entities are mentioned in the abstract.

axioms (1)
  • domain assumption A single unified formulation accurately captures the forward and reverse processes of mainstream diffusion model variants.
    Invoked to derive the closed-form optimal loss that applies across variants.

pith-pipeline@v0.9.0 · 5713 in / 1312 out tokens · 38940 ms · 2026-05-19T09:02:20.357445+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

59 extracted references · 59 canonical work pages · 3 internal anchors

  1. [1]

    What regularized auto-encoders learn from the data- generating distribution

    Guillaume Alain and Yoshua Bengio. What regularized auto-encoders learn from the data- generating distribution. The Journal of Machine Learning Research, 15(1):3563–3593, 2014

  2. [2]

    Reverse-time diffusion equation models

    Brian DO Anderson. Reverse-time diffusion equation models. Stochastic Processes and their Applications, 12(3):313–326, 1982

  3. [3]

    Estimating the optimal covariance with imperfect mean in diffusion probabilistic models

    Fan Bao, Chongxuan Li, Jiacheng Sun, Jun Zhu, and Bo Zhang. Estimating the optimal covariance with imperfect mean in diffusion probabilistic models. In International Conference on Machine Learning, pages 1555–1584. PMLR, 2022

  4. [4]

    Analytic-DPM: an analytic estimate of the optimal reverse variance in diffusion probabilistic models

    Fan Bao, Chongxuan Li, Jun Zhu, and Bo Zhang. Analytic-DPM: an analytic estimate of the optimal reverse variance in diffusion probabilistic models. In International Conference on Learning Representations, 2022

  5. [5]

    Ambient diffusion: Learning clean distributions from corrupted data

    Giannis Daras, Kulin Shah, Yuval Dagan, Aravind Gollakota, Alex Dimakis, and Adam Klivans. Ambient diffusion: Learning clean distributions from corrupted data. In Thirty-seventh Con- ference on Neural Information Processing Systems, 2023. URL https://openreview.net/ forum?id=wBJBLy9kBY

  6. [6]

    Diffusion schrödinger bridge with applications to score-based generative modeling

    Valentin De Bortoli, James Thornton, Jeremy Heng, and Arnaud Doucet. Diffusion schrödinger bridge with applications to score-based generative modeling. Advances in Neural Information Processing Systems, 34:17695–17709, 2021

  7. [7]

    Scaling vision transformers to 22 billion parameters

    Mostafa Dehghani, Josip Djolonga, Basil Mustafa, Piotr Padlewski, Jonathan Heek, Justin Gilmer, Andreas Peter Steiner, Mathilde Caron, Robert Geirhos, Ibrahim Alabdulmohsin, et al. Scaling vision transformers to 22 billion parameters. In International Conference on Machine Learning, pages 7480–7512. PMLR, 2023

  8. [8]

    Diffusion models beat GANs on image synthesis

    Prafulla Dhariwal and Alexander Nichol. Diffusion models beat GANs on image synthesis. Advances in Neural Information Processing Systems, 34:8780–8794, 2021

  9. [9]

    Probability: theory and examples, volume 49

    Rick Durrett. Probability: theory and examples, volume 49. Cambridge university press, 2019

  10. [10]

    Scaling rectified flow transform- ers for high-resolution image synthesis

    Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling rectified flow transform- ers for high-resolution image synthesis. In Forty-first International Conference on Machine Learning, 2024

  11. [11]

    Murphy, and Tim Salimans

    Ruiqi Gao, Emiel Hoogeboom, Jonathan Heek, Valentin De Bortoli, Kevin P. Murphy, and Tim Salimans. Diffusion meets flow matching: Two sides of the same coin. 2024. URL https://diffusionflow.github.io/

  12. [12]

    Masked diffusion transformer is a strong image synthesizer

    Shanghua Gao, Pan Zhou, Ming-Ming Cheng, and Shuicheng Yan. Masked diffusion transformer is a strong image synthesizer. In Proceedings of the IEEE/CVF international conference on computer vision, pages 23164–23173, 2023

  13. [13]

    Mdtv2: Masked diffusion transformer is a strong image synthesizer

    Shanghua Gao, Pan Zhou, Ming-Ming Cheng, and Shuicheng Yan. Mdtv2: Masked diffusion transformer is a strong image synthesizer. arXiv preprint arXiv:2303.14389, 2023

  14. [14]

    On memorization in diffusion models

    Xiangming Gu, Chao Du, Tianyu Pang, Chongxuan Li, Min Lin, and Ye Wang. On memorization in diffusion models. arXiv preprint arXiv:2310.02664, 2023

  15. [15]

    Brown, Prafulla Dhariwal, Scott Gray, Chris Hallacy, Benjamin Mann, Alec Radford, Aditya Ramesh, Nick Ryder, Daniel M

    Tom Henighan, Jared Kaplan, Mor Katz, Mark Chen, Christopher Hesse, Jacob Jackson, Heewoo Jun, Tom B. Brown, Prafulla Dhariwal, Scott Gray, Chris Hallacy, Benjamin Mann, Alec Radford, Aditya Ramesh, Nick Ryder, Daniel M. Ziegler, John Schulman, Dario Amodei, and Sam McCandlish. Scaling laws for autoregressive generative modeling, 2020

  16. [16]

    Gans trained by a two time-scale update rule converge to a local nash equilibrium

    Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium. Advances in neural information processing systems, 30, 2017

  17. [17]

    Denoising diffusion probabilistic models

    Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. In Advances in Neural Information Processing Systems, volume 33, pages 6840–6851, 2020. 11

  18. [18]

    An empirical analysis of compute-optimal large language model training

    Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, Tom Hennigan, Eric Noland, Katherine Millican, George van den Driessche, Bogdan Damoc, Aurelia Guy, Simon Osindero, Karen Simonyan, Erich Elsen, Oriol Vinyals, Jack William Rae, and Laur...

  19. [19]

    simple diffusion: End-to-end diffusion for high resolution images

    Emiel Hoogeboom, Jonathan Heek, and Tim Salimans. simple diffusion: End-to-end diffusion for high resolution images. In International Conference on Machine Learning, pages 13213– 13232. PMLR, 2023

  20. [20]

    Simpler diffusion (sid2): 1.5 fid on imagenet512 with pixel-space diffusion

    Emiel Hoogeboom, Thomas Mensink, Jonathan Heek, Kay Lamerigts, Ruiqi Gao, and Tim Salimans. Simpler diffusion (sid2): 1.5 fid on imagenet512 with pixel-space diffusion. arXiv preprint arXiv:2410.19324, 2024

  21. [21]

    Scalable adaptive computation for iterative generation

    Allan Jabri, David Fleet, and Ting Chen. Scalable adaptive computation for iterative generation. arXiv preprint arXiv:2212.11972, 2022

  22. [22]

    Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei

    Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B. Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models, 2020

  23. [23]

    A Style-Based Generator Architecture for Generative Adversarial Networks

    Tero Karras. A style-based generator architecture for generative adversarial networks. arXiv preprint arXiv:1812.04948, 2019

  24. [24]

    Elucidating the design space of diffusion-based generative models

    Tero Karras, Miika Aittala, Timo Aila, and Samuli Laine. Elucidating the design space of diffusion-based generative models. Advances in neural information processing systems, 35: 26565–26577, 2022

  25. [25]

    Analyzing and improving the training dynamics of diffusion models

    Tero Karras, Miika Aittala, Jaakko Lehtinen, Janne Hellsten, Timo Aila, and Samuli Laine. Analyzing and improving the training dynamics of diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 24174–24184, 2024

  26. [26]

    Variational diffusion models

    Diederik Kingma, Tim Salimans, Ben Poole, and Jonathan Ho. Variational diffusion models. Advances in neural information processing systems, 34:21696–21707, 2021

  27. [27]

    Understanding diffusion objectives as the ELBO with simple data augmentation

    Diederik P Kingma and Ruiqi Gao. Understanding diffusion objectives as the ELBO with simple data augmentation. In Thirty-seventh Conference on Neural Information Processing Systems, 2023. URL https://openreview.net/forum?id=NnMEadcdyD

  28. [28]

    Learning multiple layers of features from tiny images

    Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple layers of features from tiny images. 2009

  29. [29]

    Imagenet classification with deep convolutional neural networks

    Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. Advances in neural information processing systems, 25, 2012

  30. [30]

    Monte carlo methods.Wiley Interdisciplinary Reviews: Computational Statistics, 4(1):48–58, 2012

    Dirk P Kroese and Reuven Y Rubinstein. Monte carlo methods.Wiley Interdisciplinary Reviews: Computational Statistics, 4(1):48–58, 2012

  31. [31]

    On the scalability of diffusion-based text-to-image generation

    Hao Li, Yang Zou, Ying Wang, Orchid Majumder, Yusheng Xie, R Manmatha, Ashwin Swami- nathan, Zhuowen Tu, Stefano Ermon, and Stefano Soatto. On the scalability of diffusion-based text-to-image generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9400–9409, 2024

  32. [32]

    Scaling laws for diffusion transformers

    Zhengyang Liang, Hao He, Ceyuan Yang, and Bo Dai. Scaling laws for diffusion transformers. arXiv preprint arXiv:2410.08184, 2024

  33. [33]

    Yaron Lipman, Ricky T. Q. Chen, Heli Ben-Hamu, Maximilian Nickel, and Matthew Le. Flow matching for generative modeling. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=PqvMRDCJT9t. 12

  34. [34]

    Flow straight and fast: Learning to generate and transfer data with rectified flow

    Xingchao Liu, Chengyue Gong, and Qiang Liu. Flow straight and fast: Learning to generate and transfer data with rectified flow. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=XVjTT1nw5z

  35. [35]

    Simplifying, Stabilizing and Scaling Continuous-Time Consistency Models

    Cheng Lu and Yang Song. Simplifying, stabilizing and scaling continuous-time consistency models. arXiv preprint arXiv:2410.11081, 2024

  36. [36]

    Sit: Exploring flow and diffusion-based generative models with scalable interpolant transformers

    Nanye Ma, Mark Goldstein, Michael S Albergo, Nicholas M Boffi, Eric Vanden-Eijnden, and Saining Xie. Sit: Exploring flow and diffusion-based generative models with scalable interpolant transformers. In European Conference on Computer Vision, pages 23–40. Springer, 2024

  37. [37]

    Patel, and Peyman Milanfar

    Kangfu Mei, Zhengzhong Tu, Mauricio Delbracio, Hossein Talebi, Vishal M. Patel, and Peyman Milanfar. Bigger is not always better: Scaling properties of latent diffusion models.Transactions on Machine Learning Research, 2025. ISSN 2835-8856. URL https://openreview.net/ forum?id=0u7pWfjri5

  38. [38]

    Improved denoising diffusion probabilistic models

    Alexander Quinn Nichol and Prafulla Dhariwal. Improved denoising diffusion probabilistic models. In International Conference on Machine Learning, pages 8162–8171. PMLR, 2021

  39. [39]

    Nearest neighbour score estimators for diffusion generative models

    Matthew Niedoba, Dylan Green, Saeid Naderiparizi, Vasileios Lioutas, Jonathan Wilder Lav- ington, Xiaoxuan Liang, Yunpeng Liu, Ke Zhang, Setareh Dabiri, Adam ´Scibior, et al. Nearest neighbour score estimators for diffusion generative models. arXiv preprint arXiv:2402.08018, 2024

  40. [40]

    Scalable diffusion models with transformers

    William Peebles and Saining Xie. Scalable diffusion models with transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 4195–4205, 2023

  41. [41]

    Monte Carlo statistical methods , volume 2

    Christian P Robert, George Casella, and George Casella. Monte Carlo statistical methods , volume 2. Springer, 1999

  42. [42]

    Progressive distillation for fast sampling of diffusion models

    Tim Salimans and Jonathan Ho. Progressive distillation for fast sampling of diffusion models. In International Conference on Learning Representations, 2022. URL https://openreview. net/forum?id=TIdIXIpzhoI

  43. [43]

    A first course in monte carlo methods

    Daniel Sanz-Alonso and Omar Al-Ghattas. A first course in monte carlo methods. arXiv preprint arXiv:2405.16359, 2024

  44. [44]

    Deep unsuper- vised learning using nonequilibrium thermodynamics

    Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan, and Surya Ganguli. Deep unsuper- vised learning using nonequilibrium thermodynamics. In International Conference on Machine Learning, pages 2256–2265. PMLR, 2015

  45. [45]

    Denoising diffusion implicit models

    Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models. In International Conference on Learning Representations, 2021

  46. [46]

    Generative modeling by estimating gradients of the data distribution

    Yang Song and Stefano Ermon. Generative modeling by estimating gradients of the data distribution. In Advances in Neural Information Processing Systems, volume 32, 2019

  47. [47]

    Score-based generative modeling through stochastic differential equations

    Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equations. In International Conference on Learning Representations, 2021

  48. [48]

    A connection between score matching and denoising autoencoders

    Pascal Vincent. A connection between score matching and denoising autoencoders. Neural computation, 23(7):1661–1674, 2011

  49. [49]

    Extracting and composing robust features with denoising autoencoders

    Pascal Vincent, Hugo Larochelle, Yoshua Bengio, and Pierre-Antoine Manzagol. Extracting and composing robust features with denoising autoencoders. InProceedings of the 25th international conference on Machine learning, pages 1096–1103, 2008

  50. [50]

    Stable target field for reduced variance score estimation in diffusion models

    Yilun Xu, Shangyuan Tong, and Tommi Jaakkola. Stable target field for reduced variance score estimation in diffusion models. arXiv preprint arXiv:2302.00670, 2023

  51. [51]

    Fasterdit: Towards faster diffusion transformers training without architecture modification

    Jingfeng Yao, Cheng Wang, Wenyu Liu, and Xinggang Wang. Fasterdit: Towards faster diffusion transformers training without architecture modification. Advances in Neural Information Processing Systems, 37:56166–56189, 2024. 13

  52. [52]

    Reconstruction vs

    Jingfeng Yao, Bin Yang, and Xinggang Wang. Reconstruction vs. generation: Taming opti- mization dilemma in latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025

  53. [53]

    Representation Alignment for Generation: Training Diffusion Transformers Is Easier Than You Think

    Sihyun Yu, Sangkyung Kwak, Huiwon Jang, Jongheon Jeong, Jonathan Huang, Jinwoo Shin, and Saining Xie. Representation alignment for generation: Training diffusion transformers is easier than you think. arXiv preprint arXiv:2410.06940, 2024

  54. [54]

    unit variance principle

    Hongkai Zheng, Weili Nie, Arash Vahdat, and Anima Anandkumar. Fast training of diffusion models with masked transformers. arXiv preprint arXiv:2306.09305, 2023. 14 Appendix The supplementary material is organized as follows. In Appx. A, we provide a detailed derivation of the optimal solution for diffusion models in various formulations. In Appx. B, we gi...

  55. [55]

    NCSN (VE), ϵ prediction target

    shows that when σt = sin( π 2 t), the EDM loss function is equivalent to v-pred of VP process. NCSN (VE), ϵ prediction target. The loss function in NCSN(VE) formulation [47] is given by J (ϵ) NCSN = EpNCSN(σ)Ep(x0),p(ϵ) ϵθ x0 + σϵ, ln (σ 2 ) + ϵ 2 , 17 where pNCSN(σ) = exp # U (ln σmin, ln σmax), i.e. ln σ ∼ U (ln σmin, ln σmax). Then we can convert it to...

  56. [56]

    Let pSD3(t) = pln(t; 0, 1)

    shows that m = 0 , s = 1 consistently achieves good performance. Let pSD3(t) = pln(t; 0, 1). Then the SD3 loss function is given by J (v) SD3(θ) = EpSD3(t)Ep(x0),p(ϵ) ∥vθ (αtx0 + σtϵ, t) − (ϵ − x0)∥2 . It’s obvious that the SD3 objective has the same precondition with FM, i.e. cSD3 skip (ˆσ) = 1 1 + ˆσ , c SD3 out (ˆσ) = − ˆσ 1 + ˆσ , cSD3 in (ˆσ) = 1 1 +...

  57. [57]

    E∥ˆISNIS − I∥2 ⩽ 4 N Eq(x)[ ˆw(x)2] (Eq(x)[ ˆw(x)])2

  58. [58]

    So we can conclude that the SNIS estimator is asymptotically unbiased

    ∥E[(ˆISNIS − I)]∥ ⩽ 2 N Eq(x)[ ˆw(x)2] (Eq(x)[ ˆw(x)])2 . So we can conclude that the SNIS estimator is asymptotically unbiased. For completeness, we give the proof of the proposition. The proof is modified from Sanz-Alonso and Al-Ghattas [43]. Proof. To simplify our notation, let ˆJN = NX i=1 f (xi) ˆw(xi) ˆPN = NX i=1 ˆw(xi), xi i.i.d. ∼ q(x). Then ˆISN...

  59. [59]

    We adopt the DDPM++ network architecture used in EDM, with our primary modifications being the incorporation of our loss weighting scheme and adaptive noise distribution

    Checkpoints are saved every 2.5 million images, and we report results based on the checkpoint with the lowest FID. We adopt the DDPM++ network architecture used in EDM, with our primary modifications being the incorporation of our loss weighting scheme and adaptive noise distribution. All models are trained on 8 NVIDIA A100 GPUs. For sampling, we employ t...