Diagnosing and Improving Diffusion Models by Estimating the Optimal Loss Value
Pith reviewed 2026-05-19 09:02 UTC · model grok-4.3
The pith
The optimal loss value for diffusion models can be derived in closed form and estimated to diagnose training quality.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Under a unified formulation of diffusion models, the optimal loss value can be derived in closed form. Practical estimators, including a stochastic variant scalable to large datasets, allow diagnosis of training quality for mainstream variants. These estimators also support a more performant training schedule, and subtracting the optimal loss from the observed loss makes power-law scaling clearer for models with 120M to 1.5B parameters.
What carries the argument
Closed-form optimal loss derived from the unified diffusion formulation; it acts as a fixed baseline that normalizes observed training loss to separate training quality from inherent task difficulty.
If this is right
- Training quality of mainstream diffusion variants can be diagnosed by the gap between achieved loss and the estimated optimum.
- A training schedule constructed using the optimal loss estimate yields better performance than standard schedules.
- Power-law scaling with model size becomes more evident when loss is measured as the excess over the optimal value.
- The estimators apply to the range of diffusion models covered by the unified formulation, including those with 120M to 1.5B parameters.
Where Pith is reading between the lines
- The same subtraction of optimal loss could be tested on other generative families, such as flow-matching or score-based models, to see whether scaling relations sharpen there as well.
- During long training runs, monitoring the gap to the estimated optimum could serve as a practical signal for early stopping or model scaling decisions.
- If the estimator generalizes, it might help compare architectures trained under different noise schedules on equal footing.
- The closed-form expression could guide the design of new diffusion objectives that explicitly minimize the excess loss rather than the raw loss.
Load-bearing premise
A single unified mathematical description accurately represents the forward and reverse processes of the diffusion models used in practice and permits an exact closed-form solution for the optimal loss.
What would settle it
On a small dataset with fully known data distribution, compute the closed-form optimal loss and train multiple diffusion models until convergence; if any model achieves a lower loss than the estimate, the derivation is incorrect.
Figures
read the original abstract
Diffusion models have achieved remarkable success in generative modeling. Despite more stable training, the loss of diffusion models is not indicative of absolute data-fitting quality, since its optimal value is typically not zero but unknown, leading to confusion between large optimal loss and insufficient model capacity. In this work, we advocate the need to estimate the optimal loss value for diagnosing and improving diffusion models. We first derive the optimal loss in closed form under a unified formulation of diffusion models, and develop effective estimators for it, including a stochastic variant scalable to large datasets with proper control of variance and bias. With this tool, we unlock the inherent metric for diagnosing the training quality of mainstream diffusion model variants, and develop a more performant training schedule based on the optimal loss. Moreover, using models with 120M to 1.5B parameters, we find that the power law is better demonstrated after subtracting the optimal loss from the actual training loss, suggesting a more principled setting for investigating the scaling law for diffusion models.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript derives a closed-form expression for the optimal loss value under a unified formulation of diffusion models that covers mainstream variants. It introduces practical estimators for this optimal loss, including a scalable stochastic variant with claimed bias/variance control, and uses them to diagnose training quality, design an improved training schedule, and show that power-law scaling becomes clearer after subtracting the estimated optimal loss from observed training losses, demonstrated on models with 120M to 1.5B parameters.
Significance. If the closed-form derivation is exact and the estimators recover the optimal loss reliably, the work supplies a principled diagnostic that distinguishes inherent loss floors from insufficient model capacity or training issues. The resulting training schedule improvements and the observation that power laws are more evident post-subtraction could inform more accurate scaling studies and better diffusion training practices. The provision of a closed-form result together with practical, scalable estimators is a concrete strength.
major comments (2)
- [§3.2] §3.2 (stochastic estimator): the bias and variance control for the Monte-Carlo stochastic estimator is asserted via analysis but is not empirically verified on tractable cases (e.g., isotropic Gaussian data) where the true optimal loss can be computed exactly in closed form. Without such a sanity check, residual bias that scales with dataset size or noise schedule would undermine the reliability of the diagnostic tool and the reported training-schedule and scaling-law improvements.
- [§4] §4 (scaling experiments): the claim that power-law scaling is 'better demonstrated' after optimal-loss subtraction is supported only by visual inspection of plots; quantitative metrics (e.g., change in R² or scaling exponent with confidence intervals) comparing raw versus subtracted loss are needed to establish that the improvement is statistically meaningful rather than cosmetic.
minor comments (2)
- [§2] Notation for the unified forward/reverse process parameters should be introduced once in §2 and used consistently thereafter to prevent readers from having to cross-reference multiple definitions.
- [Figures in §4] Figure captions for the scaling plots should explicitly state the number of runs and whether error bars represent standard deviation or standard error.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback on our manuscript. We have carefully reviewed the major comments and will revise the paper to address the concerns about empirical validation of the stochastic estimator and the need for quantitative metrics in the scaling analysis. Our point-by-point responses follow.
read point-by-point responses
-
Referee: [§3.2] §3.2 (stochastic estimator): the bias and variance control for the Monte-Carlo stochastic estimator is asserted via analysis but is not empirically verified on tractable cases (e.g., isotropic Gaussian data) where the true optimal loss can be computed exactly in closed form. Without such a sanity check, residual bias that scales with dataset size or noise schedule would undermine the reliability of the diagnostic tool and the reported training-schedule and scaling-law improvements.
Authors: We appreciate the referee's suggestion to strengthen the validation of the stochastic estimator. While §3.2 provides an analytical derivation of the bias and variance control, we agree that an empirical sanity check on a tractable setting—such as isotropic Gaussian data, where the optimal loss admits an exact closed-form expression—would offer useful corroboration and help rule out any residual bias that might depend on dataset size or the noise schedule. In the revised manuscript, we will add these experiments, comparing the stochastic estimator outputs against the known ground-truth optimal loss to empirically confirm the claimed bias and variance properties. revision: yes
-
Referee: [§4] §4 (scaling experiments): the claim that power-law scaling is 'better demonstrated' after optimal-loss subtraction is supported only by visual inspection of plots; quantitative metrics (e.g., change in R² or scaling exponent with confidence intervals) comparing raw versus subtracted loss are needed to establish that the improvement is statistically meaningful rather than cosmetic.
Authors: We thank the referee for this observation. The current manuscript relies on visual comparison of the plots to illustrate that power-law scaling appears clearer after subtracting the estimated optimal loss. We acknowledge that this is insufficient to rigorously establish statistical improvement. In the revision, we will augment §4 with quantitative metrics: specifically, we will report R² values for the power-law fits, the fitted scaling exponents, and associated confidence intervals, computed separately for the raw training losses and for the losses after optimal-loss subtraction. These additions will provide a statistical basis for the claim. revision: yes
Circularity Check
No significant circularity; optimal loss derived independently from model equations
full rationale
The paper derives the optimal loss in closed form directly from the unified formulation of the forward and reverse diffusion processes and the associated loss function. This mathematical derivation produces an expression for the theoretical minimum loss value without reference to any empirical training losses, fitted parameters, or observed data statistics from the experiments. The subsequent stochastic and deterministic estimators are constructed to approximate this independently derived quantity, and the diagnostic and scaling-law applications consist of subtracting the estimated optimum from measured losses. No step in the chain reduces the claimed result to its own inputs by construction, self-definition, or load-bearing self-citation; the central claim remains a first-principles consequence of the model specification.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption A single unified formulation accurately captures the forward and reverse processes of mainstream diffusion model variants.
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Theorem 1. The optimal loss value for clean-data prediction … J(x0)_t^* = E_{p(x0)}‖x0‖² − E_{p(xt)}‖E_{p(x0|xt)}[x0]‖²
-
IndisputableMonolith/Foundation/Atomicity.leansequential_preserves_conservation unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We develop effective estimators … stochastic variant scalable to large datasets with proper control of variance and bias
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
What regularized auto-encoders learn from the data- generating distribution
Guillaume Alain and Yoshua Bengio. What regularized auto-encoders learn from the data- generating distribution. The Journal of Machine Learning Research, 15(1):3563–3593, 2014
work page 2014
-
[2]
Reverse-time diffusion equation models
Brian DO Anderson. Reverse-time diffusion equation models. Stochastic Processes and their Applications, 12(3):313–326, 1982
work page 1982
-
[3]
Estimating the optimal covariance with imperfect mean in diffusion probabilistic models
Fan Bao, Chongxuan Li, Jiacheng Sun, Jun Zhu, and Bo Zhang. Estimating the optimal covariance with imperfect mean in diffusion probabilistic models. In International Conference on Machine Learning, pages 1555–1584. PMLR, 2022
work page 2022
-
[4]
Analytic-DPM: an analytic estimate of the optimal reverse variance in diffusion probabilistic models
Fan Bao, Chongxuan Li, Jun Zhu, and Bo Zhang. Analytic-DPM: an analytic estimate of the optimal reverse variance in diffusion probabilistic models. In International Conference on Learning Representations, 2022
work page 2022
-
[5]
Ambient diffusion: Learning clean distributions from corrupted data
Giannis Daras, Kulin Shah, Yuval Dagan, Aravind Gollakota, Alex Dimakis, and Adam Klivans. Ambient diffusion: Learning clean distributions from corrupted data. In Thirty-seventh Con- ference on Neural Information Processing Systems, 2023. URL https://openreview.net/ forum?id=wBJBLy9kBY
work page 2023
-
[6]
Diffusion schrödinger bridge with applications to score-based generative modeling
Valentin De Bortoli, James Thornton, Jeremy Heng, and Arnaud Doucet. Diffusion schrödinger bridge with applications to score-based generative modeling. Advances in Neural Information Processing Systems, 34:17695–17709, 2021
work page 2021
-
[7]
Scaling vision transformers to 22 billion parameters
Mostafa Dehghani, Josip Djolonga, Basil Mustafa, Piotr Padlewski, Jonathan Heek, Justin Gilmer, Andreas Peter Steiner, Mathilde Caron, Robert Geirhos, Ibrahim Alabdulmohsin, et al. Scaling vision transformers to 22 billion parameters. In International Conference on Machine Learning, pages 7480–7512. PMLR, 2023
work page 2023
-
[8]
Diffusion models beat GANs on image synthesis
Prafulla Dhariwal and Alexander Nichol. Diffusion models beat GANs on image synthesis. Advances in Neural Information Processing Systems, 34:8780–8794, 2021
work page 2021
-
[9]
Probability: theory and examples, volume 49
Rick Durrett. Probability: theory and examples, volume 49. Cambridge university press, 2019
work page 2019
-
[10]
Scaling rectified flow transform- ers for high-resolution image synthesis
Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling rectified flow transform- ers for high-resolution image synthesis. In Forty-first International Conference on Machine Learning, 2024
work page 2024
-
[11]
Ruiqi Gao, Emiel Hoogeboom, Jonathan Heek, Valentin De Bortoli, Kevin P. Murphy, and Tim Salimans. Diffusion meets flow matching: Two sides of the same coin. 2024. URL https://diffusionflow.github.io/
work page 2024
-
[12]
Masked diffusion transformer is a strong image synthesizer
Shanghua Gao, Pan Zhou, Ming-Ming Cheng, and Shuicheng Yan. Masked diffusion transformer is a strong image synthesizer. In Proceedings of the IEEE/CVF international conference on computer vision, pages 23164–23173, 2023
work page 2023
-
[13]
Mdtv2: Masked diffusion transformer is a strong image synthesizer
Shanghua Gao, Pan Zhou, Ming-Ming Cheng, and Shuicheng Yan. Mdtv2: Masked diffusion transformer is a strong image synthesizer. arXiv preprint arXiv:2303.14389, 2023
-
[14]
On memorization in diffusion models
Xiangming Gu, Chao Du, Tianyu Pang, Chongxuan Li, Min Lin, and Ye Wang. On memorization in diffusion models. arXiv preprint arXiv:2310.02664, 2023
-
[15]
Tom Henighan, Jared Kaplan, Mor Katz, Mark Chen, Christopher Hesse, Jacob Jackson, Heewoo Jun, Tom B. Brown, Prafulla Dhariwal, Scott Gray, Chris Hallacy, Benjamin Mann, Alec Radford, Aditya Ramesh, Nick Ryder, Daniel M. Ziegler, John Schulman, Dario Amodei, and Sam McCandlish. Scaling laws for autoregressive generative modeling, 2020
work page 2020
-
[16]
Gans trained by a two time-scale update rule converge to a local nash equilibrium
Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium. Advances in neural information processing systems, 30, 2017
work page 2017
-
[17]
Denoising diffusion probabilistic models
Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. In Advances in Neural Information Processing Systems, volume 33, pages 6840–6851, 2020. 11
work page 2020
-
[18]
An empirical analysis of compute-optimal large language model training
Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, Tom Hennigan, Eric Noland, Katherine Millican, George van den Driessche, Bogdan Damoc, Aurelia Guy, Simon Osindero, Karen Simonyan, Erich Elsen, Oriol Vinyals, Jack William Rae, and Laur...
work page 2022
-
[19]
simple diffusion: End-to-end diffusion for high resolution images
Emiel Hoogeboom, Jonathan Heek, and Tim Salimans. simple diffusion: End-to-end diffusion for high resolution images. In International Conference on Machine Learning, pages 13213– 13232. PMLR, 2023
work page 2023
-
[20]
Simpler diffusion (sid2): 1.5 fid on imagenet512 with pixel-space diffusion
Emiel Hoogeboom, Thomas Mensink, Jonathan Heek, Kay Lamerigts, Ruiqi Gao, and Tim Salimans. Simpler diffusion (sid2): 1.5 fid on imagenet512 with pixel-space diffusion. arXiv preprint arXiv:2410.19324, 2024
-
[21]
Scalable adaptive computation for iterative generation
Allan Jabri, David Fleet, and Ting Chen. Scalable adaptive computation for iterative generation. arXiv preprint arXiv:2212.11972, 2022
-
[22]
Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei
Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B. Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models, 2020
work page 2020
-
[23]
A Style-Based Generator Architecture for Generative Adversarial Networks
Tero Karras. A style-based generator architecture for generative adversarial networks. arXiv preprint arXiv:1812.04948, 2019
work page internal anchor Pith review Pith/arXiv arXiv 2019
-
[24]
Elucidating the design space of diffusion-based generative models
Tero Karras, Miika Aittala, Timo Aila, and Samuli Laine. Elucidating the design space of diffusion-based generative models. Advances in neural information processing systems, 35: 26565–26577, 2022
work page 2022
-
[25]
Analyzing and improving the training dynamics of diffusion models
Tero Karras, Miika Aittala, Jaakko Lehtinen, Janne Hellsten, Timo Aila, and Samuli Laine. Analyzing and improving the training dynamics of diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 24174–24184, 2024
work page 2024
-
[26]
Diederik Kingma, Tim Salimans, Ben Poole, and Jonathan Ho. Variational diffusion models. Advances in neural information processing systems, 34:21696–21707, 2021
work page 2021
-
[27]
Understanding diffusion objectives as the ELBO with simple data augmentation
Diederik P Kingma and Ruiqi Gao. Understanding diffusion objectives as the ELBO with simple data augmentation. In Thirty-seventh Conference on Neural Information Processing Systems, 2023. URL https://openreview.net/forum?id=NnMEadcdyD
work page 2023
-
[28]
Learning multiple layers of features from tiny images
Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple layers of features from tiny images. 2009
work page 2009
-
[29]
Imagenet classification with deep convolutional neural networks
Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. Advances in neural information processing systems, 25, 2012
work page 2012
-
[30]
Monte carlo methods.Wiley Interdisciplinary Reviews: Computational Statistics, 4(1):48–58, 2012
Dirk P Kroese and Reuven Y Rubinstein. Monte carlo methods.Wiley Interdisciplinary Reviews: Computational Statistics, 4(1):48–58, 2012
work page 2012
-
[31]
On the scalability of diffusion-based text-to-image generation
Hao Li, Yang Zou, Ying Wang, Orchid Majumder, Yusheng Xie, R Manmatha, Ashwin Swami- nathan, Zhuowen Tu, Stefano Ermon, and Stefano Soatto. On the scalability of diffusion-based text-to-image generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9400–9409, 2024
work page 2024
-
[32]
Scaling laws for diffusion transformers
Zhengyang Liang, Hao He, Ceyuan Yang, and Bo Dai. Scaling laws for diffusion transformers. arXiv preprint arXiv:2410.08184, 2024
-
[33]
Yaron Lipman, Ricky T. Q. Chen, Heli Ben-Hamu, Maximilian Nickel, and Matthew Le. Flow matching for generative modeling. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=PqvMRDCJT9t. 12
work page 2023
-
[34]
Flow straight and fast: Learning to generate and transfer data with rectified flow
Xingchao Liu, Chengyue Gong, and Qiang Liu. Flow straight and fast: Learning to generate and transfer data with rectified flow. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=XVjTT1nw5z
work page 2023
-
[35]
Simplifying, Stabilizing and Scaling Continuous-Time Consistency Models
Cheng Lu and Yang Song. Simplifying, stabilizing and scaling continuous-time consistency models. arXiv preprint arXiv:2410.11081, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[36]
Sit: Exploring flow and diffusion-based generative models with scalable interpolant transformers
Nanye Ma, Mark Goldstein, Michael S Albergo, Nicholas M Boffi, Eric Vanden-Eijnden, and Saining Xie. Sit: Exploring flow and diffusion-based generative models with scalable interpolant transformers. In European Conference on Computer Vision, pages 23–40. Springer, 2024
work page 2024
-
[37]
Kangfu Mei, Zhengzhong Tu, Mauricio Delbracio, Hossein Talebi, Vishal M. Patel, and Peyman Milanfar. Bigger is not always better: Scaling properties of latent diffusion models.Transactions on Machine Learning Research, 2025. ISSN 2835-8856. URL https://openreview.net/ forum?id=0u7pWfjri5
work page 2025
-
[38]
Improved denoising diffusion probabilistic models
Alexander Quinn Nichol and Prafulla Dhariwal. Improved denoising diffusion probabilistic models. In International Conference on Machine Learning, pages 8162–8171. PMLR, 2021
work page 2021
-
[39]
Nearest neighbour score estimators for diffusion generative models
Matthew Niedoba, Dylan Green, Saeid Naderiparizi, Vasileios Lioutas, Jonathan Wilder Lav- ington, Xiaoxuan Liang, Yunpeng Liu, Ke Zhang, Setareh Dabiri, Adam ´Scibior, et al. Nearest neighbour score estimators for diffusion generative models. arXiv preprint arXiv:2402.08018, 2024
-
[40]
Scalable diffusion models with transformers
William Peebles and Saining Xie. Scalable diffusion models with transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 4195–4205, 2023
work page 2023
-
[41]
Monte Carlo statistical methods , volume 2
Christian P Robert, George Casella, and George Casella. Monte Carlo statistical methods , volume 2. Springer, 1999
work page 1999
-
[42]
Progressive distillation for fast sampling of diffusion models
Tim Salimans and Jonathan Ho. Progressive distillation for fast sampling of diffusion models. In International Conference on Learning Representations, 2022. URL https://openreview. net/forum?id=TIdIXIpzhoI
work page 2022
-
[43]
A first course in monte carlo methods
Daniel Sanz-Alonso and Omar Al-Ghattas. A first course in monte carlo methods. arXiv preprint arXiv:2405.16359, 2024
-
[44]
Deep unsuper- vised learning using nonequilibrium thermodynamics
Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan, and Surya Ganguli. Deep unsuper- vised learning using nonequilibrium thermodynamics. In International Conference on Machine Learning, pages 2256–2265. PMLR, 2015
work page 2015
-
[45]
Denoising diffusion implicit models
Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models. In International Conference on Learning Representations, 2021
work page 2021
-
[46]
Generative modeling by estimating gradients of the data distribution
Yang Song and Stefano Ermon. Generative modeling by estimating gradients of the data distribution. In Advances in Neural Information Processing Systems, volume 32, 2019
work page 2019
-
[47]
Score-based generative modeling through stochastic differential equations
Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equations. In International Conference on Learning Representations, 2021
work page 2021
-
[48]
A connection between score matching and denoising autoencoders
Pascal Vincent. A connection between score matching and denoising autoencoders. Neural computation, 23(7):1661–1674, 2011
work page 2011
-
[49]
Extracting and composing robust features with denoising autoencoders
Pascal Vincent, Hugo Larochelle, Yoshua Bengio, and Pierre-Antoine Manzagol. Extracting and composing robust features with denoising autoencoders. InProceedings of the 25th international conference on Machine learning, pages 1096–1103, 2008
work page 2008
-
[50]
Stable target field for reduced variance score estimation in diffusion models
Yilun Xu, Shangyuan Tong, and Tommi Jaakkola. Stable target field for reduced variance score estimation in diffusion models. arXiv preprint arXiv:2302.00670, 2023
-
[51]
Fasterdit: Towards faster diffusion transformers training without architecture modification
Jingfeng Yao, Cheng Wang, Wenyu Liu, and Xinggang Wang. Fasterdit: Towards faster diffusion transformers training without architecture modification. Advances in Neural Information Processing Systems, 37:56166–56189, 2024. 13
work page 2024
-
[52]
Jingfeng Yao, Bin Yang, and Xinggang Wang. Reconstruction vs. generation: Taming opti- mization dilemma in latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025
work page 2025
-
[53]
Representation Alignment for Generation: Training Diffusion Transformers Is Easier Than You Think
Sihyun Yu, Sangkyung Kwak, Huiwon Jang, Jongheon Jeong, Jonathan Huang, Jinwoo Shin, and Saining Xie. Representation alignment for generation: Training diffusion transformers is easier than you think. arXiv preprint arXiv:2410.06940, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[54]
Hongkai Zheng, Weili Nie, Arash Vahdat, and Anima Anandkumar. Fast training of diffusion models with masked transformers. arXiv preprint arXiv:2306.09305, 2023. 14 Appendix The supplementary material is organized as follows. In Appx. A, we provide a detailed derivation of the optimal solution for diffusion models in various formulations. In Appx. B, we gi...
-
[55]
NCSN (VE), ϵ prediction target
shows that when σt = sin( π 2 t), the EDM loss function is equivalent to v-pred of VP process. NCSN (VE), ϵ prediction target. The loss function in NCSN(VE) formulation [47] is given by J (ϵ) NCSN = EpNCSN(σ)Ep(x0),p(ϵ) ϵθ x0 + σϵ, ln (σ 2 ) + ϵ 2 , 17 where pNCSN(σ) = exp # U (ln σmin, ln σmax), i.e. ln σ ∼ U (ln σmin, ln σmax). Then we can convert it to...
-
[56]
shows that m = 0 , s = 1 consistently achieves good performance. Let pSD3(t) = pln(t; 0, 1). Then the SD3 loss function is given by J (v) SD3(θ) = EpSD3(t)Ep(x0),p(ϵ) ∥vθ (αtx0 + σtϵ, t) − (ϵ − x0)∥2 . It’s obvious that the SD3 objective has the same precondition with FM, i.e. cSD3 skip (ˆσ) = 1 1 + ˆσ , c SD3 out (ˆσ) = − ˆσ 1 + ˆσ , cSD3 in (ˆσ) = 1 1 +...
-
[57]
E∥ˆISNIS − I∥2 ⩽ 4 N Eq(x)[ ˆw(x)2] (Eq(x)[ ˆw(x)])2
-
[58]
So we can conclude that the SNIS estimator is asymptotically unbiased
∥E[(ˆISNIS − I)]∥ ⩽ 2 N Eq(x)[ ˆw(x)2] (Eq(x)[ ˆw(x)])2 . So we can conclude that the SNIS estimator is asymptotically unbiased. For completeness, we give the proof of the proposition. The proof is modified from Sanz-Alonso and Al-Ghattas [43]. Proof. To simplify our notation, let ˆJN = NX i=1 f (xi) ˆw(xi) ˆPN = NX i=1 ˆw(xi), xi i.i.d. ∼ q(x). Then ˆISN...
-
[59]
Checkpoints are saved every 2.5 million images, and we report results based on the checkpoint with the lowest FID. We adopt the DDPM++ network architecture used in EDM, with our primary modifications being the incorporation of our loss weighting scheme and adaptive noise distribution. All models are trained on 8 NVIDIA A100 GPUs. For sampling, we employ t...
work page 2048
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.