arxiv: 2605.10790 · v1 · submitted 2026-05-11 · 💻 cs.LG

Recognition: 2 theorem links

· Lean Theorem

Elucidating Representation Degradation Problem in Diffusion Model Training

Dazhou Li, Durude Mahee, Fan Zhu, Rui Yu, Wenbin Zhang, Xinwei He, Yeying Jin, Zhipeng Yao, Zitong Zhang

Authors on Pith no claims yet

Pith reviewed 2026-05-12 04:04 UTC · model grok-4.3

classification 💻 cs.LG

keywords diffusion modelsrepresentation degradationtraining stabilityneural tangent kerneloptimizationgenerative modelsconvergencenoise schedules

0 comments

The pith

Representation degradation in diffusion models arises from mismatched recoverability at high noise levels and is corrected by dynamically reallocating optimization effort in a plug-and-play framework.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper identifies Representation Degradation as the core inefficiency in diffusion model training, where outputs grow structurally distorted as noise increases. Analysis ties this to mismatched target recoverability linked with weakening of the neural tangent kernel spectrum and low-rank effects. The proposed Elucidated Representation Diffusion framework responds by shifting optimization effort toward more recoverable signals at each noise level. This adjustment stabilizes learning without any external labels or supervision. As a result training converges more quickly and delivers stronger generation quality on multiple diffusion architectures.

Core claim

The central claim is that training instability in diffusion models stems from mismatched target recoverability, which manifests as Neural Tangent Kernel spectral weakening and effective low-rank behavior; Elucidated Representation Diffusion corrects this by dynamically reallocating optimization effort according to each sample's effective recoverability, thereby stabilizing representation learning, accelerating convergence, and improving performance across backbones without external supervision.

What carries the argument

Elucidated Representation Diffusion (ERD), a plug-and-play optimizer that reallocates training effort according to effective recoverability at each noise level.

If this is right

Training reaches stable representations faster because effort concentrates on recoverable signals.
Generation quality improves across diffusion backbones without added supervision or architectural changes.
The same reallocation rule can be inserted into existing training pipelines as a drop-in module.
Convergence acceleration holds when the framework is applied to varied noise schedules and model sizes.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar recoverability mismatches may appear in other score-based or flow-matching generative models that use noise schedules.
The approach could reduce the total compute needed for large-scale diffusion pre-training by shortening the unstable early phase.
Monitoring NTK spectrum or rank during training might serve as a diagnostic for when reallocation is required.

Load-bearing premise

The observed instability is caused by mismatched recoverability between the model and its training targets at different noise levels.

What would settle it

Train identical diffusion backbones with and without the ERD reallocation rule, then compare the rate of structural distortion in outputs and the number of steps needed to reach target FID at high noise levels.

Figures

Figures reproduced from arXiv: 2605.10790 by Dazhou Li, Durude Mahee, Fan Zhu, Rui Yu, Wenbin Zhang, Xinwei He, Yeying Jin, Zhipeng Yao, Zitong Zhang.

**Figure 2.** Figure 2: Empirical MSE versus the Bayes optimal error (Bayes floor) across diffusion time [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Signal–noise decomposition across diffusion time [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: NTK analysis across diffusion noise levels. Fig. [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

**Figure 5.** Figure 5: Selected 256×256 samples. We use a CFG scale of 4.0 and 50 EDM Heun steps. Under the standard continuous-time configuration where log-SNR is sampled uniformly and scheduledependent factors are absorbed into the base measure, the effective allocation M(λ) is controlled by the loss weight w(λ). ERD sets w ⋆ y (λ) ∝ ωy(λ) and normalizes it to preserve the average loss scale. For non-uniform base allocations,… view at source ↗

**Figure 6.** Figure 6: Comparing different loss weighting designs by predicting [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗

**Figure 7.** Figure 7: (top rows) reveals a striking loss domination phenomenon in the globally trained model: the overall optimization process is hijacked by regions exhibiting the largest absolute loss. Counterintuitively, the excess gap is minimal where the Bayes floor is large, but massive where the floor is inherently low (e.g., t → 1 for ϵθ). Crucially, partitioning the trajectory into independent piecewise bins (bottom r… view at source ↗

**Figure 8.** Figure 8: Phase Space Trajectories of Gradient Contamination. The samples universally migrate from a healthy signal-dominated regime to a severely degraded noise-dominated regime as t increases. The diagonal dashed line represents the 50% contamination boundary. F.3 Representation Collapse in Feature Space. In Theorem D.4, we theoretically proved that under the fixed-noise surrogate, the shared-parameter architectur… view at source ↗

**Figure 9.** Figure 9: Representation Collapse in Global Shared Models. The deep hidden representations hθ,m(xt, t) are tracked across four increasing diffusion times. Projected onto the PCA basis fitted at t = 0.1, the representations catastrophically collapse from cleanly separable data clusters into an entangled, unstructured mass as t → 1. 30 [PITH_FULL_IMAGE:figures/full_fig_p030_9.png] view at source ↗

**Figure 10.** Figure 10: Eigenvalue Spectral Decay. The aggressive attenuation of the top-3 eigenvalues is a universal bottleneck, empirically supporting the bounds in Theorem D.4. Effective Rank Collapse. To further assess the global structural integrity of the Neural Tangent Kernel spectrum, we evaluate the effective rank based on Shannon entropy. While tracking individual top eigenvalues reveals absolute magnitude decay, the … view at source ↗

**Figure 11.** Figure 11: Effective Rank Collapse. The Shannon entropy-based effective rank severely plummets to near-singularity limits as t → 1. 31 [PITH_FULL_IMAGE:figures/full_fig_p031_11.png] view at source ↗

**Figure 12.** Figure 12: NTK Gram Matrix Evolution. The structural collapse from cleanly separated data clusters to an entangled homogeneous state occurs across all parameterizations. G Experimental Supplement G.1 Further Implementation Details. In this section, we provide additional implementation details, including training configurations, architecture choices, and sampling procedures. A complete list of hyperparameters is prov… view at source ↗

**Figure 13.** Figure 13: Convergence curves on ImageNet 256×256. Intermediate FID values of DiT are approximated from the training curves [40, 20] for visualization only. We also provide our U-ViT-H/2 at 4M iteration with classifier-free guidance with different class-free guidance scales. Moreover, we provide the results with the guidance interval [30] [PITH_FULL_IMAGE:figures/full_fig_p035_13.png] view at source ↗

**Figure 15.** Figure 15: We use classifier-free guidance with w = 4.0. Class label = “macaw” (88) [PITH_FULL_IMAGE:figures/full_fig_p036_15.png] view at source ↗

**Figure 16.** Figure 16: We use classifier-free guidance with w = 4.0. Class label = “sulphur-crested cockatoo” (89) [PITH_FULL_IMAGE:figures/full_fig_p036_16.png] view at source ↗

**Figure 18.** Figure 18: We use classifier-free guidance with w = 4.0. Class label = “husky” (250) [PITH_FULL_IMAGE:figures/full_fig_p037_18.png] view at source ↗

**Figure 20.** Figure 20: We use classifier-free guidance with w = 4.0. Class label = “arctic fox” (279) [PITH_FULL_IMAGE:figures/full_fig_p037_20.png] view at source ↗

**Figure 22.** Figure 22: We use classifier-free guidance with w = 4.0. Class label = “otter” (360) [PITH_FULL_IMAGE:figures/full_fig_p038_22.png] view at source ↗

**Figure 25.** Figure 25: We use classifier-free guidance with w = 4.0. Class label = “acoustic guitar” (402). 38 [PITH_FULL_IMAGE:figures/full_fig_p038_25.png] view at source ↗

**Figure 26.** Figure 26: We use classifier-free guidance with w = 4.0. Class label = “balloon” (417) [PITH_FULL_IMAGE:figures/full_fig_p039_26.png] view at source ↗

**Figure 28.** Figure 28: We use classifier-free guidance with w = 4.0. Class label = “dog sled” (537) [PITH_FULL_IMAGE:figures/full_fig_p039_28.png] view at source ↗

**Figure 30.** Figure 30: We use classifier-free guidance with w = 4.0. Class label = “laptop” (620) [PITH_FULL_IMAGE:figures/full_fig_p040_30.png] view at source ↗

**Figure 32.** Figure 32: We use classifier-free guidance with w = 4.0. Class label = “ice cream” (928) [PITH_FULL_IMAGE:figures/full_fig_p040_32.png] view at source ↗

**Figure 35.** Figure 35: We use classifier-free guidance with w = 4.0. Class label = “coral reef” (973) [PITH_FULL_IMAGE:figures/full_fig_p041_35.png] view at source ↗

**Figure 36.** Figure 36: We use classifier-free guidance with w = 4.0. Class label = “lake shore” (975) [PITH_FULL_IMAGE:figures/full_fig_p041_36.png] view at source ↗

read the original abstract

Diffusion models have achieved remarkable success, yet their training remains inefficient due to a severe optimization bottleneck, which we term Representation Degradation. As noise levels increase, the outputs of the trained model exhibit progressive structural distortion, which can destabilize training and impair generation quality. Our analysis suggests that this instability is driven by mismatched target recoverability, which is associated with Neural Tangent Kernel (NTK) spectral weakening and effective low-rank behavior. To address this, we propose Elucidated Representation Diffusion (ERD), a plug-and-play framework that dynamically reallocates optimization effort according to effective recoverability. By stabilizing representation learning without external supervision, ERD accelerates convergence and achieves strong empirical performance across diffusion backbones.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper names Representation Degradation in diffusion training and offers ERD as a dynamic reallocation fix, but the NTK causal link stays correlational rather than isolated.

read the letter

The main point is that the authors describe a training instability they call Representation Degradation, where higher noise levels cause progressive output distortion in diffusion models. They connect this to mismatched target recoverability and tie it to NTK spectral weakening plus low-rank behavior. Their response is ERD, a plug-and-play scheme that reallocates optimization effort according to effective recoverability without extra supervision. They report faster convergence and better results across backbones.

Referee Report

2 major / 1 minor

Summary. The manuscript identifies a Representation Degradation problem in diffusion model training, where model outputs exhibit progressive structural distortion as noise levels increase, destabilizing training and impairing generation quality. The authors suggest this stems from mismatched target recoverability, which they associate with Neural Tangent Kernel (NTK) spectral weakening and effective low-rank behavior. They introduce Elucidated Representation Diffusion (ERD), a plug-and-play framework that dynamically reallocates optimization effort according to effective recoverability to stabilize representation learning without external supervision, accelerate convergence, and deliver strong empirical performance across diffusion backbones.

Significance. If the causal link between mismatched recoverability, NTK weakening, and degradation is rigorously demonstrated, and ERD is shown via controlled experiments to specifically counteract this mechanism rather than provide generic stabilization, the work could meaningfully advance efficient training of diffusion models central to generative AI. The plug-and-play design is a practical strength, but the current lack of supporting analysis limits the assessed impact.

major comments (2)

[Abstract] Abstract: The claim that instability is 'driven by' mismatched target recoverability 'associated with' NTK spectral weakening and low-rank behavior is presented without any derivations, spectral analysis, equations, or ablation studies isolating this from confounders such as gradient variance growth, noise scheduling, or batch statistics. This association is load-bearing for the ERD construction, yet remains correlational based on the provided text.
[Abstract] Abstract: The assertions of accelerated convergence and 'strong empirical performance across diffusion backbones' are made without quantitative results, baselines, error bars, tables, figures, or experimental details, preventing assessment of effect sizes, statistical significance, or reproducibility.

minor comments (1)

The abstract relies on suggestive phrasing ('suggests', 'associated with') that should be replaced with precise statements once the full analysis is presented.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment point by point below, clarifying the relationship between the abstract and the supporting analysis in the full paper. We will revise the abstract to better link claims to the detailed evidence provided in the body of the work.

read point-by-point responses

Referee: [Abstract] Abstract: The claim that instability is 'driven by' mismatched target recoverability 'associated with' NTK spectral weakening and low-rank behavior is presented without any derivations, spectral analysis, equations, or ablation studies isolating this from confounders such as gradient variance growth, noise scheduling, or batch statistics. This association is load-bearing for the ERD construction, yet remains correlational based on the provided text.

Authors: We appreciate this observation regarding the abstract. The abstract is a concise summary; the full manuscript contains the requested derivations, NTK spectral analysis, equations, and ablation studies that isolate the recoverability mismatch from the listed confounders (detailed in Sections 3.2–3.4 and 4.1–4.2). These sections demonstrate the association through both theoretical analysis and controlled experiments. To address the concern that the abstract does not sufficiently indicate this support, we will revise the abstract to include brief references to the relevant sections and to more precisely characterize the nature of the association as supported by our analysis rather than purely correlational. revision: yes
Referee: [Abstract] Abstract: The assertions of accelerated convergence and 'strong empirical performance across diffusion backbones' are made without quantitative results, baselines, error bars, tables, figures, or experimental details, preventing assessment of effect sizes, statistical significance, or reproducibility.

Authors: Thank you for noting this. The abstract summarizes the empirical outcomes, while the full manuscript reports the quantitative results, including baselines, error bars, tables, figures, effect sizes, and full experimental details with reproducibility information across multiple diffusion backbones (presented in Section 5, with additional controls in the appendix). We agree the abstract could better convey the strength of these results. We will revise the abstract to incorporate key quantitative highlights (e.g., convergence speedups and performance metrics) or explicit pointers to Section 5. revision: yes

Circularity Check

0 steps flagged

No circularity in derivation chain

full rationale

The abstract and visible text present the core claims as empirical observations ('analysis suggests', 'associated with') rather than a mathematical derivation. No equations, fitted parameters renamed as predictions, self-definitional loops, or load-bearing self-citations appear. The ERD framework is introduced as a plug-and-play reallocation method grounded in the observed degradation pattern, without reducing the central premise to its own inputs by construction. The paper is therefore self-contained against external benchmarks, with no identifiable circular steps.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no explicit free parameters, axioms, or invented entities are stated. ERD is described at the level of a high-level framework without mathematical specification.

pith-pipeline@v0.9.0 · 5436 in / 1134 out tokens · 25892 ms · 2026-05-12T04:04:46.421736+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear
Our analysis suggests that this instability is driven by mismatched target recoverability, which is associated with Neural Tangent Kernel (NTK) spectral weakening and effective low-rank behavior. ... ERD sets w⋆_y(λ) ∝ ω_y(λ)
IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean absolute_floor_iff_bare_distinguishability unclear
Theorem 3.8 (Spectral Local ELBO Bound) ... modes with 2M(λ)κ_λ^j > γ contract rapidly

Reference graph

Works this paper leans on

68 extracted references · 68 canonical work pages · 10 internal anchors

[1]

Intriguing properties of quantization at scale.Advances in Neural Information Processing Systems, 36:34278–34294, 2023

Arash Ahmadian, Saurabh Dash, Hongyu Chen, Bharat Venkitesh, Zhen Stephen Gou, Phil Blunsom, Ahmet Üstün, and Sara Hooker. Intriguing properties of quantization at scale.Advances in Neural Information Processing Systems, 36:34278–34294, 2023. 8

work page 2023
[2]

All are worth words: A vit backbone for diffusion models

Fan Bao, Shen Nie, Kaiwen Xue, Yue Cao, Chongxuan Li, Hang Su, and Jun Zhu. All are worth words: A vit backbone for diffusion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 22669–22679, 2023. 1, 7, 8, 9, 33, 35

work page 2023
[3]

Perception prioritized training of diffusion models

Jooyoung Choi, Jungbeom Lee, Chaehun Shin, Sungwon Kim, Hyunwoo Kim, and Sungroh Yoon. Perception prioritized training of diffusion models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11472–11481, 2022. 2, 6, 8, 9

work page 2022
[4]

Usp: Unified self-supervised pretraining for image generation and understanding.arXiv preprint arXiv:2503.06132, 2025

Xiangxiang Chu, Renda Li, and Yong Wang. Usp: Unified self-supervised pretraining for image generation and understanding.arXiv preprint arXiv:2503.06132, 2025. 9

work page arXiv 2025
[5]

Imagenet: A large-scale hierarchical image database

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. InIEEE Conference on Computer Vision and Pattern Recognition, 2009. 7

work page 2009
[6]

Diffusion models beat gans on image synthesis.Advances in neural information processing systems, 34:8780–8794, 2021

Prafulla Dhariwal and Alexander Nichol. Diffusion models beat gans on image synthesis.Advances in neural information processing systems, 34:8780–8794, 2021. 7, 9, 33

work page 2021
[7]

Representation degeneration problem in training natural language generation models.arXiv preprint arXiv:1907.12009, 2019

Jun Gao, Di He, Xu Tan, Tao Qin, Liwei Wang, and Tie-Yan Liu. Representation degeneration problem in training natural language generation models.arXiv preprint arXiv:1907.12009, 2019. 1

work page arXiv 1907
[8]

Masked diffusion transformer is a strong image synthesizer

Shanghua Gao, Pan Zhou, Ming-Ming Cheng, and Shuicheng Yan. Mdtv2: Masked diffusion transformer is a strong image synthesizer.arXiv preprint arXiv:2303.14389, 2023. 9

work page arXiv 2023
[9]

Neural Network-Based Score Estimation in Diffusion Models: Optimization and Generalization

Yinbin Han, Meisam Razaviyayn, and Renyuan Xu. Neural network-based score estimation in diffusion models: Optimization and generalization.arXiv preprint arXiv:2401.15604, 2024. 9

work page internal anchor Pith review Pith/arXiv arXiv 2024
[10]

Improved noise schedule for diffusion training

Tiankai Hang and Shuyang Gu. Improved noise schedule for diffusion training.arXiv preprint arXiv:2407.03297, 2024. 1, 9

work page arXiv 2024
[11]

Efficient diffusion training via min-snr weighting strategy

Tiankai Hang, Shuyang Gu, Chen Li, Jianmin Bao, Dong Chen, Han Hu, Xin Geng, and Baining Guo. Efficient diffusion training via min-snr weighting strategy. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 7441–7451, 2023. 1, 6, 8, 9

work page 2023
[12]

Diffit: Diffusion vision transformers for image generation

Ali Hatamizadeh, Jiaming Song, Guilin Liu, Jan Kautz, and Arash Vahdat. Diffit: Diffusion vision transformers for image generation. InEuropean Conference on Computer Vision, pages 37–55. Springer,

work page
[13]

Gans trained by a two time-scale update rule converge to a local nash equilibrium.Advances in neural information processing systems, 30, 2017

Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium.Advances in neural information processing systems, 30, 2017. 7, 33

work page 2017
[14]

Classifier-Free Diffusion Guidance

Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance.arXiv preprint arXiv:2207.12598, 2022. 7

work page internal anchor Pith review Pith/arXiv arXiv 2022
[15]

Denoising diffusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020

Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020. 1, 7, 28

work page 2020
[16]

Cascaded diffusion models for high fidelity image generation.Journal of Machine Learning Research, 23 (47):1–33, 2022

Jonathan Ho, Chitwan Saharia, William Chan, David J Fleet, Mohammad Norouzi, and Tim Salimans. Cascaded diffusion models for high fidelity image generation.Journal of Machine Learning Research, 23 (47):1–33, 2022. 9

work page 2022
[17]

Video diffusion models.Advances in Neural Information Processing Systems, 35:8633–8646, 2022

Jonathan Ho, Tim Salimans, Alexey Gritsenko, William Chan, Mohammad Norouzi, and David J Fleet. Video diffusion models.Advances in Neural Information Processing Systems, 35:8633–8646, 2022. 1 10

work page 2022
[18]

simple diffusion: End-to-end diffusion for high resolution images

Emiel Hoogeboom, Jonathan Heek, and Tim Salimans. simple diffusion: End-to-end diffusion for high resolution images. InInternational Conference on Machine Learning, pages 13213–13232. PMLR, 2023. 9

work page 2023
[19]

Neural tangent kernel: Convergence and generalization in neural networks.Advances in neural information processing systems, 31, 2018

Arthur Jacot, Franck Gabriel, and Clément Hongler. Neural tangent kernel: Convergence and generalization in neural networks.Advances in neural information processing systems, 31, 2018. 1, 3, 9

work page 2018
[20]

No other representation component is needed: Diffusion transformers can provide representation guidance by themselves.arXiv preprint arXiv:2505.02831, 2025

Dengyang Jiang, Mengmeng Wang, Liuzhuozheng Li, Lei Zhang, Haoyu Wang, Wei Wei, Guang Dai, Yanning Zhang, and Jingdong Wang. No other representation component is needed: Diffusion transformers can provide representation guidance by themselves.arXiv preprint arXiv:2505.02831, 2025. 9, 35

work page arXiv 2025
[21]

Understanding dimensional collapse in contrastive self-supervised learning.arXiv preprint arXiv:2110.09348, 2021

Li Jing, Pascal Vincent, Yann LeCun, and Yuandong Tian. Understanding dimensional collapse in contrastive self-supervised learning.arXiv preprint arXiv:2110.09348, 2021. 2

work page arXiv 2021
[22]

Understanding dimensional collapse in contrastive self-supervised learning

Li Jing, Pascal Vincent, Yann LeCun, and Yuandong Tian. Understanding dimensional collapse in contrastive self-supervised learning. InInternational Conference on Learning Representations, 2022. 9

work page 2022
[23]

Elucidating the design space of diffusion-based generative models.Advances in neural information processing systems, 35:26565–26577, 2022

Tero Karras, Miika Aittala, Timo Aila, and Samuli Laine. Elucidating the design space of diffusion-based generative models.Advances in neural information processing systems, 35:26565–26577, 2022. 1, 7, 33

work page 2022
[24]

Analyzing and improving the training dynamics of diffusion models

Tero Karras, Miika Aittala, Jaakko Lehtinen, Janne Hellsten, Timo Aila, and Samuli Laine. Analyzing and improving the training dynamics of diffusion models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 24174–24184, 2024. 9, 33

work page 2024
[25]

Understanding diffusion objectives as the elbo with simple data augmentation.Advances in Neural Information Processing Systems, 36:65484–65516, 2023

Diederik Kingma and Ruiqi Gao. Understanding diffusion objectives as the elbo with simple data augmentation.Advances in Neural Information Processing Systems, 36:65484–65516, 2023. 1, 2, 9, 28

work page 2023
[26]

Variational diffusion models.Advances in neural information processing systems, 34:21696–21707, 2021

Diederik Kingma, Tim Salimans, Ben Poole, and Jonathan Ho. Variational diffusion models.Advances in neural information processing systems, 34:21696–21707, 2021. 2, 9, 28

work page 2021
[27]

Adam: A Method for Stochastic Optimization

Diederik P Kingma. Adam: A method for stochastic optimization.arXiv preprint arXiv:1412.6980, 2014. 7, 8, 33

work page internal anchor Pith review Pith/arXiv arXiv 2014
[28]

Diffwave: A versatile diffusion model for audio synthesis.arXiv preprint arXiv:2009.09761, 2020

Zhifeng Kong, Wei Ping, Jiaji Huang, Kexin Zhao, and Bryan Catanzaro. Diffwave: A versatile diffusion model for audio synthesis.arXiv preprint arXiv:2009.09761, 2020. 1

work page arXiv 2009
[29]

Improved precision and recall metric for assessing generative models.Advances in neural information processing systems, 32,

Tuomas Kynkäänniemi, Tero Karras, Samuli Laine, Jaakko Lehtinen, and Timo Aila. Improved precision and recall metric for assessing generative models.Advances in neural information processing systems, 32,

work page
[30]

Applying guidance in a limited interval improves sample and distribution quality in diffusion models

Tuomas Kynkäänniemi, Miika Aittala, Tero Karras, Samuli Laine, Timo Aila, and Jaakko Lehtinen. Applying guidance in a limited interval improves sample and distribution quality in diffusion models.arXiv preprint arXiv:2404.07724, 2024. 9, 35

work page arXiv 2024
[31]

Wide neural networks of any depth evolve as linear models under gradient descent

Jaehoon Lee, Lechao Xiao, Samuel Schoenholz, Yasaman Bahri, Roman Novak, Jascha Sohl-Dickstein, and Jeffrey Pennington. Wide neural networks of any depth evolve as linear models under gradient descent. Advances in neural information processing systems, 32, 2019. 9

work page 2019
[32]

Repa-e: Unlocking vae for end-to-end tuning with latent diffusion transformers, 2025

Xingjian Leng, Jaskirat Singh, Yunzhong Hou, Zhenchang Xing, Saining Xie, and Liang Zheng. Repa-e: Unlocking vae for end-to-end tuning with latent diffusion transformers.arXiv preprint arXiv:2504.10483,

work page arXiv
[33]

Not all steps are equal: Efficient generation with progressive diffusion models.arXiv preprint arXiv:2312.13307, 2023

Wenhao Li, Xiu Su, Shan You, Tao Huang, Fei Wang, Chen Qian, and Chang Xu. Not all steps are equal: Efficient generation with progressive diffusion models.arXiv preprint arXiv:2312.13307, 2023. 9

work page arXiv 2023
[34]

Diffusion-lm improves controllable text generation.Advances in Neural Information Processing Systems, 35:4328–4343,

Xiang Li, John Thickstun, Ishaan Gulrajani, Percy S Liang, and Tatsunori B Hashimoto. Diffusion-lm improves controllable text generation.Advances in Neural Information Processing Systems, 35:4328–4343,

work page
[35]

Understanding repre- sentation dynamics of diffusion models via low-dimensional modeling.arXiv preprint arXiv:2502.05743,

Xiao Li, Zekai Zhang, Xiang Li, Siyi Chen, Zhihui Zhu, Peng Wang, and Qing Qu. Understanding repre- sentation dynamics of diffusion models via low-dimensional modeling.arXiv preprint arXiv:2502.05743,

work page arXiv
[36]

Flow Matching for Generative Modeling

Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. Flow matching for generative modeling.arXiv preprint arXiv:2210.02747, 2022. 4, 28

work page internal anchor Pith review Pith/arXiv arXiv 2022
[37]

Deep learning face attributes in the wild

Ziwei Liu, Ping Luo, Xiaogang Wang, and Xiaoou Tang. Deep learning face attributes in the wild. In Proceedings of the IEEE international conference on computer vision, pages 3730–3738, 2015. 7 11

work page 2015
[38]

Decoupled Weight Decay Regularization

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization.arXiv preprint arXiv:1711.05101, 2017. 7, 33

work page internal anchor Pith review Pith/arXiv arXiv 2017
[39]

org/abs/2208.11970

Calvin Luo. Understanding diffusion models: A unified perspective.arXiv preprint arXiv:2208.11970,

work page arXiv
[40]

Sit: Exploring flow and diffusion-based generative models with scalable interpolant transformers

Nanye Ma, Mark Goldstein, Michael S Albergo, Nicholas M Boffi, Eric Vanden-Eijnden, and Saining Xie. Sit: Exploring flow and diffusion-based generative models with scalable interpolant transformers. In European Conference on Computer Vision, pages 23–40. Springer, 2024. 9, 28, 34, 35

work page 2024
[41]

Battaglia

Charlie Nash, Jacob Menick, Sander Dieleman, and Peter W Battaglia. Generating images with sparse representations.arXiv preprint arXiv:2103.03841, 2021. 7, 33

work page arXiv 2021
[42]

Improved denoising diffusion probabilistic models

Alexander Quinn Nichol and Prafulla Dhariwal. Improved denoising diffusion probabilistic models. In International conference on machine learning, pages 8162–8171. PMLR, 2021. 8, 9, 33

work page 2021
[43]

Scalable diffusion models with transformers

William Peebles and Saining Xie. Scalable diffusion models with transformers. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 4195–4205, 2023. 1, 7, 8, 9, 33, 34, 35

work page 2023
[44]

On the spectral bias of neural networks

Nasim Rahaman, Aristide Baratin, Devansh Arpit, Felix Draxler, Min Lin, Fred Hamprecht, Yoshua Bengio, and Aaron Courville. On the spectral bias of neural networks. InInternational conference on machine learning, pages 5301–5310. PMLR, 2019. 2, 9

work page 2019
[45]

High-resolution image synthesis with latent diffusion models

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022. 7, 8, 9, 33

work page 2022
[46]

U-net: Convolutional networks for biomedical image segmentation

Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical image segmentation. InMedical image computing and computer-assisted intervention–MICCAI 2015: 18th international conference, Munich, Germany, October 5-9, 2015, proceedings, part III 18, pages 234–241. Springer, 2015. 7

work page 2015
[47]

Progressive Distillation for Fast Sampling of Diffusion Models

Tim Salimans and Jonathan Ho. Progressive distillation for fast sampling of diffusion models.arXiv preprint arXiv:2202.00512, 2022. 4, 8, 28

work page internal anchor Pith review Pith/arXiv arXiv 2022
[48]

Improved techniques for training gans.Advances in neural information processing systems, 29, 2016

Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and Xi Chen. Improved techniques for training gans.Advances in neural information processing systems, 29, 2016. 7, 34

work page 2016
[49]

arXiv preprint arXiv:2510.15301 (2025)

Minglei Shi, Haolin Wang, Wenzhao Zheng, Ziyang Yuan, Xiaoshi Wu, Xintao Wang, Pengfei Wan, Jie Zhou, and Jiwen Lu. Latent diffusion model without variational autoencoder.arXiv preprint arXiv:2510.15301, 2025. 9

work page arXiv 2025
[50]

Deep unsupervised learning using nonequilibrium thermodynamics

Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan, and Surya Ganguli. Deep unsupervised learning using nonequilibrium thermodynamics. InInternational conference on machine learning, pages 2256–2265. PMLR, 2015. 1

work page 2015
[51]

Denoising Diffusion Implicit Models

Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models.arXiv preprint arXiv:2010.02502, 2020. 8

work page internal anchor Pith review Pith/arXiv arXiv 2010
[52]

Score-Based Generative Modeling through Stochastic Differential Equations

Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equations.arXiv preprint arXiv:2011.13456, 2020. 1, 7, 28

work page internal anchor Pith review Pith/arXiv arXiv 2011
[53]

Rethinking the inception architecture for computer vision

Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and Zbigniew Wojna. Rethinking the inception architecture for computer vision. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 2818–2826, 2016. 33

work page 2016
[54]

Neural discrete representation learning.Advances in neural information processing systems, 30, 2017

Aaron Van Den Oord, Oriol Vinyals, et al. Neural discrete representation learning.Advances in neural information processing systems, 30, 2017. 7

work page 2017
[55]

A closer look at time steps is worthy of triple speed-up for diffusion model training.arXiv preprint arXiv:2405.17403, 2024

Kai Wang, Yukun Zhou, Mingjia Shi, Zhihang Yuan, Yuzhang Shang, Xiaojiang Peng, Hanwang Zhang, and Yang You. A closer look at time steps is worthy of triple speed-up for diffusion model training.arXiv preprint arXiv:2405.17403, 2024. 1, 6, 9

work page arXiv 2024
[56]

Representation entanglement for generation: Training diffusion transformers is much easier than you think.arXiv preprint arXiv:2507.01467, 2025

Ge Wu, Shen Zhang, Ruijing Shi, Shanghua Gao, Zhenyuan Chen, Lei Wang, Zhaowei Chen, Hongcheng Gao, Yao Tang, Jian Yang, et al. Representation entanglement for generation: Training diffusion transform- ers is much easier than you think.arXiv preprint arXiv:2507.01467, 2025. 9

work page arXiv 2025
[57]

Fast diffusion model, 2023

Zike Wu, Pan Zhou, Kenji Kawaguchi, and Hanwang Zhang. Fast diffusion model, 2023. 9 12

work page 2023
[58]

Towards faster training of diffusion models: An inspiration of a consistency phenomenon.arXiv preprint arXiv:2404.07946, 2024

Tianshuo Xu, Peng Mi, Ruilin Wang, and Yingcong Chen. Towards faster training of diffusion models: An inspiration of a consistency phenomenon.arXiv preprint arXiv:2404.07946, 2024. 9

work page arXiv 2024
[59]

Fasterdit: Towards faster diffusion transformers training without architecture modification.arXiv preprint arXiv:2410.10356, 2024

Jingfeng Yao, Wang Cheng, Wenyu Liu, and Xinggang Wang. Fasterdit: Towards faster diffusion transformers training without architecture modification.arXiv preprint arXiv:2410.10356, 2024. 9

work page arXiv 2024
[60]

Reconstruction vs

Jingfeng Yao, Bin Yang, and Xinggang Wang. Reconstruction vs. generation: Taming optimization dilemma in latent diffusion models. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 15703–15712, 2025. 8, 9

work page 2025
[61]

Debias the training of diffusion models.arXiv preprint arXiv:2310.08442, 2023

Hu Yu, Li Shen, Jie Huang, Man Zhou, Hongsheng Li, and Feng Zhao. Debias the training of diffusion models.arXiv preprint arXiv:2310.08442, 2023. 6, 9

work page arXiv 2023
[62]

Representation Alignment for Generation: Training Diffusion Transformers Is Easier Than You Think

Sihyun Yu, Sangkyung Kwak, Huiwon Jang, Jongheon Jeong, Jonathan Huang, Jinwoo Shin, and Saining Xie. Representation alignment for generation: Training diffusion transformers is easier than you think. arXiv preprint arXiv:2410.06940, 2024. 1, 8, 9

work page internal anchor Pith review arXiv 2024
[63]

Diffusion Transformers with Representation Autoencoders

Boyang Zheng, Nanye Ma, Shengbang Tong, and Saining Xie. Diffusion transformers with representation autoencoders.arXiv preprint arXiv:2510.11690, 2025. 9

work page internal anchor Pith review Pith/arXiv arXiv 2025
[64]

arXiv:2306.09305 , year=

Hongkai Zheng, Weili Nie, Arash Vahdat, and Anima Anandkumar. Fast training of diffusion models with masked transformers.arXiv preprint arXiv:2306.09305, 2023. 9

work page arXiv 2023
[65]

Non-uniform timestep sampling: Towards faster diffusion model training

Tianyi Zheng, Cong Geng, Peng-Tao Jiang, Ben Wan, Hao Zhang, Jinwei Chen, Jia Wang, and Bo Li. Non-uniform timestep sampling: Towards faster diffusion model training. InACM Multimedia 2024, 2024. 1, 6, 9

work page 2024
[66]

Beta-tuned timestep diffusion model

Tianyi Zheng, Peng-Tao Jiang, Ben Wan, Hao Zhang, Jinwei Chen, Jia Wang, and Bo Li. Beta-tuned timestep diffusion model. InEuropean Conference on Computer Vision, 2024. 1, 9

work page 2024
[67]

3d shape generation and completion through point-voxel diffusion

Linqi Zhou, Yilun Du, and Jiajun Wu. 3d shape generation and completion through point-voxel diffusion. InProceedings of the IEEE/CVF international conference on computer vision, pages 5826–5835, 2021. 1

work page 2021
[68]

loggerhead sea turtle

Rui Zhu, Yingwei Pan, Yehao Li, Ting Yao, Zhenglong Sun, Tao Mei, and Chang Wen Chen. Sd-dit: Unleashing the power of self-supervised discrimination in diffusion transformer. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8435–8445, 2024. 8, 9 13 Elucidating Representation Degradation Problem in Diffusion Model ...

work page 2024