pith. machine review for the scientific record. sign in

arxiv: 2605.10790 · v1 · submitted 2026-05-11 · 💻 cs.LG

Recognition: 2 theorem links

· Lean Theorem

Elucidating Representation Degradation Problem in Diffusion Model Training

Dazhou Li, Durude Mahee, Fan Zhu, Rui Yu, Wenbin Zhang, Xinwei He, Yeying Jin, Zhipeng Yao, Zitong Zhang

Authors on Pith no claims yet

Pith reviewed 2026-05-12 04:04 UTC · model grok-4.3

classification 💻 cs.LG
keywords diffusion modelsrepresentation degradationtraining stabilityneural tangent kerneloptimizationgenerative modelsconvergencenoise schedules
0
0 comments X

The pith

Representation degradation in diffusion models arises from mismatched recoverability at high noise levels and is corrected by dynamically reallocating optimization effort in a plug-and-play framework.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper identifies Representation Degradation as the core inefficiency in diffusion model training, where outputs grow structurally distorted as noise increases. Analysis ties this to mismatched target recoverability linked with weakening of the neural tangent kernel spectrum and low-rank effects. The proposed Elucidated Representation Diffusion framework responds by shifting optimization effort toward more recoverable signals at each noise level. This adjustment stabilizes learning without any external labels or supervision. As a result training converges more quickly and delivers stronger generation quality on multiple diffusion architectures.

Core claim

The central claim is that training instability in diffusion models stems from mismatched target recoverability, which manifests as Neural Tangent Kernel spectral weakening and effective low-rank behavior; Elucidated Representation Diffusion corrects this by dynamically reallocating optimization effort according to each sample's effective recoverability, thereby stabilizing representation learning, accelerating convergence, and improving performance across backbones without external supervision.

What carries the argument

Elucidated Representation Diffusion (ERD), a plug-and-play optimizer that reallocates training effort according to effective recoverability at each noise level.

If this is right

  • Training reaches stable representations faster because effort concentrates on recoverable signals.
  • Generation quality improves across diffusion backbones without added supervision or architectural changes.
  • The same reallocation rule can be inserted into existing training pipelines as a drop-in module.
  • Convergence acceleration holds when the framework is applied to varied noise schedules and model sizes.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar recoverability mismatches may appear in other score-based or flow-matching generative models that use noise schedules.
  • The approach could reduce the total compute needed for large-scale diffusion pre-training by shortening the unstable early phase.
  • Monitoring NTK spectrum or rank during training might serve as a diagnostic for when reallocation is required.

Load-bearing premise

The observed instability is caused by mismatched recoverability between the model and its training targets at different noise levels.

What would settle it

Train identical diffusion backbones with and without the ERD reallocation rule, then compare the rate of structural distortion in outputs and the number of steps needed to reach target FID at high noise levels.

Figures

Figures reproduced from arXiv: 2605.10790 by Dazhou Li, Durude Mahee, Fan Zhu, Rui Yu, Wenbin Zhang, Xinwei He, Yeying Jin, Zhipeng Yao, Zitong Zhang.

Figure 1
Figure 1. Figure 1: We visualized pre-trained model predictions as forward diffusion progressively corrupts the [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Empirical MSE versus the Bayes optimal error (Bayes floor) across diffusion time [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Signal–noise decomposition across diffusion time [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: NTK analysis across diffusion noise levels. Fig. [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Selected 256×256 samples. We use a CFG scale of 4.0 and 50 EDM Heun steps. Under the standard continuous-time configuration where log-SNR is sampled uniformly and schedule￾dependent factors are absorbed into the base measure, the effective allocation M(λ) is controlled by the loss weight w(λ). ERD sets w ⋆ y (λ) ∝ ωy(λ) and normalizes it to preserve the average loss scale. For non-uniform base allocations,… view at source ↗
Figure 6
Figure 6. Figure 6: Comparing different loss weighting designs by predicting [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: (top rows) reveals a striking loss domination phenomenon in the globally trained model: the overall optimization process is hijacked by regions exhibiting the largest absolute loss. Counterin￾tuitively, the excess gap is minimal where the Bayes floor is large, but massive where the floor is inherently low (e.g., t → 1 for ϵθ). Crucially, partitioning the trajectory into independent piecewise bins (bottom r… view at source ↗
Figure 8
Figure 8. Figure 8: Phase Space Trajectories of Gradient Contamination. The samples universally migrate from a healthy signal-dominated regime to a severely degraded noise-dominated regime as t increases. The diagonal dashed line represents the 50% contamination boundary. F.3 Representation Collapse in Feature Space. In Theorem D.4, we theoretically proved that under the fixed-noise surrogate, the shared-parameter architectur… view at source ↗
Figure 9
Figure 9. Figure 9: Representation Collapse in Global Shared Models. The deep hidden representations hθ,m(xt, t) are tracked across four increasing diffusion times. Projected onto the PCA basis fitted at t = 0.1, the representations catastrophically collapse from cleanly separable data clusters into an entangled, unstructured mass as t → 1. 30 [PITH_FULL_IMAGE:figures/full_fig_p030_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Eigenvalue Spectral Decay. The aggressive attenuation of the top-3 eigenvalues is a universal bottleneck, empirically supporting the bounds in Theorem D.4. Effective Rank Collapse. To further assess the global structural integrity of the Neural Tangent Ker￾nel spectrum, we evaluate the effective rank based on Shannon entropy. While tracking individual top eigenvalues reveals absolute magnitude decay, the … view at source ↗
Figure 11
Figure 11. Figure 11: Effective Rank Collapse. The Shannon entropy-based effective rank severely plummets to near-singularity limits as t → 1. 31 [PITH_FULL_IMAGE:figures/full_fig_p031_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: NTK Gram Matrix Evolution. The structural collapse from cleanly separated data clusters to an entangled homogeneous state occurs across all parameterizations. G Experimental Supplement G.1 Further Implementation Details. In this section, we provide additional implementation details, including training configurations, architecture choices, and sampling procedures. A complete list of hyperparameters is prov… view at source ↗
Figure 13
Figure 13. Figure 13: Convergence curves on ImageNet 256×256. Intermediate FID values of DiT are approxi￾mated from the training curves [40, 20] for visualization only. We also provide our U-ViT-H/2 at 4M iteration with classifier-free guidance with different class-free guidance scales. Moreover, we provide the results with the guidance interval [30] [PITH_FULL_IMAGE:figures/full_fig_p035_13.png] view at source ↗
Figure 15
Figure 15. Figure 15: We use classifier-free guidance with w = 4.0. Class label = “macaw” (88) [PITH_FULL_IMAGE:figures/full_fig_p036_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: We use classifier-free guidance with w = 4.0. Class label = “sulphur-crested cockatoo” (89) [PITH_FULL_IMAGE:figures/full_fig_p036_16.png] view at source ↗
Figure 18
Figure 18. Figure 18: We use classifier-free guidance with w = 4.0. Class label = “husky” (250) [PITH_FULL_IMAGE:figures/full_fig_p037_18.png] view at source ↗
Figure 20
Figure 20. Figure 20: We use classifier-free guidance with w = 4.0. Class label = “arctic fox” (279) [PITH_FULL_IMAGE:figures/full_fig_p037_20.png] view at source ↗
Figure 22
Figure 22. Figure 22: We use classifier-free guidance with w = 4.0. Class label = “otter” (360) [PITH_FULL_IMAGE:figures/full_fig_p038_22.png] view at source ↗
Figure 25
Figure 25. Figure 25: We use classifier-free guidance with w = 4.0. Class label = “acoustic guitar” (402). 38 [PITH_FULL_IMAGE:figures/full_fig_p038_25.png] view at source ↗
Figure 26
Figure 26. Figure 26: We use classifier-free guidance with w = 4.0. Class label = “balloon” (417) [PITH_FULL_IMAGE:figures/full_fig_p039_26.png] view at source ↗
Figure 28
Figure 28. Figure 28: We use classifier-free guidance with w = 4.0. Class label = “dog sled” (537) [PITH_FULL_IMAGE:figures/full_fig_p039_28.png] view at source ↗
Figure 30
Figure 30. Figure 30: We use classifier-free guidance with w = 4.0. Class label = “laptop” (620) [PITH_FULL_IMAGE:figures/full_fig_p040_30.png] view at source ↗
Figure 32
Figure 32. Figure 32: We use classifier-free guidance with w = 4.0. Class label = “ice cream” (928) [PITH_FULL_IMAGE:figures/full_fig_p040_32.png] view at source ↗
Figure 35
Figure 35. Figure 35: We use classifier-free guidance with w = 4.0. Class label = “coral reef” (973) [PITH_FULL_IMAGE:figures/full_fig_p041_35.png] view at source ↗
Figure 36
Figure 36. Figure 36: We use classifier-free guidance with w = 4.0. Class label = “lake shore” (975) [PITH_FULL_IMAGE:figures/full_fig_p041_36.png] view at source ↗
read the original abstract

Diffusion models have achieved remarkable success, yet their training remains inefficient due to a severe optimization bottleneck, which we term Representation Degradation. As noise levels increase, the outputs of the trained model exhibit progressive structural distortion, which can destabilize training and impair generation quality. Our analysis suggests that this instability is driven by mismatched target recoverability, which is associated with Neural Tangent Kernel (NTK) spectral weakening and effective low-rank behavior. To address this, we propose Elucidated Representation Diffusion (ERD), a plug-and-play framework that dynamically reallocates optimization effort according to effective recoverability. By stabilizing representation learning without external supervision, ERD accelerates convergence and achieves strong empirical performance across diffusion backbones.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript identifies a Representation Degradation problem in diffusion model training, where model outputs exhibit progressive structural distortion as noise levels increase, destabilizing training and impairing generation quality. The authors suggest this stems from mismatched target recoverability, which they associate with Neural Tangent Kernel (NTK) spectral weakening and effective low-rank behavior. They introduce Elucidated Representation Diffusion (ERD), a plug-and-play framework that dynamically reallocates optimization effort according to effective recoverability to stabilize representation learning without external supervision, accelerate convergence, and deliver strong empirical performance across diffusion backbones.

Significance. If the causal link between mismatched recoverability, NTK weakening, and degradation is rigorously demonstrated, and ERD is shown via controlled experiments to specifically counteract this mechanism rather than provide generic stabilization, the work could meaningfully advance efficient training of diffusion models central to generative AI. The plug-and-play design is a practical strength, but the current lack of supporting analysis limits the assessed impact.

major comments (2)
  1. [Abstract] Abstract: The claim that instability is 'driven by' mismatched target recoverability 'associated with' NTK spectral weakening and low-rank behavior is presented without any derivations, spectral analysis, equations, or ablation studies isolating this from confounders such as gradient variance growth, noise scheduling, or batch statistics. This association is load-bearing for the ERD construction, yet remains correlational based on the provided text.
  2. [Abstract] Abstract: The assertions of accelerated convergence and 'strong empirical performance across diffusion backbones' are made without quantitative results, baselines, error bars, tables, figures, or experimental details, preventing assessment of effect sizes, statistical significance, or reproducibility.
minor comments (1)
  1. The abstract relies on suggestive phrasing ('suggests', 'associated with') that should be replaced with precise statements once the full analysis is presented.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment point by point below, clarifying the relationship between the abstract and the supporting analysis in the full paper. We will revise the abstract to better link claims to the detailed evidence provided in the body of the work.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The claim that instability is 'driven by' mismatched target recoverability 'associated with' NTK spectral weakening and low-rank behavior is presented without any derivations, spectral analysis, equations, or ablation studies isolating this from confounders such as gradient variance growth, noise scheduling, or batch statistics. This association is load-bearing for the ERD construction, yet remains correlational based on the provided text.

    Authors: We appreciate this observation regarding the abstract. The abstract is a concise summary; the full manuscript contains the requested derivations, NTK spectral analysis, equations, and ablation studies that isolate the recoverability mismatch from the listed confounders (detailed in Sections 3.2–3.4 and 4.1–4.2). These sections demonstrate the association through both theoretical analysis and controlled experiments. To address the concern that the abstract does not sufficiently indicate this support, we will revise the abstract to include brief references to the relevant sections and to more precisely characterize the nature of the association as supported by our analysis rather than purely correlational. revision: yes

  2. Referee: [Abstract] Abstract: The assertions of accelerated convergence and 'strong empirical performance across diffusion backbones' are made without quantitative results, baselines, error bars, tables, figures, or experimental details, preventing assessment of effect sizes, statistical significance, or reproducibility.

    Authors: Thank you for noting this. The abstract summarizes the empirical outcomes, while the full manuscript reports the quantitative results, including baselines, error bars, tables, figures, effect sizes, and full experimental details with reproducibility information across multiple diffusion backbones (presented in Section 5, with additional controls in the appendix). We agree the abstract could better convey the strength of these results. We will revise the abstract to incorporate key quantitative highlights (e.g., convergence speedups and performance metrics) or explicit pointers to Section 5. revision: yes

Circularity Check

0 steps flagged

No circularity in derivation chain

full rationale

The abstract and visible text present the core claims as empirical observations ('analysis suggests', 'associated with') rather than a mathematical derivation. No equations, fitted parameters renamed as predictions, self-definitional loops, or load-bearing self-citations appear. The ERD framework is introduced as a plug-and-play reallocation method grounded in the observed degradation pattern, without reducing the central premise to its own inputs by construction. The paper is therefore self-contained against external benchmarks, with no identifiable circular steps.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no explicit free parameters, axioms, or invented entities are stated. ERD is described at the level of a high-level framework without mathematical specification.

pith-pipeline@v0.9.0 · 5436 in / 1134 out tokens · 25892 ms · 2026-05-12T04:04:46.421736+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

Reference graph

Works this paper leans on

68 extracted references · 68 canonical work pages · 10 internal anchors

  1. [1]

    Intriguing properties of quantization at scale.Advances in Neural Information Processing Systems, 36:34278–34294, 2023

    Arash Ahmadian, Saurabh Dash, Hongyu Chen, Bharat Venkitesh, Zhen Stephen Gou, Phil Blunsom, Ahmet Üstün, and Sara Hooker. Intriguing properties of quantization at scale.Advances in Neural Information Processing Systems, 36:34278–34294, 2023. 8

  2. [2]

    All are worth words: A vit backbone for diffusion models

    Fan Bao, Shen Nie, Kaiwen Xue, Yue Cao, Chongxuan Li, Hang Su, and Jun Zhu. All are worth words: A vit backbone for diffusion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 22669–22679, 2023. 1, 7, 8, 9, 33, 35

  3. [3]

    Perception prioritized training of diffusion models

    Jooyoung Choi, Jungbeom Lee, Chaehun Shin, Sungwon Kim, Hyunwoo Kim, and Sungroh Yoon. Perception prioritized training of diffusion models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11472–11481, 2022. 2, 6, 8, 9

  4. [4]

    Usp: Unified self-supervised pretraining for image generation and understanding.arXiv preprint arXiv:2503.06132, 2025

    Xiangxiang Chu, Renda Li, and Yong Wang. Usp: Unified self-supervised pretraining for image generation and understanding.arXiv preprint arXiv:2503.06132, 2025. 9

  5. [5]

    Imagenet: A large-scale hierarchical image database

    Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. InIEEE Conference on Computer Vision and Pattern Recognition, 2009. 7

  6. [6]

    Diffusion models beat gans on image synthesis.Advances in neural information processing systems, 34:8780–8794, 2021

    Prafulla Dhariwal and Alexander Nichol. Diffusion models beat gans on image synthesis.Advances in neural information processing systems, 34:8780–8794, 2021. 7, 9, 33

  7. [7]

    Representation degeneration problem in training natural language generation models.arXiv preprint arXiv:1907.12009, 2019

    Jun Gao, Di He, Xu Tan, Tao Qin, Liwei Wang, and Tie-Yan Liu. Representation degeneration problem in training natural language generation models.arXiv preprint arXiv:1907.12009, 2019. 1

  8. [8]

    Masked diffusion transformer is a strong image synthesizer

    Shanghua Gao, Pan Zhou, Ming-Ming Cheng, and Shuicheng Yan. Mdtv2: Masked diffusion transformer is a strong image synthesizer.arXiv preprint arXiv:2303.14389, 2023. 9

  9. [9]

    Neural Network-Based Score Estimation in Diffusion Models: Optimization and Generalization

    Yinbin Han, Meisam Razaviyayn, and Renyuan Xu. Neural network-based score estimation in diffusion models: Optimization and generalization.arXiv preprint arXiv:2401.15604, 2024. 9

  10. [10]

    Improved noise schedule for diffusion training

    Tiankai Hang and Shuyang Gu. Improved noise schedule for diffusion training.arXiv preprint arXiv:2407.03297, 2024. 1, 9

  11. [11]

    Efficient diffusion training via min-snr weighting strategy

    Tiankai Hang, Shuyang Gu, Chen Li, Jianmin Bao, Dong Chen, Han Hu, Xin Geng, and Baining Guo. Efficient diffusion training via min-snr weighting strategy. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 7441–7451, 2023. 1, 6, 8, 9

  12. [12]

    Diffit: Diffusion vision transformers for image generation

    Ali Hatamizadeh, Jiaming Song, Guilin Liu, Jan Kautz, and Arash Vahdat. Diffit: Diffusion vision transformers for image generation. InEuropean Conference on Computer Vision, pages 37–55. Springer,

  13. [13]

    Gans trained by a two time-scale update rule converge to a local nash equilibrium.Advances in neural information processing systems, 30, 2017

    Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium.Advances in neural information processing systems, 30, 2017. 7, 33

  14. [14]

    Classifier-Free Diffusion Guidance

    Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance.arXiv preprint arXiv:2207.12598, 2022. 7

  15. [15]

    Denoising diffusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020

    Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020. 1, 7, 28

  16. [16]

    Cascaded diffusion models for high fidelity image generation.Journal of Machine Learning Research, 23 (47):1–33, 2022

    Jonathan Ho, Chitwan Saharia, William Chan, David J Fleet, Mohammad Norouzi, and Tim Salimans. Cascaded diffusion models for high fidelity image generation.Journal of Machine Learning Research, 23 (47):1–33, 2022. 9

  17. [17]

    Video diffusion models.Advances in Neural Information Processing Systems, 35:8633–8646, 2022

    Jonathan Ho, Tim Salimans, Alexey Gritsenko, William Chan, Mohammad Norouzi, and David J Fleet. Video diffusion models.Advances in Neural Information Processing Systems, 35:8633–8646, 2022. 1 10

  18. [18]

    simple diffusion: End-to-end diffusion for high resolution images

    Emiel Hoogeboom, Jonathan Heek, and Tim Salimans. simple diffusion: End-to-end diffusion for high resolution images. InInternational Conference on Machine Learning, pages 13213–13232. PMLR, 2023. 9

  19. [19]

    Neural tangent kernel: Convergence and generalization in neural networks.Advances in neural information processing systems, 31, 2018

    Arthur Jacot, Franck Gabriel, and Clément Hongler. Neural tangent kernel: Convergence and generalization in neural networks.Advances in neural information processing systems, 31, 2018. 1, 3, 9

  20. [20]

    No other representation component is needed: Diffusion transformers can provide representation guidance by themselves.arXiv preprint arXiv:2505.02831, 2025

    Dengyang Jiang, Mengmeng Wang, Liuzhuozheng Li, Lei Zhang, Haoyu Wang, Wei Wei, Guang Dai, Yanning Zhang, and Jingdong Wang. No other representation component is needed: Diffusion transformers can provide representation guidance by themselves.arXiv preprint arXiv:2505.02831, 2025. 9, 35

  21. [21]

    Understanding dimensional collapse in contrastive self-supervised learning.arXiv preprint arXiv:2110.09348, 2021

    Li Jing, Pascal Vincent, Yann LeCun, and Yuandong Tian. Understanding dimensional collapse in contrastive self-supervised learning.arXiv preprint arXiv:2110.09348, 2021. 2

  22. [22]

    Understanding dimensional collapse in contrastive self-supervised learning

    Li Jing, Pascal Vincent, Yann LeCun, and Yuandong Tian. Understanding dimensional collapse in contrastive self-supervised learning. InInternational Conference on Learning Representations, 2022. 9

  23. [23]

    Elucidating the design space of diffusion-based generative models.Advances in neural information processing systems, 35:26565–26577, 2022

    Tero Karras, Miika Aittala, Timo Aila, and Samuli Laine. Elucidating the design space of diffusion-based generative models.Advances in neural information processing systems, 35:26565–26577, 2022. 1, 7, 33

  24. [24]

    Analyzing and improving the training dynamics of diffusion models

    Tero Karras, Miika Aittala, Jaakko Lehtinen, Janne Hellsten, Timo Aila, and Samuli Laine. Analyzing and improving the training dynamics of diffusion models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 24174–24184, 2024. 9, 33

  25. [25]

    Understanding diffusion objectives as the elbo with simple data augmentation.Advances in Neural Information Processing Systems, 36:65484–65516, 2023

    Diederik Kingma and Ruiqi Gao. Understanding diffusion objectives as the elbo with simple data augmentation.Advances in Neural Information Processing Systems, 36:65484–65516, 2023. 1, 2, 9, 28

  26. [26]

    Variational diffusion models.Advances in neural information processing systems, 34:21696–21707, 2021

    Diederik Kingma, Tim Salimans, Ben Poole, and Jonathan Ho. Variational diffusion models.Advances in neural information processing systems, 34:21696–21707, 2021. 2, 9, 28

  27. [27]

    Adam: A Method for Stochastic Optimization

    Diederik P Kingma. Adam: A method for stochastic optimization.arXiv preprint arXiv:1412.6980, 2014. 7, 8, 33

  28. [28]

    Diffwave: A versatile diffusion model for audio synthesis.arXiv preprint arXiv:2009.09761, 2020

    Zhifeng Kong, Wei Ping, Jiaji Huang, Kexin Zhao, and Bryan Catanzaro. Diffwave: A versatile diffusion model for audio synthesis.arXiv preprint arXiv:2009.09761, 2020. 1

  29. [29]

    Improved precision and recall metric for assessing generative models.Advances in neural information processing systems, 32,

    Tuomas Kynkäänniemi, Tero Karras, Samuli Laine, Jaakko Lehtinen, and Timo Aila. Improved precision and recall metric for assessing generative models.Advances in neural information processing systems, 32,

  30. [30]

    Applying guidance in a limited interval improves sample and distribution quality in diffusion models

    Tuomas Kynkäänniemi, Miika Aittala, Tero Karras, Samuli Laine, Timo Aila, and Jaakko Lehtinen. Applying guidance in a limited interval improves sample and distribution quality in diffusion models.arXiv preprint arXiv:2404.07724, 2024. 9, 35

  31. [31]

    Wide neural networks of any depth evolve as linear models under gradient descent

    Jaehoon Lee, Lechao Xiao, Samuel Schoenholz, Yasaman Bahri, Roman Novak, Jascha Sohl-Dickstein, and Jeffrey Pennington. Wide neural networks of any depth evolve as linear models under gradient descent. Advances in neural information processing systems, 32, 2019. 9

  32. [32]

    Repa-e: Unlocking vae for end-to-end tuning with latent diffusion transformers, 2025

    Xingjian Leng, Jaskirat Singh, Yunzhong Hou, Zhenchang Xing, Saining Xie, and Liang Zheng. Repa-e: Unlocking vae for end-to-end tuning with latent diffusion transformers.arXiv preprint arXiv:2504.10483,

  33. [33]

    Not all steps are equal: Efficient generation with progressive diffusion models.arXiv preprint arXiv:2312.13307, 2023

    Wenhao Li, Xiu Su, Shan You, Tao Huang, Fei Wang, Chen Qian, and Chang Xu. Not all steps are equal: Efficient generation with progressive diffusion models.arXiv preprint arXiv:2312.13307, 2023. 9

  34. [34]

    Diffusion-lm improves controllable text generation.Advances in Neural Information Processing Systems, 35:4328–4343,

    Xiang Li, John Thickstun, Ishaan Gulrajani, Percy S Liang, and Tatsunori B Hashimoto. Diffusion-lm improves controllable text generation.Advances in Neural Information Processing Systems, 35:4328–4343,

  35. [35]

    Understanding repre- sentation dynamics of diffusion models via low-dimensional modeling.arXiv preprint arXiv:2502.05743,

    Xiao Li, Zekai Zhang, Xiang Li, Siyi Chen, Zhihui Zhu, Peng Wang, and Qing Qu. Understanding repre- sentation dynamics of diffusion models via low-dimensional modeling.arXiv preprint arXiv:2502.05743,

  36. [36]

    Flow Matching for Generative Modeling

    Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. Flow matching for generative modeling.arXiv preprint arXiv:2210.02747, 2022. 4, 28

  37. [37]

    Deep learning face attributes in the wild

    Ziwei Liu, Ping Luo, Xiaogang Wang, and Xiaoou Tang. Deep learning face attributes in the wild. In Proceedings of the IEEE international conference on computer vision, pages 3730–3738, 2015. 7 11

  38. [38]

    Decoupled Weight Decay Regularization

    Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization.arXiv preprint arXiv:1711.05101, 2017. 7, 33

  39. [39]

    org/abs/2208.11970

    Calvin Luo. Understanding diffusion models: A unified perspective.arXiv preprint arXiv:2208.11970,

  40. [40]

    Sit: Exploring flow and diffusion-based generative models with scalable interpolant transformers

    Nanye Ma, Mark Goldstein, Michael S Albergo, Nicholas M Boffi, Eric Vanden-Eijnden, and Saining Xie. Sit: Exploring flow and diffusion-based generative models with scalable interpolant transformers. In European Conference on Computer Vision, pages 23–40. Springer, 2024. 9, 28, 34, 35

  41. [41]

    Battaglia

    Charlie Nash, Jacob Menick, Sander Dieleman, and Peter W Battaglia. Generating images with sparse representations.arXiv preprint arXiv:2103.03841, 2021. 7, 33

  42. [42]

    Improved denoising diffusion probabilistic models

    Alexander Quinn Nichol and Prafulla Dhariwal. Improved denoising diffusion probabilistic models. In International conference on machine learning, pages 8162–8171. PMLR, 2021. 8, 9, 33

  43. [43]

    Scalable diffusion models with transformers

    William Peebles and Saining Xie. Scalable diffusion models with transformers. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 4195–4205, 2023. 1, 7, 8, 9, 33, 34, 35

  44. [44]

    On the spectral bias of neural networks

    Nasim Rahaman, Aristide Baratin, Devansh Arpit, Felix Draxler, Min Lin, Fred Hamprecht, Yoshua Bengio, and Aaron Courville. On the spectral bias of neural networks. InInternational conference on machine learning, pages 5301–5310. PMLR, 2019. 2, 9

  45. [45]

    High-resolution image synthesis with latent diffusion models

    Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022. 7, 8, 9, 33

  46. [46]

    U-net: Convolutional networks for biomedical image segmentation

    Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical image segmentation. InMedical image computing and computer-assisted intervention–MICCAI 2015: 18th international conference, Munich, Germany, October 5-9, 2015, proceedings, part III 18, pages 234–241. Springer, 2015. 7

  47. [47]

    Progressive Distillation for Fast Sampling of Diffusion Models

    Tim Salimans and Jonathan Ho. Progressive distillation for fast sampling of diffusion models.arXiv preprint arXiv:2202.00512, 2022. 4, 8, 28

  48. [48]

    Improved techniques for training gans.Advances in neural information processing systems, 29, 2016

    Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and Xi Chen. Improved techniques for training gans.Advances in neural information processing systems, 29, 2016. 7, 34

  49. [49]

    arXiv preprint arXiv:2510.15301 (2025)

    Minglei Shi, Haolin Wang, Wenzhao Zheng, Ziyang Yuan, Xiaoshi Wu, Xintao Wang, Pengfei Wan, Jie Zhou, and Jiwen Lu. Latent diffusion model without variational autoencoder.arXiv preprint arXiv:2510.15301, 2025. 9

  50. [50]

    Deep unsupervised learning using nonequilibrium thermodynamics

    Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan, and Surya Ganguli. Deep unsupervised learning using nonequilibrium thermodynamics. InInternational conference on machine learning, pages 2256–2265. PMLR, 2015. 1

  51. [51]

    Denoising Diffusion Implicit Models

    Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models.arXiv preprint arXiv:2010.02502, 2020. 8

  52. [52]

    Score-Based Generative Modeling through Stochastic Differential Equations

    Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equations.arXiv preprint arXiv:2011.13456, 2020. 1, 7, 28

  53. [53]

    Rethinking the inception architecture for computer vision

    Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and Zbigniew Wojna. Rethinking the inception architecture for computer vision. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 2818–2826, 2016. 33

  54. [54]

    Neural discrete representation learning.Advances in neural information processing systems, 30, 2017

    Aaron Van Den Oord, Oriol Vinyals, et al. Neural discrete representation learning.Advances in neural information processing systems, 30, 2017. 7

  55. [55]

    A closer look at time steps is worthy of triple speed-up for diffusion model training.arXiv preprint arXiv:2405.17403, 2024

    Kai Wang, Yukun Zhou, Mingjia Shi, Zhihang Yuan, Yuzhang Shang, Xiaojiang Peng, Hanwang Zhang, and Yang You. A closer look at time steps is worthy of triple speed-up for diffusion model training.arXiv preprint arXiv:2405.17403, 2024. 1, 6, 9

  56. [56]

    Representation entanglement for generation: Training diffusion transformers is much easier than you think.arXiv preprint arXiv:2507.01467, 2025

    Ge Wu, Shen Zhang, Ruijing Shi, Shanghua Gao, Zhenyuan Chen, Lei Wang, Zhaowei Chen, Hongcheng Gao, Yao Tang, Jian Yang, et al. Representation entanglement for generation: Training diffusion transform- ers is much easier than you think.arXiv preprint arXiv:2507.01467, 2025. 9

  57. [57]

    Fast diffusion model, 2023

    Zike Wu, Pan Zhou, Kenji Kawaguchi, and Hanwang Zhang. Fast diffusion model, 2023. 9 12

  58. [58]

    Towards faster training of diffusion models: An inspiration of a consistency phenomenon.arXiv preprint arXiv:2404.07946, 2024

    Tianshuo Xu, Peng Mi, Ruilin Wang, and Yingcong Chen. Towards faster training of diffusion models: An inspiration of a consistency phenomenon.arXiv preprint arXiv:2404.07946, 2024. 9

  59. [59]

    Fasterdit: Towards faster diffusion transformers training without architecture modification.arXiv preprint arXiv:2410.10356, 2024

    Jingfeng Yao, Wang Cheng, Wenyu Liu, and Xinggang Wang. Fasterdit: Towards faster diffusion transformers training without architecture modification.arXiv preprint arXiv:2410.10356, 2024. 9

  60. [60]

    Reconstruction vs

    Jingfeng Yao, Bin Yang, and Xinggang Wang. Reconstruction vs. generation: Taming optimization dilemma in latent diffusion models. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 15703–15712, 2025. 8, 9

  61. [61]

    Debias the training of diffusion models.arXiv preprint arXiv:2310.08442, 2023

    Hu Yu, Li Shen, Jie Huang, Man Zhou, Hongsheng Li, and Feng Zhao. Debias the training of diffusion models.arXiv preprint arXiv:2310.08442, 2023. 6, 9

  62. [62]

    Representation Alignment for Generation: Training Diffusion Transformers Is Easier Than You Think

    Sihyun Yu, Sangkyung Kwak, Huiwon Jang, Jongheon Jeong, Jonathan Huang, Jinwoo Shin, and Saining Xie. Representation alignment for generation: Training diffusion transformers is easier than you think. arXiv preprint arXiv:2410.06940, 2024. 1, 8, 9

  63. [63]

    Diffusion Transformers with Representation Autoencoders

    Boyang Zheng, Nanye Ma, Shengbang Tong, and Saining Xie. Diffusion transformers with representation autoencoders.arXiv preprint arXiv:2510.11690, 2025. 9

  64. [64]

    arXiv:2306.09305 , year=

    Hongkai Zheng, Weili Nie, Arash Vahdat, and Anima Anandkumar. Fast training of diffusion models with masked transformers.arXiv preprint arXiv:2306.09305, 2023. 9

  65. [65]

    Non-uniform timestep sampling: Towards faster diffusion model training

    Tianyi Zheng, Cong Geng, Peng-Tao Jiang, Ben Wan, Hao Zhang, Jinwei Chen, Jia Wang, and Bo Li. Non-uniform timestep sampling: Towards faster diffusion model training. InACM Multimedia 2024, 2024. 1, 6, 9

  66. [66]

    Beta-tuned timestep diffusion model

    Tianyi Zheng, Peng-Tao Jiang, Ben Wan, Hao Zhang, Jinwei Chen, Jia Wang, and Bo Li. Beta-tuned timestep diffusion model. InEuropean Conference on Computer Vision, 2024. 1, 9

  67. [67]

    3d shape generation and completion through point-voxel diffusion

    Linqi Zhou, Yilun Du, and Jiajun Wu. 3d shape generation and completion through point-voxel diffusion. InProceedings of the IEEE/CVF international conference on computer vision, pages 5826–5835, 2021. 1

  68. [68]

    loggerhead sea turtle

    Rui Zhu, Yingwei Pan, Yehao Li, Ting Yao, Zhenglong Sun, Tao Mei, and Chang Wen Chen. Sd-dit: Unleashing the power of self-supervised discrimination in diffusion transformer. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8435–8445, 2024. 8, 9 13 Elucidating Representation Degradation Problem in Diffusion Model ...