A Quantitative Approximation Framework for Flow Distillation in Diffusion Models

Hanfei Zhou; Lei Shi; Ming Li; Weiguo Gao

arxiv: 2606.03820 · v1 · pith:Q4KOS5N5new · submitted 2026-06-02 · 📊 stat.ML · cs.LG

A Quantitative Approximation Framework for Flow Distillation in Diffusion Models

Weiguo Gao , Ming Li , Lei Shi , Hanfei Zhou This is my paper

Pith reviewed 2026-06-28 07:57 UTC · model grok-4.3

classification 📊 stat.ML cs.LG

keywords diffusion distillationprobability flow ODEresidual networksLipschitz stabilitynon-uniform time gridGaussian mixture modelOrnstein-Uhlenbeck processscore approximation

0 comments

The pith

Residual compositions approximate long-horizon transport in diffusion flows with global error controlled by the stability amplification factor.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops a quantitative framework that treats few-step sampling in diffusion models as error propagation through compositions of learned flow maps for the probability-flow ODE. It separates the task of approximating the time-dependent score field from the task of controlling dynamical amplification that arises when the underlying dynamics become stiff in low-noise multimodal regimes. Within an analytically tractable Gaussian-mixture Ornstein-Uhlenbeck process, explicit L^p guarantees show that ReLU-ReQU networks approximate the score with polylogarithmic dependence on accuracy, while an explicit bound L(t) on the spatial Lipschitz constant converts into a flow-map stability estimate governed by the time integral of L(u). These estimates establish that deep residual compositions efficiently approximate long-horizon transport and that a Lipschitz-mismatch regime renders one-step distillation structurally unfavorable, yielding a non-uniform time grid obtained by uniform partitioning in the cumulative stability coordinate.

Core claim

In an analytically tractable Gaussian-mixture Ornstein--Uhlenbeck setting, deep residual compositions efficiently approximate the long-horizon transport, with global error controlled by the stability amplification factor, and a Lipschitz-mismatch regime makes one-step distillation structurally unfavorable; the resulting theory yields a stability-balanced non-uniform time grid obtained by uniform partitioning in the cumulative stability coordinate.

What carries the argument

The stability amplification factor obtained from the time integral of the spatial Lipschitz constant L(t) of the probability-flow velocity; it governs error propagation across compositions of flow maps.

If this is right

Global error in residual compositions remains controlled by the stability amplification factor instead of accumulating local errors.
One-step distillation is structurally unfavorable whenever the Lipschitz constant grows substantially at late times.
Uniform partitioning in the cumulative stability coordinate produces a non-uniform time grid that improves few-step sampling.
ReLU-ReQU networks achieve score approximation with depth and width scaling polylogarithmically in target accuracy and mixture geometry.
The framework predicts and experiments confirm up to 51.9 percent reduction in relative MSE with eight segments versus uniform grids.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The separation of approximation error from stability amplification could extend to other stiff ODE-based generative models.
The stability coordinate might guide adaptive step-size selection in sampling algorithms outside the diffusion setting.
Direct numerical verification of the explicit L(t) bound on non-Gaussian multimodal data would test the reach of the analysis.
The approach connects to classical numerical methods for integrating stiff dynamical systems.

Load-bearing premise

The Gaussian-mixture Ornstein-Uhlenbeck process is treated as representative of the multimodal low-noise regime where stability amplification occurs in diffusion models.

What would settle it

If the proposed stability-balanced non-uniform time grid fails to reduce end-to-end relative MSE relative to a uniform grid on the Gaussian-mixture Ornstein-Uhlenbeck diffusion model, the central prediction on grid optimality would be falsified.

read the original abstract

We develop a quantitative approximation framework for diffusion distillation, viewing few-step sampling as error propagation under compositions of learned flow maps. Focusing on trajectory distillation for the probability-flow ODE, we show that local approximation errors can be strongly amplified in low-noise multimodal regimes, where the underlying dynamics become stiff. In an analytically tractable Gaussian-mixture Ornstein--Uhlenbeck setting, we separate two core difficulties: approximating the time-dependent score field and controlling the dynamical amplification governed by the time-integrated Jacobian bound of the probability-flow ODE. On the approximation side, we prove constructive L^p(p_t) guarantees showing that ReLU--ReQU networks approximate the Gaussian-mixture score uniformly over time, with depth and width scaling polylogarithmically in the target accuracy and explicitly with the mixture geometry. On the stability side, we derive an explicit bound L(t) for the spatial Lipschitz constant of the probability-flow velocity and convert it into a flow map stability estimate governed by \int_s^t L(u)\,du, making late-time amplification in stiff regimes computable. Building on these estimates, we prove that deep residual compositions efficiently approximate the long-horizon transport, with global error controlled by the stability amplification factor, and identify a Lipschitz-mismatch regime in which one-step distillation is structurally unfavorable. The resulting theory yields a stability-balanced non-uniform time grid obtained by uniform partitioning in the cumulative stability coordinate. Experiments support the prediction and reduce end-to-end relative MSE by up to 51.9\% with 8 segments compared with uniform grids.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper works out explicit polylog approximation bounds and a stability-based non-uniform time grid for distillation inside a Gaussian-mixture OU process, but supplies no transfer argument to general multimodal diffusions.

read the letter

The core contribution is a clean separation of score approximation from flow-map stability in the probability-flow ODE, done inside a Gaussian-mixture Ornstein-Uhlenbeck model. They give constructive L^p guarantees for ReLU-ReQU networks that scale polylog in accuracy and depend explicitly on mixture parameters, then integrate an explicit spatial Lipschitz bound L(t) on the velocity to control error amplification over long horizons. From that they build a cumulative-stability coordinate and a non-uniform grid that partitions evenly in the integrated L(u) du, and they show one-step distillation is structurally bad when the Lipschitz mismatch is large. Experiments in the same setting report up to 51.9% relative MSE drop with eight segments versus uniform spacing.

That separation and the explicit conversion of the time-integrated Jacobian into a usable stability factor are new and useful inside the model they study. The residual-composition error bound controlled by the amplification factor follows directly once L(t) is in hand, and the grid construction is a direct consequence rather than an ad-hoc choice.

The limitation is exactly the one flagged in the stress-test note: every quantitative claim—the network-size bounds, the L(t) derivation, the global error control, the mismatch regime, and the grid itself—is obtained only for the Gaussian-mixture OU dynamics. No extension, robustness check, or counter-example is given for scores or drifts that deviate from this mixture structure, so it is not yet clear whether the same separation or the same grid rule survives in the multimodal low-noise regimes that actually matter for large diffusion models.

The work is aimed at people who want quantitative schedules for few-step samplers rather than purely empirical tuning. A reader already working on distillation theory could extract the stability-coordinate idea and test it elsewhere. It deserves a serious referee because the derivations are explicit, the experiments match the stated predictions inside the model, and the scope limitation is stated plainly rather than hidden.

Referee Report

2 major / 2 minor

Summary. The manuscript develops a quantitative approximation framework for flow distillation in diffusion models, viewing few-step sampling as error propagation under compositions of learned flow maps for the probability-flow ODE. In an analytically tractable Gaussian-mixture Ornstein-Uhlenbeck process, it proves constructive L^p(p_t) guarantees for ReLU-ReQU networks approximating the time-dependent score field (with depth/width scaling polylogarithmically in accuracy and explicitly with mixture geometry), derives an explicit spatial Lipschitz bound L(t) on the probability-flow velocity, converts it to a flow-map stability estimate governed by ∫_s^t L(u) du, proves that deep residual compositions approximate long-horizon transport with global error controlled by the stability amplification factor, identifies a Lipschitz-mismatch regime in which one-step distillation is structurally unfavorable, and constructs a stability-balanced non-uniform time grid via uniform partitioning in the cumulative stability coordinate. Experiments report up to 51.9% reduction in end-to-end relative MSE with 8 segments versus uniform grids.

Significance. If the separation of approximation versus stability difficulties, the explicit bounds, and the resulting non-uniform grid construction hold and transfer, the work supplies a rigorous, constructive theoretical basis for understanding amplification in stiff multimodal regimes and for designing better distillation schedules. The polylogarithmic network-size guarantees and parameter-free stability integral are particular strengths that could guide practical choices beyond the specific setting analyzed.

major comments (2)

[Abstract and main theoretical sections] Abstract and theoretical development (all quantitative results on L^p guarantees, L(t), ∫ L(u) du stability, residual-composition error, Lipschitz-mismatch regime, and non-uniform grid): these are obtained exclusively inside the Gaussian-mixture Ornstein-Uhlenbeck process and presented as representative of the multimodal low-noise regime of interest, yet no extension argument, robustness check, or counter-example analysis is supplied showing that the separation of approximation and stability difficulties survives when the score field or dynamics deviate from this mixture structure. This is load-bearing for the applicability claim to general diffusion models.
[Abstract] Abstract: the claims that proofs exist for the network approximation and stability bound are stated, but the manuscript does not include the full derivations in a form that permits verification of whether the L^p(p_t) guarantees hold uniformly over time or whether the Lipschitz-mismatch regime is correctly identified; this directly affects soundness of the central quantitative claims.

minor comments (2)

Notation for the cumulative stability coordinate and the precise definition of the non-uniform grid construction could be clarified with an explicit equation or algorithm box for reproducibility.
The experimental section would benefit from reporting the precise mixture parameters and noise schedule used in the Gaussian-mixture OU simulations to allow direct comparison with the theoretical L(t) bound.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below, clarifying the scope of our results while proposing targeted revisions to improve clarity and verifiability.

read point-by-point responses

Referee: [Abstract and main theoretical sections] Abstract and theoretical development (all quantitative results on L^p guarantees, L(t), ∫ L(u) du stability, residual-composition error, Lipschitz-mismatch regime, and non-uniform grid): these are obtained exclusively inside the Gaussian-mixture Ornstein-Uhlenbeck process and presented as representative of the multimodal low-noise regime of interest, yet no extension argument, robustness check, or counter-example analysis is supplied showing that the separation of approximation and stability difficulties survives when the score field or dynamics deviate from this mixture structure. This is load-bearing for the applicability claim to general diffusion models.

Authors: The Gaussian-mixture OU process is deliberately selected for analytical tractability to derive explicit, constructive bounds that separate approximation error from dynamical stability amplification. The manuscript frames the contribution as a quantitative case study revealing the Lipschitz-mismatch phenomenon and the utility of stability-balanced discretization, rather than a universal theorem for arbitrary score fields. We will add a dedicated limitations paragraph in the revised manuscript that explicitly states the setting-specific nature of the proofs and discusses how the identified mismatch regime may inform schedule design in broader multimodal regimes, without claiming automatic transfer. revision: partial
Referee: [Abstract] Abstract: the claims that proofs exist for the network approximation and stability bound are stated, but the manuscript does not include the full derivations in a form that permits verification of whether the L^p(p_t) guarantees hold uniformly over time or whether the Lipschitz-mismatch regime is correctly identified; this directly affects soundness of the central quantitative claims.

Authors: The complete proofs appear in the appendix. To improve accessibility and allow direct verification of time-uniformity and the mismatch identification, we will insert concise proof sketches (including key intermediate steps for the L^p bounds and the stability integral) into the main theoretical sections of the revised manuscript. revision: yes

Circularity Check

0 steps flagged

No circularity: explicit derivations of bounds and guarantees within the model

full rationale

The paper performs constructive mathematical derivations inside the Gaussian-mixture Ornstein-Uhlenbeck process: it derives an explicit spatial Lipschitz bound L(t) on the probability-flow velocity, converts it to a stability estimate via the integral of L(u) du, proves L^p approximation guarantees with polylog network scaling, controls residual composition error by the stability factor, and obtains the non-uniform grid by uniform partitioning in the cumulative stability coordinate. None of these steps reduce to a self-definition, a fitted parameter renamed as a prediction, or a load-bearing self-citation chain; each is obtained by direct analysis of the model dynamics and score field. The limitation to this analytically tractable setting is a question of scope and transfer, not a circular reduction of the claimed results to their inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claims rest on the choice of the Gaussian-mixture Ornstein-Uhlenbeck process as the analytically tractable model in which both approximation and stability can be controlled explicitly; no additional free parameters or invented entities are introduced beyond standard neural-network approximation theory.

axioms (2)

domain assumption Data distribution is a finite Gaussian mixture evolving under an Ornstein-Uhlenbeck process
Invoked to obtain an analytically tractable setting where the score field and the Jacobian of the probability-flow ODE can be written in closed form.
domain assumption ReLU-ReQU networks are used for score approximation
The constructive L^p(p_t) guarantees are proved specifically for this network class.

pith-pipeline@v0.9.1-grok · 5809 in / 1635 out tokens · 22398 ms · 2026-06-28T07:57:28.235720+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

58 extracted references · 12 canonical work pages · 1 internal anchor

[1]

Universal approximation bounds for superpositions of a sigmoidal function.IEEE Transactions on Information theory, 39(3):930–945, 2002

Andrew R Barron. Universal approximation bounds for superpositions of a sigmoidal function.IEEE Transactions on Information theory, 39(3):930–945, 2002

2002
[2]

Simultaneous approximation of a smooth function and its derivatives by deep neural networks with piecewise-polynomial activations.Neural Networks, 161:242–253, 2023

Denis Belomestny, Alexey Naumov, Nikita Puchkin, and Sergey Samsonov. Simultaneous approximation of a smooth function and its derivatives by deep neural networks with piecewise-polynomial activations.Neural Networks, 161:242–253, 2023

2023
[3]

On the edge of memorization in diffusion models

Sam Buchanan, Druv Pai, Yi Ma, and Valentin De Bortoli. On the edge of memorization in diffusion models. InAdvances in Neural Information Processing Systems, 2025

2025
[4]

Score approximation, estimation and distribution recovery of diffusion models on low-dimensional data

Minshuo Chen, Kaixuan Huang, Tuo Zhao, and Mengdi Wang. Score approximation, estimation and distribution recovery of diffusion models on low-dimensional data. In International Conference on Machine Learning, 2023

2023
[5]

Sam- pling is as easy as learning the score: Theory for diffusion models with minimal data assumptions

Sitan Chen, Sinho Chewi, Jerry Li, Yuanzhi Li, Adil Salim, and Anru R Zhang. Sam- pling is as easy as learning the score: Theory for diffusion models with minimal data assumptions. InInternational Conference on Learning Representations, 2023

2023
[6]

Lipschitz-Guided Design of Interpolation Schedules in Generative Models

Yifan Chen, Eric Vanden-Eijnden, and Jiawei Xu. Lipschitz-guided design of interpola- tion schedules in generative models.arXiv preprint arXiv:2509.01629, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[7]

What does guidance do? a fine-grained analysis in a simple setting

Muthu Chidambaram, Khashayar Gatmiry, Sitan Chen, Holden Lee, and Jianfeng Lu. What does guidance do? a fine-grained analysis in a simple setting. InAdvances in Neural Information Processing Systems, 2024

2024
[8]

Analysis of learning a flow-based generative model from limited sample complexity

Hugo Cui, Florent Krzakala, Eric Vanden-Eijnden, and Lenka Zdeborova. Analysis of learning a flow-based generative model from limited sample complexity. InInternational Conference on Learning Representations, 2023. 35

2023
[9]

Convergence of denoising diffusion models under the manifold hy- pothesis.Transactions on Machine Learning Research, 2022

Valentin De Bortoli. Convergence of denoising diffusion models under the manifold hy- pothesis.Transactions on Machine Learning Research, 2022

2022
[10]

Neural network approximation

Ronald DeVore, Boris Hanin, and Guergana Petrova. Neural network approximation. Acta Numerica, 30:327–444, 2021

2021
[11]

Characteristic learning for provable one step generation.arXiv preprint arXiv:2405.05512, 2024

Zhao Ding, Chenguang Duan, Yuling Jiao, Ruoxuan Li, Jerry Zhijian Yang, and Ping- wen Zhang. Characteristic learning for provable one step generation.arXiv preprint arXiv:2405.05512, 2024

work page arXiv 2024
[12]

Overparameterization of deep ResNet: Zero loss and mean-field analysis.Journal of Machine Learning Research, 23 (48):1–65, 2022

Zhiyan Ding, Shi Chen, Qin Li, and Stephen J Wright. Overparameterization of deep ResNet: Zero loss and mean-field analysis.Journal of Machine Learning Research, 23 (48):1–65, 2022

2022
[13]

One step diffusion via shortcut models

Kevin Frans, Danijar Hafner, Sergey Levine, and Pieter Abbeel. One step diffusion via shortcut models. InInternational Conference on Learning Representations, 2025

2025
[14]

How do flow matching models memorize and generalize in sample data subspaces?arXiv preprint arXiv:2410.23594, 2024

Weiguo Gao and Ming Li. How do flow matching models memorize and generalize in sample data subspaces?arXiv preprint arXiv:2410.23594, 2024

work page arXiv 2024
[15]

Toward theoretical insights into diffusion trajectory distillation via operator merging.Neural Networks, 202:109023, 2026

Weiguo Gao and Ming Li. Toward theoretical insights into diffusion trajectory distillation via operator merging.Neural Networks, 202:109023, 2026

2026
[16]

Terminally constrained flow-based generative models from an optimal control perspective.arXiv preprint arXiv:2601.09474, 2026

Weiguo Gao, Ming Li, and Qianxiao Li. Terminally constrained flow-based generative models from an optimal control perspective.arXiv preprint arXiv:2601.09474, 2026

work page arXiv 2026
[17]

Learning mixtures of Gaussians using diffusion models.arXiv preprint arXiv:2404.18869, 2024

Khashayar Gatmiry, Jonathan Kelner, and Holden Lee. Learning mixtures of Gaussians using diffusion models.arXiv preprint arXiv:2404.18869, 2024

work page arXiv 2024
[18]

Mean flows for one-step generative modeling

Zhengyang Geng, Mingyang Deng, Xingjian Bai, J Zico Kolter, and Kaiming He. Mean flows for one-step generative modeling. InAdvances in Neural Information Processing Systems, 2025

2025
[19]

BOOT: Data-free distillation of denoising diffusion models with bootstrapping

Jiatao Gu, Shuangfei Zhai, Yizhe Zhang, Lingjie Liu, and Joshua M Susskind. BOOT: Data-free distillation of denoising diffusion models with bootstrapping. InICML 2023 Workshop on Structured Probabilistic Inference & Generative Modeling, 2023

2023
[20]

Gaussian mixture solvers for diffusion models

Hanzhong Guo, Cheng Lu, Fan Bao, Tianyu Pang, Shuicheng Yan, Chao Du, and Chongxuan Li. Gaussian mixture solvers for diffusion models. InAdvances in Neural Information Processing Systems, 2023

2023
[21]

Neural network-based score esti- mation in diffusion models: Optimization and generalization

Yinbin Han, Meisam Razaviyayn, and Renyuan Xu. Neural network-based score esti- mation in diffusion models: Optimization and generalization. InAdvances in Neural Information Processing Systems, 2024

2024
[22]

Zhang, Shaoqing Ren, and Jian Sun

Kaiming He, X. Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for im- age recognition.2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 770–778, 2015

2016
[23]

Denoising diffusion probabilistic models

Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. In Advances in Neural Information Processing Systems, 2020

2020
[24]

Structured diffusion models with mixture of Gaussians as prior distribution.arXiv preprint arXiv:2410.19149, 2024

Nanshan Jia, Tingyu Zhu, Haoyu Liu, and Zeyu Zheng. Structured diffusion models with mixture of Gaussians as prior distribution.arXiv preprint arXiv:2410.19149, 2024. 36

work page arXiv 2024
[25]

Elucidating the design space of diffusion-based generative models

Tero Karras, Miika Aittala, Timo Aila, and Samuli Laine. Elucidating the design space of diffusion-based generative models. InAdvances in Neural Information Processing Systems, 2022

2022
[26]

Convergence for score-based generative model- ing with polynomial complexity

Holden Lee, Jianfeng Lu, and Yixin Tan. Convergence for score-based generative model- ing with polynomial complexity. InAdvances in Neural Information Processing Systems, 2022

2022
[27]

Better approximations of high dimensional smooth functions by deep neural networks with rectified power units.Communications in Computational Physics, 2019

Bo Li, Shanshan Tang, and Haijun Yu. Better approximations of high dimensional smooth functions by deep neural networks with rectified power units.Communications in Computational Physics, 2019

2019
[28]

Faster diffusion models via higher- order approximation.arXiv preprint arXiv:2506.24042, 2025

Gen Li, Yuchen Zhou, Yuting Wei, and Yuxin Chen. Faster diffusion models via higher- order approximation.arXiv preprint arXiv:2506.24042, 2025

work page arXiv 2025
[29]

Critical windows: Non-asymptotic theory for feature emer- gence in diffusion models

Marvin Li and Sitan Chen. Critical windows: Non-asymptotic theory for feature emer- gence in diffusion models. InInternational Conference on Machine Learning, 2024

2024
[30]

Un- raveling the smoothness properties of diffusion models: A Gaussian mixture perspective

Yingyu Liang, Zhizhou Sha, Zhenmei Shi, Zhao Song, Mingda Wan, and Yufa Zhou. Un- raveling the smoothness properties of diffusion models: A Gaussian mixture perspective. InProceedings of the IEEE/CVF International Conference on Computer Vision, 2025

2025
[31]

DPM- Solver: A fast ODE solver for diffusion probabilistic model sampling in around 10 steps

Cheng Lu, Yuhao Zhou, Fan Bao, Jianfei Chen, Chongxuan Li, and Jun Zhu. DPM- Solver: A fast ODE solver for diffusion probabilistic model sampling in around 10 steps. InAdvances in Neural Information Processing Systems, 2022

2022
[32]

Resolving memorization in empirical diffusion model for manifold data in high-dimensional spaces.arXiv preprint arXiv:2505.02508, 2025

Yang Lyu, Tan Minh Nguyen, Yuchun Qian, and Xin T Tong. Resolving memorization in empirical diffusion model for manifold data in high-dimensional spaces.arXiv preprint arXiv:2505.02508, 2025

work page arXiv 2025
[33]

Mean-field theory of two-layers neural networks: Dimension-free bounds and kernel limit

Song Mei, Theodor Misiakiewicz, and Andrea Montanari. Mean-field theory of two-layers neural networks: Dimension-free bounds and kernel limit. InConference on Learning Theory, 2019

2019
[34]

Neural networks for optimal approximation of smooth and ana- lytic functions.Neural Computation, 8(1):164–177, 1996

Hrushikesh N Mhaskar. Neural networks for optimal approximation of smooth and ana- lytic functions.Neural Computation, 8(1):164–177, 1996

1996
[35]

Diffusion models are minimax optimal distribution estimators

Kazusato Oko, Shunta Akiyama, and Taiji Suzuki. Diffusion models are minimax optimal distribution estimators. InInternational Conference on Machine Learning, 2023

2023
[36]

Progressive distillation for fast sampling of diffusion models

Tim Salimans and Jonathan Ho. Progressive distillation for fast sampling of diffusion models. InInternational Conference on Learning Representations, 2022

2022
[37]

Adversarial diffusion distillation

Axel Sauer, Dominik Lorenz, Andreas Blattmann, and Robin Rombach. Adversarial diffusion distillation. InEuropean Conference on Computer Vision, 2024

2024
[38]

Learning mixtures of Gaussians using the DDPM objective

Kulin Shah, Sitan Chen, and Adam Klivans. Learning mixtures of Gaussians using the DDPM objective. InAdvances in Neural Information Processing Systems, 2023

2023
[39]

Mean field analysis of neural networks: A law of large numbers.SIAM Journal on Applied Mathematics, 80(2):725–752, 2020

Justin Sirignano and Konstantinos Spiliopoulos. Mean field analysis of neural networks: A law of large numbers.SIAM Journal on Applied Mathematics, 80(2):725–752, 2020

2020
[40]

Denoising diffusion implicit models

Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models. InInternational Conference on Learning Representations, 2020. 37

2020
[41]

Score-based generative modeling through stochastic differential equations

Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equations. InInternational Conference on Learning Representations, 2020

2020
[42]

Consistency models

Yang Song, Prafulla Dhariwal, Mark Chen, and Ilya Sutskever. Consistency models. In International Conference on Machine Learning, 2023

2023
[43]

Adaptivity of diffusion models to manifold structures

Rong Tang and Yun Yang. Adaptivity of diffusion models to manifold structures. In International Conference on Artificial Intelligence and Statistics, 2024

2024
[44]

Attention is all you need

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. InAdvances in Neural Information Processing Systems, 2017

2017
[45]

Are we really learning the score function? reinterpreting diffusion models through Wasserstein gradient flow matching

An B Vuong, Michael T McCann, Javier E Santos, and Yen Ting Lin. Are we really learning the score function? reinterpreting diffusion models through Wasserstein gradient flow matching. InNeurIPS Workshop on Structured Probabilistic Inference, 2025

2025
[46]

Diffusion mod- els learn low-dimensional distributions via subspace clustering

Peng Wang, Huijie Zhang, Zekai Zhang, Siyi Chen, Yi Ma, and Qing Qu. Diffusion mod- els learn low-dimensional distributions via subspace clustering. InInternational Confer- ence on Learning Representations 2025 Workshop on Deep Generative Model in Machine Learning: Theory, Principle and Efficacy, 2024

2025
[47]

Error estimates of a training-free diffusion model for high-dimensional sampling

Pengjun Wang, Zezhong Zhang, Minglei Yang, Feng Bao, Yanzhao Cao, and Guannan Zhang. Error estimates of a training-free diffusion model for high-dimensional sampling. arXiv preprint arXiv:2601.19740, 2026

work page arXiv 2026
[48]

Simultaneous approximation of the score func- tion and its derivatives by deep neural networks.arXiv preprint arXiv:2512.23643, 2025

Konstantin Yakovlev and Nikita Puchkin. Simultaneous approximation of the score func- tion and its derivatives by deep neural networks.arXiv preprint arXiv:2512.23643, 2025

work page arXiv 2025
[49]

Nearly optimal VC-dimension and pseudo- dimension bounds for deep neural network derivatives

Yahong Yang, Haizhao Yang, and Yang Xiang. Nearly optimal VC-dimension and pseudo- dimension bounds for deep neural network derivatives. InAdvances in Neural Information Processing Systems, 2023

2023
[50]

Lipschitz singularities in diffusion models

Zhantao Yang, Ruili Feng, Han Zhang, Yujun Shen, Kai Zhu, Lianghua Huang, Yifei Zhang, Yu Liu, Deli Zhao, Jingren Zhou, et al. Lipschitz singularities in diffusion models. InInternational Conference on Learning Representations, 2023

2023
[51]

Improved distribution matching distillation for fast image syn- thesis

Tianwei Yin, Micha¨ el Gharbi, Taesung Park, Richard Zhang, Eli Shechtman, Fredo Du- rand, and Bill Freeman. Improved distribution matching distillation for fast image syn- thesis. InAdvances in Neural Information Processing Systems, 2024

2024
[52]

One-step diffusion with distribution matching distillation

Tianwei Yin, Micha¨ el Gharbi, Richard Zhang, Eli Shechtman, Fredo Durand, William T Freeman, and Taesung Park. One-step diffusion with distribution matching distillation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2024

2024
[53]

Exact diffusion inversion via bidirectional integration approximation

Guoqiang Zhang, Jonathan P Lewis, and W Bastiaan Kleijn. Exact diffusion inversion via bidirectional integration approximation. InEuropean Conference on Computer Vision, 2024. 38

2024
[54]

Stability and generalizability in SDE diffusion models with measure-preserving dynamics

Weitong Zhang, Chengqi Zang, Liu Li, Sarah Cechnicka, Cheng Ouyang, and Bernhard Kainz. Stability and generalizability in SDE diffusion models with measure-preserving dynamics. InAdvances in Neural Information Processing Systems, 2024

2024
[55]

UniPC: A unified predictor-corrector framework for fast sampling of diffusion models

Wenliang Zhao, Lujia Bai, Yongming Rao, Jie Zhou, and Jiwen Lu. UniPC: A unified predictor-corrector framework for fast sampling of diffusion models. InAdvances in Neural Information Processing Systems, 2023

2023
[56]

Expressive power of deep networks on manifolds: Simultaneous approximation.arXiv preprint arXiv:2509.09362, 2025

Hanfei Zhou and Lei Shi. Expressive power of deep networks on manifolds: Simultaneous approximation.arXiv preprint arXiv:2509.09362, 2025

work page arXiv 2025
[57]

Smoothing the score function for generalization in diffusion models: An optimization-based explanation framework.arXiv preprint arXiv:2601.19285, 2026

Xinyu Zhou, Jiawei Zhang, and Stephen J Wright. Smoothing the score function for generalization in diffusion models: An optimization-based explanation framework.arXiv preprint arXiv:2601.19285, 2026

work page arXiv 2026
[58]

Simple distillation for one-step diffusion models

Huaisheng Zhu, Teng Xiao, Shijie Zhou, Zhimeng Guo, Hangfan Zhang, Siyuan Xu, and Vasant G Honavar. Simple distillation for one-step diffusion models. InAdvances in Neural Information Processing Systems, 2025. 39

2025

[1] [1]

Universal approximation bounds for superpositions of a sigmoidal function.IEEE Transactions on Information theory, 39(3):930–945, 2002

Andrew R Barron. Universal approximation bounds for superpositions of a sigmoidal function.IEEE Transactions on Information theory, 39(3):930–945, 2002

2002

[2] [2]

Simultaneous approximation of a smooth function and its derivatives by deep neural networks with piecewise-polynomial activations.Neural Networks, 161:242–253, 2023

Denis Belomestny, Alexey Naumov, Nikita Puchkin, and Sergey Samsonov. Simultaneous approximation of a smooth function and its derivatives by deep neural networks with piecewise-polynomial activations.Neural Networks, 161:242–253, 2023

2023

[3] [3]

On the edge of memorization in diffusion models

Sam Buchanan, Druv Pai, Yi Ma, and Valentin De Bortoli. On the edge of memorization in diffusion models. InAdvances in Neural Information Processing Systems, 2025

2025

[4] [4]

Score approximation, estimation and distribution recovery of diffusion models on low-dimensional data

Minshuo Chen, Kaixuan Huang, Tuo Zhao, and Mengdi Wang. Score approximation, estimation and distribution recovery of diffusion models on low-dimensional data. In International Conference on Machine Learning, 2023

2023

[5] [5]

Sam- pling is as easy as learning the score: Theory for diffusion models with minimal data assumptions

Sitan Chen, Sinho Chewi, Jerry Li, Yuanzhi Li, Adil Salim, and Anru R Zhang. Sam- pling is as easy as learning the score: Theory for diffusion models with minimal data assumptions. InInternational Conference on Learning Representations, 2023

2023

[6] [6]

Lipschitz-Guided Design of Interpolation Schedules in Generative Models

Yifan Chen, Eric Vanden-Eijnden, and Jiawei Xu. Lipschitz-guided design of interpola- tion schedules in generative models.arXiv preprint arXiv:2509.01629, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[7] [7]

What does guidance do? a fine-grained analysis in a simple setting

Muthu Chidambaram, Khashayar Gatmiry, Sitan Chen, Holden Lee, and Jianfeng Lu. What does guidance do? a fine-grained analysis in a simple setting. InAdvances in Neural Information Processing Systems, 2024

2024

[8] [8]

Analysis of learning a flow-based generative model from limited sample complexity

Hugo Cui, Florent Krzakala, Eric Vanden-Eijnden, and Lenka Zdeborova. Analysis of learning a flow-based generative model from limited sample complexity. InInternational Conference on Learning Representations, 2023. 35

2023

[9] [9]

Convergence of denoising diffusion models under the manifold hy- pothesis.Transactions on Machine Learning Research, 2022

Valentin De Bortoli. Convergence of denoising diffusion models under the manifold hy- pothesis.Transactions on Machine Learning Research, 2022

2022

[10] [10]

Neural network approximation

Ronald DeVore, Boris Hanin, and Guergana Petrova. Neural network approximation. Acta Numerica, 30:327–444, 2021

2021

[11] [11]

Characteristic learning for provable one step generation.arXiv preprint arXiv:2405.05512, 2024

Zhao Ding, Chenguang Duan, Yuling Jiao, Ruoxuan Li, Jerry Zhijian Yang, and Ping- wen Zhang. Characteristic learning for provable one step generation.arXiv preprint arXiv:2405.05512, 2024

work page arXiv 2024

[12] [12]

Overparameterization of deep ResNet: Zero loss and mean-field analysis.Journal of Machine Learning Research, 23 (48):1–65, 2022

Zhiyan Ding, Shi Chen, Qin Li, and Stephen J Wright. Overparameterization of deep ResNet: Zero loss and mean-field analysis.Journal of Machine Learning Research, 23 (48):1–65, 2022

2022

[13] [13]

One step diffusion via shortcut models

Kevin Frans, Danijar Hafner, Sergey Levine, and Pieter Abbeel. One step diffusion via shortcut models. InInternational Conference on Learning Representations, 2025

2025

[14] [14]

How do flow matching models memorize and generalize in sample data subspaces?arXiv preprint arXiv:2410.23594, 2024

Weiguo Gao and Ming Li. How do flow matching models memorize and generalize in sample data subspaces?arXiv preprint arXiv:2410.23594, 2024

work page arXiv 2024

[15] [15]

Toward theoretical insights into diffusion trajectory distillation via operator merging.Neural Networks, 202:109023, 2026

Weiguo Gao and Ming Li. Toward theoretical insights into diffusion trajectory distillation via operator merging.Neural Networks, 202:109023, 2026

2026

[16] [16]

Terminally constrained flow-based generative models from an optimal control perspective.arXiv preprint arXiv:2601.09474, 2026

Weiguo Gao, Ming Li, and Qianxiao Li. Terminally constrained flow-based generative models from an optimal control perspective.arXiv preprint arXiv:2601.09474, 2026

work page arXiv 2026

[17] [17]

Learning mixtures of Gaussians using diffusion models.arXiv preprint arXiv:2404.18869, 2024

Khashayar Gatmiry, Jonathan Kelner, and Holden Lee. Learning mixtures of Gaussians using diffusion models.arXiv preprint arXiv:2404.18869, 2024

work page arXiv 2024

[18] [18]

Mean flows for one-step generative modeling

Zhengyang Geng, Mingyang Deng, Xingjian Bai, J Zico Kolter, and Kaiming He. Mean flows for one-step generative modeling. InAdvances in Neural Information Processing Systems, 2025

2025

[19] [19]

BOOT: Data-free distillation of denoising diffusion models with bootstrapping

Jiatao Gu, Shuangfei Zhai, Yizhe Zhang, Lingjie Liu, and Joshua M Susskind. BOOT: Data-free distillation of denoising diffusion models with bootstrapping. InICML 2023 Workshop on Structured Probabilistic Inference & Generative Modeling, 2023

2023

[20] [20]

Gaussian mixture solvers for diffusion models

Hanzhong Guo, Cheng Lu, Fan Bao, Tianyu Pang, Shuicheng Yan, Chao Du, and Chongxuan Li. Gaussian mixture solvers for diffusion models. InAdvances in Neural Information Processing Systems, 2023

2023

[21] [21]

Neural network-based score esti- mation in diffusion models: Optimization and generalization

Yinbin Han, Meisam Razaviyayn, and Renyuan Xu. Neural network-based score esti- mation in diffusion models: Optimization and generalization. InAdvances in Neural Information Processing Systems, 2024

2024

[22] [22]

Zhang, Shaoqing Ren, and Jian Sun

Kaiming He, X. Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for im- age recognition.2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 770–778, 2015

2016

[23] [23]

Denoising diffusion probabilistic models

Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. In Advances in Neural Information Processing Systems, 2020

2020

[24] [24]

Structured diffusion models with mixture of Gaussians as prior distribution.arXiv preprint arXiv:2410.19149, 2024

Nanshan Jia, Tingyu Zhu, Haoyu Liu, and Zeyu Zheng. Structured diffusion models with mixture of Gaussians as prior distribution.arXiv preprint arXiv:2410.19149, 2024. 36

work page arXiv 2024

[25] [25]

Elucidating the design space of diffusion-based generative models

Tero Karras, Miika Aittala, Timo Aila, and Samuli Laine. Elucidating the design space of diffusion-based generative models. InAdvances in Neural Information Processing Systems, 2022

2022

[26] [26]

Convergence for score-based generative model- ing with polynomial complexity

Holden Lee, Jianfeng Lu, and Yixin Tan. Convergence for score-based generative model- ing with polynomial complexity. InAdvances in Neural Information Processing Systems, 2022

2022

[27] [27]

Better approximations of high dimensional smooth functions by deep neural networks with rectified power units.Communications in Computational Physics, 2019

Bo Li, Shanshan Tang, and Haijun Yu. Better approximations of high dimensional smooth functions by deep neural networks with rectified power units.Communications in Computational Physics, 2019

2019

[28] [28]

Faster diffusion models via higher- order approximation.arXiv preprint arXiv:2506.24042, 2025

Gen Li, Yuchen Zhou, Yuting Wei, and Yuxin Chen. Faster diffusion models via higher- order approximation.arXiv preprint arXiv:2506.24042, 2025

work page arXiv 2025

[29] [29]

Critical windows: Non-asymptotic theory for feature emer- gence in diffusion models

Marvin Li and Sitan Chen. Critical windows: Non-asymptotic theory for feature emer- gence in diffusion models. InInternational Conference on Machine Learning, 2024

2024

[30] [30]

Un- raveling the smoothness properties of diffusion models: A Gaussian mixture perspective

Yingyu Liang, Zhizhou Sha, Zhenmei Shi, Zhao Song, Mingda Wan, and Yufa Zhou. Un- raveling the smoothness properties of diffusion models: A Gaussian mixture perspective. InProceedings of the IEEE/CVF International Conference on Computer Vision, 2025

2025

[31] [31]

DPM- Solver: A fast ODE solver for diffusion probabilistic model sampling in around 10 steps

Cheng Lu, Yuhao Zhou, Fan Bao, Jianfei Chen, Chongxuan Li, and Jun Zhu. DPM- Solver: A fast ODE solver for diffusion probabilistic model sampling in around 10 steps. InAdvances in Neural Information Processing Systems, 2022

2022

[32] [32]

Resolving memorization in empirical diffusion model for manifold data in high-dimensional spaces.arXiv preprint arXiv:2505.02508, 2025

Yang Lyu, Tan Minh Nguyen, Yuchun Qian, and Xin T Tong. Resolving memorization in empirical diffusion model for manifold data in high-dimensional spaces.arXiv preprint arXiv:2505.02508, 2025

work page arXiv 2025

[33] [33]

Mean-field theory of two-layers neural networks: Dimension-free bounds and kernel limit

Song Mei, Theodor Misiakiewicz, and Andrea Montanari. Mean-field theory of two-layers neural networks: Dimension-free bounds and kernel limit. InConference on Learning Theory, 2019

2019

[34] [34]

Neural networks for optimal approximation of smooth and ana- lytic functions.Neural Computation, 8(1):164–177, 1996

Hrushikesh N Mhaskar. Neural networks for optimal approximation of smooth and ana- lytic functions.Neural Computation, 8(1):164–177, 1996

1996

[35] [35]

Diffusion models are minimax optimal distribution estimators

Kazusato Oko, Shunta Akiyama, and Taiji Suzuki. Diffusion models are minimax optimal distribution estimators. InInternational Conference on Machine Learning, 2023

2023

[36] [36]

Progressive distillation for fast sampling of diffusion models

Tim Salimans and Jonathan Ho. Progressive distillation for fast sampling of diffusion models. InInternational Conference on Learning Representations, 2022

2022

[37] [37]

Adversarial diffusion distillation

Axel Sauer, Dominik Lorenz, Andreas Blattmann, and Robin Rombach. Adversarial diffusion distillation. InEuropean Conference on Computer Vision, 2024

2024

[38] [38]

Learning mixtures of Gaussians using the DDPM objective

Kulin Shah, Sitan Chen, and Adam Klivans. Learning mixtures of Gaussians using the DDPM objective. InAdvances in Neural Information Processing Systems, 2023

2023

[39] [39]

Mean field analysis of neural networks: A law of large numbers.SIAM Journal on Applied Mathematics, 80(2):725–752, 2020

Justin Sirignano and Konstantinos Spiliopoulos. Mean field analysis of neural networks: A law of large numbers.SIAM Journal on Applied Mathematics, 80(2):725–752, 2020

2020

[40] [40]

Denoising diffusion implicit models

Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models. InInternational Conference on Learning Representations, 2020. 37

2020

[41] [41]

Score-based generative modeling through stochastic differential equations

Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equations. InInternational Conference on Learning Representations, 2020

2020

[42] [42]

Consistency models

Yang Song, Prafulla Dhariwal, Mark Chen, and Ilya Sutskever. Consistency models. In International Conference on Machine Learning, 2023

2023

[43] [43]

Adaptivity of diffusion models to manifold structures

Rong Tang and Yun Yang. Adaptivity of diffusion models to manifold structures. In International Conference on Artificial Intelligence and Statistics, 2024

2024

[44] [44]

Attention is all you need

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. InAdvances in Neural Information Processing Systems, 2017

2017

[45] [45]

Are we really learning the score function? reinterpreting diffusion models through Wasserstein gradient flow matching

An B Vuong, Michael T McCann, Javier E Santos, and Yen Ting Lin. Are we really learning the score function? reinterpreting diffusion models through Wasserstein gradient flow matching. InNeurIPS Workshop on Structured Probabilistic Inference, 2025

2025

[46] [46]

Diffusion mod- els learn low-dimensional distributions via subspace clustering

Peng Wang, Huijie Zhang, Zekai Zhang, Siyi Chen, Yi Ma, and Qing Qu. Diffusion mod- els learn low-dimensional distributions via subspace clustering. InInternational Confer- ence on Learning Representations 2025 Workshop on Deep Generative Model in Machine Learning: Theory, Principle and Efficacy, 2024

2025

[47] [47]

Error estimates of a training-free diffusion model for high-dimensional sampling

Pengjun Wang, Zezhong Zhang, Minglei Yang, Feng Bao, Yanzhao Cao, and Guannan Zhang. Error estimates of a training-free diffusion model for high-dimensional sampling. arXiv preprint arXiv:2601.19740, 2026

work page arXiv 2026

[48] [48]

Simultaneous approximation of the score func- tion and its derivatives by deep neural networks.arXiv preprint arXiv:2512.23643, 2025

Konstantin Yakovlev and Nikita Puchkin. Simultaneous approximation of the score func- tion and its derivatives by deep neural networks.arXiv preprint arXiv:2512.23643, 2025

work page arXiv 2025

[49] [49]

Nearly optimal VC-dimension and pseudo- dimension bounds for deep neural network derivatives

Yahong Yang, Haizhao Yang, and Yang Xiang. Nearly optimal VC-dimension and pseudo- dimension bounds for deep neural network derivatives. InAdvances in Neural Information Processing Systems, 2023

2023

[50] [50]

Lipschitz singularities in diffusion models

Zhantao Yang, Ruili Feng, Han Zhang, Yujun Shen, Kai Zhu, Lianghua Huang, Yifei Zhang, Yu Liu, Deli Zhao, Jingren Zhou, et al. Lipschitz singularities in diffusion models. InInternational Conference on Learning Representations, 2023

2023

[51] [51]

Improved distribution matching distillation for fast image syn- thesis

Tianwei Yin, Micha¨ el Gharbi, Taesung Park, Richard Zhang, Eli Shechtman, Fredo Du- rand, and Bill Freeman. Improved distribution matching distillation for fast image syn- thesis. InAdvances in Neural Information Processing Systems, 2024

2024

[52] [52]

One-step diffusion with distribution matching distillation

Tianwei Yin, Micha¨ el Gharbi, Richard Zhang, Eli Shechtman, Fredo Durand, William T Freeman, and Taesung Park. One-step diffusion with distribution matching distillation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2024

2024

[53] [53]

Exact diffusion inversion via bidirectional integration approximation

Guoqiang Zhang, Jonathan P Lewis, and W Bastiaan Kleijn. Exact diffusion inversion via bidirectional integration approximation. InEuropean Conference on Computer Vision, 2024. 38

2024

[54] [54]

Stability and generalizability in SDE diffusion models with measure-preserving dynamics

Weitong Zhang, Chengqi Zang, Liu Li, Sarah Cechnicka, Cheng Ouyang, and Bernhard Kainz. Stability and generalizability in SDE diffusion models with measure-preserving dynamics. InAdvances in Neural Information Processing Systems, 2024

2024

[55] [55]

UniPC: A unified predictor-corrector framework for fast sampling of diffusion models

Wenliang Zhao, Lujia Bai, Yongming Rao, Jie Zhou, and Jiwen Lu. UniPC: A unified predictor-corrector framework for fast sampling of diffusion models. InAdvances in Neural Information Processing Systems, 2023

2023

[56] [56]

Expressive power of deep networks on manifolds: Simultaneous approximation.arXiv preprint arXiv:2509.09362, 2025

Hanfei Zhou and Lei Shi. Expressive power of deep networks on manifolds: Simultaneous approximation.arXiv preprint arXiv:2509.09362, 2025

work page arXiv 2025

[57] [57]

Smoothing the score function for generalization in diffusion models: An optimization-based explanation framework.arXiv preprint arXiv:2601.19285, 2026

Xinyu Zhou, Jiawei Zhang, and Stephen J Wright. Smoothing the score function for generalization in diffusion models: An optimization-based explanation framework.arXiv preprint arXiv:2601.19285, 2026

work page arXiv 2026

[58] [58]

Simple distillation for one-step diffusion models

Huaisheng Zhu, Teng Xiao, Shijie Zhou, Zhimeng Guo, Hangfan Zhang, Siyuan Xu, and Vasant G Honavar. Simple distillation for one-step diffusion models. InAdvances in Neural Information Processing Systems, 2025. 39

2025