pith. machine review for the scientific record. sign in

arxiv: 2604.25289 · v1 · submitted 2026-04-28 · 💻 cs.LG · cs.CV

Recognition: unknown

Exploring Time Conditioning in Diffusion Generative Models from Disjoint Noisy Data Manifolds

Authors on Pith no claims yet

Pith reviewed 2026-05-07 16:45 UTC · model grok-4.3

classification 💻 cs.LG cs.CV
keywords diffusion modelsDDIMflow matchingtime conditioningnoisy data manifoldsgenerative modelsclass conditioningmanifold geometry
0
0 comments X

The pith

DDIM can generate high-quality samples without time conditioning once its forward process aligns noisy manifolds with flow-matching trajectories.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines the role of time conditioning in diffusion models from a geometric viewpoint, focusing on why it appears essential in DDIM yet dispensable in flow matching. Noisy data distributions under the forward process form low-dimensional hyper-cylinder-like manifolds in high-dimensional spaces, and generation succeeds when these manifolds remain disentangled. By adjusting DDIM's forward process to evolve the manifolds in the same manner as flow matching, the authors establish that explicit time conditioning is no longer required for high-quality output. They further show that class-conditioned results can be obtained from a single unconditional denoising network by mapping each class to its own distinct time space. Experiments on standard image datasets support that the modified process preserves sample quality.

Core claim

Successful generation in deterministic samplers arises from the disentanglement of disjoint noisy data manifolds in high-dimensional space. Modifying the forward process of DDIM to make the noisy manifold evolve according to the flow-matching method enables high-quality generation without time conditioning. Class-conditioned synthesis is possible with a class-unconditional denoising model by decoupling classes into distinct time spaces.

What carries the argument

The geometric concentration of noisy data on hyper-cylinder-like manifolds, combined with the alignment of their evolution under a modified DDIM forward process to match flow-matching trajectories.

If this is right

  • DDIM achieves high-quality deterministic sampling without time embeddings after the manifold alignment modification.
  • Class-conditioned outputs can be produced from an unconditional model by assigning separate time intervals to each class.
  • The primary function of time conditioning is to resolve overlaps between noisy manifolds rather than to steer the denoising trajectory.
  • High-dimensional geometry explains performance gaps between standard DDIM and flow matching.
  • Manifold disentanglement becomes the decisive factor for quality in deterministic generative sampling.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Model architectures could drop time-embedding layers entirely if the forward process is adjusted accordingly.
  • The same manifold-alignment idea might simplify conditioning requirements in other deterministic generative methods.
  • Experiments on higher-resolution or multimodal data would test whether the hyper-cylinder structure generalizes.
  • Hybrid samplers could be designed that inherit the efficiency of both DDIM and flow matching without added conditioning overhead.

Load-bearing premise

That a simple change to DDIM's forward process can make its noisy data manifolds evolve exactly like those in flow matching, and that this change alone is enough to allow successful generation without time conditioning.

What would settle it

Implement the modified DDIM forward process, train on CIFAR-10 or ImageNet, and measure FID or sample quality against both standard time-conditioned DDIM and flow matching; substantially worse results would indicate that manifold alignment does not suffice for conditioning-free generation.

Figures

Figures reproduced from arXiv: 2604.25289 by Dengyang Jiang, Guang Dai, Jingdong Wang, Liuzhuozheng Li, Mengmeng Wang, Shuhong Liu, Zanyi Wang, Zhiyuan Zhan.

Figure 1
Figure 1. Figure 1: Generated images from AFHQ-Cat [1], CelebA[2], CIFAR10[3] and ImageNet[4] dataset using DDIM sampler. (a) shows the failed generation without time conditioning; the generated images are of poor quality and oversaturated. (b),(c) and (d) show the image generation using time-conditioning, disjointed data manifold, and time-space orthogonality methods, respectively. data are gradually perturbed toward Gaussia… view at source ↗
Figure 2
Figure 2. Figure 2: Visualization of the noisy data manifold in view at source ↗
Figure 3
Figure 3. Figure 3: Distribution mixing under geometric view. (a) The conventional VP schedule compresses noisy manifolds view at source ↗
Figure 4
Figure 4. Figure 4: Swiss-roll toy experiments without timestep embedding. Rows vary the ambient dimension view at source ↗
Figure 5
Figure 5. Figure 5: Orthogonal time-space disentanglement. (a) Time is encoded as a geometric direction view at source ↗
Figure 6
Figure 6. Figure 6: Qualitative DDIM comparison on the small-image benchmarks. Columns from left to right correspond to view at source ↗
Figure 8
Figure 8. Figure 8: Qualitative ImageNet result with DiT-B/2 [ view at source ↗
Figure 7
Figure 7. Figure 7: Qualitative class-conditional CIFAR10 results view at source ↗
Figure 9
Figure 9. Figure 9: Ablation on the manifold spacing parameter view at source ↗
read the original abstract

Practically, training diffusion models typically requires explicit time conditioning to guide the network through the denoising sampling process. Especially in deterministic methods like DDIM, the absence of time conditioning leads to significant performance degradation. However, other deterministic sampling approaches, such as flow matching, can generate high-quality content without this conditioning, raising the question of its necessity. In this work, we revisit the role of time conditioning from a geometric perspective. We analyze the evolution of noisy data distributions under the forward diffusion process and demonstrate that, in high-dimensional spaces, these distributions concentrate on low-dimensional hyper-cylinder-like manifolds embedded within the input space. Successful generation, we argue, stems from the disentanglement of these manifolds in high-dimensional space. Based on this insight, we modify the forward process of DDIM to align the noisy data manifold with the flow-matching approach, proving that DDIM can generate high-quality content without time conditioning, provided the noisy manifold evolves according to the flow-matching method. Additionally, we extend our framework to class-conditioned generation by decoupling classes into distinct time spaces, enabling class-conditioned synthesis with a class-unconditional denoising model. Extensive experiments validate our theoretical analysis and show that high-quality generation is achievable without explicit conditional embeddings.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper claims that noisy data distributions under the forward diffusion process concentrate on low-dimensional hyper-cylinder manifolds in high dimensions, and that successful generation requires disentanglement of these manifolds. By modifying the DDIM forward process to align the noisy manifold evolution with flow-matching trajectories, the authors prove that DDIM can achieve high-quality generation without explicit time conditioning. They further extend the framework to class-conditional synthesis by decoupling classes into distinct time spaces, allowing a class-unconditional denoiser to perform conditional generation. The claims are supported by geometric analysis and extensive experiments.

Significance. If the geometric alignment argument holds and the modification enables time-unconditional DDIM without implicit recovery of noise levels, the result would clarify the necessity of time embeddings in deterministic samplers and could simplify model architectures by removing conditioning inputs. The class-decoupling extension offers a novel way to achieve conditional generation with unconditional networks. The work builds on the contrast between DDIM and flow matching but would benefit from stronger separation between the alignment construction and the claimed performance gains.

major comments (3)
  1. [theoretical analysis of noisy manifold evolution] The central sufficiency claim—that manifold alignment with flow-matching trajectories is enough for a time-independent network to learn the reverse process—lacks a quantitative bound on the rate of concentration onto hyper-cylinders in finite dimensions. Without such a bound (e.g., in the geometric analysis section), it remains possible that the denoiser still requires implicit time information recovered from the data distribution itself.
  2. [DDIM forward-process modification] The modification to the DDIM forward process is defined precisely by the requirement that the noisy manifold follows flow-matching trajectories. This construction makes it difficult to assess whether the reported performance gains are independent of the flow-matching prior or whether the alignment simply transfers the conditioning burden; a clearer separation between the geometric condition and the learned vector field is needed.
  3. [class-conditioned synthesis framework] The extension to class-conditional generation via distinct time spaces for each class assumes that the class manifolds remain sufficiently separated under the modified dynamics. No analysis is provided on the required separation margin or on whether the unconditional denoiser can reliably disambiguate classes without explicit class embeddings during sampling.
minor comments (2)
  1. [abstract and experimental validation] The abstract states that 'extensive experiments validate our theoretical analysis' but does not specify the datasets, baselines, or controls used to isolate the effect of removing time conditioning; adding these details would strengthen the experimental section.
  2. [method] Notation for the modified forward process and the resulting reverse vector field should be introduced with explicit equations early in the method section to avoid ambiguity when comparing to standard DDIM and flow matching.

Simulated Author's Rebuttal

3 responses · 0 unresolved

Thank you for the constructive feedback on our paper. We have carefully considered each major comment and provide point-by-point responses below, along with indications of revisions to the manuscript.

read point-by-point responses
  1. Referee: The central sufficiency claim—that manifold alignment with flow-matching trajectories is enough for a time-independent network to learn the reverse process—lacks a quantitative bound on the rate of concentration onto hyper-cylinders in finite dimensions. Without such a bound (e.g., in the geometric analysis section), it remains possible that the denoiser still requires implicit time information recovered from the data distribution itself.

    Authors: We thank the referee for this observation. Our geometric analysis establishes that noisy data distributions concentrate on low-dimensional hyper-cylinder manifolds in high dimensions, with the DDIM modification ensuring alignment to flow-matching trajectories. This alignment enables the time-independent network to learn the reverse process, as supported by our theoretical construction and ablation studies demonstrating performance degradation without alignment. While an explicit quantitative bound on concentration rates for finite dimensions is not derived, the argument relies on the asymptotic high-dimensional regime, which is standard for such analyses. We have added a clarifying discussion in the revised geometric analysis section and a note in the conclusions identifying finite-dimensional bounds as future work. The experiments indicate that implicit time recovery is not occurring, as the unconditional model matches conditioned baselines only under the aligned dynamics. revision: partial

  2. Referee: The modification to the DDIM forward process is defined precisely by the requirement that the noisy manifold follows flow-matching trajectories. This construction makes it difficult to assess whether the reported performance gains are independent of the flow-matching prior or whether the alignment simply transfers the conditioning burden; a clearer separation between the geometric condition and the learned vector field is needed.

    Authors: The DDIM modification is introduced precisely to enforce the noisy manifold to evolve along flow-matching trajectories, thereby removing the necessity for explicit time conditioning in the denoiser. The performance improvements arise directly from this geometric alignment rather than from inheriting flow-matching properties wholesale. To clarify the separation, we have revised the relevant sections to distinguish the geometric alignment condition (which dictates the forward process) from the learned vector field (produced by the time-independent network). Additional ablations in the experiments section compare the aligned DDIM against unmodified flow matching and standard DDIM, confirming that the gains stem from enabling unconditional training under the modified dynamics and are not a mere transfer of conditioning burden. revision: yes

  3. Referee: The extension to class-conditional generation via distinct time spaces for each class assumes that the class manifolds remain sufficiently separated under the modified dynamics. No analysis is provided on the required separation margin or on whether the unconditional denoiser can reliably disambiguate classes without explicit class embeddings during sampling.

    Authors: By mapping each class to a distinct time space under the modified forward process, the class manifolds evolve separately, allowing the class-unconditional denoiser to perform conditional generation by leveraging the time parameter as an implicit class indicator during sampling. We validate this through extensive class-conditional experiments showing high-quality synthesis without class embeddings. We agree that a formal analysis of the separation margin would strengthen the theoretical foundation. In the revised manuscript, we have expanded the discussion of the class-conditional framework to include empirical observations of manifold separation in the learned representations and have noted the derivation of explicit margins as an avenue for future work. revision: partial

Circularity Check

2 steps flagged

Modification aligning DDIM forward process to flow-matching makes time-unconditional generation hold by construction

specific steps
  1. self definitional [Abstract]
    "we modify the forward process of DDIM to align the noisy data manifold with the flow-matching approach, proving that DDIM can generate high-quality content without time conditioning, provided the noisy manifold evolves according to the flow-matching method"

    The 'proof' is obtained by redefining the forward dynamics to satisfy the flow-matching evolution condition; the absence of time conditioning then holds tautologically because flow-matching trajectories are already known to support unconditional generation. The result is equivalent to the input modification rather than an independent prediction from the hyper-cylinder manifold analysis.

  2. fitted input called prediction [Abstract]
    "Additionally, we extend our framework to class-conditioned generation by decoupling classes into distinct time spaces, enabling class-conditioned synthesis with a class-unconditional denoising model"

    Class separation is achieved by assigning distinct time spaces, which directly encodes the conditioning information into the manifold evolution; the unconditional denoiser then 'works' because the time-space decoupling has already injected the class signal by construction.

full rationale

The paper's central derivation modifies the DDIM forward process explicitly to force alignment between noisy data manifolds and flow-matching trajectories, then concludes that time conditioning becomes unnecessary under this alignment. This reduces the 'proof' to a definitional equivalence: the claimed performance without time conditioning follows directly from importing the flow-matching property via the modification, rather than deriving sufficiency from independent geometric analysis or bounds. The class-decoupling extension similarly relies on reparameterizing time spaces to enforce separation. No external verification or quantitative tightness result is supplied to break the construction.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

Review performed on abstract only; full derivations, assumptions, and experimental details unavailable.

axioms (2)
  • domain assumption Noisy data distributions concentrate on low-dimensional hyper-cylinder-like manifolds embedded in high-dimensional input space.
    Stated as the result of the geometric analysis in the abstract.
  • domain assumption Successful generation stems from the disentanglement of these manifolds.
    Presented as the key insight enabling the modification.

pith-pipeline@v0.9.0 · 5542 in / 1367 out tokens · 34455 ms · 2026-05-07T16:45:17.050756+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

85 extracted references · 25 canonical work pages · 10 internal anchors

  1. [1]

    StarGAN v2: Diverse image synthesis for multiple domains

    Yunjey Choi, Youngjung Uh, Jaejun Yoo, and Jung-Woo Ha. StarGAN v2: Diverse image synthesis for multiple domains. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8188–8197, 2020

  2. [2]

    Deep learning face attributes in the wild

    Ziwei Liu, Ping Luo, Xiaogang Wang, and Xiaoou Tang. Deep learning face attributes in the wild. InProceedings of the IEEE International Conference on Computer Vision, pages 3730–3738, 2015

  3. [3]

    Learning multiple layers of features from tiny images

    Alex Krizhevsky. Learning multiple layers of features from tiny images. Technical report, University of Toronto,

  4. [4]

    URLhttps://www.cs.toronto.edu/~kriz/learning-features-2009-TR.pdf

  5. [5]

    Imagenet: A large-scale hierarchical image database

    Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In2009 IEEE Conference on Computer Vision and Pattern Recognition, pages 248–255. Ieee, 2009

  6. [6]

    Density-difference estimation.Neural Computation, 25(10):2734–2775, 2013

    Masashi Sugiyama, Takafumi Kanamori, Taiji Suzuki, Marthinus Christoffel du Plessis, Song Liu, and Ichiro Takeuchi. Density-difference estimation.Neural Computation, 25(10):2734–2775, 2013. xvii Exploring Time Conditioning in Diffusion Generative Models from Disjoint Noisy Data Manifolds

  7. [7]

    Denoising diffusion probabilistic models.Advances in Neural Information Processing Systems, 33:6840–6851, 2020

    Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models.Advances in Neural Information Processing Systems, 33:6840–6851, 2020

  8. [8]

    Deep unsupervised learning using nonequilibrium thermodynamics

    Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan, and Surya Ganguli. Deep unsupervised learning using nonequilibrium thermodynamics. InInternational Conference on Machine Learning, pages 2256–2265. PMLR, 2015

  9. [9]

    Thinking in frames: How visual context and test-time scaling empower video reasoning,

    Chengzu Li, Zanyi Wang, Jiaang Li, Yi Xu, Han Zhou, Huanyu Zhang, Ruichuan An, Dengyang Jiang, Zhaochong An, Ivan Vuli´c, et al. Thinking in frames: How visual context and test-time scaling empower video reasoning,

  10. [10]
  11. [11]

    Addison-Wesley Professional, 2nd edition, 1995

    James Foley, Andries van Dam, Steven Feiner, and John Hughes.Computer Graphics: Principles and Practice in C. Addison-Wesley Professional, 2nd edition, 1995

  12. [12]

    No other representation component is needed: Diffusion transformers can provide representation guidance by themselves.arXiv preprint arXiv:2505.02831,

    Dengyang Jiang, Mengmeng Wang, Liuzhuozheng Li, Lei Zhang, Haoyu Wang, Wei Wei, Guang Dai, Yanning Zhang, and Jingdong Wang. No other representation component is needed: Diffusion transformers can pro- vide representation guidance by themselves, 2025. URL https://arxiv.org/abs/2505.02831. Preprint at https://arxiv.org/abs/2505.02831

  13. [13]

    Kingma and Max Welling

    Diederik P. Kingma and Max Welling. Auto-encoding variational bayes. InInternational Conference on Learning Representations, 2014

  14. [14]

    Neural discrete representation learning.Advances in Neural Information Processing Systems, 30, 2017

    Aaron Van Den Oord, Oriol Vinyals, et al. Neural discrete representation learning.Advances in Neural Information Processing Systems, 30, 2017

  15. [15]

    Understanding disentangling in $\beta$-VAE

    Christopher P Burgess, Irina Higgins, Arka Pal, Loic Matthey, Nick Watters, Guillaume Desjardins, and Alexander Lerchner. Understanding disentangling in beta-V AE, 2018. URL https://arxiv.org/abs/1804.03599. Preprint at https://arxiv.org/abs/1804.03599

  16. [16]

    GANs trained by a two time-scale update rule converge to a local nash equilibrium.Advances in Neural Information Processing Systems, 30, 2017

    Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. GANs trained by a two time-scale update rule converge to a local nash equilibrium.Advances in Neural Information Processing Systems, 30, 2017

  17. [17]

    Denoising Diffusion Implicit Models

    Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models, 2020. URL https: //arxiv.org/abs/2010.02502. Preprint at https://arxiv.org/abs/2010.02502

  18. [18]

    Flow Matching for Generative Modeling

    Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. Flow matching for generative modeling, 2022. URLhttps://arxiv.org/abs/2210.02747. Preprint at https://arxiv.org/abs/2210.02747

  19. [19]

    Sampling is as easy as learning the score: theory for diffusion models with minimal data assumptions

    Sitan Chen, Sinho Chewi, Jerry Li, Yuanzhi Li, Adil Salim, and Anru Zhang. Sampling is as easy as learning the score: theory for diffusion models with minimal data assumptions. InThe Eleventh International Conference on Learning Representations, 2023

  20. [20]

    J. R. Dormand and P. J. Prince. A family of embedded runge-kutta formulae.Journal of Computational and Applied Mathematics, 6(1):19–26, 1980

  21. [21]

    Diffusion schrödinger bridge with applications to score-based generative modeling

    Valentin De Bortoli, James Thornton, Jeremy Heng, and Arnaud Doucet. Diffusion schrödinger bridge with applications to score-based generative modeling. In A. Beygelzimer, Y . Dauphin, P. Liang, and J. Wortman Vaughan, editors,Advances in Neural Information Processing Systems, 2021

  22. [22]

    High-resolution image synthesis with latent diffusion models

    Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10684–10695, 2022

  23. [23]

    Video diffusion models.Advances in Neural Information Processing Systems, 35:8633–8646, 2022

    Jonathan Ho, Tim Salimans, Alexey Gritsenko, William Chan, Mohammad Norouzi, and David J Fleet. Video diffusion models.Advances in Neural Information Processing Systems, 35:8633–8646, 2022

  24. [24]

    arXiv preprint arXiv:2405.03150 (2024)

    Andrew Melnik, Michal Ljubljanac, Cong Lu, Qi Yan, Weiming Ren, and Helge Ritter. Video diffusion models: A survey, 2024. URLhttps://arxiv.org/abs/2405.03150. Preprint at https://arxiv.org/abs/2405.03150

  25. [25]

    Archisound: Audio generation with diffusion, 2023

    Flavio Schneider. Archisound: Audio generation with diffusion, 2023. URL https://arxiv.org/abs/2301. 13267. Preprint at https://arxiv.org/abs/2301.13267

  26. [26]

    RefTon: Reference person shot assist virtual Try-on

    Liuzhuozheng Li, Yue Gong, Shanyuan Liu, Bo Cheng, Yuhang Ma, Liebucha Wu, Dengyang Jiang, Zanyi Wang, Dawei Leng, and Yuhui Yin. RefVTON: person-to-person try on with additional unpaired visual reference, 2025. URLhttps://arxiv.org/abs/2511.00956. Preprint at https://arxiv.org/abs/2511.00956

  27. [27]

    FLUX-Makeup: High-fidelity, identity-consistent, and robust makeup transfer via diffusion transformer, 2025

    Jian Zhu, Shanyuan Liu, Liuzhuozheng Li, Yue Gong, He Wang, Bo Cheng, Yuhang Ma, Liebucha Wu, Xiaoyu Wu, Dawei Leng, et al. FLUX-Makeup: High-fidelity, identity-consistent, and robust makeup transfer via diffusion transformer, 2025. URL https://arxiv.org/abs/2508.05069. Preprint at https://arxiv.org/abs/2508.05069

  28. [28]

    A survey on generative diffusion models.IEEE Transactions on Knowledge and Data Engineering, 2024

    Hanqun Cao, Cheng Tan, Zhangyang Gao, Yilun Xu, Guangyong Chen, Pheng-Ann Heng, and Stan Z Li. A survey on generative diffusion models.IEEE Transactions on Knowledge and Data Engineering, 2024. xviii Exploring Time Conditioning in Diffusion Generative Models from Disjoint Noisy Data Manifolds

  29. [29]

    Deforming videos to masks: Flow matching for referring video segmentation, 2025

    Zanyi Wang, Dengyang Jiang, Liuzhuozheng Li, Sizhe Dang, Chengzu Li, Harry Yang, Guang Dai, Mengmeng Wang, and Jingdong Wang. Deforming videos to masks: Flow matching for referring video segmentation, 2025. URLhttps://arxiv.org/abs/2510.06139. Preprint at https://arxiv.org/abs/2510.06139

  30. [30]

    Flow straight and fast: Learning to generate and transfer data with rectified flow

    Xingchao Liu, Chengyue Gong, and Qiang Liu. Flow straight and fast: Learning to generate and transfer data with rectified flow. InThe Eleventh International Conference on Learning Representations (ICLR), 2023

  31. [31]

    Neural networks and physical systems with emergent collective computational abilities.Proceed- ings of the National Academy of Sciences, 79(8):2554–2558, 1982

    John J Hopfield. Neural networks and physical systems with emergent collective computational abilities.Proceed- ings of the National Academy of Sciences, 79(8):2554–2558, 1982

  32. [32]

    Estimation of non-normalized statistical models by score matching.Journal of Machine Learning Research, 6:695–709, dec 2005

    Aapo Hyvärinen. Estimation of non-normalized statistical models by score matching.Journal of Machine Learning Research, 6:695–709, dec 2005

  33. [33]

    Springer, 2013

    Bernt Oksendal.Stochastic Differential Equations: An Introduction with Applications. Springer, 2013

  34. [34]

    Brian D. O. Anderson. Reverse-time diffusion equation models.Stochastic Processes and their Applications, 12 (3):313–326, 1982. doi:10.1016/0304-4149(82)90051-5

  35. [35]

    U-Net: Convolutional networks for biomedical image segmentation

    Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-Net: Convolutional networks for biomedical image segmentation. InMedical Image Computing and Computer-Assisted Intervention–MICCAI 2015, pages 234–241. Springer, 2015

  36. [36]

    Scalable diffusion models with transformers

    William Peebles and Saining Xie. Scalable diffusion models with transformers. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 4195–4205, 2023

  37. [37]

    Generative modeling by estimating gradients of the data distribution.Advances in Neural Information Processing Systems, 32, 2019

    Yang Song and Stefano Ermon. Generative modeling by estimating gradients of the data distribution.Advances in Neural Information Processing Systems, 32, 2019

  38. [38]

    Sliced score matching: A scalable approach to density and score estimation

    Yang Song, Sahaj Garg, Jiaxin Shi, and Stefano Ermon. Sliced score matching: A scalable approach to density and score estimation. InUncertainty in Artificial Intelligence, pages 574–584. PMLR, 2020

  39. [39]

    Attention is all you need.Advances in Neural Information Processing Systems, 30, 2017

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need.Advances in Neural Information Processing Systems, 30, 2017

  40. [40]

    An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

    Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale, 2020. URL https://arxiv.org/abs/2010.11929. Preprint at https://arxiv.org/abs/2010.11929

  41. [41]

    RoFormer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063, 2024

    Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. RoFormer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063, 2024

  42. [42]

    Addressing negative transfer in diffusion models.Advances in Neural Information Processing Systems, 36:27199–27222, 2023

    Hyojun Go, Yunsung Lee, Seunghyun Lee, Shinhyeok Oh, Hyeongdon Moon, and Seungtaek Choi. Addressing negative transfer in diffusion models.Advances in Neural Information Processing Systems, 36:27199–27222, 2023

  43. [43]

    Decouple-then-merge: Towards better training for diffusion models, 2024

    Qianli Ma, Xuefei Ning, Dongrui Liu, Li Niu, and Linfeng Zhang. Decouple-then-merge: Towards better training for diffusion models, 2024. URL https://arxiv.org/abs/2410.06664. Preprint at https://arxiv.org/abs/2410.06664

  44. [44]

    Is noise condition- ing necessary for denoising generative models?arXiv preprint arXiv:2502.13129,

    Qiao Sun, Zhicheng Jiang, Hanhong Zhao, and Kaiming He. Is noise conditioning necessary for denoising genera- tive models?, 2025. URL https://arxiv.org/abs/2502.13129. Preprint at https://arxiv.org/abs/2502.13129

  45. [45]

    Z-Image: An Efficient Image Generation Foundation Model with Single-Stream Diffusion Transformer

    Image Team, Huanqia Cai, Sihan Cao, Ruoyi Du, Peng Gao, Steven Hoi, Zhaohui Hou, Shijie Huang, Dengyang Jiang, Xin Jin, Liangchen Li, Zhen Li, Zhong-Yu Li, David Liu, Dongyang Liu, Junhan Shi, Qilong Wu, Feng Yu, Chi Zhang, Shifeng Zhang, and Shilin Zhou. Z-image: An efficient image generation foundation model with single-stream diffusion transformer, 202...

  46. [46]

    Exploring the deep fusion of large language models and diffusion transformers for text-to-image synthesis, 2025

    Bingda Tang, Boyang Zheng, Xichen Pan, Sayak Paul, and Saining Xie. Exploring the deep fusion of large language models and diffusion transformers for text-to-image synthesis, 2025. URL https://arxiv.org/abs/ 2505.10046. Preprint at https://arxiv.org/abs/2505.10046

  47. [47]

    Self-supervised flow matching for scalable multi-modal synthesis, 2026

    Hila Chefer, Patrick Esser, Dominik Lorenz, Dustin Podell, Vikash Raja, Vinh Tong, Antonio Torralba, and Robin Rombach. Self-supervised flow matching for scalable multi-modal synthesis, 2026. URL https://arxiv.org/ abs/2603.06507. Preprint at https://arxiv.org/abs/2603.06507

  48. [48]

    Wan: Open and Advanced Large-Scale Video Generative Models

    Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, et al. Wan: Open and advanced large-scale video generative models, 2025. URL https: //arxiv.org/abs/2503.20314. Preprint at https://arxiv.org/abs/2503.20314. xix Exploring Time Conditioning in Diffusion Generative Models from Disjoint Noisy Da...

  49. [49]

    Scaling rectified flow transformers for high-resolution image synthesis

    Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling rectified flow transformers for high-resolution image synthesis. InForty-First International Conference on Machine Learning, 2024

  50. [50]

    Improved techniques for training score-based generative models.Advances in Neural Information Processing Systems, 33:12438–12448, 2020

    Yang Song and Stefano Ermon. Improved techniques for training score-based generative models.Advances in Neural Information Processing Systems, 33:12438–12448, 2020

  51. [51]

    Practical blind image denoising via Swin-Conv-UNet and data synthesis.Machine Intelligence Research, 20(6):822–836, 2023

    Kai Zhang, Yawei Li, Jingyun Liang, Jiezhang Cao, Yulun Zhang, Hao Tang, Deng-Ping Fan, Radu Timofte, and Luc Van Gool. Practical blind image denoising via Swin-Conv-UNet and data synthesis.Machine Intelligence Research, 20(6):822–836, 2023

  52. [52]

    Representation learning: A review and new perspectives

    Yoshua Bengio, Aaron Courville, and Pascal Vincent. Representation learning: A review and new perspectives. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(8):1798–1828, 2013

  53. [53]

    Testing the manifold hypothesis.Journal of the American Mathematical Society, 29(4):983–1049, 2016

    Charles Fefferman, Sanjoy Mitter, and Hariharan Narayanan. Testing the manifold hypothesis.Journal of the American Mathematical Society, 29(4):983–1049, 2016

  54. [54]

    Sample complexity of testing the manifold hypothesis.Advances in Neural Information Processing Systems, 23, 2010

    Hariharan Narayanan and Sanjoy Mitter. Sample complexity of testing the manifold hypothesis.Advances in Neural Information Processing Systems, 23, 2010

  55. [55]

    Caterini, Brendan Leigh Ross, Jesse C Cresswell, and Gabriel Loaiza-Ganem

    Bradley CA Brown, Anthony L. Caterini, Brendan Leigh Ross, Jesse C Cresswell, and Gabriel Loaiza-Ganem. Verifying the union of manifolds hypothesis for image data. InThe Eleventh International Conference on Learning Representations, 2023

  56. [56]

    Cheongjae Jang, Yonghyeon Lee, Yung-Kyun Noh, and Frank C. Park. Geometrically regularized autoencoders for non-euclidean data. InThe Eleventh International Conference on Learning Representations, 2023

  57. [57]

    Isometric quotient variational auto-encoders for structure-preserving representation learning.Advances in Neural Information Processing Systems, 36:39075– 39087, 2023

    In Huh, Jae Myung Choe, Younggu Kim, Daesin Kim, et al. Isometric quotient variational auto-encoders for structure-preserving representation learning.Advances in Neural Information Processing Systems, 36:39075– 39087, 2023

  58. [58]

    Geometric Deep Learning: Grids, Groups, Graphs, Geodesics, and Gauges

    Michael M. Bronstein, Joan Bruna, Taco Cohen, and Petar Veli ˇckovi´c. Geometric deep learning: Grids, groups, graphs, geodesics, and gauges, 2021. URL https://arxiv.org/abs/2104.13478. Preprint at https://arxiv.org/abs/2104.13478

  59. [59]

    Diffusion models are minimax optimal distribution estimators

    Kazusato Oko, Shunta Akiyama, and Taiji Suzuki. Diffusion models are minimax optimal distribution estimators. InProceedings of the 40th International Conference on Machine Learning, volume 202 ofProceedings of Machine Learning Research, pages 26517–26582, 2023

  60. [60]

    Adaptivity of diffusion models to manifold structures

    Rong Tang and Yun Yang. Adaptivity of diffusion models to manifold structures. InInternational Conference on Artificial Intelligence and Statistics, pages 1648–1656. PMLR, 2024

  61. [61]

    Shallow diffusion networks provably learn hidden low-dimensional structure

    Nicholas Matthew Boffi, Arthur Jacot, Stephen Tu, and Ingvar Ziemann. Shallow diffusion networks provably learn hidden low-dimensional structure. InThe Thirteenth International Conference on Learning Representations, 2025

  62. [62]

    Cresswell, and Anthony L

    Gabriel Loaiza-Ganem, Brendan Leigh Ross, Jesse C. Cresswell, and Anthony L. Caterini. Diagnosing and fixing manifold overfitting in deep generative models.Transactions on Machine Learning Research, 2022

  63. [63]

    Caterini, and Jesse C

    Gabriel Loaiza-Ganem, Brendan Leigh Ross, Rasa Hosseinzadeh, Anthony L. Caterini, and Jesse C. Cresswell. Deep generative models through the lens of the manifold hypothesis: A survey and new connections.Transactions on Machine Learning Research, 2024

  64. [64]

    Convergence of denoising diffusion models under the manifold hypothesis.Transactions on Machine Learning Research, 2022

    Valentin De Bortoli. Convergence of denoising diffusion models under the manifold hypothesis.Transactions on Machine Learning Research, 2022

  65. [65]

    Available: https://arxiv.org/abs/2406.08070

    Hyungjin Chung, Jeongsol Kim, Geon Yeong Park, Hyelin Nam, and Jong Chul Ye. CFG++: Manifold-constrained classifier free guidance for diffusion models, 2024. URL https://arxiv.org/abs/2406.08070. Preprint at https://arxiv.org/abs/2406.08070

  66. [66]

    Diffusion Posterior Sampling for General Noisy Inverse Problems

    Hyungjin Chung, Jeongsol Kim, Michael T Mccann, Marc L Klasky, and Jong Chul Ye. Diffusion posterior sampling for general noisy inverse problems, 2022. URL https://arxiv.org/abs/2209.14687. Preprint at https://arxiv.org/abs/2209.14687

  67. [67]

    Improving diffusion models for inverse problems using manifold constraints.Advances in Neural Information Processing Systems, 35:25683–25696, 2022

    Hyungjin Chung, Byeongsu Sim, Dohoon Ryu, and Jong Chul Ye. Improving diffusion models for inverse problems using manifold constraints.Advances in Neural Information Processing Systems, 35:25683–25696, 2022

  68. [68]

    Manifold preserv- ing guided diffusion.arXiv preprint arXiv:2311.16424, 2023

    Yutong He, Naoki Murata, Chieh-Hsin Lai, Yuhta Takida, Toshimitsu Uesaka, Dongjun Kim, Wei-Hsiang Liao, Yuki Mitsufuji, J Zico Kolter, Ruslan Salakhutdinov, et al. Manifold preserving guided diffusion, 2023. URL https://arxiv.org/abs/2311.16424. Preprint at https://arxiv.org/abs/2311.16424. xx Exploring Time Conditioning in Diffusion Generative Models fro...

  69. [69]

    Understanding guidance scale in diffusion models from a geometric perspective.Transactions on Machine Learning Research, 2026

    Zhiyuan Zhan, Liuzhuozheng Li, and Masashi Sugiyama. Understanding guidance scale in diffusion models from a geometric perspective.Transactions on Machine Learning Research, 2026

  70. [70]

    Manifold learning: What, how, and why.Annual Review of Statistics and Its Application, 11(1):393–417, 2024

    Marina Meil˘a and Hanyu Zhang. Manifold learning: What, how, and why.Annual Review of Statistics and Its Application, 11(1):393–417, 2024

  71. [71]

    Manifold constraint reduces exposure bias in accelerated diffusion sampling

    Yuzhe Y AO, Jun Chen, Zeyi Huang, Haonan Lin, Mengmeng Wang, Guang Dai, and Jingdong Wang. Manifold constraint reduces exposure bias in accelerated diffusion sampling. InThe Thirteenth International Conference on Learning Representations, 2025

  72. [72]

    Classifier-Free Diffusion Guidance

    Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance, 2022. URL https://arxiv.org/abs/2207. 12598. Preprint at https://arxiv.org/abs/2207.12598

  73. [73]

    Diffusion models beat GANs on image synthesis.Advances in Neural Information Processing Systems, 34:8780–8794, 2021

    Prafulla Dhariwal and Alexander Nichol. Diffusion models beat GANs on image synthesis.Advances in Neural Information Processing Systems, 34:8780–8794, 2021

  74. [74]

    Your diffusion model secretly knows the dimension of the data manifold, 2022

    Jan Stanczuk, Georgios Batzolis, Teo Deveney, and Carola-Bibiane Schönlieb. Your diffusion model secretly knows the dimension of the data manifold, 2022. URL https://arxiv.org/abs/2212.12611. Preprint at https://arxiv.org/abs/2212.12611

  75. [75]

    Score approximation, estimation, and distribution recovery of diffusion models on low-dimensional data

    Minshuo Chen, Kaixuan Huang, Tuo Zhao, and Mengdi Wang. Score approximation, estimation, and distribution recovery of diffusion models on low-dimensional data. InProceedings of the 40th International Conference on Machine Learning, volume 202 ofProceedings of Machine Learning Research, pages 4672–4712, 2023

  76. [76]

    Log-concave sampling, 2023

    Sinho Chewi. Log-concave sampling, 2023. URLhttps://chewisinho.github.io. Book draft

  77. [77]

    The spacetime of diffusion models: An information geometry perspective

    Rafal Karczewski, Markus Heinonen, Alison Pouplin, Søren Hauberg, and Vikas K Garg. The spacetime of diffusion models: An information geometry perspective. InThe Fourteenth International Conference on Learning Representations, 2026

  78. [78]

    Improved techniques for training GANs.Advances in Neural Information Processing Systems, 29, 2016

    Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and Xi Chen. Improved techniques for training GANs.Advances in Neural Information Processing Systems, 29, 2016

  79. [79]

    Graduate Texts in Mathematics

    Jean-François Le Gall.Brownian Motion, Martingales, and Stochastic Calculus. Graduate Texts in Mathematics. Springer, 2016

  80. [80]

    A connection between score matching and denoising autoencoders.Neural Computation, 23(7): 1661–1674, 2011

    Pascal Vincent. A connection between score matching and denoising autoencoders.Neural Computation, 23(7): 1661–1674, 2011

Showing first 80 references.