pith. machine review for the scientific record. sign in

arxiv: 2605.11755 · v1 · submitted 2026-05-12 · 💻 cs.LG · cs.CV· stat.ML

Recognition: 2 theorem links

· Lean Theorem

One-Step Generative Modeling via Wasserstein Gradient Flows

Emmanuel J. Cand\`es, Jiaqi Han, Puheng Li, Qiushan Guo, Renyuan Xu, Stefano Ermon

Pith reviewed 2026-05-13 07:41 UTC · model grok-4.3

classification 💻 cs.LG cs.CVstat.ML
keywords generative modelingWasserstein gradient flowone-step generationSinkhorn divergenceoptimal transportdiffusion modelsImageNet
0
0 comments X

The pith

W-Flow achieves one-step ImageNet 256x256 generation at 1.29 FID by training a neural network to compress a Wasserstein gradient flow.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces W-Flow to create a generator that maps simple reference samples to target data in a single forward pass. It first evolves the reference distribution toward the target by following a Wasserstein gradient flow that minimizes an energy functional given by the Sinkhorn divergence. A static neural network is then trained to approximate the full continuous evolution at once. This produces better mode coverage and domain transfer than prior one-step methods while delivering sampling speeds roughly 100 times faster than multi-step diffusion models with comparable quality. A reader would care because the approach replaces expensive iterative sampling with a principled, transport-based shortcut that still reaches high fidelity.

Core claim

W-Flow defines an evolution from reference to target distribution through a Wasserstein gradient flow minimizing the Sinkhorn divergence energy functional, then trains a static neural generator to realize this entire evolution in one step. The finite-sample training dynamics converge to the continuous-time distributional dynamics under suitable assumptions. Empirically the resulting model reaches 1.29 FID on one-step ImageNet 256x256 generation, improves mode coverage and domain transfer, and yields approximately 100 times faster sampling than multi-step diffusion models with similar FID scores.

What carries the argument

The Wasserstein gradient flow of the Sinkhorn divergence energy functional, compressed into a single forward pass by a static neural generator.

Load-bearing premise

Finite-sample training dynamics converge to the continuous-time Wasserstein gradient flow dynamics under suitable assumptions.

What would settle it

A direct comparison showing that samples from the trained one-step generator deviate from the distribution reached by running the full multi-step Wasserstein flow on the same reference inputs.

Figures

Figures reproduced from arXiv: 2605.11755 by Emmanuel J. Cand\`es, Jiaqi Han, Puheng Li, Qiushan Guo, Renyuan Xu, Stefano Ermon.

Figure 1
Figure 1. Figure 1: (Left) 1-NFE samples from W-Flow-L/2 trained from scratch on ImageNet-256×256. (Right) Sample quality (measured by FID) vs. effective sampling compute [39] (billion parameters × number of function evaluations during sampling) evaluated on ImageNet 256×256. target distribution in one step. This would combine the efficiency of one-step generation with the flexibility of a distributional evolution during trai… view at source ↗
Figure 2
Figure 2. Figure 2: (a) The conceptual diagram of W-Flow. (b) Visualization of the training dynamics projected onto the Sinkhorn divergence landscape on 8 Gaussian mixtures, shown on a logarithmic scale. ing a few/one-step generator from scratch, typically by enforcing certain self-consistency conditions on the trajectory [18, 19, 4, 55] or the intermediate marginals [70]. These methods largely inherit their training signal f… view at source ↗
Figure 3
Figure 3. Figure 3: Comparison between one￾batch and two-batch estimators on learning a 2D Gaussian. where Π(qbt, pb) is the set of matrices with prescribed marginals. Denote the optimal solution π ε,∗ qbt,pb . Two-batch estimate for self-transport. Naïvely estimating the self-entropic OT term OTε(qbt, qbt) from a single empirical batch introduces a self-matching artifact: since each particle can be matched to itself at zero … view at source ↗
Figure 4
Figure 4. Figure 4: Classifier-free guidance. Left: The FID and Inception Score curve when sweeping over CFG scales. Right: Image samples by W-Flow, L/2 with CFG increasing from 0.0 to 2.0. 1-NFE sampling, W-Flow outperforms most diffusion models requiring up to 250 steps, such as LightningDiT-XL/2. These strong empirical results support our central claim that principled WGF dynamics can translate into exceptional generation … view at source ↗
Figure 5
Figure 5. Figure 5: (a) Oval-to-circle domain transfer. Source and target are constructed by sampling angles uniformly from [0, 2π) with parametric curves corrupted by Gaussian noise. (b) & (c) One-step facial age translation on FFHQ, mapping older faces to younger ones. (b) Histogram of the latent ℓ2 distance between 2,000 source images and their generated targets. (c) Visual comparison. (a) Drifting (b) W-Flow Drifting W-Fl… view at source ↗
Figure 6
Figure 6. Figure 6: Evaluation of mode coverage under imbalanced target distributions. (a) Evaluation of mode coverage on a 2D Gaussian mixtures dataset featuring six dominant modes and two distant minority modes. (b) PCA scatter plot of generated latent codes for an artificially imbalanced FFHQ target distribution (95% senior faces, 5% child faces). See Appendix F for generated samples showing the comparison of mode coverage… view at source ↗
Figure 7
Figure 7. Figure 7: Evaluation of self-transport estimators on a 2D Gaussian mixtures dataset featuring six [PITH_FULL_IMAGE:figures/full_fig_p026_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Comparison of velocity guidance and distribution guidance for conditional generation on a [PITH_FULL_IMAGE:figures/full_fig_p029_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Illustrations on the difference in the velocity field computation between Drifting Model [PITH_FULL_IMAGE:figures/full_fig_p029_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Uncurated samples generated by W-Flow, L/2 with CFG [PITH_FULL_IMAGE:figures/full_fig_p033_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Uncurated samples generated by W-Flow, XL/2 [PITH_FULL_IMAGE:figures/full_fig_p034_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Uncurated samples generated by W-Flow, XL/2 with CFG [PITH_FULL_IMAGE:figures/full_fig_p036_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Uncurated samples generated by Drifting Model in the mode coverage experiment [PITH_FULL_IMAGE:figures/full_fig_p037_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Uncurated samples generated by W-Flow in the mode coverage experiment (Sec. [PITH_FULL_IMAGE:figures/full_fig_p038_14.png] view at source ↗
read the original abstract

Diffusion models and flow-based methods have shown impressive generative capability, especially for images, but their sampling is expensive because it requires many iterative updates. We introduce W-Flow, a framework for training a generator that transforms samples from a simple reference distribution into samples from a target data distribution in a single step. This is achieved in two steps: we first define an evolution from the reference distribution to the target distribution through a Wasserstein gradient flow that minimizes an energy functional; second, we train a static neural generator to compress this evolution into one-step generation. We instantiate the energy functional with the Sinkhorn divergence, which yields an efficient optimal-transport-based update rule that captures global distributional discrepancy and improves coverage of the target distribution. We further prove that the finite-sample training dynamics converge to the continuous-time distributional dynamics under suitable assumptions. Empirically, W-Flow sets a new state of the art for one-step ImageNet 256$\times$256 generation, achieving 1.29 FID, with improved mode coverage and domain transfer. Compared to multi-step diffusion models with similar FID scores, our method yields approximately 100$\times$ faster sampling. These results show that Wasserstein gradient flows provide a principled and effective foundation for fast and high-fidelity generative modeling.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper introduces W-Flow, a two-stage framework that first evolves samples from a reference distribution to a target data distribution via a Wasserstein gradient flow minimizing a Sinkhorn-divergence energy functional, then trains a static neural generator to compress this continuous evolution into a single forward pass. It asserts a convergence result for finite-sample training dynamics to the continuous-time flow under suitable assumptions, and reports new state-of-the-art one-step performance on ImageNet 256×256 (1.29 FID) together with improved mode coverage, domain transfer, and roughly 100× faster sampling than multi-step diffusion models of comparable FID.

Significance. If the convergence result can be made rigorous and the empirical gains hold under controlled ablations, the work would supply a principled optimal-transport route to high-fidelity one-step generation that improves upon both diffusion and existing one-step baselines in coverage and speed, with potential impact on downstream tasks requiring fast sampling.

major comments (1)
  1. [Abstract and convergence theorem] Abstract and theoretical development: the central claim that the trained one-step generator faithfully realizes the Wasserstein flow rests on a convergence statement for finite-sample dynamics that is conditioned on unspecified 'suitable assumptions.' Because the 1.29 FID result is presented as evidence that the discrete network compresses the continuous dynamics, the precise conditions (regularity of the energy functional, Lipschitz bounds on the velocity field, uniform convergence rates of empirical measures, or control of discretization error in 256×256 image space) must be stated explicitly and shown to be satisfied; without them the link between theory and the reported FID remains unverified.
minor comments (2)
  1. [Method section] The precise definition of the Sinkhorn-regularized energy functional and the architecture/hyper-parameters of the one-step generator should be moved from supplementary material into the main text to support reproducibility of the 1.29 FID number.
  2. [Experiments] Figure captions and experimental tables should explicitly report the number of function evaluations and wall-clock time per sample when claiming the 100× speedup relative to diffusion baselines.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the careful and constructive review. The feedback on clarifying the convergence result is well-taken and will strengthen the manuscript. We respond point-by-point below.

read point-by-point responses
  1. Referee: [Abstract and convergence theorem] Abstract and theoretical development: the central claim that the trained one-step generator faithfully realizes the Wasserstein flow rests on a convergence statement for finite-sample dynamics that is conditioned on unspecified 'suitable assumptions.' Because the 1.29 FID result is presented as evidence that the discrete network compresses the continuous dynamics, the precise conditions (regularity of the energy functional, Lipschitz bounds on the velocity field, uniform convergence rates of empirical measures, or control of discretization error in 256×256 image space) must be stated explicitly and shown to be satisfied; without them the link between theory and the reported FID remains unverified.

    Authors: We agree that the assumptions require explicit statement to make the theoretical-empirical connection rigorous. In the revision we will expand the theorem (Section 3) to list them verbatim: (i) the Sinkhorn energy is λ-convex and C²-smooth w.r.t. the 2-Wasserstein metric for ε>0; (ii) the resulting velocity field is globally L-Lipschitz; (iii) the empirical measures satisfy a uniform Glivenko–Cantelli property with rate O(n^{-1/2} log n) under the covering numbers of the RKHS induced by the kernel; (iv) the Euler–Maruyama discretization error is O(Δt) uniformly on compact time intervals when the velocity is bounded. We will add a short verification paragraph showing that (i)–(iii) hold for the entropic Sinkhorn divergence on the image manifold (citing standard OT regularity results) and that (iv) is controlled by our chosen step-size schedule. The 1.29 FID remains an empirical illustration of practical performance; the revised theorem will now make the approximation guarantee precise rather than conditional on unspecified assumptions. revision: yes

Circularity Check

0 steps flagged

No circularity detected in the derivation chain

full rationale

The paper defines an evolution via Wasserstein gradient flow minimizing an energy functional instantiated with Sinkhorn divergence, then trains a neural generator to compress the flow into one step. This is a standard two-stage procedure using established optimal transport geometry and neural approximation; the claimed one-step generator is optimized against the flow rather than defined to equal it by construction. The convergence of finite-sample dynamics is asserted under suitable assumptions without any equation reducing the reported FID or sampling speed directly to a fitted internal parameter. No load-bearing self-citation, uniqueness theorem imported from prior author work, or ansatz smuggled via citation appears in the provided text. The ImageNet results are presented as empirical outcomes, not forced predictions.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the existence of a Wasserstein gradient flow for the chosen energy and on the ability of a neural network to approximate its finite-time evolution; both are standard in the literature but invoked without new justification here.

free parameters (1)
  • Sinkhorn regularization strength
    Controls the approximation quality of the divergence and must be chosen or tuned for each dataset.
axioms (1)
  • domain assumption Finite-sample training dynamics converge to continuous-time distributional dynamics under suitable assumptions
    Invoked to justify that the trained generator faithfully follows the flow; assumptions left unspecified in abstract.

pith-pipeline@v0.9.0 · 5541 in / 1304 out tokens · 44302 ms · 2026-05-13T07:41:20.795142+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

Reference graph

Works this paper leans on

71 extracted references · 71 canonical work pages · 18 internal anchors

  1. [1]

    Building Normalizing Flows with Stochastic Interpolants

    Michael S Albergo and Eric Vanden-Eijnden. Building normalizing flows with stochastic interpolants.arXiv preprint arXiv:2209.15571, 2022. 3

  2. [2]

    LightSBB-M: Bridging Schr\"odinger and Bass for Generative Diffusion Modeling

    Alexandre Alouadi, Pierre Henry-Labordère, Grégoire Loeper, Othmane Mazhar, Huyên Pham, and Nizar Touzi. Lightsbb-m: Bridging schrödinger and bass for generative diffusion modeling. 11 arXiv preprint arXiv:2601.19312, 2026. 3

  3. [3]

    Wasserstein generative adversarial networks

    Martin Arjovsky, Soumith Chintala, and Léon Bottou. Wasserstein generative adversarial networks. InInternational conference on machine learning, pages 214–223. Pmlr, 2017. 2, 3

  4. [4]

    How to build a consistency model: Learning flow maps via self-distillation.arXiv preprint arXiv:2505.18825, 2025

    Nicholas M Boffi, Michael S Albergo, and Eric Vanden-Eijnden. How to build a consistency model: Learning flow maps via self-distillation.arXiv preprint arXiv:2505.18825, 2025. 3

  5. [5]

    Large Scale GAN Training for High Fidelity Natural Image Synthesis

    Andrew Brock, Jeff Donahue, and Karen Simonyan. Large scale GAN training for high fidelity natural image synthesis.arXiv preprint arXiv:1809.11096, 2018. 9

  6. [6]

    Gradient flow drifting: Generative modeling via wasserstein gradient flows of kde-approximated divergences.arXiv preprint arXiv:2603.10592,

    Jiarui Cao, Zixuan Wei, and Yuxin Liu. Gradient flow drifting: Generative modeling via wasserstein gradient flows of kde-approximated divergences.arXiv preprint arXiv:2603.10592,

  7. [7]

    Maskgit: Masked generative image transformer

    Huiwen Chang, Han Zhang, Lu Jiang, Ce Liu, and William T Freeman. Maskgit: Masked generative image transformer. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11315–11325, 2022. 9

  8. [8]

    Scalable Wasserstein gradient flow for generative modeling through unbalanced optimal transport

    Jaemoo Choi, Jaewoong Choi, and Myungjoo Kang. Scalable Wasserstein gradient flow for generative modeling through unbalanced optimal transport. InProceedings of the 41st Inter- national Conference on Machine Learning, volume 235 ofProceedings of Machine Learning Research, pages 8629–8650. PMLR, 21–27 Jul 2024. 3, 29

  9. [9]

    Diffusion schrödinger bridge with applications to score-based generative modeling.arXiv preprint arXiv:2106.01357,

    Valentin De Bortoli, James Thornton, Jeremy Heng, and Arnaud Doucet. Diffusion schrödinger bridge with applications to score-based generative modeling.arXiv preprint arXiv:2106.01357,

  10. [10]

    ImageNet: A large-scale hierarchical image database

    Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. ImageNet: A large-scale hierarchical image database. InCVPR, pages 248–255. Ieee, 2009. 8

  11. [11]

    Generative Modeling via Drifting

    Mingyang Deng, He Li, Tianhong Li, Yilun Du, and Kaiming He. Generative modeling via drifting.arXiv preprint arXiv:2602.04770, 2026. 2, 3, 4, 6, 8, 9, 10, 26, 28, 29, 30, 32

  12. [12]

    Diffusion models beat GANs on image synthesis

    Prafulla Dhariwal and Alexander Nichol. Diffusion models beat GANs on image synthesis. NeurIPS, 34:8780–8794, 2021. 8, 9

  13. [13]

    Variational wasser- stein gradient flow.arXiv preprint arXiv:2112.02424, 2021

    Jiaojiao Fan, Qinsheng Zhang, Amirhossein Taghvaei, and Yongxin Chen. Variational wasser- stein gradient flow.arXiv preprint arXiv:2112.02424, 2021. 3

  14. [14]

    Interpolating between optimal transport and mmd using sinkhorn divergences

    Jean Feydy, Thibault Séjourné, François-Xavier Vialard, Shun-ichi Amari, Alain Trouvé, and Gabriel Peyré. Interpolating between optimal transport and mmd using sinkhorn divergences. InThe 22nd international conference on artificial intelligence and statistics, pages 2681–2690. PMLR, 2019. 6

  15. [15]

    One step diffusion via shortcut models.arXiv preprint arXiv:2410.12557, 2024

    Kevin Frans, Danijar Hafner, Sergey Levine, and Pieter Abbeel. One step diffusion via shortcut models.arXiv preprint arXiv:2410.12557, 2024. 9

  16. [16]

    Drift-Based Policy Optimization: Native One-Step Policy Learning for Online Robot Control

    Yuxuan Gao, Yedong Shen, Shiqi Zhang, Wenhao Yu, Yifan Duan, Jiajia Wu, Jiajun Deng, Yanyong Zhang, et al. Drift-based policy optimization: Native one-step policy learning for online robot control.arXiv preprint arXiv:2604.03540, 2026. 3

  17. [17]

    Learning generative models with sinkhorn divergences

    Aude Genevay, Gabriel Peyré, and Marco Cuturi. Learning generative models with sinkhorn divergences. InInternational Conference on Artificial Intelligence and Statistics, pages 1608–

  18. [18]

    3, 6, 30

    PMLR, 2018. 3, 6, 30

  19. [19]

    Mean Flows for One-step Generative Modeling

    Zhengyang Geng, Mingyang Deng, Xingjian Bai, J Zico Kolter, and Kaiming He. Mean flows for one-step generative modeling.arXiv preprint arXiv:2505.13447, 2025. 3, 9

  20. [20]

    Improved Mean Flows: On the Challenges of Fastforward Generative Models

    Zhengyang Geng, Yiyang Lu, Zongze Wu, Eli Shechtman, J Zico Kolter, and Kaiming He. Improved mean flows: On the challenges of fastforward generative models.arXiv preprint arXiv:2512.02012, 2025. 3, 9

  21. [21]

    Generative adversarial nets.NeurIPS, 2014

    Ian J Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets.NeurIPS, 2014. 2, 3 12

  22. [22]

    Improved training of wasserstein gans.Advances in neural information processing systems, 30,

    Ishaan Gulrajani, Faruk Ahmed, Martin Arjovsky, Vincent Dumoulin, and Aaron C Courville. Improved training of wasserstein gans.Advances in neural information processing systems, 30,

  23. [23]

    The wasserstein gradient flow of the sinkhorn divergence between gaussian distributions.arXiv preprint arXiv:2602.10726, 2026

    Mathis Hardion and Théo Lacombe. The wasserstein gradient flow of the sinkhorn divergence between gaussian distributions.arXiv preprint arXiv:2602.10726, 2026. 5

  24. [24]

    Deep residual learning for image recognition

    Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. InCVPR, pages 770–778, 2016. 10

  25. [25]

    Sinkhorn-drifting generative models.arXiv preprint arXiv:2603.12366, 2026

    Ping He, Om Khangaonkar, Hamed Pirsiavash, Yikun Bai, and Soheil Kolouri. Sinkhorn-drifting generative models.arXiv preprint arXiv:2603.12366, 2026. 3

  26. [26]

    GANs trained by a two time-scale update rule converge to a local nash equilibrium.NeurIPS,

    Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. GANs trained by a two time-scale update rule converge to a local nash equilibrium.NeurIPS,

  27. [27]

    Denoising diffusion probabilistic models.NeurIPS, 33:6840–6851, 2020

    Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models.NeurIPS, 33:6840–6851, 2020. 1, 2

  28. [28]

    Classifier-Free Diffusion Guidance

    Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance.arXiv preprint arXiv:2207.12598, 2022. 8

  29. [29]

    The variational formulation of the fokker– planck equation.SIAM journal on mathematical analysis, 29(1):1–17, 1998

    Richard Jordan, David Kinderlehrer, and Felix Otto. The variational formulation of the fokker– planck equation.SIAM journal on mathematical analysis, 29(1):1–17, 1998. 3

  30. [30]

    Scaling up GANs for text-to-image synthesis

    Minguk Kang, Jun-Yan Zhu, Richard Zhang, Jaesik Park, Eli Shechtman, Sylvain Paris, and Taesung Park. Scaling up GANs for text-to-image synthesis. InCVPR, pages 10124–10134,

  31. [31]

    Marlowe: Stanford’s gpu-based computational instrument, 2025

    Craig Kapfer, Kurt Stine, Balasubramanian Narasimhan, Christopher Mentzel, and Emmanuel Candes. Marlowe: Stanford’s gpu-based computational instrument, 2025. 11

  32. [32]

    A style-based generator architecture for generative adversarial networks

    Tero Karras, Samuli Laine, and Timo Aila. A style-based generator architecture for generative adversarial networks. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4401–4410, 2019. 10

  33. [33]

    A unified view of drifting and score-based models.arXiv preprint arXiv:2603.07514, 2026

    Chieh-Hsin Lai, Bac Nguyen, Naoki Murata, Yuhta Takida, Toshimitsu Uesaka, Yuki Mitsufuji, Stefano Ermon, and Molei Tao. A unified view of drifting and score-based models.arXiv preprint arXiv:2603.07514, 2026. 3

  34. [34]

    Autoregressive image generation without vector quantization.NeurIPS, 37:56424–56445, 2024

    Tianhong Li, Yonglong Tian, He Li, Mingyang Deng, and Kaiming He. Autoregressive image generation without vector quantization.NeurIPS, 37:56424–56445, 2024. 9

  35. [35]

    Generative moment matching networks

    Yujia Li, Kevin Swersky, and Rich Zemel. Generative moment matching networks. InICML, pages 1718–1727. PMLR, 2015. 5, 30

  36. [36]

    Generative Drifting for Conditional Medical Image Generation

    Zirong Li, Siyuan Mei, Weiwen Wu, Andreas Maier, Lina Gölz, and Yan Xia. Generative drifting for conditional medical image generation.arXiv preprint arXiv:2604.19736, 2026. 3

  37. [37]

    Adversarial Flow Models

    Shanchuan Lin, Ceyuan Yang, Zhijie Lin, Hao Chen, and Haoqi Fan. Adversarial flow models. arXiv preprint arXiv:2511.22475, 2025. 3, 9

  38. [38]

    Flow Matching for Generative Modeling

    Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. Flow matching for generative modeling.arXiv preprint arXiv:2210.02747, 2022. 1, 2, 3

  39. [39]

    Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow

    Xingchao Liu, Chengyue Gong, and Qiang Liu. Flow straight and fast: Learning to generate and transfer data with rectified flow.arXiv preprint arXiv:2209.03003, 2022. 1, 2, 3

  40. [40]

    Simplifying, Stabilizing and Scaling Continuous-Time Consistency Models

    Cheng Lu and Yang Song. Simplifying, stabilizing and scaling continuous-time consistency models.arXiv preprint arXiv:2410.11081, 2024. 2

  41. [41]

    Schrödinger bridge for generative ai: Soft-constrained formulation and convergence analysis.arXiv preprint arXiv:2510.11829, 2025

    Jin Ma, Ying Tan, and Renyuan Xu. Schrödinger bridge for generative ai: Soft-constrained formulation and convergence analysis.arXiv preprint arXiv:2510.11829, 2025. 3 13

  42. [42]

    SiT: Exploring flow and diffusion-based generative models with scalable interpolant transformers

    Nanye Ma, Mark Goldstein, Michael S Albergo, Nicholas M Boffi, Eric Vanden-Eijnden, and Saining Xie. SiT: Exploring flow and diffusion-based generative models with scalable interpolant transformers. InECCV, pages 23–40. Springer, 2024. 9, 31, 32

  43. [43]

    Large-scale wasserstein gradient flows.Advances in Neural Information Processing Systems, 34:15243–15256, 2021

    Petr Mokrov, Alexander Korotin, Lingxiao Li, Aude Genevay, Justin M Solomon, and Evgeny Burnaev. Large-scale wasserstein gradient flows.Advances in Neural Information Processing Systems, 34:15243–15256, 2021. 3

  44. [44]

    Entropic optimal transport: Convergence of potentials

    Marcel Nutz and Johannes Wiesel. Entropic optimal transport: Convergence of potentials. Probability Theory and Related Fields, 184(1):401–424, 2022. 6

  45. [45]

    Scalable diffusion models with transformers

    William Peebles and Saining Xie. Scalable diffusion models with transformers. InCVPR, pages 4195–4205, 2023. 8, 9

  46. [46]

    Now Foundations and Trends, 2019

    Gabriel Peyré and Marco Cuturi.Computational optimal transport: With applications to data science. Now Foundations and Trends, 2019. 6

  47. [47]

    Adversarial latent autoen- coders

    Stanislav Pidhorskyi, Donald A Adjeroh, and Gianfranco Doretto. Adversarial latent autoen- coders. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 14104–14113, 2020. 10

  48. [48]

    Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks

    Alec Radford, Luke Metz, and Soumith Chintala. Unsupervised representation learning with deep convolutional generative adversarial networks.arXiv preprint arXiv:1511.06434, 2015. 2, 3

  49. [49]

    High- resolution image synthesis with latent diffusion models

    Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High- resolution image synthesis with latent diffusion models. InCVPR, pages 10684–10695, 2022. 8

  50. [50]

    Progressive Distillation for Fast Sampling of Diffusion Models

    Tim Salimans and Jonathan Ho. Progressive distillation for fast sampling of diffusion models. arXiv preprint arXiv:2202.00512, 2022. 2

  51. [51]

    Multistep distillation of diffusion models via moment matching.NeurIPS, 37:36046–36070, 2024

    Tim Salimans, Thomas Mensink, Jonathan Heek, and Emiel Hoogeboom. Multistep distillation of diffusion models via moment matching.NeurIPS, 37:36046–36070, 2024. 2

  52. [52]

    StyleGAN-XL: Scaling StyleGAN to large diverse datasets

    Axel Sauer, Katja Schwarz, and Andreas Geiger. StyleGAN-XL: Scaling StyleGAN to large diverse datasets. InSIGGRAPH, pages 1–10, 2022. 9

  53. [53]

    Concerning nonnegative matrices and doubly stochastic matrices.Pacific Journal of Mathematics, 21(2):343–348, 1967

    Richard Sinkhorn and Paul Knopp. Concerning nonnegative matrices and doubly stochastic matrices.Pacific Journal of Mathematics, 21(2):343–348, 1967. 7

  54. [54]

    Deep unsuper- vised learning using nonequilibrium thermodynamics

    Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan, and Surya Ganguli. Deep unsuper- vised learning using nonequilibrium thermodynamics. InICML, pages 2256–2265. pmlr, 2015. 1, 2

  55. [55]

    Improved Tech- niques for Training Consistency Models

    Yang Song and Prafulla Dhariwal. Improved techniques for training consistency models.arXiv preprint arXiv:2310.14189, 2023. 2, 9

  56. [56]

    Consistency models

    Yang Song, Prafulla Dhariwal, Mark Chen, and Ilya Sutskever. Consistency models. 2023. 2, 3

  57. [57]

    Generative modeling by estimating gradients of the data distribution.Advances in neural information processing systems, 32, 2019

    Yang Song and Stefano Ermon. Generative modeling by estimating gradients of the data distribution.Advances in neural information processing systems, 32, 2019. 1

  58. [58]

    Score-Based Generative Modeling through Stochastic Differential Equations

    Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equations.arXiv preprint arXiv:2011.13456, 2020. 1, 2

  59. [59]

    Visual autoregressive modeling: Scalable image generation via next-scale prediction.NeurIPS, 37:84839–84865,

    Keyu Tian, Yi Jiang, Zehuan Yuan, Bingyue Peng, and Liwei Wang. Visual autoregressive modeling: Scalable image generation via next-scale prediction.NeurIPS, 37:84839–84865,

  60. [60]

    Wasserstein auto- encoders.arXiv preprint arXiv:1711.01558, 2017

    Ilya Tolstikhin, Olivier Bousquet, Sylvain Gelly, and Bernhard Schoelkopf. Wasserstein auto- encoders.arXiv preprint arXiv:1711.01558, 2017. 3 14

  61. [61]

    Generative drifting is secretly score matching: a spectral and variational perspective.arXiv preprint arXiv:2603.09936, 2026

    Erkan Turan and Maks Ovsjanikov. Generative drifting is secretly score matching: a spectral and variational perspective.arXiv preprint arXiv:2603.09936, 2026. 3

  62. [62]

    Pro- lificdreamer: High-fidelity and diverse text-to-3d generation with variational score distillation

    Zhengyi Wang, Cheng Lu, Yikai Wang, Fan Bao, Chongxuan Li, Hang Su, and Jun Zhu. Pro- lificdreamer: High-fidelity and diverse text-to-3d generation with variational score distillation. Advances in neural information processing systems, 36:8406–8441, 2023. 2

  63. [63]

    Transition models: Rethinking the generative learning objective.arXiv preprint arXiv:2509.04394, 2025

    Zidong Wang, Yiyuan Zhang, Xiaoyu Yue, Xiangyu Yue, Yangguang Li, Wanli Ouyang, and Lei Bai. Transition models: Rethinking the generative learning objective.arXiv preprint arXiv:2509.04394, 2025. 9

  64. [64]

    Flow-based generative models as iterative algorithms in probability space.arXiv preprint arXiv:2502.13394, 2025

    Yao Xie and Xiuyuan Cheng. Flow-based generative models as iterative algorithms in probability space.arXiv preprint arXiv:2502.13394, 2025. 3

  65. [65]

    Reconstruction vs

    Jingfeng Yao, Bin Yang, and Xinggang Wang. Reconstruction vs. generation: Taming opti- mization dilemma in latent diffusion models. InCVPR, pages 15703–15712, 2025. 9, 31, 32

  66. [66]

    Improved distribution matching distillation for fast image synthesis

    Tianwei Yin, Michaël Gharbi, Taesung Park, Richard Zhang, Eli Shechtman, Fredo Durand, and William T Freeman. Improved distribution matching distillation for fast image synthesis. Advances in neural information processing systems, 37:47455–47487, 2024. 2

  67. [67]

    One-step diffusion with distribution matching distillation

    Tianwei Yin, Michaël Gharbi, Richard Zhang, Eli Shechtman, Fredo Durand, William T Freeman, and Taesung Park. One-step diffusion with distribution matching distillation. In CVPR, pages 6613–6623, 2024. 2

  68. [68]

    Representation Alignment for Generation: Training Diffusion Transformers Is Easier Than You Think

    Sihyun Yu, Sangkyung Kwak, Huiwon Jang, Jongheon Jeong, Jonathan Huang, Jinwoo Shin, and Saining Xie. Representation alignment for generation: Training diffusion transformers is easier than you think.arXiv preprint arXiv:2410.06940, 2024. 9

  69. [69]

    AlphaFlow: Understanding and improving MeanFlow models

    Huijie Zhang, Aliaksandr Siarohin, Willi Menapace, Michael Vasilkovsky, Sergey Tulyakov, Qing Qu, and Ivan Skorokhodov. AlphaFlow: Understanding and improving MeanFlow models. arXiv preprint arXiv:2510.20771, 2025. 9

  70. [70]

    Diffusion Transformers with Representation Autoencoders

    Boyang Zheng, Nanye Ma, Shengbang Tong, and Saining Xie. Diffusion transformers with representation autoencoders.arXiv preprint arXiv:2510.11690, 2025. 9

  71. [71]

    Z Rd |x|2 dbqN 0 (x) 1/2 + Z Rd |y|2 dbpM(y) 1/2# .(32) Therefore, by Eq. (30), sup t∈[0,T] e(t)≤C T η

    Linqi Zhou, Stefano Ermon, and Jiaming Song. Inductive moment matching.arXiv preprint arXiv:2503.07565, 2025. 2, 3, 30 15 Appendix Table of Contents A Proofs 16 A.1 Complete statement and the proof of Theorem 3.1 . . . . . . . . . . . . . . . . 16 B Additional discussions 22 B.1 Wasserstein gradient flows of energy functionals . . . . . . . . . . . . . . ...