pith. machine review for the scientific record. sign in

arxiv: 2602.10764 · v2 · submitted 2026-02-11 · 💻 cs.CV

Recognition: 2 theorem links

· Lean Theorem

Dual-End Consistency Model

Authors on Pith no claims yet

Pith reviewed 2026-05-16 05:36 UTC · model grok-4.3

classification 💻 cs.CV
keywords consistency modelsdiffusion modelsone-step generationImageNet 256x256FID scorePF-ODEnoise-to-noisy mappingflow matching
0
0 comments X

The pith

The Dual-End Consistency Model stabilizes training and enables one-step generation by selecting three critical sub-trajectories from the PF-ODE and using a noise-to-noisy mapping.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper seeks to overcome the training instability and inflexible sampling in consistency models for diffusion and flow-based generative models. It identifies that instability stems from loss divergence in the self-supervised term and inflexibility from error accumulation during sampling. By decomposing the PF-ODE trajectory and targeting three key sub-trajectories, combined with flow matching regularization and a novel noise-to-noisy mapping, the method achieves reliable few-step distillation. This results in superior performance for fast image synthesis on large datasets. A sympathetic reader would care because it could make high-quality generative models practical for real-time applications without multiple iterative steps.

Core claim

The Dual-End Consistency Model decomposes the PF-ODE trajectory and selects three critical sub-trajectories as optimization targets. It leverages continuous-time consistency model objectives for few-step distillation and flow matching as a boundary regularizer to stabilize training, while introducing a noise-to-noisy mapping to map noise to any point and alleviate first-step error accumulation, achieving a state-of-the-art FID score of 1.70 in one-step generation on the ImageNet 256x256 dataset.

What carries the argument

Dual-End Consistency Model, which selects three critical sub-trajectories from the PF-ODE decomposition and applies a noise-to-noisy mapping to ensure stable training and flexible sampling.

If this is right

  • Training becomes stable by avoiding loss divergence through targeted sub-trajectory optimization.
  • Few-step distillation is enabled via continuous-time CM objectives.
  • First-step error accumulation is reduced by the noise-to-noisy mapping.
  • The model outperforms prior CM-based one-step methods on ImageNet 256x256 with FID 1.70.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The selection of specific sub-trajectories could be adapted to improve stability in other ODE-based generative approaches.
  • This might allow consistency models to scale to higher resolutions without additional regularization techniques.
  • Real-time image generation systems could incorporate this for lower latency in applications like video synthesis.

Load-bearing premise

The assumption that exactly three critical sub-trajectories from the PF-ODE decomposition will eliminate loss divergence and first-step error accumulation without introducing new instabilities.

What would settle it

Failing to achieve an FID score below 2.0 in one-step generation on ImageNet 256x256 when implementing the three sub-trajectories and N2N mapping would falsify the effectiveness claim.

Figures

Figures reproduced from arXiv: 2602.10764 by Changqing Zou, Ge Bai, Linwei Dong, Ruoyu Guo, Yawei Luo, Zehuan Yuan.

Figure 1
Figure 1. Figure 1: Left: Selected samples from DE-CM trained on ImageNet at 256 × 256 reso￾lution with 2 NFE. Right: Images generated by our text-to-image model across 2 to 50 NFE. The prompts used are provided in the Appendix. Continuous-time CMs, while circumventing discrete error accumulation, do not address the fundamental limitations (including training instability and in￾flexible sampling) of CMs, thus impeding their w… view at source ↗
Figure 2
Figure 2. Figure 2: Left (a): Comparison of FID scores across different models under various NFE settings, showing the superior performance of our method in both few- step and multi￾step sampling. Right (b): Comparison of learning objectives. CMs Distillation [45] (s = T), Flow Matching [22] (s = t), MeanFlow [10] / AYF [39] (upper triangle). DE-CM (triangular boundary). supervised objectives and the catastrophic forgetting o… view at source ↗
Figure 3
Figure 3. Figure 3: Left (a): Gradient stability comparison between supervised and self-supervised loss functions. Self-supervised loss exhibits more unstable gradients. Right (b): Visu￾alization results of different NFE sampling using the CM sampler. where dfθ− (xt,t) dt = ∇xt fθ− (xt, t) dxt dt + ∂tfθ− (xt, t) is the tangent of fθ− at (xt, t) along the trajectory of the PF-ODE dxt dt and w(t) is the weighting function. This… view at source ↗
Figure 4
Figure 4. Figure 4: We select significant trajectories from the whole {(t, s)|t < s} space and treat these selected trajectories as our optimization targets. Specifically, we employ the continuous-time consistency distillation trajectory to optimize the mapping from arbitrary time points to data, thereby achieving few-step distillation. We leverage the proposed noise-to-noisy (N2N) mapping objective to eliminate the constrain… view at source ↗
Figure 5
Figure 5. Figure 5: Left (a): Comparison of gradient norm curves with and without flow matching boundary constraints. Right (b): Training Efficiency Comparison of Different Methods on 8-GPU. Our method achieves a more efficient convergence rate. where g = (λ+γ)Fθ− −(γvϕ+λvψ)−(t−r)F˙ θ− and F˙ θ− = [∇xtFθ− , ∂Fθ− ∂t , ∂Fθ− ∂s ]· [λvϕ, λ, −γ] ⊤. Eq. (16) utilizes the velocities on both sides of the endpoint r and t, along with … view at source ↗
Figure 6
Figure 6. Figure 6: Selected samples from DE-CM trained on ImageNet at 256 × 256 resolution with 1 NFE. score [21], and ImageReward [53]. We adopt LoRA [15] with a rank of 64, a learning rate of 5e-4, and the AdamW optimizer. 5.2 Qualitative and Quantitative Comparison Class-to-Image Generation. In Tab. 1 we show ImageNet 256 × 256 results, reporting FID scores along with the NFE (Number of Function Evaluations). Our model es… view at source ↗
Figure 7
Figure 7. Figure 7: Qualitative visualization reveals DE-CM superiority in quality and text align￾ment across all NFE. Existing competing models exhibit significant step-quality trade￾offs, failing at low NFE (SD3.5-Medium, FLUX.1-Dev) or producing oversaturated samples at high NFE (LCM, CTM, PCM and Hyper-SD). DE-CM maintains robust performance throughout the efficiency-quality spectrum. MeanFlow sCMs Ours [PITH_FULL_IMAGE:… view at source ↗
Figure 8
Figure 8. Figure 8: Comparison of 1 NFE results under 8 GPUs resources. DE-CM demonstrated exceptional convergence efficiency at 16 GPU hours compared to sCMs and MeanFlow. losses enables all steps to achieve significant gains. Detailed implementation specifics are provided in the Appendix. 6 Conclusion and Limitations In this work, we propose Dual-End Consistency Model (DE-CM), a novel framework that effectively addresses th… view at source ↗
Figure 9
Figure 9. Figure 9: Visual quality comparison between DE-CM and existing methods under 1 NFE. DE-CM achieves superior naturalness in detail rendering, outperforming existing meth￾ods. The prompts used are provided in the Appendix [PITH_FULL_IMAGE:figures/full_fig_p014_9.png] view at source ↗
read the original abstract

The slow iterative sampling nature remains a major bottleneck for the practical deployment of diffusion and flow-based generative models. While consistency models (CMs) represent a state-of-the-art distillation-based approach for efficient generation, their large-scale application is still limited by two key issues: training instability and inflexible sampling. Existing methods seek to mitigate these problems through architectural adjustments or regularized objectives, yet overlook the critical reliance on trajectory selection. In this work, we first conduct an analysis on these two limitations: training instability originates from loss divergence induced by unstable self-supervised term, whereas sampling inflexibility arises from error accumulation. Based on these insights and analysis, we propose the Dual-End Consistency Model (DE-CM) that selects vital sub-trajectory clusters to achieve stable and effective training. DE-CM decomposes the PF-ODE trajectory and selects three critical sub-trajectories as optimization targets. Specifically, our approach leverages continuous-time CMs objectives to achieve few-step distillation and utilizes flow matching as a boundary regularizer to stabilize the training process. Furthermore, we propose a novel noise-to-noisy (N2N) mapping that can map noise to any point, thereby alleviating the error accumulation in the first step. Extensive experimental results show the effectiveness of our method: it achieves a state-of-the-art FID score of 1.70 in one-step generation on the ImageNet 256x256 dataset, outperforming existing CM-based one-step approaches.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper proposes the Dual-End Consistency Model (DE-CM) to overcome training instability and inflexible sampling in consistency models for diffusion/flow-based generative models. It analyzes instability as arising from loss divergence in the self-supervised term and sampling issues from first-step error accumulation. The method decomposes the PF-ODE trajectory, selects three critical sub-trajectories as optimization targets, employs continuous-time CM objectives with a flow-matching boundary regularizer, and introduces a noise-to-noisy (N2N) mapping to map noise to arbitrary points. Experiments claim a state-of-the-art one-step FID of 1.70 on ImageNet 256×256, outperforming prior CM-based approaches.

Significance. If the central claims hold under rigorous validation, the work would be significant for advancing practical one-step sampling in large-scale generative models by directly targeting trajectory-dependent instabilities. The explicit decomposition analysis and introduction of N2N mapping represent potentially useful technical contributions, but the absence of ablations, error bars, or independent verification of the three-trajectory choice reduces the strength of the significance assessment.

major comments (3)
  1. [Abstract / §3] Abstract and method description: The selection of exactly three critical sub-trajectories from the PF-ODE decomposition is presented as key to eliminating loss divergence and error accumulation, yet no explicit criterion for identifying 'critical' points, optimality condition, or ablation comparing 2/4/5 trajectories is provided; this choice appears load-bearing for the stability and FID claims but is unsupported by sensitivity analysis.
  2. [Experiments] Experimental results: The reported SOTA FID of 1.70 for one-step generation lacks error bars, multiple random seeds, or ablation tables isolating the contribution of the three sub-trajectories versus the N2N mapping and flow-matching regularizer; without these, the robustness of the central performance claim cannot be verified.
  3. [§4] §4 (method): The N2N mapping is introduced as a novel component to alleviate first-step error accumulation, but its effectiveness is demonstrated solely through the final FID score with no independent derivation, external benchmark, or controlled comparison decoupling it from the overall training procedure.
minor comments (2)
  1. [§3] Notation for PF-ODE decomposition and sub-trajectory selection could be clarified with an explicit equation or diagram showing how the three points are extracted.
  2. [Abstract] The abstract mentions 'extensive experimental results' but the provided details focus primarily on the final FID; additional tables or figures on training stability metrics (e.g., loss curves) would strengthen presentation.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback on our manuscript. We address each major comment point by point below, providing clarifications based on the analysis and experiments in the paper while committing to revisions that strengthen the presentation and robustness of our claims.

read point-by-point responses
  1. Referee: [Abstract / §3] Abstract and method description: The selection of exactly three critical sub-trajectories from the PF-ODE decomposition is presented as key to eliminating loss divergence and error accumulation, yet no explicit criterion for identifying 'critical' points, optimality condition, or ablation comparing 2/4/5 trajectories is provided; this choice appears load-bearing for the stability and FID claims but is unsupported by sensitivity analysis.

    Authors: We appreciate the referee drawing attention to this aspect. Section 3 presents the PF-ODE trajectory decomposition analysis, which identifies the three critical sub-trajectories at the specific points where the self-supervised loss term begins to diverge and where first-step sampling errors accumulate most rapidly. These correspond to the initial high-noise regime, the intermediate transition phase, and the low-noise final stage, chosen to directly target the sources of instability identified in our preliminary study. While the original submission did not include an exhaustive sensitivity analysis across alternative numbers of trajectories (due to the high computational cost of ImageNet-scale training), we agree that such analysis would improve clarity. In the revision we will add an explicit statement of the selection criterion derived from the divergence analysis and include a sensitivity table comparing results with 2, 3, 4, and 5 sub-trajectories. revision: yes

  2. Referee: [Experiments] Experimental results: The reported SOTA FID of 1.70 for one-step generation lacks error bars, multiple random seeds, or ablation tables isolating the contribution of the three sub-trajectories versus the N2N mapping and flow-matching regularizer; without these, the robustness of the central performance claim cannot be verified.

    Authors: We acknowledge that reporting statistical variability strengthens confidence in the results. The FID of 1.70 was obtained using the standard ImageNet 256×256 evaluation protocol and official evaluation code employed by prior consistency-model works. To address the concern, we will rerun the one-step generation experiments with at least three independent random seeds and report mean FID scores together with standard deviations. We will also expand the ablation studies in Section 5 to include tables that isolate the individual contributions of the three-sub-trajectory selection, the N2N mapping, and the flow-matching boundary regularizer, thereby clarifying their respective impacts on final performance. revision: yes

  3. Referee: [§4] §4 (method): The N2N mapping is introduced as a novel component to alleviate first-step error accumulation, but its effectiveness is demonstrated solely through the final FID score with no independent derivation, external benchmark, or controlled comparison decoupling it from the overall training procedure.

    Authors: The N2N mapping is formally introduced in Section 4 as a continuous mapping from pure Gaussian noise to arbitrary noisy points along the PF-ODE trajectory, directly derived to counteract the first-step error accumulation identified in our analysis of sampling inflexibility. Its effectiveness is shown through both the end-to-end FID improvement and the ablation studies that compare variants with and without the mapping. To provide a more decoupled validation, we will add a controlled experiment in the revised manuscript that fixes all other components (including the three-sub-trajectory objectives and flow-matching regularizer) and varies only the presence of N2N, reporting both trajectory-level error metrics and one-step FID. revision: partial

Circularity Check

0 steps flagged

No significant circularity detected in derivation chain

full rationale

The paper conducts an analysis of CM limitations (loss divergence from self-supervised terms and error accumulation in sampling), then proposes DE-CM as a new architecture that decomposes PF-ODE trajectories, selects three sub-trajectories, adds flow-matching regularization, and introduces an N2N mapping. These are presented as design decisions motivated by the analysis rather than any closed-form derivation or prediction that reduces to the inputs by construction. The reported 1.70 FID is an empirical training outcome on ImageNet, not a quantity forced by re-using fitted parameters or self-referential definitions. No self-citations, uniqueness theorems, or ansatzes from prior author work appear in the text to bear load on the central claims. The method remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 2 axioms · 1 invented entities

The central claim rests on the existence of a PF-ODE trajectory that can be meaningfully decomposed, the validity of continuous-time CM objectives, and the assumption that flow matching provides a stable boundary regularizer. No new physical constants or particles are introduced.

free parameters (1)
  • choice of three critical sub-trajectories
    The number and location of the selected sub-trajectories are chosen to stabilize training; their selection is not derived from first principles and must be tuned.
axioms (2)
  • domain assumption PF-ODE trajectory exists and can be decomposed into sub-trajectories whose separate optimization yields global consistency
    Invoked when the paper states it decomposes the PF-ODE trajectory and selects three critical sub-trajectories as optimization targets.
  • domain assumption flow matching acts as an effective boundary regularizer without altering the target distribution
    Used to stabilize training; treated as a standard tool rather than proved for this setting.
invented entities (1)
  • noise-to-noisy (N2N) mapping no independent evidence
    purpose: Map noise directly to any intermediate point to avoid first-step error accumulation
    New component introduced to address sampling inflexibility; no independent evidence outside the training loop is provided.

pith-pipeline@v0.9.0 · 5558 in / 1527 out tokens · 61118 ms · 2026-05-16T05:36:59.904026+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

58 extracted references · 58 canonical work pages · 13 internal anchors

  1. [1]

    Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets

    Blattmann, A., Dockhorn, T., Kulal, S., Mendelevitch, D., Kilian, M., Lorenz, D., Levi, Y., English, Z., Voleti, V., Letts, A., et al.: Stable video diffusion: Scaling latent video diffusion models to large datasets. arXiv preprint arXiv:2311.15127 (2023)

  2. [2]

    arXiv preprint arXiv:2406.075072(3), 9 (2024)

    Boffi, N.M., Albergo, M.S., Vanden-Eijnden, E.: Flow map matching. arXiv preprint arXiv:2406.075072(3), 9 (2024)

  3. [3]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Chang, H., Zhang, H., Jiang, L., Liu, C., Freeman, W.T.: Maskgit: Masked genera- tive image transformer. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 11315–11325 (2022)

  4. [4]

    In: 2009 IEEE conference on computer vision and pattern recognition

    Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: Imagenet: A large- scale hierarchical image database. In: 2009 IEEE conference on computer vision and pattern recognition. pp. 248–255. Ieee (2009)

  5. [5]

    Advances in neural information processing systems34, 8780–8794 (2021)

    Dhariwal, P., Nichol, A.: Diffusion models beat gans on image synthesis. Advances in neural information processing systems34, 8780–8794 (2021)

  6. [6]

    In: Forty-first international conference on machine learning (2024)

    Esser, P., Kulal, S., Blattmann, A., Entezari, R., Müller, J., Saini, H., Levi, Y., Lorenz, D., Sauer, A., Boesel, F., et al.: Scaling rectified flow transformers for high-resolution image synthesis. In: Forty-first international conference on machine learning (2024)

  7. [7]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Esser, P., Rombach, R., Ommer, B.: Taming transformers for high-resolution image synthesis. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 12873–12883 (2021)

  8. [8]

    In: Forty-first International Conference on Machine Learning (2024) 32 Linwei Dong et al

    Evans, Z., Carr, C., Taylor, J., Hawley, S.H., Pons, J.: Fast timing-conditioned latent audio diffusion. In: Forty-first International Conference on Machine Learning (2024) 32 Linwei Dong et al

  9. [9]

    One Step Diffusion via Shortcut Models

    Frans,K.,Hafner,D.,Levine,S.,Abbeel,P.:Onestepdiffusionviashortcutmodels. arXiv preprint arXiv:2410.12557 (2024)

  10. [10]

    Mean Flows for One-step Generative Modeling

    Geng,Z.,Deng,M.,Bai,X.,Kolter,J.Z.,He,K.:Meanflowsforone-stepgenerative modeling. arXiv preprint arXiv:2505.13447 (2025)

  11. [11]

    arXiv preprint arXiv:2406.14548 (2024)

    Geng, Z., Pokle, A., Luo, W., Lin, J., Kolter, J.Z.: Consistency models made easy. arXiv preprint arXiv:2406.14548 (2024)

  12. [12]

    arXiv preprint arXiv:2507.16884 (2025)

    Guo, Y., Wang, W., Yuan, Z., Cao, R., Chen, K., Chen, Z., Huo, Y., Zhang, Y., Wang, Y., Liu, S., et al.: Splitmeanflow: Interval splitting consistency in few-step generative modeling. arXiv preprint arXiv:2507.16884 (2025)

  13. [13]

    Advances in neural information processing systems30(2017)

    Heusel,M.,Ramsauer,H.,Unterthiner,T.,Nessler,B.,Hochreiter,S.:Ganstrained by a two time-scale update rule converge to a local nash equilibrium. Advances in neural information processing systems30(2017)

  14. [14]

    Advances in neural information processing systems33, 6840–6851 (2020)

    Ho, J., Jain, A., Abbeel, P.: Denoising diffusion probabilistic models. Advances in neural information processing systems33, 6840–6851 (2020)

  15. [15]

    ICLR1(2), 3 (2022)

    Hu, E.J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., Chen, W., et al.: Lora: Low-rank adaptation of large language models. ICLR1(2), 3 (2022)

  16. [16]

    Jackyhate: Text-to-image-2m.https://huggingface.co/datasets/jackyhate/ text-to-image-2M(2025).https://doi.org/10.57967/hf/3066

  17. [17]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Kang,M.,Zhu,J.Y.,Zhang,R.,Park,J.,Shechtman,E.,Paris,S.,Park,T.:Scaling up gans for text-to-image synthesis. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 10124–10134 (2023)

  18. [18]

    Advances in neural information processing systems35, 26565–26577 (2022)

    Karras,T.,Aittala,M.,Aila,T.,Laine,S.:Elucidatingthedesignspaceofdiffusion- based generative models. Advances in neural information processing systems35, 26565–26577 (2022)

  19. [19]

    arXiv preprint arXiv:2310.02279 (2023)

    Kim, D., Lai, C.H., Liao, W.H., Murata, N., Takida, Y., Uesaka, T., He, Y., Mitsu- fuji, Y., Ermon, S.: Consistency trajectory models: Learning probability flow ode trajectory of diffusion. arXiv preprint arXiv:2310.02279 (2023)

  20. [20]

    black-forest labs: Flux.1-dev.https://huggingface.co/black- forest- labs/ FLUX.1-dev(2024)

  21. [21]

    In: International confer- ence on machine learning

    Li, J., Li, D., Xiong, C., Hoi, S.: Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International confer- ence on machine learning. pp. 12888–12900. PMLR (2022)

  22. [22]

    Flow Matching for Generative Modeling

    Lipman, Y., Chen, R.T., Ben-Hamu, H., Nickel, M., Le, M.: Flow matching for generative modeling. arXiv preprint arXiv:2210.02747 (2022)

  23. [23]

    Haohe Liu, Yi Yuan, Xubo Liu, Xinhao Mei, Qiuqiang Kong, Qiao Tian, Yuping Wang, Wenwu Wang, Yuxuan Wang, and Mark D

    Liu, H., Chen, Z., Yuan, Y., Mei, X., Liu, X., Mandic, D., Wang, W., Plumb- ley, M.D.: Audioldm: Text-to-audio generation with latent diffusion models. arXiv preprint arXiv:2301.12503 (2023)

  24. [24]

    In: Proceedings of the IEEE/CVF inter- national conference on computer vision

    Liu, R., Wu, R., Van Hoorick, B., Tokmakov, P., Zakharov, S., Vondrick, C.: Zero- 1-to-3: Zero-shot one image to 3d object. In: Proceedings of the IEEE/CVF inter- national conference on computer vision. pp. 9298–9309 (2023)

  25. [25]

    Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow

    Liu, X., Gong, C., Liu, Q.: Flow straight and fast: Learning to generate and transfer data with rectified flow. arXiv preprint arXiv:2209.03003 (2022)

  26. [26]

    Decoupled Weight Decay Regularization

    Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017)

  27. [27]

    Simplifying, Stabilizing and Scaling Continuous-Time Consistency Models

    Lu, C., Song, Y.: Simplifying, stabilizing and scaling continuous-time consistency models. arXiv preprint arXiv:2410.11081 (2024)

  28. [28]

    Advances in neural information processing systems35, 5775–5787 (2022) DE-CM 33

    Lu, C., Zhou, Y., Bao, F., Chen, J., Li, C., Zhu, J.: Dpm-solver: A fast ode solver for diffusion probabilistic model sampling in around 10 steps. Advances in neural information processing systems35, 5775–5787 (2022) DE-CM 33

  29. [29]

    Machine Intelligence Research pp

    Lu, C., Zhou, Y., Bao, F., Chen, J., Li, C., Zhu, J.: Dpm-solver++: Fast solver for guided sampling of diffusion probabilistic models. Machine Intelligence Research pp. 1–22 (2025)

  30. [30]

    Latent Consistency Models: Synthesizing High-Resolution Images with Few-Step Inference

    Luo, S., Tan, Y., Huang, L., Li, J., Zhao, H.: Latent consistency mod- els: Synthesizing high-resolution images with few-step inference. arXiv preprint arXiv:2310.04378 (2023)

  31. [31]

    In: European Conference on Computer Vision

    Ma, N., Goldstein, M., Albergo, M.S., Boffi, N.M., Vanden-Eijnden, E., Xie, S.: Sit: Exploring flow and diffusion-based generative models with scalable interpolant transformers. In: European Conference on Computer Vision. pp. 23–40. Springer (2024)

  32. [32]

    Peebles,W.,Xie,S.:Scalablediffusionmodelswithtransformers.In:Proceedingsof the IEEE/CVF international conference on computer vision. pp. 4195–4205 (2023)

  33. [33]

    arXiv preprint arXiv:2507.03738 (2025)

    Peng, Y., Zhu, K., Liu, Y., Wu, P., Li, H., Sun, X., Wu, F.: Flow-anchored consis- tency models. arXiv preprint arXiv:2507.03738 (2025)

  34. [34]

    DreamFusion: Text-to-3D using 2D Diffusion

    Poole, B., Jain, A., Barron, J.T., Mildenhall, B.: Dreamfusion: Text-to-3d using 2d diffusion. arXiv preprint arXiv:2209.14988 (2022)

  35. [35]

    In: International conference on machine learning

    Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International conference on machine learning. pp. 8748–8763. PmLR (2021)

  36. [36]

    Hierarchical Text-Conditional Image Generation with CLIP Latents

    Ramesh, A., Dhariwal, P., Nichol, A., Chu, C., Chen, M.: Hierarchical text- conditional image generation with clip latents. arXiv preprint arXiv:2204.06125 1(2), 3 (2022)

  37. [37]

    Advances in Neural Information Processing Systems37, 117340–117362 (2024)

    Ren, Y., Xia, X., Lu, Y., Zhang, J., Wu, J., Xie, P., Wang, X., Xiao, X.: Hyper-sd: Trajectory segmented consistency model for efficient image synthesis. Advances in Neural Information Processing Systems37, 117340–117362 (2024)

  38. [38]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 10684–10695 (2022)

  39. [39]

    arXiv preprint arXiv:2506.14603 (2025)

    Sabour, A., Fidler, S., Kreis, K.: Align your flow: Scaling continuous-time flow map distillation. arXiv preprint arXiv:2506.14603 (2025)

  40. [40]

    In: European Conference on Computer Vision

    Sauer, A., Lorenz, D., Blattmann, A., Rombach, R.: Adversarial diffusion distilla- tion. In: European Conference on Computer Vision. pp. 87–103. Springer (2024)

  41. [41]

    In: ACM SIGGRAPH 2022 conference proceedings

    Sauer, A., Schwarz, K., Geiger, A.: Stylegan-xl: Scaling stylegan to large diverse datasets. In: ACM SIGGRAPH 2022 conference proceedings. pp. 1–10 (2022)

  42. [42]

    In: International conference on machine learning

    Sohl-Dickstein, J., Weiss, E., Maheswaranathan, N., Ganguli, S.: Deep unsuper- vised learning using nonequilibrium thermodynamics. In: International conference on machine learning. pp. 2256–2265. pmlr (2015)

  43. [43]

    Denoising Diffusion Implicit Models

    Song, J., Meng, C., Ermon, S.: Denoising diffusion implicit models. arXiv preprint arXiv:2010.02502 (2020)

  44. [44]

    arXiv preprint arXiv:2310.14189 (2023)

    Song, Y., Dhariwal, P.: Improved techniques for training consistency models. arXiv preprint arXiv:2310.14189 (2023)

  45. [45]

    Song, Y., Dhariwal, P., Chen, M., Sutskever, I.: Consistency models (2023)

  46. [46]

    Advances in neural information processing systems32(2019)

    Song, Y., Ermon, S.: Generative modeling by estimating gradients of the data distribution. Advances in neural information processing systems32(2019)

  47. [47]

    Score-Based Generative Modeling through Stochastic Differential Equations

    Song, Y., Sohl-Dickstein, J., Kingma, D.P., Kumar, A., Ermon, S., Poole, B.: Score- based generative modeling through stochastic differential equations. arXiv preprint arXiv:2011.13456 (2020)

  48. [48]

    Tian, K., Jiang, Y., Yuan, Z., Peng, B., Wang, L.: Visual autoregressive modeling: Scalableimagegenerationvianext-scaleprediction.Advancesinneuralinformation processing systems37, 84839–84865 (2024) 34 Linwei Dong et al

  49. [49]

    Wan: Open and Advanced Large-Scale Video Generative Models

    Wan, T., Wang, A., Ai, B., Wen, B., Mao, C., Xie, C.W., Chen, D., Yu, F., Zhao, H., Yang, J., Zeng, J., Wang, J., Zhang, J., Zhou, J., Wang, J., Chen, J., Zhu, K., Zhao, K., Yan, K., Huang, L., Feng, M., Zhang, N., Li, P., Wu, P., Chu, R., Feng, R., Zhang, S., Sun, S., Fang, T., Wang, T., Gui, T., Weng, T., Shen, T., Lin, W., Wang, W., Wang, W., Zhou, W.,...

  50. [50]

    Advances in neural information processing systems37, 83951–84009 (2024)

    Wang, F.Y., Huang, Z., Bergman, A., Shen, D., Gao, P., Lingelbach, M., Sun, K., Bian, W., Song, G., Liu, Y., et al.: Phased consistency models. Advances in neural information processing systems37, 83951–84009 (2024)

  51. [51]

    Advances in neural information processing systems36, 8406–8441 (2023)

    Wang, Z., Lu, C., Wang, Y., Bao, F., Li, C., Su, H., Zhu, J.: Prolificdreamer: High-fidelity and diverse text-to-3d generation with variational score distillation. Advances in neural information processing systems36, 8406–8441 (2023)

  52. [52]

    arXiv preprint arXiv:2502.16972 (2025)

    Wu, Z., Fan, X., Wu, H., Cao, L.: Traflow: Trajectory distillation on pre-trained rectified flow. arXiv preprint arXiv:2502.16972 (2025)

  53. [53]

    In: Proceedingsofthe37thInternationalConferenceonNeuralInformationProcessing Systems

    Xu, J., Liu, X., Wu, Y., Tong, Y., Li, Q., Ding, M., Tang, J., Dong, Y.: Imagere- ward: learning and evaluating human preferences for text-to-image generation. In: Proceedingsofthe37thInternationalConferenceonNeuralInformationProcessing Systems. pp. 15903–15935 (2023)

  54. [54]

    generation: Taming optimization dilemma in latent diffusion models

    Yao, J., Yang, B., Wang, X.: Reconstruction vs. generation: Taming optimization dilemma in latent diffusion models. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 15703–15712 (2025)

  55. [55]

    Advances in neural information processing systems37, 47455–47487 (2024)

    Yin, T., Gharbi, M., Park, T., Zhang, R., Shechtman, E., Durand, F., Freeman, B.: Improved distribution matching distillation for fast image synthesis. Advances in neural information processing systems37, 47455–47487 (2024)

  56. [56]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Yin, T., Gharbi, M., Zhang, R., Shechtman, E., Durand, F., Freeman, W.T., Park, T.: One-step diffusion with distribution matching distillation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 6613– 6623 (2024)

  57. [57]

    In: Proceedings of the IEEE/CVF international conference on computer vision

    Zhang, L., Rao, A., Agrawala, M.: Adding conditional control to text-to-image diffusion models. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 3836–3847 (2023)

  58. [58]

    arXiv preprint arXiv:2503.07565 (2025)

    Zhou, L., Ermon, S., Song, J.: Inductive moment matching. arXiv preprint arXiv:2503.07565 (2025)