arxiv: 2602.10764 · v2 · submitted 2026-02-11 · 💻 cs.CV

Recognition: 2 theorem links

· Lean Theorem

Dual-End Consistency Model

Linwei Dong , Ruoyu Guo , Ge Bai , Zehuan Yuan , Yawei Luo , Changqing Zou

Authors on Pith no claims yet

Pith reviewed 2026-05-16 05:36 UTC · model grok-4.3

classification 💻 cs.CV

keywords consistency modelsdiffusion modelsone-step generationImageNet 256x256FID scorePF-ODEnoise-to-noisy mappingflow matching

0 comments

The pith

The Dual-End Consistency Model stabilizes training and enables one-step generation by selecting three critical sub-trajectories from the PF-ODE and using a noise-to-noisy mapping.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper seeks to overcome the training instability and inflexible sampling in consistency models for diffusion and flow-based generative models. It identifies that instability stems from loss divergence in the self-supervised term and inflexibility from error accumulation during sampling. By decomposing the PF-ODE trajectory and targeting three key sub-trajectories, combined with flow matching regularization and a novel noise-to-noisy mapping, the method achieves reliable few-step distillation. This results in superior performance for fast image synthesis on large datasets. A sympathetic reader would care because it could make high-quality generative models practical for real-time applications without multiple iterative steps.

Core claim

The Dual-End Consistency Model decomposes the PF-ODE trajectory and selects three critical sub-trajectories as optimization targets. It leverages continuous-time consistency model objectives for few-step distillation and flow matching as a boundary regularizer to stabilize training, while introducing a noise-to-noisy mapping to map noise to any point and alleviate first-step error accumulation, achieving a state-of-the-art FID score of 1.70 in one-step generation on the ImageNet 256x256 dataset.

What carries the argument

Dual-End Consistency Model, which selects three critical sub-trajectories from the PF-ODE decomposition and applies a noise-to-noisy mapping to ensure stable training and flexible sampling.

If this is right

Training becomes stable by avoiding loss divergence through targeted sub-trajectory optimization.
Few-step distillation is enabled via continuous-time CM objectives.
First-step error accumulation is reduced by the noise-to-noisy mapping.
The model outperforms prior CM-based one-step methods on ImageNet 256x256 with FID 1.70.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The selection of specific sub-trajectories could be adapted to improve stability in other ODE-based generative approaches.
This might allow consistency models to scale to higher resolutions without additional regularization techniques.
Real-time image generation systems could incorporate this for lower latency in applications like video synthesis.

Load-bearing premise

The assumption that exactly three critical sub-trajectories from the PF-ODE decomposition will eliminate loss divergence and first-step error accumulation without introducing new instabilities.

What would settle it

Failing to achieve an FID score below 2.0 in one-step generation on ImageNet 256x256 when implementing the three sub-trajectories and N2N mapping would falsify the effectiveness claim.

Figures

Figures reproduced from arXiv: 2602.10764 by Changqing Zou, Ge Bai, Linwei Dong, Ruoyu Guo, Yawei Luo, Zehuan Yuan.

**Figure 1.** Figure 1: Left: Selected samples from DE-CM trained on ImageNet at 256 × 256 resolution with 2 NFE. Right: Images generated by our text-to-image model across 2 to 50 NFE. The prompts used are provided in the Appendix. Continuous-time CMs, while circumventing discrete error accumulation, do not address the fundamental limitations (including training instability and inflexible sampling) of CMs, thus impeding their w… view at source ↗

**Figure 2.** Figure 2: Left (a): Comparison of FID scores across different models under various NFE settings, showing the superior performance of our method in both few- step and multistep sampling. Right (b): Comparison of learning objectives. CMs Distillation [45] (s = T), Flow Matching [22] (s = t), MeanFlow [10] / AYF [39] (upper triangle). DE-CM (triangular boundary). supervised objectives and the catastrophic forgetting o… view at source ↗

**Figure 3.** Figure 3: Left (a): Gradient stability comparison between supervised and self-supervised loss functions. Self-supervised loss exhibits more unstable gradients. Right (b): Visualization results of different NFE sampling using the CM sampler. where dfθ− (xt,t) dt = ∇xt fθ− (xt, t) dxt dt + ∂tfθ− (xt, t) is the tangent of fθ− at (xt, t) along the trajectory of the PF-ODE dxt dt and w(t) is the weighting function. This… view at source ↗

**Figure 4.** Figure 4: We select significant trajectories from the whole {(t, s)|t < s} space and treat these selected trajectories as our optimization targets. Specifically, we employ the continuous-time consistency distillation trajectory to optimize the mapping from arbitrary time points to data, thereby achieving few-step distillation. We leverage the proposed noise-to-noisy (N2N) mapping objective to eliminate the constrain… view at source ↗

**Figure 5.** Figure 5: Left (a): Comparison of gradient norm curves with and without flow matching boundary constraints. Right (b): Training Efficiency Comparison of Different Methods on 8-GPU. Our method achieves a more efficient convergence rate. where g = (λ+γ)Fθ− −(γvϕ+λvψ)−(t−r)F˙ θ− and F˙ θ− = [∇xtFθ− , ∂Fθ− ∂t , ∂Fθ− ∂s ]· [λvϕ, λ, −γ] ⊤. Eq. (16) utilizes the velocities on both sides of the endpoint r and t, along with … view at source ↗

**Figure 6.** Figure 6: Selected samples from DE-CM trained on ImageNet at 256 × 256 resolution with 1 NFE. score [21], and ImageReward [53]. We adopt LoRA [15] with a rank of 64, a learning rate of 5e-4, and the AdamW optimizer. 5.2 Qualitative and Quantitative Comparison Class-to-Image Generation. In Tab. 1 we show ImageNet 256 × 256 results, reporting FID scores along with the NFE (Number of Function Evaluations). Our model es… view at source ↗

**Figure 7.** Figure 7: Qualitative visualization reveals DE-CM superiority in quality and text alignment across all NFE. Existing competing models exhibit significant step-quality tradeoffs, failing at low NFE (SD3.5-Medium, FLUX.1-Dev) or producing oversaturated samples at high NFE (LCM, CTM, PCM and Hyper-SD). DE-CM maintains robust performance throughout the efficiency-quality spectrum. MeanFlow sCMs Ours [PITH_FULL_IMAGE:… view at source ↗

**Figure 8.** Figure 8: Comparison of 1 NFE results under 8 GPUs resources. DE-CM demonstrated exceptional convergence efficiency at 16 GPU hours compared to sCMs and MeanFlow. losses enables all steps to achieve significant gains. Detailed implementation specifics are provided in the Appendix. 6 Conclusion and Limitations In this work, we propose Dual-End Consistency Model (DE-CM), a novel framework that effectively addresses th… view at source ↗

**Figure 9.** Figure 9: Visual quality comparison between DE-CM and existing methods under 1 NFE. DE-CM achieves superior naturalness in detail rendering, outperforming existing methods. The prompts used are provided in the Appendix [PITH_FULL_IMAGE:figures/full_fig_p014_9.png] view at source ↗

read the original abstract

The slow iterative sampling nature remains a major bottleneck for the practical deployment of diffusion and flow-based generative models. While consistency models (CMs) represent a state-of-the-art distillation-based approach for efficient generation, their large-scale application is still limited by two key issues: training instability and inflexible sampling. Existing methods seek to mitigate these problems through architectural adjustments or regularized objectives, yet overlook the critical reliance on trajectory selection. In this work, we first conduct an analysis on these two limitations: training instability originates from loss divergence induced by unstable self-supervised term, whereas sampling inflexibility arises from error accumulation. Based on these insights and analysis, we propose the Dual-End Consistency Model (DE-CM) that selects vital sub-trajectory clusters to achieve stable and effective training. DE-CM decomposes the PF-ODE trajectory and selects three critical sub-trajectories as optimization targets. Specifically, our approach leverages continuous-time CMs objectives to achieve few-step distillation and utilizes flow matching as a boundary regularizer to stabilize the training process. Furthermore, we propose a novel noise-to-noisy (N2N) mapping that can map noise to any point, thereby alleviating the error accumulation in the first step. Extensive experimental results show the effectiveness of our method: it achieves a state-of-the-art FID score of 1.70 in one-step generation on the ImageNet 256x256 dataset, outperforming existing CM-based one-step approaches.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

DE-CM claims a new 1.70 one-step FID on ImageNet 256 but the exact choice of three sub-trajectories has no visible justification or ablation.

read the letter

The paper's core move is to decompose the PF-ODE trajectory, pick three specific sub-trajectories as training targets, add a noise-to-noisy mapping to cut first-step error, and regularize with a flow-matching boundary term. This produces their Dual-End Consistency Model and the reported 1.70 FID in one step, which beats earlier CM one-step numbers on the same dataset. That number is the main thing worth noting if it holds up. The analysis of instability sources (loss divergence from the self-supervised term and error accumulation) is straightforward and leads directly to the proposed fixes, which is a plus. The N2N mapping itself is a clean idea for handling the initial step without forcing the model to start from pure noise. On the soft side, the decision to use exactly three sub-trajectories is presented as given, with no derivation, optimality condition, or ablation table showing what happens at two or four. The abstract gives no error bars, no sensitivity checks on the selection process, and no separate verification that the stability gains come from the architecture rather than the particular choice of targets. Without those, the result risks looking like a tuned outcome rather than a robust advance. This is aimed at people already working on consistency models or one-step diffusion for vision tasks who want lower sampling cost. A reader in that group could extract the N2N idea and the sub-trajectory framing even if they end up re-tuning the count themselves. The work is coherent enough on its own terms to go to a serious referee; the experiments need checking for robustness and the selection rule needs more support, but the concrete FID claim and the targeted fixes are worth a proper review.

Referee Report

3 major / 2 minor

Summary. The paper proposes the Dual-End Consistency Model (DE-CM) to overcome training instability and inflexible sampling in consistency models for diffusion/flow-based generative models. It analyzes instability as arising from loss divergence in the self-supervised term and sampling issues from first-step error accumulation. The method decomposes the PF-ODE trajectory, selects three critical sub-trajectories as optimization targets, employs continuous-time CM objectives with a flow-matching boundary regularizer, and introduces a noise-to-noisy (N2N) mapping to map noise to arbitrary points. Experiments claim a state-of-the-art one-step FID of 1.70 on ImageNet 256×256, outperforming prior CM-based approaches.

Significance. If the central claims hold under rigorous validation, the work would be significant for advancing practical one-step sampling in large-scale generative models by directly targeting trajectory-dependent instabilities. The explicit decomposition analysis and introduction of N2N mapping represent potentially useful technical contributions, but the absence of ablations, error bars, or independent verification of the three-trajectory choice reduces the strength of the significance assessment.

major comments (3)

[Abstract / §3] Abstract and method description: The selection of exactly three critical sub-trajectories from the PF-ODE decomposition is presented as key to eliminating loss divergence and error accumulation, yet no explicit criterion for identifying 'critical' points, optimality condition, or ablation comparing 2/4/5 trajectories is provided; this choice appears load-bearing for the stability and FID claims but is unsupported by sensitivity analysis.
[Experiments] Experimental results: The reported SOTA FID of 1.70 for one-step generation lacks error bars, multiple random seeds, or ablation tables isolating the contribution of the three sub-trajectories versus the N2N mapping and flow-matching regularizer; without these, the robustness of the central performance claim cannot be verified.
[§4] §4 (method): The N2N mapping is introduced as a novel component to alleviate first-step error accumulation, but its effectiveness is demonstrated solely through the final FID score with no independent derivation, external benchmark, or controlled comparison decoupling it from the overall training procedure.

minor comments (2)

[§3] Notation for PF-ODE decomposition and sub-trajectory selection could be clarified with an explicit equation or diagram showing how the three points are extracted.
[Abstract] The abstract mentions 'extensive experimental results' but the provided details focus primarily on the final FID; additional tables or figures on training stability metrics (e.g., loss curves) would strengthen presentation.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback on our manuscript. We address each major comment point by point below, providing clarifications based on the analysis and experiments in the paper while committing to revisions that strengthen the presentation and robustness of our claims.

read point-by-point responses

Referee: [Abstract / §3] Abstract and method description: The selection of exactly three critical sub-trajectories from the PF-ODE decomposition is presented as key to eliminating loss divergence and error accumulation, yet no explicit criterion for identifying 'critical' points, optimality condition, or ablation comparing 2/4/5 trajectories is provided; this choice appears load-bearing for the stability and FID claims but is unsupported by sensitivity analysis.

Authors: We appreciate the referee drawing attention to this aspect. Section 3 presents the PF-ODE trajectory decomposition analysis, which identifies the three critical sub-trajectories at the specific points where the self-supervised loss term begins to diverge and where first-step sampling errors accumulate most rapidly. These correspond to the initial high-noise regime, the intermediate transition phase, and the low-noise final stage, chosen to directly target the sources of instability identified in our preliminary study. While the original submission did not include an exhaustive sensitivity analysis across alternative numbers of trajectories (due to the high computational cost of ImageNet-scale training), we agree that such analysis would improve clarity. In the revision we will add an explicit statement of the selection criterion derived from the divergence analysis and include a sensitivity table comparing results with 2, 3, 4, and 5 sub-trajectories. revision: yes
Referee: [Experiments] Experimental results: The reported SOTA FID of 1.70 for one-step generation lacks error bars, multiple random seeds, or ablation tables isolating the contribution of the three sub-trajectories versus the N2N mapping and flow-matching regularizer; without these, the robustness of the central performance claim cannot be verified.

Authors: We acknowledge that reporting statistical variability strengthens confidence in the results. The FID of 1.70 was obtained using the standard ImageNet 256×256 evaluation protocol and official evaluation code employed by prior consistency-model works. To address the concern, we will rerun the one-step generation experiments with at least three independent random seeds and report mean FID scores together with standard deviations. We will also expand the ablation studies in Section 5 to include tables that isolate the individual contributions of the three-sub-trajectory selection, the N2N mapping, and the flow-matching boundary regularizer, thereby clarifying their respective impacts on final performance. revision: yes
Referee: [§4] §4 (method): The N2N mapping is introduced as a novel component to alleviate first-step error accumulation, but its effectiveness is demonstrated solely through the final FID score with no independent derivation, external benchmark, or controlled comparison decoupling it from the overall training procedure.

Authors: The N2N mapping is formally introduced in Section 4 as a continuous mapping from pure Gaussian noise to arbitrary noisy points along the PF-ODE trajectory, directly derived to counteract the first-step error accumulation identified in our analysis of sampling inflexibility. Its effectiveness is shown through both the end-to-end FID improvement and the ablation studies that compare variants with and without the mapping. To provide a more decoupled validation, we will add a controlled experiment in the revised manuscript that fixes all other components (including the three-sub-trajectory objectives and flow-matching regularizer) and varies only the presence of N2N, reporting both trajectory-level error metrics and one-step FID. revision: partial

Circularity Check

0 steps flagged

No significant circularity detected in derivation chain

full rationale

The paper conducts an analysis of CM limitations (loss divergence from self-supervised terms and error accumulation in sampling), then proposes DE-CM as a new architecture that decomposes PF-ODE trajectories, selects three sub-trajectories, adds flow-matching regularization, and introduces an N2N mapping. These are presented as design decisions motivated by the analysis rather than any closed-form derivation or prediction that reduces to the inputs by construction. The reported 1.70 FID is an empirical training outcome on ImageNet, not a quantity forced by re-using fitted parameters or self-referential definitions. No self-citations, uniqueness theorems, or ansatzes from prior author work appear in the text to bear load on the central claims. The method remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 2 axioms · 1 invented entities

The central claim rests on the existence of a PF-ODE trajectory that can be meaningfully decomposed, the validity of continuous-time CM objectives, and the assumption that flow matching provides a stable boundary regularizer. No new physical constants or particles are introduced.

free parameters (1)

choice of three critical sub-trajectories
The number and location of the selected sub-trajectories are chosen to stabilize training; their selection is not derived from first principles and must be tuned.

axioms (2)

domain assumption PF-ODE trajectory exists and can be decomposed into sub-trajectories whose separate optimization yields global consistency
Invoked when the paper states it decomposes the PF-ODE trajectory and selects three critical sub-trajectories as optimization targets.
domain assumption flow matching acts as an effective boundary regularizer without altering the target distribution
Used to stabilize training; treated as a standard tool rather than proved for this setting.

invented entities (1)

noise-to-noisy (N2N) mapping no independent evidence
purpose: Map noise directly to any intermediate point to avoid first-step error accumulation
New component introduced to address sampling inflexibility; no independent evidence outside the training loop is provided.

pith-pipeline@v0.9.0 · 5558 in / 1527 out tokens · 61118 ms · 2026-05-16T05:36:59.904026+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

DE-CM decomposes the PF-ODE trajectory and selects three critical sub-trajectories as optimization targets... consistency trajectories, instantaneous trajectories and noise-to-noisy trajectories
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

utilizes flow matching as a boundary regularizer to stabilize the training process

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

58 extracted references · 58 canonical work pages · 13 internal anchors

[1]

Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets

Blattmann, A., Dockhorn, T., Kulal, S., Mendelevitch, D., Kilian, M., Lorenz, D., Levi, Y., English, Z., Voleti, V., Letts, A., et al.: Stable video diffusion: Scaling latent video diffusion models to large datasets. arXiv preprint arXiv:2311.15127 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[2]

arXiv preprint arXiv:2406.075072(3), 9 (2024)

Boffi, N.M., Albergo, M.S., Vanden-Eijnden, E.: Flow map matching. arXiv preprint arXiv:2406.075072(3), 9 (2024)

work page arXiv 2024
[3]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Chang, H., Zhang, H., Jiang, L., Liu, C., Freeman, W.T.: Maskgit: Masked genera- tive image transformer. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 11315–11325 (2022)

work page 2022
[4]

In: 2009 IEEE conference on computer vision and pattern recognition

Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: Imagenet: A large- scale hierarchical image database. In: 2009 IEEE conference on computer vision and pattern recognition. pp. 248–255. Ieee (2009)

work page 2009
[5]

Advances in neural information processing systems34, 8780–8794 (2021)

Dhariwal, P., Nichol, A.: Diffusion models beat gans on image synthesis. Advances in neural information processing systems34, 8780–8794 (2021)

work page 2021
[6]

In: Forty-first international conference on machine learning (2024)

Esser, P., Kulal, S., Blattmann, A., Entezari, R., Müller, J., Saini, H., Levi, Y., Lorenz, D., Sauer, A., Boesel, F., et al.: Scaling rectified flow transformers for high-resolution image synthesis. In: Forty-first international conference on machine learning (2024)

work page 2024
[7]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Esser, P., Rombach, R., Ommer, B.: Taming transformers for high-resolution image synthesis. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 12873–12883 (2021)

work page 2021
[8]

In: Forty-first International Conference on Machine Learning (2024) 32 Linwei Dong et al

Evans, Z., Carr, C., Taylor, J., Hawley, S.H., Pons, J.: Fast timing-conditioned latent audio diffusion. In: Forty-first International Conference on Machine Learning (2024) 32 Linwei Dong et al

work page 2024
[9]

One Step Diffusion via Shortcut Models

Frans,K.,Hafner,D.,Levine,S.,Abbeel,P.:Onestepdiffusionviashortcutmodels. arXiv preprint arXiv:2410.12557 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[10]

Mean Flows for One-step Generative Modeling

Geng,Z.,Deng,M.,Bai,X.,Kolter,J.Z.,He,K.:Meanflowsforone-stepgenerative modeling. arXiv preprint arXiv:2505.13447 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[11]

arXiv preprint arXiv:2406.14548 (2024)

Geng, Z., Pokle, A., Luo, W., Lin, J., Kolter, J.Z.: Consistency models made easy. arXiv preprint arXiv:2406.14548 (2024)

work page arXiv 2024
[12]

arXiv preprint arXiv:2507.16884 (2025)

Guo, Y., Wang, W., Yuan, Z., Cao, R., Chen, K., Chen, Z., Huo, Y., Zhang, Y., Wang, Y., Liu, S., et al.: Splitmeanflow: Interval splitting consistency in few-step generative modeling. arXiv preprint arXiv:2507.16884 (2025)

work page arXiv 2025
[13]

Advances in neural information processing systems30(2017)

Heusel,M.,Ramsauer,H.,Unterthiner,T.,Nessler,B.,Hochreiter,S.:Ganstrained by a two time-scale update rule converge to a local nash equilibrium. Advances in neural information processing systems30(2017)

work page 2017
[14]

Advances in neural information processing systems33, 6840–6851 (2020)

Ho, J., Jain, A., Abbeel, P.: Denoising diffusion probabilistic models. Advances in neural information processing systems33, 6840–6851 (2020)

work page 2020
[15]

ICLR1(2), 3 (2022)

Hu, E.J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., Chen, W., et al.: Lora: Low-rank adaptation of large language models. ICLR1(2), 3 (2022)

work page 2022
[16]

Jackyhate: Text-to-image-2m.https://huggingface.co/datasets/jackyhate/ text-to-image-2M(2025).https://doi.org/10.57967/hf/3066

work page doi:10.57967/hf/3066 2025
[17]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Kang,M.,Zhu,J.Y.,Zhang,R.,Park,J.,Shechtman,E.,Paris,S.,Park,T.:Scaling up gans for text-to-image synthesis. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 10124–10134 (2023)

work page 2023
[18]

Advances in neural information processing systems35, 26565–26577 (2022)

Karras,T.,Aittala,M.,Aila,T.,Laine,S.:Elucidatingthedesignspaceofdiffusion- based generative models. Advances in neural information processing systems35, 26565–26577 (2022)

work page 2022
[19]

arXiv preprint arXiv:2310.02279 (2023)

Kim, D., Lai, C.H., Liao, W.H., Murata, N., Takida, Y., Uesaka, T., He, Y., Mitsu- fuji, Y., Ermon, S.: Consistency trajectory models: Learning probability flow ode trajectory of diffusion. arXiv preprint arXiv:2310.02279 (2023)

work page arXiv 2023
[20]

black-forest labs: Flux.1-dev.https://huggingface.co/black- forest- labs/ FLUX.1-dev(2024)

work page 2024
[21]

In: International confer- ence on machine learning

Li, J., Li, D., Xiong, C., Hoi, S.: Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International confer- ence on machine learning. pp. 12888–12900. PMLR (2022)

work page 2022
[22]

Flow Matching for Generative Modeling

Lipman, Y., Chen, R.T., Ben-Hamu, H., Nickel, M., Le, M.: Flow matching for generative modeling. arXiv preprint arXiv:2210.02747 (2022)

work page internal anchor Pith review Pith/arXiv arXiv 2022
[23]

Haohe Liu, Yi Yuan, Xubo Liu, Xinhao Mei, Qiuqiang Kong, Qiao Tian, Yuping Wang, Wenwu Wang, Yuxuan Wang, and Mark D

Liu, H., Chen, Z., Yuan, Y., Mei, X., Liu, X., Mandic, D., Wang, W., Plumb- ley, M.D.: Audioldm: Text-to-audio generation with latent diffusion models. arXiv preprint arXiv:2301.12503 (2023)

work page arXiv 2023
[24]

In: Proceedings of the IEEE/CVF inter- national conference on computer vision

Liu, R., Wu, R., Van Hoorick, B., Tokmakov, P., Zakharov, S., Vondrick, C.: Zero- 1-to-3: Zero-shot one image to 3d object. In: Proceedings of the IEEE/CVF inter- national conference on computer vision. pp. 9298–9309 (2023)

work page 2023
[25]

Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow

Liu, X., Gong, C., Liu, Q.: Flow straight and fast: Learning to generate and transfer data with rectified flow. arXiv preprint arXiv:2209.03003 (2022)

work page internal anchor Pith review Pith/arXiv arXiv 2022
[26]

Decoupled Weight Decay Regularization

Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017)

work page internal anchor Pith review Pith/arXiv arXiv 2017
[27]

Simplifying, Stabilizing and Scaling Continuous-Time Consistency Models

Lu, C., Song, Y.: Simplifying, stabilizing and scaling continuous-time consistency models. arXiv preprint arXiv:2410.11081 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[28]

Advances in neural information processing systems35, 5775–5787 (2022) DE-CM 33

Lu, C., Zhou, Y., Bao, F., Chen, J., Li, C., Zhu, J.: Dpm-solver: A fast ode solver for diffusion probabilistic model sampling in around 10 steps. Advances in neural information processing systems35, 5775–5787 (2022) DE-CM 33

work page 2022
[29]

Machine Intelligence Research pp

Lu, C., Zhou, Y., Bao, F., Chen, J., Li, C., Zhu, J.: Dpm-solver++: Fast solver for guided sampling of diffusion probabilistic models. Machine Intelligence Research pp. 1–22 (2025)

work page 2025
[30]

Latent Consistency Models: Synthesizing High-Resolution Images with Few-Step Inference

Luo, S., Tan, Y., Huang, L., Li, J., Zhao, H.: Latent consistency mod- els: Synthesizing high-resolution images with few-step inference. arXiv preprint arXiv:2310.04378 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[31]

In: European Conference on Computer Vision

Ma, N., Goldstein, M., Albergo, M.S., Boffi, N.M., Vanden-Eijnden, E., Xie, S.: Sit: Exploring flow and diffusion-based generative models with scalable interpolant transformers. In: European Conference on Computer Vision. pp. 23–40. Springer (2024)

work page 2024
[32]

Peebles,W.,Xie,S.:Scalablediffusionmodelswithtransformers.In:Proceedingsof the IEEE/CVF international conference on computer vision. pp. 4195–4205 (2023)

work page 2023
[33]

arXiv preprint arXiv:2507.03738 (2025)

Peng, Y., Zhu, K., Liu, Y., Wu, P., Li, H., Sun, X., Wu, F.: Flow-anchored consis- tency models. arXiv preprint arXiv:2507.03738 (2025)

work page arXiv 2025
[34]

DreamFusion: Text-to-3D using 2D Diffusion

Poole, B., Jain, A., Barron, J.T., Mildenhall, B.: Dreamfusion: Text-to-3d using 2d diffusion. arXiv preprint arXiv:2209.14988 (2022)

work page internal anchor Pith review Pith/arXiv arXiv 2022
[35]

In: International conference on machine learning

Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International conference on machine learning. pp. 8748–8763. PmLR (2021)

work page 2021
[36]

Hierarchical Text-Conditional Image Generation with CLIP Latents

Ramesh, A., Dhariwal, P., Nichol, A., Chu, C., Chen, M.: Hierarchical text- conditional image generation with clip latents. arXiv preprint arXiv:2204.06125 1(2), 3 (2022)

work page internal anchor Pith review Pith/arXiv arXiv 2022
[37]

Advances in Neural Information Processing Systems37, 117340–117362 (2024)

Ren, Y., Xia, X., Lu, Y., Zhang, J., Wu, J., Xie, P., Wang, X., Xiao, X.: Hyper-sd: Trajectory segmented consistency model for efficient image synthesis. Advances in Neural Information Processing Systems37, 117340–117362 (2024)

work page 2024
[38]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 10684–10695 (2022)

work page 2022
[39]

arXiv preprint arXiv:2506.14603 (2025)

Sabour, A., Fidler, S., Kreis, K.: Align your flow: Scaling continuous-time flow map distillation. arXiv preprint arXiv:2506.14603 (2025)

work page arXiv 2025
[40]

In: European Conference on Computer Vision

Sauer, A., Lorenz, D., Blattmann, A., Rombach, R.: Adversarial diffusion distilla- tion. In: European Conference on Computer Vision. pp. 87–103. Springer (2024)

work page 2024
[41]

In: ACM SIGGRAPH 2022 conference proceedings

Sauer, A., Schwarz, K., Geiger, A.: Stylegan-xl: Scaling stylegan to large diverse datasets. In: ACM SIGGRAPH 2022 conference proceedings. pp. 1–10 (2022)

work page 2022
[42]

In: International conference on machine learning

Sohl-Dickstein, J., Weiss, E., Maheswaranathan, N., Ganguli, S.: Deep unsuper- vised learning using nonequilibrium thermodynamics. In: International conference on machine learning. pp. 2256–2265. pmlr (2015)

work page 2015
[43]

Denoising Diffusion Implicit Models

Song, J., Meng, C., Ermon, S.: Denoising diffusion implicit models. arXiv preprint arXiv:2010.02502 (2020)

work page internal anchor Pith review Pith/arXiv arXiv 2010
[44]

arXiv preprint arXiv:2310.14189 (2023)

Song, Y., Dhariwal, P.: Improved techniques for training consistency models. arXiv preprint arXiv:2310.14189 (2023)

work page arXiv 2023
[45]

Song, Y., Dhariwal, P., Chen, M., Sutskever, I.: Consistency models (2023)

work page 2023
[46]

Advances in neural information processing systems32(2019)

Song, Y., Ermon, S.: Generative modeling by estimating gradients of the data distribution. Advances in neural information processing systems32(2019)

work page 2019
[47]

Score-Based Generative Modeling through Stochastic Differential Equations

Song, Y., Sohl-Dickstein, J., Kingma, D.P., Kumar, A., Ermon, S., Poole, B.: Score- based generative modeling through stochastic differential equations. arXiv preprint arXiv:2011.13456 (2020)

work page internal anchor Pith review Pith/arXiv arXiv 2011
[48]

Tian, K., Jiang, Y., Yuan, Z., Peng, B., Wang, L.: Visual autoregressive modeling: Scalableimagegenerationvianext-scaleprediction.Advancesinneuralinformation processing systems37, 84839–84865 (2024) 34 Linwei Dong et al

work page 2024
[49]

Wan: Open and Advanced Large-Scale Video Generative Models

Wan, T., Wang, A., Ai, B., Wen, B., Mao, C., Xie, C.W., Chen, D., Yu, F., Zhao, H., Yang, J., Zeng, J., Wang, J., Zhang, J., Zhou, J., Wang, J., Chen, J., Zhu, K., Zhao, K., Yan, K., Huang, L., Feng, M., Zhang, N., Li, P., Wu, P., Chu, R., Feng, R., Zhang, S., Sun, S., Fang, T., Wang, T., Gui, T., Weng, T., Shen, T., Lin, W., Wang, W., Wang, W., Zhou, W.,...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[50]

Advances in neural information processing systems37, 83951–84009 (2024)

Wang, F.Y., Huang, Z., Bergman, A., Shen, D., Gao, P., Lingelbach, M., Sun, K., Bian, W., Song, G., Liu, Y., et al.: Phased consistency models. Advances in neural information processing systems37, 83951–84009 (2024)

work page 2024
[51]

Advances in neural information processing systems36, 8406–8441 (2023)

Wang, Z., Lu, C., Wang, Y., Bao, F., Li, C., Su, H., Zhu, J.: Prolificdreamer: High-fidelity and diverse text-to-3d generation with variational score distillation. Advances in neural information processing systems36, 8406–8441 (2023)

work page 2023
[52]

arXiv preprint arXiv:2502.16972 (2025)

Wu, Z., Fan, X., Wu, H., Cao, L.: Traflow: Trajectory distillation on pre-trained rectified flow. arXiv preprint arXiv:2502.16972 (2025)

work page arXiv 2025
[53]

In: Proceedingsofthe37thInternationalConferenceonNeuralInformationProcessing Systems

Xu, J., Liu, X., Wu, Y., Tong, Y., Li, Q., Ding, M., Tang, J., Dong, Y.: Imagere- ward: learning and evaluating human preferences for text-to-image generation. In: Proceedingsofthe37thInternationalConferenceonNeuralInformationProcessing Systems. pp. 15903–15935 (2023)

work page 2023
[54]

generation: Taming optimization dilemma in latent diffusion models

Yao, J., Yang, B., Wang, X.: Reconstruction vs. generation: Taming optimization dilemma in latent diffusion models. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 15703–15712 (2025)

work page 2025
[55]

Advances in neural information processing systems37, 47455–47487 (2024)

Yin, T., Gharbi, M., Park, T., Zhang, R., Shechtman, E., Durand, F., Freeman, B.: Improved distribution matching distillation for fast image synthesis. Advances in neural information processing systems37, 47455–47487 (2024)

work page 2024
[56]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Yin, T., Gharbi, M., Zhang, R., Shechtman, E., Durand, F., Freeman, W.T., Park, T.: One-step diffusion with distribution matching distillation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 6613– 6623 (2024)

work page 2024
[57]

In: Proceedings of the IEEE/CVF international conference on computer vision

Zhang, L., Rao, A., Agrawala, M.: Adding conditional control to text-to-image diffusion models. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 3836–3847 (2023)

work page 2023
[58]

arXiv preprint arXiv:2503.07565 (2025)

Zhou, L., Ermon, S., Song, J.: Inductive moment matching. arXiv preprint arXiv:2503.07565 (2025)

work page arXiv 2025