pith. machine review for the scientific record. sign in

arxiv: 2510.08431 · v3 · submitted 2025-10-09 · 💻 cs.CV · cs.LG

Large Scale Diffusion Distillation via Score-Regularized Continuous-Time Consistency

Pith reviewed 2026-05-18 08:42 UTC · model grok-4.3

classification 💻 cs.CV cs.LG
keywords diffusion distillationconsistency modelsscore regularizationlarge-scale generationtext-to-videofast samplingmode collapse
0
0 comments X

The pith

Score-regularized consistency models scale to 14B parameters and match leading distillation quality with higher diversity.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper establishes that continuous-time consistency models can be made practical for large-scale text-to-image and video diffusion by addressing their fine-detail shortcomings. The authors first create a parallelism-compatible JVP kernel that enables training on models exceeding 10 billion parameters. They identify that pure sCM accumulates errors in details because its forward-divergence objective is mode-covering. They remedy this by introducing rCM, which adds score distillation as a long-skip regularizer to bring in complementary mode-seeking behavior. The result is high-fidelity generation in only 1 to 4 steps that matches DMD2 quality metrics while preserving diversity and avoiding mode collapse, all without GAN tuning.

Core claim

The central claim is that incorporating score distillation as a long-skip regularizer into continuous-time consistency training complements the forward-divergence objective of sCM with reverse divergence, thereby reducing error accumulation during fine-detail generation and producing models that generate high-quality samples in 1-4 steps at scales up to 14B parameters while maintaining diversity advantages over prior distillation methods.

What carries the argument

The score-regularized continuous-time consistency model (rCM), which augments the sCM objective with score distillation as a long-skip regularizer to balance mode-covering and mode-seeking divergences.

Load-bearing premise

That adding score distillation regularization will reliably reduce fine-detail errors in sCM without introducing instabilities or diversity losses when scaled to 10B+ parameter models.

What would settle it

Side-by-side evaluation on a 14B-parameter model showing that rCM samples have measurably worse fine details or lower diversity than DMD2-distilled outputs would falsify the central claim.

Figures

Figures reproduced from arXiv: 2510.08431 by Huayu Chen, Jianfei Chen, Jintao Zhang, Jun Zhu, Kaiwen Zheng, Ming-Yu Liu, Qianli Ma, Qinsheng Zhang, Yogesh Balaji, Yuji Wang.

Figure 1
Figure 1. Figure 1: 5 random video samples from 4-step sCM, DMD2, and rCM on Wan2.1 1.3B. rCM resolves the quality issues of sCM while showing clear superiority to DMD2 in generation diversity. 1 arXiv:2510.08431v1 [cs.CV] 9 Oct 2025 [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: High-level comparison of diffusion distillation methods. Despite the theoretical existence [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: 4-step generation results with pure sCM distillation. [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Illustration of rCM. Left: the forward consistency objective of sCM propagates error from small to large times; Right: reverse-divergence minimization serves as a long-skip regularizer. As shown in [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Few-step T2I samples compared to open-sourced models. rCM can render fine-grained [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Comparison between different numbers of sampling steps. [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Restructuring example for the RMSNorm layer: (left) original implementation, (right) [PITH_FULL_IMAGE:figures/full_fig_p018_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Comparison between sCM and sCTM for distillation. We implement sCTM by adding an [PITH_FULL_IMAGE:figures/full_fig_p020_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Relative L2 errors of the network output and JVP under BF16 precision. Empirically, JVP computation leads to substantially larger numerical errors compared to the network output. F PROMPTS 20 [PITH_FULL_IMAGE:figures/full_fig_p020_9.png] view at source ↗
read the original abstract

Although continuous-time consistency models (e.g., sCM, MeanFlow) are theoretically principled and empirically powerful for fast academic-scale diffusion, its applicability to large-scale text-to-image and video tasks remains unclear due to infrastructure challenges in Jacobian-vector product (JVP) computation and the limitations of evaluation benchmarks like FID. This work represents the first effort to scale up continuous-time consistency to general application-level image and video diffusion models, and to make JVP-based distillation effective at large scale. We first develop a parallelism-compatible FlashAttention-2 JVP kernel, enabling sCM training on models with over 10 billion parameters and high-dimensional video tasks. Our investigation reveals fundamental quality limitations of sCM in fine-detail generation, which we attribute to error accumulation and the "mode-covering" nature of its forward-divergence objective. To remedy this, we propose the score-regularized continuous-time consistency model (rCM), which incorporates score distillation as a long-skip regularizer. This integration complements sCM with the "mode-seeking" reverse divergence, effectively improving visual quality while maintaining high generation diversity. Validated on large-scale models (Cosmos-Predict2, Wan2.1) up to 14B parameters and 5-second videos, rCM generally matches the state-of-the-art distillation method DMD2 on quality metrics while mitigating mode collapse and offering notable advantages in diversity, all without GAN tuning or extensive hyperparameter searches. The distilled models generate high-fidelity samples in only $1\sim4$ steps, accelerating diffusion sampling by $15\times\sim50\times$. These results position rCM as a practical and theoretically grounded framework for advancing large-scale diffusion distillation. Code is available at https://github.com/NVlabs/rcm.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces a scalable approach to continuous-time consistency models for large-scale text-to-image and video diffusion. It develops a FlashAttention-2 compatible JVP kernel to enable sCM training on models >10B parameters and high-dimensional video tasks. To address observed limitations in fine-detail generation attributed to error accumulation and the mode-covering forward-divergence objective, the authors propose score-regularized continuous-time consistency (rCM) by adding score distillation as a long-skip regularizer. This is claimed to complement sCM with mode-seeking reverse divergence, yielding visual quality on par with DMD2 while preserving diversity. Results are reported on Cosmos-Predict2 and Wan2.1 models up to 14B parameters for up to 5-second videos, with 1-4 step generation providing 15-50x acceleration over diffusion sampling. Code is released.

Significance. If the empirical claims hold with supporting ablations and metrics, the work would be significant for demonstrating the first practical scaling of continuous-time consistency distillation to application-level 10B+ parameter image and video models. The JVP kernel addresses a concrete infrastructure barrier, and the rCM formulation offers a non-GAN, theoretically motivated alternative to existing distillation methods with potential advantages in diversity and training stability. Reproducible code further strengthens the contribution for the field.

major comments (2)
  1. [Experiments] Experiments section: The central claim that score distillation as a long-skip regularizer reliably complements the sCM forward-divergence objective, reduces fine-detail error accumulation, and avoids new instabilities or diversity losses at 14B scale lacks supporting quantitative evidence. No ablation studies isolating the regularizer's contribution, no training-curve analysis of gradient norms or mode coverage, and no diversity metrics (e.g., recall, pairwise LPIPS) are reported to substantiate that the mode-seeking term does not trade off coverage.
  2. [§3] §3 (rCM formulation): The integration of score distillation is presented as addressing the 'mode-covering' limitation of sCM, yet the manuscript provides no derivation or analysis showing that the combined objective avoids introducing instabilities or error accumulation of its own at large scale; this assumption is load-bearing for the quality and diversity claims.
minor comments (2)
  1. [Abstract] The abstract states 'generally matches' DMD2 on quality metrics but does not specify the exact metrics, datasets, or numerical values; these should be stated explicitly with tables or figures for clarity.
  2. [Method] Clarify the precise form of the long-skip regularizer (e.g., weighting schedule, which score function is used) in the method section to aid reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the positive assessment of the work's potential significance and for the constructive comments on the empirical and theoretical support for our claims. We respond to each major comment below.

read point-by-point responses
  1. Referee: [Experiments] Experiments section: The central claim that score distillation as a long-skip regularizer reliably complements the sCM forward-divergence objective, reduces fine-detail error accumulation, and avoids new instabilities or diversity losses at 14B scale lacks supporting quantitative evidence. No ablation studies isolating the regularizer's contribution, no training-curve analysis of gradient norms or mode coverage, and no diversity metrics (e.g., recall, pairwise LPIPS) are reported to substantiate that the mode-seeking term does not trade off coverage.

    Authors: We agree that the manuscript would be strengthened by additional quantitative ablations and diversity metrics to isolate the regularizer's contribution. The current results focus on end-to-end comparisons with DMD2 on large-scale models, showing comparable quality with advantages in diversity through qualitative inspection and avoidance of mode collapse. However, we acknowledge the lack of explicit metrics such as recall or pairwise LPIPS and isolated training-curve analyses. In the revised version, we will add ablation studies at smaller scales to quantify the regularizer's impact on fine details and mode coverage, along with reported diversity metrics where computationally feasible. Full ablations at 14B scale remain prohibitive due to resource constraints, which is why we prioritized scalable end-to-end validation. revision: yes

  2. Referee: [§3] §3 (rCM formulation): The integration of score distillation is presented as addressing the 'mode-covering' limitation of sCM, yet the manuscript provides no derivation or analysis showing that the combined objective avoids introducing instabilities or error accumulation of its own at large scale; this assumption is load-bearing for the quality and diversity claims.

    Authors: The rCM objective in §3 is constructed by adding score distillation as a long-skip regularizer to the sCM loss, motivated by the complementary properties of forward (mode-covering) and reverse (mode-seeking) divergences as established in prior work on score-based distillation. While we do not include a new formal derivation proving absence of instabilities at arbitrary scale, the successful training and stable convergence on models up to 14B parameters without observed new error accumulation or instabilities provides empirical support for the approach. We will revise §3 to expand the discussion of the combined objective, including a clearer explanation of how the regularizer mitigates accumulation and references to related stability analyses in the literature. revision: partial

Circularity Check

0 steps flagged

No circularity: rCM is an explicit combination of prior terms with independent empirical validation

full rationale

The paper's core derivation introduces rCM by adding a score-distillation regularizer to the existing sCM objective to address observed error accumulation in fine details, motivated by the mode-covering vs. mode-seeking divergence properties. This is a design choice, not a self-definitional reduction or fitted parameter renamed as prediction. The JVP kernel development and large-scale training on Cosmos-Predict2/Wan2.1 models constitute independent engineering and validation steps outside any closed loop of the paper's own equations. No load-bearing self-citations, uniqueness theorems, or ansatzes imported from prior author work are used to force the central claim; results are presented as empirical outcomes rather than tautological consequences of the inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The approach rests on standard diffusion model assumptions plus the new claim that the regularizer fixes mode-covering without side effects; no new entities are postulated.

axioms (1)
  • domain assumption The FlashAttention-2 JVP kernel is numerically stable and correctly implements the required vector-Jacobian products at scale.
    Invoked to enable training on >10B parameter models and high-dimensional video.

pith-pipeline@v0.9.0 · 5880 in / 1248 out tokens · 31602 ms · 2026-05-18T08:42:53.198677+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

  • IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear
    ?
    unclear

    Relation between the paper passage and the cited Recognition theorem.

    Our investigation reveals fundamental quality limitations of sCM in fine-detail generation, which we attribute to error accumulation and the 'mode-covering' nature of its forward-divergence objective. To remedy this, we propose the score-regularized continuous-time consistency model (rCM), which incorporates score distillation as a long-skip regularizer.

  • IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean reality_from_one_distinction unclear
    ?
    unclear

    Relation between the paper passage and the cited Recognition theorem.

    sCM employs the TrigFlow noise schedule ... and the full derivative dFθ−(xt,t)/dt can be computed using forward-mode automatic differentiation, Jacobian-vector product (JVP).

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 14 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. AnyFlow: Any-Step Video Diffusion Model with On-Policy Flow Map Distillation

    cs.CV 2026-05 unverdicted novelty 8.0

    AnyFlow enables any-step video diffusion by distilling flow-map transitions over arbitrary time intervals with on-policy backward simulation.

  2. HorizonDrive: Self-Corrective Autoregressive World Model for Long-horizon Driving Simulation

    cs.CV 2026-05 conditional novelty 7.0

    HorizonDrive enables stable long-horizon autoregressive driving simulation via anti-drifting teacher training with scheduled rollout recovery and teacher rollout distillation.

  3. Efficient Video Diffusion Models: Advancements and Challenges

    cs.CV 2026-04 unverdicted novelty 7.0

    A survey that groups efficient video diffusion methods into four paradigms—step distillation, efficient attention, model compression, and cache/trajectory optimization—and outlines open challenges for practical use.

  4. 1.x-Distill: Breaking the Diversity, Quality, and Efficiency Barrier in Distribution Matching Distillation

    cs.CV 2026-04 conditional novelty 7.0

    1.x-Distill achieves better quality and diversity than prior few-step distillation methods at 1.67 and 1.74 effective NFEs on SD3 models with up to 33x speedup.

  5. D-OPSD: On-Policy Self-Distillation for Continuously Tuning Step-Distilled Diffusion Models

    cs.CV 2026-05 unverdicted novelty 6.0

    D-OPSD enables continuous supervised fine-tuning of few-step diffusion models via on-policy self-distillation where the model acts as both teacher (multimodal context) and student (text-only context) on its own roll-outs.

  6. Mutual Forcing: Dual-Mode Self-Evolution for Fast Autoregressive Audio-Video Character Generation

    cs.CV 2026-04 unverdicted novelty 6.0

    Mutual Forcing trains a single native autoregressive audio-video model with mutually reinforcing few-step and multi-step modes via self-distillation to match 50-step baselines at 4-8 steps.

  7. Alice v1: Distillation-Enhanced Video Generation Surpassing Closed-Source Models

    cs.GR 2026-04 unverdicted novelty 6.0

    Alice v1 is an open video model that surpasses its teacher and closed-source systems like Veo3 and Sora2 in quality while running 7x faster through specialized distillation.

  8. Extending One-Step Image Generation from Class Labels to Text via Discriminative Text Representation

    cs.CV 2026-04 unverdicted novelty 6.0

    By requiring and using highly discriminative LLM text features, the work enables the first effective one-step text-conditioned image generation with MeanFlow.

  9. Continuous Adversarial Flow Models

    cs.LG 2026-04 unverdicted novelty 6.0

    Continuous adversarial flow models replace MSE in flow matching with adversarial training via a discriminator, improving guidance-free FID on ImageNet from 8.26 to 3.63 for SiT and similar gains for JiT and text-to-im...

  10. Long-Horizon Streaming Video Generation via Hybrid Attention with Decoupled Distillation

    cs.CV 2026-04 conditional novelty 6.0

    Hybrid Forcing combines linear temporal attention for long-range retention, block-sparse attention for efficiency, and decoupled distillation to achieve real-time unbounded 832x480 streaming video generation at 29.5 FPS.

  11. Salt: Self-Consistent Distribution Matching with Cache-Aware Training for Fast Video Generation

    cs.CV 2026-04 unverdicted novelty 6.0

    Salt improves low-step video generation quality by adding endpoint-consistent regularization to distribution matching distillation and using cache-conditioned feature alignment for autoregressive models.

  12. Live Avatar: Streaming Real-time Audio-Driven Avatar Generation with Infinite Length

    cs.CV 2025-12 conditional novelty 6.0

    Live Avatar enables 45 FPS real-time streaming infinite-length audio-driven avatar generation from a 14B diffusion model via distillation and timestep-forcing pipeline parallelism.

  13. Visual Generation in the New Era: An Evolution from Atomic Mapping to Agentic World Modeling

    cs.CV 2026-04 unverdicted novelty 5.0

    Visual generation models are evolving from passive renderers to interactive agentic world modelers, but current systems lack spatial reasoning, temporal consistency, and causal understanding, with evaluations overemph...

  14. World Simulation with Video Foundation Models for Physical AI

    cs.CV 2025-10 unverdicted novelty 4.0

    Cosmos-Predict2.5 unifies text-to-world, image-to-world, and video-to-world generation in one model trained on 200M clips with RL post-training, delivering improved quality and control for physical AI.

Reference graph

Works this paper leans on

33 extracted references · 33 canonical work pages · cited by 14 Pith papers · 16 internal anchors

  1. [1]

    Vidu: a highly consistent, dynamic and skilled text-to- video generator with diffusion models

    Fan Bao, Chendong Xiang, Gang Yue, Guande He, Hongzhou Zhu, Kaiwen Zheng, Min Zhao, Shilong Liu, Yaole Wang, and Jun Zhu. Vidu: a highly consistent, dynamic and skilled text-to- video generator with diffusion models. arXiv preprint arXiv:2405.04233,

  2. [2]

    Sana-sprint: One-step diffusion with continuous-time consistency distillation

    Junsong Chen, Shuchen Xue, Yuyang Zhao, Jincheng Yu, Sayak Paul, Junyu Chen, Han Cai, Song Han, and Enze Xie. Sana-sprint: One-step diffusion with continuous-time consistency distillation. arXiv preprint arXiv:2503.09641,

  3. [3]

    FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning

    Tri Dao. Flashattention-2: Faster attention with better parallelism and work partitioning. arXiv preprint arXiv:2307.08691,

  4. [4]

    Seedance 1.0: Exploring the Boundaries of Video Generation Models

    Yu Gao, Haoyuan Guo, Tuyen Hoang, Weilin Huang, Lu Jiang, Fangyuan Kong, Huixia Li, Jiashi Li, Liang Li, Xiaojie Li, et al. Seedance 1.0: Exploring the boundaries of video generation models. arXiv preprint arXiv:2506.09113,

  5. [5]

    Consistency models made easy

    Zhengyang Geng, Ashwini Pokle, William Luo, Justin Lin, and J Zico Kolter. Consistency models made easy. arXiv preprint arXiv:2406.14548,

  6. [6]

    Mean Flows for One-step Generative Modeling

    Zhengyang Geng, Mingyang Deng, Xingjian Bai, J Zico Kolter, and Kaiming He. Mean flows for one-step generative modeling. arXiv preprint arXiv:2505.13447,

  7. [7]

    Multistep consistency models

    Jonathan Heek, Emiel Hoogeboom, and Tim Salimans. Multistep consistency models. arXiv preprint arXiv:2403.06807,

  8. [8]

    Classifier-Free Diffusion Guidance

    Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance. arXiv preprint arXiv:2207.12598,

  9. [9]

    Imagen Video: High Definition Video Generation with Diffusion Models

    Jonathan Ho, William Chan, Chitwan Saharia, Jay Whang, Ruiqi Gao, Alexey Gritsenko, Diederik P Kingma, Ben Poole, Mohammad Norouzi, David J Fleet, et al. Imagen video: High definition video generation with diffusion models. arXiv preprint arXiv:2210.02303,

  10. [10]

    Self Forcing: Bridging the Train-Test Gap in Autoregressive Video Diffusion

    Xun Huang, Zhengqi Li, Guande He, Mingyuan Zhou, and Eli Shechtman. Self forcing: Bridging the training-inference gap in autoregressive video diffusion. arXiv preprint arXiv:2506.08009 ,

  11. [11]

    DeepSpeed Ulysses: System Optimizations for Enabling Training of Extreme Long Sequence Transformer Models

    Sam Ade Jacobs, Masahiro Tanaka, Chengming Zhang, Minjia Zhang, Shuaiwen Leon Song, Samyam Rajbhandari, and Yuxiong He. Deepspeed ulysses: System optimizations for enabling training of extreme long sequence transformer models. arXiv preprint arXiv:2309.14509,

  12. [12]

    Consistency trajectory models: Learning proba- bility flow ode trajectory of diffusion

    Dongjun Kim, Chieh-Hsin Lai, Wei-Hsiang Liao, Naoki Murata, Yuhta Takida, Toshimitsu Uesaka, Yutong He, Yuki Mitsufuji, and Stefano Ermon. Consistency trajectory models: Learning proba- bility flow ode trajectory of diffusion. arXiv preprint arXiv:2310.02279,

  13. [13]

    Truncated consistency models

    Sangyun Lee, Yilun Xu, Tomas Geffner, Giulia Fanti, Karsten Kreis, Arash Vahdat, and Weili Nie. Truncated consistency models. arXiv preprint arXiv:2410.14895,

  14. [14]

    T2v-turbo: Breaking the quality bottleneck of video consistency model with mixed reward feedback

    Jiachen Li, Weixi Feng, Tsu-Jui Fu, Xinyi Wang, Sugato Basu, Wenhu Chen, and William Yang Wang. T2v-turbo: Breaking the quality bottleneck of video consistency model with mixed reward feedback. Advances in neural information processing systems, 37:75692–75726, 2024a. Jiachen Li, Qian Long, Jian Zheng, Xiaofeng Gao, Robinson Piramuthu, Wenhu Chen, and Will...

  15. [15]

    Diffusion adversar- ial post-training for one-step video generation

    11 Preprint Shanchuan Lin, Xin Xia, Yuxi Ren, Ceyuan Yang, Xuefeng Xiao, and Lu Jiang. Diffusion adversar- ial post-training for one-step video generation. arXiv preprint arXiv:2501.08316, 2025a. Shanchuan Lin, Ceyuan Yang, Hao He, Jianwen Jiang, Yuxi Ren, Xin Xia, Yang Zhao, Xuefeng Xiao, and Lu Jiang. Autoregressive adversarial post-training for real-ti...

  16. [16]

    Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow

    Xingchao Liu, Chengyue Gong, and Qiang Liu. Flow straight and fast: Learning to generate and transfer data with rectified flow. arXiv preprint arXiv:2209.03003,

  17. [17]

    Simplifying, Stabilizing and Scaling Continuous-Time Consistency Models

    Cheng Lu and Yang Song. Simplifying, stabilizing and scaling continuous-time consistency models. arXiv preprint arXiv:2410.11081,

  18. [18]

    Knowledge Distillation in Iterative Generative Models for Improved Sampling Speed

    Cheng Lu, Kaiwen Zheng, Fan Bao, Jianfei Chen, Chongxuan Li, and Jun Zhu. Maximum likelihood training for score-based diffusion odes by high order denoising score matching. In International conference on machine learning, pp. 14429–14460. PMLR, 2022a. Cheng Lu, Yuhao Zhou, Fan Bao, Jianfei Chen, Chongxuan Li, and Jun Zhu. Dpm-solver: A fast ode solver for...

  19. [19]

    Latent Consistency Models: Synthesizing High-Resolution Images with Few-Step Inference

    Simian Luo, Yiqin Tan, Longbo Huang, Jian Li, and Hang Zhao. Latent consistency models: Synthe- sizing high-resolution images with few-step inference. arXiv preprint arXiv:2310.04378, 2023a. Weijian Luo, Tianyang Hu, Shifeng Zhang, Jiacheng Sun, Zhenguo Li, and Zhihua Zhang. Diff- instruct: A universal approach for transferring knowledge from pre-trained ...

  20. [20]

    Cosmos World Foundation Model Platform for Physical AI

    URL https: //arxiv.org/abs/2501.03575. Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas M ¨uller, Joe Penna, and Robin Rombach. Sdxl: Improving latent diffusion models for high-resolution image synthesis. arXiv preprint arXiv:2307.01952,

  21. [21]

    Align your flow: Scaling continuous-time flow map distillation

    Amirmojtaba Sabour, Sanja Fidler, and Karsten Kreis. Align your flow: Scaling continuous-time flow map distillation. arXiv preprint arXiv:2506.14603,

  22. [22]

    Progressive Distillation for Fast Sampling of Diffusion Models

    Tim Salimans and Jonathan Ho. Progressive distillation for fast sampling of diffusion models.arXiv preprint arXiv:2202.00512,

  23. [23]

    Fast high-resolution image synthesis with latent adversarial diffusion distillation

    12 Preprint Axel Sauer, Frederic Boesel, Tim Dockhorn, Andreas Blattmann, Patrick Esser, and Robin Rom- bach. Fast high-resolution image synthesis with latent adversarial diffusion distillation. In SIG- GRAPH Asia 2024 Conference Papers, pp. 1–11, 2024a. Axel Sauer, Dominik Lorenz, Andreas Blattmann, and Robin Rombach. Adversarial diffusion dis- tillation...

  24. [24]

    Score-Based Generative Modeling through Stochastic Differential Equations

    Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equations. arXiv preprint arXiv:2011.13456,

  25. [25]

    Wan: Open and Advanced Large-Scale Video Generative Models

    Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, et al. Wan: Open and advanced large-scale video generative models. arXiv preprint arXiv:2503.20314,

  26. [26]

    Tianwei Yin, Micha ¨el Gharbi, Taesung Park, Richard Zhang, Eli Shechtman, Fredo Durand, and Bill Freeman

    URL https://arxiv.org/ abs/2501.18427. Tianwei Yin, Micha ¨el Gharbi, Taesung Park, Richard Zhang, Eli Shechtman, Fredo Durand, and Bill Freeman. Improved distribution matching distillation for fast image synthesis. Advances in neural information processing systems, 37:47455–47487, 2024a. Tianwei Yin, Micha¨el Gharbi, Richard Zhang, Eli Shechtman, Fredo D...

  27. [27]

    Fast sampling of diffusion models with exponential integrator

    Qinsheng Zhang and Yongxin Chen. Fast sampling of diffusion models with exponential integrator. arXiv preprint arXiv:2204.13902,

  28. [28]

    PyTorch FSDP: Experiences on Scaling Fully Sharded Data Parallel

    13 Preprint Yanli Zhao, Andrew Gu, Rohan Varma, Liang Luo, Chien-Chin Huang, Min Xu, Less Wright, Hamid Shojanazeri, Myle Ott, Sam Shleifer, et al. Pytorch fsdp: experiences on scaling fully sharded data parallel. arXiv preprint arXiv:2304.11277,

  29. [29]

    Dpm-solver-v3: Improved diffusion ode solver with empirical model statistics

    Kaiwen Zheng, Cheng Lu, Jianfei Chen, and Jun Zhu. Dpm-solver-v3: Improved diffusion ode solver with empirical model statistics. Advances in Neural Information Processing Systems, 36: 55502–55542, 2023a. Kaiwen Zheng, Cheng Lu, Jianfei Chen, and Jun Zhu. Improved techniques for maximum likelihood estimation for diffusion odes. In International Conference ...

  30. [30]

    However, CMs suffer from training instabilities and quality issues such as blur

    adapt CMs to diffusion bridges models. However, CMs suffer from training instabilities and quality issues such as blur. Subsequent efforts address these limitations by introducing dedicated annealing schedules (Song & Dhariwal, 2023; Geng et al., 2024), preconditioning strategies (Zheng et al., 2025b), or segmented consistency schemes (Wang et al., 2024; ...

  31. [31]

    Nonetheless, the applica- bility of sCM to large-scale, application-level image and video diffusion models remains unclear

    and AYF (Sabour et al., 2025), which directly combine sCM with CTM, have also drawn significant attention. Nonetheless, the applica- bility of sCM to large-scale, application-level image and video diffusion models remains unclear. SANA-Sprint (Chen et al.,

  32. [32]

    is an optimized attention algorithm that reduces memory usage and improves throughput by tiling the sequence into blocks and streaming intermediate results without materializing the full attention matrix. Given query, key, and value sequencesQ ∈ RN1×d, K, V ∈ RN2×d, where N1 and N2 denote sequence lengths and d is the head dimension, the attention output ...

  33. [33]

    We maintain a smoothed version of the student parameters using the power EMA (Karras et al., 2024), and use the EMA model for evaluation. We use the AdamW optimizer with β1 = 0, β2 = 0.999 and weight decay of 0.01 for both student and fake score optimizers, while disabling gradient clipping, which we find crucial for maintaining training stability of rCM....