Large Scale Diffusion Distillation via Score-Regularized Continuous-Time Consistency
Pith reviewed 2026-05-18 08:42 UTC · model grok-4.3
The pith
Score-regularized consistency models scale to 14B parameters and match leading distillation quality with higher diversity.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that incorporating score distillation as a long-skip regularizer into continuous-time consistency training complements the forward-divergence objective of sCM with reverse divergence, thereby reducing error accumulation during fine-detail generation and producing models that generate high-quality samples in 1-4 steps at scales up to 14B parameters while maintaining diversity advantages over prior distillation methods.
What carries the argument
The score-regularized continuous-time consistency model (rCM), which augments the sCM objective with score distillation as a long-skip regularizer to balance mode-covering and mode-seeking divergences.
Load-bearing premise
That adding score distillation regularization will reliably reduce fine-detail errors in sCM without introducing instabilities or diversity losses when scaled to 10B+ parameter models.
What would settle it
Side-by-side evaluation on a 14B-parameter model showing that rCM samples have measurably worse fine details or lower diversity than DMD2-distilled outputs would falsify the central claim.
Figures
read the original abstract
Although continuous-time consistency models (e.g., sCM, MeanFlow) are theoretically principled and empirically powerful for fast academic-scale diffusion, its applicability to large-scale text-to-image and video tasks remains unclear due to infrastructure challenges in Jacobian-vector product (JVP) computation and the limitations of evaluation benchmarks like FID. This work represents the first effort to scale up continuous-time consistency to general application-level image and video diffusion models, and to make JVP-based distillation effective at large scale. We first develop a parallelism-compatible FlashAttention-2 JVP kernel, enabling sCM training on models with over 10 billion parameters and high-dimensional video tasks. Our investigation reveals fundamental quality limitations of sCM in fine-detail generation, which we attribute to error accumulation and the "mode-covering" nature of its forward-divergence objective. To remedy this, we propose the score-regularized continuous-time consistency model (rCM), which incorporates score distillation as a long-skip regularizer. This integration complements sCM with the "mode-seeking" reverse divergence, effectively improving visual quality while maintaining high generation diversity. Validated on large-scale models (Cosmos-Predict2, Wan2.1) up to 14B parameters and 5-second videos, rCM generally matches the state-of-the-art distillation method DMD2 on quality metrics while mitigating mode collapse and offering notable advantages in diversity, all without GAN tuning or extensive hyperparameter searches. The distilled models generate high-fidelity samples in only $1\sim4$ steps, accelerating diffusion sampling by $15\times\sim50\times$. These results position rCM as a practical and theoretically grounded framework for advancing large-scale diffusion distillation. Code is available at https://github.com/NVlabs/rcm.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces a scalable approach to continuous-time consistency models for large-scale text-to-image and video diffusion. It develops a FlashAttention-2 compatible JVP kernel to enable sCM training on models >10B parameters and high-dimensional video tasks. To address observed limitations in fine-detail generation attributed to error accumulation and the mode-covering forward-divergence objective, the authors propose score-regularized continuous-time consistency (rCM) by adding score distillation as a long-skip regularizer. This is claimed to complement sCM with mode-seeking reverse divergence, yielding visual quality on par with DMD2 while preserving diversity. Results are reported on Cosmos-Predict2 and Wan2.1 models up to 14B parameters for up to 5-second videos, with 1-4 step generation providing 15-50x acceleration over diffusion sampling. Code is released.
Significance. If the empirical claims hold with supporting ablations and metrics, the work would be significant for demonstrating the first practical scaling of continuous-time consistency distillation to application-level 10B+ parameter image and video models. The JVP kernel addresses a concrete infrastructure barrier, and the rCM formulation offers a non-GAN, theoretically motivated alternative to existing distillation methods with potential advantages in diversity and training stability. Reproducible code further strengthens the contribution for the field.
major comments (2)
- [Experiments] Experiments section: The central claim that score distillation as a long-skip regularizer reliably complements the sCM forward-divergence objective, reduces fine-detail error accumulation, and avoids new instabilities or diversity losses at 14B scale lacks supporting quantitative evidence. No ablation studies isolating the regularizer's contribution, no training-curve analysis of gradient norms or mode coverage, and no diversity metrics (e.g., recall, pairwise LPIPS) are reported to substantiate that the mode-seeking term does not trade off coverage.
- [§3] §3 (rCM formulation): The integration of score distillation is presented as addressing the 'mode-covering' limitation of sCM, yet the manuscript provides no derivation or analysis showing that the combined objective avoids introducing instabilities or error accumulation of its own at large scale; this assumption is load-bearing for the quality and diversity claims.
minor comments (2)
- [Abstract] The abstract states 'generally matches' DMD2 on quality metrics but does not specify the exact metrics, datasets, or numerical values; these should be stated explicitly with tables or figures for clarity.
- [Method] Clarify the precise form of the long-skip regularizer (e.g., weighting schedule, which score function is used) in the method section to aid reproducibility.
Simulated Author's Rebuttal
We thank the referee for the positive assessment of the work's potential significance and for the constructive comments on the empirical and theoretical support for our claims. We respond to each major comment below.
read point-by-point responses
-
Referee: [Experiments] Experiments section: The central claim that score distillation as a long-skip regularizer reliably complements the sCM forward-divergence objective, reduces fine-detail error accumulation, and avoids new instabilities or diversity losses at 14B scale lacks supporting quantitative evidence. No ablation studies isolating the regularizer's contribution, no training-curve analysis of gradient norms or mode coverage, and no diversity metrics (e.g., recall, pairwise LPIPS) are reported to substantiate that the mode-seeking term does not trade off coverage.
Authors: We agree that the manuscript would be strengthened by additional quantitative ablations and diversity metrics to isolate the regularizer's contribution. The current results focus on end-to-end comparisons with DMD2 on large-scale models, showing comparable quality with advantages in diversity through qualitative inspection and avoidance of mode collapse. However, we acknowledge the lack of explicit metrics such as recall or pairwise LPIPS and isolated training-curve analyses. In the revised version, we will add ablation studies at smaller scales to quantify the regularizer's impact on fine details and mode coverage, along with reported diversity metrics where computationally feasible. Full ablations at 14B scale remain prohibitive due to resource constraints, which is why we prioritized scalable end-to-end validation. revision: yes
-
Referee: [§3] §3 (rCM formulation): The integration of score distillation is presented as addressing the 'mode-covering' limitation of sCM, yet the manuscript provides no derivation or analysis showing that the combined objective avoids introducing instabilities or error accumulation of its own at large scale; this assumption is load-bearing for the quality and diversity claims.
Authors: The rCM objective in §3 is constructed by adding score distillation as a long-skip regularizer to the sCM loss, motivated by the complementary properties of forward (mode-covering) and reverse (mode-seeking) divergences as established in prior work on score-based distillation. While we do not include a new formal derivation proving absence of instabilities at arbitrary scale, the successful training and stable convergence on models up to 14B parameters without observed new error accumulation or instabilities provides empirical support for the approach. We will revise §3 to expand the discussion of the combined objective, including a clearer explanation of how the regularizer mitigates accumulation and references to related stability analyses in the literature. revision: partial
Circularity Check
No circularity: rCM is an explicit combination of prior terms with independent empirical validation
full rationale
The paper's core derivation introduces rCM by adding a score-distillation regularizer to the existing sCM objective to address observed error accumulation in fine details, motivated by the mode-covering vs. mode-seeking divergence properties. This is a design choice, not a self-definitional reduction or fitted parameter renamed as prediction. The JVP kernel development and large-scale training on Cosmos-Predict2/Wan2.1 models constitute independent engineering and validation steps outside any closed loop of the paper's own equations. No load-bearing self-citations, uniqueness theorems, or ansatzes imported from prior author work are used to force the central claim; results are presented as empirical outcomes rather than tautological consequences of the inputs.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption The FlashAttention-2 JVP kernel is numerically stable and correctly implements the required vector-Jacobian products at scale.
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Our investigation reveals fundamental quality limitations of sCM in fine-detail generation, which we attribute to error accumulation and the 'mode-covering' nature of its forward-divergence objective. To remedy this, we propose the score-regularized continuous-time consistency model (rCM), which incorporates score distillation as a long-skip regularizer.
-
IndisputableMonolith/Foundation/AbsoluteFloorClosure.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
sCM employs the TrigFlow noise schedule ... and the full derivative dFθ−(xt,t)/dt can be computed using forward-mode automatic differentiation, Jacobian-vector product (JVP).
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 14 Pith papers
-
AnyFlow: Any-Step Video Diffusion Model with On-Policy Flow Map Distillation
AnyFlow enables any-step video diffusion by distilling flow-map transitions over arbitrary time intervals with on-policy backward simulation.
-
HorizonDrive: Self-Corrective Autoregressive World Model for Long-horizon Driving Simulation
HorizonDrive enables stable long-horizon autoregressive driving simulation via anti-drifting teacher training with scheduled rollout recovery and teacher rollout distillation.
-
Efficient Video Diffusion Models: Advancements and Challenges
A survey that groups efficient video diffusion methods into four paradigms—step distillation, efficient attention, model compression, and cache/trajectory optimization—and outlines open challenges for practical use.
-
1.x-Distill: Breaking the Diversity, Quality, and Efficiency Barrier in Distribution Matching Distillation
1.x-Distill achieves better quality and diversity than prior few-step distillation methods at 1.67 and 1.74 effective NFEs on SD3 models with up to 33x speedup.
-
D-OPSD: On-Policy Self-Distillation for Continuously Tuning Step-Distilled Diffusion Models
D-OPSD enables continuous supervised fine-tuning of few-step diffusion models via on-policy self-distillation where the model acts as both teacher (multimodal context) and student (text-only context) on its own roll-outs.
-
Mutual Forcing: Dual-Mode Self-Evolution for Fast Autoregressive Audio-Video Character Generation
Mutual Forcing trains a single native autoregressive audio-video model with mutually reinforcing few-step and multi-step modes via self-distillation to match 50-step baselines at 4-8 steps.
-
Alice v1: Distillation-Enhanced Video Generation Surpassing Closed-Source Models
Alice v1 is an open video model that surpasses its teacher and closed-source systems like Veo3 and Sora2 in quality while running 7x faster through specialized distillation.
-
Extending One-Step Image Generation from Class Labels to Text via Discriminative Text Representation
By requiring and using highly discriminative LLM text features, the work enables the first effective one-step text-conditioned image generation with MeanFlow.
-
Continuous Adversarial Flow Models
Continuous adversarial flow models replace MSE in flow matching with adversarial training via a discriminator, improving guidance-free FID on ImageNet from 8.26 to 3.63 for SiT and similar gains for JiT and text-to-im...
-
Long-Horizon Streaming Video Generation via Hybrid Attention with Decoupled Distillation
Hybrid Forcing combines linear temporal attention for long-range retention, block-sparse attention for efficiency, and decoupled distillation to achieve real-time unbounded 832x480 streaming video generation at 29.5 FPS.
-
Salt: Self-Consistent Distribution Matching with Cache-Aware Training for Fast Video Generation
Salt improves low-step video generation quality by adding endpoint-consistent regularization to distribution matching distillation and using cache-conditioned feature alignment for autoregressive models.
-
Live Avatar: Streaming Real-time Audio-Driven Avatar Generation with Infinite Length
Live Avatar enables 45 FPS real-time streaming infinite-length audio-driven avatar generation from a 14B diffusion model via distillation and timestep-forcing pipeline parallelism.
-
Visual Generation in the New Era: An Evolution from Atomic Mapping to Agentic World Modeling
Visual generation models are evolving from passive renderers to interactive agentic world modelers, but current systems lack spatial reasoning, temporal consistency, and causal understanding, with evaluations overemph...
-
World Simulation with Video Foundation Models for Physical AI
Cosmos-Predict2.5 unifies text-to-world, image-to-world, and video-to-world generation in one model trained on 200M clips with RL post-training, delivering improved quality and control for physical AI.
Reference graph
Works this paper leans on
-
[1]
Vidu: a highly consistent, dynamic and skilled text-to- video generator with diffusion models
Fan Bao, Chendong Xiang, Gang Yue, Guande He, Hongzhou Zhu, Kaiwen Zheng, Min Zhao, Shilong Liu, Yaole Wang, and Jun Zhu. Vidu: a highly consistent, dynamic and skilled text-to- video generator with diffusion models. arXiv preprint arXiv:2405.04233,
-
[2]
Sana-sprint: One-step diffusion with continuous-time consistency distillation
Junsong Chen, Shuchen Xue, Yuyang Zhao, Jincheng Yu, Sayak Paul, Junyu Chen, Han Cai, Song Han, and Enze Xie. Sana-sprint: One-step diffusion with continuous-time consistency distillation. arXiv preprint arXiv:2503.09641,
-
[3]
FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning
Tri Dao. Flashattention-2: Faster attention with better parallelism and work partitioning. arXiv preprint arXiv:2307.08691,
work page internal anchor Pith review Pith/arXiv arXiv
-
[4]
Seedance 1.0: Exploring the Boundaries of Video Generation Models
Yu Gao, Haoyuan Guo, Tuyen Hoang, Weilin Huang, Lu Jiang, Fangyuan Kong, Huixia Li, Jiashi Li, Liang Li, Xiaojie Li, et al. Seedance 1.0: Exploring the boundaries of video generation models. arXiv preprint arXiv:2506.09113,
work page internal anchor Pith review Pith/arXiv arXiv
-
[5]
Zhengyang Geng, Ashwini Pokle, William Luo, Justin Lin, and J Zico Kolter. Consistency models made easy. arXiv preprint arXiv:2406.14548,
-
[6]
Mean Flows for One-step Generative Modeling
Zhengyang Geng, Mingyang Deng, Xingjian Bai, J Zico Kolter, and Kaiming He. Mean flows for one-step generative modeling. arXiv preprint arXiv:2505.13447,
work page internal anchor Pith review Pith/arXiv arXiv
-
[7]
Jonathan Heek, Emiel Hoogeboom, and Tim Salimans. Multistep consistency models. arXiv preprint arXiv:2403.06807,
-
[8]
Classifier-Free Diffusion Guidance
Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance. arXiv preprint arXiv:2207.12598,
work page internal anchor Pith review Pith/arXiv arXiv
-
[9]
Imagen Video: High Definition Video Generation with Diffusion Models
Jonathan Ho, William Chan, Chitwan Saharia, Jay Whang, Ruiqi Gao, Alexey Gritsenko, Diederik P Kingma, Ben Poole, Mohammad Norouzi, David J Fleet, et al. Imagen video: High definition video generation with diffusion models. arXiv preprint arXiv:2210.02303,
work page internal anchor Pith review Pith/arXiv arXiv
-
[10]
Self Forcing: Bridging the Train-Test Gap in Autoregressive Video Diffusion
Xun Huang, Zhengqi Li, Guande He, Mingyuan Zhou, and Eli Shechtman. Self forcing: Bridging the training-inference gap in autoregressive video diffusion. arXiv preprint arXiv:2506.08009 ,
work page internal anchor Pith review Pith/arXiv arXiv
-
[11]
Sam Ade Jacobs, Masahiro Tanaka, Chengming Zhang, Minjia Zhang, Shuaiwen Leon Song, Samyam Rajbhandari, and Yuxiong He. Deepspeed ulysses: System optimizations for enabling training of extreme long sequence transformer models. arXiv preprint arXiv:2309.14509,
work page internal anchor Pith review Pith/arXiv arXiv
-
[12]
Consistency trajectory models: Learning proba- bility flow ode trajectory of diffusion
Dongjun Kim, Chieh-Hsin Lai, Wei-Hsiang Liao, Naoki Murata, Yuhta Takida, Toshimitsu Uesaka, Yutong He, Yuki Mitsufuji, and Stefano Ermon. Consistency trajectory models: Learning proba- bility flow ode trajectory of diffusion. arXiv preprint arXiv:2310.02279,
-
[13]
Sangyun Lee, Yilun Xu, Tomas Geffner, Giulia Fanti, Karsten Kreis, Arash Vahdat, and Weili Nie. Truncated consistency models. arXiv preprint arXiv:2410.14895,
-
[14]
T2v-turbo: Breaking the quality bottleneck of video consistency model with mixed reward feedback
Jiachen Li, Weixi Feng, Tsu-Jui Fu, Xinyi Wang, Sugato Basu, Wenhu Chen, and William Yang Wang. T2v-turbo: Breaking the quality bottleneck of video consistency model with mixed reward feedback. Advances in neural information processing systems, 37:75692–75726, 2024a. Jiachen Li, Qian Long, Jian Zheng, Xiaofeng Gao, Robinson Piramuthu, Wenhu Chen, and Will...
-
[15]
Diffusion adversar- ial post-training for one-step video generation
11 Preprint Shanchuan Lin, Xin Xia, Yuxi Ren, Ceyuan Yang, Xuefeng Xiao, and Lu Jiang. Diffusion adversar- ial post-training for one-step video generation. arXiv preprint arXiv:2501.08316, 2025a. Shanchuan Lin, Ceyuan Yang, Hao He, Jianwen Jiang, Yuxi Ren, Xin Xia, Yang Zhao, Xuefeng Xiao, and Lu Jiang. Autoregressive adversarial post-training for real-ti...
-
[16]
Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow
Xingchao Liu, Chengyue Gong, and Qiang Liu. Flow straight and fast: Learning to generate and transfer data with rectified flow. arXiv preprint arXiv:2209.03003,
work page internal anchor Pith review Pith/arXiv arXiv
-
[17]
Simplifying, Stabilizing and Scaling Continuous-Time Consistency Models
Cheng Lu and Yang Song. Simplifying, stabilizing and scaling continuous-time consistency models. arXiv preprint arXiv:2410.11081,
work page internal anchor Pith review Pith/arXiv arXiv
-
[18]
Knowledge Distillation in Iterative Generative Models for Improved Sampling Speed
Cheng Lu, Kaiwen Zheng, Fan Bao, Jianfei Chen, Chongxuan Li, and Jun Zhu. Maximum likelihood training for score-based diffusion odes by high order denoising score matching. In International conference on machine learning, pp. 14429–14460. PMLR, 2022a. Cheng Lu, Yuhao Zhou, Fan Bao, Jianfei Chen, Chongxuan Li, and Jun Zhu. Dpm-solver: A fast ode solver for...
work page internal anchor Pith review Pith/arXiv arXiv
-
[19]
Latent Consistency Models: Synthesizing High-Resolution Images with Few-Step Inference
Simian Luo, Yiqin Tan, Longbo Huang, Jian Li, and Hang Zhao. Latent consistency models: Synthe- sizing high-resolution images with few-step inference. arXiv preprint arXiv:2310.04378, 2023a. Weijian Luo, Tianyang Hu, Shifeng Zhang, Jiacheng Sun, Zhenguo Li, and Zhihua Zhang. Diff- instruct: A universal approach for transferring knowledge from pre-trained ...
work page internal anchor Pith review Pith/arXiv arXiv
-
[20]
Cosmos World Foundation Model Platform for Physical AI
URL https: //arxiv.org/abs/2501.03575. Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas M ¨uller, Joe Penna, and Robin Rombach. Sdxl: Improving latent diffusion models for high-resolution image synthesis. arXiv preprint arXiv:2307.01952,
work page internal anchor Pith review Pith/arXiv arXiv
-
[21]
Align your flow: Scaling continuous-time flow map distillation
Amirmojtaba Sabour, Sanja Fidler, and Karsten Kreis. Align your flow: Scaling continuous-time flow map distillation. arXiv preprint arXiv:2506.14603,
-
[22]
Progressive Distillation for Fast Sampling of Diffusion Models
Tim Salimans and Jonathan Ho. Progressive distillation for fast sampling of diffusion models.arXiv preprint arXiv:2202.00512,
work page internal anchor Pith review Pith/arXiv arXiv
-
[23]
Fast high-resolution image synthesis with latent adversarial diffusion distillation
12 Preprint Axel Sauer, Frederic Boesel, Tim Dockhorn, Andreas Blattmann, Patrick Esser, and Robin Rom- bach. Fast high-resolution image synthesis with latent adversarial diffusion distillation. In SIG- GRAPH Asia 2024 Conference Papers, pp. 1–11, 2024a. Axel Sauer, Dominik Lorenz, Andreas Blattmann, and Robin Rombach. Adversarial diffusion dis- tillation...
-
[24]
Score-Based Generative Modeling through Stochastic Differential Equations
Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score-based generative modeling through stochastic differential equations. arXiv preprint arXiv:2011.13456,
work page internal anchor Pith review Pith/arXiv arXiv 2011
-
[25]
Wan: Open and Advanced Large-Scale Video Generative Models
Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, et al. Wan: Open and advanced large-scale video generative models. arXiv preprint arXiv:2503.20314,
work page internal anchor Pith review Pith/arXiv arXiv
-
[26]
URL https://arxiv.org/ abs/2501.18427. Tianwei Yin, Micha ¨el Gharbi, Taesung Park, Richard Zhang, Eli Shechtman, Fredo Durand, and Bill Freeman. Improved distribution matching distillation for fast image synthesis. Advances in neural information processing systems, 37:47455–47487, 2024a. Tianwei Yin, Micha¨el Gharbi, Richard Zhang, Eli Shechtman, Fredo D...
-
[27]
Fast sampling of diffusion models with exponential integrator
Qinsheng Zhang and Yongxin Chen. Fast sampling of diffusion models with exponential integrator. arXiv preprint arXiv:2204.13902,
-
[28]
PyTorch FSDP: Experiences on Scaling Fully Sharded Data Parallel
13 Preprint Yanli Zhao, Andrew Gu, Rohan Varma, Liang Luo, Chien-Chin Huang, Min Xu, Less Wright, Hamid Shojanazeri, Myle Ott, Sam Shleifer, et al. Pytorch fsdp: experiences on scaling fully sharded data parallel. arXiv preprint arXiv:2304.11277,
work page internal anchor Pith review Pith/arXiv arXiv
-
[29]
Dpm-solver-v3: Improved diffusion ode solver with empirical model statistics
Kaiwen Zheng, Cheng Lu, Jianfei Chen, and Jun Zhu. Dpm-solver-v3: Improved diffusion ode solver with empirical model statistics. Advances in Neural Information Processing Systems, 36: 55502–55542, 2023a. Kaiwen Zheng, Cheng Lu, Jianfei Chen, and Jun Zhu. Improved techniques for maximum likelihood estimation for diffusion odes. In International Conference ...
-
[30]
However, CMs suffer from training instabilities and quality issues such as blur
adapt CMs to diffusion bridges models. However, CMs suffer from training instabilities and quality issues such as blur. Subsequent efforts address these limitations by introducing dedicated annealing schedules (Song & Dhariwal, 2023; Geng et al., 2024), preconditioning strategies (Zheng et al., 2025b), or segmented consistency schemes (Wang et al., 2024; ...
work page 2023
-
[31]
and AYF (Sabour et al., 2025), which directly combine sCM with CTM, have also drawn significant attention. Nonetheless, the applica- bility of sCM to large-scale, application-level image and video diffusion models remains unclear. SANA-Sprint (Chen et al.,
work page 2025
-
[32]
is an optimized attention algorithm that reduces memory usage and improves throughput by tiling the sequence into blocks and streaming intermediate results without materializing the full attention matrix. Given query, key, and value sequencesQ ∈ RN1×d, K, V ∈ RN2×d, where N1 and N2 denote sequence lengths and d is the head dimension, the attention output ...
work page 2024
-
[33]
We maintain a smoothed version of the student parameters using the power EMA (Karras et al., 2024), and use the EMA model for evaluation. We use the AdamW optimizer with β1 = 0, β2 = 0.999 and weight decay of 0.01 for both student and fake score optimizers, while disabling gradient clipping, which we find crucial for maintaining training stability of rCM....
work page 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.