SGMD: Score Gradient Matching Distillation for Few-Step Video Diffusion Distillation
Pith reviewed 2026-06-29 08:14 UTC · model grok-4.3
The pith
SGMD distills video diffusion models to 4 steps with roughly 3 times faster training and better motion dynamics by directly matching score gradients.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
SGMD adopts a fake-score perspective by directly optimizing the fake score toward the teacher, while using teacher stop-gradient Fisher as a stable distribution-matching objective. We provide a gradient analysis that motivates this objective choice under ideal tracking. Building on this, SGMD introduces a pair of dual potentials: negative-residual (NR) for outer-loop correction and residual-contraction (RC) for inner-loop tracking. Empirically, compared to DMD2, SGMD achieves an approximately ~3 imes training speedup and substantially improves motion dynamics for 4-step distilled models while preserving temporal consistency.
What carries the argument
Dual potentials (negative-residual outer-loop correction and residual-contraction inner-loop tracking) that realize score-gradient matching to a fixed teacher Fisher objective.
If this is right
- 4-step distilled video models exhibit substantially improved motion dynamics.
- Training cost drops by a factor of approximately three relative to DMD2.
- Temporal consistency remains comparable to prior distilled models.
- Human preference shifts toward SGMD on motion quality and overall preference while visual quality and text alignment stay similar.
Where Pith is reading between the lines
- The same dual-potential construction could be tested on image diffusion distillation to check whether similar speed-ups appear outside video.
- Stronger motion preservation may make few-step video models more usable for downstream tasks such as video editing or short-clip synthesis.
- If the stop-gradient Fisher objective proves robust, it might replace reverse-KL terms in other score-based distillation pipelines.
Load-bearing premise
The gradient analysis under ideal tracking holds and the teacher stop-gradient Fisher supplies a stable distribution-matching objective that the dual potentials can implement reliably in practice.
What would settle it
A training run in which the fake score deviates from close tracking of the teacher yet SGMD still reports the claimed 3 imes speedup and motion gains would falsify the necessity of the ideal-tracking premise.
Figures
read the original abstract
Distribution Matching Distillation (DMD) is a widely used paradigm for accelerating inference in few-step video diffusion models. However, DMD-style video distillation faces two coupled challenges: the fake score must track a continuously evolving generator, making training costly when frequent updates are required, while reverse-KL-style matching can be mode-seeking and conservative for preserving strong motion dynamics. To address these issues, we propose \textbf{Score Gradient Matching Distillation (SGMD)}. SGMD adopts a fake-score perspective by directly optimizing the fake score toward the teacher, while using teacher stop-gradient Fisher as a stable distribution-matching objective. We provide a gradient analysis that motivates this objective choice under ideal tracking. Building on this, SGMD introduces a pair of dual potentials: negative-residual (NR) for outer-loop correction and residual-contraction (RC) for inner-loop tracking. Empirically, compared to DMD2, SGMD achieves an approximately $\sim 3\times$ training speedup and substantially improves motion dynamics for 4-step distilled models while preserving temporal consistency. A human study confirms that SGMD is preferred in motion quality and overall preference, while visual quality and text alignment remain comparable. Code is available at https://github.com/ModelTC/LightX2V.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes Score Gradient Matching Distillation (SGMD) to address challenges in DMD-style distillation for few-step video diffusion models. It replaces reverse-KL matching with direct optimization of the fake score toward the teacher via a stop-gradient Fisher objective, motivated by a gradient analysis under ideal tracking. Dual potentials (negative-residual NR for outer-loop correction and residual-contraction RC for inner-loop tracking) are introduced to implement this objective. Empirical claims include an approximately 3× training speedup over DMD2, improved motion dynamics in 4-step models while preserving temporal consistency, and human-study preference for motion quality and overall preference.
Significance. If the gradient analysis and dual-potential implementation prove stable, SGMD could meaningfully reduce the training cost of few-step video diffusion while improving motion fidelity, a practical bottleneck in current distillation pipelines. The public code release is a clear strength for reproducibility.
major comments (2)
- [Gradient analysis (§3)] Gradient analysis (abstract and §3): the motivating analysis is derived under the assumption of ideal tracking; the manuscript provides no explicit derivation or contraction argument showing that the stop-gradient Fisher remains a stable distribution-matching objective once the generator evolves rapidly in high-dimensional video settings.
- [Dual potentials (§4)] Dual potentials (abstract and §4): no derivation is given establishing that the NR outer-loop correction and RC inner-loop tracking together enforce the intended teacher stop-gradient Fisher objective without additional assumptions on residual boundedness or contraction; the empirical gains could therefore stem from implementation details rather than the claimed objective.
minor comments (2)
- The human-study protocol and exact quantitative motion metrics (beyond the ~3× speedup claim) should be reported with error bars and baseline comparisons in the main text or appendix.
- Notation for the fake-score and Fisher terms is introduced without a consolidated table of symbols; this would improve readability.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on the gradient analysis and dual potentials. We address each major comment below and will revise the manuscript to improve clarity on assumptions and limitations.
read point-by-point responses
-
Referee: [Gradient analysis (§3)] Gradient analysis (abstract and §3): the motivating analysis is derived under the assumption of ideal tracking; the manuscript provides no explicit derivation or contraction argument showing that the stop-gradient Fisher remains a stable distribution-matching objective once the generator evolves rapidly in high-dimensional video settings.
Authors: We agree that the gradient analysis in §3 is presented under the ideal tracking assumption, as explicitly noted in the manuscript. This assumption is used to motivate the stop-gradient Fisher objective via a simplified gradient perspective. The manuscript does not include an explicit contraction argument or stability proof for the case where the generator evolves rapidly in high-dimensional video spaces. In the revision, we will expand §3 to more prominently state this assumption and limitation, and we will add a brief discussion of practical stability supported by the observed training dynamics and ablation results. A full theoretical guarantee for non-ideal high-dimensional settings is beyond the current scope. revision: partial
-
Referee: [Dual potentials (§4)] Dual potentials (abstract and §4): no derivation is given establishing that the NR outer-loop correction and RC inner-loop tracking together enforce the intended teacher stop-gradient Fisher objective without additional assumptions on residual boundedness or contraction; the empirical gains could therefore stem from implementation details rather than the claimed objective.
Authors: We acknowledge that the manuscript introduces the NR and RC dual potentials as a practical implementation of the stop-gradient Fisher objective without providing a formal derivation that they enforce it exactly in the absence of assumptions on residual boundedness or contraction. The design is motivated by the gradient analysis and implemented to separate outer-loop correction (NR) from inner-loop tracking (RC). The empirical results, including the reported ~3× training speedup and improved motion dynamics, serve as validation. In the revision, we will add further details in §4 and an appendix on the design rationale, any implicit assumptions, and additional ablations to better separate the contribution of the objective from implementation choices. revision: yes
- Providing an explicit derivation or contraction argument showing stability of the stop-gradient Fisher objective when the generator evolves rapidly in high-dimensional video settings.
- Deriving that the NR and RC potentials together enforce the teacher stop-gradient Fisher objective without additional assumptions on residual boundedness or contraction.
Circularity Check
No circularity: gradient analysis presented as independent motivation for objective
full rationale
The paper's central derivation is a gradient analysis under ideal tracking that motivates adopting the teacher stop-gradient Fisher as the distribution-matching objective, with NR/RC dual potentials as the implementation mechanism. This is described as derived rather than fitted or self-referential, and the empirical claims (~3× speedup, motion improvements) are presented as measured outcomes rather than inputs. No self-definitional equations, fitted inputs renamed as predictions, load-bearing self-citations, or ansatz smuggling are exhibited in the abstract or described chain. The analysis remains self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
URLhttps://arxiv.org/abs/2412.07583. Cai, X., Huang, Q., Kang, Z., Li, H., Liang, S., Ma, L., Ren, S., Wei, X., Xie, R., and Zhang, T. Longcat-video technical report.arXiv preprint arXiv:2510.22200,
work page internal anchor Pith review Pith/arXiv arXiv
-
[2]
Chen, G., Huang, S., Liu, K., Zhu, J., Qu, X., Chen, P., Cheng, Y ., and Sun, Y . Flash-dmd: Towards high- fidelity few-step image generation with efficient distil- lation and joint reinforcement learning.arXiv preprint arXiv:2511.20549,
-
[3]
Fan, X., Qiu, Z., Wu, Z., Wang, F., Lin, Z., Ren, T., Lin, D., Gong, R., and Yang, L. Phased DMD: few-step dis- tribution matching distillation via score matching within subintervals.arXiv preprint arXiv:2510.27684,
-
[4]
PipeFusion: Patch-level Pipeline Parallelism for Diffusion Transformers Inference
Fang, J., Pan, J., Wang, J., Li, A., and Sun, X. Pipefusion: Patch-level pipeline parallelism for diffusion transformers inference.arXiv preprint arXiv:2405.14430,
work page internal anchor Pith review Pith/arXiv arXiv
-
[5]
URLhttps: //arxiv.org/abs/1406.2661. Ho, J. and Salimans, T. Classifier-free diffusion guid- ance,
work page internal anchor Pith review Pith/arXiv arXiv
-
[6]
Classifier-Free Diffusion Guidance
URL https://arxiv.org/abs/ 2207.12598. Ho, J., Jain, A., and Abbeel, P. Denoising diffusion prob- abilistic models. InAdvances in Neural Information Processing Systems (NeurIPS), pp. 6840–6851,
work page internal anchor Pith review Pith/arXiv arXiv
-
[7]
Self forcing: Bridging the train-test gap in autoregres- sive video diffusion
Huang, X., Li, Z., He, G., Zhou, M., and Shechtman, E. Self forcing: Bridging the train-test gap in autoregres- sive video diffusion. InAdvances in Neural Information Processing Systems (NeurIPS), 2025a. Huang, Y ., Gong, R., Liu, J., Chen, T., and Liu, X. TFMQ- DM: temporal feature maintenance quantization for dif- fusion models. InIEEE/CVF Conference on...
-
[8]
Decoupled Weight Decay Regularization
URL https://arxiv.org/abs/ 1711.05101. 9 Score Gradient Matching Distillation Luo, W., Huang, Z., Geng, Z., Kolter, J. Z., and Qi, G. One- step diffusion distillation through score implicit matching. InAdvances in Neural Information Processing Systems (NeurIPS),
work page internal anchor Pith review Pith/arXiv arXiv
-
[9]
Team, T. H. F. M. Hunyuanvideo 1.5 technical report.arXiv preprint arXiv:2511.18870,
work page internal anchor Pith review Pith/arXiv arXiv
-
[10]
Wan: Open and Advanced Large-Scale Video Generative Models
Wang, A., Ai, B., Wen, B., Mao, C., Xie, C., Chen, D., Yu, F., Zhao, H., Yang, J., Zeng, J., Wang, J., Zhang, J., Zhou, J., Wang, J., Chen, J., Zhu, K., Zhao, K., Yan, K., Huang, L., Meng, X., Zhang, N., Li, P., Wu, P., Chu, R., Feng, R., Zhang, S., Sun, S., Fang, T., Wang, T., Gui, T., Weng, T., Shen, T., Lin, W., Wang, W., Wang, W., Zhou, W., Wang, W., ...
work page internal anchor Pith review Pith/arXiv arXiv
-
[11]
PyTorch FSDP: Experiences on Scaling Fully Sharded Data Parallel
URL https://arxiv.org/abs/2304.11277. Zheng, K., Wang, Y ., Ma, Q., Chen, H., Zhang, J., Balaji, Y ., Chen, J., Liu, M.-Y ., Zhu, J., and Zhang, Q. Large scale diffusion distillation via score-regularized continuous- time consistency. InInternational Conference on Learn- ing Representations (ICLR),
work page internal anchor Pith review Pith/arXiv arXiv
-
[12]
as the objective
10 Score Gradient Matching Distillation A. Additional Proofs A.1. A formal justification of the fake-score perspective We formalize the fake-score perspective without committing to any particular loss form. Let sfake(·, t) be the learned fake score andq θ,t be the generator-induced noisy-state distribution. Define the score-consistency set (a constraint m...
2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.