pith. sign in

arxiv: 2605.30116 · v1 · pith:D5TRH7WQnew · submitted 2026-05-28 · 💻 cs.CV · cs.LG

SGMD: Score Gradient Matching Distillation for Few-Step Video Diffusion Distillation

Pith reviewed 2026-06-29 08:14 UTC · model grok-4.3

classification 💻 cs.CV cs.LG
keywords video diffusionmodel distillationfew-step generationscore gradient matchingmotion dynamicsdistribution matchinggenerative models
0
0 comments X

The pith

SGMD distills video diffusion models to 4 steps with roughly 3 times faster training and better motion dynamics by directly matching score gradients.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Distribution Matching Distillation struggles in video settings because the fake score must track an evolving generator at high cost and reverse-KL matching tends to suppress motion. SGMD switches to a fake-score view that optimizes the generator's score directly toward the teacher's score and substitutes a stop-gradient Fisher objective for stability. It supplies two dual potentials, negative-residual for outer-loop correction and residual-contraction for inner-loop tracking, to realize this objective. The resulting 4-step models train about three times faster than DMD2 baselines while showing stronger motion and unchanged temporal consistency. Human raters prefer the outputs on motion quality and overall preference.

Core claim

SGMD adopts a fake-score perspective by directly optimizing the fake score toward the teacher, while using teacher stop-gradient Fisher as a stable distribution-matching objective. We provide a gradient analysis that motivates this objective choice under ideal tracking. Building on this, SGMD introduces a pair of dual potentials: negative-residual (NR) for outer-loop correction and residual-contraction (RC) for inner-loop tracking. Empirically, compared to DMD2, SGMD achieves an approximately ~3 imes training speedup and substantially improves motion dynamics for 4-step distilled models while preserving temporal consistency.

What carries the argument

Dual potentials (negative-residual outer-loop correction and residual-contraction inner-loop tracking) that realize score-gradient matching to a fixed teacher Fisher objective.

If this is right

  • 4-step distilled video models exhibit substantially improved motion dynamics.
  • Training cost drops by a factor of approximately three relative to DMD2.
  • Temporal consistency remains comparable to prior distilled models.
  • Human preference shifts toward SGMD on motion quality and overall preference while visual quality and text alignment stay similar.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same dual-potential construction could be tested on image diffusion distillation to check whether similar speed-ups appear outside video.
  • Stronger motion preservation may make few-step video models more usable for downstream tasks such as video editing or short-clip synthesis.
  • If the stop-gradient Fisher objective proves robust, it might replace reverse-KL terms in other score-based distillation pipelines.

Load-bearing premise

The gradient analysis under ideal tracking holds and the teacher stop-gradient Fisher supplies a stable distribution-matching objective that the dual potentials can implement reliably in practice.

What would settle it

A training run in which the fake score deviates from close tracking of the teacher yet SGMD still reports the claimed 3 imes speedup and motion gains would falsify the necessity of the ideal-tracking premise.

Figures

Figures reproduced from arXiv: 2605.30116 by Dahua Lin, Lei Yang, Ruihao Gong, Xianglong Liu, Xiangyu Fan, Yang Yong, Yushi Huang, Zhuguanyu Wu.

Figure 1
Figure 1. Figure 1: Motivating 1D mixture-fitting example. Reverse-KL￾style matching tends to produce a conservative fit that avoids low￾density regions of the target distribution, while Fisher divergence yields a smoother score-matching signal. 1. Introduction Diffusion models (Ho et al., 2020; Song et al., 2020; Geng et al., 2025) have recently achieved remarkable progress in video generation (Wang et al., 2025; Team, 2025;… view at source ↗
Figure 2
Figure 2. Figure 2: Gradient behaviors under the fake-score perspective. Arrows show the net one-iteration direction on xfake induced by coupled (θ, ψ) updates: Fisher can be bent by coupling-induced tracking lag; SIM may become conservative; SGMD restores the desired direction via NR and RC. the resulting behaviors for Fisher, SIM, and SGMD. Empir￾ically, high-quality few-step distillation benefits from the alignment conditi… view at source ↗
Figure 3
Figure 3. Figure 3: Qualitative comparison (SGMD vs. DMD2). Under comparable perceptual sharpness and visual quality, SGMD shows clearer temporal progression and larger motion changes across frames while maintaining good temporal consistency. For each 81-frame video, we show frames {0, 16, 32, 48, 64, 80} as a preview [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
read the original abstract

Distribution Matching Distillation (DMD) is a widely used paradigm for accelerating inference in few-step video diffusion models. However, DMD-style video distillation faces two coupled challenges: the fake score must track a continuously evolving generator, making training costly when frequent updates are required, while reverse-KL-style matching can be mode-seeking and conservative for preserving strong motion dynamics. To address these issues, we propose \textbf{Score Gradient Matching Distillation (SGMD)}. SGMD adopts a fake-score perspective by directly optimizing the fake score toward the teacher, while using teacher stop-gradient Fisher as a stable distribution-matching objective. We provide a gradient analysis that motivates this objective choice under ideal tracking. Building on this, SGMD introduces a pair of dual potentials: negative-residual (NR) for outer-loop correction and residual-contraction (RC) for inner-loop tracking. Empirically, compared to DMD2, SGMD achieves an approximately $\sim 3\times$ training speedup and substantially improves motion dynamics for 4-step distilled models while preserving temporal consistency. A human study confirms that SGMD is preferred in motion quality and overall preference, while visual quality and text alignment remain comparable. Code is available at https://github.com/ModelTC/LightX2V.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes Score Gradient Matching Distillation (SGMD) to address challenges in DMD-style distillation for few-step video diffusion models. It replaces reverse-KL matching with direct optimization of the fake score toward the teacher via a stop-gradient Fisher objective, motivated by a gradient analysis under ideal tracking. Dual potentials (negative-residual NR for outer-loop correction and residual-contraction RC for inner-loop tracking) are introduced to implement this objective. Empirical claims include an approximately 3× training speedup over DMD2, improved motion dynamics in 4-step models while preserving temporal consistency, and human-study preference for motion quality and overall preference.

Significance. If the gradient analysis and dual-potential implementation prove stable, SGMD could meaningfully reduce the training cost of few-step video diffusion while improving motion fidelity, a practical bottleneck in current distillation pipelines. The public code release is a clear strength for reproducibility.

major comments (2)
  1. [Gradient analysis (§3)] Gradient analysis (abstract and §3): the motivating analysis is derived under the assumption of ideal tracking; the manuscript provides no explicit derivation or contraction argument showing that the stop-gradient Fisher remains a stable distribution-matching objective once the generator evolves rapidly in high-dimensional video settings.
  2. [Dual potentials (§4)] Dual potentials (abstract and §4): no derivation is given establishing that the NR outer-loop correction and RC inner-loop tracking together enforce the intended teacher stop-gradient Fisher objective without additional assumptions on residual boundedness or contraction; the empirical gains could therefore stem from implementation details rather than the claimed objective.
minor comments (2)
  1. The human-study protocol and exact quantitative motion metrics (beyond the ~3× speedup claim) should be reported with error bars and baseline comparisons in the main text or appendix.
  2. Notation for the fake-score and Fisher terms is introduced without a consolidated table of symbols; this would improve readability.

Simulated Author's Rebuttal

2 responses · 2 unresolved

We thank the referee for the constructive feedback on the gradient analysis and dual potentials. We address each major comment below and will revise the manuscript to improve clarity on assumptions and limitations.

read point-by-point responses
  1. Referee: [Gradient analysis (§3)] Gradient analysis (abstract and §3): the motivating analysis is derived under the assumption of ideal tracking; the manuscript provides no explicit derivation or contraction argument showing that the stop-gradient Fisher remains a stable distribution-matching objective once the generator evolves rapidly in high-dimensional video settings.

    Authors: We agree that the gradient analysis in §3 is presented under the ideal tracking assumption, as explicitly noted in the manuscript. This assumption is used to motivate the stop-gradient Fisher objective via a simplified gradient perspective. The manuscript does not include an explicit contraction argument or stability proof for the case where the generator evolves rapidly in high-dimensional video spaces. In the revision, we will expand §3 to more prominently state this assumption and limitation, and we will add a brief discussion of practical stability supported by the observed training dynamics and ablation results. A full theoretical guarantee for non-ideal high-dimensional settings is beyond the current scope. revision: partial

  2. Referee: [Dual potentials (§4)] Dual potentials (abstract and §4): no derivation is given establishing that the NR outer-loop correction and RC inner-loop tracking together enforce the intended teacher stop-gradient Fisher objective without additional assumptions on residual boundedness or contraction; the empirical gains could therefore stem from implementation details rather than the claimed objective.

    Authors: We acknowledge that the manuscript introduces the NR and RC dual potentials as a practical implementation of the stop-gradient Fisher objective without providing a formal derivation that they enforce it exactly in the absence of assumptions on residual boundedness or contraction. The design is motivated by the gradient analysis and implemented to separate outer-loop correction (NR) from inner-loop tracking (RC). The empirical results, including the reported ~3× training speedup and improved motion dynamics, serve as validation. In the revision, we will add further details in §4 and an appendix on the design rationale, any implicit assumptions, and additional ablations to better separate the contribution of the objective from implementation choices. revision: yes

standing simulated objections not resolved
  • Providing an explicit derivation or contraction argument showing stability of the stop-gradient Fisher objective when the generator evolves rapidly in high-dimensional video settings.
  • Deriving that the NR and RC potentials together enforce the teacher stop-gradient Fisher objective without additional assumptions on residual boundedness or contraction.

Circularity Check

0 steps flagged

No circularity: gradient analysis presented as independent motivation for objective

full rationale

The paper's central derivation is a gradient analysis under ideal tracking that motivates adopting the teacher stop-gradient Fisher as the distribution-matching objective, with NR/RC dual potentials as the implementation mechanism. This is described as derived rather than fitted or self-referential, and the empirical claims (~3× speedup, motion improvements) are presented as measured outcomes rather than inputs. No self-definitional equations, fitted inputs renamed as predictions, load-bearing self-citations, or ansatz smuggling are exhibited in the abstract or described chain. The analysis remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No details on free parameters, axioms, or invented entities are extractable from the abstract alone.

pith-pipeline@v0.9.1-grok · 5774 in / 1010 out tokens · 33264 ms · 2026-06-29T08:14:21.281706+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

12 extracted references · 11 canonical work pages · 8 internal anchors

  1. [1]

    Mobile Video Diffusion

    URLhttps://arxiv.org/abs/2412.07583. Cai, X., Huang, Q., Kang, Z., Li, H., Liang, S., Ma, L., Ren, S., Wei, X., Xie, R., and Zhang, T. Longcat-video technical report.arXiv preprint arXiv:2510.22200,

  2. [2]

    Flash-dmd: Towards high- fidelity few-step image generation with efficient distil- lation and joint reinforcement learning.arXiv preprint arXiv:2511.20549,

    Chen, G., Huang, S., Liu, K., Zhu, J., Qu, X., Chen, P., Cheng, Y ., and Sun, Y . Flash-dmd: Towards high- fidelity few-step image generation with efficient distil- lation and joint reinforcement learning.arXiv preprint arXiv:2511.20549,

  3. [3]

    Phased DMD: few-step dis- tribution matching distillation via score matching within subintervals.arXiv preprint arXiv:2510.27684,

    Fan, X., Qiu, Z., Wu, Z., Wang, F., Lin, Z., Ren, T., Lin, D., Gong, R., and Yang, L. Phased DMD: few-step dis- tribution matching distillation via score matching within subintervals.arXiv preprint arXiv:2510.27684,

  4. [4]

    PipeFusion: Patch-level Pipeline Parallelism for Diffusion Transformers Inference

    Fang, J., Pan, J., Wang, J., Li, A., and Sun, X. Pipefusion: Patch-level pipeline parallelism for diffusion transformers inference.arXiv preprint arXiv:2405.14430,

  5. [5]

    URLhttps: //arxiv.org/abs/1406.2661. Ho, J. and Salimans, T. Classifier-free diffusion guid- ance,

  6. [6]

    Classifier-Free Diffusion Guidance

    URL https://arxiv.org/abs/ 2207.12598. Ho, J., Jain, A., and Abbeel, P. Denoising diffusion prob- abilistic models. InAdvances in Neural Information Processing Systems (NeurIPS), pp. 6840–6851,

  7. [7]

    Self forcing: Bridging the train-test gap in autoregres- sive video diffusion

    Huang, X., Li, Z., He, G., Zhou, M., and Shechtman, E. Self forcing: Bridging the train-test gap in autoregres- sive video diffusion. InAdvances in Neural Information Processing Systems (NeurIPS), 2025a. Huang, Y ., Gong, R., Liu, J., Chen, T., and Liu, X. TFMQ- DM: temporal feature maintenance quantization for dif- fusion models. InIEEE/CVF Conference on...

  8. [8]

    Decoupled Weight Decay Regularization

    URL https://arxiv.org/abs/ 1711.05101. 9 Score Gradient Matching Distillation Luo, W., Huang, Z., Geng, Z., Kolter, J. Z., and Qi, G. One- step diffusion distillation through score implicit matching. InAdvances in Neural Information Processing Systems (NeurIPS),

  9. [9]

    Team, T. H. F. M. Hunyuanvideo 1.5 technical report.arXiv preprint arXiv:2511.18870,

  10. [10]

    Wan: Open and Advanced Large-Scale Video Generative Models

    Wang, A., Ai, B., Wen, B., Mao, C., Xie, C., Chen, D., Yu, F., Zhao, H., Yang, J., Zeng, J., Wang, J., Zhang, J., Zhou, J., Wang, J., Chen, J., Zhu, K., Zhao, K., Yan, K., Huang, L., Meng, X., Zhang, N., Li, P., Wu, P., Chu, R., Feng, R., Zhang, S., Sun, S., Fang, T., Wang, T., Gui, T., Weng, T., Shen, T., Lin, W., Wang, W., Wang, W., Zhou, W., Wang, W., ...

  11. [11]

    PyTorch FSDP: Experiences on Scaling Fully Sharded Data Parallel

    URL https://arxiv.org/abs/2304.11277. Zheng, K., Wang, Y ., Ma, Q., Chen, H., Zhang, J., Balaji, Y ., Chen, J., Liu, M.-Y ., Zhu, J., and Zhang, Q. Large scale diffusion distillation via score-regularized continuous- time consistency. InInternational Conference on Learn- ing Representations (ICLR),

  12. [12]

    as the objective

    10 Score Gradient Matching Distillation A. Additional Proofs A.1. A formal justification of the fake-score perspective We formalize the fake-score perspective without committing to any particular loss form. Let sfake(·, t) be the learned fake score andq θ,t be the generator-induced noisy-state distribution. Define the score-consistency set (a constraint m...