arxiv: 2604.03118 · v1 · submitted 2026-04-03 · 💻 cs.CV · eess.IV

Recognition: 2 theorem links

· Lean Theorem

Salt: Self-Consistent Distribution Matching with Cache-Aware Training for Fast Video Generation

Bingqi Ma, Dailan He, Guanglu Song, Jun Zhang, Xiahong Wang, Xingtong Ge, Yi Zhang, Yu Liu, Yushi Huang

Authors on Pith no claims yet

Pith reviewed 2026-05-13 19:35 UTC · model grok-4.3

classification 💻 cs.CV eess.IV

keywords video generationdiffusion distillationdistribution matchinglow-NFE inferenceKV cacheautoregressive generationself-consistency

0 comments

The pith

Salt distills video models to 2-4 steps by regularizing the endpoint consistency of consecutive denoising updates and conditioning on KV cache states.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that distribution matching distillation produces sharp video samples but allows drift when denoising updates are composed into full rollouts at very low step counts. It introduces self-consistent regularization that forces partial trajectories to match at their endpoints, reducing accumulated errors in motion and detail. For autoregressive real-time models, the same idea is extended by treating the KV cache as an explicit conditioning signal during training and adding a feature alignment loss that pulls low-quality outputs toward high-quality references. Experiments confirm the approach works on both standard diffusion backbones and cache-based autoregressive setups without changing inference speed or memory layout. If the regularization holds, low-budget video generation could reach the sharpness previously available only at much higher compute cost.

Core claim

Self-Consistent Distribution Matching Distillation (SC-DMD) explicitly regularizes the endpoint-consistent composition of consecutive denoising updates so that multi-step rollouts avoid drift, while Cache-Distribution-Aware training treats the KV cache as a quality-parameterized condition and adds cache-conditioned feature alignment to steer low-quality autoregressive outputs toward high-quality references, yielding higher-quality video at 2-4 NFEs across tested non-autoregressive and autoregressive architectures.

What carries the argument

Self-Consistent Distribution Matching Distillation (SC-DMD) that enforces endpoint consistency across consecutive denoising updates, together with cache-conditioned feature alignment that uses the KV cache as a conditioning variable.

If this is right

Low-NFE video quality improves on non-autoregressive backbones such as Wan 2.1.
Autoregressive real-time models such as Self Forcing gain quality while remaining compatible with existing KV-cache mechanisms.
Sharp, mode-seeking samples are recovered without the conservative smoothing typical of trajectory consistency distillation.
The method adds no extra inference cost or memory overhead beyond the original backbone.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same endpoint-consistency idea could be tested on image or audio generation tasks that also rely on multi-step sampling.
Cache-aware alignment might extend naturally to streaming or online generation where the cache state evolves over time.
Combining the regularization with other acceleration methods such as step-size scheduling could be checked for additive gains.

Load-bearing premise

That enforcing endpoint consistency on composed denoising updates will prevent drift in full rollouts and that cache-conditioned feature alignment will reliably improve quality without creating new inconsistencies.

What would settle it

Quantitative comparison of motion consistency and perceptual sharpness metrics on identical prompts at 2-4 NFEs between Salt and baseline distribution matching distillation, checking whether trajectory drift or over-smoothing visibly decreases.

Figures

Figures reproduced from arXiv: 2604.03118 by Bingqi Ma, Dailan He, Guanglu Song, Jun Zhang, Xiahong Wang, Xingtong Ge, Yi Zhang, Yu Liu, Yushi Huang.

**Figure 1.** Figure 1: Compositionality deficit of DMD. First, middle, and last frames from 4- /8-/16-step DMD students (rows, top to bottom) on: (a) “...a spaceman wearing a red wool knitted motorcycle helmet...” and (b) “...a large stack of vintage televisions all showing different programs...museum gallery.” Increasing the number of denoising steps degrades rather than improves quality: the 16-step model loses the knitted hel… view at source ↗

**Figure 2.** Figure 2: Comparison of training trajectories for few-step distillation meth [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗

**Figure 3.** Figure 3: Overview of Salt for autoregressive video generation. Left: A step count K ∈ {8, 4, 2} is sampled to define the few-step denoising trajectory. Middle: Conditioned on the current KV cache, text, and noise, the student generator Gθ denoises the current chunk. A self-consistency (SC) loss LSC regularizes the endpoint discrepancy between a direct update and a composed two-step update. Right: During mixed-step … view at source ↗

**Figure 4.** Figure 4: Qualitative comparison on texture-rich and high-dynamic scenes. trained on a 4-point grid; DMD-8, trained on a denser 8-point grid; and SCDMD, which uses the same 8-point grid as DMD-8 but adds the shortcut selfconsistency loss. Simply increasing the training grid density does not solve the problem: compared with DMD-4, DMD-8 drops from 84.39 to 84.05 in Quality and from 82.78 to 82.54 in Total, while al… view at source ↗

**Figure 5.** Figure 5: Cross-step consistency under the same seed and prompt. [PITH_FULL_IMAGE:figures/full_fig_p014_5.png] view at source ↗

**Figure 1.** Figure 1: Displacement-normalized local semigroup defect on the test-time 4-step inference path. For each adjacent inference interval (ts, te), we compare the direct endpoint x (1) te = Ψ ts→te θ (xts ) against the composed endpoint x (2) te = Ψ tm→te θ (Ψ ts→tm θ (xts )), where tm is the corresponding intermediate timestep from the finer training grid. Lower is better. SC-DMD achieves a lower overall local semigr… view at source ↗

**Figure 2.** Figure 2: More qualitative comparisons between the Causal Forcing [47] baseline and our method. Our method shows consistent advantages in both visual quality and semantic consistency. Compared with the baseline, our results better preserve subject identity, object geometry, and scene composition across frames, while also producing smoother motion progression. The reading-girl example highlights reduced semantic/i… view at source ↗

read the original abstract

Distilling video generation models to extremely low inference budgets (e.g., 2--4 NFEs) is crucial for real-time deployment, yet remains challenging. Trajectory-style consistency distillation often becomes conservative under complex video dynamics, yielding an over-smoothed appearance and weak motion. Distribution matching distillation (DMD) can recover sharp, mode-seeking samples, but its local training signals do not explicitly regularize how denoising updates compose across timesteps, making composed rollouts prone to drift. To overcome this challenge, we propose Self-Consistent Distribution Matching Distillation (SC-DMD), which explicitly regularizes the endpoint-consistent composition of consecutive denoising updates. For real-time autoregressive video generation, we further treat the KV cache as a quality parameterized condition and propose Cache-Distribution-Aware training. This training scheme applies SC-DMD over multi-step rollouts and introduces a cache-conditioned feature alignment objective that steers low-quality outputs toward high-quality references. Across extensive experiments on both non-autoregressive backbones (e.g., Wan~2.1) and autoregressive real-time paradigms (e.g., Self Forcing), our method, dubbed \textbf{Salt}, consistently improves low-NFE video generation quality while remaining compatible with diverse KV-cache memory mechanisms. Source code will be released at \href{https://github.com/XingtongGe/Salt}{https://github.com/XingtongGe/Salt}.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Salt adds endpoint-consistent regularization to DMD for low-NFE video and a cache-aware scheme for autoregressive cases, but the gains look incremental and the drift-prevention claim needs tighter validation.

read the letter

The core idea is SC-DMD, which regularizes consecutive denoising steps so their composition stays endpoint-consistent, plus cache-conditioned feature alignment for autoregressive rollouts. That combination is the actual novelty over plain DMD. They test it on Wan 2.1 and Self Forcing backbones and report better quality at 2-4 NFEs while staying compatible with different KV-cache setups. Releasing the code is a clear positive for anyone who wants to check the implementation.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces Salt for distilling video generation models to low NFEs (2-4). It proposes Self-Consistent Distribution Matching Distillation (SC-DMD) that adds explicit regularization on the endpoint-consistent composition of consecutive denoising updates to reduce drift in composed rollouts, and Cache-Distribution-Aware training that treats the KV cache as a quality-conditioned input, applies SC-DMD over multi-step autoregressive rollouts, and adds a cache-conditioned feature alignment loss to steer outputs toward high-quality references. Experiments on non-autoregressive backbones (e.g., Wan 2.1) and autoregressive paradigms (e.g., Self Forcing) report consistent quality gains at low NFEs while remaining compatible with diverse KV-cache mechanisms.

Significance. If the added regularization demonstrably closes the composition gap for high-dimensional video dynamics and the cache alignment improves quality without new inconsistencies, the work would meaningfully extend distribution-matching distillation to practical real-time video generation. The compatibility with both non-autoregressive and autoregressive KV-cache setups, plus the promise of open-sourced code, would strengthen its utility for deployment.

major comments (2)

[§3.1] §3.1 (SC-DMD formulation): the central claim that explicit endpoint-consistent regularization prevents drift in low-NFE rollouts is load-bearing, yet the manuscript provides no derivation showing that the added term closes the composition gap beyond the local signals already present in standard DMD; without this or an ablation isolating the regularization's effect on accumulated error over timesteps, the improvement over baseline DMD remains unverified for complex motions.
[§4.3] §4.3 (Cache-Distribution-Aware training): the cache-conditioned feature alignment is asserted to steer low-quality outputs toward references without introducing new inconsistencies, but the reported experiments contain no direct metric (e.g., temporal consistency or endpoint mismatch) quantifying whether the alignment term creates fresh drift or artifacts in autoregressive rollouts, which is required to support the claim for real-time paradigms.

minor comments (2)

[Abstract] The abstract and §1 could more precisely state the exact quantitative metrics (e.g., FVD, CLIP score) and NFE settings used to claim 'consistent improvements'.
[§3.2] Notation for the KV-cache conditioning in Eq. (X) is introduced without an explicit diagram showing how the cache state is injected into the feature alignment loss.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful and constructive comments. We address each major comment below and have revised the manuscript accordingly to strengthen the presentation and empirical support.

read point-by-point responses

Referee: [§3.1] §3.1 (SC-DMD formulation): the central claim that explicit endpoint-consistent regularization prevents drift in low-NFE rollouts is load-bearing, yet the manuscript provides no derivation showing that the added term closes the composition gap beyond the local signals already present in standard DMD; without this or an ablation isolating the regularization's effect on accumulated error over timesteps, the improvement over baseline DMD remains unverified for complex motions.

Authors: We appreciate this observation. In the revised manuscript we have added an explicit derivation in Section 3.1 (and expanded in Appendix A) showing that the endpoint-consistent regularization term penalizes discrepancies between the composed multi-step trajectory and the direct endpoint mapping, thereby addressing the composition gap that is invisible to the per-step local signals of standard DMD. We have also inserted a targeted ablation in Section 4.2 that isolates the regularization's contribution by measuring accumulated temporal error over long rollouts on complex motion sequences, confirming a measurable reduction in drift relative to baseline DMD. revision: yes
Referee: [§4.3] §4.3 (Cache-Distribution-Aware training): the cache-conditioned feature alignment is asserted to steer low-quality outputs toward references without introducing new inconsistencies, but the reported experiments contain no direct metric (e.g., temporal consistency or endpoint mismatch) quantifying whether the alignment term creates fresh drift or artifacts in autoregressive rollouts, which is required to support the claim for real-time paradigms.

Authors: We agree that direct quantification is necessary. In the revised Section 4.3 we now report temporal consistency (optical-flow-based frame-to-frame coherence) and endpoint mismatch metrics on autoregressive rollouts. These measurements show that the cache-conditioned feature alignment improves fidelity to high-quality references while keeping both consistency and endpoint error at or below the levels observed with the unaligned baseline, supporting the claim that no new drift is introduced. revision: yes

Circularity Check

0 steps flagged

No circularity: new regularization terms and training scheme introduced independently

full rationale

The paper proposes SC-DMD as an explicit regularization of endpoint-consistent composition of denoising updates on top of standard DMD, plus a cache-conditioned feature alignment objective for autoregressive rollouts. These are framed as novel additions to address drift in low-NFE video generation, without any equations or claims reducing to self-citations, fitted parameters renamed as predictions, or ansatzes smuggled from prior author work. The derivation chain builds on established distribution matching principles with independent methodological content that does not collapse by construction to its inputs. No load-bearing steps exhibit self-definitional loops or uniqueness imported from overlapping citations.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Based on abstract only; no explicit free parameters, new axioms, or invented entities are detailed beyond standard assumptions in diffusion distillation.

axioms (1)

domain assumption Distribution matching distillation recovers sharp mode-seeking samples from teacher models
Invoked as the basis for using DMD to address over-smoothing in consistency distillation.

pith-pipeline@v0.9.0 · 5578 in / 1178 out tokens · 50960 ms · 2026-05-13T19:35:26.286285+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear
SC-DMD augments DMD with shortcut self-consistency regularizer L_SC = E[d(x(1)_te, x(2)_te)] where x(1)_te = Ψ_ts→te_θ(xts) and x(2)_te is the two-step composition, enforcing semigroup defect on the student Euler operator.
IndisputableMonolith/Foundation/AlphaCoordinateFixation.lean alpha_pin_under_high_calibration unclear
Cache-conditioned feature alignment L_align on relational matrices R_low and R_ref for mixed K∈{2,4,8} rollouts.

Reference graph

Works this paper leans on

47 extracted references · 47 canonical work pages · 8 internal anchors

[1]

Flow map matching with stochastic interpolants: A mathematical framework for consistency models.arXiv preprint arXiv:2406.07507, 2024

Boffi, N.M., Albergo, M.S., Vanden-Eijnden, E.: Flow map matching with stochas- tic interpolants: A mathematical framework for consistency models. arXiv preprint arXiv:2406.07507 (2024) 5

work page arXiv 2024
[2]

In: The Thirty-ninth Annual Conference on Neural Information Processing Systems 5

Boffi, N.M., Albergo, M.S., Vanden-Eijnden, E.: How to build a consistency model: Learning flow maps via self-distillation. In: The Thirty-ninth Annual Conference on Neural Information Processing Systems 5

work page
[3]

arXiv preprint arXiv:2510.17858 (2025) 1

Cai, X., Wu, Y., Chen, Q., Wu, H., Xiang, L., Wen, H.: Shortcutting pre- trained flow matching diffusion models is almost free lunch. arXiv preprint arXiv:2510.17858 (2025) 1

work page arXiv 2025
[4]

SkyReels-V2: Infinite-length Film Generative Model

Chen, G., Lin, D., Yang, J., Lin, C., Zhu, J., Fan, M., Zhang, H., Chen, S., Chen, Z., Ma, C., et al.: Skyreels-v2: Infinite-length film generative model. arXiv preprint arXiv:2504.13074 (2025) 4 16 Xingtong Ge et al

work page internal anchor Pith review arXiv 2025
[5]

arXiv preprint arXiv:2508.21019 (2025) 2, 3, 4

Cheng, J., Ma, B., Ren, X., Jin, H.H., Yu, K., Zhang, P., Li, W., Zhou, Y., Zheng, T., Lu, Q.: Phased one-step adversarial equilibrium for video diffusion models. arXiv preprint arXiv:2508.21019 (2025) 2, 3, 4

work page arXiv 2025
[6]

Contributors, L.: Lightx2v: Light video generation inference framework.https: //github.com/ModelTC/lightx2v(2025) 2, 3, 10, 11

work page 2025
[7]

One step diffusion via shortcut models.arXiv preprint arXiv:2410.12557, 2024

Frans,K.,Hafner,D.,Levine,S.,Abbeel,P.:Onestepdiffusionviashortcutmodels. arXiv preprint arXiv:2410.12557 (2024) 1, 7

work page arXiv 2024
[8]

arXiv preprint arXiv:2506.00523 (2025) 3

Ge, X., Zhang, X., Xu, T., Zhang, Y., Zhang, X., Wang, Y., Zhang, J.: Sense- flow: Scaling distribution matching for flow-based text-to-image distillation. arXiv preprint arXiv:2506.00523 (2025) 3

work page arXiv 2025
[9]

Mean Flows for One-step Generative Modeling

Geng,Z.,Deng,M.,Bai,X.,Kolter,J.Z.,He,K.:Meanflowsforone-stepgenerative modeling. arXiv preprint arXiv:2505.13447 (2025) 1

work page internal anchor Pith review Pith/arXiv arXiv 2025
[10]

Hastie, T., Tibshirani, R., Friedman, J.H., Friedman, J.H.: The elements of sta- tistical learning: data mining, inference, and prediction, vol. 2. Springer (2009) 2

work page 2009
[11]

arXiv preprint arXiv:2511.16955 (2025) 3

He, D., Feng, G., Ge, X., Niu, Y., Zhang, Y., Ma, B., Song, G., Liu, Y., Li, H.: Neighbor grpo: Contrastive ode policy optimization aligns flow models. arXiv preprint arXiv:2511.16955 (2025) 3

work page arXiv 2025
[12]

Advances in neural information processing systems33, 6840–6851 (2020) 1

Ho, J., Jain, A., Abbeel, P.: Denoising diffusion probabilistic models. Advances in neural information processing systems33, 6840–6851 (2020) 1

work page 2020
[13]

Self Forcing: Bridging the Train-Test Gap in Autoregressive Video Diffusion

Huang, X., Li, Z., He, G., Zhou, M., Shechtman, E.: Self forcing: Bridging the train-test gap in autoregressive video diffusion. arXiv preprint arXiv:2506.08009 (2025) 2, 3, 4, 5, 10, 11, 12, 21, 22

work page internal anchor Pith review Pith/arXiv arXiv 2025
[14]

Linvideo: A post-training framework towards o (n) attention in efficient video generation,

Huang, Y., Ge, X., Gong, R., Lv, C., Zhang, J.: Linvideo: A post-training framework towards o (n) attention in efficient video generation. arXiv preprint arXiv:2510.08318 (2025) 2

work page arXiv 2025
[15]

Huang, Y., Gong, R., Liu, J., Ding, Y., Lv, C., Qin, H., Zhang, J.: Qvgen: Pushing the limit of quantized video generative models (2026),https://arxiv.org/abs/ 2505.114972

work page arXiv 2026
[16]

Huang, Y., Wang, Z., Gong, R., Liu, J., Zhang, X., Guo, J., Liu, X., Zhang, J.: Har- monica: Harmonizing training and inference for better feature caching in diffusion transformer acceleration (2025),https://arxiv.org/abs/2410.017233

work page arXiv 2025
[17]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2024) 10

Huang, Z., He, Y., Yu, J., Zhang, F., Si, C., Jiang, Y., Zhang, Y., Wu, T., Jin, Q., Chanpaisit, N., Wang, Y., Chen, X., Wang, L., Lin, D., Qiao, Y., Liu, Z.: VBench: Comprehensive benchmark suite for video generative models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2024) 10

work page 2024
[18]

IEEE Transactions on Pattern Analysis and Machine Intelligence (2025)

Huang, Z., Zhang, F., Xu, X., He, Y., Yu, J., Dong, Z., Ma, Q., Chanpaisit, N., Si, C., Jiang, Y., Wang, Y., Chen, X., Chen, Y.C., Wang, L., Lin, D., Qiao, Y., Liu, Z.: VBench++: Comprehensive and versatile benchmark suite for video generative models. IEEE Transactions on Pattern Analysis and Machine Intelligence (2025). https://doi.org/10.1109/TPAMI.2025...

work page doi:10.1109/tpami.2025.363389010 2025
[19]

In: Kim, B., Yue, Y., Chaudhuri, S., Fragkiadaki, K., Khan, M., Sun, Y

Kim, D., Lai, C.H., Liao, W., Murata, N., Takida, Y., Uesaka, T., He, Y., Mit- sufuji, Y., Ermon, S.: Consistency trajectory models: Learning probability flow ode trajectory of diffusion. In: Kim, B., Yue, Y., Chaudhuri, S., Fragkiadaki, K., Khan, M., Sun, Y. (eds.) International Conference on Learning Representations. vol. 2024, pp. 44493–44525 (2024),ht...

work page 2024
[20]

arXiv preprint arXiv:2501.08316 (2025) 2, 3, 4

Lin, S., Xia, X., Ren, Y., Yang, C., Xiao, X., Jiang, L.: Diffusion adversarial post- training for one-step video generation. arXiv preprint arXiv:2501.08316 (2025) 2, 3, 4

work page arXiv 2025
[21]

arXiv preprint arXiv:2506.09350 (2025) 2, 4

Lin, S., Yang, C., He, H., Jiang, J., Ren, Y., Xia, X., Zhao, Y., Xiao, X., Jiang, L.: Autoregressive adversarial post-training for real-time interactive video generation. arXiv preprint arXiv:2506.09350 (2025) 2, 4

work page arXiv 2025
[22]

In: ICLR

Lipman, Y., Chen, R.T.Q., Ben-Hamu, H., Nickel, M., Le, M.: Flow matching for generative modeling. In: ICLR. OpenReview.net (2023) 1

work page 2023
[23]

arXiv preprint arXiv:2509.25161 (2025) 4

Liu, K., Hu, W., Xu, J., Shan, Y., Lu, S.: Rolling forcing: Autoregressive long video diffusion in real time. arXiv preprint arXiv:2509.25161 (2025) 4

work page arXiv 2025
[24]

Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow

Liu, X., Gong, C., Liu, Q.: Flow straight and fast: Learning to generate and transfer data with rectified flow. arXiv preprint arXiv:2209.03003 (2022) 1

work page internal anchor Pith review Pith/arXiv arXiv 2022
[25]

In: Proceedings of the Computer Vision and Pattern Recognition Conference

Liu, Y., Liu, B., Zhang, Y., Hou, X., Song, G., Liu, Y., You, H.: See further when clear: Curriculum consistency model. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 18103–18112 (2025) 3

work page 2025
[26]

Reward forcing: Efficient streaming video generation with rewarded distribution matching distillation,

Lu, Y., Zeng, Y., Li, H., Ouyang, H., Wang, Q., Cheng, K.L., Zhu, J., Cao, H., Zhang, Z., Zhu, X., et al.: Reward forcing: Efficient streaming video generation with rewarded distribution matching distillation. arXiv preprint arXiv:2512.04678 (2025) 3, 4

work page arXiv 2025
[27]

Latent Consistency Models: Synthesizing High-Resolution Images with Few-Step Inference

Luo, S., Tan, Y., Huang, L., Li, J., Zhao, H.: Latent consistency mod- els: Synthesizing high-resolution images with few-step inference. arXiv preprint arXiv:2310.04378 (2023) 1, 3

work page internal anchor Pith review Pith/arXiv arXiv 2023
[28]

Lv, Z., Si, C., Pan, T., Chen, Z., Wong, K.Y.K., Qiao, Y., Liu, Z.: Dual-expert consistencymodelforefficientandhigh-qualityvideogeneration.In:Proceedingsof the IEEE/CVF International Conference on Computer Vision (ICCV). pp. 14983– 14993 (October 2025) 3

work page 2025
[29]

In: Proceedings of the Computer Vision and Pattern Recognition Conference

Mao, X., Jiang, Z., Wang, F.Y., Zhang, J., Chen, H., Chi, M., Wang, Y., Luo, W.: Osv: One step is enough for high-quality image to video generation. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 12585–12594 (2025) 2

work page 2025
[30]

Transition matching distillation for fast video generation.arXiv preprint arXiv:2601.09881, 2026

Nie, W., Berner, J., Ma, N., Liu, C., Xie, S., Vahdat, A.: Transition matching distillation for fast video generation. arXiv preprint arXiv:2601.09881 (2026) 2, 3, 21

work page arXiv 2026
[31]

arXiv preprint arXiv:2404.13686 (2024) 3

Ren, Y., Xia, X., Lu, Y., Zhang, J., Wu, J., Xie, P., Wang, X., Xiao, X.: Hyper- sd: Trajectory segmented consistency model for efficient image synthesis. arXiv preprint arXiv:2404.13686 (2024) 3

work page arXiv 2024
[32]

In: The Thirty-ninth Annual Conference on Neural Information Processing Systems 2, 3, 5, 6, 7

Sabour, A., Fidler, S., Kreis, K.: Align your flow: Scaling continuous-time flow map distillation. In: The Thirty-ninth Annual Conference on Neural Information Processing Systems 2, 3, 5, 6, 7

work page
[33]

In: Interna- tional Conference on Machine Learning

Song, Y., Dhariwal, P., Chen, M., Sutskever, I.: Consistency models. In: Interna- tional Conference on Machine Learning. pp. 32211–32252. PMLR (2023) 1, 3, 5, 6

work page 2023
[34]

MAGI-1: Autoregressive Video Generation at Scale

Teng, H., Jia, H., Sun, L., Li, L., Li, M., Tang, M., Han, S., Zhang, T., Zhang, W., Luo, W., et al.: Magi-1: Autoregressive video generation at scale. arXiv preprint arXiv:2505.13211 (2025) 4

work page internal anchor Pith review arXiv 2025
[35]

Wan: Open and Advanced Large-Scale Video Generative Models

Wan, T., Wang, A., Ai, B., Wen, B., Mao, C., Xie, C.W., Chen, D., Yu, F., Zhao, H., Yang, J., et al.: Wan: Open and advanced large-scale video generative models. arXiv preprint arXiv:2503.20314 (2025) 1, 2, 6, 10, 20

work page internal anchor Pith review Pith/arXiv arXiv 2025
[36]

Advances in neural information processing systems37, 83951–84009 (2024) 3, 10, 11 18 Xingtong Ge et al

Wang, F.Y., Huang, Z., Bergman, A., Shen, D., Gao, P., Lingelbach, M., Sun, K., Bian, W., Song, G., Liu, Y., et al.: Phased consistency models. Advances in neural information processing systems37, 83951–84009 (2024) 3, 10, 11 18 Xingtong Ge et al

work page 2024
[37]

arXiv preprint arXiv:2512.06802 (2025) 3

Wang, Y., Zhang, H., Xue, T., Qiao, Y., Wang, Y., Xu, C., Chen, X.: Vdot: Efficient unified video creation via optimal transport distillation. arXiv preprint arXiv:2512.06802 (2025) 3

work page arXiv 2025
[38]

Advances in Neural Information Processing Systems36, 8406–8441 (2023) 2

Wang, Z., Lu, C., Wang, Y., Bao, F., Li, C., Su, H., Zhu, J.: Prolificdreamer: High-fidelity and diverse text-to-3d generation with variational score distillation. Advances in Neural Information Processing Systems36, 8406–8441 (2023) 2

work page 2023
[39]

Longlive: Real-time interactive long video generation.arXiv preprint arXiv:2509.22622, 2025

Yang, S., Huang, W., Chu, R., Xiao, Y., Zhao, Y., Wang, X., Li, M., Xie, E., Chen, Y., Lu, Y., et al.: Longlive: Real-time interactive long video generation. arXiv preprint arXiv:2509.22622 (2025) 2, 3, 4, 10, 11, 12, 13, 21, 22

work page arXiv 2025
[40]

Infinity-rope: Action-controllable infinite video generation emerges from autoregressive self-rollout.arXiv preprint arXiv:2511.20649, 2025

Yesiltepe, H., Meral, T.H.S., Akan, A.K., Oktay, K., Yanardag, P.: Infinity-rope: Action-controllable infinite video generation emerges from autoregressive self- rollout. arXiv preprint arXiv:2511.20649 (2025) 4

work page arXiv 2025
[41]

Advances in neural information processing systems37, 47455–47487 (2024) 2, 3, 5

Yin, T., Gharbi, M., Park, T., Zhang, R., Shechtman, E., Durand, F., Freeman, B.: Improved distribution matching distillation for fast image synthesis. Advances in neural information processing systems37, 47455–47487 (2024) 2, 3, 5

work page 2024
[42]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Yin, T., Gharbi, M., Zhang, R., Shechtman, E., Durand, F., Freeman, W.T., Park, T.: One-step diffusion with distribution matching distillation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 6613– 6623 (2024) 2, 3, 4, 10, 11, 12, 21

work page 2024
[43]

In: Pro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recogni- tion

Yin, T., Zhang, Q., Zhang, R., Freeman, W.T., Durand, F., Shechtman, E., Huang, X.: From slow bidirectional to fast autoregressive video diffusion models. In: Pro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recogni- tion. pp. 22963–22974 (2025) 2, 3, 4, 5, 12, 21, 22

work page 2025
[44]

Advances in Neural Information Processing Systems37, 111000–111021 (2024) 2

Zhai, Y., Lin, K., Yang, Z., Li, L., Wang, J., Lin, C.C., Doermann, D., Yuan, J., Wang, L.: Motion consistency model: Accelerating video diffusion with disentan- gled motion-appearance distillation. Advances in Neural Information Processing Systems37, 111000–111021 (2024) 2

work page 2024
[45]

arXiv preprint arXiv:2511.20123 (2025) 4

Zhao, M., Zhu, H., Wang, Y., Yan, B., Zhang, J., He, G., Yang, L., Li, C., Zhu, J.: Ultravico: Breaking extrapolation limits in video diffusion transformers. arXiv preprint arXiv:2511.20123 (2025) 4

work page arXiv 2025
[46]

Large Scale Diffusion Distillation via Score-Regularized Continuous-Time Consistency

Zheng, K., Wang, Y., Ma, Q., Chen, H., Zhang, J., Balaji, Y., Chen, J., Liu, M.Y., Zhu, J., Zhang, Q.: Large scale diffusion distillation via score-regularized continuous-time consistency. arXiv preprint arXiv:2510.08431 (2025) 3, 10, 12, 20, 21

work page internal anchor Pith review Pith/arXiv arXiv 2025
[47]

Zhu, H., Zhao, M., He, G., Su, H., Li, C., Zhu, J.: Causal forcing: Autoregressive diffusion distillation done right for high-quality real-time interactive video genera- tion. arXiv preprint arXiv:2602.02214 (2026) 2, 4, 10, 11, 12, 13, 21, 22, 23 Salt 19 A More results about SC-DMD A.1 Measuring Semigroup Defect on the Test-Time Inference Path A key moti...

work page arXiv 2026