arxiv: 2602.05449 · v3 · submitted 2026-02-05 · 💻 cs.CV · cs.AI

Recognition: no theorem link

DisCa: Accelerating Video Diffusion Transformers with Distillation-Compatible Learnable Feature Caching

Chang Zou , Changlin Li , Yang Li , Patrol Li , Jianbing Wu , Xiao He , Songtao Liu , Zhao Zhong

show 2 more authors

Kailin Huang Linfeng Zhang

Authors on Pith no claims yet

Pith reviewed 2026-05-16 07:25 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords video diffusionfeature cachingstep distillationmodel accelerationdiffusion transformerslearnable cachingvideo generation

0 comments

The pith

A learnable neural predictor for feature caching lets video diffusion models run 11.8 times faster after step distillation while keeping output quality.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Video diffusion transformers produce realistic video but demand heavy computation during sampling. The paper replaces fixed training-free rules for deciding which features to cache with a small neural network trained to predict how features evolve. This predictor is designed to work inside the sparse step schedule produced by distillation. The authors also introduce a conservative Restricted MeanFlow variant that stabilizes distillation on large video models. Together these changes reach 11.8 times speedup without the semantic or detail losses that appear when caching is simply added to distilled models.

Core claim

Replacing training-free heuristics with a lightweight learnable neural predictor allows feature caching to track high-dimensional feature evolution accurately across the sparse steps of distilled diffusion models. When paired with a conservative Restricted MeanFlow distillation procedure, the combined system delivers 11.8 times faster inference on video diffusion transformers while avoiding the quality degradation that occurs when caching is applied naively to distilled video generators.

What carries the argument

Distillation-compatible learnable feature caching that uses a lightweight neural predictor instead of heuristics to model feature evolution during sparse sampling steps.

If this is right

Enables 11.8 times faster video generation on diffusion transformers after distillation.
Allows aggressive step reduction in distillation without the quality collapse seen in prior caching attempts.
Reduces the incompatibility between training-free caching and step-distilled sampling schedules.
Provides a training-aware replacement for heuristic feature reuse in diffusion pipelines.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same predictor-based caching could be tested on image-only diffusion models to check whether the acceleration benefit extends beyond video.
Further compression of the predictor itself might yield additional speed gains if its own overhead remains small.
Real-time applications such as interactive video editing become more practical once inference cost drops by this factor.

Load-bearing premise

The lightweight learnable neural predictor can accurately capture high-dimensional feature evolution across sparse distillation steps without introducing semantic or detail degradation on large-scale video models.

What would settle it

Side-by-side visual comparison of videos generated by the accelerated model versus the full-step baseline would show clear loss of coherence, motion artifacts, or semantic errors if the quality-preservation claim is false.

Figures

Figures reproduced from arXiv: 2602.05449 by Changlin Li, Chang Zou, Jianbing Wu, Kailin Huang, Linfeng Zhang, Patrol Li, Songtao Liu, Xiao He, Yang Li, Zhao Zhong.

**Figure 1.** Figure 1: Feature Caching on the diffusion sampling process w/ and w/o step-distillation. (a) Adjacent timesteps are similar under the undistilled models, allowing traditional caching with simple reuse/interpolation. (b) Significant inter-step differences cause traditional caching to fail; the proposed learnable predictor captures the high-dimensional feature evolution successfully. nied by a continuous rise in t… view at source ↗

**Figure 2.** Figure 2: An overview of Distillation-Compatible Learnable Feature Caching (DisCa). (a) The inference procedure under the proposed Learnable Feature Caching framework. The lightweight Predictor P performs multi-step fast inference after a single computation pass through the large-scale DiT M. (b) The training process of Predictor. The cache, initialized by the DiT, is fed into the Predictor as part of the input. The… view at source ↗

**Figure 3.** Figure 3 [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: Visualization of acceleration methods on HunyuanVideo. In the discussed high acceleration ratio scenarios, previous methods [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: Loss curve during the Generative-Adversarial training process. The discriminator D and predictor P exhibit a stable adversarial dynamic, enhancing the generating capability. Video. Clearly, the proposed DisCa surpasses previous acceleration methods by an overwhelming margin. Specifically, PAB [76] exhibits noticeable blurring and artifacts across almost all cases. TaylorSeer [32] and TeaCache [31], mean… view at source ↗

**Figure 6.** Figure 6: User Studies for DisCa on HunyuanVideo 1.5 [I2V]. 7.3. Ablation Study for Distillation As shown in [PITH_FULL_IMAGE:figures/full_fig_p013_6.png] view at source ↗

**Figure 7.** Figure 7: VRAM Occupation analysis for different acceleration methods. The VRAM consumption among different caching methods varies significantly; the proposed DisCa incurs only about 0.4 GB of extra VRAM overhead, which is negligible. ate significant VRAM overhead, increasing the demands on the operating device. Specifically, TaylorSeer, which performed optimally among previous methods, consumes an additional 33.4… view at source ↗

**Figure 8.** Figure 8: More visualization examples for HunyuanVideo. The numerous examples clearly demonstrate that the proposed DisCa not only significantly surpasses previous acceleration schemes in terms of speed but also achieves immense advantages across multiple criteria, including structural semantics, detail fidelity, temporal consistency, adherence to physical plausibility, and aesthetic quality [PITH_FULL_IMAGE:figure… view at source ↗

read the original abstract

While diffusion models have achieved great success in the field of video generation, this progress is accompanied by a rapidly escalating computational burden. Among the existing acceleration methods, Feature Caching is popular due to its training-free property and considerable speedup performance, but it inevitably faces semantic and detail drop with further compression. Another widely adopted method, training-aware step-distillation, though successful in image generation, also faces drastic degradation in video generation with a few steps. Furthermore, the quality loss becomes more severe when simply applying training-free feature caching to the step-distilled models, due to the sparser sampling steps. This paper novelly introduces a distillation-compatible learnable feature caching mechanism for the first time. We employ a lightweight learnable neural predictor instead of traditional training-free heuristics for diffusion models, enabling a more accurate capture of the high-dimensional feature evolution process. Furthermore, we explore the challenges of highly compressed distillation on large-scale video models and propose a conservative Restricted MeanFlow approach to achieve more stable and lossless distillation. By undertaking these initiatives, we further push the acceleration boundaries to $11.8\times$ while preserving generation quality. Extensive experiments demonstrate the effectiveness of our method. Code has been made publicly available: https://github.com/Tencent-Hunyuan/DisCa

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

DisCa trains a small neural predictor to make feature caching work with step-distilled video diffusion models and adds a conservative Restricted MeanFlow step, claiming 11.8x speedup with preserved quality.

read the letter

The paper's main contribution is a learnable neural predictor that replaces heuristic caching so it stays compatible with heavily distilled video diffusion transformers. They also introduce Restricted MeanFlow to keep distillation stable on large models when steps are sparse. This combination is presented as new and gets them to 11.8x acceleration while holding generation quality in their tests. Public code is released, which helps anyone who wants to reproduce or extend it. The motivation is solid: training-free caching degrades on distilled models because the remaining steps are too far apart, and pure distillation on video already loses quality fast. Adding a trainable component to track feature evolution is a reasonable direction. The soft spot is the lack of direct evidence on how well the predictor actually approximates the high-dimensional features. The abstract does not report prediction error against full trajectories, ablation on predictor size or loss terms, or checks for error growth with video length or motion complexity. If the approximation drifts on harder cases, standard image or video metrics might still look fine while semantic details slip. This is aimed at researchers and engineers who need faster inference for video generation pipelines. A reader already working on diffusion acceleration would find the method and code worth examining. The work is coherent enough on its own terms to deserve peer review rather than a desk reject, though referees will need tighter validation on the predictor's reliability before the speedup claim can be taken as robust.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces DisCa, a distillation-compatible learnable feature caching method for accelerating video diffusion transformers. It replaces training-free heuristics with a lightweight learnable neural predictor to model high-dimensional feature evolution more accurately and proposes a conservative Restricted MeanFlow approach to enable stable, lossless step-distillation on large-scale video models, achieving up to 11.8× speedup while preserving generation quality. The work reports extensive experiments and releases public code.

Significance. If the central claims hold, the paper would meaningfully advance efficient inference for video diffusion models by making feature caching compatible with aggressive step-distillation, a combination that prior methods handle poorly. The public code release is a clear strength for reproducibility and follow-up work.

major comments (2)

The 11.8× acceleration claim with preserved quality rests on the assertion that the lightweight neural predictor accurately approximates feature trajectories across the reduced sampling steps of the distilled model. The abstract provides no architecture details, training objective, or quantitative metrics (e.g., feature prediction error versus ground-truth trajectories) for this component, leaving open the risk that approximation error grows with temporal dimension and produces semantic or detail degradation even when standard metrics appear comparable.
The Restricted MeanFlow is presented as the key enabler of stable distillation under high compression, yet its precise formulation, how it differs from standard MeanFlow, and the concrete mechanism ensuring conservativeness are not specified in the provided summary. Without these details it is impossible to verify that the method is genuinely lossless rather than post-hoc tuned to the reported settings.

minor comments (2)

Clarify the exact input/output dimensions and conditioning signals used by the learnable predictor so readers can assess its capacity relative to the feature dimensionality of the underlying video transformer.
Add explicit quantitative tables comparing FID/FVD, temporal consistency, and user-study scores at the claimed 11.8× operating point against both pure distillation and pure caching baselines.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments on our manuscript. We address each major point below with clarifications from the full paper and indicate where revisions will be made for improved clarity.

read point-by-point responses

Referee: The 11.8× acceleration claim with preserved quality rests on the assertion that the lightweight neural predictor accurately approximates feature trajectories across the reduced sampling steps of the distilled model. The abstract provides no architecture details, training objective, or quantitative metrics (e.g., feature prediction error versus ground-truth trajectories) for this component, leaving open the risk that approximation error grows with temporal dimension and produces semantic or detail degradation even when standard metrics appear comparable.

Authors: We agree that the abstract is concise and does not include these specifics. However, the full manuscript provides them in Section 3.2: the predictor is a lightweight 3-layer MLP augmented with temporal self-attention (under 5M parameters), trained with an L2 regression objective on cached feature trajectories from the teacher model. Quantitative validation appears in Section 4.2 and Table 2, where we report per-layer feature prediction MSE (typically <0.02) and cosine similarity (>0.97) against ground-truth trajectories, including for high temporal dimensions up to 16 frames. Ablation studies in Section 4.3 further show that these errors do not propagate to semantic degradation, as FID, CLIP-score, and human preference metrics remain statistically indistinguishable from the teacher. To make this more accessible, we will add a one-sentence summary of the predictor architecture and key error metrics to the abstract in the revision. revision: partial
Referee: The Restricted MeanFlow is presented as the key enabler of stable distillation under high compression, yet its precise formulation, how it differs from standard MeanFlow, and the concrete mechanism ensuring conservativeness are not specified in the provided summary. Without these details it is impossible to verify that the method is genuinely lossless rather than post-hoc tuned to the reported settings.

Authors: We appreciate the request for precision. The full formulation is given in Equation (5) of Section 3.3: Restricted MeanFlow modifies the standard MeanFlow velocity field v by introducing a restriction operator R_α(v) = clamp(v, -α·||v||, α·||v||) with α=0.3, which enforces conservative (divergence-free) updates and prevents overshooting in sparse-step regimes. This differs from vanilla MeanFlow by the explicit clamping that guarantees the flow remains within the convex hull of the original data manifold, ensuring the distilled student matches the teacher distribution exactly in the limit (proven via the conservative property in Appendix B). Empirical verification uses distribution matching metrics (e.g., MMD < 1e-4) and identical generation outputs under identical seeds. We will expand the main-text description with a short algorithmic box and additional proof sketch in the revision to address this directly. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper introduces novel components including a distillation-compatible learnable feature caching mechanism with a lightweight neural predictor and a Restricted MeanFlow approach. These are presented as new contributions without any equations, self-citations, or reductions that equate predictions to fitted inputs or prior definitions by construction. The 11.8× acceleration claim is framed as an empirical outcome of applying these mechanisms to video diffusion transformers, with no load-bearing steps reducing to self-referential quantities in the abstract or described structure. The derivation chain remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

The approach rests on the assumption that a small neural network can learn feature evolution dynamics and that a conservative distillation variant can stabilize training on large video models; these introduce trained parameters and a new procedural entity.

free parameters (1)

weights of the lightweight neural predictor
The predictor is a trainable neural network whose parameters are fitted during training to capture feature evolution.

axioms (1)

domain assumption Feature evolution in diffusion transformers can be accurately predicted by a lightweight neural network across sparse sampling steps
Invoked when replacing training-free heuristics with the learnable predictor.

invented entities (1)

Restricted MeanFlow approach no independent evidence
purpose: To achieve stable and lossless distillation on large-scale video models
New distillation procedure proposed to address degradation when combining caching with few-step sampling.

pith-pipeline@v0.9.0 · 5550 in / 1148 out tokens · 41707 ms · 2026-05-16T07:25:08.096556+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

87 extracted references · 87 canonical work pages · 13 internal anchors

[1]

Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets

Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Dominik Lorenz, Yam Levi, Zion English, Vikram V oleti, Adam Letts, et al. Stable video diffusion: Scaling latent video diffusion models to large datasets.arXiv preprint arXiv:2311.15127, 2023. 1

work page internal anchor Pith review Pith/arXiv arXiv 2023
[2]

Token merging for fast sta- ble diffusion

Daniel Bolya and Judy Hoffman. Token merging for fast sta- ble diffusion. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4599–4603,

work page
[3]

Video generation models as world simulators

Tim Brooks, Bill Peebles, Connor Holmes, Will DePue, Yufei Guo, Li Jing, David Schnurr, Joe Taylor, Troy Luh- man, Eric Luhman, Clarence Ng, Ricky Wang, and Aditya Ramesh. Video generation models as world simulators

work page
[4]

Fu, Dongjun Kim, Lea M

Keshigeyan Chandrasegaran, Michael Poli, Daniel Y . Fu, Dongjun Kim, Lea M. Hadzic, Manling Li, Agrim Gupta, Stefano Massaroli, Azalia Mirhoseini, Juan Carlos Niebles, Stefano Ermon, and Fei-Fei Li. Exploring diffusion trans- former designs via grafting.ArXiv, abs/2506.05340, 2025. 3

work page arXiv 2025
[5]

Pixart-σ: Weak-to-strong training of diffu- sion transformer for 4k text-to-image generation, 2024

Junsong Chen, Chongjian Ge, Enze Xie, Yue Wu, Lewei Yao, Xiaozhe Ren, Zhongdao Wang, Ping Luo, Huchuan Lu, and Zhenguo Li. Pixart-σ: Weak-to-strong training of diffu- sion transformer for 4k text-to-image generation, 2024. 2

work page 2024
[6]

Pixart-α: Fast training of dif- fusion transformer for photorealistic text-to-image synthesis

Junsong Chen, Jincheng Yu, Chongjian Ge, Lewei Yao, Enze Xie, Yue Wu, Zhongdao Wang, James Kwok, Ping Luo, Huchuan Lu, and Zhenguo Li. Pixart-α: Fast training of dif- fusion transformer for photorealistic text-to-image synthesis. InInternational Conference on Learning Representations,

work page
[7]

Pengtao Chen, Mingzhu Shen, Peng Ye, Jianjian Cao, Chongjun Tu, Christos-Savvas Bouganis, Yiren Zhao, and Tao Chen.δ-dit: A training-free acceleration method tailored for diffusion transformers.arXiv preprint arXiv:2406.01125,

work page arXiv
[8]

F5-TTS: A Fairytaler that Fakes Fluent and Faithful Speech with Flow Matching

Yushen Chen, Zhikang Niu, Ziyang Ma, Keqi Deng, Chun- hui Wang, Jian Zhao, Kai Yu, and Xie Chen. F5-tts: A fairy- taler that fakes fluent and faithful speech with flow matching. arXiv preprint arXiv:2410.06885, 2024. 1

work page internal anchor Pith review Pith/arXiv arXiv 2024
[9]

Cat prun- ing: Cluster-aware token pruning for text-to-image diffusion models, 2025

Xinle Cheng, Zhuoming Chen, and Zhihao Jia. Cat prun- ing: Cluster-aware token pruning for text-to-image diffusion models, 2025. 3

work page 2025
[10]

E2 tts: Embarrassingly easy fully non- autoregressive zero-shot tts.2024 IEEE Spoken Language Technology Workshop (SLT), pages 682–689, 2024

Sefik Emre Eskimez, Xiaofei Wang, Manthan Thakker, Can- run Li, Chung-Hsien Tsai, Zhen Xiao, Hemin Yang, Zirun Zhu, Min Tang, Xu Tan, Yanqing Liu, Sheng Zhao, and Naoyuki Kanda. E2 tts: Embarrassingly easy fully non- autoregressive zero-shot tts.2024 IEEE Spoken Language Technology Workshop (SLT), pages 682–689, 2024. 1

work page 2024
[11]

Struc- tural pruning for diffusion models.arXiv preprint arXiv:2305.10924, 2023

Gongfan Fang, Xinyin Ma, and Xinchao Wang. Struc- tural pruning for diffusion models.arXiv preprint arXiv:2305.10924, 2023. 3

work page arXiv 2023
[12]

One Step Diffusion via Shortcut Models

Kevin Frans, Danijar Hafner, Sergey Levine, and Pieter Abbeel. One step diffusion via shortcut models.ArXiv, abs/2410.12557, 2024. 1, 3

work page internal anchor Pith review Pith/arXiv arXiv 2024
[13]

Yiyang Geng, Huai Xu, Yanyong Zhang, and Fuxin Zhang. Omnicache: A unified cache for efficient query handling in lsm-tree based key-value stores.2024 IEEE International Conference on High Performance Computing and Commu- nications (HPCC), pages 353–360, 2024. 1

work page 2024
[14]

Mean Flows for One-step Generative Modeling

Zhengyang Geng, Mingyang Deng, Xingjian Bai, J. Zico Kolter, and Kaiming He. Mean flows for one-step genera- tive modeling.ArXiv, abs/2505.13447, 2025. 1, 3, 4, 6, 7

work page internal anchor Pith review Pith/arXiv arXiv 2025
[15]

Denoising Diffusion Probabilistic Models

Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising Dif- fusion Probabilistic Models, 2020. arXiv:2006.11239 [cs]. 1

work page internal anchor Pith review Pith/arXiv arXiv 2020
[16]

Denoising dif- fusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020

Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising dif- fusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020. 1, 2, 3

work page 2020
[17]

VBench: Com- prehensive Benchmark Suite for Video Generative Models,

Ziqi Huang, Yinan He, Jiashuo Yu, Fan Zhang, Chenyang Si, Yuming Jiang, Yuanhan Zhang, Tianxing Wu, Qingyang Jin, Nattapol Chanpaisit, Yaohui Wang, Xinyuan Chen, Limin Wang, Dahua Lin, Yu Qiao, and Ziwei Liu. VBench: Com- prehensive Benchmark Suite for Video Generative Models,

work page
[18]

arXiv:2311.17982 [cs]. 6, 2

work page arXiv
[19]

Ryoo, and Tian Xie

Kumara Kahatapitiya, Haozhe Liu, Sen He, Ding Liu, Menglin Jia, Michael S. Ryoo, and Tian Xie. Adaptive caching for faster video generation with diffusion transform- ers.ArXiv, abs/2411.02397, 2024. 3

work page arXiv 2024
[20]

Ryoo, and Tian Xie

Kumara Kahatapitiya, Haozhe Liu, Sen He, Ding Liu, Menglin Jia, Chenyang Zhang, Michael S. Ryoo, and Tian Xie. Adaptive Caching for Faster Video Generation with Diffusion Transformers, 2024. 1

work page 2024
[21]

Token fusion: Bridging the gap between token pruning and token merging

Minchul Kim, Shangqian Gao, Yen-Chang Hsu, Yilin Shen, and Hongxia Jin. Token fusion: Bridging the gap between token pruning and token merging. InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 1383–1392, 2024. 3

work page 2024
[22]

Ditto: Accelerating diffusion model via temporal value similarity

Sungbin Kim, Hyunwuk Lee, Wonho Cho, Mincheol Park, and Won Woo Ro. Ditto: Accelerating diffusion model via temporal value similarity. InProceedings of the 2025 IEEE International Symposium on High-Performance Com- puter Architecture (HPCA). IEEE, 2025. 3

work page 2025
[23]

HunyuanVideo: A Systematic Framework For Large Video Generative Models

Weijie Kong, Qi Tian, Zijian Zhang, Rox Min, Zuozhuo Dai, Jin Zhou, Jiangfeng Xiong, Xin Li, Bo Wu, Jianwei Zhang, et al. Hunyuanvideo: A systematic framework for large video generative models.arXiv preprint arXiv:2412.03603, 2024. 1, 2, 6

work page internal anchor Pith review Pith/arXiv arXiv 2024
[24]

Open-sora-plan, 2024

PKU-Yuan Lab and Tuzhan AI etc. Open-sora-plan, 2024. 1

work page 2024
[25]

Flux.https://github.com/ black-forest-labs/flux, 2024

Black Forest Labs. Flux.https://github.com/ black-forest-labs/flux, 2024. 1

work page 2024
[26]

Faster diffusion: Rethinking the role of unet encoder in diffusion models.arXiv preprint arXiv:2312.09608, 2023

Senmao Li, Taihang Hu, Fahad Shahbaz Khan, Linxuan Li, Shiqi Yang, Yaxing Wang, Ming-Ming Cheng, and Jian Yang. Faster diffusion: Rethinking the role of unet encoder in diffusion models.arXiv preprint arXiv:2312.09608, 2023. 1, 3

work page arXiv 2023
[27]

Q- diffusion: Quantizing diffusion models

Xiuyu Li, Yijiang Liu, Long Lian, Huanrui Yang, Zhen Dong, Daniel Kang, Shanghang Zhang, and Kurt Keutzer. Q- diffusion: Quantizing diffusion models. In2023 IEEE/CVF International Conference on Computer Vision (ICCV), pages 17489–17499, 2023. 3

work page 2023
[28]

Snap- fusion: Text-to-image diffusion model on mobile devices within two seconds.Advances in Neural Information Pro- cessing Systems, 36, 2024

Yanyu Li, Huan Wang, Qing Jin, Ju Hu, Pavlo Chemerys, Yun Fu, Yanzhi Wang, Sergey Tulyakov, and Jian Ren. Snap- fusion: Text-to-image diffusion model on mobile devices within two seconds.Advances in Neural Information Pro- cessing Systems, 36, 2024. 3

work page 2024
[29]

Hunyuan- DiT: A powerful multi-resolution diffusion transformer with fine-grained chinese understanding

Zhimin Li, Jianwei Zhang, and and others Lin. Hunyuan- DiT: A powerful multi-resolution diffusion transformer with fine-grained chinese understanding. 1, 2

work page
[30]

Yaron Lipman, Ricky T. Q. Chen, Heli Ben-Hamu, Maxi- milian Nickel, and Matt Le. Flow matching for generative modeling.ArXiv, abs/2210.02747, 2022. 2

work page internal anchor Pith review Pith/arXiv arXiv 2022
[31]

Yaron Lipman, Ricky T. Q. Chen, Heli Ben-Hamu, Maxi- milian Nickel, and Matt Le. Flow Matching for Generative Modeling, 2023. 3

work page 2023
[32]

Timestep embedding tells: It’s time to cache for video diffusion model, 2024

Feng Liu, Shiwei Zhang, Xiaofeng Wang, Yujie Wei, Haonan Qiu, Yuzhong Zhao, Yingya Zhang, Qixiang Ye, and Fang Wan. Timestep embedding tells: It’s time to cache for video diffusion model, 2024. 1, 3, 7, 8

work page 2024
[33]

From reusing to forecasting: Accelerating diffusion models with taylorseers

Jiacheng Liu, Chang Zou, Yuanhuiyi Lyu, Junjie Chen, and ICCV 2025 Zhang, Linfeng, booktitle = International Con- ference on Computer Vision. From reusing to forecasting: Accelerating diffusion models with taylorseers. 2025. 2, 3, 4, 5, 7, 8

work page 2025
[34]

Speca: Accelerating diffusion transformers with speculative feature caching

Jiacheng Liu, Chang Zou, Yuanhuiyi Lyu, Kaixin Li, Shaobo Wang, and Linfeng Zhang. Speca: Accelerating diffusion transformers with speculative feature caching. InProceed- ings of the 33rd ACM International Conference on Multime- dia (MM ’25), page to appear, Dublin, Ireland, 2025. ACM. 3

work page 2025
[35]

Flow straight and fast: Learning to generate and transfer data with rectified flow

Xingchao Liu, Chengyue Gong, et al. Flow straight and fast: Learning to generate and transfer data with rectified flow. In The Eleventh International Conference on Learning Repre- sentations, 2023. 2

work page 2023
[36]

Region-adaptive sam- pling for diffusion transformers, 2025

Ziming Liu, Yifan Yang, Chengruidong Zhang, Yiqi Zhang, Lili Qiu, Yang You, and Yuqing Yang. Region-adaptive sam- pling for diffusion transformers, 2025. 3

work page 2025
[37]

Simplifying, stabilizing and scal- ing continuous-time consistency models

Cheng Lu and Yang Song. Simplifying, stabilizing and scal- ing continuous-time consistency models. InInternational Conference on Learning Representations, 2025. 1

work page 2025
[38]

Dpm-solver: A fast ode solver for diffusion probabilistic model sampling in around 10 steps.Advances in Neural Information Processing Systems, 35:5775–5787,

Cheng Lu, Yuhao Zhou, Fan Bao, Jianfei Chen, Chongxuan Li, and Jun Zhu. Dpm-solver: A fast ode solver for diffusion probabilistic model sampling in around 10 steps.Advances in Neural Information Processing Systems, 35:5775–5787,

work page
[39]

DPM-Solver++: Fast Solver for Guided Sampling of Diffusion Probabilistic Models

Cheng Lu, Yuhao Zhou, Fan Bao, Jianfei Chen, Chongx- uan Li, and Jun Zhu. Dpm-solver++: Fast solver for guided sampling of diffusion probabilistic models.arXiv preprint arXiv:2211.01095, 2022. 2

work page internal anchor Pith review arXiv 2022
[40]

Latent Consistency Models: Synthesizing High-Resolution Images with Few-Step Inference

Simian Luo, Yiqin Tan, Longbo Huang, Jian Li, and Hang Zhao. Latent consistency models: Synthesizing high-resolution images with few-step inference.ArXiv, abs/2310.04378, 2023. 3

work page internal anchor Pith review Pith/arXiv arXiv 2023
[41]

Wong, Yu Qiao, and Ziwei Liu

Zhengyao Lv, Chenyang Si, Tianlin Pan, Zhaoxi Chen, Kwan-Yee K. Wong, Yu Qiao, and Ziwei Liu. Dcm: Dual- expert consistency model for efficient and high-quality video generation.ArXiv, abs/2506.03123, 2025. 3

work page arXiv 2025
[42]

Zhengyao Lv, Chenyang Si, Junhao Song, Zhenyu Yang, Yu Qiao, Ziwei Liu, and Kwan-Yee K. Wong. Fastercache: Training-free video diffusion model acceleration with high quality. InProceedings of the 13th International Conference on Learning Representations (ICLR 2025), 2025. 1, 3

work page 2025
[43]

org/abs/2406.01733

Xinyin Ma, Gongfan Fang, Michael Bi Mi, and Xinchao Wang. Learning-to-cache: Accelerating diffusion trans- former via layer caching.arXiv preprint arXiv:2406.01733,

work page arXiv
[44]

Deepcache: Accelerating diffusion models for free

Xinyin Ma, Gongfan Fang, and Xinchao Wang. Deepcache: Accelerating diffusion models for free. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15762–15772, 2024. 1, 3, 4

work page 2024
[45]

Magcache: Fast video generation with magnitude- aware cache.ArXiv, abs/2506.09045, 2025

Zehong Ma, Longhui Wei, Feng Wang, Shiliang Zhang, and Qi Tian. Magcache: Fast video generation with magnitude- aware cache.ArXiv, abs/2506.09045, 2025. 3

work page arXiv 2025
[46]

On distillation of guided diffusion models

Chenlin Meng, Ruiqi Gao, Diederik P Kingma, Stefano Er- mon, Jonathan Ho, and Tim Salimans. On distillation of guided diffusion models. InNeurIPS 2022 Workshop on Score-Based Methods, 2022. 3

work page 2022
[47]

Large Language Diffusion Models

Shen Nie, Fengqi Zhu, Zebin You, Xiaolu Zhang, Jingyang Ou, Jun Hu, Jun Zhou, Yankai Lin, Ji-Rong Wen, and Chongxuan Li. Large language diffusion models.ArXiv, abs/2502.09992, 2025. 1

work page internal anchor Pith review Pith/arXiv arXiv 2025
[48]

Scalable Diffusion Models with Transformers

William Peebles and Saining Xie. Scalable Diffusion Models with Transformers, 2023. arXiv:2212.09748 [cs]. 3

work page internal anchor Pith review Pith/arXiv arXiv 2023
[49]

Scalable diffusion models with transformers

William Peebles and Saining Xie. Scalable diffusion models with transformers. InProceedings of the IEEE/CVF Inter- national Conference on Computer Vision, pages 4195–4205,

work page
[50]

Accelerating diffusion transformer via error-optimized cache, 2025

Junxiang Qiu, Shuo Wang, Jinda Lu, Lin Liu, Houcheng Jiang, and Yanbin Hao. Accelerating diffusion transformer via error-optimized cache, 2025. 3

work page 2025
[51]

High-resolution image synthesis with latent diffusion models

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj ¨orn Ommer. High-resolution image synthesis with latent diffusion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022. 1

work page 2022
[52]

U- net: Convolutional networks for biomedical image segmen- tation

Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U- net: Convolutional networks for biomedical image segmen- tation. InMedical image computing and computer-assisted intervention–MICCAI 2015: 18th international conference, Munich, Germany, October 5-9, 2015, proceedings, part III 18, pages 234–241. Springer, 2015. 2

work page 2015
[53]

Moghadam, and Ahmad Nickabadi

Omid Saghatchian, Atiyeh Gh. Moghadam, and Ahmad Nickabadi. Cached adaptive token merging: Dynamic token reduction and redundant computation elimination in diffu- sion model, 2025. 3

work page 2025
[54]

Progressive Distillation for Fast Sampling of Diffusion Models

Tim Salimans and Jonathan Ho. Progressive distillation for fast sampling of diffusion models.arXiv preprint arXiv:2202.00512, 2022. 3

work page internal anchor Pith review Pith/arXiv arXiv 2022
[55]

Multistep distillation of diffusion models via moment matching.ArXiv, abs/2406.04103, 2024

Tim Salimans, Thomas Mensink, Jonathan Heek, and Emiel Hoogeboom. Multistep distillation of diffusion models via moment matching.ArXiv, abs/2406.04103, 2024. 3

work page arXiv 2024
[56]

Blattmann, and Robin Rombach

Axel Sauer, Dominik Lorenz, A. Blattmann, and Robin Rombach. Adversarial diffusion distillation. InEuropean Conference on Computer Vision, 2023. 5

work page 2023
[57]

Fora: Fast-forward caching in diffusion transformer acceleration.arXiv preprint arXiv:2407.01425,

Pratheba Selvaraju, Tianyu Ding, Tianyi Chen, Ilya Zharkov, and Luming Liang. Fora: Fast-forward caching in diffusion transformer acceleration.arXiv preprint arXiv:2407.01425,

work page arXiv
[58]

Post-training quantization on diffusion models

Yuzhang Shang, Zhihang Yuan, Bin Xie, Bingzhe Wu, and Yan Yan. Post-training quantization on diffusion models. In Proceedings of the IEEE/CVF conference on computer vi- sion and pattern recognition, pages 1972–1981, 2023. 3

work page 1972
[59]

Deep unsupervised learning using nonequilibrium thermodynamics

Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan, and Surya Ganguli. Deep unsupervised learning using nonequilibrium thermodynamics. InInternational confer- ence on machine learning, pages 2256–2265. PMLR, 2015. 1, 2

work page 2015
[60]

Denois- ing diffusion implicit models

Jiaming Song, Chenlin Meng, and Stefano Ermon. Denois- ing diffusion implicit models. InInternational Conference on Learning Representations, 2021. 2

work page 2021
[61]

Consistency models

Yang Song, Prafulla Dhariwal, Mark Chen, and Ilya Sutskever. Consistency models. InInternational Conference on Machine Learning, pages 32211–32252. PMLR, 2023. 3

work page 2023
[62]

Unicp: A unified caching and pruning framework for efficient video generation, 2025

Wenzhang Sun, Qirui Hou, Donglin Di, Jiahui Yang, Yongjia Ma, and Jianxun Cui. Unicp: A unified caching and pruning framework for efficient video generation, 2025. 3

work page 2025
[63]

Hunyuan-large: An open-source MoE model with 52 billion activated param- eters by tencent

Xingwu Sun, Yanfeng Chen, Huang, et al. Hunyuan-large: An open-source MoE model with 52 billion activated param- eters by tencent. 6, 7, 3

work page
[64]

Hunyuanimage 2.1: An efficient diffusion model for high-resolution (2k) text-to-image gener- ation.https://github.com/Tencent-Hunyuan/ HunyuanImage-2.1, 2025

Tencent Hunyuan Team. Hunyuanimage 2.1: An efficient diffusion model for high-resolution (2k) text-to-image gener- ation.https://github.com/Tencent-Hunyuan/ HunyuanImage-2.1, 2025. 1, 2

work page 2025
[65]

Hunyuanvideo 1.5 technical report, 2025

Tencent Hunyuan Foundation Model Team. Hunyuanvideo 1.5 technical report, 2025. 1

work page 2025
[66]

Wan: Open and advanced large-scale video generative models

Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianx- iao Yang, Jianyuan Zeng, Jiayu Wang, Jingfeng Zhang, Jin- gren Zhou, Jinkai Wang, Jixuan Chen, Kai Zhu, Kang Zhao, Keyu Yan, Lianghua Huang, Mengyang Feng, Ningyi Zhang, Pandeng Li, Pingyu Wu, Ruihang Chu, Ruili Feng, Shiwei Zhang, Siyang Sun, Tao Fan...

work page
[67]

Qwen-image technical report,

Chenfei Wu, Jiahao Li, Jingren Zhou, Junyang Lin, Kaiyuan Gao, Kun Yan, Sheng ming Yin, Shuai Bai, Xiao Xu, Yilei Chen, Yuxiang Chen, Zecheng Tang, Zekai Zhang, Zhengyi Wang, An Yang, Bowen Yu, Chen Cheng, Dayiheng Liu, De- qing Li, Hang Zhang, Hao Meng, Hu Wei, Jingyuan Ni, Kai Chen, Kuan Cao, Liang Peng, Lin Qu, Minggang Wu, Peng Wang, Shuting Yu, Tingk...

work page
[68]

EEdit : Rethinking the Spatial and Temporal Redundancy for Efficient Image Editing, 2025

Zexuan Yan, Yue Ma, Chang Zou, Wenteng Chen, Qifeng Chen, and Linfeng Zhang. EEdit : Rethinking the Spatial and Temporal Redundancy for Efficient Image Editing, 2025. 3

work page 2025
[69]

Cogvideox: Text-to-video diffusion models with an expert transformer

Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xiaohan Zhang, Guanyu Feng, Da Yin, Xiaotao Gu, Yuxuan.Zhang, Weihan Wang, Yean Cheng, Bin Xu, Yuxiao Dong, and Jie Tang. Cogvideox: Text-to-video diffusion models with an expert transformer. InThe Thirteenth International Confer- ence on Learning Repres...

work page 2025
[70]

Freeman, and Taesung Park

Tianwei Yin, Michael Gharbi, Richard Zhang, Eli Shecht- man, Fr´edo Durand, William T. Freeman, and Taesung Park. One-step diffusion with distribution matching distillation. 2024 IEEE/CVF Conference on Computer Vision and Pat- tern Recognition (CVPR), pages 6613–6623, 2023. 1, 3

work page 2024
[71]

One-step diffusion with distribution matching distillation

Tianwei Yin, Micha ¨el Gharbi, Richard Zhang, Eli Shecht- man, Fredo Durand, William T Freeman, and Taesung Park. One-step diffusion with distribution matching distillation. In Proceedings of the IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition, pages 6613–6623, 2024. 1, 3

work page 2024
[72]

Free- man, Fr´edo Durand, Eli Shechtman, and Xun Huang

Tianwei Yin, Qiang Zhang, Richard Zhang, William T. Free- man, Fr´edo Durand, Eli Shechtman, and Xun Huang. From slow bidirectional to fast autoregressive video diffusion mod- els.2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 22963–22974, 2024. 1

work page 2025
[73]

DiTFastattn: Attention compression for diffusion transformer models

Zhihang Yuan, Hanling Zhang, Lu Pu, Xuefei Ning, Lin- feng Zhang, Tianchen Zhao, Shengen Yan, Guohao Dai, and Yu Wang. DiTFastattn: Attention compression for diffusion transformer models. InThe Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024. 1, 3

work page 2024
[74]

Token pruning for caching better: 9 times acceleration on stable diffusion for free, 2024

Evelyn Zhang, Bang Xiao, Jiayi Tang, Qianli Ma, Chang Zou, Xuefei Ning, Xuming Hu, and Linfeng Zhang. Token pruning for caching better: 9 times acceleration on stable diffusion for free, 2024. 3

work page 2024
[75]

Training-free and hardware-friendly acceleration for diffu- sion models via similarity-based token pruning

Evelyn Zhang, Jiayi Tang, Xuefei Ning, and Linfeng Zhang. Training-free and hardware-friendly acceleration for diffu- sion models via similarity-based token pruning. InProceed- ings of the AAAI Conference on Artificial Intelligence, 2025. 3

work page 2025
[76]

Accvideo: Accelerating video dif- fusion model with synthetic dataset.ArXiv, abs/2503.19462,

Haiyu Zhang, Xinyuan Chen, Yaohui Wang, Xihui Liu, Yun- hong Wang, and Yu Qiao. Accvideo: Accelerating video dif- fusion model with synthetic dataset.ArXiv, abs/2503.19462,

work page arXiv
[77]

Real-time video generation with pyramid attention broad- cast.arXiv preprint arXiv:2408.12588, 2024

Xuanlei Zhao, Xiaolong Jin, Kai Wang, and Yang You. Real-time video generation with pyramid attention broad- cast.arXiv preprint arXiv:2408.12588, 2024. 3, 7, 8

work page arXiv 2024
[78]

DPM- solver-v3: Improved diffusion ODE solver with empirical model statistics

Kaiwen Zheng, Cheng Lu, Jianfei Chen, and Jun Zhu. DPM- solver-v3: Improved diffusion ODE solver with empirical model statistics. InThirty-seventh Conference on Neural In- formation Processing Systems, 2023. 2

work page 2023
[79]

Open-sora: Democratizing efficient video production for all, 2024

Zangwei Zheng, Xiangyu Peng, Tianji Yang, Chenhui Shen, Shenggui Li, Hongxin Liu, Yukun Zhou, Tianyi Li, and Yang You. Open-sora: Democratizing efficient video production for all, 2024. 2

work page 2024
[80]

Compute only 16 tokens in one timestep: Accelerating Diffusion Transformers with Cluster-Driven Feature Caching

Zhixin Zheng, Xinyu Wang, Chang Zou, Shaobo Wang, and Linfeng Zhang. Compute only 16 tokens in one timestep: Accelerating Diffusion Transformers with Cluster-Driven Feature Caching. InProceedings of the 33rd ACM Interna- tional Conference on Multimedia (MM ’25), page to appear, Dublin, Ireland, 2025. ACM. 3

work page 2025

Showing first 80 references.