Recognition: no theorem link
DisCa: Accelerating Video Diffusion Transformers with Distillation-Compatible Learnable Feature Caching
Pith reviewed 2026-05-16 07:25 UTC · model grok-4.3
The pith
A learnable neural predictor for feature caching lets video diffusion models run 11.8 times faster after step distillation while keeping output quality.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Replacing training-free heuristics with a lightweight learnable neural predictor allows feature caching to track high-dimensional feature evolution accurately across the sparse steps of distilled diffusion models. When paired with a conservative Restricted MeanFlow distillation procedure, the combined system delivers 11.8 times faster inference on video diffusion transformers while avoiding the quality degradation that occurs when caching is applied naively to distilled video generators.
What carries the argument
Distillation-compatible learnable feature caching that uses a lightweight neural predictor instead of heuristics to model feature evolution during sparse sampling steps.
If this is right
- Enables 11.8 times faster video generation on diffusion transformers after distillation.
- Allows aggressive step reduction in distillation without the quality collapse seen in prior caching attempts.
- Reduces the incompatibility between training-free caching and step-distilled sampling schedules.
- Provides a training-aware replacement for heuristic feature reuse in diffusion pipelines.
Where Pith is reading between the lines
- The same predictor-based caching could be tested on image-only diffusion models to check whether the acceleration benefit extends beyond video.
- Further compression of the predictor itself might yield additional speed gains if its own overhead remains small.
- Real-time applications such as interactive video editing become more practical once inference cost drops by this factor.
Load-bearing premise
The lightweight learnable neural predictor can accurately capture high-dimensional feature evolution across sparse distillation steps without introducing semantic or detail degradation on large-scale video models.
What would settle it
Side-by-side visual comparison of videos generated by the accelerated model versus the full-step baseline would show clear loss of coherence, motion artifacts, or semantic errors if the quality-preservation claim is false.
Figures
read the original abstract
While diffusion models have achieved great success in the field of video generation, this progress is accompanied by a rapidly escalating computational burden. Among the existing acceleration methods, Feature Caching is popular due to its training-free property and considerable speedup performance, but it inevitably faces semantic and detail drop with further compression. Another widely adopted method, training-aware step-distillation, though successful in image generation, also faces drastic degradation in video generation with a few steps. Furthermore, the quality loss becomes more severe when simply applying training-free feature caching to the step-distilled models, due to the sparser sampling steps. This paper novelly introduces a distillation-compatible learnable feature caching mechanism for the first time. We employ a lightweight learnable neural predictor instead of traditional training-free heuristics for diffusion models, enabling a more accurate capture of the high-dimensional feature evolution process. Furthermore, we explore the challenges of highly compressed distillation on large-scale video models and propose a conservative Restricted MeanFlow approach to achieve more stable and lossless distillation. By undertaking these initiatives, we further push the acceleration boundaries to $11.8\times$ while preserving generation quality. Extensive experiments demonstrate the effectiveness of our method. Code has been made publicly available: https://github.com/Tencent-Hunyuan/DisCa
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces DisCa, a distillation-compatible learnable feature caching method for accelerating video diffusion transformers. It replaces training-free heuristics with a lightweight learnable neural predictor to model high-dimensional feature evolution more accurately and proposes a conservative Restricted MeanFlow approach to enable stable, lossless step-distillation on large-scale video models, achieving up to 11.8× speedup while preserving generation quality. The work reports extensive experiments and releases public code.
Significance. If the central claims hold, the paper would meaningfully advance efficient inference for video diffusion models by making feature caching compatible with aggressive step-distillation, a combination that prior methods handle poorly. The public code release is a clear strength for reproducibility and follow-up work.
major comments (2)
- The 11.8× acceleration claim with preserved quality rests on the assertion that the lightweight neural predictor accurately approximates feature trajectories across the reduced sampling steps of the distilled model. The abstract provides no architecture details, training objective, or quantitative metrics (e.g., feature prediction error versus ground-truth trajectories) for this component, leaving open the risk that approximation error grows with temporal dimension and produces semantic or detail degradation even when standard metrics appear comparable.
- The Restricted MeanFlow is presented as the key enabler of stable distillation under high compression, yet its precise formulation, how it differs from standard MeanFlow, and the concrete mechanism ensuring conservativeness are not specified in the provided summary. Without these details it is impossible to verify that the method is genuinely lossless rather than post-hoc tuned to the reported settings.
minor comments (2)
- Clarify the exact input/output dimensions and conditioning signals used by the learnable predictor so readers can assess its capacity relative to the feature dimensionality of the underlying video transformer.
- Add explicit quantitative tables comparing FID/FVD, temporal consistency, and user-study scores at the claimed 11.8× operating point against both pure distillation and pure caching baselines.
Simulated Author's Rebuttal
We thank the referee for the constructive comments on our manuscript. We address each major point below with clarifications from the full paper and indicate where revisions will be made for improved clarity.
read point-by-point responses
-
Referee: The 11.8× acceleration claim with preserved quality rests on the assertion that the lightweight neural predictor accurately approximates feature trajectories across the reduced sampling steps of the distilled model. The abstract provides no architecture details, training objective, or quantitative metrics (e.g., feature prediction error versus ground-truth trajectories) for this component, leaving open the risk that approximation error grows with temporal dimension and produces semantic or detail degradation even when standard metrics appear comparable.
Authors: We agree that the abstract is concise and does not include these specifics. However, the full manuscript provides them in Section 3.2: the predictor is a lightweight 3-layer MLP augmented with temporal self-attention (under 5M parameters), trained with an L2 regression objective on cached feature trajectories from the teacher model. Quantitative validation appears in Section 4.2 and Table 2, where we report per-layer feature prediction MSE (typically <0.02) and cosine similarity (>0.97) against ground-truth trajectories, including for high temporal dimensions up to 16 frames. Ablation studies in Section 4.3 further show that these errors do not propagate to semantic degradation, as FID, CLIP-score, and human preference metrics remain statistically indistinguishable from the teacher. To make this more accessible, we will add a one-sentence summary of the predictor architecture and key error metrics to the abstract in the revision. revision: partial
-
Referee: The Restricted MeanFlow is presented as the key enabler of stable distillation under high compression, yet its precise formulation, how it differs from standard MeanFlow, and the concrete mechanism ensuring conservativeness are not specified in the provided summary. Without these details it is impossible to verify that the method is genuinely lossless rather than post-hoc tuned to the reported settings.
Authors: We appreciate the request for precision. The full formulation is given in Equation (5) of Section 3.3: Restricted MeanFlow modifies the standard MeanFlow velocity field v by introducing a restriction operator R_α(v) = clamp(v, -α·||v||, α·||v||) with α=0.3, which enforces conservative (divergence-free) updates and prevents overshooting in sparse-step regimes. This differs from vanilla MeanFlow by the explicit clamping that guarantees the flow remains within the convex hull of the original data manifold, ensuring the distilled student matches the teacher distribution exactly in the limit (proven via the conservative property in Appendix B). Empirical verification uses distribution matching metrics (e.g., MMD < 1e-4) and identical generation outputs under identical seeds. We will expand the main-text description with a short algorithmic box and additional proof sketch in the revision to address this directly. revision: yes
Circularity Check
No significant circularity detected
full rationale
The paper introduces novel components including a distillation-compatible learnable feature caching mechanism with a lightweight neural predictor and a Restricted MeanFlow approach. These are presented as new contributions without any equations, self-citations, or reductions that equate predictions to fitted inputs or prior definitions by construction. The 11.8× acceleration claim is framed as an empirical outcome of applying these mechanisms to video diffusion transformers, with no load-bearing steps reducing to self-referential quantities in the abstract or described structure. The derivation chain remains self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
free parameters (1)
- weights of the lightweight neural predictor
axioms (1)
- domain assumption Feature evolution in diffusion transformers can be accurately predicted by a lightweight neural network across sparse sampling steps
invented entities (1)
-
Restricted MeanFlow approach
no independent evidence
Reference graph
Works this paper leans on
-
[1]
Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets
Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Dominik Lorenz, Yam Levi, Zion English, Vikram V oleti, Adam Letts, et al. Stable video diffusion: Scaling latent video diffusion models to large datasets.arXiv preprint arXiv:2311.15127, 2023. 1
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[2]
Token merging for fast sta- ble diffusion
Daniel Bolya and Judy Hoffman. Token merging for fast sta- ble diffusion. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4599–4603,
-
[3]
Video generation models as world simulators
Tim Brooks, Bill Peebles, Connor Holmes, Will DePue, Yufei Guo, Li Jing, David Schnurr, Joe Taylor, Troy Luh- man, Eric Luhman, Clarence Ng, Ricky Wang, and Aditya Ramesh. Video generation models as world simulators
-
[4]
Keshigeyan Chandrasegaran, Michael Poli, Daniel Y . Fu, Dongjun Kim, Lea M. Hadzic, Manling Li, Agrim Gupta, Stefano Massaroli, Azalia Mirhoseini, Juan Carlos Niebles, Stefano Ermon, and Fei-Fei Li. Exploring diffusion trans- former designs via grafting.ArXiv, abs/2506.05340, 2025. 3
-
[5]
Pixart-σ: Weak-to-strong training of diffu- sion transformer for 4k text-to-image generation, 2024
Junsong Chen, Chongjian Ge, Enze Xie, Yue Wu, Lewei Yao, Xiaozhe Ren, Zhongdao Wang, Ping Luo, Huchuan Lu, and Zhenguo Li. Pixart-σ: Weak-to-strong training of diffu- sion transformer for 4k text-to-image generation, 2024. 2
work page 2024
-
[6]
Pixart-α: Fast training of dif- fusion transformer for photorealistic text-to-image synthesis
Junsong Chen, Jincheng Yu, Chongjian Ge, Lewei Yao, Enze Xie, Yue Wu, Zhongdao Wang, James Kwok, Ping Luo, Huchuan Lu, and Zhenguo Li. Pixart-α: Fast training of dif- fusion transformer for photorealistic text-to-image synthesis. InInternational Conference on Learning Representations,
- [7]
-
[8]
F5-TTS: A Fairytaler that Fakes Fluent and Faithful Speech with Flow Matching
Yushen Chen, Zhikang Niu, Ziyang Ma, Keqi Deng, Chun- hui Wang, Jian Zhao, Kai Yu, and Xie Chen. F5-tts: A fairy- taler that fakes fluent and faithful speech with flow matching. arXiv preprint arXiv:2410.06885, 2024. 1
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[9]
Cat prun- ing: Cluster-aware token pruning for text-to-image diffusion models, 2025
Xinle Cheng, Zhuoming Chen, and Zhihao Jia. Cat prun- ing: Cluster-aware token pruning for text-to-image diffusion models, 2025. 3
work page 2025
-
[10]
Sefik Emre Eskimez, Xiaofei Wang, Manthan Thakker, Can- run Li, Chung-Hsien Tsai, Zhen Xiao, Hemin Yang, Zirun Zhu, Min Tang, Xu Tan, Yanqing Liu, Sheng Zhao, and Naoyuki Kanda. E2 tts: Embarrassingly easy fully non- autoregressive zero-shot tts.2024 IEEE Spoken Language Technology Workshop (SLT), pages 682–689, 2024. 1
work page 2024
-
[11]
Struc- tural pruning for diffusion models.arXiv preprint arXiv:2305.10924, 2023
Gongfan Fang, Xinyin Ma, and Xinchao Wang. Struc- tural pruning for diffusion models.arXiv preprint arXiv:2305.10924, 2023. 3
-
[12]
One Step Diffusion via Shortcut Models
Kevin Frans, Danijar Hafner, Sergey Levine, and Pieter Abbeel. One step diffusion via shortcut models.ArXiv, abs/2410.12557, 2024. 1, 3
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[13]
Yiyang Geng, Huai Xu, Yanyong Zhang, and Fuxin Zhang. Omnicache: A unified cache for efficient query handling in lsm-tree based key-value stores.2024 IEEE International Conference on High Performance Computing and Commu- nications (HPCC), pages 353–360, 2024. 1
work page 2024
-
[14]
Mean Flows for One-step Generative Modeling
Zhengyang Geng, Mingyang Deng, Xingjian Bai, J. Zico Kolter, and Kaiming He. Mean flows for one-step genera- tive modeling.ArXiv, abs/2505.13447, 2025. 1, 3, 4, 6, 7
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[15]
Denoising Diffusion Probabilistic Models
Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising Dif- fusion Probabilistic Models, 2020. arXiv:2006.11239 [cs]. 1
work page internal anchor Pith review Pith/arXiv arXiv 2020
-
[16]
Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising dif- fusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020. 1, 2, 3
work page 2020
-
[17]
VBench: Com- prehensive Benchmark Suite for Video Generative Models,
Ziqi Huang, Yinan He, Jiashuo Yu, Fan Zhang, Chenyang Si, Yuming Jiang, Yuanhan Zhang, Tianxing Wu, Qingyang Jin, Nattapol Chanpaisit, Yaohui Wang, Xinyuan Chen, Limin Wang, Dahua Lin, Yu Qiao, and Ziwei Liu. VBench: Com- prehensive Benchmark Suite for Video Generative Models,
- [18]
-
[19]
Kumara Kahatapitiya, Haozhe Liu, Sen He, Ding Liu, Menglin Jia, Michael S. Ryoo, and Tian Xie. Adaptive caching for faster video generation with diffusion transform- ers.ArXiv, abs/2411.02397, 2024. 3
-
[20]
Kumara Kahatapitiya, Haozhe Liu, Sen He, Ding Liu, Menglin Jia, Chenyang Zhang, Michael S. Ryoo, and Tian Xie. Adaptive Caching for Faster Video Generation with Diffusion Transformers, 2024. 1
work page 2024
-
[21]
Token fusion: Bridging the gap between token pruning and token merging
Minchul Kim, Shangqian Gao, Yen-Chang Hsu, Yilin Shen, and Hongxia Jin. Token fusion: Bridging the gap between token pruning and token merging. InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 1383–1392, 2024. 3
work page 2024
-
[22]
Ditto: Accelerating diffusion model via temporal value similarity
Sungbin Kim, Hyunwuk Lee, Wonho Cho, Mincheol Park, and Won Woo Ro. Ditto: Accelerating diffusion model via temporal value similarity. InProceedings of the 2025 IEEE International Symposium on High-Performance Com- puter Architecture (HPCA). IEEE, 2025. 3
work page 2025
-
[23]
HunyuanVideo: A Systematic Framework For Large Video Generative Models
Weijie Kong, Qi Tian, Zijian Zhang, Rox Min, Zuozhuo Dai, Jin Zhou, Jiangfeng Xiong, Xin Li, Bo Wu, Jianwei Zhang, et al. Hunyuanvideo: A systematic framework for large video generative models.arXiv preprint arXiv:2412.03603, 2024. 1, 2, 6
work page internal anchor Pith review Pith/arXiv arXiv 2024
- [24]
-
[25]
Flux.https://github.com/ black-forest-labs/flux, 2024
Black Forest Labs. Flux.https://github.com/ black-forest-labs/flux, 2024. 1
work page 2024
-
[26]
Senmao Li, Taihang Hu, Fahad Shahbaz Khan, Linxuan Li, Shiqi Yang, Yaxing Wang, Ming-Ming Cheng, and Jian Yang. Faster diffusion: Rethinking the role of unet encoder in diffusion models.arXiv preprint arXiv:2312.09608, 2023. 1, 3
-
[27]
Q- diffusion: Quantizing diffusion models
Xiuyu Li, Yijiang Liu, Long Lian, Huanrui Yang, Zhen Dong, Daniel Kang, Shanghang Zhang, and Kurt Keutzer. Q- diffusion: Quantizing diffusion models. In2023 IEEE/CVF International Conference on Computer Vision (ICCV), pages 17489–17499, 2023. 3
work page 2023
-
[28]
Yanyu Li, Huan Wang, Qing Jin, Ju Hu, Pavlo Chemerys, Yun Fu, Yanzhi Wang, Sergey Tulyakov, and Jian Ren. Snap- fusion: Text-to-image diffusion model on mobile devices within two seconds.Advances in Neural Information Pro- cessing Systems, 36, 2024. 3
work page 2024
-
[29]
Zhimin Li, Jianwei Zhang, and and others Lin. Hunyuan- DiT: A powerful multi-resolution diffusion transformer with fine-grained chinese understanding. 1, 2
-
[30]
Yaron Lipman, Ricky T. Q. Chen, Heli Ben-Hamu, Maxi- milian Nickel, and Matt Le. Flow matching for generative modeling.ArXiv, abs/2210.02747, 2022. 2
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[31]
Yaron Lipman, Ricky T. Q. Chen, Heli Ben-Hamu, Maxi- milian Nickel, and Matt Le. Flow Matching for Generative Modeling, 2023. 3
work page 2023
-
[32]
Timestep embedding tells: It’s time to cache for video diffusion model, 2024
Feng Liu, Shiwei Zhang, Xiaofeng Wang, Yujie Wei, Haonan Qiu, Yuzhong Zhao, Yingya Zhang, Qixiang Ye, and Fang Wan. Timestep embedding tells: It’s time to cache for video diffusion model, 2024. 1, 3, 7, 8
work page 2024
-
[33]
From reusing to forecasting: Accelerating diffusion models with taylorseers
Jiacheng Liu, Chang Zou, Yuanhuiyi Lyu, Junjie Chen, and ICCV 2025 Zhang, Linfeng, booktitle = International Con- ference on Computer Vision. From reusing to forecasting: Accelerating diffusion models with taylorseers. 2025. 2, 3, 4, 5, 7, 8
work page 2025
-
[34]
Speca: Accelerating diffusion transformers with speculative feature caching
Jiacheng Liu, Chang Zou, Yuanhuiyi Lyu, Kaixin Li, Shaobo Wang, and Linfeng Zhang. Speca: Accelerating diffusion transformers with speculative feature caching. InProceed- ings of the 33rd ACM International Conference on Multime- dia (MM ’25), page to appear, Dublin, Ireland, 2025. ACM. 3
work page 2025
-
[35]
Flow straight and fast: Learning to generate and transfer data with rectified flow
Xingchao Liu, Chengyue Gong, et al. Flow straight and fast: Learning to generate and transfer data with rectified flow. In The Eleventh International Conference on Learning Repre- sentations, 2023. 2
work page 2023
-
[36]
Region-adaptive sam- pling for diffusion transformers, 2025
Ziming Liu, Yifan Yang, Chengruidong Zhang, Yiqi Zhang, Lili Qiu, Yang You, and Yuqing Yang. Region-adaptive sam- pling for diffusion transformers, 2025. 3
work page 2025
-
[37]
Simplifying, stabilizing and scal- ing continuous-time consistency models
Cheng Lu and Yang Song. Simplifying, stabilizing and scal- ing continuous-time consistency models. InInternational Conference on Learning Representations, 2025. 1
work page 2025
-
[38]
Cheng Lu, Yuhao Zhou, Fan Bao, Jianfei Chen, Chongxuan Li, and Jun Zhu. Dpm-solver: A fast ode solver for diffusion probabilistic model sampling in around 10 steps.Advances in Neural Information Processing Systems, 35:5775–5787,
-
[39]
DPM-Solver++: Fast Solver for Guided Sampling of Diffusion Probabilistic Models
Cheng Lu, Yuhao Zhou, Fan Bao, Jianfei Chen, Chongx- uan Li, and Jun Zhu. Dpm-solver++: Fast solver for guided sampling of diffusion probabilistic models.arXiv preprint arXiv:2211.01095, 2022. 2
work page internal anchor Pith review arXiv 2022
-
[40]
Latent Consistency Models: Synthesizing High-Resolution Images with Few-Step Inference
Simian Luo, Yiqin Tan, Longbo Huang, Jian Li, and Hang Zhao. Latent consistency models: Synthesizing high-resolution images with few-step inference.ArXiv, abs/2310.04378, 2023. 3
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[41]
Zhengyao Lv, Chenyang Si, Tianlin Pan, Zhaoxi Chen, Kwan-Yee K. Wong, Yu Qiao, and Ziwei Liu. Dcm: Dual- expert consistency model for efficient and high-quality video generation.ArXiv, abs/2506.03123, 2025. 3
-
[42]
Zhengyao Lv, Chenyang Si, Junhao Song, Zhenyu Yang, Yu Qiao, Ziwei Liu, and Kwan-Yee K. Wong. Fastercache: Training-free video diffusion model acceleration with high quality. InProceedings of the 13th International Conference on Learning Representations (ICLR 2025), 2025. 1, 3
work page 2025
-
[43]
Xinyin Ma, Gongfan Fang, Michael Bi Mi, and Xinchao Wang. Learning-to-cache: Accelerating diffusion trans- former via layer caching.arXiv preprint arXiv:2406.01733,
-
[44]
Deepcache: Accelerating diffusion models for free
Xinyin Ma, Gongfan Fang, and Xinchao Wang. Deepcache: Accelerating diffusion models for free. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15762–15772, 2024. 1, 3, 4
work page 2024
-
[45]
Magcache: Fast video generation with magnitude- aware cache.ArXiv, abs/2506.09045, 2025
Zehong Ma, Longhui Wei, Feng Wang, Shiliang Zhang, and Qi Tian. Magcache: Fast video generation with magnitude- aware cache.ArXiv, abs/2506.09045, 2025. 3
-
[46]
On distillation of guided diffusion models
Chenlin Meng, Ruiqi Gao, Diederik P Kingma, Stefano Er- mon, Jonathan Ho, and Tim Salimans. On distillation of guided diffusion models. InNeurIPS 2022 Workshop on Score-Based Methods, 2022. 3
work page 2022
-
[47]
Large Language Diffusion Models
Shen Nie, Fengqi Zhu, Zebin You, Xiaolu Zhang, Jingyang Ou, Jun Hu, Jun Zhou, Yankai Lin, Ji-Rong Wen, and Chongxuan Li. Large language diffusion models.ArXiv, abs/2502.09992, 2025. 1
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[48]
Scalable Diffusion Models with Transformers
William Peebles and Saining Xie. Scalable Diffusion Models with Transformers, 2023. arXiv:2212.09748 [cs]. 3
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[49]
Scalable diffusion models with transformers
William Peebles and Saining Xie. Scalable diffusion models with transformers. InProceedings of the IEEE/CVF Inter- national Conference on Computer Vision, pages 4195–4205,
-
[50]
Accelerating diffusion transformer via error-optimized cache, 2025
Junxiang Qiu, Shuo Wang, Jinda Lu, Lin Liu, Houcheng Jiang, and Yanbin Hao. Accelerating diffusion transformer via error-optimized cache, 2025. 3
work page 2025
-
[51]
High-resolution image synthesis with latent diffusion models
Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj ¨orn Ommer. High-resolution image synthesis with latent diffusion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022. 1
work page 2022
-
[52]
U- net: Convolutional networks for biomedical image segmen- tation
Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U- net: Convolutional networks for biomedical image segmen- tation. InMedical image computing and computer-assisted intervention–MICCAI 2015: 18th international conference, Munich, Germany, October 5-9, 2015, proceedings, part III 18, pages 234–241. Springer, 2015. 2
work page 2015
-
[53]
Omid Saghatchian, Atiyeh Gh. Moghadam, and Ahmad Nickabadi. Cached adaptive token merging: Dynamic token reduction and redundant computation elimination in diffu- sion model, 2025. 3
work page 2025
-
[54]
Progressive Distillation for Fast Sampling of Diffusion Models
Tim Salimans and Jonathan Ho. Progressive distillation for fast sampling of diffusion models.arXiv preprint arXiv:2202.00512, 2022. 3
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[55]
Multistep distillation of diffusion models via moment matching.ArXiv, abs/2406.04103, 2024
Tim Salimans, Thomas Mensink, Jonathan Heek, and Emiel Hoogeboom. Multistep distillation of diffusion models via moment matching.ArXiv, abs/2406.04103, 2024. 3
-
[56]
Axel Sauer, Dominik Lorenz, A. Blattmann, and Robin Rombach. Adversarial diffusion distillation. InEuropean Conference on Computer Vision, 2023. 5
work page 2023
-
[57]
Fora: Fast-forward caching in diffusion transformer acceleration.arXiv preprint arXiv:2407.01425,
Pratheba Selvaraju, Tianyu Ding, Tianyi Chen, Ilya Zharkov, and Luming Liang. Fora: Fast-forward caching in diffusion transformer acceleration.arXiv preprint arXiv:2407.01425,
-
[58]
Post-training quantization on diffusion models
Yuzhang Shang, Zhihang Yuan, Bin Xie, Bingzhe Wu, and Yan Yan. Post-training quantization on diffusion models. In Proceedings of the IEEE/CVF conference on computer vi- sion and pattern recognition, pages 1972–1981, 2023. 3
work page 1972
-
[59]
Deep unsupervised learning using nonequilibrium thermodynamics
Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan, and Surya Ganguli. Deep unsupervised learning using nonequilibrium thermodynamics. InInternational confer- ence on machine learning, pages 2256–2265. PMLR, 2015. 1, 2
work page 2015
-
[60]
Denois- ing diffusion implicit models
Jiaming Song, Chenlin Meng, and Stefano Ermon. Denois- ing diffusion implicit models. InInternational Conference on Learning Representations, 2021. 2
work page 2021
-
[61]
Yang Song, Prafulla Dhariwal, Mark Chen, and Ilya Sutskever. Consistency models. InInternational Conference on Machine Learning, pages 32211–32252. PMLR, 2023. 3
work page 2023
-
[62]
Unicp: A unified caching and pruning framework for efficient video generation, 2025
Wenzhang Sun, Qirui Hou, Donglin Di, Jiahui Yang, Yongjia Ma, and Jianxun Cui. Unicp: A unified caching and pruning framework for efficient video generation, 2025. 3
work page 2025
-
[63]
Hunyuan-large: An open-source MoE model with 52 billion activated param- eters by tencent
Xingwu Sun, Yanfeng Chen, Huang, et al. Hunyuan-large: An open-source MoE model with 52 billion activated param- eters by tencent. 6, 7, 3
-
[64]
Tencent Hunyuan Team. Hunyuanimage 2.1: An efficient diffusion model for high-resolution (2k) text-to-image gener- ation.https://github.com/Tencent-Hunyuan/ HunyuanImage-2.1, 2025. 1, 2
work page 2025
-
[65]
Hunyuanvideo 1.5 technical report, 2025
Tencent Hunyuan Foundation Model Team. Hunyuanvideo 1.5 technical report, 2025. 1
work page 2025
-
[66]
Wan: Open and advanced large-scale video generative models
Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianx- iao Yang, Jianyuan Zeng, Jiayu Wang, Jingfeng Zhang, Jin- gren Zhou, Jinkai Wang, Jixuan Chen, Kai Zhu, Kang Zhao, Keyu Yan, Lianghua Huang, Mengyang Feng, Ningyi Zhang, Pandeng Li, Pingyu Wu, Ruihang Chu, Ruili Feng, Shiwei Zhang, Siyang Sun, Tao Fan...
-
[67]
Chenfei Wu, Jiahao Li, Jingren Zhou, Junyang Lin, Kaiyuan Gao, Kun Yan, Sheng ming Yin, Shuai Bai, Xiao Xu, Yilei Chen, Yuxiang Chen, Zecheng Tang, Zekai Zhang, Zhengyi Wang, An Yang, Bowen Yu, Chen Cheng, Dayiheng Liu, De- qing Li, Hang Zhang, Hao Meng, Hu Wei, Jingyuan Ni, Kai Chen, Kuan Cao, Liang Peng, Lin Qu, Minggang Wu, Peng Wang, Shuting Yu, Tingk...
-
[68]
EEdit : Rethinking the Spatial and Temporal Redundancy for Efficient Image Editing, 2025
Zexuan Yan, Yue Ma, Chang Zou, Wenteng Chen, Qifeng Chen, and Linfeng Zhang. EEdit : Rethinking the Spatial and Temporal Redundancy for Efficient Image Editing, 2025. 3
work page 2025
-
[69]
Cogvideox: Text-to-video diffusion models with an expert transformer
Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xiaohan Zhang, Guanyu Feng, Da Yin, Xiaotao Gu, Yuxuan.Zhang, Weihan Wang, Yean Cheng, Bin Xu, Yuxiao Dong, and Jie Tang. Cogvideox: Text-to-video diffusion models with an expert transformer. InThe Thirteenth International Confer- ence on Learning Repres...
work page 2025
-
[70]
Tianwei Yin, Michael Gharbi, Richard Zhang, Eli Shecht- man, Fr´edo Durand, William T. Freeman, and Taesung Park. One-step diffusion with distribution matching distillation. 2024 IEEE/CVF Conference on Computer Vision and Pat- tern Recognition (CVPR), pages 6613–6623, 2023. 1, 3
work page 2024
-
[71]
One-step diffusion with distribution matching distillation
Tianwei Yin, Micha ¨el Gharbi, Richard Zhang, Eli Shecht- man, Fredo Durand, William T Freeman, and Taesung Park. One-step diffusion with distribution matching distillation. In Proceedings of the IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition, pages 6613–6623, 2024. 1, 3
work page 2024
-
[72]
Free- man, Fr´edo Durand, Eli Shechtman, and Xun Huang
Tianwei Yin, Qiang Zhang, Richard Zhang, William T. Free- man, Fr´edo Durand, Eli Shechtman, and Xun Huang. From slow bidirectional to fast autoregressive video diffusion mod- els.2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 22963–22974, 2024. 1
work page 2025
-
[73]
DiTFastattn: Attention compression for diffusion transformer models
Zhihang Yuan, Hanling Zhang, Lu Pu, Xuefei Ning, Lin- feng Zhang, Tianchen Zhao, Shengen Yan, Guohao Dai, and Yu Wang. DiTFastattn: Attention compression for diffusion transformer models. InThe Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024. 1, 3
work page 2024
-
[74]
Token pruning for caching better: 9 times acceleration on stable diffusion for free, 2024
Evelyn Zhang, Bang Xiao, Jiayi Tang, Qianli Ma, Chang Zou, Xuefei Ning, Xuming Hu, and Linfeng Zhang. Token pruning for caching better: 9 times acceleration on stable diffusion for free, 2024. 3
work page 2024
-
[75]
Evelyn Zhang, Jiayi Tang, Xuefei Ning, and Linfeng Zhang. Training-free and hardware-friendly acceleration for diffu- sion models via similarity-based token pruning. InProceed- ings of the AAAI Conference on Artificial Intelligence, 2025. 3
work page 2025
-
[76]
Accvideo: Accelerating video dif- fusion model with synthetic dataset.ArXiv, abs/2503.19462,
Haiyu Zhang, Xinyuan Chen, Yaohui Wang, Xihui Liu, Yun- hong Wang, and Yu Qiao. Accvideo: Accelerating video dif- fusion model with synthetic dataset.ArXiv, abs/2503.19462,
-
[77]
Real-time video generation with pyramid attention broad- cast.arXiv preprint arXiv:2408.12588, 2024
Xuanlei Zhao, Xiaolong Jin, Kai Wang, and Yang You. Real-time video generation with pyramid attention broad- cast.arXiv preprint arXiv:2408.12588, 2024. 3, 7, 8
-
[78]
DPM- solver-v3: Improved diffusion ODE solver with empirical model statistics
Kaiwen Zheng, Cheng Lu, Jianfei Chen, and Jun Zhu. DPM- solver-v3: Improved diffusion ODE solver with empirical model statistics. InThirty-seventh Conference on Neural In- formation Processing Systems, 2023. 2
work page 2023
-
[79]
Open-sora: Democratizing efficient video production for all, 2024
Zangwei Zheng, Xiangyu Peng, Tianji Yang, Chenhui Shen, Shenggui Li, Hongxin Liu, Yukun Zhou, Tianyi Li, and Yang You. Open-sora: Democratizing efficient video production for all, 2024. 2
work page 2024
-
[80]
Zhixin Zheng, Xinyu Wang, Chang Zou, Shaobo Wang, and Linfeng Zhang. Compute only 16 tokens in one timestep: Accelerating Diffusion Transformers with Cluster-Driven Feature Caching. InProceedings of the 33rd ACM Interna- tional Conference on Multimedia (MM ’25), page to appear, Dublin, Ireland, 2025. ACM. 3
work page 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.