pith. machine review for the scientific record. sign in

arxiv: 2604.15911 · v1 · submitted 2026-04-17 · 💻 cs.CV

Recognition: unknown

Efficient Video Diffusion Models: Advancements and Challenges

Authors on Pith no claims yet

Pith reviewed 2026-05-10 08:25 UTC · model grok-4.3

classification 💻 cs.CV
keywords videodiffusionefficientmodelsattentioncategorizationchallengesdirections
0
0 comments X

The pith

A survey that groups efficient video diffusion methods into four paradigms—step distillation, efficient attention, model compression, and cache/trajectory optimization—and outlines open challenges for practical use.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Video diffusion models generate videos by starting with random noise and repeatedly cleaning it up over many steps, but this uses enormous computing power because each step processes both space and time. The paper organizes ways to speed this up into four categories: cutting the total number of cleaning steps, making the internal attention calculations cheaper, shrinking the overall model, and reusing previous calculations through caching or smarter paths. It examines how these choices reduce either the count of steps or the work per step, then flags remaining issues like maintaining quality when combining speedups and the need for better hardware support.

Core claim

To the best of our knowledge, our work is the first comprehensive survey on efficient video diffusion models, offering researchers and engineers a structured overview of the field and its emerging research directions.

Load-bearing premise

That the proposed four-class categorization (step distillation, efficient attention, model compression, cache/trajectory optimization) comprehensively and unbiasedly covers all relevant methods without significant omissions or overlaps.

Figures

Figures reproduced from arXiv: 2604.15911 by James Kwok, Lichen Bai, Pengfei Wan, Shitong Shao, Zeke Xie.

Figure 1
Figure 1. Figure 1: Left: Distribution of literature across various accelerated sampling algorithms for video diffusion models. Middle: Publication trends and adoption growth of accelerated sampling algorithms for video diffusion models (2022–2026). Right: Comparative growth trends of accelerated sampling algorithms in image versus video diffusion tasks (2022–2026). 2022 to 84 in 2025), indicating an early but fast-consolidat… view at source ↗
Figure 2
Figure 2. Figure 2: Conceptual illustration of efficient video diffusion generation. The main methods are organized into four major [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Overview of step distillation for accelerated video diffusion. The paradigm reduces NFE by distilling multi-step [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Overview of the Self-Forcing algorithm. This framework serves as the foundation for various real-time video generation [PITH_FULL_IMAGE:figures/full_fig_p010_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Overview of efficient attention for video diffusion acceleration. The methods reduce per-step overhead via dy [PITH_FULL_IMAGE:figures/full_fig_p012_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Overview of representative static attention masks used by different methods under a common illustrative setup with [PITH_FULL_IMAGE:figures/full_fig_p013_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Overview of model compression for accelerated video diffusion. The figure highlights quantization-aware training [PITH_FULL_IMAGE:figures/full_fig_p016_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Overview of cache and trajectory optimization methods for video diffusion acceleration. The framework integrates [PITH_FULL_IMAGE:figures/full_fig_p020_8.png] view at source ↗
read the original abstract

Video diffusion models have rapidly become the dominant paradigm for high-fidelity generative video synthesis, but their practical deployment remains constrained by severe inference costs. Compared with image generation, video synthesis compounds computation across spatial-temporal token growth and iterative denoising, making attention and memory traffic major bottlenecks in real-world settings. This survey provides a systematic and deployment-oriented review of efficient video diffusion models. We propose a unified categorization that organizes existing methods into four classes of main paradigms, including step distillation, efficient attention, model compression, and cache/trajectory optimization. Building on this categorization, we respectively analyze algorithmic trends of these four paradigms and examine how different design choices target two core objectives: reducing the number of function evaluations and minimizing per-step overhead. Finally, we discuss open challenges and future directions, including quality preservation under composite acceleration, hardware-software co-design, robust real-time long-horizon generation, and open infrastructure for standardized evaluation. To the best of our knowledge, our work is the first comprehensive survey on efficient video diffusion models, offering researchers and engineers a structured overview of the field and its emerging research directions.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Circularity Check

0 steps flagged

No circularity: survey paper with no derivations or self-referential claims

full rationale

This is a survey paper that organizes existing literature on efficient video diffusion models into four proposed categories (step distillation, efficient attention, model compression, cache/trajectory optimization). It contains no original equations, derivations, fitted parameters, predictions, or mathematical results. The central claim is that the work is the first comprehensive survey, which is a statement of scope and novelty rather than a derived quantity. No self-citations are load-bearing for any result, and the categorization is an organizational framework applied to external methods, not a reduction to the paper's own inputs. The paper is self-contained as a review and scores 0 on circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The paper introduces no new mathematical derivations, free parameters, axioms, or invented entities; it relies entirely on prior published work in diffusion models for its reviewed methods.

pith-pipeline@v0.9.0 · 5491 in / 986 out tokens · 48394 ms · 2026-05-10T08:25:24.236182+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Exploring Data-Free LoRA Transferability for Video Diffusion Models

    cs.CV 2026-05 unverdicted novelty 6.0

    CASA uses spectral density to arbitrate between preserving the target model's manifold and restoring LoRA alignment, mitigating style degradation and structural collapse in distilled video diffusion models.

Reference graph

Works this paper leans on

247 extracted references · 219 canonical work pages · cited by 1 Pith paper · 20 internal anchors

  1. [4]

    Ganesh Bikshandi, Tri Dao, Pradeep Ramani, Jay Shah, Vijay Thakkar, and Ying Zhang. 2024. FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-precision. InAdvances in Neural Information Processing Systems 37. Neural Information Processing Systems Foundation, Inc. (NeurIPS), 68658–68685. https://doi.org/10.52202/079017-2193

  2. [16]

    Siyan Chen, Yanfei Chen, Ying Chen, Zhuo Chen, Feng Cheng, Xuyan Chi, Jian Cong, Qinpeng Cui, Qide Dong, Junliang Fan, et al

  3. [17]

    Seedance 1.5 pro: A native audio-visual joint generation foundation model.arXiv preprint arXiv:2512.13507, 2025

    Seedance 1.5 pro: A Native Audio-Visual Joint Generation Foundation Model. arXiv:2512.13507 https://arxiv.org/abs/2512.13507

  4. [28]

    Juechu Dong, Boyuan Feng, Driss Guessous, Yanbo Liang, and Horace He. 2024. Flex attention: A programming model for generating optimized attention kernels. arXiv:2412.05496 https://arxiv.org/abs/2412.05496

  5. [29]

    Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. 2020. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. InInternational Conference on Learning Representations. OpenReview.net. https://openreview.net...

  6. [43]

    Zhengyang Geng, Mingyang Deng, Xingjian Bai, J Zico Kolter, and Kaiming He. 2025. Mean flows for one-step generative modeling. arXiv:2505.13447 https://arxiv.org/abs/2505.13447

  7. [49]

    Junxian Guo, Haotian Tang, Shang Yang, Zhekai Zhang, Zhijian Liu, and Song Han. 2024. Block Sparse Attention. https://github.com/mit- han-lab/Block-Sparse-Attention

  8. [50]

    Yoav HaCohen, Benny Brazowski, Nisan Chiprut, Yaki Bitterman, Andrew Kvochko, Avishai Berkowitz, Daniel Shalem, Daphna Lifschitz, Dudu Moshe, Eitan Porat, et al . 2026. LTX-2: Efficient Joint Audio-Visual Foundation Model. arXiv:2601.03233 https: //arxiv.org/abs/2601.03233

  9. [51]

    Hao-AI-Lab. 2025. FastVideo. https://github.com/hao-ai-lab/FastVideo/tree/main

  10. [57]

    Ziqi Huang, Yinan He, Jiashuo Yu, Fan Zhang, Chenyang Si, Yuming Jiang, Yuanhan Zhang, Tianxing Wu, Qingyang Jin, Nattapol Chanpaisit, Yaohui Wang, Xinyuan Chen, Limin Wang, Dahua Lin, Yu Qiao, and Ziwei Liu. 2024. VBench: Comprehensive Benchmark Suite for Video Generative Models. In2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR...

  11. [58]

    Ziqi Huang, Fan Zhang, Xiaojie Xu, Yinan He, Jiashuo Yu, Ziyue Dong, Qianli Ma, Nattapol Chanpaisit, Chenyang Si, Yuming Jiang, Yaohui Wang, Xinyuan Chen, Ying-Cong Chen, Limin Wang, Dahua Lin, Yu Qiao, and Ziwei Liu. 2026. VBench++: Comprehensive and Versatile Benchmark Suite for Video Generative Models.IEEE Transactions on Pattern Analysis and Machine I...

  12. [59]

    Sihui Ji, Xi Chen, Shuai Yang, Xin Tao, Pengfei Wan, and Hengshuang Zhao. 2025. MemFlow: Flowing Adaptive Memory for Consistent and Efficient Long Video Narratives. arXiv:2512.14699 https://arxiv.org/abs/2512.14699

  13. [60]

    Jiaxiu Jiang, Wenbo Li, Jingjing Ren, Yuping Qiu, Yong Guo, Xiaogang Xu, Han Wu, and Wangmeng Zuo. 2025. LoViC: Efficient Long Video Generation with Context Compression. arXiv:2507.12952 https://arxiv.org/abs/2507.12952

  14. [61]

    Yang Jin, Zhicheng Sun, Ningyuan Li, Kun Xu, Kun Xu, Hao Jiang, Nan Zhuang, Quzhe Huang, Yang Song, Yadong Mu, and Zhouchen Lin. 2024. Pyramidal Flow Matching for Efficient Video Generative Modeling. arXiv:2410.05954 https://arxiv.org/abs/2410.05954

  15. [63]

    Kim, Y ., Jang, J., and Shin, S

    Kumara Kahatapitiya, Haozhe Liu, Sen He, Ding Liu, Menglin Jia, Chenyang Zhang, Michael S. Ryoo, and Tian Xie. 2024. Adaptive Caching for Faster Video Generation with Diffusion Transformers. arXiv:2411.02397 https://arxiv.org/abs/2411.02397

  16. [64]

    Tero Karras, Samuli Laine, and Timo Aila. 2019. A Style-Based Generator Architecture for Generative Adversarial Networks. In2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 4396–4405. https://doi.org/10.1109/cvpr.2019.00453

  17. [65]

    Bosung Kim, Kyuhwan Lee, Isu Jeong, Jungmin Cheon, Yeojin Lee, and Seulki Lee. 2025. On-device Sora: Enabling Training-Free Diffusion-based Text-to-Video Generation for Mobile Devices. arXiv:2502.04363 https://arxiv.org/abs/2502.04363

  18. [66]

    Jisoo Kim, Wooseok Seo, Junwan Kim, Seungho Park, Sooyeon Park, and Youngjae Yu. 2025. Vip: Iterative online preference distillation for efficient video diffusion models. InProceedings of the IEEE/CVF International Conference on Computer Vision. 17235–17245. , Vol. 1, No. 1, Article . Publication date: April 2026. 30•Shitong Shao, James Kwok, Pengfei Wan,...

  19. [67]

    Weijie Kong, Qi Tian, Zijian Zhang, Rox Min, Zuozhuo Dai, Jin Zhou, Jiangfeng Xiong, Xin Li, Bo Wu, Jianwei Zhang, et al . 2024. HunyuanVideo: A Systematic Framework For Large Video Generative Models. arXiv:2412.03603 https://arxiv.org/abs/2412.03603

  20. [68]

    Xin Kong, Daniel Watson, Yannick Strümpler, Michael Niemeyer, and Federico Tombari. 2025. CausNVS: Autoregressive Multi-view Diffusion for Flexible 3D Novel View Synthesis. arXiv:2509.06579 https://arxiv.org/abs/2509.06579

  21. [69]

    Black Forest Labs. 2024. FLUX. https://blackforestlabs.ai/

  22. [70]

    Kunyang Li, Mubarak Shah, and Yuzhang Shang. 2026. PackCache: A Training-Free Acceleration Method for Unified Autoregressive Video Generation via Compact KV-Cache. arXiv:2601.04359 https://arxiv.org/abs/2601.04359

  23. [71]

    Wuyang Li, Wentao Pan, Po-Chien Luan, Yang Gao, and Alexandre Alahi. 2025. Stable video infinity: Infinite-length video generation with error recycling. arXiv:2510.09212 https://arxiv.org/abs/2510.09212

  24. [72]

    Xingyang Li, Muyang Li, Tianle Cai, Haocheng Xi, Shuo Yang, Yujun Lin, Lvmin Zhang, Songlin Yang, Jinbo Hu, Kelly Peng, Maneesh Agrawala, Ion Stoica, Kurt Keutzer, and Song Han. 2025. Radial Attention: 𝑂(𝑛log𝑛) Sparse Attention with Energy Decay for Long Video Generation. arXiv:2506.19852 https://arxiv.org/abs/2506.19852

  25. [73]

    Zongyi Li, Shujie Hu, Shujie Liu, Long Zhou, Jeongsoo Choi, Lingwei Meng, Xun Guo, Jinyu Li, Hefei Ling, and Furu Wei. 2025. ARLON: Boosting Diffusion Transformers with Autoregressive Models for Long Video Generation. arXiv:2410.20502 https://arxiv.org/abs/2410. 20502

  26. [74]

    Zhiteng Li, Hanxuan Li, Junyi Wu, Kai Liu, Haotong Qin, Linghe Kong, Guihai Chen, Yulun Zhang, and Xiaokang Yang. 2025. DVD-Quant: Data-free Video Diffusion Transformers Quantization. arXiv:2505.18663 https://arxiv.org/abs/2505.18663

  27. [75]

    Zongjian Li, Bin Lin, Yang Ye, Liuhan Chen, Xinhua Cheng, Shenghai Yuan, and Li Yuan. 2024. WF-VAE: Enhancing Video VAE by Wavelet-Driven Energy Flow for Latent Video Diffusion Model. arXiv:2411.17459 https://arxiv.org/abs/2411.17459

  28. [76]

    Feng Liang, Akio Kodaira, Chenfeng Xu, Masayoshi Tomizuka, Kurt Keutzer, and Diana Marculescu. 2025. Looking Backward: Streaming Video-to-Video Translation with Feature Banks. arXiv:2405.15757 https://arxiv.org/abs/2405.15757

  29. [77]

    Shanchuan Lin, Xin Xia, Yuxi Ren, Ceyuan Yang, Xuefeng Xiao, and Lu Jiang. 2025. Diffusion Adversarial Post-Training for One-Step Video Generation. arXiv:2501.08316 https://arxiv.org/abs/2501.08316

  30. [78]

    Akide Liu, Zeyu Zhang, Zhexin Li, Xuehai Bai, Yizeng Han, Jiasheng Tang, Yuanjie Xing, Jichao Wu, Mingyang Yang, Weihua Chen, Jiahao He, Yuanyu He, Fan Wang, Gholamreza Haffari, and Bohan Zhuang. 2025. FPSAttention: Training-Aware FP8 and Sparsity Co-Design for Fast Video Diffusion. arXiv:2506.04648 https://arxiv.org/abs/2506.04648

  31. [79]

    Kwok, Sumi Helal, and Zeke Xie

    Buhua Liu, Shitong Shao, Bao Li, Lichen Bai, Zhiqiang Xu, Haoyi Xiong, James T. Kwok, Sumi Helal, and Zeke Xie. 2026. Alignment of Diffusion Models: Fundamentals, Challenges, and Future.Comput. Surveys58, 9 (March 2026), 1–37. https://doi.org/10.1145/3796982

  32. [80]

    Chao Liu and Arash Vahdat. 2025. On Equivariance and Fast Sampling in Video Diffusion Models Trained with Warped Noise. arXiv:2504.09789 https://arxiv.org/abs/2504.09789

  33. [81]

    Dong Liu, Yanxuan Yu, Jiayi Zhang, Yifan Li, Ben Lengerich, and Ying Nian Wu. 2025. FastCache: Fast Caching for Diffusion Transformer Through Learnable Linear Approximation. arXiv:2505.20353 https://arxiv.org/abs/2505.20353

  34. [82]

    Haosong Liu, Yuge Cheng, Wenxuan Miao, Zihan Liu, Aiyue Chen, Jing Lin, Yiwu Yao, Chen Chen, Jingwen Leng, Yu Feng, and Minyi Guo. 2025. Astraea: A Token-wise Acceleration Framework for Video Diffusion Transformers. arXiv:2506.05096 https: //arxiv.org/abs/2506.05096

  35. [83]

    Huaize Liu, Wenzhang Sun, Qiyuan Zhang, Donglin Di, Biao Gong, Hao Li, Chen Wei, and Changqing Zou. 2025. Hi-VAE: Efficient Video Autoencoding with Global and Detailed Motion. arXiv:2506.07136 https://arxiv.org/abs/2506.07136

  36. [84]

    Jiacheng Liu, Chang Zou, Yuanhuiyi Lyu, Junjie Chen, and Linfeng Zhang. 2025. From Reusing to Forecasting: Accelerating Diffusion Models with TaylorSeers. arXiv:2503.06923 https://arxiv.org/abs/2503.06923

  37. [85]

    Kunhao Liu, Wenbo Hu, Jiale Xu, Ying Shan, and Shijian Lu. 2025. Rolling Forcing: Autoregressive Long Video Diffusion in Real Time. arXiv:2509.25161 https://arxiv.org/abs/2509.25161

  38. [86]

    Kai Liu, Shaoqiu Zhang, Linghe Kong, and Yulun Zhang. 2025. CLQ: Cross-Layer Guided Orthogonal-based Quantization for Diffusion Transformers. arXiv:2509.24416 https://arxiv.org/abs/2509.24416

  39. [87]

    Penghui Liu, Jiangshan Wang, Yutong Shen, Shanhui Mo, Chenyang Qi, and Yue Ma. 2025. MultiMotion: Multi Subject Video Motion Transfer via Video Diffusion Transformer. arXiv:2512.07500 https://arxiv.org/abs/2512.07500

  40. [88]

    Tianqi Liu, Zihao Huang, Zhaoxi Chen, Guangcong Wang, Shoukang Hu, Liao Shen, Huiqiang Sun, Zhiguo Cao, Wei Li, and Ziwei Liu

  41. [89]

    arXiv:2503.20785 https://arxiv.org/abs/2503.20785

    Free4D: Tuning-free 4D Scene Generation with Spatial-Temporal Consistency. arXiv:2503.20785 https://arxiv.org/abs/2503.20785

  42. [90]

    Xingchao Liu, Chengyue Gong, and Qiang Liu. 2022. Flow straight and fast: Learning to generate and transfer data with rectified flow. arXiv:2209.03003 https://arxiv.org/abs/2209.03003

  43. [91]

    Xinyan Liu, Huihong Shi, Yang Xu, and Zhongfeng Wang. 2026. TaQ-DiT: Time-aware Quantization for Diffusion Transformers.IEEE Transactions on Circuits and Systems for Video Technology(2026), 1–1. https://doi.org/10.1109/tcsvt.2026.3652275

  44. [92]

    Yong Liu, Jinshan Pan, Yinchuan Li, Qingji Dong, Chao Zhu, Yu Guo, and Fei Wang. 2025. UltraVSR: Achieving Ultra-Realistic Video Super-Resolution with Efficient One-Step Diffusion Space. InProceedings of the 33rd ACM International Conference on Multimedia. ACM, 7785–7794. https://doi.org/10.1145/3746027.3755117 , Vol. 1, No. 1, Article . Publication date:...

  45. [93]

    Chetwin Low and Weimin Wang. 2025. TalkingMachines: Real-Time Audio-Driven FaceTime-Style Video via Autoregressive Diffusion Models. arXiv:2506.03099 https://arxiv.org/abs/2506.03099

  46. [94]

    Beijia Lu, Ziyi Chen, Jing Xiao, and Jun-Yan Zhu. 2025. Input-Aware Sparse Attention for Real-Time Co-Speech Video Generation. In Proceedings of the SIGGRAPH Asia 2025 Conference Papers. ACM, 1–11. https://doi.org/10.1145/3757377.3763831

  47. [95]

    Guanxing Lu, Zifeng Gao, Tianxing Chen, Wenxun Dai, Ziwei Wang, Wenbo Ding, and Yansong Tang. 2025. ManiCM: Real-time 3D Diffusion Policy via Consistency Model for Robotic Manipulation. arXiv:2406.01586 https://arxiv.org/abs/2406.01586

  48. [96]

    Yunhong Lu, Yanhong Zeng, Haobo Li, Hao Ouyang, Qiuyu Wang, Ka Leong Cheng, Jiapeng Zhu, Hengyuan Cao, Zhipeng Zhang, Xing Zhu, et al . 2025. Reward Forcing: Efficient Streaming Video Generation with Rewarded Distribution Matching Distillation. arXiv:2512.04678 https://arxiv.org/abs/2512.04678

  49. [97]

    Yihong Luo, Tianyang Hu, Jiacheng Sun, Yujun Cai, and Jing Tang. 2025. Learning Few-Step Diffusion Models by Trajectory Distribution Matching. arXiv:2503.06674 https://arxiv.org/abs/2503.06674

  50. [98]

    Wong, Yu Qiao, and Ziwei Liu

    Zhengyao Lv, Chenyang Si, Tianlin Pan, Zhaoxi Chen, Kwan-Yee K. Wong, Yu Qiao, and Ziwei Liu. 2025. Dual-Expert Consistency Model for Efficient and High-Quality Video Generation. arXiv:2506.03123 https://arxiv.org/abs/2506.03123

  51. [99]

    Zhengyao Lv, Chenyang Si, Junhao Song, Zhenyu Yang, Yu Qiao, Ziwei Liu, and Kwan-Yee K. Wong. 2024. FasterCache: Training-Free Video Diffusion Model Acceleration with High Quality. arXiv:2410.19355 https://arxiv.org/abs/2410.19355

  52. [100]

    Guoqing Ma, Haoyang Huang, Kun Yan, Liangyu Chen, Nan Duan, Shengming Yin, Changyi Wan, Ranchen Ming, Xiaoniu Song, Xing Chen, Yu Zhou, Deshan Sun, Deyu Zhou, Jian Zhou, Kaijun Tan, Kang An, Mei Chen, Wei Ji, Qiling Wu, Wen Sun, Xin Han, Yanan Wei, Zheng Ge, Aojie Li, Bin Wang, Bizhu Huang, Bo Wang, Brian Li, Changxing Miao, Chen Xu, Chenfei Wu, Chenguang...

  53. [101]

    Xin Ma, Yaohui Wang, Genyun Jia, Xinyuan Chen, Tien-Tsin Wong, and Cunjian Chen. 2026. Consistent and Controllable Image Animation with Motion Linear Diffusion Transformers.IEEE Transactions on Pattern Analysis and Machine Intelligence(2026), 1–16. https://doi.org/10.1109/tpami.2026.3664227

  54. [102]

    Zehong Ma, Longhui Wei, Feng Wang, Shiliang Zhang, and Qi Tian. 2025. MagCache: Fast Video Generation with Magnitude-Aware Cache. arXiv:2506.09045 https://arxiv.org/abs/2506.09045

  55. [103]

    2025.Krea Realtime 14B: Real-time Video Generation

    Erwann Millon. 2025.Krea Realtime 14B: Real-time Video Generation. https://github.com/krea-ai/realtime-video

  56. [104]

    Kepan Nan, Rui Xie, Penghao Zhou, Tiehan Fan, Zhenheng Yang, Zhijie Chen, Xiang Li, Jian Yang, and Ying Tai. 2024. OpenVid-1M: A Large-Scale High-Quality Dataset for Text-to-video Generation. arXiv:2407.02371 https://arxiv.org/abs/2407.02371

  57. [105]

    Open-Sora-Plan. 2024. Mixkit. https://huggingface.co/datasets/LanguageBind/Open-Sora-Plan-v1.1.0

  58. [106]

    William Peebles and Saining Xie. 2023. Scalable Diffusion Models with Transformers. In2023 IEEE/CVF International Conference on Computer Vision (ICCV). IEEE, 4172–4182. https://doi.org/10.1109/iccv51070.2023.00387

  59. [107]

    Xurui Peng, Hong Liu, Chenqian Yan, Rui Ma, Fangmin Chen, Xing Wang, Zhihua Wu, Songwei Liu, and Mingbao Lin. 2025. ERTACache: Error Rectification and Timesteps Adjustment for Efficient Diffusion. arXiv:2508.21091 https://arxiv.org/abs/2508.21091

  60. [108]

    Liang Qiao, Yue Dai, Yeqi Huang, Hongyu Kan, Jun Shi, and Hong An. 2025. FlashOmni: A Unified Sparse Attention Engine for Diffusion Transformers. arXiv:2509.25401 https://arxiv.org/abs/2509.25401

  61. [109]

    Haonan Qiu, Menghan Xia, Yong Zhang, Yingqing He, Xintao Wang, Ying Shan, and Ziwei Liu. 2023. FreeNoise: Tuning-Free Longer Video Diffusion via Noise Rescheduling. arXiv:2310.15169 https://arxiv.org/abs/2310.15169

  62. [110]

    Sucheng Ren, Qihang Yu, Ju He, Alan Yuille, and Liang-Chieh Chen. 2025. Grouping First, Attending Smartly: Training-Free Acceleration for Diffusion Transformers. arXiv:2505.14687 https://arxiv.org/abs/2505.14687

  63. [111]

    Amirmojtaba Sabour, Sanja Fidler, and Karsten Kreis. 2024. Align Your Steps: Optimizing Sampling Schedules in Diffusion Models. arXiv:2404.14507 https://arxiv.org/abs/2404.14507

  64. [112]

    Axel Sauer, Frederic Boesel, Tim Dockhorn, Andreas Blattmann, Patrick Esser, and Robin Rombach. 2024. Fast High-Resolution Image Synthesis with Latent Adversarial Diffusion Distillation. InSIGGRAPH Asia 2024 Conference Papers. ACM, 1–11. https: //doi.org/10.1145/3680528.3687625

  65. [113]

    Axel Sauer, Dominik Lorenz, Andreas Blattmann, and Robin Rombach. 2024. Adversarial Diffusion Distillation. InComputer Vision – ECCV 2024. Springer Nature Switzerland, 87–103. https://doi.org/10.1007/978-3-031-73016-0_6 , Vol. 1, No. 1, Article . Publication date: April 2026. 32•Shitong Shao, James Kwok, Pengfei Wan, Zeke Xie

  66. [114]

    Shitong Shao, Hongwei Yi, Hanzhong Guo, Tian Ye, Daquan Zhou, Michael Lingelbach, Zhiqiang Xu, and Zeke Xie. 2025. MagicDistil- lation: Weak-to-Strong Video Distillation for Large-Scale Few-Step Synthesis. arXiv:2503.13319 https://arxiv.org/abs/2503.13319

  67. [115]

    Yihua Shao, Deyang Lin, Fanhu Zeng, Minxi Yan, Muyang Zhang, Siyu Chen, Yuxuan Fan, Ziyang Yan, Haozhe Wang, Jingcai Guo, Yan Wang, Haotong Qin, and Hao Tang. 2025. TR-DQ: Time-Rotation Diffusion Quantization. arXiv:2503.06564 https: //arxiv.org/abs/2503.06564

  68. [116]

    Leqi Shen, Guoqiang Gong, Tao He, Yifeng Zhang, Pengzhang Liu, Sicheng Zhao, and Guiguang Ding. 2025. FastVID: Dynamic Density Pruning for Fast Video Large Language Models. arXiv:2503.11187 https://arxiv.org/abs/2503.11187

  69. [117]

    Xuan Shen, Chenxia Han, Yufa Zhou, Yanyue Xie, Yifan Gong, Quanyi Wang, Yiwei Wang, Yanzhi Wang, Pu Zhao, and Jiuxiang Gu

  70. [118]

    arXiv:2505.14708 [cs]

    DraftAttention: Fast Video Diffusion via Low-Resolution Attention Guidance. arXiv:2505.14708 https://arxiv.org/abs/2505.14708

  71. [119]

    Kuai Shou. 2024. Kling 2.6. https://app.klingai.com/global/release-notes/c605hp1tzd?type=dialog

  72. [120]

    Gaurav Shrivastava and Abhinav Shrivastava. 2024. Video Prediction by Modeling Videos as Continuous Multi-Dimensional Processes. In2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 7236–7245. https://doi.org/10.1109/cvpr52733. 2024.00691

  73. [121]

    SkyTimelapse. 2021. SkyTimelapse. youtube.com/channel/UCtLemFmUPZYItte3PpG7f2Q/videos?reload=9

  74. [122]

    Yang Song, Prafulla Dhariwal, Mark Chen, and Ilya Sutskever. 2023. Consistency Models. InInternational Conference on Machine Learning. PMLR, 32211–32252

  75. [123]

    Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. 2023. Score-Based Generative Modeling through Stochastic Differential Equations. InInternational Conference on Learning Representations. OpenReview.net. https: //openreview.net/forum?id=PxTIG12RRHS

  76. [124]

    K Soomro. 2012. UCF101: A dataset of 101 human actions classes from videos in the wild. arXiv:1212.0402 https://arxiv.org/abs/1212.0402

  77. [125]

    Stability.ai. 2024. Introducing Stable Diffusion 3.5. https://stability.ai/news/introducing-stable-diffusion-3-5

  78. [126]

    Wenzhang Sun, Qirui Hou, Donglin Di, Jiahui Yang, Yongjia Ma, and Jianxun Cui. 2025. UniCP: A Unified Caching and Pruning Framework for Efficient Video Generation. InProceedings of the 7th ACM International Conference on Multimedia in Asia. ACM, 1–7. https://doi.org/10.1145/3743093.3770981

  79. [127]

    Wenhao Sun, Rong-Cheng Tu, Yifu Ding, Zhao Jin, Jingyi Liao, Shunyu Liu, and Dacheng Tao. 2025. VORTA: Efficient Video Diffusion via Routing Sparse Attention. arXiv:2505.18809 https://arxiv.org/abs/2505.18809

  80. [128]

    Wenhao Sun, Rong-Cheng Tu, Jingyi Liao, Zhao Jin, and Dacheng Tao. 2024. AsymRnR: Video Diffusion Transformers Acceleration with Asymmetric Reduction and Restoration. arXiv:2412.11706 https://arxiv.org/abs/2412.11706

Showing first 80 references.