pith. machine review for the scientific record. sign in

arxiv: 2605.14513 · v1 · submitted 2026-05-14 · 💻 cs.CV · cs.AI

Recognition: unknown

HASTE: Training-Free Video Diffusion Acceleration via Head-Wise Adaptive Sparse Attention

Authors on Pith no claims yet

Pith reviewed 2026-05-15 05:14 UTC · model grok-4.3

classification 💻 cs.CV cs.AI
keywords video diffusionsparse attentiontraining-free accelerationDiThead-wise adaptationtemporal mask reuseerror-guided calibration
0
0 comments X

The pith

Head-wise adaptive sparse attention accelerates pretrained video diffusion models up to 1.93 times without retraining by reusing temporal masks and calibrating sparsity per head.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Video diffusion transformers are slowed by full quadratic attention, and existing training-free sparse methods still pay repeated costs to predict masks and apply uniform sparsity thresholds across heads. The paper introduces a head-wise adaptive framework that reuses attention masks when query-key drift stays low and adjusts per-head top-p thresholds by minimizing measured output error under a global sparsity budget. If the approach works, it delivers substantial wall-clock speedups on large models like Wan2.1 at 720p while keeping generated video quality competitive with the dense baseline. A sympathetic reader would care because the changes require no retraining and plug directly into existing checkpoints, lowering the barrier to running high-quality video generation on available hardware.

Core claim

The central claim is that two plug-in components—Temporal Mask Reuse, which skips mask prediction based on query-key drift, and Error-guided Budgeted Calibration, which assigns per-head top-p thresholds by minimizing model-output error under a global sparsity budget—together produce a head-wise adaptive sparse attention scheme that consistently improves prior training-free methods and reaches up to 1.93 times speedup at 720p on Wan2.1-1.3B and Wan2.1-14B models while preserving competitive video quality and similarity metrics.

What carries the argument

Head-wise adaptive sparse attention framework using Temporal Mask Reuse to skip unnecessary mask predictions on low query-key drift and Error-guided Budgeted Calibration to set per-head sparsity thresholds that minimize measured model-output error under a fixed global budget.

Load-bearing premise

That measured model-output error under a global sparsity budget reliably predicts perceptual video quality across heads and that temporal query-key drift stays stable enough for safe mask reuse without visible artifacts.

What would settle it

Running the method on Wan2.1 models at 720p and observing either a clear drop in perceptual video quality metrics such as FVD or visible temporal flickering and artifacts traceable to reused masks, or failing to realize the reported wall-clock speedup on standard inference hardware.

Figures

Figures reproduced from arXiv: 2605.14513 by Fei Chao, Jing Xu, Rongrong Ji, Xiawu Zheng, Xuzhe Zheng, Yuexiao Ma.

Figure 1
Figure 1. Figure 1: Qualitative comparison on Wan2.1-14B text-to-video generation at 480P. We compare dense attention with two representative sparse-attention baselines, XAttention [35] and SVG2 [36], and their variants enhanced by our method. These results show that our method better preserves the video quality while further improving inference efficiency. 1 Introduction Video diffusion transformers [15, 23] have become a fo… view at source ↗
Figure 2
Figure 2. Figure 2: Temporal mask similarity is heterogeneous across prompts, layers, and heads. (a) Prompt-level curves for randomly sampled prompts still vary substantially after averaging over layers. (b) A layer-step mask-IoU heatmap for one randomly sampled prompt shows that the same heterogeneity also appears across layers. (c) Per-head mask-IoU heatmap within a randomly selected layer shows that this variation persists… view at source ↗
Figure 3
Figure 3. Figure 3: Head-wise threshold-induced sparsity and error response curves on Wan2.1-1.3B with XAttention [35], measured at a randomly selected layer and denoising timestep. Each polyline contains seven operating points corresponding to top-p thresholds {1.00, 0.95, 0.90, 0.85, 0.80, 0.70, 0.65}. From left to right, the panels show attention-output MSE, model-output MSE on denoising velocity, and head sparsity, all as… view at source ↗
Figure 4
Figure 4. Figure 4: Overview of the proposed head-wise sparse-attention framework. Left: Temporal Mask Reuse (TMR) reduces online mask-prediction overhead by deciding, for each head, whether to reuse the previous sparse mask or refresh it using a lightweight query-key stability signal. Right: Error-guided Budgeted Calibration (EBC) operates offline, measuring candidate operating points for each head and selecting head-specifi… view at source ↗
Figure 5
Figure 5. Figure 5: Relationship between adjacent-step mask IoU and query-key drift. Both the full-token drift and the mean-pooled drift show clear negative correlations with mask IoU, indicating that the cheaper mean-pooled statistic preserves the predictive trend of the full-token drift while requiring much less cache memory. Each point corresponds to one sampled head-step pair from either conditional or unconditional branc… view at source ↗
Figure 6
Figure 6. Figure 6: Realized sparsity over denoising timesteps for several randomly selected heads under a fixed sparsification threshold. Even for the same head, the achieved sparsity varies substantially across timesteps, while different heads exhibit different trajectories. This empirical observation supports interval-based timestep sampling in calibration, rather than measuring each head at only one denoising step. As sho… view at source ↗
read the original abstract

Diffusion-based video generation has advanced substantially in visual fidelity and temporal coherence, but practical deployment remains limited by the quadratic complexity of full attention. Training-free sparse attention is attractive because it accelerates pretrained models without retraining, yet existing online top-$p$ sparse attention still spends non-negligible cost on mask prediction and applies shared thresholds despite strong head-level heterogeneity. We show that these two overlooked factors limit the practical speed-quality trade-off of training-free sparse attention in Video DiTs. To address them, we introduce a head-wise adaptive framework with two plug-in components: Temporal Mask Reuse, which skips unnecessary mask prediction based on query-key drift, and Error-guided Budgeted Calibration, which assigns per-head top-$p$ thresholds by minimizing measured model-output error under a global sparsity budget. On Wan2.1-1.3B and Wan2.1-14B, our method consistently improves XAttention and SVG2, achieving up to 1.93 times speedup at 720P while maintaining competitive video quality and similarity metrics.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces HASTE, a training-free head-wise adaptive sparse attention framework for accelerating pretrained Video DiTs. It adds two plug-in components—Temporal Mask Reuse (skipping mask prediction via query-key drift) and Error-guided Budgeted Calibration (per-head top-p thresholds chosen to minimize measured model-output error under a global sparsity budget)—and reports that the method improves XAttention and SVG2 on Wan2.1-1.3B and Wan2.1-14B, reaching up to 1.93× speedup at 720P while preserving competitive video quality and similarity metrics.

Significance. If the speedup and quality claims are substantiated with rigorous ablations and perceptual validation, the work would offer a practical, training-free acceleration path for large video generation models. The explicit handling of head-level heterogeneity and the reuse of masks address two real bottlenecks in online sparse attention for DiTs; the training-free nature is a clear strength.

major comments (3)
  1. [Experimental results (abstract and §4)] The central empirical claim (up to 1.93× speedup at 720P with competitive quality) rests on the Error-guided Budgeted Calibration, yet the manuscript provides no error bars, exact per-metric tables, or ablation isolating the calibration objective from the global sparsity budget. Without these, it is impossible to verify that the reported gains are statistically reliable or that the per-head thresholds actually improve the speed-quality frontier over uniform top-p baselines.
  2. [Error-guided Budgeted Calibration] The calibration minimizes a scalar model-output error (latent-space L2 or equivalent) subject to the global sparsity budget. This proxy is not shown to correlate with perceptual video quality, especially temporal coherence and motion artifacts; latent error frequently decouples from visible flickering once temporal attention patterns shift across denoising steps. The paper must demonstrate that the chosen thresholds preserve human-visible quality (e.g., via FVD, user studies, or temporal stability metrics) rather than only frame-wise similarity scores.
  3. [Temporal Mask Reuse] Temporal Mask Reuse assumes query-key drift remains small enough for safe mask reuse across timesteps. Drift magnitude typically increases as noise decreases and semantic structure appears; if this assumption fails for later denoising steps, visible artifacts can appear that are invisible to the calibration objective. The manuscript should quantify drift statistics and show that reuse does not degrade temporal consistency on the evaluated models.
minor comments (2)
  1. [Abstract] The abstract states “competitive video quality and similarity metrics” without naming the concrete metrics (FVD, CLIP-T, etc.) or reporting numerical values; this should be clarified in the abstract and results section.
  2. [Method overview] Notation for the global sparsity budget and per-head top-p thresholds should be introduced with a single consistent symbol set and a small illustrative diagram showing how the budget is allocated across heads.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and will incorporate the suggested analyses and ablations in the revised manuscript to strengthen the empirical validation.

read point-by-point responses
  1. Referee: [Experimental results (abstract and §4)] The central empirical claim (up to 1.93× speedup at 720P with competitive quality) rests on the Error-guided Budgeted Calibration, yet the manuscript provides no error bars, exact per-metric tables, or ablation isolating the calibration objective from the global sparsity budget. Without these, it is impossible to verify that the reported gains are statistically reliable or that the per-head thresholds actually improve the speed-quality frontier over uniform top-p baselines.

    Authors: We agree that error bars, full tables, and isolating ablations are needed for rigor. In revision we will add standard deviations over 3 random seeds for all reported metrics, include complete per-metric tables, and provide a dedicated ablation comparing Error-guided Budgeted Calibration against uniform top-p under identical global sparsity budgets. This will isolate the per-head adaptation benefit and confirm statistical reliability of the 1.93× speedup. revision: yes

  2. Referee: [Error-guided Budgeted Calibration] The calibration minimizes a scalar model-output error (latent-space L2 or equivalent) subject to the global sparsity budget. This proxy is not shown to correlate with perceptual video quality, especially temporal coherence and motion artifacts; latent error frequently decouples from visible flickering once temporal attention patterns shift across denoising steps. The paper must demonstrate that the chosen thresholds preserve human-visible quality (e.g., via FVD, user studies, or temporal stability metrics) rather than only frame-wise similarity scores.

    Authors: We acknowledge that latent L2 is a proxy and will expand evaluation in revision by reporting Fréchet Video Distance (FVD) and temporal stability metrics (e.g., frame-to-frame difference variance). We will also add discussion of observed correlation between the calibration objective and these perceptual metrics on Wan2.1. While full user studies exceed current scope, we will include additional qualitative temporal-coherence examples and note that our similarity metrics already remain competitive; the revision will prioritize the requested quantitative perceptual metrics. revision: partial

  3. Referee: [Temporal Mask Reuse] Temporal Mask Reuse assumes query-key drift remains small enough for safe mask reuse across timesteps. Drift magnitude typically increases as noise decreases and semantic structure appears; if this assumption fails for later denoising steps, visible artifacts can appear that are invisible to the calibration objective. The manuscript should quantify drift statistics and show that reuse does not degrade temporal consistency on the evaluated models.

    Authors: We will add a new subsection quantifying query-key drift magnitude (L2 distance between consecutive Q/K) across all denoising timesteps for both Wan2.1-1.3B and 14B. We will also report temporal consistency metrics (e.g., temporal PSNR and motion artifact scores) with and without mask reuse to demonstrate that reuse preserves coherence. Our internal checks show drift remains below the threshold that triggers visible artifacts, but the explicit statistics will directly address the concern. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical calibration and reuse rules are independent of claimed outputs

full rationale

The paper introduces two plug-in components (Temporal Mask Reuse based on query-key drift and Error-guided Budgeted Calibration that selects per-head top-p thresholds by minimizing measured model-output error under a global sparsity budget). These are procedural heuristics whose parameters are set by direct measurement on the target model rather than by any derivation that reduces the reported speedup or quality metrics to the inputs by construction. No equations are presented that equate a 'prediction' to a fitted quantity; the central claims are empirical speedups (up to 1.93× at 720P) validated on Wan2.1 models with competitive similarity metrics. No self-citation load-bearing steps, uniqueness theorems, or ansatz smuggling appear in the provided text. The method is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The approach rests on the domain assumption that attention heads exhibit strong heterogeneity that can be exploited for per-head sparsity without retraining; no new mathematical axioms or invented entities are introduced.

free parameters (1)
  • global sparsity budget
    Used to distribute per-head top-p thresholds while minimizing measured output error
axioms (1)
  • domain assumption Head-level heterogeneity in attention patterns is stable enough to allow adaptive thresholds that preserve output quality
    Invoked to justify why shared thresholds are suboptimal and why per-head calibration improves the trade-off

pith-pipeline@v0.9.0 · 5492 in / 1202 out tokens · 52259 ms · 2026-05-15T05:14:18.704869+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

55 extracted references · 55 canonical work pages · 11 internal anchors

  1. [1]

    Dicache: Let diffusion model determine its own cache.arXiv preprint arXiv:2508.17356, 2025

    Jiazi Bu, Pengyang Ling, Yujie Zhou, Yibin Wang, Yuhang Zang, Dahua Lin, and Jiaqi Wang. Dicache: Let diffusion model determine its own cache.arXiv preprint arXiv:2508.17356, 2025

  2. [2]

    Rainfusion: Adaptive video generation accel- eration via multi-dimensional visual redundancy.arXiv preprint arXiv:2505.21036,

    Aiyue Chen, Bin Dong, Jingru Li, Jing Lin, Kun Tian, Yiwu Yao, and Gongyi Wang. RainFusion: Adaptive Video Generation Acceleration via Multi-Dimensional Visual Redundancy, June 2025. URLhttp://arxiv.org/ abs/2505.21036. arXiv:2505.21036 [cs]

  3. [3]

    RainFusion2.0: Temporal-Spatial Awareness and Hardware-Efficient Block-wise Sparse Attention

    Aiyue Chen, Yaofu Liu, Junjian Huang, Guang Lian, Yiwu Yao, Wangli Lan, Jing Lin, Zhixin Ma, Tingting Zhou, and Harry Yang. RainFusion2.0: Temporal-Spatial Awareness and Hardware-Efficient Block-wise Sparse Attention, December 2025. URLhttp://arxiv.org/abs/2512.24086. arXiv:2512.24086 [cs]

  4. [4]

    Sparse-vDiT: Unleashing the Power of Sparse Attention to Accelerate Video Diffusion Transformers, June 2025

    Pengtao Chen, Xianfang Zeng, Maosen Zhao, Peng Ye, Mingzhu Shen, Wei Cheng, Gang Yu, and Tao Chen. Sparse-vDiT: Unleashing the Power of Sparse Attention to Accelerate Video Diffusion Transformers, June 2025. URLhttp://arxiv.org/abs/2506.03065. arXiv:2506.03065 [cs]

  5. [5]

    Hicache: Training-free acceleration of diffusion models via hermite polynomial-based feature caching.arXiv preprint arXiv:2508.16984, 2025

    Liang Feng, Shikang Zheng, Jiacheng Liu, Yuqi Lin, Qinming Zhou, Peiliang Cai, Xinyu Wang, Junjie Chen, Chang Zou, Yue Ma, et al. Hicache: Training-free acceleration of diffusion models via hermite polynomial-based feature caching.arXiv preprint arXiv:2508.16984, 2025

  6. [6]

    One Step Diffusion via Shortcut Models

    Kevin Frans, Danijar Hafner, Sergey Levine, and Pieter Abbeel. One step diffusion via shortcut models.arXiv preprint arXiv:2410.12557, 2024

  7. [7]

    BLADE: Block-Sparse Attention Meets Step Distillation for Efficient Video Generation, September 2025

    Youping Gu, Xiaolong Li, Yuhao Hu, Minqi Chen, and Bohan Zhuang. BLADE: Block-Sparse Attention Meets Step Distillation for Efficient Video Generation, September 2025. URLhttp://arxiv.org/abs/2508.10774 . arXiv:2508.10774 [cs]

  8. [8]

    Block Sparse Attention

    Junxian Guo, Haotian Tang, Shang Yang, Zhekai Zhang, Zhijian Liu, and Song Han. Block Sparse Attention. https://github.com/mit-han-lab/Block-Sparse-Attention, 2024

  9. [9]

    Denoising diffusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020

    Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020

  10. [10]

    Vbench: Comprehensive benchmark suite for video generative models

    Ziqi Huang, Yinan He, Jiashuo Yu, Fan Zhang, Chenyang Si, Yuming Jiang, Yuanhan Zhang, Tianxing Wu, Qingyang Jin, Nattapol Chanpaisit, et al. Vbench: Comprehensive benchmark suite for video generative models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21807–21818, 2024

  11. [11]

    HunyuanVideo: A Systematic Framework For Large Video Generative Models

    Weijie Kong, Qi Tian, Zijian Zhang, Rox Min, Zuozhuo Dai, Jin Zhou, Jiangfeng Xiong, Xin Li, Bo Wu, Jianwei Zhang, Kathrina Wu, Qin Lin, Junkun Yuan, Yanxin Long, Aladdin Wang, Andong Wang, Changlin Li, Duojun Huang, Fang Yang, Hao Tan, Hongmei Wang, Jacob Song, Jiawang Bai, Jianbing Wu, Jinbao Xue, Joey Wang, Kai Wang, Mengyang Liu, Pengyu Li, Shuai Li, ...

  12. [12]

    Svdquant: Absorbing outliers by low-rank components for 4-bit diffusion models.arXiv preprint arXiv:2411.05007, 2024

    Muyang Li, Yujun Lin, Zhekai Zhang, Tianle Cai, Xiuyu Li, Junxian Guo, Enze Xie, Chenlin Meng, Jun-Yan Zhu, and Song Han. Svdquant: Absorbing outliers by low-rank components for 4-bit diffusion models.arXiv preprint arXiv:2411.05007, 2024

  13. [13]

    Compact Attention: Exploiting Structured Spatio-Temporal Sparsity for Fast Video Generation, August 2025

    Qirui Li, Guangcong Zheng, Qi Zhao, Jie Li, Bin Dong, Yiwu Yao, and Xi Li. Compact Attention: Exploiting Structured Spatio-Temporal Sparsity for Fast Video Generation, August 2025. URLhttp://arxiv.org/abs/25 08.12969. arXiv:2508.12969 [cs]

  14. [14]

    Radial Attention: $O(n\log n)$ Sparse Attention with Energy Decay for Long Video Generation, December 2025

    Xingyang Li, Muyang Li, Tianle Cai, Haocheng Xi, Shuo Yang, Yujun Lin, Lvmin Zhang, Songlin Yang, Jinbo Hu, Kelly Peng, Maneesh Agrawala, Ion Stoica, Kurt Keutzer, and Song Han. Radial Attention: $O(n\log n)$ Sparse Attention with Energy Decay for Long Video Generation, December 2025. URLhttp://arxiv.org/abs/ 2506.19852. arXiv:2506.19852 [cs]

  15. [15]

    Flow Matching for Generative Modeling

    Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. Flow matching for generative modeling.arXiv preprint arXiv:2210.02747, 2022. 17 MAC-AutoML

  16. [16]

    Timestep embedding tells: It’s time to cache for video diffusion model

    Feng Liu, Shiwei Zhang, Xiaofeng Wang, Yujie Wei, Haonan Qiu, Yuzhong Zhao, Yingya Zhang, Qixiang Ye, and Fang Wan. Timestep embedding tells: It’s time to cache for video diffusion model. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 7353–7363, 2025

  17. [17]

    Freqca: Accelerating diffusion models via frequency-aware caching.arXiv preprint arXiv:2510.08669, 2025

    Jiacheng Liu, Peiliang Cai, Qinming Zhou, Yuqi Lin, Deyang Kong, Benhao Huang, Yupei Pan, Haowen Xu, Chang Zou, Junshu Tang, et al. Freqca: Accelerating diffusion models via frequency-aware caching.arXiv preprint arXiv:2510.08669, 2025

  18. [18]

    From reusing to forecasting: Accelerating diffusion models with taylorseers

    Jiacheng Liu, Chang Zou, Yuanhuiyi Lyu, Junjie Chen, and Linfeng Zhang. From reusing to forecasting: Accelerating diffusion models with taylorseers. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 15853–15863, 2025

  19. [20]

    Attention Sparsity is Input-Stable: Training-Free Sparse Attention for Video Generation via Offline Sparsity Profiling and Online QK Co-Clustering

    Jiayi Luo, Jiayu Chen, Jiankun Wang, Cong Wang, Hanxin Zhu, Qingyun Sun, Chen Gao, Zhibo Chen, and Jianxin Li. Training-Free Sparse Attention for Fast Video Generation via Offline Layer-Wise Sparsity Profiling and Online Bidirectional Co-Clustering, March 2026. URLhttp://arxiv.org/abs/2603.18636. arXiv:2603.18636 [cs]

  20. [21]

    Step-Video-T2V Technical Report: The Practice, Challenges, and Future of Video Foundation Model, February 2025

    Guoqing Ma, Haoyang Huang, Kun Yan, Liangyu Chen, Nan Duan, Shengming Yin, Changyi Wan, Ranchen Ming, Xiaoniu Song, Xing Chen, Yu Zhou, Deshan Sun, Deyu Zhou, Jian Zhou, Kaijun Tan, Kang An, Mei Chen, Wei Ji, Qiling Wu, Wen Sun, Xin Han, Yanan Wei, Zheng Ge, Aojie Li, Bin Wang, Bizhu Huang, Bo Wang, Brian Li, Changxing Miao, Chen Xu, Chenfei Wu, Chenguang...

  21. [22]

    Latte: Latent Diffusion Transformer for Video Generation

    Xin Ma, Yaohui Wang, Xinyuan Chen, Gengyun Jia, Ziwei Liu, Yuan-Fang Li, Cunjian Chen, and Yu Qiao. Latte: Latent diffusion transformer for video generation.arXiv preprint arXiv:2401.03048, 2024

  22. [23]

    Scalable diffusion models with transformers

    William Peebles and Saining Xie. Scalable diffusion models with transformers. InProceedings of the IEEE/CVF international conference on computer vision, pages 4195–4205, 2023

  23. [24]

    DraftAttention: Fast Video Diffusion via Low-Resolution Attention Guidance, May

    Xuan Shen, Chenxia Han, Yufa Zhou, Yanyue Xie, Yifan Gong, Quanyi Wang, Yiwei Wang, Yanzhi Wang, Pu Zhao, and Jiuxiang Gu. DraftAttention: Fast Video Diffusion via Low-Resolution Attention Guidance, May

  24. [25]

    Draftattention: Fast video diffusion via low-resolution attention guidance

    URLhttp://arxiv.org/abs/2505.14708. arXiv:2505.14708 [cs]

  25. [26]

    LiteAttention: A Temporal Sparse Attention for Diffusion Transformers, November 2025

    Dor Shmilovich, Tony Wu, Aviad Dahan, and Yuval Domb. LiteAttention: A Temporal Sparse Attention for Diffusion Transformers, November 2025. URLhttp://arxiv.org/abs/2511.11062. arXiv:2511.11062 [cs]

  26. [27]

    Denoising Diffusion Implicit Models

    Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models.arXiv preprint arXiv:2010.02502, 2020

  27. [28]

    AsymRnR: Video Diffusion Transformers Acceleration with Asymmetric Reduction and Restoration, December 2024

    Wenhao Sun, Rong-Cheng Tu, Jingyi Liao, Zhao Jin, and Dacheng Tao. AsymRnR: Video Diffusion Transformers Acceleration with Asymmetric Reduction and Restoration, December 2024. URLhttp://arxiv.org/abs/2412.1

  28. [29]

    arXiv:2412.11706 [cs]

  29. [30]

    Wan: Open and Advanced Large-Scale Video Generative Models

    Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, Jianyuan Zeng, Jiayu Wang, Jingfeng Zhang, Jingren Zhou, Jinkai Wang, Jixuan Chen, Kai Zhu, Kang Zhao, Keyu Yan, Lianghua Huang, Mengyang Feng, Ningyi Zhang, Pandeng Li, Pingyu Wu, Ruihang Chu, Ruili Feng, Shiwei Zhang, Siyang Sun, Tao Fang, T...

  30. [31]

    Objective video quality assessment

    Zhou Wang, Hamid R Sheikh, Alan C Bovik, et al. Objective video quality assessment. InThe handbook of video databases: design and applications, volume 41, pages 1041–1078. CRC press Boca Raton, 2003

  31. [32]

    Image quality assessment: from error visibility to structural similarity.IEEE transactions on image processing, 13(4):600–612, 2004

    Zhou Wang, Alan C Bovik, Hamid R Sheikh, and Eero P Simoncelli. Image quality assessment: from error visibility to structural similarity.IEEE transactions on image processing, 13(4):600–612, 2004

  32. [33]

    Vmoba: Mixture-of-block attention for video diffusion models.arXiv preprint arXiv:2506.23858,

    Jianzong Wu, Liang Hou, Haotian Yang, Xin Tao, Ye Tian, Pengfei Wan, Di Zhang, and Yunhai Tong. Vmoba: Mixture-of-block attention for video diffusion models.arXiv preprint arXiv:2506.23858, 2025

  33. [34]

    USV: Unified Sparsification for Accelerating Video Diffusion Models, December 2025

    Xinjian Wu, Hongmei Wang, Yuan Zhou, and Qinglin Lu. USV: Unified Sparsification for Accelerating Video Diffusion Models, December 2025. URLhttp://arxiv.org/abs/2512.05754. arXiv:2512.05754 [cs]

  34. [35]

    Sparse videogen: Acceler- ating video diffusion transformers with spatial-temporal sparsity.arXiv preprint arXiv:2502.01776,

    Haocheng Xi, Shuo Yang, Yilong Zhao, Chenfeng Xu, Muyang Li, Xiuyu Li, Yujun Lin, Han Cai, Jintao Zhang, Dacheng Li, Jianfei Chen, Ion Stoica, Kurt Keutzer, and Song Han. Sparse VideoGen: Accelerating Video Diffusion Transformers with Spatial-Temporal Sparsity, April 2025. URLhttp://arxiv.org/abs/2502.01776. arXiv:2502.01776 [cs]

  35. [36]

    Training-free and adaptive sparse attention for efficient long video generation.arXiv preprint arXiv:2502.21079,

    Yifei Xia, Suhan Ling, Fangcheng Fu, Yujie Wang, Huixia Li, Xuefeng Xiao, and Bin Cui. Training-free and Adaptive Sparse Attention for Efficient Long Video Generation, February 2025. URLhttp://arxiv.org/abs/25 02.21079. arXiv:2502.21079 [cs]

  36. [37]

    Xattention: Block sparse attention with antidiagonal scoring.arXiv preprint arXiv:2503.16428,

    Ruyi Xu, Guangxuan Xiao, Haofeng Huang, Junxian Guo, and Song Han. XAttention: Block Sparse Attention with Antidiagonal Scoring, March 2025. URLhttp://arxiv.org/abs/2503.16428. arXiv:2503.16428 [cs]

  37. [38]

    Sparse VideoGen2: Accelerate Video Generation with Sparse Attention via Semantic-Aware Permutation

    Shuo Yang, Haocheng Xi, Yilong Zhao, Muyang Li, Jintao Zhang, Han Cai, Yujun Lin, Xiuyu Li, Chenfeng Xu, Kelly Peng, Jianfei Chen, Song Han, Kurt Keutzer, and Ion Stoica. Sparse VideoGen2: Accelerate Video Generation with Sparse Attention via Semantic-Aware Permutation, October 2025. URLhttp://arxiv.org/abs/2505.18875. arXiv:2505.18875 [cs]

  38. [39]

    CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer

    Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xiaohan Zhang, Guanyu Feng, Da Yin, Yuxuan Zhang, Weihan Wang, Yean Cheng, Bin Xu, Xiaotao Gu, Yuxiao Dong, and Jie Tang. CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer, March 2025. URL http://arxiv.org/abs/2408.06072. arXiv:2408.06072 [cs]

  39. [40]

    Improved distribution matching distillation for fast image synthesis.Advances in neural information processing systems, 37:47455–47487, 2024

    Tianwei Yin, Michaël Gharbi, Taesung Park, Richard Zhang, Eli Shechtman, Fredo Durand, and William T Freeman. Improved distribution matching distillation for fast image synthesis.Advances in neural information processing systems, 37:47455–47487, 2024

  40. [41]

    One-step diffusion with distribution matching distillation

    Tianwei Yin, Michaël Gharbi, Richard Zhang, Eli Shechtman, Fredo Durand, William T Freeman, and Taesung Park. One-step diffusion with distribution matching distillation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6613–6623, 2024

  41. [42]

    Bidirectional sparse attention for faster video diffusion training.arXiv preprint arXiv:2509.01085,

    Chenlu Zhan, Wen Li, Chuyu Shen, Jun Zhang, Suhui Wu, and Hao Zhang. Bidirectional Sparse Attention for Faster Video Diffusion Training, September 2025. URLhttp://arxiv.org/abs/2509.01085. arXiv:2509.01085 [cs]

  42. [43]

    Sageattention2: Efficient attention with thorough outlier smoothing and per-thread int4 quantization.arXiv preprint arXiv:2411.10958, 2024

    Jintao Zhang, Haofeng Huang, Pengle Zhang, Jia Wei, Jun Zhu, and Jianfei Chen. Sageattention2: Efficient attention with thorough outlier smoothing and per-thread int4 quantization.arXiv preprint arXiv:2411.10958, 2024

  43. [44]

    Sageattention: Accurate 8-bit attention for plug-and-play inference acceleration.arXiv preprint arXiv:2410.02367, 2024

    Jintao Zhang, Jia Wei, Haofeng Huang, Pengle Zhang, Jun Zhu, and Jianfei Chen. Sageattention: Accurate 8-bit attention for plug-and-play inference acceleration.arXiv preprint arXiv:2410.02367, 2024

  44. [45]

    Gonzalez, Jun Zhu, and Jianfei Chen

    Jintao Zhang, Haoxu Wang, Kai Jiang, Shuo Yang, Kaiwen Zheng, Haocheng Xi, Ziteng Wang, Hongzhou Zhu, Min Zhao, Ion Stoica, Joseph E. Gonzalez, Jun Zhu, and Jianfei Chen. SLA: Beyond Sparsity in Diffusion Transformers via Fine-Tunable Sparse-Linear Attention, 2025. URLhttps://arxiv.org/abs/2509.24006. Version Number: 2

  45. [46]

    Spargeattention: Accurate and training-free sparse attention accelerating any model inference

    JintaoZhang, ChendongXiang, HaofengHuang, JiaWei, HaochengXi, JunZhu, andJianfeiChen. SpargeAttention: Accurate and Training-free Sparse Attention Accelerating Any Model Inference, October 2025. URLhttp: //arxiv.org/abs/2502.18137. arXiv:2502.18137 [cs]. 19 MAC-AutoML

  46. [47]

    VSA: Faster Video Diffusion with Trainable Sparse Attention, October 2025

    Peiyuan Zhang, Yongqi Chen, Haofeng Huang, Will Lin, Zhengzhong Liu, Ion Stoica, Eric Xing, and Hao Zhang. VSA: Faster Video Diffusion with Trainable Sparse Attention, October 2025. URLhttp://arxiv.org/abs/2505 .13389. arXiv:2505.13389 [cs]

  47. [48]

    Fast video generation with sliding tile attention.arXiv preprint arXiv:2502.04507, 2025

    Peiyuan Zhang, Yongqi Chen, Runlong Su, Hangliang Ding, Ion Stoica, Zhengzhong Liu, and Hao Zhang. Fast video generation with sliding tile attention.arXiv preprint arXiv:2502.04507, 2025

  48. [49]

    The unreasonable effectiveness of deep features as a perceptual metric

    Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang. The unreasonable effectiveness of deep features as a perceptual metric. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 586–595, 2018

  49. [50]

    Ride the Wave: Precision-Allocated Sparse Attention for Smooth Video Generation

    Wentai Zhang, Ronghui Xi, Shiyao Peng, Jiayu Huang, Haoran Luo, Zichen Tang, et al. Ride the wave: Precision-allocated sparse attention for smooth video generation.arXiv preprint arXiv:2604.12219, 2026

  50. [51]

    Training-Free Efficient Video Generation via Dynamic Token Carving, November 2025

    Yuechen Zhang, Jinbo Xing, Bin Xia, Shaoteng Liu, Bohao Peng, Xin Tao, Pengfei Wan, Eric Lo, and Jiaya Jia. Training-Free Efficient Video Generation via Dynamic Token Carving, November 2025. URLhttp://arxiv.org/ abs/2505.16864. arXiv:2505.16864 [cs]

  51. [52]

    Vidit-q: Efficient and accurate quantization of diffusion transformers for image and video generation.arXiv preprint arXiv:2406.02540, 2024

    Tianchen Zhao, Tongcheng Fang, Haofeng Huang, Enshu Liu, Rui Wan, Widyadewi Soedarmadji, Shiyao Li, Zinan Lin, Guohao Dai, Shengen Yan, et al. Vidit-q: Efficient and accurate quantization of diffusion transformers for image and video generation.arXiv preprint arXiv:2406.02540, 2024

  52. [53]

    PAROAttention: Pattern-Aware ReOrdering for Efficient Sparse and Quantized Attention in Visual Generation Models, June 2025

    Tianchen Zhao, Ke Hong, Xinhao Yang, Xuefeng Xiao, Huixia Li, Feng Ling, Ruiqi Xie, Siqi Chen, Hongyu Zhu, Yichong Zhang, and Yu Wang. PAROAttention: Pattern-Aware ReOrdering for Efficient Sparse and Quantized Attention in Visual Generation Models, June 2025. URLhttp://arxiv.org/abs/2506.16054. arXiv:2506.16054 [cs]. 20 MAC-AutoML Appendix A Proof of Prop...

  53. [54]

    cross-head interaction terms within the same layer, as shown in Eq. (37)

  54. [55]

    higher-order propagation terms represented byRl in Eq. (28)

  55. [56]

    Therefore, the additive objective is not an exact decomposition of the full network error

    cross-layer interactions, since the ILP sums measurements over all layers and heads while treating them as independently selectable operating points. Therefore, the additive objective is not an exact decomposition of the full network error. 24 MAC-AutoML Practical implication for calibration.Our ILP should thus be interpreted as optimizing a measurement-d...