pith. sign in

arxiv: 2605.31057 · v1 · pith:MQFK65VLnew · submitted 2026-05-29 · 💻 cs.CV · cs.LG

LVSA: Training-Free Sparse Attention for Long Video Diffusion

Pith reviewed 2026-06-28 22:33 UTC · model grok-4.3

classification 💻 cs.CV cs.LG
keywords sparse attentionvideo diffusionlong video generationtraining-freeblock-sparse attentionrotating anchorstemporal artifactscompute reduction
0
0 comments X

The pith

LVSA sparse attention reduces long video diffusion compute by up to 3.33 times without any retraining.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Dense self-attention creates quadratic costs and causes video diffusion models to produce near-static repetitive output beyond their training length. LVSA replaces it with a training-free block-sparse pattern that uses structured windows together with rotating global anchors. The pattern removes fixed-grid bias that produces long-range temporal artifacts. On tested models this yields compute reductions of 3.17 times on Wan 2.1 1.3B at 6 times horizon, 2.98 times on Wan 2.1 14B at 6 times horizon, and 3.33 times on HunyuanVideo 1.5 at 1.5 times horizon, while quality stays neutral at training length and improves at longer lengths. The same method also enables previously impossible 2 times horizon generation on HunyuanVideo 1.5 and delivers speedups on NPUs.

Core claim

LVSA is a training-free model-agnostic block-sparse attention for video diffusion transformers. It combines a structured window pattern with rotating global anchors to remove the fixed-grid bias which causes long-range temporal artifacts. When used with a FlashInfer kernel, LVSA reduces compute up to 3.17 times on Wan 2.1 1.3B at a 6 times horizon, 2.98 times on Wan 2.1 14B at a 6 times horizon, and 3.33 times on HunyuanVideo 1.5 at a 1.5 times horizon compared with dense attention. It remains quality-neutral at training horizon length and quality-positive at extended lengths, enables 2 times horizon generation on HunyuanVideo 1.5 that is otherwise out of memory, and provides speedups up to

What carries the argument

Structured window pattern combined with rotating global anchors in block-sparse attention

If this is right

  • Compute reduced up to 3.17 times on Wan 2.1 1.3B at 6 times horizon
  • Compute reduced up to 2.98 times on Wan 2.1 14B at 6 times horizon
  • Compute reduced up to 3.33 times on HunyuanVideo 1.5 at 1.5 times horizon
  • 2 times horizon generation enabled on HunyuanVideo 1.5 which exceeds single-GPU memory for dense attention
  • Speedups up to 2.41 times versus RIFLEx and 3.27 times versus UltraViCo on Wan 2.1 1.3B

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The rotation schedule could be tested on other transformer-based generative tasks that suffer from similar fixed-pattern attention biases.
  • VQeval could be applied to re-score existing long-video methods that currently receive inflated scores from metrics that reward looping output.
  • Because LVSA requires no retraining, the same pattern can be dropped into additional video diffusion models beyond the three evaluated here.
  • The method may combine with other inference optimizations such as quantization to produce further efficiency gains.

Load-bearing premise

The specific window sizes and rotation schedule remove fixed-grid bias and long-range artifacts without introducing new failure modes.

What would settle it

Generating videos at six times training horizon length with LVSA and measuring the same rate of repetitive looping as dense attention would show the quality claim does not hold.

Figures

Figures reproduced from arXiv: 2605.31057 by Gael Glorian, Hongsheng Liu, Ioannis Lamprou, Yujie Yuan, Zhen Zhang.

Figure 1
Figure 1. Figure 1: Basic versus expanded window pattern. The basic adaptive window (a) wastes attention budget [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Rotating periodic global frames with Tper = 4. The set Gs shifts by one position per denoising step and wraps modulo T. Over any Tper consecutive steps, each frame appears as a global an￾chor exactly once. See [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: HunyuanVideo 1.5 at 2× horizon (257 frames), generated by LVSA-FI on a single 80GB GPU; dense attention is infeasible at this setting due to OOM ( [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Wall-time scaling across three video DiTs (Wan 2.1 1.3B, Wan 2.1 14B, HunyuanVideo 1.5) [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Wan 2.1 1.3B at a 6× horizon (481 frames), prompt cat window, same seed. Frames 20, 200, 380, 460 shown for each backend. Dense converges to near-static output — the cat barely moves across ∼ 440 frames — while LVSA produces genuine pose and lighting variation. This is the failure mode VBench-Long’s subject consistency rewards and VQeval correctly penalizes. imaging quality dimension tells the opposite sto… view at source ↗
read the original abstract

Dense self-attention is the compute and quality bottleneck of long-video diffusion inference: cost grows quadratically with the sequence length, and beyond the training horizon the model converges to near-static output, that is, "frozen" repetitive video. State of the art approaches are either too costly, e.g., they require retraining, or fail to satisfy both performance and quality objectives in a scalable manner. To this end, we introduce Long Video Sparse Attention (LVSA), a training-free model-agnostic block-sparse attention for video diffusion transformers that combines a structured window pattern with rotating global anchors, thus removing the fixed-grid bias which causes long-range temporal artifacts. LVSA, combined with a FlashInfer kernel, reduces compute up to 3.17x on Wan 2.1 1.3B at a 6x horizon, 2.98x on Wan 2.1 14B at a 6x horizon, and 3.33x on HunyuanVideo 1.5 at a 1.5x horizon, compared to dense attention. Beyond reducing compute, LVSA enables HunyuanVideo 1.5 generation at a 2x horizon, which is otherwise out-of-memory on a single GPU. Moreover, LVSA provides speedups up to 2.41x compared to RIFLEx and 3.27x compared to UltraViCo on Wan 2.1 1.3B. To demonstrate applicability across diverse platforms, we apply LVSA on NPUs and achieve speedups up to 2.71x on Wan 2.2 A14B and 3.24x on Wan 2.1 1.3B compared to dense attention. To evaluate quality in a fair way, we introduce VQeval, a tool properly scoring loopy video failures, which instead are rewarded in state of the art evaluators like VBench-Long. LVSA is quality-neutral for generation at training horizon length and quality-positive at extended lengths.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces LVSA, a training-free block-sparse attention pattern for video diffusion transformers that combines structured local windows with rotating global anchors to mitigate fixed-grid bias and long-range temporal artifacts. It reports concrete inference speedups (up to 3.17× on Wan 2.1 1.3B at 6× horizon, 2.98× on Wan 2.1 14B, 3.33× on HunyuanVideo) versus dense attention, additional gains versus RIFLEx/UltraViCo, out-of-memory relief for longer sequences, cross-hardware results on NPUs, and introduces VQeval as a metric that penalizes repetitive “frozen” outputs better than VBench-Long; quality is claimed neutral at training length and positive at extended lengths.

Significance. If the empirical claims hold under scrutiny, LVSA would constitute a practical, model-agnostic engineering contribution that materially extends the usable context length of existing video diffusion models at inference time without retraining, directly addressing the quadratic cost and temporal degradation problems that currently limit long-video generation.

major comments (3)
  1. [Abstract / Method] Abstract and method description: the central claim that the specific combination of structured windows plus rotating global anchors removes fixed-grid bias and long-range artifacts (while remaining quality-neutral/positive) is load-bearing for all reported speedups and quality statements, yet the manuscript supplies no quantitative ablation comparing rotating versus fixed anchors, no sensitivity sweeps on window size or rotation period, and no comparison against alternative sparse patterns; without these controls the reported 3.17×/2.98×/3.33× speedups and VQeval gains could be artifacts of hyper-parameter choices rather than a general property of the method.
  2. [Evaluation / VQeval] Evaluation section: the abstract states concrete speedups and quality improvements but reports neither error bars nor the number of random seeds used; likewise, no validation data (correlation coefficients, human-study results) are provided showing that VQeval scores align with human judgments on loopy-video failures, undermining the assertion that LVSA is “quality-positive at extended lengths.”
  3. [Results / Comparisons] Results tables (implied by abstract numbers): the speedups versus RIFLEx and UltraViCo (2.41× and 3.27×) are presented without accompanying implementation details, hyper-parameter matching protocol, or confirmation that the baseline kernels were run under identical memory and batch settings, making it impossible to assess whether the gains are attributable to LVSA or to unstated engineering differences.
minor comments (2)
  1. [Implementation details] The abstract mentions FlashInfer and NPU results but does not specify the exact kernel configuration or memory layout changes that enable the reported speedups; a short paragraph or table entry would improve reproducibility.
  2. [Method] Notation for the rotation schedule and anchor placement is introduced without an accompanying diagram or pseudocode; a single figure would clarify the pattern for readers.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the detailed and constructive review. We address each major comment below and will revise the manuscript accordingly to strengthen the presentation and evidence.

read point-by-point responses
  1. Referee: [Abstract / Method] Abstract and method description: the central claim that the specific combination of structured windows plus rotating global anchors removes fixed-grid bias and long-range artifacts (while remaining quality-neutral/positive) is load-bearing for all reported speedups and quality statements, yet the manuscript supplies no quantitative ablation comparing rotating versus fixed anchors, no sensitivity sweeps on window size or rotation period, and no comparison against alternative sparse patterns; without these controls the reported 3.17×/2.98×/3.33× speedups and VQeval gains could be artifacts of hyper-parameter choices rather than a general property of the method.

    Authors: We agree that direct quantitative ablations would provide stronger support for the design choices. The motivation for rotating anchors (to break fixed-grid bias) is explained in the method, but we will add an explicit ablation of rotating versus fixed anchors, sensitivity sweeps over window size and rotation period, and comparisons against additional sparse patterns (e.g., random and strided) in a new experimental subsection. revision: yes

  2. Referee: [Evaluation / VQeval] Evaluation section: the abstract states concrete speedups and quality improvements but reports neither error bars nor the number of random seeds used; likewise, no validation data (correlation coefficients, human-study results) are provided showing that VQeval scores align with human judgments on loopy-video failures, undermining the assertion that LVSA is “quality-positive at extended lengths.”

    Authors: We will add the number of random seeds used and error bars to all reported metrics. While VQeval is motivated by directly penalizing low temporal variance (unlike VBench-Long), we acknowledge the absence of explicit validation data. We will include a small human preference study with correlation coefficients in an appendix of the revision. revision: yes

  3. Referee: [Results / Comparisons] Results tables (implied by abstract numbers): the speedups versus RIFLEx and UltraViCo (2.41× and 3.27×) are presented without accompanying implementation details, hyper-parameter matching protocol, or confirmation that the baseline kernels were run under identical memory and batch settings, making it impossible to assess whether the gains are attributable to LVSA or to unstated engineering differences.

    Authors: All baselines were evaluated using their official implementations under identical hardware, batch size, and memory settings as the dense and LVSA runs. We will add an appendix with full hyper-parameter tables, kernel versions, and explicit confirmation of matched experimental conditions. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical engineering pattern with no derivation chain or self-referential steps

full rationale

The paper presents LVSA as a training-free, model-agnostic block-sparse attention design that combines a structured window pattern with rotating global anchors. No equations, fitted parameters, predictions derived from subsets of data, or load-bearing self-citations appear in the abstract or described method. The central claims rest on empirical speedups and quality evaluations across three models rather than any mathematical derivation that reduces to its own inputs by construction. Design choices are stated directly as an engineering solution without invoking uniqueness theorems, ansatzes from prior self-work, or renaming of known results. This is the common case of a self-contained empirical contribution.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no equations, hyperparameters, or modeling assumptions are stated. The method implicitly introduces at least window size, anchor count, and rotation schedule as design choices, but none are quantified or justified here.

pith-pipeline@v0.9.1-grok · 5919 in / 1207 out tokens · 17862 ms · 2026-06-28T22:33:04.385040+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

14 extracted references · 7 canonical work pages · 2 internal anchors

  1. [1]

    VBench: Comprehensive benchmark suite for video generative models

    Ziqi Huang, Yinan He, Jiashuo Yu, Fan Zhang, Chenyang Si, Yuming Jiang, Yuanhan Zhang, Tianxing Wu, Qingyang Jin, Nattapol Chanpaisit, Yaohui Wang, Xinyuan Chen, Limin Wang, Dahua Lin, Yu Qiao, and Ziwei Liu. VBench: Comprehensive benchmark suite for video generative models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024

  2. [2]

    DeepSpeed Ulysses: System Optimizations for Enabling Training of Extreme Long Sequence Transformer Models

    Sam Ade Jacobs, Masahiro Tanaka, Chengming Zhang, Minjia Zhang, Shuaiwen Leon Song, Samyam Ra- jbhandari, and Yuxiong He. Deepspeed ulysses: System optimizations for enabling training of extreme long sequence transformer models.arXiv preprint arXiv:2309.14509, 2023

  3. [3]

    Hunyuanvideo: A systematic framework for large video generative models.arXiv preprint arXiv:2412.03603, 2024

    Weijie Kong, Qi Tian, Zijian Zhang, Rox Min, Zuozhuo Dai, Jin Zhou, Jiangfeng Xiong, Xin Li, Bo Wu, Jianwei Zhang, et al. Hunyuanvideo: A systematic framework for large video generative models.arXiv preprint arXiv:2412.03603, 2024

  4. [4]

    Radial attention: O(nlogn) sparse attention with energy decay for long video generation.arXiv preprint arXiv:2506.19852, 2025

    Xingyang Li*, Muyang Li*, Tianle Cai, Haocheng Xi, Shuo Yang, Yujun Lin, Lvmin Zhang, Songlin Yang, Jinbo Hu, Kelly Peng, Maneesh Agrawala, Ion Stoica, Kurt Keutzer, and Song Han. Radial attention: O(nlogn) sparse attention with energy decay for long video generation.arXiv preprint arXiv:2506.19852, 2025

  5. [5]

    Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025

    Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, Jianyuan Zeng, Jiayu Wang, Jingfeng Zhang, Jingren Zhou, Jinkai Wang, Jixuan Chen, Kai Zhu, Kang Zhao, Keyu Yan, Lianghua Huang, Mengyang Feng, Ningyi Zhang, Pandeng Li, Pingyu Wu, Ruihang Chu, Ruili Feng, Shiwei Zhang, Siyang Sun, Tao Fang, T...

  6. [6]

    Video is worth a thousand images: Exploring the latest trends in long video generation.ACM Comput

    Faraz Waseem and Muhammad Shahzad. Video is worth a thousand images: Exploring the latest trends in long video generation.ACM Comput. Surv., 58(6), December 2025

  7. [7]

    arXiv preprint arXiv:2502.01776 (2025)

    Haocheng Xi, Shuo Yang, Yilong Zhao, Chenfeng Xu, Muyang Li, Xiuyu Li, Yujun Lin, Han Cai, Jintao Zhang, Dacheng Li, et al. Sparse videogen: Accelerating video diffusion transformers with spatial-temporal sparsity.arXiv preprint arXiv:2502.01776, 2025

  8. [8]

    Training-free and adaptive sparse attention for efficient long video generation

    Yifei Xia, Suhan Ling, Fangcheng Fu, Yujie Wang, Huixia Li, Xuefeng Xiao, and Bin Cui. Training-free and adaptive sparse attention for efficient long video generation. InICCV, 2025

  9. [9]

    Sparse VideoGen2: Accelerate Video Generation with Sparse Attention via Semantic-Aware Permutation

    Shuo Yang, Haocheng Xi, Yilong Zhao, Muyang Li, Jintao Zhang, Han Cai, Yujun Lin, Xiuyu Li, Chenfeng Xu, Kelly Peng, et al. Sparse videogen2: Accelerate video generation with sparse attention via semantic- aware permutation.arXiv preprint arXiv:2505.18875, 2025

  10. [10]

    vllm-omni: Fully disaggregated serving for any-to-any multimodal models,

    Peiqi Yin, Jiangyun Zhu, Han Gao, Chenguang Zheng, Yongxiang Huang, Taichang Zhou, Ruirui Yang, Weizhi Liu, Weiqing Chen, Canlin Guo, et al. vllm-omni: Fully disaggregated serving for any-to-any multi- modal models.arXiv preprint arXiv:2602.02204, 2026

  11. [11]

    Sageattention: Accurate 8-bit attention for plug-and-play inference acceleration, 2025

    Jintao Zhang, Jia Wei, Haofeng Huang, Pengle Zhang, Jun Zhu, and Jianfei Chen. Sageattention: Accurate 8-bit attention for plug-and-play inference acceleration, 2025

  12. [12]

    arXiv preprint arXiv:2502.04507 (2025)

    Peiyuan Zhang, Yongqi Chen, Runlong Su, Hangliang Ding, Ion Stoica, Zhengzhong Liu, and Hao Zhang. Fast video generation with sliding tile attention.arXiv preprint arXiv:2502.04507, 2025

  13. [13]

    Riflex: A free lunch for length extrapolation in video diffusion transformers.arXiv preprint arXiv:2502.15894, 2025

    Min Zhao, Guande He, Yixiao Chen, Hongzhou Zhu, Chongxuan Li, and Jun Zhu. Riflex: A free lunch for length extrapolation in video diffusion transformers.arXiv preprint arXiv:2502.15894, 2025

  14. [14]

    Ultravico: Breaking extrapolation limits in video diffusion transformers.arXiv preprint arXiv:2511.20123, 2025

    Min Zhao, Hongzhou Zhu, Yingze Wang, Bokai Yan, Jintao Zhang, Guande He, Ling Yang, Chongxuan Li, and Jun Zhu. Ultravico: Breaking extrapolation limits in video diffusion transformers.arXiv preprint arXiv:2511.20123, 2025. 10