pith. sign in

arxiv: 2605.20659 · v1 · pith:ZVWK3MVCnew · submitted 2026-05-20 · 💻 cs.CV · cs.LG

RoPeSLR: 3D RoPE-driven Sparse-LowRank Attention for Efficient Diffusion Transformers

Pith reviewed 2026-05-21 05:51 UTC · model grok-4.3

classification 💻 cs.CV cs.LG
keywords diffusion transformerssparse attentionlow-rank attentionrotary position embeddingsvideo generationefficient inferencelong sequence modeling3D position encoding
0
0 comments X

The pith

RoPeSLR decouples DiT attention into sparse semantic spikes and low-rank background via 3D RoPE to cut quadratic costs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Diffusion transformers for video generation face a quadratic attention bottleneck that grows prohibitive on long sequences. The paper establishes that attention patterns separate into a limited set of high-value semantic connections and a low-rank background that can be handled with far less computation. RoPeSLR injects 3D positional information into a head-wise low-rank form to keep relative distance awareness intact, unlike plain linear attention. This structure supports very high sparsity while preserving output quality, directly addressing the scaling barrier for ultra-long video synthesis.

Core claim

Under empirically validated assumptions the DiT attention manifold decouples into a high-frequency semantic spike set bounded by O(L to the 3/2) sparsity and an extreme low-rank O(d_h log L) background continuum. RoPeSLR therefore replaces standard linear attention with a head-wise low-rank parameterization equipped with learnable 3D absolute positional embedding injection that synthesizes long-range relative distance decay.

What carries the argument

3D RoPE-driven Sparse-LowRank attention, a head-wise low-rank parameterization with learnable 3D absolute positional embedding injection that preserves relative-position structure at high sparsity.

If this is right

  • At 90 percent sparsity the method uses up to 10 times fewer FLOPs on Wan2.1-1.3B.
  • It produces a 2.26 times end-to-end inference speedup on 100K-plus token sequences in HunyuanVideo-13B.
  • Generation fidelity remains near lossless with less than 1.3 percent average VBench degradation.
  • Sub-quadratic sparsity and sub-linear rank growth allow scaling to longer sequences without the usual cost explosion.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same structural split could be tested on other long-context generative tasks such as audio or 3D synthesis.
  • Further compression might be obtained by making the low-rank component adaptive to content rather than fixed per head.
  • Pairing the attention change with existing quantization or caching methods would likely multiply the observed speedups.

Load-bearing premise

The attention weights inside diffusion transformers consistently separate into a small sparse high-frequency component plus a low-rank background that holds across models and sequence lengths.

What would settle it

Compute the numerical rank and the number of large-magnitude entries in attention matrices from a trained DiT on long video sequences and check whether they stay near O(L to the 3/2) spikes and O(d_h log L) rank.

Figures

Figures reproduced from arXiv: 2605.20659 by Kun Yuan, Renjia Deng, Yixiang Cai, Yutong He, Yuxi Liu, Zekun Zhang.

Figure 1
Figure 1. Figure 1: Video generation examples on Wan2.1-T2V-1.3B-480P. Each row shows the full attention baseline [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Analysis of HunyuanVideo post-attention matrices across 960 diverse layer-head combinations, [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: HunyuanVideo post-softmax matrices are decomposed using a 90% per-row energy threshold. Across [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Statistical analysis of cosine similarity between outputs from three distinct attention variants [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Stage I training loss curves over 500 steps for three configurations: VMoBA with linear attention [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Evolution of QK Spectral Concentration Driven by 3D RoPE. We evaluate three pre￾trained DiTs (Wan2.1-1.3B, Wan2.1-14B, and HunyuanVideo-13B) using 3 representative text prompts at 480 × 832 resolution, 49 frames. The X-axis is the RoPE dimension index m ∈ [1, dk/2]. We display the maximum interaction magnitude (approximated via the 99th percentile to filter sparse outliers and evaluate the background) and … view at source ↗
Figure 7
Figure 7. Figure 7: Training MSE loss under different low-rank dimensions. [PITH_FULL_IMAGE:figures/full_fig_p034_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Stage-I training loss on Wan2.1-T2V-1.3B under different stage-I lengths (400, 600, 800, and 1000 [PITH_FULL_IMAGE:figures/full_fig_p035_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Overall training loss on Wan2.1-T2V-1.3B under different stage-I lengths (400, 600, 800, and 1000 [PITH_FULL_IMAGE:figures/full_fig_p036_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: FLOPs and latency comparison of attention approximation methods for the non-top-k branch on [PITH_FULL_IMAGE:figures/full_fig_p036_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Analysis of causal self-attention matrices from Qwen2-7B-Instruct under a 90% energy threshold [PITH_FULL_IMAGE:figures/full_fig_p037_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Gate activation intensity maps of our RoPeSLR on Wan2.1-T2V-1.3B (480P), shown for five [PITH_FULL_IMAGE:figures/full_fig_p038_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Mechanistic Interpretability via Gram Spectral Analysis. Methodology: Evaluated on Wan2.1-T2V-1.3B (Layer 15, Denoising Step 10) at 480 × 832 resolution. SVD is applied to the pre-activation output Olr over a latent grid of 18 × 30 × 52 (L = 28, 080). Analysis: (Top) The eigenspectrum (energy distribution) shows that 3D PE induces an exponential decay aligning with Full Attention, preventing the spectral … view at source ↗
read the original abstract

Diffusion Transformers (DiTs) have revolutionized high-fidelity video generation, yet their $\mathcal{O}(L^2)$ attention complexity poses a formidable bottleneck for long-sequence synthesis. While recent sparse-linear attention hybrids aim to mitigate this, their performance severely degrades at extreme sparsity due to the "RoPE Dilemma": standard linear attention fails to preserve the orthogonal relative-position structure of 3D Rotary Position Embeddings (RoPE), neutralizing vital distance awareness. To address this, we propose \textbf{RoPeSLR}, a 3D RoPE-guided Sparse-LowRank attention framework. We establish that under empirically validated assumptions, the DiT attention manifold admits a decoupling into a high-frequency semantic spike set (bounded by $\mathcal{O}(L^{3/2})$ sparsity) and an extreme low-rank ($\mathcal{O}(d_h \log L)$) background continuum. Guided by this structural prior, RoPeSLR eschews standard linear attention for a head-wise low-rank parameterization equipped with a learnable 3D Absolute Positional Embedding (PE) injection, seamlessly synthesizing long-range relative distance decay. By guaranteeing sub-quadratic sparsity and sub-linear rank growth, RoPeSLR is exceptionally suited for scaling to ultra-long video inference. Extensive evaluations validate this scalable superiority: at 90\% sparsity, RoPeSLR achieves up to $10\times$ fewer FLOPs on Wan2.1-1.3B and delivers a $2.26\times$ end-to-end inference speedup on the ultra-long 100K+ token sequences of HunyuanVideo-13B, all while maintaining near-lossless generation fidelity (less than 1.3\% average VBench degradation).

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes RoPeSLR, a 3D RoPE-guided Sparse-LowRank attention framework for Diffusion Transformers to mitigate O(L^2) attention complexity in long-sequence video generation. It claims that under empirically validated assumptions the DiT attention manifold decouples into a high-frequency semantic spike set bounded by O(L^{3/2}) sparsity and an extreme low-rank O(d_h log L) background continuum. The method replaces standard linear attention with a head-wise low-rank parameterization plus learnable 3D Absolute Positional Embedding injection to preserve relative distance decay. Experiments report up to 10× fewer FLOPs at 90% sparsity on Wan2.1-1.3B and 2.26× end-to-end speedup on 100K+ token HunyuanVideo-13B sequences with <1.3% average VBench degradation.

Significance. If the claimed manifold decoupling and sub-quadratic/sub-linear scaling hold under the tested regimes, the approach could enable practical scaling of DiT-based video generators to ultra-long sequences while retaining near-lossless fidelity. The explicit integration of 3D RoPE structure into a sparse-low-rank factorization distinguishes it from generic sparse or linear attention hybrids and offers a concrete path toward sub-quadratic inference for high-resolution video synthesis.

major comments (2)
  1. [Abstract] Abstract: the central complexity and fidelity claims rest on the structural prior that the attention manifold admits a decoupling into O(L^{3/2})-sparse spikes and O(d_h log L) low-rank continuum “under empirically validated assumptions.” No section, table, or figure in the manuscript provides direct quantitative verification (e.g., measured effective rank or retained non-zero count versus sequence length) that the observed values remain within these bounds for the Wan2.1-1.3B and HunyuanVideo-13B models at the reported lengths.
  2. [§4 (Experiments)] The 90% sparsity operating point and the resulting 10× FLOPs reduction are presented as direct consequences of the O(L^{3/2}) bound, yet the manuscript does not report an ablation that isolates the contribution of the learned 3D absolute PE injection versus the low-rank parameterization alone. Without this isolation it is unclear whether the near-lossless VBench result is attributable to the claimed structural prior or to post-hoc tuning of the sparsity pattern.
minor comments (2)
  1. [§3.2] Notation for the head-wise low-rank matrices and the 3D absolute PE injection should be introduced with explicit dimensions (e.g., rank r_h per head) to make the O(d_h log L) claim immediately verifiable from the equations.
  2. [Table 2] The manuscript should include a table reporting measured effective rank and actual sparsity level versus theoretical bounds for each evaluated sequence length.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive report, as well as the recommendation for major revision. The comments help clarify where additional evidence is needed to support the central claims. We address each major comment below and indicate the revisions we will make to the manuscript.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the central complexity and fidelity claims rest on the structural prior that the attention manifold admits a decoupling into O(L^{3/2})-sparse spikes and O(d_h log L) low-rank continuum “under empirically validated assumptions.” No section, table, or figure in the manuscript provides direct quantitative verification (e.g., measured effective rank or retained non-zero count versus sequence length) that the observed values remain within these bounds for the Wan2.1-1.3B and HunyuanVideo-13B models at the reported lengths.

    Authors: We acknowledge that while the manuscript describes the decoupling under empirically validated assumptions, it does not include explicit quantitative verification such as plots or tables of measured effective rank or retained non-zero attention weights versus sequence length for the specific models and lengths reported. To directly address this, we will add a new figure and accompanying analysis in the revised manuscript (or supplementary material) that reports these measurements for Wan2.1-1.3B and HunyuanVideo-13B across the tested sequence lengths. This will provide the requested verification that the observed values remain consistent with the stated O(L^{3/2}) sparsity and O(d_h log L) rank bounds. revision: yes

  2. Referee: [§4 (Experiments)] The 90% sparsity operating point and the resulting 10× FLOPs reduction are presented as direct consequences of the O(L^{3/2}) bound, yet the manuscript does not report an ablation that isolates the contribution of the learned 3D absolute PE injection versus the low-rank parameterization alone. Without this isolation it is unclear whether the near-lossless VBench result is attributable to the claimed structural prior or to post-hoc tuning of the sparsity pattern.

    Authors: We agree that an ablation isolating the learned 3D absolute PE injection from the low-rank parameterization would strengthen the attribution of results to the structural prior. The current experiments evaluate the full RoPeSLR framework at the 90% sparsity point but do not separately ablate these two components. We will add this ablation study to the revised Section 4, reporting performance when the 3D PE injection is removed while retaining the low-rank structure (and vice versa) on the Wan2.1-1.3B model. The results will clarify the individual contributions and support that the fidelity preservation stems from the integration guided by the claimed manifold properties rather than sparsity tuning alone. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper states a structural prior on attention manifold decoupling under empirically validated assumptions and then designs RoPeSLR around it, but no equations, self-citations, or fitted parameters are shown reducing the final performance claims (FLOPs reduction, speedup, VBench scores) back to the inputs by construction. The reported metrics are measured outcomes on external models (Wan2.1-1.3B, HunyuanVideo-13B) after applying the method, not tautological restatements of observed sparsity/rank statistics. The derivation remains self-contained against the stated benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim depends on an unproven structural decomposition of the attention manifold and the assumption that a learnable 3D absolute PE can seamlessly restore relative distance decay inside the low-rank component.

free parameters (1)
  • learnable 3D Absolute Positional Embedding parameters
    Introduced per-head to synthesize long-range relative distance decay; their values are learned rather than derived from first principles.
axioms (1)
  • ad hoc to paper DiT attention manifold admits decoupling into O(L^{3/2}) sparse semantic spikes and O(d_h log L) low-rank background
    Invoked to justify the sparse-lowrank design; described as empirically validated but without derivation in the provided abstract.

pith-pipeline@v0.9.0 · 5863 in / 1476 out tokens · 32928 ms · 2026-05-21T05:51:23.148845+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

  • IndisputableMonolith/Foundation/AlexanderDuality.lean alexander_duality_circle_linking echoes
    ?
    echoes

    ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

    We establish that under empirically validated assumptions, the DiT attention manifold admits a decoupling into a high-frequency semantic spike set (bounded by O(L^{3/2}) sparsity) and an extreme low-rank (O(d_h log L)) background continuum.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

58 extracted references · 58 canonical work pages · 3 internal anchors

  1. [1]

    P. Chen, X. Zeng, M. Zhao, P. Ye, M. Shen, W. Cheng, G. Yu, and T. Chen. Sparse-vdit: Unleashing the power of sparse attention to accelerate video diffusion transformers, 2025

  2. [2]

    R. Chen, K. G. Mills, L. Jiang, C. Gao, and D. Niu. Re-ttention: Ultra sparse visual generation via attention statistical reshape, 2025

  3. [3]

    Choromanski, V

    K. Choromanski, V. Likhosherstov, D. Dohan, X. Song, A. Gane, T. Sarlos, P. Hawkins, J. Davis, A. Mohiuddin, L. Kaiser, D. Belanger, L. Colwell, and A. Weller. Rethinking attention with performers. InICLR 2021, 2021

  4. [4]

    T. Dao. Flashattention-2: Faster attention with better parallelism and work partitioning, 2023

  5. [5]

    Q. Fan, H. Huang, and R. He. Breaking the low-rank dilemma of linear attention, 2024

  6. [6]

    T. Fang, H. Zhang, R. Xie, Z. Han, X. Tao, T. Zhao, P. Wan, W. Ding, W. Ouyang, X. Ning, and Y. Wang. Salad: Achieve high-sparsity attention via efficient linear attention tuning for video diffusion transformer, 2026

  7. [7]

    A. Han, J. Li, W. Huang, M. Hong, A. Takeda, P. Jawanpuria, and B. Mishra. Sltrain: a sparse plus low-rank approach for parameter and memory efficient pretraining, 2024

  8. [8]

    Huang, Y

    Z. Huang, Y. He, J. Yu, F. Zhang, C. Si, Y. Jiang, Y. Zhang, T. Wu, Q. Jin, N. Chanpaisit, Y. Wang, X. Chen, L. Wang, D. Lin, Y. Qiao, and Z. Liu. Vbench: Comprehensive benchmark suite for video generative models, 2023

  9. [9]

    B. Kong, J. Liang, Y. Liu, R. Deng, and K. Yuan. Cr-net: Scaling parameter-efficient training with cross-layer low-rank structure, 2025

  10. [10]

    W. Kong, Q. Tian, Z. Zhang, R. Min, Z. Dai, J. Zhou, J. Xiong, X. Li, B. Wu, J. Zhang, K. Wu, Q. Lin, J. Yuan, Y. Long, A. Wang, A. Wang, C. Li, D. Huang, F. Yang, H. Tan, H. Wang, J. Song, J. Bai, J. Wu, J. Xue, J. Wang, K. Wang, M. Liu, P. Li, S. Li, W. Wang, W. Yu, X. Deng, Y. Li, Y. Chen, Y. Cui, Y. Peng, Z. Yu, Z. He, Z. Xu, Z. Zhou, Z. Xu, Y. Tao, Q...

  11. [11]

    X. Li, M. Li, T. Cai, H. Xi, S. Yang, Y. Lin, L. Zhang, S. Yang, J. Hu, K. Peng, M. Agrawala, I. Stoica, K. Keutzer, and S. Han. Radial attention:o(nlogn)sparse attention with energy decay for long video generation, 2025

  12. [12]

    Y. Liu, Y. Hu, Z. Zhang, K. Jiang, and K. Yuan. Mixture of distributions matters: Dynamic sparse attention for efficient video diffusion transformers, 2026

  13. [13]

    H. Lu, G. Yang, N. Fei, Y. Huo, Z. Lu, P. Luo, and M. Ding. Vdt: General-purpose video diffusion transformers via mask modeling.arXiv preprint arXiv:2305.13311, 2023

  14. [14]

    X. Ma, Y. Wang, X. Chen, G. Jia, Z. Liu, Y.-F. Li, C. Chen, and Y. Qiao. Latte: Latent diffusion transformer for video generation, 2024

  15. [15]

    K. Nan, R. Xie, P. Zhou, T. Fan, Z. Yang, Z. Chen, X. Li, J. Yang, and Y. Tai. Openvid-1m: A large-scale high-quality dataset for text-to-video generation.arXiv preprint arXiv:2407.02371, 2024

  16. [16]

    Peebles and S

    W. Peebles and S. Xie. Scalable diffusion models with transformers. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 4195–4205, 2023. 11

  17. [17]

    Rahaman, A

    N. Rahaman, A. Baratin, D. Arpit, F. Draxler, M. Lin, F. A. Hamprecht, Y. Bengio, and A. Courville. On the spectral bias of neural networks, 2019

  18. [18]

    Shmilovich, T

    D. Shmilovich, T. Wu, A. Dahan, and Y. Domb. Liteattention: A temporal sparse attention for diffusion transformers, 2025

  19. [19]

    Sitzmann, J

    V. Sitzmann, J. N. P. Martel, A. W. Bergman, D. B. Lindell, and G. Wetzstein. Implicit neural representations with periodic activation functions, 2020

  20. [20]

    J. Su, Y. Lu, S. Pan, A. Murtadha, B. Wen, and Y. Liu. Roformer: Enhanced transformer with rotary position embedding.arXiv preprint arXiv:2104.09864, 2021

  21. [21]

    X. Tan, Y. Chen, Y. Jiang, X. Chen, K. Yan, N. Duan, Y. Zhu, D. Jiang, and H. Xu. Dsv: Exploiting dynamic sparsity to accelerate large-scale video dit training. InProceedings of the 31st ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), volume 1, pages 101–116, 2026

  22. [22]

    Srinivasan, Ben Mildenhall, Sara Fridovich-Keil, Nithin Raghavan, Utkarsh Singhal, Ravi Ramamoorthi, Jonathan T

    M. Tancik, P. P. Srinivasan, B. Mildenhall, S. Fridovich-Keil, N. Raghavan, U. Singhal, R. Ramamoorthi, J. T. Barron, and R. Ng. Fourier features let networks learn high frequency functions in low dimensional domains. arXiv preprint arXiv:2006.10739, June 2020

  23. [23]

    K. Team. Kling-omni technical report, 2025

  24. [24]

    T. Wan, A. Wang, B. Ai, B. Wen, C. Mao, C.-W. Xie, D. Chen, F. Yu, H. Zhao, J. Yang, J. Zeng, J. Wang, J. Zhang, J. Zhou, J. Wang, J. Chen, K. Zhu, K. Zhao, K. Yan, L. Huang, M. Feng, N. Zhang, P. Li, P. Wu, R. Chu, R. Feng, S. Zhang, S. Sun, T. Fang, T. Wang, T. Gui, T. Weng, T. Shen, W. Lin, W. Wang, W. Wang, W. Zhou, W. Wang, W. Shen, W. Yu, X. Shi, X....

  25. [25]

    S. Wang, B. Z. Li, M. Khabsa, H. Fang, and H. Ma. Linformer: Self-attention with linear complexity.arXiv preprint arXiv:2006.04768, 2020

  26. [26]

    J. Wu, L. Hou, H. Yang, X. Tao, Y. Tian, P. Wan, D. Zhang, and Y. Tong. Vmoba: Mixture-of-block attention for video diffusion models, 2025

  27. [27]

    H. Xi, S. Yang, Y. Zhao, C. Xu, M. Li, X. Li, Y. Lin, H. Cai, J. Zhang, D. Li, J. Chen, I. Stoica, K. Keutzer, and S. Han. Sparse video-gen: Accelerating video diffusion transformers with spatial-temporal sparsity. In International Conference on Machine Learning (ICML). PMLR, 2025

  28. [28]

    Y. Xia, S. Ling, F. Fu, Y. Wang, H. Li, X. Xiao, and B. Cui. Training-free and adaptive sparse attention for efficient long video generation, 2025

  29. [29]

    E. Xie, J. Chen, J. Chen, H. Cai, H. Tang, Y. Lin, Z. Zhang, M. Li, L. Zhu, Y. Lu, and S. Han. Sana: Efficient high-resolution image synthesis with linear diffusion transformers, 2024

  30. [30]

    S. Yang, H. Xi, Y. Zhao, M. Li, J. Zhang, H. Cai, Y. Lin, X. Li, C. Xu, K. Peng, J. Chen, S. Han, K. Keutzer, and I. Stoica. Sparse videogen2: Accelerate video generation with sparse attention via semantic-aware permutation, 2025

  31. [31]

    Z. Yang, J. Teng, W. Zheng, M. Ding, S. Huang, J. Xu, Y. Yang, W. Hong, X. Zhang, G. Feng, D. Yin, Y. Zhang, W. Wang, Y. Cheng, B. Xu, X. Gu, Y. Dong, and J. Tang. Cogvideox: Text-to-video diffusion models with an expert transformer. InInternational Conference on Learning Representations (ICLR), 2025

  32. [32]

    Zhang, H

    J. Zhang, H. Wang, K. Jiang, S. Yang, K. Zheng, H. Xi, Z. Wang, H. Zhu, M. Zhao, I. Stoica, J. E. Gonzalez, J. Zhu, and J. Chen. Sla: Beyond sparsity in diffusion transformers via fine-tunable sparse-linear attention, 2025

  33. [33]

    Zhang, H

    J. Zhang, H. Wang, K. Jiang, K. Zheng, Y. Jiang, I. Stoica, J. Chen, J. Zhu, and J. E. Gonzalez. Sla2: Sparse-linear attention with learnable routing and qat, 2026

  34. [34]

    Zhang, J

    J. Zhang, J. Wei, H. Huang, P. Zhang, J. Zhu, and J. Chen. Sageattention: Accurate 8-bit attention for plug-and-play inference acceleration. InInternational Conference on Learning Representations (ICLR), 2025

  35. [35]

    Zhang, C

    J. Zhang, C. Xiang, H. Huang, J. Wei, H. Xi, J. Zhu, and J. Chen. Spargeattention: Accurate and training-free sparse attention accelerating any model inference, 2025. 12

  36. [36]

    Zhang, Y

    P. Zhang, Y. Chen, H. Huang, W. Lin, Z. Liu, I. Stoica, E. Xing, and H. Zhang. Vsa: Faster video diffusion with trainable sparse attention. InAdvances in Neural Information Processing Systems (NeurIPS), 2025

  37. [37]

    Zheng, X

    Z. Zheng, X. Peng, T. Yang, C. Shen, S. Li, H. Liu, Y. Zhou, T. Li, and Y. You. Open-sora: Democratizing efficient video production for all, 2024. 13 Appendix Contents 1 Introduction 2 2 Related work 3 3 Preliminary 4 4 Method 4 4.1 Theoretical Foundation: The Sparse-Low-Rank Theorem . . . . . . . . . . . . . . . . . . . . . 4 4.2 Positional Context Recov...

  38. [38]

    We prove the sparse branch has a deterministic sparsity guarantee: at most⌊1/τ⌋non-zero entries per row

    We first define an energy thresholdτ to split the full attention matrixA into two disjoint components: a high-energy spike setΩτ (captured by the sparse branch) and a low-energy smooth background setΩc τ (approximated by the low-rank compensator). We prove the sparse branch has a deterministic sparsity guarantee: at most⌊1/τ⌋non-zero entries per row

  39. [39]

    Leveraging the mathematical structure of 3D RoPE, we prove the pre-softmax QK logit matrix can be exactly expanded as a Fourier series over spatiotemporal frequencies, where each frequency term corresponds to a matrix of rank at most 2

  40. [40]

    Under a empirically validated spectral concentration assumption, high-frequency interaction coefficients in the background set decay exponentially, allowing us to truncate the Fourier series to get a low-rank approximation of the pre-softmax background matrix with controllable error

  41. [41]

    We use positive random features (FAVOR+)[3] to linearize the softmax exponential kernel while preserving the low-rank structure, and perform error propagation to bound the final approximation error of the background branch

  42. [42]

    A.2 Preliminaries and Notation We first formalize all symbols and foundational definitions with zero ambiguity, consistent with standard video diffusion transformer literature

    Our final main theorem establishes the existence of this decomposition: the sparse branch has guaranteed high sparsity, and the low rank branch has a rank bound that is linear in head dimensiondh and only logarithmic in sequence lengthL, confirming extreme low-rankness for long video sequences. A.2 Preliminaries and Notation We first formalize all symbols...

  43. [43]

    to approximateAbg only on the background setΩc τ. To correct the inaccurate outputs produced byˆAlowrank on the spike setΩτ, we define the residual sparse branch ˜Asparse as an exact error compensator: ˜Asparse(p, q) = ( A(p, q)− ˆAlowrank(p, q)if(p, q)∈Ω τ 0otherwise The final reconstructed attention matrix is then guaranteed as: ˆAfinal = ˜Asparse + ˆAl...

  44. [44]

    Theresidual sparse branch ˜Asparse is sparse, with exactly the same support as the original spike set Ωτ, and thus has at most⌊1/τ⌋non-zero entries per row

  45. [45]

    Thelow-rank branch ˆAlowrank is a globally defined, unmasked matrix that acts as a low-rank approxi- mator for the background targetAbg. It operates with a provable bottleneck rank bound: R=O dh · τ E 2 ·log τ E ·logL With probability at least1−δfail, the low-rank approximator tightly bounds the background target onΩc τ: max (p,q)∈Ωcτ | ˆAlowrank(p, q)−A ...

  46. [46]

    Sub-quadratic Sparsity:The residual sparse branch˜Asparse, which acts as an exact error com- pensator on the spike set (˜Asparse(p, q) = A(p, q) − ˆAlowrank(p, q)for( p, q) ∈ Ωτ), has total non-zero entries bounded by NNZ(˜Asparse) =O(L 3 2 )

  47. [47]

    Sub-linear Rank:The globally unmasked low-rank branchˆAlowrank, which acts as a dense approxi- mator for the background targetAbg, requires a bottleneck rank ofR=O(dh ·logL)

  48. [48]

    Asymptotic Error Bound:With high probability1 −δ fail, the global reconstruction error is uniformly bounded bymaxp,q | ˆAfinal(p, q)−A(p, q)|=O L− 1 2 . Proof.We prove each statement sequentially: Proof of Statement 1 (Sparse Branch Complexity).Givenτ = c L 1 2 , by Theorem 2, the total number of non-zero entries is: NNZ( ˜Asparse)≤L· 1 τ =L· $ L 1 2 c % ...

  49. [49]

    Compared to the rapid scaling of the sequence itself, this dictates that the low-rank branch becomes increasingly compact and heavily compressed as video length expands

    Sub-linear Rank Bottleneck: The rank required to closely capture the global background continuum grows only logarithmically (R = O(logL )). Compared to the rapid scaling of the sequence itself, this dictates that the low-rank branch becomes increasingly compact and heavily compressed as video length expands. Together, these bounds prove that RoPeSLR activ...

  50. [50]

    It enables anexact Fourier series expansionof the pre-softmax QK inner product over relative positions (Lemma 2), which decomposes the full attention matrix into a sum of rank-≤ 2frequency- specific matrices (Lemma 3)

  51. [51]

    Leveraging this property,Lemma 4proves that the background matrix can be approximated with a rank bound ofO(dh log(1/δ)), which is independent of sequence lengthL

    Its standard exponential frequency schedule inducesexponential decay of high-frequency interaction coefficients(empirically validated in Section B.1). Leveraging this property,Lemma 4proves that the background matrix can be approximated with a rank bound ofO(dh log(1/δ)), which is independent of sequence lengthL. Remark 5(Understanding the Token-Wise Cont...

  52. [52]

    However, because Theorem 3 guarantees that the background continuumAbg is exceptionally low-rank, it admits a low-dimensional matrix factorization

    Algebraic Factorization of Global Aggregation:While trigonometric identities explicitly decouple the pre-softmax 3D RoPE logits into absolute spatial bases, the non-linear softmax operation typically breaks this pairwise separation. However, because Theorem 3 guarantees that the background continuumAbg is exceptionally low-rank, it admits a low-dimensiona...

  53. [53]

    perceives

    Amortized Decoding via Dense State Encapsulation:WhileCk represents a global sequence-level sum, it does not require explicit re-computation if the tokenXp already "perceives" the macroscopic state. In deep DiTs,Xp is a dense semantic capsule: it has accumulated global context via preceding self-attention layers and is strictly modulated by global timeste...

  54. [54]

    Empirical Verification:The structural validity of this proxy is confirmed by our Stage-I alignment objective Lalign = ∥Ototal −AV∥ 2 F, which converges rapidly to a minimal floor (Figure 5). Mechanistically, ourGram spectral analysis (Appendix B.8)proves that the MLP’s output eigenvectors perfectly match the geometric standing waves of the ground-truth ba...

  55. [55]

    For LLMs using 1D RoPE with head dimensiondh, the decay rate isρLLM = 10000−α/dh

    Cubic Acceleration of Spectral Decay (Rank Inequality): As derived in Lemma 4, the truncation index Mk required to bound the error depends on the exponential decay constantρ. For LLMs using 1D RoPE with head dimensiondh, the decay rate isρLLM = 10000−α/dh. For DiTs using 3D RoPE, the dimension is partitioned (dt, dx, dy). Assuming a uniform partitiondk = ...

  56. [56]

    In LLMs, tasks like in-context learning scatter semantic spikes across the causal history, causing this universal bound to be vacuous in practice

    Empirical Tightness of the Deterministic Sparsity Bound: While the theoreticalO(L3/2)sparsity upper bound (derived viaNNZ≤L· ⌊ 1/τ⌋) is a deterministic property holding for any row-stochastic attention matrix, its practical utility heavily depends on the data domain. In LLMs, tasks like in-context learning scatter semantic spikes across the causal history...

  57. [57]

    an airplane taking off

    Empirical Evidence: We provide experimental evidence to validate this remark, see figure 2 and 11 for more details. 31 B Additional Experiments B.1 Empirical Validation of Spectral Concentration (Assumption 2) B.1.1 Spectral Evolution and Justification of Exponential Decay To verify the exponential decayρm0 k (ρk < 1) proposed in Assumption 2, we measure ...

  58. [58]

    Green AI

    Context Retrieval. Olin = ϕ(Q)KVctx. Here, ϕ(Q) ∈R L×dh and KVctx ∈R dh×dh. The FLOPs are 2BHLd 2 h. Therefore, the baseline linear attention branch fundamentally imposes anO(d2 h)bottleneck per token: Clinear_branch = 4BHLd 2 h Assuming identical sparse branches and gating mechanisms, the efficiency disparity between RoPeSLR and standard linear hybrids l...