RoPeSLR: 3D RoPE-driven Sparse-LowRank Attention for Efficient Diffusion Transformers

Kun Yuan; Renjia Deng; Yixiang Cai; Yutong He; Yuxi Liu; Zekun Zhang

arxiv: 2605.20659 · v1 · pith:ZVWK3MVCnew · submitted 2026-05-20 · 💻 cs.CV · cs.LG

RoPeSLR: 3D RoPE-driven Sparse-LowRank Attention for Efficient Diffusion Transformers

Yuxi Liu , Zekun Zhang , Yixiang Cai , Renjia Deng , Yutong He , Kun Yuan This is my paper

Pith reviewed 2026-05-21 05:51 UTC · model grok-4.3

classification 💻 cs.CV cs.LG

keywords diffusion transformerssparse attentionlow-rank attentionrotary position embeddingsvideo generationefficient inferencelong sequence modeling3D position encoding

0 comments

The pith

RoPeSLR decouples DiT attention into sparse semantic spikes and low-rank background via 3D RoPE to cut quadratic costs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Diffusion transformers for video generation face a quadratic attention bottleneck that grows prohibitive on long sequences. The paper establishes that attention patterns separate into a limited set of high-value semantic connections and a low-rank background that can be handled with far less computation. RoPeSLR injects 3D positional information into a head-wise low-rank form to keep relative distance awareness intact, unlike plain linear attention. This structure supports very high sparsity while preserving output quality, directly addressing the scaling barrier for ultra-long video synthesis.

Core claim

Under empirically validated assumptions the DiT attention manifold decouples into a high-frequency semantic spike set bounded by O(L to the 3/2) sparsity and an extreme low-rank O(d_h log L) background continuum. RoPeSLR therefore replaces standard linear attention with a head-wise low-rank parameterization equipped with learnable 3D absolute positional embedding injection that synthesizes long-range relative distance decay.

What carries the argument

3D RoPE-driven Sparse-LowRank attention, a head-wise low-rank parameterization with learnable 3D absolute positional embedding injection that preserves relative-position structure at high sparsity.

If this is right

At 90 percent sparsity the method uses up to 10 times fewer FLOPs on Wan2.1-1.3B.
It produces a 2.26 times end-to-end inference speedup on 100K-plus token sequences in HunyuanVideo-13B.
Generation fidelity remains near lossless with less than 1.3 percent average VBench degradation.
Sub-quadratic sparsity and sub-linear rank growth allow scaling to longer sequences without the usual cost explosion.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same structural split could be tested on other long-context generative tasks such as audio or 3D synthesis.
Further compression might be obtained by making the low-rank component adaptive to content rather than fixed per head.
Pairing the attention change with existing quantization or caching methods would likely multiply the observed speedups.

Load-bearing premise

The attention weights inside diffusion transformers consistently separate into a small sparse high-frequency component plus a low-rank background that holds across models and sequence lengths.

What would settle it

Compute the numerical rank and the number of large-magnitude entries in attention matrices from a trained DiT on long video sequences and check whether they stay near O(L to the 3/2) spikes and O(d_h log L) rank.

Figures

Figures reproduced from arXiv: 2605.20659 by Kun Yuan, Renjia Deng, Yixiang Cai, Yutong He, Yuxi Liu, Zekun Zhang.

**Figure 2.** Figure 2: Analysis of HunyuanVideo post-attention matrices across 960 diverse layer-head combinations, [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: HunyuanVideo post-softmax matrices are decomposed using a 90% per-row energy threshold. Across [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: Statistical analysis of cosine similarity between outputs from three distinct attention variants [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: Stage I training loss curves over 500 steps for three configurations: VMoBA with linear attention [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗

**Figure 6.** Figure 6: Evolution of QK Spectral Concentration Driven by 3D RoPE. We evaluate three pretrained DiTs (Wan2.1-1.3B, Wan2.1-14B, and HunyuanVideo-13B) using 3 representative text prompts at 480 × 832 resolution, 49 frames. The X-axis is the RoPE dimension index m ∈ [1, dk/2]. We display the maximum interaction magnitude (approximated via the 99th percentile to filter sparse outliers and evaluate the background) and … view at source ↗

**Figure 7.** Figure 7: Training MSE loss under different low-rank dimensions. [PITH_FULL_IMAGE:figures/full_fig_p034_7.png] view at source ↗

**Figure 8.** Figure 8: Stage-I training loss on Wan2.1-T2V-1.3B under different stage-I lengths (400, 600, 800, and 1000 [PITH_FULL_IMAGE:figures/full_fig_p035_8.png] view at source ↗

**Figure 9.** Figure 9: Overall training loss on Wan2.1-T2V-1.3B under different stage-I lengths (400, 600, 800, and 1000 [PITH_FULL_IMAGE:figures/full_fig_p036_9.png] view at source ↗

**Figure 10.** Figure 10: FLOPs and latency comparison of attention approximation methods for the non-top-k branch on [PITH_FULL_IMAGE:figures/full_fig_p036_10.png] view at source ↗

**Figure 11.** Figure 11: Analysis of causal self-attention matrices from Qwen2-7B-Instruct under a 90% energy threshold [PITH_FULL_IMAGE:figures/full_fig_p037_11.png] view at source ↗

**Figure 12.** Figure 12: Gate activation intensity maps of our RoPeSLR on Wan2.1-T2V-1.3B (480P), shown for five [PITH_FULL_IMAGE:figures/full_fig_p038_12.png] view at source ↗

**Figure 13.** Figure 13: Mechanistic Interpretability via Gram Spectral Analysis. Methodology: Evaluated on Wan2.1-T2V-1.3B (Layer 15, Denoising Step 10) at 480 × 832 resolution. SVD is applied to the pre-activation output Olr over a latent grid of 18 × 30 × 52 (L = 28, 080). Analysis: (Top) The eigenspectrum (energy distribution) shows that 3D PE induces an exponential decay aligning with Full Attention, preventing the spectral … view at source ↗

read the original abstract

Diffusion Transformers (DiTs) have revolutionized high-fidelity video generation, yet their $\mathcal{O}(L^2)$ attention complexity poses a formidable bottleneck for long-sequence synthesis. While recent sparse-linear attention hybrids aim to mitigate this, their performance severely degrades at extreme sparsity due to the "RoPE Dilemma": standard linear attention fails to preserve the orthogonal relative-position structure of 3D Rotary Position Embeddings (RoPE), neutralizing vital distance awareness. To address this, we propose \textbf{RoPeSLR}, a 3D RoPE-guided Sparse-LowRank attention framework. We establish that under empirically validated assumptions, the DiT attention manifold admits a decoupling into a high-frequency semantic spike set (bounded by $\mathcal{O}(L^{3/2})$ sparsity) and an extreme low-rank ($\mathcal{O}(d_h \log L)$) background continuum. Guided by this structural prior, RoPeSLR eschews standard linear attention for a head-wise low-rank parameterization equipped with a learnable 3D Absolute Positional Embedding (PE) injection, seamlessly synthesizing long-range relative distance decay. By guaranteeing sub-quadratic sparsity and sub-linear rank growth, RoPeSLR is exceptionally suited for scaling to ultra-long video inference. Extensive evaluations validate this scalable superiority: at 90\% sparsity, RoPeSLR achieves up to $10\times$ fewer FLOPs on Wan2.1-1.3B and delivers a $2.26\times$ end-to-end inference speedup on the ultra-long 100K+ token sequences of HunyuanVideo-13B, all while maintaining near-lossless generation fidelity (less than 1.3\% average VBench degradation).

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

RoPeSLR targets the RoPE preservation issue in sparse-linear attention for 3D video DiTs with a head-wise low-rank plus absolute PE trick, but the load-bearing manifold decoupling claim needs concrete verification.

read the letter

The main point is that this work identifies a specific failure mode when applying linear attention to 3D RoPE in long video DiTs and tries to fix it by swapping in head-wise low-rank parameterization plus learnable 3D absolute positional embeddings to recover relative distance decay. That combination is the actual novelty here, going past generic sparse or low-rank extensions by tying the design directly to the RoPE structure. They report practical gains: 10x fewer FLOPs at 90% sparsity on Wan2.1-1.3B and 2.26x end-to-end speedup on 100K+ token HunyuanVideo sequences with under 1.3% VBench drop. Those numbers, if they hold, matter for anyone scaling video generation past current length limits. The paper does a decent job laying out why standard linear attention breaks the orthogonal relative-position properties and why injecting absolute PE inside the low-rank path might restore the needed decay without full quadratic cost. The structural prior about splitting attention into an O(L^{3/2})-sparse semantic spike set and an O(d_h log L) low-rank background is presented as empirically checked, which at least gives a clear rationale for the sparsity and rank targets. That said, the central worry is whether that decoupling actually matches the observed ranks and sparsity patterns in the real models and sequence lengths they test. The abstract leans on those assumptions to justify both the complexity bounds and the near-lossless fidelity claim, yet without the derivations, rank plots, or ablation tables it is difficult to judge how tight the bounds are or how sensitive the results are to the post-hoc sparsity schedule. If the low-rank part needs higher rank in practice or the sparse part has to keep more entries to avoid artifacts, the sub-quadratic guarantee and the speedup numbers weaken. This is aimed at groups working on efficient DiT inference for video, especially those already dealing with long sequences. A reader who cares about position encoding in linear attention variants will pick up the specific 3D RoPE handling idea. It is worth sending for peer review so the empirical validation of the manifold split and the exact implementation details can be checked properly.

Referee Report

2 major / 2 minor

Summary. The paper proposes RoPeSLR, a 3D RoPE-guided Sparse-LowRank attention framework for Diffusion Transformers to mitigate O(L^2) attention complexity in long-sequence video generation. It claims that under empirically validated assumptions the DiT attention manifold decouples into a high-frequency semantic spike set bounded by O(L^{3/2}) sparsity and an extreme low-rank O(d_h log L) background continuum. The method replaces standard linear attention with a head-wise low-rank parameterization plus learnable 3D Absolute Positional Embedding injection to preserve relative distance decay. Experiments report up to 10× fewer FLOPs at 90% sparsity on Wan2.1-1.3B and 2.26× end-to-end speedup on 100K+ token HunyuanVideo-13B sequences with <1.3% average VBench degradation.

Significance. If the claimed manifold decoupling and sub-quadratic/sub-linear scaling hold under the tested regimes, the approach could enable practical scaling of DiT-based video generators to ultra-long sequences while retaining near-lossless fidelity. The explicit integration of 3D RoPE structure into a sparse-low-rank factorization distinguishes it from generic sparse or linear attention hybrids and offers a concrete path toward sub-quadratic inference for high-resolution video synthesis.

major comments (2)

[Abstract] Abstract: the central complexity and fidelity claims rest on the structural prior that the attention manifold admits a decoupling into O(L^{3/2})-sparse spikes and O(d_h log L) low-rank continuum “under empirically validated assumptions.” No section, table, or figure in the manuscript provides direct quantitative verification (e.g., measured effective rank or retained non-zero count versus sequence length) that the observed values remain within these bounds for the Wan2.1-1.3B and HunyuanVideo-13B models at the reported lengths.
[§4 (Experiments)] The 90% sparsity operating point and the resulting 10× FLOPs reduction are presented as direct consequences of the O(L^{3/2}) bound, yet the manuscript does not report an ablation that isolates the contribution of the learned 3D absolute PE injection versus the low-rank parameterization alone. Without this isolation it is unclear whether the near-lossless VBench result is attributable to the claimed structural prior or to post-hoc tuning of the sparsity pattern.

minor comments (2)

[§3.2] Notation for the head-wise low-rank matrices and the 3D absolute PE injection should be introduced with explicit dimensions (e.g., rank r_h per head) to make the O(d_h log L) claim immediately verifiable from the equations.
[Table 2] The manuscript should include a table reporting measured effective rank and actual sparsity level versus theoretical bounds for each evaluated sequence length.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive report, as well as the recommendation for major revision. The comments help clarify where additional evidence is needed to support the central claims. We address each major comment below and indicate the revisions we will make to the manuscript.

read point-by-point responses

Referee: [Abstract] Abstract: the central complexity and fidelity claims rest on the structural prior that the attention manifold admits a decoupling into O(L^{3/2})-sparse spikes and O(d_h log L) low-rank continuum “under empirically validated assumptions.” No section, table, or figure in the manuscript provides direct quantitative verification (e.g., measured effective rank or retained non-zero count versus sequence length) that the observed values remain within these bounds for the Wan2.1-1.3B and HunyuanVideo-13B models at the reported lengths.

Authors: We acknowledge that while the manuscript describes the decoupling under empirically validated assumptions, it does not include explicit quantitative verification such as plots or tables of measured effective rank or retained non-zero attention weights versus sequence length for the specific models and lengths reported. To directly address this, we will add a new figure and accompanying analysis in the revised manuscript (or supplementary material) that reports these measurements for Wan2.1-1.3B and HunyuanVideo-13B across the tested sequence lengths. This will provide the requested verification that the observed values remain consistent with the stated O(L^{3/2}) sparsity and O(d_h log L) rank bounds. revision: yes
Referee: [§4 (Experiments)] The 90% sparsity operating point and the resulting 10× FLOPs reduction are presented as direct consequences of the O(L^{3/2}) bound, yet the manuscript does not report an ablation that isolates the contribution of the learned 3D absolute PE injection versus the low-rank parameterization alone. Without this isolation it is unclear whether the near-lossless VBench result is attributable to the claimed structural prior or to post-hoc tuning of the sparsity pattern.

Authors: We agree that an ablation isolating the learned 3D absolute PE injection from the low-rank parameterization would strengthen the attribution of results to the structural prior. The current experiments evaluate the full RoPeSLR framework at the 90% sparsity point but do not separately ablate these two components. We will add this ablation study to the revised Section 4, reporting performance when the 3D PE injection is removed while retaining the low-rank structure (and vice versa) on the Wan2.1-1.3B model. The results will clarify the individual contributions and support that the fidelity preservation stems from the integration guided by the claimed manifold properties rather than sparsity tuning alone. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper states a structural prior on attention manifold decoupling under empirically validated assumptions and then designs RoPeSLR around it, but no equations, self-citations, or fitted parameters are shown reducing the final performance claims (FLOPs reduction, speedup, VBench scores) back to the inputs by construction. The reported metrics are measured outcomes on external models (Wan2.1-1.3B, HunyuanVideo-13B) after applying the method, not tautological restatements of observed sparsity/rank statistics. The derivation remains self-contained against the stated benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim depends on an unproven structural decomposition of the attention manifold and the assumption that a learnable 3D absolute PE can seamlessly restore relative distance decay inside the low-rank component.

free parameters (1)

learnable 3D Absolute Positional Embedding parameters
Introduced per-head to synthesize long-range relative distance decay; their values are learned rather than derived from first principles.

axioms (1)

ad hoc to paper DiT attention manifold admits decoupling into O(L^{3/2}) sparse semantic spikes and O(d_h log L) low-rank background
Invoked to justify the sparse-lowrank design; described as empirically validated but without derivation in the provided abstract.

pith-pipeline@v0.9.0 · 5863 in / 1476 out tokens · 32928 ms · 2026-05-21T05:51:23.148845+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/AlexanderDuality.lean alexander_duality_circle_linking echoes

?

echoes
ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

We establish that under empirically validated assumptions, the DiT attention manifold admits a decoupling into a high-frequency semantic spike set (bounded by O(L^{3/2}) sparsity) and an extreme low-rank (O(d_h log L)) background continuum.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

58 extracted references · 58 canonical work pages · 3 internal anchors

[1]

P. Chen, X. Zeng, M. Zhao, P. Ye, M. Shen, W. Cheng, G. Yu, and T. Chen. Sparse-vdit: Unleashing the power of sparse attention to accelerate video diffusion transformers, 2025

work page 2025
[2]

R. Chen, K. G. Mills, L. Jiang, C. Gao, and D. Niu. Re-ttention: Ultra sparse visual generation via attention statistical reshape, 2025

work page 2025
[3]

Choromanski, V

K. Choromanski, V. Likhosherstov, D. Dohan, X. Song, A. Gane, T. Sarlos, P. Hawkins, J. Davis, A. Mohiuddin, L. Kaiser, D. Belanger, L. Colwell, and A. Weller. Rethinking attention with performers. InICLR 2021, 2021

work page 2021
[4]

T. Dao. Flashattention-2: Faster attention with better parallelism and work partitioning, 2023

work page 2023
[5]

Q. Fan, H. Huang, and R. He. Breaking the low-rank dilemma of linear attention, 2024

work page 2024
[6]

T. Fang, H. Zhang, R. Xie, Z. Han, X. Tao, T. Zhao, P. Wan, W. Ding, W. Ouyang, X. Ning, and Y. Wang. Salad: Achieve high-sparsity attention via efficient linear attention tuning for video diffusion transformer, 2026

work page 2026
[7]

A. Han, J. Li, W. Huang, M. Hong, A. Takeda, P. Jawanpuria, and B. Mishra. Sltrain: a sparse plus low-rank approach for parameter and memory efficient pretraining, 2024

work page 2024
[8]

Huang, Y

Z. Huang, Y. He, J. Yu, F. Zhang, C. Si, Y. Jiang, Y. Zhang, T. Wu, Q. Jin, N. Chanpaisit, Y. Wang, X. Chen, L. Wang, D. Lin, Y. Qiao, and Z. Liu. Vbench: Comprehensive benchmark suite for video generative models, 2023

work page 2023
[9]

B. Kong, J. Liang, Y. Liu, R. Deng, and K. Yuan. Cr-net: Scaling parameter-efficient training with cross-layer low-rank structure, 2025

work page 2025
[10]

W. Kong, Q. Tian, Z. Zhang, R. Min, Z. Dai, J. Zhou, J. Xiong, X. Li, B. Wu, J. Zhang, K. Wu, Q. Lin, J. Yuan, Y. Long, A. Wang, A. Wang, C. Li, D. Huang, F. Yang, H. Tan, H. Wang, J. Song, J. Bai, J. Wu, J. Xue, J. Wang, K. Wang, M. Liu, P. Li, S. Li, W. Wang, W. Yu, X. Deng, Y. Li, Y. Chen, Y. Cui, Y. Peng, Z. Yu, Z. He, Z. Xu, Z. Zhou, Z. Xu, Y. Tao, Q...

work page 2024
[11]

X. Li, M. Li, T. Cai, H. Xi, S. Yang, Y. Lin, L. Zhang, S. Yang, J. Hu, K. Peng, M. Agrawala, I. Stoica, K. Keutzer, and S. Han. Radial attention:o(nlogn)sparse attention with energy decay for long video generation, 2025

work page 2025
[12]

Y. Liu, Y. Hu, Z. Zhang, K. Jiang, and K. Yuan. Mixture of distributions matters: Dynamic sparse attention for efficient video diffusion transformers, 2026

work page 2026
[13]

H. Lu, G. Yang, N. Fei, Y. Huo, Z. Lu, P. Luo, and M. Ding. Vdt: General-purpose video diffusion transformers via mask modeling.arXiv preprint arXiv:2305.13311, 2023

work page arXiv 2023
[14]

X. Ma, Y. Wang, X. Chen, G. Jia, Z. Liu, Y.-F. Li, C. Chen, and Y. Qiao. Latte: Latent diffusion transformer for video generation, 2024

work page 2024
[15]

K. Nan, R. Xie, P. Zhou, T. Fan, Z. Yang, Z. Chen, X. Li, J. Yang, and Y. Tai. Openvid-1m: A large-scale high-quality dataset for text-to-video generation.arXiv preprint arXiv:2407.02371, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[16]

Peebles and S

W. Peebles and S. Xie. Scalable diffusion models with transformers. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 4195–4205, 2023. 11

work page 2023
[17]

Rahaman, A

N. Rahaman, A. Baratin, D. Arpit, F. Draxler, M. Lin, F. A. Hamprecht, Y. Bengio, and A. Courville. On the spectral bias of neural networks, 2019

work page 2019
[18]

Shmilovich, T

D. Shmilovich, T. Wu, A. Dahan, and Y. Domb. Liteattention: A temporal sparse attention for diffusion transformers, 2025

work page 2025
[19]

Sitzmann, J

V. Sitzmann, J. N. P. Martel, A. W. Bergman, D. B. Lindell, and G. Wetzstein. Implicit neural representations with periodic activation functions, 2020

work page 2020
[20]

J. Su, Y. Lu, S. Pan, A. Murtadha, B. Wen, and Y. Liu. Roformer: Enhanced transformer with rotary position embedding.arXiv preprint arXiv:2104.09864, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[21]

X. Tan, Y. Chen, Y. Jiang, X. Chen, K. Yan, N. Duan, Y. Zhu, D. Jiang, and H. Xu. Dsv: Exploiting dynamic sparsity to accelerate large-scale video dit training. InProceedings of the 31st ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), volume 1, pages 101–116, 2026

work page 2026
[22]

Srinivasan, Ben Mildenhall, Sara Fridovich-Keil, Nithin Raghavan, Utkarsh Singhal, Ravi Ramamoorthi, Jonathan T

M. Tancik, P. P. Srinivasan, B. Mildenhall, S. Fridovich-Keil, N. Raghavan, U. Singhal, R. Ramamoorthi, J. T. Barron, and R. Ng. Fourier features let networks learn high frequency functions in low dimensional domains. arXiv preprint arXiv:2006.10739, June 2020

work page arXiv 2006
[23]

K. Team. Kling-omni technical report, 2025

work page 2025
[24]

T. Wan, A. Wang, B. Ai, B. Wen, C. Mao, C.-W. Xie, D. Chen, F. Yu, H. Zhao, J. Yang, J. Zeng, J. Wang, J. Zhang, J. Zhou, J. Wang, J. Chen, K. Zhu, K. Zhao, K. Yan, L. Huang, M. Feng, N. Zhang, P. Li, P. Wu, R. Chu, R. Feng, S. Zhang, S. Sun, T. Fang, T. Wang, T. Gui, T. Weng, T. Shen, W. Lin, W. Wang, W. Wang, W. Zhou, W. Wang, W. Shen, W. Yu, X. Shi, X....

work page 2025
[25]

S. Wang, B. Z. Li, M. Khabsa, H. Fang, and H. Ma. Linformer: Self-attention with linear complexity.arXiv preprint arXiv:2006.04768, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2006
[26]

J. Wu, L. Hou, H. Yang, X. Tao, Y. Tian, P. Wan, D. Zhang, and Y. Tong. Vmoba: Mixture-of-block attention for video diffusion models, 2025

work page 2025
[27]

H. Xi, S. Yang, Y. Zhao, C. Xu, M. Li, X. Li, Y. Lin, H. Cai, J. Zhang, D. Li, J. Chen, I. Stoica, K. Keutzer, and S. Han. Sparse video-gen: Accelerating video diffusion transformers with spatial-temporal sparsity. In International Conference on Machine Learning (ICML). PMLR, 2025

work page 2025
[28]

Y. Xia, S. Ling, F. Fu, Y. Wang, H. Li, X. Xiao, and B. Cui. Training-free and adaptive sparse attention for efficient long video generation, 2025

work page 2025
[29]

E. Xie, J. Chen, J. Chen, H. Cai, H. Tang, Y. Lin, Z. Zhang, M. Li, L. Zhu, Y. Lu, and S. Han. Sana: Efficient high-resolution image synthesis with linear diffusion transformers, 2024

work page 2024
[30]

S. Yang, H. Xi, Y. Zhao, M. Li, J. Zhang, H. Cai, Y. Lin, X. Li, C. Xu, K. Peng, J. Chen, S. Han, K. Keutzer, and I. Stoica. Sparse videogen2: Accelerate video generation with sparse attention via semantic-aware permutation, 2025

work page 2025
[31]

Z. Yang, J. Teng, W. Zheng, M. Ding, S. Huang, J. Xu, Y. Yang, W. Hong, X. Zhang, G. Feng, D. Yin, Y. Zhang, W. Wang, Y. Cheng, B. Xu, X. Gu, Y. Dong, and J. Tang. Cogvideox: Text-to-video diffusion models with an expert transformer. InInternational Conference on Learning Representations (ICLR), 2025

work page 2025
[32]

Zhang, H

J. Zhang, H. Wang, K. Jiang, S. Yang, K. Zheng, H. Xi, Z. Wang, H. Zhu, M. Zhao, I. Stoica, J. E. Gonzalez, J. Zhu, and J. Chen. Sla: Beyond sparsity in diffusion transformers via fine-tunable sparse-linear attention, 2025

work page 2025
[33]

Zhang, H

J. Zhang, H. Wang, K. Jiang, K. Zheng, Y. Jiang, I. Stoica, J. Chen, J. Zhu, and J. E. Gonzalez. Sla2: Sparse-linear attention with learnable routing and qat, 2026

work page 2026
[34]

Zhang, J

J. Zhang, J. Wei, H. Huang, P. Zhang, J. Zhu, and J. Chen. Sageattention: Accurate 8-bit attention for plug-and-play inference acceleration. InInternational Conference on Learning Representations (ICLR), 2025

work page 2025
[35]

Zhang, C

J. Zhang, C. Xiang, H. Huang, J. Wei, H. Xi, J. Zhu, and J. Chen. Spargeattention: Accurate and training-free sparse attention accelerating any model inference, 2025. 12

work page 2025
[36]

Zhang, Y

P. Zhang, Y. Chen, H. Huang, W. Lin, Z. Liu, I. Stoica, E. Xing, and H. Zhang. Vsa: Faster video diffusion with trainable sparse attention. InAdvances in Neural Information Processing Systems (NeurIPS), 2025

work page 2025
[37]

Zheng, X

Z. Zheng, X. Peng, T. Yang, C. Shen, S. Li, H. Liu, Y. Zhou, T. Li, and Y. You. Open-sora: Democratizing efficient video production for all, 2024. 13 Appendix Contents 1 Introduction 2 2 Related work 3 3 Preliminary 4 4 Method 4 4.1 Theoretical Foundation: The Sparse-Low-Rank Theorem . . . . . . . . . . . . . . . . . . . . . 4 4.2 Positional Context Recov...

work page 2024
[38]

We prove the sparse branch has a deterministic sparsity guarantee: at most⌊1/τ⌋non-zero entries per row

We first define an energy thresholdτ to split the full attention matrixA into two disjoint components: a high-energy spike setΩτ (captured by the sparse branch) and a low-energy smooth background setΩc τ (approximated by the low-rank compensator). We prove the sparse branch has a deterministic sparsity guarantee: at most⌊1/τ⌋non-zero entries per row

work page
[39]

Leveraging the mathematical structure of 3D RoPE, we prove the pre-softmax QK logit matrix can be exactly expanded as a Fourier series over spatiotemporal frequencies, where each frequency term corresponds to a matrix of rank at most 2

work page
[40]

Under a empirically validated spectral concentration assumption, high-frequency interaction coefficients in the background set decay exponentially, allowing us to truncate the Fourier series to get a low-rank approximation of the pre-softmax background matrix with controllable error

work page
[41]

We use positive random features (FAVOR+)[3] to linearize the softmax exponential kernel while preserving the low-rank structure, and perform error propagation to bound the final approximation error of the background branch

work page
[42]

A.2 Preliminaries and Notation We first formalize all symbols and foundational definitions with zero ambiguity, consistent with standard video diffusion transformer literature

Our final main theorem establishes the existence of this decomposition: the sparse branch has guaranteed high sparsity, and the low rank branch has a rank bound that is linear in head dimensiondh and only logarithmic in sequence lengthL, confirming extreme low-rankness for long video sequences. A.2 Preliminaries and Notation We first formalize all symbols...

work page
[43]

to approximateAbg only on the background setΩc τ. To correct the inaccurate outputs produced byˆAlowrank on the spike setΩτ, we define the residual sparse branch ˜Asparse as an exact error compensator: ˜Asparse(p, q) = ( A(p, q)− ˆAlowrank(p, q)if(p, q)∈Ω τ 0otherwise The final reconstructed attention matrix is then guaranteed as: ˆAfinal = ˜Asparse + ˆAl...

work page
[44]

Theresidual sparse branch ˜Asparse is sparse, with exactly the same support as the original spike set Ωτ, and thus has at most⌊1/τ⌋non-zero entries per row

work page
[45]

Thelow-rank branch ˆAlowrank is a globally defined, unmasked matrix that acts as a low-rank approxi- mator for the background targetAbg. It operates with a provable bottleneck rank bound: R=O dh · τ E 2 ·log τ E ·logL With probability at least1−δfail, the low-rank approximator tightly bounds the background target onΩc τ: max (p,q)∈Ωcτ | ˆAlowrank(p, q)−A ...

work page
[46]

Sub-quadratic Sparsity:The residual sparse branch˜Asparse, which acts as an exact error com- pensator on the spike set (˜Asparse(p, q) = A(p, q) − ˆAlowrank(p, q)for( p, q) ∈ Ωτ), has total non-zero entries bounded by NNZ(˜Asparse) =O(L 3 2 )

work page
[47]

Sub-linear Rank:The globally unmasked low-rank branchˆAlowrank, which acts as a dense approxi- mator for the background targetAbg, requires a bottleneck rank ofR=O(dh ·logL)

work page
[48]

Asymptotic Error Bound:With high probability1 −δ fail, the global reconstruction error is uniformly bounded bymaxp,q | ˆAfinal(p, q)−A(p, q)|=O L− 1 2 . Proof.We prove each statement sequentially: Proof of Statement 1 (Sparse Branch Complexity).Givenτ = c L 1 2 , by Theorem 2, the total number of non-zero entries is: NNZ( ˜Asparse)≤L· 1 τ =L· $ L 1 2 c % ...

work page
[49]

Compared to the rapid scaling of the sequence itself, this dictates that the low-rank branch becomes increasingly compact and heavily compressed as video length expands

Sub-linear Rank Bottleneck: The rank required to closely capture the global background continuum grows only logarithmically (R = O(logL )). Compared to the rapid scaling of the sequence itself, this dictates that the low-rank branch becomes increasingly compact and heavily compressed as video length expands. Together, these bounds prove that RoPeSLR activ...

work page
[50]

It enables anexact Fourier series expansionof the pre-softmax QK inner product over relative positions (Lemma 2), which decomposes the full attention matrix into a sum of rank-≤ 2frequency- specific matrices (Lemma 3)

work page
[51]

Leveraging this property,Lemma 4proves that the background matrix can be approximated with a rank bound ofO(dh log(1/δ)), which is independent of sequence lengthL

Its standard exponential frequency schedule inducesexponential decay of high-frequency interaction coefficients(empirically validated in Section B.1). Leveraging this property,Lemma 4proves that the background matrix can be approximated with a rank bound ofO(dh log(1/δ)), which is independent of sequence lengthL. Remark 5(Understanding the Token-Wise Cont...

work page
[52]

However, because Theorem 3 guarantees that the background continuumAbg is exceptionally low-rank, it admits a low-dimensional matrix factorization

Algebraic Factorization of Global Aggregation:While trigonometric identities explicitly decouple the pre-softmax 3D RoPE logits into absolute spatial bases, the non-linear softmax operation typically breaks this pairwise separation. However, because Theorem 3 guarantees that the background continuumAbg is exceptionally low-rank, it admits a low-dimensiona...

work page
[53]

perceives

Amortized Decoding via Dense State Encapsulation:WhileCk represents a global sequence-level sum, it does not require explicit re-computation if the tokenXp already "perceives" the macroscopic state. In deep DiTs,Xp is a dense semantic capsule: it has accumulated global context via preceding self-attention layers and is strictly modulated by global timeste...

work page
[54]

Empirical Verification:The structural validity of this proxy is confirmed by our Stage-I alignment objective Lalign = ∥Ototal −AV∥ 2 F, which converges rapidly to a minimal floor (Figure 5). Mechanistically, ourGram spectral analysis (Appendix B.8)proves that the MLP’s output eigenvectors perfectly match the geometric standing waves of the ground-truth ba...

work page
[55]

For LLMs using 1D RoPE with head dimensiondh, the decay rate isρLLM = 10000−α/dh

Cubic Acceleration of Spectral Decay (Rank Inequality): As derived in Lemma 4, the truncation index Mk required to bound the error depends on the exponential decay constantρ. For LLMs using 1D RoPE with head dimensiondh, the decay rate isρLLM = 10000−α/dh. For DiTs using 3D RoPE, the dimension is partitioned (dt, dx, dy). Assuming a uniform partitiondk = ...

work page
[56]

In LLMs, tasks like in-context learning scatter semantic spikes across the causal history, causing this universal bound to be vacuous in practice

Empirical Tightness of the Deterministic Sparsity Bound: While the theoreticalO(L3/2)sparsity upper bound (derived viaNNZ≤L· ⌊ 1/τ⌋) is a deterministic property holding for any row-stochastic attention matrix, its practical utility heavily depends on the data domain. In LLMs, tasks like in-context learning scatter semantic spikes across the causal history...

work page
[57]

an airplane taking off

Empirical Evidence: We provide experimental evidence to validate this remark, see figure 2 and 11 for more details. 31 B Additional Experiments B.1 Empirical Validation of Spectral Concentration (Assumption 2) B.1.1 Spectral Evolution and Justification of Exponential Decay To verify the exponential decayρm0 k (ρk < 1) proposed in Assumption 2, we measure ...

work page arXiv
[58]

Green AI

Context Retrieval. Olin = ϕ(Q)KVctx. Here, ϕ(Q) ∈R L×dh and KVctx ∈R dh×dh. The FLOPs are 2BHLd 2 h. Therefore, the baseline linear attention branch fundamentally imposes anO(d2 h)bottleneck per token: Clinear_branch = 4BHLd 2 h Assuming identical sparse branches and gating mechanisms, the efficiency disparity between RoPeSLR and standard linear hybrids l...

work page

[1] [1]

P. Chen, X. Zeng, M. Zhao, P. Ye, M. Shen, W. Cheng, G. Yu, and T. Chen. Sparse-vdit: Unleashing the power of sparse attention to accelerate video diffusion transformers, 2025

work page 2025

[2] [2]

R. Chen, K. G. Mills, L. Jiang, C. Gao, and D. Niu. Re-ttention: Ultra sparse visual generation via attention statistical reshape, 2025

work page 2025

[3] [3]

Choromanski, V

K. Choromanski, V. Likhosherstov, D. Dohan, X. Song, A. Gane, T. Sarlos, P. Hawkins, J. Davis, A. Mohiuddin, L. Kaiser, D. Belanger, L. Colwell, and A. Weller. Rethinking attention with performers. InICLR 2021, 2021

work page 2021

[4] [4]

T. Dao. Flashattention-2: Faster attention with better parallelism and work partitioning, 2023

work page 2023

[5] [5]

Q. Fan, H. Huang, and R. He. Breaking the low-rank dilemma of linear attention, 2024

work page 2024

[6] [6]

T. Fang, H. Zhang, R. Xie, Z. Han, X. Tao, T. Zhao, P. Wan, W. Ding, W. Ouyang, X. Ning, and Y. Wang. Salad: Achieve high-sparsity attention via efficient linear attention tuning for video diffusion transformer, 2026

work page 2026

[7] [7]

A. Han, J. Li, W. Huang, M. Hong, A. Takeda, P. Jawanpuria, and B. Mishra. Sltrain: a sparse plus low-rank approach for parameter and memory efficient pretraining, 2024

work page 2024

[8] [8]

Huang, Y

Z. Huang, Y. He, J. Yu, F. Zhang, C. Si, Y. Jiang, Y. Zhang, T. Wu, Q. Jin, N. Chanpaisit, Y. Wang, X. Chen, L. Wang, D. Lin, Y. Qiao, and Z. Liu. Vbench: Comprehensive benchmark suite for video generative models, 2023

work page 2023

[9] [9]

B. Kong, J. Liang, Y. Liu, R. Deng, and K. Yuan. Cr-net: Scaling parameter-efficient training with cross-layer low-rank structure, 2025

work page 2025

[10] [10]

W. Kong, Q. Tian, Z. Zhang, R. Min, Z. Dai, J. Zhou, J. Xiong, X. Li, B. Wu, J. Zhang, K. Wu, Q. Lin, J. Yuan, Y. Long, A. Wang, A. Wang, C. Li, D. Huang, F. Yang, H. Tan, H. Wang, J. Song, J. Bai, J. Wu, J. Xue, J. Wang, K. Wang, M. Liu, P. Li, S. Li, W. Wang, W. Yu, X. Deng, Y. Li, Y. Chen, Y. Cui, Y. Peng, Z. Yu, Z. He, Z. Xu, Z. Zhou, Z. Xu, Y. Tao, Q...

work page 2024

[11] [11]

X. Li, M. Li, T. Cai, H. Xi, S. Yang, Y. Lin, L. Zhang, S. Yang, J. Hu, K. Peng, M. Agrawala, I. Stoica, K. Keutzer, and S. Han. Radial attention:o(nlogn)sparse attention with energy decay for long video generation, 2025

work page 2025

[12] [12]

Y. Liu, Y. Hu, Z. Zhang, K. Jiang, and K. Yuan. Mixture of distributions matters: Dynamic sparse attention for efficient video diffusion transformers, 2026

work page 2026

[13] [13]

H. Lu, G. Yang, N. Fei, Y. Huo, Z. Lu, P. Luo, and M. Ding. Vdt: General-purpose video diffusion transformers via mask modeling.arXiv preprint arXiv:2305.13311, 2023

work page arXiv 2023

[14] [14]

X. Ma, Y. Wang, X. Chen, G. Jia, Z. Liu, Y.-F. Li, C. Chen, and Y. Qiao. Latte: Latent diffusion transformer for video generation, 2024

work page 2024

[15] [15]

K. Nan, R. Xie, P. Zhou, T. Fan, Z. Yang, Z. Chen, X. Li, J. Yang, and Y. Tai. Openvid-1m: A large-scale high-quality dataset for text-to-video generation.arXiv preprint arXiv:2407.02371, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[16] [16]

Peebles and S

W. Peebles and S. Xie. Scalable diffusion models with transformers. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 4195–4205, 2023. 11

work page 2023

[17] [17]

Rahaman, A

N. Rahaman, A. Baratin, D. Arpit, F. Draxler, M. Lin, F. A. Hamprecht, Y. Bengio, and A. Courville. On the spectral bias of neural networks, 2019

work page 2019

[18] [18]

Shmilovich, T

D. Shmilovich, T. Wu, A. Dahan, and Y. Domb. Liteattention: A temporal sparse attention for diffusion transformers, 2025

work page 2025

[19] [19]

Sitzmann, J

V. Sitzmann, J. N. P. Martel, A. W. Bergman, D. B. Lindell, and G. Wetzstein. Implicit neural representations with periodic activation functions, 2020

work page 2020

[20] [20]

J. Su, Y. Lu, S. Pan, A. Murtadha, B. Wen, and Y. Liu. Roformer: Enhanced transformer with rotary position embedding.arXiv preprint arXiv:2104.09864, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021

[21] [21]

X. Tan, Y. Chen, Y. Jiang, X. Chen, K. Yan, N. Duan, Y. Zhu, D. Jiang, and H. Xu. Dsv: Exploiting dynamic sparsity to accelerate large-scale video dit training. InProceedings of the 31st ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), volume 1, pages 101–116, 2026

work page 2026

[22] [22]

Srinivasan, Ben Mildenhall, Sara Fridovich-Keil, Nithin Raghavan, Utkarsh Singhal, Ravi Ramamoorthi, Jonathan T

M. Tancik, P. P. Srinivasan, B. Mildenhall, S. Fridovich-Keil, N. Raghavan, U. Singhal, R. Ramamoorthi, J. T. Barron, and R. Ng. Fourier features let networks learn high frequency functions in low dimensional domains. arXiv preprint arXiv:2006.10739, June 2020

work page arXiv 2006

[23] [23]

K. Team. Kling-omni technical report, 2025

work page 2025

[24] [24]

T. Wan, A. Wang, B. Ai, B. Wen, C. Mao, C.-W. Xie, D. Chen, F. Yu, H. Zhao, J. Yang, J. Zeng, J. Wang, J. Zhang, J. Zhou, J. Wang, J. Chen, K. Zhu, K. Zhao, K. Yan, L. Huang, M. Feng, N. Zhang, P. Li, P. Wu, R. Chu, R. Feng, S. Zhang, S. Sun, T. Fang, T. Wang, T. Gui, T. Weng, T. Shen, W. Lin, W. Wang, W. Wang, W. Zhou, W. Wang, W. Shen, W. Yu, X. Shi, X....

work page 2025

[25] [25]

S. Wang, B. Z. Li, M. Khabsa, H. Fang, and H. Ma. Linformer: Self-attention with linear complexity.arXiv preprint arXiv:2006.04768, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2006

[26] [26]

J. Wu, L. Hou, H. Yang, X. Tao, Y. Tian, P. Wan, D. Zhang, and Y. Tong. Vmoba: Mixture-of-block attention for video diffusion models, 2025

work page 2025

[27] [27]

H. Xi, S. Yang, Y. Zhao, C. Xu, M. Li, X. Li, Y. Lin, H. Cai, J. Zhang, D. Li, J. Chen, I. Stoica, K. Keutzer, and S. Han. Sparse video-gen: Accelerating video diffusion transformers with spatial-temporal sparsity. In International Conference on Machine Learning (ICML). PMLR, 2025

work page 2025

[28] [28]

Y. Xia, S. Ling, F. Fu, Y. Wang, H. Li, X. Xiao, and B. Cui. Training-free and adaptive sparse attention for efficient long video generation, 2025

work page 2025

[29] [29]

E. Xie, J. Chen, J. Chen, H. Cai, H. Tang, Y. Lin, Z. Zhang, M. Li, L. Zhu, Y. Lu, and S. Han. Sana: Efficient high-resolution image synthesis with linear diffusion transformers, 2024

work page 2024

[30] [30]

S. Yang, H. Xi, Y. Zhao, M. Li, J. Zhang, H. Cai, Y. Lin, X. Li, C. Xu, K. Peng, J. Chen, S. Han, K. Keutzer, and I. Stoica. Sparse videogen2: Accelerate video generation with sparse attention via semantic-aware permutation, 2025

work page 2025

[31] [31]

Z. Yang, J. Teng, W. Zheng, M. Ding, S. Huang, J. Xu, Y. Yang, W. Hong, X. Zhang, G. Feng, D. Yin, Y. Zhang, W. Wang, Y. Cheng, B. Xu, X. Gu, Y. Dong, and J. Tang. Cogvideox: Text-to-video diffusion models with an expert transformer. InInternational Conference on Learning Representations (ICLR), 2025

work page 2025

[32] [32]

Zhang, H

J. Zhang, H. Wang, K. Jiang, S. Yang, K. Zheng, H. Xi, Z. Wang, H. Zhu, M. Zhao, I. Stoica, J. E. Gonzalez, J. Zhu, and J. Chen. Sla: Beyond sparsity in diffusion transformers via fine-tunable sparse-linear attention, 2025

work page 2025

[33] [33]

Zhang, H

J. Zhang, H. Wang, K. Jiang, K. Zheng, Y. Jiang, I. Stoica, J. Chen, J. Zhu, and J. E. Gonzalez. Sla2: Sparse-linear attention with learnable routing and qat, 2026

work page 2026

[34] [34]

Zhang, J

J. Zhang, J. Wei, H. Huang, P. Zhang, J. Zhu, and J. Chen. Sageattention: Accurate 8-bit attention for plug-and-play inference acceleration. InInternational Conference on Learning Representations (ICLR), 2025

work page 2025

[35] [35]

Zhang, C

J. Zhang, C. Xiang, H. Huang, J. Wei, H. Xi, J. Zhu, and J. Chen. Spargeattention: Accurate and training-free sparse attention accelerating any model inference, 2025. 12

work page 2025

[36] [36]

Zhang, Y

P. Zhang, Y. Chen, H. Huang, W. Lin, Z. Liu, I. Stoica, E. Xing, and H. Zhang. Vsa: Faster video diffusion with trainable sparse attention. InAdvances in Neural Information Processing Systems (NeurIPS), 2025

work page 2025

[37] [37]

Zheng, X

Z. Zheng, X. Peng, T. Yang, C. Shen, S. Li, H. Liu, Y. Zhou, T. Li, and Y. You. Open-sora: Democratizing efficient video production for all, 2024. 13 Appendix Contents 1 Introduction 2 2 Related work 3 3 Preliminary 4 4 Method 4 4.1 Theoretical Foundation: The Sparse-Low-Rank Theorem . . . . . . . . . . . . . . . . . . . . . 4 4.2 Positional Context Recov...

work page 2024

[38] [38]

We prove the sparse branch has a deterministic sparsity guarantee: at most⌊1/τ⌋non-zero entries per row

We first define an energy thresholdτ to split the full attention matrixA into two disjoint components: a high-energy spike setΩτ (captured by the sparse branch) and a low-energy smooth background setΩc τ (approximated by the low-rank compensator). We prove the sparse branch has a deterministic sparsity guarantee: at most⌊1/τ⌋non-zero entries per row

work page

[39] [39]

Leveraging the mathematical structure of 3D RoPE, we prove the pre-softmax QK logit matrix can be exactly expanded as a Fourier series over spatiotemporal frequencies, where each frequency term corresponds to a matrix of rank at most 2

work page

[40] [40]

Under a empirically validated spectral concentration assumption, high-frequency interaction coefficients in the background set decay exponentially, allowing us to truncate the Fourier series to get a low-rank approximation of the pre-softmax background matrix with controllable error

work page

[41] [41]

We use positive random features (FAVOR+)[3] to linearize the softmax exponential kernel while preserving the low-rank structure, and perform error propagation to bound the final approximation error of the background branch

work page

[42] [42]

A.2 Preliminaries and Notation We first formalize all symbols and foundational definitions with zero ambiguity, consistent with standard video diffusion transformer literature

Our final main theorem establishes the existence of this decomposition: the sparse branch has guaranteed high sparsity, and the low rank branch has a rank bound that is linear in head dimensiondh and only logarithmic in sequence lengthL, confirming extreme low-rankness for long video sequences. A.2 Preliminaries and Notation We first formalize all symbols...

work page

[43] [43]

to approximateAbg only on the background setΩc τ. To correct the inaccurate outputs produced byˆAlowrank on the spike setΩτ, we define the residual sparse branch ˜Asparse as an exact error compensator: ˜Asparse(p, q) = ( A(p, q)− ˆAlowrank(p, q)if(p, q)∈Ω τ 0otherwise The final reconstructed attention matrix is then guaranteed as: ˆAfinal = ˜Asparse + ˆAl...

work page

[44] [44]

Theresidual sparse branch ˜Asparse is sparse, with exactly the same support as the original spike set Ωτ, and thus has at most⌊1/τ⌋non-zero entries per row

work page

[45] [45]

Thelow-rank branch ˆAlowrank is a globally defined, unmasked matrix that acts as a low-rank approxi- mator for the background targetAbg. It operates with a provable bottleneck rank bound: R=O dh · τ E 2 ·log τ E ·logL With probability at least1−δfail, the low-rank approximator tightly bounds the background target onΩc τ: max (p,q)∈Ωcτ | ˆAlowrank(p, q)−A ...

work page

[46] [46]

Sub-quadratic Sparsity:The residual sparse branch˜Asparse, which acts as an exact error com- pensator on the spike set (˜Asparse(p, q) = A(p, q) − ˆAlowrank(p, q)for( p, q) ∈ Ωτ), has total non-zero entries bounded by NNZ(˜Asparse) =O(L 3 2 )

work page

[47] [47]

Sub-linear Rank:The globally unmasked low-rank branchˆAlowrank, which acts as a dense approxi- mator for the background targetAbg, requires a bottleneck rank ofR=O(dh ·logL)

work page

[48] [48]

Asymptotic Error Bound:With high probability1 −δ fail, the global reconstruction error is uniformly bounded bymaxp,q | ˆAfinal(p, q)−A(p, q)|=O L− 1 2 . Proof.We prove each statement sequentially: Proof of Statement 1 (Sparse Branch Complexity).Givenτ = c L 1 2 , by Theorem 2, the total number of non-zero entries is: NNZ( ˜Asparse)≤L· 1 τ =L· $ L 1 2 c % ...

work page

[49] [49]

Compared to the rapid scaling of the sequence itself, this dictates that the low-rank branch becomes increasingly compact and heavily compressed as video length expands

Sub-linear Rank Bottleneck: The rank required to closely capture the global background continuum grows only logarithmically (R = O(logL )). Compared to the rapid scaling of the sequence itself, this dictates that the low-rank branch becomes increasingly compact and heavily compressed as video length expands. Together, these bounds prove that RoPeSLR activ...

work page

[50] [50]

It enables anexact Fourier series expansionof the pre-softmax QK inner product over relative positions (Lemma 2), which decomposes the full attention matrix into a sum of rank-≤ 2frequency- specific matrices (Lemma 3)

work page

[51] [51]

Leveraging this property,Lemma 4proves that the background matrix can be approximated with a rank bound ofO(dh log(1/δ)), which is independent of sequence lengthL

Its standard exponential frequency schedule inducesexponential decay of high-frequency interaction coefficients(empirically validated in Section B.1). Leveraging this property,Lemma 4proves that the background matrix can be approximated with a rank bound ofO(dh log(1/δ)), which is independent of sequence lengthL. Remark 5(Understanding the Token-Wise Cont...

work page

[52] [52]

However, because Theorem 3 guarantees that the background continuumAbg is exceptionally low-rank, it admits a low-dimensional matrix factorization

Algebraic Factorization of Global Aggregation:While trigonometric identities explicitly decouple the pre-softmax 3D RoPE logits into absolute spatial bases, the non-linear softmax operation typically breaks this pairwise separation. However, because Theorem 3 guarantees that the background continuumAbg is exceptionally low-rank, it admits a low-dimensiona...

work page

[53] [53]

perceives

Amortized Decoding via Dense State Encapsulation:WhileCk represents a global sequence-level sum, it does not require explicit re-computation if the tokenXp already "perceives" the macroscopic state. In deep DiTs,Xp is a dense semantic capsule: it has accumulated global context via preceding self-attention layers and is strictly modulated by global timeste...

work page

[54] [54]

Empirical Verification:The structural validity of this proxy is confirmed by our Stage-I alignment objective Lalign = ∥Ototal −AV∥ 2 F, which converges rapidly to a minimal floor (Figure 5). Mechanistically, ourGram spectral analysis (Appendix B.8)proves that the MLP’s output eigenvectors perfectly match the geometric standing waves of the ground-truth ba...

work page

[55] [55]

For LLMs using 1D RoPE with head dimensiondh, the decay rate isρLLM = 10000−α/dh

Cubic Acceleration of Spectral Decay (Rank Inequality): As derived in Lemma 4, the truncation index Mk required to bound the error depends on the exponential decay constantρ. For LLMs using 1D RoPE with head dimensiondh, the decay rate isρLLM = 10000−α/dh. For DiTs using 3D RoPE, the dimension is partitioned (dt, dx, dy). Assuming a uniform partitiondk = ...

work page

[56] [56]

In LLMs, tasks like in-context learning scatter semantic spikes across the causal history, causing this universal bound to be vacuous in practice

Empirical Tightness of the Deterministic Sparsity Bound: While the theoreticalO(L3/2)sparsity upper bound (derived viaNNZ≤L· ⌊ 1/τ⌋) is a deterministic property holding for any row-stochastic attention matrix, its practical utility heavily depends on the data domain. In LLMs, tasks like in-context learning scatter semantic spikes across the causal history...

work page

[57] [57]

an airplane taking off

Empirical Evidence: We provide experimental evidence to validate this remark, see figure 2 and 11 for more details. 31 B Additional Experiments B.1 Empirical Validation of Spectral Concentration (Assumption 2) B.1.1 Spectral Evolution and Justification of Exponential Decay To verify the exponential decayρm0 k (ρk < 1) proposed in Assumption 2, we measure ...

work page arXiv

[58] [58]

Green AI

Context Retrieval. Olin = ϕ(Q)KVctx. Here, ϕ(Q) ∈R L×dh and KVctx ∈R dh×dh. The FLOPs are 2BHLd 2 h. Therefore, the baseline linear attention branch fundamentally imposes anO(d2 h)bottleneck per token: Clinear_branch = 4BHLd 2 h Assuming identical sparse branches and gating mechanisms, the efficiency disparity between RoPeSLR and standard linear hybrids l...

work page