RoPeSLR: 3D RoPE-driven Sparse-LowRank Attention for Efficient Diffusion Transformers
Pith reviewed 2026-05-21 05:51 UTC · model grok-4.3
The pith
RoPeSLR decouples DiT attention into sparse semantic spikes and low-rank background via 3D RoPE to cut quadratic costs.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Under empirically validated assumptions the DiT attention manifold decouples into a high-frequency semantic spike set bounded by O(L to the 3/2) sparsity and an extreme low-rank O(d_h log L) background continuum. RoPeSLR therefore replaces standard linear attention with a head-wise low-rank parameterization equipped with learnable 3D absolute positional embedding injection that synthesizes long-range relative distance decay.
What carries the argument
3D RoPE-driven Sparse-LowRank attention, a head-wise low-rank parameterization with learnable 3D absolute positional embedding injection that preserves relative-position structure at high sparsity.
If this is right
- At 90 percent sparsity the method uses up to 10 times fewer FLOPs on Wan2.1-1.3B.
- It produces a 2.26 times end-to-end inference speedup on 100K-plus token sequences in HunyuanVideo-13B.
- Generation fidelity remains near lossless with less than 1.3 percent average VBench degradation.
- Sub-quadratic sparsity and sub-linear rank growth allow scaling to longer sequences without the usual cost explosion.
Where Pith is reading between the lines
- The same structural split could be tested on other long-context generative tasks such as audio or 3D synthesis.
- Further compression might be obtained by making the low-rank component adaptive to content rather than fixed per head.
- Pairing the attention change with existing quantization or caching methods would likely multiply the observed speedups.
Load-bearing premise
The attention weights inside diffusion transformers consistently separate into a small sparse high-frequency component plus a low-rank background that holds across models and sequence lengths.
What would settle it
Compute the numerical rank and the number of large-magnitude entries in attention matrices from a trained DiT on long video sequences and check whether they stay near O(L to the 3/2) spikes and O(d_h log L) rank.
Figures
read the original abstract
Diffusion Transformers (DiTs) have revolutionized high-fidelity video generation, yet their $\mathcal{O}(L^2)$ attention complexity poses a formidable bottleneck for long-sequence synthesis. While recent sparse-linear attention hybrids aim to mitigate this, their performance severely degrades at extreme sparsity due to the "RoPE Dilemma": standard linear attention fails to preserve the orthogonal relative-position structure of 3D Rotary Position Embeddings (RoPE), neutralizing vital distance awareness. To address this, we propose \textbf{RoPeSLR}, a 3D RoPE-guided Sparse-LowRank attention framework. We establish that under empirically validated assumptions, the DiT attention manifold admits a decoupling into a high-frequency semantic spike set (bounded by $\mathcal{O}(L^{3/2})$ sparsity) and an extreme low-rank ($\mathcal{O}(d_h \log L)$) background continuum. Guided by this structural prior, RoPeSLR eschews standard linear attention for a head-wise low-rank parameterization equipped with a learnable 3D Absolute Positional Embedding (PE) injection, seamlessly synthesizing long-range relative distance decay. By guaranteeing sub-quadratic sparsity and sub-linear rank growth, RoPeSLR is exceptionally suited for scaling to ultra-long video inference. Extensive evaluations validate this scalable superiority: at 90\% sparsity, RoPeSLR achieves up to $10\times$ fewer FLOPs on Wan2.1-1.3B and delivers a $2.26\times$ end-to-end inference speedup on the ultra-long 100K+ token sequences of HunyuanVideo-13B, all while maintaining near-lossless generation fidelity (less than 1.3\% average VBench degradation).
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes RoPeSLR, a 3D RoPE-guided Sparse-LowRank attention framework for Diffusion Transformers to mitigate O(L^2) attention complexity in long-sequence video generation. It claims that under empirically validated assumptions the DiT attention manifold decouples into a high-frequency semantic spike set bounded by O(L^{3/2}) sparsity and an extreme low-rank O(d_h log L) background continuum. The method replaces standard linear attention with a head-wise low-rank parameterization plus learnable 3D Absolute Positional Embedding injection to preserve relative distance decay. Experiments report up to 10× fewer FLOPs at 90% sparsity on Wan2.1-1.3B and 2.26× end-to-end speedup on 100K+ token HunyuanVideo-13B sequences with <1.3% average VBench degradation.
Significance. If the claimed manifold decoupling and sub-quadratic/sub-linear scaling hold under the tested regimes, the approach could enable practical scaling of DiT-based video generators to ultra-long sequences while retaining near-lossless fidelity. The explicit integration of 3D RoPE structure into a sparse-low-rank factorization distinguishes it from generic sparse or linear attention hybrids and offers a concrete path toward sub-quadratic inference for high-resolution video synthesis.
major comments (2)
- [Abstract] Abstract: the central complexity and fidelity claims rest on the structural prior that the attention manifold admits a decoupling into O(L^{3/2})-sparse spikes and O(d_h log L) low-rank continuum “under empirically validated assumptions.” No section, table, or figure in the manuscript provides direct quantitative verification (e.g., measured effective rank or retained non-zero count versus sequence length) that the observed values remain within these bounds for the Wan2.1-1.3B and HunyuanVideo-13B models at the reported lengths.
- [§4 (Experiments)] The 90% sparsity operating point and the resulting 10× FLOPs reduction are presented as direct consequences of the O(L^{3/2}) bound, yet the manuscript does not report an ablation that isolates the contribution of the learned 3D absolute PE injection versus the low-rank parameterization alone. Without this isolation it is unclear whether the near-lossless VBench result is attributable to the claimed structural prior or to post-hoc tuning of the sparsity pattern.
minor comments (2)
- [§3.2] Notation for the head-wise low-rank matrices and the 3D absolute PE injection should be introduced with explicit dimensions (e.g., rank r_h per head) to make the O(d_h log L) claim immediately verifiable from the equations.
- [Table 2] The manuscript should include a table reporting measured effective rank and actual sparsity level versus theoretical bounds for each evaluated sequence length.
Simulated Author's Rebuttal
We thank the referee for the detailed and constructive report, as well as the recommendation for major revision. The comments help clarify where additional evidence is needed to support the central claims. We address each major comment below and indicate the revisions we will make to the manuscript.
read point-by-point responses
-
Referee: [Abstract] Abstract: the central complexity and fidelity claims rest on the structural prior that the attention manifold admits a decoupling into O(L^{3/2})-sparse spikes and O(d_h log L) low-rank continuum “under empirically validated assumptions.” No section, table, or figure in the manuscript provides direct quantitative verification (e.g., measured effective rank or retained non-zero count versus sequence length) that the observed values remain within these bounds for the Wan2.1-1.3B and HunyuanVideo-13B models at the reported lengths.
Authors: We acknowledge that while the manuscript describes the decoupling under empirically validated assumptions, it does not include explicit quantitative verification such as plots or tables of measured effective rank or retained non-zero attention weights versus sequence length for the specific models and lengths reported. To directly address this, we will add a new figure and accompanying analysis in the revised manuscript (or supplementary material) that reports these measurements for Wan2.1-1.3B and HunyuanVideo-13B across the tested sequence lengths. This will provide the requested verification that the observed values remain consistent with the stated O(L^{3/2}) sparsity and O(d_h log L) rank bounds. revision: yes
-
Referee: [§4 (Experiments)] The 90% sparsity operating point and the resulting 10× FLOPs reduction are presented as direct consequences of the O(L^{3/2}) bound, yet the manuscript does not report an ablation that isolates the contribution of the learned 3D absolute PE injection versus the low-rank parameterization alone. Without this isolation it is unclear whether the near-lossless VBench result is attributable to the claimed structural prior or to post-hoc tuning of the sparsity pattern.
Authors: We agree that an ablation isolating the learned 3D absolute PE injection from the low-rank parameterization would strengthen the attribution of results to the structural prior. The current experiments evaluate the full RoPeSLR framework at the 90% sparsity point but do not separately ablate these two components. We will add this ablation study to the revised Section 4, reporting performance when the 3D PE injection is removed while retaining the low-rank structure (and vice versa) on the Wan2.1-1.3B model. The results will clarify the individual contributions and support that the fidelity preservation stems from the integration guided by the claimed manifold properties rather than sparsity tuning alone. revision: yes
Circularity Check
No significant circularity in derivation chain
full rationale
The paper states a structural prior on attention manifold decoupling under empirically validated assumptions and then designs RoPeSLR around it, but no equations, self-citations, or fitted parameters are shown reducing the final performance claims (FLOPs reduction, speedup, VBench scores) back to the inputs by construction. The reported metrics are measured outcomes on external models (Wan2.1-1.3B, HunyuanVideo-13B) after applying the method, not tautological restatements of observed sparsity/rank statistics. The derivation remains self-contained against the stated benchmarks.
Axiom & Free-Parameter Ledger
free parameters (1)
- learnable 3D Absolute Positional Embedding parameters
axioms (1)
- ad hoc to paper DiT attention manifold admits decoupling into O(L^{3/2}) sparse semantic spikes and O(d_h log L) low-rank background
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/AlexanderDuality.leanalexander_duality_circle_linking echoes?
echoesECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.
We establish that under empirically validated assumptions, the DiT attention manifold admits a decoupling into a high-frequency semantic spike set (bounded by O(L^{3/2}) sparsity) and an extreme low-rank (O(d_h log L)) background continuum.
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
P. Chen, X. Zeng, M. Zhao, P. Ye, M. Shen, W. Cheng, G. Yu, and T. Chen. Sparse-vdit: Unleashing the power of sparse attention to accelerate video diffusion transformers, 2025
work page 2025
-
[2]
R. Chen, K. G. Mills, L. Jiang, C. Gao, and D. Niu. Re-ttention: Ultra sparse visual generation via attention statistical reshape, 2025
work page 2025
-
[3]
K. Choromanski, V. Likhosherstov, D. Dohan, X. Song, A. Gane, T. Sarlos, P. Hawkins, J. Davis, A. Mohiuddin, L. Kaiser, D. Belanger, L. Colwell, and A. Weller. Rethinking attention with performers. InICLR 2021, 2021
work page 2021
-
[4]
T. Dao. Flashattention-2: Faster attention with better parallelism and work partitioning, 2023
work page 2023
-
[5]
Q. Fan, H. Huang, and R. He. Breaking the low-rank dilemma of linear attention, 2024
work page 2024
-
[6]
T. Fang, H. Zhang, R. Xie, Z. Han, X. Tao, T. Zhao, P. Wan, W. Ding, W. Ouyang, X. Ning, and Y. Wang. Salad: Achieve high-sparsity attention via efficient linear attention tuning for video diffusion transformer, 2026
work page 2026
-
[7]
A. Han, J. Li, W. Huang, M. Hong, A. Takeda, P. Jawanpuria, and B. Mishra. Sltrain: a sparse plus low-rank approach for parameter and memory efficient pretraining, 2024
work page 2024
- [8]
-
[9]
B. Kong, J. Liang, Y. Liu, R. Deng, and K. Yuan. Cr-net: Scaling parameter-efficient training with cross-layer low-rank structure, 2025
work page 2025
-
[10]
W. Kong, Q. Tian, Z. Zhang, R. Min, Z. Dai, J. Zhou, J. Xiong, X. Li, B. Wu, J. Zhang, K. Wu, Q. Lin, J. Yuan, Y. Long, A. Wang, A. Wang, C. Li, D. Huang, F. Yang, H. Tan, H. Wang, J. Song, J. Bai, J. Wu, J. Xue, J. Wang, K. Wang, M. Liu, P. Li, S. Li, W. Wang, W. Yu, X. Deng, Y. Li, Y. Chen, Y. Cui, Y. Peng, Z. Yu, Z. He, Z. Xu, Z. Zhou, Z. Xu, Y. Tao, Q...
work page 2024
-
[11]
X. Li, M. Li, T. Cai, H. Xi, S. Yang, Y. Lin, L. Zhang, S. Yang, J. Hu, K. Peng, M. Agrawala, I. Stoica, K. Keutzer, and S. Han. Radial attention:o(nlogn)sparse attention with energy decay for long video generation, 2025
work page 2025
-
[12]
Y. Liu, Y. Hu, Z. Zhang, K. Jiang, and K. Yuan. Mixture of distributions matters: Dynamic sparse attention for efficient video diffusion transformers, 2026
work page 2026
- [13]
-
[14]
X. Ma, Y. Wang, X. Chen, G. Jia, Z. Liu, Y.-F. Li, C. Chen, and Y. Qiao. Latte: Latent diffusion transformer for video generation, 2024
work page 2024
-
[15]
K. Nan, R. Xie, P. Zhou, T. Fan, Z. Yang, Z. Chen, X. Li, J. Yang, and Y. Tai. Openvid-1m: A large-scale high-quality dataset for text-to-video generation.arXiv preprint arXiv:2407.02371, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[16]
W. Peebles and S. Xie. Scalable diffusion models with transformers. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 4195–4205, 2023. 11
work page 2023
-
[17]
N. Rahaman, A. Baratin, D. Arpit, F. Draxler, M. Lin, F. A. Hamprecht, Y. Bengio, and A. Courville. On the spectral bias of neural networks, 2019
work page 2019
-
[18]
D. Shmilovich, T. Wu, A. Dahan, and Y. Domb. Liteattention: A temporal sparse attention for diffusion transformers, 2025
work page 2025
-
[19]
V. Sitzmann, J. N. P. Martel, A. W. Bergman, D. B. Lindell, and G. Wetzstein. Implicit neural representations with periodic activation functions, 2020
work page 2020
-
[20]
J. Su, Y. Lu, S. Pan, A. Murtadha, B. Wen, and Y. Liu. Roformer: Enhanced transformer with rotary position embedding.arXiv preprint arXiv:2104.09864, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[21]
X. Tan, Y. Chen, Y. Jiang, X. Chen, K. Yan, N. Duan, Y. Zhu, D. Jiang, and H. Xu. Dsv: Exploiting dynamic sparsity to accelerate large-scale video dit training. InProceedings of the 31st ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), volume 1, pages 101–116, 2026
work page 2026
-
[22]
M. Tancik, P. P. Srinivasan, B. Mildenhall, S. Fridovich-Keil, N. Raghavan, U. Singhal, R. Ramamoorthi, J. T. Barron, and R. Ng. Fourier features let networks learn high frequency functions in low dimensional domains. arXiv preprint arXiv:2006.10739, June 2020
-
[23]
K. Team. Kling-omni technical report, 2025
work page 2025
-
[24]
T. Wan, A. Wang, B. Ai, B. Wen, C. Mao, C.-W. Xie, D. Chen, F. Yu, H. Zhao, J. Yang, J. Zeng, J. Wang, J. Zhang, J. Zhou, J. Wang, J. Chen, K. Zhu, K. Zhao, K. Yan, L. Huang, M. Feng, N. Zhang, P. Li, P. Wu, R. Chu, R. Feng, S. Zhang, S. Sun, T. Fang, T. Wang, T. Gui, T. Weng, T. Shen, W. Lin, W. Wang, W. Wang, W. Zhou, W. Wang, W. Shen, W. Yu, X. Shi, X....
work page 2025
-
[25]
S. Wang, B. Z. Li, M. Khabsa, H. Fang, and H. Ma. Linformer: Self-attention with linear complexity.arXiv preprint arXiv:2006.04768, 2020
work page internal anchor Pith review Pith/arXiv arXiv 2006
-
[26]
J. Wu, L. Hou, H. Yang, X. Tao, Y. Tian, P. Wan, D. Zhang, and Y. Tong. Vmoba: Mixture-of-block attention for video diffusion models, 2025
work page 2025
-
[27]
H. Xi, S. Yang, Y. Zhao, C. Xu, M. Li, X. Li, Y. Lin, H. Cai, J. Zhang, D. Li, J. Chen, I. Stoica, K. Keutzer, and S. Han. Sparse video-gen: Accelerating video diffusion transformers with spatial-temporal sparsity. In International Conference on Machine Learning (ICML). PMLR, 2025
work page 2025
-
[28]
Y. Xia, S. Ling, F. Fu, Y. Wang, H. Li, X. Xiao, and B. Cui. Training-free and adaptive sparse attention for efficient long video generation, 2025
work page 2025
-
[29]
E. Xie, J. Chen, J. Chen, H. Cai, H. Tang, Y. Lin, Z. Zhang, M. Li, L. Zhu, Y. Lu, and S. Han. Sana: Efficient high-resolution image synthesis with linear diffusion transformers, 2024
work page 2024
-
[30]
S. Yang, H. Xi, Y. Zhao, M. Li, J. Zhang, H. Cai, Y. Lin, X. Li, C. Xu, K. Peng, J. Chen, S. Han, K. Keutzer, and I. Stoica. Sparse videogen2: Accelerate video generation with sparse attention via semantic-aware permutation, 2025
work page 2025
-
[31]
Z. Yang, J. Teng, W. Zheng, M. Ding, S. Huang, J. Xu, Y. Yang, W. Hong, X. Zhang, G. Feng, D. Yin, Y. Zhang, W. Wang, Y. Cheng, B. Xu, X. Gu, Y. Dong, and J. Tang. Cogvideox: Text-to-video diffusion models with an expert transformer. InInternational Conference on Learning Representations (ICLR), 2025
work page 2025
- [32]
- [33]
- [34]
- [35]
- [36]
-
[37]
Z. Zheng, X. Peng, T. Yang, C. Shen, S. Li, H. Liu, Y. Zhou, T. Li, and Y. You. Open-sora: Democratizing efficient video production for all, 2024. 13 Appendix Contents 1 Introduction 2 2 Related work 3 3 Preliminary 4 4 Method 4 4.1 Theoretical Foundation: The Sparse-Low-Rank Theorem . . . . . . . . . . . . . . . . . . . . . 4 4.2 Positional Context Recov...
work page 2024
-
[38]
We first define an energy thresholdτ to split the full attention matrixA into two disjoint components: a high-energy spike setΩτ (captured by the sparse branch) and a low-energy smooth background setΩc τ (approximated by the low-rank compensator). We prove the sparse branch has a deterministic sparsity guarantee: at most⌊1/τ⌋non-zero entries per row
-
[39]
Leveraging the mathematical structure of 3D RoPE, we prove the pre-softmax QK logit matrix can be exactly expanded as a Fourier series over spatiotemporal frequencies, where each frequency term corresponds to a matrix of rank at most 2
-
[40]
Under a empirically validated spectral concentration assumption, high-frequency interaction coefficients in the background set decay exponentially, allowing us to truncate the Fourier series to get a low-rank approximation of the pre-softmax background matrix with controllable error
-
[41]
We use positive random features (FAVOR+)[3] to linearize the softmax exponential kernel while preserving the low-rank structure, and perform error propagation to bound the final approximation error of the background branch
-
[42]
Our final main theorem establishes the existence of this decomposition: the sparse branch has guaranteed high sparsity, and the low rank branch has a rank bound that is linear in head dimensiondh and only logarithmic in sequence lengthL, confirming extreme low-rankness for long video sequences. A.2 Preliminaries and Notation We first formalize all symbols...
-
[43]
to approximateAbg only on the background setΩc τ. To correct the inaccurate outputs produced byˆAlowrank on the spike setΩτ, we define the residual sparse branch ˜Asparse as an exact error compensator: ˜Asparse(p, q) = ( A(p, q)− ˆAlowrank(p, q)if(p, q)∈Ω τ 0otherwise The final reconstructed attention matrix is then guaranteed as: ˆAfinal = ˜Asparse + ˆAl...
-
[44]
Theresidual sparse branch ˜Asparse is sparse, with exactly the same support as the original spike set Ωτ, and thus has at most⌊1/τ⌋non-zero entries per row
-
[45]
Thelow-rank branch ˆAlowrank is a globally defined, unmasked matrix that acts as a low-rank approxi- mator for the background targetAbg. It operates with a provable bottleneck rank bound: R=O dh · τ E 2 ·log τ E ·logL With probability at least1−δfail, the low-rank approximator tightly bounds the background target onΩc τ: max (p,q)∈Ωcτ | ˆAlowrank(p, q)−A ...
-
[46]
Sub-quadratic Sparsity:The residual sparse branch˜Asparse, which acts as an exact error com- pensator on the spike set (˜Asparse(p, q) = A(p, q) − ˆAlowrank(p, q)for( p, q) ∈ Ωτ), has total non-zero entries bounded by NNZ(˜Asparse) =O(L 3 2 )
-
[47]
Sub-linear Rank:The globally unmasked low-rank branchˆAlowrank, which acts as a dense approxi- mator for the background targetAbg, requires a bottleneck rank ofR=O(dh ·logL)
-
[48]
Asymptotic Error Bound:With high probability1 −δ fail, the global reconstruction error is uniformly bounded bymaxp,q | ˆAfinal(p, q)−A(p, q)|=O L− 1 2 . Proof.We prove each statement sequentially: Proof of Statement 1 (Sparse Branch Complexity).Givenτ = c L 1 2 , by Theorem 2, the total number of non-zero entries is: NNZ( ˜Asparse)≤L· 1 τ =L· $ L 1 2 c % ...
-
[49]
Sub-linear Rank Bottleneck: The rank required to closely capture the global background continuum grows only logarithmically (R = O(logL )). Compared to the rapid scaling of the sequence itself, this dictates that the low-rank branch becomes increasingly compact and heavily compressed as video length expands. Together, these bounds prove that RoPeSLR activ...
-
[50]
It enables anexact Fourier series expansionof the pre-softmax QK inner product over relative positions (Lemma 2), which decomposes the full attention matrix into a sum of rank-≤ 2frequency- specific matrices (Lemma 3)
-
[51]
Its standard exponential frequency schedule inducesexponential decay of high-frequency interaction coefficients(empirically validated in Section B.1). Leveraging this property,Lemma 4proves that the background matrix can be approximated with a rank bound ofO(dh log(1/δ)), which is independent of sequence lengthL. Remark 5(Understanding the Token-Wise Cont...
-
[52]
Algebraic Factorization of Global Aggregation:While trigonometric identities explicitly decouple the pre-softmax 3D RoPE logits into absolute spatial bases, the non-linear softmax operation typically breaks this pairwise separation. However, because Theorem 3 guarantees that the background continuumAbg is exceptionally low-rank, it admits a low-dimensiona...
-
[53]
Amortized Decoding via Dense State Encapsulation:WhileCk represents a global sequence-level sum, it does not require explicit re-computation if the tokenXp already "perceives" the macroscopic state. In deep DiTs,Xp is a dense semantic capsule: it has accumulated global context via preceding self-attention layers and is strictly modulated by global timeste...
-
[54]
Empirical Verification:The structural validity of this proxy is confirmed by our Stage-I alignment objective Lalign = ∥Ototal −AV∥ 2 F, which converges rapidly to a minimal floor (Figure 5). Mechanistically, ourGram spectral analysis (Appendix B.8)proves that the MLP’s output eigenvectors perfectly match the geometric standing waves of the ground-truth ba...
-
[55]
For LLMs using 1D RoPE with head dimensiondh, the decay rate isρLLM = 10000−α/dh
Cubic Acceleration of Spectral Decay (Rank Inequality): As derived in Lemma 4, the truncation index Mk required to bound the error depends on the exponential decay constantρ. For LLMs using 1D RoPE with head dimensiondh, the decay rate isρLLM = 10000−α/dh. For DiTs using 3D RoPE, the dimension is partitioned (dt, dx, dy). Assuming a uniform partitiondk = ...
-
[56]
Empirical Tightness of the Deterministic Sparsity Bound: While the theoreticalO(L3/2)sparsity upper bound (derived viaNNZ≤L· ⌊ 1/τ⌋) is a deterministic property holding for any row-stochastic attention matrix, its practical utility heavily depends on the data domain. In LLMs, tasks like in-context learning scatter semantic spikes across the causal history...
-
[57]
Empirical Evidence: We provide experimental evidence to validate this remark, see figure 2 and 11 for more details. 31 B Additional Experiments B.1 Empirical Validation of Spectral Concentration (Assumption 2) B.1.1 Spectral Evolution and Justification of Exponential Decay To verify the exponential decayρm0 k (ρk < 1) proposed in Assumption 2, we measure ...
-
[58]
Context Retrieval. Olin = ϕ(Q)KVctx. Here, ϕ(Q) ∈R L×dh and KVctx ∈R dh×dh. The FLOPs are 2BHLd 2 h. Therefore, the baseline linear attention branch fundamentally imposes anO(d2 h)bottleneck per token: Clinear_branch = 4BHLd 2 h Assuming identical sparse branches and gating mechanisms, the efficiency disparity between RoPeSLR and standard linear hybrids l...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.