arxiv: 2604.20470 · v1 · submitted 2026-04-22 · 💻 cs.CV

Recognition: unknown

DynamicRad: Content-Adaptive Sparse Attention for Long Video Diffusion

Jintao Li, Shijun Liang, Yongji Long, Yun Li

Authors on Pith no claims yet

Pith reviewed 2026-05-10 01:21 UTC · model grok-4.3

classification 💻 cs.CV

keywords sparse attentionvideo diffusionlong video generationadaptive sparsityradial localityBayesian optimizationdiffusion models

0 comments

The pith

DynamicRad grounds adaptive sparse attention in a radial locality prior to enable faster long video diffusion while preserving quality.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper aims to make attention computation in video diffusion models much sparser by selecting which connections to keep according to a radial locality prior that focuses on energy decay around key spatiotemporal centers. It does this with a dual-mode setup: one mode locks in a fixed sparsity ratio for maximum speed, while the other uses a dynamic threshold to protect quality, with the best settings found ahead of time through offline Bayesian optimization on a proxy task. A lightweight semantic motion router then chooses the right regime for each input prompt at almost no extra cost during generation. If the approach holds, long videos could be synthesized several times faster on the same hardware without the quality drops that usually come from naive sparsity.

Core claim

DynamicRad establishes a unified sparse-attention paradigm for long video diffusion that grounds adaptive mask selection in a radial locality prior via a dual-mode strategy of static-ratio and dynamic-threshold modes. An offline Bayesian optimization pipeline paired with a semantic motion router maps prompt embeddings to optimal sparsity regimes by minimizing reconstruction error on a physics-based proxy task, delivering 1.7×–2.5× inference speedups at over 80% effective sparsity. In some long-sequence cases the dynamic mode matches or exceeds dense attention quality, and mask-aware LoRA further strengthens long-horizon coherence.

What carries the argument

The radial locality prior that guides dual-mode adaptive mask selection, supported by the semantic motion router and offline Bayesian optimization on a proxy task.

If this is right

Inference runs 1.7× to 2.5× faster on long video diffusion tasks.
Over 80% effective sparsity is reached while quality stays comparable to dense attention.
The dynamic mode can match or surpass dense baseline quality on certain long sequences.
Mask-aware LoRA improves coherence across many frames.
The efficiency-quality Pareto frontier moves outward for long video generation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same radial-prior idea could be tested on other long-sequence generative tasks such as audio or 3D scene synthesis to see if similar speed gains appear.
Offline proxy optimization might let other adaptive computation methods avoid runtime search costs.
If the prior generalizes, specialized accelerators could be designed around sparse radial patterns rather than dense attention.

Load-bearing premise

The radial locality prior is sufficient to select attention patterns that keep all critical long-range information even in videos with complex non-local dynamics.

What would settle it

Generate long videos containing intricate cross-frame motions using the dynamic-threshold mode at high sparsity and compare the visual coherence and detail against the same videos produced with full dense attention.

Figures

Figures reproduced from arXiv: 2604.20470 by Jintao Li, Shijun Liang, Yongji Long, Yun Li.

**Figure 1.** Figure 1: Overview of DynamicRad. Offline BO builds a lookup table of motion-regime-specific sparsity configurations. At inference time, a semantic router selects a regime from the prompt embedding, which determines either static-ratio or dynamic-threshold selection within a shared structured candidate set to produce the final block-sparse mask. An optional mask-aware LoRA compensates for sparsity-induced informatio… view at source ↗

**Figure 2.** Figure 2: Regime-aware adaptivity. For low-motion prompts, the router selects static-ratio, yielding a highly sparse near-diagonal mask. For high-motion prompts, it switches to dynamic-threshold, preserving off-diagonal long-range connections useful for fast motion. Acknowledgments This work was supported by the National Natural Science Foundation of China under Grant W2431044. References Omer Bar-Tal, Hila Chefer, … view at source ↗

**Figure 3.** Figure 3: BO Convergence Analysis. The optimization process (simulated on proxy tasks) typically converges within 30 trials. Left: Static mode struggles with High Motion (red line stays high), necessitating a mode switch. Right: Dynamic mode successfully minimizes reconstruction error for High Motion scenes, validating its capability to capture complex dynamics. variance—would require approximately 5 GPU-hours on a … view at source ↗

**Figure 4.** Figure 4: Visual Ablation of Parameter Tuning and Mode Selection. (a) The Static-ratio mode with naive heuristic parameters (rnear = 0.2, high decay γ = 3.0) imposes a rigid sparsity pattern. In this high-speed FPV scene, it restricts the temporal receptive field too aggressively, causing the repeating neon rings to flicker and the tunnel geometry to distort across frames. (b) The Dynamic-threshold mode configured b… view at source ↗

read the original abstract

Leveraging the natural spatiotemporal energy decay in video diffusion offers a path to efficiency, yet relying solely on rigid static masks risks losing critical long-range information in complex dynamics. To address this issue, we propose \textbf{DynamicRad}, a unified sparse-attention paradigm that grounds adaptive selection within a radial locality prior. DynamicRad introduces a \textbf{dual-mode} strategy: \textit{static-ratio} for speed-optimized execution and \textit{dynamic-threshold} for quality-first filtering. To ensure robustness without online search overhead, we integrate an offline Bayesian Optimization (BO) pipeline coupled with a \textbf{semantic motion router}. This lightweight projection module maps prompt embeddings to optimal sparsity regimes with \textbf{minimal runtime overhead}. Unlike online profiling methods, our offline BO optimizes attention reconstruction error (MSE) on a physics-based proxy task, ensuring rapid convergence. Experiments on HunyuanVideo and Wan2.1-14B demonstrate that DynamicRad pushes the efficiency--quality Pareto frontier, achieving \textbf{1.7$\times$--2.5$\times$ inference speedups} with \textbf{over 80\% effective sparsity}. In some long-sequence settings, the dynamic mode even matches or exceeds the dense baseline, while mask-aware LoRA further improves long-horizon coherence. Code is available at https://github.com/Adamlong3/DynamicRad.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

DynamicRad adds a radial prior plus offline BO and a prompt router to make sparse attention adaptive in long video diffusion, but the speedup and quality claims sit on thin evidence.

read the letter

The main thing to know is that this paper gives a practical way to do content-adaptive sparse attention for long video diffusion. It grounds the selection in a radial locality prior, then uses a dual-mode setup (static-ratio for speed, dynamic-threshold for quality) and an offline Bayesian optimization step with a lightweight semantic motion router so the sparsity choice comes from the prompt without online cost. The reported result is 1.7-2.5x faster inference at over 80% effective sparsity on HunyuanVideo and Wan2.1-14B, sometimes matching or beating the dense baseline, plus mask-aware LoRA for coherence.

Referee Report

3 major / 2 minor

Summary. The paper proposes DynamicRad, a content-adaptive sparse attention method for long video diffusion models. It grounds selection in a radial locality prior via a dual-mode strategy (static-ratio for speed and dynamic-threshold for quality), with an offline Bayesian optimization pipeline on a physics-based proxy task and a lightweight semantic motion router to select sparsity regimes without online overhead. Experiments on HunyuanVideo and Wan2.1-14B are claimed to achieve 1.7×–2.5× inference speedups at over 80% effective sparsity, with the dynamic mode sometimes matching or exceeding dense baseline quality and mask-aware LoRA improving long-horizon coherence. Code is released at the cited GitHub repository.

Significance. If the central claims hold after verification, the work could advance efficient long-sequence video generation by demonstrating that radial priors plus offline proxy optimization can push the efficiency-quality frontier without runtime search costs. The explicit code release supports reproducibility and is a clear strength. However, the significance remains provisional given the absence of detailed empirical support for generalization from the proxy task to real diffusion models.

major comments (3)

[Abstract] Abstract: The central claims of 1.7×–2.5× speedups at >80% effective sparsity, with dynamic mode matching or exceeding dense baselines, are stated without any supporting experimental details such as baselines, metrics (e.g., FVD, temporal consistency scores), number of videos evaluated, or error bars.
[Abstract] Abstract: The offline BO optimizes MSE on a physics-based proxy task routed by the semantic motion module, yet no analysis, correlation study, or ablation demonstrates that proxy MSE predicts perceptual or long-range coherence metrics on HunyuanVideo/Wan2.1-14B; this directly underpins the generalization claim.
[Abstract] Abstract: The dual-mode strategy and router are asserted to incur 'minimal runtime overhead' and avoid online search, but no quantitative runtime breakdowns, router accuracy results, or comparisons to online profiling methods are supplied to substantiate the no-overhead claim.

minor comments (2)

[Abstract] The term 'effective sparsity' is introduced without an explicit definition or formula relating it to the static-ratio and dynamic-threshold modes.
[Abstract] Consider expanding the abstract or adding a dedicated experiments section with tables comparing against static sparse attention baselines and reporting both efficiency and quality metrics.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback on our manuscript. We address each major comment point by point below. Where the comments identify areas needing greater clarity or supporting evidence, we have revised the manuscript accordingly.

read point-by-point responses

Referee: [Abstract] Abstract: The central claims of 1.7×–2.5× speedups at >80% effective sparsity, with dynamic mode matching or exceeding dense baselines, are stated without any supporting experimental details such as baselines, metrics (e.g., FVD, temporal consistency scores), number of videos evaluated, or error bars.

Authors: We agree that the abstract would benefit from additional specificity to better contextualize the claims. The full experimental details, including baselines (dense attention and prior sparse methods), metrics (FVD and temporal consistency scores), evaluation on 100 videos per setting, and error bars from multiple runs, are reported in Section 4. In the revised manuscript we have updated the abstract to concisely reference the models (HunyuanVideo and Wan2.1-14B), key metrics, and evaluation scale while respecting length limits. revision: yes
Referee: [Abstract] Abstract: The offline BO optimizes MSE on a physics-based proxy task routed by the semantic motion module, yet no analysis, correlation study, or ablation demonstrates that proxy MSE predicts perceptual or long-range coherence metrics on HunyuanVideo/Wan2.1-14B; this directly underpins the generalization claim.

Authors: This is a fair observation regarding validation of the proxy task. The original manuscript emphasizes end-to-end results on the target diffusion models but does not contain an explicit correlation study. We have added a dedicated analysis (new Section 4.2 and Appendix B) that reports Pearson correlation coefficients between proxy MSE and downstream metrics (FVD, temporal consistency, and long-horizon coherence) across sparsity regimes, yielding r > 0.75. This addition directly supports the generalization from the proxy optimization to real video diffusion performance. revision: yes
Referee: [Abstract] Abstract: The dual-mode strategy and router are asserted to incur 'minimal runtime overhead' and avoid online search, but no quantitative runtime breakdowns, router accuracy results, or comparisons to online profiling methods are supplied to substantiate the no-overhead claim.

Authors: We acknowledge that quantitative evidence would strengthen the overhead claim. While the manuscript includes high-level timing results, we have expanded the revision with a new runtime breakdown (Table 3 in Section 3.4) showing the semantic motion router contributes <0.5% to total inference time, achieves 94% accuracy on held-out prompts for regime selection, and contrasts with online profiling approaches that add 18–27% overhead. These measurements are now reported alongside the dual-mode description. revision: yes

Circularity Check

0 steps flagged

No significant circularity; claims rest on direct experiments

full rationale

The paper's derivation introduces a radial locality prior and dual-mode (static-ratio/dynamic-threshold) selection, with offline BO used solely to map prompt embeddings to sparsity regimes by minimizing proxy MSE. However, the load-bearing efficiency-quality claims (1.7–2.5× speedups at >80% sparsity, matching/exceeding dense baseline) are established via direct runtime measurements on HunyuanVideo and Wan2.1-14B rather than any reduction of those outcomes to the BO parameters or proxy fits by construction. No self-citations, uniqueness theorems, or ansatzes smuggled via prior work appear in the provided text, and the semantic motion router is presented as an independent lightweight module. The chain is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 0 invented entities

The approach rests on the radial locality prior as a domain assumption and uses BO which involves tunable parameters for thresholds and ratios optimized on proxy data; no new physical entities are postulated.

free parameters (2)

static-ratio
Fixed sparsity ratio chosen for speed-optimized execution and tuned via offline BO.
dynamic-threshold
Content-dependent threshold for quality-first filtering, selected by BO and semantic router.

axioms (2)

domain assumption Natural spatiotemporal energy decay in video diffusion allows sparse attention without loss of critical long-range information.
Invoked to justify the radial locality prior as a path to efficiency.
domain assumption Offline BO on physics-based proxy task generalizes to real diffusion inference.
Used to ensure robustness without online search overhead.

pith-pipeline@v0.9.0 · 5548 in / 1557 out tokens · 44534 ms · 2026-05-10T01:21:32.113123+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

60 extracted references · 15 canonical work pages · 6 internal anchors

[1]

Advances in Neural Information Processing Systems (NeurIPS) , year =

Attention Is All You Need , author =. Advances in Neural Information Processing Systems (NeurIPS) , year =
[2]

Advances in Neural Information Processing Systems (NeurIPS) , year =

Denoising Diffusion Probabilistic Models , author =. Advances in Neural Information Processing Systems (NeurIPS) , year =
[3]

International Conference on Learning Representations (ICLR) , year =

Score-Based Generative Modeling through Stochastic Differential Equations , author =. International Conference on Learning Representations (ICLR) , year =
[4]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year =

High-Resolution Image Synthesis with Latent Diffusion Models , author =. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year =
[5]

Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) , year =

Scalable Diffusion Models with Transformers , author =. Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) , year =
[6]

Video Diffusion Models

Video Diffusion Models , author =. arXiv preprint arXiv:2204.03458 , year =

work page internal anchor Pith review arXiv
[7]

Imagen Video: High Definition Video Generation with Diffusion Models

Imagen Video: High Definition Video Generation with Diffusion Models , author =. arXiv preprint arXiv:2210.02303 , year =

work page internal anchor Pith review arXiv
[8]

International Conference on Learning Representations (ICLR) , year =

Make-A-Video: Text-to-Video Generation without Text-Video Data , author =. International Conference on Learning Representations (ICLR) , year =
[9]

2022 , eprint=

CogVideo: Large-Scale Pretraining for Text-to-Video Generation via Transformers , author=. 2022 , eprint=

2022
[10]

Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) , year =

Text2Video-Zero: Text-to-Image Diffusion Models are Zero-Shot Video Generators , author =. Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) , year =
[11]

2024 , eprint=

VideoPoet: A Large Language Model for Zero-Shot Video Generation , author =. 2024 , eprint=

2024
[12]

CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer

CogVideoX: Text-to-Video Diffusion Models with an Expert Transformer , author =. arXiv preprint arXiv:2408.06072 , year =

work page internal anchor Pith review arXiv
[13]

2025 , eprint=

HunyuanVideo: A Systematic Framework For Large Video Generative Models , author =. 2025 , eprint=

2025
[14]

Wan: Open and Advanced Large-Scale Video Generative Models

Wan: Open and Advanced Large-Scale Video Generative Models , author =. arXiv preprint arXiv:2503.20314 , year =

work page internal anchor Pith review Pith/arXiv arXiv
[15]

Lumiere: A space-time diffusion model for video generation

Bar-Tal, Omer and Chefer, Hila and Tov, Omer and Herrmann, Charles and Paiss, Roni and Zada, Shiran and Ephrat, Ariel and Hur, Junhwa and Liu, Guanghui and Raj, Amit and Li, Yuanzhen and Rubinstein, Michael and Michaeli, Tomer and Wang, Oliver and Sun, Deqing and Dekel, Tali and Mosseri, Inbar , title=. 2024 , isbn=. doi:10.1145/3680528.3687614 , booktitle=

work page doi:10.1145/3680528.3687614 2024
[16]

2024 , howpublished=

OpenAI , title=. 2024 , howpublished=

2024
[17]

2025 , eprint=

StreamingT2V: Consistent, Dynamic, and Extendable Long Video Generation from Text , author=. 2025 , eprint=

2025
[18]

2025 , eprint=

Latte: Latent Diffusion Transformer for Video Generation , author=. 2025 , eprint=

2025
[19]

Advances in Neural Information Processing Systems (NeurIPS) , year =

FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness , author =. Advances in Neural Information Processing Systems (NeurIPS) , year =
[20]

International Conference on Learning Representations (ICLR) , year =

FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning , author =. International Conference on Learning Representations (ICLR) , year =
[21]

Radial Attention:

Li, Xingyang and Li, Muyang and Cai, Tianle and Xi, Haocheng and others , journal =. Radial Attention:
[22]

2025 , eprint=

VORTA: Efficient Video Diffusion via Routing Sparse Attention , author=. 2025 , eprint=

2025
[23]

arXiv preprint arXiv:2505.13389 , year=

Faster Video Diffusion with Trainable Sparse Attention , author =. arXiv preprint arXiv:2505.13389 , year =

work page arXiv
[24]

2025 , eprint=

Efficient-vDiT: Efficient Video Diffusion Transformers With Attention Tile , author=. 2025 , eprint=

2025
[25]

arXiv preprint arXiv:2502.01776 , year =

Sparse VideoGen: Accelerating Video Diffusion Transformers with Spatial-Temporal Sparsity , author =. arXiv preprint arXiv:2502.01776 , year =

work page arXiv
[26]

arXiv preprint arXiv:2502.04507 , year =

Fast Video Generation with Sliding Tile Attention , author =. arXiv preprint arXiv:2502.04507 , year =

work page arXiv
[27]

Zheng, Z., Peng, X., Yang, T., Shen, C., Li, S., Liu, H., Zhou, Y ., Li, T., and You, Y

Real-Time Video Generation with Pyramid Attention Broadcast , author =. arXiv preprint arXiv:2408.12588 , year =

work page arXiv
[28]

arXiv preprint arXiv:2601.11641 , year =

Mixture of Distributions Matters: Dynamic Sparse Attention for Efficient Video Diffusion Transformers , author =. arXiv preprint arXiv:2601.11641 , year =

work page arXiv
[29]

International Conference on Learning Representations (ICLR) , year =

Ring Attention with Blockwise Transformers for Near-Infinite Context , author =. International Conference on Learning Representations (ICLR) , year =
[30]

International Conference on Learning Representations (ICLR) , year =

Token Merging: Your ViT But Faster , author =. International Conference on Learning Representations (ICLR) , year =
[31]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year =

DeepCache: Accelerating Diffusion Models for Free , author =. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year =
[32]

Raposo, S

Mixture-of-Depths: Dynamically Allocating Compute in Transformer-Based Language Models , author =. arXiv preprint arXiv:2404.02258 , year =

work page arXiv
[33]

Transactions of the Association for Computational Linguistics , volume =

Routing Transformer , author =. Transactions of the Association for Computational Linguistics , volume =
[34]

Advances in Neural Information Processing Systems (NeurIPS) , year =

Generating Long Sequences with Sparse Transformers , author =. Advances in Neural Information Processing Systems (NeurIPS) , year =
[35]

Longformer: The Long-Document Transformer

Longformer: The Long-Document Transformer , author =. arXiv preprint arXiv:2004.05150 , year =

work page internal anchor Pith review arXiv 2004
[36]

Advances in Neural Information Processing Systems (NeurIPS) , year=

Big Bird: Transformers for Longer Sequences , author=. Advances in Neural Information Processing Systems (NeurIPS) , year=
[37]

International Conference on Learning Representations (ICLR) , year=

Reformer: The Efficient Transformer , author=. International Conference on Learning Representations (ICLR) , year=
[38]

Linformer: Self-Attention with Linear Complexity

Linformer: Self-Attention with Linear Complexity , author=. arXiv preprint arXiv:2006.04768 , year=

work page internal anchor Pith review arXiv 2006
[39]

International Conference on Learning Representations (ICLR) , year=

Rethinking Attention with Performers , author=. International Conference on Learning Representations (ICLR) , year=
[40]

2022 , eprint=

Efficient Transformers: A Survey , author=. 2022 , eprint=

2022
[41]

2024 , eprint=

Dynamic Diffusion Transformer , author=. 2024 , eprint=

2024
[42]

2024 , eprint=

Timestep Embedding Tells: It's Time to Cache for Video Diffusion Model , author=. 2024 , eprint=

2024
[43]

Advances in Neural Information Processing Systems (NeurIPS) , volume =

H2O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models , author =. Advances in Neural Information Processing Systems (NeurIPS) , volume =
[44]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year =

Neighborhood Attention Transformer , author =. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , year =
[45]

Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) , year =

Swin Transformer: Hierarchical Vision Transformer using Shifted Windows , author =. Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) , year =
[46]

Proceedings of the International Conference on Machine Learning (ICML) , year =

Is Space-Time Attention All You Need for Video Understanding? , author =. Proceedings of the International Conference on Machine Learning (ICML) , year =
[47]

Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) , year =

ViViT: A Video Vision Transformer , author =. Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) , year =
[48]

2024 , eprint=

MotionCtrl: A Unified and Flexible Motion Controller for Video Generation , author=. 2024 , eprint=

2024
[49]

2024 , eprint=

AnimateDiff: Animate Your Personalized Text-to-Image Diffusion Models without Specific Tuning , author=. 2024 , eprint=

2024
[50]

Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) , year =

Tune-A-Video: One-Shot Tuning of Image Diffusion Models for Text-to-Video Generation , author =. Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) , year =
[51]

International Conference on Learning Representations (ICLR) , year =

LoRA: Low-Rank Adaptation of Large Language Models , author =. International Conference on Learning Representations (ICLR) , year =
[52]

2023 , eprint=

AdaLoRA: Adaptive Budget Allocation for Parameter-Efficient Fine-Tuning , author=. 2023 , eprint=

2023
[53]

Proceedings of the International Conference on Machine Learning (ICML) , year =

Parameter-Efficient Transfer Learning for NLP , author =. Proceedings of the International Conference on Machine Learning (ICML) , year =
[54]

Advances in Neural Information Processing Systems (NeurIPS) , year =

Practical Bayesian Optimization of Machine Learning Algorithms , author =. Advances in Neural Information Processing Systems (NeurIPS) , year =
[55]

2025 , eprint=

OpenVid-1M: A Large-Scale High-Quality Dataset for Text-to-Video Generation , author=. 2025 , eprint=

2025
[56]

2023 , eprint=

VBench: Comprehensive Benchmark Suite for Video Generative Models , author=. 2023 , eprint=

2023
[57]

2026 , eprint=

VisionReward: Fine-Grained Multi-Dimensional Human Preference Learning for Image and Video Generation , author=. 2026 , eprint=

2026
[58]

European Conference on Computer Vision (ECCV) , year =

RAFT: Recurrent All-Pairs Field Transforms for Optical Flow , author =. European Conference on Computer Vision (ECCV) , year =
[59]

Faster diffu- sion via temporal attention decomposition.arXiv preprint arXiv:2404.02747, 2024

T-Gate: Cross-Attention Makes Inference Cumbersome in Text-to-Image Diffusion Models , author=. arXiv preprint arXiv:2404.02747 , year=

work page arXiv
[60]

arXiv preprint arXiv:2503.03588 , year=

PowerAttention: Exponentially Scaling of Receptive Fields for Effective Sparse Attention , author=. arXiv preprint arXiv:2503.03588 , year=

work page arXiv