pith. sign in

arxiv: 2504.17768 · v3 · pith:Q2KJM2DXnew · submitted 2025-04-24 · 💻 cs.CL · cs.LG

The Sparse Frontier: Sparse Attention Trade-offs in Transformer LLMs

classification 💻 cs.CL cs.LG
keywords sparseattentionmethodssparsityanalysiscostduringestimation
0
0 comments X
read the original abstract

Sparse attention offers a promising strategy to extend long-context capabilities in Transformer LLMs, yet its efficiency-accuracy trade-offs remain unclear due to the lack of comprehensive evaluation. We address this gap with the largest-scale empirical analysis to date of training-free sparse attention, evaluating six methods across multiple model families and sizes, sequences up to 128K tokens, and sparsity levels up to 0.95 (i.e., $1/20$ attention budget) on nine diverse tasks. We first organise the rapidly evolving landscape of sparse attention methods into a taxonomy along four design axes. Our analysis then yields actionable insights: 1) sparse attention is effective: larger sparse models outperform smaller dense ones at equivalent cost, improving the Pareto frontier; 2) for the training-free methods we study, fine-grained per-query importance estimation during prefilling remains impractical-due to both the cost of estimation and the lack of sparse kernels that translate fine-grained sparsity into wall-clock gains-forcing a task-dependent choice between global-to-token and block-to-block selection. Instead, during decoding, token-to-page selection becomes feasible, enabling better generalisation and higher sparsity tolerance; 3) longer sequences tolerate higher sparsity, suggesting that fixed-budget methods in production are suboptimal. Together, these findings provide practical guidance for deploying sparse attention and methodological recommendations for future evaluations. Our code is available at https://github.com/PiotrNawrot/sparse-frontier.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. AsyncSparse: Accelerating Sparse Matrix-Matrix Multiplication on Asynchronous GPU Architectures

    cs.DC 2026-04 unverdicted novelty 7.0

    AsyncSparse presents BCSR and WCSR kernels that use TMA and warp specialization to accelerate SpMM, outperforming prior libraries by 1.47-6.24x on SuiteSparse and achieving 2.66x end-to-end speedup on Qwen2.5-7B at 90...

  2. VFA: Relieving Vector Operations in Flash Attention with Global Maximum Pre-computation

    cs.LG 2026-04 unverdicted novelty 5.0

    VFA optimizes Flash Attention by pre-computing global max approximations from key blocks and reordering traversal to reduce vector bottlenecks while preserving exact computation.

  3. Lifelong In-Context Learning with Transformers Requires Parametric Forms of Attention

    cs.LG 2026-06 unverdicted novelty 3.0

    Argues that parametric attention forms are necessary for lifelong in-context learning in transformers to maintain constant memory footprint over arbitrary sequence lengths.