BLASST: Dynamic BLocked Attention Sparsity via Softmax Thresholding

Jiayi Yuan , Cameron Shinn , Kai Xu , Jingze Cui , George Klimiashvili , Guangxuan Xiao , Perkz Zheng , Bo Li

show 14 more authors

Yuxin Zhou Zhouhai Ye Weijie You Tian Zheng Dominic Brown Pengbo Wang Markus Hoehnerbach Richard Cai Julien Demouth John D. Owens Xia Hu Song Han Timmy Liu Huizi Mao

Authors on Pith no claims yet

classification 💻 cs.CL

keywords attentionblasstthresholddecodeinferenceprefillsoftmaxsparsity

0 comments

read the original abstract

The growing demand for long-context inference capabilities in Large Language Models (LLMs) has intensified the computational and memory bottlenecks inherent to the self-attention mechanism. To address this challenge, we introduce BLASST, a drop-in, dynamic sparse attention mechanism that accelerates inference by using only a fixed scalar threshold to skip attention blocks. Our method targets practical inference deployment by removing the barriers to adoption present in existing works. As such, BLASST eliminates training requirements, avoids expensive pre-computation passes, accelerates both prefill and decode across all major attention variants (MHA, GQA, MQA, and MLA), provides optimized support for modern hardware, and easily integrates into existing frameworks. This is achieved by reusing online softmax statistics to identify negligible attention scores, skipping softmax, value block loads, and the subsequent matrix multiplication. We demonstrate the BLASST algorithm by delivering optimized kernels with negligible latency overhead. Our automated threshold calibration procedure reveals a simple inverse relationship between optimal threshold and context length, meaning we require only a single threshold each for prefill and decode per model. Preserving benchmark accuracy, we demonstrate a 1.52x speedup for prefill at 71.9% sparsity and a 1.48x speedup for decode at 73.2% sparsity on modern GPUs.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Sparse Attention as a Range Searching Problem: Towards an Inference-Efficient Index for KV Cache
cs.LG 2026-05 unverdicted novelty 6.0

Louver is a new index structure that guarantees zero false negatives for sparse attention in LLM KV caches by casting the problem as halfspace range searching.
Sparse Attention as a Range Searching Problem: Towards an Inference-Efficient Index for KV Cache
cs.LG 2026-05 unverdicted novelty 6.0

Louver is a new index for LLM KV caches that guarantees zero false negatives for keys above a relevance threshold, runs faster than prior sparse and some dense attention methods, and integrates lightly into existing p...
VFA: Relieving Vector Operations in Flash Attention with Global Maximum Pre-computation
cs.LG 2026-04 unverdicted novelty 5.0

VFA optimizes Flash Attention by pre-computing global max approximations from key blocks and reordering traversal to reduce vector bottlenecks while preserving exact computation.