pith. machine review for the scientific record. sign in

arxiv: 2602.03216 · v2 · submitted 2026-02-03 · 💻 cs.CL · cs.LG

Recognition: unknown

Token Sparse Attention: Efficient Long-Context Inference with Interleaved Token Selection

Authors on Pith no claims yet
classification 💻 cs.CL cs.LG
keywords attentiontokensparseinferencelong-contextdynamicinterleavedlayers
0
0 comments X
read the original abstract

The quadratic complexity of attention remains the central bottleneck in long-context inference for large language models. Prior acceleration methods either sparsify the attention map with structured patterns or permanently evict tokens at specific layers, which can retain irrelevant tokens or rely on irreversible early decisions despite the layer-/head-wise dynamics of token importance. In this paper, we propose Token Sparse Attention, a lightweight and dynamic token-level sparsification mechanism that compresses per-head $Q$, $K$, $V$ to a reduced token set during attention and then decompresses the output back to the original sequence, enabling token information to be reconsidered in subsequent layers. Furthermore, Token Sparse Attention exposes a new design point at the intersection of token selection and sparse attention. Our approach is fully compatible with dense attention implementations, including Flash Attention, and can be seamlessly composed with existing sparse attention kernels. Experimental results show that Token Sparse Attention consistently improves accuracy-latency trade-off, achieving up to $\times$3.23 attention speedup at 128K context with less than 1% accuracy degradation. These results demonstrate that dynamic and interleaved token-level sparsification is a complementary and effective strategy for scalable long-context inference.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Revisiting RaBitQ and TurboQuant: A Symmetric Comparison of Methods, Theory, and Experiments

    cs.LG 2026-04 unverdicted novelty 5.0

    RaBitQ outperforms TurboQuant in most tested settings for inner-product estimation, nearest-neighbor search, and KV cache quantization, while several TurboQuant runtime and recall results could not be reproduced from ...