pith. sign in

arxiv: 2502.08363 · v3 · pith:SX65M2EKnew · submitted 2025-02-12 · 💻 cs.CL · cs.AI

Top-Theta Attention: Sparsifying Transformers by Compensated Thresholding

classification 💻 cs.CL cs.AI
keywords attentionaccuracyduringelementsinferencesparsifyingthetathresholding
0
0 comments X
read the original abstract

We present Top-Theta (Top-$\theta$) Attention, a training-free method for sparsifying transformer attention during inference. Our key insight is that static, per-head thresholds can be calibrated to retain the desired constant number of significant elements per attention row. This approach enables content-based sparsity without retraining, and it remains robust across data domains. We further introduce compensation techniques to preserve accuracy under aggressive sparsification, establishing attention thresholding as a practical and principled alternative to top-k attention. We provide extensive evaluation on natural language processing tasks, showing that Top-$\theta$ achieves 3-10x reduction in V-cache usage and up to 10x fewer attention elements during inference while degrading no more than 1% in accuracy.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.