pith. machine review for the scientific record. sign in

arxiv: 2504.20966 · v4 · submitted 2025-04-29 · 💻 cs.LG

Recognition: unknown

Softpick: No Attention Sink, No Massive Activations with Rectified Softmax

Authors on Pith no claims yet
classification 💻 cs.LG
keywords softpickattentionsinksoftmaxactivationslowermassivemodels
0
0 comments X
read the original abstract

We introduce softpick, a rectified, not sum-to-one, drop-in replacement for softmax in transformer attention mechanisms that eliminates attention sink and massive activations. Our experiments with 340M and 1.8B parameter models demonstrate that softpick achieves 0\% sink rate consistently. The softpick transformers produce hidden states with significantly lower kurtosis and creates sparse attention maps. Quantized models using softpick outperform softmax on standard benchmarks, with a particularly pronounced advantage at lower bit precisions. Our analysis and discussion shows how softpick has the potential to open new possibilities for quantization, low-precision training, sparsity optimization, pruning, and interpretability. Our code: https://github.com/zaydzuhri/softpick-attention

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 5 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. A Single Layer to Explain Them All:Understanding Massive Activations in Large Language Models

    cs.CL 2026-05 unverdicted novelty 7.0

    Massive activations originate in a specific ME Layer across LLM families; reducing their token rigidity via a targeted method boosts performance and mitigates attention sinks.

  2. A Single Layer to Explain Them All:Understanding Massive Activations in Large Language Models

    cs.CL 2026-05 conditional novelty 7.0

    Massive activations first appear in a single ME Layer due to RMSNorm and FFN, remain invariant thereafter, and a simple softening method raises LLM performance while reducing attention sinks.

  3. FLUID: Continuous-Time Hyperconnected Sparse Transformer for Sink-Free Learning

    cs.LG 2026-05 unverdicted novelty 7.0

    FLUID is a continuous-time transformer using Liquid Attention Networks to model attention as stable ODE solutions that interpolate between discrete SDPA and CT-RNNs, with an explicit sink gate and liquid hyper-connect...

  4. Attention Sink in Transformers: A Survey on Utilization, Interpretation, and Mitigation

    cs.LG 2026-04 unverdicted novelty 7.0

    The first survey on Attention Sink in Transformers structures the literature around fundamental utilization, mechanistic interpretation, and strategic mitigation.

  5. Gated Attention for Large Language Models: Non-linearity, Sparsity, and Attention-Sink-Free

    cs.CL 2025-05 conditional novelty 6.0

    Applying a head-specific sigmoid gate after SDPA in LLMs boosts performance and stability by adding non-linearity and query-dependent sparse modulation while reducing attention sinks.