pith. sign in

arxiv: 2504.20966 · v4 · submitted 2025-04-29 · 💻 cs.LG

Softpick: No Attention Sink, No Massive Activations with Rectified Softmax

Pith reviewed 2026-05-22 18:06 UTC · model grok-4.3

classification 💻 cs.LG
keywords attention mechanismsoftmax replacementattention sinkquantizationtransformerrectified activationsparse attention
0
0 comments X

The pith

Softpick replaces softmax with a rectified non-sum-to-one function to remove attention sinks and massive activations in transformers.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces softpick as a drop-in replacement for softmax in transformer attention layers. Softpick applies rectification and drops the requirement that outputs sum to one, which the experiments show eliminates attention sink entirely in 340M and 1.8B models. The resulting hidden states exhibit lower kurtosis and the attention maps become sparse. When models are quantized, softpick versions outperform standard softmax models on benchmarks, with the gap widening at lower bit widths. The authors argue this change opens routes to more efficient quantization, low-precision training, and pruning without sacrificing core performance.

Core claim

Softpick is a rectified function that does not sum to one and serves as a direct substitute for softmax inside attention. When substituted, it produces zero attention sink rate across tested model sizes, yields hidden states with markedly lower kurtosis, and generates sparse attention maps. Models using softpick maintain or exceed baseline accuracy after quantization, particularly at low bit precisions, while preserving training dynamics in the reported scales.

What carries the argument

Softpick, a rectified non-sum-to-one replacement for softmax that removes the normalization constraint while preserving non-negativity.

If this is right

  • Attention maps become sparse without extra regularization.
  • Hidden-state kurtosis drops, reducing the incidence of massive activations.
  • Quantized softpick models show higher accuracy than quantized softmax models, especially below 8 bits.
  • The method supports further work on low-precision training and pruning because the attention mechanism no longer forces dense normalization.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Because softpick removes the sum-to-one constraint, it may allow attention weights to be interpreted more directly as unnormalized importance scores rather than probabilities.
  • Lower kurtosis in activations could reduce the dynamic range required for fixed-point arithmetic, suggesting possible gains in custom hardware accelerators.
  • The observed sparsity might combine with existing pruning methods to produce attention layers that are both sparse and sink-free.

Load-bearing premise

The changes to normalization and rectification preserve the model's ability to learn useful representations without hidden degradation that would appear only at larger scales or on different tasks.

What would settle it

Training a model larger than 1.8B parameters with softpick and measuring whether attention sink rate remains zero or whether downstream benchmark scores fall below the softmax baseline at matched precision.

read the original abstract

We introduce softpick, a rectified, not sum-to-one, drop-in replacement for softmax in transformer attention mechanisms that eliminates attention sink and massive activations. Our experiments with 340M and 1.8B parameter models demonstrate that softpick achieves 0\% sink rate consistently. The softpick transformers produce hidden states with significantly lower kurtosis and creates sparse attention maps. Quantized models using softpick outperform softmax on standard benchmarks, with a particularly pronounced advantage at lower bit precisions. Our analysis and discussion shows how softpick has the potential to open new possibilities for quantization, low-precision training, sparsity optimization, pruning, and interpretability. Our code: https://github.com/zaydzuhri/softpick-attention

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces softpick, a rectified and non-sum-to-one replacement for softmax in transformer attention. It claims this eliminates attention sinks and massive activations, as shown by consistent 0% sink rates, lower kurtosis in hidden states, sparser attention maps, and superior performance of quantized models (especially at low bit precisions) on 340M and 1.8B parameter models. The work also discusses potential benefits for quantization, low-precision training, sparsity, pruning, and interpretability.

Significance. If the empirical results hold under broader validation, softpick could meaningfully advance efficient transformer deployment by addressing attention sink and activation magnitude issues without auxiliary mechanisms, with particular value for quantization and low-precision regimes. The consistent findings across two model scales provide a solid starting point, though the absence of machine-checked proofs or parameter-free derivations limits the strength of the assessment.

major comments (2)
  1. [Experiments] Experiments section: The reported 0% sink rate and quantization gains on 340M/1.8B models are promising, but the manuscript supplies no ablation that isolates the effect of removing the sum-to-one constraint or the rectification step. This is load-bearing for the central claim because downstream linear layers now receive non-convex-combination inputs whose magnitude and variance are no longer bounded by standard softmax properties; any implicit rescaling learned during training could mask degradation that appears only at greater depth or width.
  2. [Method] Method and analysis sections: There is no examination of how the modified attention matrix interacts with LayerNorm or residual connections under the non-sum-to-one property. The skeptic concern that this may alter gradient flow and output scaling in ways invisible at 1.8B scale is therefore unaddressed, leaving open the possibility that the observed kurtosis reduction and sink elimination rely on compensatory mechanisms that fail to generalize.
minor comments (2)
  1. [Abstract] Abstract: The claim that softpick 'creates sparse attention maps' would benefit from a quantitative metric (e.g., average number of non-zero entries per row or entropy) rather than qualitative description.
  2. [Experiments] The link to the GitHub repository is provided but the manuscript does not specify which exact training hyperparameters, data mixtures, or evaluation protocols were used, making reproduction and direct comparison to softmax baselines more difficult.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for the constructive and detailed comments on our manuscript. We address each major comment below and will revise the paper to incorporate additional ablations and analysis as suggested.

read point-by-point responses
  1. Referee: [Experiments] Experiments section: The reported 0% sink rate and quantization gains on 340M/1.8B models are promising, but the manuscript supplies no ablation that isolates the effect of removing the sum-to-one constraint or the rectification step. This is load-bearing for the central claim because downstream linear layers now receive non-convex-combination inputs whose magnitude and variance are no longer bounded by standard softmax properties; any implicit rescaling learned during training could mask degradation that appears only at greater depth or width.

    Authors: We agree that an ablation isolating the rectification step from the removal of the sum-to-one constraint would strengthen the central claims. In the revised manuscript we will add experiments comparing four variants on the 340M model: standard softmax, rectified but sum-to-one, non-rectified and non-sum-to-one, and the full softpick. These controls will clarify whether the observed zero sink rate and kurtosis reduction require both modifications or can be achieved by either alone. We note that the current results already show consistent gains across two scales, but the requested ablations will make the contribution of each design choice explicit. revision: yes

  2. Referee: [Method] Method and analysis sections: There is no examination of how the modified attention matrix interacts with LayerNorm or residual connections under the non-sum-to-one property. The skeptic concern that this may alter gradient flow and output scaling in ways invisible at 1.8B scale is therefore unaddressed, leaving open the possibility that the observed kurtosis reduction and sink elimination rely on compensatory mechanisms that fail to generalize.

    Authors: We acknowledge that a direct examination of interactions with LayerNorm and residual streams under the non-sum-to-one regime is missing. In the revision we will add a short analysis subsection that reports per-layer attention-output norms, hidden-state kurtosis trajectories, and gradient-norm statistics for both softmax and softpick models. These measurements will be included for both the 340M and 1.8B scales to address concerns about compensatory mechanisms and generalization. While the existing empirical results already demonstrate stable training and improved quantized performance, the added diagnostics should mitigate the skeptic concern. revision: yes

standing simulated objections not resolved
  • The manuscript provides no machine-checked proofs or parameter-free derivations for the elimination of attention sinks; the work is empirical and such formal results remain an open direction for future theoretical analysis.

Circularity Check

0 steps flagged

No circularity: results are direct empirical measurements

full rationale

The paper introduces softpick as a rectified non-sum-to-one attention replacement and validates it solely through training runs and measurements on 340M and 1.8B models. Reported outcomes (0% sink rate, reduced kurtosis, quantization gains) are direct observations from the trained networks rather than any derivation, prediction, or first-principles claim that reduces to fitted parameters or self-citations by construction. No load-bearing mathematical steps or uniqueness theorems appear in the provided text.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The method rests on the standard transformer attention formulation plus the new rectification and non-normalization choices. No free parameters are explicitly fitted in the abstract description. The invented entity is the softpick function itself, introduced to solve the sink and activation problems.

axioms (1)
  • domain assumption Standard scaled dot-product attention mechanism remains valid when softmax is replaced by a non-sum-to-one rectified function.
    Implicit in treating softpick as a drop-in replacement without retraining the entire architecture from scratch.
invented entities (1)
  • softpick function no independent evidence
    purpose: Rectified non-normalized replacement for softmax to eliminate attention sinks and massive activations.
    Newly defined operation whose properties are demonstrated empirically rather than derived from prior theory.

pith-pipeline@v0.9.0 · 5665 in / 1365 out tokens · 38661 ms · 2026-05-22T18:06:48.875350+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 10 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. A Single Layer to Explain Them All:Understanding Massive Activations in Large Language Models

    cs.CL 2026-05 conditional novelty 7.0

    Massive activations first appear in a single ME Layer due to RMSNorm and FFN, remain invariant thereafter, and a simple softening method raises LLM performance while reducing attention sinks.

  2. A Single Layer to Explain Them All:Understanding Massive Activations in Large Language Models

    cs.CL 2026-05 unverdicted novelty 7.0

    Massive activations originate in a specific ME Layer across LLM families; reducing their token rigidity via a targeted method boosts performance and mitigates attention sinks.

  3. FLUID: Continuous-Time Hyperconnected Sparse Transformer for Sink-Free Learning

    cs.LG 2026-05 unverdicted novelty 7.0

    FLUID is a continuous-time transformer using Liquid Attention Networks to model attention as stable ODE solutions that interpolate between discrete SDPA and CT-RNNs, with an explicit sink gate and liquid hyper-connect...

  4. Attention Sink in Transformers: A Survey on Utilization, Interpretation, and Mitigation

    cs.LG 2026-04 unverdicted novelty 7.0

    The first survey on Attention Sink in Transformers structures the literature around fundamental utilization, mechanistic interpretation, and strategic mitigation.

  5. Most Transformer Modifications Still Do Not Transfer at 1-3B: A 2020-2026 Update to Narang et al. (2021) with Downstream Evaluation and a Noise Floor

    cs.LG 2026-05 conditional novelty 6.0

    Empirical update to prior work shows most of 20 recent Transformer modifications do not transfer at 1-3B scales when measured with downstream CLIMB-12 tasks, multi-seed noise floor, and cross-scale stability.

  6. OScaR: The Occam's Razor for Extreme KV Cache Quantization in LLMs and Beyond

    cs.LG 2026-05 unverdicted novelty 6.0

    OScaR mitigates token norm imbalance via canalized rotation and omni-token scaling to enable near-lossless INT2 KV cache quantization with up to 3x decoding speedup and 5.3x memory reduction.

  7. Attention Sinks Induce Gradient Sinks: Massive Activations as Gradient Regulators in Transformers

    cs.LG 2026-03 unverdicted novelty 6.0

    Attention sinks induce gradient sinks under causal masking, with massive activations serving as adaptive RMSNorm regulators that attenuate localized gradient pressure in Transformer training.

  8. Gated Attention for Large Language Models: Non-linearity, Sparsity, and Attention-Sink-Free

    cs.CL 2025-05 conditional novelty 6.0

    Applying a head-specific sigmoid gate after SDPA in LLMs boosts performance and stability by adding non-linearity and query-dependent sparse modulation while reducing attention sinks.

  9. Attention Sink Forges Native MoE in Attention Layers: Sink-Aware Training to Address Head Collapse

    cs.CL 2026-02 unverdicted novelty 5.0

    Attention sinks forge native MoE mechanisms in attention layers that cause head collapse, addressed by sink-aware training with auxiliary load balancing.

  10. Attention Sinks and Outliers in Attention Residuals

    cs.LG 2026-05 unverdicted novelty 4.0

    OASIS mitigates attention sinks and outliers in AttnResidual models via Softmax1 null space and inter-layer signals, reporting norm and kurtosis reductions plus large gains in quantized perplexity and task accuracy.