Softpick: No Attention Sink, No Massive Activations with Rectified Softmax
Pith reviewed 2026-05-22 18:06 UTC · model grok-4.3
The pith
Softpick replaces softmax with a rectified non-sum-to-one function to remove attention sinks and massive activations in transformers.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Softpick is a rectified function that does not sum to one and serves as a direct substitute for softmax inside attention. When substituted, it produces zero attention sink rate across tested model sizes, yields hidden states with markedly lower kurtosis, and generates sparse attention maps. Models using softpick maintain or exceed baseline accuracy after quantization, particularly at low bit precisions, while preserving training dynamics in the reported scales.
What carries the argument
Softpick, a rectified non-sum-to-one replacement for softmax that removes the normalization constraint while preserving non-negativity.
If this is right
- Attention maps become sparse without extra regularization.
- Hidden-state kurtosis drops, reducing the incidence of massive activations.
- Quantized softpick models show higher accuracy than quantized softmax models, especially below 8 bits.
- The method supports further work on low-precision training and pruning because the attention mechanism no longer forces dense normalization.
Where Pith is reading between the lines
- Because softpick removes the sum-to-one constraint, it may allow attention weights to be interpreted more directly as unnormalized importance scores rather than probabilities.
- Lower kurtosis in activations could reduce the dynamic range required for fixed-point arithmetic, suggesting possible gains in custom hardware accelerators.
- The observed sparsity might combine with existing pruning methods to produce attention layers that are both sparse and sink-free.
Load-bearing premise
The changes to normalization and rectification preserve the model's ability to learn useful representations without hidden degradation that would appear only at larger scales or on different tasks.
What would settle it
Training a model larger than 1.8B parameters with softpick and measuring whether attention sink rate remains zero or whether downstream benchmark scores fall below the softmax baseline at matched precision.
read the original abstract
We introduce softpick, a rectified, not sum-to-one, drop-in replacement for softmax in transformer attention mechanisms that eliminates attention sink and massive activations. Our experiments with 340M and 1.8B parameter models demonstrate that softpick achieves 0\% sink rate consistently. The softpick transformers produce hidden states with significantly lower kurtosis and creates sparse attention maps. Quantized models using softpick outperform softmax on standard benchmarks, with a particularly pronounced advantage at lower bit precisions. Our analysis and discussion shows how softpick has the potential to open new possibilities for quantization, low-precision training, sparsity optimization, pruning, and interpretability. Our code: https://github.com/zaydzuhri/softpick-attention
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces softpick, a rectified and non-sum-to-one replacement for softmax in transformer attention. It claims this eliminates attention sinks and massive activations, as shown by consistent 0% sink rates, lower kurtosis in hidden states, sparser attention maps, and superior performance of quantized models (especially at low bit precisions) on 340M and 1.8B parameter models. The work also discusses potential benefits for quantization, low-precision training, sparsity, pruning, and interpretability.
Significance. If the empirical results hold under broader validation, softpick could meaningfully advance efficient transformer deployment by addressing attention sink and activation magnitude issues without auxiliary mechanisms, with particular value for quantization and low-precision regimes. The consistent findings across two model scales provide a solid starting point, though the absence of machine-checked proofs or parameter-free derivations limits the strength of the assessment.
major comments (2)
- [Experiments] Experiments section: The reported 0% sink rate and quantization gains on 340M/1.8B models are promising, but the manuscript supplies no ablation that isolates the effect of removing the sum-to-one constraint or the rectification step. This is load-bearing for the central claim because downstream linear layers now receive non-convex-combination inputs whose magnitude and variance are no longer bounded by standard softmax properties; any implicit rescaling learned during training could mask degradation that appears only at greater depth or width.
- [Method] Method and analysis sections: There is no examination of how the modified attention matrix interacts with LayerNorm or residual connections under the non-sum-to-one property. The skeptic concern that this may alter gradient flow and output scaling in ways invisible at 1.8B scale is therefore unaddressed, leaving open the possibility that the observed kurtosis reduction and sink elimination rely on compensatory mechanisms that fail to generalize.
minor comments (2)
- [Abstract] Abstract: The claim that softpick 'creates sparse attention maps' would benefit from a quantitative metric (e.g., average number of non-zero entries per row or entropy) rather than qualitative description.
- [Experiments] The link to the GitHub repository is provided but the manuscript does not specify which exact training hyperparameters, data mixtures, or evaluation protocols were used, making reproduction and direct comparison to softmax baselines more difficult.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed comments on our manuscript. We address each major comment below and will revise the paper to incorporate additional ablations and analysis as suggested.
read point-by-point responses
-
Referee: [Experiments] Experiments section: The reported 0% sink rate and quantization gains on 340M/1.8B models are promising, but the manuscript supplies no ablation that isolates the effect of removing the sum-to-one constraint or the rectification step. This is load-bearing for the central claim because downstream linear layers now receive non-convex-combination inputs whose magnitude and variance are no longer bounded by standard softmax properties; any implicit rescaling learned during training could mask degradation that appears only at greater depth or width.
Authors: We agree that an ablation isolating the rectification step from the removal of the sum-to-one constraint would strengthen the central claims. In the revised manuscript we will add experiments comparing four variants on the 340M model: standard softmax, rectified but sum-to-one, non-rectified and non-sum-to-one, and the full softpick. These controls will clarify whether the observed zero sink rate and kurtosis reduction require both modifications or can be achieved by either alone. We note that the current results already show consistent gains across two scales, but the requested ablations will make the contribution of each design choice explicit. revision: yes
-
Referee: [Method] Method and analysis sections: There is no examination of how the modified attention matrix interacts with LayerNorm or residual connections under the non-sum-to-one property. The skeptic concern that this may alter gradient flow and output scaling in ways invisible at 1.8B scale is therefore unaddressed, leaving open the possibility that the observed kurtosis reduction and sink elimination rely on compensatory mechanisms that fail to generalize.
Authors: We acknowledge that a direct examination of interactions with LayerNorm and residual streams under the non-sum-to-one regime is missing. In the revision we will add a short analysis subsection that reports per-layer attention-output norms, hidden-state kurtosis trajectories, and gradient-norm statistics for both softmax and softpick models. These measurements will be included for both the 340M and 1.8B scales to address concerns about compensatory mechanisms and generalization. While the existing empirical results already demonstrate stable training and improved quantized performance, the added diagnostics should mitigate the skeptic concern. revision: yes
- The manuscript provides no machine-checked proofs or parameter-free derivations for the elimination of attention sinks; the work is empirical and such formal results remain an open direction for future theoretical analysis.
Circularity Check
No circularity: results are direct empirical measurements
full rationale
The paper introduces softpick as a rectified non-sum-to-one attention replacement and validates it solely through training runs and measurements on 340M and 1.8B models. Reported outcomes (0% sink rate, reduced kurtosis, quantization gains) are direct observations from the trained networks rather than any derivation, prediction, or first-principles claim that reduces to fitted parameters or self-citations by construction. No load-bearing mathematical steps or uniqueness theorems appear in the provided text.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Standard scaled dot-product attention mechanism remains valid when softmax is replaced by a non-sum-to-one rectified function.
invented entities (1)
-
softpick function
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Softpick(x)i = ReLU(exi −1) / sum |exj −1| ... removes the strict requirement to sum to one, which is the main cause of attention sink.
-
IndisputableMonolith/Foundation/BranchSelection.leanbranch_selection unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We propose the softpick function as a drop-in replacement to softmax in attention.
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 10 Pith papers
-
A Single Layer to Explain Them All:Understanding Massive Activations in Large Language Models
Massive activations first appear in a single ME Layer due to RMSNorm and FFN, remain invariant thereafter, and a simple softening method raises LLM performance while reducing attention sinks.
-
A Single Layer to Explain Them All:Understanding Massive Activations in Large Language Models
Massive activations originate in a specific ME Layer across LLM families; reducing their token rigidity via a targeted method boosts performance and mitigates attention sinks.
-
FLUID: Continuous-Time Hyperconnected Sparse Transformer for Sink-Free Learning
FLUID is a continuous-time transformer using Liquid Attention Networks to model attention as stable ODE solutions that interpolate between discrete SDPA and CT-RNNs, with an explicit sink gate and liquid hyper-connect...
-
Attention Sink in Transformers: A Survey on Utilization, Interpretation, and Mitigation
The first survey on Attention Sink in Transformers structures the literature around fundamental utilization, mechanistic interpretation, and strategic mitigation.
-
Most Transformer Modifications Still Do Not Transfer at 1-3B: A 2020-2026 Update to Narang et al. (2021) with Downstream Evaluation and a Noise Floor
Empirical update to prior work shows most of 20 recent Transformer modifications do not transfer at 1-3B scales when measured with downstream CLIMB-12 tasks, multi-seed noise floor, and cross-scale stability.
-
OScaR: The Occam's Razor for Extreme KV Cache Quantization in LLMs and Beyond
OScaR mitigates token norm imbalance via canalized rotation and omni-token scaling to enable near-lossless INT2 KV cache quantization with up to 3x decoding speedup and 5.3x memory reduction.
-
Attention Sinks Induce Gradient Sinks: Massive Activations as Gradient Regulators in Transformers
Attention sinks induce gradient sinks under causal masking, with massive activations serving as adaptive RMSNorm regulators that attenuate localized gradient pressure in Transformer training.
-
Gated Attention for Large Language Models: Non-linearity, Sparsity, and Attention-Sink-Free
Applying a head-specific sigmoid gate after SDPA in LLMs boosts performance and stability by adding non-linearity and query-dependent sparse modulation while reducing attention sinks.
-
Attention Sink Forges Native MoE in Attention Layers: Sink-Aware Training to Address Head Collapse
Attention sinks forge native MoE mechanisms in attention layers that cause head collapse, addressed by sink-aware training with auxiliary load balancing.
-
Attention Sinks and Outliers in Attention Residuals
OASIS mitigates attention sinks and outliers in AttnResidual models via Softmax1 null space and inter-layer signals, reporting norm and kurtosis reductions plus large gains in quantized perplexity and task accuracy.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.