SSA: Improving Performance With a Better Scoring Function
Pith reviewed 2026-05-18 21:55 UTC · model grok-4.3
The pith
Replacing Softmax with Scaled Signed Averaging in attention improves transformer generalization under distribution shifts.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We analyze the generalization failures of transformers in in-context learning under distribution shifts and identify Softmax as a contributing factor. We propose Scaled Signed Averaging (SSA) as a novel attention scoring function that mitigates these failures. SSA significantly improves performance on our ICL tasks and outperforms transformer models with Softmax on several NLP benchmarks and linguistic probing tasks, in both decoder-only and encoder-only architectures.
What carries the argument
Scaled Signed Averaging (SSA), a scoring function for attention that uses scaled signed averages to replace the standard Softmax operation.
If this is right
- SSA leads to significantly better results on in-context learning tasks with distribution shifts.
- Transformer models using SSA outperform those with Softmax on multiple NLP benchmarks.
- SSA works effectively in both decoder-only and encoder-only architectures.
- The approach addresses generalization issues without changing the overall transformer structure.
Where Pith is reading between the lines
- This suggests that attention mechanisms can be tuned for better generalization without major architectural changes.
- The method might extend to other attention-based models in vision or other domains.
- Long-term, it could influence the design of more stable large language models.
Load-bearing premise
Softmax is a primary cause of the observed generalization failures under distribution shifts rather than a symptom or unrelated factor.
What would settle it
If replacing Softmax with SSA fails to improve or worsens results on the distribution-shifted ICL tasks and benchmarks, the central claim would be falsified.
Figures
read the original abstract
While transformer models exhibit strong in-context learning (ICL) abilities, they often fail to generalize under simple distribution shifts. We analyze these failures and identify Softmax, the scoring function in the attention mechanism, as a contributing factor. We propose \textbf{Scaled Signed Averaging (SSA)}, a novel attention scoring function that mitigates these failures. SSA significantly improves performance on our ICL tasks and outperforms transformer models with Softmax on several NLP benchmarks and linguistic probing tasks, in both decoder-only and encoder-only architectures.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper analyzes generalization failures of transformers under distribution shifts in in-context learning (ICL), identifies the Softmax scoring function in attention as a contributing factor, and proposes Scaled Signed Averaging (SSA) as a novel alternative attention scoring function. It claims that SSA mitigates these failures, significantly improves performance on the authors' ICL tasks, and outperforms standard Softmax-based transformers on several NLP benchmarks and linguistic probing tasks in both decoder-only and encoder-only architectures.
Significance. If the empirical claims are substantiated with quantitative results, error bars, and controlled comparisons, the work could provide a useful alternative to Softmax attention and help isolate the role of scoring functions in ICL robustness. The proposal of an independent new scoring function (SSA) is a concrete, falsifiable contribution that avoids circularity in its definition.
major comments (2)
- [Abstract] Abstract: the central claim of significant improvement and outperformance on ICL tasks and NLP benchmarks supplies no quantitative results, error bars, baseline comparisons, or derivation of SSA, so the empirical support for the performance gains cannot be assessed from the available text.
- [Experimental section] Experimental section: no controlled ablation is reported that holds training protocol, architecture, and optimization fixed while substituting only the attention scoring function (SSA vs. Softmax vs. other non-softmax alternatives such as ReLU attention or sparsemax) across the distribution-shift regimes; without this, the attribution of ICL failures primarily to Softmax remains correlational.
minor comments (1)
- [Method] The mathematical definition of SSA should be presented with explicit equations early in the method section to support reproducibility.
Simulated Author's Rebuttal
We thank the referee for their constructive comments on our manuscript. We address each major comment below and outline the revisions we will make to strengthen the paper.
read point-by-point responses
-
Referee: [Abstract] Abstract: the central claim of significant improvement and outperformance on ICL tasks and NLP benchmarks supplies no quantitative results, error bars, baseline comparisons, or derivation of SSA, so the empirical support for the performance gains cannot be assessed from the available text.
Authors: We agree that the abstract would benefit from greater specificity. In the revised manuscript we will expand the abstract to report key quantitative improvements (e.g., accuracy gains on the distribution-shift ICL tasks and on the NLP benchmarks), note the presence of error bars in the main results, reference the primary baselines, and briefly characterize the SSA derivation. The full derivation and all supporting numbers already appear in Sections 3 and 4; the abstract revision will simply surface them at the front of the paper. revision: yes
-
Referee: [Experimental section] Experimental section: no controlled ablation is reported that holds training protocol, architecture, and optimization fixed while substituting only the attention scoring function (SSA vs. Softmax vs. other non-softmax alternatives such as ReLU attention or sparsemax) across the distribution-shift regimes; without this, the attribution of ICL failures primarily to Softmax remains correlational.
Authors: We acknowledge that a fully isolated ablation strengthens causal claims. While our existing experiments already keep architecture and optimization fixed when swapping scoring functions, we will add an explicit controlled ablation subsection in the revised version. This new study will hold training protocol, model size, optimizer, and data identical while comparing SSA, Softmax, ReLU attention, and sparsemax on the same distribution-shift ICL regimes, thereby providing direct evidence for the role of the scoring function. revision: yes
Circularity Check
SSA introduced as independent ansatz; no derivation reduces to inputs or self-citations
full rationale
The manuscript proposes Scaled Signed Averaging (SSA) as a novel attention scoring function motivated by observed ICL failures under distribution shift. The abstract and described claims present SSA as an explicit replacement for softmax without any equations that define SSA in terms of the target performance metrics or that fit parameters to the evaluation data and then relabel the fit as a prediction. No self-citation chain is invoked to justify uniqueness or to smuggle in an ansatz; the central argument remains an empirical claim that is externally falsifiable by substituting other scoring functions under matched protocols. Because the derivation chain does not collapse to its own inputs by construction, the paper is self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Softmax is a contributing factor to transformer failures under simple distribution shifts in in-context learning
invented entities (1)
-
Scaled Signed Averaging (SSA)
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We replace the exponential in the scoring function by a parametrized form: x ↦ (1 + b|x|) sgn(x)^n ... score(zi) = (1 + b|zi|)^sgn(zi)^n / sum
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
2011.Lexical Meaning in Context: A web of words
Nicholas Asher. 2011.Lexical Meaning in Context: A web of words. Cambridge University Press. Nicholas Asher, Swarnadeep Bhar, Akshay Chaturvedi, Julie Hunter, and Soumya Paul
work page 2011
-
[2]
Understanding in-context learn- ing in transformers and llms by learning to learn dis- crete functions.arXiv preprint arXiv:2310.03016. Stella Biderman, Hailey Schoelkopf, Lintang Sutawika, Leo Gao, Jonathan Tow, Baber Abbasi, Alham Fikri Aji, Pawan Sasanka Ammanamanchi, Sidney Black, Jordan Clive, Anthony DiPofi, Julen Etxaniz, Ben- jamin Fattori, Jessic...
-
[3]
Lessons from the Trenches on Reproducible Evaluation of Language Models
Lessons from the trenches on reproducible evaluation of language mod- els.arXiv preprint arXiv:2405.14782. Describes LM-evaluation harness and best practices for LM evaluation. Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, and 1 others
work page internal anchor Pith review Pith/arXiv arXiv
-
[4]
Akshay Chaturvedi, Swarnadeep Bhar, Soumadeep Saha, Utpal Garain, and Nicholas Asher
Language models are few-shot learners.Advances in neural information processing systems, 33:1877–1901. Akshay Chaturvedi, Swarnadeep Bhar, Soumadeep Saha, Utpal Garain, and Nicholas Asher
work page 1901
-
[5]
Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge
Think you have solved question an- swering? try arc, the ai2 reasoning challenge.arXiv preprint arXiv:1803.05457. P Kingma Diederik
work page internal anchor Pith review Pith/arXiv arXiv
-
[6]
A Survey on In-context Learning
A survey on in- context learning.arXiv preprint arXiv:2301.00234. Shivam Garg, Dimitris Tsipras, Percy S Liang, and Gre- gory Valiant
work page internal anchor Pith review Pith/arXiv arXiv
-
[7]
Phu Mon Htut, Jason Phang, Shikha Bordia, and Samuel R Bowman
How well can transformers emulate in-context newton’s method?arXiv preprint arXiv:2403.03183. Phu Mon Htut, Jason Phang, Shikha Bordia, and Samuel R Bowman
-
[8]
Do attention heads in bert track syntactic dependencies?arXiv preprint arXiv:1911.12246. Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, and 1 others
-
[9]
R Thomas McCoy, Shunyu Yao, Dan Friedman, Mathew D Hardy, and Thomas L Griffiths
All for one: Llms solve mental math at the last token with information transferred from other tokens.arXiv preprint arXiv:2509.09650. R Thomas McCoy, Shunyu Yao, Dan Friedman, Mathew D Hardy, and Thomas L Griffiths
-
[10]
Re- examining learning linear functions in context. ArXiv:2411.11465 [cs.LG]. 9 Catherine Olsson, Nelson Elhage, Neel Nanda, Nicholas Joseph, Nova DasSarma, Tom Henighan, Ben Mann, Amanda Askell, Yuntao Bai, Anna Chen, and 1 oth- ers
-
[11]
In-context Learning and Induction Heads
In-context learning and induction heads. arXiv preprint arXiv:2209.11895. Denis Paperno, Germán Kruszewski, Angeliki Lazari- dou, Ngoc Quan Pham, Raffaella Bernardi, Sandro Pezzelle, Marco Baroni, Gemma Boleda, and Raquel Fernández
work page internal anchor Pith review Pith/arXiv arXiv
-
[12]
The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale
The fineweb datasets: Decanting the web for the finest text data.arXiv preprint arXiv:2406.17557. 15T tokens web dataset. Allan Raventós, Mansheej Paul, Feng Chen, and Surya Ganguli
work page internal anchor Pith review Pith/arXiv arXiv
-
[13]
Yuandong Tian, Yiping Wang, Beidi Chen, and Simon S Du
Lora users beware: A few spurious tokens can manipulate your finetuned model.arXiv preprint arXiv:2506.11402. Yuandong Tian, Yiping Wang, Beidi Chen, and Simon S Du
-
[14]
Visualizing Attention in Transformer-Based Language Representation Models
Visualizing attention in transformer- based language representation models.arXiv preprint arXiv:1904.02679. Alex Wang, Yada Pruksachatkun, Nikita Nangia, Aman- preet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R Bowman
work page internal anchor Pith review Pith/arXiv arXiv 1904
-
[15]
SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems
Superglue: A stickier benchmark for general-purpose language understand- ing systems.arXiv preprint arXiv:1905.00537. Sang Michael Xie, Aditi Raghunathan, Percy Liang, and Tengyu Ma
work page internal anchor Pith review Pith/arXiv arXiv 1905
-
[16]
An Explanation of In-context Learning as Implicit Bayesian Inference
An explanation of in-context learn- ing as implicit bayesian inference.arXiv preprint arXiv:2111.02080. Jiacheng Ye, Jiahui Gao, Shansan Gong, Lin Zheng, Xin Jiang, Zhenguo Li, and Lingpeng Kong
work page internal anchor Pith review Pith/arXiv arXiv
-
[17]
Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi
Beyond autoregression: Discrete diffusion for com- plex reasoning and planning.arXiv preprint arXiv:2410.14157. Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi
-
[18]
A Training details Additional training information:We use the Adam optimizer (Diederik,
Self-adjust softmax.arXiv preprint arXiv:2502.18277. A Training details Additional training information:We use the Adam optimizer (Diederik,
-
[19]
, and a learn- ing rate of10 −4 for all models. Computational resources:We used Nvidia A-100 GPUs to train the different versions of transformer models from scratch and used Nvidia V olta (V100 - 7,8 Tflops DP) GPUs for the fine-tuning of LLaMA 3.1 8B involved in these experiments. Fine-tuningLLaMA 3.1 8B was fine tuned on 26000 randomly generated sequenc...
work page 2022
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.