pith. sign in

arxiv: 2508.14685 · v4 · submitted 2025-08-20 · 💻 cs.CL

SSA: Improving Performance With a Better Scoring Function

Pith reviewed 2026-05-18 21:55 UTC · model grok-4.3

classification 💻 cs.CL
keywords attention scoringsoftmaxin-context learningdistribution shiftstransformer generalizationNLP benchmarksScaled Signed Averaging
0
0 comments X p. Extension

The pith

Replacing Softmax with Scaled Signed Averaging in attention improves transformer generalization under distribution shifts.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Transformer models often fail to generalize when the data distribution shifts slightly, even though they show strong in-context learning on training distributions. The authors trace part of this issue to the Softmax function inside the attention layers. They develop Scaled Signed Averaging as a new scoring function to replace Softmax. Experiments show that models with this change handle distribution shifts more reliably and achieve higher scores on various NLP tasks and probes. These gains appear in both decoder-only and encoder-only setups.

Core claim

We analyze the generalization failures of transformers in in-context learning under distribution shifts and identify Softmax as a contributing factor. We propose Scaled Signed Averaging (SSA) as a novel attention scoring function that mitigates these failures. SSA significantly improves performance on our ICL tasks and outperforms transformer models with Softmax on several NLP benchmarks and linguistic probing tasks, in both decoder-only and encoder-only architectures.

What carries the argument

Scaled Signed Averaging (SSA), a scoring function for attention that uses scaled signed averages to replace the standard Softmax operation.

If this is right

  • SSA leads to significantly better results on in-context learning tasks with distribution shifts.
  • Transformer models using SSA outperform those with Softmax on multiple NLP benchmarks.
  • SSA works effectively in both decoder-only and encoder-only architectures.
  • The approach addresses generalization issues without changing the overall transformer structure.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • This suggests that attention mechanisms can be tuned for better generalization without major architectural changes.
  • The method might extend to other attention-based models in vision or other domains.
  • Long-term, it could influence the design of more stable large language models.

Load-bearing premise

Softmax is a primary cause of the observed generalization failures under distribution shifts rather than a symptom or unrelated factor.

What would settle it

If replacing Softmax with SSA fails to improve or worsens results on the distribution-shifted ICL tasks and benchmarks, the central claim would be falsified.

Figures

Figures reproduced from arXiv: 2508.14685 by J\'er\^ome Bolte, Nicholas Asher, Omar Naim, Swarnadeep Bhar.

Figure 1
Figure 1. Figure 1: Plots showing examples of boundary values [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Attention maps for an ICL example for the task "every" of type [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: (Left) Comparison plot showing the evolution of MSE for SSA and Softmax-based models (12L8AH) [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Heatmaps showing the evolution of errors for the 12L8AH model with Softmax (Left) and SSA (Right) [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Figures showing plots for the base func [PITH_FULL_IMAGE:figures/full_fig_p011_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Heatmap showing the evolution of errors for [PITH_FULL_IMAGE:figures/full_fig_p011_6.png] view at source ↗
Figure 8
Figure 8. Figure 8: Heatmap showing the evolution of errors for [PITH_FULL_IMAGE:figures/full_fig_p012_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Plot showing how many tokens have a norm [PITH_FULL_IMAGE:figures/full_fig_p012_9.png] view at source ↗
Figure 11
Figure 11. Figure 11: Evolution of MSE for various models with [PITH_FULL_IMAGE:figures/full_fig_p014_11.png] view at source ↗
read the original abstract

While transformer models exhibit strong in-context learning (ICL) abilities, they often fail to generalize under simple distribution shifts. We analyze these failures and identify Softmax, the scoring function in the attention mechanism, as a contributing factor. We propose \textbf{Scaled Signed Averaging (SSA)}, a novel attention scoring function that mitigates these failures. SSA significantly improves performance on our ICL tasks and outperforms transformer models with Softmax on several NLP benchmarks and linguistic probing tasks, in both decoder-only and encoder-only architectures.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper analyzes generalization failures of transformers under distribution shifts in in-context learning (ICL), identifies the Softmax scoring function in attention as a contributing factor, and proposes Scaled Signed Averaging (SSA) as a novel alternative attention scoring function. It claims that SSA mitigates these failures, significantly improves performance on the authors' ICL tasks, and outperforms standard Softmax-based transformers on several NLP benchmarks and linguistic probing tasks in both decoder-only and encoder-only architectures.

Significance. If the empirical claims are substantiated with quantitative results, error bars, and controlled comparisons, the work could provide a useful alternative to Softmax attention and help isolate the role of scoring functions in ICL robustness. The proposal of an independent new scoring function (SSA) is a concrete, falsifiable contribution that avoids circularity in its definition.

major comments (2)
  1. [Abstract] Abstract: the central claim of significant improvement and outperformance on ICL tasks and NLP benchmarks supplies no quantitative results, error bars, baseline comparisons, or derivation of SSA, so the empirical support for the performance gains cannot be assessed from the available text.
  2. [Experimental section] Experimental section: no controlled ablation is reported that holds training protocol, architecture, and optimization fixed while substituting only the attention scoring function (SSA vs. Softmax vs. other non-softmax alternatives such as ReLU attention or sparsemax) across the distribution-shift regimes; without this, the attribution of ICL failures primarily to Softmax remains correlational.
minor comments (1)
  1. [Method] The mathematical definition of SSA should be presented with explicit equations early in the method section to support reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments on our manuscript. We address each major comment below and outline the revisions we will make to strengthen the paper.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the central claim of significant improvement and outperformance on ICL tasks and NLP benchmarks supplies no quantitative results, error bars, baseline comparisons, or derivation of SSA, so the empirical support for the performance gains cannot be assessed from the available text.

    Authors: We agree that the abstract would benefit from greater specificity. In the revised manuscript we will expand the abstract to report key quantitative improvements (e.g., accuracy gains on the distribution-shift ICL tasks and on the NLP benchmarks), note the presence of error bars in the main results, reference the primary baselines, and briefly characterize the SSA derivation. The full derivation and all supporting numbers already appear in Sections 3 and 4; the abstract revision will simply surface them at the front of the paper. revision: yes

  2. Referee: [Experimental section] Experimental section: no controlled ablation is reported that holds training protocol, architecture, and optimization fixed while substituting only the attention scoring function (SSA vs. Softmax vs. other non-softmax alternatives such as ReLU attention or sparsemax) across the distribution-shift regimes; without this, the attribution of ICL failures primarily to Softmax remains correlational.

    Authors: We acknowledge that a fully isolated ablation strengthens causal claims. While our existing experiments already keep architecture and optimization fixed when swapping scoring functions, we will add an explicit controlled ablation subsection in the revised version. This new study will hold training protocol, model size, optimizer, and data identical while comparing SSA, Softmax, ReLU attention, and sparsemax on the same distribution-shift ICL regimes, thereby providing direct evidence for the role of the scoring function. revision: yes

Circularity Check

0 steps flagged

SSA introduced as independent ansatz; no derivation reduces to inputs or self-citations

full rationale

The manuscript proposes Scaled Signed Averaging (SSA) as a novel attention scoring function motivated by observed ICL failures under distribution shift. The abstract and described claims present SSA as an explicit replacement for softmax without any equations that define SSA in terms of the target performance metrics or that fit parameters to the evaluation data and then relabel the fit as a prediction. No self-citation chain is invoked to justify uniqueness or to smuggle in an ansatz; the central argument remains an empirical claim that is externally falsifiable by substituting other scoring functions under matched protocols. Because the derivation chain does not collapse to its own inputs by construction, the paper is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the domain assumption that Softmax causes the identified ICL failures and on the newly invented SSA entity; no free parameters are mentioned in the abstract.

axioms (1)
  • domain assumption Softmax is a contributing factor to transformer failures under simple distribution shifts in in-context learning
    Directly stated in the abstract as the identified cause.
invented entities (1)
  • Scaled Signed Averaging (SSA) no independent evidence
    purpose: Alternative attention scoring function to replace Softmax and mitigate generalization failures
    Newly proposed in the paper with no independent evidence provided in the abstract.

pith-pipeline@v0.9.0 · 5612 in / 1367 out tokens · 47318 ms · 2026-05-18T21:55:38.318788+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

19 extracted references · 19 canonical work pages · 8 internal anchors

  1. [1]

    2011.Lexical Meaning in Context: A web of words

    Nicholas Asher. 2011.Lexical Meaning in Context: A web of words. Cambridge University Press. Nicholas Asher, Swarnadeep Bhar, Akshay Chaturvedi, Julie Hunter, and Soumya Paul

  2. [2]

    Understanding in-context learn- ing in transformers and llms by learning to learn dis- crete functions.arXiv preprint arXiv:2310.03016. Stella Biderman, Hailey Schoelkopf, Lintang Sutawika, Leo Gao, Jonathan Tow, Baber Abbasi, Alham Fikri Aji, Pawan Sasanka Ammanamanchi, Sidney Black, Jordan Clive, Anthony DiPofi, Julen Etxaniz, Ben- jamin Fattori, Jessic...

  3. [3]

    Lessons from the Trenches on Reproducible Evaluation of Language Models

    Lessons from the trenches on reproducible evaluation of language mod- els.arXiv preprint arXiv:2405.14782. Describes LM-evaluation harness and best practices for LM evaluation. Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, and 1 others

  4. [4]

    Akshay Chaturvedi, Swarnadeep Bhar, Soumadeep Saha, Utpal Garain, and Nicholas Asher

    Language models are few-shot learners.Advances in neural information processing systems, 33:1877–1901. Akshay Chaturvedi, Swarnadeep Bhar, Soumadeep Saha, Utpal Garain, and Nicholas Asher

  5. [5]

    Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

    Think you have solved question an- swering? try arc, the ai2 reasoning challenge.arXiv preprint arXiv:1803.05457. P Kingma Diederik

  6. [6]

    A Survey on In-context Learning

    A survey on in- context learning.arXiv preprint arXiv:2301.00234. Shivam Garg, Dimitris Tsipras, Percy S Liang, and Gre- gory Valiant

  7. [7]

    Phu Mon Htut, Jason Phang, Shikha Bordia, and Samuel R Bowman

    How well can transformers emulate in-context newton’s method?arXiv preprint arXiv:2403.03183. Phu Mon Htut, Jason Phang, Shikha Bordia, and Samuel R Bowman

  8. [8]

    Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, and 1 others

    Do attention heads in bert track syntactic dependencies?arXiv preprint arXiv:1911.12246. Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, and 1 others

  9. [9]

    R Thomas McCoy, Shunyu Yao, Dan Friedman, Mathew D Hardy, and Thomas L Griffiths

    All for one: Llms solve mental math at the last token with information transferred from other tokens.arXiv preprint arXiv:2509.09650. R Thomas McCoy, Shunyu Yao, Dan Friedman, Mathew D Hardy, and Thomas L Griffiths

  10. [10]

    ArXiv:2411.11465 [cs.LG]

    Re- examining learning linear functions in context. ArXiv:2411.11465 [cs.LG]. 9 Catherine Olsson, Nelson Elhage, Neel Nanda, Nicholas Joseph, Nova DasSarma, Tom Henighan, Ben Mann, Amanda Askell, Yuntao Bai, Anna Chen, and 1 oth- ers

  11. [11]

    In-context Learning and Induction Heads

    In-context learning and induction heads. arXiv preprint arXiv:2209.11895. Denis Paperno, Germán Kruszewski, Angeliki Lazari- dou, Ngoc Quan Pham, Raffaella Bernardi, Sandro Pezzelle, Marco Baroni, Gemma Boleda, and Raquel Fernández

  12. [12]

    The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale

    The fineweb datasets: Decanting the web for the finest text data.arXiv preprint arXiv:2406.17557. 15T tokens web dataset. Allan Raventós, Mansheej Paul, Feng Chen, and Surya Ganguli

  13. [13]

    Yuandong Tian, Yiping Wang, Beidi Chen, and Simon S Du

    Lora users beware: A few spurious tokens can manipulate your finetuned model.arXiv preprint arXiv:2506.11402. Yuandong Tian, Yiping Wang, Beidi Chen, and Simon S Du

  14. [14]

    Visualizing Attention in Transformer-Based Language Representation Models

    Visualizing attention in transformer- based language representation models.arXiv preprint arXiv:1904.02679. Alex Wang, Yada Pruksachatkun, Nikita Nangia, Aman- preet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R Bowman

  15. [15]

    SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems

    Superglue: A stickier benchmark for general-purpose language understand- ing systems.arXiv preprint arXiv:1905.00537. Sang Michael Xie, Aditi Raghunathan, Percy Liang, and Tengyu Ma

  16. [16]

    An Explanation of In-context Learning as Implicit Bayesian Inference

    An explanation of in-context learn- ing as implicit bayesian inference.arXiv preprint arXiv:2111.02080. Jiacheng Ye, Jiahui Gao, Shansan Gong, Lin Zheng, Xin Jiang, Zhenguo Li, and Lingpeng Kong

  17. [17]

    Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi

    Beyond autoregression: Discrete diffusion for com- plex reasoning and planning.arXiv preprint arXiv:2410.14157. Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi

  18. [18]

    A Training details Additional training information:We use the Adam optimizer (Diederik,

    Self-adjust softmax.arXiv preprint arXiv:2502.18277. A Training details Additional training information:We use the Adam optimizer (Diederik,

  19. [19]

    some" task with SSA for model trained from scratch Figure 6: Heatmap showing the evolution of errors for the task

    , and a learn- ing rate of10 −4 for all models. Computational resources:We used Nvidia A-100 GPUs to train the different versions of transformer models from scratch and used Nvidia V olta (V100 - 7,8 Tflops DP) GPUs for the fine-tuning of LLaMA 3.1 8B involved in these experiments. Fine-tuningLLaMA 3.1 8B was fine tuned on 26000 randomly generated sequenc...