A Appendix A.1 Theoretical Foundations In this section, we summarize previous results establishing why sigmoid attention leads to more stable training than softmax attention

URLhttps://arxiv · 2021 · arXiv 2303.06296

2 Pith papers cite this work. Polarity classification is still indexing.

2 Pith papers citing it

representative citing papers

Learning Less Is More: Premature Upper-Layer Attention Specialization Hurts Language Model Pretraining

cs.CL · 2026-05-11 · unverdicted · novelty 7.0

Temporarily reducing the learning rate on upper-layer query and key projections during early GPT pretraining prevents premature attention specialization and improves model performance.

Better Models, Faster Training: Sigmoid Attention for single-cell Foundation Models

cs.LG · 2026-04-29 · unverdicted · novelty 4.0

Sigmoid attention replaces softmax in single-cell foundation models to deliver better representations, faster training, and stability, backed by bounded derivatives, diagonal Jacobian, and a new efficient GPU kernel.

citing papers explorer

Showing 2 of 2 citing papers.

Learning Less Is More: Premature Upper-Layer Attention Specialization Hurts Language Model Pretraining cs.CL · 2026-05-11 · unverdicted · none · ref 18
Temporarily reducing the learning rate on upper-layer query and key projections during early GPT pretraining prevents premature attention specialization and improves model performance.
Better Models, Faster Training: Sigmoid Attention for single-cell Foundation Models cs.LG · 2026-04-29 · unverdicted · none · ref 22
Sigmoid attention replaces softmax in single-cell foundation models to deliver better representations, faster training, and stability, backed by bounded derivatives, diagonal Jacobian, and a new efficient GPU kernel.

A Appendix A.1 Theoretical Foundations In this section, we summarize previous results establishing why sigmoid attention leads to more stable training than softmax attention

fields

years

verdicts

representative citing papers

citing papers explorer