Nexusformer uses a three-stage nonlinear mapping in attention to enable stable, inheritable scaling of transformers, matching baseline perplexity with up to 41.5% less compute when growing from 240M to 440M parameters.
On the expressivity role of LayerNorm in Transformers' attention
2 Pith papers cite this work. Polarity classification is still indexing.
2
Pith papers citing it
citation-role summary
background 1
citation-polarity summary
fields
cs.LG 2roles
background 1polarities
background 1representative citing papers
citing papers explorer
-
Nexusformer: Nonlinear Attention Expansion for Stable and Inheritable Transformer Scaling
Nexusformer uses a three-stage nonlinear mapping in attention to enable stable, inheritable scaling of transformers, matching baseline perplexity with up to 41.5% less compute when growing from 240M to 440M parameters.
- Robust Filter Attention: Self-Attention as Precision-Weighted State Estimation