Mixture of Activations mixes activation functions token-adaptively in FFNs via lightweight gates, strictly more expressive than fixed or learnable activations, and yields lower pretraining loss from 0.12B to 2B models.
Polynomial composition activations: Unleashing the dynamics of large language models.arXiv preprint arXiv:2411.03884, 2024
1 Pith paper cite this work. Polarity classification is still indexing.
1
Pith paper citing it
fields
cs.LG 1years
2026 1verdicts
UNVERDICTED 1representative citing papers
citing papers explorer
-
More Expressive Feedforward Layers: Part I. Token-Adaptive Mixing of Activations
Mixture of Activations mixes activation functions token-adaptively in FFNs via lightweight gates, strictly more expressive than fixed or learnable activations, and yields lower pretraining loss from 0.12B to 2B models.