Applying a head-specific sigmoid gate after SDPA in LLMs boosts performance and stability by adding non-linearity and query-dependent sparse modulation while reducing attention sinks.
On the Number of Linear Regions of Deep Neural Networks.arXiv2014
2 Pith papers cite this work. Polarity classification is still indexing.
2
Pith papers citing it
representative citing papers
Empirical comparison of transfer learning performance across eleven pre-trained models on five image datasets using accuracy, time, and size metrics.
citing papers explorer
-
Gated Attention for Large Language Models: Non-linearity, Sparsity, and Attention-Sink-Free
Applying a head-specific sigmoid gate after SDPA in LLMs boosts performance and stability by adding non-linearity and query-dependent sparse modulation while reducing attention sinks.
-
A Transfer Learning Evaluation of Deep Neural Networks for Image Classification
Empirical comparison of transfer learning performance across eleven pre-trained models on five image datasets using accuracy, time, and size metrics.