Gated attention enables non-flat and positively curved geometries in the Fisher-Rao manifold of representations that ungated attention cannot achieve.
org/CorpusID:260350893
4 Pith papers cite this work. Polarity classification is still indexing.
years
2026 4representative citing papers
Attention pooling produces a free-multiplicative-convolution bulk spectrum and two phase transitions for signal recovery; optimal weights are the top eigenvector of the positional correlation matrix R.
Lipschitz continuous transformations F of probability measures w.r.t. Wasserstein distance admit continuous transport maps f(·,μ) such that F(μ) = f(·,μ)_# μ.
Residual networks admit progressive approximation trajectories with monotonically decreasing error, enabling useful predictions from any depth after a single training run via the LPA principle.
citing papers explorer
-
Gating Enables Curvature: A Geometric Expressivity Gap in Attention
Gated attention enables non-flat and positively curved geometries in the Fisher-Rao manifold of representations that ungated attention cannot achieve.
-
How Does Attention Help? Insights from Random Matrices on Signal Recovery from Sequence Models
Attention pooling produces a free-multiplicative-convolution bulk spectrum and two phase transitions for signal recovery; optimal weights are the top eigenvector of the positional correlation matrix R.
-
Continuous transformations of probability measures and their transport representations
Lipschitz continuous transformations F of probability measures w.r.t. Wasserstein distance admit continuous transport maps f(·,μ) such that F(μ) = f(·,μ)_# μ.
-
Progressive Approximation in Deep Residual Networks: Theory and Validation
Residual networks admit progressive approximation trajectories with monotonically decreasing error, enabling useful predictions from any depth after a single training run via the LPA principle.