Gaussian distributions are invariant under the mean-field Transformer flow, reducing infinite-dimensional dynamics to a bilinear control system on mean and covariance with explicit reachability and stability results.
arXiv preprint arXiv:2411.04551 , year=
10 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
roles
background 1polarities
background 1representative citing papers
A mean-field kinetic theory derivation produces a closed-form U-shaped token retrieval profile that explains the lost-in-the-middle phenomenon in Transformers.
Derives transformer-like dual-filter inference layers from first-principles optimal control on nonlinear discrete and linear Gaussian sequence models.
Transformers converge pathwise to a stochastic particle system and SPDE in the scaling limit, exhibiting synchronization by noise and exponential energy dissipation when common noise is coercive relative to self-attention drift.
Lipschitz continuous transformations F of probability measures w.r.t. Wasserstein distance admit continuous transport maps f(·,μ) such that F(μ) = f(·,μ)_# μ.
Explicit constructions approximate diffeomorphisms and pushforward measures via continuity equation flows with perceptron velocity fields of piecewise constant weights, using polar-like decompositions and probabilistic methods for regular maps.
In the mean-field limit of attention with perceptron blocks, critical points of the energy landscape are generically atomic and localized on subsets of the unit sphere.
Transformers with O(sum m^j) blocks and O(d sum m^j) parameters can exactly interpolate any finite dataset of input sequences in R^d to output sequences of lengths m^j.
Derives forward and backward propagation-of-chaos bounds for finite vs. infinite-context transformers modeled as contextual flow maps, achieving Wasserstein rate n^{-1/d} generally and n^{-1/2} for transformer-like cases.
Models multi-head transformer data flow as time-dependent Wasserstein gradient flows of an attention-capturing interaction energy, with proofs on omega-limit stationary points and stability under weight and input perturbations.
citing papers explorer
-
Reachability and asymptotics of Gaussian Transformer dynamics
Gaussian distributions are invariant under the mean-field Transformer flow, reducing infinite-dimensional dynamics to a bilinear control system on mean and covariance with explicit reachability and stability results.
-
Transformer-like Inference from Optimal Control
Derives transformer-like dual-filter inference layers from first-principles optimal control on nonlinear discrete and linear Gaussian sequence models.
-
Perceptrons and localization of attention's mean-field landscape
In the mean-field limit of attention with perceptron blocks, critical points of the energy landscape are generically atomic and localized on subsets of the unit sphere.
-
Exact Sequence Interpolation with Transformers
Transformers with O(sum m^j) blocks and O(d sum m^j) parameters can exactly interpolate any finite dataset of input sequences in R^d to output sequences of lengths m^j.
-
Propagation of Chaos in Contextual Flow Maps
Derives forward and backward propagation-of-chaos bounds for finite vs. infinite-context transformers modeled as contextual flow maps, achieving Wasserstein rate n^{-1/d} generally and n^{-1/2} for transformer-like cases.
-
Multi-Headed Transformer Architectures as Time-dependent Wasserstein Gradient Flows
Models multi-head transformer data flow as time-dependent Wasserstein gradient flows of an attention-capturing interaction energy, with proofs on omega-limit stationary points and stability under weight and input perturbations.