The upper-tail accumulation scale derived from the gap-counting function N_n sets the critical inverse temperature for softmax attention concentration, unifying prior conflicting laws as special cases of different N_n.
The Mean-Field Dynamics of Transformers
8 Pith papers cite this work. Polarity classification is still indexing.
years
2026 8verdicts
UNVERDICTED 8representative citing papers
AdamW-trained transformer hidden states and backpropagated variables converge uniformly in L2 to a forward-backward ODE system (McKean-Vlasov when non-causal) at rate O(L^{-1}+L^{-1/3}H^{-1/2}) as depth L and heads H increase, with bounds independent of token number.
A Gaussian-kernel diffusion operator on feature clouds yields closed-form class affinities and spectra in Gaussian models, with provably smooth observables under perturbations.
Transformers converge pathwise to a stochastic particle system and SPDE in the scaling limit, exhibiting synchronization by noise and exponential energy dissipation when common noise is coercive relative to self-attention drift.
Symmetric self-attention dynamics select the dominant eigendirection of V, producing homogeneous alignment when one positive eigenvalue dominates or sign-split polarization when V is negative definite.
In the low-temperature regime, the token distribution in mean-field transformers concentrates onto the push-forward under a key-query-value projection with Wasserstein distance scaling as √(log(β+1)/β) exp(Ct) + exp(-ct).
Multi-head self-attention dynamics admit a non-decreasing energy functional under suitable score-matrix conditions, with closed-form clustering thresholds and monotonic entropy production in simplified regimes.
WassersteinGrad aggregates perturbed gradient attribution maps via their entropic Wasserstein barycenter to avoid blurring from geometric shifts in explanations of autoregressive weather forecasts.
citing papers explorer
-
A Unified Framework for Critical Scaling of Inverse Temperature in Self-Attention
The upper-tail accumulation scale derived from the gap-counting function N_n sets the critical inverse temperature for softmax attention concentration, unifying prior conflicting laws as special cases of different N_n.
-
Uniform Scaling Limits in AdamW-Trained Transformers
AdamW-trained transformer hidden states and backpropagated variables converge uniformly in L2 to a forward-backward ODE system (McKean-Vlasov when non-causal) at rate O(L^{-1}+L^{-1/3}H^{-1/2}) as depth L and heads H increase, with bounds independent of token number.
-
Diffusion Operator Geometry of Feedforward Representations
A Gaussian-kernel diffusion operator on feature clouds yields closed-form class affinities and spectra in Gaussian models, with provably smooth observables under perturbations.
-
Stochastic Scaling Limits and Synchronization by Noise in Deep Transformer Models
Transformers converge pathwise to a stochastic particle system and SPDE in the scaling limit, exhibiting synchronization by noise and exponential energy dissipation when common noise is coercive relative to self-attention drift.
-
Spectral Selection in Symmetric Self-Attention Dynamics
Symmetric self-attention dynamics select the dominant eigendirection of V, producing homogeneous alignment when one positive eigenvalue dominates or sign-split polarization when V is negative definite.
-
Quantifying Concentration Phenomena of Mean-Field Transformers in the Low-Temperature Regime
In the low-temperature regime, the token distribution in mean-field transformers concentrates onto the push-forward under a key-query-value projection with Wasserstein distance scaling as √(log(β+1)/β) exp(Ct) + exp(-ct).
-
Gradient Flow Structure and Quantitative Dynamics of Multi-Head Self-Attention
Multi-head self-attention dynamics admit a non-decreasing energy functional under suitable score-matrix conditions, with closed-form clustering thresholds and monotonic entropy production in simplified regimes.
-
Explanation of Dynamic Physical Field Predictions using WassersteinGrad: Application to Autoregressive Weather Forecasting
WassersteinGrad aggregates perturbed gradient attribution maps via their entropic Wasserstein barycenter to avoid blurring from geometric shifts in explanations of autoregressive weather forecasts.