The upper-tail accumulation scale derived from the gap-counting function N_n sets the critical inverse temperature for softmax attention concentration, unifying prior conflicting laws as special cases of different N_n.
The mean-field dynamics of transformers
14 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
years
2026 14verdicts
UNVERDICTED 14roles
background 2polarities
background 2representative citing papers
In every dimension d≥2 there exists a unique β_*^{(d)}>0 such that the uniform density on the sphere is the unique global minimizer of the USA free energy up to the linear-stability threshold K_# for β≤β_*, yielding a continuous transition, while for β>β_* the uniform density is not globally minimiz
Attention in minimal transformers under corruption performs in-context empirical Bayes via a single kernel-weighted posterior mean step followed by depth-driven particle dynamics refinement.
AI weather models may simulate the atmosphere via particle positions in latent space whose updates follow gradient flow on a learned free energy functional rather than conventional physical equations.
AdamW-trained transformer hidden states and backpropagated variables converge uniformly in L2 to a forward-backward ODE system (McKean-Vlasov when non-causal) at rate O(L^{-1}+L^{-1/3}H^{-1/2}) as depth L and heads H increase, with bounds independent of token number.
A Gaussian-kernel diffusion operator on feature clouds yields closed-form class affinities and spectra in Gaussian models, with provably smooth observables under perturbations.
Transformers converge pathwise to a stochastic particle system and SPDE in the scaling limit, exhibiting synchronization by noise and exponential energy dissipation when common noise is coercive relative to self-attention drift.
Symmetric self-attention dynamics select the dominant eigendirection of V, producing homogeneous alignment when one positive eigenvalue dominates or sign-split polarization when V is negative definite.
Introduces the Patnaik-Pearson intrinsic dimension estimator, proves some of its properties, relates it to HTSR/SETOL for Pareto spectra, and applies it to track embedding dimension evolution in BERT-base and DeepSeek-R1-Distill-Qwen-1.
Derives forward and backward propagation-of-chaos bounds for finite vs. infinite-context transformers modeled as contextual flow maps, achieving Wasserstein rate n^{-1/d} generally and n^{-1/2} for transformer-like cases.
In the low-temperature regime, the token distribution in mean-field transformers concentrates onto the push-forward under a key-query-value projection with Wasserstein distance scaling as √(log(β+1)/β) exp(Ct) + exp(-ct).
Multi-head self-attention dynamics admit a non-decreasing energy functional under suitable score-matrix conditions, with closed-form clustering thresholds and monotonic entropy production in simplified regimes.
WassersteinGrad aggregates perturbed gradient attribution maps via their entropic Wasserstein barycenter to avoid blurring from geometric shifts in explanations of autoregressive weather forecasts.
Auxiliary variables prevent mode collapse in mean-field transformers, with the limit distribution being the pushforward of the auxiliary distribution, and positional encoding and prompt insertion have universality of representation.
citing papers explorer
-
Stochastic Scaling Limits and Synchronization by Noise in Deep Transformer Models
Transformers converge pathwise to a stochastic particle system and SPDE in the scaling limit, exhibiting synchronization by noise and exponential energy dissipation when common noise is coercive relative to self-attention drift.