The Mean-Field Dynamics of Transformers

Philippe Rigollet · 2026 · arXiv 2512.01868

8 Pith papers cite this work. Polarity classification is still indexing.

8 Pith papers citing it

representative citing papers

A Unified Framework for Critical Scaling of Inverse Temperature in Self-Attention

stat.ML · 2026-05-12 · unverdicted · novelty 8.0

The upper-tail accumulation scale derived from the gap-counting function N_n sets the critical inverse temperature for softmax attention concentration, unifying prior conflicting laws as special cases of different N_n.

Uniform Scaling Limits in AdamW-Trained Transformers

stat.ML · 2026-05-11 · unverdicted · novelty 7.0

AdamW-trained transformer hidden states and backpropagated variables converge uniformly in L2 to a forward-backward ODE system (McKean-Vlasov when non-causal) at rate O(L^{-1}+L^{-1/3}H^{-1/2}) as depth L and heads H increase, with bounds independent of token number.

Diffusion Operator Geometry of Feedforward Representations

cs.LG · 2026-05-01 · unverdicted · novelty 7.0

A Gaussian-kernel diffusion operator on feature clouds yields closed-form class affinities and spectra in Gaussian models, with provably smooth observables under perturbations.

Stochastic Scaling Limits and Synchronization by Noise in Deep Transformer Models

math.PR · 2026-04-29 · unverdicted · novelty 7.0

Transformers converge pathwise to a stochastic particle system and SPDE in the scaling limit, exhibiting synchronization by noise and exponential energy dissipation when common noise is coercive relative to self-attention drift.

Spectral Selection in Symmetric Self-Attention Dynamics

math.DS · 2026-04-28 · unverdicted · novelty 7.0

Symmetric self-attention dynamics select the dominant eigendirection of V, producing homogeneous alignment when one positive eigenvalue dominates or sign-split polarization when V is negative definite.

Quantifying Concentration Phenomena of Mean-Field Transformers in the Low-Temperature Regime

math.AP · 2026-05-11 · unverdicted · novelty 6.0

In the low-temperature regime, the token distribution in mean-field transformers concentrates onto the push-forward under a key-query-value projection with Wasserstein distance scaling as √(log(β+1)/β) exp(Ct) + exp(-ct).

Gradient Flow Structure and Quantitative Dynamics of Multi-Head Self-Attention

cs.LG · 2026-05-05 · unverdicted · novelty 6.0 · 2 refs

Multi-head self-attention dynamics admit a non-decreasing energy functional under suitable score-matrix conditions, with closed-form clustering thresholds and monotonic entropy production in simplified regimes.

Explanation of Dynamic Physical Field Predictions using WassersteinGrad: Application to Autoregressive Weather Forecasting

stat.ML · 2026-04-24 · unverdicted · novelty 6.0

WassersteinGrad aggregates perturbed gradient attribution maps via their entropic Wasserstein barycenter to avoid blurring from geometric shifts in explanations of autoregressive weather forecasts.

citing papers explorer

Showing 8 of 8 citing papers.

A Unified Framework for Critical Scaling of Inverse Temperature in Self-Attention stat.ML · 2026-05-12 · unverdicted · none · ref 31
The upper-tail accumulation scale derived from the gap-counting function N_n sets the critical inverse temperature for softmax attention concentration, unifying prior conflicting laws as special cases of different N_n.
Uniform Scaling Limits in AdamW-Trained Transformers stat.ML · 2026-05-11 · unverdicted · none · ref 45
AdamW-trained transformer hidden states and backpropagated variables converge uniformly in L2 to a forward-backward ODE system (McKean-Vlasov when non-causal) at rate O(L^{-1}+L^{-1/3}H^{-1/2}) as depth L and heads H increase, with bounds independent of token number.
Diffusion Operator Geometry of Feedforward Representations cs.LG · 2026-05-01 · unverdicted · none · ref 45
A Gaussian-kernel diffusion operator on feature clouds yields closed-form class affinities and spectra in Gaussian models, with provably smooth observables under perturbations.
Stochastic Scaling Limits and Synchronization by Noise in Deep Transformer Models math.PR · 2026-04-29 · unverdicted · none · ref 45
Transformers converge pathwise to a stochastic particle system and SPDE in the scaling limit, exhibiting synchronization by noise and exponential energy dissipation when common noise is coercive relative to self-attention drift.
Spectral Selection in Symmetric Self-Attention Dynamics math.DS · 2026-04-28 · unverdicted · none · ref 23
Symmetric self-attention dynamics select the dominant eigendirection of V, producing homogeneous alignment when one positive eigenvalue dominates or sign-split polarization when V is negative definite.
Quantifying Concentration Phenomena of Mean-Field Transformers in the Low-Temperature Regime math.AP · 2026-05-11 · unverdicted · none · ref 40
In the low-temperature regime, the token distribution in mean-field transformers concentrates onto the push-forward under a key-query-value projection with Wasserstein distance scaling as √(log(β+1)/β) exp(Ct) + exp(-ct).
Gradient Flow Structure and Quantitative Dynamics of Multi-Head Self-Attention cs.LG · 2026-05-05 · unverdicted · none · ref 16 · 2 links
Multi-head self-attention dynamics admit a non-decreasing energy functional under suitable score-matrix conditions, with closed-form clustering thresholds and monotonic entropy production in simplified regimes.
Explanation of Dynamic Physical Field Predictions using WassersteinGrad: Application to Autoregressive Weather Forecasting stat.ML · 2026-04-24 · unverdicted · none · ref 39
WassersteinGrad aggregates perturbed gradient attribution maps via their entropic Wasserstein barycenter to avoid blurring from geometric shifts in explanations of autoregressive weather forecasts.

The Mean-Field Dynamics of Transformers

fields

years

verdicts

representative citing papers

citing papers explorer