The mean-field dynamics of transformers

Philippe Rigollet,The Mean-Field Dynamics of Transformers · 2025 · arXiv 2512.01868

14 Pith papers cite this work. Polarity classification is still indexing.

14 Pith papers citing it

read on arXiv browse 14 citing papers

citation-role summary

background 2

citation-polarity summary

background 2

representative citing papers

A Unified Framework for Critical Scaling of Inverse Temperature in Self-Attention

stat.ML · 2026-05-12 · unverdicted · novelty 8.0

The upper-tail accumulation scale derived from the gap-counting function N_n sets the critical inverse temperature for softmax attention concentration, unifying prior conflicting laws as special cases of different N_n.

Phase transitions for the noisy transformer model in arbitrary dimension

math.AP · 2026-06-03 · unverdicted · novelty 7.0

In every dimension d≥2 there exists a unique β_*^{(d)}>0 such that the uniform density on the sphere is the unique global minimizer of the USA free energy up to the linear-stability threshold K_# for β≤β_*, yielding a continuous transition, while for β>β_* the uniform density is not globally minimiz

Attention as In-Context Empirical Bayes: A Two-Stage View via Particle Dynamics

cs.LG · 2026-05-28 · unverdicted · novelty 7.0

Attention in minimal transformers under corruption performs in-context empirical Bayes via a single kernel-weighted posterior mean step followed by depth-driven particle dynamics refinement.

The physics of AI weather models

physics.ao-ph · 2026-05-22 · unverdicted · novelty 7.0

AI weather models may simulate the atmosphere via particle positions in latent space whose updates follow gradient flow on a learned free energy functional rather than conventional physical equations.

Uniform Scaling Limits in AdamW-Trained Transformers

stat.ML · 2026-05-11 · unverdicted · novelty 7.0

AdamW-trained transformer hidden states and backpropagated variables converge uniformly in L2 to a forward-backward ODE system (McKean-Vlasov when non-causal) at rate O(L^{-1}+L^{-1/3}H^{-1/2}) as depth L and heads H increase, with bounds independent of token number.

Diffusion Operator Geometry of Feedforward Representations

cs.LG · 2026-05-01 · unverdicted · novelty 7.0

A Gaussian-kernel diffusion operator on feature clouds yields closed-form class affinities and spectra in Gaussian models, with provably smooth observables under perturbations.

Stochastic Scaling Limits and Synchronization by Noise in Deep Transformer Models

math.PR · 2026-04-29 · unverdicted · novelty 7.0

Transformers converge pathwise to a stochastic particle system and SPDE in the scaling limit, exhibiting synchronization by noise and exponential energy dissipation when common noise is coercive relative to self-attention drift.

Spectral Selection in Symmetric Self-Attention Dynamics

math.DS · 2026-04-28 · unverdicted · novelty 7.0

Symmetric self-attention dynamics select the dominant eigendirection of V, producing homogeneous alignment when one positive eigenvalue dominates or sign-split polarization when V is negative definite.

Patnaik-Pearson intrinsic dimension for internal representations of neural networks

math.ST · 2026-06-17 · unverdicted · novelty 6.0 · 2 refs

Introduces the Patnaik-Pearson intrinsic dimension estimator, proves some of its properties, relates it to HTSR/SETOL for Pareto spectra, and applies it to track embedding dimension evolution in BERT-base and DeepSeek-R1-Distill-Qwen-1.

Propagation of Chaos in Contextual Flow Maps

cs.LG · 2026-05-16 · unverdicted · novelty 6.0

Derives forward and backward propagation-of-chaos bounds for finite vs. infinite-context transformers modeled as contextual flow maps, achieving Wasserstein rate n^{-1/d} generally and n^{-1/2} for transformer-like cases.

Quantifying Concentration Phenomena of Mean-Field Transformers in the Low-Temperature Regime

math.AP · 2026-05-11 · unverdicted · novelty 6.0

In the low-temperature regime, the token distribution in mean-field transformers concentrates onto the push-forward under a key-query-value projection with Wasserstein distance scaling as √(log(β+1)/β) exp(Ct) + exp(-ct).

Gradient Flow Structure and Quantitative Dynamics of Multi-Head Self-Attention

cs.LG · 2026-05-05 · unverdicted · novelty 6.0 · 2 refs

Multi-head self-attention dynamics admit a non-decreasing energy functional under suitable score-matrix conditions, with closed-form clustering thresholds and monotonic entropy production in simplified regimes.

Explanation of Dynamic Physical Field Predictions using WassersteinGrad: Application to Autoregressive Weather Forecasting

stat.ML · 2026-04-24 · unverdicted · novelty 6.0

WassersteinGrad aggregates perturbed gradient attribution maps via their entropic Wasserstein barycenter to avoid blurring from geometric shifts in explanations of autoregressive weather forecasts.

Anti Mode-Collapse in Mean-Field Transformer via Auxiliary Variables

cs.LG · 2026-05-28 · unverdicted · novelty 5.0

Auxiliary variables prevent mode collapse in mean-field transformers, with the limit distribution being the pushforward of the auxiliary distribution, and positional encoding and prompt insertion have universality of representation.

citing papers explorer

Showing 1 of 1 citing paper after filters.

Stochastic Scaling Limits and Synchronization by Noise in Deep Transformer Models math.PR · 2026-04-29 · unverdicted · none · ref 45
Transformers converge pathwise to a stochastic particle system and SPDE in the scaling limit, exhibiting synchronization by noise and exponential energy dissipation when common noise is coercive relative to self-attention drift.

The mean-field dynamics of transformers

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer