A Unified Perspective on the Dynamics of Deep Transformers

Gabriel Peyr\'e; Jos\'e Antonio Carrillo; Pierre Ablin; Val\'erie Castin

arxiv: 2501.18322 · v2 · pith:RYDGGZC5new · submitted 2025-01-30 · 💻 cs.LG · math.AP

A Unified Perspective on the Dynamics of Deep Transformers

Val\'erie Castin , Pierre Ablin , Jos\'e Antonio Carrillo , Gabriel Peyr\'e This is my paper

classification 💻 cs.LG math.AP

keywords attentiondatagaussiantransformerdynamicsinitialtransformersanalysis

0 comments

read the original abstract

Transformers, which are state-of-the-art in most machine learning tasks, represent the data as sequences of vectors called tokens. This representation is then exploited by the attention function, which learns dependencies between tokens and is key to the success of Transformers. However, the iterative application of attention across layers induces complex dynamics that remain to be fully understood. To analyze these dynamics, we identify each input sequence with a probability measure and model its evolution as a Vlasov equation called Transformer PDE, whose velocity field is non-linear in the probability measure. Our first set of contributions focuses on compactly supported initial data. We show the Transformer PDE is well-posed and is the mean-field limit of an interacting particle system, thus generalizing and extending previous analysis to several variants of self-attention: multi-head attention, L2 attention, Sinkhorn attention, Sigmoid attention, and masked attention--leveraging a conditional Wasserstein framework. In a second set of contributions, we are the first to study non-compactly supported initial conditions, by focusing on Gaussian initial data. Again for different types of attention, we show that the Transformer PDE preserves the space of Gaussian measures, which allows us to analyze the Gaussian case theoretically and numerically to identify typical behaviors. This Gaussian analysis captures the evolution of data anisotropy through a deep Transformer. In particular, we highlight a clustering phenomenon that parallels previous results in the non-normalized discrete case.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 11 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Kinetic theory for Transformers and the lost-in-the-middle phenomenon
math.AP 2026-05 conditional novelty 8.0

A mean-field kinetic theory derivation produces a closed-form U-shaped token retrieval profile that explains the lost-in-the-middle phenomenon in Transformers.
Transformer-like Inference from Optimal Control
cs.LG 2026-05 unverdicted novelty 7.0

Derives transformer-like dual-filter inference layers from first-principles optimal control on nonlinear discrete and linear Gaussian sequence models.
Uniform Scaling Limits in AdamW-Trained Transformers
stat.ML 2026-05 unverdicted novelty 7.0

AdamW-trained transformer hidden states and backpropagated variables converge uniformly in L2 to a forward-backward ODE system (McKean-Vlasov when non-causal) at rate O(L^{-1}+L^{-1/3}H^{-1/2}) as depth L and heads H ...
Stochastic Scaling Limits and Synchronization by Noise in Deep Transformer Models
math.PR 2026-04 unverdicted novelty 7.0

Transformers converge pathwise to a stochastic particle system and SPDE in the scaling limit, exhibiting synchronization by noise and exponential energy dissipation when common noise is coercive relative to self-atten...
Spectral Selection in Symmetric Self-Attention Dynamics
math.DS 2026-04 unverdicted novelty 7.0

Symmetric self-attention dynamics select the dominant eigendirection of V, producing homogeneous alignment when one positive eigenvalue dominates or sign-split polarization when V is negative definite.
Preconditioned Regularized Wasserstein Proximal Sampling
stat.ML 2025-09 unverdicted novelty 7.0

A preconditioned regularized Wasserstein proximal sampling algorithm is introduced for particle-based approximation of Gibbs distributions, featuring a PDE-derived kernel formulation and non-asymptotic convergence ana...
Propagation of Chaos in Contextual Flow Maps
cs.LG 2026-05 unverdicted novelty 6.0

Derives forward and backward propagation-of-chaos bounds for finite vs. infinite-context transformers modeled as contextual flow maps, achieving Wasserstein rate n^{-1/d} generally and n^{-1/2} for transformer-like cases.
Multi-Headed Transformer Architectures as Time-dependent Wasserstein Gradient Flows
cs.LG 2026-05 unverdicted novelty 6.0

Models multi-head transformer data flow as time-dependent Wasserstein gradient flows of an attention-capturing interaction energy, with proofs on omega-limit stationary points and stability under weight and input pert...
Quantifying Concentration Phenomena of Mean-Field Transformers in the Low-Temperature Regime
math.AP 2026-05 unverdicted novelty 6.0

In the low-temperature regime, the token distribution in mean-field transformers concentrates onto the push-forward under a key-query-value projection with Wasserstein distance scaling as √(log(β+1)/β) exp(Ct) + exp(-ct).
On the global convergence of gradient descent for wide shallow models with bounded nonlinearities
math.OC 2026-05 unverdicted novelty 6.0

Gradient descent on wide shallow models with bounded nonlinearities converges globally in the mean-field limit as non-global critical points are unstable under the dynamics.
Quantitative Clustering in Mean-Field Transformer Models
cs.LG 2025-04 unverdicted novelty 5.0

Mean-field transformer models synchronize to a Dirac point mass exponentially fast with explicit quantitative rates under suitable parameter assumptions.