Quantitative Clustering in Mean-Field Transformer Models

Philippe Rigollet, Shi Chen, Yury Polyanskiy, Zhengjiang Lin

Authors on Pith no claims yet

classification 💻 cs.LG math.APmath.DSstat.ML

keywords modelstransformerclusteringmean-fieldquantitativeakinassumptionsasymptotic

read the original abstract

The evolution of tokens through deep transformer models can be modeled as an interacting particle system that has been shown to exhibit an asymptotic clustering behavior akin to the synchronization phenomenon in Kuramoto models. In this work, we investigate the long-time clustering of mean-field transformer models. More precisely, under suitable assumptions on the transformer model parameters, we establish that any suitably regular mean-field initialization synchronizes exponentially fast to a Dirac point mass, with explicit quantitative convergence rates.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 6 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Kinetic theory for Transformers and the lost-in-the-middle phenomenon
math.AP 2026-05 conditional novelty 8.0

A mean-field kinetic theory derivation produces a closed-form U-shaped token retrieval profile that explains the lost-in-the-middle phenomenon in Transformers.
Gradient Flow Structure and Quantitative Dynamics of Multi-Head Self-Attention
cs.LG 2026-05 unverdicted novelty 7.0

Multi-head self-attention is modeled as a gradient flow with a non-decreasing energy functional under conditions on score matrices, yielding closed-form clustering thresholds in simplified regimes and monotonic entrop...
Stochastic Scaling Limits and Synchronization by Noise in Deep Transformer Models
math.PR 2026-04 unverdicted novelty 7.0

Transformers converge pathwise to a stochastic particle system and SPDE in the scaling limit, exhibiting synchronization by noise and exponential energy dissipation when common noise is coercive relative to self-atten...
Spectral Selection in Symmetric Self-Attention Dynamics
math.DS 2026-04 unverdicted novelty 7.0

Symmetric self-attention dynamics select the dominant eigendirection of V, producing homogeneous alignment when one positive eigenvalue dominates or sign-split polarization when V is negative definite.
Quantifying Concentration Phenomena of Mean-Field Transformers in the Low-Temperature Regime
math.AP 2026-05 unverdicted novelty 6.0

In the low-temperature regime, the token distribution in mean-field transformers concentrates onto the push-forward under a key-query-value projection with Wasserstein distance scaling as √(log(β+1)/β) exp(Ct) + exp(-ct).
Gradient Flow Structure and Quantitative Dynamics of Multi-Head Self-Attention
cs.LG 2026-05 unverdicted novelty 6.0

Multi-head self-attention dynamics admit a non-decreasing energy functional under suitable score-matrix conditions, with closed-form clustering thresholds and monotonic entropy production in simplified regimes.