Perceptrons and localization of attention's mean-field landscape

Antonio \'Alvarez-L\'opez, Borjan Geshkovski, Dom\`enec Ruiz-Balet

classification 💻 cs.LG math.OC

keywords spheregradientmean-fieldseensystemunitatomicattention

read the original abstract

The forward pass of a Transformer can be seen as an interacting particle system on the unit sphere: time plays the role of layers, particles that of token embeddings, and the unit sphere idealizes layer normalization. In some weight settings the system can even be seen as a gradient flow for an explicit energy, and one can make sense of the infinite context length (mean-field) limit thanks to Wasserstein gradient flows. In this paper we study the effect of the perceptron block in this setting, and show that critical points are generically atomic and localized on subsets of the sphere.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Kinetic theory for Transformers and the lost-in-the-middle phenomenon
math.AP 2026-05 conditional novelty 8.0

A mean-field kinetic theory derivation produces a closed-form U-shaped token retrieval profile that explains the lost-in-the-middle phenomenon in Transformers.
Stochastic Scaling Limits and Synchronization by Noise in Deep Transformer Models
math.PR 2026-04 unverdicted novelty 7.0

Transformers converge pathwise to a stochastic particle system and SPDE in the scaling limit, exhibiting synchronization by noise and exponential energy dissipation when common noise is coercive relative to self-atten...
Quantifying Concentration Phenomena of Mean-Field Transformers in the Low-Temperature Regime
math.AP 2026-05 unverdicted novelty 6.0

In the low-temperature regime, the token distribution in mean-field transformers concentrates onto the push-forward under a key-query-value projection with Wasserstein distance scaling as √(log(β+1)/β) exp(Ct) + exp(-ct).