Auxiliary variables prevent mode collapse in mean-field transformers, with the limit distribution being the pushforward of the auxiliary distribution, and positional encoding and prompt insertion have universality of representation.
Training-Induced Escape from Token Clustering in a Mean-Field Formulation of Transformers
1 Pith paper cite this work. Polarity classification is still indexing.
abstract
Transformers perform inference by iteratively transforming token representations across layers. This layerwise computation has been studied empirically, and recent mean-field theories of Transformer dynamics explain how attention can drive token distributions toward clustering. However, existing mean-field analyses largely treat model parameters as prescribed, leaving open how training reshapes this clustering picture. We study this question in a noisy mean-field Transformer in which only a parameter-linear FFN is trained under $L^2$ regularization. We find and analyze a training-induced phase in the dynamics: after initially following attention-driven clustering, the token distribution can leave the clustered regime near the final layers. Our mathematical analysis is based on an entropy-regularized interaction energy that captures the clustering bias of attention. More broadly, our results point toward a training-aware mean-field theory of Transformer dynamics, in which training and inference dynamics are treated together.
fields
cs.LG 1years
2026 1verdicts
UNVERDICTED 1representative citing papers
citing papers explorer
-
Anti Mode-Collapse in Mean-Field Transformer via Auxiliary Variables
Auxiliary variables prevent mode collapse in mean-field transformers, with the limit distribution being the pushforward of the auxiliary distribution, and positional encoding and prompt insertion have universality of representation.