arxiv: 2605.04279 · v2 · submitted 2026-05-05 · 💻 cs.LG

Recognition: 2 theorem links

· Lean Theorem

Gradient Flow Structure and Quantitative Dynamics of Multi-Head Self-Attention

Ayan Pendharkar

Pith reviewed 2026-05-11 00:52 UTC · model grok-4.3

classification 💻 cs.LG

keywords multi-head self-attentiongradient flowenergy functionaltoken clusteringradial shadow termssoftmax dynamicsentropy productiontransformer stability

0 comments

The pith

Under suitable conditions on score matrices, multi-head self-attention admits a non-decreasing energy functional along its gradient flow dynamics.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper frames multi-head transformer attention as tokens evolving on the unit sphere under softmax interaction potentials that drive clustering. It proves that a natural multi-head energy functional increases monotonically along both flat and spherical flows when score matrices eliminate radial shadow terms. In a simplified scalar-head regime with equiangular tokens, the work supplies a closed-form critical inverse temperature for clustering onset and shows super-additive rates for heterogeneous heads together with monotonic entropy growth toward equilibrium. These results matter to readers because they supply a unified dynamical explanation for why multi-head attention stabilizes representations and forms clusters in transformers.

Core claim

Under suitable conditions on the score matrices that eliminate radial shadow terms, the natural multi-head energy functional is non-decreasing along both flat and spherical dynamics. In the scalar-head regime with equiangular token configurations the paper derives a closed-form critical inverse temperature governing clustering, establishes super-additive clustering rates for heterogeneous heads, and proves an entropy production identity under which attention entropy increases monotonically as clustering progresses.

What carries the argument

The multi-head energy functional whose monotonicity is obstructed by radial shadow terms, which are projections of each head's output onto token directions.

If this is right

Tokens form clusters as the multi-head energy increases toward equilibrium.
A closed-form expression exists for the critical inverse temperature that triggers clustering in equiangular scalar-head cases.
Heterogeneous heads produce super-additive clustering rates compared with homogeneous ones.
Linearized ReLU attention and softmax attention display distinct clustering timescales.
Attention entropy grows monotonically to its equilibrium value as clustering advances.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The monotonicity framework might be used to certify stability of new attention variants before training.
Quantitative clustering times could help predict when transformers begin to memorize or overfit during optimization.
The entropy production identity may connect to information-theoretic analyses of representation collapse in deeper networks.

Load-bearing premise

Suitable conditions on the score matrices exist that remove radial shadow terms and allow the energy functional to be non-decreasing.

What would settle it

A concrete simulation or calculation in which the multi-head energy decreases along the dynamics despite the score matrices satisfying the paper's sufficient conditions for monotonicity.

Figures

Figures reproduced from arXiv: 2605.04279 by Ayan Pendharkar.

**Figure 1.** Figure 1: Alignment and energy dynamics in the non-equiangular regime. view at source ↗

**Figure 2.** Figure 2: Rate function and super-additivity (Theorem view at source ↗

**Figure 3.** Figure 3: ReLU vs. softmax clustering time (Theorem view at source ↗

**Figure 4.** Figure 4: Entropy production identity and two-phase structure (Theorem view at source ↗

**Figure 5.** Figure 5: Exponential convergence near equilibrium (Theorem view at source ↗

read the original abstract

Transformer self-attention can be interpreted as a gradient flow on the unit sphere, in which tokens evolve under softmax interaction potentials and tend to form clusters. While prior work has established clustering behavior for single-head attention, the multi-head setting remains less understood due to geometric interference between heads, which invalidates standard monotonicity arguments. In this work, we develop a theoretical framework for multi-head self-attention dynamics and resolve several open questions. We show that, under suitable conditions on the score matrices, a natural multi-head energy functional is non-decreasing along both flat and spherical dynamics. We identify the key obstruction to per-head monotonicity as radial shadow terms, which are projections of each head's output onto token directions, persisting even under orthogonality assumptions. We introduce a sufficient condition ensuring monotonicity and establish robustness to approximate orthogonality. In a simplified scalar-head regime with equiangular token configurations, we derive a closed-form expression for the critical inverse temperature governing clustering behavior, and show that heterogeneous heads exhibit super-additive clustering rates. In this regime, we also prove a separation in clustering time between ReLU and softmax attention in the linearized dynamics. Finally, we establish an entropy production identity and show that attention entropy increases monotonically toward equilibrium as clustering progresses. Our results provide a unified perspective on the dynamics of multi-head attention and clarify the mechanisms underlying clustering and stability in transformer models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gives conditions for non-decreasing multi-head attention energy by canceling radial shadows, but those conditions are not shown to hold generically or stay invariant under the flow.

read the letter

The main thing to know is that this work extends single-head gradient flow results to the multi-head case by identifying radial shadow terms as the obstruction to monotonicity and supplying a sufficient condition on score matrices that removes them, plus a robustness claim for approximate orthogonality. In a simplified scalar-head equiangular regime it also derives a closed-form critical inverse temperature, shows super-additive clustering for heterogeneous heads, proves a ReLU-softmax separation in linearized dynamics, and gives an entropy production identity with monotonic entropy growth toward equilibrium. Those pieces are the actual new content and they sit on top of the existing single-head literature without obvious circularity. The geometric framing and the entropy identity look like clean extensions that could be useful for later analysis. The soft spot is exactly the one flagged in the stress test: the sufficient condition is not checked on typical learned or random score matrices, and nothing shows the flow itself preserves the condition. Without that, the non-decreasing claim stays conditional rather than generic, and the radial shadow problem is only sidestepped rather than resolved for realistic multi-head attention. The closed-form results are further limited to the equiangular scalar setup, so their reach is narrow. This is for theorists who already work on dynamical systems views of transformers and want to push the clustering and stability story to multiple heads. A reader focused on attention mechanics or training dynamics would get concrete expressions and identities to build on. It deserves a serious referee because the questions are open, the approach is systematic, and the claims are stated clearly enough to be checked. I would send it to review but flag the need for either broader conditions or explicit checks on when the sufficient condition actually applies in practice.

Referee Report

2 major / 1 minor

Summary. The paper develops a theoretical framework for multi-head self-attention in transformers, viewing token evolution as gradient flow on the unit sphere under softmax potentials. It claims that under suitable conditions on the score matrices, a natural multi-head energy functional is non-decreasing along both flat and spherical dynamics; identifies radial shadow terms (projections of head outputs onto token directions) as the obstruction to per-head monotonicity; supplies a sufficient condition for monotonicity together with robustness under approximate orthogonality. In a simplified scalar-head regime restricted to equiangular token configurations, it derives a closed-form critical inverse temperature governing clustering, shows super-additive clustering rates for heterogeneous heads, establishes a separation in clustering time between ReLU and softmax in the linearized dynamics, and proves an entropy production identity under which attention entropy increases monotonically toward equilibrium.

Significance. If the central claims hold, the work supplies a unified geometric perspective on multi-head attention that resolves interference issues left open by single-head analyses. The closed-form critical temperature, super-additivity result, and entropy production identity constitute concrete quantitative strengths that yield falsifiable predictions on clustering rates and stability; these are load-bearing contributions that could guide both theoretical understanding and practical transformer design. The conditional character of the monotonicity theorem, however, restricts immediate generality.

major comments (2)

[Abstract] Abstract and the section introducing the multi-head energy: the non-decreasing property is asserted only under 'suitable conditions' on the score matrices that cancel radial shadow terms. The manuscript supplies a sufficient condition and proves robustness to approximate orthogonality, yet provides no verification that typical learned or random score matrices satisfy the condition, nor that the flow itself preserves it. This renders the central monotonicity claim conditional and limits its resolution of the geometric interference problem for generic multi-head attention.
[Simplified scalar-head regime] The simplified scalar-head regime with equiangular token configurations: the closed-form critical inverse temperature and the separation between ReLU and softmax clustering times are derived under this restrictive assumption. The paper does not address stability of these expressions under perturbations away from equiangularity or under the full multi-head dynamics, weakening the quantitative claims.

minor comments (1)

Notation for the score matrices and radial shadow projections could be introduced with an explicit equation or diagram in the early sections to improve readability for readers unfamiliar with the geometric setup.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful reading and constructive feedback on our manuscript. The comments correctly identify the conditional nature of the monotonicity result and the scope of the simplified regime. We address each major comment point by point below, indicating where revisions will be made to improve clarity and transparency without altering the core contributions.

read point-by-point responses

Referee: [Abstract] Abstract and the section introducing the multi-head energy: the non-decreasing property is asserted only under 'suitable conditions' on the score matrices that cancel radial shadow terms. The manuscript supplies a sufficient condition and proves robustness to approximate orthogonality, yet provides no verification that typical learned or random score matrices satisfy the condition, nor that the flow itself preserves it. This renders the central monotonicity claim conditional and limits its resolution of the geometric interference problem for generic multi-head attention.

Authors: We agree that the non-decreasing property of the multi-head energy is established under a sufficient condition on the score matrices that cancels the radial shadow terms, as explicitly stated in the abstract and the relevant theorem. The manuscript derives this condition, identifies the radial shadows as the obstruction, and proves robustness under approximate orthogonality. However, we did not verify whether typical learned or random score matrices satisfy the condition, nor prove that the dynamics preserve it. In the revised version, we will add a dedicated remark in the discussion section clarifying that the result is conditional on this sufficient condition, noting the open question of its prevalence in practice, and suggesting that post-training checks could be performed. This preserves the framework's value in resolving interference when the condition holds while being fully transparent about its scope. revision: partial
Referee: [Simplified scalar-head regime] The simplified scalar-head regime with equiangular token configurations: the closed-form critical inverse temperature and the separation between ReLU and softmax clustering times are derived under this restrictive assumption. The paper does not address stability of these expressions under perturbations away from equiangularity or under the full multi-head dynamics, weakening the quantitative claims.

Authors: The closed-form critical inverse temperature, the ReLU-softmax separation in clustering times, super-additivity for heterogeneous heads, and the entropy production identity are all derived specifically within the simplified scalar-head regime under equiangular token configurations. This assumption enables exact solvability and yields concrete quantitative predictions. The general multi-head energy framework and monotonicity results are developed separately and do not rely on equiangularity. We acknowledge that we have not analyzed the stability of these expressions under perturbations away from equiangularity or their extension to the full multi-head dynamics. In the revision, we will insert a paragraph in the simplified-regime section explicitly stating the restrictive nature of the assumption, its role in obtaining closed forms, and that the results serve as exact benchmarks for this case, with stability under perturbations left for future work. revision: partial

Circularity Check

0 steps flagged

No significant circularity; derivations are conditional on explicit assumptions without reduction to inputs

full rationale

The paper establishes non-decreasing multi-head energy along gradient flows by introducing a sufficient condition on score matrices that cancels radial shadow terms, then proves the result directly from the energy functional and dynamics equations. In the scalar-head regime it derives a closed-form critical inverse temperature from the linearized system under equiangular configurations. These steps are self-contained mathematical arguments from the stated gradient-flow structure and do not rely on self-definitional loops, fitted parameters renamed as predictions, or load-bearing self-citations. The assumptions are declared up front and the claims are explicitly conditional, which is standard non-circular practice.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The framework rests on interpreting self-attention as gradient flow on the sphere and on standard monotonicity properties of energy functionals; no explicit free parameters or new entities are introduced in the abstract.

axioms (2)

domain assumption Self-attention tokens evolve under softmax interaction potentials on the unit sphere
Core modeling assumption stated in the opening sentence of the abstract.
standard math Gradient flows admit monotonicity arguments when energy functionals are non-decreasing
Invoked to establish the multi-head energy result and entropy production identity.

pith-pipeline@v0.9.0 · 5544 in / 1349 out tokens · 51832 ms · 2026-05-11T00:52:05.091905+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Constants.lean phi_golden_ratio echoes
For H=2, the threshold is c*(2)=(√5−1)/2=1/φ, the reciprocal of the golden ratio. ... β*=1/(2α) ln [c*(H)²(n−1)/(1−c*(H)²)] (Theorem 19)
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel echoes
dE_multi/dt = (1/n) Σ_i ∥ẋ_i∥² ≥ 0 (Theorem 11); per-head monotonicity under Radial Dominance (Theorem 17)

Reference graph

Works this paper leans on

21 extracted references · 11 canonical work pages · 2 internal anchors

[1]

Ambrosio, N

L. Ambrosio, N. Gigli, and G. Savaré.Gradient Flows in Metric Spaces and in the Space of Probability Measures. Birkhäuser, 2005

2005
[2]

J. L. Ba, J. R. Kiros, and G. E. Hinton. Layer normalization. arXiv:1607.06450, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016
[3]

Bruno, F

G. Bruno, F. Pasqualotto, and A. Agazzi. Emergence of meta-stable clustering in mean-field transformers. InICLR, 2025

2025
[4]

S. Chen, Z. Lin, Y. Polyanskiy, and P. Rigollet. Quantitative clustering in mean-field transformer models. arXiv:2504.14697, 2025. 19

work page internal anchor Pith review arXiv 2025
[5]

S. Chen, Z. Lin, Y. Polyanskiy, and P. Rigollet. Critical attention scaling in long-context transformers. InICLR, 2026

2026
[6]

Z. Chen, Y. Polyanskiy, and P. Rigollet. Clustering with Wasserstein–Fisher–Rao gradient flows. InNeurIPS Workshop, 2025

2025
[7]

Synchro- nization on circles and spheres with nonlinear interactions.arXiv preprint arXiv:2405.18273, 2024

C. Criscitiello, Q. Rebjock, A. D. McRae, and N. Boumal. Synchronization on circles and spheres with nonlinear interactions. arXiv:2405.18273, 2024

work page arXiv 2024
[8]

Dynamic metastability in the self-attention model.arXiv preprint arXiv:2410.06833, 2024

B. Geshkovski, H. Koubbi, Y. Polyanskiy, and P. Rigollet. Dynamic metastability in the self-attention model. arXiv:2410.06833, 2024

work page arXiv 2024
[9]

Geshkovski, C

B. Geshkovski, C. Letrouit, Y. Polyanskiy, and P. Rigollet. The emergence of clusters in self-attention dynamics. InNeurIPS, 2023. arXiv:2305.05465

work page arXiv 2023
[10]

A mathematical perspective on transformers.arXiv preprint arXiv:2312.10794, 2023

B. Geshkovski, C. Letrouit, Y. Polyanskiy, and P. Rigollet. A mathematical perspective on transformers.Bulletin of the American Mathematical Society, 62(3):427–479, 2025. arXiv:2312.10794

work page arXiv 2025
[11]

N. K. Jha and B. Reagen. Entropy-guided attention for private LLMs. arXiv:2501.03489, 2025

work page arXiv 2025
[12]

Karagodin, S

N. Karagodin, S. Ge, Y. Polyanskiy, and P. Rigollet. Normalization in attention dynamics. InNeurIPS, 2025

2025
[13]

Karagodin, Y

N. Karagodin, Y. Polyanskiy, and P. Rigollet. Clustering in causal attention masking. In NeurIPS, 2024

2024
[14]

Kuramoto

Y. Kuramoto. Self-entrainment of a population of coupled non-linear oscillators. InInter- national Symposium on Mathematical Problems in Theoretical Physics, 1975

1975
[15]

Synchronization of mean-field models on the circle.arXiv preprint arXiv:2507.22857, 2025

Y. Polyanskiy, P. Rigollet, and A. Yao. Synchronization of mean-field models on the circle. arXiv:2507.22857, 2025

work page arXiv 2025
[16]

The Mean-Field Dynamics of Transformers.arXiv preprint arXiv:2512.01868, 2025

P. Rigollet. The mean-field dynamics of transformers. arXiv:2512.01868, 2025

work page arXiv 2025
[17]

Z. Wang, Y. Shen, B. Yang, Y. Li, and C. Ding. A study on ReLU and softmax in transformer. arXiv:2302.06461, 2023

work page arXiv 2023
[18]

Tomihari and R

A. Tomihari and R. Karakida. Recurrent self-attention dynamics: An energy-agnostic perspective from Jacobians. InNeurIPS, 2025

2025
[19]

Vaswani, N

A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin. Attention is all you need. InNeurIPS, 2017

2017
[20]

Replacing soft- max with ReLU in Vision Transformers,

M. Wortsman, J. Lee, J. Gilmer, and S. Kornblith. Replacing softmax with ReLU in vision transformers. arXiv:2309.08586, 2023

work page arXiv 2023
[21]

S. Zhai, T. Likhomanenko, E. Littwin, D. Busbridge, J. Ramapuram, Y. Zhang, J. Gu, and J. M. Susskind. Stabilizing transformer training by preventing attention entropy collapse. InICML, 2023. 20

2023