arxiv: 2605.01199 · v1 · submitted 2026-05-02 · 💻 cs.LG

Recognition: unknown

Focus and Dilution: The Multi-stage Learning Process of Attention

Zheng-An Chen , Pengxiao Lin , Zhi-Qin John Xu , Tao Luo

Authors on Pith no claims yet

Pith reviewed 2026-05-09 14:11 UTC · model grok-4.3

classification 💻 cs.LG

keywords attention mechanismstransformer training dynamicsgradient flowfocus-dilution cycleMarkovian datamulti-stage learningone-layer transformercritical points

0 comments

The pith

Attention learning in one-layer Transformers proceeds through a repeating cycle of focus on high-frequency tokens followed by dilution via embedding redistribution.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that attention training follows a recurrent focus-dilution cycle that breaks into four distinct stages when analyzed via gradient flow in a one-layer Transformer on Markovian data. Embeddings first collapse quickly to a rank-one form while attention stays fixed, then attention parameters grow and sharpen on frequent tokens, after which evolving attention creates embedding changes that spread attention mass and weaken the focus, and finally small differences among infrequent tokens break a symmetry to restart the cycle. A reader would care because this gives an explicit sequence for how attention patterns emerge and reset during training instead of leaving the process as an unexplained black box. The decomposition relies on linearizing the dynamics around successive critical points to isolate each phase.

Core claim

In a one-layer Transformer trained on Markovian data, gradient-flow analysis decomposes attention learning into a focus-dilution cycle consisting of four stages: embeddings and projections first condense rapidly to rank one with attention parameters nearly frozen; attention parameters then increase and drive focus toward high-frequency tokens; continued attention evolution produces next-order embedding perturbations that redistribute mass and dilute the focus; finally, asymmetries among low-frequency tokens lift a degenerate critical point, open new embedding directions, and initiate the next cycle.

What carries the argument

The focus-dilution cycle, obtained by stage-wise linearization of the gradient flow around successive critical points in the one-layer Markovian setting.

Load-bearing premise

The stage-wise linearization of gradient flow around critical points remains accurate only inside the one-layer Transformer trained on Markovian data.

What would settle it

Train a one-layer Transformer on synthetic Markovian sequences while recording the embedding-matrix rank, the attention weights on high-frequency versus low-frequency tokens, and the norm of embedding perturbations at successive training intervals; absence of the predicted four-stage sequence or lack of repetition would falsify the decomposition.

Figures

Figures reproduced from arXiv: 2605.01199 by Pengxiao Lin, Tao Luo, Zheng-An Chen, Zhi-Qin John Xu.

**Figure 1.** Figure 1: Overview of the setting and the focus–dilution training pattern. (Left) Sequences are generated by a Markov chain with stationary distribution π. We extract dataset S0, S1, S2 from the training set, which differ only in the identity of the final token in each sequence. (Right) The loss curves exhibit four stages: initial condensation, attention growth, attention dilution, and the emergence of a new directi… view at source ↗

**Figure 2.** Figure 2: Empirical focus–dilution cycle. (A) Loss curves for S0, S1, S2, accompanied by cosine-similarity matrices for (W0, WQ, WK, W1). (B) PCA of embeddings reveals one-directional growth (Stage I & II), retraction during dilution (Stage III), and expansion into new directions (Stage IV). (C) Attention maps transition from focus to dilution. Between steps 350 and 550, the attention given to the first token by all… view at source ↗

**Figure 3.** Figure 3: Experimental results on real-world datasets. (A) Results on WikiText. (A1) The attention evolution of the medium-frequency token ’continue’ given the input sequence [the, comma, whitespace, ter, end, continue]. The attention scores exhibit a distinct four-phase transition: dilution → focus on the high-frequency whitespace → secondary dilution → final focus on continue itself. (A2) The evolution of ∥W0∥2 fo… view at source ↗

read the original abstract

Transformer-based models have achieved remarkable success across a wide range of domains, yet our understanding of their training dynamics remains limited. In this work, we identify a recurrent focus-dilution cycle in attention learning and provide a rigorous explanation in a one-layer Transformer setting for Markovian data via gradient-flow analysis. Using stage-wise linearization around critical points, we show that a single focus-dilution cycle can be decomposed into a sequence of distinct stages. First, embedding and projection rapidly condense to a rank-one structure, while attention parameters remain effectively frozen. Then, the attention parameters begin to increase, inducing a frequency-driven focus toward high-frequency tokens. As attention continues to evolve, it generates next-order perturbations in embeddings, leading to a mass-redistribution mechanism that progressively dilutes this focus. Finally, small asymmetries among low-frequency tokens lift a degenerate critical point, opening new embedding directions and initiating the next cycle. Experiments on synthetic Markovian data as well as WikiText and TinyStories corroborate the predicted stages and cyclical dynamics.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper decomposes attention training into a four-stage focus-dilution cycle for one-layer Transformers on Markovian data using gradient-flow linearization around critical points.

read the letter

This paper identifies a recurrent focus-dilution cycle in attention learning for one-layer Transformers on Markovian data. It decomposes the cycle into four distinct stages via stage-wise linearization of the gradient flow around critical points. First the embeddings and projections collapse to rank one while attention parameters stay nearly fixed. Then attention grows and focuses on high-frequency tokens. Evolving attention then produces perturbations that redistribute mass and dilute the focus. Finally small asymmetries among low-frequency tokens lift a degenerate critical point and restart the cycle. The synthetic Markovian experiments track the predicted sequence closely, and the WikiText and TinyStories runs show the same stages appearing in natural text. The analysis stays within its stated limits: one layer, Markovian data, and continuous gradient flow. The restrictions are declared up front rather than tuned after the fact, which keeps the derivation from feeling circular. The stage breakdown is new relative to the cited prior work on attention dynamics. The main soft spots are the lack of explicit error bounds on the linearization steps and the absence of checks on how well the stages survive discretization or added layers. The dilution step depends on next-order perturbations whose size could vary with data statistics. Deeper models with longer-range dependencies might blur the clean cycle. This is for readers who want mechanistic accounts of Transformer optimization rather than new benchmarks. The math is honest within the restricted setting and the experiments line up with the claims, so the paper deserves a serious referee to verify the derivations and assess whether the cycle points to any usable training interventions.

Referee Report

2 major / 3 minor

Summary. The paper claims to identify a recurrent focus-dilution cycle in attention learning and to provide a rigorous gradient-flow explanation for it in the restricted setting of one-layer Transformers trained on Markovian data. Using stage-wise linearization around critical points, it decomposes each cycle into four stages: rapid rank-one condensation of embeddings and projections (attention frozen), attention growth producing high-frequency focus, dilution via next-order embedding perturbations, and restart of the cycle by low-frequency token asymmetries lifting a degeneracy. Synthetic Markovian experiments are presented as direct verification, with WikiText and TinyStories results offered as qualitative corroboration of the predicted stages.

Significance. If the analytic steps hold, the work supplies a concrete mechanistic decomposition of attention training dynamics in a controlled, analytically tractable regime. The explicit restriction to one-layer Markovian data, the derivation from gradient flow rather than post-hoc fitting, and the stage-wise linearization around critical points are genuine strengths that distinguish it from purely empirical studies of Transformer training. Observation of the predicted stages on both synthetic and natural data suggests the cycle may be a useful organizing principle even if the full generality remains open.

major comments (2)

[Abstract and gradient-flow analysis section] The central claim rests on stage-wise linearization of the gradient-flow ODEs around critical points, yet the manuscript provides neither the explicit linearization steps nor error bounds on the approximation (Abstract; gradient-flow analysis section). Without these, it is impossible to verify that the four stages remain distinct and that the continuous-time decomposition survives discretization in SGD training. This is load-bearing for the 'rigorous explanation' asserted in the abstract.
[Experiments section] The synthetic experiments are described as direct tests of the predicted stages, but the manuscript does not report quantitative alignment metrics (e.g., measured onset times or magnitudes of focus versus dilution phases) or controls confirming that the observed cycles disappear when the Markovian assumption is violated (Experiments section). This weakens the evidential link between the analytic decomposition and the numerical results.

minor comments (3)

[Introduction] The introduction should state the one-layer and Markovian restrictions more prominently so readers immediately understand the scope before encountering the technical claims.
[Figures] Figure captions would benefit from explicit labels indicating which analytic stage each panel is intended to illustrate.
[Discussion] A short discussion of how the cycle might be affected by multi-head attention or residual connections would help readers assess potential extensions beyond the one-layer case.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful and constructive review. The comments identify key areas where additional detail and quantitative support will strengthen the manuscript. We address each major comment below and have revised the paper accordingly.

read point-by-point responses

Referee: [Abstract and gradient-flow analysis section] The central claim rests on stage-wise linearization of the gradient-flow ODEs around critical points, yet the manuscript provides neither the explicit linearization steps nor error bounds on the approximation (Abstract; gradient-flow analysis section). Without these, it is impossible to verify that the four stages remain distinct and that the continuous-time decomposition survives discretization in SGD training. This is load-bearing for the 'rigorous explanation' asserted in the abstract.

Authors: We agree that explicit linearization steps and error bounds are necessary to substantiate the stage decomposition. In the revised manuscript we have added an appendix that derives the Jacobian at each critical point, computes the leading eigenvalues and eigenvectors, and shows how the time-scale separation produces the four distinct stages. We also include a first-order perturbation analysis that bounds the approximation error in terms of the frequency gap in the Markovian data. For the continuous-to-discrete transition, a fully rigorous bound for arbitrary SGD remains technically involved and is noted as future work; however, we have added experiments with a range of learning rates confirming that the cycle persists in the small-step-size regime that approximates the gradient flow. revision: partial
Referee: [Experiments section] The synthetic experiments are described as direct tests of the predicted stages, but the manuscript does not report quantitative alignment metrics (e.g., measured onset times or magnitudes of focus versus dilution phases) or controls confirming that the observed cycles disappear when the Markovian assumption is violated (Experiments section). This weakens the evidential link between the analytic decomposition and the numerical results.

Authors: We accept that quantitative metrics and controls would make the experimental validation more direct. The revised experiments section now reports measured onset times (with standard deviations) for each of the four stages across 20 independent runs, together with the peak and trough values of attention mass on high-frequency tokens during focus and dilution phases. We have also added control runs on non-Markovian synthetic sequences that introduce longer-range dependencies; in these cases the cyclic behavior is absent, consistent with the analytic prediction that the Markovian structure is required for the degeneracy and restart mechanism. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper derives the focus-dilution cycle explicitly from gradient-flow equations on a one-layer Transformer with Markovian data, using stage-wise linearization around critical points. All steps are presented as consequences of the model's dynamics and stated assumptions rather than parameter fits, self-definitions, or load-bearing self-citations. The Markovian/one-layer scope is declared upfront; experiments on synthetic data test the predicted stages directly, while WikiText/TinyStories serve only as corroboration. No quoted reduction shows any prediction or uniqueness claim collapsing to its own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on two standard domain assumptions plus one paper-specific linearization technique; no free parameters or new entities are introduced in the abstract.

axioms (2)

domain assumption Gradient flow on the continuous-time limit accurately captures the discrete SGD training trajectory
Invoked to justify the stage-wise analysis of attention parameter evolution.
domain assumption Data is strictly Markovian so that token frequencies alone determine the relevant statistics
Required for the frequency-driven focus mechanism to be well-defined.

pith-pipeline@v0.9.0 · 5479 in / 1355 out tokens · 58086 ms · 2026-05-09T14:11:42.577013+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

42 extracted references · 2 canonical work pages · 1 internal anchor

[1]

Available: https://doi.org/10.1162/tacl a 00449

doi: 10.1162/tacl a 00708. URL https: //aclanthology.org/2024.tacl-1.74/. Chen, S., Sheen, H., Wang, T., and Yang, Z. Unveiling induction heads: Provable training dynamics and feature learning in transformers. In Globerson, A., Mackey, L., Belgrave, D., Fan, A., Paquet, U., Tomczak, J., and Zhang, C. (eds.),Advances in Neural Information Processing Sys- t...

work page internal anchor Pith review doi:10.1162/tacl 2024
[2]

Lorch, E

URL https://arxiv.org/abs/2303.0 4245. Lorch, E. Visualizing deep network training trajectories with pca. InICML Workshop on Visualization for Deep Learning, 2016. Lu, H., Mao, Y ., and Nayak, A. On the dynamics of training attention models. InInternational Conference on Learning Representations, 2021. URL https: //openreview.net/forum?id=1OCTOShAmqB. Luo...

2016
[3]

topic word

URL https://openreview.net/forum ?id=HyGBdo0qFm. Rajaraman, N., Jiao, J., and Ramchandran, K. An analysis of tokenization: Transformers under markov data. InThe Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024. URL https://openrevi ew.net/forum?id=wm9JZq7RCe. Rotskoff, G. and Vanden-Eijnden, E. Parameters as inter- acting par...

work page doi:10.52202/079017-1465 2024
[4]

For nonlinear models, (Zhou et al., 2022) found that small initialization similarly promotes parameter condensation, thereby reducing model complexity

theoretically establish results regarding matrix alignment. For nonlinear models, (Zhou et al., 2022) found that small initialization similarly promotes parameter condensation, thereby reducing model complexity. Theoretically, (Luo et al., 2021; Chen et al., 2024b; Zhou et al., 2023; Kumar & Haupt, 2024) have further deepened the understanding of this phe...

2022
[5]

That is lims→∞ 1 s Ps j=1 1 xj =π ⊺, where1 xj ∈R d is the one-hot vector of tokenx j

Ergodicity.Along a single trajectory, the empirical state frequencies converge to π. That is lims→∞ 1 s Ps j=1 1 xj =π ⊺, where1 xj ∈R d is the one-hot vector of tokenx j. Proof. Throughout, we work on the finite state space V={1, . . . ,|V|} . By the definition of the transition matrix P defined in (2), P is irreducible and aperiodic with strictly positi...

2026
[6]

By the Perron–Frobenius theorem, the eigenvalue1ofPis simple and all other eigenvalues satisfy|λ|<1

Convergence of marginals.Since P is ergodic on a finite state space, it is primitive. By the Perron–Frobenius theorem, the eigenvalue1ofPis simple and all other eigenvalues satisfy|λ|<1. Let1∈R |V| denote the all-ones vector. Because Pis row-stochastic, we haveP1=1; becauseπis stationary, we haveπ ⊺P=π ⊺. Define the rank-one projector Π :=1π ⊺. ThenΠ 2 = ...
[7]

Fix a reference state, say state1, and define the (strict) return times τ0 := 0, τ k+1 := inf{t > τ k :X t = 1}, k≥0

Ergodicity of empirical frequencies.Let (Xt)t≥0 be the Markov chain with transition matrix P and arbitrary initial distributionµ 0. Fix a reference state, say state1, and define the (strict) return times τ0 := 0, τ k+1 := inf{t > τ k :X t = 1}, k≥0. By irreducibility on a finite state space, the chain is positive recurrent, henceτk <∞ almost surely for al...

2026
[8]

Thus the flow stays inW

The argument forW K is identical. Thus the flow stays inW. (ii) Preservation of the low-frequency symmetry. Let G:={σ:{1, . . . , V} → {1, . . . , V}|σ(1) = 1}.(147) and letσ∈Gbe the permutation matrix. Define the group action as ρσ(θ) := (ΠσW0, W1Π⊺ σ, WQ, WK). Under this action, ones check that Pi,j(ρσ(θ)) =P σ(i),σ(j) (θ). Since the loss function can b...

2026
[9]

We take the expansion up to the first order aboutξi and ˆpi and then substitute then into the expression ofM-driven term

Linearization about M-driven term. We take the expansion up to the first order aboutξi and ˆpi and then substitute then into the expression ofM-driven term. (1). Linearization of the proxy attention weightsξ 1, ξ2. Takeξ 1 as an example, ξ1 = 1 1 + 1−π1 π1 exp (−ηγ1∆γ) =π 1 +π 1(1−π 1)η1γ0 1∆γ0 +O(∥θ∥ 2) in which we use the fact that parameters locate nea...
[10]

Using the fact that parameters locate near η= 0 , the linearization of Eq (154)–(156) corresponds to the right-hand side except that eta takes a value at the initial point

Linearization about Φ-driven term. Using the fact that parameters locate near η= 0 , the linearization of Eq (154)–(156) corresponds to the right-hand side except that eta takes a value at the initial point. Substituting the above expansions into (154)–(156), and keeping only first-order terms, yields ˙γ1 1 = ∆β0 π2 1(−ˆp1 1 ) +π 1(1−π 1)(−ˆp1 2 ) + 3λ η1...

2026
[11]

Slow manifold directions (within the rank-one manifold):any positive eigenvalues created from the kernel directions are at mostO(δ 2)
[12]

D.17, there exists a transverse positive eigenvalue of orderΘ(δ)

F ast transverse directions (escaping the manifold):Under condition in Lem. D.17, there exists a transverse positive eigenvalue of orderΘ(δ). Proof. The estimate ∥∇L∥=O(δ 3) follows by inserting θ=Q Rζ(0, δ) into the kernel expansion and using the explicit expressions: (1). 1 2 Q⊺ KB(q2y2, q2y2)(From Lem. D.13): 1 2 Q⊺ KB(q2y2, q2y2) = 1p ∥γ∥2 +∥β∥ 2   ...

2026
[13]

37 Submission and Formatting Instructions for ICML 2026

The variation of the attention proxy satisfies dA1 = 0 and dAi̸=1 =ηγ i̸=1 0, 1 4(dγ2 −dγ 3),− 1 4(dγ2 −dγ 3) for i̸= 1. 37 Submission and Formatting Instructions for ICML 2026

2026
[14]

The variation of the output probability satisfiesdP i =A idMVar(P i)
[15]

Proof.We calculate the first-order variation in sequence

The variation of ∂L ∂M and ∂L ∂Φ admit the following expression: d ∂L ∂M = X i πiA⊺ i dPi,d ∂L ∂Φ = 0. Proof.We calculate the first-order variation in sequence
[16]

Fori̸= 1, usingA i = ˆe1 andΦ =ηγγ ⊺, e⊺ i dΦ =e ⊺ i d(ηγγ ⊺) = d(ηγiγ⊺)

AtA 1 =e 1, we havediag(e 1)−e 1e⊺ 1 = 0, hencedA 1 = 0. Fori̸= 1, usingA i = ˆe1 andΦ =ηγγ ⊺, e⊺ i dΦ =e ⊺ i d(ηγγ ⊺) = d(ηγiγ⊺). SinceVar(A i̸=1) = diag(ˆe1)−ˆe1ˆe⊺ 1 equals to   0 0 0 0 1 4 − 1 4 0− 1 4 1 4   , we obtain the displayed vector form
[17]

UnderM=γβ ⊺, dAi M= dA i γβ ⊺ = (dAi γ)β ⊺

By definition, dPi = d(AiM) Var(Pi) = (dAi M+A i dM) Var(Pi). UnderM=γβ ⊺, dAi M= dA i γβ ⊺ = (dAi γ)β ⊺. Fori= 1,dA 1 = 0, hencedA 1M= 0. Fori̸= 1,dA i̸=1 =ηγ i̸=1 0, 1 4(dγ2 −dγ 3),− 1 4(dγ2 −dγ 3) . Thus, dAi̸=1 γ=ηγ i̸=1(0, 1 4(dγ2 −dγ 3),− 1 4(dγ2 −dγ 3))·(γ 1, γ2, γ3) = 1 4 ηγi̸=1(dγ2 −dγ 3)(γ2 −γ 3) = 0, sinceγ 2 =γ 3 at the symmetric point. Henced...
[18]

Therefore the i= 2,3 contributions cancel, givingP i πi dA⊺ i (Pi −P i) = 0

By definition of ∂L ∂M and the chain rule, d ∂L ∂M =− X i πidA⊺ i (Pi −P i)− X i πiA⊺ i d(−Pi) At the symmetric point, P1 −P 1 = 0 andP i̸=1(Pi −P i) = 0, while dA2 = dA3 and π2 =π 3. Therefore the i= 2,3 contributions cancel, givingP i πi dA⊺ i (Pi −P i) = 0. It yields the claimed form. By the definition of ∂L ∂Φ and the chain rule, d ∂L ∂Φ =−d X i πi ei...

2026
[19]

The∆γequation. Usingd ∂L ∂M =π 1A⊺ 1dP1 + (1−π 1)A⊺ i̸=1dPi̸=1 andA 1 =e ⊺ 1,A i̸=1 = ˆe⊺ 1 = (0, 1 2 , 1 2), we obtain d∆γ dt =−π 1A⊺ 1dP1β−(1−π 1)A⊺ i̸=1dPi̸=1β =−π 1A⊺ 1 dγ1β⊺ Var(P1)β+γ 1dβ⊺ Var(P1)β −(1−π 1)A⊺ i̸=1 1 2(dγ2 + dγ3)β ⊺ Var(Pi̸=1)β+γ i̸=1 dβ⊺ Var(Pi̸=1)β , (179) where we used the rank-one identity dM= d(γβ ⊺) = (dγ)β ⊺ +γ(dβ) ⊺ and dPi =...
[20]

The∆βequation. Similarly, d∆β dt =−π 1dP⊺ 1 A1γ−(1−π 1)dP⊺ i̸=1Ai̸=1γ.(181) Using againdP i =A i dMVar(P i)andA 1 =e ⊺ 1,A i̸=1 = ˆe⊺ 1, we obtain the compact matrix form d∆β dt =− v⊺ 1 v⊺ 2 v⊺ 2 C dθ.(182) Combining (180) and (182), the linearization reads d dt ∆γ ∆β =J 0 dθ, whereJ 0 is exactly the block matrix. 39 Submission and Formatting Instructions...

2026
[21]

Computation of ∂ ∂δ ∂L ∂M . By definition, ∂ ∂δ ∂L ∂M = ∂ ∂δ − X i πiA⊺ i (Pi −P i) ! =− X i ∂δπiA⊺ i (Pi −P i)− X i πi∂δA⊺ i (Pi −P i)− X i πiA⊺ i ∂δ(Pi −P i) 40 Submission and Formatting Instructions for ICML 2026 UsingP i =λe ⊺ i + (1−λ)π ⊺, the first term is computed as − X i ∂δπiA⊺ i (Pi −P i) =−A ⊺ i̸=1 ((P2 −P 2)−(P 3 −P 3)) =−A ⊺ i̸=1(0, λ,−λ) Sin...

2026
[22]

Computation of ∂ ∂δ ∂L ∂Φ . By definition, ∂ ∂δ ∂L ∂Φ = ∂ ∂δ − X i πiei(Pi −P i)M ⊺ Var(Ai) ! =− X i ∂δπiei(Pi −P i)M ⊺ Var(Ai)− X i πiei∂δ(Pi −P i)M ⊺ Var(Ai)− X i πiei(Pi −P i)M ⊺∂δ Var(Ai) Similar to the computation about ∂ ∂δ ∂L ∂M , ones can check that ∂ ∂δ ∂L ∂Φ vanishes. Multiplying (186) by β on the right yields zero because it is proportional to ...

2026
[23]

The term −P i(∂δπi) (dA⊺ i )(Pi −P i): using the structure of dAi and (Pi −P i)β= 0 , its contribution vanishes when paired withβand withγ, i.e., − X i (∂δπi) (dA⊺ i )(Pi −P i) β= 0, · ⊺ γ= 0
[24]

For i= 1 , we have∂ δ Var(A1) = 0sinceA 1 =e ⊺

The term −P i πi ∂δdA⊺ i (Pi −P i): Since dAi =e ⊺ i Φ Var(Ai), we get ∂δdAi =e ⊺ i Φ (∂δ Var(Ai)). For i= 1 , we have∂ δ Var(A1) = 0sinceA 1 =e ⊺
[25]

Fori̸= 1, ∂δ Var(Ai) = (1−π 1)     0 0 0 0 1 0 0 0−1   −   0 1 −1   0, 1 2 , 1 2 −   0 1 21 2   (0,1,−1)   = 0 Hence this term is zero
[26]

By the symmetric specializationβ 2 =β 3 andγ 2 =γ 3, this term also satisfies · β= 0, · ⊺ γ= 0

The term− P i πi (dA⊺ i )∂ δ(Pi −P i): since∂ δ(Pi −P i) =∂ δPi, we obtain − X i πi (dA⊺ i )∂ δ(Pi −P i) = X i πi(dA⊺ i ) (0,1−λ,−(1−λ)). By the symmetric specializationβ 2 =β 3 andγ 2 =γ 3, this term also satisfies · β= 0, · ⊺ γ= 0
[27]

The term− P i(∂δπi)A ⊺ i (−dPi): using∂ δπ2 =−∂ δπ3 andA 2 =A 3 atδ= 0, we get − X i (∂δπi)A ⊺ i (−dPi) =−A ⊺ 2(−dP2) +A ⊺ 3(−dP3) = 0
[28]

The term− P i πi∂δA⊺ i (−dPi): We have − X i πi∂δA⊺ i (−dPi) =   0 1 −1   Ai̸=1dMVar(P i)(192)
[29]

The term− P i πiA⊺ i ∂δ(−dPi): − X i πiA⊺ i ∂δ(−dPi) = X i πiA⊺ i ∂δ (dAiMVar(P i) +A idMVar(P i)) Similar to the previous computation, we have − X i πiA⊺ i ∂δ(−dPi) =A ⊺ i̸=1(0,1,−1)(dγ)β ⊺ Var(Pi̸=1).(193) The last two terms, −P i πi (∂δA⊺ i )(−dPi) and −P i πi A⊺ i ∂δ(−dPi), produce the only nonzero contribution to ∂δd(∂L/∂M)βalong the(2,−3)antisymmetr...

2026
[30]

We take the second differential of− ∂L ∂M β−η ∂L ∂Φ + ∂L ∂Φ ⊺ γ, −d2 ∂L ∂M β −d 2 η ∂L ∂Φ + ∂L ∂Φ ⊺ γ The computation ofM-term andΦ-term is computed as follows

The computation ofB k for1≤k≤3. We take the second differential of− ∂L ∂M β−η ∂L ∂Φ + ∂L ∂Φ ⊺ γ, −d2 ∂L ∂M β −d 2 η ∂L ∂Φ + ∂L ∂Φ ⊺ γ The computation ofM-term andΦ-term is computed as follows. M-term. (a) Contribution from2(d(∂L/∂M)) dβ. This produces the blocks denoted byB (1) k : B(1) 1 (·,·) =−π 1   0 0 0β ⊺ Var(P1) 0 0 0 0 0 0 0 0 Var(P1)β0 0 2γ 1...

2026
[31]

The computation ofB k for4≤k≤6. Similarly, we compute thed 2 ∂L ∂M ⊺ γ d2 ∂L ∂M ⊺ γ = d2 ∂L ∂M ⊺ γ+ 2d ∂L ∂M ⊺ dγ = 2 X i πidA⊺ i dPi + X i πiA⊺ i d2Pi !⊺ γ+ 2 X i πiA⊺ i AidMVar(P i) !⊺ dγ = X i πid2P⊺ i Aiγ+ 2 X i πi Var(Pi)dM ⊺A⊺ i Aidγ (217) 47 Submission and Formatting Instructions for ICML 2026 For the term2 P i πi Var(Pi)dM ⊺A⊺ i Aidγ, we get 2 X i...

2026
[32]

The computation of the cases where1≤k≤3. From Eq. (206), we get   B(1) 1 (q2y2, q2y2) B(1) 2 (q2y2, q2y2) B(1) 3 (q2y2, q2y2)   =− 1 2   4π1γ1P1,2 2(1−π 1)γi̸=1Pi̸=1,2 2(1−π 1)γi̸=1Pi̸=1,2   y2 2 (224) Forlfrom2to4andl= 6, their contribution vanishes. Forl= 5, the contribution is   B(5) 1 (q2y2, q2y2) B(5) 2 (q2y2, q2y2) B(5) 3 (q2y2, q2y2) ...
[33]

For l= 1,2,3 , their contribution vanishes

The computation of the cases where 4≤k≤6 . For l= 1,2,3 , their contribution vanishes. Then contribution of l= 4 case is −1 2 π1γ3 1   0 0 0  0 P1,2 P1,2   −2P 1,2P⊺ 1   − 1 2(1−π 1)γ3 i̸=1   0 0 0  0 Pi̸=1,2 Pi̸=1,2   −2P i̸=1,2P⊺ i̸=1   (226) Sum them up, we getB(q 2y2, q2y2). Then by direct computation, we get th...
[34]

The computation of∂ 2 δ ∂L ∂M . By definition, ∂2 δ ∂L ∂M =∂ 2 δ − X i πiA⊺ i (Pi −P i) ! =∂ δ − X i ∂δπiA⊺ i (Pi −P i)− X i πi∂δA⊺ i (Pi −P i)− X i πiA⊺ i ∂δ(Pi −P i) ! =− X i ∂2 δ πiA⊺ i (Pi −P i)−2 X i ∂δπi∂δA⊺ i (Pi −P i)−2 X i ∂δπiA⊺ i ∂δ(Pi −P i) − X i πi∂2 δ A⊺ i (Pi −P i)−2 X i πi∂δA⊺ i ∂δ(Pi −P i)− X i πiA⊺ i ∂2 δ (Pi −P i) =−2 X i ∂δπi∂δA⊺ i (Pi...

2026
[35]

By definition,∂ 2 δ ∂L ∂Φ =∂ 2 δ −P i πi ei Pi −P i M ⊺ Var(Ai)

The computation of∂ 2 δ ∂L ∂Φ . By definition,∂ 2 δ ∂L ∂Φ =∂ 2 δ −P i πi ei Pi −P i M ⊺ Var(Ai) . In particular, ∂δ − X i ∂δπi ei Pi −P i M ⊺ Var(Ai)− X i πi ei∂δ Pi −P i M ⊺ Var(Ai)− X i πi ei Pi −P i M ⊺∂δ Var(Ai) ! =− X i ∂2 δ πi ei Pi −P i M ⊺ Var(Ai)−2 X i ∂δπi ei∂δ Pi −P i M ⊺ Var(Ai)−2 X i ∂δπi ei Pi −P i M ⊺∂δ Var(Ai) − X i πi ei∂2 δ Pi −P i M ⊺ V...

2026
[36]

Using the matrix form defined in Eq

The contribution froml= 1. Using the matrix form defined in Eq. (206), we get B(1) 1 (q2y2) =− y2√ 2(0,0,0,0,2π 1γ1P1,2,−2π 1γ1P1,2) B(1) 2 (q2y2) =− y2√ 2(0,0,0,0,(1−π 1)γi̸=1Pi̸=1,2,−(1−π 1)γi̸=1Pi̸=1,2) B(1) 3 (q2y2) =B (1) 2 (q2y2)
[37]

The contributions froml= 2,3,4,6vanish
[38]

Using the matrix defined in Eq

The contribution froml= 5. Using the matrix defined in Eq. (212), we get c3   0 1 −1   =−π 1γ2 1(0,P 1,2β2 −P 1,2(P1β),−P 1,2β2 −P 1,2(P1β)), which implies B(5) 1 (q2y2) =− y2√ 2 π1γ2 1(0,0,0,0,P 1,2β2 −P 1,2(P1β),−(P 1,2β2 −P 1,2(P1β))) Similarly, B(5) 2 (q2y2) =B (5) 3 (q2y2) =− y2√ 2 1 2(1−π 1)γ2 i̸=1(0,0,0,0,P i̸=1,2β2 −P i̸=1,2(Pi̸=1β),−P i̸=1,2β...
[39]

By direct computation, B(1) 4 (q2y2) = (0,0,0,0,0,0) B(1) 5 (q2y2) =− y2√ 2(π1γ1P1,2, 1 2(1−π 1)γi̸=1Pi̸=1,2, 1 2(1−π 1)γi̸=1Pi̸=1,2,0,0,0) B(1) 6 (q2y2) =−B (1) 5

The contribution froml= 1. By direct computation, B(1) 4 (q2y2) = (0,0,0,0,0,0) B(1) 5 (q2y2) =− y2√ 2(π1γ1P1,2, 1 2(1−π 1)γi̸=1Pi̸=1,2, 1 2(1−π 1)γi̸=1Pi̸=1,2,0,0,0) B(1) 6 (q2y2) =−B (1) 5
[40]

This term vanishes

The contribution froml= 2. This term vanishes
[41]

This term makes the same contribution as the first term

The contribution froml= 3. This term makes the same contribution as the first term
[42]

The contribution froml= 4. By definition, − X i πi(d(AiM)d Var(Pi))⊺Aiγ=−π 1γ1(A1dMd Var(P1))⊺ −(1−π 1)γi̸=1(Ai̸=1dMd Var(Pi̸=1))⊺ Take the first term as an example, the cross term in(A1dMd Var(P1))⊺ is γ1dγ1 (diag(β ⊺ Var(P1))−P ⊺ 1 β⊺ Var(P1)−Var(P 1)βP1) dβ +γ 1dγ1 (diag(dβ ⊺ Var(P1))−P ⊺ 1dβ⊺ Var(P1)−Var(P 1)dβP1))β Write in entry form, for4≤k≤6, we g...

2026