arxiv: 2603.28964 · v3 · submitted 2026-03-30 · 💻 cs.LG · cs.AI

Recognition: 2 theorem links

· Lean Theorem

Spectral Edge Dynamics: An Analytical-Empirical Study of Phase Transitions in Neural Network Training

Yongzhong Xu

Authors on Pith no claims yet

Pith reviewed 2026-05-14 21:17 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords spectral gapphase transitionsgrokkingneural network trainingGram matrixDyson dynamicsoptimizer dependenceadiabatic parameter

0 comments

The pith

Phase transitions in neural network training are controlled by the spectral gap of the rolling-window Gram matrix of parameter updates.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper claims that abrupt shifts in neural network behavior during training, such as grokking after long plateaus or sudden capability gains, are driven by changes in one specific spectral gap. In the extreme regime of far more parameters than the size of the rolling observation window, this gap sits at the position k* where consecutive singular values show their largest ratio. From three assumptions the analysis derives how the gap evolves over time, how loss contributions break down by mode, and why only the collapse at this privileged position halts progress. The size of recent matrix changes relative to learning rate and gradient strength then decides whether the system stays in a plateau, undergoes a transition, or begins to forget.

Core claim

In the extreme aspect-ratio regime, phase transitions are controlled by the intra-signal gap separating dominant from subdominant modes at k* = argmax σ_j/σ_{j+1}. From three assumptions the work derives gap dynamics governed by a Dyson-type ODE with curvature asymmetry, damping, and gradient driving; a spectral loss decomposition linking each mode's learning contribution to its Davis-Kahan stability coefficient; and the Gap Maximality Principle establishing that k* is the unique dynamically privileged position whose collapse alone disrupts learning while an α-feedback loop sustains it without optimizer assumptions. The adiabatic parameter A = ||ΔG||_F / (η g^2) then classifies circuit state

What carries the argument

The spectral gap at position k* of the rolling-window Gram matrix of parameter updates, whose dynamics are governed by a Dyson-type ODE and whose maximality is enforced by an α-feedback loop.

Load-bearing premise

The gap dynamics, spectral loss decomposition, and Gap Maximality Principle all rest on three unspecified assumptions claimed to hold in the extreme aspect-ratio regime.

What would settle it

A training run in which a clear grokking or capability-gain event occurs without preceding gap dynamics or without collapse specifically at the predicted k* position.

Figures

Figures reproduced from arXiv: 2603.28964 by Yongzhong Xu.

**Figure 1.** Figure 1: The spectral edge framework. (A) Singular value spectrum of the Gram matrix G for TinyStories 51M, showing the intra-signal gap at k ∗ = 2. All eigenvalues are far above the BBP noise detection threshold (dashed; see Theorem 4.1). (B) Ratio profile σk/σk+1: the maximum at k = 2 defines the signal rank. (C) Gap ratio σ2/σ3 over training for TinyStories 51M, showing the three-phase pattern: rise, plateau, c… view at source ↗

**Figure 2.** Figure 2: Spectral edge analysis for TinyStories 51M. Top: Consecutive singular value ratios σk/σk+1 over training. The σ1/σ2 gap (red) dominates during the plateau phase, while σ2/σ3 (blue) shows the three-phase rise–plateau–collapse pattern. Middle: Eigenvalue gaps σ 2 k − σ 2 k+1 over training, showing the same three-phase structure. Bottom: Gap ratio σ2/σ3 (blue) overlaid with validation loss (red), confirming t… view at source ↗

**Figure 3.** Figure 3: Stability coefficient hierarchy for GPT-2 124M. Heatmap of αj (Definition 16.1) across eigenvalue index j (vertical) and training step (horizontal). The dominant mode (j = 1, green) has α ≈ 1 throughout training. The gap mode (j = 2) fluctuates between stable (green) and unstable (red) as k ∗ shifts. All subdominant modes (j ≥ 3, red) have α ≈ 0. The hierarchy α1 > α2 > · · · is visually obvious. Testing t… view at source ↗

**Figure 4.** Figure 4: Grokking as a spectral edge event (Dyck-1). Grokking runs (weight decay ω = 1.0, red/solid, 3 seeds) vs. control runs (ω = 0, blue/dashed, 3 seeds). Top: singular value ratio σ1/σ2 of WQ. With weight decay, the ratio rises dramatically as σ2 → 0 (gap opens); without weight decay, the ratio stays flat (no gap). Middle: Frobenius norm ∥WQ∥F . Weight decay compresses the weights; the control runs retain large… view at source ↗

**Figure 5.** Figure 5: Dependency graph of the spectral edge framework. Green = no dependence on [P, H] ≈ 0. Yellow (dashed) = qualitative conclusion survives; only exact coefficients need commutativity. Orange (thick dashed) = both structure and conclusion require [P, H] ≈ 0. The core diagnostic and learning-theoretic results (left column and NTK branch) are entirely clean. The flow equations are weakly dependent. Only the Kryl… view at source ↗

read the original abstract

We develop the spectral edge analysis: phase transitions in neural network training -- grokking, capability gains, loss plateaus -- are controlled by the spectral gap of the rolling-window Gram matrix of parameter updates. In the extreme aspect ratio regime (parameters $P \sim 10^8$, window $W \sim 10$), the classical BBP detection threshold is vacuous; the operative structure is the intra-signal gap separating dominant from subdominant modes at position $k^* = \mathrm{argmax}\, \sigma_j/\sigma_{j+1}$. From three assumptions we derive: (i) gap dynamics governed by a Dyson-type ODE with curvature asymmetry, damping, and gradient driving; (ii) a spectral loss decomposition linking each mode's learning contribution to its Davis--Kahan stability coefficient; (iii) the Gap Maximality Principle, showing that $k^*$ is the unique dynamically privileged position -- its collapse is the only one that disrupts learning, and it sustains itself through an $\alpha$-feedback loop requiring no assumption on the optimizer. The adiabatic parameter $\mathcal{A} = \|\Delta G\|_F / (\eta\, g^2)$ controls circuit stability: $\mathcal{A} \ll 1$ (plateau), $\mathcal{A} \sim 1$ (phase transition), $\mathcal{A} \gg 1$ (forgetting). Tested across six model families (150K--124M parameters): gap dynamics precede every grokking event (24/24 with weight decay, 1/24 without), the gap position is optimizer-dependent (Muon: $k^*=1$, AdamW: $k^*=2$ on the same model), and 19/20 quantitative predictions are confirmed. The framework is consistent with the edge of stability, Tensor Programs, Dyson Brownian motion, the Lottery Ticket Hypothesis, and neural scaling laws.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper claims phase transitions in training are controlled by the max relative gap k* in the rolling Gram matrix of updates, derived from three assumptions yielding a Dyson ODE and Gap Maximality Principle, with supporting experiments across models.

read the letter

The paper's main claim is that phase transitions like grokking are governed by the intra-signal spectral gap at k* in the rolling-window Gram matrix of parameter updates, derived from three assumptions that yield a Dyson-type ODE for gap dynamics, a spectral loss decomposition, and the Gap Maximality Principle. What is new is the synthesis around the spectral edge in the extreme aspect-ratio regime, where the classical BBP threshold doesn't apply, and the introduction of the adiabatic parameter A that is supposed to control whether training plateaus, transitions, or forgets. The Gap Maximality Principle is presented as showing k* is uniquely privileged and self-sustaining through alpha feedback. The empirical work is the stronger part. They report that gap dynamics precede all grokking events with weight decay across the tested models, and that the gap position depends on the optimizer, with different k* for Muon and AdamW on the same model. Confirming 19 out of 20 quantitative predictions across six model families gives some concrete support for the framework. The main soft spot is that the three assumptions behind the derivations are not stated. Without them, it's impossible to check if the ODE and maximality principle actually follow in the P much larger than W regime. This leaves open the possibility that the theory is shaped around the observations rather than standing independently. The abstract notes consistency with prior ideas like edge of stability and Dyson Brownian motion, but the new parts need the missing steps to be convincing. This work is for people tracking theoretical explanations for sudden capability jumps in training. A reader looking for fresh angles on spectral methods applied to dynamics would see value in the empirical patterns and the proposed control parameter, even if the derivations require more detail. I would recommend sending it for peer review. The empirical results are specific and testable, so referees could push on the assumptions and check the predictions directly.

Referee Report

2 major / 0 minor

Summary. The manuscript claims that phase transitions in neural network training (grokking, capability gains, loss plateaus) are controlled by the intra-signal spectral gap at position k* = argmax σ_j/σ_{j+1} of the rolling-window Gram matrix of parameter updates. In the extreme aspect-ratio regime (P ~ 10^8, W ~ 10), three unspecified assumptions yield (i) gap dynamics via a Dyson-type ODE with curvature asymmetry, damping and gradient driving, (ii) a spectral loss decomposition via Davis-Kahan coefficients, and (iii) the Gap Maximality Principle asserting that k* is the unique dynamically privileged position sustained by an α-feedback loop. The adiabatic parameter A = ||ΔG||_F / (η g²) is introduced to classify regimes (A ≪ 1 for plateaus, A ~ 1 for transitions, A ≫ 1 for forgetting). Experiments across six model families (150K–124M parameters) report that gap dynamics precede every grokking event (24/24 with weight decay) and that 19/20 quantitative predictions hold, with k* shown to be optimizer-dependent.

Significance. If the derivations can be verified, the work supplies an analytical framework that links spectral properties of update matrices to training phase transitions and is consistent with the edge of stability, Tensor Programs, Dyson Brownian motion, the Lottery Ticket Hypothesis, and neural scaling laws. The reported temporal precedence of gap dynamics and optimizer-specific k* values would constitute a substantive empirical contribution to understanding grokking and related phenomena.

major comments (2)

[Abstract and derivation sections] Abstract and derivation sections: The three assumptions from which the Dyson-type ODE for gap dynamics, the spectral loss decomposition, and the Gap Maximality Principle are derived are never stated explicitly. Without their formulation it is impossible to verify whether these results follow in the P ≫ W regime or whether the Gap Maximality Principle is independent of the dynamics it explains.
[Abstract] Abstract: The adiabatic parameter A = ||ΔG||_F / (η g²) is constructed directly from the Frobenius norm of the update matrix and the gradient norm—quantities internal to the training process—creating a risk that the classification of stability regimes (plateau/transition/forgetting) is tautological rather than predictive.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive report. We address each major comment below and indicate the revisions planned for the next version of the manuscript.

read point-by-point responses

Referee: [Abstract and derivation sections] Abstract and derivation sections: The three assumptions from which the Dyson-type ODE for gap dynamics, the spectral loss decomposition, and the Gap Maximality Principle are derived are never stated explicitly. Without their formulation it is impossible to verify whether these results follow in the P ≫ W regime or whether the Gap Maximality Principle is independent of the dynamics it explains.

Authors: We agree that the three assumptions must be stated explicitly to permit verification. In the revised manuscript we will add a dedicated subsection immediately preceding the derivation that enumerates them verbatim: (i) the extreme aspect-ratio regime P ≫ W with rolling-window Gram matrix of width W, (ii) the local linearity of the update map over the window, and (iii) the dominance of the leading singular vectors in the gradient projection. With these listed, it becomes straightforward to confirm that the Dyson-type ODE, the Davis–Kahan spectral loss decomposition, and the Gap Maximality Principle all follow directly in the stated regime and that the maximality principle is obtained without further reference to optimizer-specific dynamics. revision: yes
Referee: [Abstract] Abstract: The adiabatic parameter A = ||ΔG||_F / (η g²) is constructed directly from the Frobenius norm of the update matrix and the gradient norm—quantities internal to the training process—creating a risk that the classification of stability regimes (plateau/transition/forgetting) is tautological rather than predictive.

Authors: While A is assembled from quantities observed during training, it is not tautological because its value is computed on a rolling basis and used prospectively: the regime classification at step t is determined by A evaluated at t − Δt for a fixed lag Δt. Our experiments demonstrate that this early value of A correctly anticipates the subsequent occurrence (or absence) of grokking and loss plateaus across model families. We will revise the abstract and the adiabatic-parameter section to stress this forward-looking usage and to include an explicit statement that A is evaluated before the transition it classifies. revision: partial

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper explicitly starts from three (unspecified in abstract) assumptions and derives the Dyson-type ODE for gap dynamics, the Davis-Kahan-based spectral loss decomposition, and the Gap Maximality Principle. The adiabatic parameter is introduced by definition as A = ||ΔG||_F / (η g^2) and then used to label stability regimes; this is a definitional mapping rather than a fitted quantity renamed as a prediction. No equations are shown reducing any claimed result to its inputs by construction, no self-citation chain is load-bearing, and no ansatz is smuggled via prior work. The 19/20 quantitative predictions and optimizer-dependent k* observations are presented as empirical tests external to the derivations, keeping the central claim independent of its starting assumptions.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

The central claim rests on three unspecified assumptions invoked for the derivations, plus the introduced adiabatic parameter A and the privileged position k* whose independent falsifiability is not shown in the abstract.

free parameters (1)

adiabatic parameter A
Defined as ||Delta G||_F / (eta g^2) and used to demarcate plateau, transition, and forgetting regimes; its value is computed from training quantities rather than derived from first principles.

axioms (1)

ad hoc to paper Three assumptions sufficient to derive gap dynamics, spectral loss decomposition, and Gap Maximality Principle
Abstract states derivations begin from these three assumptions without listing them.

invented entities (1)

spectral gap position k* no independent evidence
purpose: Unique dynamically privileged mode whose collapse disrupts learning and is sustained by an alpha-feedback loop
Defined as argmax sigma_j / sigma_{j+1}; no external falsifiable prediction for its location is given beyond the training runs themselves.

pith-pipeline@v0.9.0 · 5646 in / 1460 out tokens · 54079 ms · 2026-05-14T21:17:51.661533+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

From three axioms we derive: (i) gap dynamics governed by a Dyson-type ODE... (ii) spectral loss decomposition... (iii) the Gap Maximality Principle, showing that k* is the unique dynamically privileged position
IndisputableMonolith/Foundation/ArithmeticFromLogic.lean embed_injective unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

k*=argmax σj/σj+1 ... gap ratio R(t)=σk*/σk*+1

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

The Lifecycle of the Spectral Edge: From Gradient Learning to Weight-Decay Compression
cs.LG 2026-04 unverdicted novelty 7.0

The spectral edge transitions from a gradient-driven functional direction before grokking to a perturbation-flat, ablation-critical compression axis at grokking, forming three universality classes predicted by a gap f...
Spectral Edge Dynamics Reveal Functional Modes of Learning
cs.LG 2026-04 unverdicted novelty 7.0

Spectral edge dynamics during grokking reveal task-dependent low-dimensional functional modes over inputs, such as Fourier modes for modular addition and cross-term decompositions for x squared plus y squared.
Gradient-Direction Sensitivity Reveals Linear-Centroid Coupling Hidden by Optimizer Trajectories
cs.LG 2026-04 unverdicted novelty 5.0

Gradient-based SVD diagnostic uncovers hidden SED-LCH coupling in single and multitask settings and shows rank-3 subspace constraints speed up grokking by 2.3x.

Reference graph

Works this paper leans on

38 extracted references · 38 canonical work pages · cited by 3 Pith papers · 7 internal anchors

[1]

J. Baik, G. Ben Arous, and S. P´ ech´ e. Phase transition of the largest eigenvalue for nonnull complex sample covariance matrices.Ann. Probab., 33(5):1643–1697, 2005

work page 2005
[2]

Baik and J

J. Baik and J. W. Silverstein. Eigenvalues of large sample covariance matrices of spiked population models.J. Multivariate Anal., 97(6):1382–1408, 2006

work page 2006
[3]

Benaych-Georges and R

F. Benaych-Georges and R. R. Nadakuditi. The eigenvalues and eigenvectors of finite, low rank perturbations of large random matrices.Adv. Math., 227(1):494–521, 2011

work page 2011
[4]

Couillet and Z

R. Couillet and Z. Liao.Random Matrix Methods for Machine Learning.Cambridge Uni- versity Press, 2022

work page 2022
[5]

F. Dyson. A Brownian-motion model for the eigenvalues of a random matrix.J. Math. Phys., 3(6):1191–1198, 1962

work page 1962
[6]

El Karoui

N. El Karoui. A rate of convergence result for the largest eigenvalue of complex white Wishart matrices.Ann. Probab., 34(6):2077–2117, 2006

work page 2077
[7]

V. A. Marˇ cenko and L. A. Pastur. Distribution of eigenvalues for some sets of random matrices.Math. USSR-Sb., 1(4):457–483, 1967

work page 1967
[8]

C. A. Tracy and H. Widom. Level-spacing distributions and the Airy kernel.Comm. Math. Phys., 159(1):151–174, 1994

work page 1994
[9]

Davis and W

C. Davis and W. M. Kahan. The rotation of eigenvectors by a perturbation. III.SIAM J. Numer. Anal., 7(1):1–46, 1970

work page 1970
[10]

Kato.Perturbation Theory for Linear Operators.Springer, 1966

T. Kato.Perturbation Theory for Linear Operators.Springer, 1966

work page 1966
[11]

G. W. Stewart and J. Sun.Matrix Perturbation Theory.Academic Press, 1990

work page 1990
[12]

von Neumann and E

J. von Neumann and E. P. Wigner. ¨Uber das Verhalten von Eigenwerten bei adiabatischen Prozessen.Phys. Z., 30:467–470, 1929

work page 1929
[13]

T. D. Barrett and B. Dherin. Implicit gradient regularization. ICLR, 2021

work page 2021
[14]

J. M. Cohen, S. Kaur, Y. Li, J. Z. Kolter, and A. Talwalkar. Gradient descent on neural networks typically occurs at the edge of stability. ICLR, 2021

work page 2021
[15]

Damian, E

A. Damian, E. Nichani, and J. D. Lee. Self-stabilization: The implicit bias of gradient descent at the edge of stability. ICLR, 2023. 61

work page 2023
[16]

Ghorbani, S

B. Ghorbani, S. Krishnan, and Y. Xiao. An investigation into neural net optimization via Hessian eigenvalue density. ICML, 2019

work page 2019
[17]

Gradient Descent Happens in a Tiny Subspace

G. Gur-Ari, D. A. Roberts, and E. Dyer. Gradient descent happens in a tiny subspace. arXiv:1812.04754, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[18]

C. H. Martin and M. W. Mahoney. Implicit self-regularization in deep neural networks: Evidence from random matrix theory and implications for training.JMLR, 22(165):1–73, 2021

work page 2021
[19]

V. Papyan. Measurements of three-level hierarchical structure in the outliers in the spec- trum of deepnet Hessians. ICML, 2019

work page 2019
[20]

D. A. Roberts, S. Yaida, and B. Hanin.The Principles of Deep Learning Theory.Cambridge University Press, 2022

work page 2022
[21]

Empirical Analysis of the Hessian of Over-Parametrized Neural Networks

L. Sagun, U. Evci, V. U. G¨ uney, Y. Dauphin, and L. Bottou. Empirical analysis of the Hessian of over-parametrized neural networks. arXiv:1706.04454, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[22]

G. Yang. Tensor programs I–V: Feature learning in infinite-width neural networks. arXiv:2006.14548 et seq., 2020–2023

work page arXiv 2006
[23]

Scaling Laws for Neural Language Models

J. Kaplan, S. McCandlish, T. Henighan, T. B. Brown, B. Chess, R. Child, S. Gray, A. Rad- ford, J. Wu, and D. Amodei. Scaling laws for neural language models. arXiv:2001.08361, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2001
[24]

Hoffmann, S

J. Hoffmann, S. Borgeaud, A. Mensch, et al. Training compute-optimal large language models. NeurIPS, 2022

work page 2022
[25]

arXiv preprint arXiv:2102.06701 , year=

Y. Bahri, E. Dyer, J. Kaplan, J. Lee, and U. Sharma. Explaining neural scaling laws. arXiv:2102.06701, 2021

work page arXiv 2021
[26]

Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets

A. Power, Y. Burda, H. Edwards, I. Babuschkin, and V. Misra. Grokking: Generalization beyond overfitting on small algorithmic datasets. arXiv:2201.02177, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[27]

Nanda, L

N. Nanda, L. Chan, T. Lieberum, J. Smith, and J. Steinhardt. Progress measures for grokking via mechanistic interpretability. ICLR, 2023

work page 2023
[28]

Olsson, N

C. Olsson, N. Elhage, N. Nanda, et al. In-context learning and induction heads.Transformer Circuits Thread, 2022

work page 2022
[29]

K. Wang, A. Variengien, A. Conmy, B. Shlegeris, and J. Steinhardt. Interpretability in the wild: A circuit for indirect object identification in GPT-2 small. ICLR, 2023

work page 2023
[30]

Frankle and M

J. Frankle and M. Carbin. The lottery ticket hypothesis: Finding sparse, trainable neural networks. ICLR, 2019

work page 2019
[31]

K. Jordan. Muon: An optimizer for hidden layers. 2024

work page 2024
[32]

Y. Xu. Spectral Edge Dynamics of Training Trajectories: Signal–Noise Geometry Across Scales. arXiv:2603.15678, 2026

work page arXiv 2026
[33]

Y. Xu. Optimizer-Induced Low-Dimensional Drift and Transverse Dynamics in Transformer Training. arXiv:2602.23696, 2026

work page arXiv 2026
[34]

Y. Xu. Global Low-Rank, Local Full-Rank: The Holographic Encoding of Learned Algo- rithms. arXiv:2602.18649, 2026. 62

work page arXiv 2026
[35]

Y. Xu. Low-dimensional and transversely curved optimization dynamics in grokking. arXiv:2602.16746, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[36]

Y. Xu. The geometry of multi-task grokking: Transverse instability, superposition, and weight decay phase structure. arXiv:2602.18523, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[37]

Y. Xu. Early-Warning Signals of Grokking via Loss-Landscape Geometry. arXiv:2602.16967, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[38]

Y. Xu. The spectral edge thesis: Detailed mathematical notes. Companion document, 2026. 196 pp. 63

work page 2026