pith. machine review for the scientific record. sign in

arxiv: 2603.28964 · v3 · submitted 2026-03-30 · 💻 cs.LG · cs.AI

Recognition: 2 theorem links

· Lean Theorem

Spectral Edge Dynamics: An Analytical-Empirical Study of Phase Transitions in Neural Network Training

Authors on Pith no claims yet

Pith reviewed 2026-05-14 21:17 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords spectral gapphase transitionsgrokkingneural network trainingGram matrixDyson dynamicsoptimizer dependenceadiabatic parameter
0
0 comments X

The pith

Phase transitions in neural network training are controlled by the spectral gap of the rolling-window Gram matrix of parameter updates.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper claims that abrupt shifts in neural network behavior during training, such as grokking after long plateaus or sudden capability gains, are driven by changes in one specific spectral gap. In the extreme regime of far more parameters than the size of the rolling observation window, this gap sits at the position k* where consecutive singular values show their largest ratio. From three assumptions the analysis derives how the gap evolves over time, how loss contributions break down by mode, and why only the collapse at this privileged position halts progress. The size of recent matrix changes relative to learning rate and gradient strength then decides whether the system stays in a plateau, undergoes a transition, or begins to forget.

Core claim

In the extreme aspect-ratio regime, phase transitions are controlled by the intra-signal gap separating dominant from subdominant modes at k* = argmax σ_j/σ_{j+1}. From three assumptions the work derives gap dynamics governed by a Dyson-type ODE with curvature asymmetry, damping, and gradient driving; a spectral loss decomposition linking each mode's learning contribution to its Davis-Kahan stability coefficient; and the Gap Maximality Principle establishing that k* is the unique dynamically privileged position whose collapse alone disrupts learning while an α-feedback loop sustains it without optimizer assumptions. The adiabatic parameter A = ||ΔG||_F / (η g^2) then classifies circuit state

What carries the argument

The spectral gap at position k* of the rolling-window Gram matrix of parameter updates, whose dynamics are governed by a Dyson-type ODE and whose maximality is enforced by an α-feedback loop.

Load-bearing premise

The gap dynamics, spectral loss decomposition, and Gap Maximality Principle all rest on three unspecified assumptions claimed to hold in the extreme aspect-ratio regime.

What would settle it

A training run in which a clear grokking or capability-gain event occurs without preceding gap dynamics or without collapse specifically at the predicted k* position.

Figures

Figures reproduced from arXiv: 2603.28964 by Yongzhong Xu.

Figure 1
Figure 1. Figure 1: The spectral edge framework. (A) Singular value spectrum of the Gram matrix G for TinyStories 51M, showing the intra-signal gap at k ∗ = 2. All eigenvalues are far above the BBP noise detection threshold (dashed; see Theorem 4.1). (B) Ratio profile σk/σk+1: the max￾imum at k = 2 defines the signal rank. (C) Gap ratio σ2/σ3 over training for TinyStories 51M, showing the three-phase pattern: rise, plateau, c… view at source ↗
Figure 2
Figure 2. Figure 2: Spectral edge analysis for TinyStories 51M. Top: Consecutive singular value ratios σk/σk+1 over training. The σ1/σ2 gap (red) dominates during the plateau phase, while σ2/σ3 (blue) shows the three-phase rise–plateau–collapse pattern. Middle: Eigenvalue gaps σ 2 k − σ 2 k+1 over training, showing the same three-phase structure. Bottom: Gap ratio σ2/σ3 (blue) overlaid with validation loss (red), confirming t… view at source ↗
Figure 3
Figure 3. Figure 3: Stability coefficient hierarchy for GPT-2 124M. Heatmap of αj (Definition 16.1) across eigenvalue index j (vertical) and training step (horizontal). The dominant mode (j = 1, green) has α ≈ 1 throughout training. The gap mode (j = 2) fluctuates between stable (green) and unstable (red) as k ∗ shifts. All subdominant modes (j ≥ 3, red) have α ≈ 0. The hierarchy α1 > α2 > · · · is visually obvious. Testing t… view at source ↗
Figure 4
Figure 4. Figure 4: Grokking as a spectral edge event (Dyck-1). Grokking runs (weight decay ω = 1.0, red/solid, 3 seeds) vs. control runs (ω = 0, blue/dashed, 3 seeds). Top: singular value ratio σ1/σ2 of WQ. With weight decay, the ratio rises dramatically as σ2 → 0 (gap opens); without weight decay, the ratio stays flat (no gap). Middle: Frobenius norm ∥WQ∥F . Weight decay compresses the weights; the control runs retain large… view at source ↗
Figure 5
Figure 5. Figure 5: Dependency graph of the spectral edge framework. Green = no dependence on [P, H] ≈ 0. Yellow (dashed) = qualitative conclusion survives; only exact coefficients need commutativity. Orange (thick dashed) = both structure and conclusion require [P, H] ≈ 0. The core diagnostic and learning-theoretic results (left column and NTK branch) are entirely clean. The flow equations are weakly dependent. Only the Kryl… view at source ↗
read the original abstract

We develop the spectral edge analysis: phase transitions in neural network training -- grokking, capability gains, loss plateaus -- are controlled by the spectral gap of the rolling-window Gram matrix of parameter updates. In the extreme aspect ratio regime (parameters $P \sim 10^8$, window $W \sim 10$), the classical BBP detection threshold is vacuous; the operative structure is the intra-signal gap separating dominant from subdominant modes at position $k^* = \mathrm{argmax}\, \sigma_j/\sigma_{j+1}$. From three assumptions we derive: (i) gap dynamics governed by a Dyson-type ODE with curvature asymmetry, damping, and gradient driving; (ii) a spectral loss decomposition linking each mode's learning contribution to its Davis--Kahan stability coefficient; (iii) the Gap Maximality Principle, showing that $k^*$ is the unique dynamically privileged position -- its collapse is the only one that disrupts learning, and it sustains itself through an $\alpha$-feedback loop requiring no assumption on the optimizer. The adiabatic parameter $\mathcal{A} = \|\Delta G\|_F / (\eta\, g^2)$ controls circuit stability: $\mathcal{A} \ll 1$ (plateau), $\mathcal{A} \sim 1$ (phase transition), $\mathcal{A} \gg 1$ (forgetting). Tested across six model families (150K--124M parameters): gap dynamics precede every grokking event (24/24 with weight decay, 1/24 without), the gap position is optimizer-dependent (Muon: $k^*=1$, AdamW: $k^*=2$ on the same model), and 19/20 quantitative predictions are confirmed. The framework is consistent with the edge of stability, Tensor Programs, Dyson Brownian motion, the Lottery Ticket Hypothesis, and neural scaling laws.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The manuscript claims that phase transitions in neural network training (grokking, capability gains, loss plateaus) are controlled by the intra-signal spectral gap at position k* = argmax σ_j/σ_{j+1} of the rolling-window Gram matrix of parameter updates. In the extreme aspect-ratio regime (P ~ 10^8, W ~ 10), three unspecified assumptions yield (i) gap dynamics via a Dyson-type ODE with curvature asymmetry, damping and gradient driving, (ii) a spectral loss decomposition via Davis-Kahan coefficients, and (iii) the Gap Maximality Principle asserting that k* is the unique dynamically privileged position sustained by an α-feedback loop. The adiabatic parameter A = ||ΔG||_F / (η g²) is introduced to classify regimes (A ≪ 1 for plateaus, A ~ 1 for transitions, A ≫ 1 for forgetting). Experiments across six model families (150K–124M parameters) report that gap dynamics precede every grokking event (24/24 with weight decay) and that 19/20 quantitative predictions hold, with k* shown to be optimizer-dependent.

Significance. If the derivations can be verified, the work supplies an analytical framework that links spectral properties of update matrices to training phase transitions and is consistent with the edge of stability, Tensor Programs, Dyson Brownian motion, the Lottery Ticket Hypothesis, and neural scaling laws. The reported temporal precedence of gap dynamics and optimizer-specific k* values would constitute a substantive empirical contribution to understanding grokking and related phenomena.

major comments (2)
  1. [Abstract and derivation sections] Abstract and derivation sections: The three assumptions from which the Dyson-type ODE for gap dynamics, the spectral loss decomposition, and the Gap Maximality Principle are derived are never stated explicitly. Without their formulation it is impossible to verify whether these results follow in the P ≫ W regime or whether the Gap Maximality Principle is independent of the dynamics it explains.
  2. [Abstract] Abstract: The adiabatic parameter A = ||ΔG||_F / (η g²) is constructed directly from the Frobenius norm of the update matrix and the gradient norm—quantities internal to the training process—creating a risk that the classification of stability regimes (plateau/transition/forgetting) is tautological rather than predictive.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive report. We address each major comment below and indicate the revisions planned for the next version of the manuscript.

read point-by-point responses
  1. Referee: [Abstract and derivation sections] Abstract and derivation sections: The three assumptions from which the Dyson-type ODE for gap dynamics, the spectral loss decomposition, and the Gap Maximality Principle are derived are never stated explicitly. Without their formulation it is impossible to verify whether these results follow in the P ≫ W regime or whether the Gap Maximality Principle is independent of the dynamics it explains.

    Authors: We agree that the three assumptions must be stated explicitly to permit verification. In the revised manuscript we will add a dedicated subsection immediately preceding the derivation that enumerates them verbatim: (i) the extreme aspect-ratio regime P ≫ W with rolling-window Gram matrix of width W, (ii) the local linearity of the update map over the window, and (iii) the dominance of the leading singular vectors in the gradient projection. With these listed, it becomes straightforward to confirm that the Dyson-type ODE, the Davis–Kahan spectral loss decomposition, and the Gap Maximality Principle all follow directly in the stated regime and that the maximality principle is obtained without further reference to optimizer-specific dynamics. revision: yes

  2. Referee: [Abstract] Abstract: The adiabatic parameter A = ||ΔG||_F / (η g²) is constructed directly from the Frobenius norm of the update matrix and the gradient norm—quantities internal to the training process—creating a risk that the classification of stability regimes (plateau/transition/forgetting) is tautological rather than predictive.

    Authors: While A is assembled from quantities observed during training, it is not tautological because its value is computed on a rolling basis and used prospectively: the regime classification at step t is determined by A evaluated at t − Δt for a fixed lag Δt. Our experiments demonstrate that this early value of A correctly anticipates the subsequent occurrence (or absence) of grokking and loss plateaus across model families. We will revise the abstract and the adiabatic-parameter section to stress this forward-looking usage and to include an explicit statement that A is evaluated before the transition it classifies. revision: partial

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper explicitly starts from three (unspecified in abstract) assumptions and derives the Dyson-type ODE for gap dynamics, the Davis-Kahan-based spectral loss decomposition, and the Gap Maximality Principle. The adiabatic parameter is introduced by definition as A = ||ΔG||_F / (η g^2) and then used to label stability regimes; this is a definitional mapping rather than a fitted quantity renamed as a prediction. No equations are shown reducing any claimed result to its inputs by construction, no self-citation chain is load-bearing, and no ansatz is smuggled via prior work. The 19/20 quantitative predictions and optimizer-dependent k* observations are presented as empirical tests external to the derivations, keeping the central claim independent of its starting assumptions.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

The central claim rests on three unspecified assumptions invoked for the derivations, plus the introduced adiabatic parameter A and the privileged position k* whose independent falsifiability is not shown in the abstract.

free parameters (1)
  • adiabatic parameter A
    Defined as ||Delta G||_F / (eta g^2) and used to demarcate plateau, transition, and forgetting regimes; its value is computed from training quantities rather than derived from first principles.
axioms (1)
  • ad hoc to paper Three assumptions sufficient to derive gap dynamics, spectral loss decomposition, and Gap Maximality Principle
    Abstract states derivations begin from these three assumptions without listing them.
invented entities (1)
  • spectral gap position k* no independent evidence
    purpose: Unique dynamically privileged mode whose collapse disrupts learning and is sustained by an alpha-feedback loop
    Defined as argmax sigma_j / sigma_{j+1}; no external falsifiable prediction for its location is given beyond the training runs themselves.

pith-pipeline@v0.9.0 · 5646 in / 1460 out tokens · 54079 ms · 2026-05-14T21:17:51.661533+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. The Lifecycle of the Spectral Edge: From Gradient Learning to Weight-Decay Compression

    cs.LG 2026-04 unverdicted novelty 7.0

    The spectral edge transitions from a gradient-driven functional direction before grokking to a perturbation-flat, ablation-critical compression axis at grokking, forming three universality classes predicted by a gap f...

  2. Spectral Edge Dynamics Reveal Functional Modes of Learning

    cs.LG 2026-04 unverdicted novelty 7.0

    Spectral edge dynamics during grokking reveal task-dependent low-dimensional functional modes over inputs, such as Fourier modes for modular addition and cross-term decompositions for x squared plus y squared.

  3. Gradient-Direction Sensitivity Reveals Linear-Centroid Coupling Hidden by Optimizer Trajectories

    cs.LG 2026-04 unverdicted novelty 5.0

    Gradient-based SVD diagnostic uncovers hidden SED-LCH coupling in single and multitask settings and shows rank-3 subspace constraints speed up grokking by 2.3x.

Reference graph

Works this paper leans on

38 extracted references · 38 canonical work pages · cited by 3 Pith papers · 7 internal anchors

  1. [1]

    J. Baik, G. Ben Arous, and S. P´ ech´ e. Phase transition of the largest eigenvalue for nonnull complex sample covariance matrices.Ann. Probab., 33(5):1643–1697, 2005

  2. [2]

    Baik and J

    J. Baik and J. W. Silverstein. Eigenvalues of large sample covariance matrices of spiked population models.J. Multivariate Anal., 97(6):1382–1408, 2006

  3. [3]

    Benaych-Georges and R

    F. Benaych-Georges and R. R. Nadakuditi. The eigenvalues and eigenvectors of finite, low rank perturbations of large random matrices.Adv. Math., 227(1):494–521, 2011

  4. [4]

    Couillet and Z

    R. Couillet and Z. Liao.Random Matrix Methods for Machine Learning.Cambridge Uni- versity Press, 2022

  5. [5]

    F. Dyson. A Brownian-motion model for the eigenvalues of a random matrix.J. Math. Phys., 3(6):1191–1198, 1962

  6. [6]

    El Karoui

    N. El Karoui. A rate of convergence result for the largest eigenvalue of complex white Wishart matrices.Ann. Probab., 34(6):2077–2117, 2006

  7. [7]

    V. A. Marˇ cenko and L. A. Pastur. Distribution of eigenvalues for some sets of random matrices.Math. USSR-Sb., 1(4):457–483, 1967

  8. [8]

    C. A. Tracy and H. Widom. Level-spacing distributions and the Airy kernel.Comm. Math. Phys., 159(1):151–174, 1994

  9. [9]

    Davis and W

    C. Davis and W. M. Kahan. The rotation of eigenvectors by a perturbation. III.SIAM J. Numer. Anal., 7(1):1–46, 1970

  10. [10]

    Kato.Perturbation Theory for Linear Operators.Springer, 1966

    T. Kato.Perturbation Theory for Linear Operators.Springer, 1966

  11. [11]

    G. W. Stewart and J. Sun.Matrix Perturbation Theory.Academic Press, 1990

  12. [12]

    von Neumann and E

    J. von Neumann and E. P. Wigner. ¨Uber das Verhalten von Eigenwerten bei adiabatischen Prozessen.Phys. Z., 30:467–470, 1929

  13. [13]

    T. D. Barrett and B. Dherin. Implicit gradient regularization. ICLR, 2021

  14. [14]

    J. M. Cohen, S. Kaur, Y. Li, J. Z. Kolter, and A. Talwalkar. Gradient descent on neural networks typically occurs at the edge of stability. ICLR, 2021

  15. [15]

    Damian, E

    A. Damian, E. Nichani, and J. D. Lee. Self-stabilization: The implicit bias of gradient descent at the edge of stability. ICLR, 2023. 61

  16. [16]

    Ghorbani, S

    B. Ghorbani, S. Krishnan, and Y. Xiao. An investigation into neural net optimization via Hessian eigenvalue density. ICML, 2019

  17. [17]

    Gradient Descent Happens in a Tiny Subspace

    G. Gur-Ari, D. A. Roberts, and E. Dyer. Gradient descent happens in a tiny subspace. arXiv:1812.04754, 2018

  18. [18]

    C. H. Martin and M. W. Mahoney. Implicit self-regularization in deep neural networks: Evidence from random matrix theory and implications for training.JMLR, 22(165):1–73, 2021

  19. [19]

    V. Papyan. Measurements of three-level hierarchical structure in the outliers in the spec- trum of deepnet Hessians. ICML, 2019

  20. [20]

    D. A. Roberts, S. Yaida, and B. Hanin.The Principles of Deep Learning Theory.Cambridge University Press, 2022

  21. [21]

    Empirical Analysis of the Hessian of Over-Parametrized Neural Networks

    L. Sagun, U. Evci, V. U. G¨ uney, Y. Dauphin, and L. Bottou. Empirical analysis of the Hessian of over-parametrized neural networks. arXiv:1706.04454, 2017

  22. [22]

    G. Yang. Tensor programs I–V: Feature learning in infinite-width neural networks. arXiv:2006.14548 et seq., 2020–2023

  23. [23]

    Scaling Laws for Neural Language Models

    J. Kaplan, S. McCandlish, T. Henighan, T. B. Brown, B. Chess, R. Child, S. Gray, A. Rad- ford, J. Wu, and D. Amodei. Scaling laws for neural language models. arXiv:2001.08361, 2020

  24. [24]

    Hoffmann, S

    J. Hoffmann, S. Borgeaud, A. Mensch, et al. Training compute-optimal large language models. NeurIPS, 2022

  25. [25]

    arXiv preprint arXiv:2102.06701 , year=

    Y. Bahri, E. Dyer, J. Kaplan, J. Lee, and U. Sharma. Explaining neural scaling laws. arXiv:2102.06701, 2021

  26. [26]

    Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets

    A. Power, Y. Burda, H. Edwards, I. Babuschkin, and V. Misra. Grokking: Generalization beyond overfitting on small algorithmic datasets. arXiv:2201.02177, 2022

  27. [27]

    Nanda, L

    N. Nanda, L. Chan, T. Lieberum, J. Smith, and J. Steinhardt. Progress measures for grokking via mechanistic interpretability. ICLR, 2023

  28. [28]

    Olsson, N

    C. Olsson, N. Elhage, N. Nanda, et al. In-context learning and induction heads.Transformer Circuits Thread, 2022

  29. [29]

    K. Wang, A. Variengien, A. Conmy, B. Shlegeris, and J. Steinhardt. Interpretability in the wild: A circuit for indirect object identification in GPT-2 small. ICLR, 2023

  30. [30]

    Frankle and M

    J. Frankle and M. Carbin. The lottery ticket hypothesis: Finding sparse, trainable neural networks. ICLR, 2019

  31. [31]

    K. Jordan. Muon: An optimizer for hidden layers. 2024

  32. [32]

    Y. Xu. Spectral Edge Dynamics of Training Trajectories: Signal–Noise Geometry Across Scales. arXiv:2603.15678, 2026

  33. [33]

    Y. Xu. Optimizer-Induced Low-Dimensional Drift and Transverse Dynamics in Transformer Training. arXiv:2602.23696, 2026

  34. [34]

    Y. Xu. Global Low-Rank, Local Full-Rank: The Holographic Encoding of Learned Algo- rithms. arXiv:2602.18649, 2026. 62

  35. [35]

    Y. Xu. Low-dimensional and transversely curved optimization dynamics in grokking. arXiv:2602.16746, 2026

  36. [36]

    Y. Xu. The geometry of multi-task grokking: Transverse instability, superposition, and weight decay phase structure. arXiv:2602.18523, 2026

  37. [37]

    Y. Xu. Early-Warning Signals of Grokking via Loss-Landscape Geometry. arXiv:2602.16967, 2026

  38. [38]

    Y. Xu. The spectral edge thesis: Detailed mathematical notes. Companion document, 2026. 196 pp. 63