Recognition: 2 theorem links
· Lean TheoremThe Lifecycle of the Spectral Edge: From Gradient Learning to Weight-Decay Compression
Pith reviewed 2026-05-10 18:55 UTC · model grok-4.3
The pith
The dominant direction of parameter updates in neural networks shifts from gradient-driven learning to weight-decay compression exactly at grokking.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We decompose the spectral edge—the dominant direction of the Gram matrix of parameter updates—into its gradient and weight-decay components on Dyck-1 and SCAN tasks. Before grokking the edge is gradient-driven and functionally active. At grokking the two components align, turning the edge into a compression axis that is flat under perturbations yet more than 4000 times more critical to performance than random directions when ablated. Nonlinear probes recover the original information with high accuracy where linear probes fail, and removing weight decay after grokking reverses compression while the algorithm remains intact. Three universality classes—functional, mixed, and compression—are in,
What carries the argument
The spectral edge, the dominant eigenvector of the Gram matrix of parameter updates, which decomposes into gradient and weight-decay parts and becomes the compression axis once the two forces align at grokking.
Load-bearing premise
The alignment of gradient and weight-decay components at grokking is the cause of the subsequent compression phase rather than a side effect, and the gap flow equation predicts the three classes from independent dynamics.
What would settle it
Train the same architectures on new sequence tasks, compute the spectral edge before and after grokking, and ablate it; if the post-grokking ablation impact stays comparable to random directions or the gap flow equation fails to assign the observed class, the two-phase lifecycle claim is false.
Figures
read the original abstract
We decompose the spectral edge -- the dominant direction of the Gram matrix of parameter updates -- into its gradient and weight-decay components during grokking in two sequence tasks (Dyck-1 and SCAN). We find a sharp two-phase lifecycle: before grokking the edge is gradient-driven and functionally active; at grokking, gradient and weight decay align, and the edge becomes a compression axis that is perturbation-flat yet ablation-critical (>4000x more impactful than random directions). Three universality classes emerge (functional, mixed, compression), predicted by the gap flow equation. Nonlinear probes show information is re-encoded, not lost (MLP $R^2=0.99$ where linear $R^2=0.86$), and removing weight decay post-grok reverses compression while preserving the algorithm.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper decomposes the spectral edge (dominant direction of the Gram matrix of parameter updates) into gradient and weight-decay components during grokking on Dyck-1 and SCAN sequence tasks. It reports a sharp two-phase lifecycle: pre-grokking the edge is gradient-driven and functionally active; at grokking, the components align and the edge becomes a perturbation-flat yet ablation-critical compression axis (>4000x more impactful than random directions). Three universality classes (functional, mixed, compression) are identified and claimed to be predicted by a gap flow equation. Supporting observations include nonlinear probes showing re-encoding of information (MLP R²=0.99 vs. linear R²=0.86) and reversal of compression upon post-grokking removal of weight decay while preserving the learned algorithm.
Significance. If the central claims hold after verification, the work offers a mechanistic account of grokking as a transition from gradient-driven functional learning to weight-decay-driven compression along the spectral edge, with potential to unify observations across regularization and generalization phenomena. The reported universality classes and gap flow equation, if shown to be independently derived rather than post-hoc, would constitute a notable contribution; the ablation-criticality result and nonlinear probe evidence are strengths that could be load-bearing for broader impact in understanding implicit regularization.
major comments (3)
- [Abstract and results on lifecycle] The causal claim that gradient-weight-decay alignment at grokking drives the compression phase (rather than being a correlated byproduct) rests primarily on temporal coincidence and one coarse intervention (post-grokking weight-decay removal). A targeted decoupling experiment isolating the directional alignment from the mere presence of a nonzero weight-decay term is needed to support the two-phase lifecycle as causal.
- [Abstract and gap flow equation presentation] The gap flow equation is asserted to predict the three universality classes, yet the available description provides no derivation or solution steps independent of the same training runs used to label the classes. This raises a circularity risk: if the equation parameters or form were fitted to the observed grokking trajectories, the predictive claim is undermined.
- [Results on ablation and probes] Quantitative claims such as >4000x ablation impact relative to random directions and specific R² values (MLP 0.99, linear 0.86) cannot be assessed without the corresponding methods, data exclusion rules, error bars, or full ablation tables. These numbers are load-bearing for the compression-axis claim and must be accompanied by reproducible details.
minor comments (2)
- [Methods] Clarify the precise definition of the spectral edge and the Gram matrix construction in the methods section to allow replication.
- [Gap flow equation section] Add explicit statements on whether the gap flow equation was solved analytically or numerically and on the training hyperparameters used for the universality class labeling.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback, which highlights important areas for strengthening the causal interpretation, theoretical grounding, and reproducibility of our results. We address each major comment point by point below and will incorporate the suggested revisions.
read point-by-point responses
-
Referee: [Abstract and results on lifecycle] The causal claim that gradient-weight-decay alignment at grokking drives the compression phase (rather than being a correlated byproduct) rests primarily on temporal coincidence and one coarse intervention (post-grokking weight-decay removal). A targeted decoupling experiment isolating the directional alignment from the mere presence of a nonzero weight-decay term is needed to support the two-phase lifecycle as causal.
Authors: We agree that temporal coincidence and the existing post-grokking weight-decay removal experiment provide correlational support but fall short of isolating the directional alignment as the causal driver. The reversal of compression upon weight-decay removal (while preserving the algorithm) indicates that weight decay is necessary to sustain the aligned compression axis, but a more targeted intervention is warranted. In the revised manuscript we will add a decoupling experiment that holds the weight-decay magnitude fixed while disrupting or enforcing alignment (e.g., via projected gradient steps or auxiliary loss terms that penalize misalignment). This will directly test whether alignment, rather than the mere presence of weight decay, triggers the compression phase. revision: yes
-
Referee: [Abstract and gap flow equation presentation] The gap flow equation is asserted to predict the three universality classes, yet the available description provides no derivation or solution steps independent of the same training runs used to label the classes. This raises a circularity risk: if the equation parameters or form were fitted to the observed grokking trajectories, the predictive claim is undermined.
Authors: The gap flow equation is obtained by taking the continuous-time limit of the spectral-gap dynamics under simultaneous gradient and weight-decay updates, yielding a low-dimensional ODE whose fixed points and basins directly classify the three observed regimes. No parameters were fitted to the empirical trajectories; the equation form follows from the inner-product structure of the Gram matrix and the linear action of weight decay. In the revision we will include a self-contained derivation in the appendix, showing the step-by-step reduction from the discrete update rules to the gap ODE, the analytic solution for gap evolution, and the resulting phase diagram that predicts the functional, mixed, and compression classes from initial gap size and regularization strength alone. revision: yes
-
Referee: [Results on ablation and probes] Quantitative claims such as >4000x ablation impact relative to random directions and specific R² values (MLP 0.99, linear 0.86) cannot be assessed without the corresponding methods, data exclusion rules, error bars, or full ablation tables. These numbers are load-bearing for the compression-axis claim and must be accompanied by reproducible details.
Authors: We acknowledge that the current manuscript does not supply sufficient methodological detail for independent verification of the quantitative claims. The >4000x figure is the ratio of test-loss increase after ablating the spectral-edge direction versus the mean increase over 100 random directions, averaged over five random seeds with standard-deviation error bars; the R² values are obtained from 10-fold cross-validated probes on held-out activations. In the revised version we will expand the Methods section with the precise ablation protocol, number of trials, exclusion criteria (none beyond standard checkpoint filtering), full ablation tables, and probe implementation details so that all reported numbers become fully reproducible. revision: yes
Circularity Check
No significant circularity; derivation remains self-contained
full rationale
The paper decomposes the spectral edge into gradient and weight-decay components via direct analysis of Gram matrices during training on Dyck-1 and SCAN, observes temporal alignment at grokking, and validates compression via ablation and weight-decay removal interventions. The gap flow equation is asserted to predict the three universality classes, but the provided text gives no indication that this equation is obtained by fitting parameters to the same class-labeling observations or by self-referential definition; instead it functions as an independent organizing model whose derivation is not shown to collapse into the experimental outputs. No self-citation chains, ansatz smuggling, or renaming of known results appear as load-bearing steps. The central claims rest on empirical measurements and interventions that are falsifiable outside any fitted equation, satisfying the criteria for an independent derivation chain.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
dg/dt ≈ −η(h_k* − h_k*+1) d̄ − η(h̄ + ω)g + ηW (|G_k*|²/d_k* − |G_k*+1|²/d_k*+1)
-
IndisputableMonolith/Foundation/AlexanderDuality.leanalexander_duality_circle_linking unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Three universality classes emerge (functional, mixed, compression), predicted by the gap flow equation.
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
M., Kaur, S., Li, Y., Kolter, J
Cohen, J. M., Kaur, S., Li, Y., Kolter, J. Z., and Talwalkar, A. Gradient descent on neural networks typically occurs at the edge of stability. In ICLR, 2021
2021
-
[2]
and Kahan, W
Davis, C. and Kahan, W. M. The rotation of eigenvectors by a perturbation. III . SIAM J. Numer. Anal., 7(1):1--46, 1970
1970
-
[3]
Gradient Descent Happens in a Tiny Subspace
Gur-Ari, G., Roberts, D. A., and Dyer, E. Gradient descent happens in a tiny subspace. arXiv preprint arXiv:1812.04754, 2018
work page Pith review arXiv 2018
-
[4]
and Schmidhuber, J
Hochreiter, S. and Schmidhuber, J. Flat minima. Neural Computation, 9(1):1--42, 1997
1997
-
[5]
Lake, B. M. and Baroni, M. Generalization without systematicity: On the compositional skills of sequence-to-sequence recurrent networks. In ICML, 2018
2018
-
[6]
J., and Tegmark, M
Liu, Z., Michaud, E. J., and Tegmark, M. Omnigrok: Grokking beyond algorithmic data. In ICLR, 2023
2023
-
[7]
J., Tegmark, M., and Williams, M
Liu, Z., Kitouni, O., Nolte, N., Michaud, E. J., Tegmark, M., and Williams, M. Towards understanding grokking: An effective theory of representation learning. In NeurIPS, 2023
2023
-
[8]
and Hutter, F
Loshchilov, I. and Hutter, F. Decoupled weight decay regularization. In ICLR, 2019
2019
-
[9]
and Li, J
Lyu, K. and Li, J. Gradient descent maximizes the margin of homogeneous neural networks. In ICLR, 2020
2020
-
[10]
S., Lee, J
Lyu, K., Jin, J., Li, Z., Du, S. S., Lee, J. D., and Hu, W. Dichotomy of early and late phase implicit biases can provably induce grokking. In ICLR, 2024
2024
-
[11]
Progress measures for grokking via mechanistic interpretability
Nanda, N., Chan, L., Lieberum, T., Smith, J., and Steinhardt, J. Progress measures for grokking via mechanistic interpretability. In ICLR, 2023
2023
-
[12]
A toy model of interference weights
Olah, C., Batson, J., Templeton, A., Conerly, T., Henighan, T., and Carter, S. A toy model of interference weights. Transformer Circuits Thread, 2025. https://transformer-circuits.pub/2025/interference-weights/index.html
2025
-
[13]
Grokking: Generalization beyond overfitting on small algorithmic datasets
Power, A., Burda, Y., Edwards, H., Babuschkin, I., and Misra, V. Grokking: Generalization beyond overfitting on small algorithmic datasets. In MATH-AI Workshop, ICLR, 2022
2022
-
[14]
Empirical Analysis of the Hessian of Over-Parametrized Neural Networks
Sagun, L., Evci, U., G\"uney, V. U., Dauphin, Y., and Bottou, L. Empirical analysis of the H essian of over-parametrized neural networks. arXiv preprint arXiv:1706.04454, 2017
work page Pith review arXiv 2017
-
[15]
Xu, Y. Low-dimensional execution manifolds in transformer learning dynamics: Evidence from modular arithmetic tasks. arXiv preprint arXiv:2602.10496, 2026
-
[16]
Low-Dimensional and Transversely Curved Optimization Dynamics in Grokking
Xu, Y. Low-dimensional and transversely curved optimization dynamics in grokking. arXiv preprint arXiv:2602.16746, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[17]
Early-Warning Signals of Grokking via Loss-Landscape Geometry
Xu, Y. Early-warning signals of grokking via loss-landscape geometry. arXiv preprint arXiv:2602.16967, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[18]
Xu, Y. The geometry of multi-task grokking: Transverse instability, superposition, and weight decay phase structure. arXiv preprint arXiv:2602.18523, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[19]
Holographic encoding and spectral edge events in neural network training
Xu, Y. Holographic encoding and spectral edge events in neural network training. arXiv preprint arXiv:2602.18649, 2026
-
[20]
Backbone drift and phase transitions in transformer pretraining
Xu, Y. Backbone drift and phase transitions in transformer pretraining. arXiv preprint arXiv:2602.23696, 2026
-
[21]
Xu, Y. The spectral edge thesis: A mathematical framework for intra-signal phase transitions in neural network training. arXiv preprint arXiv:2603.28964, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[22]
Spectral edge dynamics reveal functional modes of learning
Xu, Y. Spectral edge dynamics reveal functional modes of learning. arXiv preprint, 2026
2026
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.