Recognition: unknown
The Norm-Separation Delay Law of Grokking: A First-Principles Theory of Delayed Generalization
Pith reviewed 2026-05-15 16:14 UTC · model grok-4.3
The pith
Grokking delay follows T_grok - T_mem = Θ(γ_eff^{-1} log(‖θ_mem‖² / ‖θ_post‖²)), derived from norm separation in regularized optimization and validated with high correlations across 293 runs.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We show that grokking is a norm-driven representational phase transition in regularised training dynamics, and establish the Norm-Separation Delay Law: T_grok - T_mem = Θ(γ_eff^{-1} log(‖θ_mem‖² / ‖θ_post‖²)), where γ_eff is the optimiser's effective contraction rate.
Load-bearing premise
That grokking is driven by norm separation between competing interpolating representations and that the discrete Lyapunov contraction argument plus dynamical constraints of regularised first-order optimisation directly yield the stated delay law.
read the original abstract
Grokking -- the sudden generalisation that appears long after a model has perfectly memorised its training data -- has been widely observed but lacks a quantitative theory explaining the length of the delay. We show that grokking is a norm-driven representational phase transition in regularised training dynamics, and establish the Norm-Separation Delay Law: $T_{\mathrm{grok}} - T_{\mathrm{mem}} = \Theta(\gamma_{\mathrm{eff}}^{-1} \log(\|\theta_{\mathrm{mem}}\|^2 / \|\theta_{\mathrm{post}}\|^2))$, where $\gamma_{\mathrm{eff}}$ is the optimiser's effective contraction rate ($\gamma_{\mathrm{eff}} = \eta\lambda$ for SGD, $\gamma_{\mathrm{eff}} \ge \eta\lambda$ for AdamW). The upper bound follows from a discrete Lyapunov contraction argument; the matching lower bound from dynamical constraints of regularised first-order optimisation. Across 293 training runs spanning modular addition, modular multiplication, and sparse parity, we confirm three falsifiable predictions: inverse scaling with weight decay ($R^2 = 0.97$), inverse scaling with learning rate ($R^2 = 0.92$), and logarithmic dependence on the norm ratio (Pearson $r = 0.91$). A fourth finding reveals that grokking requires an optimiser capable of decoupling memorisation from contraction: SGD fails entirely at the same hyperparameters where AdamW reliably groks. These results reframe grokking not as a mysterious optimisation artefact but as a predictable consequence of norm separation between competing interpolating representations. We further derive a practical three-input algorithm that predicts grokking delay at memorisation time with 34.6% mean absolute error (bootstrap 95% CI [30.0%, 39.4%], $N=60$ seeds), enabling principled early stopping.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that grokking arises as a norm-driven representational phase transition under regularized training. It establishes the Norm-Separation Delay Law T_grok - T_mem = Θ(γ_eff^{-1} log(‖θ_mem‖² / ‖θ_post‖²)), where γ_eff is the optimizer's effective contraction rate (ηλ for SGD, ≥ηλ for AdamW). The upper bound is derived from a discrete Lyapunov contraction argument on the quadratic norm penalty; the matching lower bound follows from dynamical constraints of regularized first-order optimization. Across 293 runs on modular addition, multiplication, and sparse parity, the work reports inverse scaling of delay with weight decay (R²=0.97) and learning rate (R²=0.92), logarithmic dependence on the norm ratio (Pearson r=0.91), failure of SGD to grok at hyperparameters where AdamW succeeds, and a three-input predictor achieving 34.6% MAE at memorization time.
Significance. If the derivation is completed, the result supplies the first quantitative, falsifiable scaling law for grokking delay grounded in optimization dynamics rather than phenomenology. The high R² fits, the explicit contrast between SGD and AdamW, and the practical early-stopping algorithm constitute clear strengths that could be directly useful for training analysis. The work reframes delayed generalization as a predictable consequence of norm separation between competing interpolators.
major comments (1)
- [Abstract / Norm-Separation Delay Law statement] The central claim asserts both an upper and a matching lower bound for the Θ expression. The upper bound is attributed to a discrete Lyapunov contraction argument, yet the manuscript supplies only a high-level summary without the explicit sequence of inequalities, the precise Lyapunov function, or error terms. The lower bound is ascribed to 'dynamical constraints of regularised first-order optimisation' without a derivation showing that any trajectory must require at least Ω(γ_eff^{-1} log(‖θ_mem‖² / ‖θ_post‖²)) steps before the smaller-norm solution can dominate the loss landscape. Until these steps are written out, the quantitative law reduces to an empirically supported scaling plus an unproven lower bound.
minor comments (2)
- The practical three-input predictor is announced with a 34.6% MAE but its exact inputs, training procedure, and bootstrap details are not fully specified in the provided text; a short algorithmic box or pseudocode would improve reproducibility.
- The definition of γ_eff for AdamW is given as ≥ηλ; an explicit expression or bound in terms of β1, β2, and ε would remove ambiguity when comparing optimizers.
Axiom & Free-Parameter Ledger
free parameters (1)
- γ_eff
axioms (2)
- domain assumption Discrete Lyapunov contraction governs the upper bound on delay under regularized first-order optimization.
- domain assumption Grokking arises as a representational phase transition driven by norm separation between memorizing and generalizing interpolators.
Forward citations
Cited by 1 Pith paper
-
The Spectral Lifecycle of Transformer Training: Transient Compression Waves, Persistent Spectral Gradients, and the Q/K--V Asymmetry
Transformer weight spectra exhibit transient compression waves that propagate layer-wise, persistent non-monotonic depth gradients in power-law exponents, and Q/K-V asymmetry, with the spectral exponent alpha predicti...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.