The Norm-Separation Delay Law of Grokking: A First-Principles Theory of Delayed Generalization

Truong Xuan Khanh , Truong Quynh Hoa , Luu Duc Trung , Phan Thanh Duc

Authors on Pith no claims yet

Pith reviewed 2026-05-15 16:14 UTC · model grok-4.3

classification 💻 cs.AI cs.LG

keywords mathrmgrokkingdelaygammacontractionthetatrainingadamw

0 comments

The pith

Grokking delay follows T_grok - T_mem = Θ(γ_eff^{-1} log(‖θ_mem‖² / ‖θ_post‖²)), derived from norm separation in regularized optimization and validated with high correlations across 293 runs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Neural networks sometimes memorize training data perfectly but only later suddenly generalize to new examples, a behavior called grokking. The authors argue this delay occurs because regularization causes the model's weights to contract at different rates for a memorizing solution versus a better generalizing one. They derive a formula showing the delay scales inversely with the effective contraction rate (tied to learning rate and weight decay) and with the logarithm of the norm ratio between these solutions. Experiments on modular arithmetic and parity tasks confirm the predicted inverse scaling with weight decay and learning rate, plus the logarithmic dependence. The work also shows that AdamW enables this separation while SGD does not at the same settings. They provide a simple predictor using norms measured at memorization time that forecasts the delay with moderate accuracy.

Core claim

We show that grokking is a norm-driven representational phase transition in regularised training dynamics, and establish the Norm-Separation Delay Law: T_grok - T_mem = Θ(γ_eff^{-1} log(‖θ_mem‖² / ‖θ_post‖²)), where γ_eff is the optimiser's effective contraction rate.

Load-bearing premise

That grokking is driven by norm separation between competing interpolating representations and that the discrete Lyapunov contraction argument plus dynamical constraints of regularised first-order optimisation directly yield the stated delay law.

read the original abstract

Grokking -- the sudden generalisation that appears long after a model has perfectly memorised its training data -- has been widely observed but lacks a quantitative theory explaining the length of the delay. We show that grokking is a norm-driven representational phase transition in regularised training dynamics, and establish the Norm-Separation Delay Law: $T_{\mathrm{grok}} - T_{\mathrm{mem}} = \Theta(\gamma_{\mathrm{eff}}^{-1} \log(\|\theta_{\mathrm{mem}}\|^2 / \|\theta_{\mathrm{post}}\|^2))$, where $\gamma_{\mathrm{eff}}$ is the optimiser's effective contraction rate ($\gamma_{\mathrm{eff}} = \eta\lambda$ for SGD, $\gamma_{\mathrm{eff}} \ge \eta\lambda$ for AdamW). The upper bound follows from a discrete Lyapunov contraction argument; the matching lower bound from dynamical constraints of regularised first-order optimisation. Across 293 training runs spanning modular addition, modular multiplication, and sparse parity, we confirm three falsifiable predictions: inverse scaling with weight decay ($R^2 = 0.97$), inverse scaling with learning rate ($R^2 = 0.92$), and logarithmic dependence on the norm ratio (Pearson $r = 0.91$). A fourth finding reveals that grokking requires an optimiser capable of decoupling memorisation from contraction: SGD fails entirely at the same hyperparameters where AdamW reliably groks. These results reframe grokking not as a mysterious optimisation artefact but as a predictable consequence of norm separation between competing interpolating representations. We further derive a practical three-input algorithm that predicts grokking delay at memorisation time with 34.6% mean absolute error (bootstrap 95% CI [30.0%, 39.4%], $N=60$ seeds), enabling principled early stopping.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gives a clean scaling law for grokking delay from norm separation plus strong empirical checks, but the lower-bound part of the Θ claim rests on unshown dynamical constraints.

read the letter

The punchline is that this work turns the grokking delay into a testable scaling relation T_grok minus T_mem equals theta of one over gamma_eff times log of the squared norm ratio at memorization. The experiments back the inverse dependence on weight decay and learning rate with R-squared 0.97 and 0.92, and the log-norm term with Pearson 0.91 across 293 runs on modular addition, multiplication, and sparse parity. They also show that AdamW groks reliably where plain SGD does not at the same hyperparameters, which is a concrete optimizer distinction worth noting. The three-input predictor achieving 34.6 percent MAE on new seeds is a practical takeaway that could be used for early stopping checks. What the paper does well is keep the claims falsifiable and report the scaling relations directly from the data rather than post-hoc curve fitting. The empirical coverage is broad enough to make the inverse and log patterns credible. The soft spot is the derivation. The upper bound follows from a discrete Lyapunov contraction on the quadratic penalty, which is standard. The matching lower bound is attributed to dynamical constraints of regularized first-order optimization without an explicit sequence showing why any trajectory must spend at least that many steps before the smaller-norm interpolator dominates. If those constraints reduce to the loss staying flat until norms separate, the two-sided Θ is not fully justified and the result is closer to an upper-bound scaling plus observed correlation. The predictor also uses the measured norm ratio at memorization time, so it functions more as a diagnostic than a zero-shot forecast from initialization. This paper is for researchers working on optimization dynamics and delayed generalization in deep networks. A reader interested in quantitative theories of grokking will get value from the scaling plots and the optimizer contrast even if they want tighter proofs. It deserves a serious referee because the empirical patterns are sharp and the central idea is simple enough to check or extend.

Referee Report

1 major / 2 minor

Summary. The paper claims that grokking arises as a norm-driven representational phase transition under regularized training. It establishes the Norm-Separation Delay Law T_grok - T_mem = Θ(γ_eff^{-1} log(‖θ_mem‖² / ‖θ_post‖²)), where γ_eff is the optimizer's effective contraction rate (ηλ for SGD, ≥ηλ for AdamW). The upper bound is derived from a discrete Lyapunov contraction argument on the quadratic norm penalty; the matching lower bound follows from dynamical constraints of regularized first-order optimization. Across 293 runs on modular addition, multiplication, and sparse parity, the work reports inverse scaling of delay with weight decay (R²=0.97) and learning rate (R²=0.92), logarithmic dependence on the norm ratio (Pearson r=0.91), failure of SGD to grok at hyperparameters where AdamW succeeds, and a three-input predictor achieving 34.6% MAE at memorization time.

Significance. If the derivation is completed, the result supplies the first quantitative, falsifiable scaling law for grokking delay grounded in optimization dynamics rather than phenomenology. The high R² fits, the explicit contrast between SGD and AdamW, and the practical early-stopping algorithm constitute clear strengths that could be directly useful for training analysis. The work reframes delayed generalization as a predictable consequence of norm separation between competing interpolators.

major comments (1)

[Abstract / Norm-Separation Delay Law statement] The central claim asserts both an upper and a matching lower bound for the Θ expression. The upper bound is attributed to a discrete Lyapunov contraction argument, yet the manuscript supplies only a high-level summary without the explicit sequence of inequalities, the precise Lyapunov function, or error terms. The lower bound is ascribed to 'dynamical constraints of regularised first-order optimisation' without a derivation showing that any trajectory must require at least Ω(γ_eff^{-1} log(‖θ_mem‖² / ‖θ_post‖²)) steps before the smaller-norm solution can dominate the loss landscape. Until these steps are written out, the quantitative law reduces to an empirically supported scaling plus an unproven lower bound.

minor comments (2)

The practical three-input predictor is announced with a 34.6% MAE but its exact inputs, training procedure, and bootstrap details are not fully specified in the provided text; a short algorithmic box or pseudocode would improve reproducibility.
The definition of γ_eff for AdamW is given as ≥ηλ; an explicit expression or bound in terms of β1, β2, and ε would remove ambiguity when comparing optimizers.

Axiom & Free-Parameter Ledger

1 free parameters · 2 axioms · 0 invented entities

The central claim rests on optimization-dynamics assumptions identifying norm separation as the driver of the phase transition; no new physical entities are introduced and the norm ratio is treated as an observable rather than a fitted constant.

free parameters (1)

γ_eff
Effective contraction rate defined as ηλ for SGD and bounded for AdamW; its precise value for AdamW may require empirical calibration.

axioms (2)

domain assumption Discrete Lyapunov contraction governs the upper bound on delay under regularized first-order optimization.
Invoked to establish the Θ upper bound on T_grok - T_mem.
domain assumption Grokking arises as a representational phase transition driven by norm separation between memorizing and generalizing interpolators.
Core framing that converts the delay into a norm-ratio problem.

pith-pipeline@v0.9.0 · 5666 in / 1437 out tokens · 81458 ms · 2026-05-15T16:14:07.811515+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

The Spectral Lifecycle of Transformer Training: Transient Compression Waves, Persistent Spectral Gradients, and the Q/K--V Asymmetry
cs.LG 2026-04 unverdicted novelty 8.0

Transformer weight spectra exhibit transient compression waves that propagate layer-wise, persistent non-monotonic depth gradients in power-law exponents, and Q/K-V asymmetry, with the spectral exponent alpha predicti...