pith. machine review for the scientific record. sign in

arxiv: 2603.13331 · v2 · submitted 2026-03-05 · 💻 cs.AI · cs.LG

Recognition: unknown

The Norm-Separation Delay Law of Grokking: A First-Principles Theory of Delayed Generalization

Authors on Pith no claims yet

Pith reviewed 2026-05-15 16:14 UTC · model grok-4.3

classification 💻 cs.AI cs.LG
keywords mathrmgrokkingdelaygammacontractionthetatrainingadamw
0
0 comments X

The pith

Grokking delay follows T_grok - T_mem = Θ(γ_eff^{-1} log(‖θ_mem‖² / ‖θ_post‖²)), derived from norm separation in regularized optimization and validated with high correlations across 293 runs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Neural networks sometimes memorize training data perfectly but only later suddenly generalize to new examples, a behavior called grokking. The authors argue this delay occurs because regularization causes the model's weights to contract at different rates for a memorizing solution versus a better generalizing one. They derive a formula showing the delay scales inversely with the effective contraction rate (tied to learning rate and weight decay) and with the logarithm of the norm ratio between these solutions. Experiments on modular arithmetic and parity tasks confirm the predicted inverse scaling with weight decay and learning rate, plus the logarithmic dependence. The work also shows that AdamW enables this separation while SGD does not at the same settings. They provide a simple predictor using norms measured at memorization time that forecasts the delay with moderate accuracy.

Core claim

We show that grokking is a norm-driven representational phase transition in regularised training dynamics, and establish the Norm-Separation Delay Law: T_grok - T_mem = Θ(γ_eff^{-1} log(‖θ_mem‖² / ‖θ_post‖²)), where γ_eff is the optimiser's effective contraction rate.

Load-bearing premise

That grokking is driven by norm separation between competing interpolating representations and that the discrete Lyapunov contraction argument plus dynamical constraints of regularised first-order optimisation directly yield the stated delay law.

read the original abstract

Grokking -- the sudden generalisation that appears long after a model has perfectly memorised its training data -- has been widely observed but lacks a quantitative theory explaining the length of the delay. We show that grokking is a norm-driven representational phase transition in regularised training dynamics, and establish the Norm-Separation Delay Law: $T_{\mathrm{grok}} - T_{\mathrm{mem}} = \Theta(\gamma_{\mathrm{eff}}^{-1} \log(\|\theta_{\mathrm{mem}}\|^2 / \|\theta_{\mathrm{post}}\|^2))$, where $\gamma_{\mathrm{eff}}$ is the optimiser's effective contraction rate ($\gamma_{\mathrm{eff}} = \eta\lambda$ for SGD, $\gamma_{\mathrm{eff}} \ge \eta\lambda$ for AdamW). The upper bound follows from a discrete Lyapunov contraction argument; the matching lower bound from dynamical constraints of regularised first-order optimisation. Across 293 training runs spanning modular addition, modular multiplication, and sparse parity, we confirm three falsifiable predictions: inverse scaling with weight decay ($R^2 = 0.97$), inverse scaling with learning rate ($R^2 = 0.92$), and logarithmic dependence on the norm ratio (Pearson $r = 0.91$). A fourth finding reveals that grokking requires an optimiser capable of decoupling memorisation from contraction: SGD fails entirely at the same hyperparameters where AdamW reliably groks. These results reframe grokking not as a mysterious optimisation artefact but as a predictable consequence of norm separation between competing interpolating representations. We further derive a practical three-input algorithm that predicts grokking delay at memorisation time with 34.6% mean absolute error (bootstrap 95% CI [30.0%, 39.4%], $N=60$ seeds), enabling principled early stopping.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper claims that grokking arises as a norm-driven representational phase transition under regularized training. It establishes the Norm-Separation Delay Law T_grok - T_mem = Θ(γ_eff^{-1} log(‖θ_mem‖² / ‖θ_post‖²)), where γ_eff is the optimizer's effective contraction rate (ηλ for SGD, ≥ηλ for AdamW). The upper bound is derived from a discrete Lyapunov contraction argument on the quadratic norm penalty; the matching lower bound follows from dynamical constraints of regularized first-order optimization. Across 293 runs on modular addition, multiplication, and sparse parity, the work reports inverse scaling of delay with weight decay (R²=0.97) and learning rate (R²=0.92), logarithmic dependence on the norm ratio (Pearson r=0.91), failure of SGD to grok at hyperparameters where AdamW succeeds, and a three-input predictor achieving 34.6% MAE at memorization time.

Significance. If the derivation is completed, the result supplies the first quantitative, falsifiable scaling law for grokking delay grounded in optimization dynamics rather than phenomenology. The high R² fits, the explicit contrast between SGD and AdamW, and the practical early-stopping algorithm constitute clear strengths that could be directly useful for training analysis. The work reframes delayed generalization as a predictable consequence of norm separation between competing interpolators.

major comments (1)
  1. [Abstract / Norm-Separation Delay Law statement] The central claim asserts both an upper and a matching lower bound for the Θ expression. The upper bound is attributed to a discrete Lyapunov contraction argument, yet the manuscript supplies only a high-level summary without the explicit sequence of inequalities, the precise Lyapunov function, or error terms. The lower bound is ascribed to 'dynamical constraints of regularised first-order optimisation' without a derivation showing that any trajectory must require at least Ω(γ_eff^{-1} log(‖θ_mem‖² / ‖θ_post‖²)) steps before the smaller-norm solution can dominate the loss landscape. Until these steps are written out, the quantitative law reduces to an empirically supported scaling plus an unproven lower bound.
minor comments (2)
  1. The practical three-input predictor is announced with a 34.6% MAE but its exact inputs, training procedure, and bootstrap details are not fully specified in the provided text; a short algorithmic box or pseudocode would improve reproducibility.
  2. The definition of γ_eff for AdamW is given as ≥ηλ; an explicit expression or bound in terms of β1, β2, and ε would remove ambiguity when comparing optimizers.

Axiom & Free-Parameter Ledger

1 free parameters · 2 axioms · 0 invented entities

The central claim rests on optimization-dynamics assumptions identifying norm separation as the driver of the phase transition; no new physical entities are introduced and the norm ratio is treated as an observable rather than a fitted constant.

free parameters (1)
  • γ_eff
    Effective contraction rate defined as ηλ for SGD and bounded for AdamW; its precise value for AdamW may require empirical calibration.
axioms (2)
  • domain assumption Discrete Lyapunov contraction governs the upper bound on delay under regularized first-order optimization.
    Invoked to establish the Θ upper bound on T_grok - T_mem.
  • domain assumption Grokking arises as a representational phase transition driven by norm separation between memorizing and generalizing interpolators.
    Core framing that converts the delay into a norm-ratio problem.

pith-pipeline@v0.9.0 · 5666 in / 1437 out tokens · 81458 ms · 2026-05-15T16:14:07.811515+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. The Spectral Lifecycle of Transformer Training: Transient Compression Waves, Persistent Spectral Gradients, and the Q/K--V Asymmetry

    cs.LG 2026-04 unverdicted novelty 8.0

    Transformer weight spectra exhibit transient compression waves that propagate layer-wise, persistent non-monotonic depth gradients in power-law exponents, and Q/K-V asymmetry, with the spectral exponent alpha predicti...