To Grok Grokking: Provable Grokking in Ridge Regression

Gal Vardi; Itay Safran; Mingyue Xu

arxiv: 2601.19791 · v3 · pith:XZ5TPSOOnew · submitted 2026-01-27 · 💻 cs.LG · stat.ML

To Grok Grokking: Provable Grokking in Ridge Regression

Mingyue Xu , Gal Vardi , Itay Safran This is my paper

classification 💻 cs.LG stat.ML

keywords grokkinggeneralizationtraininglearningregressionboundsempiricallylinear

0 comments

read the original abstract

We study grokking, the onset of generalization long after overfitting, in a classical ridge regression setting. We prove end-to-end grokking results for learning over-parameterized linear regression models using gradient descent with weight decay. Specifically, we prove that the following stages occur: (i) the model overfits the training data early during training; (ii) poor generalization persists long after overfitting has manifested; and (iii) the generalization error eventually becomes arbitrarily small. Moreover, we show, both theoretically and empirically, that grokking can be amplified or eliminated in a principled manner through proper hyperparameter tuning. To the best of our knowledge, these are the first rigorous quantitative bounds on the generalization delay (which we refer to as the "grokking time") in terms of training hyperparameters. Lastly, going beyond the linear setting, we empirically demonstrate that our quantitative bounds also capture the behavior of grokking on non-linear neural networks. Our results suggest that grokking is not an inherent failure mode of deep learning, but rather a consequence of specific training conditions, and thus does not require fundamental changes to the model architecture or learning algorithm to avoid.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Canonical Regularisation of Wide Feature-Learning Neural Networks
stat.ML 2026-05 unverdicted novelty 8.0

Derives geodesic ridge regularization and Riemannian Gibbs Process prior for feature-learning wide neural networks, generalizing kernel-regime results via function-space axiomatization.
Weight Decay Regimes in Grokking Transformers: Cheap Online Diagnostics
cs.LG 2026-05 conditional novelty 6.0

Weight decay controls distinct learning regimes in grokking transformers on modular arithmetic, tracked by new cheap attention-based diagnostics with empirical critical value and exponent fits.