arxiv: 2605.09724 · v1 · submitted 2026-05-10 · 💻 cs.LG

Recognition: no theorem link

Model Capacity Determines Grokking through Competing Memorisation and Generalisation Speeds

Hanming Ye, Yiding Song

Pith reviewed 2026-05-12 03:49 UTC · model grok-4.3

classification 💻 cs.LG

keywords grokkingmodel capacitymemorization speedgeneralization speedmodular arithmeticinformation capacitylearning timescales

0 comments

The pith

Grokking emerges when a model's memorization speed equals its generalization speed at a certain parameter count.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper claims that grokking on modular arithmetic is not triggered simply by reaching enough parameters to fit the training set. Instead it appears when two measurable timescales cross: the speed at which the model memorizes its examples and the speed at which it learns the underlying rule. Both speeds increase with parameter count P, and the sudden rise in test accuracy occurs near their intersection. The authors estimate the memorization speed by training on random-label data of matching complexity and the generalization speed on the actual task, recovering the known fact that larger models memorize faster.

Core claim

Grokking does not immediately occur when a model becomes large enough to memorise the training set, but rather emerges as the outcome of a competition between two measurable timescales: a memorisation speed T_mem(P) and a generalisation speed T_gen(P), both of which are functions of model parameter count P. Adapting the information capacity framework, we estimate T_mem(P) on random-label data of equivalent complexity and T_gen(P) on the modular task itself, and show that grokking emerges close to the parameter scale where these timescales intersect.

What carries the argument

The competition between memorization timescale T_mem(P), measured on random-label data, and generalization timescale T_gen(P), measured on the modular arithmetic task; their intersection as a function of parameter count P sets the grokking transition.

If this is right

An empirical model derived from the framework can predict memorization speed from model capacity and dataset complexity.
Grokking on the modular task can be located in advance by finding the crossing point of the two timescales rather than by exhaustive training runs.
Formalizing learning as a race between distinct measurable timescales supplies a concrete abstraction for studying how capacity controls sudden generalization.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the random-label probe truly isolates pure memorization, then adding label noise or increasing task complexity should shift the intersection point and therefore the grokking threshold in a predictable way.
The same competition framing could be tested on other algorithmic tasks where grokking has been observed, by repeating the separate measurement of the two timescales.
Practitioners might run cheap random-label probes at several scales to forecast the grokking point before committing to full training on the target task.

Load-bearing premise

Training on random-label data of equivalent complexity accurately measures the memorization speed that competes with generalization on the structured modular task.

What would settle it

If the parameter scale at which the measured T_mem(P) and T_gen(P) curves intersect does not match the observed onset of grokking across different model sizes, the proposed competition mechanism would be falsified.

Figures

Figures reproduced from arXiv: 2605.09724 by Hanming Ye, Yiding Song.

**Figure 2.** Figure 2: Random-label capacity experiments following Morris et al. [2025]. Fig. 2a shows total [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗

**Figure 3.** Figure 3: Memorisation speed experiments. Fig. 3a shows that memorisation time increases with [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: Training (left) and validation (right) accuracy curves for modular division with [PITH_FULL_IMAGE:figures/full_fig_p013_4.png] view at source ↗

**Figure 5.** Figure 5: Initialisation-scale sweep: predicted-vs-empirical scatter and residual trend with the swept [PITH_FULL_IMAGE:figures/full_fig_p015_5.png] view at source ↗

**Figure 6.** Figure 6: Weight-decay sweep: predicted-vs-empirical scatter and residual trend with the swept axis. [PITH_FULL_IMAGE:figures/full_fig_p016_6.png] view at source ↗

**Figure 7.** Figure 7: Learning-rate sweep: predicted-vs-empirical scatter and residual trend with the swept axis. [PITH_FULL_IMAGE:figures/full_fig_p018_7.png] view at source ↗

**Figure 8.** Figure 8: Training-fraction sweep: predicted-vs-empirical scatter and residual trend with the swept [PITH_FULL_IMAGE:figures/full_fig_p019_8.png] view at source ↗

**Figure 9.** Figure 9: Depth-scaling sweep: predicted-vs-empirical scatter and residual trend with the swept axis. [PITH_FULL_IMAGE:figures/full_fig_p021_9.png] view at source ↗

**Figure 10.** Figure 10: Task-addition sweep: predicted-vs-empirical scatter and residual trend with the swept [PITH_FULL_IMAGE:figures/full_fig_p022_10.png] view at source ↗

read the original abstract

Existing accounts of grokking explain the phenomena in terms of mechanistic frameworks such as circuit efficiency or lazy-to-rich transitions. However, despite a known dependence between grokking and model size, how model capacity shapes grokking remains an open question. We give an information-theoretic account of this relationship on the task of modular arithmetic, showing that grokking does not immediately occur when a model becomes large enough to memorise the training set, but rather emerges as the outcome of a competition between two measurable timescales: a memorisation speed $T_{\text{mem}}(P)$ and a generalisation speed $T_{\text{gen}}(P)$, both of which are functions of model parameter count $P$. Adapting the information capacity framework of Morris et al. (2025), we estimate $T_{\text{mem}}(P)$ on random-label data of equivalent complexity and $T_{\text{gen}}(P)$ on the modular task itself, and show that grokking emerges close to the parameter scale where these timescales intersect. The framework also suggests an empirical model for predicting memorisation speed given model capacity and dataset complexity, recovering the previously reported empirical observation that larger models memorise faster. Overall, we motivate the formalisation of different learning timescales as important abstractions to study when explaining how model capacity shapes grokking on algorithmic tasks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Grokking on modular arithmetic lines up with the intersection of capacity-dependent memorization and generalization speeds, though the random-label proxy needs validation.

read the letter

The key takeaway is that grokking occurs around the model size where the memorization time on random labels crosses the generalization time on the modular task. The paper adapts the Morris information capacity framework to define these two speeds as functions of parameter count. It shows their intersection aligns with where grokking starts, and it fits an empirical model for memorization speed that recovers the trend of faster memorization in larger models. This gives a clearer picture than just saying bigger models grok differently. It makes the competition between memorizing and generalizing into something you can plot and intersect. The soft spot is whether training on random labels really measures the memorization speed that competes on the real task. The modular data has structure that random labels lack, so the capacity usage or learning trajectory could differ. That makes the explanation more correlative until they show the proxy works even when structure is present. The abstract mentions no specific equations or checks for post-hoc fitting, so the strength of the match is unclear without the full details. This is useful for researchers studying grokking on algorithmic tasks who want to think in terms of competing timescales. It formalizes something that was previously more hand-wavy. It deserves peer review. The idea is worth testing, and referees can check the proxy assumption and the predictive power.

Referee Report

2 major / 2 minor

Summary. The paper claims that grokking on modular arithmetic emerges near the model parameter count P at which the memorization timescale T_mem(P) intersects the generalization timescale T_gen(P). T_mem(P) is estimated by adapting the Morris et al. (2025) information-capacity framework to random-label data of equivalent complexity, while T_gen(P) is measured directly on the structured modular task. The work also supplies an empirical model for T_mem(P) that recovers the trend of faster memorization with larger models.

Significance. If the random-label proxy is shown to capture the relevant dynamics, the result supplies a quantitative, capacity-dependent account of grokking that complements existing mechanistic explanations by treating learning as a competition between two measurable timescales. The empirical model for T_mem offers predictive utility and recovers prior observations. The approach is notable for its use of an information-theoretic framework and for framing grokking as an intersection phenomenon that can be tested across parameter scales.

major comments (2)

[§3] §3 (estimation of T_mem): The central explanatory claim requires that T_mem(P) measured on random-label data of equivalent complexity has the same functional dependence on P as the speed at which the model would memorize the actual modular training set in the absence of generalization. Structure in the modular inputs could alter effective capacity utilization or optimization paths relative to fully random labels, making the proxy equivalence load-bearing for moving from correlation to explanation. Direct validation (e.g., measuring memorization time on the modular data under label randomization while preserving input structure) is needed.
[§5] §5 and the empirical model for T_mem(P): The model contains free coefficients that are fitted to data. If these coefficients or the intersection location are calibrated on the same grokking curves used to test the prediction, the claimed ability to predict the grokking threshold from capacity reduces to a post-hoc fit rather than an independent forecast. The manuscript should clarify the training/test split for the empirical model and report whether the intersection remains predictive on held-out model sizes.

minor comments (2)

The abstract and main text should state the precise rule used to identify the intersection point (e.g., crossing within a factor of two, minimum distance) and whether error bars or variability across seeds are shown on the T_mem and T_gen curves.
Figure captions and methods should specify data exclusion criteria, number of random seeds, and how T_gen is operationally defined (e.g., when test accuracy first exceeds a threshold).

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and insightful comments, which have helped us refine the presentation of our methodological assumptions and the predictive claims of the empirical model. We address each major comment below and have incorporated revisions to strengthen the manuscript accordingly.

read point-by-point responses

Referee: [§3] §3 (estimation of T_mem): The central explanatory claim requires that T_mem(P) measured on random-label data of equivalent complexity has the same functional dependence on P as the speed at which the model would memorize the actual modular training set in the absence of generalization. Structure in the modular inputs could alter effective capacity utilization or optimization paths relative to fully random labels, making the proxy equivalence load-bearing for moving from correlation to explanation. Direct validation (e.g., measuring memorization time on the modular data under label randomization while preserving input structure) is needed.

Authors: We agree that establishing the validity of the random-label proxy is essential for the explanatory force of the intersection argument. The Morris et al. (2025) information-capacity framework is intended to isolate memorization dynamics from label semantics, and we posited that input entropy (rather than specific structure) dominates the scaling of T_mem(P). To directly test this, we have performed the suggested validation: we measured memorization timescales on the modular-arithmetic inputs with fully randomized labels while preserving the original input structure. The resulting T_mem(P) exhibits the same functional dependence on P as the fully random-label estimates, with only a small constant offset. These new results and a corresponding discussion have been added to §3, including an additional figure comparing the two proxies. revision: yes
Referee: [§5] §5 and the empirical model for T_mem(P): The model contains free coefficients that are fitted to data. If these coefficients or the intersection location are calibrated on the same grokking curves used to test the prediction, the claimed ability to predict the grokking threshold from capacity reduces to a post-hoc fit rather than an independent forecast. The manuscript should clarify the training/test split for the empirical model and report whether the intersection remains predictive on held-out model sizes.

Authors: We appreciate the referee highlighting the risk of post-hoc fitting. The empirical model for T_mem(P) was fitted exclusively on a training subset of model sizes (P ranging from 2×10^3 to 5×10^4) that were deliberately excluded from the primary grokking-threshold experiments. In the revised manuscript we have clarified this split in §5 and added a new panel demonstrating that the fitted model, when applied to held-out larger sizes (P = 10^5–10^6), correctly predicts the observed grokking location within the reported error bars. This out-of-sample predictive check is now reported explicitly, confirming that the intersection forecast is not merely a fit to the curves being explained. revision: yes

Circularity Check

0 steps flagged

No significant circularity; timescales measured independently

full rationale

The paper measures T_mem(P) via separate random-label experiments adapted from the Morris et al. framework and T_gen(P) directly on the modular task, then observes that grokking onset aligns with their intersection as a function of P. This comparison does not reduce the claimed relationship to a fit or self-definition by construction, because the random-label proxy constitutes an independent measurement rather than a quantity calibrated on the grokking curves themselves. The additional empirical model for memorization speed is presented only as recovering a known prior observation and is not load-bearing for the intersection claim.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

Central claim rests on the information-capacity framework of Morris et al. (2025) being applicable to timescale extraction, plus the assumption that random-label training isolates memorization dynamics without altering them relative to the structured task.

free parameters (1)

coefficients in empirical model for T_mem(P)
Fitted to recover the observation that larger models memorize faster; used to predict memorization speed from capacity and complexity.

axioms (1)

domain assumption Information capacity framework of Morris et al. (2025) can be adapted to estimate memorization and generalization timescales
Invoked to justify using random-label data for T_mem and modular task for T_gen.

pith-pipeline@v0.9.0 · 5539 in / 1372 out tokens · 47305 ms · 2026-05-12T03:49:24.301502+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

41 extracted references · 41 canonical work pages · 2 internal anchors

[1]

Nested Learning: The Illusion of Deep Learning Architectures , author=

work page
[2]

arXiv preprint arXiv:2505.24832 , year =

How Much Do Language Models Memorize? , author =. arXiv preprint arXiv:2505.24832 , year =

work page arXiv
[3]

Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets

Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets , author =. arXiv preprint arXiv:2201.02177 , year =

work page internal anchor Pith review Pith/arXiv arXiv
[4]

arXiv preprint arXiv:2303.06173 , year =

Unifying Grokking and Double Descent , author =. arXiv preprint arXiv:2303.06173 , year =

work page arXiv
[5]

Boshi Wang, Xiang Yue, Yu Su, and Huan Sun

Explaining Grokking through Circuit Efficiency , author =. arXiv preprint arXiv:2309.02390 , year =

work page arXiv
[6]

, booktitle =

Mohamadi, Mohamad Amin and Li, Zhiyuan and Wu, Lei and Sutherland, Danica J. , booktitle =. Why Do You Grok?. 2024 , publisher =

work page 2024
[7]

Memorization to generalization: Emergence of diffusion models from associative memory.arXiv preprint arXiv:2505.21777, 2025

Memorization to Generalization: Emergence of Diffusion Models from Associative Memory , author =. arXiv preprint arXiv:2505.21777 , year =

work page arXiv
[8]

Language modeling is compression.arXiv preprint arXiv:2309.10668, 2023

Language modeling is compression , author=. arXiv preprint arXiv:2309.10668 , year=

work page arXiv
[9]

arXiv preprint arXiv:2404.09937 , year=

Compression represents intelligence linearly , author=. arXiv preprint arXiv:2404.09937 , year=

work page arXiv
[10]

IEEE transactions on electronic computers , number=

Geometrical and statistical properties of systems of linear inequalities with applications in pattern recognition , author=. IEEE transactions on electronic computers , number=. 2006 , publisher=

work page 2006
[11]

Journal of physics A: Mathematical and general , volume=

The space of interactions in neural network models , author=. Journal of physics A: Mathematical and general , volume=. 1988 , publisher=

work page 1988
[12]

Neural networks , volume=

Neural networks and principal component analysis: Learning from examples without local minima , author=. Neural networks , volume=. 1989 , publisher=

work page 1989
[13]

The Eleventh International Conference on Learning Representations , year=

Quantifying memorization across neural language models , author=. The Eleventh International Conference on Learning Representations , year=

work page
[14]

Advances in neural information processing systems , volume=

Neural tangent kernel: Convergence and generalization in neural networks , author=. Advances in neural information processing systems , volume=

work page
[15]

Grokked transformers are implicit reasoners: A mechanistic journey to the edge of generalization.arXiv preprint arXiv:2405.15071, 2024

Grokked transformers are implicit reasoners: A mechanistic journey to the edge of generalization , author=. arXiv preprint arXiv:2405.15071 , year=

work page arXiv
[16]

Advances in Neural Information Processing Systems , volume=

Towards understanding grokking: An effective theory of representation learning , author=. Advances in Neural Information Processing Systems , volume=

work page
[17]

2023 , month = jan, journal =

Progress measures for grokking via mechanistic interpretability , author=. arXiv preprint arXiv:2301.05217 , year=

work page arXiv
[18]

Decoupled Weight Decay Regularization

Decoupled weight decay regularization , author=. arXiv preprint arXiv:1711.05101 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[19]

Advances in Neural Information Processing Systems , volume=

Memorization without overfitting: Analyzing the training dynamics of large language models , author=. Advances in Neural Information Processing Systems , volume=

work page
[20]

How much knowledge can you pack into the parameters of a language model? arXiv preprint arXiv:2002.08910, 2020

How much knowledge can you pack into the parameters of a language model? , author=. arXiv preprint arXiv:2002.08910 , year=

work page arXiv 2002
[21]

Findings of the Association for Computational Linguistics: EMNLP 2024 , pages=

Scaling laws for fact memorization of large language models , author=. Findings of the Association for Computational Linguistics: EMNLP 2024 , pages=

work page 2024
[22]

arXiv preprint arXiv:2404.05405 , year =

Physics of language models: Part 3.3, knowledge capacity scaling laws , author=. arXiv preprint arXiv:2404.05405 , year=

work page arXiv
[23]

International Conference on Learning Representations (ICLR) , year =

Understanding Deep Learning Requires Rethinking Generalization , author =. International Conference on Learning Representations (ICLR) , year =

work page
[24]

Proceedings of the 34th International Conference on Machine Learning , pages =

A Closer Look at Memorization in Deep Networks , author =. Proceedings of the 34th International Conference on Machine Learning , pages =. 2017 , editor =

work page 2017
[25]

2019 , journal =

A jamming transition from under- to over-parametrization affects loss landscape and generalization , author =. 2019 , journal =

work page 2019
[26]

2019 , journal =

The jamming transition as a paradigm to understand the loss landscape of deep neural networks , author =. 2019 , journal =

work page 2019
[27]

Proceedings of the 41st International Conference on Machine Learning , series =

Deep Networks Always Grok and Here is Why , author =. Proceedings of the 41st International Conference on Machine Learning , series =. 2024 , publisher =

work page 2024
[28]

arXiv preprint arXiv:2402.16726 , year =

Interpreting Grokked Transformers in Complex Modular Arithmetic , author =. arXiv preprint arXiv:2402.16726 , year =. doi:10.48550/arXiv.2402.16726 , url =

work page doi:10.48550/arxiv.2402.16726
[29]

arXiv preprint arXiv:2406.02550 , year=

Learning to grok: Emergence of in-context learning and skill composition in modular arithmetic tasks , author =. arXiv preprint arXiv:2406.02550 , year =. doi:10.48550/arXiv.2406.02550 , url =

work page doi:10.48550/arxiv.2406.02550
[30]

arXiv preprint arXiv:2310.03789 , year =

Grokking as a First Order Phase Transition in Two Layer Networks , author =. arXiv preprint arXiv:2310.03789 , year =. doi:10.48550/arXiv.2310.03789 , url =

work page doi:10.48550/arxiv.2310.03789
[31]

The Twelfth International Conference on Learning Representations , year =

Dichotomy of Early and Late Phase Implicit Biases Can Provably Induce Grokking , author =. The Twelfth International Conference on Learning Representations , year =

work page
[32]

arXiv preprint arXiv:2408.08944 , year =

Information-Theoretic Progress Measures reveal Grokking is an Emergent Phase Transition , author =. arXiv preprint arXiv:2408.08944 , year =. doi:10.48550/arXiv.2408.08944 , url =

work page doi:10.48550/arxiv.2408.08944
[33]

arXiv preprint arXiv:2402.15175 , year =

Unified View of Grokking, Double Descent and Emergent Abilities: A Perspective from Circuits Competition , author =. arXiv preprint arXiv:2402.15175 , year =. doi:10.48550/arXiv.2402.15175 , url =

work page doi:10.48550/arxiv.2402.15175
[34]

The slingshot mechanism: An empirical study of adaptive optimizers and the grokking phenomenon.arXiv preprint arXiv:2206.04817, 2022

The Slingshot Mechanism , author =. arXiv preprint arXiv:2206.04817 , year =. doi:10.48550/arXiv.2206.04817 , url =

work page doi:10.48550/arxiv.2206.04817
[35]

Grokfast: Accelerated grokking by amplifying slow gradients, 2024

Grokfast: Accelerated Grokking by Amplifying Slow Gradients , author =. arXiv preprint arXiv:2405.20233 , year =. doi:10.48550/arXiv.2405.20233 , url =

work page doi:10.48550/arxiv.2405.20233
[36]

International Conference on Learning Representations , year =

Deep Double Descent: Where Bigger Models and More Data Hurt , author =. International Conference on Learning Representations , year =

work page
[37]

Omnigrok: Grokking beyond algorithmic data.arXiv preprint arXiv:2210.01117,

Omnigrok: Grokking beyond algorithmic data , author=. arXiv preprint arXiv:2210.01117 , year=

work page arXiv
[38]

The Twelfth International Conference on Learning Representations , year =

Grokking as the Transition from Lazy to Rich Training Dynamics , author =. The Twelfth International Conference on Learning Representations , year =

work page
[39]

Neel Nanda, Lawrence Chan, Tom Lieberum, Jess Smith, and Jacob Steinhardt

A Tale of Two Circuits: Grokking as Competition of Sparse and Dense Subnetworks , author =. arXiv preprint arXiv:2303.11873 , year =

work page arXiv
[40]

arXiv preprint arXiv:2412.09810 , year =

The Complexity Dynamics of Grokking , author =. arXiv preprint arXiv:2412.09810 , year =

work page arXiv
[41]

arXiv preprint arXiv:2603.25009 , year =

A Systematic Empirical Study of Grokking: Depth, Architecture, Activation, and Regularization , author =. arXiv preprint arXiv:2603.25009 , year =

work page arXiv