Grokking arises from gradual amplification of a Fourier-based circuit in the weights followed by removal of memorizing components.
The slingshot mechanism: An empirical study of adaptive optimizers and the grokking phenomenon
4 Pith papers cite this work. Polarity classification is still indexing.
representative citing papers
Slingshot loss spikes arise from floating-point precision limits that round correct-class gradients to zero, breaking zero-sum constraints and driving exponential parameter growth through numerical feature inflation.
Grokking emerges near the model size where memorization timescale T_mem(P) intersects generalization timescale T_gen(P) on modular arithmetic.
Deep learning systems achieve systematic understanding through internal models tracking regularities but exhibit fractured understanding due to symbolic misalignment, lack of explicit reduction, and weak unification.
citing papers explorer
-
Progress measures for grokking via mechanistic interpretability
Grokking arises from gradual amplification of a Fourier-based circuit in the weights followed by removal of memorizing components.
-
Grokking or Glitching? How Low-Precision Drives Slingshot Loss Spikes
Slingshot loss spikes arise from floating-point precision limits that round correct-class gradients to zero, breaking zero-sum constraints and driving exponential parameter growth through numerical feature inflation.
-
Model Capacity Determines Grokking through Competing Memorisation and Generalisation Speeds
Grokking emerges near the model size where memorization timescale T_mem(P) intersects generalization timescale T_gen(P) on modular arithmetic.
-
A Model of Understanding in Deep Learning Systems
Deep learning systems achieve systematic understanding through internal models tracking regularities but exhibit fractured understanding due to symbolic misalignment, lack of explicit reduction, and weak unification.