arXiv preprint arXiv:2206.04817 , year=

The slingshot mechanism: An empirical study of adaptive optimizers, the grokking phenomenon , author= · 2022 · arXiv 2206.04817

10 Pith papers cite this work. Polarity classification is still indexing.

10 Pith papers citing it

read on arXiv browse 10 citing papers

citation-role summary

background 1

citation-polarity summary

contest 1

representative citing papers

Progress measures for grokking via mechanistic interpretability

cs.LG · 2023-01-12 · accept · novelty 8.0

Grokking arises from gradual amplification of a Fourier-based circuit in the weights followed by removal of memorizing components.

Learning as Observable Matrix Dynamics: Diffusive Relaxations versus Phase Transitions

cs.LG · 2026-06-29 · unverdicted · novelty 7.0

Observable Matrix Dynamics (OMD) is a new diagnostic framework that uses random matrix theory on distance matrices to distinguish diffusive relaxations from phase-transition-like reorganizations during neural network training.

Feature Learning in Linear-Width Two-Layer Networks: Two vs. One Step of Gradient Descent

stat.ML · 2026-05-18 · unverdicted · novelty 7.0 · 2 refs

Two steps of gradient descent on first-layer weights in linear-width two-layer networks produce a spiked random matrix with floor(alpha2/(1/2-alpha1)) outliers, each a learned direction, and batch reuse allows capturing directions with information exponent exceeding one.

Grokking or Glitching? How Low-Precision Drives Slingshot Loss Spikes

cs.LG · 2026-05-07 · unverdicted · novelty 7.0 · 3 refs

Slingshot loss spikes are produced by low-precision arithmetic that breaks the zero-sum gradient constraint and drives exponential growth via Numerical Feature Inflation.

The Norm-Separation Delay Law of Grokking: A First-Principles Theory of Delayed Generalization

cs.AI · 2026-03-05 · conditional · novelty 7.0 · 2 refs

Grokking delay follows T_grok - T_mem = Θ(γ_eff^{-1} log(‖θ_mem‖² / ‖θ_post‖²)), derived from norm separation in regularized optimization and validated with high correlations across 293 runs.

Egalitarian Gradient Descent: A Simple Approach to Accelerated Grokking

cs.LG · 2025-10-06 · unverdicted · novelty 7.0

EGD equalizes gradient speeds across singular directions, eliminating or shortening grokking plateaus on modular addition and sparse parity problems.

SingGuard: A Policy-Adaptive Multimodal LLM Guardrail with Dynamic Reasoning

cs.CV · 2026-06-22 · unverdicted · novelty 5.0

SingGuard introduces a policy-adaptive multimodal LLM guardrail with dynamic reasoning regimes and SingGuard-Bench, reporting SOTA F1 scores across 35 datasets and improved policy-following accuracy under runtime shifts.

Model Capacity Determines Grokking through Competing Memorisation and Generalisation Speeds

cs.LG · 2026-05-10 · unverdicted · novelty 5.0

Grokking emerges near the model size where memorization timescale T_mem(P) intersects generalization timescale T_gen(P) on modular arithmetic.

A Model of Understanding in Deep Learning Systems

cs.AI · 2026-04-05 · unverdicted · novelty 5.0

Deep learning systems achieve systematic understanding through internal models tracking regularities but exhibit fractured understanding due to symbolic misalignment, lack of explicit reduction, and weak unification.

On the Convergence Behavior of Preconditioned Gradient Descent Toward the Rich Learning Regime

cs.LG · 2026-01-06 · unverdicted · novelty 5.0

Preconditioned gradient descent mitigates spectral bias and reduces grokking delays by enabling uniform parameter space exploration in the NTK regime, confirming grokking as a transition to the rich regime.

citing papers explorer

Showing 10 of 10 citing papers.

Progress measures for grokking via mechanistic interpretability cs.LG · 2023-01-12 · accept · none · ref 51
Grokking arises from gradual amplification of a Fourier-based circuit in the weights followed by removal of memorizing components.
Learning as Observable Matrix Dynamics: Diffusive Relaxations versus Phase Transitions cs.LG · 2026-06-29 · unverdicted · none · ref 21
Observable Matrix Dynamics (OMD) is a new diagnostic framework that uses random matrix theory on distance matrices to distinguish diffusive relaxations from phase-transition-like reorganizations during neural network training.
Feature Learning in Linear-Width Two-Layer Networks: Two vs. One Step of Gradient Descent stat.ML · 2026-05-18 · unverdicted · none · ref 204 · 2 links
Two steps of gradient descent on first-layer weights in linear-width two-layer networks produce a spiked random matrix with floor(alpha2/(1/2-alpha1)) outliers, each a learned direction, and batch reuse allows capturing directions with information exponent exceeding one.
Grokking or Glitching? How Low-Precision Drives Slingshot Loss Spikes cs.LG · 2026-05-07 · unverdicted · none · ref 1 · 3 links
Slingshot loss spikes are produced by low-precision arithmetic that breaks the zero-sum gradient constraint and drives exponential growth via Numerical Feature Inflation.
The Norm-Separation Delay Law of Grokking: A First-Principles Theory of Delayed Generalization cs.AI · 2026-03-05 · conditional · none · ref 11 · 2 links
Grokking delay follows T_grok - T_mem = Θ(γ_eff^{-1} log(‖θ_mem‖² / ‖θ_post‖²)), derived from norm separation in regularized optimization and validated with high correlations across 293 runs.
Egalitarian Gradient Descent: A Simple Approach to Accelerated Grokking cs.LG · 2025-10-06 · unverdicted · none · ref 12
EGD equalizes gradient speeds across singular directions, eliminating or shortening grokking plateaus on modular addition and sparse parity problems.
SingGuard: A Policy-Adaptive Multimodal LLM Guardrail with Dynamic Reasoning cs.CV · 2026-06-22 · unverdicted · none · ref 71
SingGuard introduces a policy-adaptive multimodal LLM guardrail with dynamic reasoning regimes and SingGuard-Bench, reporting SOTA F1 scores across 35 datasets and improved policy-following accuracy under runtime shifts.
Model Capacity Determines Grokking through Competing Memorisation and Generalisation Speeds cs.LG · 2026-05-10 · unverdicted · none · ref 34
Grokking emerges near the model size where memorization timescale T_mem(P) intersects generalization timescale T_gen(P) on modular arithmetic.
A Model of Understanding in Deep Learning Systems cs.AI · 2026-04-05 · unverdicted · none · ref 7
Deep learning systems achieve systematic understanding through internal models tracking regularities but exhibit fractured understanding due to symbolic misalignment, lack of explicit reduction, and weak unification.
On the Convergence Behavior of Preconditioned Gradient Descent Toward the Rich Learning Regime cs.LG · 2026-01-06 · unverdicted · none · ref 18
Preconditioned gradient descent mitigates spectral bias and reduces grokking delays by enabling uniform parameter space exploration in the NTK regime, confirming grokking as a transition to the rich regime.

arXiv preprint arXiv:2206.04817 , year=

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer