arxiv: 2602.18523 · v3 · submitted 2026-02-19 · 💻 cs.LG · cs.AI

Recognition: 2 theorem links

· Lean Theorem

The Geometry of Multi-Task Grokking: Transverse Instability, Superposition, and Weight Decay Phase Structure

Yongzhong Xu

Authors on Pith no claims yet

Pith reviewed 2026-05-15 20:36 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords multi-task grokkingweight decaytransformermodular arithmeticsuperposition subspaceparameter space geometrytransverse instability

0 comments

The pith

Multi-task grokking builds a compact superposition subspace in parameter space where weight decay supplies compression pressure and excess parameters add geometric redundancy.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper extends single-task geometric analysis to Transformers trained jointly on modular arithmetic tasks such as addition and multiplication. It reports that grokking unfolds in a consistent staggered sequence across seeds, with multiplication generalizing first. Optimization remains confined to an empirically invariant low-dimensional manifold, and defects orthogonal to that manifold precede the jump to generalization. Weight decay sweeps produce distinct dynamical regimes, including a sharp failure mode with no decay, while final solutions prove holographically incompressible: they occupy only a handful of principal trajectory directions yet sit in full-rank weights and collapse under small perturbations. Overparameterization supplies redundant center manifolds that allow partial recovery after aggressive removal of transverse gradient components.

Core claim

Multi-task grokking constructs a compact superposition subspace in parameter space, with weight decay acting as compression pressure and excess parameters supplying geometric redundancy in optimization pathways. The supporting observations are staggered grokking order, universal integrability on an invariant manifold, systematic weight-decay phase structure, holographic incompressibility of the final weights, and transverse fragility offset by redundancy.

What carries the argument

The low-dimensional execution manifold that confines all optimization trajectories together with the orthogonal commutator defects that reliably precede generalization.

If this is right

Grokking timescale, curvature depth, and defect lead covary systematically with weight decay, producing distinct dynamical regimes.
Final solutions occupy only 4-8 principal trajectory directions yet are destroyed by SVD truncation, magnitude pruning, or uniform scaling.
Removal of less than 10 percent of orthogonal gradient components eliminates grokking, although dual-task models show partial recovery.
Multiplication generalizes before squaring, which precedes addition, with consistent delays across random seeds.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The redundancy supplied by excess parameters may confer robustness against transverse instabilities when the number of tasks increases.
The observed staggered order could reflect algebraic complexity differences among the modular operations rather than model architecture alone.
If the superposition subspace scales sublinearly with task count, multi-task training could remain parameter-efficient even for larger task sets.

Load-bearing premise

The low-dimensional execution manifold stays invariant during training and commutator defects always precede generalization in the modular tasks examined.

What would settle it

An experiment in which grokking occurs without any detectable commutator defects preceding it or in which the observed manifold dimensionality changes measurably during the transition.

Figures

Figures reproduced from arXiv: 2602.18523 by Yongzhong Xu.

**Figure 1.** Figure 1: Multi-task grokking dynamics. (a) Dual-task: multiplication leads addition. (b) Tri-task: a three-way staggered ordering emerges. 4 Manifold Structure: Lower Rank and Orthogonal Heads PC1% is lower than single-task. PCA on attention weight trajectories reveals that multi-task grokking produces lower PC1% than single-task training ( [PITH_FULL_IMAGE:figures/full_fig_p005_1.png] view at source ↗

**Figure 2.** Figure 2: PC1% decreases with task count. (a) Dual-task: 55–77%. (b) Tri-task: 49–56%. The manifold is no longer rank-1 but remains strongly low-dimensional. (a) Dual-task expanding-window PC1% (seed 42): declines over training. (b) Tri-task expanding-window PC1%: same declining pattern [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗

**Figure 3.** Figure 3: PC1% declines over training in multi-task settings, unlike single-task grokking where [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: Grok (WD=1.0) vs. no-WD (WD=0.0) eigenspectra for tri-task (seed 42). No-WD has [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: Task-specific head weights are nearly orthogonal. The shared trunk learns a representation [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗

**Figure 6.** Figure 6: The execution manifold is empirically integrable in multi-task settings. The invariance [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗

**Figure 7.** Figure 7: Cross-task gradient structure. (a) Same-task cosines are high (∼0.8), cross-task cosines are moderate (∼0.3). (b) Cross-task defect has roughly the same magnitude as total-loss defect. 6 Hessian Eigenvalue Analysis: Saddle-Mediated Transitions To probe the loss landscape curvature directly [Li et al., 2018b, Fort and Jastrzebski, 2019], we compute the bottom eigenvalue of the Hessian (via power iteration o… view at source ↗

**Figure 8.** Figure 8: Hessian curvature depth scales with weight decay. [PITH_FULL_IMAGE:figures/full_fig_p009_8.png] view at source ↗

**Figure 9.** Figure 9: Per-task bottom Hessian eigenvalues (seed 42, WD=1.0). Mul (blue) has deeper negative [PITH_FULL_IMAGE:figures/full_fig_p010_9.png] view at source ↗

**Figure 10.** Figure 10: Tri-task Hessian analysis replicates the dual-task pattern: stronger WD drives deeper [PITH_FULL_IMAGE:figures/full_fig_p011_10.png] view at source ↗

**Figure 11.** Figure 11: Grokking timescale vs. weight decay (dual-task, 3 seeds, log scale). The relationship is [PITH_FULL_IMAGE:figures/full_fig_p011_11.png] view at source ↗

**Figure 12.** Figure 12: Tri-task defect onset always precedes grokking (27 [PITH_FULL_IMAGE:figures/full_fig_p013_12.png] view at source ↗

**Figure 13.** Figure 13: Dual-task defect onset always precedes grokking (15/15 conditions), with a non-monotonic [PITH_FULL_IMAGE:figures/full_fig_p014_13.png] view at source ↗

**Figure 14.** Figure 14: V-shaped lead fraction across weight decay (dual-task only). [PITH_FULL_IMAGE:figures/full_fig_p015_14.png] view at source ↗

**Figure 15.** Figure 15: Defect traces across five WD values (seed 42). [PITH_FULL_IMAGE:figures/full_fig_p015_15.png] view at source ↗

**Figure 16.** Figure 16: Reconstruction threshold. The grokking solution requires 5–10 PCA directions, with [PITH_FULL_IMAGE:figures/full_fig_p016_16.png] view at source ↗

**Figure 17.** Figure 17: The grokking solution is incompressible by any post-hoc method. Only trajectory [PITH_FULL_IMAGE:figures/full_fig_p017_17.png] view at source ↗

**Figure 18.** Figure 18: Orthogonal deletion dose-response for both tri-task and dual-task. Grok delay increases [PITH_FULL_IMAGE:figures/full_fig_p019_18.png] view at source ↗

**Figure 19.** Figure 19: Phase portrait for tri-task arithmetic (seed 42, layer 0) in the ( [PITH_FULL_IMAGE:figures/full_fig_p020_19.png] view at source ↗

**Figure 20.** Figure 20: Grokking vs. memorizing phase portraits for dual-task arithmetic (modular addition + [PITH_FULL_IMAGE:figures/full_fig_p021_20.png] view at source ↗

**Figure 21.** Figure 21: Layer-wise phase portrait overlay for tri-task arithmetic (seed 42). Layer 0 (blue) traces [PITH_FULL_IMAGE:figures/full_fig_p022_21.png] view at source ↗

**Figure 22.** Figure 22: Constraint-induced compression. (a) Dual-task models require k ∗ = 5–10 PCA directions depending on WD, with the orthogonal complement carrying 0.2–0.8% of trajectory variance. (b) Tri-task models require fewer components (k ∗ = 3–8), consistent with stronger superposition pressure from additional task constraints. See Tables 14 and 15 for per-k accuracy breakdowns. 3. Late phase (superposition completion… view at source ↗

**Figure 23.** Figure 23: Top-10 eigenspectra for dual-task and tri-task. Both show a dominant first eigenvalue, [PITH_FULL_IMAGE:figures/full_fig_p034_23.png] view at source ↗

**Figure 24.** Figure 24: SVD of weight deltas. Top-5 singular values grow concurrently in multi-task settings, [PITH_FULL_IMAGE:figures/full_fig_p034_24.png] view at source ↗

**Figure 25.** Figure 25: Commutator defect time series for both multi-task settings. [PITH_FULL_IMAGE:figures/full_fig_p035_25.png] view at source ↗

**Figure 26.** Figure 26: Tri-task defect vs. per-task test accuracy across three WD values ( [PITH_FULL_IMAGE:figures/full_fig_p036_26.png] view at source ↗

**Figure 27.** Figure 27: Hero figure: defect predicts grokking in the tri-task setting (add task, [PITH_FULL_IMAGE:figures/full_fig_p037_27.png] view at source ↗

**Figure 28.** Figure 28: Combined defect magnitude and integrability over training. Integrability remains at [PITH_FULL_IMAGE:figures/full_fig_p037_28.png] view at source ↗

**Figure 29.** Figure 29: Hessian curvature scaling across WD values (multi-seed). Curvature depth increases [PITH_FULL_IMAGE:figures/full_fig_p037_29.png] view at source ↗

**Figure 30.** Figure 30: Tri-task cross-WD defect summary. WD=0.5 shows the largest final defect (analysis [PITH_FULL_IMAGE:figures/full_fig_p038_30.png] view at source ↗

**Figure 31.** Figure 31: Dual-task PCA eigenspectrum across weight decay values. The top 5–10 eigenvalues [PITH_FULL_IMAGE:figures/full_fig_p038_31.png] view at source ↗

**Figure 32.** Figure 32: Detailed comparison of reconstruction threshold [PITH_FULL_IMAGE:figures/full_fig_p038_32.png] view at source ↗

**Figure 33.** Figure 33: Heatmap comparison of reconstruction accuracy across WD and [PITH_FULL_IMAGE:figures/full_fig_p039_33.png] view at source ↗

**Figure 34.** Figure 34: Dual-task gradient projection ablation (averaged across 3 seeds, WD=1.0). PCA [PITH_FULL_IMAGE:figures/full_fig_p039_34.png] view at source ↗

read the original abstract

Grokking -- the abrupt transition from memorization to generalization long after near-zero training loss -- has been studied mainly in single-task settings. We extend geometric analysis to multi-task modular arithmetic, training shared-trunk Transformers on dual-task (mod-add + mod-mul) and tri-task (mod-add + mod-mul + mod-sq) objectives across a systematic weight decay sweep. Five consistent phenomena emerge. (1) Staggered grokking order: multiplication generalizes first, followed by squaring, then addition, with consistent delays across seeds. (2) Universal integrability: optimization trajectories remain confined to an empirically invariant low-dimensional execution manifold; commutator defects orthogonal to this manifold reliably precede generalization. (3) Weight decay phase structure: grokking timescale, curvature depth, reconstruction threshold, and defect lead covary systematically with weight decay, revealing distinct dynamical regimes and a sharp no-decay failure mode. (4) Holographic incompressibility: final solutions occupy only 4--8 principal trajectory directions yet are distributed across full-rank weights and destroyed by minimal perturbations; SVD truncation, magnitude pruning, and uniform scaling all fail to preserve performance. (5) Transverse fragility and redundancy: removing less than 10% of orthogonal gradient components eliminates grokking, yet dual-task models exhibit partial recovery under extreme deletion, suggesting redundant center manifolds enabled by overparameterization. Together, these results support a dynamical picture in which multi-task grokking constructs a compact superposition subspace in parameter space, with weight decay acting as compression pressure and excess parameters supplying geometric redundancy in optimization pathways.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Staggered generalization order and transverse redundancy show up clearly in the multi-task runs, but the superposition subspace story stays correlational without interventions.

read the letter

The two things worth knowing are the consistent staggered grokking order across tasks and the transverse fragility result. Multiplication generalizes first, then squaring, then addition, with the delays holding across seeds. Small deletions of orthogonal gradient components kill the transition, yet dual-task models show some recovery, pointing to redundancy from overparameterization. The paper also reports that solutions stay low-dimensional in trajectory space but full-rank in weights, and that weight decay sweeps produce distinct regimes with a sharp failure mode at zero decay. These patterns come from shared-trunk transformers on modular add-mul and add-mul-sq objectives. The systematic weight decay sweep and seed consistency checks are the strongest parts; they give concrete numbers on how grokking timescale, curvature, and defect lead covary. The holographic incompressibility claim is backed by the SVD truncation, pruning, and scaling tests that all destroy performance. The soft spots sit in the causal claims. Commutator defects are said to reliably precede generalization and to sit orthogonal to an invariant execution manifold, but the runs show only that they appear before the jump. No ablation holds other trajectory features fixed while changing the defects, so it remains possible the defects are downstream effects rather than load-bearing drivers. The invariance of the low-dimensional manifold is asserted from the observed confinement, yet the text gives limited detail on how the manifold is identified or whether it survives different initializations. This work is for people already tracking grokking and mechanistic interpretability in transformers. A reader who wants fresh multi-task observations to test against will get value from the phase structure and redundancy findings. It deserves a serious referee because the empirical patterns are new enough to warrant checking, even if the dynamical picture needs tighter tests on causality.

Referee Report

2 major / 2 minor

Summary. The paper extends single-task grokking analysis to multi-task modular arithmetic (mod-add + mod-mul, and + mod-sq) using shared-trunk Transformers. It reports five empirical phenomena across weight-decay sweeps: staggered generalization order, confinement to an invariant low-dimensional execution manifold with orthogonal commutator defects preceding generalization, systematic weight-decay phase structure in timescales and curvature, holographic incompressibility of final solutions (4-8 principal directions yet full-rank and fragile to perturbation), and transverse fragility with partial redundancy from overparameterization. These are interpreted as evidence for construction of a compact superposition subspace under weight-decay compression.

Significance. If the reported patterns are robust and the manifold invariance holds under controlled interventions, the work supplies a concrete geometric mechanism linking weight decay, parameter redundancy, and multi-task generalization. The phase-structure and holographic-incompressibility observations are particularly novel and could guide regularization design in overparameterized models.

major comments (2)

[Abstract (2)] Abstract, phenomenon (2): the assertion that commutator defects 'reliably precede generalization' and support a mechanistic superposition-subspace picture rests on observational correlation across seeds and sweeps. No ablation, projection, or regularizer is described that selectively suppresses or amplifies these defects while holding other trajectory statistics fixed; without such tests the defects could be downstream consequences rather than load-bearing drivers of transverse instability.
[Abstract (3),(5)] Abstract, phenomenon (3) and (5): the claimed 'universal integrability' and 'transverse fragility' require that the low-dimensional execution manifold remains invariant across the tested scales and tasks. The manuscript provides no quantitative test (e.g., distance to the manifold under controlled perturbations or across model widths) that would falsify invariance; the reported consistency across seeds is necessary but not sufficient for the geometric claim.

minor comments (2)

[Abstract] The abstract and described results omit error bars, exact seed counts, precise hyperparameter ranges, and data-exclusion criteria; these details are needed to evaluate whether fitting choices affect the reported phase boundaries and defect-lead times.
[Methods] Notation for 'commutator defects' and 'principal trajectory directions' is introduced without an explicit definition or reference to the underlying SVD or Lie-bracket construction; a short methods subsection would improve reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment below with clarifications based on the presented evidence and indicate revisions where they strengthen the claims without misrepresenting the work.

read point-by-point responses

Referee: [Abstract (2)] Abstract, phenomenon (2): the assertion that commutator defects 'reliably precede generalization' and support a mechanistic superposition-subspace picture rests on observational correlation across seeds and sweeps. No ablation, projection, or regularizer is described that selectively suppresses or amplifies these defects while holding other trajectory statistics fixed; without such tests the defects could be downstream consequences rather than load-bearing drivers of transverse instability.

Authors: We agree that the evidence for commutator defects as load-bearing drivers is observational, drawn from consistent patterns across multiple random seeds and weight-decay sweeps in the multi-task setting. The manuscript shows these defects appearing orthogonally to the execution manifold prior to generalization in all tested configurations, supporting the superposition-subspace interpretation. However, we have not performed selective ablations or interventions that hold other statistics fixed. We will revise the abstract and add a discussion section explicitly characterizing the evidence as correlational while outlining targeted future ablation experiments to test causality. revision: partial
Referee: [Abstract (3),(5)] Abstract, phenomenon (3) and (5): the claimed 'universal integrability' and 'transverse fragility' require that the low-dimensional execution manifold remains invariant across the tested scales and tasks. The manuscript provides no quantitative test (e.g., distance to the manifold under controlled perturbations or across model widths) that would falsify invariance; the reported consistency across seeds is necessary but not sufficient for the geometric claim.

Authors: The referee is correct that invariance is supported by empirical consistency of trajectory confinement across seeds, tasks (dual- and tri-task), and scales rather than explicit falsification via perturbation distances or width variations. The full manuscript demonstrates this through PCA projections and manifold reconstruction metrics, but lacks quantitative tests such as measuring deviation under controlled perturbations. We will incorporate such quantitative invariance tests, including perturbation-based distance metrics and width-scaling experiments, into a revised version to provide stronger geometric validation. revision: yes

Circularity Check

0 steps flagged

No circularity: results are direct empirical observations from training runs

full rationale

The paper reports five consistent phenomena observed across training runs on shared-trunk Transformers for multi-task modular arithmetic, including staggered grokking order, confinement to an empirically invariant low-dimensional execution manifold, weight decay phase structure, holographic incompressibility, and transverse fragility. These are presented as patterns from systematic sweeps over seeds, tasks, and weight decay values, with no derivation chain, equations, or self-citations that reduce a claimed prediction or first-principles result to fitted inputs or prior author work by construction. The central dynamical picture is framed as supported by these observations rather than any tautological redefinition or imported uniqueness theorem.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

Central claims rest on empirical observations of training trajectories; the invariant low-dimensional manifold and superposition subspace are interpretive constructs drawn from the data rather than derived from first principles.

free parameters (1)

weight decay coefficient
Systematic sweep used to delineate phase structure; specific values not stated in abstract.

axioms (1)

domain assumption Optimization trajectories remain confined to an empirically invariant low-dimensional execution manifold
Stated as universal integrability and used to interpret commutator defects as predictors of generalization.

invented entities (1)

compact superposition subspace no independent evidence
purpose: Describes the low-rank yet full-weight structure of final solutions
Introduced to unify the holographic incompressibility and redundancy observations

pith-pipeline@v0.9.0 · 5596 in / 1347 out tokens · 52148 ms · 2026-05-15T20:36:59.473932+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

optimization trajectories remain confined to an empirically invariant low-dimensional execution manifold; commutator defects orthogonal to this manifold reliably precede generalization
IndisputableMonolith/Foundation/BranchSelection.lean branch_selection unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

weight decay acting as compression pressure and excess parameters supplying geometric redundancy

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 4 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

The Lifecycle of the Spectral Edge: From Gradient Learning to Weight-Decay Compression
cs.LG 2026-04 unverdicted novelty 7.0

The spectral edge transitions from a gradient-driven functional direction before grokking to a perturbation-flat, ablation-critical compression axis at grokking, forming three universality classes predicted by a gap f...
Spectral Edge Dynamics Reveal Functional Modes of Learning
cs.LG 2026-04 unverdicted novelty 7.0

Spectral edge dynamics during grokking reveal task-dependent low-dimensional functional modes over inputs, such as Fourier modes for modular addition and cross-term decompositions for x squared plus y squared.
Spectral Edge Dynamics: An Analytical-Empirical Study of Phase Transitions in Neural Network Training
cs.LG 2026-03 unverdicted novelty 6.0

Spectral gaps in the Gram matrix of parameter updates control phase transitions such as grokking in neural network training.
Gradient-Direction Sensitivity Reveals Linear-Centroid Coupling Hidden by Optimizer Trajectories
cs.LG 2026-04 unverdicted novelty 5.0

Gradient-based SVD diagnostic uncovers hidden SED-LCH coupling in single and multitask settings and shows rank-3 subspace constraints speed up grokking by 2.3x.

Reference graph

Works this paper leans on

12 extracted references · 12 canonical work pages · cited by 4 Pith papers · 3 internal anchors

[1]

Xander Davies, Lauro Langosco, and David Krueger

URL https://transformer-circuits.pub/2023/monosemantic-features/index.html. Xander Davies, Lauro Langosco, and David Krueger. Unifying grokking and double descent.arXiv preprint arXiv:2303.06173,

work page arXiv 2023
[2]

Stanislav Fort and Stanislaw Jastrzebski

URL https://transformer-circuits.pub/2022/ toy_model/index.html. Stanislav Fort and Stanislaw Jastrzebski. Large scale structure of neural network loss landscapes. InAdvances in Neural Information Processing Systems, volume 32,

work page 2022
[3]

Scaling Laws for Neural Language Models

Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361,

work page internal anchor Pith review Pith/arXiv arXiv 2001
[4]

Grokking as the transition from lazy to rich training dynamics.arXiv preprint arXiv:2310.06110,

Tanishq Kumar, Blake Bordelon, Samuel J Gershman, and Cengiz Pehlevan. Grokking as the transition from lazy to rich training dynamics.arXiv preprint arXiv:2310.06110,

work page arXiv
[5]

Measuring the intrinsic dimension of objective landscapes

Chunyuan Li, Heerad Farkhoor, Rosanne Liu, and Jason Yosinski. Measuring the intrinsic dimension of objective landscapes. InInternational Conference on Learning Representations, 2018a. Hao Li, Zheng Xu, Gavin Taylor, Christoph Studer, and Tom Goldstein. Visualizing the loss landscape of neural nets. InAdvances in Neural Information Processing Systems, vol...

work page arXiv
[6]

Dichotomy of early and late phase implicit biases can provably induce grokking.arXiv preprint arXiv:2311.18817,

32 Kaifeng Lyu, Jikai Jin, Zhiyuan Li, Simon S Du, Jason D Lee, and Wei Hu. Dichotomy of early and late phase implicit biases can provably induce grokking.arXiv preprint arXiv:2311.18817,

work page arXiv
[7]

A tale of two circuits: Grokking as competition of sparse and dense subnetworks.arXiv preprint arXiv:2303.11873,

William Merrill, Nikolaos Tsilivis, and Aman Shukla. A tale of two circuits: Grokking as competition of sparse and dense subnetworks.arXiv preprint arXiv:2303.11873,

work page arXiv
[9]

Neel Nanda, Lawrence Chan, Tom Liberum, Jess Smith, and Jacob Steinhardt

URLhttps://arxiv.org/abs/2602.01434. Neel Nanda, Lawrence Chan, Tom Liberum, Jess Smith, and Jacob Steinhardt. Progress measures for grokking via mechanistic interpretability.arXiv preprint arXiv:2301.05217,

work page arXiv
[10]

Grokking: Generalization beyond overfitting on small algorithmic datasets

Alethea Power, Yuri Burda, Harri Edwards, Igor Babuschkin, and Vedant Misra. Grokking: Generalization beyond overfitting on small algorithmic datasets. InICLR 2022 Workshop on MATH-AI,

work page 2022
[11]

Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets

URLhttps://arxiv.org/abs/2201.02177. Vimal Thilak, Etai Littwin, Shuangfei Zhai, Omid Saremi, Roni Paiss, and Joshua Susskind. The slingshot mechanism: An empirical study of adaptive optimizers and the grokking phenomenon. arXiv preprint arXiv:2206.04817,

work page internal anchor Pith review Pith/arXiv arXiv
[12]

Explaining grokking through circuit efficiency.arXiv preprint arXiv:2309.02390,

Vikrant Varma, Rohin Shah, Zachary Kenton, J´ anos Kram´ ar, and Neel Nanda. Explaining grokking through circuit efficiency.arXiv preprint arXiv:2309.02390,

work page arXiv
[13]

Early-Warning Signals of Grokking via Loss-Landscape Geometry

Yongzhong Xu. Early-warning signals of grokking via loss-landscape geometry.arXiv preprint arXiv:2602.16967, 2026a. URLhttps://arxiv.org/abs/2602.16967. Yongzhong Xu. Low-dimensional and transversely curved optimization dynamics in grokking.arXiv preprint arXiv:2602.16746, 2026b. URLhttps://arxiv.org/abs/2602.16746. Yongzhong Xu. Low-dimensional execution...

work page internal anchor Pith review Pith/arXiv arXiv