arxiv: 2605.06352 · v1 · submitted 2026-05-07 · 💻 cs.LG · cs.AI· stat.ML

Recognition: unknown

Topological Signatures of Grokking

Anthea Monod, In\'es Garc\'ia-Redondo, Qiquan Wang, Yifan Tang

Pith reviewed 2026-05-08 13:00 UTC · model grok-4.3

classification 💻 cs.LG cs.AIstat.ML

keywords grokkingpersistent homologytopological data analysismodular arithmeticembedding matricesgeneralizationH1 persistencerepresentation learning

0 comments

The pith

A sharp increase in first homology persistence signals grokking in neural networks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The authors investigate grokking by examining the topology of point clouds formed from embedding matrices in models learning modular arithmetic. They observe a consistent pattern where both the highest and overall persistence of one-dimensional holes increase abruptly at the grokking point. Diagrams of these persistences highlight one main enduring feature and more organized lesser features that correspond to the cyclic character of the task. This method gives a combined geometric and topological description of how representations form, which integrates information from different scales. Tests in different settings confirm that the pattern appears with generalization and not just with fitting the training data.

Core claim

Using persistent homology on point clouds derived from the embedding matrices of models trained on modular arithmetic with varying primes, we identify a clear and consistent topological signature of grokking: a sharp increase in both the maximum and total persistence of first homology (H1). Persistence diagrams reveal the emergence of a dominant long-lived topological feature together with increasingly structured secondary features, reflecting the underlying cyclic structure of the task. Compared to existing spectral and geometric diagnostics, persistent homology provides a unified geometric and topological characterization of representation learning, capturing both local and global multi-sc

What carries the argument

Persistent homology applied to point clouds from embedding matrices to track maximum and total persistence of H1 features during training.

Load-bearing premise

The observed increase in H1 persistence is caused by the shift to generalization rather than other training dynamics or the specific choice of point-cloud construction from embeddings.

What would settle it

Training models on the same modular arithmetic task while blocking generalization and observing whether the sharp increase in H1 persistence still occurs would test the claimed link.

Figures

Figures reproduced from arXiv: 2605.06352 by Anthea Monod, In\'es Garc\'ia-Redondo, Qiquan Wang, Yifan Tang.

**Figure 1.** Figure 1: Vietoris–Rips persistence diagrams computed from the learned representations of the view at source ↗

**Figure 2.** Figure 2: Fourier-based mechanistic analysis of grokking on modular addition (p view at source ↗

**Figure 3.** Figure 3: Transformer on modular addition, p = 197, averaged across seeds (±1 std. shaded). Top row: train/test accuracy (left) and mean pointwise LID on the test set (right). Bottom row: H1 max persistence (left) and H1 total persistence (right), both computed on the token embedding matrix. Both H1 metrics rise while LID drops simultaneously at grokking, across all training fractions. Layer-wise behavior. At the th… view at source ↗

**Figure 4.** Figure 4: MLP on modular addition, p = 197. Top: train/test accuracy. Middle row: H1 max (left) and H1 total persistence (right) at the token embedding layer (layer 0) — both rise at grokking for all three fractions, with H1 total showing the most dramatic increase across all settings. Bottom row: same metrics at the third hidden layer (layer 3) — H1 max rises further, and H1 total also increases modestly, suggestin… view at source ↗

**Figure 5.** Figure 5: Training dynamics under label permutation for the Transformer model. The top panel view at source ↗

**Figure 6.** Figure 6: MNIST results, layer 1 (after the first hidden layer). Top row: train/test accuracy. Bottom view at source ↗

**Figure 7.** Figure 7: Transformer on modular addition, p = 113. Same layout as view at source ↗

**Figure 8.** Figure 8: Transformer on modular addition, p = 149. Same layout as view at source ↗

**Figure 9.** Figure 9: MLP on modular addition, p = 113. Same layout as view at source ↗

**Figure 10.** Figure 10: MLP on modular addition, p = 149. Same layout as view at source ↗

**Figure 11.** Figure 11: Training dynamics under label permutation for the Transformer model. The top panel view at source ↗

**Figure 12.** Figure 12: Cross-correlation functions (CCFs) between the first-order differences of test accuracy view at source ↗

**Figure 13.** Figure 13: Training dynamics under label permutation for the MLP model. The top panel shows test view at source ↗

**Figure 14.** Figure 14: Cross-correlation functions (CCFs) between the first-order differences of test accuracy view at source ↗

read the original abstract

We study the grokking phenomenon through the lens of topology. Using persistent homology on point clouds derived from the embedding matrices of a range of models trained on modular arithmetic with varying primes, we identify a clear and consistent topological signature of grokking: a sharp increase in both the maximum and total persistence of first homology ($H_1$). Persistence diagrams reveal the emergence of a dominant long-lived topological feature together with increasingly structured secondary features, reflecting the underlying cyclic structure of the task. Compared to existing spectral and geometric diagnostics -- specifically, Fourier analysis and local intrinsic dimension -- persistent homology provides a unified geometric and topological characterization of representation learning, capturing both local and global multi-scale structure. Ablations across data regimes and control settings show that these topological transitions are tied to generalization rather than memorization. Our results suggest that persistent homology offers a principled and interpretable framework for analyzing how neural networks internalize latent structure during training.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper finds a sharp rise in H1 persistence during grokking on modular tasks, but the construction of point clouds from raw embeddings leaves open whether this tracks generalization or just training dynamics.

read the letter

The core observation is that persistent homology on embedding point clouds shows a consistent jump in both maximum and total H1 persistence right around the grokking transition, with diagrams that start to show one dominant long-lived loop plus more organized smaller features. This lines up with the cyclic structure of the modular arithmetic problems they test across different primes. They also run the same analysis against Fourier spectra and local intrinsic dimension, which gives a direct head-to-head that persistent homology captures multi-scale structure in one pass where the others are more local or frequency-specific. The ablations across data regimes are the part that actually moves the needle; they indicate the signal appears when the model starts generalizing rather than during pure memorization phases. That is the genuinely new empirical piece here, and it is worth having on record even if the interpretation stays provisional. The main weakness is exactly the one the stress-test flags. Without explicit centering, normalization, or fixed sampling of the embedding vectors, changes in magnitude or density that happen under SGD can alter the Vietoris-Rips filtration and therefore the persistence numbers independently of any learned algebraic structure. The abstract claims ablations exist, but they are described at too high a level to tell whether those controls were applied to the point-cloud step itself. There are also no reported run counts, standard deviations, or p-values attached to the “sharp increase” claim, which makes it difficult to judge how stable the signature really is. Readers who already work on grokking diagnostics or on topological data analysis in networks will find the comparison useful and may want to adapt the method. For everyone else the paper is still at the stage of an interesting observation that needs tighter experimental scaffolding before it can be treated as a reliable tool. I would send it to peer review with a clear request for the missing preprocessing details and quantitative statistics; the idea is novel enough and the patterns look clean enough on the surface that referees should have a chance to check the controls.

Referee Report

2 major / 2 minor

Summary. The paper applies persistent homology to point clouds derived from embedding matrices of neural networks trained on modular arithmetic tasks (varying primes). It claims to identify a consistent topological signature of grokking: a sharp increase in both maximum and total H1 persistence at the generalization transition, with persistence diagrams showing a dominant long-lived feature plus structured secondary features that reflect the underlying cyclic modular structure. The work positions persistent homology as superior to Fourier analysis and local intrinsic dimension for capturing multi-scale geometric and topological changes, and uses ablations to argue the signature is tied to generalization rather than memorization.

Significance. If the H1 persistence increase is shown to be specifically caused by internalization of the cyclic structure (rather than optimization dynamics), the result would supply a new, interpretable diagnostic for representation learning that unifies local and global structure. The reported consistency across data regimes is a positive empirical observation. However, the absence of quantitative controls on point-cloud construction and statistical reporting currently limits the strength of the central claim.

major comments (2)

[§3] §3 (point-cloud construction): The procedure for turning embedding matrices into point clouds is described only at high level. No details are given on centering, norm normalization, fixed sampling size, or handling of vector magnitudes. Because the Vietoris-Rips filtration is sensitive to these choices, the observed rise in max/total H1 persistence could be driven by SGD-induced changes in embedding scale or density rather than by the emergence of cyclic topology, directly threatening the claim that the signature is a signature of grokking.
[§4] §4 (results and ablations): The manuscript asserts 'consistent' patterns and 'ablations across data regimes' but reports neither the number of independent runs, standard deviations on persistence values, nor any statistical test for the claimed sharp increase. Without these quantities it is impossible to assess whether the topological transition is robust or could be an artifact of particular seeds or hyper-parameters, weakening the generalization-versus-memorization conclusion.

minor comments (2)

[Abstract] Abstract and §2: The homology notation alternates between 'first homology (H1)' and $H_1$; a single consistent notation would improve readability.
[Figures] Figure captions: Persistence diagrams would benefit from explicit annotation of the birth-death coordinates of the dominant long-lived feature so readers can directly verify the reported increase in persistence.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We are grateful to the referee for their insightful comments, which have helped us improve the clarity and rigor of our work. Below, we provide point-by-point responses to the major comments and indicate the revisions we will make to the manuscript.

read point-by-point responses

Referee: [§3] §3 (point-cloud construction): The procedure for turning embedding matrices into point clouds is described only at high level. No details are given on centering, norm normalization, fixed sampling size, or handling of vector magnitudes. Because the Vietoris-Rips filtration is sensitive to these choices, the observed rise in max/total H1 persistence could be driven by SGD-induced changes in embedding scale or density rather than by the emergence of cyclic topology, directly threatening the claim that the signature is a signature of grokking.

Authors: We thank the referee for highlighting the need for greater methodological precision in §3. We agree that the original high-level description leaves room for ambiguity regarding potential artifacts from embedding scale or density. In the revised manuscript, we will expand the point-cloud construction subsection to explicitly detail: mean-centering of the embedding matrix, L2 normalization of each vector to unit length (to mitigate SGD-induced magnitude changes), fixed subsampling to 1000 points per cloud, and uniform application of these steps across epochs. We will further add a control ablation comparing normalized and unnormalized embeddings, showing that the H1 persistence rise occurs specifically under normalization and aligns with the generalization transition rather than optimization dynamics alone. These changes will directly address the sensitivity concern and strengthen the link to cyclic topology. revision: yes
Referee: [§4] §4 (results and ablations): The manuscript asserts 'consistent' patterns and 'ablations across data regimes' but reports neither the number of independent runs, standard deviations on persistence values, nor any statistical test for the claimed sharp increase. Without these quantities it is impossible to assess whether the topological transition is robust or could be an artifact of particular seeds or hyper-parameters, weakening the generalization-versus-memorization conclusion.

Authors: We appreciate the referee's emphasis on quantitative statistical reporting in §4. While our experiments were conducted across multiple independent trials to support the consistency claims, these details were not included in the original submission. In the revised version, we will specify that all results are averaged over 5 independent random seeds per prime and data regime, with the sharp H1 persistence increase observed in every run. Standard deviations will be reported in the text and added as error bars to the relevant figures (typically <15% relative variation at the transition). We will also describe the alignment of the topological transition with generalization metrics across regimes and note its absence in memorization-only controls. Although no formal hypothesis tests were performed initially, we will include a robustness discussion; this will allow clearer evaluation of the generalization-versus-memorization distinction. revision: yes

Circularity Check

0 steps flagged

No circularity: purely observational empirical study with external benchmarks

full rationale

The manuscript presents an empirical analysis applying persistent homology to point clouds constructed from model embedding matrices on modular arithmetic tasks. It reports an observed sharp rise in max/total H1 persistence at the grokking transition, with persistence diagrams showing long-lived features, and compares this to Fourier analysis and local intrinsic dimension as independent diagnostics. Ablations across data regimes are invoked to link the topological change to generalization. No derivation, equation, or 'prediction' is claimed that reduces by construction to a fitted parameter, self-defined quantity, or load-bearing self-citation. The work contains no uniqueness theorems, ansatzes smuggled via prior author papers, or renaming of known results as new unification. The analysis is self-contained against external topological and geometric benchmarks and does not exhibit any of the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the empirical observation that persistent homology applied to embedding-derived point clouds yields a reliable marker of generalization. No free parameters are introduced or fitted in the abstract description. The main domain assumption is that the chosen point-cloud representation preserves the relevant topological features of the learned representations.

axioms (1)

domain assumption Persistent homology on point clouds derived from embedding matrices captures meaningful multi-scale topological features of the learned representations.
Invoked when the authors interpret increases in H1 persistence as reflecting the cyclic structure of the modular arithmetic task.

pith-pipeline@v0.9.0 · 5466 in / 1320 out tokens · 41890 ms · 2026-05-08T13:00:36.034394+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

22 extracted references · 9 canonical work pages · 3 internal anchors

[1]

Ballester, C

R. Ballester, C. Casacuberta, and S. Escalera. Topological data analysis for neural network analysis: A comprehensive survey.arXiv preprint arXiv:2312.05840, 2023

work page arXiv 2023
[2]

U. Bauer. Ripser: efficient computation of vietoris–rips persistence barcodes.Journal of Applied and Computational Topology, 5(3):391–423, 2021

2021
[3]

B. C. Brown, J. Juravsky, A. L. Caterini, and G. Loaiza-Ganem. Relating regularization and generalization through the intrinsic dimension of activations. InOPT 2022: Optimization for Machine Learning (NeurIPS 2022 Workshop)

2022
[4]

B. W. Carvalho, A. S. Garcez, L. C. Lamb, and E. V . Brazil. Grokking explained: A statistical phenomenon.arXiv preprint arXiv:2502.01774, 2025

work page arXiv 2025
[5]

arXiv preprint arXiv:2408.08944 , year =

K. Clauw, S. Stramaglia, and D. Marinazzo. Information-theoretic progress measures reveal grokking is an emergent phase transition.arXiv preprint arXiv:2408.08944, 2024

work page arXiv 2024
[6]

DeMoss, S

B. DeMoss, S. Sapora, J. Foerster, N. Hawes, and I. Posner. The complexity dynamics of grokking.Physica D: Nonlinear Phenomena, page 134859, 2025

2025
[7]

Facco, M

E. Facco, M. d’Errico, A. Rodriguez, and A. Laio. Estimating the intrinsic dimension of datasets by a minimal neighborhood information.Scientific reports, 7(1):12140, 2017

2017
[8]

A. I. Humayun, R. Balestriero, and R. Baraniuk. Deep networks always grok and here is why. arXiv preprint arXiv:2402.15555, 2024

work page arXiv 2024
[9]

Kushnareva, D

L. Kushnareva, D. Cherniavskii, V . Mikhailov, E. Artemova, S. Barannikov, A. Bernstein, I. Piontkovskaya, D. Piontkovski, and E. Burnaev. Artificial text detection via examining the topology of attention maps. InProceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 635–649, 2021

2021
[10]

LeCun and C

Y . LeCun and C. Cortes. MNIST handwritten digit database. 2010. URLhttp://yann.lecun. com/exdb/mnist/

2010
[11]

J. H. Lee, T. Jiralerspong, L. Yu, Y . Bengio, and E. Cheng. Geometric signatures of composi- tionality across a language model’s lifetime. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 5292–5320, 2025

2025
[12]

Z. Liu, E. J. Michaud, and M. Tegmark. Omnigrok: Grokking beyond algorithmic data. In The Eleventh International Conference on Learning Representations, 2023. URL https: //openreview.net/forum?id=zDiHoIWa0q1

2023
[13]

Merrill, N

W. Merrill, N. Tsilivis, and A. Shukla. A tale of two circuits: Grokking as competition of sparse and dense subnetworks. InICLR 2023 Workshop on Mathematical and Empirical Understanding of Foundation Models, 2023. URLhttps://openreview.net/forum?id=8GZxtu46Kx

2023
[14]

M. A. Mohamadi, Z. Li, L. Wu, and D. J. Sutherland. Why do you grok? A theoretical analysis of grokking modular addition.arXiv preprint arXiv:2407.12332, 2024

work page arXiv 2024
[15]

Nanda, L

N. Nanda, L. Chan, T. Lieberum, J. Smith, and J. Steinhardt. Progress measures for grokking via mechanistic interpretability. InThe Eleventh International Conference on Learning Repre- sentations, 2023. URLhttps://openreview.net/forum?id=9XFSbDPmdW

2023
[16]

Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets

A. Power, Y . Burda, H. Edwards, I. Babuschkin, and V . Misra. Grokking: Generalization beyond overfitting on small algorithmic datasets.arXiv preprint arXiv:2201.02177, 2022

work page internal anchor Pith review arXiv 2022
[17]

B. M. Ruppik, J. von Rohrscheidt, C. van Niekerk, M. Heck, R. Vukovic, S. Feng, H. chin Lin, N. Lubis, B. Rieck, M. Zibrowius, and M. Gasic. Less is more: Local intrinsic dimensions of contextual language models. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2026. URLhttps://openreview.net/forum?id=dXqqFte3KT. 10

2026
[18]

C. Tan, I. García-Redondo, Q. Wang, M. M. Bronstein, and A. Monod. On the limitations of fractal dimension as a measure of generalization. In A. Globerson, L. Mackey, D. Bel- grave, A. Fan, U. Paquet, J. Tomczak, and C. Zhang, editors,Advances in Neural Information Processing Systems, volume 37, pages 60309–60334. Curran Associates, Inc., 2024. doi: 10.52...

work page doi:10.52202/079017-1928 2024
[19]

Topological Data Analysis Applications in Natural Language Processing: A Survey

A. Uchendu and T. Le. Unveiling topological structures from language: A survey of topological data analysis applications in nlp.arXiv preprint arXiv:2411.10298, 2024

work page internal anchor Pith review arXiv 2024
[20]

Watanabe and H

S. Watanabe and H. Yamana. Topological measurement of deep neural networks using persistent homology.Annals of Mathematics and Artificial Intelligence, 90(1):75–92, 2022

2022
[21]

The Geometric Inductive Bias of Grokking: Bypassing Phase Transitions via Architectural Topology

A. Yıldırım. The geometric inductive bias of grokking: Bypassing phase transitions via architectural topology.arXiv preprint arXiv:2603.05228, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[22]

Zheng, K

X. Zheng, K. Daruwalla, A. S. Benjamin, and D. Klindt. Delays in generalization match delayed changes in representational geometry. InUniReps: 2nd Edition of the Workshop on Unifying Representations in Neural Models, 2024. URL https://openreview.net/forum? id=1ae108kHk2. 11 A Additional Transformer Results Figures 7 and 8 show the full transformer results...

2024