pith. machine review for the scientific record. sign in

arxiv: 2605.06352 · v1 · submitted 2026-05-07 · 💻 cs.LG · cs.AI· stat.ML

Recognition: unknown

Topological Signatures of Grokking

Anthea Monod, In\'es Garc\'ia-Redondo, Qiquan Wang, Yifan Tang

Pith reviewed 2026-05-08 13:00 UTC · model grok-4.3

classification 💻 cs.LG cs.AIstat.ML
keywords grokkingpersistent homologytopological data analysismodular arithmeticembedding matricesgeneralizationH1 persistencerepresentation learning
0
0 comments X

The pith

A sharp increase in first homology persistence signals grokking in neural networks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The authors investigate grokking by examining the topology of point clouds formed from embedding matrices in models learning modular arithmetic. They observe a consistent pattern where both the highest and overall persistence of one-dimensional holes increase abruptly at the grokking point. Diagrams of these persistences highlight one main enduring feature and more organized lesser features that correspond to the cyclic character of the task. This method gives a combined geometric and topological description of how representations form, which integrates information from different scales. Tests in different settings confirm that the pattern appears with generalization and not just with fitting the training data.

Core claim

Using persistent homology on point clouds derived from the embedding matrices of models trained on modular arithmetic with varying primes, we identify a clear and consistent topological signature of grokking: a sharp increase in both the maximum and total persistence of first homology (H1). Persistence diagrams reveal the emergence of a dominant long-lived topological feature together with increasingly structured secondary features, reflecting the underlying cyclic structure of the task. Compared to existing spectral and geometric diagnostics, persistent homology provides a unified geometric and topological characterization of representation learning, capturing both local and global multi-sc

What carries the argument

Persistent homology applied to point clouds from embedding matrices to track maximum and total persistence of H1 features during training.

Load-bearing premise

The observed increase in H1 persistence is caused by the shift to generalization rather than other training dynamics or the specific choice of point-cloud construction from embeddings.

What would settle it

Training models on the same modular arithmetic task while blocking generalization and observing whether the sharp increase in H1 persistence still occurs would test the claimed link.

Figures

Figures reproduced from arXiv: 2605.06352 by Anthea Monod, In\'es Garc\'ia-Redondo, Qiquan Wang, Yifan Tang.

Figure 1
Figure 1. Figure 1: Vietoris–Rips persistence diagrams computed from the learned representations of the view at source ↗
Figure 2
Figure 2. Figure 2: Fourier-based mechanistic analysis of grokking on modular addition (p view at source ↗
Figure 3
Figure 3. Figure 3: Transformer on modular addition, p = 197, averaged across seeds (±1 std. shaded). Top row: train/test accuracy (left) and mean pointwise LID on the test set (right). Bottom row: H1 max persistence (left) and H1 total persistence (right), both computed on the token embedding matrix. Both H1 metrics rise while LID drops simultaneously at grokking, across all training fractions. Layer-wise behavior. At the th… view at source ↗
Figure 4
Figure 4. Figure 4: MLP on modular addition, p = 197. Top: train/test accuracy. Middle row: H1 max (left) and H1 total persistence (right) at the token embedding layer (layer 0) — both rise at grokking for all three fractions, with H1 total showing the most dramatic increase across all settings. Bottom row: same metrics at the third hidden layer (layer 3) — H1 max rises further, and H1 total also increases modestly, suggestin… view at source ↗
Figure 5
Figure 5. Figure 5: Training dynamics under label permutation for the Transformer model. The top panel view at source ↗
Figure 6
Figure 6. Figure 6: MNIST results, layer 1 (after the first hidden layer). Top row: train/test accuracy. Bottom view at source ↗
Figure 7
Figure 7. Figure 7: Transformer on modular addition, p = 113. Same layout as view at source ↗
Figure 8
Figure 8. Figure 8: Transformer on modular addition, p = 149. Same layout as view at source ↗
Figure 9
Figure 9. Figure 9: MLP on modular addition, p = 113. Same layout as view at source ↗
Figure 10
Figure 10. Figure 10: MLP on modular addition, p = 149. Same layout as view at source ↗
Figure 11
Figure 11. Figure 11: Training dynamics under label permutation for the Transformer model. The top panel view at source ↗
Figure 12
Figure 12. Figure 12: Cross-correlation functions (CCFs) between the first-order differences of test accuracy view at source ↗
Figure 13
Figure 13. Figure 13: Training dynamics under label permutation for the MLP model. The top panel shows test view at source ↗
Figure 14
Figure 14. Figure 14: Cross-correlation functions (CCFs) between the first-order differences of test accuracy view at source ↗
read the original abstract

We study the grokking phenomenon through the lens of topology. Using persistent homology on point clouds derived from the embedding matrices of a range of models trained on modular arithmetic with varying primes, we identify a clear and consistent topological signature of grokking: a sharp increase in both the maximum and total persistence of first homology ($H_1$). Persistence diagrams reveal the emergence of a dominant long-lived topological feature together with increasingly structured secondary features, reflecting the underlying cyclic structure of the task. Compared to existing spectral and geometric diagnostics -- specifically, Fourier analysis and local intrinsic dimension -- persistent homology provides a unified geometric and topological characterization of representation learning, capturing both local and global multi-scale structure. Ablations across data regimes and control settings show that these topological transitions are tied to generalization rather than memorization. Our results suggest that persistent homology offers a principled and interpretable framework for analyzing how neural networks internalize latent structure during training.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper applies persistent homology to point clouds derived from embedding matrices of neural networks trained on modular arithmetic tasks (varying primes). It claims to identify a consistent topological signature of grokking: a sharp increase in both maximum and total H1 persistence at the generalization transition, with persistence diagrams showing a dominant long-lived feature plus structured secondary features that reflect the underlying cyclic modular structure. The work positions persistent homology as superior to Fourier analysis and local intrinsic dimension for capturing multi-scale geometric and topological changes, and uses ablations to argue the signature is tied to generalization rather than memorization.

Significance. If the H1 persistence increase is shown to be specifically caused by internalization of the cyclic structure (rather than optimization dynamics), the result would supply a new, interpretable diagnostic for representation learning that unifies local and global structure. The reported consistency across data regimes is a positive empirical observation. However, the absence of quantitative controls on point-cloud construction and statistical reporting currently limits the strength of the central claim.

major comments (2)
  1. [§3] §3 (point-cloud construction): The procedure for turning embedding matrices into point clouds is described only at high level. No details are given on centering, norm normalization, fixed sampling size, or handling of vector magnitudes. Because the Vietoris-Rips filtration is sensitive to these choices, the observed rise in max/total H1 persistence could be driven by SGD-induced changes in embedding scale or density rather than by the emergence of cyclic topology, directly threatening the claim that the signature is a signature of grokking.
  2. [§4] §4 (results and ablations): The manuscript asserts 'consistent' patterns and 'ablations across data regimes' but reports neither the number of independent runs, standard deviations on persistence values, nor any statistical test for the claimed sharp increase. Without these quantities it is impossible to assess whether the topological transition is robust or could be an artifact of particular seeds or hyper-parameters, weakening the generalization-versus-memorization conclusion.
minor comments (2)
  1. [Abstract] Abstract and §2: The homology notation alternates between 'first homology (H1)' and $H_1$; a single consistent notation would improve readability.
  2. [Figures] Figure captions: Persistence diagrams would benefit from explicit annotation of the birth-death coordinates of the dominant long-lived feature so readers can directly verify the reported increase in persistence.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We are grateful to the referee for their insightful comments, which have helped us improve the clarity and rigor of our work. Below, we provide point-by-point responses to the major comments and indicate the revisions we will make to the manuscript.

read point-by-point responses
  1. Referee: [§3] §3 (point-cloud construction): The procedure for turning embedding matrices into point clouds is described only at high level. No details are given on centering, norm normalization, fixed sampling size, or handling of vector magnitudes. Because the Vietoris-Rips filtration is sensitive to these choices, the observed rise in max/total H1 persistence could be driven by SGD-induced changes in embedding scale or density rather than by the emergence of cyclic topology, directly threatening the claim that the signature is a signature of grokking.

    Authors: We thank the referee for highlighting the need for greater methodological precision in §3. We agree that the original high-level description leaves room for ambiguity regarding potential artifacts from embedding scale or density. In the revised manuscript, we will expand the point-cloud construction subsection to explicitly detail: mean-centering of the embedding matrix, L2 normalization of each vector to unit length (to mitigate SGD-induced magnitude changes), fixed subsampling to 1000 points per cloud, and uniform application of these steps across epochs. We will further add a control ablation comparing normalized and unnormalized embeddings, showing that the H1 persistence rise occurs specifically under normalization and aligns with the generalization transition rather than optimization dynamics alone. These changes will directly address the sensitivity concern and strengthen the link to cyclic topology. revision: yes

  2. Referee: [§4] §4 (results and ablations): The manuscript asserts 'consistent' patterns and 'ablations across data regimes' but reports neither the number of independent runs, standard deviations on persistence values, nor any statistical test for the claimed sharp increase. Without these quantities it is impossible to assess whether the topological transition is robust or could be an artifact of particular seeds or hyper-parameters, weakening the generalization-versus-memorization conclusion.

    Authors: We appreciate the referee's emphasis on quantitative statistical reporting in §4. While our experiments were conducted across multiple independent trials to support the consistency claims, these details were not included in the original submission. In the revised version, we will specify that all results are averaged over 5 independent random seeds per prime and data regime, with the sharp H1 persistence increase observed in every run. Standard deviations will be reported in the text and added as error bars to the relevant figures (typically <15% relative variation at the transition). We will also describe the alignment of the topological transition with generalization metrics across regimes and note its absence in memorization-only controls. Although no formal hypothesis tests were performed initially, we will include a robustness discussion; this will allow clearer evaluation of the generalization-versus-memorization distinction. revision: yes

Circularity Check

0 steps flagged

No circularity: purely observational empirical study with external benchmarks

full rationale

The manuscript presents an empirical analysis applying persistent homology to point clouds constructed from model embedding matrices on modular arithmetic tasks. It reports an observed sharp rise in max/total H1 persistence at the grokking transition, with persistence diagrams showing long-lived features, and compares this to Fourier analysis and local intrinsic dimension as independent diagnostics. Ablations across data regimes are invoked to link the topological change to generalization. No derivation, equation, or 'prediction' is claimed that reduces by construction to a fitted parameter, self-defined quantity, or load-bearing self-citation. The work contains no uniqueness theorems, ansatzes smuggled via prior author papers, or renaming of known results as new unification. The analysis is self-contained against external topological and geometric benchmarks and does not exhibit any of the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the empirical observation that persistent homology applied to embedding-derived point clouds yields a reliable marker of generalization. No free parameters are introduced or fitted in the abstract description. The main domain assumption is that the chosen point-cloud representation preserves the relevant topological features of the learned representations.

axioms (1)
  • domain assumption Persistent homology on point clouds derived from embedding matrices captures meaningful multi-scale topological features of the learned representations.
    Invoked when the authors interpret increases in H1 persistence as reflecting the cyclic structure of the modular arithmetic task.

pith-pipeline@v0.9.0 · 5466 in / 1320 out tokens · 41890 ms · 2026-05-08T13:00:36.034394+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

22 extracted references · 9 canonical work pages · 3 internal anchors

  1. [1]

    Ballester, C

    R. Ballester, C. Casacuberta, and S. Escalera. Topological data analysis for neural network analysis: A comprehensive survey.arXiv preprint arXiv:2312.05840, 2023

  2. [2]

    U. Bauer. Ripser: efficient computation of vietoris–rips persistence barcodes.Journal of Applied and Computational Topology, 5(3):391–423, 2021

  3. [3]

    B. C. Brown, J. Juravsky, A. L. Caterini, and G. Loaiza-Ganem. Relating regularization and generalization through the intrinsic dimension of activations. InOPT 2022: Optimization for Machine Learning (NeurIPS 2022 Workshop)

  4. [4]

    B. W. Carvalho, A. S. Garcez, L. C. Lamb, and E. V . Brazil. Grokking explained: A statistical phenomenon.arXiv preprint arXiv:2502.01774, 2025

  5. [5]

    arXiv preprint arXiv:2408.08944 , year =

    K. Clauw, S. Stramaglia, and D. Marinazzo. Information-theoretic progress measures reveal grokking is an emergent phase transition.arXiv preprint arXiv:2408.08944, 2024

  6. [6]

    DeMoss, S

    B. DeMoss, S. Sapora, J. Foerster, N. Hawes, and I. Posner. The complexity dynamics of grokking.Physica D: Nonlinear Phenomena, page 134859, 2025

  7. [7]

    Facco, M

    E. Facco, M. d’Errico, A. Rodriguez, and A. Laio. Estimating the intrinsic dimension of datasets by a minimal neighborhood information.Scientific reports, 7(1):12140, 2017

  8. [8]

    A. I. Humayun, R. Balestriero, and R. Baraniuk. Deep networks always grok and here is why. arXiv preprint arXiv:2402.15555, 2024

  9. [9]

    Kushnareva, D

    L. Kushnareva, D. Cherniavskii, V . Mikhailov, E. Artemova, S. Barannikov, A. Bernstein, I. Piontkovskaya, D. Piontkovski, and E. Burnaev. Artificial text detection via examining the topology of attention maps. InProceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 635–649, 2021

  10. [10]

    LeCun and C

    Y . LeCun and C. Cortes. MNIST handwritten digit database. 2010. URLhttp://yann.lecun. com/exdb/mnist/

  11. [11]

    J. H. Lee, T. Jiralerspong, L. Yu, Y . Bengio, and E. Cheng. Geometric signatures of composi- tionality across a language model’s lifetime. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 5292–5320, 2025

  12. [12]

    Z. Liu, E. J. Michaud, and M. Tegmark. Omnigrok: Grokking beyond algorithmic data. In The Eleventh International Conference on Learning Representations, 2023. URL https: //openreview.net/forum?id=zDiHoIWa0q1

  13. [13]

    Merrill, N

    W. Merrill, N. Tsilivis, and A. Shukla. A tale of two circuits: Grokking as competition of sparse and dense subnetworks. InICLR 2023 Workshop on Mathematical and Empirical Understanding of Foundation Models, 2023. URLhttps://openreview.net/forum?id=8GZxtu46Kx

  14. [14]

    M. A. Mohamadi, Z. Li, L. Wu, and D. J. Sutherland. Why do you grok? A theoretical analysis of grokking modular addition.arXiv preprint arXiv:2407.12332, 2024

  15. [15]

    Nanda, L

    N. Nanda, L. Chan, T. Lieberum, J. Smith, and J. Steinhardt. Progress measures for grokking via mechanistic interpretability. InThe Eleventh International Conference on Learning Repre- sentations, 2023. URLhttps://openreview.net/forum?id=9XFSbDPmdW

  16. [16]

    Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets

    A. Power, Y . Burda, H. Edwards, I. Babuschkin, and V . Misra. Grokking: Generalization beyond overfitting on small algorithmic datasets.arXiv preprint arXiv:2201.02177, 2022

  17. [17]

    B. M. Ruppik, J. von Rohrscheidt, C. van Niekerk, M. Heck, R. Vukovic, S. Feng, H. chin Lin, N. Lubis, B. Rieck, M. Zibrowius, and M. Gasic. Less is more: Local intrinsic dimensions of contextual language models. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2026. URLhttps://openreview.net/forum?id=dXqqFte3KT. 10

  18. [18]

    C. Tan, I. García-Redondo, Q. Wang, M. M. Bronstein, and A. Monod. On the limitations of fractal dimension as a measure of generalization. In A. Globerson, L. Mackey, D. Bel- grave, A. Fan, U. Paquet, J. Tomczak, and C. Zhang, editors,Advances in Neural Information Processing Systems, volume 37, pages 60309–60334. Curran Associates, Inc., 2024. doi: 10.52...

  19. [19]

    Topological Data Analysis Applications in Natural Language Processing: A Survey

    A. Uchendu and T. Le. Unveiling topological structures from language: A survey of topological data analysis applications in nlp.arXiv preprint arXiv:2411.10298, 2024

  20. [20]

    Watanabe and H

    S. Watanabe and H. Yamana. Topological measurement of deep neural networks using persistent homology.Annals of Mathematics and Artificial Intelligence, 90(1):75–92, 2022

  21. [21]

    The Geometric Inductive Bias of Grokking: Bypassing Phase Transitions via Architectural Topology

    A. Yıldırım. The geometric inductive bias of grokking: Bypassing phase transitions via architectural topology.arXiv preprint arXiv:2603.05228, 2026

  22. [22]

    Zheng, K

    X. Zheng, K. Daruwalla, A. S. Benjamin, and D. Klindt. Delays in generalization match delayed changes in representational geometry. InUniReps: 2nd Edition of the Workshop on Unifying Representations in Neural Models, 2024. URL https://openreview.net/forum? id=1ae108kHk2. 11 A Additional Transformer Results Figures 7 and 8 show the full transformer results...