pith. sign in

arxiv: 2606.08985 · v1 · pith:VXQLT3YInew · submitted 2026-06-08 · 💻 cs.LG

Beyond Neural Collapse: Task-Intrinsic Geometry Governs Neural Representations in Modular Arithmetic

Pith reviewed 2026-06-27 17:37 UTC · model grok-4.3

classification 💻 cs.LG
keywords neural collapsemodular arithmeticcyclic geometryneural representationsphase alignmentgrokkingtask-intrinsic geometryrank-2 embeddings
0
0 comments X

The pith

Modular arithmetic tasks organize neural representations into cyclic circles on a plane rather than the simplex of neural collapse.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Neural collapse predicts that balanced K-class problems end with representations forming a (K-1)-dimensional simplex equiangular tight frame. This paper instead shows that modular addition produces a two-dimensional cyclic geometry in which both classifier weights and token embeddings sit on circles. The process begins with downstream classifier weights settling into a rank-2 equiangular plane under cross-entropy gradients. Once that plane forms, backpropagated gradients and weight decay lock upstream embeddings inside it. Inside the plane the motion reduces to entropy-regularized transport on the circle whose solutions are the single-frequency characters of the modular group, placing points at equal angles. The cyclic arrangement gains a Theta(K) advantage under Schatten or weight-decay regularizers while the simplex gains only O(1) in cross-entropy, so the cyclic solution wins above a critical threshold of order 1/K.

Core claim

The paper claims that task-intrinsic geometry in modular addition produces a cyclic rank-2 configuration on circles instead of neural collapse. Classifier weights first reach a rank-2 equiangular plane driven by dense cross-entropy gradients. Backpropagated feature gradients then confine embeddings to the same plane while weight decay removes orthogonal components. The resulting in-plane dynamics admit an entropy-regularized transport interpretation on the circle that, combined with modular-addition labels, reduces to phase alignment whose minimizers are single-frequency characters of Z/PZ. This cyclic solution prevails because it enjoys a Theta(K) advantage under Schatten or weight-decay su

What carries the argument

The entropy-regularized transport on S^1 that reduces embedding formation to phase alignment with single-frequency characters of Z/PZ after subspace locking into a rank-2 plane.

If this is right

  • Classifier weights reach a rank-2 equiangular configuration before upstream embeddings fully reorganize.
  • Once the plane forms, backpropagated gradients constrain embeddings to it and weight decay suppresses orthogonal components.
  • The cyclic solution gains a Theta(K) advantage under weight-decay or Schatten surrogates while the simplex gains only O(1) in cross-entropy.
  • Grokking on modular arithmetic is governed by a task-structured trade-off between separation, symmetry, and complexity rather than maximal separation alone.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same layerwise locking and phase-alignment mechanism may appear in other group-structured tasks whose labels possess modular or cyclic symmetry.
  • Varying the modulus P while holding regularization fixed could test whether the number of equal-angle points on the circle scales directly with group order.
  • Experiments that deliberately strengthen or weaken the entropy-regularization term inside the plane could switch the learned geometry between cyclic and simplex regimes.

Load-bearing premise

The induced in-plane dynamics admit an entropy-regularized transport interpretation on the circle whose minimizers are exactly the single-frequency characters of the modular group.

What would settle it

Train networks on modular addition under weight decay and measure whether embeddings lie at equal angles on a circle or whether the claimed Theta(K) versus O(1) advantage between cyclic and simplex solutions holds when the regularization strength crosses Theta(1/K).

Figures

Figures reproduced from arXiv: 2606.08985 by Hu Tan, Kuo Gai, Shihua Zhang.

Figure 1
Figure 1. Figure 1: One-hot coding A 0 0 1 0 1 0 1 0 0 0 1 0 a b + c 0.1 0.0 0.8 0.1 1 0 0 0 B C [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
read the original abstract

While neural collapse (NC) predicts that a $K$-class-balanced classifier should organize terminal representations as a $(K-1)$-dimensional simplex equiangular tight frame (ETF), modular addition consistently enters a different regime: networks compress to a two-dimensional cyclic geometry in which both classifier weights and token embeddings lie on circles. We refine the explanation of this phenomenon in three directions. First, we formalize a layerwise non-uniform training mechanism: downstream classifier weights are driven by dense cross-entropy gradients into a rank-2 equiangular configuration before upstream embeddings fully reorganize, and once this classifier plane forms, backpropagated feature gradients constrain embedding motion to the same plane while weight decay suppresses orthogonal components. Second, after this subspace locking, the induced in-plane dynamics admit an entropy-regularized transport interpretation on $S^1$; combined with modular-addition labels, this reduces embedding formation to phase alignment, whose minimizers are single-frequency characters of $\mathbb{Z}/P\mathbb{Z}$ and hence equal-angle points on a circle. Third, we quantify why this solution prevails over NC: a simplex ETF gains only an $O(1)$ advantage in cross-entropy, whereas the cyclic rank-2 solution enjoys a $\Theta(K)$ advantage under Schatten or weight-decay surrogates, yielding a critical threshold $\lambda_{\mathrm{crit}} = \Theta(1/K)$. Our results explain both why classifier weights move first and why embeddings subsequently align with them, showing that grokking on modular arithmetic is governed not by maximal separation alone but by a task-structured trade-off between separation, symmetry, and complexity.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript claims that modular arithmetic induces cyclic rank-2 geometries (embeddings and weights on circles) rather than the (K-1)-simplex ETF of neural collapse. It formalizes a layerwise process in which cross-entropy first drives classifier weights into a rank-2 equiangular plane; back-propagated gradients then lock embeddings into the same plane while weight decay suppresses orthogonal components. After locking, in-plane dynamics reduce to entropy-regularized transport on S^1 whose minimizers are single-frequency characters of Z/PZ. The cyclic solution is shown to enjoy a Θ(K) advantage under Schatten/weight-decay surrogates versus an O(1) cross-entropy advantage for the ETF, producing a critical threshold λ_crit = Θ(1/K).

Significance. If the reductions and uniqueness arguments are rigorous, the work supplies a task-intrinsic account of geometry selection that incorporates symmetry and regularization trade-offs, explaining both the ordering of classifier versus embedding reorganization and the prevalence of cyclic solutions in grokking. The explicit scaling comparison and transport interpretation constitute clear strengths that could extend to other algebraic tasks; the manuscript also supplies a mechanistic account of subspace locking.

major comments (2)
  1. [Abstract / in-plane dynamics section] Abstract and the section formalizing in-plane dynamics: the reduction of embedding formation to phase alignment whose minimizers are uniquely the single-frequency characters of Z/PZ is load-bearing for the entire advantage comparison. The manuscript states the entropy-regularized transport interpretation on S^1 but supplies no Euler-Lagrange equation, convexity argument, or explicit verification that the modular-addition label measure forces single-frequency minimizers exclusively; without this, higher harmonics or other phase configurations could achieve comparable costs, so the claimed uniqueness and the subsequent Θ(K) advantage do not follow.
  2. [Quantification of advantage / critical threshold section] Section quantifying the advantage and critical threshold: the derivation of λ_crit = Θ(1/K) compares an O(1) cross-entropy advantage for the ETF against a Θ(K) advantage for the cyclic solution under Schatten or weight-decay surrogates. This scaling is only decisive once the cyclic geometry is established as the actual attractor of the transport problem; if uniqueness fails, the threshold comparison is conditional on an unproven step and cannot be used to explain prevalence.
minor comments (2)
  1. Notation: clarify the relationship between the modulus P and the number of classes K throughout; it is unclear whether they are independent or whether P is composite in the experiments.
  2. [References] References: add citations to prior analyses of grokking on modular arithmetic and to existing work on entropy-regularized transport or Fourier analysis on cyclic groups.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful and constructive review. The two major comments correctly identify that the uniqueness of the single-frequency minimizers in the in-plane transport problem is central to the claimed advantage and threshold, and that the manuscript as submitted provides only a high-level statement of the entropy-regularized transport interpretation without the supporting derivations. We respond to each point below and will revise the manuscript accordingly.

read point-by-point responses
  1. Referee: [Abstract / in-plane dynamics section] Abstract and the section formalizing in-plane dynamics: the reduction of embedding formation to phase alignment whose minimizers are uniquely the single-frequency characters of Z/PZ is load-bearing for the entire advantage comparison. The manuscript states the entropy-regularized transport interpretation on S^1 but supplies no Euler-Lagrange equation, convexity argument, or explicit verification that the modular-addition label measure forces single-frequency minimizers exclusively; without this, higher harmonics or other phase configurations could achieve comparable costs, so the claimed uniqueness and the subsequent Θ(K) advantage do not follow.

    Authors: We agree that the manuscript states the transport interpretation and the resulting single-frequency minimizers but does not supply the requested Euler-Lagrange derivation or uniqueness proof. In the revised version we will add an explicit stationarity condition for the entropy-regularized functional on S^1 together with a Fourier-series argument showing that the modular-addition label measure on the circle selects only the fundamental characters; the higher harmonics are strictly suboptimal. This addition will make the uniqueness rigorous and thereby support the subsequent scaling comparison. revision: yes

  2. Referee: [Quantification of advantage / critical threshold section] Section quantifying the advantage and critical threshold: the derivation of λ_crit = Θ(1/K) compares an O(1) cross-entropy advantage for the ETF against a Θ(K) advantage for the cyclic solution under Schatten or weight-decay surrogates. This scaling is only decisive once the cyclic geometry is established as the actual attractor of the transport problem; if uniqueness fails, the threshold comparison is conditional on an unproven step and cannot be used to explain prevalence.

    Authors: The referee is correct that the λ_crit derivation presupposes uniqueness of the cyclic attractor. Once the uniqueness proof is inserted (as outlined in the response to the first comment), the O(1) versus Θ(K) comparison becomes unconditional. We will also add a short clarifying sentence noting the logical dependence on uniqueness, but the added analysis removes the conditionality. No change to the reported scaling itself is required. revision: yes

Circularity Check

1 steps flagged

λ_crit = Θ(1/K) obtained algebraically by balancing the paper's posited O(1) vs Θ(K) advantage scalings

specific steps
  1. self definitional [Abstract (third direction)]
    "a simplex ETF gains only an O(1) advantage in cross-entropy, whereas the cyclic rank-2 solution enjoys a Θ(K) advantage under Schatten or weight-decay surrogates, yielding a critical threshold λ_crit = Θ(1/K)"

    The critical threshold is defined as the regularization strength at which the stated O(1) cross-entropy advantage equals λ times the Θ(K) surrogate advantage. The scaling λ_crit = Θ(1/K) is therefore the immediate algebraic rearrangement of those posited scalings; no further derivation or external data is needed to obtain it.

full rationale

The paper's third direction quantifies prevalence of the cyclic geometry over NC by comparing advantages and then states the resulting threshold. This threshold scaling follows directly from equating the two advantage terms inside the same regularization model; the derivation therefore reduces to the inputs by algebraic construction rather than an independent external benchmark or additional theorem.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on two domain assumptions about training dynamics and transport that are introduced in the abstract; no free parameters are explicitly fitted to data in the summary, and no new entities are postulated.

axioms (2)
  • domain assumption Downstream classifier weights are driven by dense cross-entropy gradients into a rank-2 equiangular configuration before upstream embeddings fully reorganize
    Invoked as the first direction to explain subspace locking and subsequent embedding constraint.
  • domain assumption After subspace locking, the induced in-plane dynamics admit an entropy-regularized transport interpretation on S^1 whose minimizers are single-frequency characters of Z/PZ
    Used in the second direction to reduce embedding formation to phase alignment.

pith-pipeline@v0.9.1-grok · 5828 in / 1665 out tokens · 33678 ms · 2026-06-27T17:37:19.965864+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

32 extracted references · 2 linked inside Pith

  1. [1]

    Papyan, Vardan and Han, X. Y. and Donoho, David L. , title =. Proceedings of the National Academy of Sciences , volume =

  2. [2]

    Applied and Computational Harmonic Analysis , volume =

    Lu, Jianfeng and Steinerberger, Stefan , title =. Applied and Computational Harmonic Analysis , volume =

  3. [3]

    ICML , year =

    Zhou, Jinxin and Li, Xiao and Ding, Tianyu and You, Chong and Qu, Qing and Zhu, Zhihui , title =. ICML , year =

  4. [4]

    , title =

    Rangamani, Akshay and Lindegaard, Marius and Galanti, Tomer and Poggio, Tomaso A. , title =. ICML , year =

  5. [5]

    Journal of Machine Learning Research , volume =

    Wang, Peng and Li, Xiao and Yaras, Can and Zhu, Zhihui and Balzano, Laura and Hu, Wei and Qu, Qing , title =. Journal of Machine Learning Research , volume =

  6. [6]

    IEEE Transactions on Neural Networks and Learning Systems , volume =

    Wang, Sicong and Gai, Kuo and Zhang, Shihua , title =. IEEE Transactions on Neural Networks and Learning Systems , volume =

  7. [7]

    and Tran, Hung The and Ho, Nhat and Nguyen, Tan Minh , title =

    Dang, Hien and Huu, Tho Tran and Osher, Stanley J. and Tran, Hung The and Ho, Nhat and Nguyen, Tan Minh , title =. ICML , year =

  8. [8]

    ICML , year =

    Tirer, Tom and Bruna, Joan , title =. ICML , year =

  9. [9]

    NeurIPS , year =

    Jacot, Arthur , title =. NeurIPS , year =

  10. [10]

    ICLR , year =

    Wang, Zihan and Jacot, Arthur , title =. ICLR , year =

  11. [11]

    arXiv preprint arXiv:2402.03991 , year =

    Zangrando, Emanuele and Deidda, Piero and Brugiapaglia, Simone and Guglielmi, Nicola and Tudisco, Francesco , title =. arXiv preprint arXiv:2402.03991 , year =

  12. [12]

    Neural Collapse vs

    S. Neural Collapse vs. Low-Rank Bias: Is Deep Neural Collapse Really Optimal? , booktitle =

  13. [13]

    arXiv preprint arXiv:2201.02177 , year =

    Power, Alethea and Burda, Yuri and Edwards, Harri and Babuschkin, Igor and Misra, Vedant , title =. arXiv preprint arXiv:2201.02177 , year =

  14. [14]

    and Tegmark, Max and Williams, Mike , title =

    Liu, Ziming and Kitouni, Ouail and Nolte, Niklas and Michaud, Eric J. and Tegmark, Max and Williams, Mike , title =. NeurIPS , year =

  15. [15]

    ICLR , year =

    Nanda, Neel and Chan, Lawrence and Lieberum, Tom and Smith, Jess and Steinhardt, Jacob , title =. ICLR , year =

  16. [16]

    arXiv preprint arXiv:2301.02679 , year =

    Gromov, Andrey , title =. arXiv preprint arXiv:2301.02679 , year =

  17. [17]

    NeurIPS , year =

    Zhong, Ziqian and Liu, Ziming and Tegmark, Max and Andreas, Jacob , title =. NeurIPS , year =

  18. [18]

    , title =

    Mohamadi, Mohamad Amin and Li, Zhiyuan and Wu, Lei and Sutherland, Danica J. , title =. ICML , year =

  19. [19]

    Clustering and Alignment: Understanding the Training Dynamics in Modular Addition , journal =

    Mu. Clustering and Alignment: Understanding the Training Dynamics in Modular Addition , journal =

  20. [20]

    AlquBoj, H. V. and AlQuabeh, Hilal and Bojkovic, Velibor and Nwadike, Munachiso and Inui, Kentaro , title =. arXiv preprint arXiv:2505.15624 , year =

  21. [21]

    NeurIPS , year =

    Cuturi, Marco , title =. NeurIPS , year =

  22. [22]

    Computational Optimal Transport: With Applications to Data Science , journal =

    Peyr. Computational Optimal Transport: With Applications to Data Science , journal =

  23. [23]

    Wasserstein Generative Adversarial Networks , booktitle =

    Arjovsky, Martin and Chintala, Soumith and Bottou, L. Wasserstein Generative Adversarial Networks , booktitle =

  24. [24]

    Optimal Transport for Domain Adaptation , journal =

    Courty, Nicolas and Flamary, R. Optimal Transport for Domain Adaptation , journal =

  25. [25]

    arXiv preprint arXiv:2408.00329 , year =

    Gai, Kuo and Wang, Sicong and Zhang, Shihua , title =. arXiv preprint arXiv:2408.00329 , year =

  26. [26]

    Journal of the American Mathematical Society , volume =

    Cohn, Henry and Kumar, Abhinav , title =. Journal of the American Mathematical Society , volume =

  27. [27]

    and Kumar, Abhinav and Sch

    Cohn, Henry and Elkies, Noam D. and Kumar, Abhinav and Sch. Point Configurations that are Asymmetric Yet Balanced , journal =

  28. [28]

    NeurIPS , year =

    Arora, Sanjeev and Cohen, Nadav and Hu, Wei and Luo, Yuping , title =. NeurIPS , year =

  29. [29]

    and Bhojanapalli, Srinadh and Neyshabur, Behnam and Srebro, Nati , title =

    Gunasekar, Suriya and Woodworth, Blake E. and Bhojanapalli, Srinadh and Neyshabur, Behnam and Srebro, Nati , title =. NeurIPS , year =

  30. [30]

    NeurIPS , year =

    Zhu, Zhihui and Ding, Tianyu and Zhou, Jinxin and Li, Xiao and You, Chong and Sulam, Jeremias and Qu, Qing , title =. NeurIPS , year =

  31. [31]

    Average Gradient Outer Product as a Mechanism for Deep Neural Collapse , booktitle =

    Beaglehole, Daniel and S. Average Gradient Outer Product as a Mechanism for Deep Neural Collapse , booktitle =

  32. [32]

    arXiv preprint arXiv:2606.05863 , year =

    Tan, Hu and Gai, Kuo and Zhang, Shihua , title =. arXiv preprint arXiv:2606.05863 , year =