Beyond Neural Collapse: Task-Intrinsic Geometry Governs Neural Representations in Modular Arithmetic
Pith reviewed 2026-06-27 17:37 UTC · model grok-4.3
The pith
Modular arithmetic tasks organize neural representations into cyclic circles on a plane rather than the simplex of neural collapse.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The paper claims that task-intrinsic geometry in modular addition produces a cyclic rank-2 configuration on circles instead of neural collapse. Classifier weights first reach a rank-2 equiangular plane driven by dense cross-entropy gradients. Backpropagated feature gradients then confine embeddings to the same plane while weight decay removes orthogonal components. The resulting in-plane dynamics admit an entropy-regularized transport interpretation on the circle that, combined with modular-addition labels, reduces to phase alignment whose minimizers are single-frequency characters of Z/PZ. This cyclic solution prevails because it enjoys a Theta(K) advantage under Schatten or weight-decay su
What carries the argument
The entropy-regularized transport on S^1 that reduces embedding formation to phase alignment with single-frequency characters of Z/PZ after subspace locking into a rank-2 plane.
If this is right
- Classifier weights reach a rank-2 equiangular configuration before upstream embeddings fully reorganize.
- Once the plane forms, backpropagated gradients constrain embeddings to it and weight decay suppresses orthogonal components.
- The cyclic solution gains a Theta(K) advantage under weight-decay or Schatten surrogates while the simplex gains only O(1) in cross-entropy.
- Grokking on modular arithmetic is governed by a task-structured trade-off between separation, symmetry, and complexity rather than maximal separation alone.
Where Pith is reading between the lines
- The same layerwise locking and phase-alignment mechanism may appear in other group-structured tasks whose labels possess modular or cyclic symmetry.
- Varying the modulus P while holding regularization fixed could test whether the number of equal-angle points on the circle scales directly with group order.
- Experiments that deliberately strengthen or weaken the entropy-regularization term inside the plane could switch the learned geometry between cyclic and simplex regimes.
Load-bearing premise
The induced in-plane dynamics admit an entropy-regularized transport interpretation on the circle whose minimizers are exactly the single-frequency characters of the modular group.
What would settle it
Train networks on modular addition under weight decay and measure whether embeddings lie at equal angles on a circle or whether the claimed Theta(K) versus O(1) advantage between cyclic and simplex solutions holds when the regularization strength crosses Theta(1/K).
Figures
read the original abstract
While neural collapse (NC) predicts that a $K$-class-balanced classifier should organize terminal representations as a $(K-1)$-dimensional simplex equiangular tight frame (ETF), modular addition consistently enters a different regime: networks compress to a two-dimensional cyclic geometry in which both classifier weights and token embeddings lie on circles. We refine the explanation of this phenomenon in three directions. First, we formalize a layerwise non-uniform training mechanism: downstream classifier weights are driven by dense cross-entropy gradients into a rank-2 equiangular configuration before upstream embeddings fully reorganize, and once this classifier plane forms, backpropagated feature gradients constrain embedding motion to the same plane while weight decay suppresses orthogonal components. Second, after this subspace locking, the induced in-plane dynamics admit an entropy-regularized transport interpretation on $S^1$; combined with modular-addition labels, this reduces embedding formation to phase alignment, whose minimizers are single-frequency characters of $\mathbb{Z}/P\mathbb{Z}$ and hence equal-angle points on a circle. Third, we quantify why this solution prevails over NC: a simplex ETF gains only an $O(1)$ advantage in cross-entropy, whereas the cyclic rank-2 solution enjoys a $\Theta(K)$ advantage under Schatten or weight-decay surrogates, yielding a critical threshold $\lambda_{\mathrm{crit}} = \Theta(1/K)$. Our results explain both why classifier weights move first and why embeddings subsequently align with them, showing that grokking on modular arithmetic is governed not by maximal separation alone but by a task-structured trade-off between separation, symmetry, and complexity.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript claims that modular arithmetic induces cyclic rank-2 geometries (embeddings and weights on circles) rather than the (K-1)-simplex ETF of neural collapse. It formalizes a layerwise process in which cross-entropy first drives classifier weights into a rank-2 equiangular plane; back-propagated gradients then lock embeddings into the same plane while weight decay suppresses orthogonal components. After locking, in-plane dynamics reduce to entropy-regularized transport on S^1 whose minimizers are single-frequency characters of Z/PZ. The cyclic solution is shown to enjoy a Θ(K) advantage under Schatten/weight-decay surrogates versus an O(1) cross-entropy advantage for the ETF, producing a critical threshold λ_crit = Θ(1/K).
Significance. If the reductions and uniqueness arguments are rigorous, the work supplies a task-intrinsic account of geometry selection that incorporates symmetry and regularization trade-offs, explaining both the ordering of classifier versus embedding reorganization and the prevalence of cyclic solutions in grokking. The explicit scaling comparison and transport interpretation constitute clear strengths that could extend to other algebraic tasks; the manuscript also supplies a mechanistic account of subspace locking.
major comments (2)
- [Abstract / in-plane dynamics section] Abstract and the section formalizing in-plane dynamics: the reduction of embedding formation to phase alignment whose minimizers are uniquely the single-frequency characters of Z/PZ is load-bearing for the entire advantage comparison. The manuscript states the entropy-regularized transport interpretation on S^1 but supplies no Euler-Lagrange equation, convexity argument, or explicit verification that the modular-addition label measure forces single-frequency minimizers exclusively; without this, higher harmonics or other phase configurations could achieve comparable costs, so the claimed uniqueness and the subsequent Θ(K) advantage do not follow.
- [Quantification of advantage / critical threshold section] Section quantifying the advantage and critical threshold: the derivation of λ_crit = Θ(1/K) compares an O(1) cross-entropy advantage for the ETF against a Θ(K) advantage for the cyclic solution under Schatten or weight-decay surrogates. This scaling is only decisive once the cyclic geometry is established as the actual attractor of the transport problem; if uniqueness fails, the threshold comparison is conditional on an unproven step and cannot be used to explain prevalence.
minor comments (2)
- Notation: clarify the relationship between the modulus P and the number of classes K throughout; it is unclear whether they are independent or whether P is composite in the experiments.
- [References] References: add citations to prior analyses of grokking on modular arithmetic and to existing work on entropy-regularized transport or Fourier analysis on cyclic groups.
Simulated Author's Rebuttal
We thank the referee for the careful and constructive review. The two major comments correctly identify that the uniqueness of the single-frequency minimizers in the in-plane transport problem is central to the claimed advantage and threshold, and that the manuscript as submitted provides only a high-level statement of the entropy-regularized transport interpretation without the supporting derivations. We respond to each point below and will revise the manuscript accordingly.
read point-by-point responses
-
Referee: [Abstract / in-plane dynamics section] Abstract and the section formalizing in-plane dynamics: the reduction of embedding formation to phase alignment whose minimizers are uniquely the single-frequency characters of Z/PZ is load-bearing for the entire advantage comparison. The manuscript states the entropy-regularized transport interpretation on S^1 but supplies no Euler-Lagrange equation, convexity argument, or explicit verification that the modular-addition label measure forces single-frequency minimizers exclusively; without this, higher harmonics or other phase configurations could achieve comparable costs, so the claimed uniqueness and the subsequent Θ(K) advantage do not follow.
Authors: We agree that the manuscript states the transport interpretation and the resulting single-frequency minimizers but does not supply the requested Euler-Lagrange derivation or uniqueness proof. In the revised version we will add an explicit stationarity condition for the entropy-regularized functional on S^1 together with a Fourier-series argument showing that the modular-addition label measure on the circle selects only the fundamental characters; the higher harmonics are strictly suboptimal. This addition will make the uniqueness rigorous and thereby support the subsequent scaling comparison. revision: yes
-
Referee: [Quantification of advantage / critical threshold section] Section quantifying the advantage and critical threshold: the derivation of λ_crit = Θ(1/K) compares an O(1) cross-entropy advantage for the ETF against a Θ(K) advantage for the cyclic solution under Schatten or weight-decay surrogates. This scaling is only decisive once the cyclic geometry is established as the actual attractor of the transport problem; if uniqueness fails, the threshold comparison is conditional on an unproven step and cannot be used to explain prevalence.
Authors: The referee is correct that the λ_crit derivation presupposes uniqueness of the cyclic attractor. Once the uniqueness proof is inserted (as outlined in the response to the first comment), the O(1) versus Θ(K) comparison becomes unconditional. We will also add a short clarifying sentence noting the logical dependence on uniqueness, but the added analysis removes the conditionality. No change to the reported scaling itself is required. revision: yes
Circularity Check
λ_crit = Θ(1/K) obtained algebraically by balancing the paper's posited O(1) vs Θ(K) advantage scalings
specific steps
-
self definitional
[Abstract (third direction)]
"a simplex ETF gains only an O(1) advantage in cross-entropy, whereas the cyclic rank-2 solution enjoys a Θ(K) advantage under Schatten or weight-decay surrogates, yielding a critical threshold λ_crit = Θ(1/K)"
The critical threshold is defined as the regularization strength at which the stated O(1) cross-entropy advantage equals λ times the Θ(K) surrogate advantage. The scaling λ_crit = Θ(1/K) is therefore the immediate algebraic rearrangement of those posited scalings; no further derivation or external data is needed to obtain it.
full rationale
The paper's third direction quantifies prevalence of the cyclic geometry over NC by comparing advantages and then states the resulting threshold. This threshold scaling follows directly from equating the two advantage terms inside the same regularization model; the derivation therefore reduces to the inputs by algebraic construction rather than an independent external benchmark or additional theorem.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Downstream classifier weights are driven by dense cross-entropy gradients into a rank-2 equiangular configuration before upstream embeddings fully reorganize
- domain assumption After subspace locking, the induced in-plane dynamics admit an entropy-regularized transport interpretation on S^1 whose minimizers are single-frequency characters of Z/PZ
Reference graph
Works this paper leans on
-
[1]
Papyan, Vardan and Han, X. Y. and Donoho, David L. , title =. Proceedings of the National Academy of Sciences , volume =
-
[2]
Applied and Computational Harmonic Analysis , volume =
Lu, Jianfeng and Steinerberger, Stefan , title =. Applied and Computational Harmonic Analysis , volume =
-
[3]
ICML , year =
Zhou, Jinxin and Li, Xiao and Ding, Tianyu and You, Chong and Qu, Qing and Zhu, Zhihui , title =. ICML , year =
-
[4]
, title =
Rangamani, Akshay and Lindegaard, Marius and Galanti, Tomer and Poggio, Tomaso A. , title =. ICML , year =
-
[5]
Journal of Machine Learning Research , volume =
Wang, Peng and Li, Xiao and Yaras, Can and Zhu, Zhihui and Balzano, Laura and Hu, Wei and Qu, Qing , title =. Journal of Machine Learning Research , volume =
-
[6]
IEEE Transactions on Neural Networks and Learning Systems , volume =
Wang, Sicong and Gai, Kuo and Zhang, Shihua , title =. IEEE Transactions on Neural Networks and Learning Systems , volume =
-
[7]
and Tran, Hung The and Ho, Nhat and Nguyen, Tan Minh , title =
Dang, Hien and Huu, Tho Tran and Osher, Stanley J. and Tran, Hung The and Ho, Nhat and Nguyen, Tan Minh , title =. ICML , year =
-
[8]
ICML , year =
Tirer, Tom and Bruna, Joan , title =. ICML , year =
-
[9]
NeurIPS , year =
Jacot, Arthur , title =. NeurIPS , year =
-
[10]
ICLR , year =
Wang, Zihan and Jacot, Arthur , title =. ICLR , year =
-
[11]
arXiv preprint arXiv:2402.03991 , year =
Zangrando, Emanuele and Deidda, Piero and Brugiapaglia, Simone and Guglielmi, Nicola and Tudisco, Francesco , title =. arXiv preprint arXiv:2402.03991 , year =
-
[12]
Neural Collapse vs
S. Neural Collapse vs. Low-Rank Bias: Is Deep Neural Collapse Really Optimal? , booktitle =
-
[13]
arXiv preprint arXiv:2201.02177 , year =
Power, Alethea and Burda, Yuri and Edwards, Harri and Babuschkin, Igor and Misra, Vedant , title =. arXiv preprint arXiv:2201.02177 , year =
-
[14]
and Tegmark, Max and Williams, Mike , title =
Liu, Ziming and Kitouni, Ouail and Nolte, Niklas and Michaud, Eric J. and Tegmark, Max and Williams, Mike , title =. NeurIPS , year =
-
[15]
ICLR , year =
Nanda, Neel and Chan, Lawrence and Lieberum, Tom and Smith, Jess and Steinhardt, Jacob , title =. ICLR , year =
-
[16]
arXiv preprint arXiv:2301.02679 , year =
Gromov, Andrey , title =. arXiv preprint arXiv:2301.02679 , year =
-
[17]
NeurIPS , year =
Zhong, Ziqian and Liu, Ziming and Tegmark, Max and Andreas, Jacob , title =. NeurIPS , year =
-
[18]
, title =
Mohamadi, Mohamad Amin and Li, Zhiyuan and Wu, Lei and Sutherland, Danica J. , title =. ICML , year =
-
[19]
Clustering and Alignment: Understanding the Training Dynamics in Modular Addition , journal =
Mu. Clustering and Alignment: Understanding the Training Dynamics in Modular Addition , journal =
-
[20]
AlquBoj, H. V. and AlQuabeh, Hilal and Bojkovic, Velibor and Nwadike, Munachiso and Inui, Kentaro , title =. arXiv preprint arXiv:2505.15624 , year =
-
[21]
NeurIPS , year =
Cuturi, Marco , title =. NeurIPS , year =
-
[22]
Computational Optimal Transport: With Applications to Data Science , journal =
Peyr. Computational Optimal Transport: With Applications to Data Science , journal =
-
[23]
Wasserstein Generative Adversarial Networks , booktitle =
Arjovsky, Martin and Chintala, Soumith and Bottou, L. Wasserstein Generative Adversarial Networks , booktitle =
-
[24]
Optimal Transport for Domain Adaptation , journal =
Courty, Nicolas and Flamary, R. Optimal Transport for Domain Adaptation , journal =
-
[25]
arXiv preprint arXiv:2408.00329 , year =
Gai, Kuo and Wang, Sicong and Zhang, Shihua , title =. arXiv preprint arXiv:2408.00329 , year =
-
[26]
Journal of the American Mathematical Society , volume =
Cohn, Henry and Kumar, Abhinav , title =. Journal of the American Mathematical Society , volume =
-
[27]
and Kumar, Abhinav and Sch
Cohn, Henry and Elkies, Noam D. and Kumar, Abhinav and Sch. Point Configurations that are Asymmetric Yet Balanced , journal =
-
[28]
NeurIPS , year =
Arora, Sanjeev and Cohen, Nadav and Hu, Wei and Luo, Yuping , title =. NeurIPS , year =
-
[29]
and Bhojanapalli, Srinadh and Neyshabur, Behnam and Srebro, Nati , title =
Gunasekar, Suriya and Woodworth, Blake E. and Bhojanapalli, Srinadh and Neyshabur, Behnam and Srebro, Nati , title =. NeurIPS , year =
-
[30]
NeurIPS , year =
Zhu, Zhihui and Ding, Tianyu and Zhou, Jinxin and Li, Xiao and You, Chong and Sulam, Jeremias and Qu, Qing , title =. NeurIPS , year =
-
[31]
Average Gradient Outer Product as a Mechanism for Deep Neural Collapse , booktitle =
Beaglehole, Daniel and S. Average Gradient Outer Product as a Mechanism for Deep Neural Collapse , booktitle =
-
[32]
arXiv preprint arXiv:2606.05863 , year =
Tan, Hu and Gai, Kuo and Zhang, Shihua , title =. arXiv preprint arXiv:2606.05863 , year =
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.