pith. machine review for the scientific record. sign in

arxiv: 2604.03068 · v1 · submitted 2026-04-03 · ❄️ cond-mat.dis-nn · cond-mat.stat-mech· stat.ML

Recognition: 2 theorem links

· Lean Theorem

Escape dynamics and implicit bias of one-pass SGD in overparameterized quadratic networks

Authors on Pith no claims yet

Pith reviewed 2026-05-13 18:20 UTC · model grok-4.3

classification ❄️ cond-mat.dis-nn cond-mat.stat-mechstat.ML
keywords stochastic gradient descentoverparameterizationquadratic activationsteacher-student modelimplicit biasoverlap dynamicsescape from plateau
0
0 comments X

The pith

In quadratic teacher-student networks, overparameterization modestly accelerates escape from poor generalization plateaus via prefactor changes in loss decay, while conserved overlap quantities select the solution closest to random initialz

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper derives low-dimensional ODEs for overlap matrices that govern one-pass SGD in high-dimensional quadratic networks with fixed sample-to-dimension ratio alpha. Overparameterization p greater than teacher width p* only modestly speeds escape from the plateau of poor performance by changing the prefactor of the exponential loss decay. Rotational symmetry in unconstrained weights creates a manifold of zero-loss solutions, and the dynamics reliably reaches the member of that manifold nearest the random initialization because a conserved quantity in the overlap ODEs prevents farther solutions from being reached. Hessian analysis of the population loss confirms the plateau consists of saddles and the manifold consists of marginal minima.

Core claim

In the high-dimensional regime with fixed alpha equals M over N and finite hidden widths p and p*, the SGD dynamics governed by the closed ODEs for student-teacher and student-student overlaps escapes the poor-generalization plateau with an exponential rate whose prefactor is modestly improved by overparameterization p greater than p*. From the manifold of zero-loss solutions created by continuous rotational symmetry, the dynamics selects the solution closest to random initialization, as enforced by a conserved quantity in the overlap evolution equations.

What carries the argument

Low-dimensional ODEs for the student-teacher and student-student overlap matrices that close in the high-dimensional limit and govern the evolution of the loss and generalization error.

If this is right

  • Overparameterization p greater than p* changes only the prefactor of the exponential escape from the plateau, not the exponential rate itself.
  • A conserved quantity in the overlap ODEs forces selection of the zero-loss solution closest to initialization on the manifold.
  • The plateau corresponds to saddles with at least one negative eigenvalue of the population-loss Hessian.
  • The manifold of zero-loss solutions corresponds to marginal minima of the population loss.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The conserved quantity supplies an explicit mechanism for the implicit bias toward solutions near initialization that is often observed in SGD.
  • The modest effect of extra parameters suggests that, for quadratic activations, increasing width beyond the teacher width yields diminishing returns for generalization speed.
  • The same overlap-closure technique may apply to other activations or architectures where similar low-dimensional conserved quantities could be identified.

Load-bearing premise

The overlap matrices close into a finite set of ODEs in the high-dimensional limit with fixed alpha and finite p and p* for quadratic activations under the teacher-student setup.

What would settle it

A direct numerical integration of the finite-N network showing that escape time from the plateau scales exponentially rather than linearly with the overparameterization gap p minus p* would falsify the modest-acceleration claim.

Figures

Figures reproduced from arXiv: 2604.03068 by Carlo Lucibello, Chiara Cammarota, Dario Bocchi, Luca Saglietti, Theotime Regimbeau.

Figure 1
Figure 1. Figure 1: Comparison between the numerical solution of the ODEs (solid lines) and the average over [PITH_FULL_IMAGE:figures/full_fig_p005_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Numerical analysis of the optimal learning rate [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Analysis using the numerical solution of the ODEs for [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Evolution of the elements of the matrix S(t) over time for a random orthogonal initialization, illustrating their numerical conservation during the learning dynamics. 7.5 Study of the degrees of freedom 7.5.1 Student network expressivity Since we know that, for networks with quadratic activation, the outputs are fully determined by the matrix GW = WT W, which is a N × N matrix, we define the student’s degr… view at source ↗
Figure 5
Figure 5. Figure 5: Degrees of freedom of the student network [PITH_FULL_IMAGE:figures/full_fig_p029_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Dimension of the solution manifold as a function of the student network number of hidden [PITH_FULL_IMAGE:figures/full_fig_p030_6.png] view at source ↗
read the original abstract

We analyze the one-pass stochastic gradient descent dynamics of a two-layer neural network with quadratic activations in a teacher--student framework. In the high-dimensional regime, where the input dimension $N$ and the number of samples $M$ diverge at fixed ratio $\alpha = M/N$, and for finite hidden widths $(p,p^*)$ of the student and teacher, respectively, we study the low-dimensional ordinary differential equations that govern the evolution of the student--teacher and student--student overlap matrices. We show that overparameterization ($p>p^*$) only modestly accelerates escape from a plateau of poor generalization by modifying the prefactor of the exponential decay of the loss. We then examine how unconstrained weight norms introduce a continuous rotational symmetry that results in a nontrivial manifold of zero-loss solutions for $p>1$. From this manifold the dynamics consistently selects the closest solution to the random initialization, as enforced by a conserved quantity in the ODEs governing the evolution of the overlaps. Finally, a Hessian analysis of the population-loss landscape confirms that the plateau and the solution manifold correspond to saddles with at least one negative eigenvalue and to marginal minima in the population-loss geometry, respectively.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper analyzes one-pass SGD dynamics for two-layer networks with quadratic activations in the teacher-student framework. In the high-dimensional limit (N, M → ∞ at fixed α = M/N) with finite hidden widths p and p*, it derives a closed system of low-dimensional ODEs for the student-teacher and student-student overlap matrices. The central claims are that overparameterization (p > p*) modestly accelerates escape from a poor-generalization plateau solely by modifying the prefactor of the exponential loss decay, that a conserved quantity in the ODEs enforces selection of the closest zero-loss solution on a manifold induced by rotational symmetry of unconstrained weights, and that Hessian analysis of the population loss identifies the plateau as a saddle (negative eigenvalue) and the solution manifold as marginal minima.

Significance. If the ODE closure holds exactly, the work supplies a parameter-free analytical characterization of escape dynamics and implicit bias for this solvable model class. The exact reduction to overlaps (enabled by quadratic activations), the identification of a conserved quantity, and the resulting falsifiable predictions for prefactor scaling and solution selection constitute clear strengths. These results sharpen understanding of why overparameterization yields only modest dynamical benefits in the high-dimensional regime and provide a benchmark for more general activation functions.

major comments (2)
  1. [Section 3 (ODE derivation)] Derivation of the overlap ODEs (Section 3): The reduction to a closed finite-dimensional system for the overlaps is load-bearing for every subsequent claim (conserved quantity, prefactor modification, Hessian classification). The manuscript asserts exact closure in the N→∞ limit because quadratic activations are degree-2 polynomials, yet it must explicitly verify that the expectation of the SGD update on the population loss contains no residual higher-moment contributions that would prevent the overlaps from forming a sufficient statistic. Without this step-by-step reduction, the conserved quantity and the exponential-decay analysis rest on an unverified assumption.
  2. [Hessian analysis section] Hessian analysis of the population-loss landscape: The classification of the plateau as a saddle with at least one negative eigenvalue and the solution manifold as marginal minima is used to interpret the dynamics. The eigenvalues should be expressed explicitly in terms of the overlap matrices at the fixed points; any dependence on the particular choice of initialization or on finite-N corrections would weaken the geometric interpretation.
minor comments (2)
  1. [Introduction / Setup] Notation: Define the overlap matrices (student-student Q and student-teacher M) and the scaling parameters α, p, p* at their first appearance with explicit dimensions.
  2. [Escape dynamics paragraph] The statement that overparameterization 'only modestly accelerates escape' should be accompanied by the explicit prefactor ratio derived from the ODEs rather than left as a qualitative claim.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We are grateful to the referee for their thorough review and positive assessment of our work's contributions to understanding SGD dynamics in overparameterized quadratic networks. We address each major comment below and will revise the manuscript accordingly to strengthen the presentation.

read point-by-point responses
  1. Referee: Derivation of the overlap ODEs (Section 3): The reduction to a closed finite-dimensional system for the overlaps is load-bearing for every subsequent claim (conserved quantity, prefactor modification, Hessian classification). The manuscript asserts exact closure in the N→∞ limit because quadratic activations are degree-2 polynomials, yet it must explicitly verify that the expectation of the SGD update on the population loss contains no residual higher-moment contributions that would prevent the overlaps from forming a sufficient statistic. Without this step-by-step reduction, the conserved quantity and the exponential-decay analysis rest on an unverified assumption.

    Authors: We thank the referee for highlighting the need for a more explicit verification of the ODE closure. In the revised version, we will provide a detailed, step-by-step derivation in Section 3 demonstrating that the quadratic activations ensure the SGD update's expectation depends solely on the overlap matrices, with higher-moment terms vanishing in the thermodynamic limit. This will rigorously establish the overlaps as a sufficient statistic, thereby supporting the conserved quantity and the exponential decay analysis. revision: yes

  2. Referee: Hessian analysis of the population-loss landscape: The classification of the plateau as a saddle with at least one negative eigenvalue and the solution manifold as marginal minima is used to interpret the dynamics. The eigenvalues should be expressed explicitly in terms of the overlap matrices at the fixed points; any dependence on the particular choice of initialization or on finite-N corrections would weaken the geometric interpretation.

    Authors: We agree that explicit expressions for the eigenvalues will enhance the clarity of the geometric interpretation. In the revision, we will derive and present the eigenvalues of the Hessian explicitly as functions of the overlap matrices at the relevant fixed points. The fixed points themselves are characterized in the overlap space independently of specific initializations, and our analysis is conducted in the N→∞ limit, so finite-N effects are not considered; we will add a clarifying remark on this point. revision: yes

Circularity Check

0 steps flagged

No significant circularity; overlap ODEs and conserved quantity derived from high-dimensional limit

full rationale

The paper derives closed low-dimensional ODEs for student-teacher and student-student overlap matrices from the N→∞ limit at fixed α=M/N with quadratic activations (degree-2 polynomials ensure closure under expectations over inputs). The conserved quantity selecting the closest zero-loss solution is a direct algebraic consequence of these dynamical equations, not an input redefined as output. Overparameterization effects on escape prefactors and the Hessian classification of saddles versus marginal minima follow from the same system and independent landscape analysis. No self-definitional reductions, fitted parameters renamed as predictions, or load-bearing self-citations appear in the derivation chain.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claims rest on the high-dimensional limit closing the overlap dynamics and on the existence of a rotational symmetry from unconstrained norms; no free parameters are fitted to data and no new entities are postulated.

axioms (2)
  • domain assumption In the high-dimensional limit with fixed α=M/N the evolution of student-teacher and student-student overlaps closes into a finite set of ODEs for quadratic activations.
    Invoked to reduce the high-dimensional SGD to low-dimensional dynamics.
  • domain assumption Unconstrained weight norms produce a continuous rotational symmetry yielding a manifold of zero-loss solutions when p>1.
    Used to explain the existence of the solution manifold and the conserved quantity.

pith-pipeline@v0.9.0 · 5530 in / 1462 out tokens · 24395 ms · 2026-05-13T18:20:13.990119+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

54 extracted references · 54 canonical work pages

  1. [1]

    Engel,Statistical mechanics of learning

    A. Engel,Statistical mechanics of learning. Cambridge University Press, 2001

  2. [2]

    Phase retrieval: From computational imaging to machine learning: A tutorial,

    J. Dong, L. Valzania, A. Maillard, T.-a. Pham, S. Gigan, and M. Unser, “Phase retrieval: From computational imaging to machine learning: A tutorial,”IEEE Signal Processing Magazine, vol. 40, no. 1, pp. 45–57, 2023. 12

  3. [3]

    Phase retrieval via wirtinger flow: Theory and algorithms,

    E. J. Candes, X. Li, and M. Soltanolkotabi, “Phase retrieval via wirtinger flow: Theory and algorithms,”IEEE Transactions on Information Theory, vol. 61, no. 4, pp. 1985–2007, 2015

  4. [4]

    Phase retrieval using alternating minimization,

    P. Netrapalli, P. Jain, and S. Sanghavi, “Phase retrieval using alternating minimization,”Advances in Neural Information Processing Systems, vol. 26, 2013

  5. [5]

    Phase recovery, maxcut and complex semidefinite programming,

    I. Waldspurger, A. d’Aspremont, and S. Mallat, “Phase recovery, maxcut and complex semidefinite programming,”Mathematical Programming, vol. 149, pp. 47–81, 2015

  6. [6]

    Solving random quadratic systems of equations is nearly as easy as solving linear systems,

    Y. Chen and E. J. Cand` es, “Solving random quadratic systems of equations is nearly as easy as solving linear systems,”Communications on pure and applied mathematics, vol. 70, no. 5, pp. 822–883, 2017

  7. [7]

    A nonconvex approach for phase retrieval: Reshaped wirtinger flow and incremental algorithms,

    H. Zhang, Y. Liang, and Y. Chi, “A nonconvex approach for phase retrieval: Reshaped wirtinger flow and incremental algorithms,”Journal of Machine Learning Research, vol. 18, no. 141, pp. 1– 35, 2017

  8. [8]

    Solving large-scale systems of random quadratic equa- tions via stochastic truncated amplitude flow,

    G. Wang, G. B. Giannakis, and J. Chen, “Solving large-scale systems of random quadratic equa- tions via stochastic truncated amplitude flow,” in2017 25th European Signal Processing Confer- ence (EUSIPCO), pp. 1420–1424, IEEE, 2017

  9. [9]

    Solving most systems of random quadratic equations,

    G. Wang, G. Giannakis, Y. Saad, and J. Chen, “Solving most systems of random quadratic equations,”Advances in Neural Information Processing Systems, vol. 30, 2017

  10. [10]

    Two-step phase retrieval algorithm using single-intensity measurement,

    C. Zhang, M. Wang, Q. Chen, D. Wang, and S. Wei, “Two-step phase retrieval algorithm using single-intensity measurement,”International Journal of Optics, vol. 2018, no. 1, p. 8643819, 2018

  11. [11]

    Optimal errors and phase transitions in high-dimensional generalized linear models,

    J. Barbier, F. Krzakala, N. Macris, L. Miolane, and L. Zdeborov´ a, “Optimal errors and phase transitions in high-dimensional generalized linear models,”Proceedings of the National Academy of Sciences, vol. 116, no. 12, pp. 5451–5460, 2019

  12. [12]

    A stochastic approximation method,

    H. Robbins and S. Monro, “A stochastic approximation method,”The annals of mathematical statistics, pp. 400–407, 1951

  13. [13]

    Sensitivity and gen- eralization in neural networks: an empirical study,

    R. Novak, Y. Bahri, D. A. Abolafia, J. Pennington, and J. Sohl-Dickstein, “Sensitivity and gen- eralization in neural networks: an empirical study,”arXiv preprint arXiv:1802.08760, 2018

  14. [14]

    Reconciling modern machine-learning practice and the classical bias–variance trade-off,

    M. Belkin, D. Hsu, S. Ma, and S. Mandal, “Reconciling modern machine-learning practice and the classical bias–variance trade-off,”Proceedings of the National Academy of Sciences, vol. 116, no. 32, pp. 15849–15854, 2019

  15. [15]

    More is better in modern machine learn- ing: when infinite overparameterization is optimal and overfitting is obligatory,

    J. B. Simon, D. Karkada, N. Ghosh, and M. Belkin, “More is better in modern machine learn- ing: when infinite overparameterization is optimal and overfitting is obligatory,”arXiv preprint arXiv:2311.14646, 2023

  16. [16]

    Optimization and generalization of shallow neural networks with quadratic activation functions,

    S. Sarao Mannelli, E. Vanden-Eijnden, and L. Zdeborov´ a, “Optimization and generalization of shallow neural networks with quadratic activation functions,”Advances in Neural Information Processing Systems, vol. 33, pp. 13445–13455, 2020

  17. [17]

    Escaping mediocrity: how two-layer networks learn hard generalized linear models with sgd,

    L. Arnaboldi, F. Krzakala, B. Loureiro, and L. Stephan, “Escaping mediocrity: how two-layer networks learn hard generalized linear models with sgd,”arXiv preprint arXiv:2305.18502, 2023

  18. [18]

    Online stochastic gradient descent with arbitrary initialization solves non-smooth, non-convex phase retrieval,

    Y. S. Tan and R. Vershynin, “Online stochastic gradient descent with arbitrary initialization solves non-smooth, non-convex phase retrieval,”Journal of Machine Learning Research, vol. 24, no. 58, pp. 1–47, 2023

  19. [19]

    Online stochastic gradient descent on non-convex losses from high-dimensional inference,

    G. B. Arous, R. Gheissari, and A. Jagannath, “Online stochastic gradient descent on non-convex losses from high-dimensional inference,”Journal of Machine Learning Research, vol. 22, no. 106, pp. 1–51, 2021

  20. [20]

    Learning single-index models with shallow neural networks,

    A. Bietti, J. Bruna, C. Sanford, and M. J. Song, “Learning single-index models with shallow neural networks,”Advances in neural information processing systems, vol. 35, pp. 9768–9783, 2022. 13

  21. [21]

    Neural networks can learn representations with gradient descent,

    A. Damian, J. Lee, and M. Soltanolkotabi, “Neural networks can learn representations with gradient descent,” inConference on Learning Theory, pp. 5413–5452, PMLR, 2022

  22. [22]

    Learning time-scales in two-layers neural networks,

    R. Berthier, A. Montanari, and K. Zhou, “Learning time-scales in two-layers neural networks,” Foundations of Computational Mathematics, pp. 1–84, 2024

  23. [23]

    Computational-statistical gaps in G aussian single-index models (extended abstract)

    A. Damian, L. Pillaud-Vivien, J. D. Lee, and J. Bruna, “Computational-statistical gaps in gaussian single-index models,”arXiv preprint arXiv:2403.05529, 2024

  24. [24]

    Dynamical decoupling of generalization and overfitting in large two-layer networks,

    A. Montanari and P. Urbani, “Dynamical decoupling of generalization and overfitting in large two-layer networks,”arXiv preprint arXiv:2502.21269, 2025

  25. [25]

    On the global convergence of gradient descent for over-parameterized models using optimal transport,

    L. Chizat and F. Bach, “On the global convergence of gradient descent for over-parameterized models using optimal transport,”Advances in neural information processing systems, vol. 31, 2018

  26. [26]

    Neural networks as interacting particle systems: Asymp- totic convexity of the loss landscape and universal scaling of the approximation error,

    G. M. Rotskoff and E. Vanden-Eijnden, “Neural networks as interacting particle systems: Asymp- totic convexity of the loss landscape and universal scaling of the approximation error,”stat, vol. 1050, p. 22, 2018

  27. [27]

    A mean field view of the landscape of two-layer neural networks,

    S. Mei, A. Montanari, and P.-M. Nguyen, “A mean field view of the landscape of two-layer neural networks,”Proceedings of the National Academy of Sciences, vol. 115, no. 33, pp. E7665–E7671, 2018

  28. [28]

    Mean-field theory of two-layers neural networks: dimension-free bounds and kernel limit,

    S. Mei, T. Misiakiewicz, and A. Montanari, “Mean-field theory of two-layers neural networks: dimension-free bounds and kernel limit,” inConference on learning theory, pp. 2388–2464, PMLR, 2019

  29. [29]

    Mean field analysis of neural networks: A central limit theo- rem,

    J. Sirignano and K. Spiliopoulos, “Mean field analysis of neural networks: A central limit theo- rem,”Stochastic Processes and their Applications, vol. 130, no. 3, pp. 1820–1852, 2020

  30. [30]

    The committee machine: Computational to statistical gaps in learning a two-layers neural network,

    B. Aubin, A. Maillard, F. Krzakala, N. Macris, L. Zdeborov´ a,et al., “The committee machine: Computational to statistical gaps in learning a two-layers neural network,”Advances in Neural Information Processing Systems, vol. 31, 2018

  31. [31]

    Bayes-optimal learning of deep random networks of extensive-width,

    H. Cui, F. Krzakala, and L. Zdeborov´ a, “Bayes-optimal learning of deep random networks of extensive-width,” inInternational Conference on Machine Learning, pp. 6468–6521, PMLR, 2023

  32. [32]

    Bayes-optimal learn- ing of an extensive-width neural network from quadratically many samples,

    A. Maillard, E. Troiani, S. Martin, F. Krzakala, and L. Zdeborov´ a, “Bayes-optimal learn- ing of an extensive-width neural network from quadratically many samples,”arXiv preprint arXiv:2408.03733, 2024

  33. [33]

    Hitting the high-dimensional notes: An ode for sgd learning dynamics on glms and multi-index models,

    E. Collins-Woodfin, C. Paquette, E. Paquette, and I. Seroussi, “Hitting the high-dimensional notes: An ode for sgd learning dynamics on glms and multi-index models,”Information and Inference: A Journal of the IMA, vol. 13, no. 4, p. iaae028, 2024

  34. [34]

    Essentially no barriers in neural net- work energy landscape,

    F. Draxler, K. Veschgini, M. Salmhofer, and F. Hamprecht, “Essentially no barriers in neural net- work energy landscape,” inInternational conference on machine learning, pp. 1309–1318, PMLR, 2018

  35. [35]

    Comparing dynamics: Deep neural networks versus glassy systems,

    M. Baity-Jesi, L. Sagun, M. Geiger, S. Spigler, G. B. Arous, C. Cammarota, Y. LeCun, M. Wyart, and G. Biroli, “Comparing dynamics: Deep neural networks versus glassy systems,” inInterna- tional Conference on Machine Learning, pp. 314–323, PMLR, 2018

  36. [36]

    Shaping the learning landscape in neural networks around wide flat minima,

    C. Baldassi, F. Pittorino, and R. Zecchina, “Shaping the learning landscape in neural networks around wide flat minima,”Proceedings of the National Academy of Sciences, vol. 117, no. 1, pp. 161–170, 2020

  37. [37]

    Implicit reg- ularization in matrix factorization,

    S. Gunasekar, B. E. Woodworth, S. Bhojanapalli, B. Neyshabur, and N. Srebro, “Implicit reg- ularization in matrix factorization,”Advances in neural information processing systems, vol. 30, 2017. 14

  38. [38]

    Characterizing implicit bias in terms of opti- mization geometry,

    S. Gunasekar, J. Lee, D. Soudry, and N. Srebro, “Characterizing implicit bias in terms of opti- mization geometry,” inInternational Conference on Machine Learning, pp. 1832–1841, PMLR, 2018

  39. [39]

    The implicit bias of gradient descent on separable data,

    D. Soudry, E. Hoffer, M. S. Nacson, S. Gunasekar, and N. Srebro, “The implicit bias of gradient descent on separable data,”Journal of Machine Learning Research, vol. 19, no. 70, pp. 1–57, 2018

  40. [40]

    Implicit bias of gradient descent for wide two-layer neural networks trained with the logistic loss,

    L. Chizat and F. Bach, “Implicit bias of gradient descent for wide two-layer neural networks trained with the logistic loss,” inConference on learning theory, pp. 1305–1338, PMLR, 2020

  41. [41]

    On-line learning in soft committee machines,

    D. Saad and S. A. Solla, “On-line learning in soft committee machines,”Physical Review E, vol. 52, no. 4, p. 4225, 1995

  42. [42]

    Dynamics of stochastic gra- dient descent for two-layer neural networks in the teacher-student setup,

    S. Goldt, M. Advani, A. M. Saxe, F. Krzakala, and L. Zdeborov´ a, “Dynamics of stochastic gra- dient descent for two-layer neural networks in the teacher-student setup,”Advances in neural information processing systems, vol. 32, 2019

  43. [43]

    Phase diagram of stochastic gradient descent in high-dimensional two-layer neural networks,

    R. Veiga, L. Stephan, B. Loureiro, F. Krzakala, and L. Zdeborov´ a, “Phase diagram of stochastic gradient descent in high-dimensional two-layer neural networks,”Advances in Neural Information Processing Systems, vol. 35, pp. 23244–23255, 2022

  44. [44]

    Symmetries, flat minima, and the conserved quantities of gradient flow,

    B. Zhao, I. Ganev, R. Walters, R. Yu, and N. Dehmamy, “Symmetries, flat minima, and the conserved quantities of gradient flow,”arXiv preprint arXiv:2210.17216, 2022

  45. [45]

    Symmetry in neural network parameter spaces,

    B. Zhao, R. Walters, and R. Yu, “Symmetry in neural network parameter spaces,”arXiv preprint arXiv:2506.13018, 2025

  46. [46]

    Global minima of overparameterized neural networks,

    Y. Cooper, “Global minima of overparameterized neural networks,”SIAM Journal on Mathemat- ics of Data Science, vol. 3, no. 2, pp. 676–691, 2021

  47. [47]

    Geometry of the loss landscape in overparameterized neural networks: Symmetries and invariances,

    B. Simsek, F. Ged, A. Jacot, F. Spadaro, C. Hongler, W. Gerstner, and J. Brea, “Geometry of the loss landscape in overparameterized neural networks: Symmetries and invariances,” in International Conference on Machine Learning, pp. 9722–9732, PMLR, 2021

  48. [48]

    Flat minima,

    S. Hochreiter and J. Schmidhuber, “Flat minima,”Neural computation, vol. 9, no. 1, pp. 1–42, 1997

  49. [49]

    Empirical Analysis of the Hessian of Over-Parametrized Neural Networks

    L. Sagun, U. Evci, V. U. Guney, Y. Dauphin, and L. Bottou, “Empirical analysis of the hessian of over-parametrized neural networks,”arXiv preprint arXiv:1706.04454, 2017

  50. [50]

    Noether’s learning dynamics: Role of symmetry breaking in neural networks,

    H. Tanaka and D. Kunin, “Noether’s learning dynamics: Role of symmetry breaking in neural networks,”Advances in Neural Information Processing Systems, vol. 34, pp. 25646–25660, 2021

  51. [51]

    Double trouble in double descent: Bias and variance (s) in the lazy regime,

    S. d’Ascoli, M. Refinetti, G. Biroli, and F. Krzakala, “Double trouble in double descent: Bias and variance (s) in the lazy regime,” inInternational Conference on Machine Learning, pp. 2280–2290, PMLR, 2020

  52. [52]

    Scaling description of generalization with number of parameters in deep learning,

    M. Geiger, A. Jacot, S. Spigler, F. Gabriel, L. Sagun, S. d’Ascoli, G. Biroli, C. Hongler, and M. Wyart, “Scaling description of generalization with number of parameters in deep learning,” Journal of Statistical Mechanics: Theory and Experiment, vol. 2020, no. 2, p. 023401, 2020

  53. [53]

    Generalisation dynamics of online learning in over-parameterised neural networks,

    S. Goldt, M. S. Advani, A. M. Saxe, F. Krzakala, and L. Zdeborov´ a, “Generalisation dynamics of online learning in over-parameterised neural networks,”arXiv preprint arXiv:1901.09085, 2019

  54. [54]

    Student specialization in deep rectified networks with finite width and input dimension,

    Y. Tian, “Student specialization in deep rectified networks with finite width and input dimension,” inInternational Conference on Machine Learning, pp. 9470–9480, PMLR, 2020. 15 7 Appendix 7.1 Dynamical equations In order to derive the equations that describe the dynamics of the order parameters in the main text, we start from dρkl dα =⟨F kul⟩, dQkk′ dα =...