arxiv: 2604.03068 · v1 · submitted 2026-04-03 · ❄️ cond-mat.dis-nn · cond-mat.stat-mech· stat.ML

Recognition: 2 theorem links

· Lean Theorem

Escape dynamics and implicit bias of one-pass SGD in overparameterized quadratic networks

Dario Bocchi , Theotime Regimbeau , Carlo Lucibello , Luca Saglietti , Chiara Cammarota

Authors on Pith no claims yet

Pith reviewed 2026-05-13 18:20 UTC · model grok-4.3

classification ❄️ cond-mat.dis-nn cond-mat.stat-mechstat.ML

keywords stochastic gradient descentoverparameterizationquadratic activationsteacher-student modelimplicit biasoverlap dynamicsescape from plateau

0 comments

The pith

In quadratic teacher-student networks, overparameterization modestly accelerates escape from poor generalization plateaus via prefactor changes in loss decay, while conserved overlap quantities select the solution closest to random initialz

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper derives low-dimensional ODEs for overlap matrices that govern one-pass SGD in high-dimensional quadratic networks with fixed sample-to-dimension ratio alpha. Overparameterization p greater than teacher width p* only modestly speeds escape from the plateau of poor performance by changing the prefactor of the exponential loss decay. Rotational symmetry in unconstrained weights creates a manifold of zero-loss solutions, and the dynamics reliably reaches the member of that manifold nearest the random initialization because a conserved quantity in the overlap ODEs prevents farther solutions from being reached. Hessian analysis of the population loss confirms the plateau consists of saddles and the manifold consists of marginal minima.

Core claim

In the high-dimensional regime with fixed alpha equals M over N and finite hidden widths p and p*, the SGD dynamics governed by the closed ODEs for student-teacher and student-student overlaps escapes the poor-generalization plateau with an exponential rate whose prefactor is modestly improved by overparameterization p greater than p*. From the manifold of zero-loss solutions created by continuous rotational symmetry, the dynamics selects the solution closest to random initialization, as enforced by a conserved quantity in the overlap evolution equations.

What carries the argument

Low-dimensional ODEs for the student-teacher and student-student overlap matrices that close in the high-dimensional limit and govern the evolution of the loss and generalization error.

If this is right

Overparameterization p greater than p* changes only the prefactor of the exponential escape from the plateau, not the exponential rate itself.
A conserved quantity in the overlap ODEs forces selection of the zero-loss solution closest to initialization on the manifold.
The plateau corresponds to saddles with at least one negative eigenvalue of the population-loss Hessian.
The manifold of zero-loss solutions corresponds to marginal minima of the population loss.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The conserved quantity supplies an explicit mechanism for the implicit bias toward solutions near initialization that is often observed in SGD.
The modest effect of extra parameters suggests that, for quadratic activations, increasing width beyond the teacher width yields diminishing returns for generalization speed.
The same overlap-closure technique may apply to other activations or architectures where similar low-dimensional conserved quantities could be identified.

Load-bearing premise

The overlap matrices close into a finite set of ODEs in the high-dimensional limit with fixed alpha and finite p and p* for quadratic activations under the teacher-student setup.

What would settle it

A direct numerical integration of the finite-N network showing that escape time from the plateau scales exponentially rather than linearly with the overparameterization gap p minus p* would falsify the modest-acceleration claim.

Figures

Figures reproduced from arXiv: 2604.03068 by Carlo Lucibello, Chiara Cammarota, Dario Bocchi, Luca Saglietti, Theotime Regimbeau.

**Figure 2.** Figure 2: Numerical analysis of the optimal learning rate [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗

**Figure 3.** Figure 3: Analysis using the numerical solution of the ODEs for [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗

**Figure 4.** Figure 4: Evolution of the elements of the matrix S(t) over time for a random orthogonal initialization, illustrating their numerical conservation during the learning dynamics. 7.5 Study of the degrees of freedom 7.5.1 Student network expressivity Since we know that, for networks with quadratic activation, the outputs are fully determined by the matrix GW = WT W, which is a N × N matrix, we define the student’s degr… view at source ↗

**Figure 5.** Figure 5: Degrees of freedom of the student network [PITH_FULL_IMAGE:figures/full_fig_p029_5.png] view at source ↗

**Figure 6.** Figure 6: Dimension of the solution manifold as a function of the student network number of hidden [PITH_FULL_IMAGE:figures/full_fig_p030_6.png] view at source ↗

read the original abstract

We analyze the one-pass stochastic gradient descent dynamics of a two-layer neural network with quadratic activations in a teacher--student framework. In the high-dimensional regime, where the input dimension $N$ and the number of samples $M$ diverge at fixed ratio $\alpha = M/N$, and for finite hidden widths $(p,p^*)$ of the student and teacher, respectively, we study the low-dimensional ordinary differential equations that govern the evolution of the student--teacher and student--student overlap matrices. We show that overparameterization ($p>p^*$) only modestly accelerates escape from a plateau of poor generalization by modifying the prefactor of the exponential decay of the loss. We then examine how unconstrained weight norms introduce a continuous rotational symmetry that results in a nontrivial manifold of zero-loss solutions for $p>1$. From this manifold the dynamics consistently selects the closest solution to the random initialization, as enforced by a conserved quantity in the ODEs governing the evolution of the overlaps. Finally, a Hessian analysis of the population-loss landscape confirms that the plateau and the solution manifold correspond to saddles with at least one negative eigenvalue and to marginal minima in the population-loss geometry, respectively.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Overparam only modestly changes the escape prefactor in this quadratic teacher-student SGD, while a conserved overlap quantity selects the closest zero-loss solution from the manifold.

read the letter

Overparameterization only modestly accelerates escape from the plateau of poor generalization by modifying the prefactor of the exponential loss decay, and a conserved quantity in the overlap ODEs makes the dynamics select the closest solution to initialization from the manifold of zero-loss points. The paper derives the closed low-dimensional ODEs for the student-teacher and student-student overlap matrices in the high-dimensional limit at fixed alpha with finite p and p*. Quadratic activations allow the expectations over inputs to close exactly, so the reduced system tracks the full dynamics without extra terms. They use this to quantify the prefactor shift for p greater than p* and to show that the conserved overlap enforces the selection bias. The Hessian analysis of the population loss then classifies the plateau as a saddle with negative eigenvalues and the solution manifold as marginal minima. This is a clean extension of prior teacher-student SGD work, with the explicit conserved quantity and the landscape geometry as the main additions. The derivations look direct and the claims follow from the equations without circularity or fitted parameters. The main soft spot is that the modest prefactor effect is specific to this quadratic setup and may not explain the stronger benefits of overparameterization in deeper or non-polynomial networks, but that is a scope issue rather than a flaw in the math. The ODE closure holds in the stated scaling limit, so the central results stand. This paper is for theorists working on SGD dynamics and implicit bias in solvable high-dimensional models. A reader who wants explicit mechanisms and conserved quantities in the teacher-student framework will find it useful. It deserves peer review because the derivations are grounded and produce concrete, falsifiable statements within the model.

Referee Report

2 major / 2 minor

Summary. The paper analyzes one-pass SGD dynamics for two-layer networks with quadratic activations in the teacher-student framework. In the high-dimensional limit (N, M → ∞ at fixed α = M/N) with finite hidden widths p and p*, it derives a closed system of low-dimensional ODEs for the student-teacher and student-student overlap matrices. The central claims are that overparameterization (p > p*) modestly accelerates escape from a poor-generalization plateau solely by modifying the prefactor of the exponential loss decay, that a conserved quantity in the ODEs enforces selection of the closest zero-loss solution on a manifold induced by rotational symmetry of unconstrained weights, and that Hessian analysis of the population loss identifies the plateau as a saddle (negative eigenvalue) and the solution manifold as marginal minima.

Significance. If the ODE closure holds exactly, the work supplies a parameter-free analytical characterization of escape dynamics and implicit bias for this solvable model class. The exact reduction to overlaps (enabled by quadratic activations), the identification of a conserved quantity, and the resulting falsifiable predictions for prefactor scaling and solution selection constitute clear strengths. These results sharpen understanding of why overparameterization yields only modest dynamical benefits in the high-dimensional regime and provide a benchmark for more general activation functions.

major comments (2)

[Section 3 (ODE derivation)] Derivation of the overlap ODEs (Section 3): The reduction to a closed finite-dimensional system for the overlaps is load-bearing for every subsequent claim (conserved quantity, prefactor modification, Hessian classification). The manuscript asserts exact closure in the N→∞ limit because quadratic activations are degree-2 polynomials, yet it must explicitly verify that the expectation of the SGD update on the population loss contains no residual higher-moment contributions that would prevent the overlaps from forming a sufficient statistic. Without this step-by-step reduction, the conserved quantity and the exponential-decay analysis rest on an unverified assumption.
[Hessian analysis section] Hessian analysis of the population-loss landscape: The classification of the plateau as a saddle with at least one negative eigenvalue and the solution manifold as marginal minima is used to interpret the dynamics. The eigenvalues should be expressed explicitly in terms of the overlap matrices at the fixed points; any dependence on the particular choice of initialization or on finite-N corrections would weaken the geometric interpretation.

minor comments (2)

[Introduction / Setup] Notation: Define the overlap matrices (student-student Q and student-teacher M) and the scaling parameters α, p, p* at their first appearance with explicit dimensions.
[Escape dynamics paragraph] The statement that overparameterization 'only modestly accelerates escape' should be accompanied by the explicit prefactor ratio derived from the ODEs rather than left as a qualitative claim.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We are grateful to the referee for their thorough review and positive assessment of our work's contributions to understanding SGD dynamics in overparameterized quadratic networks. We address each major comment below and will revise the manuscript accordingly to strengthen the presentation.

read point-by-point responses

Referee: Derivation of the overlap ODEs (Section 3): The reduction to a closed finite-dimensional system for the overlaps is load-bearing for every subsequent claim (conserved quantity, prefactor modification, Hessian classification). The manuscript asserts exact closure in the N→∞ limit because quadratic activations are degree-2 polynomials, yet it must explicitly verify that the expectation of the SGD update on the population loss contains no residual higher-moment contributions that would prevent the overlaps from forming a sufficient statistic. Without this step-by-step reduction, the conserved quantity and the exponential-decay analysis rest on an unverified assumption.

Authors: We thank the referee for highlighting the need for a more explicit verification of the ODE closure. In the revised version, we will provide a detailed, step-by-step derivation in Section 3 demonstrating that the quadratic activations ensure the SGD update's expectation depends solely on the overlap matrices, with higher-moment terms vanishing in the thermodynamic limit. This will rigorously establish the overlaps as a sufficient statistic, thereby supporting the conserved quantity and the exponential decay analysis. revision: yes
Referee: Hessian analysis of the population-loss landscape: The classification of the plateau as a saddle with at least one negative eigenvalue and the solution manifold as marginal minima is used to interpret the dynamics. The eigenvalues should be expressed explicitly in terms of the overlap matrices at the fixed points; any dependence on the particular choice of initialization or on finite-N corrections would weaken the geometric interpretation.

Authors: We agree that explicit expressions for the eigenvalues will enhance the clarity of the geometric interpretation. In the revision, we will derive and present the eigenvalues of the Hessian explicitly as functions of the overlap matrices at the relevant fixed points. The fixed points themselves are characterized in the overlap space independently of specific initializations, and our analysis is conducted in the N→∞ limit, so finite-N effects are not considered; we will add a clarifying remark on this point. revision: yes

Circularity Check

0 steps flagged

No significant circularity; overlap ODEs and conserved quantity derived from high-dimensional limit

full rationale

The paper derives closed low-dimensional ODEs for student-teacher and student-student overlap matrices from the N→∞ limit at fixed α=M/N with quadratic activations (degree-2 polynomials ensure closure under expectations over inputs). The conserved quantity selecting the closest zero-loss solution is a direct algebraic consequence of these dynamical equations, not an input redefined as output. Overparameterization effects on escape prefactors and the Hessian classification of saddles versus marginal minima follow from the same system and independent landscape analysis. No self-definitional reductions, fitted parameters renamed as predictions, or load-bearing self-citations appear in the derivation chain.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claims rest on the high-dimensional limit closing the overlap dynamics and on the existence of a rotational symmetry from unconstrained norms; no free parameters are fitted to data and no new entities are postulated.

axioms (2)

domain assumption In the high-dimensional limit with fixed α=M/N the evolution of student-teacher and student-student overlaps closes into a finite set of ODEs for quadratic activations.
Invoked to reduce the high-dimensional SGD to low-dimensional dynamics.
domain assumption Unconstrained weight norms produce a continuous rotational symmetry yielding a manifold of zero-loss solutions when p>1.
Used to explain the existence of the solution manifold and the conserved quantity.

pith-pipeline@v0.9.0 · 5530 in / 1462 out tokens · 24395 ms · 2026-05-13T18:20:13.990119+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean (Jcost uniqueness, Aczél classification) washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

we study the low-dimensional ordinary differential equations that govern the evolution of the student–teacher and student–student overlap matrices... conserved quantity in the ODEs governing the evolution of the overlaps
IndisputableMonolith/Foundation/AlexanderDuality.lean alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

continuous rotational symmetry that results in a nontrivial manifold of zero-loss solutions... S(t) = ρ(t)[ρ(t)ᵀρ(t)]^{-1/2} remains constant

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

54 extracted references · 54 canonical work pages

[1]

Engel,Statistical mechanics of learning

A. Engel,Statistical mechanics of learning. Cambridge University Press, 2001

work page 2001
[2]

Phase retrieval: From computational imaging to machine learning: A tutorial,

J. Dong, L. Valzania, A. Maillard, T.-a. Pham, S. Gigan, and M. Unser, “Phase retrieval: From computational imaging to machine learning: A tutorial,”IEEE Signal Processing Magazine, vol. 40, no. 1, pp. 45–57, 2023. 12

work page 2023
[3]

Phase retrieval via wirtinger flow: Theory and algorithms,

E. J. Candes, X. Li, and M. Soltanolkotabi, “Phase retrieval via wirtinger flow: Theory and algorithms,”IEEE Transactions on Information Theory, vol. 61, no. 4, pp. 1985–2007, 2015

work page 1985
[4]

Phase retrieval using alternating minimization,

P. Netrapalli, P. Jain, and S. Sanghavi, “Phase retrieval using alternating minimization,”Advances in Neural Information Processing Systems, vol. 26, 2013

work page 2013
[5]

Phase recovery, maxcut and complex semidefinite programming,

I. Waldspurger, A. d’Aspremont, and S. Mallat, “Phase recovery, maxcut and complex semidefinite programming,”Mathematical Programming, vol. 149, pp. 47–81, 2015

work page 2015
[6]

Solving random quadratic systems of equations is nearly as easy as solving linear systems,

Y. Chen and E. J. Cand` es, “Solving random quadratic systems of equations is nearly as easy as solving linear systems,”Communications on pure and applied mathematics, vol. 70, no. 5, pp. 822–883, 2017

work page 2017
[7]

A nonconvex approach for phase retrieval: Reshaped wirtinger flow and incremental algorithms,

H. Zhang, Y. Liang, and Y. Chi, “A nonconvex approach for phase retrieval: Reshaped wirtinger flow and incremental algorithms,”Journal of Machine Learning Research, vol. 18, no. 141, pp. 1– 35, 2017

work page 2017
[8]

Solving large-scale systems of random quadratic equa- tions via stochastic truncated amplitude flow,

G. Wang, G. B. Giannakis, and J. Chen, “Solving large-scale systems of random quadratic equa- tions via stochastic truncated amplitude flow,” in2017 25th European Signal Processing Confer- ence (EUSIPCO), pp. 1420–1424, IEEE, 2017

work page 2017
[9]

Solving most systems of random quadratic equations,

G. Wang, G. Giannakis, Y. Saad, and J. Chen, “Solving most systems of random quadratic equations,”Advances in Neural Information Processing Systems, vol. 30, 2017

work page 2017
[10]

Two-step phase retrieval algorithm using single-intensity measurement,

C. Zhang, M. Wang, Q. Chen, D. Wang, and S. Wei, “Two-step phase retrieval algorithm using single-intensity measurement,”International Journal of Optics, vol. 2018, no. 1, p. 8643819, 2018

work page 2018
[11]

Optimal errors and phase transitions in high-dimensional generalized linear models,

J. Barbier, F. Krzakala, N. Macris, L. Miolane, and L. Zdeborov´ a, “Optimal errors and phase transitions in high-dimensional generalized linear models,”Proceedings of the National Academy of Sciences, vol. 116, no. 12, pp. 5451–5460, 2019

work page 2019
[12]

A stochastic approximation method,

H. Robbins and S. Monro, “A stochastic approximation method,”The annals of mathematical statistics, pp. 400–407, 1951

work page 1951
[13]

Sensitivity and gen- eralization in neural networks: an empirical study,

R. Novak, Y. Bahri, D. A. Abolafia, J. Pennington, and J. Sohl-Dickstein, “Sensitivity and gen- eralization in neural networks: an empirical study,”arXiv preprint arXiv:1802.08760, 2018

work page arXiv 2018
[14]

Reconciling modern machine-learning practice and the classical bias–variance trade-off,

M. Belkin, D. Hsu, S. Ma, and S. Mandal, “Reconciling modern machine-learning practice and the classical bias–variance trade-off,”Proceedings of the National Academy of Sciences, vol. 116, no. 32, pp. 15849–15854, 2019

work page 2019
[15]

More is better in modern machine learn- ing: when infinite overparameterization is optimal and overfitting is obligatory,

J. B. Simon, D. Karkada, N. Ghosh, and M. Belkin, “More is better in modern machine learn- ing: when infinite overparameterization is optimal and overfitting is obligatory,”arXiv preprint arXiv:2311.14646, 2023

work page arXiv 2023
[16]

Optimization and generalization of shallow neural networks with quadratic activation functions,

S. Sarao Mannelli, E. Vanden-Eijnden, and L. Zdeborov´ a, “Optimization and generalization of shallow neural networks with quadratic activation functions,”Advances in Neural Information Processing Systems, vol. 33, pp. 13445–13455, 2020

work page 2020
[17]

Escaping mediocrity: how two-layer networks learn hard generalized linear models with sgd,

L. Arnaboldi, F. Krzakala, B. Loureiro, and L. Stephan, “Escaping mediocrity: how two-layer networks learn hard generalized linear models with sgd,”arXiv preprint arXiv:2305.18502, 2023

work page arXiv 2023
[18]

Online stochastic gradient descent with arbitrary initialization solves non-smooth, non-convex phase retrieval,

Y. S. Tan and R. Vershynin, “Online stochastic gradient descent with arbitrary initialization solves non-smooth, non-convex phase retrieval,”Journal of Machine Learning Research, vol. 24, no. 58, pp. 1–47, 2023

work page 2023
[19]

Online stochastic gradient descent on non-convex losses from high-dimensional inference,

G. B. Arous, R. Gheissari, and A. Jagannath, “Online stochastic gradient descent on non-convex losses from high-dimensional inference,”Journal of Machine Learning Research, vol. 22, no. 106, pp. 1–51, 2021

work page 2021
[20]

Learning single-index models with shallow neural networks,

A. Bietti, J. Bruna, C. Sanford, and M. J. Song, “Learning single-index models with shallow neural networks,”Advances in neural information processing systems, vol. 35, pp. 9768–9783, 2022. 13

work page 2022
[21]

Neural networks can learn representations with gradient descent,

A. Damian, J. Lee, and M. Soltanolkotabi, “Neural networks can learn representations with gradient descent,” inConference on Learning Theory, pp. 5413–5452, PMLR, 2022

work page 2022
[22]

Learning time-scales in two-layers neural networks,

R. Berthier, A. Montanari, and K. Zhou, “Learning time-scales in two-layers neural networks,” Foundations of Computational Mathematics, pp. 1–84, 2024

work page 2024
[23]

Computational-statistical gaps in G aussian single-index models (extended abstract)

A. Damian, L. Pillaud-Vivien, J. D. Lee, and J. Bruna, “Computational-statistical gaps in gaussian single-index models,”arXiv preprint arXiv:2403.05529, 2024

work page arXiv 2024
[24]

Dynamical decoupling of generalization and overfitting in large two-layer networks,

A. Montanari and P. Urbani, “Dynamical decoupling of generalization and overfitting in large two-layer networks,”arXiv preprint arXiv:2502.21269, 2025

work page arXiv 2025
[25]

On the global convergence of gradient descent for over-parameterized models using optimal transport,

L. Chizat and F. Bach, “On the global convergence of gradient descent for over-parameterized models using optimal transport,”Advances in neural information processing systems, vol. 31, 2018

work page 2018
[26]

Neural networks as interacting particle systems: Asymp- totic convexity of the loss landscape and universal scaling of the approximation error,

G. M. Rotskoff and E. Vanden-Eijnden, “Neural networks as interacting particle systems: Asymp- totic convexity of the loss landscape and universal scaling of the approximation error,”stat, vol. 1050, p. 22, 2018

work page 2018
[27]

A mean field view of the landscape of two-layer neural networks,

S. Mei, A. Montanari, and P.-M. Nguyen, “A mean field view of the landscape of two-layer neural networks,”Proceedings of the National Academy of Sciences, vol. 115, no. 33, pp. E7665–E7671, 2018

work page 2018
[28]

Mean-field theory of two-layers neural networks: dimension-free bounds and kernel limit,

S. Mei, T. Misiakiewicz, and A. Montanari, “Mean-field theory of two-layers neural networks: dimension-free bounds and kernel limit,” inConference on learning theory, pp. 2388–2464, PMLR, 2019

work page 2019
[29]

Mean field analysis of neural networks: A central limit theo- rem,

J. Sirignano and K. Spiliopoulos, “Mean field analysis of neural networks: A central limit theo- rem,”Stochastic Processes and their Applications, vol. 130, no. 3, pp. 1820–1852, 2020

work page 2020
[30]

The committee machine: Computational to statistical gaps in learning a two-layers neural network,

B. Aubin, A. Maillard, F. Krzakala, N. Macris, L. Zdeborov´ a,et al., “The committee machine: Computational to statistical gaps in learning a two-layers neural network,”Advances in Neural Information Processing Systems, vol. 31, 2018

work page 2018
[31]

Bayes-optimal learning of deep random networks of extensive-width,

H. Cui, F. Krzakala, and L. Zdeborov´ a, “Bayes-optimal learning of deep random networks of extensive-width,” inInternational Conference on Machine Learning, pp. 6468–6521, PMLR, 2023

work page 2023
[32]

Bayes-optimal learn- ing of an extensive-width neural network from quadratically many samples,

A. Maillard, E. Troiani, S. Martin, F. Krzakala, and L. Zdeborov´ a, “Bayes-optimal learn- ing of an extensive-width neural network from quadratically many samples,”arXiv preprint arXiv:2408.03733, 2024

work page arXiv 2024
[33]

Hitting the high-dimensional notes: An ode for sgd learning dynamics on glms and multi-index models,

E. Collins-Woodfin, C. Paquette, E. Paquette, and I. Seroussi, “Hitting the high-dimensional notes: An ode for sgd learning dynamics on glms and multi-index models,”Information and Inference: A Journal of the IMA, vol. 13, no. 4, p. iaae028, 2024

work page 2024
[34]

Essentially no barriers in neural net- work energy landscape,

F. Draxler, K. Veschgini, M. Salmhofer, and F. Hamprecht, “Essentially no barriers in neural net- work energy landscape,” inInternational conference on machine learning, pp. 1309–1318, PMLR, 2018

work page 2018
[35]

Comparing dynamics: Deep neural networks versus glassy systems,

M. Baity-Jesi, L. Sagun, M. Geiger, S. Spigler, G. B. Arous, C. Cammarota, Y. LeCun, M. Wyart, and G. Biroli, “Comparing dynamics: Deep neural networks versus glassy systems,” inInterna- tional Conference on Machine Learning, pp. 314–323, PMLR, 2018

work page 2018
[36]

Shaping the learning landscape in neural networks around wide flat minima,

C. Baldassi, F. Pittorino, and R. Zecchina, “Shaping the learning landscape in neural networks around wide flat minima,”Proceedings of the National Academy of Sciences, vol. 117, no. 1, pp. 161–170, 2020

work page 2020
[37]

Implicit reg- ularization in matrix factorization,

S. Gunasekar, B. E. Woodworth, S. Bhojanapalli, B. Neyshabur, and N. Srebro, “Implicit reg- ularization in matrix factorization,”Advances in neural information processing systems, vol. 30, 2017. 14

work page 2017
[38]

Characterizing implicit bias in terms of opti- mization geometry,

S. Gunasekar, J. Lee, D. Soudry, and N. Srebro, “Characterizing implicit bias in terms of opti- mization geometry,” inInternational Conference on Machine Learning, pp. 1832–1841, PMLR, 2018

work page 2018
[39]

The implicit bias of gradient descent on separable data,

D. Soudry, E. Hoffer, M. S. Nacson, S. Gunasekar, and N. Srebro, “The implicit bias of gradient descent on separable data,”Journal of Machine Learning Research, vol. 19, no. 70, pp. 1–57, 2018

work page 2018
[40]

Implicit bias of gradient descent for wide two-layer neural networks trained with the logistic loss,

L. Chizat and F. Bach, “Implicit bias of gradient descent for wide two-layer neural networks trained with the logistic loss,” inConference on learning theory, pp. 1305–1338, PMLR, 2020

work page 2020
[41]

On-line learning in soft committee machines,

D. Saad and S. A. Solla, “On-line learning in soft committee machines,”Physical Review E, vol. 52, no. 4, p. 4225, 1995

work page 1995
[42]

Dynamics of stochastic gra- dient descent for two-layer neural networks in the teacher-student setup,

S. Goldt, M. Advani, A. M. Saxe, F. Krzakala, and L. Zdeborov´ a, “Dynamics of stochastic gra- dient descent for two-layer neural networks in the teacher-student setup,”Advances in neural information processing systems, vol. 32, 2019

work page 2019
[43]

Phase diagram of stochastic gradient descent in high-dimensional two-layer neural networks,

R. Veiga, L. Stephan, B. Loureiro, F. Krzakala, and L. Zdeborov´ a, “Phase diagram of stochastic gradient descent in high-dimensional two-layer neural networks,”Advances in Neural Information Processing Systems, vol. 35, pp. 23244–23255, 2022

work page 2022
[44]

Symmetries, flat minima, and the conserved quantities of gradient flow,

B. Zhao, I. Ganev, R. Walters, R. Yu, and N. Dehmamy, “Symmetries, flat minima, and the conserved quantities of gradient flow,”arXiv preprint arXiv:2210.17216, 2022

work page arXiv 2022
[45]

Symmetry in neural network parameter spaces,

B. Zhao, R. Walters, and R. Yu, “Symmetry in neural network parameter spaces,”arXiv preprint arXiv:2506.13018, 2025

work page arXiv 2025
[46]

Global minima of overparameterized neural networks,

Y. Cooper, “Global minima of overparameterized neural networks,”SIAM Journal on Mathemat- ics of Data Science, vol. 3, no. 2, pp. 676–691, 2021

work page 2021
[47]

Geometry of the loss landscape in overparameterized neural networks: Symmetries and invariances,

B. Simsek, F. Ged, A. Jacot, F. Spadaro, C. Hongler, W. Gerstner, and J. Brea, “Geometry of the loss landscape in overparameterized neural networks: Symmetries and invariances,” in International Conference on Machine Learning, pp. 9722–9732, PMLR, 2021

work page 2021
[48]

Flat minima,

S. Hochreiter and J. Schmidhuber, “Flat minima,”Neural computation, vol. 9, no. 1, pp. 1–42, 1997

work page 1997
[49]

Empirical Analysis of the Hessian of Over-Parametrized Neural Networks

L. Sagun, U. Evci, V. U. Guney, Y. Dauphin, and L. Bottou, “Empirical analysis of the hessian of over-parametrized neural networks,”arXiv preprint arXiv:1706.04454, 2017

work page Pith review arXiv 2017
[50]

Noether’s learning dynamics: Role of symmetry breaking in neural networks,

H. Tanaka and D. Kunin, “Noether’s learning dynamics: Role of symmetry breaking in neural networks,”Advances in Neural Information Processing Systems, vol. 34, pp. 25646–25660, 2021

work page 2021
[51]

Double trouble in double descent: Bias and variance (s) in the lazy regime,

S. d’Ascoli, M. Refinetti, G. Biroli, and F. Krzakala, “Double trouble in double descent: Bias and variance (s) in the lazy regime,” inInternational Conference on Machine Learning, pp. 2280–2290, PMLR, 2020

work page 2020
[52]

Scaling description of generalization with number of parameters in deep learning,

M. Geiger, A. Jacot, S. Spigler, F. Gabriel, L. Sagun, S. d’Ascoli, G. Biroli, C. Hongler, and M. Wyart, “Scaling description of generalization with number of parameters in deep learning,” Journal of Statistical Mechanics: Theory and Experiment, vol. 2020, no. 2, p. 023401, 2020

work page 2020
[53]

Generalisation dynamics of online learning in over-parameterised neural networks,

S. Goldt, M. S. Advani, A. M. Saxe, F. Krzakala, and L. Zdeborov´ a, “Generalisation dynamics of online learning in over-parameterised neural networks,”arXiv preprint arXiv:1901.09085, 2019

work page arXiv 1901
[54]

Student specialization in deep rectified networks with finite width and input dimension,

Y. Tian, “Student specialization in deep rectified networks with finite width and input dimension,” inInternational Conference on Machine Learning, pp. 9470–9480, PMLR, 2020. 15 7 Appendix 7.1 Dynamical equations In order to derive the equations that describe the dynamics of the order parameters in the main text, we start from dρkl dα =⟨F kul⟩, dQkk′ dα =...

work page 2020