Recognition: 2 theorem links
· Lean TheoremEscape dynamics and implicit bias of one-pass SGD in overparameterized quadratic networks
Pith reviewed 2026-05-13 18:20 UTC · model grok-4.3
The pith
In quadratic teacher-student networks, overparameterization modestly accelerates escape from poor generalization plateaus via prefactor changes in loss decay, while conserved overlap quantities select the solution closest to random initialz
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
In the high-dimensional regime with fixed alpha equals M over N and finite hidden widths p and p*, the SGD dynamics governed by the closed ODEs for student-teacher and student-student overlaps escapes the poor-generalization plateau with an exponential rate whose prefactor is modestly improved by overparameterization p greater than p*. From the manifold of zero-loss solutions created by continuous rotational symmetry, the dynamics selects the solution closest to random initialization, as enforced by a conserved quantity in the overlap evolution equations.
What carries the argument
Low-dimensional ODEs for the student-teacher and student-student overlap matrices that close in the high-dimensional limit and govern the evolution of the loss and generalization error.
If this is right
- Overparameterization p greater than p* changes only the prefactor of the exponential escape from the plateau, not the exponential rate itself.
- A conserved quantity in the overlap ODEs forces selection of the zero-loss solution closest to initialization on the manifold.
- The plateau corresponds to saddles with at least one negative eigenvalue of the population-loss Hessian.
- The manifold of zero-loss solutions corresponds to marginal minima of the population loss.
Where Pith is reading between the lines
- The conserved quantity supplies an explicit mechanism for the implicit bias toward solutions near initialization that is often observed in SGD.
- The modest effect of extra parameters suggests that, for quadratic activations, increasing width beyond the teacher width yields diminishing returns for generalization speed.
- The same overlap-closure technique may apply to other activations or architectures where similar low-dimensional conserved quantities could be identified.
Load-bearing premise
The overlap matrices close into a finite set of ODEs in the high-dimensional limit with fixed alpha and finite p and p* for quadratic activations under the teacher-student setup.
What would settle it
A direct numerical integration of the finite-N network showing that escape time from the plateau scales exponentially rather than linearly with the overparameterization gap p minus p* would falsify the modest-acceleration claim.
Figures
read the original abstract
We analyze the one-pass stochastic gradient descent dynamics of a two-layer neural network with quadratic activations in a teacher--student framework. In the high-dimensional regime, where the input dimension $N$ and the number of samples $M$ diverge at fixed ratio $\alpha = M/N$, and for finite hidden widths $(p,p^*)$ of the student and teacher, respectively, we study the low-dimensional ordinary differential equations that govern the evolution of the student--teacher and student--student overlap matrices. We show that overparameterization ($p>p^*$) only modestly accelerates escape from a plateau of poor generalization by modifying the prefactor of the exponential decay of the loss. We then examine how unconstrained weight norms introduce a continuous rotational symmetry that results in a nontrivial manifold of zero-loss solutions for $p>1$. From this manifold the dynamics consistently selects the closest solution to the random initialization, as enforced by a conserved quantity in the ODEs governing the evolution of the overlaps. Finally, a Hessian analysis of the population-loss landscape confirms that the plateau and the solution manifold correspond to saddles with at least one negative eigenvalue and to marginal minima in the population-loss geometry, respectively.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper analyzes one-pass SGD dynamics for two-layer networks with quadratic activations in the teacher-student framework. In the high-dimensional limit (N, M → ∞ at fixed α = M/N) with finite hidden widths p and p*, it derives a closed system of low-dimensional ODEs for the student-teacher and student-student overlap matrices. The central claims are that overparameterization (p > p*) modestly accelerates escape from a poor-generalization plateau solely by modifying the prefactor of the exponential loss decay, that a conserved quantity in the ODEs enforces selection of the closest zero-loss solution on a manifold induced by rotational symmetry of unconstrained weights, and that Hessian analysis of the population loss identifies the plateau as a saddle (negative eigenvalue) and the solution manifold as marginal minima.
Significance. If the ODE closure holds exactly, the work supplies a parameter-free analytical characterization of escape dynamics and implicit bias for this solvable model class. The exact reduction to overlaps (enabled by quadratic activations), the identification of a conserved quantity, and the resulting falsifiable predictions for prefactor scaling and solution selection constitute clear strengths. These results sharpen understanding of why overparameterization yields only modest dynamical benefits in the high-dimensional regime and provide a benchmark for more general activation functions.
major comments (2)
- [Section 3 (ODE derivation)] Derivation of the overlap ODEs (Section 3): The reduction to a closed finite-dimensional system for the overlaps is load-bearing for every subsequent claim (conserved quantity, prefactor modification, Hessian classification). The manuscript asserts exact closure in the N→∞ limit because quadratic activations are degree-2 polynomials, yet it must explicitly verify that the expectation of the SGD update on the population loss contains no residual higher-moment contributions that would prevent the overlaps from forming a sufficient statistic. Without this step-by-step reduction, the conserved quantity and the exponential-decay analysis rest on an unverified assumption.
- [Hessian analysis section] Hessian analysis of the population-loss landscape: The classification of the plateau as a saddle with at least one negative eigenvalue and the solution manifold as marginal minima is used to interpret the dynamics. The eigenvalues should be expressed explicitly in terms of the overlap matrices at the fixed points; any dependence on the particular choice of initialization or on finite-N corrections would weaken the geometric interpretation.
minor comments (2)
- [Introduction / Setup] Notation: Define the overlap matrices (student-student Q and student-teacher M) and the scaling parameters α, p, p* at their first appearance with explicit dimensions.
- [Escape dynamics paragraph] The statement that overparameterization 'only modestly accelerates escape' should be accompanied by the explicit prefactor ratio derived from the ODEs rather than left as a qualitative claim.
Simulated Author's Rebuttal
We are grateful to the referee for their thorough review and positive assessment of our work's contributions to understanding SGD dynamics in overparameterized quadratic networks. We address each major comment below and will revise the manuscript accordingly to strengthen the presentation.
read point-by-point responses
-
Referee: Derivation of the overlap ODEs (Section 3): The reduction to a closed finite-dimensional system for the overlaps is load-bearing for every subsequent claim (conserved quantity, prefactor modification, Hessian classification). The manuscript asserts exact closure in the N→∞ limit because quadratic activations are degree-2 polynomials, yet it must explicitly verify that the expectation of the SGD update on the population loss contains no residual higher-moment contributions that would prevent the overlaps from forming a sufficient statistic. Without this step-by-step reduction, the conserved quantity and the exponential-decay analysis rest on an unverified assumption.
Authors: We thank the referee for highlighting the need for a more explicit verification of the ODE closure. In the revised version, we will provide a detailed, step-by-step derivation in Section 3 demonstrating that the quadratic activations ensure the SGD update's expectation depends solely on the overlap matrices, with higher-moment terms vanishing in the thermodynamic limit. This will rigorously establish the overlaps as a sufficient statistic, thereby supporting the conserved quantity and the exponential decay analysis. revision: yes
-
Referee: Hessian analysis of the population-loss landscape: The classification of the plateau as a saddle with at least one negative eigenvalue and the solution manifold as marginal minima is used to interpret the dynamics. The eigenvalues should be expressed explicitly in terms of the overlap matrices at the fixed points; any dependence on the particular choice of initialization or on finite-N corrections would weaken the geometric interpretation.
Authors: We agree that explicit expressions for the eigenvalues will enhance the clarity of the geometric interpretation. In the revision, we will derive and present the eigenvalues of the Hessian explicitly as functions of the overlap matrices at the relevant fixed points. The fixed points themselves are characterized in the overlap space independently of specific initializations, and our analysis is conducted in the N→∞ limit, so finite-N effects are not considered; we will add a clarifying remark on this point. revision: yes
Circularity Check
No significant circularity; overlap ODEs and conserved quantity derived from high-dimensional limit
full rationale
The paper derives closed low-dimensional ODEs for student-teacher and student-student overlap matrices from the N→∞ limit at fixed α=M/N with quadratic activations (degree-2 polynomials ensure closure under expectations over inputs). The conserved quantity selecting the closest zero-loss solution is a direct algebraic consequence of these dynamical equations, not an input redefined as output. Overparameterization effects on escape prefactors and the Hessian classification of saddles versus marginal minima follow from the same system and independent landscape analysis. No self-definitional reductions, fitted parameters renamed as predictions, or load-bearing self-citations appear in the derivation chain.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption In the high-dimensional limit with fixed α=M/N the evolution of student-teacher and student-student overlaps closes into a finite set of ODEs for quadratic activations.
- domain assumption Unconstrained weight norms produce a continuous rotational symmetry yielding a manifold of zero-loss solutions when p>1.
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.lean (Jcost uniqueness, Aczél classification)washburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
we study the low-dimensional ordinary differential equations that govern the evolution of the student–teacher and student–student overlap matrices... conserved quantity in the ODEs governing the evolution of the overlaps
-
IndisputableMonolith/Foundation/AlexanderDuality.leanalexander_duality_circle_linking unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
continuous rotational symmetry that results in a nontrivial manifold of zero-loss solutions... S(t) = ρ(t)[ρ(t)ᵀρ(t)]^{-1/2} remains constant
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Engel,Statistical mechanics of learning
A. Engel,Statistical mechanics of learning. Cambridge University Press, 2001
work page 2001
-
[2]
Phase retrieval: From computational imaging to machine learning: A tutorial,
J. Dong, L. Valzania, A. Maillard, T.-a. Pham, S. Gigan, and M. Unser, “Phase retrieval: From computational imaging to machine learning: A tutorial,”IEEE Signal Processing Magazine, vol. 40, no. 1, pp. 45–57, 2023. 12
work page 2023
-
[3]
Phase retrieval via wirtinger flow: Theory and algorithms,
E. J. Candes, X. Li, and M. Soltanolkotabi, “Phase retrieval via wirtinger flow: Theory and algorithms,”IEEE Transactions on Information Theory, vol. 61, no. 4, pp. 1985–2007, 2015
work page 1985
-
[4]
Phase retrieval using alternating minimization,
P. Netrapalli, P. Jain, and S. Sanghavi, “Phase retrieval using alternating minimization,”Advances in Neural Information Processing Systems, vol. 26, 2013
work page 2013
-
[5]
Phase recovery, maxcut and complex semidefinite programming,
I. Waldspurger, A. d’Aspremont, and S. Mallat, “Phase recovery, maxcut and complex semidefinite programming,”Mathematical Programming, vol. 149, pp. 47–81, 2015
work page 2015
-
[6]
Solving random quadratic systems of equations is nearly as easy as solving linear systems,
Y. Chen and E. J. Cand` es, “Solving random quadratic systems of equations is nearly as easy as solving linear systems,”Communications on pure and applied mathematics, vol. 70, no. 5, pp. 822–883, 2017
work page 2017
-
[7]
A nonconvex approach for phase retrieval: Reshaped wirtinger flow and incremental algorithms,
H. Zhang, Y. Liang, and Y. Chi, “A nonconvex approach for phase retrieval: Reshaped wirtinger flow and incremental algorithms,”Journal of Machine Learning Research, vol. 18, no. 141, pp. 1– 35, 2017
work page 2017
-
[8]
Solving large-scale systems of random quadratic equa- tions via stochastic truncated amplitude flow,
G. Wang, G. B. Giannakis, and J. Chen, “Solving large-scale systems of random quadratic equa- tions via stochastic truncated amplitude flow,” in2017 25th European Signal Processing Confer- ence (EUSIPCO), pp. 1420–1424, IEEE, 2017
work page 2017
-
[9]
Solving most systems of random quadratic equations,
G. Wang, G. Giannakis, Y. Saad, and J. Chen, “Solving most systems of random quadratic equations,”Advances in Neural Information Processing Systems, vol. 30, 2017
work page 2017
-
[10]
Two-step phase retrieval algorithm using single-intensity measurement,
C. Zhang, M. Wang, Q. Chen, D. Wang, and S. Wei, “Two-step phase retrieval algorithm using single-intensity measurement,”International Journal of Optics, vol. 2018, no. 1, p. 8643819, 2018
work page 2018
-
[11]
Optimal errors and phase transitions in high-dimensional generalized linear models,
J. Barbier, F. Krzakala, N. Macris, L. Miolane, and L. Zdeborov´ a, “Optimal errors and phase transitions in high-dimensional generalized linear models,”Proceedings of the National Academy of Sciences, vol. 116, no. 12, pp. 5451–5460, 2019
work page 2019
-
[12]
A stochastic approximation method,
H. Robbins and S. Monro, “A stochastic approximation method,”The annals of mathematical statistics, pp. 400–407, 1951
work page 1951
-
[13]
Sensitivity and gen- eralization in neural networks: an empirical study,
R. Novak, Y. Bahri, D. A. Abolafia, J. Pennington, and J. Sohl-Dickstein, “Sensitivity and gen- eralization in neural networks: an empirical study,”arXiv preprint arXiv:1802.08760, 2018
-
[14]
Reconciling modern machine-learning practice and the classical bias–variance trade-off,
M. Belkin, D. Hsu, S. Ma, and S. Mandal, “Reconciling modern machine-learning practice and the classical bias–variance trade-off,”Proceedings of the National Academy of Sciences, vol. 116, no. 32, pp. 15849–15854, 2019
work page 2019
-
[15]
J. B. Simon, D. Karkada, N. Ghosh, and M. Belkin, “More is better in modern machine learn- ing: when infinite overparameterization is optimal and overfitting is obligatory,”arXiv preprint arXiv:2311.14646, 2023
-
[16]
Optimization and generalization of shallow neural networks with quadratic activation functions,
S. Sarao Mannelli, E. Vanden-Eijnden, and L. Zdeborov´ a, “Optimization and generalization of shallow neural networks with quadratic activation functions,”Advances in Neural Information Processing Systems, vol. 33, pp. 13445–13455, 2020
work page 2020
-
[17]
Escaping mediocrity: how two-layer networks learn hard generalized linear models with sgd,
L. Arnaboldi, F. Krzakala, B. Loureiro, and L. Stephan, “Escaping mediocrity: how two-layer networks learn hard generalized linear models with sgd,”arXiv preprint arXiv:2305.18502, 2023
-
[18]
Y. S. Tan and R. Vershynin, “Online stochastic gradient descent with arbitrary initialization solves non-smooth, non-convex phase retrieval,”Journal of Machine Learning Research, vol. 24, no. 58, pp. 1–47, 2023
work page 2023
-
[19]
Online stochastic gradient descent on non-convex losses from high-dimensional inference,
G. B. Arous, R. Gheissari, and A. Jagannath, “Online stochastic gradient descent on non-convex losses from high-dimensional inference,”Journal of Machine Learning Research, vol. 22, no. 106, pp. 1–51, 2021
work page 2021
-
[20]
Learning single-index models with shallow neural networks,
A. Bietti, J. Bruna, C. Sanford, and M. J. Song, “Learning single-index models with shallow neural networks,”Advances in neural information processing systems, vol. 35, pp. 9768–9783, 2022. 13
work page 2022
-
[21]
Neural networks can learn representations with gradient descent,
A. Damian, J. Lee, and M. Soltanolkotabi, “Neural networks can learn representations with gradient descent,” inConference on Learning Theory, pp. 5413–5452, PMLR, 2022
work page 2022
-
[22]
Learning time-scales in two-layers neural networks,
R. Berthier, A. Montanari, and K. Zhou, “Learning time-scales in two-layers neural networks,” Foundations of Computational Mathematics, pp. 1–84, 2024
work page 2024
-
[23]
Computational-statistical gaps in G aussian single-index models (extended abstract)
A. Damian, L. Pillaud-Vivien, J. D. Lee, and J. Bruna, “Computational-statistical gaps in gaussian single-index models,”arXiv preprint arXiv:2403.05529, 2024
-
[24]
Dynamical decoupling of generalization and overfitting in large two-layer networks,
A. Montanari and P. Urbani, “Dynamical decoupling of generalization and overfitting in large two-layer networks,”arXiv preprint arXiv:2502.21269, 2025
-
[25]
On the global convergence of gradient descent for over-parameterized models using optimal transport,
L. Chizat and F. Bach, “On the global convergence of gradient descent for over-parameterized models using optimal transport,”Advances in neural information processing systems, vol. 31, 2018
work page 2018
-
[26]
G. M. Rotskoff and E. Vanden-Eijnden, “Neural networks as interacting particle systems: Asymp- totic convexity of the loss landscape and universal scaling of the approximation error,”stat, vol. 1050, p. 22, 2018
work page 2018
-
[27]
A mean field view of the landscape of two-layer neural networks,
S. Mei, A. Montanari, and P.-M. Nguyen, “A mean field view of the landscape of two-layer neural networks,”Proceedings of the National Academy of Sciences, vol. 115, no. 33, pp. E7665–E7671, 2018
work page 2018
-
[28]
Mean-field theory of two-layers neural networks: dimension-free bounds and kernel limit,
S. Mei, T. Misiakiewicz, and A. Montanari, “Mean-field theory of two-layers neural networks: dimension-free bounds and kernel limit,” inConference on learning theory, pp. 2388–2464, PMLR, 2019
work page 2019
-
[29]
Mean field analysis of neural networks: A central limit theo- rem,
J. Sirignano and K. Spiliopoulos, “Mean field analysis of neural networks: A central limit theo- rem,”Stochastic Processes and their Applications, vol. 130, no. 3, pp. 1820–1852, 2020
work page 2020
-
[30]
The committee machine: Computational to statistical gaps in learning a two-layers neural network,
B. Aubin, A. Maillard, F. Krzakala, N. Macris, L. Zdeborov´ a,et al., “The committee machine: Computational to statistical gaps in learning a two-layers neural network,”Advances in Neural Information Processing Systems, vol. 31, 2018
work page 2018
-
[31]
Bayes-optimal learning of deep random networks of extensive-width,
H. Cui, F. Krzakala, and L. Zdeborov´ a, “Bayes-optimal learning of deep random networks of extensive-width,” inInternational Conference on Machine Learning, pp. 6468–6521, PMLR, 2023
work page 2023
-
[32]
Bayes-optimal learn- ing of an extensive-width neural network from quadratically many samples,
A. Maillard, E. Troiani, S. Martin, F. Krzakala, and L. Zdeborov´ a, “Bayes-optimal learn- ing of an extensive-width neural network from quadratically many samples,”arXiv preprint arXiv:2408.03733, 2024
-
[33]
Hitting the high-dimensional notes: An ode for sgd learning dynamics on glms and multi-index models,
E. Collins-Woodfin, C. Paquette, E. Paquette, and I. Seroussi, “Hitting the high-dimensional notes: An ode for sgd learning dynamics on glms and multi-index models,”Information and Inference: A Journal of the IMA, vol. 13, no. 4, p. iaae028, 2024
work page 2024
-
[34]
Essentially no barriers in neural net- work energy landscape,
F. Draxler, K. Veschgini, M. Salmhofer, and F. Hamprecht, “Essentially no barriers in neural net- work energy landscape,” inInternational conference on machine learning, pp. 1309–1318, PMLR, 2018
work page 2018
-
[35]
Comparing dynamics: Deep neural networks versus glassy systems,
M. Baity-Jesi, L. Sagun, M. Geiger, S. Spigler, G. B. Arous, C. Cammarota, Y. LeCun, M. Wyart, and G. Biroli, “Comparing dynamics: Deep neural networks versus glassy systems,” inInterna- tional Conference on Machine Learning, pp. 314–323, PMLR, 2018
work page 2018
-
[36]
Shaping the learning landscape in neural networks around wide flat minima,
C. Baldassi, F. Pittorino, and R. Zecchina, “Shaping the learning landscape in neural networks around wide flat minima,”Proceedings of the National Academy of Sciences, vol. 117, no. 1, pp. 161–170, 2020
work page 2020
-
[37]
Implicit reg- ularization in matrix factorization,
S. Gunasekar, B. E. Woodworth, S. Bhojanapalli, B. Neyshabur, and N. Srebro, “Implicit reg- ularization in matrix factorization,”Advances in neural information processing systems, vol. 30, 2017. 14
work page 2017
-
[38]
Characterizing implicit bias in terms of opti- mization geometry,
S. Gunasekar, J. Lee, D. Soudry, and N. Srebro, “Characterizing implicit bias in terms of opti- mization geometry,” inInternational Conference on Machine Learning, pp. 1832–1841, PMLR, 2018
work page 2018
-
[39]
The implicit bias of gradient descent on separable data,
D. Soudry, E. Hoffer, M. S. Nacson, S. Gunasekar, and N. Srebro, “The implicit bias of gradient descent on separable data,”Journal of Machine Learning Research, vol. 19, no. 70, pp. 1–57, 2018
work page 2018
-
[40]
Implicit bias of gradient descent for wide two-layer neural networks trained with the logistic loss,
L. Chizat and F. Bach, “Implicit bias of gradient descent for wide two-layer neural networks trained with the logistic loss,” inConference on learning theory, pp. 1305–1338, PMLR, 2020
work page 2020
-
[41]
On-line learning in soft committee machines,
D. Saad and S. A. Solla, “On-line learning in soft committee machines,”Physical Review E, vol. 52, no. 4, p. 4225, 1995
work page 1995
-
[42]
S. Goldt, M. Advani, A. M. Saxe, F. Krzakala, and L. Zdeborov´ a, “Dynamics of stochastic gra- dient descent for two-layer neural networks in the teacher-student setup,”Advances in neural information processing systems, vol. 32, 2019
work page 2019
-
[43]
Phase diagram of stochastic gradient descent in high-dimensional two-layer neural networks,
R. Veiga, L. Stephan, B. Loureiro, F. Krzakala, and L. Zdeborov´ a, “Phase diagram of stochastic gradient descent in high-dimensional two-layer neural networks,”Advances in Neural Information Processing Systems, vol. 35, pp. 23244–23255, 2022
work page 2022
-
[44]
Symmetries, flat minima, and the conserved quantities of gradient flow,
B. Zhao, I. Ganev, R. Walters, R. Yu, and N. Dehmamy, “Symmetries, flat minima, and the conserved quantities of gradient flow,”arXiv preprint arXiv:2210.17216, 2022
-
[45]
Symmetry in neural network parameter spaces,
B. Zhao, R. Walters, and R. Yu, “Symmetry in neural network parameter spaces,”arXiv preprint arXiv:2506.13018, 2025
-
[46]
Global minima of overparameterized neural networks,
Y. Cooper, “Global minima of overparameterized neural networks,”SIAM Journal on Mathemat- ics of Data Science, vol. 3, no. 2, pp. 676–691, 2021
work page 2021
-
[47]
Geometry of the loss landscape in overparameterized neural networks: Symmetries and invariances,
B. Simsek, F. Ged, A. Jacot, F. Spadaro, C. Hongler, W. Gerstner, and J. Brea, “Geometry of the loss landscape in overparameterized neural networks: Symmetries and invariances,” in International Conference on Machine Learning, pp. 9722–9732, PMLR, 2021
work page 2021
-
[48]
S. Hochreiter and J. Schmidhuber, “Flat minima,”Neural computation, vol. 9, no. 1, pp. 1–42, 1997
work page 1997
-
[49]
Empirical Analysis of the Hessian of Over-Parametrized Neural Networks
L. Sagun, U. Evci, V. U. Guney, Y. Dauphin, and L. Bottou, “Empirical analysis of the hessian of over-parametrized neural networks,”arXiv preprint arXiv:1706.04454, 2017
work page Pith review arXiv 2017
-
[50]
Noether’s learning dynamics: Role of symmetry breaking in neural networks,
H. Tanaka and D. Kunin, “Noether’s learning dynamics: Role of symmetry breaking in neural networks,”Advances in Neural Information Processing Systems, vol. 34, pp. 25646–25660, 2021
work page 2021
-
[51]
Double trouble in double descent: Bias and variance (s) in the lazy regime,
S. d’Ascoli, M. Refinetti, G. Biroli, and F. Krzakala, “Double trouble in double descent: Bias and variance (s) in the lazy regime,” inInternational Conference on Machine Learning, pp. 2280–2290, PMLR, 2020
work page 2020
-
[52]
Scaling description of generalization with number of parameters in deep learning,
M. Geiger, A. Jacot, S. Spigler, F. Gabriel, L. Sagun, S. d’Ascoli, G. Biroli, C. Hongler, and M. Wyart, “Scaling description of generalization with number of parameters in deep learning,” Journal of Statistical Mechanics: Theory and Experiment, vol. 2020, no. 2, p. 023401, 2020
work page 2020
-
[53]
Generalisation dynamics of online learning in over-parameterised neural networks,
S. Goldt, M. S. Advani, A. M. Saxe, F. Krzakala, and L. Zdeborov´ a, “Generalisation dynamics of online learning in over-parameterised neural networks,”arXiv preprint arXiv:1901.09085, 2019
-
[54]
Student specialization in deep rectified networks with finite width and input dimension,
Y. Tian, “Student specialization in deep rectified networks with finite width and input dimension,” inInternational Conference on Machine Learning, pp. 9470–9480, PMLR, 2020. 15 7 Appendix 7.1 Dynamical equations In order to derive the equations that describe the dynamics of the order parameters in the main text, we start from dρkl dα =⟨F kul⟩, dQkk′ dα =...
work page 2020
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.