arxiv: 2605.01288 · v2 · submitted 2026-05-02 · 💻 cs.LG · cond-mat.dis-nn· stat.ML

Recognition: 2 theorem links

· Lean Theorem

A Theory of Saddle Escape in Deep Nonlinear Networks

Divit Rawal, Michael R. DeWeese

Authors on Pith no claims yet

Pith reviewed 2026-05-11 02:19 UTC · model grok-4.3

classification 💻 cs.LG cond-mat.dis-nnstat.ML

keywords saddle escapedeep nonlinear networksFrobenius norm imbalancecritical depthuniversality classespermutation-symmetric submanifoldscalar ODE reductionsmall initialization

0 comments

The pith

An exact identity on Frobenius norm imbalances in layer weights reduces deep nonlinear training to a scalar ODE whose escape time scales as ε to the power of minus (r minus 2), where r is the number of bottleneck layers.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Deep networks initialized with small weights linger in long plateaus before acquiring features through sharp transitions. The authors derive an exact identity relating the imbalances in Frobenius norms across layers that holds for arbitrary smooth activations and differentiable losses. This identity, together with an approximate balance on the permutation-symmetric submanifold, collapses the high-dimensional gradient flow to a single ordinary differential equation. Solving that equation produces an escape-time law governed solely by the critical number r of layers at the small scale, independent of total depth L, and the same exponent appears under rescaled He-normal initialization. Numerical experiments confirm the predicted scaling for multiple activations.

Core claim

We derive an exact identity for the imbalance of Frobenius norms of layer weight matrices that holds for any smooth activation and any differentiable loss and use this to classify activation functions into four universality classes. On the permutation-symmetric submanifold, the identity combines with an approximate balance law to reduce the full matrix flow to a scalar ODE, giving a critical-depth escape time law τ★ = Θ(ε^{-(r-2)}) governed by the number r of layers at the bottleneck scale rather than the total depth L. We find that this same r-2 exponent is recovered under He-normal initialization with r bottleneck layers rescaled by ε, where the symmetry manifold is preserved by the flow.

What carries the argument

The exact identity on Frobenius-norm imbalances of layer weights, which combines with an approximate balance law on the permutation-symmetric submanifold to collapse the matrix flow onto a scalar ODE.

If this is right

Escape time depends only on the critical depth r at the bottleneck scale, not on total network depth L.
Activation functions are partitioned into four universality classes according to the sign and magnitude of a constant appearing in the exact norm-imbalance identity.
The same r-2 scaling law holds when He-normal initialization is applied to the r bottleneck layers after rescaling by ε.
The permutation-symmetric submanifold is invariant under the flow for the rescaled He-normal case even though it is not attracting.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Increasing total depth beyond the bottleneck count should not lengthen the initial plateau provided the narrow layers remain at scale ε.
The four universality classes imply that only activations with a particular leading Taylor coefficient near zero will exhibit the slowest escape for given r.
Testing the identity directly on networks with non-permutation-symmetric initial conditions would reveal how far the approximate balance law can be stretched.

Load-bearing premise

The reduction to a scalar ODE relies on an approximate balance law on the permutation-symmetric submanifold whose accuracy and range of validity are not derived from first principles.

What would settle it

A simulation in which the measured escape time fails to scale as Θ(ε^{-(r-2)}) when the number of bottleneck layers r is varied while total depth and initialization scale are held fixed.

Figures

Figures reproduced from arXiv: 2605.01288 by Divit Rawal, Michael R. DeWeese.

**Figure 1.** Figure 1: Empirical confirmation of ansatz reduction. (a) Loss vs. time: reduced ODE (solid line) overlaid with empirical NL-parameter gradient descent (circles). (b) Layer scales Xℓ versus time t: scalar ODE trajectories match the full-parameter dynamics. Because the ansatz is flow-invariant (Section B.1), the population gradient flow on W descends to a flow on the L scalars (X1, . . . , XL). The reduction below is… view at source ↗

**Figure 2.** Figure 2: Escape time obeys critical-depth law on the manifold. (a) Escape time tesc vs initialization scale ε for balanced init: closed form (solid) and reduced ODE (diamonds), polynomial scaling ε −(L−2) steepens with depth. (b) Same at fixed L = 6 with r layers at bottleneck scale: diamonds track the ε −(r−2) law of Theorem 6. Theoretical prediction and experiment diverge at large ε. Proof sketch. Separability o… view at source ↗

**Figure 3.** Figure 3: Universality across activations. (a) Raw escape time tesc vs ε for three Class B activations (tanh, erf, sin; solid line) and two Class C (GELU, Swish; dashed). (b) After rescaling by K(σ) : Class B curves collapse onto the master curve of Corollary 7; Class C deviates by O(γC ε) per Section C. Proof sketch. Partition [s 2 1 , 1] into shells [s 2 j , s2 j+1]. Under the strict-hierarchy assumption the shell… view at source ↗

**Figure 4.** Figure 4: Off-manifold critical-depth exponent. tesc vs ε for L = 8 tanh with r ∈ {3, 5, 8} bottleneck layers, He-normal init, SGD: slopes track the ε −(r−2) law of Theorem 11. Theorem 11 makes the single-mode exponent intrinsic to M and γ rather than to any ansatz. The product structure ∥Gℓ∥F ≍ ∥Bℓ∥2∥Aℓ−1∥2 is the rank-one Frobenius identity applied to the filtered-composition expansion of Section B.3, and the … view at source ↗

**Figure 5.** Figure 5: Three-mode tanh cascade and escape-time decomposition. Black: training loss; blue: mode-1 alignment ∥W1v1∥2/ √ N. Light purple: leading-order single-mode prediction of Theorem 5. Dark purple: homotopy identity T(1) = T(0)+R 1 0 A(ν)dν on the homotopy from decoupled single-mode (ν = 0) to augmented block-mean (ν = 1). Proposition 25 (Schur–Perron positive eigenvalue). Let A, D ∈ R n×n be diagonal matrices … view at source ↗

read the original abstract

In deep networks with small initialization, training exhibits long plateaus separated by sharp feature-acquisition transitions. Whereas shallow nonlinear networks and deep linear networks are well studied, extending these analyses to deep nonlinear networks remains challenging. We derive an exact identity for the imbalance of Frobenius norms of layer weight matrices that holds for any smooth activation and any differentiable loss and use this to classify activation functions into four universality classes. On the permutation-symmetric submanifold, the identity combines with an approximate balance law to reduce the full matrix flow to a scalar ODE, giving a critical-depth escape time law $\tau_\star = \Theta(\varepsilon^{-(r-2)})$ governed by the number $r$ of layers at the bottleneck scale rather than the total depth $L$. We find that this same $r-2$ exponent is recovered under He-normal initialization with $r$ bottleneck layers rescaled by $\varepsilon$, where the symmetry manifold is preserved by the flow but not attracting. We find close agreement between our theory and numerical simulations.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The exact identity for Frobenius-norm imbalance is new and holds rigorously, but the escape-time scaling law rests on an approximate balance whose error is not bounded.

read the letter

The paper's main new piece is an exact identity that relates the Frobenius norms of the weight matrices across layers. It holds for any smooth activation and any differentiable loss, with no extra assumptions. From that identity the authors partition activations into four universality classes and then, on the permutation-symmetric submanifold, combine it with an approximate balance law to reduce the matrix flow to a scalar ODE. The ODE yields the claimed escape-time scaling τ★ = Θ(ε^{-(r-2)}) controlled by bottleneck depth r rather than total depth L. They also recover the same exponent under rescaled He-normal initialization and report close numerical agreement in simulations.

Referee Report

2 major / 2 minor

Summary. The manuscript derives an exact identity for the imbalance of Frobenius norms of layer weight matrices that holds for any smooth activation and any differentiable loss, and uses it to classify activation functions into four universality classes. On the permutation-symmetric submanifold this identity is combined with an approximate balance law to reduce the full matrix gradient flow to a scalar ODE, yielding the critical-depth escape-time scaling law τ★ = Θ(ε^{-(r-2)}) governed by the number r of bottleneck layers rather than total depth L. The same scaling is recovered under He-normal initialization with rescaled bottlenecks, and close agreement with numerical simulations is reported.

Significance. If the approximate balance law can be justified with controlled errors, the work would constitute a significant contribution by supplying a general, parameter-free exact identity applicable to broad classes of networks and losses, together with a predictive scaling law that isolates the role of bottleneck depth in saddle escape. The exact identity itself is a robust technical result that could serve as a foundation for further analyses of nonlinear training dynamics.

major comments (2)

[permutation-symmetric submanifold reduction] The reduction to the scalar ODE and the resulting escape-time law τ★ = Θ(ε^{-(r-2)}) (abstract and the section on the permutation-symmetric submanifold) rests on an approximate balance law whose validity is asserted but neither derived from first principles nor equipped with an explicit error bound that remains controlled as ε → 0 near the saddle. This approximation is load-bearing for the central scaling claim.
[universality classes] The classification of activations into universality classes follows directly from the exact identity, yet the manuscript does not demonstrate that the approximate balance law remains uniformly valid across all four classes or quantify any class-dependent error that could affect the escape-time exponent.

minor comments (2)

The abstract introduces the scaling law before defining the bottleneck scale r and the initialization parameter ε; a brief parenthetical clarification would improve readability.
[numerical simulations] In the numerical validation, reporting fitted exponents with confidence intervals or the number of independent runs would make the agreement with the predicted r-2 scaling more quantitative.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We are grateful to the referee for the thorough review and for identifying key points that merit further clarification in our work on saddle escape in deep nonlinear networks. Below we provide point-by-point responses to the major comments.

read point-by-point responses

Referee: [permutation-symmetric submanifold reduction] The reduction to the scalar ODE and the resulting escape-time law τ★ = Θ(ε^{-(r-2)}) (abstract and the section on the permutation-symmetric submanifold) rests on an approximate balance law whose validity is asserted but neither derived from first principles nor equipped with an explicit error bound that remains controlled as ε → 0 near the saddle. This approximation is load-bearing for the central scaling claim.

Authors: We agree that the approximate balance law is pivotal for the reduction to the scalar ODE and the escape-time scaling. This law is motivated by the invariance of the permutation-symmetric submanifold under the dynamics for symmetric initializations, combined with the exact imbalance identity. Although a complete first-principles derivation with explicit error bounds is not provided in the current version, the scaling is robustly confirmed by simulations. In the revision, we will expand the relevant section to include a more detailed justification of the balance law, deriving it heuristically from the symmetry constraints and providing an informal argument that the error does not alter the leading-order scaling as ε approaches zero. revision: partial
Referee: [universality classes] The classification of activations into universality classes follows directly from the exact identity, yet the manuscript does not demonstrate that the approximate balance law remains uniformly valid across all four classes or quantify any class-dependent error that could affect the escape-time exponent.

Authors: The four universality classes are classified exclusively using the exact norm-imbalance identity, which applies universally for any smooth activation and differentiable loss. The approximate balance law, however, is a consequence of the permutation symmetry on the submanifold and does not depend on the particular class of the activation. We will revise the manuscript to explicitly state this independence and to include numerical results demonstrating the escape-time scaling for representative activations from each of the four classes, thereby confirming the uniformity of the exponent. revision: yes

Circularity Check

0 steps flagged

No significant circularity; exact identity plus stated approximation yields scaling without reduction to inputs by construction

full rationale

The derivation begins with an exact identity for Frobenius-norm imbalance that holds independently for any smooth activation and differentiable loss. This identity is combined with a separately stated approximate balance law on the permutation-symmetric submanifold to obtain the scalar ODE and the resulting escape-time scaling τ★ = Θ(ε^{-(r-2)}). The balance law is presented as holding approximately rather than derived from first principles or fitted to the target scaling, so the final law is not equivalent to the inputs by construction. No self-citations, uniqueness theorems, or parameter fits are invoked to force the result. The paper is therefore self-contained on its own terms; any concerns about the approximation's validity or error bounds fall under correctness rather than circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The derivation rests on standard smoothness and differentiability assumptions plus one approximate balance law whose grounding is not detailed beyond being approximate.

axioms (2)

domain assumption Activations are smooth and losses are differentiable.
Invoked to obtain the exact identity for norm imbalance.
ad hoc to paper An approximate balance law holds on the permutation-symmetric submanifold.
Used to reduce the matrix flow to a scalar ODE; its validity range is not derived.

pith-pipeline@v0.9.0 · 5481 in / 1429 out tokens · 38874 ms · 2026-05-11T02:19:07.595331+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear
We derive an exact identity for the imbalance of Frobenius norms of layer weight matrices that holds for any smooth activation and any differentiable loss [...] giving a critical-depth escape time law τ★ = Θ(ε^{-(r-2)})
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear
Define φ_σ(z) .= zσ'(z)−σ(z) [...] classify activation functions into four universality classes

Reference graph

Works this paper leans on

42 extracted references · 42 canonical work pages · 1 internal anchor

[1]

Annual Conference Computational Learning Theory , year =

Emmanuel Abbe, Enric Boix-Adserà, and Theodor Misiakiewicz. Sgd learning on neural net- works: leap complexity and saddle-to-saddle dynamics.ArXiv, abs/2302.11055, 2023. URL https://api.semanticscholar.org/CorpusID:257078637

work page arXiv 2023
[2]

The merged-staircase prop- erty: a necessary and nearly sufficient condition for sgd learning of sparse functions on two- layer neural networks, 2024

Emmanuel Abbe, Enric Boix-Adsera, and Theodor Misiakiewicz. The merged-staircase prop- erty: a necessary and nearly sufficient condition for sgd learning of sparse functions on two- layer neural networks, 2024. URLhttps://arxiv.org/abs/2202.08658

work page arXiv 2024
[3]

2017 , month = oct, journal =

Madhu S. Advani and Andrew M. Saxe. High-dimensional dynamics of generalization error in neural networks, 2017. URLhttps://arxiv.org/abs/1710.03667

work page arXiv 2017
[4]

Escaping mediocrity: how two-layer networks learn hard generalized linear models with sgd,

Luca Arnaboldi, Florent Krzakala, Bruno Loureiro, and Ludovic Stephan. Escaping medi- ocrity: how two-layer networks learn hard generalized linear models with sgd, 2024. URL https://arxiv.org/abs/2305.18502

work page arXiv 2024
[5]

On the Optimization of Deep Networks: Im- plicit Acceleration by Overparameterization, June 2018

Sanjeev Arora, Nadav Cohen, and Elad Hazan. On the optimization of deep networks: Implicit acceleration by overparameterization, 2018. URLhttps://arxiv.org/abs/1802.06509

work page arXiv 2018
[6]

Implicit Regularization in Deep Matrix Factorization, October 2019

Sanjeev Arora, Nadav Cohen, Wei Hu, and Yuping Luo. Implicit regularization in deep matrix factorization, 2019. URLhttps://arxiv.org/abs/1905.13655

work page arXiv 2019
[7]

Online stochastic gradient descent on non-convex losses from high-dimensional inference, 2021

Gerard Ben Arous, Reza Gheissari, and Aukosh Jagannath. Online stochastic gradient descent on non-convex losses from high-dimensional inference, 2021. URLhttps://arxiv.org/ abs/2003.10409

work page arXiv 2021
[8]

High-dimensional limit theorems for sgd: Effective dynamics and critical scaling, 2023

Gerard Ben Arous, Reza Gheissari, and Aukosh Jagannath. High-dimensional limit theorems for sgd: Effective dynamics and critical scaling, 2023. URLhttps://arxiv.org/abs/ 2206.04030

work page arXiv 2023
[9]

Simon, and Cengiz Pehlevan

Alexander Atanasov, Alexandru Meterez, James B. Simon, and Cengiz Pehlevan. The opti- mization landscape of sgd across the feature learning strength, 2025. URLhttps://arxiv. org/abs/2410.04642

work page arXiv 2025
[10]

Saddle-To-Saddle Dynamics in Deep ReLU Networks: Low-Rank Bias in the First Saddle Escape

Ioannis Bantzis, James B. Simon, and Arthur Jacot. Saddle-to-saddle dynamics in deep relu networks: Low-rank bias in the first saddle escape, 2026. URLhttps://arxiv.org/abs/ 2505.21722

work page internal anchor Pith review Pith/arXiv arXiv 2026
[11]

Learning by on-line gradient descent.Journal of Physics A: Mathematical and General, 28:643–656, 02 1995

Michael Biehl and H Schwarze. Learning by on-line gradient descent.Journal of Physics A: Mathematical and General, 28:643–656, 02 1995. doi: 10.1088/0305-4470/28/3/018

work page doi:10.1088/0305-4470/28/3/018 1995
[12]

Weight-space symmetry in deep networks gives rise to permutation saddles, connected by equal-loss valleys across the loss landscape.arXiv preprint arXiv:1907.02911, 2019

Johanni Brea, Berfin Simsek, Bernd Illing, and Wulfram Gerstner. Weight-space symmetry in deep networks gives rise to permutation saddles, connected by equal-loss valleys across the loss landscape, 2019. URLhttps://arxiv.org/abs/1907.02911

work page arXiv 2019
[13]

Lipschitz flow-box theorem, 2006

Craig Calcaterra and Axel Boldt. Lipschitz flow-box theorem, 2006. URLhttps://arxiv. org/abs/math/0305207

work page arXiv 2006
[14]

On lazy training in differentiable program- ming, 2020

Lenaic Chizat, Edouard Oyallon, and Francis Bach. On lazy training in differentiable program- ming, 2020. URLhttps://arxiv.org/abs/1812.07956

work page arXiv 2020
[15]

Du, Wei Hu, and Jason D

Simon S. Du, Wei Hu, and Jason D. Lee. Algorithmic regularization in learning deep homoge- neous models: Layers are automatically balanced.Advances in Neural Information Processing Systems, 2018-December:384–395, 2018. ISSN 1049-5258. Publisher Copyright: © 2018 Cur- ran Associates Inc..All rights reserved.; 32nd Conference on Neural Information Processin...

work page 2018
[16]

Effect of batch learning in multilayer neural networks

Kenji Fukumizu. Effect of batch learning in multilayer neural networks. InInternational Con- ference on Neural Information Processing, 1998. URLhttps://api.semanticscholar. org/CorpusID:605683. 10

work page 1998
[17]

Sebastian Goldt, Madhu S Advani, Andrew M Saxe, Florent Krzakala, and Lenka Zdeborová. Dynamics of stochastic gradient descent for two-layer neural networks in the teacher–student setup*.Journal of Statistical Mechanics: Theory and Experiment, 2020(12):124010, December

work page 2020
[18]

doi: 10.1088/1742-5468/abc61e

ISSN 1742-5468. doi: 10.1088/1742-5468/abc61e. URLhttp://dx.doi.org/10. 1088/1742-5468/abc61e

work page doi:10.1088/1742-5468/abc61e
[19]

& Sun, J

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Delving deep into rectifiers: Sur- passing human-level performance on imagenet classification, 2015. URLhttps://arxiv. org/abs/1502.01852

work page arXiv 2015
[20]

A. C. Hindmarsh and L. R. Petzold. LSODA, ordinary differential equation solver for stiff or non-stiff system. Nuclear Energy Agency of the OECD (NEA), Sep 2005. URLhttps: //www.osti.gov/etdeweb/biblio/21352532

work page arXiv 2005
[21]

Horn and Charles R

Roger A. Horn and Charles R. Johnson.Matrix Analysis. Cambridge University Press, 1990. ISBN 0521386322. URLhttp://www.amazon.com/Matrix-Analysis-Roger-Horn/ dp/0521386322%3FSubscriptionId%3D192BW6DQ43CK9FN0ZGG2%26tag%3Dws% 26linkCode%3Dxm2%26camp%3D2025%26creative%3D165953%26creativeASIN% 3D0521386322

work page arXiv 1990
[22]

Saddle-to- saddle dynamics in deep linear networks: Small initialization training, symmetry, and sparsity,

Arthur Jacot, François Ged, Berfin ¸ Sim¸ sek, Clément Hongler, and Franck Gabriel. Saddle-to- saddle dynamics in deep linear networks: Small initialization training, symmetry, and sparsity,

work page
[23]

URLhttps://arxiv.org/abs/2106.15933

work page arXiv
[24]

Gradient descent aligns the layers of deep linear networks

Ziwei Ji and Matus Telgarsky. Gradient descent aligns the layers of deep linear networks, 2019. URLhttps://arxiv.org/abs/1810.02032

work page Pith review arXiv 2019
[25]

How to Escape Saddle Points Efficiently

Chi Jin, Rong Ge, Praneeth Netrapalli, Sham M. Kakade, and Michael I. Jordan. How to escape saddle points efficiently, 2017. URLhttps://arxiv.org/abs/1703.00887

work page Pith review arXiv 2017
[26]

Directional convergence near small initializations and saddles in two-homogeneous neural networks, 2024

Akshay Kumar and Jarvis Haupt. Directional convergence near small initializations and saddles in two-homogeneous neural networks, 2024. URLhttps://arxiv.org/abs/2402.09226

work page arXiv 2024
[27]

Daniel Kunin, Javier Sagastuy-Brena, Surya Ganguli, Daniel L. K. Yamins, and Hidenori Tanaka. Neural mechanics: Symmetry and broken conservation laws in deep learning dy- namics, 2021. URLhttps://arxiv.org/abs/2012.04728

work page arXiv 2021
[28]

Get rich quick: exact solutions reveal how unbalanced initializations pro- mote rapid feature learning

Daniel Kunin, Allan Raventós, Clémentine Dominé, Feng Chen, David Klindt, Andrew Saxe, and Surya Ganguli. Get rich quick: exact solutions reveal how unbalanced initializations pro- mote rapid feature learning. InProceedings of the 38th International Conference on Neural Information Processing Systems, NIPS ’24, Red Hook, NY , USA, 2024. Curran Associates ...

work page 2024
[29]

arXiv preprint arXiv:2506.06489 , year =

Daniel Kunin, Giovanni Luca Marchetti, Feng Chen, Dhruva Karkada, James B. Simon, Michael R. DeWeese, Surya Ganguli, and Nina Miolane. Alternating gradient flows: A the- ory of feature learning in two-layer neural networks, 2025. URLhttps://arxiv.org/abs/ 2506.06489

work page arXiv 2025
[30]

2018 , month = aug, journal =

Song Mei, Andrea Montanari, and Phan-Minh Nguyen. A mean field view of the landscape of two-layer neural networks.Proceedings of the National Academy of Sciences, 115(33), 2018. ISSN 1091-6490. doi: 10.1073/pnas.1806579115. URLhttp://dx.doi.org/10.1073/ pnas.1806579115

work page doi:10.1073/pnas.1806579115 2018
[31]

Convergence and Implicit Bias of Gradient Flow on Overparametrized Linear Networks, May 2022

Hancheng Min, Salma Tarmoun, René Vidal, and Enrique Mallada. Convergence and implicit bias of gradient flow on overparametrized linear networks, 2022. URLhttps://arxiv.org/ abs/2105.06351

work page arXiv 2022
[32]

Saddle-to-saddle dynamics in diagonal linear networks,

Scott Pesme and Nicolas Flammarion. Saddle-to-saddle dynamics in diagonal linear networks,

work page
[33]

URLhttps://arxiv.org/abs/2304.00488

work page arXiv
[34]

Hamprecht, Yoshua Bengio, and Aaron Courville

Nasim Rahaman, Aristide Baratin, Devansh Arpit, Felix Draxler, Min Lin, Fred A. Hamprecht, Yoshua Bengio, and Aaron Courville. On the spectral bias of neural networks, 2019. URL https://arxiv.org/abs/1806.08734. 11

work page arXiv 2019
[35]

Dynamics of on-line gradient descent learning for multilayer neural networks

David Saad and Sara Solla. Dynamics of on-line gradient descent learning for multilayer neural networks. 04 1999

work page 1999
[36]

Exact solutions to the nonlinear dynamics of learning in deep linear neural networks

Andrew M. Saxe, James L. McClelland, and Surya Ganguli. Exact solutions to the nonlinear dynamics of learning in deep linear neural networks, 2014. URLhttps://arxiv.org/abs/ 1312.6120

work page Pith review arXiv 2014
[37]

Saxe, Shagun Sodhani, and Sam Lewallen

Andrew M. Saxe, Shagun Sodhani, and Sam Lewallen. The neural race reduction: Dynamics of abstraction in gated networks, 2022. URLhttps://arxiv.org/abs/2207.10430

work page arXiv 2022
[38]

Simon, Maksis Knutins, Liu Ziyin, Daniel Geisz, Abraham J

James B. Simon, Maksis Knutins, Liu Ziyin, Daniel Geisz, Abraham J. Fetterman, and Joshua Albrecht. On the stepwise nature of self-supervised learning, 2023. URLhttps://arxiv. org/abs/2303.15438

work page arXiv 2023
[39]

Geometry of the loss landscape in overparameterized neural net- works: Symmetries and invariances, 2021

Berfin ¸ Sim¸ sek, François Ged, Arthur Jacot, Francesco Spadaro, Clément Hongler, Wulfram Gerstner, and Johanni Brea. Geometry of the loss landscape in overparameterized neural net- works: Symmetries and invariances, 2021. URLhttps://arxiv.org/abs/2105.12221

work page arXiv 2021
[40]

Noether’s learning dynamics: Role of symmetry breaking in neural networks, 2021

Hidenori Tanaka and Daniel Kunin. Noether’s learning dynamics: Role of symmetry breaking in neural networks, 2021. URLhttps://arxiv.org/abs/2105.02716. 12 A Proof and Extension of Theorem 1 We give the derivation of Theorem 1, state the matrix-valued refinement, and record the almost- everywhere version for ReLU. A.1 Full Derivation of Theorem 1 Fix a lay...

work page arXiv 2021
[41]

strict hierarchy

=O(∥X∥ L+q−1)sharply; the sharpness argument of (24) (withh (q) σ ̸= 0 for every Class B activation used in this paper) shows the bound is attained, not merely an upper bound. Proposition 13 is the nonlinear analog of deep-linear balance: the Class B bound holds to order ∥X∥ L+2, the Class C bound to∥X∥ L+1. The Class B exponent is sharper than pointwise ...

work page
[42]

At each quadrature node, the forward trajectory ˙xν =f ν(xν)and backward adjoint˙p ν =−(∂ xfν)⊤pν are integrated jointly. The Jacobian∂ xfν is constructed analytically in closed form from the reduced-variable equations of motion (layer scales, off-block amplitudes, cross-block rotations): each Jacobian entry is a polynomial in the reduced state with Gauss...

work page