pith. machine review for the scientific record. sign in

arxiv: 2605.01288 · v2 · submitted 2026-05-02 · 💻 cs.LG · cond-mat.dis-nn· stat.ML

Recognition: 2 theorem links

· Lean Theorem

A Theory of Saddle Escape in Deep Nonlinear Networks

Divit Rawal, Michael R. DeWeese

Authors on Pith no claims yet

Pith reviewed 2026-05-11 02:19 UTC · model grok-4.3

classification 💻 cs.LG cond-mat.dis-nnstat.ML
keywords saddle escapedeep nonlinear networksFrobenius norm imbalancecritical depthuniversality classespermutation-symmetric submanifoldscalar ODE reductionsmall initialization
0
0 comments X

The pith

An exact identity on Frobenius norm imbalances in layer weights reduces deep nonlinear training to a scalar ODE whose escape time scales as ε to the power of minus (r minus 2), where r is the number of bottleneck layers.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Deep networks initialized with small weights linger in long plateaus before acquiring features through sharp transitions. The authors derive an exact identity relating the imbalances in Frobenius norms across layers that holds for arbitrary smooth activations and differentiable losses. This identity, together with an approximate balance on the permutation-symmetric submanifold, collapses the high-dimensional gradient flow to a single ordinary differential equation. Solving that equation produces an escape-time law governed solely by the critical number r of layers at the small scale, independent of total depth L, and the same exponent appears under rescaled He-normal initialization. Numerical experiments confirm the predicted scaling for multiple activations.

Core claim

We derive an exact identity for the imbalance of Frobenius norms of layer weight matrices that holds for any smooth activation and any differentiable loss and use this to classify activation functions into four universality classes. On the permutation-symmetric submanifold, the identity combines with an approximate balance law to reduce the full matrix flow to a scalar ODE, giving a critical-depth escape time law τ★ = Θ(ε^{-(r-2)}) governed by the number r of layers at the bottleneck scale rather than the total depth L. We find that this same r-2 exponent is recovered under He-normal initialization with r bottleneck layers rescaled by ε, where the symmetry manifold is preserved by the flow.

What carries the argument

The exact identity on Frobenius-norm imbalances of layer weights, which combines with an approximate balance law on the permutation-symmetric submanifold to collapse the matrix flow onto a scalar ODE.

If this is right

  • Escape time depends only on the critical depth r at the bottleneck scale, not on total network depth L.
  • Activation functions are partitioned into four universality classes according to the sign and magnitude of a constant appearing in the exact norm-imbalance identity.
  • The same r-2 scaling law holds when He-normal initialization is applied to the r bottleneck layers after rescaling by ε.
  • The permutation-symmetric submanifold is invariant under the flow for the rescaled He-normal case even though it is not attracting.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Increasing total depth beyond the bottleneck count should not lengthen the initial plateau provided the narrow layers remain at scale ε.
  • The four universality classes imply that only activations with a particular leading Taylor coefficient near zero will exhibit the slowest escape for given r.
  • Testing the identity directly on networks with non-permutation-symmetric initial conditions would reveal how far the approximate balance law can be stretched.

Load-bearing premise

The reduction to a scalar ODE relies on an approximate balance law on the permutation-symmetric submanifold whose accuracy and range of validity are not derived from first principles.

What would settle it

A simulation in which the measured escape time fails to scale as Θ(ε^{-(r-2)}) when the number of bottleneck layers r is varied while total depth and initialization scale are held fixed.

Figures

Figures reproduced from arXiv: 2605.01288 by Divit Rawal, Michael R. DeWeese.

Figure 1
Figure 1. Figure 1: Empirical confirmation of ansatz reduction. (a) Loss vs. time: reduced ODE (solid line) overlaid with empirical NL-parameter gradient descent (circles). (b) Layer scales Xℓ versus time t: scalar ODE trajectories match the full-parameter dynamics. Because the ansatz is flow-invariant (Section B.1), the population gradient flow on W descends to a flow on the L scalars (X1, . . . , XL). The reduction below is… view at source ↗
Figure 2
Figure 2. Figure 2: Escape time obeys critical-depth law on the manifold. (a) Escape time tesc vs initializa￾tion scale ε for balanced init: closed form (solid) and reduced ODE (diamonds), polynomial scaling ε −(L−2) steepens with depth. (b) Same at fixed L = 6 with r layers at bottleneck scale: diamonds track the ε −(r−2) law of Theorem 6. Theoretical prediction and experiment diverge at large ε. Proof sketch. Separability o… view at source ↗
Figure 3
Figure 3. Figure 3: Universality across activations. (a) Raw escape time tesc vs ε for three Class B activations (tanh, erf, sin; solid line) and two Class C (GELU, Swish; dashed). (b) After rescaling by K(σ) : Class B curves collapse onto the master curve of Corollary 7; Class C deviates by O(γC ε) per Section C. Proof sketch. Partition [s 2 1 , 1] into shells [s 2 j , s2 j+1]. Under the strict-hierarchy assumption the shell… view at source ↗
Figure 4
Figure 4. Figure 4: Off-manifold critical-depth exponent. tesc vs ε for L = 8 tanh with r ∈ {3, 5, 8} bot￾tleneck layers, He-normal init, SGD: slopes track the ε −(r−2) law of Theorem 11. Theorem 11 makes the single-mode expo￾nent intrinsic to M and γ rather than to any ansatz. The product structure ∥Gℓ∥F ≍ ∥Bℓ∥2∥Aℓ−1∥2 is the rank-one Frobenius iden￾tity applied to the filtered-composition expan￾sion of Section B.3, and the … view at source ↗
Figure 5
Figure 5. Figure 5: Three-mode tanh cascade and escape-time decomposition. Black: training loss; blue: mode-1 alignment ∥W1v1∥2/ √ N. Light purple: leading-order single-mode prediction of Theo￾rem 5. Dark purple: homotopy identity T(1) = T(0)+R 1 0 A(ν)dν on the homotopy from decoupled single-mode (ν = 0) to augmented block-mean (ν = 1). Proposition 25 (Schur–Perron positive eigenvalue). Let A, D ∈ R n×n be diagonal matrices … view at source ↗
read the original abstract

In deep networks with small initialization, training exhibits long plateaus separated by sharp feature-acquisition transitions. Whereas shallow nonlinear networks and deep linear networks are well studied, extending these analyses to deep nonlinear networks remains challenging. We derive an exact identity for the imbalance of Frobenius norms of layer weight matrices that holds for any smooth activation and any differentiable loss and use this to classify activation functions into four universality classes. On the permutation-symmetric submanifold, the identity combines with an approximate balance law to reduce the full matrix flow to a scalar ODE, giving a critical-depth escape time law $\tau_\star = \Theta(\varepsilon^{-(r-2)})$ governed by the number $r$ of layers at the bottleneck scale rather than the total depth $L$. We find that this same $r-2$ exponent is recovered under He-normal initialization with $r$ bottleneck layers rescaled by $\varepsilon$, where the symmetry manifold is preserved by the flow but not attracting. We find close agreement between our theory and numerical simulations.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript derives an exact identity for the imbalance of Frobenius norms of layer weight matrices that holds for any smooth activation and any differentiable loss, and uses it to classify activation functions into four universality classes. On the permutation-symmetric submanifold this identity is combined with an approximate balance law to reduce the full matrix gradient flow to a scalar ODE, yielding the critical-depth escape-time scaling law τ★ = Θ(ε^{-(r-2)}) governed by the number r of bottleneck layers rather than total depth L. The same scaling is recovered under He-normal initialization with rescaled bottlenecks, and close agreement with numerical simulations is reported.

Significance. If the approximate balance law can be justified with controlled errors, the work would constitute a significant contribution by supplying a general, parameter-free exact identity applicable to broad classes of networks and losses, together with a predictive scaling law that isolates the role of bottleneck depth in saddle escape. The exact identity itself is a robust technical result that could serve as a foundation for further analyses of nonlinear training dynamics.

major comments (2)
  1. [permutation-symmetric submanifold reduction] The reduction to the scalar ODE and the resulting escape-time law τ★ = Θ(ε^{-(r-2)}) (abstract and the section on the permutation-symmetric submanifold) rests on an approximate balance law whose validity is asserted but neither derived from first principles nor equipped with an explicit error bound that remains controlled as ε → 0 near the saddle. This approximation is load-bearing for the central scaling claim.
  2. [universality classes] The classification of activations into universality classes follows directly from the exact identity, yet the manuscript does not demonstrate that the approximate balance law remains uniformly valid across all four classes or quantify any class-dependent error that could affect the escape-time exponent.
minor comments (2)
  1. The abstract introduces the scaling law before defining the bottleneck scale r and the initialization parameter ε; a brief parenthetical clarification would improve readability.
  2. [numerical simulations] In the numerical validation, reporting fitted exponents with confidence intervals or the number of independent runs would make the agreement with the predicted r-2 scaling more quantitative.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We are grateful to the referee for the thorough review and for identifying key points that merit further clarification in our work on saddle escape in deep nonlinear networks. Below we provide point-by-point responses to the major comments.

read point-by-point responses
  1. Referee: [permutation-symmetric submanifold reduction] The reduction to the scalar ODE and the resulting escape-time law τ★ = Θ(ε^{-(r-2)}) (abstract and the section on the permutation-symmetric submanifold) rests on an approximate balance law whose validity is asserted but neither derived from first principles nor equipped with an explicit error bound that remains controlled as ε → 0 near the saddle. This approximation is load-bearing for the central scaling claim.

    Authors: We agree that the approximate balance law is pivotal for the reduction to the scalar ODE and the escape-time scaling. This law is motivated by the invariance of the permutation-symmetric submanifold under the dynamics for symmetric initializations, combined with the exact imbalance identity. Although a complete first-principles derivation with explicit error bounds is not provided in the current version, the scaling is robustly confirmed by simulations. In the revision, we will expand the relevant section to include a more detailed justification of the balance law, deriving it heuristically from the symmetry constraints and providing an informal argument that the error does not alter the leading-order scaling as ε approaches zero. revision: partial

  2. Referee: [universality classes] The classification of activations into universality classes follows directly from the exact identity, yet the manuscript does not demonstrate that the approximate balance law remains uniformly valid across all four classes or quantify any class-dependent error that could affect the escape-time exponent.

    Authors: The four universality classes are classified exclusively using the exact norm-imbalance identity, which applies universally for any smooth activation and differentiable loss. The approximate balance law, however, is a consequence of the permutation symmetry on the submanifold and does not depend on the particular class of the activation. We will revise the manuscript to explicitly state this independence and to include numerical results demonstrating the escape-time scaling for representative activations from each of the four classes, thereby confirming the uniformity of the exponent. revision: yes

Circularity Check

0 steps flagged

No significant circularity; exact identity plus stated approximation yields scaling without reduction to inputs by construction

full rationale

The derivation begins with an exact identity for Frobenius-norm imbalance that holds independently for any smooth activation and differentiable loss. This identity is combined with a separately stated approximate balance law on the permutation-symmetric submanifold to obtain the scalar ODE and the resulting escape-time scaling τ★ = Θ(ε^{-(r-2)}). The balance law is presented as holding approximately rather than derived from first principles or fitted to the target scaling, so the final law is not equivalent to the inputs by construction. No self-citations, uniqueness theorems, or parameter fits are invoked to force the result. The paper is therefore self-contained on its own terms; any concerns about the approximation's validity or error bounds fall under correctness rather than circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The derivation rests on standard smoothness and differentiability assumptions plus one approximate balance law whose grounding is not detailed beyond being approximate.

axioms (2)
  • domain assumption Activations are smooth and losses are differentiable.
    Invoked to obtain the exact identity for norm imbalance.
  • ad hoc to paper An approximate balance law holds on the permutation-symmetric submanifold.
    Used to reduce the matrix flow to a scalar ODE; its validity range is not derived.

pith-pipeline@v0.9.0 · 5481 in / 1429 out tokens · 38874 ms · 2026-05-11T02:19:07.595331+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

Reference graph

Works this paper leans on

42 extracted references · 42 canonical work pages · 1 internal anchor

  1. [1]

    Annual Conference Computational Learning Theory , year =

    Emmanuel Abbe, Enric Boix-Adserà, and Theodor Misiakiewicz. Sgd learning on neural net- works: leap complexity and saddle-to-saddle dynamics.ArXiv, abs/2302.11055, 2023. URL https://api.semanticscholar.org/CorpusID:257078637

  2. [2]

    The merged-staircase prop- erty: a necessary and nearly sufficient condition for sgd learning of sparse functions on two- layer neural networks, 2024

    Emmanuel Abbe, Enric Boix-Adsera, and Theodor Misiakiewicz. The merged-staircase prop- erty: a necessary and nearly sufficient condition for sgd learning of sparse functions on two- layer neural networks, 2024. URLhttps://arxiv.org/abs/2202.08658

  3. [3]

    2017 , month = oct, journal =

    Madhu S. Advani and Andrew M. Saxe. High-dimensional dynamics of generalization error in neural networks, 2017. URLhttps://arxiv.org/abs/1710.03667

  4. [4]

    Escaping mediocrity: how two-layer networks learn hard generalized linear models with sgd,

    Luca Arnaboldi, Florent Krzakala, Bruno Loureiro, and Ludovic Stephan. Escaping medi- ocrity: how two-layer networks learn hard generalized linear models with sgd, 2024. URL https://arxiv.org/abs/2305.18502

  5. [5]

    On the Optimization of Deep Networks: Im- plicit Acceleration by Overparameterization, June 2018

    Sanjeev Arora, Nadav Cohen, and Elad Hazan. On the optimization of deep networks: Implicit acceleration by overparameterization, 2018. URLhttps://arxiv.org/abs/1802.06509

  6. [6]

    Implicit Regularization in Deep Matrix Factorization, October 2019

    Sanjeev Arora, Nadav Cohen, Wei Hu, and Yuping Luo. Implicit regularization in deep matrix factorization, 2019. URLhttps://arxiv.org/abs/1905.13655

  7. [7]

    Online stochastic gradient descent on non-convex losses from high-dimensional inference, 2021

    Gerard Ben Arous, Reza Gheissari, and Aukosh Jagannath. Online stochastic gradient descent on non-convex losses from high-dimensional inference, 2021. URLhttps://arxiv.org/ abs/2003.10409

  8. [8]

    High-dimensional limit theorems for sgd: Effective dynamics and critical scaling, 2023

    Gerard Ben Arous, Reza Gheissari, and Aukosh Jagannath. High-dimensional limit theorems for sgd: Effective dynamics and critical scaling, 2023. URLhttps://arxiv.org/abs/ 2206.04030

  9. [9]

    Simon, and Cengiz Pehlevan

    Alexander Atanasov, Alexandru Meterez, James B. Simon, and Cengiz Pehlevan. The opti- mization landscape of sgd across the feature learning strength, 2025. URLhttps://arxiv. org/abs/2410.04642

  10. [10]

    Saddle-To-Saddle Dynamics in Deep ReLU Networks: Low-Rank Bias in the First Saddle Escape

    Ioannis Bantzis, James B. Simon, and Arthur Jacot. Saddle-to-saddle dynamics in deep relu networks: Low-rank bias in the first saddle escape, 2026. URLhttps://arxiv.org/abs/ 2505.21722

  11. [11]

    Learning by on-line gradient descent.Journal of Physics A: Mathematical and General, 28:643–656, 02 1995

    Michael Biehl and H Schwarze. Learning by on-line gradient descent.Journal of Physics A: Mathematical and General, 28:643–656, 02 1995. doi: 10.1088/0305-4470/28/3/018

  12. [12]

    Weight-space symmetry in deep networks gives rise to permutation saddles, connected by equal-loss valleys across the loss landscape.arXiv preprint arXiv:1907.02911, 2019

    Johanni Brea, Berfin Simsek, Bernd Illing, and Wulfram Gerstner. Weight-space symmetry in deep networks gives rise to permutation saddles, connected by equal-loss valleys across the loss landscape, 2019. URLhttps://arxiv.org/abs/1907.02911

  13. [13]

    Lipschitz flow-box theorem, 2006

    Craig Calcaterra and Axel Boldt. Lipschitz flow-box theorem, 2006. URLhttps://arxiv. org/abs/math/0305207

  14. [14]

    On lazy training in differentiable program- ming, 2020

    Lenaic Chizat, Edouard Oyallon, and Francis Bach. On lazy training in differentiable program- ming, 2020. URLhttps://arxiv.org/abs/1812.07956

  15. [15]

    Du, Wei Hu, and Jason D

    Simon S. Du, Wei Hu, and Jason D. Lee. Algorithmic regularization in learning deep homoge- neous models: Layers are automatically balanced.Advances in Neural Information Processing Systems, 2018-December:384–395, 2018. ISSN 1049-5258. Publisher Copyright: © 2018 Cur- ran Associates Inc..All rights reserved.; 32nd Conference on Neural Information Processin...

  16. [16]

    Effect of batch learning in multilayer neural networks

    Kenji Fukumizu. Effect of batch learning in multilayer neural networks. InInternational Con- ference on Neural Information Processing, 1998. URLhttps://api.semanticscholar. org/CorpusID:605683. 10

  17. [17]

    Sebastian Goldt, Madhu S Advani, Andrew M Saxe, Florent Krzakala, and Lenka Zdeborová. Dynamics of stochastic gradient descent for two-layer neural networks in the teacher–student setup*.Journal of Statistical Mechanics: Theory and Experiment, 2020(12):124010, December

  18. [18]

    doi: 10.1088/1742-5468/abc61e

    ISSN 1742-5468. doi: 10.1088/1742-5468/abc61e. URLhttp://dx.doi.org/10. 1088/1742-5468/abc61e

  19. [19]

    & Sun, J

    Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Delving deep into rectifiers: Sur- passing human-level performance on imagenet classification, 2015. URLhttps://arxiv. org/abs/1502.01852

  20. [20]

    A. C. Hindmarsh and L. R. Petzold. LSODA, ordinary differential equation solver for stiff or non-stiff system. Nuclear Energy Agency of the OECD (NEA), Sep 2005. URLhttps: //www.osti.gov/etdeweb/biblio/21352532

  21. [21]

    Horn and Charles R

    Roger A. Horn and Charles R. Johnson.Matrix Analysis. Cambridge University Press, 1990. ISBN 0521386322. URLhttp://www.amazon.com/Matrix-Analysis-Roger-Horn/ dp/0521386322%3FSubscriptionId%3D192BW6DQ43CK9FN0ZGG2%26tag%3Dws% 26linkCode%3Dxm2%26camp%3D2025%26creative%3D165953%26creativeASIN% 3D0521386322

  22. [22]

    Saddle-to- saddle dynamics in deep linear networks: Small initialization training, symmetry, and sparsity,

    Arthur Jacot, François Ged, Berfin ¸ Sim¸ sek, Clément Hongler, and Franck Gabriel. Saddle-to- saddle dynamics in deep linear networks: Small initialization training, symmetry, and sparsity,

  23. [23]

    URLhttps://arxiv.org/abs/2106.15933

  24. [24]

    Gradient descent aligns the layers of deep linear networks

    Ziwei Ji and Matus Telgarsky. Gradient descent aligns the layers of deep linear networks, 2019. URLhttps://arxiv.org/abs/1810.02032

  25. [25]

    How to Escape Saddle Points Efficiently

    Chi Jin, Rong Ge, Praneeth Netrapalli, Sham M. Kakade, and Michael I. Jordan. How to escape saddle points efficiently, 2017. URLhttps://arxiv.org/abs/1703.00887

  26. [26]

    Directional convergence near small initializations and saddles in two-homogeneous neural networks, 2024

    Akshay Kumar and Jarvis Haupt. Directional convergence near small initializations and saddles in two-homogeneous neural networks, 2024. URLhttps://arxiv.org/abs/2402.09226

  27. [27]

    Daniel Kunin, Javier Sagastuy-Brena, Surya Ganguli, Daniel L. K. Yamins, and Hidenori Tanaka. Neural mechanics: Symmetry and broken conservation laws in deep learning dy- namics, 2021. URLhttps://arxiv.org/abs/2012.04728

  28. [28]

    Get rich quick: exact solutions reveal how unbalanced initializations pro- mote rapid feature learning

    Daniel Kunin, Allan Raventós, Clémentine Dominé, Feng Chen, David Klindt, Andrew Saxe, and Surya Ganguli. Get rich quick: exact solutions reveal how unbalanced initializations pro- mote rapid feature learning. InProceedings of the 38th International Conference on Neural Information Processing Systems, NIPS ’24, Red Hook, NY , USA, 2024. Curran Associates ...

  29. [29]

    arXiv preprint arXiv:2506.06489 , year =

    Daniel Kunin, Giovanni Luca Marchetti, Feng Chen, Dhruva Karkada, James B. Simon, Michael R. DeWeese, Surya Ganguli, and Nina Miolane. Alternating gradient flows: A the- ory of feature learning in two-layer neural networks, 2025. URLhttps://arxiv.org/abs/ 2506.06489

  30. [30]

    2018 , month = aug, journal =

    Song Mei, Andrea Montanari, and Phan-Minh Nguyen. A mean field view of the landscape of two-layer neural networks.Proceedings of the National Academy of Sciences, 115(33), 2018. ISSN 1091-6490. doi: 10.1073/pnas.1806579115. URLhttp://dx.doi.org/10.1073/ pnas.1806579115

  31. [31]

    Convergence and Implicit Bias of Gradient Flow on Overparametrized Linear Networks, May 2022

    Hancheng Min, Salma Tarmoun, René Vidal, and Enrique Mallada. Convergence and implicit bias of gradient flow on overparametrized linear networks, 2022. URLhttps://arxiv.org/ abs/2105.06351

  32. [32]

    Saddle-to-saddle dynamics in diagonal linear networks,

    Scott Pesme and Nicolas Flammarion. Saddle-to-saddle dynamics in diagonal linear networks,

  33. [33]

    URLhttps://arxiv.org/abs/2304.00488

  34. [34]

    Hamprecht, Yoshua Bengio, and Aaron Courville

    Nasim Rahaman, Aristide Baratin, Devansh Arpit, Felix Draxler, Min Lin, Fred A. Hamprecht, Yoshua Bengio, and Aaron Courville. On the spectral bias of neural networks, 2019. URL https://arxiv.org/abs/1806.08734. 11

  35. [35]

    Dynamics of on-line gradient descent learning for multilayer neural networks

    David Saad and Sara Solla. Dynamics of on-line gradient descent learning for multilayer neural networks. 04 1999

  36. [36]

    Exact solutions to the nonlinear dynamics of learning in deep linear neural networks

    Andrew M. Saxe, James L. McClelland, and Surya Ganguli. Exact solutions to the nonlinear dynamics of learning in deep linear neural networks, 2014. URLhttps://arxiv.org/abs/ 1312.6120

  37. [37]

    Saxe, Shagun Sodhani, and Sam Lewallen

    Andrew M. Saxe, Shagun Sodhani, and Sam Lewallen. The neural race reduction: Dynamics of abstraction in gated networks, 2022. URLhttps://arxiv.org/abs/2207.10430

  38. [38]

    Simon, Maksis Knutins, Liu Ziyin, Daniel Geisz, Abraham J

    James B. Simon, Maksis Knutins, Liu Ziyin, Daniel Geisz, Abraham J. Fetterman, and Joshua Albrecht. On the stepwise nature of self-supervised learning, 2023. URLhttps://arxiv. org/abs/2303.15438

  39. [39]

    Geometry of the loss landscape in overparameterized neural net- works: Symmetries and invariances, 2021

    Berfin ¸ Sim¸ sek, François Ged, Arthur Jacot, Francesco Spadaro, Clément Hongler, Wulfram Gerstner, and Johanni Brea. Geometry of the loss landscape in overparameterized neural net- works: Symmetries and invariances, 2021. URLhttps://arxiv.org/abs/2105.12221

  40. [40]

    Noether’s learning dynamics: Role of symmetry breaking in neural networks, 2021

    Hidenori Tanaka and Daniel Kunin. Noether’s learning dynamics: Role of symmetry breaking in neural networks, 2021. URLhttps://arxiv.org/abs/2105.02716. 12 A Proof and Extension of Theorem 1 We give the derivation of Theorem 1, state the matrix-valued refinement, and record the almost- everywhere version for ReLU. A.1 Full Derivation of Theorem 1 Fix a lay...

  41. [41]

    strict hierarchy

    =O(∥X∥ L+q−1)sharply; the sharpness argument of (24) (withh (q) σ ̸= 0 for every Class B activation used in this paper) shows the bound is attained, not merely an upper bound. Proposition 13 is the nonlinear analog of deep-linear balance: the Class B bound holds to order ∥X∥ L+2, the Class C bound to∥X∥ L+1. The Class B exponent is sharper than pointwise ...

  42. [42]

    At each quadrature node, the forward trajectory ˙xν =f ν(xν)and backward adjoint˙p ν =−(∂ xfν)⊤pν are integrated jointly. The Jacobian∂ xfν is constructed analytically in closed form from the reduced-variable equations of motion (layer scales, off-block amplitudes, cross-block rotations): each Jacobian entry is a polynomial in the reduced state with Gauss...