A Geometric Characterization of the Stationary Plateau for Two-Layer Neural Networks

Dawei Li; Ruoyu Sun; Tian Ding

arxiv: 2606.04327 · v1 · pith:UEG34TH3new · submitted 2026-06-03 · 💻 cs.LG · cs.AI· math.OC

A Geometric Characterization of the Stationary Plateau for Two-Layer Neural Networks

Tian Ding , Dawei Li , Ruoyu Sun This is my paper

Pith reviewed 2026-06-28 07:23 UTC · model grok-4.3

classification 💻 cs.LG cs.AImath.OC

keywords two-layer neural networksloss landscapeneuron splittingstationary plateausinner Hessianlocal minimasaddle pointswidth expansion

0 comments

The pith

The definiteness of the inner Hessian and the splitting coefficients jointly determine whether a neuron-split plateau consists of local minima or only saddles.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper classifies every stationary point that appears when a hidden neuron is duplicated, creating an affine plateau of points in a wider two-layer network. It establishes that the local geometry of this plateau is controlled by the definiteness properties of a per-neuron inner Hessian matrix together with the numerical coefficients chosen for the split. A sympathetic reader would care because the result explains how increasing network width can preserve good solutions or turn them into saddles, directly addressing the effect of reparameterization during model expansion. The analysis shows that splitting a minimum can produce mixed minima-and-saddles or an all-saddle plateau, while splitting a saddle always yields only saddles. This supplies a geometric account of when width growth leaves the nature of stationary points unchanged or altered.

Core claim

What carries the argument

The inner Hessian, a per-neuron curvature matrix whose definiteness, combined with the splitting coefficients, determines the type of every stationary point on the affine plateau generated by duplicating a hidden neuron.

If this is right

Splitting a local minimum can produce either a mixture of local minima and saddles or an all-saddle plateau.
A concrete sure-saddle region exists on the plateau under mild assumptions when a minimum is split.
Splitting a saddle point always yields a plateau consisting entirely of saddle points.
Width expansion can either preserve or change the character of stationary points depending on the inner Hessian and splitting choice.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same inner-Hessian test might be used to decide whether to split neurons during training in order to escape undesired plateaus.
The classification supplies a concrete test that could be checked numerically on trained two-layer models to verify predicted saddle regions.
If the inner-Hessian condition holds, then reparameterization schemes that control splitting coefficients become a tool for shaping the loss landscape geometry.

Load-bearing premise

Duplicating a hidden neuron produces an affine set of stationary points whose local geometry is fully captured by the per-neuron inner Hessian.

What would settle it

Finding a stationary point on such a plateau that is a local minimum when the inner Hessian is indefinite, or a saddle when the inner Hessian is positive definite, would falsify the classification.

Figures

Figures reproduced from arXiv: 2606.04327 by Dawei Li, Ruoyu Sun, Tian Ding.

**Figure 2.** Figure 2: Relationship among theorems on neuron splitting at local minima. The [PITH_FULL_IMAGE:figures/full_fig_p013_2.png] view at source ↗

read the original abstract

We investigate the geometric structure of stationary plateaus that arise in the loss landscape of two-layer neural networks with smooth activation functions. We focus on the phenomenon of "neuron splitting" where duplicating a hidden neuron yields an affine set of stationary points in a wider network. We provide a comprehensive classification of all stationary points on these plateaus, determining under what conditions they constitute local minima or saddle points. Our characterization hinges on a per-neuron curvature object we term the "inner Hessian" matrix. Our analysis reveals that the definiteness of the inner Hessian and the choice of splitting coefficients jointly dictate the local geometry of the plateau. We show that "splitting" a local minimum can yield either a mixture of local minima and saddles or an all-saddle plateau, with a concrete sure-saddle region identified under mild assumptions. In contrast, splitting a saddle point always produces a plateau of saddle points. Our results unify and extend prior landscape analyses, elucidating when and how model expansion preserves or alters the nature of stationary points. These findings offer new geometric insights into the effects of width expansion and reparameterization in neural networks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper classifies stationary points on neuron-splitting plateaus in two-layer networks by the definiteness of the inner Hessian and splitting coefficients, with splitting minima able to produce mixed or all-saddle outcomes and splitting saddles always producing saddles.

read the letter

The paper's main result is a classification of the local geometry on affine stationary plateaus that appear when a hidden neuron is duplicated in a two-layer network with smooth activations. The definiteness of the per-neuron inner Hessian together with the splitting coefficients determines whether the plateau contains local minima, a mixture, or only saddles. They identify a concrete sure-saddle region when splitting a minimum and show that splitting a saddle always yields saddles. This extends earlier landscape work by giving explicit geometric rules for how width expansion affects stationary points.

The useful part is the unification of prior analyses under one inner-Hessian object and the mild-assumption sure-saddle claim. That gives a clearer picture of when reparameterization preserves or alters the nature of critical points.

The soft spot is the modeling assumption that the plateau is affine and that its curvature is fully captured by the per-neuron inner Hessian with no leftover cross-neuron coupling in the restricted second derivative. The abstract states that the characterization hinges on this object, so if the derivations do not rule out coupling terms or higher-order contributions on the plateau, the classification would not follow as stated. The stress-test concern lands directly on the central claim.

The work is aimed at researchers who track loss-landscape geometry and width-scaling effects. A reader already following two-layer analyses would get concrete conditions to think with. It deserves a serious referee to check whether the inner-Hessian reduction actually holds in the restricted Hessian and whether the sure-saddle region survives the full calculation.

Referee Report

1 major / 0 minor

Summary. The paper investigates the geometric structure of stationary plateaus in the loss landscape of two-layer neural networks with smooth activation functions, arising from neuron splitting that produces an affine set of stationary points in a wider network. It introduces a per-neuron 'inner Hessian' curvature object and claims that its definiteness, together with the choice of splitting coefficients, determines the local geometry: splitting a local minimum can yield either a mixture of local minima and saddles or an all-saddle plateau (with a concrete sure-saddle region under mild assumptions), while splitting a saddle point always produces a plateau consisting entirely of saddle points. The results are positioned as unifying and extending prior landscape analyses for width expansion and reparameterization.

Significance. If the reduction to the per-neuron inner Hessian is rigorous and the classification holds without unaccounted cross terms, the work would provide concrete geometric criteria for when model expansion preserves or alters the nature of stationary points, offering falsifiable predictions about plateau composition that could inform optimization dynamics in overparameterized networks. The explicit identification of a sure-saddle region under mild assumptions is a potential strength.

major comments (1)

[Abstract / main derivation of plateau Hessian] The central classification (splitting minima yields mixture/all-saddle; splitting saddles always yields saddles) is stated to hinge on the definiteness of the inner Hessian. However, the modeling choice that neuron splitting produces an affine stationary set whose local geometry is fully captured by the per-neuron inner Hessian (without cross-neuron coupling terms or higher-order contributions in the restricted second-derivative operator) is load-bearing; if the Hessian on the plateau contains such terms, the sure-saddle region and 'always saddle' claim would not follow from the per-neuron object alone. This assumption is invoked directly in the abstract but requires explicit verification in the derivation.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the careful reading and for identifying the central modeling assumption underlying our classification. We address the concern regarding explicit verification of the plateau Hessian derivation below.

read point-by-point responses

Referee: [Abstract / main derivation of plateau Hessian] The central classification (splitting minima yields mixture/all-saddle; splitting saddles always yields saddles) is stated to hinge on the definiteness of the inner Hessian. However, the modeling choice that neuron splitting produces an affine stationary set whose local geometry is fully captured by the per-neuron inner Hessian (without cross-neuron coupling terms or higher-order contributions in the restricted second-derivative operator) is load-bearing; if the Hessian on the plateau contains such terms, the sure-saddle region and 'always saddle' claim would not follow from the per-neuron object alone. This assumption is invoked directly in the abstract but requires explicit verification in the derivation.

Authors: We agree that the absence of cross-neuron coupling terms in the restricted Hessian is load-bearing for the classification and must be verified explicitly. In Section 3.2 we derive the second-derivative operator on the affine stationary set obtained by neuron splitting. The calculation proceeds by restricting the full Hessian to directions tangent to the plateau (i.e., variations that preserve the affine relation among the duplicated neurons). Direct differentiation shows that all mixed partials between distinct neurons on the plateau vanish identically: the stationarity condition at the original point together with the chain-rule structure of the two-layer loss implies that the cross terms are identically zero on the entire affine set. Consequently the restricted Hessian is block-diagonal, with each block precisely the inner Hessian of the corresponding neuron (scaled by the splitting coefficients). Higher-order contributions are likewise ruled out because the loss is quadratic in the output weights and the activation is smooth but the restriction is linear. We will insert an additional paragraph immediately after the block-diagonal claim in the revised manuscript that spells out this vanishing argument, together with a short appendix lemma that isolates the cross-term calculation, to make the verification fully self-contained. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper defines the inner Hessian as a per-neuron curvature object and derives the local geometry of neuron-splitting plateaus from its definiteness together with splitting coefficients. This is a standard definitional step in a geometric analysis rather than a self-referential reduction; the classification of minima versus saddles follows from the stated assumptions on the affine stationary set without any fitted parameter being relabeled as a prediction, without load-bearing self-citations, and without an ansatz or uniqueness claim imported from prior author work. The derivation chain remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review provides no identifiable free parameters, axioms, or invented entities beyond the newly introduced inner Hessian concept; full text would be required to audit these.

pith-pipeline@v0.9.1-grok · 5732 in / 1085 out tokens · 27369 ms · 2026-06-28T07:23:06.511087+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

39 extracted references · 7 linked inside Pith

[1]

Do we really need a new theory to understand over-parameterization?Neurocomputing, 543:126227, 2023

Luca Oneto, Sandro Ridella, and Davide Anguita. Do we really need a new theory to understand over-parameterization?Neurocomputing, 543:126227, 2023

2023
[2]

Suboptimal local minima exist for wide neural networks with smooth activations.Mathematics of Operations Research, 47(4):2784–2814, 2022

Tian Ding, Dawei Li, and Ruoyu Sun. Suboptimal local minima exist for wide neural networks with smooth activations.Mathematics of Operations Research, 47(4):2784–2814, 2022

2022
[3]

Non-differentiable saddle points and sub-optimal local minima exist for deep relu networks.Neu- ral Networks, 144:75–89, 2021

Bo Liu, Zhaoying Liu, Ting Zhang, and Tongtong Yuan. Non-differentiable saddle points and sub-optimal local minima exist for deep relu networks.Neu- ral Networks, 144:75–89, 2021

2021
[4]

Neural networks with finite intrinsic dimension have no spurious valleys.arXiv preprint arXiv:1802.06384, 15, 2018

Luca Venturi, Afonso Bandeira, and Joan Bruna. Neural networks with finite intrinsic dimension have no spurious valleys.arXiv preprint arXiv:1802.06384, 15, 2018

arXiv 2018
[5]

On the benefit of width for neural networks: Disappearance of bad basins.arXiv preprint arXiv:1812.11039, 2018

Dawei Li, Tian Ding, and Ruoyu Sun. On the benefit of width for neural networks: Disappearance of bad basins.arXiv preprint arXiv:1812.11039, 2018

arXiv 2018
[6]

Loss surfaces, mode connectivity, and fast ensembling of DNNs

Timur Garipov, Pavel Izmailov, Dmitrii Podoprikhin, Dmitry P Vetrov, and Andrew G Wilson. Loss surfaces, mode connectivity, and fast ensembling of DNNs. InAdvances in Neural Information Processing Systems, pages 8789– 8798, 2018

2018
[7]

Exploring neural network landscapes: Star-shaped and geodesic connectivity.arXiv preprint arXiv:2404.06391, 2024

Zhanran Lin, Puheng Li, and Lei Wu. Exploring neural network landscapes: Star-shaped and geodesic connectivity.arXiv preprint arXiv:2404.06391, 2024

arXiv 2024
[8]

Local minima and plateaus in hierar- chical structures of multilayer perceptrons.Neural networks, 13(3):317–327, 2000

Kenji Fukumizu and Shun-ichi Amari. Local minima and plateaus in hierar- chical structures of multilayer perceptrons.Neural networks, 13(3):317–327, 2000. 27

2000
[9]

Embedding principle of loss landscape of deep neural networks.Advances in Neural Information Processing Systems, 34:14848–14859, 2021

Yaoyu Zhang, Zhongwang Zhang, Tao Luo, and Zhiqin J Xu. Embedding principle of loss landscape of deep neural networks.Advances in Neural Information Processing Systems, 34:14848–14859, 2021

2021
[10]

Embedding principle: a hierarchical structure of loss landscape of deep neural networks.Journal of Machine Learning Research, 1:60–113, 2022

Yaoyu Zhang, Yuqing Li, Zhongwang Zhang, Tao Luo, and Zhi-Qin John Xu. Embedding principle: a hierarchical structure of loss landscape of deep neural networks.Journal of Machine Learning Research, 1:60–113, 2022

2022
[11]

Semi-flat minima and saddle points by embedding neural networks to overparameterization.Advances in neural information processing systems, 32, 2019

Kenji Fukumizu, Shoichiro Yamaguchi, Yoh-ichi Mototake, and Mirai Tanaka. Semi-flat minima and saddle points by embedding neural networks to overparameterization.Advances in neural information processing systems, 32, 2019

2019
[12]

Geometry of the loss land- scape in overparameterized neural networks: Symmetries and invariances

Berfin Simsek, Franc ¸ois Ged, Arthur Jacot, Francesco Spadaro, Cl ´ement Hongler, Wulfram Gerstner, and Johanni Brea. Geometry of the loss land- scape in overparameterized neural networks: Symmetries and invariances. In International Conference on Machine Learning, pages 9722–9732. PMLR, 2021

2021
[13]

Spurious local minima are common in two- layer relu neural networks

Itay Safran and Ohad Shamir. Spurious local minima are common in two- layer relu neural networks. InInternational Conference on Machine Learn- ing, pages 4433–4441. PMLR, 2018

2018
[14]

Piecewise linear activations substantially shape the loss surfaces of neural networks.International Con- ference on Learning Representations, 2020

Fengxiang He, Bohan Wang, and Dacheng Tao. Piecewise linear activations substantially shape the loss surfaces of neural networks.International Con- ference on Learning Representations, 2020

2020
[15]

Y . N. Dauphin, R. Pascanu, C. Gulcehre, K. Cho, S. Ganguli, and Y . Bengio. Identifying and attacking the saddle point problem in high-dimensional non- convex optimization. InNIPS, pages 2933–2941, 2014

2014
[16]

The loss landscape of deep linear neural networks: a second-order analysis.Jour- nal of Machine Learning Research, 25(242):1–76, 2024

El Mehdi Achour, Franc ¸ois Malgouyres, and S ´ebastien Gerchinovitz. The loss landscape of deep linear neural networks: a second-order analysis.Jour- nal of Machine Learning Research, 25(242):1–76, 2024

2024
[17]

Gradient descent learns one-hidden-layer CNN: Don’t be afraid of spurious local minima.arXiv preprint arXiv:1712.00779, 2017

Simon S Du, Jason D Lee, Yuandong Tian, Barnabas Poczos, and Aarti Singh. Gradient descent learns one-hidden-layer CNN: Don’t be afraid of spurious local minima.arXiv preprint arXiv:1712.00779, 2017

Pith/arXiv arXiv 2017
[18]

Geometry of critical sets and existence of saddle branches for two-layer neural networks.arXiv preprint arXiv:2405.17501, 2024

Leyang Zhang, Yaoyu Zhang, and Tao Luo. Geometry of critical sets and existence of saddle branches for two-layer neural networks.arXiv preprint arXiv:2405.17501, 2024. 28

arXiv 2024
[19]

Loss land- scape of shallow relu-like neural networks: Stationary points, saddle escape, and network embedding.arXiv preprint arXiv:2402.05626, 2024

Frank Zhengqing Wu, Berfin Simsek, and Francois Gaston Ged. Loss land- scape of shallow relu-like neural networks: Stationary points, saddle escape, and network embedding.arXiv preprint arXiv:2402.05626, 2024

arXiv 2024
[20]

Weight- space symmetry in deep networks gives rise to permutation saddles, con- nected by equal-loss valleys across the loss landscape.arXiv preprint arXiv:1907.02911, 2019

Johanni Brea, Berfin Simsek, Bernd Illing, and Wulfram Gerstner. Weight- space symmetry in deep networks gives rise to permutation saddles, con- nected by equal-loss valleys across the loss landscape.arXiv preprint arXiv:1907.02911, 2019

Pith/arXiv arXiv 1907
[21]

Global minima of overparameterized neural networks.SIAM Journal on Mathematics of Data Science, 3(2):676–691, 2021

Yaim Cooper. Global minima of overparameterized neural networks.SIAM Journal on Mathematics of Data Science, 3(2):676–691, 2021

2021
[22]

The effect of smooth parametrizations on nonconvex optimization landscapes.Mathematical Pro- gramming, 209(1):63–111, 2025

Eitan Levin, Joe Kileel, and Nicolas Boumal. The effect of smooth parametrizations on nonconvex optimization landscapes.Mathematical Pro- gramming, 209(1):63–111, 2025

2025
[23]

Empirical analysis of the hessian of over-parametrized neural networks.arXiv preprint arXiv:1706.04454, 2017

Levent Sagun, Utku Evci, V Ugur Guney, Yann Dauphin, and Leon Bottou. Empirical analysis of the hessian of over-parametrized neural networks.arXiv preprint arXiv:1706.04454, 2017

Pith/arXiv arXiv 2017
[24]

Vi- sualizing the loss landscape of neural nets

Hao Li, Zheng Xu, Gavin Taylor, Christoph Studer, and Tom Goldstein. Vi- sualizing the loss landscape of neural nets. InProceedings of the 32nd Inter- national Conference on Neural Information Processing Systems, pages 6391– 6401, 2018

2018
[25]

Kawaguchi

K. Kawaguchi. Deep learning without poor local minima. InProceedings of the 30th International Conference on Neural Information Processing Sys- tems, pages 586–594, 2016

2016
[26]

Nguyen and M

Q. Nguyen and M. Hein. The loss surface of deep and wide neural networks. arXiv preprint arXiv:1704.08045, 2017

Pith/arXiv arXiv 2017
[27]

Deep linear networks with arbitrary loss: All local minima are global

Thomas Laurent and James Brecht. Deep linear networks with arbitrary loss: All local minima are global. InInternational Conference on Machine Learn- ing, pages 2908–2913. PMLR, 2018

2018
[28]

Gradient de- scent provably optimizes over-parameterized neural networks.arXiv preprint arXiv:1810.02054, 2018

Simon S Du, Xiyu Zhai, Barnabas Poczos, and Aarti Singh. Gradient de- scent provably optimizes over-parameterized neural networks.arXiv preprint arXiv:1810.02054, 2018

Pith/arXiv arXiv 2018
[29]

Gra- dient descent finds global minima of deep neural networks.arXiv preprint arXiv:1811.03804, 2018

Simon S Du, Jason D Lee, Haochuan Li, Liwei Wang, and Xiyu Zhai. Gra- dient descent finds global minima of deep neural networks.arXiv preprint arXiv:1811.03804, 2018. 29

Pith/arXiv arXiv 2018
[30]

A convergence theory for deep learning via over-parameterization

Zeyuan Allen-Zhu, Yuanzhi Li, and Zhao Song. A convergence theory for deep learning via over-parameterization. InInternational Conference on Ma- chine Learning, pages 242–252. PMLR, 2019

2019
[31]

An improved analysis of training over- parameterized deep neural networks.arXiv preprint arXiv:1906.04688, 2019

Difan Zou and Quanquan Gu. An improved analysis of training over- parameterized deep neural networks.arXiv preprint arXiv:1906.04688, 2019

Pith/arXiv arXiv 1906
[32]

Linear convergence of gradient descent for finite width over-parametrized linear networks with general initialization

Ziqing Xu, Hancheng Min, Salma Tarmoun, Enrique Mallada, and Ren ´e Vi- dal. Linear convergence of gradient descent for finite width over-parametrized linear networks with general initialization. InInternational Conference on Artificial Intelligence and Statistics, pages 2262–2284. PMLR, 2023

2023
[33]

Understanding warmup-stable-decay learning rates: A river valley loss landscape perspective.arXiv preprint arXiv:2410.05192, 2024

Kaiyue Wen, Zhiyuan Li, Jason Wang, David Hall, Percy Liang, and Tengyu Ma. Understanding warmup-stable-decay learning rates: A river valley loss landscape perspective.arXiv preprint arXiv:2410.05192, 2024

arXiv 2024
[34]

Splitting steepest descent for grow- ing neural architectures.Advances in neural information processing systems, 32, 2019

Lemeng Wu, Dilin Wang, and Qiang Liu. Splitting steepest descent for grow- ing neural architectures.Advances in neural information processing systems, 32, 2019. A Proof of Lemmas and Propositions This appendix provides proofs of all the propositions and lemmas presented in the main body of this paper. A.1 Proof of Proposition 1 By Definition 1, we have f(...

2019
[35]

Further, noting that ∑m−r j=0 λ j =1 for anyλλλ∈Λ, there existsi 0 ∈[r:m]such thatλ i0−r >0

From (76a),∂L /∂w i =0for alli∈[r−1]. Further, noting that ∑m−r j=0 λ j =1 for anyλλλ∈Λ, there existsi 0 ∈[r:m]such thatλ i0−r >0. Then, from (76b) we have ∂L ∂w r = 1 λi0−r · ∂L ∂w i0 =0,(77) yielding∂L /∂W =0. We conclude that∂L/∂W=0if and only if∂L /∂W =0. Therefore, for anyλλλ∈Λ,θis a stationary point of wide-net if and only ifθ is a stationary point ...
[36]

Then, combined with (88) we have |ek|= m ∑ i=r λ ′ i−rv′ r Z 1 0 Z t 0 F ′′ i,k(τ)dτdt ≤ m ∑ i=r λ ′ i−r|v′ r| Z 1 0 Z t 0 ∥∆ ˜wi∥2 2M(δ)dτdt ≤ 1 2 |v′ r| ·M(δ) m ∑ i=r λ ′ i−r∥∆ ˜wi∥2 2.(92) 34 There exists sufficiently smallδ 1,1 >0 such thatλ ′ i−r ≤2λ i−r and|v ′ r| ≤2|v r|for allθ ′ ∈B(θ,δ 1,1). Thus, for any 0<δ≤δ 1,1, we have |ek| ≤2|v r| ·M(δ) m ∑...
[37]

perturbed

By (33) and (37) we have ⟨∆y,e⟩= * ∆y, m ∑ i=r v′ iσ d ∑ k=1 w′ i,kxk ! −v ′ rσ d ∑ k=1 w′ r,kxk !+ = m ∑ i=r λ ′ i−rv′ r * ∆y,σ d ∑ k=1 (wr,k +∆w i,k)xk !+ −v ′ r * ∆y,σ d ∑ k=1 (wr,k +∆w r,k)xk !+ (97) wherex k ∈R n×1 denotes (the transpose of) thek-th row ofX. We define a function Ga :R d →Rparameterized bya∈Ras Ga(u) =a * ∆y,σ d ∑ k=1 (wr,k +u k)xk !+...
[38]

Further, there exists sufficiently smallδ 1,1 >0 such that|v ′ r| ≤ 2|vr|

Then, from (111) we have |ek|= 1 ∑ j=0 β ′ jv′ r Z 1 0 Z t 0 F ′′ j,k(τ)dτdt ≤ 1 ∑ j=0 β ′ j|v′ r| Z 1 0 Z t 0 ∥u j∥2 2M(δ)dτdt ≤ 1 2 |v′ r| ·M(δ) 1 ∑ j=0 β ′ j∥u j∥2 2.(115) Note thatβ ′ j <4β j. Further, there exists sufficiently smallδ 1,1 >0 such that|v ′ r| ≤ 2|vr|. Thus, for any 0<δ≤δ 1,1, we have |ek| ≤4|v r| ·M(δ) β0∥u0∥2 2 +β 1∥u1∥2 2 ,∀θ ′ ∈B(θ,...
[39]

vi ·σ d ∑ l=1 wi,l ·X l,k ! −y k # ·

+o(∥∆wr+1∥2 2) (147a) = β 2 2λ0 + β 2 2λ1 b⊤Hin r b+o(∥βb∥ 2 2)(147b) = β 2 2 1 λ0 + 1 λ1 λmin(Hin r ) +o(β 2)(147c) where (147b) follows from (143). Finally, combining (145), (147), and the decom- position (20), the difference of the empirical loss is given by L(θ ′)−L(θ) = 1 2 m ∑ i=1 ∆ti 2 2 + * ∆y, m ∑ i=1 ∆ti + = β 2 2 1 λ0 + 1 λ1 λmin(Hin r ) +o(β 2...

[1] [1]

Do we really need a new theory to understand over-parameterization?Neurocomputing, 543:126227, 2023

Luca Oneto, Sandro Ridella, and Davide Anguita. Do we really need a new theory to understand over-parameterization?Neurocomputing, 543:126227, 2023

2023

[2] [2]

Suboptimal local minima exist for wide neural networks with smooth activations.Mathematics of Operations Research, 47(4):2784–2814, 2022

Tian Ding, Dawei Li, and Ruoyu Sun. Suboptimal local minima exist for wide neural networks with smooth activations.Mathematics of Operations Research, 47(4):2784–2814, 2022

2022

[3] [3]

Non-differentiable saddle points and sub-optimal local minima exist for deep relu networks.Neu- ral Networks, 144:75–89, 2021

Bo Liu, Zhaoying Liu, Ting Zhang, and Tongtong Yuan. Non-differentiable saddle points and sub-optimal local minima exist for deep relu networks.Neu- ral Networks, 144:75–89, 2021

2021

[4] [4]

Neural networks with finite intrinsic dimension have no spurious valleys.arXiv preprint arXiv:1802.06384, 15, 2018

Luca Venturi, Afonso Bandeira, and Joan Bruna. Neural networks with finite intrinsic dimension have no spurious valleys.arXiv preprint arXiv:1802.06384, 15, 2018

arXiv 2018

[5] [5]

On the benefit of width for neural networks: Disappearance of bad basins.arXiv preprint arXiv:1812.11039, 2018

Dawei Li, Tian Ding, and Ruoyu Sun. On the benefit of width for neural networks: Disappearance of bad basins.arXiv preprint arXiv:1812.11039, 2018

arXiv 2018

[6] [6]

Loss surfaces, mode connectivity, and fast ensembling of DNNs

Timur Garipov, Pavel Izmailov, Dmitrii Podoprikhin, Dmitry P Vetrov, and Andrew G Wilson. Loss surfaces, mode connectivity, and fast ensembling of DNNs. InAdvances in Neural Information Processing Systems, pages 8789– 8798, 2018

2018

[7] [7]

Exploring neural network landscapes: Star-shaped and geodesic connectivity.arXiv preprint arXiv:2404.06391, 2024

Zhanran Lin, Puheng Li, and Lei Wu. Exploring neural network landscapes: Star-shaped and geodesic connectivity.arXiv preprint arXiv:2404.06391, 2024

arXiv 2024

[8] [8]

Local minima and plateaus in hierar- chical structures of multilayer perceptrons.Neural networks, 13(3):317–327, 2000

Kenji Fukumizu and Shun-ichi Amari. Local minima and plateaus in hierar- chical structures of multilayer perceptrons.Neural networks, 13(3):317–327, 2000. 27

2000

[9] [9]

Embedding principle of loss landscape of deep neural networks.Advances in Neural Information Processing Systems, 34:14848–14859, 2021

Yaoyu Zhang, Zhongwang Zhang, Tao Luo, and Zhiqin J Xu. Embedding principle of loss landscape of deep neural networks.Advances in Neural Information Processing Systems, 34:14848–14859, 2021

2021

[10] [10]

Embedding principle: a hierarchical structure of loss landscape of deep neural networks.Journal of Machine Learning Research, 1:60–113, 2022

Yaoyu Zhang, Yuqing Li, Zhongwang Zhang, Tao Luo, and Zhi-Qin John Xu. Embedding principle: a hierarchical structure of loss landscape of deep neural networks.Journal of Machine Learning Research, 1:60–113, 2022

2022

[11] [11]

Semi-flat minima and saddle points by embedding neural networks to overparameterization.Advances in neural information processing systems, 32, 2019

Kenji Fukumizu, Shoichiro Yamaguchi, Yoh-ichi Mototake, and Mirai Tanaka. Semi-flat minima and saddle points by embedding neural networks to overparameterization.Advances in neural information processing systems, 32, 2019

2019

[12] [12]

Geometry of the loss land- scape in overparameterized neural networks: Symmetries and invariances

Berfin Simsek, Franc ¸ois Ged, Arthur Jacot, Francesco Spadaro, Cl ´ement Hongler, Wulfram Gerstner, and Johanni Brea. Geometry of the loss land- scape in overparameterized neural networks: Symmetries and invariances. In International Conference on Machine Learning, pages 9722–9732. PMLR, 2021

2021

[13] [13]

Spurious local minima are common in two- layer relu neural networks

Itay Safran and Ohad Shamir. Spurious local minima are common in two- layer relu neural networks. InInternational Conference on Machine Learn- ing, pages 4433–4441. PMLR, 2018

2018

[14] [14]

Piecewise linear activations substantially shape the loss surfaces of neural networks.International Con- ference on Learning Representations, 2020

Fengxiang He, Bohan Wang, and Dacheng Tao. Piecewise linear activations substantially shape the loss surfaces of neural networks.International Con- ference on Learning Representations, 2020

2020

[15] [15]

Y . N. Dauphin, R. Pascanu, C. Gulcehre, K. Cho, S. Ganguli, and Y . Bengio. Identifying and attacking the saddle point problem in high-dimensional non- convex optimization. InNIPS, pages 2933–2941, 2014

2014

[16] [16]

The loss landscape of deep linear neural networks: a second-order analysis.Jour- nal of Machine Learning Research, 25(242):1–76, 2024

El Mehdi Achour, Franc ¸ois Malgouyres, and S ´ebastien Gerchinovitz. The loss landscape of deep linear neural networks: a second-order analysis.Jour- nal of Machine Learning Research, 25(242):1–76, 2024

2024

[17] [17]

Gradient descent learns one-hidden-layer CNN: Don’t be afraid of spurious local minima.arXiv preprint arXiv:1712.00779, 2017

Simon S Du, Jason D Lee, Yuandong Tian, Barnabas Poczos, and Aarti Singh. Gradient descent learns one-hidden-layer CNN: Don’t be afraid of spurious local minima.arXiv preprint arXiv:1712.00779, 2017

Pith/arXiv arXiv 2017

[18] [18]

Geometry of critical sets and existence of saddle branches for two-layer neural networks.arXiv preprint arXiv:2405.17501, 2024

Leyang Zhang, Yaoyu Zhang, and Tao Luo. Geometry of critical sets and existence of saddle branches for two-layer neural networks.arXiv preprint arXiv:2405.17501, 2024. 28

arXiv 2024

[19] [19]

Loss land- scape of shallow relu-like neural networks: Stationary points, saddle escape, and network embedding.arXiv preprint arXiv:2402.05626, 2024

Frank Zhengqing Wu, Berfin Simsek, and Francois Gaston Ged. Loss land- scape of shallow relu-like neural networks: Stationary points, saddle escape, and network embedding.arXiv preprint arXiv:2402.05626, 2024

arXiv 2024

[20] [20]

Weight- space symmetry in deep networks gives rise to permutation saddles, con- nected by equal-loss valleys across the loss landscape.arXiv preprint arXiv:1907.02911, 2019

Johanni Brea, Berfin Simsek, Bernd Illing, and Wulfram Gerstner. Weight- space symmetry in deep networks gives rise to permutation saddles, con- nected by equal-loss valleys across the loss landscape.arXiv preprint arXiv:1907.02911, 2019

Pith/arXiv arXiv 1907

[21] [21]

Global minima of overparameterized neural networks.SIAM Journal on Mathematics of Data Science, 3(2):676–691, 2021

Yaim Cooper. Global minima of overparameterized neural networks.SIAM Journal on Mathematics of Data Science, 3(2):676–691, 2021

2021

[22] [22]

The effect of smooth parametrizations on nonconvex optimization landscapes.Mathematical Pro- gramming, 209(1):63–111, 2025

Eitan Levin, Joe Kileel, and Nicolas Boumal. The effect of smooth parametrizations on nonconvex optimization landscapes.Mathematical Pro- gramming, 209(1):63–111, 2025

2025

[23] [23]

Empirical analysis of the hessian of over-parametrized neural networks.arXiv preprint arXiv:1706.04454, 2017

Levent Sagun, Utku Evci, V Ugur Guney, Yann Dauphin, and Leon Bottou. Empirical analysis of the hessian of over-parametrized neural networks.arXiv preprint arXiv:1706.04454, 2017

Pith/arXiv arXiv 2017

[24] [24]

Vi- sualizing the loss landscape of neural nets

Hao Li, Zheng Xu, Gavin Taylor, Christoph Studer, and Tom Goldstein. Vi- sualizing the loss landscape of neural nets. InProceedings of the 32nd Inter- national Conference on Neural Information Processing Systems, pages 6391– 6401, 2018

2018

[25] [25]

Kawaguchi

K. Kawaguchi. Deep learning without poor local minima. InProceedings of the 30th International Conference on Neural Information Processing Sys- tems, pages 586–594, 2016

2016

[26] [26]

Nguyen and M

Q. Nguyen and M. Hein. The loss surface of deep and wide neural networks. arXiv preprint arXiv:1704.08045, 2017

Pith/arXiv arXiv 2017

[27] [27]

Deep linear networks with arbitrary loss: All local minima are global

Thomas Laurent and James Brecht. Deep linear networks with arbitrary loss: All local minima are global. InInternational Conference on Machine Learn- ing, pages 2908–2913. PMLR, 2018

2018

[28] [28]

Gradient de- scent provably optimizes over-parameterized neural networks.arXiv preprint arXiv:1810.02054, 2018

Simon S Du, Xiyu Zhai, Barnabas Poczos, and Aarti Singh. Gradient de- scent provably optimizes over-parameterized neural networks.arXiv preprint arXiv:1810.02054, 2018

Pith/arXiv arXiv 2018

[29] [29]

Gra- dient descent finds global minima of deep neural networks.arXiv preprint arXiv:1811.03804, 2018

Simon S Du, Jason D Lee, Haochuan Li, Liwei Wang, and Xiyu Zhai. Gra- dient descent finds global minima of deep neural networks.arXiv preprint arXiv:1811.03804, 2018. 29

Pith/arXiv arXiv 2018

[30] [30]

A convergence theory for deep learning via over-parameterization

Zeyuan Allen-Zhu, Yuanzhi Li, and Zhao Song. A convergence theory for deep learning via over-parameterization. InInternational Conference on Ma- chine Learning, pages 242–252. PMLR, 2019

2019

[31] [31]

An improved analysis of training over- parameterized deep neural networks.arXiv preprint arXiv:1906.04688, 2019

Difan Zou and Quanquan Gu. An improved analysis of training over- parameterized deep neural networks.arXiv preprint arXiv:1906.04688, 2019

Pith/arXiv arXiv 1906

[32] [32]

Linear convergence of gradient descent for finite width over-parametrized linear networks with general initialization

Ziqing Xu, Hancheng Min, Salma Tarmoun, Enrique Mallada, and Ren ´e Vi- dal. Linear convergence of gradient descent for finite width over-parametrized linear networks with general initialization. InInternational Conference on Artificial Intelligence and Statistics, pages 2262–2284. PMLR, 2023

2023

[33] [33]

Understanding warmup-stable-decay learning rates: A river valley loss landscape perspective.arXiv preprint arXiv:2410.05192, 2024

Kaiyue Wen, Zhiyuan Li, Jason Wang, David Hall, Percy Liang, and Tengyu Ma. Understanding warmup-stable-decay learning rates: A river valley loss landscape perspective.arXiv preprint arXiv:2410.05192, 2024

arXiv 2024

[34] [34]

Splitting steepest descent for grow- ing neural architectures.Advances in neural information processing systems, 32, 2019

Lemeng Wu, Dilin Wang, and Qiang Liu. Splitting steepest descent for grow- ing neural architectures.Advances in neural information processing systems, 32, 2019. A Proof of Lemmas and Propositions This appendix provides proofs of all the propositions and lemmas presented in the main body of this paper. A.1 Proof of Proposition 1 By Definition 1, we have f(...

2019

[35] [35]

Further, noting that ∑m−r j=0 λ j =1 for anyλλλ∈Λ, there existsi 0 ∈[r:m]such thatλ i0−r >0

From (76a),∂L /∂w i =0for alli∈[r−1]. Further, noting that ∑m−r j=0 λ j =1 for anyλλλ∈Λ, there existsi 0 ∈[r:m]such thatλ i0−r >0. Then, from (76b) we have ∂L ∂w r = 1 λi0−r · ∂L ∂w i0 =0,(77) yielding∂L /∂W =0. We conclude that∂L/∂W=0if and only if∂L /∂W =0. Therefore, for anyλλλ∈Λ,θis a stationary point of wide-net if and only ifθ is a stationary point ...

[36] [36]

Then, combined with (88) we have |ek|= m ∑ i=r λ ′ i−rv′ r Z 1 0 Z t 0 F ′′ i,k(τ)dτdt ≤ m ∑ i=r λ ′ i−r|v′ r| Z 1 0 Z t 0 ∥∆ ˜wi∥2 2M(δ)dτdt ≤ 1 2 |v′ r| ·M(δ) m ∑ i=r λ ′ i−r∥∆ ˜wi∥2 2.(92) 34 There exists sufficiently smallδ 1,1 >0 such thatλ ′ i−r ≤2λ i−r and|v ′ r| ≤2|v r|for allθ ′ ∈B(θ,δ 1,1). Thus, for any 0<δ≤δ 1,1, we have |ek| ≤2|v r| ·M(δ) m ∑...

[37] [37]

perturbed

By (33) and (37) we have ⟨∆y,e⟩= * ∆y, m ∑ i=r v′ iσ d ∑ k=1 w′ i,kxk ! −v ′ rσ d ∑ k=1 w′ r,kxk !+ = m ∑ i=r λ ′ i−rv′ r * ∆y,σ d ∑ k=1 (wr,k +∆w i,k)xk !+ −v ′ r * ∆y,σ d ∑ k=1 (wr,k +∆w r,k)xk !+ (97) wherex k ∈R n×1 denotes (the transpose of) thek-th row ofX. We define a function Ga :R d →Rparameterized bya∈Ras Ga(u) =a * ∆y,σ d ∑ k=1 (wr,k +u k)xk !+...

[38] [38]

Further, there exists sufficiently smallδ 1,1 >0 such that|v ′ r| ≤ 2|vr|

Then, from (111) we have |ek|= 1 ∑ j=0 β ′ jv′ r Z 1 0 Z t 0 F ′′ j,k(τ)dτdt ≤ 1 ∑ j=0 β ′ j|v′ r| Z 1 0 Z t 0 ∥u j∥2 2M(δ)dτdt ≤ 1 2 |v′ r| ·M(δ) 1 ∑ j=0 β ′ j∥u j∥2 2.(115) Note thatβ ′ j <4β j. Further, there exists sufficiently smallδ 1,1 >0 such that|v ′ r| ≤ 2|vr|. Thus, for any 0<δ≤δ 1,1, we have |ek| ≤4|v r| ·M(δ) β0∥u0∥2 2 +β 1∥u1∥2 2 ,∀θ ′ ∈B(θ,...

[39] [39]

vi ·σ d ∑ l=1 wi,l ·X l,k ! −y k # ·

+o(∥∆wr+1∥2 2) (147a) = β 2 2λ0 + β 2 2λ1 b⊤Hin r b+o(∥βb∥ 2 2)(147b) = β 2 2 1 λ0 + 1 λ1 λmin(Hin r ) +o(β 2)(147c) where (147b) follows from (143). Finally, combining (145), (147), and the decom- position (20), the difference of the empirical loss is given by L(θ ′)−L(θ) = 1 2 m ∑ i=1 ∆ti 2 2 + * ∆y, m ∑ i=1 ∆ti + = β 2 2 1 λ0 + 1 λ1 λmin(Hin r ) +o(β 2...