A Geometric Characterization of the Stationary Plateau for Two-Layer Neural Networks
Pith reviewed 2026-06-28 07:23 UTC · model grok-4.3
The pith
The definiteness of the inner Hessian and the splitting coefficients jointly determine whether a neuron-split plateau consists of local minima or only saddles.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We investigate the geometric structure of stationary plateaus that arise in the loss landscape of two-layer neural networks with smooth activation functions. We focus on the phenomenon of neuron splitting where duplicating a hidden neuron yields an affine set of stationary points in a wider network. We provide a comprehensive classification of all stationary points on these plateaus, determining under what conditions they constitute local minima or saddle points. Our characterization hinges on a per-neuron curvature object we term the inner Hessian matrix. Our analysis reveals that the definiteness of the inner Hessian and the choice of splitting coefficients jointly dictate the local geomet
What carries the argument
The inner Hessian, a per-neuron curvature matrix whose definiteness, combined with the splitting coefficients, determines the type of every stationary point on the affine plateau generated by duplicating a hidden neuron.
If this is right
- Splitting a local minimum can produce either a mixture of local minima and saddles or an all-saddle plateau.
- A concrete sure-saddle region exists on the plateau under mild assumptions when a minimum is split.
- Splitting a saddle point always yields a plateau consisting entirely of saddle points.
- Width expansion can either preserve or change the character of stationary points depending on the inner Hessian and splitting choice.
Where Pith is reading between the lines
- The same inner-Hessian test might be used to decide whether to split neurons during training in order to escape undesired plateaus.
- The classification supplies a concrete test that could be checked numerically on trained two-layer models to verify predicted saddle regions.
- If the inner-Hessian condition holds, then reparameterization schemes that control splitting coefficients become a tool for shaping the loss landscape geometry.
Load-bearing premise
Duplicating a hidden neuron produces an affine set of stationary points whose local geometry is fully captured by the per-neuron inner Hessian.
What would settle it
Finding a stationary point on such a plateau that is a local minimum when the inner Hessian is indefinite, or a saddle when the inner Hessian is positive definite, would falsify the classification.
Figures
read the original abstract
We investigate the geometric structure of stationary plateaus that arise in the loss landscape of two-layer neural networks with smooth activation functions. We focus on the phenomenon of "neuron splitting" where duplicating a hidden neuron yields an affine set of stationary points in a wider network. We provide a comprehensive classification of all stationary points on these plateaus, determining under what conditions they constitute local minima or saddle points. Our characterization hinges on a per-neuron curvature object we term the "inner Hessian" matrix. Our analysis reveals that the definiteness of the inner Hessian and the choice of splitting coefficients jointly dictate the local geometry of the plateau. We show that "splitting" a local minimum can yield either a mixture of local minima and saddles or an all-saddle plateau, with a concrete sure-saddle region identified under mild assumptions. In contrast, splitting a saddle point always produces a plateau of saddle points. Our results unify and extend prior landscape analyses, elucidating when and how model expansion preserves or alters the nature of stationary points. These findings offer new geometric insights into the effects of width expansion and reparameterization in neural networks.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper investigates the geometric structure of stationary plateaus in the loss landscape of two-layer neural networks with smooth activation functions, arising from neuron splitting that produces an affine set of stationary points in a wider network. It introduces a per-neuron 'inner Hessian' curvature object and claims that its definiteness, together with the choice of splitting coefficients, determines the local geometry: splitting a local minimum can yield either a mixture of local minima and saddles or an all-saddle plateau (with a concrete sure-saddle region under mild assumptions), while splitting a saddle point always produces a plateau consisting entirely of saddle points. The results are positioned as unifying and extending prior landscape analyses for width expansion and reparameterization.
Significance. If the reduction to the per-neuron inner Hessian is rigorous and the classification holds without unaccounted cross terms, the work would provide concrete geometric criteria for when model expansion preserves or alters the nature of stationary points, offering falsifiable predictions about plateau composition that could inform optimization dynamics in overparameterized networks. The explicit identification of a sure-saddle region under mild assumptions is a potential strength.
major comments (1)
- [Abstract / main derivation of plateau Hessian] The central classification (splitting minima yields mixture/all-saddle; splitting saddles always yields saddles) is stated to hinge on the definiteness of the inner Hessian. However, the modeling choice that neuron splitting produces an affine stationary set whose local geometry is fully captured by the per-neuron inner Hessian (without cross-neuron coupling terms or higher-order contributions in the restricted second-derivative operator) is load-bearing; if the Hessian on the plateau contains such terms, the sure-saddle region and 'always saddle' claim would not follow from the per-neuron object alone. This assumption is invoked directly in the abstract but requires explicit verification in the derivation.
Simulated Author's Rebuttal
We thank the referee for the careful reading and for identifying the central modeling assumption underlying our classification. We address the concern regarding explicit verification of the plateau Hessian derivation below.
read point-by-point responses
-
Referee: [Abstract / main derivation of plateau Hessian] The central classification (splitting minima yields mixture/all-saddle; splitting saddles always yields saddles) is stated to hinge on the definiteness of the inner Hessian. However, the modeling choice that neuron splitting produces an affine stationary set whose local geometry is fully captured by the per-neuron inner Hessian (without cross-neuron coupling terms or higher-order contributions in the restricted second-derivative operator) is load-bearing; if the Hessian on the plateau contains such terms, the sure-saddle region and 'always saddle' claim would not follow from the per-neuron object alone. This assumption is invoked directly in the abstract but requires explicit verification in the derivation.
Authors: We agree that the absence of cross-neuron coupling terms in the restricted Hessian is load-bearing for the classification and must be verified explicitly. In Section 3.2 we derive the second-derivative operator on the affine stationary set obtained by neuron splitting. The calculation proceeds by restricting the full Hessian to directions tangent to the plateau (i.e., variations that preserve the affine relation among the duplicated neurons). Direct differentiation shows that all mixed partials between distinct neurons on the plateau vanish identically: the stationarity condition at the original point together with the chain-rule structure of the two-layer loss implies that the cross terms are identically zero on the entire affine set. Consequently the restricted Hessian is block-diagonal, with each block precisely the inner Hessian of the corresponding neuron (scaled by the splitting coefficients). Higher-order contributions are likewise ruled out because the loss is quadratic in the output weights and the activation is smooth but the restriction is linear. We will insert an additional paragraph immediately after the block-diagonal claim in the revised manuscript that spells out this vanishing argument, together with a short appendix lemma that isolates the cross-term calculation, to make the verification fully self-contained. revision: yes
Circularity Check
No significant circularity detected
full rationale
The paper defines the inner Hessian as a per-neuron curvature object and derives the local geometry of neuron-splitting plateaus from its definiteness together with splitting coefficients. This is a standard definitional step in a geometric analysis rather than a self-referential reduction; the classification of minima versus saddles follows from the stated assumptions on the affine stationary set without any fitted parameter being relabeled as a prediction, without load-bearing self-citations, and without an ansatz or uniqueness claim imported from prior author work. The derivation chain remains self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Do we really need a new theory to understand over-parameterization?Neurocomputing, 543:126227, 2023
Luca Oneto, Sandro Ridella, and Davide Anguita. Do we really need a new theory to understand over-parameterization?Neurocomputing, 543:126227, 2023
2023
-
[2]
Suboptimal local minima exist for wide neural networks with smooth activations.Mathematics of Operations Research, 47(4):2784–2814, 2022
Tian Ding, Dawei Li, and Ruoyu Sun. Suboptimal local minima exist for wide neural networks with smooth activations.Mathematics of Operations Research, 47(4):2784–2814, 2022
2022
-
[3]
Non-differentiable saddle points and sub-optimal local minima exist for deep relu networks.Neu- ral Networks, 144:75–89, 2021
Bo Liu, Zhaoying Liu, Ting Zhang, and Tongtong Yuan. Non-differentiable saddle points and sub-optimal local minima exist for deep relu networks.Neu- ral Networks, 144:75–89, 2021
2021
-
[4]
Luca Venturi, Afonso Bandeira, and Joan Bruna. Neural networks with finite intrinsic dimension have no spurious valleys.arXiv preprint arXiv:1802.06384, 15, 2018
arXiv 2018
-
[5]
Dawei Li, Tian Ding, and Ruoyu Sun. On the benefit of width for neural networks: Disappearance of bad basins.arXiv preprint arXiv:1812.11039, 2018
arXiv 2018
-
[6]
Loss surfaces, mode connectivity, and fast ensembling of DNNs
Timur Garipov, Pavel Izmailov, Dmitrii Podoprikhin, Dmitry P Vetrov, and Andrew G Wilson. Loss surfaces, mode connectivity, and fast ensembling of DNNs. InAdvances in Neural Information Processing Systems, pages 8789– 8798, 2018
2018
-
[7]
Zhanran Lin, Puheng Li, and Lei Wu. Exploring neural network landscapes: Star-shaped and geodesic connectivity.arXiv preprint arXiv:2404.06391, 2024
arXiv 2024
-
[8]
Local minima and plateaus in hierar- chical structures of multilayer perceptrons.Neural networks, 13(3):317–327, 2000
Kenji Fukumizu and Shun-ichi Amari. Local minima and plateaus in hierar- chical structures of multilayer perceptrons.Neural networks, 13(3):317–327, 2000. 27
2000
-
[9]
Embedding principle of loss landscape of deep neural networks.Advances in Neural Information Processing Systems, 34:14848–14859, 2021
Yaoyu Zhang, Zhongwang Zhang, Tao Luo, and Zhiqin J Xu. Embedding principle of loss landscape of deep neural networks.Advances in Neural Information Processing Systems, 34:14848–14859, 2021
2021
-
[10]
Embedding principle: a hierarchical structure of loss landscape of deep neural networks.Journal of Machine Learning Research, 1:60–113, 2022
Yaoyu Zhang, Yuqing Li, Zhongwang Zhang, Tao Luo, and Zhi-Qin John Xu. Embedding principle: a hierarchical structure of loss landscape of deep neural networks.Journal of Machine Learning Research, 1:60–113, 2022
2022
-
[11]
Semi-flat minima and saddle points by embedding neural networks to overparameterization.Advances in neural information processing systems, 32, 2019
Kenji Fukumizu, Shoichiro Yamaguchi, Yoh-ichi Mototake, and Mirai Tanaka. Semi-flat minima and saddle points by embedding neural networks to overparameterization.Advances in neural information processing systems, 32, 2019
2019
-
[12]
Geometry of the loss land- scape in overparameterized neural networks: Symmetries and invariances
Berfin Simsek, Franc ¸ois Ged, Arthur Jacot, Francesco Spadaro, Cl ´ement Hongler, Wulfram Gerstner, and Johanni Brea. Geometry of the loss land- scape in overparameterized neural networks: Symmetries and invariances. In International Conference on Machine Learning, pages 9722–9732. PMLR, 2021
2021
-
[13]
Spurious local minima are common in two- layer relu neural networks
Itay Safran and Ohad Shamir. Spurious local minima are common in two- layer relu neural networks. InInternational Conference on Machine Learn- ing, pages 4433–4441. PMLR, 2018
2018
-
[14]
Piecewise linear activations substantially shape the loss surfaces of neural networks.International Con- ference on Learning Representations, 2020
Fengxiang He, Bohan Wang, and Dacheng Tao. Piecewise linear activations substantially shape the loss surfaces of neural networks.International Con- ference on Learning Representations, 2020
2020
-
[15]
Y . N. Dauphin, R. Pascanu, C. Gulcehre, K. Cho, S. Ganguli, and Y . Bengio. Identifying and attacking the saddle point problem in high-dimensional non- convex optimization. InNIPS, pages 2933–2941, 2014
2014
-
[16]
The loss landscape of deep linear neural networks: a second-order analysis.Jour- nal of Machine Learning Research, 25(242):1–76, 2024
El Mehdi Achour, Franc ¸ois Malgouyres, and S ´ebastien Gerchinovitz. The loss landscape of deep linear neural networks: a second-order analysis.Jour- nal of Machine Learning Research, 25(242):1–76, 2024
2024
-
[17]
Simon S Du, Jason D Lee, Yuandong Tian, Barnabas Poczos, and Aarti Singh. Gradient descent learns one-hidden-layer CNN: Don’t be afraid of spurious local minima.arXiv preprint arXiv:1712.00779, 2017
Pith/arXiv arXiv 2017
-
[18]
Leyang Zhang, Yaoyu Zhang, and Tao Luo. Geometry of critical sets and existence of saddle branches for two-layer neural networks.arXiv preprint arXiv:2405.17501, 2024. 28
arXiv 2024
-
[19]
Frank Zhengqing Wu, Berfin Simsek, and Francois Gaston Ged. Loss land- scape of shallow relu-like neural networks: Stationary points, saddle escape, and network embedding.arXiv preprint arXiv:2402.05626, 2024
arXiv 2024
-
[20]
Johanni Brea, Berfin Simsek, Bernd Illing, and Wulfram Gerstner. Weight- space symmetry in deep networks gives rise to permutation saddles, con- nected by equal-loss valleys across the loss landscape.arXiv preprint arXiv:1907.02911, 2019
Pith/arXiv arXiv 1907
-
[21]
Global minima of overparameterized neural networks.SIAM Journal on Mathematics of Data Science, 3(2):676–691, 2021
Yaim Cooper. Global minima of overparameterized neural networks.SIAM Journal on Mathematics of Data Science, 3(2):676–691, 2021
2021
-
[22]
The effect of smooth parametrizations on nonconvex optimization landscapes.Mathematical Pro- gramming, 209(1):63–111, 2025
Eitan Levin, Joe Kileel, and Nicolas Boumal. The effect of smooth parametrizations on nonconvex optimization landscapes.Mathematical Pro- gramming, 209(1):63–111, 2025
2025
-
[23]
Levent Sagun, Utku Evci, V Ugur Guney, Yann Dauphin, and Leon Bottou. Empirical analysis of the hessian of over-parametrized neural networks.arXiv preprint arXiv:1706.04454, 2017
Pith/arXiv arXiv 2017
-
[24]
Vi- sualizing the loss landscape of neural nets
Hao Li, Zheng Xu, Gavin Taylor, Christoph Studer, and Tom Goldstein. Vi- sualizing the loss landscape of neural nets. InProceedings of the 32nd Inter- national Conference on Neural Information Processing Systems, pages 6391– 6401, 2018
2018
-
[25]
Kawaguchi
K. Kawaguchi. Deep learning without poor local minima. InProceedings of the 30th International Conference on Neural Information Processing Sys- tems, pages 586–594, 2016
2016
-
[26]
Q. Nguyen and M. Hein. The loss surface of deep and wide neural networks. arXiv preprint arXiv:1704.08045, 2017
Pith/arXiv arXiv 2017
-
[27]
Deep linear networks with arbitrary loss: All local minima are global
Thomas Laurent and James Brecht. Deep linear networks with arbitrary loss: All local minima are global. InInternational Conference on Machine Learn- ing, pages 2908–2913. PMLR, 2018
2018
-
[28]
Simon S Du, Xiyu Zhai, Barnabas Poczos, and Aarti Singh. Gradient de- scent provably optimizes over-parameterized neural networks.arXiv preprint arXiv:1810.02054, 2018
Pith/arXiv arXiv 2018
-
[29]
Gra- dient descent finds global minima of deep neural networks.arXiv preprint arXiv:1811.03804, 2018
Simon S Du, Jason D Lee, Haochuan Li, Liwei Wang, and Xiyu Zhai. Gra- dient descent finds global minima of deep neural networks.arXiv preprint arXiv:1811.03804, 2018. 29
Pith/arXiv arXiv 2018
-
[30]
A convergence theory for deep learning via over-parameterization
Zeyuan Allen-Zhu, Yuanzhi Li, and Zhao Song. A convergence theory for deep learning via over-parameterization. InInternational Conference on Ma- chine Learning, pages 242–252. PMLR, 2019
2019
-
[31]
Difan Zou and Quanquan Gu. An improved analysis of training over- parameterized deep neural networks.arXiv preprint arXiv:1906.04688, 2019
Pith/arXiv arXiv 1906
-
[32]
Linear convergence of gradient descent for finite width over-parametrized linear networks with general initialization
Ziqing Xu, Hancheng Min, Salma Tarmoun, Enrique Mallada, and Ren ´e Vi- dal. Linear convergence of gradient descent for finite width over-parametrized linear networks with general initialization. InInternational Conference on Artificial Intelligence and Statistics, pages 2262–2284. PMLR, 2023
2023
-
[33]
Kaiyue Wen, Zhiyuan Li, Jason Wang, David Hall, Percy Liang, and Tengyu Ma. Understanding warmup-stable-decay learning rates: A river valley loss landscape perspective.arXiv preprint arXiv:2410.05192, 2024
arXiv 2024
-
[34]
Splitting steepest descent for grow- ing neural architectures.Advances in neural information processing systems, 32, 2019
Lemeng Wu, Dilin Wang, and Qiang Liu. Splitting steepest descent for grow- ing neural architectures.Advances in neural information processing systems, 32, 2019. A Proof of Lemmas and Propositions This appendix provides proofs of all the propositions and lemmas presented in the main body of this paper. A.1 Proof of Proposition 1 By Definition 1, we have f(...
2019
-
[35]
Further, noting that ∑m−r j=0 λ j =1 for anyλλλ∈Λ, there existsi 0 ∈[r:m]such thatλ i0−r >0
From (76a),∂L /∂w i =0for alli∈[r−1]. Further, noting that ∑m−r j=0 λ j =1 for anyλλλ∈Λ, there existsi 0 ∈[r:m]such thatλ i0−r >0. Then, from (76b) we have ∂L ∂w r = 1 λi0−r · ∂L ∂w i0 =0,(77) yielding∂L /∂W =0. We conclude that∂L/∂W=0if and only if∂L /∂W =0. Therefore, for anyλλλ∈Λ,θis a stationary point of wide-net if and only ifθ is a stationary point ...
-
[36]
Then, combined with (88) we have |ek|= m ∑ i=r λ ′ i−rv′ r Z 1 0 Z t 0 F ′′ i,k(τ)dτdt ≤ m ∑ i=r λ ′ i−r|v′ r| Z 1 0 Z t 0 ∥∆ ˜wi∥2 2M(δ)dτdt ≤ 1 2 |v′ r| ·M(δ) m ∑ i=r λ ′ i−r∥∆ ˜wi∥2 2.(92) 34 There exists sufficiently smallδ 1,1 >0 such thatλ ′ i−r ≤2λ i−r and|v ′ r| ≤2|v r|for allθ ′ ∈B(θ,δ 1,1). Thus, for any 0<δ≤δ 1,1, we have |ek| ≤2|v r| ·M(δ) m ∑...
-
[37]
perturbed
By (33) and (37) we have ⟨∆y,e⟩= * ∆y, m ∑ i=r v′ iσ d ∑ k=1 w′ i,kxk ! −v ′ rσ d ∑ k=1 w′ r,kxk !+ = m ∑ i=r λ ′ i−rv′ r * ∆y,σ d ∑ k=1 (wr,k +∆w i,k)xk !+ −v ′ r * ∆y,σ d ∑ k=1 (wr,k +∆w r,k)xk !+ (97) wherex k ∈R n×1 denotes (the transpose of) thek-th row ofX. We define a function Ga :R d →Rparameterized bya∈Ras Ga(u) =a * ∆y,σ d ∑ k=1 (wr,k +u k)xk !+...
-
[38]
Further, there exists sufficiently smallδ 1,1 >0 such that|v ′ r| ≤ 2|vr|
Then, from (111) we have |ek|= 1 ∑ j=0 β ′ jv′ r Z 1 0 Z t 0 F ′′ j,k(τ)dτdt ≤ 1 ∑ j=0 β ′ j|v′ r| Z 1 0 Z t 0 ∥u j∥2 2M(δ)dτdt ≤ 1 2 |v′ r| ·M(δ) 1 ∑ j=0 β ′ j∥u j∥2 2.(115) Note thatβ ′ j <4β j. Further, there exists sufficiently smallδ 1,1 >0 such that|v ′ r| ≤ 2|vr|. Thus, for any 0<δ≤δ 1,1, we have |ek| ≤4|v r| ·M(δ) β0∥u0∥2 2 +β 1∥u1∥2 2 ,∀θ ′ ∈B(θ,...
-
[39]
vi ·σ d ∑ l=1 wi,l ·X l,k ! −y k # ·
+o(∥∆wr+1∥2 2) (147a) = β 2 2λ0 + β 2 2λ1 b⊤Hin r b+o(∥βb∥ 2 2)(147b) = β 2 2 1 λ0 + 1 λ1 λmin(Hin r ) +o(β 2)(147c) where (147b) follows from (143). Finally, combining (145), (147), and the decom- position (20), the difference of the empirical loss is given by L(θ ′)−L(θ) = 1 2 m ∑ i=1 ∆ti 2 2 + * ∆y, m ∑ i=1 ∆ti + = β 2 2 1 λ0 + 1 λ1 λmin(Hin r ) +o(β 2...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.