pith. sign in

arxiv: 2606.04327 · v1 · pith:UEG34TH3new · submitted 2026-06-03 · 💻 cs.LG · cs.AI· math.OC

A Geometric Characterization of the Stationary Plateau for Two-Layer Neural Networks

Pith reviewed 2026-06-28 07:23 UTC · model grok-4.3

classification 💻 cs.LG cs.AImath.OC
keywords two-layer neural networksloss landscapeneuron splittingstationary plateausinner Hessianlocal minimasaddle pointswidth expansion
0
0 comments X

The pith

The definiteness of the inner Hessian and the splitting coefficients jointly determine whether a neuron-split plateau consists of local minima or only saddles.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper classifies every stationary point that appears when a hidden neuron is duplicated, creating an affine plateau of points in a wider two-layer network. It establishes that the local geometry of this plateau is controlled by the definiteness properties of a per-neuron inner Hessian matrix together with the numerical coefficients chosen for the split. A sympathetic reader would care because the result explains how increasing network width can preserve good solutions or turn them into saddles, directly addressing the effect of reparameterization during model expansion. The analysis shows that splitting a minimum can produce mixed minima-and-saddles or an all-saddle plateau, while splitting a saddle always yields only saddles. This supplies a geometric account of when width growth leaves the nature of stationary points unchanged or altered.

Core claim

We investigate the geometric structure of stationary plateaus that arise in the loss landscape of two-layer neural networks with smooth activation functions. We focus on the phenomenon of neuron splitting where duplicating a hidden neuron yields an affine set of stationary points in a wider network. We provide a comprehensive classification of all stationary points on these plateaus, determining under what conditions they constitute local minima or saddle points. Our characterization hinges on a per-neuron curvature object we term the inner Hessian matrix. Our analysis reveals that the definiteness of the inner Hessian and the choice of splitting coefficients jointly dictate the local geomet

What carries the argument

The inner Hessian, a per-neuron curvature matrix whose definiteness, combined with the splitting coefficients, determines the type of every stationary point on the affine plateau generated by duplicating a hidden neuron.

If this is right

  • Splitting a local minimum can produce either a mixture of local minima and saddles or an all-saddle plateau.
  • A concrete sure-saddle region exists on the plateau under mild assumptions when a minimum is split.
  • Splitting a saddle point always yields a plateau consisting entirely of saddle points.
  • Width expansion can either preserve or change the character of stationary points depending on the inner Hessian and splitting choice.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same inner-Hessian test might be used to decide whether to split neurons during training in order to escape undesired plateaus.
  • The classification supplies a concrete test that could be checked numerically on trained two-layer models to verify predicted saddle regions.
  • If the inner-Hessian condition holds, then reparameterization schemes that control splitting coefficients become a tool for shaping the loss landscape geometry.

Load-bearing premise

Duplicating a hidden neuron produces an affine set of stationary points whose local geometry is fully captured by the per-neuron inner Hessian.

What would settle it

Finding a stationary point on such a plateau that is a local minimum when the inner Hessian is indefinite, or a saddle when the inner Hessian is positive definite, would falsify the classification.

Figures

Figures reproduced from arXiv: 2606.04327 by Dawei Li, Ruoyu Sun, Tian Ding.

Figure 1
Figure 1. Figure 1: An illustration of neuron splitting. One hidden neuron of the width [PITH_FULL_IMAGE:figures/full_fig_p007_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Relationship among theorems on neuron splitting at local minima. The [PITH_FULL_IMAGE:figures/full_fig_p013_2.png] view at source ↗
read the original abstract

We investigate the geometric structure of stationary plateaus that arise in the loss landscape of two-layer neural networks with smooth activation functions. We focus on the phenomenon of "neuron splitting" where duplicating a hidden neuron yields an affine set of stationary points in a wider network. We provide a comprehensive classification of all stationary points on these plateaus, determining under what conditions they constitute local minima or saddle points. Our characterization hinges on a per-neuron curvature object we term the "inner Hessian" matrix. Our analysis reveals that the definiteness of the inner Hessian and the choice of splitting coefficients jointly dictate the local geometry of the plateau. We show that "splitting" a local minimum can yield either a mixture of local minima and saddles or an all-saddle plateau, with a concrete sure-saddle region identified under mild assumptions. In contrast, splitting a saddle point always produces a plateau of saddle points. Our results unify and extend prior landscape analyses, elucidating when and how model expansion preserves or alters the nature of stationary points. These findings offer new geometric insights into the effects of width expansion and reparameterization in neural networks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 0 minor

Summary. The paper investigates the geometric structure of stationary plateaus in the loss landscape of two-layer neural networks with smooth activation functions, arising from neuron splitting that produces an affine set of stationary points in a wider network. It introduces a per-neuron 'inner Hessian' curvature object and claims that its definiteness, together with the choice of splitting coefficients, determines the local geometry: splitting a local minimum can yield either a mixture of local minima and saddles or an all-saddle plateau (with a concrete sure-saddle region under mild assumptions), while splitting a saddle point always produces a plateau consisting entirely of saddle points. The results are positioned as unifying and extending prior landscape analyses for width expansion and reparameterization.

Significance. If the reduction to the per-neuron inner Hessian is rigorous and the classification holds without unaccounted cross terms, the work would provide concrete geometric criteria for when model expansion preserves or alters the nature of stationary points, offering falsifiable predictions about plateau composition that could inform optimization dynamics in overparameterized networks. The explicit identification of a sure-saddle region under mild assumptions is a potential strength.

major comments (1)
  1. [Abstract / main derivation of plateau Hessian] The central classification (splitting minima yields mixture/all-saddle; splitting saddles always yields saddles) is stated to hinge on the definiteness of the inner Hessian. However, the modeling choice that neuron splitting produces an affine stationary set whose local geometry is fully captured by the per-neuron inner Hessian (without cross-neuron coupling terms or higher-order contributions in the restricted second-derivative operator) is load-bearing; if the Hessian on the plateau contains such terms, the sure-saddle region and 'always saddle' claim would not follow from the per-neuron object alone. This assumption is invoked directly in the abstract but requires explicit verification in the derivation.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the careful reading and for identifying the central modeling assumption underlying our classification. We address the concern regarding explicit verification of the plateau Hessian derivation below.

read point-by-point responses
  1. Referee: [Abstract / main derivation of plateau Hessian] The central classification (splitting minima yields mixture/all-saddle; splitting saddles always yields saddles) is stated to hinge on the definiteness of the inner Hessian. However, the modeling choice that neuron splitting produces an affine stationary set whose local geometry is fully captured by the per-neuron inner Hessian (without cross-neuron coupling terms or higher-order contributions in the restricted second-derivative operator) is load-bearing; if the Hessian on the plateau contains such terms, the sure-saddle region and 'always saddle' claim would not follow from the per-neuron object alone. This assumption is invoked directly in the abstract but requires explicit verification in the derivation.

    Authors: We agree that the absence of cross-neuron coupling terms in the restricted Hessian is load-bearing for the classification and must be verified explicitly. In Section 3.2 we derive the second-derivative operator on the affine stationary set obtained by neuron splitting. The calculation proceeds by restricting the full Hessian to directions tangent to the plateau (i.e., variations that preserve the affine relation among the duplicated neurons). Direct differentiation shows that all mixed partials between distinct neurons on the plateau vanish identically: the stationarity condition at the original point together with the chain-rule structure of the two-layer loss implies that the cross terms are identically zero on the entire affine set. Consequently the restricted Hessian is block-diagonal, with each block precisely the inner Hessian of the corresponding neuron (scaled by the splitting coefficients). Higher-order contributions are likewise ruled out because the loss is quadratic in the output weights and the activation is smooth but the restriction is linear. We will insert an additional paragraph immediately after the block-diagonal claim in the revised manuscript that spells out this vanishing argument, together with a short appendix lemma that isolates the cross-term calculation, to make the verification fully self-contained. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper defines the inner Hessian as a per-neuron curvature object and derives the local geometry of neuron-splitting plateaus from its definiteness together with splitting coefficients. This is a standard definitional step in a geometric analysis rather than a self-referential reduction; the classification of minima versus saddles follows from the stated assumptions on the affine stationary set without any fitted parameter being relabeled as a prediction, without load-bearing self-citations, and without an ansatz or uniqueness claim imported from prior author work. The derivation chain remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review provides no identifiable free parameters, axioms, or invented entities beyond the newly introduced inner Hessian concept; full text would be required to audit these.

pith-pipeline@v0.9.1-grok · 5732 in / 1085 out tokens · 27369 ms · 2026-06-28T07:23:06.511087+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

39 extracted references · 7 linked inside Pith

  1. [1]

    Do we really need a new theory to understand over-parameterization?Neurocomputing, 543:126227, 2023

    Luca Oneto, Sandro Ridella, and Davide Anguita. Do we really need a new theory to understand over-parameterization?Neurocomputing, 543:126227, 2023

  2. [2]

    Suboptimal local minima exist for wide neural networks with smooth activations.Mathematics of Operations Research, 47(4):2784–2814, 2022

    Tian Ding, Dawei Li, and Ruoyu Sun. Suboptimal local minima exist for wide neural networks with smooth activations.Mathematics of Operations Research, 47(4):2784–2814, 2022

  3. [3]

    Non-differentiable saddle points and sub-optimal local minima exist for deep relu networks.Neu- ral Networks, 144:75–89, 2021

    Bo Liu, Zhaoying Liu, Ting Zhang, and Tongtong Yuan. Non-differentiable saddle points and sub-optimal local minima exist for deep relu networks.Neu- ral Networks, 144:75–89, 2021

  4. [4]

    Neural networks with finite intrinsic dimension have no spurious valleys.arXiv preprint arXiv:1802.06384, 15, 2018

    Luca Venturi, Afonso Bandeira, and Joan Bruna. Neural networks with finite intrinsic dimension have no spurious valleys.arXiv preprint arXiv:1802.06384, 15, 2018

  5. [5]

    On the benefit of width for neural networks: Disappearance of bad basins.arXiv preprint arXiv:1812.11039, 2018

    Dawei Li, Tian Ding, and Ruoyu Sun. On the benefit of width for neural networks: Disappearance of bad basins.arXiv preprint arXiv:1812.11039, 2018

  6. [6]

    Loss surfaces, mode connectivity, and fast ensembling of DNNs

    Timur Garipov, Pavel Izmailov, Dmitrii Podoprikhin, Dmitry P Vetrov, and Andrew G Wilson. Loss surfaces, mode connectivity, and fast ensembling of DNNs. InAdvances in Neural Information Processing Systems, pages 8789– 8798, 2018

  7. [7]

    Exploring neural network landscapes: Star-shaped and geodesic connectivity.arXiv preprint arXiv:2404.06391, 2024

    Zhanran Lin, Puheng Li, and Lei Wu. Exploring neural network landscapes: Star-shaped and geodesic connectivity.arXiv preprint arXiv:2404.06391, 2024

  8. [8]

    Local minima and plateaus in hierar- chical structures of multilayer perceptrons.Neural networks, 13(3):317–327, 2000

    Kenji Fukumizu and Shun-ichi Amari. Local minima and plateaus in hierar- chical structures of multilayer perceptrons.Neural networks, 13(3):317–327, 2000. 27

  9. [9]

    Embedding principle of loss landscape of deep neural networks.Advances in Neural Information Processing Systems, 34:14848–14859, 2021

    Yaoyu Zhang, Zhongwang Zhang, Tao Luo, and Zhiqin J Xu. Embedding principle of loss landscape of deep neural networks.Advances in Neural Information Processing Systems, 34:14848–14859, 2021

  10. [10]

    Embedding principle: a hierarchical structure of loss landscape of deep neural networks.Journal of Machine Learning Research, 1:60–113, 2022

    Yaoyu Zhang, Yuqing Li, Zhongwang Zhang, Tao Luo, and Zhi-Qin John Xu. Embedding principle: a hierarchical structure of loss landscape of deep neural networks.Journal of Machine Learning Research, 1:60–113, 2022

  11. [11]

    Semi-flat minima and saddle points by embedding neural networks to overparameterization.Advances in neural information processing systems, 32, 2019

    Kenji Fukumizu, Shoichiro Yamaguchi, Yoh-ichi Mototake, and Mirai Tanaka. Semi-flat minima and saddle points by embedding neural networks to overparameterization.Advances in neural information processing systems, 32, 2019

  12. [12]

    Geometry of the loss land- scape in overparameterized neural networks: Symmetries and invariances

    Berfin Simsek, Franc ¸ois Ged, Arthur Jacot, Francesco Spadaro, Cl ´ement Hongler, Wulfram Gerstner, and Johanni Brea. Geometry of the loss land- scape in overparameterized neural networks: Symmetries and invariances. In International Conference on Machine Learning, pages 9722–9732. PMLR, 2021

  13. [13]

    Spurious local minima are common in two- layer relu neural networks

    Itay Safran and Ohad Shamir. Spurious local minima are common in two- layer relu neural networks. InInternational Conference on Machine Learn- ing, pages 4433–4441. PMLR, 2018

  14. [14]

    Piecewise linear activations substantially shape the loss surfaces of neural networks.International Con- ference on Learning Representations, 2020

    Fengxiang He, Bohan Wang, and Dacheng Tao. Piecewise linear activations substantially shape the loss surfaces of neural networks.International Con- ference on Learning Representations, 2020

  15. [15]

    Y . N. Dauphin, R. Pascanu, C. Gulcehre, K. Cho, S. Ganguli, and Y . Bengio. Identifying and attacking the saddle point problem in high-dimensional non- convex optimization. InNIPS, pages 2933–2941, 2014

  16. [16]

    The loss landscape of deep linear neural networks: a second-order analysis.Jour- nal of Machine Learning Research, 25(242):1–76, 2024

    El Mehdi Achour, Franc ¸ois Malgouyres, and S ´ebastien Gerchinovitz. The loss landscape of deep linear neural networks: a second-order analysis.Jour- nal of Machine Learning Research, 25(242):1–76, 2024

  17. [17]

    Gradient descent learns one-hidden-layer CNN: Don’t be afraid of spurious local minima.arXiv preprint arXiv:1712.00779, 2017

    Simon S Du, Jason D Lee, Yuandong Tian, Barnabas Poczos, and Aarti Singh. Gradient descent learns one-hidden-layer CNN: Don’t be afraid of spurious local minima.arXiv preprint arXiv:1712.00779, 2017

  18. [18]

    Geometry of critical sets and existence of saddle branches for two-layer neural networks.arXiv preprint arXiv:2405.17501, 2024

    Leyang Zhang, Yaoyu Zhang, and Tao Luo. Geometry of critical sets and existence of saddle branches for two-layer neural networks.arXiv preprint arXiv:2405.17501, 2024. 28

  19. [19]

    Loss land- scape of shallow relu-like neural networks: Stationary points, saddle escape, and network embedding.arXiv preprint arXiv:2402.05626, 2024

    Frank Zhengqing Wu, Berfin Simsek, and Francois Gaston Ged. Loss land- scape of shallow relu-like neural networks: Stationary points, saddle escape, and network embedding.arXiv preprint arXiv:2402.05626, 2024

  20. [20]

    Weight- space symmetry in deep networks gives rise to permutation saddles, con- nected by equal-loss valleys across the loss landscape.arXiv preprint arXiv:1907.02911, 2019

    Johanni Brea, Berfin Simsek, Bernd Illing, and Wulfram Gerstner. Weight- space symmetry in deep networks gives rise to permutation saddles, con- nected by equal-loss valleys across the loss landscape.arXiv preprint arXiv:1907.02911, 2019

  21. [21]

    Global minima of overparameterized neural networks.SIAM Journal on Mathematics of Data Science, 3(2):676–691, 2021

    Yaim Cooper. Global minima of overparameterized neural networks.SIAM Journal on Mathematics of Data Science, 3(2):676–691, 2021

  22. [22]

    The effect of smooth parametrizations on nonconvex optimization landscapes.Mathematical Pro- gramming, 209(1):63–111, 2025

    Eitan Levin, Joe Kileel, and Nicolas Boumal. The effect of smooth parametrizations on nonconvex optimization landscapes.Mathematical Pro- gramming, 209(1):63–111, 2025

  23. [23]

    Empirical analysis of the hessian of over-parametrized neural networks.arXiv preprint arXiv:1706.04454, 2017

    Levent Sagun, Utku Evci, V Ugur Guney, Yann Dauphin, and Leon Bottou. Empirical analysis of the hessian of over-parametrized neural networks.arXiv preprint arXiv:1706.04454, 2017

  24. [24]

    Vi- sualizing the loss landscape of neural nets

    Hao Li, Zheng Xu, Gavin Taylor, Christoph Studer, and Tom Goldstein. Vi- sualizing the loss landscape of neural nets. InProceedings of the 32nd Inter- national Conference on Neural Information Processing Systems, pages 6391– 6401, 2018

  25. [25]

    Kawaguchi

    K. Kawaguchi. Deep learning without poor local minima. InProceedings of the 30th International Conference on Neural Information Processing Sys- tems, pages 586–594, 2016

  26. [26]

    Nguyen and M

    Q. Nguyen and M. Hein. The loss surface of deep and wide neural networks. arXiv preprint arXiv:1704.08045, 2017

  27. [27]

    Deep linear networks with arbitrary loss: All local minima are global

    Thomas Laurent and James Brecht. Deep linear networks with arbitrary loss: All local minima are global. InInternational Conference on Machine Learn- ing, pages 2908–2913. PMLR, 2018

  28. [28]

    Gradient de- scent provably optimizes over-parameterized neural networks.arXiv preprint arXiv:1810.02054, 2018

    Simon S Du, Xiyu Zhai, Barnabas Poczos, and Aarti Singh. Gradient de- scent provably optimizes over-parameterized neural networks.arXiv preprint arXiv:1810.02054, 2018

  29. [29]

    Gra- dient descent finds global minima of deep neural networks.arXiv preprint arXiv:1811.03804, 2018

    Simon S Du, Jason D Lee, Haochuan Li, Liwei Wang, and Xiyu Zhai. Gra- dient descent finds global minima of deep neural networks.arXiv preprint arXiv:1811.03804, 2018. 29

  30. [30]

    A convergence theory for deep learning via over-parameterization

    Zeyuan Allen-Zhu, Yuanzhi Li, and Zhao Song. A convergence theory for deep learning via over-parameterization. InInternational Conference on Ma- chine Learning, pages 242–252. PMLR, 2019

  31. [31]

    An improved analysis of training over- parameterized deep neural networks.arXiv preprint arXiv:1906.04688, 2019

    Difan Zou and Quanquan Gu. An improved analysis of training over- parameterized deep neural networks.arXiv preprint arXiv:1906.04688, 2019

  32. [32]

    Linear convergence of gradient descent for finite width over-parametrized linear networks with general initialization

    Ziqing Xu, Hancheng Min, Salma Tarmoun, Enrique Mallada, and Ren ´e Vi- dal. Linear convergence of gradient descent for finite width over-parametrized linear networks with general initialization. InInternational Conference on Artificial Intelligence and Statistics, pages 2262–2284. PMLR, 2023

  33. [33]

    Understanding warmup-stable-decay learning rates: A river valley loss landscape perspective.arXiv preprint arXiv:2410.05192, 2024

    Kaiyue Wen, Zhiyuan Li, Jason Wang, David Hall, Percy Liang, and Tengyu Ma. Understanding warmup-stable-decay learning rates: A river valley loss landscape perspective.arXiv preprint arXiv:2410.05192, 2024

  34. [34]

    Splitting steepest descent for grow- ing neural architectures.Advances in neural information processing systems, 32, 2019

    Lemeng Wu, Dilin Wang, and Qiang Liu. Splitting steepest descent for grow- ing neural architectures.Advances in neural information processing systems, 32, 2019. A Proof of Lemmas and Propositions This appendix provides proofs of all the propositions and lemmas presented in the main body of this paper. A.1 Proof of Proposition 1 By Definition 1, we have f(...

  35. [35]

    Further, noting that ∑m−r j=0 λ j =1 for anyλλλ∈Λ, there existsi 0 ∈[r:m]such thatλ i0−r >0

    From (76a),∂L /∂w i =0for alli∈[r−1]. Further, noting that ∑m−r j=0 λ j =1 for anyλλλ∈Λ, there existsi 0 ∈[r:m]such thatλ i0−r >0. Then, from (76b) we have ∂L ∂w r = 1 λi0−r · ∂L ∂w i0 =0,(77) yielding∂L /∂W =0. We conclude that∂L/∂W=0if and only if∂L /∂W =0. Therefore, for anyλλλ∈Λ,θis a stationary point of wide-net if and only ifθ is a stationary point ...

  36. [36]

    Then, combined with (88) we have |ek|= m ∑ i=r λ ′ i−rv′ r Z 1 0 Z t 0 F ′′ i,k(τ)dτdt ≤ m ∑ i=r λ ′ i−r|v′ r| Z 1 0 Z t 0 ∥∆ ˜wi∥2 2M(δ)dτdt ≤ 1 2 |v′ r| ·M(δ) m ∑ i=r λ ′ i−r∥∆ ˜wi∥2 2.(92) 34 There exists sufficiently smallδ 1,1 >0 such thatλ ′ i−r ≤2λ i−r and|v ′ r| ≤2|v r|for allθ ′ ∈B(θ,δ 1,1). Thus, for any 0<δ≤δ 1,1, we have |ek| ≤2|v r| ·M(δ) m ∑...

  37. [37]

    perturbed

    By (33) and (37) we have ⟨∆y,e⟩= * ∆y, m ∑ i=r v′ iσ d ∑ k=1 w′ i,kxk ! −v ′ rσ d ∑ k=1 w′ r,kxk !+ = m ∑ i=r λ ′ i−rv′ r * ∆y,σ d ∑ k=1 (wr,k +∆w i,k)xk !+ −v ′ r * ∆y,σ d ∑ k=1 (wr,k +∆w r,k)xk !+ (97) wherex k ∈R n×1 denotes (the transpose of) thek-th row ofX. We define a function Ga :R d →Rparameterized bya∈Ras Ga(u) =a * ∆y,σ d ∑ k=1 (wr,k +u k)xk !+...

  38. [38]

    Further, there exists sufficiently smallδ 1,1 >0 such that|v ′ r| ≤ 2|vr|

    Then, from (111) we have |ek|= 1 ∑ j=0 β ′ jv′ r Z 1 0 Z t 0 F ′′ j,k(τ)dτdt ≤ 1 ∑ j=0 β ′ j|v′ r| Z 1 0 Z t 0 ∥u j∥2 2M(δ)dτdt ≤ 1 2 |v′ r| ·M(δ) 1 ∑ j=0 β ′ j∥u j∥2 2.(115) Note thatβ ′ j <4β j. Further, there exists sufficiently smallδ 1,1 >0 such that|v ′ r| ≤ 2|vr|. Thus, for any 0<δ≤δ 1,1, we have |ek| ≤4|v r| ·M(δ) β0∥u0∥2 2 +β 1∥u1∥2 2 ,∀θ ′ ∈B(θ,...

  39. [39]

    vi ·σ d ∑ l=1 wi,l ·X l,k ! −y k # ·

    +o(∥∆wr+1∥2 2) (147a) = β 2 2λ0 + β 2 2λ1 b⊤Hin r b+o(∥βb∥ 2 2)(147b) = β 2 2 1 λ0 + 1 λ1 λmin(Hin r ) +o(β 2)(147c) where (147b) follows from (143). Finally, combining (145), (147), and the decom- position (20), the difference of the empirical loss is given by L(θ ′)−L(θ) = 1 2 m ∑ i=1 ∆ti 2 2 + * ∆y, m ∑ i=1 ∆ti + = β 2 2 1 λ0 + 1 λ1 λmin(Hin r ) +o(β 2...