pith. sign in

arxiv: 2604.09412 · v2 · pith:5BLQ2JLInew · submitted 2026-04-10 · 📊 stat.ML · cond-mat.dis-nn· cs.LG

Sharp description of local minima in the loss landscape of high-dimensional two-layer ReLU neural networks

Pith reviewed 2026-05-10 16:40 UTC · model grok-4.3

classification 📊 stat.ML cond-mat.dis-nncs.LG
keywords loss landscapelocal minimaReLU networkssummary statisticsteacher-student settingstochastic gradient descentoverparameterizationpopulation loss
0
0 comments X

The pith

Local minima in two-layer ReLU networks admit an exact low-dimensional representation via summary statistics.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines the population loss landscape of two-layer ReLU networks of the form sum ReLU(w_k^T x) in a realizable teacher-student setting with Gaussian covariates. It shows that every local minimum can be exactly captured by a small collection of summary statistics instead of the full high-dimensional weight vectors. This reduction supplies a sharp, interpretable map of the landscape and directly ties each minimum to an attractive fixed point of the one-pass SGD dynamics written in the same summary-statistics coordinates. The resulting picture explains why minima remain isolated when the model is well-specified yet acquire flat connecting directions once the network width grows, thereby making global solutions more reachable.

Core claim

Local minima admit an exact low-dimensional representation in terms of summary statistics, yielding a sharp and interpretable characterisation of the landscape. Local minima correspond to attractive fixed points of the dynamics in summary statistics space. This perspective reveals a hierarchical structure of minima: they are typically isolated in the well-specified regime, but become connected by flat directions as network width increases. In the overparameterised regime, global minima become increasingly accessible, attracting the dynamics and reducing convergence to spurious solutions.

What carries the argument

The exact low-dimensional summary-statistics representation of local minima, which collapses the high-dimensional loss surface to a tractable reduced space and identifies the minima as fixed points of the corresponding SGD flow.

If this is right

  • Local minima correspond to attractive fixed points of one-pass SGD dynamics in summary statistics space.
  • Minima remain isolated when the teacher-student model is well-specified.
  • Flat directions appear and connect minima once network width exceeds the teacher width.
  • Global minima attract the reduced dynamics more strongly in the overparameterized regime, limiting trapping at spurious solutions.
  • Common simplifying assumptions about the loss landscape miss these connectivity features even for minimal two-layer models.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same summary-statistics reduction might be used to track convergence speed of other first-order methods by writing their updates in the reduced coordinates.
  • If the data distribution deviates from Gaussian, the exact fixed-point equations would change, potentially altering the connectivity of minima.
  • Initialization schemes that place the summary statistics near the attractive fixed points of the reduced dynamics could be designed and tested directly.
  • The hierarchical connectivity pattern may appear in deeper ReLU networks, offering a route to analyze how depth and width jointly shape landscape structure.

Load-bearing premise

The derivation assumes a realizable teacher-student setting with Gaussian covariates and networks consisting of a finite sum of ReLUs.

What would settle it

Numerical optimization locates a local minimum whose full parameter vector fails to satisfy the closed set of equations that the summary statistics must obey at a stationary point.

Figures

Figures reproduced from arXiv: 2604.09412 by Bruno Loureiro, Jie Huang, Stefano Sarao Mannelli.

Figure 1
Figure 1. Figure 1: Geometry and statistics of local minima. (a) Schematic comparison of the loss landscape. In the well￾specified regime (left), minima are isolated points (marked ‘x’), whereas in the over-parameterised regime (right), they form continuous connected manifolds (red segments). (b) Validation of the theoretical predictions. The histogram shows the distribution of population risk reached by gradient flow (104 ru… view at source ↗
Figure 2
Figure 2. Figure 2: Loss families and theoretical value. Histogram of final loss values obtained from ODE dynamics at M = 17 for K = M (left), K = M + 1 (center), and K = M + 2 (right) starting from 104 initialisations in orthonormal teacher configuration. Dashed vertical lines indicate the corresponding theoretical loss values obtained from the Result 2 For K = M and K = M + 1, the theoretical values capture the locations of… view at source ↗
Figure 3
Figure 3. Figure 3: Left: Loss along the string path. We show the loss along the string path connecting two two local minima with different weights in K = M case and K = M + 1 case obtained from the string method in GD. For K = M, the path crosses several different loss barriers, while for K = M + 1 the string remains at constant loss, indicating a flat direction connecting the minima. Right: Perturbative analysis of fixed po… view at source ↗
Figure 4
Figure 4. Figure 4: Connectivity of solution manifolds in the over-parameterised regime (K = M + 1). Loss profiles along the minimum loss paths (computed via the string method) connecting two symmetric realisations of the same local minimum. Different colorus correspond to different families indexed by k1 (the number of anti-aligned units). The paths are perfectly flat, indicating that the isolated fixed points of the well-sp… view at source ↗
Figure 5
Figure 5. Figure 5: Training dynamics and loss quantisation across parameterisation regimes. Evolution of population risk under gradient flow for M = 20 teacher units (1,000 random initialisations per condition, learning rate η = 0.1 , orthonormal teacher configuration). All regimes exhibit entrapment in discrete high-loss plateaux. However, while the well-specified case (K=20, orange) is dominated by these suboptimal attract… view at source ↗
Figure 6
Figure 6. Figure 6: Comparison between mean-field ODE predictions and network simulations. Panel (a) shows the training loss, while panels (b) and (c) display representative entries of the order parameters R and Q as functions of training steps. Solid lines denote simulation results and markers indicate ODE predictions. A.1 Population Gradient We provide additional details on the ODEs for our learning problems. The network is… view at source ↗
Figure 7
Figure 7. Figure 7: Gaussian mixture analysis of diagonal order parameters.The left panel shows the distribution of diagonal elements of the matrix Q, which is well described by a two-component Gaussian mixture model. The right panel shows the distribution of diagonal elements of the matrix T, which follows a single Gaussian distribution centred at 1. The data are obtained by grouping order parameters within the same minima f… view at source ↗
Figure 8
Figure 8. Figure 8: Different order parameters at K = 18 and M = 17. The plots show the values of R (first row) and Q (second row) for minima representative of the different families. From left to right, we see results for k1 = 0, 2, 3, 4, 5, 6. B.2 Solution of the Fixed Point Equation Under the Ansatz The fixed point equations of dQik/dt, dRin/dt (Eqs. 22 and Eqs. 21) simplify after substituting the ansatz defined in Eqs. 8.… view at source ↗
Figure 9
Figure 9. Figure 9: displays the order parameters (R∗ , Q∗ ) obtained directly from the numerical solver for K = 18 and M = 17 in different k1. The resulting structures and losses closely match those shown in [PITH_FULL_IMAGE:figures/full_fig_p021_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: illustrates the impact of finite dimensionality on the final population loss distribution. We compare the empirical histograms obtained from finite d simulations against the idealised infinite-dimensional case. While the discrete, quantised hierarchy of the local minima is strictly preserved, finite dimensions introduce variance. This results in a progressive broadening of the loss distribution peaks arou… view at source ↗
Figure 11
Figure 11. Figure 11: Student-teacher overlap matrix (R) across different dimensions. Heatmaps of the R matrix for 10 randomly sampled configurations converging to the first-order local minimum (k1 = 1). The top row represents the idealised d → ∞ limit, while subsequent rows correspond to finite dimensions d = 784, 392, and 196. The block￾symmetric structure predicted by our theoretical ansatz clearly emerges across all regime… view at source ↗
Figure 12
Figure 12. Figure 12: Illustration of the string method. In this section, we briefly introduce the string method [Weinan et al., 2002, Ren et al., 2007, Samanta and Weinan, 2013], which is used to investigate whether distinct minima are separated by energy barriers or connected by flat valleys. This algorithm seeks the Minimum Energy Path (MEP) connecting two fixed configurations in the order parameter space. Conceptually simi… view at source ↗
Figure 13
Figure 13. Figure 13: R of different settings. Examples of endpoint configurations within the k1 = 2 family in the overpa￾rameterised regime, shown here through the corresponding R matrices. For the strings in [PITH_FULL_IMAGE:figures/full_fig_p025_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Strings of different settings. Loss values along strings connecting the endpoint configurations shown in [PITH_FULL_IMAGE:figures/full_fig_p026_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Order Parameters of Leaky ReLU in K = M = 12 G.3 Sigmoidal (erf) Consider the sigmoidal activation function defined by the error function: g(x) = erf  x √ 2  . (60) The derivative is proportional to a Gaussian: g ′ (x) = q 2 π e −x 2/2 . Two-Variable Case (I2) Based on the arcsine law for Gaussian integrals, the result is: I erf 2 (Σ) = 2 π arcsin p C12 (1 + C11)(1 + C22) ! . (61) Three-Variable Case (I… view at source ↗
Figure 16
Figure 16. Figure 16: Order Parameters of erf in K = M = 12 27 [PITH_FULL_IMAGE:figures/full_fig_p027_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: Loss distribution in different constraints These histograms show the empirical density of the final population loss after 20,000,000 running steps. (a) Normalized Gradient Descent (nGD) with K = 19, M = 17. Despite overparameterisation, the loss distribution is strictly multimodal, exhibiting discrete peaks at high loss values. (b) Orthonormalised Gradient Descent (onGD) with K = 17, M = 17. The distribut… view at source ↗
Figure 18
Figure 18. Figure 18: Order parameters of local minima in nGD (K = 19, M = 17). The columns correspond to different final losses. Left Two Columns (Best Minima): The R matrices exhibit a near-perfect diagonal structure, indicating successful retrieval of the 17 teacher features. However, the irreducible error persists because the 2 excess students (constrained to Qii = 1) cannot decay to zero. Right Four Columns (Suboptimal Mi… view at source ↗
Figure 19
Figure 19. Figure 19: Final order parameters for onGD (K = M = 17). The columns correspond to different final losses. Bottom Row: The student-student overlap matrices Q retain a strict identity structure (Q = I), satisfying the orthogonality constraint. Top Row: The student-teacher overlap matrices R exhibit a disordered, noise-like pattern. Unlike successful learning scenarios shown in [PITH_FULL_IMAGE:figures/full_fig_p030_… view at source ↗
Figure 20
Figure 20. Figure 20: Dynamics of onGD in a setting with small hidden layers (K = M = 2). (a) The training loss (log scale) decreases monotonically and converges to a near-zero value (∼ 10−3 ), indicating successful optimisation. (b) The final student-teacher overlap matrix R exhibits a clear diagonal structure (red blocks indicate high positive correlation). This confirms that, unlike the case with larger hidden layers (K = M… view at source ↗
Figure 21
Figure 21. Figure 21: Distribution of loss for small networks in onGD under mean-field dynamic. Histograms of the final population loss obtained after long-time integration of the mean-field ODEs under Orthonormalised GD over 10,000 random initialisations. Each panel corresponds to a small student-teacher size configuration (K, M). The top row shows the equal case K = M and the bottom row shows the overparameterised case K = M… view at source ↗
Figure 22
Figure 22. Figure 22: Distribution of loss for wider networks in onGD under mean-field dynamic. Histograms of the final population loss obtained after long-time integration of the mean-field ODEs under Orthonormalised GD over 10,000 random initialisations. Each panel corresponds to a student-teacher size configuration (K, M) with larger widths than those shown in [PITH_FULL_IMAGE:figures/full_fig_p034_22.png] view at source ↗
read the original abstract

We study the population loss landscape of two-layer ReLU networks of the form $\sum_{k=1}^K \mathrm{ReLU}(w_k^\top x)$ in a realisable teacher-student setting with Gaussian covariates. We show that local minima admit an exact low-dimensional representation in terms of summary statistics, yielding a sharp and interpretable characterisation of the landscape. We further establish a direct link with one-pass SGD: local minima correspond to attractive fixed points of the dynamics in summary statistics space. This perspective reveals a hierarchical organisation of minima into discrete families and shows how overparameterisation changes their stability and reachability under gradient-based dynamics. In this overparameterised regime, global minima become increasingly accessible, attracting the dynamics and reducing convergence to spurious solutions. Overall, our results reveal intrinsic limitations of common simplifying assumptions, which may miss essential features of the loss landscape even in minimal neural network models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The manuscript analyzes the population loss landscape of two-layer ReLU networks of the form ∑_{k=1}^K ReLU(w_k^T x) in a realizable teacher-student setting with Gaussian covariates. It claims that local minima admit an exact low-dimensional representation in terms of summary statistics (inner products with the teacher direction and neuron norms), yielding a sharp and interpretable characterization. It further links these minima to attractive fixed points of one-pass SGD dynamics in summary-statistics space, revealing a hierarchical structure: minima are typically isolated in the well-specified regime but become connected by flat directions as width increases, making global minima more accessible and reducing convergence to spurious solutions.

Significance. If the exact closure and SGD correspondence hold, the work provides a valuable precise description of the loss landscape for a minimal neural-network model, moving beyond bounds or approximations to an exact low-dimensional reduction. The direct connection between landscape geometry and SGD fixed-point dynamics is a notable strength, as is the explanation of how overparameterization creates flat directions that improve accessibility to global minima. These insights could inform analyses of optimization in wider or deeper networks and challenge simplifying assumptions common in the literature.

major comments (1)
  1. [§3 (summary-statistics closure) and §4 (characterization of critical points)] The central claim requires that both the population loss and its gradient (hence the stationarity condition) close exactly under a fixed low-dimensional set of summary statistics. While rotational invariance makes loss closure plausible, the manuscript must verify that ReLU active/inactive sets at critical points introduce no extra degrees of freedom outside the chosen statistics (K inner products plus K norms). If this verification is only for the loss and not the gradient, or if it implicitly assumes all neurons are fully aligned or orthogonal, the representation of local minima is incomplete.
minor comments (2)
  1. [§2] Clarify the precise definition of the summary-statistics vector early in the paper and state explicitly which quantities are assumed known versus derived.
  2. [§5] The hierarchical-structure claim would benefit from a small illustrative example or figure showing the transition from isolated to connected minima as K grows.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the careful reading and for highlighting the need to explicitly confirm closure of both the loss and its gradient. We address this point below and will revise the manuscript to strengthen the presentation.

read point-by-point responses
  1. Referee: [§3 (summary-statistics closure) and §4 (characterization of critical points)] The central claim requires that both the population loss and its gradient (hence the stationarity condition) close exactly under a fixed low-dimensional set of summary statistics. While rotational invariance makes loss closure plausible, the manuscript must verify that ReLU active/inactive sets at critical points introduce no extra degrees of freedom outside the chosen statistics (K inner products plus K norms). If this verification is only for the loss and not the gradient, or if it implicitly assumes all neurons are fully aligned or orthogonal, the representation of local minima is incomplete.

    Authors: We agree that explicit verification for the gradient is essential. In §3 the population loss is shown to depend only on the K inner products with the teacher and the K neuron norms, by rotational invariance of the isotropic Gaussian measure. In §4 the stationarity equations are obtained by differentiating under the integral; the ReLU active-set indicator for each neuron has expectation and conditional moments that are functions solely of the norm of w_k and its inner product with the teacher direction, because the only distinguished direction in the problem is the teacher vector. Consequently the active/inactive sets introduce no additional degrees of freedom beyond the chosen summary statistics. The derivation nowhere assumes full alignment or orthogonality; it holds for arbitrary configurations of the summary statistics. To make this closure fully transparent we will add a dedicated lemma in the revised version that states the gradient closure explicitly. revision: partial

Circularity Check

0 steps flagged

No circularity: low-dimensional closure follows from Gaussian rotational invariance and explicit ReLU integration

full rationale

The central claim is that the population loss L(w) and its gradient close exactly under a fixed set of summary statistics (inner products with the teacher vector plus weight norms). This reduction is obtained by direct integration against the Gaussian measure and the piecewise-linear structure of ReLU; it is not obtained by fitting parameters to data, by redefining the target in terms of the statistics, or by invoking a self-citation chain. The stationarity condition is then solved inside the reduced coordinates. Because the derivation begins from the explicit model assumptions (realizable teacher-student, isotropic Gaussian covariates) and produces the closure by explicit calculation rather than by construction or renaming, the chain is self-contained and non-circular.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

Based on abstract only; no specific free parameters or invented entities mentioned. The summary statistics may involve some implicit parameters but not detailed.

axioms (2)
  • domain assumption Gaussian covariates assumption
    The setting uses Gaussian inputs, which simplifies analysis but is a modeling choice.
  • domain assumption Realizable teacher-student setting
    Assumes the data is generated by a teacher network of the same form.

pith-pipeline@v0.9.0 · 5467 in / 1359 out tokens · 49697 ms · 2026-05-10T16:40:16.865754+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.