pith. machine review for the scientific record. sign in

arxiv: 1706.04454 · v3 · submitted 2017-06-14 · 💻 cs.LG

Recognition: unknown

Empirical Analysis of the Hessian of Over-Parametrized Neural Networks

Leon Bottou, Levent Sagun, Utku Evci, V. Ugur Guney, Yann Dauphin

classification 💻 cs.LG
keywords datahessianbasinsbulklargespectrumattractionconnected
0
0 comments X
read the original abstract

We study the properties of common loss surfaces through their Hessian matrix. In particular, in the context of deep learning, we empirically show that the spectrum of the Hessian is composed of two parts: (1) the bulk centered near zero, (2) and outliers away from the bulk. We present numerical evidence and mathematical justifications to the following conjectures laid out by Sagun et al. (2016): Fixing data, increasing the number of parameters merely scales the bulk of the spectrum; fixing the dimension and changing the data (for instance adding more clusters or making the data less separable) only affects the outliers. We believe that our observations have striking implications for non-convex optimization in high dimensions. First, the flatness of such landscapes (which can be measured by the singularity of the Hessian) implies that classical notions of basins of attraction may be quite misleading. And that the discussion of wide/narrow basins may be in need of a new perspective around over-parametrization and redundancy that are able to create large connected components at the bottom of the landscape. Second, the dependence of small number of large eigenvalues to the data distribution can be linked to the spectrum of the covariance matrix of gradients of model outputs. With this in mind, we may reevaluate the connections within the data-architecture-algorithm framework of a model, hoping that it would shed light into the geometry of high-dimensional and non-convex spaces in modern applications. In particular, we present a case that links the two observations: small and large batch gradient descent appear to converge to different basins of attraction but we show that they are in fact connected through their flat region and so belong to the same basin.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 12 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. When and Why SignSGD Outperforms SGD: A Theoretical Study Based on $\ell_1$-norm Lower Bounds

    cs.LG 2026-05 unverdicted novelty 8.0

    SignSGD provably beats SGD by a factor of d under sparse noise via matched ℓ1-norm upper and lower bounds, with an equivalent result for Muon on matrices, and this predicts faster GPT-2 pretraining.

  2. Backdoor Channels Hidden in Latent Space: Cryptographic Undetectability in Modern Neural Networks

    cs.CR 2026-05 unverdicted novelty 7.0

    Backdoors can be realized as statistically natural latent directions in modern neural networks, achieving high attack success with negligible clean accuracy loss and resisting existing defenses.

  3. Select-then-differentiate: Solving Bilevel Optimization with Manifold Lower-level Solution Sets

    math.OC 2026-05 unverdicted novelty 7.0

    Optimistic bilevel optimization with manifold lower-level minimizers is differentiable if the optimistic selection is unique, yielding a pseudoinverse hyper-gradient and a convergent HG-MS algorithm whose rate depends...

  4. Spectral Surgery: Class-Targeted Post-Hoc Rebalancing via Hessian Spike Perturbation

    cs.LG 2026-05 unverdicted novelty 7.0

    Spectral Surgery uses a sensitivity matrix and constrained optimization to perturb weights along Hessian spike eigenvectors, rebalancing per-class accuracy on CIFAR-10 and ISIC-2019 without retraining.

  5. Fast Gauss-Newton for Multiclass Cross-Entropy

    cs.LG 2026-05 unverdicted novelty 7.0

    FGN is a positive semidefinite under-approximation of the multiclass GGN obtained by exact decomposition into true-vs-rest and within-competitor terms, exact for binary classification and implemented via matrix-free c...

  6. The Lifecycle of the Spectral Edge: From Gradient Learning to Weight-Decay Compression

    cs.LG 2026-04 unverdicted novelty 7.0

    The spectral edge transitions from a gradient-driven functional direction before grokking to a perturbation-flat, ablation-critical compression axis at grokking, forming three universality classes predicted by a gap f...

  7. Selection Plateau and a Sparsity-Dependent Hierarchy of Pruning Features

    cs.LG 2026-05 unverdicted novelty 6.0

    All rank-monotone pruning scorers converge to identical accuracy at fixed sparsity, but non-monotone features with sparsity-dependent complexity can escape this plateau, as shown by the SICS hypothesis on ViT-Small/CIFAR-10.

  8. Quantum Tilted Loss in Variational Optimization: Theory and Applications

    quant-ph 2026-05 unverdicted novelty 6.0

    QTL unifies expectation-value minimization with CVaR and Gibbs heuristics under one tunable operator, amplifying gradients in structured cases while preserving global minima and shifting the bottleneck to measurement ...

  9. Generalization at the Edge of Stability

    cs.LG 2026-04 unverdicted novelty 6.0

    Training at the edge of stability causes neural network optimizers to converge on fractal attractors whose effective dimension, measured via a new sharpness dimension from the Hessian spectrum, bounds generalization e...

  10. Exploiting Correlations in Federated Learning: Opportunities and Practical Limitations

    cs.IT 2026-04 unverdicted novelty 6.0

    A correlation-based taxonomy unifies existing FL compression methods, experiments show correlation strengths vary by task and architecture, and adaptive mode-switching designs are proposed to exploit this.

  11. Escape dynamics and implicit bias of one-pass SGD in overparameterized quadratic networks

    cond-mat.dis-nn 2026-04 unverdicted novelty 6.0

    In overparameterized quadratic networks, one-pass SGD escapes generalization plateaus only modestly faster and selects the initialization-closest zero-loss solution due to a conserved quantity in the overlap ODEs.

  12. Spectral Edge Dynamics: An Analytical-Empirical Study of Phase Transitions in Neural Network Training

    cs.LG 2026-03 unverdicted novelty 6.0

    Spectral gaps in the Gram matrix of parameter updates control phase transitions such as grokking in neural network training.