pith. sign in

arxiv: 2606.28662 · v1 · pith:SNBONQASnew · submitted 2026-06-27 · 💻 cs.LG · cs.AI· cs.NE

Closed-Form Steepest Descent Direction toward Flat Minima: Reducing Upper Bounds on the Loss Hessian Eigenspectrum in Neural Networks

Pith reviewed 2026-06-30 09:38 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.NE
keywords flat minimaHessian eigenvaluesWolkowicz-Styan boundneural network regularizationsteepest descentcross-entropy lossthree-layer networks
0
0 comments X

The pith

Analytically deriving the gradient of the Wolkowicz-Styan upper bound supplies a closed-form steepest descent direction that reduces an upper limit on the largest loss Hessian eigenvalue.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper derives an exact gradient for the Wolkowicz-Styan upper bound on the maximum eigenvalue of the cross-entropy loss Hessian in three-layer networks. This gradient identifies parameter directions that shrink the bound and thereby favor flatter regions of the loss surface. The authors introduce Hessian Spectral Range Regularization, which performs updates along the resulting steepest descent vector. If the approach works, training can be guided toward flat minima using only closed-form expressions rather than numerical estimates of the bound. A reader would care because the flatness hypothesis links smaller Hessian eigenvalues to better generalization performance.

Core claim

We analytically derive the gradient of the Wolkowicz-Styan upper bound on the maximum eigenvalue of the cross-entropy loss Hessian in three-layer neural networks. This closed-form gradient characterizes directions leading to flat minima. We propose Hessian Spectral Range (HSR) Regularization, which updates parameters along the steepest descent direction of the WS bound. Experiments show that HSR Regularization narrows the Hessian eigenvalue spectrum, avoids sharp minima and saddle points, and promotes convergence to flat minima. This is the first reported closed-form gradient that promotes flat minima without numerical approximations.

What carries the argument

The analytically derived gradient of the Wolkowicz-Styan upper bound on the maximum eigenvalue of the cross-entropy loss Hessian, which directly supplies the steepest descent direction for the bound.

If this is right

  • HSR Regularization narrows the range of Hessian eigenvalues during training.
  • The method steers optimization away from sharp minima and saddle points.
  • Convergence occurs toward flat minima using only the closed-form gradient.
  • The approach applies to cross-entropy loss on three-layer architectures without requiring numerical gradient approximations.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the WS bound is sufficiently tight, the same gradient could be used to control the actual maximum eigenvalue rather than only its upper limit.
  • The closed-form nature of the gradient may allow direct comparison of flatness-seeking directions across different data distributions.
  • Extending the derivation to deeper networks would require analogous differentiable bounds on their Hessian spectra.

Load-bearing premise

That shrinking the Wolkowicz-Styan upper bound on the largest Hessian eigenvalue will produce flatter minima that generalize better.

What would settle it

Training runs in which HSR Regularization is applied yet the measured maximum Hessian eigenvalue stays as large as in standard training or test accuracy does not improve.

Figures

Figures reproduced from arXiv: 2606.28662 by Hirotaka Takahashi, Kazuki Sakai, Makoto Sasaki, Yohei Kakimoto, Yusuke Sakai, Yuto Omae.

Figure 1
Figure 1. Figure 1: Three-layer hierarchical NN analyzed in this study. [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Computation time of ∂λsup(θ)/ ∂θ. “Num.” denotes the numerical solution, and “Ana.” denotes the analytical solution. Left: Variation with respect to the dimensionality D, with the training data size fixed at I = 200. Right: Variation with respect to the data size I, with the dimensionality fixed at D = 21. All computations were executed via serial processing on an Apple M2 CPU (clock frequency: 3.49 GHz). … view at source ↗
Figure 3
Figure 3. Figure 3: Comparison between numerical and analytical solu [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 5
Figure 5. Figure 5: Relationship between the covariance of the bivariate [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Comparison of gradient norms between the mean and [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Top: Training dynamics of the WS upper bound and top [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Left: Training data (50 samples), Right: Test data [PITH_FULL_IMAGE:figures/full_fig_p009_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Left: Comparison between λ1 and λsup(θ ♯ ). Right: Histogram of λsup(θ ♯ ). Results are shown for 1,124 unique critical points that satisfy the convergence criteria [PITH_FULL_IMAGE:figures/full_fig_p010_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Comparison of training dynamics for each method. [PITH_FULL_IMAGE:figures/full_fig_p010_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Loss landscapes at critical points for Sharp Minima [PITH_FULL_IMAGE:figures/full_fig_p011_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Statistics of the Hessian eigenspectrum at critical points. Asterisks indicate p-values from the two-sided Wilcoxon [PITH_FULL_IMAGE:figures/full_fig_p012_12.png] view at source ↗
Figure 15
Figure 15. Figure 15: Relationship between decision boundaries and the [PITH_FULL_IMAGE:figures/full_fig_p013_15.png] view at source ↗
Figure 14
Figure 14. Figure 14: Macro F1-scores at the critical point. Asterisks indi [PITH_FULL_IMAGE:figures/full_fig_p013_14.png] view at source ↗
read the original abstract

The flatness hypothesis suggests that flatness of the loss landscape, as measured by the eigenvalues of the loss Hessian, correlates with better neural network generalization. While various algorithms reduce these eigenvalues, most focus on procedural design, leaving it unclear how data distributions and NN parameters structurally determine directions toward flat minima. Characterizing these directions analytically is generally intractable. To overcome this mathematical difficulty, recent studies derived the Wolkowicz-Styan (WS) upper bound on the maximum eigenvalue of the cross-entropy loss Hessian in three-layer NNs. Although this upper bound is differentiable, its gradient was not derived. Therefore, we analytically derive the gradient of the WS upper bound to characterize directions leading to flat minima. Based on this, we propose Hessian Spectral Range (HSR) Regularization, which updates parameters along the steepest descent direction of the WS bound. Experiments demonstrate that HSR Regularization narrows the Hessian eigenvalue spectrum, avoids sharp minima and saddle points, and promotes convergence to flat minima. Although the applicability of this method is currently limited to cross-entropy loss and three-layer architectures, to the best of the authors' knowledge, this is the first study to report a closed-form gradient that promotes convergence to flat minima without numerical approximations. Therefore, the theoretical analysis of this gradient is expected to contribute to the further development of NNs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript analytically derives the gradient of the Wolkowicz-Styan (WS) upper bound on the largest eigenvalue of the cross-entropy loss Hessian for three-layer neural networks. It uses this closed-form gradient to define Hessian Spectral Range (HSR) Regularization, which performs parameter updates along the steepest descent direction of the bound. The authors claim that this procedure narrows the Hessian eigenspectrum, avoids sharp minima and saddle points, and promotes convergence to flat minima that improve generalization. The work is restricted to cross-entropy loss and three-layer architectures and presents itself as the first closed-form (non-numerical) gradient for this purpose.

Significance. If the derivation is free of algebraic error and if reduction of the WS bound reliably produces a corresponding reduction in the true spectral radius, the result would supply an explicit, differentiable characterization of directions toward flatter minima. The closed-form nature of the gradient is a concrete strength that enables future theoretical analysis without reliance on finite-difference or automatic-differentiation approximations. The limitation to three-layer networks and cross-entropy loss, however, confines the immediate practical scope.

major comments (2)
  1. [Abstract, §3 (gradient derivation), §4 (experiments)] The central claim that HSR Regularization reaches flat minima rests on the unverified assumption that descent on the WS upper bound produces a corresponding decrease in the actual maximum Hessian eigenvalue. No analytic argument or empirical diagnostic is supplied showing that the bound remains sufficiently tight once the regularization term is active (e.g., that the gap between bound and λ_max does not widen under the induced parameter updates).
  2. [§4] §4 (experimental results): the reported narrowing of the Hessian eigenvalue spectrum is presented without quantitative controls that isolate the effect of bound minimization from other optimization dynamics. In particular, there is no comparison against a baseline that directly penalizes an estimate of λ_max, nor any measurement of the tightness ratio (bound / λ_max) before and after HSR updates.
minor comments (2)
  1. [Abstract] The abstract states that the method 'avoids sharp minima and saddle points,' yet the manuscript provides no explicit diagnostic (e.g., eigenvalue sign checks or escape-time statistics) that would substantiate avoidance of saddles.
  2. [§3] Notation for the WS bound and its gradient should be introduced with a self-contained definition before the differentiation step; readers must currently consult the cited prior work to follow the algebra.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments. We address each major point below and indicate planned revisions.

read point-by-point responses
  1. Referee: [Abstract, §3 (gradient derivation), §4 (experiments)] The central claim that HSR Regularization reaches flat minima rests on the unverified assumption that descent on the WS upper bound produces a corresponding decrease in the actual maximum Hessian eigenvalue. No analytic argument or empirical diagnostic is supplied showing that the bound remains sufficiently tight once the regularization term is active (e.g., that the gap between bound and λ_max does not widen under the induced parameter updates).

    Authors: We acknowledge that the manuscript does not include an explicit check of bound tightness under HSR updates. Because the WS expression is a proven upper bound, descent on it necessarily constrains possible values of λ_max, but we agree that empirical verification of the gap is valuable. In the revision we will add plots of (WS bound − λ_max) before and after HSR steps on the reported architectures. revision: yes

  2. Referee: [§4] §4 (experimental results): the reported narrowing of the Hessian eigenvalue spectrum is presented without quantitative controls that isolate the effect of bound minimization from other optimization dynamics. In particular, there is no comparison against a baseline that directly penalizes an estimate of λ_max, nor any measurement of the tightness ratio (bound / λ_max) before and after HSR updates.

    Authors: We agree that reporting the tightness ratio (WS bound / λ_max) would strengthen the experimental claims and will include these measurements in the revised §4. A direct baseline that penalizes a numerical estimate of λ_max is computationally prohibitive for the network sizes considered, which is precisely why the closed-form WS gradient is useful; we will add a brief discussion of this distinction rather than a full comparison. revision: partial

Circularity Check

0 steps flagged

Analytic differentiation of external WS bound is self-contained; no reduction to inputs or self-citation chain

full rationale

The paper's central derivation is an analytic computation of the gradient of the Wolkowicz-Styan upper bound (previously derived in cited recent studies, not by these authors). This step consists of standard differentiation applied to an existing closed-form expression and does not embed the target eigenvalue, fit parameters to the outcome, or rely on a self-citation for uniqueness or ansatz. The subsequent HSR Regularization is defined directly from the resulting gradient expression. No load-bearing step reduces by construction to the paper's own fitted values or prior claims; the derivation chain remains independent of the flat-minima outcome it is later tested against.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 1 invented entities

The work rests on the prior existence and applicability of the Wolkowicz-Styan bound to three-layer cross-entropy Hessians and on the flatness hypothesis linking smaller max eigenvalue to better generalization. No new free parameters or invented entities beyond the regularization itself are introduced in the abstract.

axioms (2)
  • domain assumption The Wolkowicz-Styan upper bound is valid and differentiable for the cross-entropy loss Hessian of three-layer neural networks
    Cited from recent studies; invoked as the starting point for the gradient derivation
  • domain assumption Reducing the WS upper bound on the maximum Hessian eigenvalue moves the network toward flatter minima that generalize better
    Core motivation for proposing HSR Regularization
invented entities (1)
  • HSR Regularization no independent evidence
    purpose: Parameter update rule that follows the steepest descent direction of the WS bound
    New regularization term defined from the derived gradient

pith-pipeline@v0.9.1-grok · 5801 in / 1506 out tokens · 15709 ms · 2026-06-30T09:38:25.849921+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

40 extracted references · 26 canonical work pages · 5 internal anchors

  1. [1]

    J. Chai, H. Zeng, A. Li, E. W. T. Ngai, Deep learning in computer vision: A critical review of emerging techniques and application scenarios, Machine Learning with Applications 6 (2021) 100134.doi:10. 1016/j.mlwa.2021.100134

  2. [2]

    Mehrish, N

    A. Mehrish, N. Majumder, R. Bharadwaj, R. Mihalcea, S. Poria, A re- view of deep learning techniques for speech processing, Information Fu- sion 99 (2023) 101869.doi:10.1016/j.inffus.2023.101869

  3. [3]

    E. O. Arkhangelskaya, S. I. Nikolenko, Deep Learning for Natural Language Processing: A Survey, Journal of Mathematical Sciences 273 (4) (2023) 533–582.doi:10.1007/s10958-023-06519-6

  4. [4]

    Neural Computation9(1), 1–42 (01 1997)

    S. Hochreiter, J. Schmidhuber, Flat minima, Neural Computation 9 (1) (1997) 1–42.doi:10.1162/neco.1997.9.1.1

  5. [5]

    Y . Liu, S. Yu, T. Lin, Hessian regularization of deep neural networks: A novel approach based on stochastic estimators of Hessian trace, Neurocomputing 536 (2023) 13–20.doi:10.1016/j.neucom. 2023.03.017

  6. [6]

    Arora, Z

    S. Arora, Z. Li, A. Panigrahi, Understanding Gradient Descent on Edge of Stability in Deep Learning (2022).arXiv:2205.09745,doi: 10.48550/arXiv.2205.09745

  7. [7]

    K. Lyu, Z. Li, S. Arora, Understanding the Generalization Benefit of Normalization Layers: Sharpness Reduction (2023).arXiv:2206. 07085,doi:10.48550/arXiv.2206.07085

  8. [8]

    Sharpness-Aware Minimization for Efficiently Improving Generalization

    P. Foret, A. Kleiner, H. Mobahi, B. Neyshabur, Sharpness-Aware Min- imization for Efficiently Improving Generalization (2021).arXiv: 2010.01412,doi:10.48550/arXiv.2010.01412

  9. [9]

    Y . Omae, K. Sakai, Y . Kakimoto, M. Sasaki, Y . Sakai, H. Takahashi, Wolkowicz-Styan Upper Bound on the Hessian Eigenspectrum for Cross-Entropy Loss in Nonlinear Smooth Neural Networks, arXiv.org (2026).doi:10.48550/arXiv.2604.10202

  10. [10]

    L. Dinh, R. Pascanu, S. Bengio, Y . Bengio, Sharp Minima Can Gener- alize For Deep Nets, Proceedings of the 34th International Conference on Machine Learning (2017).doi:arXiv:1703.04933

  11. [11]

    Chen, C.-J

    X. Chen, C.-J. Hsieh, B. Gong, When Vision Transformers Outperform ResNets without Pre-training or Strong Data Augmentations (2022). arXiv:2106.01548,doi:10.48550/arXiv.2106.01548

  12. [12]

    Huang, X

    W. Huang, X. Liu, X. Wang, J. Yamagishi, Y . Qian, From Sharpness to Better Generalization for Speech Deepfake Detection (2025).arXiv: 2506.11532,doi:10.48550/arXiv.2506.11532

  13. [13]

    H. Li, Z. Xu, G. Taylor, C. Studer, T. Goldstein, Visualizing the Loss Landscape of Neural Nets, Advances in Neural Information Processing Systems 31 (2018)

  14. [14]

    N. S. Keskar, D. Mudigere, J. Nocedal, M. Smelyanskiy, P. T. P. Tang, On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima (2017).arXiv:1609.04836,doi:10.48550/ arXiv.1609.04836

  15. [15]

    Ghorbani, S

    B. Ghorbani, S. Krishnan, Y . Xiao, An Investigation into Neural Net Optimization via Hessian Eigenvalue Density, Proceedings of the 36th International Conference on Machine Learning (2019) 2232–2241

  16. [16]

    L. Wu, W. J. Su, The Implicit Regularization of Dynamical Stability in Stochastic Gradient Descent (2023).arXiv:2305.17490,doi: 10.48550/arXiv.2305.17490

  17. [17]

    M. Wei, D. J. Schwab, How noise affects the Hessian spectrum in overparameterized neural networks (2019).arXiv:1910.00195, doi:10.48550/arXiv.1910.00195

  18. [18]

    Torchvision — Torchvision 0.27 documentation, https://docs.pytorch.org/vision/stable/index.html

  19. [19]

    Simonyan, A

    K. Simonyan, A. Zisserman, Very Deep Convolutional Networks for Large-Scale Image Recognition, CoRR (2014)

  20. [20]

    Kaiming, Z

    H. Kaiming, Z. Xiangyu, R. Shaoqing, S. Jian, Deep Residual Learning for Image Recognition, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 1 (2016) 770–778.doi:10.1109/ cvpr.2016.90

  21. [21]

    M. Hutchinson, A Stochastic Estimator of the Trace of the Influence Matrix for Laplacian Smoothing Splines, Communications in Statistics - Simulation and Computation 18 (3) (1989) 1059–1076.doi:10. 1080/03610918908812806

  22. [22]

    C. Lanczos, An iteration method for the solution of the eigenvalue problem of linear differential and integral operators, Journal of Research of the National Bureau of Standards 45 (4) (1950) 255–282

  23. [23]

    Z. Yao, A. Gholami, K. Keutzer, M. W. Mahoney, PyHessian: Neural Networks Through the Lens of the Hessian, 2020 IEEE International Conference on Big Data (Big Data) (2020) 581–590doi:10.1109/ BigData50022.2020.9378171

  24. [24]

    Z. Dong, Z. Yao, D. Arfeen, A. Gholami, M. W. Mahoney, K. Keutzer, HAWQ-V2: Hessian Aware trace-Weighted Quantization of Neural Net- works, Advances in Neural Information Processing Systems 33 (2020) 18518–18529

  25. [25]

    Bishop, Exact Calculation of the Hessian Matrix for the Multilayer Perceptron, Neural Computation 4 (4) (1992) 494–501.doi:10

    C. Bishop, Exact Calculation of the Hessian Matrix for the Multilayer Perceptron, Neural Computation 4 (4) (1992) 494–501.doi:10. 1162/neco.1992.4.4.494

  26. [26]

    S. P. Singh, G. Bachmann, T. Hofmann, Analytic Insights into Structure and Rank of Neural Network Hessian Maps (2021).arXiv:2106. 16225,doi:10.48550/arXiv.2106.16225

  27. [27]

    Y . Wu, X. Zhu, C. Wu, A. Wang, R. Ge, Dissecting Hessian: Under- standing Common Structure of Hessian in Neural Networks (2022). arXiv:2010.04261,doi:10.48550/arXiv.2010.04261

  28. [28]

    S. P. Singh, W. Ormaniec, T. Hofmann, Cracking the Hessian: Closed- Form Hessian Spectra for Fundamental Neural Networks, OpenReview in ICLR2026 (2026)

  29. [29]

    X. Yue, M. Nouiehed, R. A. Kontar, SALR: Sharpness-aware Learning Rate Scheduler for Improved Generalization, IEEE Transactions on Neural Networks and Learning Systems 35 (9) (2024) 12518–12527. arXiv:2011.05348,doi:10.1109/TNNLS.2023.3263393. PREPRINT 25

  30. [30]

    A. R. Sankar, Y . Khasbage, R. Vigneswaran, V . N Balasubrama- nian, A Deeper Look at the Hessian Eigenspectrum of Deep Neural Networks and its Applications to Regularization, Proceedings of the AAAI Conference on Artificial Intelligence 35 (11) (2021) 9481–9488. doi:10.1609/aaai.v35i11.17142

  31. [31]

    H. Luo, T. Truong, T. Pham, M. Harandi, D. Phung, T. Le, Explicit Eigenvalue Regularization Improves Sharpness-Aware Minimization (2025).arXiv:2501.12666,doi:10.48550/arXiv.2501. 12666

  32. [32]

    J. Kwon, J. Kim, H. Park, I. K. Choi, ASAM: Adaptive Sharpness-Aware Minimization for Scale-Invariant Learning of Deep Neural Networks, Proceedings of the 38th International Conference on Machine Learning (2021) 5905–5914

  33. [33]

    Y . Zhou, Y . Qu, X. Xu, H. Shen, ImbSAM: A Closer Look at Sharpness- Aware Minimization in Class-Imbalanced Recognition, 2023 IEEE/CVF International Conference on Computer Vision (ICCV) (2023) 11311– 11321doi:10.1109/ICCV51070.2023.01042

  34. [34]

    Andriushchenko, N

    M. Andriushchenko, N. Flammarion, Towards Understanding Sharpness- Aware Minimization, Proceedings of the 39th International Conference on Machine Learning (2022) 639–668

  35. [35]

    H. R. Zhang, D. Li, H. Ju, Noise Stability Optimization for Finding Flat Minima: A Hessian-based Regularization Approach (2024).arXiv: 2306.08553,doi:10.48550/arXiv.2306.08553

  36. [36]

    Wolkowicz, G

    H. Wolkowicz, G. P. H. Styan, Bounds for eigenvalues using traces, Linear Algebra and its Applications 29 (1980) 471–506.doi:10. 1016/0024-3795(80)90258-X

  37. [37]

    Eigenvalues of the Hessian in Deep Learning: Singularity and Beyond

    L. Sagun, L. Bottou, Y . LeCun, Eigenvalues of the Hessian in Deep Learning: Singularity and Beyond (2017).arXiv:1611.07476, doi:10.48550/arXiv.1611.07476

  38. [38]

    Xie, Q.-Y

    Z. Xie, Q.-Y . Tang, Y . Cai, M. Sun, P. Li, On the Power-Law Hessian Spectrums in Deep Learning (2022).arXiv:2201.13011,doi: 10.48550/arXiv.2201.13011

  39. [39]

    Papyan, Traces of Class/Cross-Class Structure Pervade Deep Learning Spectra, Journal of Machine Learning Research 21 (2020)

    V . Papyan, Traces of Class/Cross-Class Structure Pervade Deep Learning Spectra, Journal of Machine Learning Research 21 (2020)

  40. [40]

    M. A. Khamis, H. Q. Ngo, X. Nguyen, D. Olteanu, M. Schleich, Learning Models over Relational Data using Sparse Tensors and Functional Dependencies (2020).arXiv:1703.04780,doi:10. 48550/arXiv.1703.04780