Closed-Form Steepest Descent Direction toward Flat Minima: Reducing Upper Bounds on the Loss Hessian Eigenspectrum in Neural Networks

Hirotaka Takahashi; Kazuki Sakai; Makoto Sasaki; Yohei Kakimoto; Yusuke Sakai; Yuto Omae

arxiv: 2606.28662 · v1 · pith:SNBONQASnew · submitted 2026-06-27 · 💻 cs.LG · cs.AI· cs.NE

Closed-Form Steepest Descent Direction toward Flat Minima: Reducing Upper Bounds on the Loss Hessian Eigenspectrum in Neural Networks

Yuto Omae , Kazuki Sakai , Yohei Kakimoto , Makoto Sasaki , Yusuke Sakai , Hirotaka Takahashi This is my paper

Pith reviewed 2026-06-30 09:38 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.NE

keywords flat minimaHessian eigenvaluesWolkowicz-Styan boundneural network regularizationsteepest descentcross-entropy lossthree-layer networks

0 comments

The pith

Analytically deriving the gradient of the Wolkowicz-Styan upper bound supplies a closed-form steepest descent direction that reduces an upper limit on the largest loss Hessian eigenvalue.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper derives an exact gradient for the Wolkowicz-Styan upper bound on the maximum eigenvalue of the cross-entropy loss Hessian in three-layer networks. This gradient identifies parameter directions that shrink the bound and thereby favor flatter regions of the loss surface. The authors introduce Hessian Spectral Range Regularization, which performs updates along the resulting steepest descent vector. If the approach works, training can be guided toward flat minima using only closed-form expressions rather than numerical estimates of the bound. A reader would care because the flatness hypothesis links smaller Hessian eigenvalues to better generalization performance.

Core claim

We analytically derive the gradient of the Wolkowicz-Styan upper bound on the maximum eigenvalue of the cross-entropy loss Hessian in three-layer neural networks. This closed-form gradient characterizes directions leading to flat minima. We propose Hessian Spectral Range (HSR) Regularization, which updates parameters along the steepest descent direction of the WS bound. Experiments show that HSR Regularization narrows the Hessian eigenvalue spectrum, avoids sharp minima and saddle points, and promotes convergence to flat minima. This is the first reported closed-form gradient that promotes flat minima without numerical approximations.

What carries the argument

The analytically derived gradient of the Wolkowicz-Styan upper bound on the maximum eigenvalue of the cross-entropy loss Hessian, which directly supplies the steepest descent direction for the bound.

If this is right

HSR Regularization narrows the range of Hessian eigenvalues during training.
The method steers optimization away from sharp minima and saddle points.
Convergence occurs toward flat minima using only the closed-form gradient.
The approach applies to cross-entropy loss on three-layer architectures without requiring numerical gradient approximations.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the WS bound is sufficiently tight, the same gradient could be used to control the actual maximum eigenvalue rather than only its upper limit.
The closed-form nature of the gradient may allow direct comparison of flatness-seeking directions across different data distributions.
Extending the derivation to deeper networks would require analogous differentiable bounds on their Hessian spectra.

Load-bearing premise

That shrinking the Wolkowicz-Styan upper bound on the largest Hessian eigenvalue will produce flatter minima that generalize better.

What would settle it

Training runs in which HSR Regularization is applied yet the measured maximum Hessian eigenvalue stays as large as in standard training or test accuracy does not improve.

Figures

Figures reproduced from arXiv: 2606.28662 by Hirotaka Takahashi, Kazuki Sakai, Makoto Sasaki, Yohei Kakimoto, Yusuke Sakai, Yuto Omae.

**Figure 2.** Figure 2: Computation time of ∂λsup(θ)/ ∂θ. “Num.” denotes the numerical solution, and “Ana.” denotes the analytical solution. Left: Variation with respect to the dimensionality D, with the training data size fixed at I = 200. Right: Variation with respect to the data size I, with the dimensionality fixed at D = 21. All computations were executed via serial processing on an Apple M2 CPU (clock frequency: 3.49 GHz). … view at source ↗

**Figure 3.** Figure 3: Comparison between numerical and analytical solu [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 5.** Figure 5: Relationship between the covariance of the bivariate [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗

**Figure 6.** Figure 6: Comparison of gradient norms between the mean and [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗

**Figure 7.** Figure 7: Top: Training dynamics of the WS upper bound and top [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗

**Figure 8.** Figure 8: Left: Training data (50 samples), Right: Test data [PITH_FULL_IMAGE:figures/full_fig_p009_8.png] view at source ↗

**Figure 9.** Figure 9: Left: Comparison between λ1 and λsup(θ ♯ ). Right: Histogram of λsup(θ ♯ ). Results are shown for 1,124 unique critical points that satisfy the convergence criteria [PITH_FULL_IMAGE:figures/full_fig_p010_9.png] view at source ↗

**Figure 10.** Figure 10: Comparison of training dynamics for each method. [PITH_FULL_IMAGE:figures/full_fig_p010_10.png] view at source ↗

**Figure 11.** Figure 11: Loss landscapes at critical points for Sharp Minima [PITH_FULL_IMAGE:figures/full_fig_p011_11.png] view at source ↗

**Figure 12.** Figure 12: Statistics of the Hessian eigenspectrum at critical points. Asterisks indicate p-values from the two-sided Wilcoxon [PITH_FULL_IMAGE:figures/full_fig_p012_12.png] view at source ↗

**Figure 15.** Figure 15: Relationship between decision boundaries and the [PITH_FULL_IMAGE:figures/full_fig_p013_15.png] view at source ↗

**Figure 14.** Figure 14: Macro F1-scores at the critical point. Asterisks indi [PITH_FULL_IMAGE:figures/full_fig_p013_14.png] view at source ↗

read the original abstract

The flatness hypothesis suggests that flatness of the loss landscape, as measured by the eigenvalues of the loss Hessian, correlates with better neural network generalization. While various algorithms reduce these eigenvalues, most focus on procedural design, leaving it unclear how data distributions and NN parameters structurally determine directions toward flat minima. Characterizing these directions analytically is generally intractable. To overcome this mathematical difficulty, recent studies derived the Wolkowicz-Styan (WS) upper bound on the maximum eigenvalue of the cross-entropy loss Hessian in three-layer NNs. Although this upper bound is differentiable, its gradient was not derived. Therefore, we analytically derive the gradient of the WS upper bound to characterize directions leading to flat minima. Based on this, we propose Hessian Spectral Range (HSR) Regularization, which updates parameters along the steepest descent direction of the WS bound. Experiments demonstrate that HSR Regularization narrows the Hessian eigenvalue spectrum, avoids sharp minima and saddle points, and promotes convergence to flat minima. Although the applicability of this method is currently limited to cross-entropy loss and three-layer architectures, to the best of the authors' knowledge, this is the first study to report a closed-form gradient that promotes convergence to flat minima without numerical approximations. Therefore, the theoretical analysis of this gradient is expected to contribute to the further development of NNs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper derives a closed-form gradient for the WS upper bound on Hessian eigenvalues in three-layer nets, but descending on the bound does not guarantee the true max eigenvalue shrinks.

read the letter

The main thing here is the analytic gradient of the Wolkowicz-Styan upper bound for the largest Hessian eigenvalue under cross-entropy loss in three-layer networks. That step is new; earlier papers had the bound but left the gradient undescribed, so this lets them run steepest descent on the bound itself and turn it into the HSR regularizer.

The derivation is the part that stands on its own. It gives an explicit direction for parameter updates without needing numerical Hessian approximations, which is a concrete technical move for this restricted case.

The soft spot is exactly the one flagged in the stress-test. The method minimizes the bound, not λ_max directly. Nothing in the abstract shows that the bound stays tight once the regularizer starts moving the weights, so it is possible for the bound to drop while the actual spectral radius stays large or grows. The flat-minima and spectrum-narrowing claims rest on that link, and the paper supplies no analytic check or plot confirming the bound tracks the eigenvalue under the proposed updates.

Scope is narrow by design—three layers, cross-entropy only—which the authors note. Experiments are asserted to work but without reported controls for bound tightness or comparisons that isolate the effect.

This is for people already working on analytic curvature control in small networks who might want the gradient formula to test or extend. A reader could pull the closed-form expression and run their own checks.

I would send it to peer review. The derivation itself is worth referee time even if the generalization argument needs tightening.

Referee Report

2 major / 2 minor

Summary. The manuscript analytically derives the gradient of the Wolkowicz-Styan (WS) upper bound on the largest eigenvalue of the cross-entropy loss Hessian for three-layer neural networks. It uses this closed-form gradient to define Hessian Spectral Range (HSR) Regularization, which performs parameter updates along the steepest descent direction of the bound. The authors claim that this procedure narrows the Hessian eigenspectrum, avoids sharp minima and saddle points, and promotes convergence to flat minima that improve generalization. The work is restricted to cross-entropy loss and three-layer architectures and presents itself as the first closed-form (non-numerical) gradient for this purpose.

Significance. If the derivation is free of algebraic error and if reduction of the WS bound reliably produces a corresponding reduction in the true spectral radius, the result would supply an explicit, differentiable characterization of directions toward flatter minima. The closed-form nature of the gradient is a concrete strength that enables future theoretical analysis without reliance on finite-difference or automatic-differentiation approximations. The limitation to three-layer networks and cross-entropy loss, however, confines the immediate practical scope.

major comments (2)

[Abstract, §3 (gradient derivation), §4 (experiments)] The central claim that HSR Regularization reaches flat minima rests on the unverified assumption that descent on the WS upper bound produces a corresponding decrease in the actual maximum Hessian eigenvalue. No analytic argument or empirical diagnostic is supplied showing that the bound remains sufficiently tight once the regularization term is active (e.g., that the gap between bound and λ_max does not widen under the induced parameter updates).
[§4] §4 (experimental results): the reported narrowing of the Hessian eigenvalue spectrum is presented without quantitative controls that isolate the effect of bound minimization from other optimization dynamics. In particular, there is no comparison against a baseline that directly penalizes an estimate of λ_max, nor any measurement of the tightness ratio (bound / λ_max) before and after HSR updates.

minor comments (2)

[Abstract] The abstract states that the method 'avoids sharp minima and saddle points,' yet the manuscript provides no explicit diagnostic (e.g., eigenvalue sign checks or escape-time statistics) that would substantiate avoidance of saddles.
[§3] Notation for the WS bound and its gradient should be introduced with a self-contained definition before the differentiation step; readers must currently consult the cited prior work to follow the algebra.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments. We address each major point below and indicate planned revisions.

read point-by-point responses

Referee: [Abstract, §3 (gradient derivation), §4 (experiments)] The central claim that HSR Regularization reaches flat minima rests on the unverified assumption that descent on the WS upper bound produces a corresponding decrease in the actual maximum Hessian eigenvalue. No analytic argument or empirical diagnostic is supplied showing that the bound remains sufficiently tight once the regularization term is active (e.g., that the gap between bound and λ_max does not widen under the induced parameter updates).

Authors: We acknowledge that the manuscript does not include an explicit check of bound tightness under HSR updates. Because the WS expression is a proven upper bound, descent on it necessarily constrains possible values of λ_max, but we agree that empirical verification of the gap is valuable. In the revision we will add plots of (WS bound − λ_max) before and after HSR steps on the reported architectures. revision: yes
Referee: [§4] §4 (experimental results): the reported narrowing of the Hessian eigenvalue spectrum is presented without quantitative controls that isolate the effect of bound minimization from other optimization dynamics. In particular, there is no comparison against a baseline that directly penalizes an estimate of λ_max, nor any measurement of the tightness ratio (bound / λ_max) before and after HSR updates.

Authors: We agree that reporting the tightness ratio (WS bound / λ_max) would strengthen the experimental claims and will include these measurements in the revised §4. A direct baseline that penalizes a numerical estimate of λ_max is computationally prohibitive for the network sizes considered, which is precisely why the closed-form WS gradient is useful; we will add a brief discussion of this distinction rather than a full comparison. revision: partial

Circularity Check

0 steps flagged

Analytic differentiation of external WS bound is self-contained; no reduction to inputs or self-citation chain

full rationale

The paper's central derivation is an analytic computation of the gradient of the Wolkowicz-Styan upper bound (previously derived in cited recent studies, not by these authors). This step consists of standard differentiation applied to an existing closed-form expression and does not embed the target eigenvalue, fit parameters to the outcome, or rely on a self-citation for uniqueness or ansatz. The subsequent HSR Regularization is defined directly from the resulting gradient expression. No load-bearing step reduces by construction to the paper's own fitted values or prior claims; the derivation chain remains independent of the flat-minima outcome it is later tested against.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 1 invented entities

The work rests on the prior existence and applicability of the Wolkowicz-Styan bound to three-layer cross-entropy Hessians and on the flatness hypothesis linking smaller max eigenvalue to better generalization. No new free parameters or invented entities beyond the regularization itself are introduced in the abstract.

axioms (2)

domain assumption The Wolkowicz-Styan upper bound is valid and differentiable for the cross-entropy loss Hessian of three-layer neural networks
Cited from recent studies; invoked as the starting point for the gradient derivation
domain assumption Reducing the WS upper bound on the maximum Hessian eigenvalue moves the network toward flatter minima that generalize better
Core motivation for proposing HSR Regularization

invented entities (1)

HSR Regularization no independent evidence
purpose: Parameter update rule that follows the steepest descent direction of the WS bound
New regularization term defined from the derived gradient

pith-pipeline@v0.9.1-grok · 5801 in / 1506 out tokens · 15709 ms · 2026-06-30T09:38:25.849921+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

40 extracted references · 26 canonical work pages · 5 internal anchors

[1]

J. Chai, H. Zeng, A. Li, E. W. T. Ngai, Deep learning in computer vision: A critical review of emerging techniques and application scenarios, Machine Learning with Applications 6 (2021) 100134.doi:10. 1016/j.mlwa.2021.100134

work page arXiv 2021
[2]

Mehrish, N

A. Mehrish, N. Majumder, R. Bharadwaj, R. Mihalcea, S. Poria, A re- view of deep learning techniques for speech processing, Information Fu- sion 99 (2023) 101869.doi:10.1016/j.inffus.2023.101869

work page doi:10.1016/j.inffus.2023.101869 2023
[3]

E. O. Arkhangelskaya, S. I. Nikolenko, Deep Learning for Natural Language Processing: A Survey, Journal of Mathematical Sciences 273 (4) (2023) 533–582.doi:10.1007/s10958-023-06519-6

work page doi:10.1007/s10958-023-06519-6 2023
[4]

Neural Computation9(1), 1–42 (01 1997)

S. Hochreiter, J. Schmidhuber, Flat minima, Neural Computation 9 (1) (1997) 1–42.doi:10.1162/neco.1997.9.1.1

work page doi:10.1162/neco.1997.9.1.1 1997
[5]

Y . Liu, S. Yu, T. Lin, Hessian regularization of deep neural networks: A novel approach based on stochastic estimators of Hessian trace, Neurocomputing 536 (2023) 13–20.doi:10.1016/j.neucom. 2023.03.017

work page doi:10.1016/j.neucom 2023
[6]

Arora, Z

S. Arora, Z. Li, A. Panigrahi, Understanding Gradient Descent on Edge of Stability in Deep Learning (2022).arXiv:2205.09745,doi: 10.48550/arXiv.2205.09745

work page doi:10.48550/arxiv.2205.09745 2022
[7]

K. Lyu, Z. Li, S. Arora, Understanding the Generalization Benefit of Normalization Layers: Sharpness Reduction (2023).arXiv:2206. 07085,doi:10.48550/arXiv.2206.07085

work page doi:10.48550/arxiv.2206.07085 2023
[8]

Sharpness-Aware Minimization for Efficiently Improving Generalization

P. Foret, A. Kleiner, H. Mobahi, B. Neyshabur, Sharpness-Aware Min- imization for Efficiently Improving Generalization (2021).arXiv: 2010.01412,doi:10.48550/arXiv.2010.01412

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2010.01412 2021
[9]

Y . Omae, K. Sakai, Y . Kakimoto, M. Sasaki, Y . Sakai, H. Takahashi, Wolkowicz-Styan Upper Bound on the Hessian Eigenspectrum for Cross-Entropy Loss in Nonlinear Smooth Neural Networks, arXiv.org (2026).doi:10.48550/arXiv.2604.10202

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2604.10202 2026
[10]

L. Dinh, R. Pascanu, S. Bengio, Y . Bengio, Sharp Minima Can Gener- alize For Deep Nets, Proceedings of the 34th International Conference on Machine Learning (2017).doi:arXiv:1703.04933

work page internal anchor Pith review Pith/arXiv arXiv 2017
[11]

Chen, C.-J

X. Chen, C.-J. Hsieh, B. Gong, When Vision Transformers Outperform ResNets without Pre-training or Strong Data Augmentations (2022). arXiv:2106.01548,doi:10.48550/arXiv.2106.01548

work page doi:10.48550/arxiv.2106.01548 2022
[12]

Huang, X

W. Huang, X. Liu, X. Wang, J. Yamagishi, Y . Qian, From Sharpness to Better Generalization for Speech Deepfake Detection (2025).arXiv: 2506.11532,doi:10.48550/arXiv.2506.11532

work page doi:10.48550/arxiv.2506.11532 2025
[13]

H. Li, Z. Xu, G. Taylor, C. Studer, T. Goldstein, Visualizing the Loss Landscape of Neural Nets, Advances in Neural Information Processing Systems 31 (2018)

2018
[14]

N. S. Keskar, D. Mudigere, J. Nocedal, M. Smelyanskiy, P. T. P. Tang, On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima (2017).arXiv:1609.04836,doi:10.48550/ arXiv.1609.04836

work page internal anchor Pith review Pith/arXiv arXiv 2017
[15]

Ghorbani, S

B. Ghorbani, S. Krishnan, Y . Xiao, An Investigation into Neural Net Optimization via Hessian Eigenvalue Density, Proceedings of the 36th International Conference on Machine Learning (2019) 2232–2241

2019
[16]

L. Wu, W. J. Su, The Implicit Regularization of Dynamical Stability in Stochastic Gradient Descent (2023).arXiv:2305.17490,doi: 10.48550/arXiv.2305.17490

work page doi:10.48550/arxiv.2305.17490 2023
[17]

M. Wei, D. J. Schwab, How noise affects the Hessian spectrum in overparameterized neural networks (2019).arXiv:1910.00195, doi:10.48550/arXiv.1910.00195

work page doi:10.48550/arxiv.1910.00195 2019
[18]

Torchvision — Torchvision 0.27 documentation, https://docs.pytorch.org/vision/stable/index.html
[19]

Simonyan, A

K. Simonyan, A. Zisserman, Very Deep Convolutional Networks for Large-Scale Image Recognition, CoRR (2014)

2014
[20]

Kaiming, Z

H. Kaiming, Z. Xiangyu, R. Shaoqing, S. Jian, Deep Residual Learning for Image Recognition, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 1 (2016) 770–778.doi:10.1109/ cvpr.2016.90

2016
[21]

M. Hutchinson, A Stochastic Estimator of the Trace of the Influence Matrix for Laplacian Smoothing Splines, Communications in Statistics - Simulation and Computation 18 (3) (1989) 1059–1076.doi:10. 1080/03610918908812806

1989
[22]

C. Lanczos, An iteration method for the solution of the eigenvalue problem of linear differential and integral operators, Journal of Research of the National Bureau of Standards 45 (4) (1950) 255–282

1950
[23]

Z. Yao, A. Gholami, K. Keutzer, M. W. Mahoney, PyHessian: Neural Networks Through the Lens of the Hessian, 2020 IEEE International Conference on Big Data (Big Data) (2020) 581–590doi:10.1109/ BigData50022.2020.9378171

work page arXiv 2020
[24]

Z. Dong, Z. Yao, D. Arfeen, A. Gholami, M. W. Mahoney, K. Keutzer, HAWQ-V2: Hessian Aware trace-Weighted Quantization of Neural Net- works, Advances in Neural Information Processing Systems 33 (2020) 18518–18529

2020
[25]

Bishop, Exact Calculation of the Hessian Matrix for the Multilayer Perceptron, Neural Computation 4 (4) (1992) 494–501.doi:10

C. Bishop, Exact Calculation of the Hessian Matrix for the Multilayer Perceptron, Neural Computation 4 (4) (1992) 494–501.doi:10. 1162/neco.1992.4.4.494

1992
[26]

S. P. Singh, G. Bachmann, T. Hofmann, Analytic Insights into Structure and Rank of Neural Network Hessian Maps (2021).arXiv:2106. 16225,doi:10.48550/arXiv.2106.16225

work page doi:10.48550/arxiv.2106.16225 2021
[27]

Y . Wu, X. Zhu, C. Wu, A. Wang, R. Ge, Dissecting Hessian: Under- standing Common Structure of Hessian in Neural Networks (2022). arXiv:2010.04261,doi:10.48550/arXiv.2010.04261

work page doi:10.48550/arxiv.2010.04261 2022
[28]

S. P. Singh, W. Ormaniec, T. Hofmann, Cracking the Hessian: Closed- Form Hessian Spectra for Fundamental Neural Networks, OpenReview in ICLR2026 (2026)

2026
[29]

X. Yue, M. Nouiehed, R. A. Kontar, SALR: Sharpness-aware Learning Rate Scheduler for Improved Generalization, IEEE Transactions on Neural Networks and Learning Systems 35 (9) (2024) 12518–12527. arXiv:2011.05348,doi:10.1109/TNNLS.2023.3263393. PREPRINT 25

work page doi:10.1109/tnnls.2023.3263393 2024
[30]

A. R. Sankar, Y . Khasbage, R. Vigneswaran, V . N Balasubrama- nian, A Deeper Look at the Hessian Eigenspectrum of Deep Neural Networks and its Applications to Regularization, Proceedings of the AAAI Conference on Artificial Intelligence 35 (11) (2021) 9481–9488. doi:10.1609/aaai.v35i11.17142

work page doi:10.1609/aaai.v35i11.17142 2021
[31]

H. Luo, T. Truong, T. Pham, M. Harandi, D. Phung, T. Le, Explicit Eigenvalue Regularization Improves Sharpness-Aware Minimization (2025).arXiv:2501.12666,doi:10.48550/arXiv.2501. 12666

work page doi:10.48550/arxiv.2501 2025
[32]

J. Kwon, J. Kim, H. Park, I. K. Choi, ASAM: Adaptive Sharpness-Aware Minimization for Scale-Invariant Learning of Deep Neural Networks, Proceedings of the 38th International Conference on Machine Learning (2021) 5905–5914

2021
[33]

Y . Zhou, Y . Qu, X. Xu, H. Shen, ImbSAM: A Closer Look at Sharpness- Aware Minimization in Class-Imbalanced Recognition, 2023 IEEE/CVF International Conference on Computer Vision (ICCV) (2023) 11311– 11321doi:10.1109/ICCV51070.2023.01042

work page doi:10.1109/iccv51070.2023.01042 2023
[34]

Andriushchenko, N

M. Andriushchenko, N. Flammarion, Towards Understanding Sharpness- Aware Minimization, Proceedings of the 39th International Conference on Machine Learning (2022) 639–668

2022
[35]

H. R. Zhang, D. Li, H. Ju, Noise Stability Optimization for Finding Flat Minima: A Hessian-based Regularization Approach (2024).arXiv: 2306.08553,doi:10.48550/arXiv.2306.08553

work page doi:10.48550/arxiv.2306.08553 2024
[36]

Wolkowicz, G

H. Wolkowicz, G. P. H. Styan, Bounds for eigenvalues using traces, Linear Algebra and its Applications 29 (1980) 471–506.doi:10. 1016/0024-3795(80)90258-X

1980
[37]

Eigenvalues of the Hessian in Deep Learning: Singularity and Beyond

L. Sagun, L. Bottou, Y . LeCun, Eigenvalues of the Hessian in Deep Learning: Singularity and Beyond (2017).arXiv:1611.07476, doi:10.48550/arXiv.1611.07476

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1611.07476 2017
[38]

Xie, Q.-Y

Z. Xie, Q.-Y . Tang, Y . Cai, M. Sun, P. Li, On the Power-Law Hessian Spectrums in Deep Learning (2022).arXiv:2201.13011,doi: 10.48550/arXiv.2201.13011

work page doi:10.48550/arxiv.2201.13011 2022
[39]

Papyan, Traces of Class/Cross-Class Structure Pervade Deep Learning Spectra, Journal of Machine Learning Research 21 (2020)

V . Papyan, Traces of Class/Cross-Class Structure Pervade Deep Learning Spectra, Journal of Machine Learning Research 21 (2020)

2020
[40]

M. A. Khamis, H. Q. Ngo, X. Nguyen, D. Olteanu, M. Schleich, Learning Models over Relational Data using Sparse Tensors and Functional Dependencies (2020).arXiv:1703.04780,doi:10. 48550/arXiv.1703.04780

work page arXiv 2020

[1] [1]

J. Chai, H. Zeng, A. Li, E. W. T. Ngai, Deep learning in computer vision: A critical review of emerging techniques and application scenarios, Machine Learning with Applications 6 (2021) 100134.doi:10. 1016/j.mlwa.2021.100134

work page arXiv 2021

[2] [2]

Mehrish, N

A. Mehrish, N. Majumder, R. Bharadwaj, R. Mihalcea, S. Poria, A re- view of deep learning techniques for speech processing, Information Fu- sion 99 (2023) 101869.doi:10.1016/j.inffus.2023.101869

work page doi:10.1016/j.inffus.2023.101869 2023

[3] [3]

E. O. Arkhangelskaya, S. I. Nikolenko, Deep Learning for Natural Language Processing: A Survey, Journal of Mathematical Sciences 273 (4) (2023) 533–582.doi:10.1007/s10958-023-06519-6

work page doi:10.1007/s10958-023-06519-6 2023

[4] [4]

Neural Computation9(1), 1–42 (01 1997)

S. Hochreiter, J. Schmidhuber, Flat minima, Neural Computation 9 (1) (1997) 1–42.doi:10.1162/neco.1997.9.1.1

work page doi:10.1162/neco.1997.9.1.1 1997

[5] [5]

Y . Liu, S. Yu, T. Lin, Hessian regularization of deep neural networks: A novel approach based on stochastic estimators of Hessian trace, Neurocomputing 536 (2023) 13–20.doi:10.1016/j.neucom. 2023.03.017

work page doi:10.1016/j.neucom 2023

[6] [6]

Arora, Z

S. Arora, Z. Li, A. Panigrahi, Understanding Gradient Descent on Edge of Stability in Deep Learning (2022).arXiv:2205.09745,doi: 10.48550/arXiv.2205.09745

work page doi:10.48550/arxiv.2205.09745 2022

[7] [7]

K. Lyu, Z. Li, S. Arora, Understanding the Generalization Benefit of Normalization Layers: Sharpness Reduction (2023).arXiv:2206. 07085,doi:10.48550/arXiv.2206.07085

work page doi:10.48550/arxiv.2206.07085 2023

[8] [8]

Sharpness-Aware Minimization for Efficiently Improving Generalization

P. Foret, A. Kleiner, H. Mobahi, B. Neyshabur, Sharpness-Aware Min- imization for Efficiently Improving Generalization (2021).arXiv: 2010.01412,doi:10.48550/arXiv.2010.01412

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2010.01412 2021

[9] [9]

Y . Omae, K. Sakai, Y . Kakimoto, M. Sasaki, Y . Sakai, H. Takahashi, Wolkowicz-Styan Upper Bound on the Hessian Eigenspectrum for Cross-Entropy Loss in Nonlinear Smooth Neural Networks, arXiv.org (2026).doi:10.48550/arXiv.2604.10202

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2604.10202 2026

[10] [10]

L. Dinh, R. Pascanu, S. Bengio, Y . Bengio, Sharp Minima Can Gener- alize For Deep Nets, Proceedings of the 34th International Conference on Machine Learning (2017).doi:arXiv:1703.04933

work page internal anchor Pith review Pith/arXiv arXiv 2017

[11] [11]

Chen, C.-J

X. Chen, C.-J. Hsieh, B. Gong, When Vision Transformers Outperform ResNets without Pre-training or Strong Data Augmentations (2022). arXiv:2106.01548,doi:10.48550/arXiv.2106.01548

work page doi:10.48550/arxiv.2106.01548 2022

[12] [12]

Huang, X

W. Huang, X. Liu, X. Wang, J. Yamagishi, Y . Qian, From Sharpness to Better Generalization for Speech Deepfake Detection (2025).arXiv: 2506.11532,doi:10.48550/arXiv.2506.11532

work page doi:10.48550/arxiv.2506.11532 2025

[13] [13]

H. Li, Z. Xu, G. Taylor, C. Studer, T. Goldstein, Visualizing the Loss Landscape of Neural Nets, Advances in Neural Information Processing Systems 31 (2018)

2018

[14] [14]

N. S. Keskar, D. Mudigere, J. Nocedal, M. Smelyanskiy, P. T. P. Tang, On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima (2017).arXiv:1609.04836,doi:10.48550/ arXiv.1609.04836

work page internal anchor Pith review Pith/arXiv arXiv 2017

[15] [15]

Ghorbani, S

B. Ghorbani, S. Krishnan, Y . Xiao, An Investigation into Neural Net Optimization via Hessian Eigenvalue Density, Proceedings of the 36th International Conference on Machine Learning (2019) 2232–2241

2019

[16] [16]

L. Wu, W. J. Su, The Implicit Regularization of Dynamical Stability in Stochastic Gradient Descent (2023).arXiv:2305.17490,doi: 10.48550/arXiv.2305.17490

work page doi:10.48550/arxiv.2305.17490 2023

[17] [17]

M. Wei, D. J. Schwab, How noise affects the Hessian spectrum in overparameterized neural networks (2019).arXiv:1910.00195, doi:10.48550/arXiv.1910.00195

work page doi:10.48550/arxiv.1910.00195 2019

[18] [18]

Torchvision — Torchvision 0.27 documentation, https://docs.pytorch.org/vision/stable/index.html

[19] [19]

Simonyan, A

K. Simonyan, A. Zisserman, Very Deep Convolutional Networks for Large-Scale Image Recognition, CoRR (2014)

2014

[20] [20]

Kaiming, Z

H. Kaiming, Z. Xiangyu, R. Shaoqing, S. Jian, Deep Residual Learning for Image Recognition, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 1 (2016) 770–778.doi:10.1109/ cvpr.2016.90

2016

[21] [21]

M. Hutchinson, A Stochastic Estimator of the Trace of the Influence Matrix for Laplacian Smoothing Splines, Communications in Statistics - Simulation and Computation 18 (3) (1989) 1059–1076.doi:10. 1080/03610918908812806

1989

[22] [22]

C. Lanczos, An iteration method for the solution of the eigenvalue problem of linear differential and integral operators, Journal of Research of the National Bureau of Standards 45 (4) (1950) 255–282

1950

[23] [23]

Z. Yao, A. Gholami, K. Keutzer, M. W. Mahoney, PyHessian: Neural Networks Through the Lens of the Hessian, 2020 IEEE International Conference on Big Data (Big Data) (2020) 581–590doi:10.1109/ BigData50022.2020.9378171

work page arXiv 2020

[24] [24]

Z. Dong, Z. Yao, D. Arfeen, A. Gholami, M. W. Mahoney, K. Keutzer, HAWQ-V2: Hessian Aware trace-Weighted Quantization of Neural Net- works, Advances in Neural Information Processing Systems 33 (2020) 18518–18529

2020

[25] [25]

Bishop, Exact Calculation of the Hessian Matrix for the Multilayer Perceptron, Neural Computation 4 (4) (1992) 494–501.doi:10

C. Bishop, Exact Calculation of the Hessian Matrix for the Multilayer Perceptron, Neural Computation 4 (4) (1992) 494–501.doi:10. 1162/neco.1992.4.4.494

1992

[26] [26]

S. P. Singh, G. Bachmann, T. Hofmann, Analytic Insights into Structure and Rank of Neural Network Hessian Maps (2021).arXiv:2106. 16225,doi:10.48550/arXiv.2106.16225

work page doi:10.48550/arxiv.2106.16225 2021

[27] [27]

Y . Wu, X. Zhu, C. Wu, A. Wang, R. Ge, Dissecting Hessian: Under- standing Common Structure of Hessian in Neural Networks (2022). arXiv:2010.04261,doi:10.48550/arXiv.2010.04261

work page doi:10.48550/arxiv.2010.04261 2022

[28] [28]

S. P. Singh, W. Ormaniec, T. Hofmann, Cracking the Hessian: Closed- Form Hessian Spectra for Fundamental Neural Networks, OpenReview in ICLR2026 (2026)

2026

[29] [29]

X. Yue, M. Nouiehed, R. A. Kontar, SALR: Sharpness-aware Learning Rate Scheduler for Improved Generalization, IEEE Transactions on Neural Networks and Learning Systems 35 (9) (2024) 12518–12527. arXiv:2011.05348,doi:10.1109/TNNLS.2023.3263393. PREPRINT 25

work page doi:10.1109/tnnls.2023.3263393 2024

[30] [30]

A. R. Sankar, Y . Khasbage, R. Vigneswaran, V . N Balasubrama- nian, A Deeper Look at the Hessian Eigenspectrum of Deep Neural Networks and its Applications to Regularization, Proceedings of the AAAI Conference on Artificial Intelligence 35 (11) (2021) 9481–9488. doi:10.1609/aaai.v35i11.17142

work page doi:10.1609/aaai.v35i11.17142 2021

[31] [31]

H. Luo, T. Truong, T. Pham, M. Harandi, D. Phung, T. Le, Explicit Eigenvalue Regularization Improves Sharpness-Aware Minimization (2025).arXiv:2501.12666,doi:10.48550/arXiv.2501. 12666

work page doi:10.48550/arxiv.2501 2025

[32] [32]

J. Kwon, J. Kim, H. Park, I. K. Choi, ASAM: Adaptive Sharpness-Aware Minimization for Scale-Invariant Learning of Deep Neural Networks, Proceedings of the 38th International Conference on Machine Learning (2021) 5905–5914

2021

[33] [33]

Y . Zhou, Y . Qu, X. Xu, H. Shen, ImbSAM: A Closer Look at Sharpness- Aware Minimization in Class-Imbalanced Recognition, 2023 IEEE/CVF International Conference on Computer Vision (ICCV) (2023) 11311– 11321doi:10.1109/ICCV51070.2023.01042

work page doi:10.1109/iccv51070.2023.01042 2023

[34] [34]

Andriushchenko, N

M. Andriushchenko, N. Flammarion, Towards Understanding Sharpness- Aware Minimization, Proceedings of the 39th International Conference on Machine Learning (2022) 639–668

2022

[35] [35]

H. R. Zhang, D. Li, H. Ju, Noise Stability Optimization for Finding Flat Minima: A Hessian-based Regularization Approach (2024).arXiv: 2306.08553,doi:10.48550/arXiv.2306.08553

work page doi:10.48550/arxiv.2306.08553 2024

[36] [36]

Wolkowicz, G

H. Wolkowicz, G. P. H. Styan, Bounds for eigenvalues using traces, Linear Algebra and its Applications 29 (1980) 471–506.doi:10. 1016/0024-3795(80)90258-X

1980

[37] [37]

Eigenvalues of the Hessian in Deep Learning: Singularity and Beyond

L. Sagun, L. Bottou, Y . LeCun, Eigenvalues of the Hessian in Deep Learning: Singularity and Beyond (2017).arXiv:1611.07476, doi:10.48550/arXiv.1611.07476

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1611.07476 2017

[38] [38]

Xie, Q.-Y

Z. Xie, Q.-Y . Tang, Y . Cai, M. Sun, P. Li, On the Power-Law Hessian Spectrums in Deep Learning (2022).arXiv:2201.13011,doi: 10.48550/arXiv.2201.13011

work page doi:10.48550/arxiv.2201.13011 2022

[39] [39]

Papyan, Traces of Class/Cross-Class Structure Pervade Deep Learning Spectra, Journal of Machine Learning Research 21 (2020)

V . Papyan, Traces of Class/Cross-Class Structure Pervade Deep Learning Spectra, Journal of Machine Learning Research 21 (2020)

2020

[40] [40]

M. A. Khamis, H. Q. Ngo, X. Nguyen, D. Olteanu, M. Schleich, Learning Models over Relational Data using Sparse Tensors and Functional Dependencies (2020).arXiv:1703.04780,doi:10. 48550/arXiv.1703.04780

work page arXiv 2020