pith. machine review for the scientific record. sign in

arxiv: 2604.14108 · v1 · submitted 2026-04-15 · 💻 cs.LG · math.DS· math.OC· stat.ML

Recognition: unknown

Momentum Further Constrains Sharpness at the Edge of Stochastic Stability

Authors on Pith no claims yet

Pith reviewed 2026-05-10 13:09 UTC · model grok-4.3

classification 💻 cs.LG math.DSmath.OCstat.ML
keywords SGD with momentumedge of stochastic stabilitybatch sharpnessoptimization dynamicsstochastic gradient descentlinear stability analysisdeep learning optimization
0
0 comments X

The pith

SGD with momentum stabilizes batch sharpness at two different plateaus depending on batch size near the stochastic stability edge.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that adding momentum to mini-batch stochastic gradient descent produces an edge-of-stability regime whose sharpness behavior splits by batch size. For small batches, sharpness settles at the lower value 2(1-β)/η because momentum amplifies stochastic noise and therefore selects flatter regions than plain SGD. For large batches, sharpness instead settles at the higher value 2(1+β)/η, recovering the classical stabilizing action of momentum that is seen in full-batch training. A reader cares because the result ties a common optimizer choice directly to the sharpness of the minima that training reaches and therefore to generalization. It also shows that the usual single-threshold picture of stability must be replaced by two distinct regimes when momentum and batch size are varied together.

Core claim

SGD with momentum exhibits an Edge of Stochastic Stability regime in which batch sharpness, the expected directional mini-batch curvature, converges to one of two batch-size-dependent plateaus. At small batch sizes it reaches the lower plateau 2(1-β)/η, which reflects momentum amplification of stochastic fluctuations and favors flatter solutions than vanilla SGD. At large batch sizes it reaches the higher plateau 2(1+β)/η, where momentum recovers its classical stabilizing effect and favors sharper solutions consistent with deterministic gradient flow. These two limits align with linear stability thresholds and cannot be captured by any single momentum-adjusted threshold.

What carries the argument

Batch sharpness, defined as expected directional mini-batch curvature, and its convergence to the two momentum-dependent plateaus 2(1-β)/η and 2(1+β)/η at the instability boundary.

If this is right

  • Momentum favors flatter regions than vanilla SGD when batch size is small because it amplifies stochastic fluctuations.
  • Momentum favors sharper regions consistent with full-batch dynamics when batch size is large.
  • Hyperparameter tuning for momentum must treat small-batch and large-batch regimes separately rather than using one stability threshold.
  • The observed sharpness plateaus match the predictions of linear stability analysis applied to the momentum update.
  • The coupling of momentum and batch size directly shapes which solutions the optimizer selects.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The split regimes may explain why practitioners often pair momentum with small batches to improve generalization.
  • The same two-regime structure could appear in other momentum-based methods such as Nesterov or Adam and would be testable by measuring batch sharpness across batch sizes.
  • Adjusting the momentum coefficient as a function of batch size might allow explicit control over the sharpness of the final solution.
  • Large-batch training with momentum may require different learning-rate scaling rules than small-batch training because the effective stability threshold changes.

Load-bearing premise

Finite simulations of training reach the same asymptotic sharpness plateaus that linear stability analysis predicts near the instability boundary.

What would settle it

A long training run at several batch sizes in which measured batch sharpness fails to approach either 2(1-β)/η or 2(1+β)/η as training time increases.

Figures

Figures reproduced from arXiv: 2604.14108 by Advikar Ananthkumar, Arseniy Andreyev, Marc Walden, Pierfrancesco Beneventano, Tomaso Poggio.

Figure 1
Figure 1. Figure 1: λmax under full-batch GD with momentum (left) and mini-batch SGD with momentum (right). MLP on an 8k subset of CIFAR-10 for fixed step size η = .004 and varying β. The stabilization level of Batch Sharpness (Definition 3.1) inverts its monotonicity in β [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: EoSS phenomenon using SGDM (left) and SGDN (right). MLPs on an 8k subset of CIFAR-10 under different step sizes η and with β = 0.9. Batch Sharpness stabilizes around the 2(1 − β)/η = 1/(5η) threshold, shown by the dotted lines. in the small-batch (noise-dominated) regime and BSplateau ≈ (2(1+β) η (SGDM), 2(1+β) η(1+2β) (SGDN) (2) in the large-batch (deterministic) regime. 1 The small-batch plateau is stric… view at source ↗
Figure 3
Figure 3. Figure 3: Stabilization levels of Batch Sharpness and λmax across varying batch sizes for an MLP trained with SGDM (top) and SGDN (bottom) at η = 0.005 and β = 0.9. The critical batch size, defined heuristically as the threshold at which training dynamics enter the large-batch regime, is marked for each optimizer. Notably, SGDN reaches this regime at a batch size almost an order of magnitude smaller than SGDM. 2.2. … view at source ↗
Figure 4
Figure 4. Figure 4: Dynamics of curvature statistics for SGDM with β = 0.5. Top row: MLP; bottom row: CNN. Columns correspond to batch sizes b ∈ {4, 64, 256}. Batch Sharpness and λmax rise and then plateau, with larger batches yielding higher plateau levels. For Batch Sharpness, the left column is near the small-batch level 2(1 − β)/η, the middle column lies in transition, and the right column approaches the large-batch level… view at source ↗
Figure 5
Figure 5. Figure 5: Within-run dynamics for an MLP with batch size b = 4. The SGDM run uses learning rate η = 0.001 with momentum β = 0.9, while the SGD run uses learning rate η = 0.01, chosen to match the effective step size. how full-batch sharpness behaves alongside it. Empirically, as in the case of vanilla SGD, stabilization of Batch Sharp￾ness induces a corresponding stabilization of the full-batch top eigenvalue λmax; … view at source ↗
Figure 6
Figure 6. Figure 6: Within-run EoSS dynamics for an MLP under destabilizing interventions at step 75k with batch size b = 16, learning rate η = 0.004, and momentum β = 0.9. Left: destabilizing momentum intervention, increasing β to 0.95. Middle: destabilizing learning rate intervention, increasing η to 0.0067. Right: destabilizing batch size intervention, decreasing b to 8. Top: Batch Sharpness and λmax. Bottom: Training loss… view at source ↗
Figure 7
Figure 7. Figure 7: Within-run EoSS dynamics for early destabilizing interventions during the progressive sharpening phase at step 10k on an MLP with baseline learning rate η = 0.004, momentum β = 0.9, and batch size b = 16. Left: destabilizing momentum intervention, increasing β to 0.95. Middle: destabilizing learning-rate intervention, increasing η to 0.0067. Right: destabilizing batch-size intervention, decreasing batch si… view at source ↗
Figure 8
Figure 8. Figure 8: , and [PITH_FULL_IMAGE:figures/full_fig_p021_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Within-run EoSS dynamics for destabilizing interventions at high batch size at step 50k on an MLP with baseline learning rate η = 0.03, momentum β = 0.5, and batch size b = 16384. Left: destabilizing momentum intervention, increasing β to 0.52. Middle: destabilizing learning-rate intervention, increasing η to 0.035. Right: destabilizing batch-size intervention, decreasing b to 12288. Top: Batch Sharpness a… view at source ↗
Figure 10
Figure 10. Figure 10: Within-run EoSS dynamics for stabilizing interventions with low batch sizes at step 150k on an MLP with batch size b = 16, learning rate η = 0.004, and momentum β = 0.9. Left: stabilizing momentum intervention, decreasing β to 0.875. Middle: stabilizing learning-rate intervention, decreasing η to 0.003. Right: stabilizing batch-size intervention, increasing batch size b to 32. Top: Batch Sharpness and λma… view at source ↗
Figure 11
Figure 11. Figure 11: Within-run EoSS dynamics for early stabilizing interventions during the progressive sharpening phase at step 10k on an MLP with baseline learning rate η = 0.004, momentum β = 0.9, and batch size b = 16. Left: stabilizing momentum intervention, decreasing β to 0.875. Middle: stabilizing learning-rate intervention, decreasing η to 0.003. Right: stabilizing batch-size intervention, increasing batch size b to… view at source ↗
Figure 12
Figure 12. Figure 12: Within-run EoSS dynamics for stabilizing interventions with intermediate batch sizes at step 75k on an MLP with baseline learning rate η = 0.004, momentum β = 0.9, and batch size b = 512. Left: stabilizing momentum intervention, decreasing β to 0.875. Middle: stabilizing learning-rate intervention, decreasing η to 0.003. Right: stabilizing batch-size intervention, increasing batch size b to 768. Top: Batc… view at source ↗
Figure 13
Figure 13. Figure 13: uses the distance from initialization primarily as a baseline to provide context for the separation between SGD and SGDM trajectories. While both runs move a similar total distance through parameter space, the distance between them is of a comparable order of magnitude to their distance from initialization. This lack of point-by-point proximity suggests that matching Batch Sharpness stabilization levels d… view at source ↗
Figure 15
Figure 15. Figure 15 [PITH_FULL_IMAGE:figures/full_fig_p025_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: MLP, η = 0.004, β = 0.9 [PITH_FULL_IMAGE:figures/full_fig_p025_16.png] view at source ↗
Figure 18
Figure 18. Figure 18: MLP, η = 0.001, β = 0.9 [PITH_FULL_IMAGE:figures/full_fig_p025_18.png] view at source ↗
Figure 20
Figure 20. Figure 20 [PITH_FULL_IMAGE:figures/full_fig_p025_20.png] view at source ↗
Figure 22
Figure 22. Figure 22 [PITH_FULL_IMAGE:figures/full_fig_p026_22.png] view at source ↗
Figure 23
Figure 23. Figure 23: CNN, η = 0.001, β = 0.9 [PITH_FULL_IMAGE:figures/full_fig_p026_23.png] view at source ↗
read the original abstract

Recent work suggests that (stochastic) gradient descent self-organizes near an instability boundary, shaping both optimization and the solutions found. Momentum and mini-batch gradients are widely used in practical deep learning optimization, but it remains unclear whether they operate in a comparable regime of instability. We demonstrate that SGD with momentum exhibits an Edge of Stochastic Stability (EoSS)-like regime with batch-size-dependent behavior that cannot be explained by a single momentum-adjusted stability threshold. Batch Sharpness (the expected directional mini-batch curvature) stabilizes in two distinct regimes: at small batch sizes it converges to a lower plateau $2(1-\beta)/\eta$, reflecting amplification of stochastic fluctuations by momentum and favoring flatter regions than vanilla SGD; at large batch sizes it converges to a higher plateau $2(1+\beta)/\eta$, where momentum recovers its classical stabilizing effect and favors sharper regions consistent with full-batch dynamics. We further show that this aligns with linear stability thresholds and discuss the implications for hyperparameter tuning and coupling.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript claims that SGD with momentum exhibits an Edge of Stochastic Stability (EoSS)-like regime in which batch sharpness (expected directional mini-batch curvature) converges to two distinct, batch-size-dependent plateaus: a lower value of 2(1-β)/η at small batch sizes (reflecting momentum-amplified stochastic fluctuations) and a higher value of 2(1+β)/η at large batch sizes (recovering classical momentum stabilization). This regime separation cannot be captured by any single momentum-adjusted stability threshold and is shown to align with linear stability analysis, with implications for hyperparameter tuning and optimization dynamics.

Significance. If the empirical plateaus and their alignment with linear thresholds hold, the work supplies explicit, testable formulas that refine the EoSS picture for momentum and mini-batching, clarifying why momentum favors flatter regions under small-batch stochasticity while recovering sharper solutions under large-batch or full-batch conditions. The parameter-free expressions in terms of β and η constitute a concrete prediction that could guide practical tuning and connect optimization dynamics to generalization.

major comments (2)
  1. [Abstract] Abstract and the linear-stability derivation: the central claim that linear stability thresholds directly dictate the observed nonlinear batch-sharpness plateaus is load-bearing, yet the manuscript provides no explicit argument or perturbation analysis showing that higher-order curvature terms or transient nonlinear effects do not shift the effective thresholds away from 2(1-β)/η and 2(1+β)/η.
  2. [Empirical section] Simulation results (finite-time convergence): the reported stabilization to the two plateaus rests on the assumption that finite-length runs accurately reflect infinite-time asymptotic behavior near the instability boundary; without reported training horizons relative to the stability time scale, convergence diagnostics, or error bars on the sharpness estimator, transient effects could produce apparent regime separation.
minor comments (2)
  1. [Abstract] The symbols β (momentum) and η (learning rate) are used in the plateau formulas without an early, self-contained definition; a brief reminder in the abstract or introduction would improve readability.
  2. [Abstract] The phrase 'batch-size-dependent behavior' is repeated; a single consolidated statement of the two regimes would reduce redundancy.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments. Below we provide point-by-point responses to the major comments, indicating where revisions will be made to address the concerns.

read point-by-point responses
  1. Referee: [Abstract] Abstract and the linear-stability derivation: the central claim that linear stability thresholds directly dictate the observed nonlinear batch-sharpness plateaus is load-bearing, yet the manuscript provides no explicit argument or perturbation analysis showing that higher-order curvature terms or transient nonlinear effects do not shift the effective thresholds away from 2(1-β)/η and 2(1+β)/η.

    Authors: While the linear stability analysis provides the thresholds that match our empirical observations, we acknowledge the absence of an explicit perturbation argument in the manuscript. In the revision, we will add a paragraph in the theory section arguing that, consistent with the EoSS framework, the system self-organizes such that nonlinear effects are suppressed near the boundary, preserving the linear thresholds as the effective plateaus. This is supported by the close agreement in our simulations. We will also cite related literature on linear approximations in stochastic optimization. revision: partial

  2. Referee: [Empirical section] Simulation results (finite-time convergence): the reported stabilization to the two plateaus rests on the assumption that finite-length runs accurately reflect infinite-time asymptotic behavior near the instability boundary; without reported training horizons relative to the stability time scale, convergence diagnostics, or error bars on the sharpness estimator, transient effects could produce apparent regime separation.

    Authors: We agree that additional details on convergence would strengthen the empirical claims. The manuscript reports results after 10^5 training steps, which exceeds the characteristic time scales derived from the linear analysis (approximately 1/|log(stability factor)|). In the revised version, we will include plots showing the evolution of batch sharpness over time to demonstrate convergence, report standard errors from 5 independent runs, and add a discussion comparing the simulation length to the stability time scale. revision: yes

Circularity Check

0 steps flagged

No significant circularity; stability thresholds derived independently from linear analysis

full rationale

The paper performs linear stability analysis on the momentum SGD update rule to obtain the two batch-size-dependent thresholds 2(1-β)/η and 2(1+β)/η. These are presented as the analytically expected plateaus to which batch sharpness converges. Simulations are then used to verify that observed sharpness approaches these values, which is a non-circular empirical check rather than a re-statement of fitted inputs or self-definitions. No load-bearing self-citations, ansatz smuggling, or renaming of known results are required for the central claim. The derivation chain remains self-contained against the linear dynamics of the optimizer.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Review performed on abstract only; full derivations and assumptions not available. The stability thresholds appear to rest on linearization of the momentum update rule and assumptions about noise statistics in mini-batch gradients.

axioms (1)
  • domain assumption Linear stability analysis of the momentum-augmented gradient update governs the long-term behavior of batch sharpness near the instability boundary.
    Invoked to link the observed plateaus to theoretical thresholds.

pith-pipeline@v0.9.0 · 5495 in / 1249 out tokens · 36071 ms · 2026-05-10T13:09:26.771620+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

51 extracted references · 33 canonical work pages · 2 internal anchors

  1. [1]

    write newline

    " write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...

  2. [2]

    and Pennington, J

    Agarwala, A. and Pennington, J. High dimensional analysis reveals conservative sharpening and a stochastic edge of stability. arXiv preprint arXiv:2404.19261, 2024

  3. [3]

    Understanding the unstable convergence of gradient descent

    Ahn, K., Zhang, J., and Sra, S. Understanding the unstable convergence of gradient descent. In Proceedings of the 39th International Conference on Machine Learning , June 2022. URL https://proceedings.mlr.press/v162/ahn22a.html

  4. [4]

    and Beneventano, P

    Andreyev, A. and Beneventano, P. Edge of Stochastic Stability : Revisiting the Edge of Stability for SGD . December 2024. doi:10.48550/arXiv.2412.20553. URL http://arxiv.org/abs/2412.20553. arXiv:2412.20553

  5. [5]

    and Bruna, J

    Chen, L. and Bruna, J. Beyond the edge of stability via two-step gradient updates. In Krause, A., Brunskill, E., Cho, K., Engelhardt, B., Sabato, S., and Scarlett, J. (eds.), Proceedings of the 40th International Conference on Machine Learning, volume 202 of Proceedings of Machine Learning Research, pp.\ 4330--4391. PMLR, 23--29 Jul 2023. URL https://proc...

  6. [6]

    arXiv preprint arXiv:2103.00065 , year=

    Cohen, J. M., Kaur, S., Li, Y., Kolter, J. Z., and Talwalkar, A. Gradient Descent on Neural Networks Typically Occurs at the Edge of Stability . arXiv:2103.00065 [cs, stat], June 2021. URL http://arxiv.org/abs/2103.00065. arXiv:2103.00065

  7. [7]

    arXiv:2207.14484 , year=

    Cohen, J. M., Ghorbani, B., Krishnan, S., Agarwal, N., Medapati, S., Badura, M., Suo, D., Cardoze, D., Nado, Z., Dahl, G. E., and Gilmer, J. Adaptive Gradient Methods at the Edge of Stability , July 2022. URL http://arxiv.org/abs/2207.14484. arXiv:2207.14484 [cs]

  8. [8]

    M., Damian, A., Talwalkar, A., Kolter, Z., and Lee, J

    Cohen, J. M., Damian, A., Talwalkar, A., Kolter, Z., and Lee, J. D. Understanding Optimization in Deep Learning with Central Flows , October 2024. URL http://arxiv.org/abs/2410.24206. arXiv:2410.24206

  9. [9]

    and Orabona, F

    Cutkosky, A. and Orabona, F. Momentum-based variance reduction in non-convex SGD . In Advances in Neural Information Processing Systems (NeurIPS), 2019

  10. [10]

    Damian, A., Nichani, E., and Lee, J. D. Self- Stabilization : The Implicit Bias of Gradient Descent at the Edge of Stability , April 2023. URL http://arxiv.org/abs/2209.15594. arXiv:2209.15594 [cs, math, stat]

  11. [11]

    When and why momentum accelerates SGD : An empirical study, 2023

    Fu, J., Wang, B., Zhang, H., Zhang, Z., Chen, W., and Zheng, N. When and why momentum accelerates SGD : An empirical study, 2023. URL https://arxiv.org/abs/2306.09000

  12. [12]

    arXiv preprint arXiv:2302.00849 , year=

    Ghosh, A., Lyu, H., Zhang, X., and Wang, R. Implicit regularization in heavy-ball momentum accelerated stochastic gradient descent. In International Conference on Learning Representations (ICLR), 2023. URL https://arxiv.org/abs/2302.00849

  13. [13]

    Understanding the role of momentum in stochastic gradient methods

    Gitman, I., Lang, H., Zhang, P., and Xiao, L. Understanding the role of momentum in stochastic gradient methods. In Advances in Neural Information Processing Systems (NeurIPS), 2019. URL https://arxiv.org/abs/1910.13962

  14. [14]

    Learning rates as a function of batch size: A random matrix theory approach to neural network training, 2021

    Granziol, D., Zohren, S., and Roberts, S. Learning rates as a function of batch size: A random matrix theory approach to neural network training, 2021. URL https://arxiv.org/abs/2006.09092

  15. [15]

    Deep Residual Learning for Image Recognition

    He, K., Zhang, X., Ren, S., and Sun, J. Deep Residual Learning for Image Recognition . arXiv:1512.03385 [cs], December 2015. URL http://arxiv.org/abs/1512.03385. arXiv:1512.03385

  16. [16]

    On the Relation Between the Sharpest Directions of DNN Loss and the SGD Step Length , Year =

    Jastrz e bski, S., Kenton, Z., Ballas, N., Fischer, A., Bengio, Y., and Storkey, A. On the Relation Between the Sharpest Directions of DNN Loss and the SGD Step Length , December 2019. URL http://arxiv.org/abs/1807.05031. arXiv:1807.05031 [stat]

  17. [17]

    The break-even point on optimization trajectories of deep neural networks.arXiv preprint arXiv:2002.09572,

    Jastrz e bski, S., Szymczak, M., Fort, S., Arpit, D., Tabor, J., Cho, K., and Geras, K. The Break - Even Point on Optimization Trajectories of Deep Neural Networks . arXiv:2002.09572 [cs, stat], February 2020. URL http://arxiv.org/abs/2002.09572. arXiv:2002.09572

  18. [18]

    and Li, Y

    Jelassi, S. and Li, Y. Towards understanding how momentum improves generalization in deep learning. In Proceedings of the 39th International Conference on Machine Learning (ICML), Proceedings of Machine Learning Research, 2022. URL https://arxiv.org/abs/2207.05931

  19. [19]

    Muon: An optimizer for hidden layers in neural networks, dec 2024

    Jordan, K. Muon: An optimizer for hidden layers in neural networks, dec 2024. URL https://kellerjordan.github.io/posts/muon/. Blog post

  20. [20]

    Kidambi, R., Netrapalli, P., Jain, P., and Kakade, S. M. On the insufficiency of existing momentum schemes for stochastic optimization, 2018. URL https://arxiv.org/abs/1803.05591

  21. [21]

    Krizhevsky, A., Sutskever, I., and Hinton, G. E. ImageNet Classification with Deep Convolutional Neural Networks . In Advances in Neural Information Processing Systems , volume 25. Curran Associates, Inc., 2012. URL https://papers.nips.cc/paper/2012/hash/c399862d3b9d6b76c8436e924a68c45b-Abstract.html

  22. [22]

    and Jang, C

    Lee, S. and Jang, C. A new characterization of the edge of stability based on a sharpness measure aware of batch gradient distribution. In International Conference on Learning Representations, 2023. URL https://api.semanticscholar.org/CorpusID:259298833

  23. [23]

    Dynamics of stochastic gradient algorithms, 2015

    Li, Q., Tai, C., and E, W. Dynamics of stochastic gradient algorithms, 2015. URL https://arxiv.org/abs/1511.06251

  24. [24]

    Stochastic modified equations and adaptive stochastic gradient algorithms

    Li, Q., Tai, C., and E, W. Stochastic modified equations and adaptive stochastic gradient algorithms. In Proceedings of the 34th International Conference on Machine Learning (ICML), 2017. URL https://arxiv.org/abs/1612.06277

  25. [25]

    Stochastic modified equations I : Mathematical foundations

    Li, Q., Tai, C., and E, W. Stochastic modified equations I : Mathematical foundations. Journal of Machine Learning Research, 2019. URL https://www.jmlr.org/papers/v20/17-526.html

  26. [26]

    A diffusion approximation theory of momentum SGD in nonconvex optimization, 2018

    Liu, T., Chen, Z., Zhou, E., and Zhao, T. A diffusion approximation theory of momentum SGD in nonconvex optimization, 2018. URL https://arxiv.org/abs/1802.05155

  27. [27]

    An improved analysis of stochastic gradient descent with momentum

    Liu, Y., Gao, Y., and Yin, W. An improved analysis of stochastic gradient descent with momentum. In Advances in Neural Information Processing Systems (NeurIPS), 2020

  28. [28]

    Effects of momentum in implicit bias of gradient flow for diagonal linear networks

    Lyu, B. Effects of momentum in implicit bias of gradient flow for diagonal linear networks. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), 2025. URL https://ojs.aaai.org/index.php/AAAI/article/view/34118

  29. [29]

    and Ying, L

    Ma, C. and Ying, L. On linear stability of SGD and input-smoothness of neural networks. In Beygelzimer, A., Dauphin, Y., Liang, P., and Vaughan, J. W. (eds.), Advances in Neural Information Processing Systems, 2021. URL https://openreview.net/forum?id=yAvCV6NwWQ

  30. [30]

    and Yarats, D

    Ma, J. and Yarats, D. Quasi-hyperbolic momentum and adam for deep learning, 2018. URL https://arxiv.org/abs/1810.06801

  31. [31]

    D., and Blei, D

    Mandt, S., Hoffman, M. D., and Blei, D. M. Stochastic gradient descent as approximate Bayesian inference. Journal of Machine Learning Research, 18 0 (134): 0 1--35, 2017. URL https://www.jmlr.org/papers/v18/16-511.html

  32. [32]

    & Luschi, C

    Masters, D. and Luschi, C. Revisiting Small Batch Training for Deep Neural Networks , April 2018. URL http://arxiv.org/abs/1804.07612. arXiv:1804.07612

  33. [33]

    Asynchrony begets momentum, with an application to deep learning, 2016

    Mitliagkas, I., Zhang, C., Hadjis, S., and R \'e , C. Asynchrony begets momentum, with an application to deep learning, 2016. URL https://arxiv.org/abs/1605.09774

  34. [34]

    and Michaeli, T

    Mulayoff, R. and Michaeli, T. Exact mean square linear stability analysis for sgd, 2024. URL https://arxiv.org/abs/2306.07850

  35. [35]

    On an approach to the construction of optimal methods of minimization of smooth convex functions

    Nesterov, Y. On an approach to the construction of optimal methods of minimization of smooth convex functions. Ekonomika i Mateaticheskie Metody, 24 0 (3): 0 509--517, 1988

  36. [36]

    and Paquette, E

    Paquette, C. and Paquette, E. Dynamics of stochastic momentum methods on large-scale quadratic models. In Advances in Neural Information Processing Systems (NeurIPS), 2021. URL https://arxiv.org/abs/2104.03485

  37. [37]

    Some methods of speeding up the convergence of iteration methods

    Polyak, B. Some methods of speeding up the convergence of iteration methods. USSR Computational Mathematics and Mathematical Physics , 4 0 (5): 0 1--17, 1964. ISSN 0041-5553. doi:10.1016/0041-5553(64)90137-5. URL https://www.sciencedirect.com/science/article/pii/0041555364901375

  38. [38]

    On the generalization of stochastic gradient descent with momentum

    Ramezani-Kebrya, A., Antonakopoulos, K., Cevher, V., Khisti, A., and Liang, B. On the generalization of stochastic gradient descent with momentum. Journal of Machine Learning Research, 25 0 (22): 0 1--56, 2024. URL https://jmlr.org/papers/v25/22-0068.html

  39. [39]

    I., and Su, W

    Shi, B., Du, S., Jordan, M. I., and Su, W. Understanding the acceleration phenomenon via high-resolution differential equations. Mathematical Programming, 2022. doi:10.1007/s10107-021-01681-8. URL https://doi.org/10.1007/s10107-021-01681-8

  40. [40]

    A differential equation for modeling Nesterov 's accelerated gradient method: Theory and insights

    Su, W., Boyd, S., and Cand \`e s, E. A differential equation for modeling Nesterov 's accelerated gradient method: Theory and insights. In Advances in Neural Information Processing Systems (NeurIPS), 2014. URL https://arxiv.org/abs/1503.01243

  41. [41]

    E., and Hinton, G

    Sutskever, I., Martens, J., Dahl, G. E., and Hinton, G. E. On the importance of initialization and momentum in deep learning. In Proceedings of the 30th International Conference on Machine Learning (ICML), volume 28 of Proceedings of Machine Learning Research, pp.\ 1139--1147, 2013. URL https://proceedings.mlr.press/v28/sutskever13.html

  42. [42]

    Does momentum change the implicit regularization on separable data? In Advances in Neural Information Processing Systems (NeurIPS), 2022

    Wang, B., Meng, Q., Zhang, H., Sun, R., Chen, W., Ma, Z.-M., and Liu, T.-Y. Does momentum change the implicit regularization on separable data? In Advances in Neural Information Processing Systems (NeurIPS), 2022. URL https://openreview.net/forum?id=i-8uqlurj1f

  43. [43]

    The marginal value of momentum for small learning rate SGD

    Wang, R., Malladi, S., Wang, T., Lyu, K., and Li, Z. The marginal value of momentum for small learning rate SGD . In International Conference on Learning Representations (ICLR), 2024. URL https://arxiv.org/abs/2307.15196

  44. [44]

    C., and Jordan, M

    Wibisono, A., Wilson, A. C., and Jordan, M. I. A variational perspective on accelerated methods in optimization, 2016. URL https://arxiv.org/abs/1603.04245

  45. [45]

    C., Recht, B., and Jordan, M

    Wilson, A. C., Recht, B., and Jordan, M. I. A Lyapunov analysis of accelerated methods in optimization. Journal of Machine Learning Research, 2021

  46. [46]

    and Su, W

    Wu, L. and Su, W. J. The Implicit Regularization of Dynamical Stability in Stochastic Gradient Descent , June 2023. URL http://arxiv.org/abs/2305.17490. arXiv:2305.17490 [stat]

  47. [47]

    How SGD Selects the Global Minima in Over -parameterized Learning : A Dynamical Stability Perspective

    Wu, L., Ma, C., and E, W. How SGD Selects the Global Minima in Over -parameterized Learning : A Dynamical Stability Perspective . In Advances in Neural Information Processing Systems , volume 31. Curran Associates, Inc., 2018. URL https://proceedings.neurips.cc/paper_files/paper/2018/hash/6651526b6fb8f29a00507de6a49ce30f-Abstract.html

  48. [48]

    The alignment property of sgd noise and how it helps select flat minima: A stability analysis, 2022

    Wu, L., Wang, M., and Su, W. The alignment property of sgd noise and how it helps select flat minima: A stability analysis, 2022

  49. [49]

    arXiv preprint arXiv:1802.08770 , year =

    Xing, C., Arpit, D., Tsirigotis, C., and Bengio, Y. A Walk with SGD , May 2018. URL http://arxiv.org/abs/1802.08770. arXiv:1802.08770 [cs, stat]

  50. [50]

    Yuan, K., Ying, B., and Sayed, A. H. On the influence of momentum acceleration on online learning, 2016. URL http://arxiv.org/abs/1603.04136

  51. [51]

    Wide Residual Networks

    Zagoruyko, S. and Komodakis, N. Wide Residual Networks . arXiv:1605.07146 [cs], June 2017. URL http://arxiv.org/abs/1605.07146. arXiv:1605.07146