arxiv: 2604.14108 · v1 · submitted 2026-04-15 · 💻 cs.LG · math.DS· math.OC· stat.ML

Recognition: unknown

Momentum Further Constrains Sharpness at the Edge of Stochastic Stability

Arseniy Andreyev , Advikar Ananthkumar , Marc Walden , Tomaso Poggio , Pierfrancesco Beneventano

Authors on Pith no claims yet

Pith reviewed 2026-05-10 13:09 UTC · model grok-4.3

classification 💻 cs.LG math.DSmath.OCstat.ML

keywords SGD with momentumedge of stochastic stabilitybatch sharpnessoptimization dynamicsstochastic gradient descentlinear stability analysisdeep learning optimization

0 comments

The pith

SGD with momentum stabilizes batch sharpness at two different plateaus depending on batch size near the stochastic stability edge.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that adding momentum to mini-batch stochastic gradient descent produces an edge-of-stability regime whose sharpness behavior splits by batch size. For small batches, sharpness settles at the lower value 2(1-β)/η because momentum amplifies stochastic noise and therefore selects flatter regions than plain SGD. For large batches, sharpness instead settles at the higher value 2(1+β)/η, recovering the classical stabilizing action of momentum that is seen in full-batch training. A reader cares because the result ties a common optimizer choice directly to the sharpness of the minima that training reaches and therefore to generalization. It also shows that the usual single-threshold picture of stability must be replaced by two distinct regimes when momentum and batch size are varied together.

Core claim

SGD with momentum exhibits an Edge of Stochastic Stability regime in which batch sharpness, the expected directional mini-batch curvature, converges to one of two batch-size-dependent plateaus. At small batch sizes it reaches the lower plateau 2(1-β)/η, which reflects momentum amplification of stochastic fluctuations and favors flatter solutions than vanilla SGD. At large batch sizes it reaches the higher plateau 2(1+β)/η, where momentum recovers its classical stabilizing effect and favors sharper solutions consistent with deterministic gradient flow. These two limits align with linear stability thresholds and cannot be captured by any single momentum-adjusted threshold.

What carries the argument

Batch sharpness, defined as expected directional mini-batch curvature, and its convergence to the two momentum-dependent plateaus 2(1-β)/η and 2(1+β)/η at the instability boundary.

If this is right

Momentum favors flatter regions than vanilla SGD when batch size is small because it amplifies stochastic fluctuations.
Momentum favors sharper regions consistent with full-batch dynamics when batch size is large.
Hyperparameter tuning for momentum must treat small-batch and large-batch regimes separately rather than using one stability threshold.
The observed sharpness plateaus match the predictions of linear stability analysis applied to the momentum update.
The coupling of momentum and batch size directly shapes which solutions the optimizer selects.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The split regimes may explain why practitioners often pair momentum with small batches to improve generalization.
The same two-regime structure could appear in other momentum-based methods such as Nesterov or Adam and would be testable by measuring batch sharpness across batch sizes.
Adjusting the momentum coefficient as a function of batch size might allow explicit control over the sharpness of the final solution.
Large-batch training with momentum may require different learning-rate scaling rules than small-batch training because the effective stability threshold changes.

Load-bearing premise

Finite simulations of training reach the same asymptotic sharpness plateaus that linear stability analysis predicts near the instability boundary.

What would settle it

A long training run at several batch sizes in which measured batch sharpness fails to approach either 2(1-β)/η or 2(1+β)/η as training time increases.

Figures

Figures reproduced from arXiv: 2604.14108 by Advikar Ananthkumar, Arseniy Andreyev, Marc Walden, Pierfrancesco Beneventano, Tomaso Poggio.

**Figure 1.** Figure 1: λmax under full-batch GD with momentum (left) and mini-batch SGD with momentum (right). MLP on an 8k subset of CIFAR-10 for fixed step size η = .004 and varying β. The stabilization level of Batch Sharpness (Definition 3.1) inverts its monotonicity in β [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗

**Figure 2.** Figure 2: EoSS phenomenon using SGDM (left) and SGDN (right). MLPs on an 8k subset of CIFAR-10 under different step sizes η and with β = 0.9. Batch Sharpness stabilizes around the 2(1 − β)/η = 1/(5η) threshold, shown by the dotted lines. in the small-batch (noise-dominated) regime and BSplateau ≈ (2(1+β) η (SGDM), 2(1+β) η(1+2β) (SGDN) (2) in the large-batch (deterministic) regime. 1 The small-batch plateau is stric… view at source ↗

**Figure 3.** Figure 3: Stabilization levels of Batch Sharpness and λmax across varying batch sizes for an MLP trained with SGDM (top) and SGDN (bottom) at η = 0.005 and β = 0.9. The critical batch size, defined heuristically as the threshold at which training dynamics enter the large-batch regime, is marked for each optimizer. Notably, SGDN reaches this regime at a batch size almost an order of magnitude smaller than SGDM. 2.2. … view at source ↗

**Figure 4.** Figure 4: Dynamics of curvature statistics for SGDM with β = 0.5. Top row: MLP; bottom row: CNN. Columns correspond to batch sizes b ∈ {4, 64, 256}. Batch Sharpness and λmax rise and then plateau, with larger batches yielding higher plateau levels. For Batch Sharpness, the left column is near the small-batch level 2(1 − β)/η, the middle column lies in transition, and the right column approaches the large-batch level… view at source ↗

**Figure 5.** Figure 5: Within-run dynamics for an MLP with batch size b = 4. The SGDM run uses learning rate η = 0.001 with momentum β = 0.9, while the SGD run uses learning rate η = 0.01, chosen to match the effective step size. how full-batch sharpness behaves alongside it. Empirically, as in the case of vanilla SGD, stabilization of Batch Sharpness induces a corresponding stabilization of the full-batch top eigenvalue λmax; … view at source ↗

**Figure 6.** Figure 6: Within-run EoSS dynamics for an MLP under destabilizing interventions at step 75k with batch size b = 16, learning rate η = 0.004, and momentum β = 0.9. Left: destabilizing momentum intervention, increasing β to 0.95. Middle: destabilizing learning rate intervention, increasing η to 0.0067. Right: destabilizing batch size intervention, decreasing b to 8. Top: Batch Sharpness and λmax. Bottom: Training loss… view at source ↗

**Figure 7.** Figure 7: Within-run EoSS dynamics for early destabilizing interventions during the progressive sharpening phase at step 10k on an MLP with baseline learning rate η = 0.004, momentum β = 0.9, and batch size b = 16. Left: destabilizing momentum intervention, increasing β to 0.95. Middle: destabilizing learning-rate intervention, increasing η to 0.0067. Right: destabilizing batch-size intervention, decreasing batch si… view at source ↗

**Figure 8.** Figure 8: , and [PITH_FULL_IMAGE:figures/full_fig_p021_8.png] view at source ↗

**Figure 9.** Figure 9: Within-run EoSS dynamics for destabilizing interventions at high batch size at step 50k on an MLP with baseline learning rate η = 0.03, momentum β = 0.5, and batch size b = 16384. Left: destabilizing momentum intervention, increasing β to 0.52. Middle: destabilizing learning-rate intervention, increasing η to 0.035. Right: destabilizing batch-size intervention, decreasing b to 12288. Top: Batch Sharpness a… view at source ↗

**Figure 10.** Figure 10: Within-run EoSS dynamics for stabilizing interventions with low batch sizes at step 150k on an MLP with batch size b = 16, learning rate η = 0.004, and momentum β = 0.9. Left: stabilizing momentum intervention, decreasing β to 0.875. Middle: stabilizing learning-rate intervention, decreasing η to 0.003. Right: stabilizing batch-size intervention, increasing batch size b to 32. Top: Batch Sharpness and λma… view at source ↗

**Figure 11.** Figure 11: Within-run EoSS dynamics for early stabilizing interventions during the progressive sharpening phase at step 10k on an MLP with baseline learning rate η = 0.004, momentum β = 0.9, and batch size b = 16. Left: stabilizing momentum intervention, decreasing β to 0.875. Middle: stabilizing learning-rate intervention, decreasing η to 0.003. Right: stabilizing batch-size intervention, increasing batch size b to… view at source ↗

**Figure 12.** Figure 12: Within-run EoSS dynamics for stabilizing interventions with intermediate batch sizes at step 75k on an MLP with baseline learning rate η = 0.004, momentum β = 0.9, and batch size b = 512. Left: stabilizing momentum intervention, decreasing β to 0.875. Middle: stabilizing learning-rate intervention, decreasing η to 0.003. Right: stabilizing batch-size intervention, increasing batch size b to 768. Top: Batc… view at source ↗

**Figure 13.** Figure 13: uses the distance from initialization primarily as a baseline to provide context for the separation between SGD and SGDM trajectories. While both runs move a similar total distance through parameter space, the distance between them is of a comparable order of magnitude to their distance from initialization. This lack of point-by-point proximity suggests that matching Batch Sharpness stabilization levels d… view at source ↗

**Figure 15.** Figure 15 [PITH_FULL_IMAGE:figures/full_fig_p025_15.png] view at source ↗

**Figure 16.** Figure 16: MLP, η = 0.004, β = 0.9 [PITH_FULL_IMAGE:figures/full_fig_p025_16.png] view at source ↗

**Figure 18.** Figure 18: MLP, η = 0.001, β = 0.9 [PITH_FULL_IMAGE:figures/full_fig_p025_18.png] view at source ↗

**Figure 20.** Figure 20 [PITH_FULL_IMAGE:figures/full_fig_p025_20.png] view at source ↗

**Figure 22.** Figure 22 [PITH_FULL_IMAGE:figures/full_fig_p026_22.png] view at source ↗

**Figure 23.** Figure 23: CNN, η = 0.001, β = 0.9 [PITH_FULL_IMAGE:figures/full_fig_p026_23.png] view at source ↗

read the original abstract

Recent work suggests that (stochastic) gradient descent self-organizes near an instability boundary, shaping both optimization and the solutions found. Momentum and mini-batch gradients are widely used in practical deep learning optimization, but it remains unclear whether they operate in a comparable regime of instability. We demonstrate that SGD with momentum exhibits an Edge of Stochastic Stability (EoSS)-like regime with batch-size-dependent behavior that cannot be explained by a single momentum-adjusted stability threshold. Batch Sharpness (the expected directional mini-batch curvature) stabilizes in two distinct regimes: at small batch sizes it converges to a lower plateau $2(1-\beta)/\eta$, reflecting amplification of stochastic fluctuations by momentum and favoring flatter regions than vanilla SGD; at large batch sizes it converges to a higher plateau $2(1+\beta)/\eta$, where momentum recovers its classical stabilizing effect and favors sharper regions consistent with full-batch dynamics. We further show that this aligns with linear stability thresholds and discuss the implications for hyperparameter tuning and coupling.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Momentum splits the edge of stochastic stability into two batch-size regimes with distinct sharpness plateaus that cannot be reduced to one adjusted threshold.

read the letter

The main thing here is that momentum in SGD produces two separate regimes for how batch sharpness settles at the edge of stochastic stability, and the switch depends on batch size in a way that prior single-threshold adjustments do not cover. At small batches the sharpness converges to the lower plateau 2(1-β)/η, where momentum amplifies noise and favors flatter regions. At large batches it moves to the higher plateau 2(1+β)/η, where momentum acts more like a stabilizer and permits sharper solutions closer to full-batch behavior. The paper derives these expressions and shows the regime change in simulations, which is the concrete extension beyond vanilla EoSS results. This gives explicit, usable formulas that tie β, η, and batch size together, and the alignment with linear stability thresholds is a clean consistency check. The work is most useful for people already working on implicit regularization through optimizer dynamics. A reader who knows the earlier EoSS papers can evaluate the new claims without much extra background. The soft spot is that the central claim rests on finite simulations reflecting true asymptotic behavior and on linear stability directly controlling the measured batch sharpness in nonlinear deep-net training. The abstract does not include the full derivations, error analysis, or long-run convergence checks, so it is not yet clear how robust the plateaus remain when those assumptions are relaxed. Still, the claims are specific enough to be checked. I would send this to peer review so referees can verify the math and the experimental details directly.

Referee Report

2 major / 2 minor

Summary. The manuscript claims that SGD with momentum exhibits an Edge of Stochastic Stability (EoSS)-like regime in which batch sharpness (expected directional mini-batch curvature) converges to two distinct, batch-size-dependent plateaus: a lower value of 2(1-β)/η at small batch sizes (reflecting momentum-amplified stochastic fluctuations) and a higher value of 2(1+β)/η at large batch sizes (recovering classical momentum stabilization). This regime separation cannot be captured by any single momentum-adjusted stability threshold and is shown to align with linear stability analysis, with implications for hyperparameter tuning and optimization dynamics.

Significance. If the empirical plateaus and their alignment with linear thresholds hold, the work supplies explicit, testable formulas that refine the EoSS picture for momentum and mini-batching, clarifying why momentum favors flatter regions under small-batch stochasticity while recovering sharper solutions under large-batch or full-batch conditions. The parameter-free expressions in terms of β and η constitute a concrete prediction that could guide practical tuning and connect optimization dynamics to generalization.

major comments (2)

[Abstract] Abstract and the linear-stability derivation: the central claim that linear stability thresholds directly dictate the observed nonlinear batch-sharpness plateaus is load-bearing, yet the manuscript provides no explicit argument or perturbation analysis showing that higher-order curvature terms or transient nonlinear effects do not shift the effective thresholds away from 2(1-β)/η and 2(1+β)/η.
[Empirical section] Simulation results (finite-time convergence): the reported stabilization to the two plateaus rests on the assumption that finite-length runs accurately reflect infinite-time asymptotic behavior near the instability boundary; without reported training horizons relative to the stability time scale, convergence diagnostics, or error bars on the sharpness estimator, transient effects could produce apparent regime separation.

minor comments (2)

[Abstract] The symbols β (momentum) and η (learning rate) are used in the plateau formulas without an early, self-contained definition; a brief reminder in the abstract or introduction would improve readability.
[Abstract] The phrase 'batch-size-dependent behavior' is repeated; a single consolidated statement of the two regimes would reduce redundancy.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments. Below we provide point-by-point responses to the major comments, indicating where revisions will be made to address the concerns.

read point-by-point responses

Referee: [Abstract] Abstract and the linear-stability derivation: the central claim that linear stability thresholds directly dictate the observed nonlinear batch-sharpness plateaus is load-bearing, yet the manuscript provides no explicit argument or perturbation analysis showing that higher-order curvature terms or transient nonlinear effects do not shift the effective thresholds away from 2(1-β)/η and 2(1+β)/η.

Authors: While the linear stability analysis provides the thresholds that match our empirical observations, we acknowledge the absence of an explicit perturbation argument in the manuscript. In the revision, we will add a paragraph in the theory section arguing that, consistent with the EoSS framework, the system self-organizes such that nonlinear effects are suppressed near the boundary, preserving the linear thresholds as the effective plateaus. This is supported by the close agreement in our simulations. We will also cite related literature on linear approximations in stochastic optimization. revision: partial
Referee: [Empirical section] Simulation results (finite-time convergence): the reported stabilization to the two plateaus rests on the assumption that finite-length runs accurately reflect infinite-time asymptotic behavior near the instability boundary; without reported training horizons relative to the stability time scale, convergence diagnostics, or error bars on the sharpness estimator, transient effects could produce apparent regime separation.

Authors: We agree that additional details on convergence would strengthen the empirical claims. The manuscript reports results after 10^5 training steps, which exceeds the characteristic time scales derived from the linear analysis (approximately 1/|log(stability factor)|). In the revised version, we will include plots showing the evolution of batch sharpness over time to demonstrate convergence, report standard errors from 5 independent runs, and add a discussion comparing the simulation length to the stability time scale. revision: yes

Circularity Check

0 steps flagged

No significant circularity; stability thresholds derived independently from linear analysis

full rationale

The paper performs linear stability analysis on the momentum SGD update rule to obtain the two batch-size-dependent thresholds 2(1-β)/η and 2(1+β)/η. These are presented as the analytically expected plateaus to which batch sharpness converges. Simulations are then used to verify that observed sharpness approaches these values, which is a non-circular empirical check rather than a re-statement of fitted inputs or self-definitions. No load-bearing self-citations, ansatz smuggling, or renaming of known results are required for the central claim. The derivation chain remains self-contained against the linear dynamics of the optimizer.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Review performed on abstract only; full derivations and assumptions not available. The stability thresholds appear to rest on linearization of the momentum update rule and assumptions about noise statistics in mini-batch gradients.

axioms (1)

domain assumption Linear stability analysis of the momentum-augmented gradient update governs the long-term behavior of batch sharpness near the instability boundary.
Invoked to link the observed plateaus to theoretical thresholds.

pith-pipeline@v0.9.0 · 5495 in / 1249 out tokens · 36071 ms · 2026-05-10T13:09:26.771620+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

51 extracted references · 33 canonical work pages · 2 internal anchors

[1]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...
[2]

and Pennington, J

Agarwala, A. and Pennington, J. High dimensional analysis reveals conservative sharpening and a stochastic edge of stability. arXiv preprint arXiv:2404.19261, 2024

work page arXiv 2024
[3]

Understanding the unstable convergence of gradient descent

Ahn, K., Zhang, J., and Sra, S. Understanding the unstable convergence of gradient descent. In Proceedings of the 39th International Conference on Machine Learning , June 2022. URL https://proceedings.mlr.press/v162/ahn22a.html

2022
[4]

and Beneventano, P

Andreyev, A. and Beneventano, P. Edge of Stochastic Stability : Revisiting the Edge of Stability for SGD . December 2024. doi:10.48550/arXiv.2412.20553. URL http://arxiv.org/abs/2412.20553. arXiv:2412.20553

work page doi:10.48550/arxiv.2412.20553 2024
[5]

and Bruna, J

Chen, L. and Bruna, J. Beyond the edge of stability via two-step gradient updates. In Krause, A., Brunskill, E., Cho, K., Engelhardt, B., Sabato, S., and Scarlett, J. (eds.), Proceedings of the 40th International Conference on Machine Learning, volume 202 of Proceedings of Machine Learning Research, pp.\ 4330--4391. PMLR, 23--29 Jul 2023. URL https://proc...

2023
[6]

arXiv preprint arXiv:2103.00065 , year=

Cohen, J. M., Kaur, S., Li, Y., Kolter, J. Z., and Talwalkar, A. Gradient Descent on Neural Networks Typically Occurs at the Edge of Stability . arXiv:2103.00065 [cs, stat], June 2021. URL http://arxiv.org/abs/2103.00065. arXiv:2103.00065

work page arXiv 2021
[7]

arXiv:2207.14484 , year=

Cohen, J. M., Ghorbani, B., Krishnan, S., Agarwal, N., Medapati, S., Badura, M., Suo, D., Cardoze, D., Nado, Z., Dahl, G. E., and Gilmer, J. Adaptive Gradient Methods at the Edge of Stability , July 2022. URL http://arxiv.org/abs/2207.14484. arXiv:2207.14484 [cs]

work page arXiv 2022
[8]

M., Damian, A., Talwalkar, A., Kolter, Z., and Lee, J

Cohen, J. M., Damian, A., Talwalkar, A., Kolter, Z., and Lee, J. D. Understanding Optimization in Deep Learning with Central Flows , October 2024. URL http://arxiv.org/abs/2410.24206. arXiv:2410.24206

work page arXiv 2024
[9]

and Orabona, F

Cutkosky, A. and Orabona, F. Momentum-based variance reduction in non-convex SGD . In Advances in Neural Information Processing Systems (NeurIPS), 2019

2019
[10]

Damian, A., Nichani, E., and Lee, J. D. Self- Stabilization : The Implicit Bias of Gradient Descent at the Edge of Stability , April 2023. URL http://arxiv.org/abs/2209.15594. arXiv:2209.15594 [cs, math, stat]

work page arXiv 2023
[11]

When and why momentum accelerates SGD : An empirical study, 2023

Fu, J., Wang, B., Zhang, H., Zhang, Z., Chen, W., and Zheng, N. When and why momentum accelerates SGD : An empirical study, 2023. URL https://arxiv.org/abs/2306.09000

work page arXiv 2023
[12]

arXiv preprint arXiv:2302.00849 , year=

Ghosh, A., Lyu, H., Zhang, X., and Wang, R. Implicit regularization in heavy-ball momentum accelerated stochastic gradient descent. In International Conference on Learning Representations (ICLR), 2023. URL https://arxiv.org/abs/2302.00849

work page arXiv 2023
[13]

Understanding the role of momentum in stochastic gradient methods

Gitman, I., Lang, H., Zhang, P., and Xiao, L. Understanding the role of momentum in stochastic gradient methods. In Advances in Neural Information Processing Systems (NeurIPS), 2019. URL https://arxiv.org/abs/1910.13962

work page arXiv 2019
[14]

Learning rates as a function of batch size: A random matrix theory approach to neural network training, 2021

Granziol, D., Zohren, S., and Roberts, S. Learning rates as a function of batch size: A random matrix theory approach to neural network training, 2021. URL https://arxiv.org/abs/2006.09092

work page arXiv 2021
[15]

Deep Residual Learning for Image Recognition

He, K., Zhang, X., Ren, S., and Sun, J. Deep Residual Learning for Image Recognition . arXiv:1512.03385 [cs], December 2015. URL http://arxiv.org/abs/1512.03385. arXiv:1512.03385

work page internal anchor Pith review arXiv 2015
[16]

On the Relation Between the Sharpest Directions of DNN Loss and the SGD Step Length , Year =

Jastrz e bski, S., Kenton, Z., Ballas, N., Fischer, A., Bengio, Y., and Storkey, A. On the Relation Between the Sharpest Directions of DNN Loss and the SGD Step Length , December 2019. URL http://arxiv.org/abs/1807.05031. arXiv:1807.05031 [stat]

work page arXiv 2019
[17]

The break-even point on optimization trajectories of deep neural networks.arXiv preprint arXiv:2002.09572,

Jastrz e bski, S., Szymczak, M., Fort, S., Arpit, D., Tabor, J., Cho, K., and Geras, K. The Break - Even Point on Optimization Trajectories of Deep Neural Networks . arXiv:2002.09572 [cs, stat], February 2020. URL http://arxiv.org/abs/2002.09572. arXiv:2002.09572

work page arXiv 2002
[18]

and Li, Y

Jelassi, S. and Li, Y. Towards understanding how momentum improves generalization in deep learning. In Proceedings of the 39th International Conference on Machine Learning (ICML), Proceedings of Machine Learning Research, 2022. URL https://arxiv.org/abs/2207.05931

work page arXiv 2022
[19]

Muon: An optimizer for hidden layers in neural networks, dec 2024

Jordan, K. Muon: An optimizer for hidden layers in neural networks, dec 2024. URL https://kellerjordan.github.io/posts/muon/. Blog post

2024
[20]

Kidambi, R., Netrapalli, P., Jain, P., and Kakade, S. M. On the insufficiency of existing momentum schemes for stochastic optimization, 2018. URL https://arxiv.org/abs/1803.05591

work page arXiv 2018
[21]

Krizhevsky, A., Sutskever, I., and Hinton, G. E. ImageNet Classification with Deep Convolutional Neural Networks . In Advances in Neural Information Processing Systems , volume 25. Curran Associates, Inc., 2012. URL https://papers.nips.cc/paper/2012/hash/c399862d3b9d6b76c8436e924a68c45b-Abstract.html

2012
[22]

and Jang, C

Lee, S. and Jang, C. A new characterization of the edge of stability based on a sharpness measure aware of batch gradient distribution. In International Conference on Learning Representations, 2023. URL https://api.semanticscholar.org/CorpusID:259298833

2023
[23]

Dynamics of stochastic gradient algorithms, 2015

Li, Q., Tai, C., and E, W. Dynamics of stochastic gradient algorithms, 2015. URL https://arxiv.org/abs/1511.06251

work page arXiv 2015
[24]

Stochastic modified equations and adaptive stochastic gradient algorithms

Li, Q., Tai, C., and E, W. Stochastic modified equations and adaptive stochastic gradient algorithms. In Proceedings of the 34th International Conference on Machine Learning (ICML), 2017. URL https://arxiv.org/abs/1612.06277

work page arXiv 2017
[25]

Stochastic modified equations I : Mathematical foundations

Li, Q., Tai, C., and E, W. Stochastic modified equations I : Mathematical foundations. Journal of Machine Learning Research, 2019. URL https://www.jmlr.org/papers/v20/17-526.html

2019
[26]

A diffusion approximation theory of momentum SGD in nonconvex optimization, 2018

Liu, T., Chen, Z., Zhou, E., and Zhao, T. A diffusion approximation theory of momentum SGD in nonconvex optimization, 2018. URL https://arxiv.org/abs/1802.05155

work page arXiv 2018
[27]

An improved analysis of stochastic gradient descent with momentum

Liu, Y., Gao, Y., and Yin, W. An improved analysis of stochastic gradient descent with momentum. In Advances in Neural Information Processing Systems (NeurIPS), 2020

2020
[28]

Effects of momentum in implicit bias of gradient flow for diagonal linear networks

Lyu, B. Effects of momentum in implicit bias of gradient flow for diagonal linear networks. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), 2025. URL https://ojs.aaai.org/index.php/AAAI/article/view/34118

2025
[29]

and Ying, L

Ma, C. and Ying, L. On linear stability of SGD and input-smoothness of neural networks. In Beygelzimer, A., Dauphin, Y., Liang, P., and Vaughan, J. W. (eds.), Advances in Neural Information Processing Systems, 2021. URL https://openreview.net/forum?id=yAvCV6NwWQ

2021
[30]

and Yarats, D

Ma, J. and Yarats, D. Quasi-hyperbolic momentum and adam for deep learning, 2018. URL https://arxiv.org/abs/1810.06801

work page arXiv 2018
[31]

D., and Blei, D

Mandt, S., Hoffman, M. D., and Blei, D. M. Stochastic gradient descent as approximate Bayesian inference. Journal of Machine Learning Research, 18 0 (134): 0 1--35, 2017. URL https://www.jmlr.org/papers/v18/16-511.html

2017
[32]

& Luschi, C

Masters, D. and Luschi, C. Revisiting Small Batch Training for Deep Neural Networks , April 2018. URL http://arxiv.org/abs/1804.07612. arXiv:1804.07612

work page arXiv 2018
[33]

Asynchrony begets momentum, with an application to deep learning, 2016

Mitliagkas, I., Zhang, C., Hadjis, S., and R \'e , C. Asynchrony begets momentum, with an application to deep learning, 2016. URL https://arxiv.org/abs/1605.09774

work page arXiv 2016
[34]

and Michaeli, T

Mulayoff, R. and Michaeli, T. Exact mean square linear stability analysis for sgd, 2024. URL https://arxiv.org/abs/2306.07850

work page arXiv 2024
[35]

On an approach to the construction of optimal methods of minimization of smooth convex functions

Nesterov, Y. On an approach to the construction of optimal methods of minimization of smooth convex functions. Ekonomika i Mateaticheskie Metody, 24 0 (3): 0 509--517, 1988

1988
[36]

and Paquette, E

Paquette, C. and Paquette, E. Dynamics of stochastic momentum methods on large-scale quadratic models. In Advances in Neural Information Processing Systems (NeurIPS), 2021. URL https://arxiv.org/abs/2104.03485

work page arXiv 2021
[37]

Some methods of speeding up the convergence of iteration methods

Polyak, B. Some methods of speeding up the convergence of iteration methods. USSR Computational Mathematics and Mathematical Physics , 4 0 (5): 0 1--17, 1964. ISSN 0041-5553. doi:10.1016/0041-5553(64)90137-5. URL https://www.sciencedirect.com/science/article/pii/0041555364901375

work page doi:10.1016/0041-5553(64)90137-5 1964
[38]

On the generalization of stochastic gradient descent with momentum

Ramezani-Kebrya, A., Antonakopoulos, K., Cevher, V., Khisti, A., and Liang, B. On the generalization of stochastic gradient descent with momentum. Journal of Machine Learning Research, 25 0 (22): 0 1--56, 2024. URL https://jmlr.org/papers/v25/22-0068.html

2024
[39]

I., and Su, W

Shi, B., Du, S., Jordan, M. I., and Su, W. Understanding the acceleration phenomenon via high-resolution differential equations. Mathematical Programming, 2022. doi:10.1007/s10107-021-01681-8. URL https://doi.org/10.1007/s10107-021-01681-8

work page doi:10.1007/s10107-021-01681-8 2022
[40]

A differential equation for modeling Nesterov 's accelerated gradient method: Theory and insights

Su, W., Boyd, S., and Cand \`e s, E. A differential equation for modeling Nesterov 's accelerated gradient method: Theory and insights. In Advances in Neural Information Processing Systems (NeurIPS), 2014. URL https://arxiv.org/abs/1503.01243

work page arXiv 2014
[41]

E., and Hinton, G

Sutskever, I., Martens, J., Dahl, G. E., and Hinton, G. E. On the importance of initialization and momentum in deep learning. In Proceedings of the 30th International Conference on Machine Learning (ICML), volume 28 of Proceedings of Machine Learning Research, pp.\ 1139--1147, 2013. URL https://proceedings.mlr.press/v28/sutskever13.html

2013
[42]

Does momentum change the implicit regularization on separable data? In Advances in Neural Information Processing Systems (NeurIPS), 2022

Wang, B., Meng, Q., Zhang, H., Sun, R., Chen, W., Ma, Z.-M., and Liu, T.-Y. Does momentum change the implicit regularization on separable data? In Advances in Neural Information Processing Systems (NeurIPS), 2022. URL https://openreview.net/forum?id=i-8uqlurj1f

2022
[43]

The marginal value of momentum for small learning rate SGD

Wang, R., Malladi, S., Wang, T., Lyu, K., and Li, Z. The marginal value of momentum for small learning rate SGD . In International Conference on Learning Representations (ICLR), 2024. URL https://arxiv.org/abs/2307.15196

work page arXiv 2024
[44]

C., and Jordan, M

Wibisono, A., Wilson, A. C., and Jordan, M. I. A variational perspective on accelerated methods in optimization, 2016. URL https://arxiv.org/abs/1603.04245

work page arXiv 2016
[45]

C., Recht, B., and Jordan, M

Wilson, A. C., Recht, B., and Jordan, M. I. A Lyapunov analysis of accelerated methods in optimization. Journal of Machine Learning Research, 2021

2021
[46]

and Su, W

Wu, L. and Su, W. J. The Implicit Regularization of Dynamical Stability in Stochastic Gradient Descent , June 2023. URL http://arxiv.org/abs/2305.17490. arXiv:2305.17490 [stat]

work page arXiv 2023
[47]

How SGD Selects the Global Minima in Over -parameterized Learning : A Dynamical Stability Perspective

Wu, L., Ma, C., and E, W. How SGD Selects the Global Minima in Over -parameterized Learning : A Dynamical Stability Perspective . In Advances in Neural Information Processing Systems , volume 31. Curran Associates, Inc., 2018. URL https://proceedings.neurips.cc/paper_files/paper/2018/hash/6651526b6fb8f29a00507de6a49ce30f-Abstract.html

work page arXiv 2018
[48]

The alignment property of sgd noise and how it helps select flat minima: A stability analysis, 2022

Wu, L., Wang, M., and Su, W. The alignment property of sgd noise and how it helps select flat minima: A stability analysis, 2022

2022
[49]

arXiv preprint arXiv:1802.08770 , year =

Xing, C., Arpit, D., Tsirigotis, C., and Bengio, Y. A Walk with SGD , May 2018. URL http://arxiv.org/abs/1802.08770. arXiv:1802.08770 [cs, stat]

work page arXiv 2018
[50]

Yuan, K., Ying, B., and Sayed, A. H. On the influence of momentum acceleration on online learning, 2016. URL http://arxiv.org/abs/1603.04136

work page arXiv 2016
[51]

Wide Residual Networks

Zagoruyko, S. and Komodakis, N. Wide Residual Networks . arXiv:1605.07146 [cs], June 2017. URL http://arxiv.org/abs/1605.07146. arXiv:1605.07146

work page internal anchor Pith review arXiv 2017