pith. sign in

Momentum Further Constrains Sharpness at the Edge of Stochastic Stability

2 Pith papers cite this work. Polarity classification is still indexing.

2 Pith papers citing it
abstract

Recent work suggests that (stochastic) gradient descent self-organizes near an instability boundary, shaping both optimization and the solutions found. Momentum and mini-batch gradients are widely used in practical deep learning optimization, but it remains unclear whether they operate in a comparable regime of instability. We demonstrate that SGD with momentum exhibits an Edge of Stochastic Stability (EoSS)-like regime with batch-size-dependent behavior that cannot be explained by a single momentum-adjusted stability threshold. Batch Sharpness (the expected directional mini-batch curvature) stabilizes in two distinct regimes: at small batch sizes it converges to a lower plateau $2(1-\beta)/\eta$, reflecting amplification of stochastic fluctuations by momentum and favoring flatter regions than vanilla SGD; at large batch sizes it converges to a higher plateau $2(1+\beta)/\eta$, where momentum recovers its classical stabilizing effect and favors sharper regions consistent with full-batch dynamics. We further show that this aligns with linear stability thresholds and discuss the implications for hyperparameter tuning and coupling.

fields

cs.LG 2

years

2026 2

representative citing papers

Does Weight Decay Enhance Training Stability?

cs.LG · 2026-05-15 · conditional · novelty 6.0

Weight decay slows progressive sharpening at the edge of stability, inducing damped oscillations in CNNs and a phase transition to sub-2/η sharpness in MLPs driven by parameter-sharpness gradient alignment, yielding more stable NTK dynamics.

citing papers explorer

Showing 2 of 2 citing papers.

  • Edge of Stability Selectively Shapes Learning Across the Data Distribution cs.LG · 2026-06-02 · unverdicted · none · ref 10 · internal anchor

    Edge of stability acts as a selective mechanism that amplifies learning on data groups with aligned persistent gradients while suppressing others.

  • Does Weight Decay Enhance Training Stability? cs.LG · 2026-05-15 · conditional · none · ref 21 · internal anchor

    Weight decay slows progressive sharpening at the edge of stability, inducing damped oscillations in CNNs and a phase transition to sub-2/η sharpness in MLPs driven by parameter-sharpness gradient alignment, yielding more stable NTK dynamics.