On the Provable Suboptimality of Momentum SGD in Nonstationary Stochastic Optimization

Cameron J. Hogan; Martin T. Wells; Sharan Sahu

arxiv: 2601.12238 · v4 · pith:GITJNNV6new · submitted 2026-01-18 · 📊 stat.ML · cs.LG· math.OC

On the Provable Suboptimality of Momentum SGD in Nonstationary Stochastic Optimization

Sharan Sahu , Cameron J. Hogan , Martin T. Wells This is my paper

Pith reviewed 2026-05-21 15:29 UTC · model grok-4.3

classification 📊 stat.ML cs.LGmath.OC

keywords momentum SGDnonstationary optimizationtracking errordistribution shiftdrift amplificationminimax lower boundstime-varying objectives

0 comments

The pith

Momentum SGD incurs a drift-amplification penalty under distribution shift that grows as the momentum parameter approaches 1, producing systematic tracking lag that vanilla SGD avoids.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines SGD and its momentum variants for tracking optima in time-varying stochastic problems that remain strongly convex and smooth. It derives finite-time bounds that split the tracking error into transient, noise-driven, and drift-driven parts. The decomposition reveals that higher momentum values amplify the drift component, creating lag that worsens as the momentum parameter nears 1. Matching minimax lower bounds under gradient-variation limits show the penalty is unavoidable in drift-heavy regimes rather than an artifact of the upper-bound analysis.

Core claim

Under the stated strong-convexity and smoothness assumptions, momentum methods introduce an explicit drift-amplification term in the tracking error that diverges with the momentum parameter β approaching 1. This produces systematic lag behind the moving optimum. The upper bounds are paired with minimax lower bounds over gradient-variation classes, establishing that stale-gradient averaging forces an information-theoretic barrier: in drift-dominated regimes momentum is provably inferior to plain SGD.

What carries the argument

The tracking-error decomposition that isolates a drift-induced component whose size grows with the momentum parameter β.

If this is right

In drift-dominated regimes vanilla SGD provably tracks the moving optimum more accurately than momentum variants.
The information-theoretic lower bounds delimit the regime boundaries where momentum becomes harmful.
The explicit penalty explains observed instability of momentum in dynamic environments.
Practitioners can use the decomposition to decide when to disable momentum.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar lag penalties may appear in other accelerated first-order methods under nonstationarity.
The result suggests testing momentum decay schedules that keep β well below 1 during periods of high drift.
Extensions could relax strong convexity to see whether the barrier persists in convex or nonconvex drifting settings.

Load-bearing premise

The time-varying objectives remain strongly convex and smooth throughout the analysis.

What would settle it

In a controlled strongly convex and smooth nonstationary problem with measurable gradient variations, run momentum SGD at several β values approaching 1 and record whether the observed tracking error increases with β; if the error stays flat or decreases, the claimed penalty does not hold.

Figures

Figures reproduced from arXiv: 2601.12238 by Cameron J. Hogan, Martin T. Wells, Sharan Sahu.

**Figure 1.** Figure 1: Two minimax regimes and the inertia window. (a) After a regime change, momentum averages past gradients: this reduces noise but induces inertia, so the iterate lags behind the drifting minimizer for an “inertia window” whose length grows with β. (b) The minimax lower bound in Theorem 3.7 is the maximum of two contributions: a noise/variation-limited term (information-limited) and an inertia-limited term (d… view at source ↗

**Figure 2.** Figure 2: Tracking a drifting minimizer under strong convexity. Mean squared tracking error versus time for SGD, Heavy-Ball (HB), and Nesterov (NAG) on a strongly convex quadratic with γ = 0.1, β = 0.9, σ2 = 0.8 [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗

**Figure 3.** Figure 3: Structure of the proof: Arrows represent the main logical implications between results. The section Tracking Bounds in Expectation develops the core recursive analysis for both (SGD) (C.1–C.4) and (SGDM) (C.5–C.9), including the key Φe → Φ conversion (C.7). High-Probability Bounds lifts these results to pathwise guarantees via concentration arguments (D.1–D.4). The Minimax Lower Bounds section constructs t… view at source ↗

**Figure 4.** Figure 4: Non-stationary benchmark suite. Results for Linear and Logistic tasks. (Continued on next page.) 63 [PITH_FULL_IMAGE:figures/full_fig_p063_4.png] view at source ↗

**Figure 4.** Figure 4: Non-stationary benchmark suite. Mean ± 1 std over 20 seeds for well-conditioned (κ = 10) and ill-conditioned (κ = 1000) regimes. Tracking error: parameter-space for linear/logistic, prediction-space for MLP. Step sizes respect the stability bounds from Theorem 3.3. 64 [PITH_FULL_IMAGE:figures/full_fig_p064_4.png] view at source ↗

read the original abstract

In this paper, we provide a comprehensive theoretical analysis of Stochastic Gradient Descent (SGD) and its momentum variants (Polyak Heavy-Ball and Nesterov) for tracking time-varying optima under strong convexity and smoothness. Our finite-time bounds reveal a sharp decomposition of tracking error into transient, noise-induced, and drift-induced components. This decomposition exposes a fundamental trade-off: while momentum is often used as a gradient-smoothing heuristic, under distribution shift it incurs an explicit drift-amplification penalty that diverges as the momentum parameter $\beta$ approaches 1, yielding systematic tracking lag. We complement these upper bounds with minimax lower bounds under gradient-variation constraints, proving this momentum-induced tracking penalty is not an analytical artifact but an information-theoretic barrier: in drift-dominated regimes, momentum is unavoidably worse because stale-gradient averaging forces systematic lag. Our results provide theoretical grounding for the empirical instability of momentum in dynamic settings and precisely delineate regime boundaries where vanilla SGD provably outperforms its accelerated counterparts.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Momentum SGD picks up a clear drift-amplification penalty under distribution shift that vanilla SGD avoids, with matching upper and lower bounds.

read the letter

The main point is that momentum SGD incurs an explicit drift-amplification penalty in nonstationary stochastic optimization that diverges as beta approaches 1, leading to tracking lag, while vanilla SGD does not. The paper backs this with finite-time upper bounds and minimax lower bounds under gradient-variation constraints. This decomposition of tracking error into transient, noise-induced, and drift-induced terms is new and helpful. It clarifies the trade-off and shows regime boundaries where SGD outperforms momentum. The lower bounds turn the penalty into an information-theoretic result rather than an artifact of the upper bound analysis. Within the strongly convex and smooth assumptions, the argument is consistent and the drift term is expressed directly without circularity. The softer part is the reliance on strong convexity and smoothness. This is standard for such analyses but limits direct application to non-convex problems common in practice. The abstract leaves some constant dependencies implicit, which might affect how sharp the bounds are in finite time or moderate dimensions. These issues are minor given the focused scope. This paper targets readers working on theoretical aspects of optimization under distribution shift or time-varying objectives. It would be valuable for anyone needing precise guidance on when to prefer SGD over momentum in dynamic settings. I recommend sending it for peer review. The work is grounded and adds useful insight on algorithm choice.

Referee Report

1 major / 2 minor

Summary. The manuscript analyzes SGD and momentum variants (Polyak Heavy-Ball and Nesterov) for tracking time-varying optima under strong convexity and smoothness. Finite-time upper bounds decompose tracking error into transient, noise-induced, and drift-induced terms, exposing a drift-amplification penalty for momentum that diverges as β approaches 1. These upper bounds are paired with minimax lower bounds under gradient-variation constraints, establishing the penalty as an information-theoretic barrier in drift-dominated regimes where momentum is provably worse than vanilla SGD.

Significance. If the bounds are correct, the work supplies a precise theoretical account of why momentum can exhibit instability under distribution shift, while identifying explicit regime boundaries where SGD outperforms its accelerated counterparts. The upper/lower bound pairing is a strength, as it converts an empirical observation into a falsifiable, information-theoretic statement rather than an artifact of analysis technique.

major comments (1)

[lower-bound section (minimax result)] The central claim that the drift penalty is information-theoretic rests on the gradient-variation constraint in the lower-bound construction. Please confirm in the proof of the minimax lower bound that the variation measure is defined identically to the one appearing in the upper-bound drift term; any mismatch would weaken the matching claim.

minor comments (2)

The abstract states that both Polyak and Nesterov momentum incur the same drift penalty; the main text should explicitly note whether the constants differ between the two variants or whether the analysis treats them uniformly.
[finite-time bounds] All finite-time bounds should list the explicit dependence of the drift term on β and on the gradient-variation measure so that readers can directly recover the regime boundaries without re-deriving the expressions.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their thorough review and insightful comment on the minimax lower bound. We are pleased that the significance of the work is recognized. Below we provide a point-by-point response to the major comment.

read point-by-point responses

Referee: [lower-bound section (minimax result)] The central claim that the drift penalty is information-theoretic rests on the gradient-variation constraint in the lower-bound construction. Please confirm in the proof of the minimax lower bound that the variation measure is defined identically to the one appearing in the upper-bound drift term; any mismatch would weaken the matching claim.

Authors: We confirm that the variation measure is defined identically in both the upper and lower bounds. In the upper-bound analysis (Theorem 3), the drift-induced error term is bounded in terms of the total gradient variation V_G = sum_{t=1}^T ||nabla f_t(x_t^*) - nabla f_{t-1}(x_{t-1}^*)||, where x_t^* denotes the time-varying optimum. The minimax lower bound construction in Section 4 employs precisely the same constraint: the adversary is restricted to sequences where the cumulative gradient variation is at most V_G. This identical definition ensures that the lower bound applies directly to the regimes analyzed in the upper bounds, establishing that the drift-amplification penalty for momentum is indeed an information-theoretic barrier rather than an artifact of the analysis. To make this equivalence more explicit, we will include a clarifying remark in the revised manuscript. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper derives finite-time upper bounds decomposing tracking error into transient, noise-induced, and drift-induced terms for SGD and momentum methods under strong convexity and smoothness, explicitly expressing the drift-amplification penalty in terms of the momentum parameter β and gradient variation. These are paired with independent minimax lower bounds under gradient-variation constraints. No step reduces by construction to a fitted parameter, self-definition, or load-bearing self-citation chain; the bounds are self-contained mathematical derivations within the stated assumptions and do not rely on renaming or smuggling ansatzes from prior self-work. This matches the default expectation of no significant circularity (score 0-2) for papers with explicit, falsifiable bound derivations.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The analysis rests on standard domain assumptions of strong convexity and smoothness plus a gradient-variation constraint; no free parameters are fitted and no new entities are postulated.

axioms (2)

domain assumption The time-varying objective functions are strongly convex and smooth.
Invoked to obtain the finite-time tracking error bounds and the decomposition into transient, noise, and drift components.
domain assumption Gradient variation is bounded in the minimax lower-bound construction.
Used to establish the information-theoretic barrier for momentum methods.

pith-pipeline@v0.9.0 · 5711 in / 1380 out tokens · 54206 ms · 2026-05-21T15:29:14.951043+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Adapt or Forget: Provable Tradeoffs Between Adam and SGD in Nonstationary Optimization
stat.ML 2026-05 unverdicted novelty 6.0

Adam's adaptive preconditioning and first-moment averaging improve high-probability tracking error in noise-dominated nonstationary regimes but can increase it under strong drift, where SGD achieves a smaller floor, w...

Reference graph

Works this paper leans on

40 extracted references · 40 canonical work pages · cited by 1 Pith paper · 4 internal anchors

[1]

Cevher, V ., Becker, S., and Schmidt, M

doi: 10.23919/ACC.2019.8814889. Cevher, V ., Becker, S., and Schmidt, M. Convex optimiza- tion for big data: Scalable, randomized, and parallel al- gorithms for big data analytics.IEEE Signal Processing Magazine, 31(5):32–43, 2014. doi: 10.1109/MSP.2014. 2329397. Chen*, J., Pan*, X., Monga, R., Bengio, S., and Joze- fowicz, R. Revisiting distributed synch...

work page doi:10.23919/acc.2019.8814889 2019
[2]

, author Zhou, A

URL https://openreview.net/forum? id=HyAddcLge. Chen, J., Wolfe, C., Li, Z., and Kyrillidis, A. Demon: Improved neural network training with momentum decay. InICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 3958–3962, 2022. doi: 10.1109/ICASSP43922.2022. 9746839. Chen, X., Wang, Y ., and Wang, Y .-...

work page doi:10.1109/icassp43922.2022 2022
[3]

cc/paper_files/paper/2021/file/ 62e7f2e090fe150ef8deb4466fdc81b3-Paper

URL https://proceedings.neurips. cc/paper_files/paper/2021/file/ 62e7f2e090fe150ef8deb4466fdc81b3-Paper. pdf. Cutler, J., Drusvyatskiy, D., and Harchaoui, Z. Stochas- tic optimization under distributional drift.Jour- nal of Machine Learning Research, 24(147):1–56,

work page 2021
[4]

Davtyan, A., Sameni, S., Cerkezi, L., Meishvili, G., Bielski, A., and Favaro, P

URL http://jmlr.org/papers/v24/ 21-1410.html. Davtyan, A., Sameni, S., Cerkezi, L., Meishvili, G., Bielski, A., and Favaro, P. Koala: A Kalman opti- mization algorithm with loss adaptivity.Proceedings of the AAAI Conference on Artificial Intelligence, 36 (6):6471–6479, Jun. 2022. doi: 10.1609/aaai.v36i6. 20599. URL https://ojs.aaai.org/index. php/AAAI/art...

work page doi:10.1609/aaai.v36i6 2022
[5]

Gitman, I., Lang, H., Zhang, P., and Xiao, L.Understanding the role of momentum in stochastic gradient methods

URL https://proceedings.mlr.press/ v97/ghorbani19b.html. Gitman, I., Lang, H., Zhang, P., and Xiao, L.Understanding the role of momentum in stochastic gradient methods. Curran Associates Inc., Red Hook, NY , USA, 2019. Godichon-Baggioni, A., Lu, W., and Portier, B. Online estimation of the inverse of the hessian for stochastic op- timization with applicat...

work page doi:10.1109/tit.1980.1056141 2019
[6]

doi: 10.1561/2400000013

ISSN 2167-3888. doi: 10.1561/2400000013. URL https://doi.org/10.1561/2400000013. Huang, T., Zhu, Z., Jin, G., Liu, L., Wang, Z., and Liu, S. SPAM: Spike-aware Adam with momentum reset for sta- ble LLM training. InThe Thirteenth International Confer- ence on Learning Representations, 2025. URL https: //openreview.net/forum?id=L9eBxTCpQG. Ilboudo, W. E. L.,...

work page doi:10.1561/2400000013 2025
[7]

Jagannath, A., Jones-McCormick, T., and Sarangian, V

doi: 10.1109/TNNLS.2020.3041755. Jagannath, A., Jones-McCormick, T., and Sarangian, V . High-dimensional limit theorems for SGD: Momentum and adaptive step-sizes, 2025. URL https://arxiv. org/abs/2511.03952. arXiv:2511.03952. Jastrz˛ ebski, S., Kenton, Z., Ballas, N., Fischer, A., Bengio, Y ., and Storkey, A. On the relation between the sharpest direction...

work page doi:10.1109/tnnls.2020.3041755 2020
[8]

Jelassi, S

URL https://openreview.net/forum? id=SkgEaj05t7. Jelassi, S. and Li, Y . Towards understanding how momentum improves generalization in deep learning,

work page
[9]

Jentzen, A., Kranz, J., and Riekert, A

URL https://openreview.net/forum? id=lf0W6tcWmh-. Jentzen, A., Kranz, J., and Riekert, A. PADAM: Parallel av- eraged Adam reduces the error for stochastic optimization in scientific machine learning, 2025. URL https:// arxiv.org/abs/2505.22085. arXiv:2505.22085. Kingma, D. P. and Ba, J. Adam: A method for stochas- tic optimization. In Bengio, Y . and LeCu...

work page doi:10.1007/978-1-4899-2696-8_9 2025
[10]

org/CorpusID:15039054

URL https://api.semanticscholar. org/CorpusID:15039054. 10 On the Provable Suboptimality of Momentum SGD in Nonstationary Stochastic Optimization Li, X., Luo, J., Zheng, Z., Wang, H., Luo, L., Wen, L., Wu, L., and Xu, S. On the performance analysis of momentum method: A frequency domain perspective. InThe Thirteenth International Conference on Learning Re...

work page 2025
[11]

URL https://proceedings.mlr.press/ v237/liao24a.html. Liu, J., Su, J., Yao, X., Jiang, Z., Lai, G., Du, Y ., Qin, Y ., Xu, W., Lu, E., Yan, J., Chen, Y ., Zheng, H., Liu, Y ., Liu, S., Yin, B., He, W., Zhu, H., Wang, Y ., Wang, J., Dong, M., Zhang, Z., Kang, Y ., Zhang, H., Xu, X., Zhang, Y ., Wu, Y ., Zhou, X., and Yang, Z. Muon is scalable for llm train...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[12]

Luo, L., Xiong, Y ., and Liu, Y

URL https://openreview.net/forum? id=Syxt5oC5YQ. Luo, L., Xiong, Y ., and Liu, Y . Adaptive gradient methods with dynamic bound of learning rate. InIn- ternational Conference on Learning Representations,

work page
[13]

Nesterov, Y

URL https://openreview.net/forum? id=Bkg3g2R9FX. Nesterov, Y . A method for solving the convex programming problem with convergence rateo(1/k2).Dokl Akad Nauk SSSR, 269:543, 1983. URL https://cir.nii.ac. jp/crid/1370862715914709505. Nesterov, Y . Introductory lectures on convex opti- mization - a basic course. InApplied Optimization,

work page arXiv 1983
[14]

The Extended Kalman Filter is a Natural Gradient Descent in Trajectory Space

URL https://api.semanticscholar. org/CorpusID:62288331. Ollivier, Y . The extended Kalman filter is a natural gradi- ent descent in trajectory space, 2019. URL https:// arxiv.org/abs/1901.00696. arXiv:1901.00696. Papyan, V . Traces of class/cross-class structure pervade deep learning spectra.Journal of Machine Learning Research, 21(252):1–64, 2020. URL ht...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.1016/0041-5553(63)90382-3 2019
[15]

URL https://arxiv.org/abs/2506. 11904. arXiv:2506.11904. Vuckovic, J. Kalman gradient descent: Adap- tive variance reduction in stochastic optimization,

work page arXiv
[16]

URL https://arxiv.org/abs/1810. 12273. arXiv:1810.12273. Vyas, N., Morwani, D., Zhao, R., Shapira, I., Brandfon- brener, D., Janson, L., and Kakade, S. M. SOAP: Im- proving and stabilizing shampoo using Adam for lan- guage modeling. InThe Thirteenth International Confer- ence on Learning Representations, 2025. URL https: //openreview.net/forum?id=IDxZhXrp...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[17]

URL https: //doi.org/10.1155/2017/9478952

doi: 10.1155/2017/9478952. URL https: //doi.org/10.1155/2017/9478952. Article ID 9478952, 9 pages. Yang, X. Towards stochastic gradient variance reduction by solving a filtering problem, 2023. URL https:// openreview.net/forum?id=0sxmoci9Ma. Yao, Z., Gholami, A., Shen, S., Mustafa, M., Keutzer, K., and Mahoney, M. ADAHESSIAN: An adaptive second order opti...

work page doi:10.1155/2017/9478952 2017
[18]

ADADELTA: An Adaptive Learning Rate Method

URL https://proceedings.neurips. cc/paper_files/paper/2018/file/ 90365351ccc7437a1309dc64e4db32a3-Paper. pdf. Zeiler, M. D. Adadelta: An adaptive learning rate method, 2012. URL https://arxiv.org/abs/ 1212.5701. arXiv:1212.5701. Zhang, J. and Mitliagkas, I. Yellowfin and the art of momentum tuning. In Talwalkar, A., Smith, V ., and Zaharia, M. (eds.),Proc...

work page internal anchor Pith review Pith/arXiv arXiv 2018
[19]

org/paper_files/paper/2019/file/ b205b525b7ce002baae53228bab6d26b-Paper

URL https://proceedings.mlsys. org/paper_files/paper/2019/file/ b205b525b7ce002baae53228bab6d26b-Paper. pdf. Zhang, T. Solving large scale linear prediction prob- lems using stochastic gradient descent algorithms. In 12 On the Provable Suboptimality of Momentum SGD in Nonstationary Stochastic Optimization Proceedings of the Twenty-First International Con-...

work page doi:10.1016/j.future.2019.09 2019
[20]

Zhao, P., Zhang, Y .-J., Zhang, L., and Zhou, Z.-H

URL https://www.sciencedirect.com/ science/article/pii/S0167739X18332540. Zhao, P., Zhang, Y .-J., Zhang, L., and Zhou, Z.-H. Adaptiv- ity and non-stationarity: Problem-dependent dynamic re- gret for online convex optimization.Journal of Machine Learning Research, 25(98):1–52, 2024. URL http: //jmlr.org/papers/v25/21-0748.html. Zhou, K., Jin, Y ., Ding, Q...

work page 2024
[21]

time- resolved

URL https://proceedings.neurips. cc/paper_files/paper/2021/file/ eddea82ad2755b24c4e168c5fc2ebd40-Paper. pdf. Ziyin, L., Wang, Z. T., and Ueda, M. Laprop: Separating mo- mentum and adaptivity in Adam, 2021. URL https:// arxiv.org/abs/2002.04839. arXiv:2002.04839. 13 On the Provable Suboptimality of Momentum SGD in Nonstationary Stochastic Optimization App...

work page arXiv 2021
[22]

Then ξt+1(θ) is Ft+1–measurable and satisfies E[ξt+1(θ)| Ft] =0a.s

(Martingale difference noise)Define the conditional mean gradient mt+1(θ) =E ∇θg(θ, Xt+1)| F t , and the gradient noise ξt+1(θ) =∇ θg(θ, Xt+1)−m t+1(θ). Then ξt+1(θ) is Ft+1–measurable and satisfies E[ξt+1(θ)| Ft] =0a.s. for allθ∈R d. Under this framework, SGD admits the decomposition θt+1 =θ t −γ tmt+1(θt)−γ tξt+1(θt).(SGD) For momentum, we use a two-par...

work page
[23]

(Gradient noise along iterates)The gradient noise ξt+1 satisfies E ∥ξt+1(θt)∥2 ≤σ 2 and E ∥ξt+1(ψt)∥2 ≤ 18 On the Provable Suboptimality of Momentum SGD in Nonstationary Stochastic Optimization σ2 a.s. C.1. Proof for SGD tracking error bound First we will prove a recursive relation for the tracking error that we will subsequently use for our expectation a...

work page 2021
[24]

eθt eθt−1 # = 1 1−β

(Step-decay schedule in the low drift-to-noise regime).Suppose γ⋆ <1/2L (equivalently, the minimizer of E(γ) is not at the smoothness cap), so that γ⋆ = 8∆2 µσ2 1/3 ,E= 3 ∆σ2 µ2 2/3 . Define epochsk= 0,1, . . . , K−1with γ0 := 1 2L , γ k := γk−1 +γ ⋆ 2 (k≥1), K:= 1 + l log2 γ0 γ⋆ m , and epoch lengths T0 := l 2 µγ0 log 2∥θ0 −θ ⋆ 0∥2 E(γ0) m , T k := l2 lo...

work page 2016
[25]

mode-splitting

(Step-decay schedule with momentum restart).Suppose γ⋆ β < µ(1−β) 2/(4L2) (i.e., the minimizer of Eβ(γ) is not at the stability cap). Define the epoch stepsizes γ0 := µ(1−β) 2 4L2 , γ k := γk−1 +γ ⋆ β 2 (k≥1), K:= 1 + l log2 γ0 γ⋆ β m . 31 On the Provable Suboptimality of Momentum SGD in Nonstationary Stochastic Optimization Define epoch lengths T0 := l1−...

work page 2012
[26]

(Minimizers.) Eachg u admits a unique minimizerθ ⋆ u, andθ ⋆ u =uae 1

work page
[27]

Proof of Lemma E.4

(Discrepancy.) The two-point discrepancy satisfiesχ(g +, g−)≥µa 2/8. Proof of Lemma E.4. On the Euclidean ball {θ:∥θ∥ 2 ≤r/2} , we have ψr(θ) = 1 and ∇ψr(θ) = 0 by construction of the bump. Therefore, for all ∥θ∥2 ≤r/2 , ∇gu(θ) =µθ−uµae 1. Setting ∇gu(θ) = 0 yields θ=uae 1, and this point indeed lies in the region where the above simplification holds sinc...

work page
[28]

The Markov chain (St)t≥0 admits a unique stationary distribution π∞, which is a centered Gaussian N(0,Σ ∞) on R2

work page
[29]

The stationary covarianceΣ ∞ is the unique positive semidefinite solution of the discrete-time Lyapunov equation Σ =AΣA ⊤ +σ 2BB ⊤.(172)

work page
[30]

WritingΣ ∞ = ( v c c v )withv= Var π∞(xt)andc= Cov π∞(xt, xt−1), we have the exact closed form v= (1 +β)γ 2σ2 (1−β) (1 +β) 2 −(1 +β−γµ) 2 = (1 +β)γσ 2 (1−β)µ 2(1 +β)−γµ .(173)

work page
[31]

Moreover, if additionallyγµ≤1 +β(a common small-step regime), then v≥ γσ2 4µ(1−β) .(174) Proof of Lemma E.7

In the stable regime one necessarily has γµ <2(1 +β) , so the denominator in (173) is positive. Moreover, if additionallyγµ≤1 +β(a common small-step regime), then v≥ γσ2 4µ(1−β) .(174) Proof of Lemma E.7. We proceed in four steps: (1) existence/uniqueness of a stationary law, (2) characterization via a Lyapunov equation, (3) explicit solution of the(1,1)e...

work page 2019
[32]

Gaussian-gradient KL bound:from Lemma E.12, we have supu,v∈U DKL(P π u ∥P π v )≤Cµ 2a2T /(σ2/(1−β)),∀π

work page
[33]

LetV T >0be the variation budget

Packing size lower bound:from Lemma E.9, we have the packing size lower bound log|U | ≥c J for a universal constantc >0(e.g., from Varshamov–Gilbert or constant-weight codes). LetV T >0be the variation budget. Choose a:= VT C µ J1/q 1/α .(209) ThenGVar p,q(Gu 1:T )≤V T for allu∈ U. Moreover, ifJadditionally satisfies C µ2a2 σ2/(1−β) T≤ c 2 J,(210) then Fa...

work page
[34]

sub-Gaussian × sub-Gaussian ⇒ sub-exponential

Subsequently invoking Lemma E.1, we conclude that there does not exist an admissible policy π∈Π β such that supg∈Gp,q(VT ) Rπ,ϕG T (g)≤ 1 9 inf g,˜g∈Θ PT t=1 χ(gt,˜gt). This gives us a lower bound for the minimax regret. Now we must just enforce the gradient variational budget and the information constraint. To enforce the gradient variational budget, we ...

work page arXiv 2016
[35]

for a constantK, then∥X| G 1∥Ψ2 ≤Ka.s

IfG 1 ⊆ G2 and∥X| G 2∥Ψ2 ≤Ka.s. for a constantK, then∥X| G 1∥Ψ2 ≤Ka.s

work page
[36]

for a constantK, then∥XY| G∥ Ψα ≤K∥X| G∥ Ψα a.s

If|Y| ≤Ka.s. for a constantK, then∥XY| G∥ Ψα ≤K∥X| G∥ Ψα a.s

work page
[37]

IfXis independent ofG, then∥X| G∥ Ψα =∥X∥ Ψα a.s

work page
[38]

IfX⊥Y| G, thenE[XY| G] =E[X| G]E[Y| G]a.s., andE[X|Y,G] =E[X| G]a.s. Proof. (i) By assumption, E[exp(|X/K|2)| G 2]≤2 a.s.; taking G1-conditional expectations and using the tower property yields E[exp(|X/K|2)| G 1]≤2 a.s., hence ∥X| G 1∥Ψ2 ≤K . (ii) If ϕ is admissible for X given G, then Kϕ is admissible for XY given G since |XY|/(Kϕ)≤ |X|/ϕ . Taking essen...

work page
[39]

If supu∈Sd−1,u∈F E exp |u⊤X|2/K2 F | F ≤2a.s , then E[XX⊤ | F]⪯K 2 F Id a.s and henceE[∥X∥ 2 2 | F]≤dK 2 F a.s Proof.We first prove the scalar case

(Vector).Let X∈R d. If supu∈Sd−1,u∈F E exp |u⊤X|2/K2 F | F ≤2a.s , then E[XX⊤ | F]⪯K 2 F Id a.s and henceE[∥X∥ 2 2 | F]≤dK 2 F a.s Proof.We first prove the scalar case. Bye y ≥1 +yfory≥0, 2≥E 1 + |X|2 K2 F F = 1 + 1 K2 F E[|X|2 | F]a.s., soE[|X| 2 | F]≤K 2 F. The vector case follows very similarly. Fix any F-measurable u∈S d−1 and apply the scalar part to...

work page
[40]

Then E[Mτ] =E[M 0]

(Unbounded stopping time via truncation).Assume E[|Mτ∧n |]<∞ for all n and that {Mτ∧n }n≥1 is uniformly integrable. Then E[Mτ] =E[M 0]. More generally, if(Mt) is an integrable supermartingale (resp. submartingale), then ineither(i) or (ii) we have E[Mτ]≤ E[M0](resp.E[M τ]≥E[M 0]). Proof.We first show that the stopped process is a martingale: Step 1: Show ...

work page

[1] [1]

Cevher, V ., Becker, S., and Schmidt, M

doi: 10.23919/ACC.2019.8814889. Cevher, V ., Becker, S., and Schmidt, M. Convex optimiza- tion for big data: Scalable, randomized, and parallel al- gorithms for big data analytics.IEEE Signal Processing Magazine, 31(5):32–43, 2014. doi: 10.1109/MSP.2014. 2329397. Chen*, J., Pan*, X., Monga, R., Bengio, S., and Joze- fowicz, R. Revisiting distributed synch...

work page doi:10.23919/acc.2019.8814889 2019

[2] [2]

, author Zhou, A

URL https://openreview.net/forum? id=HyAddcLge. Chen, J., Wolfe, C., Li, Z., and Kyrillidis, A. Demon: Improved neural network training with momentum decay. InICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 3958–3962, 2022. doi: 10.1109/ICASSP43922.2022. 9746839. Chen, X., Wang, Y ., and Wang, Y .-...

work page doi:10.1109/icassp43922.2022 2022

[3] [3]

cc/paper_files/paper/2021/file/ 62e7f2e090fe150ef8deb4466fdc81b3-Paper

URL https://proceedings.neurips. cc/paper_files/paper/2021/file/ 62e7f2e090fe150ef8deb4466fdc81b3-Paper. pdf. Cutler, J., Drusvyatskiy, D., and Harchaoui, Z. Stochas- tic optimization under distributional drift.Jour- nal of Machine Learning Research, 24(147):1–56,

work page 2021

[4] [4]

Davtyan, A., Sameni, S., Cerkezi, L., Meishvili, G., Bielski, A., and Favaro, P

URL http://jmlr.org/papers/v24/ 21-1410.html. Davtyan, A., Sameni, S., Cerkezi, L., Meishvili, G., Bielski, A., and Favaro, P. Koala: A Kalman opti- mization algorithm with loss adaptivity.Proceedings of the AAAI Conference on Artificial Intelligence, 36 (6):6471–6479, Jun. 2022. doi: 10.1609/aaai.v36i6. 20599. URL https://ojs.aaai.org/index. php/AAAI/art...

work page doi:10.1609/aaai.v36i6 2022

[5] [5]

Gitman, I., Lang, H., Zhang, P., and Xiao, L.Understanding the role of momentum in stochastic gradient methods

URL https://proceedings.mlr.press/ v97/ghorbani19b.html. Gitman, I., Lang, H., Zhang, P., and Xiao, L.Understanding the role of momentum in stochastic gradient methods. Curran Associates Inc., Red Hook, NY , USA, 2019. Godichon-Baggioni, A., Lu, W., and Portier, B. Online estimation of the inverse of the hessian for stochastic op- timization with applicat...

work page doi:10.1109/tit.1980.1056141 2019

[6] [6]

doi: 10.1561/2400000013

ISSN 2167-3888. doi: 10.1561/2400000013. URL https://doi.org/10.1561/2400000013. Huang, T., Zhu, Z., Jin, G., Liu, L., Wang, Z., and Liu, S. SPAM: Spike-aware Adam with momentum reset for sta- ble LLM training. InThe Thirteenth International Confer- ence on Learning Representations, 2025. URL https: //openreview.net/forum?id=L9eBxTCpQG. Ilboudo, W. E. L.,...

work page doi:10.1561/2400000013 2025

[7] [7]

Jagannath, A., Jones-McCormick, T., and Sarangian, V

doi: 10.1109/TNNLS.2020.3041755. Jagannath, A., Jones-McCormick, T., and Sarangian, V . High-dimensional limit theorems for SGD: Momentum and adaptive step-sizes, 2025. URL https://arxiv. org/abs/2511.03952. arXiv:2511.03952. Jastrz˛ ebski, S., Kenton, Z., Ballas, N., Fischer, A., Bengio, Y ., and Storkey, A. On the relation between the sharpest direction...

work page doi:10.1109/tnnls.2020.3041755 2020

[8] [8]

Jelassi, S

URL https://openreview.net/forum? id=SkgEaj05t7. Jelassi, S. and Li, Y . Towards understanding how momentum improves generalization in deep learning,

work page

[9] [9]

Jentzen, A., Kranz, J., and Riekert, A

URL https://openreview.net/forum? id=lf0W6tcWmh-. Jentzen, A., Kranz, J., and Riekert, A. PADAM: Parallel av- eraged Adam reduces the error for stochastic optimization in scientific machine learning, 2025. URL https:// arxiv.org/abs/2505.22085. arXiv:2505.22085. Kingma, D. P. and Ba, J. Adam: A method for stochas- tic optimization. In Bengio, Y . and LeCu...

work page doi:10.1007/978-1-4899-2696-8_9 2025

[10] [10]

org/CorpusID:15039054

URL https://api.semanticscholar. org/CorpusID:15039054. 10 On the Provable Suboptimality of Momentum SGD in Nonstationary Stochastic Optimization Li, X., Luo, J., Zheng, Z., Wang, H., Luo, L., Wen, L., Wu, L., and Xu, S. On the performance analysis of momentum method: A frequency domain perspective. InThe Thirteenth International Conference on Learning Re...

work page 2025

[11] [11]

URL https://proceedings.mlr.press/ v237/liao24a.html. Liu, J., Su, J., Yao, X., Jiang, Z., Lai, G., Du, Y ., Qin, Y ., Xu, W., Lu, E., Yan, J., Chen, Y ., Zheng, H., Liu, Y ., Liu, S., Yin, B., He, W., Zhu, H., Wang, Y ., Wang, J., Dong, M., Zhang, Z., Kang, Y ., Zhang, H., Xu, X., Zhang, Y ., Wu, Y ., Zhou, X., and Yang, Z. Muon is scalable for llm train...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[12] [12]

Luo, L., Xiong, Y ., and Liu, Y

URL https://openreview.net/forum? id=Syxt5oC5YQ. Luo, L., Xiong, Y ., and Liu, Y . Adaptive gradient methods with dynamic bound of learning rate. InIn- ternational Conference on Learning Representations,

work page

[13] [13]

Nesterov, Y

URL https://openreview.net/forum? id=Bkg3g2R9FX. Nesterov, Y . A method for solving the convex programming problem with convergence rateo(1/k2).Dokl Akad Nauk SSSR, 269:543, 1983. URL https://cir.nii.ac. jp/crid/1370862715914709505. Nesterov, Y . Introductory lectures on convex opti- mization - a basic course. InApplied Optimization,

work page arXiv 1983

[14] [14]

The Extended Kalman Filter is a Natural Gradient Descent in Trajectory Space

URL https://api.semanticscholar. org/CorpusID:62288331. Ollivier, Y . The extended Kalman filter is a natural gradi- ent descent in trajectory space, 2019. URL https:// arxiv.org/abs/1901.00696. arXiv:1901.00696. Papyan, V . Traces of class/cross-class structure pervade deep learning spectra.Journal of Machine Learning Research, 21(252):1–64, 2020. URL ht...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.1016/0041-5553(63)90382-3 2019

[15] [15]

URL https://arxiv.org/abs/2506. 11904. arXiv:2506.11904. Vuckovic, J. Kalman gradient descent: Adap- tive variance reduction in stochastic optimization,

work page arXiv

[16] [16]

URL https://arxiv.org/abs/1810. 12273. arXiv:1810.12273. Vyas, N., Morwani, D., Zhao, R., Shapira, I., Brandfon- brener, D., Janson, L., and Kakade, S. M. SOAP: Im- proving and stabilizing shampoo using Adam for lan- guage modeling. InThe Thirteenth International Confer- ence on Learning Representations, 2025. URL https: //openreview.net/forum?id=IDxZhXrp...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[17] [17]

URL https: //doi.org/10.1155/2017/9478952

doi: 10.1155/2017/9478952. URL https: //doi.org/10.1155/2017/9478952. Article ID 9478952, 9 pages. Yang, X. Towards stochastic gradient variance reduction by solving a filtering problem, 2023. URL https:// openreview.net/forum?id=0sxmoci9Ma. Yao, Z., Gholami, A., Shen, S., Mustafa, M., Keutzer, K., and Mahoney, M. ADAHESSIAN: An adaptive second order opti...

work page doi:10.1155/2017/9478952 2017

[18] [18]

ADADELTA: An Adaptive Learning Rate Method

URL https://proceedings.neurips. cc/paper_files/paper/2018/file/ 90365351ccc7437a1309dc64e4db32a3-Paper. pdf. Zeiler, M. D. Adadelta: An adaptive learning rate method, 2012. URL https://arxiv.org/abs/ 1212.5701. arXiv:1212.5701. Zhang, J. and Mitliagkas, I. Yellowfin and the art of momentum tuning. In Talwalkar, A., Smith, V ., and Zaharia, M. (eds.),Proc...

work page internal anchor Pith review Pith/arXiv arXiv 2018

[19] [19]

org/paper_files/paper/2019/file/ b205b525b7ce002baae53228bab6d26b-Paper

URL https://proceedings.mlsys. org/paper_files/paper/2019/file/ b205b525b7ce002baae53228bab6d26b-Paper. pdf. Zhang, T. Solving large scale linear prediction prob- lems using stochastic gradient descent algorithms. In 12 On the Provable Suboptimality of Momentum SGD in Nonstationary Stochastic Optimization Proceedings of the Twenty-First International Con-...

work page doi:10.1016/j.future.2019.09 2019

[20] [20]

Zhao, P., Zhang, Y .-J., Zhang, L., and Zhou, Z.-H

URL https://www.sciencedirect.com/ science/article/pii/S0167739X18332540. Zhao, P., Zhang, Y .-J., Zhang, L., and Zhou, Z.-H. Adaptiv- ity and non-stationarity: Problem-dependent dynamic re- gret for online convex optimization.Journal of Machine Learning Research, 25(98):1–52, 2024. URL http: //jmlr.org/papers/v25/21-0748.html. Zhou, K., Jin, Y ., Ding, Q...

work page 2024

[21] [21]

time- resolved

URL https://proceedings.neurips. cc/paper_files/paper/2021/file/ eddea82ad2755b24c4e168c5fc2ebd40-Paper. pdf. Ziyin, L., Wang, Z. T., and Ueda, M. Laprop: Separating mo- mentum and adaptivity in Adam, 2021. URL https:// arxiv.org/abs/2002.04839. arXiv:2002.04839. 13 On the Provable Suboptimality of Momentum SGD in Nonstationary Stochastic Optimization App...

work page arXiv 2021

[22] [22]

Then ξt+1(θ) is Ft+1–measurable and satisfies E[ξt+1(θ)| Ft] =0a.s

(Martingale difference noise)Define the conditional mean gradient mt+1(θ) =E ∇θg(θ, Xt+1)| F t , and the gradient noise ξt+1(θ) =∇ θg(θ, Xt+1)−m t+1(θ). Then ξt+1(θ) is Ft+1–measurable and satisfies E[ξt+1(θ)| Ft] =0a.s. for allθ∈R d. Under this framework, SGD admits the decomposition θt+1 =θ t −γ tmt+1(θt)−γ tξt+1(θt).(SGD) For momentum, we use a two-par...

work page

[23] [23]

(Gradient noise along iterates)The gradient noise ξt+1 satisfies E ∥ξt+1(θt)∥2 ≤σ 2 and E ∥ξt+1(ψt)∥2 ≤ 18 On the Provable Suboptimality of Momentum SGD in Nonstationary Stochastic Optimization σ2 a.s. C.1. Proof for SGD tracking error bound First we will prove a recursive relation for the tracking error that we will subsequently use for our expectation a...

work page 2021

[24] [24]

eθt eθt−1 # = 1 1−β

(Step-decay schedule in the low drift-to-noise regime).Suppose γ⋆ <1/2L (equivalently, the minimizer of E(γ) is not at the smoothness cap), so that γ⋆ = 8∆2 µσ2 1/3 ,E= 3 ∆σ2 µ2 2/3 . Define epochsk= 0,1, . . . , K−1with γ0 := 1 2L , γ k := γk−1 +γ ⋆ 2 (k≥1), K:= 1 + l log2 γ0 γ⋆ m , and epoch lengths T0 := l 2 µγ0 log 2∥θ0 −θ ⋆ 0∥2 E(γ0) m , T k := l2 lo...

work page 2016

[25] [25]

mode-splitting

(Step-decay schedule with momentum restart).Suppose γ⋆ β < µ(1−β) 2/(4L2) (i.e., the minimizer of Eβ(γ) is not at the stability cap). Define the epoch stepsizes γ0 := µ(1−β) 2 4L2 , γ k := γk−1 +γ ⋆ β 2 (k≥1), K:= 1 + l log2 γ0 γ⋆ β m . 31 On the Provable Suboptimality of Momentum SGD in Nonstationary Stochastic Optimization Define epoch lengths T0 := l1−...

work page 2012

[26] [26]

(Minimizers.) Eachg u admits a unique minimizerθ ⋆ u, andθ ⋆ u =uae 1

work page

[27] [27]

Proof of Lemma E.4

(Discrepancy.) The two-point discrepancy satisfiesχ(g +, g−)≥µa 2/8. Proof of Lemma E.4. On the Euclidean ball {θ:∥θ∥ 2 ≤r/2} , we have ψr(θ) = 1 and ∇ψr(θ) = 0 by construction of the bump. Therefore, for all ∥θ∥2 ≤r/2 , ∇gu(θ) =µθ−uµae 1. Setting ∇gu(θ) = 0 yields θ=uae 1, and this point indeed lies in the region where the above simplification holds sinc...

work page

[28] [28]

The Markov chain (St)t≥0 admits a unique stationary distribution π∞, which is a centered Gaussian N(0,Σ ∞) on R2

work page

[29] [29]

The stationary covarianceΣ ∞ is the unique positive semidefinite solution of the discrete-time Lyapunov equation Σ =AΣA ⊤ +σ 2BB ⊤.(172)

work page

[30] [30]

WritingΣ ∞ = ( v c c v )withv= Var π∞(xt)andc= Cov π∞(xt, xt−1), we have the exact closed form v= (1 +β)γ 2σ2 (1−β) (1 +β) 2 −(1 +β−γµ) 2 = (1 +β)γσ 2 (1−β)µ 2(1 +β)−γµ .(173)

work page

[31] [31]

Moreover, if additionallyγµ≤1 +β(a common small-step regime), then v≥ γσ2 4µ(1−β) .(174) Proof of Lemma E.7

In the stable regime one necessarily has γµ <2(1 +β) , so the denominator in (173) is positive. Moreover, if additionallyγµ≤1 +β(a common small-step regime), then v≥ γσ2 4µ(1−β) .(174) Proof of Lemma E.7. We proceed in four steps: (1) existence/uniqueness of a stationary law, (2) characterization via a Lyapunov equation, (3) explicit solution of the(1,1)e...

work page 2019

[32] [32]

Gaussian-gradient KL bound:from Lemma E.12, we have supu,v∈U DKL(P π u ∥P π v )≤Cµ 2a2T /(σ2/(1−β)),∀π

work page

[33] [33]

LetV T >0be the variation budget

Packing size lower bound:from Lemma E.9, we have the packing size lower bound log|U | ≥c J for a universal constantc >0(e.g., from Varshamov–Gilbert or constant-weight codes). LetV T >0be the variation budget. Choose a:= VT C µ J1/q 1/α .(209) ThenGVar p,q(Gu 1:T )≤V T for allu∈ U. Moreover, ifJadditionally satisfies C µ2a2 σ2/(1−β) T≤ c 2 J,(210) then Fa...

work page

[34] [34]

sub-Gaussian × sub-Gaussian ⇒ sub-exponential

Subsequently invoking Lemma E.1, we conclude that there does not exist an admissible policy π∈Π β such that supg∈Gp,q(VT ) Rπ,ϕG T (g)≤ 1 9 inf g,˜g∈Θ PT t=1 χ(gt,˜gt). This gives us a lower bound for the minimax regret. Now we must just enforce the gradient variational budget and the information constraint. To enforce the gradient variational budget, we ...

work page arXiv 2016

[35] [35]

for a constantK, then∥X| G 1∥Ψ2 ≤Ka.s

IfG 1 ⊆ G2 and∥X| G 2∥Ψ2 ≤Ka.s. for a constantK, then∥X| G 1∥Ψ2 ≤Ka.s

work page

[36] [36]

for a constantK, then∥XY| G∥ Ψα ≤K∥X| G∥ Ψα a.s

If|Y| ≤Ka.s. for a constantK, then∥XY| G∥ Ψα ≤K∥X| G∥ Ψα a.s

work page

[37] [37]

IfXis independent ofG, then∥X| G∥ Ψα =∥X∥ Ψα a.s

work page

[38] [38]

IfX⊥Y| G, thenE[XY| G] =E[X| G]E[Y| G]a.s., andE[X|Y,G] =E[X| G]a.s. Proof. (i) By assumption, E[exp(|X/K|2)| G 2]≤2 a.s.; taking G1-conditional expectations and using the tower property yields E[exp(|X/K|2)| G 1]≤2 a.s., hence ∥X| G 1∥Ψ2 ≤K . (ii) If ϕ is admissible for X given G, then Kϕ is admissible for XY given G since |XY|/(Kϕ)≤ |X|/ϕ . Taking essen...

work page

[39] [39]

If supu∈Sd−1,u∈F E exp |u⊤X|2/K2 F | F ≤2a.s , then E[XX⊤ | F]⪯K 2 F Id a.s and henceE[∥X∥ 2 2 | F]≤dK 2 F a.s Proof.We first prove the scalar case

(Vector).Let X∈R d. If supu∈Sd−1,u∈F E exp |u⊤X|2/K2 F | F ≤2a.s , then E[XX⊤ | F]⪯K 2 F Id a.s and henceE[∥X∥ 2 2 | F]≤dK 2 F a.s Proof.We first prove the scalar case. Bye y ≥1 +yfory≥0, 2≥E 1 + |X|2 K2 F F = 1 + 1 K2 F E[|X|2 | F]a.s., soE[|X| 2 | F]≤K 2 F. The vector case follows very similarly. Fix any F-measurable u∈S d−1 and apply the scalar part to...

work page

[40] [40]

Then E[Mτ] =E[M 0]

(Unbounded stopping time via truncation).Assume E[|Mτ∧n |]<∞ for all n and that {Mτ∧n }n≥1 is uniformly integrable. Then E[Mτ] =E[M 0]. More generally, if(Mt) is an integrable supermartingale (resp. submartingale), then ineither(i) or (ii) we have E[Mτ]≤ E[M0](resp.E[M τ]≥E[M 0]). Proof.We first show that the stopped process is a martingale: Step 1: Show ...

work page