Can Adaptive Gradient Methods Converge under Heavy-Tailed Noise? A Case Study of AdaGrad

Zijian Liu

arxiv: 2605.18694 · v2 · pith:EXA7RSNMnew · submitted 2026-05-18 · 🧮 math.OC · cs.LG· stat.ML

Can Adaptive Gradient Methods Converge under Heavy-Tailed Noise? A Case Study of AdaGrad

Zijian Liu This is my paper

Pith reviewed 2026-06-30 18:13 UTC · model grok-4.3

classification 🧮 math.OC cs.LGstat.ML

keywords AdaGradheavy-tailed noisenon-convex optimizationconvergence rateadaptive gradient methodstail indexstochastic optimization

0 comments

The pith

AdaGrad converges at a provable rate in non-convex settings under heavy-tailed gradient noise with tail index above 4/3, without knowing the index in advance.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that AdaGrad reaches a convergence guarantee for non-convex problems when gradient noise follows a power-law tail with index p satisfying 4/3 < p ≤ 2. This holds without any knowledge of p and without adding clipping or normalization steps. The result is presented as the first such rate for this adaptive method under the stated noise condition. A matching lower bound indicates that AdaGrad cannot attain the known minimax rate for heavy-tailed optimization. An improved rate is also derived for the AdaGrad-Norm variant that extends down to any p > 1 under one extra assumption.

Core claim

AdaGrad attains the first provable convergence rate for non-convex optimization under heavy-tailed noise when the tail index lies in (4/3, 2], and the rate is obtained without any prior knowledge of the index. The same analysis yields an algorithm-dependent lower bound showing that AdaGrad cannot match the existing minimax rate. For the variant AdaGrad-Norm the paper obtains a stronger rate that holds for every tail index in (1, 2] once an additional mild assumption is imposed.

What carries the argument

AdaGrad's per-coordinate accumulation of squared past gradients, which produces adaptive step sizes that respond to the observed noise magnitude.

If this is right

AdaGrad converges on non-convex losses under heavy-tailed noise without gradient clipping when p exceeds 4/3.
The convergence rate adapts automatically to the unknown tail index.
AdaGrad-Norm reaches a faster rate that covers every tail index down to p > 1 under one extra assumption.
The minimax optimal rate for heavy-tailed optimization is not achievable by AdaGrad.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar adaptive mechanisms in Adam may also tolerate heavy tails without clipping once the analysis is extended.
The gap between AdaGrad's rate and the minimax rate suggests that some form of explicit noise control remains necessary to reach optimality.
The result supplies a concrete benchmark that later adaptive methods can be tested against under the same noise model.

Load-bearing premise

The gradient noise must have a tail index strictly larger than 4/3.

What would settle it

A numerical run of plain AdaGrad on a non-convex problem whose stochastic gradients are drawn from a stable distribution with tail index 1.3, showing either divergence or failure to reach the claimed rate.

read the original abstract

Many tasks in modern machine learning are observed to involve heavy-tailed gradient noise during the optimization process. To manage this realistic and challenging setting, new mechanisms, such as gradient clipping and gradient normalization, have been introduced to ensure the convergence of first-order algorithms. However, adaptive gradient methods, a famous class of modern optimizers that includes popular $\mathtt{Adam}$ and $\mathtt{AdamW}$, often perform well even without any extra operations mentioned above. It is therefore natural to ask whether adaptive gradient methods can converge under heavy-tailed noise without any algorithmic changes. In this work, we take the first step toward answering this question by investigating a special case, $\mathtt{AdaGrad}$, the origin of adaptive gradient methods. We provide the first provable convergence rate for $\mathtt{AdaGrad}$ in non-convex optimization when the tail index $p$ satisfies $4/3<p\leq2$. Notably, this result is achieved without requiring any prior knowledge of $p$ and is hence adaptive to the tail index. In addition, we develop an algorithm-dependent lower bound, suggesting that the existing minimax rate for heavy-tailed optimization is not attainable by $\mathtt{AdaGrad}$. Lastly, we consider $\mathtt{AdaGrad}\text{-}\mathtt{Norm}$, a popular variant of $\mathtt{AdaGrad}$ in theoretical studies, and show an improved rate that holds for any $1<p\leq2$ under an extra mild assumption.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gives the first convergence rate for plain AdaGrad under heavy-tailed noise in non-convex smooth problems when the unknown tail index p sits in (4/3,2], plus a matching lower bound showing it misses the minimax rate.

read the letter

The core result is that standard AdaGrad converges at a rate that adapts to the unknown tail index p without any clipping or tuning, at least when 4/3 < p ≤ 2. They also prove an algorithm-specific lower bound that shows AdaGrad cannot match the best possible rate under the same noise model. For the normalized variant they recover a rate down to p > 1 under one extra assumption.

This is the first explicit guarantee of that form for unmodified AdaGrad. The adaptivity claim is the main technical contribution, and the lower bound is a useful negative result that clarifies the gap to minimax optimality. Both pieces rest on standard non-convex smoothness plus finite p-moment noise, which matches the setting the abstract advertises.

The range p > 4/3 looks like a technical limit of the current proof rather than a fundamental barrier; tightening it would strengthen the paper. The extra assumption needed for AdaGrad-Norm is mild but still narrows the result. No circularity or hidden parameter tuning appears in the stated claims.

The work is aimed at researchers who care about convergence theory for adaptive methods under realistic noise. It is solid enough on its own terms to merit a serious referee, even if the rates are not yet optimal. I would send it out for review.

Referee Report

0 major / 3 minor

Summary. The paper claims to establish the first provable convergence guarantee for standard AdaGrad in non-convex smooth optimization under gradient noise whose p-th moment is finite for unknown p in (4/3,2], without any p-dependent tuning or clipping. It supplies both an upper bound on the convergence rate and an algorithm-dependent lower bound showing that AdaGrad cannot attain the known minimax rate for heavy-tailed optimization. It further shows an improved rate for the AdaGrad-Norm variant that holds for any 1 < p ≤ 2 under one additional mild assumption.

Significance. If the central claims hold, the work is significant because it supplies the first rigorous evidence that a popular adaptive method can converge under heavy-tailed noise without algorithmic modifications, and does so adaptively to the unknown tail index. The algorithm-dependent lower bound is a valuable contribution that clarifies the limitations of AdaGrad relative to the minimax rate. These results are directly relevant to modern ML practice where heavy-tailed gradient noise is frequently observed.

minor comments (3)

[Abstract] The abstract states the result is 'adaptive to the tail index' but does not display the explicit dependence of the rate on p; adding the precise rate expression (even in big-O notation) would improve readability.
[§2] Notation for the noise tail index p and the smoothness constant L is introduced without an early consolidated table of assumptions; a short 'Assumptions' paragraph or table in §2 would help readers track the precise conditions under which the rates hold.
[Lower bound section] The lower-bound construction in the section on algorithm-dependent lower bounds is described at a high level; a brief remark on whether the hard instance satisfies the same L-smoothness assumption used in the upper bound would remove any potential ambiguity.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive assessment of our work, the accurate summary of our contributions, and the recommendation for minor revision. We are pleased that the significance of the first provable convergence guarantees for AdaGrad under heavy-tailed noise (without knowledge of p) is recognized, along with the algorithm-dependent lower bound and the improved rate for AdaGrad-Norm.

Circularity Check

0 steps flagged

No significant circularity identified

full rationale

The paper's central contribution is a new convergence analysis for standard AdaGrad (and a variant) under non-convex L-smoothness plus gradient noise with finite but unknown p-th moment for p in (4/3,2]. The abstract and skeptic summary present this as an independent proof that supplies both upper and lower bounds without requiring knowledge of p, with no equations, fitted parameters, or self-citations described as load-bearing. No self-definitional steps, renamed known results, or ansatzes smuggled via prior author work are visible in the provided material; the derivation is therefore treated as self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Abstract-only review; the main domain assumption is the heavy-tailed noise model with tail index p. No free parameters or invented entities are mentioned.

axioms (1)

domain assumption Gradient noise has tail index p satisfying 4/3 < p ≤ 2
This defines the problem setting for which the convergence rate is claimed.

pith-pipeline@v0.9.1-grok · 5800 in / 1109 out tokens · 28747 ms · 2026-06-30T18:13:09.194398+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

In-Expectation Convergence of Stochastic Gradient Methods under Heavy-Tailed Noise
math.OC 2026-05 unverdicted novelty 6.0

New in-expectation convergence guarantees for SMD, ASMD (convex) and SGD, SGDM (nonconvex) under heavy-tailed noise without bounded-domain restrictions or algorithmic modifications.
Open Problem: Is AdamW Effective Under Heavy-Tailed Noise?
cs.LG 2026-06 unverdicted novelty 5.0

The paper poses whether AdamW converges under heavy-tailed stochastic gradient noise and supplies a weighted-metric benchmark plus a corridor lower-bound showing how denominator memory can obscure large gradients.

Reference graph

Works this paper leans on

25 extracted references · 5 canonical work pages · cited by 2 Pith papers · 2 internal anchors

[1]

Arjevani, Y ., Carmon, Y ., Duchi, J

URL https://openreview.net/forum? id=0uI5415ry7. Arjevani, Y ., Carmon, Y ., Duchi, J. C., Foster, D. J., Srebro, N., and Woodworth, B. Lower bounds for non-convex stochastic optimization.Mathematical Programming, 199 (1-2):165–214, 2023. Attia, A. and Koren, T. SGD with AdaGrad step- sizes: Full adaptivity with high probability to un- known parameters, u...

2023
[2]

Battash, B., Wolf, L., and Lindenbaum, O

URL https://proceedings.mlr.press/ v202/attia23a.html. Battash, B., Wolf, L., and Lindenbaum, O. Revisiting the noise model of stochastic gradient descent. In Das- gupta, S., Mandt, S., and Li, Y . (eds.),Proceedings of The 27th International Conference on Artificial Intelligence and Statistics, volume 238 ofProceedings of Machine Learning Research, pp. 4...
[3]

Curtis, and Jorge Nocedal

URL https://proceedings.mlr.press/ v238/battash24a.html. Bernstein, J., Wang, Y .-X., Azizzadenesheli, K., and Anand- kumar, A. signSGD: Compressed optimisation for non- convex problems. In Dy, J. and Krause, A. (eds.),Pro- ceedings of the 35th International Conference on Ma- chine Learning, volume 80 ofProceedings of Machine Learning Research, pp. 560–56...

work page doi:10.1137/16m1080173 2018
[4]

Crawshaw, M

URL https://proceedings.mlr.press/ v267/chezhegov25a.html. Crawshaw, M. and Liu, M. Complexity lower bounds of adaptive gradient algorithms for non-convex stochastic optimization under relaxed smoothness. InThe Thirteenth International Conference on Learning Representations,
[5]

Cutkosky, A

URL https://openreview.net/forum? id=ZjOXuAfS6l. Cutkosky, A. and Mehta, H. Momentum improves nor- malized SGD. In III, H. D. and Singh, A. (eds.),Pro- ceedings of the 37th International Conference on Ma- chine Learning, volume 119 ofProceedings of Machine Learning Research, pp. 2260–2268. PMLR, 13–18 Jul
[6]

Cutkosky, A

URL https://proceedings.mlr.press/ v119/cutkosky20b.html. Cutkosky, A. and Mehta, H. High-probability bounds for non-convex stochastic optimization with heavy tails. In Ranzato, M., Beygelzimer, A., Dauphin, Y ., Liang, P., and Vaughan, J. W. (eds.),Advances in Neural Information Processing Systems, vol- ume 34, pp. 4883–4895. Curran Associates, Inc.,
[7]

Convergence guarantees for RMSProp and ADAM in non-convex optimization and an empirical comparison to Nesterov acceleration

URL https://proceedings.neurips. cc/paper_files/paper/2021/file/ 26901debb30ea03f0aa833c9de6b81e9- Paper.pdf. De, S., Mukherjee, A., and Ullah, E. Convergence guaran- tees for rmsprop and adam in non-convex optimization and an empirical comparison to nesterov acceleration. arXiv preprint arXiv:1807.06766, 2018. Défossez, A., Bottou, L., Bach, F., and Usun...

work page internal anchor Pith review Pith/arXiv arXiv 2021
[8]

9 Can Adaptive Gradient Methods Converge under Heavy-Tailed Noise? A Case Study of AdaGrad Duchi, J., Hazan, E., and Singer, Y

URL https://openreview.net/forum? id=ZPQhzTSWA7. 9 Can Adaptive Gradient Methods Converge under Heavy-Tailed Noise? A Case Study of AdaGrad Duchi, J., Hazan, E., and Singer, Y . Adaptive subgra- dient methods for online learning and stochastic opti- mization.Journal of Machine Learning Research, 12 (61):2121–2159, 2011. URL http://jmlr.org/ papers/v12/duc...

2011
[9]

Faw, M., Rout, L., Caramanis, C., and Shakkottai, S

URL https://proceedings.mlr.press/ v178/faw22a.html. Faw, M., Rout, L., Caramanis, C., and Shakkottai, S. Beyond uniform smoothness: A stopped analy- sis of adaptive sgd. In Neu, G. and Rosasco, L. (eds.),Proceedings of Thirty Sixth Conference on Learn- ing Theory, volume 195 ofProceedings of Machine Learning Research, pp. 89–160. PMLR, 12–15 Jul
[10]

Garg, S., Zhanson, J., Parisotto, E., Prasad, A., Kolter, Z., Lipton, Z., Balakrishnan, S., Salakhutdinov, R., and Ravikumar, P

URL https://proceedings.mlr.press/ v195/faw23a.html. Garg, S., Zhanson, J., Parisotto, E., Prasad, A., Kolter, Z., Lipton, Z., Balakrishnan, S., Salakhutdinov, R., and Ravikumar, P. On proximal policy optimization’s heavy- tailed gradients. In Meila, M. and Zhang, T. (eds.),Pro- ceedings of the 38th International Conference on Ma- chine Learning, volume 1...
[11]

URL https://proceedings.mlr.press/ v139/garg21b.html. Hong, Y . and Lin, J. On convergence of adam for stochastic optimization under relaxed assumptions. In Globerson, A., Mackey, L., Belgrave, D., Fan, A., Paquet, U., Tomczak, J., and Zhang, C. (eds.), Advances in Neural Information Processing Sys- tems, volume 37, pp. 10827–10877. Curran Asso- ciates, I...

work page doi:10.52202/079017-0346 2024
[12]

Jiang, R., Maladkar, D., and Mokhtari, A

URL https://proceedings.mlr.press/ v258/hubler25a.html. Jiang, R., Maladkar, D., and Mokhtari, A. Provable complexity improvement of adagrad over sgd: Upper and lower bounds in stochastic non-convex optimiza- tion. In Haghtalab, N. and Moitra, A. (eds.),Pro- ceedings of Thirty Eighth Conference on Learning The- ory, volume 291 ofProceedings of Machine Lea...
[13]

Adam: A Method for Stochastic Optimization

URL https://proceedings.mlr.press/ v291/jiang25c.html. Kavis, A., Levy, K. Y ., and Cevher, V . High probability bounds for a class of nonconvex algorithms with ada- grad stepsize. InInternational Conference on Learning Representations, 2022. URL https://openreview. net/forum?id=dSw0QtRMJkO. Kingma, D. P. and Ba, J. Adam: A method for stochastic optimizat...

work page internal anchor Pith review Pith/arXiv arXiv 2022
[14]

cc/paper_files/paper/2021/file/ ac10ff1941c540cd87c107330996f4f6- Paper.pdf

URL https://proceedings.neurips. cc/paper_files/paper/2021/file/ ac10ff1941c540cd87c107330996f4f6- Paper.pdf. Li, H., Dong, Y ., and Lin, Z. On the o( √ d/t1/4) con- vergence rate of rmsprop and its momentum extension measured by ℓ1 norm.Journal of Machine Learning Research, 26(131):1–25, 2025. URL http://jmlr. org/papers/v26/24-0523.html. Li, X. and Orab...

2021
[15]

URL https://proceedings.mlr.press/ v313/liu26a.html. Liu, Z. and Zhou, Z. Nonconvex stochastic optimization under heavy-tailed noises: Optimal convergence without gradient clipping. InThe Thirteenth International Confer- ence on Learning Representations, 2025. URL https: //openreview.net/forum?id=NKotdPUc3L. Liu, Z., Nguyen, T. D., Ene, A., and Nguyen, H....

2025
[16]

cc/paper_files/paper/2023/file/ 4c454d34f3a4c8d6b4ca85a918e5d7ba- Paper-Conference.pdf

URL https://proceedings.neurips. cc/paper_files/paper/2023/file/ 4c454d34f3a4c8d6b4ca85a918e5d7ba- Paper-Conference.pdf. Reddi, S. J., Kale, S., and Kumar, S. On the conver- gence of adam and beyond. InInternational Confer- ence on Learning Representations, 2018. URL https: //openreview.net/forum?id=ryQu7f-RZ. Sadiev, A., Danilova, M., Gorbunov, E., Horvá...

2023
[17]

Shazeer, N

URL https://proceedings.mlr.press/ v202/sadiev23a.html. Shazeer, N. and Stern, M. Adafactor: Adaptive learning rates with sublinear memory cost. In Dy, J. and Krause, A. (eds.),Proceedings of the 35th International Confer- ence on Machine Learning, volume 80 ofProceedings of Machine Learning Research, pp. 4596–4604. PMLR, 10–15 Jul 2018. URLhttps://procee...

2018
[18]

Sun, T., Liu, X., and Yuan, K

URL https://proceedings.mlr.press/ v97/simsekli19a.html. Sun, T., Liu, X., and Yuan, K. Revisiting gradient nor- malization and clipping for nonconvex sgd under heavy- tailed noise: Necessity, sufficiency, and acceleration. Journal of Machine Learning Research, 26(237):1–42,
[19]

Tieleman, T., Hinton, G., et al

URL http://jmlr.org/papers/v26/24- 1991.html. Tieleman, T., Hinton, G., et al. Lecture 6.5-rmsprop: Divide the gradient by a running average of its recent magnitude. COURSERA: Neural networks for machine learning, 4 (2):26–31, 2012. 11 Can Adaptive Gradient Methods Converge under Heavy-Tailed Noise? A Case Study of AdaGrad Vaswani, A., Shazeer, N., Parmar...

1991
[20]

cc/paper_files/paper/2017/file/ 3f5ee243547dee91fbd053c1c4a845aa- Paper.pdf

URL https://proceedings.neurips. cc/paper_files/paper/2017/file/ 3f5ee243547dee91fbd053c1c4a845aa- Paper.pdf. Wang, B., Fu, J., Zhang, H., Zheng, N., and Chen, W. Closing the gap between the upper bound and lower bound of adam's iteration complexity. In Oh, A., Naumann, T., Globerson, A., Saenko, K., Hardt, M., and Levine, S. (eds.),Advances in Neural Inf...

work page doi:10.1145/3637528.3671718 2017
[21]

cc/paper_files/paper/2023/file/ eb1a323fa10d4102ff13422476a744ff- Paper-Conference.pdf

URL https://proceedings.neurips. cc/paper_files/paper/2023/file/ eb1a323fa10d4102ff13422476a744ff- Paper-Conference.pdf. Zaheer, M., Reddi, S., Sachan, D., Kale, S., and Kumar, S. Adaptive methods for nonconvex opti- mization. In Bengio, S., Wallach, H., Larochelle, H., Grauman, K., Cesa-Bianchi, N., and Garnett, R. (eds.),Advances in Neural Information P...

2023
[22]

cc/paper_files/paper/2018/file/ 90365351ccc7437a1309dc64e4db32a3- Paper.pdf

URL https://proceedings.neurips. cc/paper_files/paper/2018/file/ 90365351ccc7437a1309dc64e4db32a3- Paper.pdf. Zhang, J., He, T., Sra, S., and Jadbabaie, A. Why gra- dient clipping accelerates training: A theoretical jus- tification for adaptivity. InInternational Conference on Learning Representations, 2020a. URL https: //openreview.net/forum?id=BJgnXpVYw...

2018
[23]

Zhang, Y ., Chen, C., Shi, N., Sun, R., and Luo, Z.-Q

URL https://openreview.net/forum? id=QIzRdjIWnS. Zhang, Y ., Chen, C., Shi, N., Sun, R., and Luo, Z.-Q. Adam can converge without any modification on update rules. In Koyejo, S., Mohamed, S., Agarwal, A., Belgrave, D., Cho, K., and Oh, A. (eds.),Ad- vances in Neural Information Processing Systems, volume 35, pp. 28386–28399. Curran Associates, Inc.,
[24]

TX t=1 (∇f(x t))2 λ+ √wt 1 # ≤∆ +γ dX i=1 σ2 i ci TX t=1 E

URL https://proceedings.neurips. cc/paper_files/paper/2022/file/ b6260ae5566442da053e5ab5d691067a- Paper-Conference.pdf. Zou, F., Shen, L., Jie, Z., Zhang, W., and Liu, W. A sufficient condition for convergences of adam and rmsprop. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2019. 12 Can Adaptive Grad...

2022
[25]

s uT,i λ+ p vT,i +u T,i +c 2 i × λ+ q vT,i +u T,i +c 2 i # (c) ≤ vuutE

Divide both sides of the above inequality by γ 2 to obtain E " TX t=1 (∇f(x t))2 λ+ √wt 1 # ≤ 2∆ γ + 2γ∥L∥ 1 lnK T + 8∥σ∥ 1 T 1 p − 1 2 ln 1 ¯p+ 1 2 KT .(9) 13 Can Adaptive Gradient Methods Converge under Heavy-Tailed Noise? A Case Study of AdaGrad Next, recall thatu t,i =Pt s=1(∇if(x s))2,∀t∈N(introduced in Lemma A.4), we hence have E √uT,i =E "s uT,i λ+...

2019

[1] [1]

Arjevani, Y ., Carmon, Y ., Duchi, J

URL https://openreview.net/forum? id=0uI5415ry7. Arjevani, Y ., Carmon, Y ., Duchi, J. C., Foster, D. J., Srebro, N., and Woodworth, B. Lower bounds for non-convex stochastic optimization.Mathematical Programming, 199 (1-2):165–214, 2023. Attia, A. and Koren, T. SGD with AdaGrad step- sizes: Full adaptivity with high probability to un- known parameters, u...

2023

[2] [2]

Battash, B., Wolf, L., and Lindenbaum, O

URL https://proceedings.mlr.press/ v202/attia23a.html. Battash, B., Wolf, L., and Lindenbaum, O. Revisiting the noise model of stochastic gradient descent. In Das- gupta, S., Mandt, S., and Li, Y . (eds.),Proceedings of The 27th International Conference on Artificial Intelligence and Statistics, volume 238 ofProceedings of Machine Learning Research, pp. 4...

[3] [3]

Curtis, and Jorge Nocedal

URL https://proceedings.mlr.press/ v238/battash24a.html. Bernstein, J., Wang, Y .-X., Azizzadenesheli, K., and Anand- kumar, A. signSGD: Compressed optimisation for non- convex problems. In Dy, J. and Krause, A. (eds.),Pro- ceedings of the 35th International Conference on Ma- chine Learning, volume 80 ofProceedings of Machine Learning Research, pp. 560–56...

work page doi:10.1137/16m1080173 2018

[4] [4]

Crawshaw, M

URL https://proceedings.mlr.press/ v267/chezhegov25a.html. Crawshaw, M. and Liu, M. Complexity lower bounds of adaptive gradient algorithms for non-convex stochastic optimization under relaxed smoothness. InThe Thirteenth International Conference on Learning Representations,

[5] [5]

Cutkosky, A

URL https://openreview.net/forum? id=ZjOXuAfS6l. Cutkosky, A. and Mehta, H. Momentum improves nor- malized SGD. In III, H. D. and Singh, A. (eds.),Pro- ceedings of the 37th International Conference on Ma- chine Learning, volume 119 ofProceedings of Machine Learning Research, pp. 2260–2268. PMLR, 13–18 Jul

[6] [6]

Cutkosky, A

URL https://proceedings.mlr.press/ v119/cutkosky20b.html. Cutkosky, A. and Mehta, H. High-probability bounds for non-convex stochastic optimization with heavy tails. In Ranzato, M., Beygelzimer, A., Dauphin, Y ., Liang, P., and Vaughan, J. W. (eds.),Advances in Neural Information Processing Systems, vol- ume 34, pp. 4883–4895. Curran Associates, Inc.,

[7] [7]

Convergence guarantees for RMSProp and ADAM in non-convex optimization and an empirical comparison to Nesterov acceleration

URL https://proceedings.neurips. cc/paper_files/paper/2021/file/ 26901debb30ea03f0aa833c9de6b81e9- Paper.pdf. De, S., Mukherjee, A., and Ullah, E. Convergence guaran- tees for rmsprop and adam in non-convex optimization and an empirical comparison to nesterov acceleration. arXiv preprint arXiv:1807.06766, 2018. Défossez, A., Bottou, L., Bach, F., and Usun...

work page internal anchor Pith review Pith/arXiv arXiv 2021

[8] [8]

9 Can Adaptive Gradient Methods Converge under Heavy-Tailed Noise? A Case Study of AdaGrad Duchi, J., Hazan, E., and Singer, Y

URL https://openreview.net/forum? id=ZPQhzTSWA7. 9 Can Adaptive Gradient Methods Converge under Heavy-Tailed Noise? A Case Study of AdaGrad Duchi, J., Hazan, E., and Singer, Y . Adaptive subgra- dient methods for online learning and stochastic opti- mization.Journal of Machine Learning Research, 12 (61):2121–2159, 2011. URL http://jmlr.org/ papers/v12/duc...

2011

[9] [9]

Faw, M., Rout, L., Caramanis, C., and Shakkottai, S

URL https://proceedings.mlr.press/ v178/faw22a.html. Faw, M., Rout, L., Caramanis, C., and Shakkottai, S. Beyond uniform smoothness: A stopped analy- sis of adaptive sgd. In Neu, G. and Rosasco, L. (eds.),Proceedings of Thirty Sixth Conference on Learn- ing Theory, volume 195 ofProceedings of Machine Learning Research, pp. 89–160. PMLR, 12–15 Jul

[10] [10]

Garg, S., Zhanson, J., Parisotto, E., Prasad, A., Kolter, Z., Lipton, Z., Balakrishnan, S., Salakhutdinov, R., and Ravikumar, P

URL https://proceedings.mlr.press/ v195/faw23a.html. Garg, S., Zhanson, J., Parisotto, E., Prasad, A., Kolter, Z., Lipton, Z., Balakrishnan, S., Salakhutdinov, R., and Ravikumar, P. On proximal policy optimization’s heavy- tailed gradients. In Meila, M. and Zhang, T. (eds.),Pro- ceedings of the 38th International Conference on Ma- chine Learning, volume 1...

[11] [11]

URL https://proceedings.mlr.press/ v139/garg21b.html. Hong, Y . and Lin, J. On convergence of adam for stochastic optimization under relaxed assumptions. In Globerson, A., Mackey, L., Belgrave, D., Fan, A., Paquet, U., Tomczak, J., and Zhang, C. (eds.), Advances in Neural Information Processing Sys- tems, volume 37, pp. 10827–10877. Curran Asso- ciates, I...

work page doi:10.52202/079017-0346 2024

[12] [12]

Jiang, R., Maladkar, D., and Mokhtari, A

URL https://proceedings.mlr.press/ v258/hubler25a.html. Jiang, R., Maladkar, D., and Mokhtari, A. Provable complexity improvement of adagrad over sgd: Upper and lower bounds in stochastic non-convex optimiza- tion. In Haghtalab, N. and Moitra, A. (eds.),Pro- ceedings of Thirty Eighth Conference on Learning The- ory, volume 291 ofProceedings of Machine Lea...

[13] [13]

Adam: A Method for Stochastic Optimization

URL https://proceedings.mlr.press/ v291/jiang25c.html. Kavis, A., Levy, K. Y ., and Cevher, V . High probability bounds for a class of nonconvex algorithms with ada- grad stepsize. InInternational Conference on Learning Representations, 2022. URL https://openreview. net/forum?id=dSw0QtRMJkO. Kingma, D. P. and Ba, J. Adam: A method for stochastic optimizat...

work page internal anchor Pith review Pith/arXiv arXiv 2022

[14] [14]

cc/paper_files/paper/2021/file/ ac10ff1941c540cd87c107330996f4f6- Paper.pdf

URL https://proceedings.neurips. cc/paper_files/paper/2021/file/ ac10ff1941c540cd87c107330996f4f6- Paper.pdf. Li, H., Dong, Y ., and Lin, Z. On the o( √ d/t1/4) con- vergence rate of rmsprop and its momentum extension measured by ℓ1 norm.Journal of Machine Learning Research, 26(131):1–25, 2025. URL http://jmlr. org/papers/v26/24-0523.html. Li, X. and Orab...

2021

[15] [15]

URL https://proceedings.mlr.press/ v313/liu26a.html. Liu, Z. and Zhou, Z. Nonconvex stochastic optimization under heavy-tailed noises: Optimal convergence without gradient clipping. InThe Thirteenth International Confer- ence on Learning Representations, 2025. URL https: //openreview.net/forum?id=NKotdPUc3L. Liu, Z., Nguyen, T. D., Ene, A., and Nguyen, H....

2025

[16] [16]

cc/paper_files/paper/2023/file/ 4c454d34f3a4c8d6b4ca85a918e5d7ba- Paper-Conference.pdf

URL https://proceedings.neurips. cc/paper_files/paper/2023/file/ 4c454d34f3a4c8d6b4ca85a918e5d7ba- Paper-Conference.pdf. Reddi, S. J., Kale, S., and Kumar, S. On the conver- gence of adam and beyond. InInternational Confer- ence on Learning Representations, 2018. URL https: //openreview.net/forum?id=ryQu7f-RZ. Sadiev, A., Danilova, M., Gorbunov, E., Horvá...

2023

[17] [17]

Shazeer, N

URL https://proceedings.mlr.press/ v202/sadiev23a.html. Shazeer, N. and Stern, M. Adafactor: Adaptive learning rates with sublinear memory cost. In Dy, J. and Krause, A. (eds.),Proceedings of the 35th International Confer- ence on Machine Learning, volume 80 ofProceedings of Machine Learning Research, pp. 4596–4604. PMLR, 10–15 Jul 2018. URLhttps://procee...

2018

[18] [18]

Sun, T., Liu, X., and Yuan, K

URL https://proceedings.mlr.press/ v97/simsekli19a.html. Sun, T., Liu, X., and Yuan, K. Revisiting gradient nor- malization and clipping for nonconvex sgd under heavy- tailed noise: Necessity, sufficiency, and acceleration. Journal of Machine Learning Research, 26(237):1–42,

[19] [19]

Tieleman, T., Hinton, G., et al

URL http://jmlr.org/papers/v26/24- 1991.html. Tieleman, T., Hinton, G., et al. Lecture 6.5-rmsprop: Divide the gradient by a running average of its recent magnitude. COURSERA: Neural networks for machine learning, 4 (2):26–31, 2012. 11 Can Adaptive Gradient Methods Converge under Heavy-Tailed Noise? A Case Study of AdaGrad Vaswani, A., Shazeer, N., Parmar...

1991

[20] [20]

cc/paper_files/paper/2017/file/ 3f5ee243547dee91fbd053c1c4a845aa- Paper.pdf

URL https://proceedings.neurips. cc/paper_files/paper/2017/file/ 3f5ee243547dee91fbd053c1c4a845aa- Paper.pdf. Wang, B., Fu, J., Zhang, H., Zheng, N., and Chen, W. Closing the gap between the upper bound and lower bound of adam's iteration complexity. In Oh, A., Naumann, T., Globerson, A., Saenko, K., Hardt, M., and Levine, S. (eds.),Advances in Neural Inf...

work page doi:10.1145/3637528.3671718 2017

[21] [21]

cc/paper_files/paper/2023/file/ eb1a323fa10d4102ff13422476a744ff- Paper-Conference.pdf

URL https://proceedings.neurips. cc/paper_files/paper/2023/file/ eb1a323fa10d4102ff13422476a744ff- Paper-Conference.pdf. Zaheer, M., Reddi, S., Sachan, D., Kale, S., and Kumar, S. Adaptive methods for nonconvex opti- mization. In Bengio, S., Wallach, H., Larochelle, H., Grauman, K., Cesa-Bianchi, N., and Garnett, R. (eds.),Advances in Neural Information P...

2023

[22] [22]

cc/paper_files/paper/2018/file/ 90365351ccc7437a1309dc64e4db32a3- Paper.pdf

URL https://proceedings.neurips. cc/paper_files/paper/2018/file/ 90365351ccc7437a1309dc64e4db32a3- Paper.pdf. Zhang, J., He, T., Sra, S., and Jadbabaie, A. Why gra- dient clipping accelerates training: A theoretical jus- tification for adaptivity. InInternational Conference on Learning Representations, 2020a. URL https: //openreview.net/forum?id=BJgnXpVYw...

2018

[23] [23]

Zhang, Y ., Chen, C., Shi, N., Sun, R., and Luo, Z.-Q

URL https://openreview.net/forum? id=QIzRdjIWnS. Zhang, Y ., Chen, C., Shi, N., Sun, R., and Luo, Z.-Q. Adam can converge without any modification on update rules. In Koyejo, S., Mohamed, S., Agarwal, A., Belgrave, D., Cho, K., and Oh, A. (eds.),Ad- vances in Neural Information Processing Systems, volume 35, pp. 28386–28399. Curran Associates, Inc.,

[24] [24]

TX t=1 (∇f(x t))2 λ+ √wt 1 # ≤∆ +γ dX i=1 σ2 i ci TX t=1 E

URL https://proceedings.neurips. cc/paper_files/paper/2022/file/ b6260ae5566442da053e5ab5d691067a- Paper-Conference.pdf. Zou, F., Shen, L., Jie, Z., Zhang, W., and Liu, W. A sufficient condition for convergences of adam and rmsprop. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2019. 12 Can Adaptive Grad...

2022

[25] [25]

s uT,i λ+ p vT,i +u T,i +c 2 i × λ+ q vT,i +u T,i +c 2 i # (c) ≤ vuutE

Divide both sides of the above inequality by γ 2 to obtain E " TX t=1 (∇f(x t))2 λ+ √wt 1 # ≤ 2∆ γ + 2γ∥L∥ 1 lnK T + 8∥σ∥ 1 T 1 p − 1 2 ln 1 ¯p+ 1 2 KT .(9) 13 Can Adaptive Gradient Methods Converge under Heavy-Tailed Noise? A Case Study of AdaGrad Next, recall thatu t,i =Pt s=1(∇if(x s))2,∀t∈N(introduced in Lemma A.4), we hence have E √uT,i =E "s uT,i λ+...

2019