Can Adaptive Gradient Methods Converge under Heavy-Tailed Noise? A Case Study of AdaGrad
Pith reviewed 2026-06-30 18:13 UTC · model grok-4.3
The pith
AdaGrad converges at a provable rate in non-convex settings under heavy-tailed gradient noise with tail index above 4/3, without knowing the index in advance.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
AdaGrad attains the first provable convergence rate for non-convex optimization under heavy-tailed noise when the tail index lies in (4/3, 2], and the rate is obtained without any prior knowledge of the index. The same analysis yields an algorithm-dependent lower bound showing that AdaGrad cannot match the existing minimax rate. For the variant AdaGrad-Norm the paper obtains a stronger rate that holds for every tail index in (1, 2] once an additional mild assumption is imposed.
What carries the argument
AdaGrad's per-coordinate accumulation of squared past gradients, which produces adaptive step sizes that respond to the observed noise magnitude.
If this is right
- AdaGrad converges on non-convex losses under heavy-tailed noise without gradient clipping when p exceeds 4/3.
- The convergence rate adapts automatically to the unknown tail index.
- AdaGrad-Norm reaches a faster rate that covers every tail index down to p > 1 under one extra assumption.
- The minimax optimal rate for heavy-tailed optimization is not achievable by AdaGrad.
Where Pith is reading between the lines
- Similar adaptive mechanisms in Adam may also tolerate heavy tails without clipping once the analysis is extended.
- The gap between AdaGrad's rate and the minimax rate suggests that some form of explicit noise control remains necessary to reach optimality.
- The result supplies a concrete benchmark that later adaptive methods can be tested against under the same noise model.
Load-bearing premise
The gradient noise must have a tail index strictly larger than 4/3.
What would settle it
A numerical run of plain AdaGrad on a non-convex problem whose stochastic gradients are drawn from a stable distribution with tail index 1.3, showing either divergence or failure to reach the claimed rate.
read the original abstract
Many tasks in modern machine learning are observed to involve heavy-tailed gradient noise during the optimization process. To manage this realistic and challenging setting, new mechanisms, such as gradient clipping and gradient normalization, have been introduced to ensure the convergence of first-order algorithms. However, adaptive gradient methods, a famous class of modern optimizers that includes popular $\mathtt{Adam}$ and $\mathtt{AdamW}$, often perform well even without any extra operations mentioned above. It is therefore natural to ask whether adaptive gradient methods can converge under heavy-tailed noise without any algorithmic changes. In this work, we take the first step toward answering this question by investigating a special case, $\mathtt{AdaGrad}$, the origin of adaptive gradient methods. We provide the first provable convergence rate for $\mathtt{AdaGrad}$ in non-convex optimization when the tail index $p$ satisfies $4/3<p\leq2$. Notably, this result is achieved without requiring any prior knowledge of $p$ and is hence adaptive to the tail index. In addition, we develop an algorithm-dependent lower bound, suggesting that the existing minimax rate for heavy-tailed optimization is not attainable by $\mathtt{AdaGrad}$. Lastly, we consider $\mathtt{AdaGrad}\text{-}\mathtt{Norm}$, a popular variant of $\mathtt{AdaGrad}$ in theoretical studies, and show an improved rate that holds for any $1<p\leq2$ under an extra mild assumption.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims to establish the first provable convergence guarantee for standard AdaGrad in non-convex smooth optimization under gradient noise whose p-th moment is finite for unknown p in (4/3,2], without any p-dependent tuning or clipping. It supplies both an upper bound on the convergence rate and an algorithm-dependent lower bound showing that AdaGrad cannot attain the known minimax rate for heavy-tailed optimization. It further shows an improved rate for the AdaGrad-Norm variant that holds for any 1 < p ≤ 2 under one additional mild assumption.
Significance. If the central claims hold, the work is significant because it supplies the first rigorous evidence that a popular adaptive method can converge under heavy-tailed noise without algorithmic modifications, and does so adaptively to the unknown tail index. The algorithm-dependent lower bound is a valuable contribution that clarifies the limitations of AdaGrad relative to the minimax rate. These results are directly relevant to modern ML practice where heavy-tailed gradient noise is frequently observed.
minor comments (3)
- [Abstract] The abstract states the result is 'adaptive to the tail index' but does not display the explicit dependence of the rate on p; adding the precise rate expression (even in big-O notation) would improve readability.
- [§2] Notation for the noise tail index p and the smoothness constant L is introduced without an early consolidated table of assumptions; a short 'Assumptions' paragraph or table in §2 would help readers track the precise conditions under which the rates hold.
- [Lower bound section] The lower-bound construction in the section on algorithm-dependent lower bounds is described at a high level; a brief remark on whether the hard instance satisfies the same L-smoothness assumption used in the upper bound would remove any potential ambiguity.
Simulated Author's Rebuttal
We thank the referee for the positive assessment of our work, the accurate summary of our contributions, and the recommendation for minor revision. We are pleased that the significance of the first provable convergence guarantees for AdaGrad under heavy-tailed noise (without knowledge of p) is recognized, along with the algorithm-dependent lower bound and the improved rate for AdaGrad-Norm.
Circularity Check
No significant circularity identified
full rationale
The paper's central contribution is a new convergence analysis for standard AdaGrad (and a variant) under non-convex L-smoothness plus gradient noise with finite but unknown p-th moment for p in (4/3,2]. The abstract and skeptic summary present this as an independent proof that supplies both upper and lower bounds without requiring knowledge of p, with no equations, fitted parameters, or self-citations described as load-bearing. No self-definitional steps, renamed known results, or ansatzes smuggled via prior author work are visible in the provided material; the derivation is therefore treated as self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Gradient noise has tail index p satisfying 4/3 < p ≤ 2
Forward citations
Cited by 2 Pith papers
-
In-Expectation Convergence of Stochastic Gradient Methods under Heavy-Tailed Noise
New in-expectation convergence guarantees for SMD, ASMD (convex) and SGD, SGDM (nonconvex) under heavy-tailed noise without bounded-domain restrictions or algorithmic modifications.
-
Open Problem: Is AdamW Effective Under Heavy-Tailed Noise?
The paper poses whether AdamW converges under heavy-tailed stochastic gradient noise and supplies a weighted-metric benchmark plus a corridor lower-bound showing how denominator memory can obscure large gradients.
Reference graph
Works this paper leans on
-
[1]
Arjevani, Y ., Carmon, Y ., Duchi, J
URL https://openreview.net/forum? id=0uI5415ry7. Arjevani, Y ., Carmon, Y ., Duchi, J. C., Foster, D. J., Srebro, N., and Woodworth, B. Lower bounds for non-convex stochastic optimization.Mathematical Programming, 199 (1-2):165–214, 2023. Attia, A. and Koren, T. SGD with AdaGrad step- sizes: Full adaptivity with high probability to un- known parameters, u...
2023
-
[2]
Battash, B., Wolf, L., and Lindenbaum, O
URL https://proceedings.mlr.press/ v202/attia23a.html. Battash, B., Wolf, L., and Lindenbaum, O. Revisiting the noise model of stochastic gradient descent. In Das- gupta, S., Mandt, S., and Li, Y . (eds.),Proceedings of The 27th International Conference on Artificial Intelligence and Statistics, volume 238 ofProceedings of Machine Learning Research, pp. 4...
-
[3]
URL https://proceedings.mlr.press/ v238/battash24a.html. Bernstein, J., Wang, Y .-X., Azizzadenesheli, K., and Anand- kumar, A. signSGD: Compressed optimisation for non- convex problems. In Dy, J. and Krause, A. (eds.),Pro- ceedings of the 35th International Conference on Ma- chine Learning, volume 80 ofProceedings of Machine Learning Research, pp. 560–56...
-
[4]
Crawshaw, M
URL https://proceedings.mlr.press/ v267/chezhegov25a.html. Crawshaw, M. and Liu, M. Complexity lower bounds of adaptive gradient algorithms for non-convex stochastic optimization under relaxed smoothness. InThe Thirteenth International Conference on Learning Representations,
-
[5]
Cutkosky, A
URL https://openreview.net/forum? id=ZjOXuAfS6l. Cutkosky, A. and Mehta, H. Momentum improves nor- malized SGD. In III, H. D. and Singh, A. (eds.),Pro- ceedings of the 37th International Conference on Ma- chine Learning, volume 119 ofProceedings of Machine Learning Research, pp. 2260–2268. PMLR, 13–18 Jul
-
[6]
Cutkosky, A
URL https://proceedings.mlr.press/ v119/cutkosky20b.html. Cutkosky, A. and Mehta, H. High-probability bounds for non-convex stochastic optimization with heavy tails. In Ranzato, M., Beygelzimer, A., Dauphin, Y ., Liang, P., and Vaughan, J. W. (eds.),Advances in Neural Information Processing Systems, vol- ume 34, pp. 4883–4895. Curran Associates, Inc.,
-
[7]
URL https://proceedings.neurips. cc/paper_files/paper/2021/file/ 26901debb30ea03f0aa833c9de6b81e9- Paper.pdf. De, S., Mukherjee, A., and Ullah, E. Convergence guaran- tees for rmsprop and adam in non-convex optimization and an empirical comparison to nesterov acceleration. arXiv preprint arXiv:1807.06766, 2018. Défossez, A., Bottou, L., Bach, F., and Usun...
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[8]
9 Can Adaptive Gradient Methods Converge under Heavy-Tailed Noise? A Case Study of AdaGrad Duchi, J., Hazan, E., and Singer, Y
URL https://openreview.net/forum? id=ZPQhzTSWA7. 9 Can Adaptive Gradient Methods Converge under Heavy-Tailed Noise? A Case Study of AdaGrad Duchi, J., Hazan, E., and Singer, Y . Adaptive subgra- dient methods for online learning and stochastic opti- mization.Journal of Machine Learning Research, 12 (61):2121–2159, 2011. URL http://jmlr.org/ papers/v12/duc...
2011
-
[9]
Faw, M., Rout, L., Caramanis, C., and Shakkottai, S
URL https://proceedings.mlr.press/ v178/faw22a.html. Faw, M., Rout, L., Caramanis, C., and Shakkottai, S. Beyond uniform smoothness: A stopped analy- sis of adaptive sgd. In Neu, G. and Rosasco, L. (eds.),Proceedings of Thirty Sixth Conference on Learn- ing Theory, volume 195 ofProceedings of Machine Learning Research, pp. 89–160. PMLR, 12–15 Jul
-
[10]
Garg, S., Zhanson, J., Parisotto, E., Prasad, A., Kolter, Z., Lipton, Z., Balakrishnan, S., Salakhutdinov, R., and Ravikumar, P
URL https://proceedings.mlr.press/ v195/faw23a.html. Garg, S., Zhanson, J., Parisotto, E., Prasad, A., Kolter, Z., Lipton, Z., Balakrishnan, S., Salakhutdinov, R., and Ravikumar, P. On proximal policy optimization’s heavy- tailed gradients. In Meila, M. and Zhang, T. (eds.),Pro- ceedings of the 38th International Conference on Ma- chine Learning, volume 1...
-
[11]
URL https://proceedings.mlr.press/ v139/garg21b.html. Hong, Y . and Lin, J. On convergence of adam for stochastic optimization under relaxed assumptions. In Globerson, A., Mackey, L., Belgrave, D., Fan, A., Paquet, U., Tomczak, J., and Zhang, C. (eds.), Advances in Neural Information Processing Sys- tems, volume 37, pp. 10827–10877. Curran Asso- ciates, I...
-
[12]
Jiang, R., Maladkar, D., and Mokhtari, A
URL https://proceedings.mlr.press/ v258/hubler25a.html. Jiang, R., Maladkar, D., and Mokhtari, A. Provable complexity improvement of adagrad over sgd: Upper and lower bounds in stochastic non-convex optimiza- tion. In Haghtalab, N. and Moitra, A. (eds.),Pro- ceedings of Thirty Eighth Conference on Learning The- ory, volume 291 ofProceedings of Machine Lea...
-
[13]
Adam: A Method for Stochastic Optimization
URL https://proceedings.mlr.press/ v291/jiang25c.html. Kavis, A., Levy, K. Y ., and Cevher, V . High probability bounds for a class of nonconvex algorithms with ada- grad stepsize. InInternational Conference on Learning Representations, 2022. URL https://openreview. net/forum?id=dSw0QtRMJkO. Kingma, D. P. and Ba, J. Adam: A method for stochastic optimizat...
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[14]
cc/paper_files/paper/2021/file/ ac10ff1941c540cd87c107330996f4f6- Paper.pdf
URL https://proceedings.neurips. cc/paper_files/paper/2021/file/ ac10ff1941c540cd87c107330996f4f6- Paper.pdf. Li, H., Dong, Y ., and Lin, Z. On the o( √ d/t1/4) con- vergence rate of rmsprop and its momentum extension measured by ℓ1 norm.Journal of Machine Learning Research, 26(131):1–25, 2025. URL http://jmlr. org/papers/v26/24-0523.html. Li, X. and Orab...
2021
-
[15]
URL https://proceedings.mlr.press/ v313/liu26a.html. Liu, Z. and Zhou, Z. Nonconvex stochastic optimization under heavy-tailed noises: Optimal convergence without gradient clipping. InThe Thirteenth International Confer- ence on Learning Representations, 2025. URL https: //openreview.net/forum?id=NKotdPUc3L. Liu, Z., Nguyen, T. D., Ene, A., and Nguyen, H....
2025
-
[16]
cc/paper_files/paper/2023/file/ 4c454d34f3a4c8d6b4ca85a918e5d7ba- Paper-Conference.pdf
URL https://proceedings.neurips. cc/paper_files/paper/2023/file/ 4c454d34f3a4c8d6b4ca85a918e5d7ba- Paper-Conference.pdf. Reddi, S. J., Kale, S., and Kumar, S. On the conver- gence of adam and beyond. InInternational Confer- ence on Learning Representations, 2018. URL https: //openreview.net/forum?id=ryQu7f-RZ. Sadiev, A., Danilova, M., Gorbunov, E., Horvá...
2023
-
[17]
Shazeer, N
URL https://proceedings.mlr.press/ v202/sadiev23a.html. Shazeer, N. and Stern, M. Adafactor: Adaptive learning rates with sublinear memory cost. In Dy, J. and Krause, A. (eds.),Proceedings of the 35th International Confer- ence on Machine Learning, volume 80 ofProceedings of Machine Learning Research, pp. 4596–4604. PMLR, 10–15 Jul 2018. URLhttps://procee...
2018
-
[18]
Sun, T., Liu, X., and Yuan, K
URL https://proceedings.mlr.press/ v97/simsekli19a.html. Sun, T., Liu, X., and Yuan, K. Revisiting gradient nor- malization and clipping for nonconvex sgd under heavy- tailed noise: Necessity, sufficiency, and acceleration. Journal of Machine Learning Research, 26(237):1–42,
-
[19]
Tieleman, T., Hinton, G., et al
URL http://jmlr.org/papers/v26/24- 1991.html. Tieleman, T., Hinton, G., et al. Lecture 6.5-rmsprop: Divide the gradient by a running average of its recent magnitude. COURSERA: Neural networks for machine learning, 4 (2):26–31, 2012. 11 Can Adaptive Gradient Methods Converge under Heavy-Tailed Noise? A Case Study of AdaGrad Vaswani, A., Shazeer, N., Parmar...
1991
-
[20]
cc/paper_files/paper/2017/file/ 3f5ee243547dee91fbd053c1c4a845aa- Paper.pdf
URL https://proceedings.neurips. cc/paper_files/paper/2017/file/ 3f5ee243547dee91fbd053c1c4a845aa- Paper.pdf. Wang, B., Fu, J., Zhang, H., Zheng, N., and Chen, W. Closing the gap between the upper bound and lower bound of adam's iteration complexity. In Oh, A., Naumann, T., Globerson, A., Saenko, K., Hardt, M., and Levine, S. (eds.),Advances in Neural Inf...
-
[21]
cc/paper_files/paper/2023/file/ eb1a323fa10d4102ff13422476a744ff- Paper-Conference.pdf
URL https://proceedings.neurips. cc/paper_files/paper/2023/file/ eb1a323fa10d4102ff13422476a744ff- Paper-Conference.pdf. Zaheer, M., Reddi, S., Sachan, D., Kale, S., and Kumar, S. Adaptive methods for nonconvex opti- mization. In Bengio, S., Wallach, H., Larochelle, H., Grauman, K., Cesa-Bianchi, N., and Garnett, R. (eds.),Advances in Neural Information P...
2023
-
[22]
cc/paper_files/paper/2018/file/ 90365351ccc7437a1309dc64e4db32a3- Paper.pdf
URL https://proceedings.neurips. cc/paper_files/paper/2018/file/ 90365351ccc7437a1309dc64e4db32a3- Paper.pdf. Zhang, J., He, T., Sra, S., and Jadbabaie, A. Why gra- dient clipping accelerates training: A theoretical jus- tification for adaptivity. InInternational Conference on Learning Representations, 2020a. URL https: //openreview.net/forum?id=BJgnXpVYw...
2018
-
[23]
Zhang, Y ., Chen, C., Shi, N., Sun, R., and Luo, Z.-Q
URL https://openreview.net/forum? id=QIzRdjIWnS. Zhang, Y ., Chen, C., Shi, N., Sun, R., and Luo, Z.-Q. Adam can converge without any modification on update rules. In Koyejo, S., Mohamed, S., Agarwal, A., Belgrave, D., Cho, K., and Oh, A. (eds.),Ad- vances in Neural Information Processing Systems, volume 35, pp. 28386–28399. Curran Associates, Inc.,
-
[24]
TX t=1 (∇f(x t))2 λ+ √wt 1 # ≤∆ +γ dX i=1 σ2 i ci TX t=1 E
URL https://proceedings.neurips. cc/paper_files/paper/2022/file/ b6260ae5566442da053e5ab5d691067a- Paper-Conference.pdf. Zou, F., Shen, L., Jie, Z., Zhang, W., and Liu, W. A sufficient condition for convergences of adam and rmsprop. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2019. 12 Can Adaptive Grad...
2022
-
[25]
s uT,i λ+ p vT,i +u T,i +c 2 i × λ+ q vT,i +u T,i +c 2 i # (c) ≤ vuutE
Divide both sides of the above inequality by γ 2 to obtain E " TX t=1 (∇f(x t))2 λ+ √wt 1 # ≤ 2∆ γ + 2γ∥L∥ 1 lnK T + 8∥σ∥ 1 T 1 p − 1 2 ln 1 ¯p+ 1 2 KT .(9) 13 Can Adaptive Gradient Methods Converge under Heavy-Tailed Noise? A Case Study of AdaGrad Next, recall thatu t,i =Pt s=1(∇if(x s))2,∀t∈N(introduced in Lemma A.4), we hence have E √uT,i =E "s uT,i λ+...
2019
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.