pith. sign in

arxiv: 2605.18694 · v1 · pith:EXA7RSNMnew · submitted 2026-05-18 · 🧮 math.OC · cs.LG· stat.ML

Can Adaptive Gradient Methods Converge under Heavy-Tailed Noise? A Case Study of AdaGrad

Pith reviewed 2026-05-20 08:48 UTC · model grok-4.3

classification 🧮 math.OC cs.LGstat.ML
keywords AdaGradheavy-tailed noisenon-convex optimizationconvergence rateadaptive gradient methodsstochastic optimizationtail index
0
0 comments X

The pith

AdaGrad converges at a provable rate in non-convex optimization under heavy-tailed gradient noise with unknown tail index between 4/3 and 2.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that AdaGrad reaches a convergence guarantee in non-convex stochastic optimization even when gradient noise follows a heavy-tailed distribution with index p in (4/3, 2]. The proof works without any prior knowledge of p and without extra steps such as clipping. This matters because heavy-tailed noise appears often in machine learning yet adaptive methods like AdaGrad have been observed to work in practice. The authors also give an algorithm-specific lower bound indicating AdaGrad cannot match the best possible rate for this noise class, plus a stronger rate for the normalized variant AdaGrad-Norm under one mild extra condition.

Core claim

The central claim is that AdaGrad attains the first known convergence rate in non-convex settings when the gradient noise has finite p-moment for unknown p satisfying 4/3 < p ≤ 2. The analysis is adaptive to p. An algorithm-dependent lower bound is derived showing that the existing minimax rate is not achievable by AdaGrad. For AdaGrad-Norm an improved rate holds for any 1 < p ≤ 2 under an extra mild assumption.

What carries the argument

The adaptive step-size rule of AdaGrad, which scales updates using the accumulated sum of squared past gradients, allowing the method to respond to the observed noise scale without knowing the tail index in advance.

If this is right

  • AdaGrad converges in non-convex optimization under heavy-tailed noise for any unknown p in (4/3, 2].
  • No prior knowledge of the tail index or additional clipping is needed for the guarantee.
  • AdaGrad cannot attain the minimax optimal rate for heavy-tailed noise.
  • AdaGrad-Norm obtains a stronger convergence rate for every p in (1, 2] once one mild extra assumption holds.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Other adaptive methods such as Adam may admit similar tail-adaptive convergence arguments.
  • The robustness observed in practice for adaptive optimizers under heavy tails may stem directly from their built-in scaling rather than from external fixes.
  • Empirical tests on problems engineered with exact tail index 1.5 could directly verify or refute the predicted rate.

Load-bearing premise

Stochastic gradients possess a finite p-moment for some unknown p strictly larger than 4/3 inside the standard non-convex stochastic optimization framework.

What would settle it

An explicit non-convex stochastic problem with gradient noise of tail index p = 1.5 in which AdaGrad either diverges or converges slower than the claimed rate would disprove the main result.

read the original abstract

Many tasks in modern machine learning are observed to involve heavy-tailed gradient noise during the optimization process. To manage this realistic and challenging setting, new mechanisms, such as gradient clipping and gradient normalization, have been introduced to ensure the convergence of first-order algorithms. However, adaptive gradient methods, a famous class of modern optimizers that includes popular $\mathtt{Adam}$ and $\mathtt{AdamW}$, often perform well even without any extra operations mentioned above. It is therefore natural to ask whether adaptive gradient methods can converge under heavy-tailed noise without any algorithmic changes. In this work, we take the first step toward answering this question by investigating a special case, $\mathtt{AdaGrad}$, the origin of adaptive gradient methods. We provide the first provable convergence rate for $\mathtt{AdaGrad}$ in non-convex optimization when the tail index $p$ satisfies $4/3<p\leq2$. Notably, this result is achieved without requiring any prior knowledge of $p$ and is hence adaptive to the tail index. In addition, we develop an algorithm-dependent lower bound, suggesting that the existing minimax rate for heavy-tailed optimization is not attainable by $\mathtt{AdaGrad}$. Lastly, we consider $\mathtt{AdaGrad}\text{-}\mathtt{Norm}$, a popular variant of $\mathtt{AdaGrad}$ in theoretical studies, and show an improved rate that holds for any $1<p\leq2$ under an extra mild assumption.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript investigates the convergence of AdaGrad (and its norm variant) in non-convex stochastic optimization under heavy-tailed gradient noise with finite but unknown p-moment for 4/3 < p ≤ 2. It claims the first provable convergence rate for standard AdaGrad that requires no prior knowledge of p (hence adaptive to the tail index), develops an algorithm-dependent lower bound indicating that existing minimax rates for heavy-tailed optimization are not attainable by AdaGrad, and shows an improved rate for AdaGrad-Norm that holds for any 1 < p ≤ 2 under an extra mild assumption.

Significance. If the central convergence guarantee holds without hidden second-moment assumptions or p-dependent constants in the proof, the result would be significant: it provides the first rigorous evidence that a popular adaptive method can succeed under realistic heavy-tailed noise without algorithmic modifications such as clipping, and the p-adaptivity strengthens its practical relevance. The algorithm-dependent lower bound usefully delineates the limitations of AdaGrad relative to general minimax rates in this regime.

major comments (2)
  1. [Theorem 3.1] Theorem 3.1 (main convergence result for AdaGrad): the claimed rate is stated to hold without knowledge of p and to be adaptive to the tail index. However, the key bounding step for the adaptive term (the telescoping sum involving 1/sqrt(v_t) or sum ||g_t||/sqrt(v_t)) relies on Holder/Young inequalities whose exponents close only for p > 4/3; it is not immediate from the argument whether auxiliary constants or truncation thresholds remain independent of p or whether a uniform second-moment proxy is implicitly invoked to control the denominator when p approaches 4/3 from above. This must be verified explicitly, as any p-dependent choice would contradict the adaptivity claim.
  2. [Section 4] Section 4 (algorithm-dependent lower bound): the construction shows that the minimax rate is not attained by AdaGrad, but the hard-instance analysis must confirm that the state-dependent step-size sequence produced by AdaGrad is properly accounted for when deriving the lower bound; otherwise the gap between the upper bound of Theorem 3.1 and this lower bound may be overstated.
minor comments (2)
  1. [Abstract] The abstract and introduction could more explicitly state whether the final convergence rate depends on p or is uniform over the interval (4/3,2].
  2. [Section 2] Notation for the p-moment assumption and the precise definition of the tail index should be introduced earlier and used consistently in the statement of all theorems.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful and constructive review of our manuscript. We address each major comment below, providing clarifications on the proof details and outlining planned revisions to improve transparency.

read point-by-point responses
  1. Referee: [Theorem 3.1] Theorem 3.1 (main convergence result for AdaGrad): the claimed rate is stated to hold without knowledge of p and to be adaptive to the tail index. However, the key bounding step for the adaptive term (the telescoping sum involving 1/sqrt(v_t) or sum ||g_t||/sqrt(v_t)) relies on Holder/Young inequalities whose exponents close only for p > 4/3; it is not immediate from the argument whether auxiliary constants or truncation thresholds remain independent of p or whether a uniform second-moment proxy is implicitly invoked to control the denominator when p approaches 4/3 from above. This must be verified explicitly, as any p-dependent choice would contradict the adaptivity claim.

    Authors: We thank the referee for this precise observation on the proof mechanics. In the analysis of Theorem 3.1, Holder's inequality is applied to the sum involving ||g_t|| / sqrt(v_t) using conjugate exponents that are valid uniformly for any fixed p > 4/3; the resulting bound depends on p only through the moment assumption itself, with no p-dependent truncation thresholds or auxiliary constants chosen in the algorithm or proof. No implicit second-moment proxy is used—the denominator is controlled directly via the non-decreasing property of v_t and the p-moment bound on the noise, without invoking higher moments. To make this explicit and address the concern about uniformity as p approaches 4/3, we will insert a short clarifying remark immediately after the key inequality in the proof of Theorem 3.1 (and a corresponding note in the main text) that highlights the p-independence of all algorithmic choices and the absence of hidden assumptions. revision: partial

  2. Referee: [Section 4] Section 4 (algorithm-dependent lower bound): the construction shows that the minimax rate is not attained by AdaGrad, but the hard-instance analysis must confirm that the state-dependent step-size sequence produced by AdaGrad is properly accounted for when deriving the lower bound; otherwise the gap between the upper bound of Theorem 3.1 and this lower bound may be overstated.

    Authors: We agree that explicitly tracking the state-dependent step sizes is essential for a rigorous lower bound. In the hard-instance construction of Section 4, the gradient sequence is chosen so that the accumulated v_t (and thus the per-coordinate step sizes) can be computed in closed form at each iteration; the lower-bound derivation then substitutes these explicit adaptive steps into the progress inequality and shows that the resulting convergence rate remains strictly slower than the known minimax rate for heavy-tailed non-convex optimization. We will revise Section 4 to include an additional lemma that isolates the step-size sequence produced by AdaGrad on this instance and verifies its incorporation at every step of the lower-bound argument, thereby making the accounting fully transparent. revision: yes

Circularity Check

0 steps flagged

No circularity: new convergence proof under p-moment assumptions

full rationale

The paper derives a convergence rate for AdaGrad in non-convex settings under heavy-tailed noise with finite but unknown p-moment for 4/3 < p ≤ 2. This is presented as the first such result that is adaptive to p without requiring its knowledge in advance. The abstract and reader's summary indicate a standard extension of stochastic optimization analysis replacing variance bounds with p-moment assumptions, without any fitted parameters, self-definitional rates, or load-bearing self-citations that reduce the claim to its own inputs. The derivation chain remains self-contained against external benchmarks for heavy-tailed optimization, yielding an independent theoretical contribution rather than a renaming or construction that forces the result.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

Based on abstract only: the central claim rests on standard non-convex stochastic optimization assumptions plus the existence of p-moments for unknown p > 4/3; no free parameters or invented entities are mentioned.

axioms (2)
  • domain assumption Gradient noise possesses a finite p-moment for some unknown p in (4/3,2]
    Invoked to obtain the stated convergence rate without prior knowledge of p.
  • standard math Standard non-convex stochastic optimization setting
    Background assumption for the convergence analysis in non-convex regime.

pith-pipeline@v0.9.0 · 5800 in / 1473 out tokens · 34892 ms · 2026-05-20T08:48:16.383851+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

25 extracted references · 25 canonical work pages · 2 internal anchors

  1. [1]

    Arjevani, Y ., Carmon, Y ., Duchi, J

    URL https://openreview.net/forum? id=0uI5415ry7. Arjevani, Y ., Carmon, Y ., Duchi, J. C., Foster, D. J., Srebro, N., and Woodworth, B. Lower bounds for non-convex stochastic optimization.Mathematical Programming, 199 (1-2):165–214, 2023. Attia, A. and Koren, T. SGD with AdaGrad step- sizes: Full adaptivity with high probability to un- known parameters, u...

  2. [2]

    Battash, B., Wolf, L., and Lindenbaum, O

    URL https://proceedings.mlr.press/ v202/attia23a.html. Battash, B., Wolf, L., and Lindenbaum, O. Revisiting the noise model of stochastic gradient descent. In Das- gupta, S., Mandt, S., and Li, Y . (eds.),Proceedings of The 27th International Conference on Artificial Intelligence and Statistics, volume 238 ofProceedings of Machine Learning Research, pp. 4...

  3. [3]

    Bernstein, J., Wang, Y .-X., Azizzadenesheli, K., and Anand- kumar, A

    URL https://proceedings.mlr.press/ v238/battash24a.html. Bernstein, J., Wang, Y .-X., Azizzadenesheli, K., and Anand- kumar, A. signSGD: Compressed optimisation for non- convex problems. In Dy, J. and Krause, A. (eds.),Pro- ceedings of the 35th International Conference on Ma- chine Learning, volume 80 ofProceedings of Machine Learning Research, pp. 560–56...

  4. [4]

    Crawshaw, M

    URL https://proceedings.mlr.press/ v267/chezhegov25a.html. Crawshaw, M. and Liu, M. Complexity lower bounds of adaptive gradient algorithms for non-convex stochastic optimization under relaxed smoothness. InThe Thirteenth International Conference on Learning Representations,

  5. [5]

    Cutkosky, A

    URL https://openreview.net/forum? id=ZjOXuAfS6l. Cutkosky, A. and Mehta, H. Momentum improves nor- malized SGD. In III, H. D. and Singh, A. (eds.),Pro- ceedings of the 37th International Conference on Ma- chine Learning, volume 119 ofProceedings of Machine Learning Research, pp. 2260–2268. PMLR, 13–18 Jul

  6. [6]

    Cutkosky, A

    URL https://proceedings.mlr.press/ v119/cutkosky20b.html. Cutkosky, A. and Mehta, H. High-probability bounds for non-convex stochastic optimization with heavy tails. In Ranzato, M., Beygelzimer, A., Dauphin, Y ., Liang, P., and Vaughan, J. W. (eds.),Advances in Neural Information Processing Systems, vol- ume 34, pp. 4883–4895. Curran Associates, Inc.,

  7. [7]

    Convergence guarantees for RMSProp and ADAM in non-convex optimization and an empirical comparison to Nesterov acceleration

    URL https://proceedings.neurips. cc/paper_files/paper/2021/file/ 26901debb30ea03f0aa833c9de6b81e9- Paper.pdf. De, S., Mukherjee, A., and Ullah, E. Convergence guaran- tees for rmsprop and adam in non-convex optimization and an empirical comparison to nesterov acceleration. arXiv preprint arXiv:1807.06766, 2018. Défossez, A., Bottou, L., Bach, F., and Usun...

  8. [8]

    9 Can Adaptive Gradient Methods Converge under Heavy-Tailed Noise? A Case Study of AdaGrad Duchi, J., Hazan, E., and Singer, Y

    URL https://openreview.net/forum? id=ZPQhzTSWA7. 9 Can Adaptive Gradient Methods Converge under Heavy-Tailed Noise? A Case Study of AdaGrad Duchi, J., Hazan, E., and Singer, Y . Adaptive subgra- dient methods for online learning and stochastic opti- mization.Journal of Machine Learning Research, 12 (61):2121–2159, 2011. URL http://jmlr.org/ papers/v12/duc...

  9. [9]

    Faw, M., Rout, L., Caramanis, C., and Shakkottai, S

    URL https://proceedings.mlr.press/ v178/faw22a.html. Faw, M., Rout, L., Caramanis, C., and Shakkottai, S. Beyond uniform smoothness: A stopped analy- sis of adaptive sgd. In Neu, G. and Rosasco, L. (eds.),Proceedings of Thirty Sixth Conference on Learn- ing Theory, volume 195 ofProceedings of Machine Learning Research, pp. 89–160. PMLR, 12–15 Jul

  10. [10]

    Garg, S., Zhanson, J., Parisotto, E., Prasad, A., Kolter, Z., Lipton, Z., Balakrishnan, S., Salakhutdinov, R., and Ravikumar, P

    URL https://proceedings.mlr.press/ v195/faw23a.html. Garg, S., Zhanson, J., Parisotto, E., Prasad, A., Kolter, Z., Lipton, Z., Balakrishnan, S., Salakhutdinov, R., and Ravikumar, P. On proximal policy optimization’s heavy- tailed gradients. In Meila, M. and Zhang, T. (eds.),Pro- ceedings of the 38th International Conference on Ma- chine Learning, volume 1...

  11. [11]

    URL https://proceedings.mlr.press/ v139/garg21b.html. Hong, Y . and Lin, J. On convergence of adam for stochastic optimization under relaxed assumptions. In Globerson, A., Mackey, L., Belgrave, D., Fan, A., Paquet, U., Tomczak, J., and Zhang, C. (eds.), Advances in Neural Information Processing Sys- tems, volume 37, pp. 10827–10877. Curran Asso- ciates, I...

  12. [12]

    Jiang, R., Maladkar, D., and Mokhtari, A

    URL https://proceedings.mlr.press/ v258/hubler25a.html. Jiang, R., Maladkar, D., and Mokhtari, A. Provable complexity improvement of adagrad over sgd: Upper and lower bounds in stochastic non-convex optimiza- tion. In Haghtalab, N. and Moitra, A. (eds.),Pro- ceedings of Thirty Eighth Conference on Learning The- ory, volume 291 ofProceedings of Machine Lea...

  13. [13]

    Adam: A Method for Stochastic Optimization

    URL https://proceedings.mlr.press/ v291/jiang25c.html. Kavis, A., Levy, K. Y ., and Cevher, V . High probability bounds for a class of nonconvex algorithms with ada- grad stepsize. InInternational Conference on Learning Representations, 2022. URL https://openreview. net/forum?id=dSw0QtRMJkO. Kingma, D. P. and Ba, J. Adam: A method for stochastic optimizat...

  14. [14]

    cc/paper_files/paper/2021/file/ ac10ff1941c540cd87c107330996f4f6- Paper.pdf

    URL https://proceedings.neurips. cc/paper_files/paper/2021/file/ ac10ff1941c540cd87c107330996f4f6- Paper.pdf. Li, H., Dong, Y ., and Lin, Z. On the o( √ d/t1/4) con- vergence rate of rmsprop and its momentum extension measured by ℓ1 norm.Journal of Machine Learning Research, 26(131):1–25, 2025. URL http://jmlr. org/papers/v26/24-0523.html. Li, X. and Orab...

  15. [15]

    URL https://proceedings.mlr.press/ v313/liu26a.html. Liu, Z. and Zhou, Z. Nonconvex stochastic optimization under heavy-tailed noises: Optimal convergence without gradient clipping. InThe Thirteenth International Confer- ence on Learning Representations, 2025. URL https: //openreview.net/forum?id=NKotdPUc3L. Liu, Z., Nguyen, T. D., Ene, A., and Nguyen, H....

  16. [16]

    cc/paper_files/paper/2023/file/ 4c454d34f3a4c8d6b4ca85a918e5d7ba- Paper-Conference.pdf

    URL https://proceedings.neurips. cc/paper_files/paper/2023/file/ 4c454d34f3a4c8d6b4ca85a918e5d7ba- Paper-Conference.pdf. Reddi, S. J., Kale, S., and Kumar, S. On the conver- gence of adam and beyond. InInternational Confer- ence on Learning Representations, 2018. URL https: //openreview.net/forum?id=ryQu7f-RZ. Sadiev, A., Danilova, M., Gorbunov, E., Horvá...

  17. [17]

    Shazeer, N

    URL https://proceedings.mlr.press/ v202/sadiev23a.html. Shazeer, N. and Stern, M. Adafactor: Adaptive learning rates with sublinear memory cost. In Dy, J. and Krause, A. (eds.),Proceedings of the 35th International Confer- ence on Machine Learning, volume 80 ofProceedings of Machine Learning Research, pp. 4596–4604. PMLR, 10–15 Jul 2018. URLhttps://procee...

  18. [18]

    Sun, T., Liu, X., and Yuan, K

    URL https://proceedings.mlr.press/ v97/simsekli19a.html. Sun, T., Liu, X., and Yuan, K. Revisiting gradient nor- malization and clipping for nonconvex sgd under heavy- tailed noise: Necessity, sufficiency, and acceleration. Journal of Machine Learning Research, 26(237):1–42,

  19. [19]

    Tieleman, T., Hinton, G., et al

    URL http://jmlr.org/papers/v26/24- 1991.html. Tieleman, T., Hinton, G., et al. Lecture 6.5-rmsprop: Divide the gradient by a running average of its recent magnitude. COURSERA: Neural networks for machine learning, 4 (2):26–31, 2012. 11 Can Adaptive Gradient Methods Converge under Heavy-Tailed Noise? A Case Study of AdaGrad Vaswani, A., Shazeer, N., Parmar...

  20. [20]

    cc/paper_files/paper/2017/file/ 3f5ee243547dee91fbd053c1c4a845aa- Paper.pdf

    URL https://proceedings.neurips. cc/paper_files/paper/2017/file/ 3f5ee243547dee91fbd053c1c4a845aa- Paper.pdf. Wang, B., Fu, J., Zhang, H., Zheng, N., and Chen, W. Closing the gap between the upper bound and lower bound of adam's iteration complexity. In Oh, A., Naumann, T., Globerson, A., Saenko, K., Hardt, M., and Levine, S. (eds.),Advances in Neural Inf...

  21. [21]

    cc/paper_files/paper/2023/file/ eb1a323fa10d4102ff13422476a744ff- Paper-Conference.pdf

    URL https://proceedings.neurips. cc/paper_files/paper/2023/file/ eb1a323fa10d4102ff13422476a744ff- Paper-Conference.pdf. Zaheer, M., Reddi, S., Sachan, D., Kale, S., and Kumar, S. Adaptive methods for nonconvex opti- mization. In Bengio, S., Wallach, H., Larochelle, H., Grauman, K., Cesa-Bianchi, N., and Garnett, R. (eds.),Advances in Neural Information P...

  22. [22]

    cc/paper_files/paper/2018/file/ 90365351ccc7437a1309dc64e4db32a3- Paper.pdf

    URL https://proceedings.neurips. cc/paper_files/paper/2018/file/ 90365351ccc7437a1309dc64e4db32a3- Paper.pdf. Zhang, J., He, T., Sra, S., and Jadbabaie, A. Why gra- dient clipping accelerates training: A theoretical jus- tification for adaptivity. InInternational Conference on Learning Representations, 2020a. URL https: //openreview.net/forum?id=BJgnXpVYw...

  23. [23]

    Zhang, Y ., Chen, C., Shi, N., Sun, R., and Luo, Z.-Q

    URL https://openreview.net/forum? id=QIzRdjIWnS. Zhang, Y ., Chen, C., Shi, N., Sun, R., and Luo, Z.-Q. Adam can converge without any modification on update rules. In Koyejo, S., Mohamed, S., Agarwal, A., Belgrave, D., Cho, K., and Oh, A. (eds.),Ad- vances in Neural Information Processing Systems, volume 35, pp. 28386–28399. Curran Associates, Inc.,

  24. [24]

    TX t=1 (∇f(x t))2 λ+ √wt 1 # ≤∆ +γ dX i=1 σ2 i ci TX t=1 E

    URL https://proceedings.neurips. cc/paper_files/paper/2022/file/ b6260ae5566442da053e5ab5d691067a- Paper-Conference.pdf. Zou, F., Shen, L., Jie, Z., Zhang, W., and Liu, W. A sufficient condition for convergences of adam and rmsprop. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2019. 12 Can Adaptive Grad...

  25. [25]

    s uT,i λ+ p vT,i +u T,i +c 2 i × λ+ q vT,i +u T,i +c 2 i # (c) ≤ vuutE

    Divide both sides of the above inequality by γ 2 to obtain E " TX t=1 (∇f(x t))2 λ+ √wt 1 # ≤ 2∆ γ + 2γ∥L∥ 1 lnK T + 8∥σ∥ 1 T 1 p − 1 2 ln 1 ¯p+ 1 2 KT .(9) 13 Can Adaptive Gradient Methods Converge under Heavy-Tailed Noise? A Case Study of AdaGrad Next, recall thatu t,i =Pt s=1(∇if(x s))2,∀t∈N(introduced in Lemma A.4), we hence have E √uT,i =E "s uT,i λ+...