Adam Converges in Nonsmooth Nonconvex Optimization
Pith reviewed 2026-06-26 10:28 UTC · model grok-4.3
The pith
Classical Adam with bias correction converges at rate 1/T to the 2/13 in nonsmooth nonconvex optimization using randomly scaled learning rates.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The classical Adam update, with its built-in bias-correction term and without clipping or other modifications, converges to a stationary point at rate 1/T^{2/13} for nonsmooth nonconvex objectives when the momentum parameters satisfy β1 = β2 and the learning rate is drawn from a random scaling schedule; the same bound holds under heavy-tailed gradient noise.
What carries the argument
The bias-correction term inside the standard Adam update together with a randomly scaled learning-rate schedule, analyzed through the Online-to-Nonconvex Conversion framework.
If this is right
- Adam can be used directly on nonsmooth problems such as neural-network training without algorithmic changes.
- The same convergence guarantee applies when gradient noise has heavy tails.
- Setting the two momentum parameters equal is theoretically justified.
- Bias correction does not prevent a convergence proof under the stated conditions.
Where Pith is reading between the lines
- Random scaling of the learning rate may be worth testing in other first-order methods for nonsmooth nonconvex problems.
- The 2/13 exponent might be improved by relaxing the equal-momentum requirement or by using a different conversion framework.
- The result suggests that empirical success of Adam on deep networks may stem from its ability to tolerate heavy-tailed noise without extra clipping.
Load-bearing premise
The proof requires that the two momentum decay rates are set equal and that the learning rate is randomly scaled rather than following the usual deterministic schedule.
What would settle it
An explicit nonsmooth nonconvex function and heavy-tailed noise distribution on which Adam with β1 = β2 and random scaling fails to achieve a 1/T^{2/13} rate to stationarity.
read the original abstract
Adam is one of the most widely implemented and influential modern optimizers. Why is it effective across different optimization problems in practice? This question arguably lies at the center of the optimization community over the last decade and has motivated a substantial body of work aimed at understanding its convergence behavior. However, existing studies have mainly focused on the convergence rate of Adam in smooth nonconvex optimization, which unfortunately does not adequately capture practical settings, since many real-world problems are nonsmooth, such as those arising in training neural networks. Thus, these studies cannot fully explain the popularity and empirical success of Adam. Recently, an insightful and powerful framework called Online-to-Nonconvex Conversion has opened a new way to analyze Adam for nonsmooth nonconvex optimization. Unfortunately, prior works along this line share two common limitations. First, all of them ignore the important bias-correction term in the original Adam algorithm. Second and more importantly, many of them require extra operations that are not used in Adam, such as a clipping step. Therefore, the convergence guarantee for the original Adam method still remains unclear. In this work, we present the first finite-time analysis for the classical form of Adam, i.e., with the bias-correction step and without further algorithmic modifications, and prove that a randomly scaled learning rate ensures a convergence rate of $1/T^{\frac{2}{13}}$ for nonsmooth nonconvex optimization. Moreover, our result provably applies to the modern heavy-tailed noise regime, which is closer to practice. Interestingly, our theory is established under the parameter choice $\beta_1=\beta_2$, aligning with the recent empirical studies.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims the first finite-time convergence guarantee for the unmodified classical Adam algorithm (including bias correction) in nonsmooth nonconvex optimization. It establishes a rate of 1/T^{2/13} under a randomly scaled learning rate, the restriction β1=β2, and heavy-tailed noise, by building on the Online-to-Nonconvex Conversion framework; prior works are critiqued for omitting bias correction or adding clipping.
Significance. If the result holds under the stated conditions, it would strengthen the theoretical basis for Adam in practical nonsmooth settings with heavy-tailed gradients, closing two documented gaps in the Online-to-Nonconvex Conversion line. The explicit alignment with empirical studies on β1=β2 is a positive feature, as is the extension to heavy-tailed noise.
major comments (2)
- [Abstract] Abstract and introduction: the claim of analysis for the 'classical form of Adam... without further algorithmic modifications' is load-bearing yet in tension with the required randomly scaled learning rate (instead of any deterministic schedule) and the restriction β1=β2; these choices are necessary to close the reduction inside the Online-to-Nonconvex Conversion framework but alter the algorithm relative to standard Adam.
- [Abstract] The precise function class, noise moment conditions, and boundedness assumptions inherited from the Online-to-Nonconvex Conversion framework are not stated in the abstract or introduction; without them it is impossible to verify whether the 1/T^{2/13} guarantee applies to the nonsmooth nonconvex problems the claim targets.
Simulated Author's Rebuttal
We thank the referee for the constructive comments. Below we respond point-by-point to the major comments, indicating where we agree and where we maintain a different view while remaining faithful to the manuscript.
read point-by-point responses
-
Referee: [Abstract] Abstract and introduction: the claim of analysis for the 'classical form of Adam... without further algorithmic modifications' is load-bearing yet in tension with the required randomly scaled learning rate (instead of any deterministic schedule) and the restriction β1=β2; these choices are necessary to close the reduction inside the Online-to-Nonconvex Conversion framework but alter the algorithm relative to standard Adam.
Authors: We disagree that the randomly scaled learning rate or the choice β1=β2 constitutes an algorithmic modification. The classical Adam algorithm consists of the momentum and second-moment updates together with the bias-correction terms; no clipping, extra normalization, or other operations are added. The learning-rate schedule is an external hyper-parameter choice, and the paper explicitly states that a randomly scaled schedule yields the stated rate. Likewise, β1=β2 is a parameter setting (not a change to the update rules) that our analysis requires and that the manuscript notes aligns with recent empirical findings. We will add a clarifying sentence in the introduction to distinguish the core algorithm from the schedule and parameter choice. revision: partial
-
Referee: [Abstract] The precise function class, noise moment conditions, and boundedness assumptions inherited from the Online-to-Nonconvex Conversion framework are not stated in the abstract or introduction; without them it is impossible to verify whether the 1/T^{2/13} guarantee applies to the nonsmooth nonconvex problems the claim targets.
Authors: We agree that the abstract and introduction would benefit from an explicit, concise statement of the inherited assumptions. In the revision we will insert a short paragraph (or bullet list) in the introduction summarizing the function class (nonsmooth nonconvex), the heavy-tailed noise moment conditions, and the boundedness requirements from the Online-to-Nonconvex Conversion framework. revision: yes
Circularity Check
No significant circularity; analysis extends external framework under explicit restrictions
full rationale
The paper extends the cited Online-to-Nonconvex Conversion framework (prior independent work) to the classical Adam with bias correction, explicitly stating the required choices β1=β2 and randomly scaled learning rate to obtain the 1/T^{2/13} rate. No load-bearing step reduces by the paper's own equations to a fitted parameter renamed as prediction, a self-definitional loop, or a self-citation chain that renders the result tautological. The derivation remains self-contained against the stated assumptions of the external framework.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Standard technical assumptions of the Online-to-Nonconvex Conversion framework for nonsmooth nonconvex optimization
Reference graph
Works this paper leans on
-
[1]
Abernethy, Elad Hazan, and Alexander Rakhlin
Jacob D. Abernethy, Elad Hazan, and Alexander Rakhlin. Competing in the dark: An efficient algorithm for bandit linear optimization. In Rocco A. Servedio and Tong Zhang, editors,21st Annual Conference on Learning Theory - COLT 2008, Helsinki, Finland, July 9-12, 2008, pages 263–274. Omnipress,
2008
-
[2]
Kwangjun Ahn, Zhiyu Zhang, Yunbum Kook, and Yan Dai
URLhttps://proceedings.neurips.cc/paper_ files/paper/2024/file/ac8ec9b4d94c03f0af8c4fe3d5fad4fd-Paper-Conference.pdf. Kwangjun Ahn, Zhiyu Zhang, Yunbum Kook, and Yan Dai. Understanding Adam optimizer via online learning of updates: Adam is FTRL in disguise. In Ruslan Salakhutdinov, Zico Kolter, Katherine Heller, Adrian Weller, Nuria Oliver, Jonathan Scarl...
2024
-
[3]
URLhttps://doi.org/10.1137/19M1263443
doi: 10.1137/19M1263443. URLhttps://doi.org/10.1137/19M1263443. Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhari- wal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel Ziegler, Jeffrey Wu, Clemens Win...
-
[4]
cc/paper_files/paper/2020/file/1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdf
URLhttps://proceedings.neurips. cc/paper_files/paper/2020/file/1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdf. Xiangyi Chen, Sijia Liu, Ruoyu Sun, and Mingyi Hong. On the convergence of a class of adam-type algorithms for non-convex optimization. InInternational Conference on Learning Representations,
2020
-
[5]
Soham De, Anirbit Mukherjee, and Enayat Ullah
URLhttps://proceedings.neurips.cc/paper_files/paper/2022/file/ 2c8d9636f74d0207ff4f65956010f450-Paper-Conference.pdf. Soham De, Anirbit Mukherjee, and Enayat Ullah. Convergence guarantees for rmsprop and adam in non-convex optimization and an empirical comparison to nesterov acceleration.arXiv preprint arXiv:1807.06766,
Pith/arXiv arXiv 2022
-
[6]
Geoffrey J. Gordon. Regret bounds for prediction problems. In Shai Ben-David and Philip M. Long, editors,Proceedings of the Twelfth Annual Conference on Computational Learning Theory, COLT 1999, Santa Cruz, CA, USA, July 7-9, 1999, pages 29–40. ACM,
1999
-
[7]
doi: 10.1145/307400. 307410. URLhttps://doi.org/10.1145/307400.307410. Zhishuai Guo, Yi Xu, Wotao Yin, Rong Jin, and Tianbao Yang. A novel convergence analysis for algorithms of the adam family.arXiv preprint arXiv:2112.03459,
-
[8]
Extracting certainty from uncertainty: Regret bounded by variation in costs
Elad Hazan and Satyen Kale. Extracting certainty from uncertainty: Regret bounded by variation in costs. In Rocco A. Servedio and Tong Zhang, editors,21st Annual Conference on Learning Theory - COLT 2008, Helsinki, Finland, July 9-12, 2008, pages 57–68. Omnipress,
2008
-
[9]
Association for Computing Machinery. ISBN 9798400715105. doi: 10.1145/3717823.3718308. URLhttps://doi.org/10.1145/3717823.3718308. Michael Jordan, Guy Kornowski, Tianyi Lin, Ohad Shamir, and Manolis Zampetakis. Deterministic nonsmooth nonconvex optimization. In Gergely Neu and Lorenzo Rosasco, editors,Proceedings of Thirty Sixth Conference on Learning The...
-
[10]
Diederik P Kingma and Jimmy Ba
URLhttps://proceedings.mlr.press/ v195/jordan23a.html. Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization.arXiv preprint arXiv:1412.6980,
-
[11]
Oracle complexity in nonsmooth nonconvex optimization.Jour- nal of Machine Learning Research, 23(314):1–44, 2022a
Guy Kornowski and Ohad Shamir. Oracle complexity in nonsmooth nonconvex optimization.Jour- nal of Machine Learning Research, 23(314):1–44, 2022a. URLhttp://jmlr.org/papers/v23/ 21-1507.html. 25 Liu Guy Kornowski and Ohad Shamir. On the complexity of finding small subgradients in nonsmooth optimization. InOPT 2022: Optimization for Machine Learning (NeurIP...
2022
-
[12]
URLhttps://proceedings.neurips.cc/paper_files/paper/2023/file/ a3cc50126338b175e56bb3cad134db0b-Paper-Conference.pdf. Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al. Deepseek-v3 technical report.arXiv preprint arXiv:2412.19437, 2024a. Langqi Liu, Yibo Wang, and Lijun Zhang...
Pith/arXiv arXiv 2023
-
[13]
URLhttp://jmlr.org/papers/v18/14-428. html. H. Brendan McMahan and Matthew J. Streeter. Adaptive bound optimization for online convex optimization. In Adam Tauman Kalai and Mehryar Mohri, editors,COLT 2010 - The 23rd Conference on Learning Theory, Haifa, Israel, June 27-29, 2010, pages 244–256. Omnipress,
2010
-
[14]
URLhttps://proceedings.mlr.press/v313/nguyen26b.html. Francesco Orabona. A modern introduction to online learning.arXiv preprint arXiv:1912.13213,
Pith/arXiv arXiv 1912
-
[15]
Iosif Pinelis
URLhttps://proceedings.neurips.cc/paper_files/paper/2025/file/ 5bd9aa206d782e4e1f7ab5d177a10828-Paper-Conference.pdf. Iosif Pinelis. Best possible bounds of the von Bahr–Esseen type.Annals of Functional Analysis, 6 (4):1 – 29,
2025
-
[16]
URLhttps://doi.org/10.15352/afa/06-4-1
doi: 10.15352/afa/06-4-1. URLhttps://doi.org/10.15352/afa/06-4-1. Sashank J. Reddi, Satyen Kale, and Sanjiv Kumar. On the convergence of adam and beyond. In International Conference on Learning Representations,
-
[17]
Shai Shalev-Shwartz and Yoram Singer
URLhttps://openreview.net/ forum?id=ryQu7f-RZ. Shai Shalev-Shwartz and Yoram Singer. Online learning meets optimization in the dual. In Gábor Lugosi and Hans Ulrich Simon, editors,Learning Theory, 19th Annual Conference on Learning Theory, COLT 2006, Pittsburgh, PA, USA, June 22-25, 2006, Proceedings, volume 4005 ofLecture Notes in Computer Science, pages...
2006
-
[18]
URL https://doi.org/10.1007/11776420_32
doi: 10.1007/11776420\_32. URL https://doi.org/10.1007/11776420_32. Umut Simsekli, Levent Sagun, and Mert Gurbuzbalaban. A tail-index analysis of stochastic gra- dient noise in deep neural networks. In Kamalika Chaudhuri and Ruslan Salakhutdinov, ed- itors,Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Ma...
-
[19]
Llama 2: Open foundation and fine-tuned chat models.arXiv preprint arXiv:2307.09288,
HugoTouvron, LouisMartin, KevinStone, PeterAlbert, AmjadAlmahairi, YasmineBabaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models.arXiv preprint arXiv:2307.09288,
-
[20]
URLhttp://www.jstor.org/stable/2238095
ISSN 00034851. URLhttp://www.jstor.org/stable/2238095. Bohan Wang, Jingwen Fu, Huishuai Zhang, Nanning Zheng, and Wei Chen. Closing the gap between the upper bound and lower bound of adam's iteration complexity. In A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, editors,Advances in Neural Information Processing Systems, volume 36, pag...
-
[21]
27 Liu Bohan Wang, Yushun Zhang, Huishuai Zhang, Qi Meng, Ruoyu Sun, Zhi-Ming Ma, Tie-Yan Liu, Zhi-Quan Luo, and Wei Chen
URLhttps://proceedings.neurips.cc/paper_files/paper/2023/file/ 7ac19fdcdf4f311f3e3ef2e7ef4784d7-Paper-Conference.pdf. 27 Liu Bohan Wang, Yushun Zhang, Huishuai Zhang, Qi Meng, Ruoyu Sun, Zhi-Ming Ma, Tie-Yan Liu, Zhi-Quan Luo, and Wei Chen. Provable adaptivity of adam under non-uniform smoothness. In Proceedings of the 30th ACM SIGKDD Conference on Knowle...
2023
-
[22]
Association for Computing Machinery. ISBN9798400704901. doi: 10.1145/3637528.3671718. URLhttps://doi.org/10.1145/3637528. 3671718. Yan-Feng Xie, Yu-Jie Zhang, Peng Zhao, and Zhi-Hua Zhou. Dynamic regret via discounted- to-dynamic reduction with applications to curved losses and adam optimizer.arXiv preprint arXiv:2602.08372,
-
[23]
Dingzhi Yu, Rui Pan, Yuxing Liu, and Tong Zhang. Stosignsgd: Unbiased structural stochasticity fixes signsgd for training large language models.arXiv preprint arXiv:2604.15416,
-
[24]
URLhttps://proceedings.neurips.cc/paper_files/paper/ 2018/file/90365351ccc7437a1309dc64e4db32a3-Paper.pdf. Jingzhao Zhang, Sai Praneeth Karimireddy, Andreas Veit, Seungyeon Kim, Sashank Reddi, Sanjiv Kumar, and Suvrit Sra. Why are adaptive methods good for attention models? In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin, editors,Advance...
arXiv 2018
-
[25]
Fangyu Zou, Li Shen, Zequn Jie, Weizhong Zhang, and Wei Liu
URLhttps://proceedings.neurips.cc/paper_ files/paper/2022/file/b6260ae5566442da053e5ab5d691067a-Paper-Conference.pdf. Fangyu Zou, Li Shen, Zequn Jie, Weizhong Zhang, and Wei Liu. A sufficient condition for conver- gences of adam and rmsprop. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June
2022
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.