Adam Converges in Nonsmooth Nonconvex Optimization

Zijian Liu

arxiv: 2606.22326 · v1 · pith:2KHDVUD4new · submitted 2026-06-21 · 🧮 math.OC · cs.LG· stat.ML

Adam Converges in Nonsmooth Nonconvex Optimization

Zijian Liu This is my paper

Pith reviewed 2026-06-26 10:28 UTC · model grok-4.3

classification 🧮 math.OC cs.LGstat.ML

keywords Adam optimizernonsmooth nonconvex optimizationfinite-time convergencebias correctionheavy-tailed noiserandom learning ratemomentum parameters

0 comments

The pith

Classical Adam with bias correction converges at rate 1/T to the 2/13 in nonsmooth nonconvex optimization using randomly scaled learning rates.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper gives the first finite-time guarantee for the original Adam algorithm, including its bias-correction step, in nonsmooth nonconvex problems. It proves that setting the two momentum parameters equal and randomly scaling the learning rate produces a convergence rate of 1 over T to the power 2/13. The guarantee also covers heavy-tailed noise, which matches observed behavior in training. Prior analyses either omitted the bias correction or added extra steps not present in standard Adam. The result therefore supplies a theoretical basis for Adam's practical effectiveness on the kinds of problems that arise in neural network training.

Core claim

The classical Adam update, with its built-in bias-correction term and without clipping or other modifications, converges to a stationary point at rate 1/T^{2/13} for nonsmooth nonconvex objectives when the momentum parameters satisfy β1 = β2 and the learning rate is drawn from a random scaling schedule; the same bound holds under heavy-tailed gradient noise.

What carries the argument

The bias-correction term inside the standard Adam update together with a randomly scaled learning-rate schedule, analyzed through the Online-to-Nonconvex Conversion framework.

If this is right

Adam can be used directly on nonsmooth problems such as neural-network training without algorithmic changes.
The same convergence guarantee applies when gradient noise has heavy tails.
Setting the two momentum parameters equal is theoretically justified.
Bias correction does not prevent a convergence proof under the stated conditions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Random scaling of the learning rate may be worth testing in other first-order methods for nonsmooth nonconvex problems.
The 2/13 exponent might be improved by relaxing the equal-momentum requirement or by using a different conversion framework.
The result suggests that empirical success of Adam on deep networks may stem from its ability to tolerate heavy-tailed noise without extra clipping.

Load-bearing premise

The proof requires that the two momentum decay rates are set equal and that the learning rate is randomly scaled rather than following the usual deterministic schedule.

What would settle it

An explicit nonsmooth nonconvex function and heavy-tailed noise distribution on which Adam with β1 = β2 and random scaling fails to achieve a 1/T^{2/13} rate to stationarity.

read the original abstract

Adam is one of the most widely implemented and influential modern optimizers. Why is it effective across different optimization problems in practice? This question arguably lies at the center of the optimization community over the last decade and has motivated a substantial body of work aimed at understanding its convergence behavior. However, existing studies have mainly focused on the convergence rate of Adam in smooth nonconvex optimization, which unfortunately does not adequately capture practical settings, since many real-world problems are nonsmooth, such as those arising in training neural networks. Thus, these studies cannot fully explain the popularity and empirical success of Adam. Recently, an insightful and powerful framework called Online-to-Nonconvex Conversion has opened a new way to analyze Adam for nonsmooth nonconvex optimization. Unfortunately, prior works along this line share two common limitations. First, all of them ignore the important bias-correction term in the original Adam algorithm. Second and more importantly, many of them require extra operations that are not used in Adam, such as a clipping step. Therefore, the convergence guarantee for the original Adam method still remains unclear. In this work, we present the first finite-time analysis for the classical form of Adam, i.e., with the bias-correction step and without further algorithmic modifications, and prove that a randomly scaled learning rate ensures a convergence rate of $1/T^{\frac{2}{13}}$ for nonsmooth nonconvex optimization. Moreover, our result provably applies to the modern heavy-tailed noise regime, which is closer to practice. Interestingly, our theory is established under the parameter choice $\beta_1=\beta_2$, aligning with the recent empirical studies.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper supplies the first convergence bound for Adam that keeps bias correction and skips clipping in nonsmooth nonconvex problems, but only after setting β1=β2 and switching to random learning-rate scaling.

read the letter

This paper gives the first finite-time guarantee for the classical Adam update (bias correction included, no added clipping) on nonsmooth nonconvex objectives with heavy-tailed noise, at rate 1/T^{2/13}. That is the concrete advance over earlier Online-to-Nonconvex Conversion results.

The authors correctly identify the two gaps in prior work and close them by retaining the bias-correction term while staying inside the same framework. The alignment note that β1=β2 matches some recent empirical observations is a small but useful observation.

The result still rests on two restrictions that move it away from unmodified Adam: β1 must equal β2, and the learning rate must be randomly scaled rather than following the usual deterministic schedule. The precise function class, moment conditions on the noise, and boundedness assumptions inherited from the conversion framework are not stated in the abstract, so the breadth of the claim is hard to judge without the full proof. If those assumptions turn out to be strong, the practical reach shrinks.

The paper is aimed at researchers who track theoretical analyses of adaptive methods and want to see the nonsmooth nonconvex case addressed. A reader already familiar with the Online-to-Nonconvex Conversion line will extract the most value; others will mainly see the rate and the parameter restrictions.

It deserves a serious referee because it targets a documented open question with a new bound, even though the proof will need close checking and the scope will likely require clarification.

Referee Report

2 major / 0 minor

Summary. The paper claims the first finite-time convergence guarantee for the unmodified classical Adam algorithm (including bias correction) in nonsmooth nonconvex optimization. It establishes a rate of 1/T^{2/13} under a randomly scaled learning rate, the restriction β1=β2, and heavy-tailed noise, by building on the Online-to-Nonconvex Conversion framework; prior works are critiqued for omitting bias correction or adding clipping.

Significance. If the result holds under the stated conditions, it would strengthen the theoretical basis for Adam in practical nonsmooth settings with heavy-tailed gradients, closing two documented gaps in the Online-to-Nonconvex Conversion line. The explicit alignment with empirical studies on β1=β2 is a positive feature, as is the extension to heavy-tailed noise.

major comments (2)

[Abstract] Abstract and introduction: the claim of analysis for the 'classical form of Adam... without further algorithmic modifications' is load-bearing yet in tension with the required randomly scaled learning rate (instead of any deterministic schedule) and the restriction β1=β2; these choices are necessary to close the reduction inside the Online-to-Nonconvex Conversion framework but alter the algorithm relative to standard Adam.
[Abstract] The precise function class, noise moment conditions, and boundedness assumptions inherited from the Online-to-Nonconvex Conversion framework are not stated in the abstract or introduction; without them it is impossible to verify whether the 1/T^{2/13} guarantee applies to the nonsmooth nonconvex problems the claim targets.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments. Below we respond point-by-point to the major comments, indicating where we agree and where we maintain a different view while remaining faithful to the manuscript.

read point-by-point responses

Referee: [Abstract] Abstract and introduction: the claim of analysis for the 'classical form of Adam... without further algorithmic modifications' is load-bearing yet in tension with the required randomly scaled learning rate (instead of any deterministic schedule) and the restriction β1=β2; these choices are necessary to close the reduction inside the Online-to-Nonconvex Conversion framework but alter the algorithm relative to standard Adam.

Authors: We disagree that the randomly scaled learning rate or the choice β1=β2 constitutes an algorithmic modification. The classical Adam algorithm consists of the momentum and second-moment updates together with the bias-correction terms; no clipping, extra normalization, or other operations are added. The learning-rate schedule is an external hyper-parameter choice, and the paper explicitly states that a randomly scaled schedule yields the stated rate. Likewise, β1=β2 is a parameter setting (not a change to the update rules) that our analysis requires and that the manuscript notes aligns with recent empirical findings. We will add a clarifying sentence in the introduction to distinguish the core algorithm from the schedule and parameter choice. revision: partial
Referee: [Abstract] The precise function class, noise moment conditions, and boundedness assumptions inherited from the Online-to-Nonconvex Conversion framework are not stated in the abstract or introduction; without them it is impossible to verify whether the 1/T^{2/13} guarantee applies to the nonsmooth nonconvex problems the claim targets.

Authors: We agree that the abstract and introduction would benefit from an explicit, concise statement of the inherited assumptions. In the revision we will insert a short paragraph (or bullet list) in the introduction summarizing the function class (nonsmooth nonconvex), the heavy-tailed noise moment conditions, and the boundedness requirements from the Online-to-Nonconvex Conversion framework. revision: yes

Circularity Check

0 steps flagged

No significant circularity; analysis extends external framework under explicit restrictions

full rationale

The paper extends the cited Online-to-Nonconvex Conversion framework (prior independent work) to the classical Adam with bias correction, explicitly stating the required choices β1=β2 and randomly scaled learning rate to obtain the 1/T^{2/13} rate. No load-bearing step reduces by the paper's own equations to a fitted parameter renamed as prediction, a self-definitional loop, or a self-citation chain that renders the result tautological. The derivation remains self-contained against the stated assumptions of the external framework.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the validity of the Online-to-Nonconvex Conversion framework applied to nonsmooth nonconvex functions and heavy-tailed noise; the abstract supplies no explicit list of free parameters or invented entities.

axioms (1)

domain assumption Standard technical assumptions of the Online-to-Nonconvex Conversion framework for nonsmooth nonconvex optimization
The proof is described as building directly on this framework while removing two of its prior limitations.

pith-pipeline@v0.9.1-grok · 5822 in / 1370 out tokens · 40166 ms · 2026-06-26T10:28:02.458597+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

25 extracted references · 6 canonical work pages

[1]

Abernethy, Elad Hazan, and Alexander Rakhlin

Jacob D. Abernethy, Elad Hazan, and Alexander Rakhlin. Competing in the dark: An efficient algorithm for bandit linear optimization. In Rocco A. Servedio and Tong Zhang, editors,21st Annual Conference on Learning Theory - COLT 2008, Helsinki, Finland, July 9-12, 2008, pages 263–274. Omnipress,

2008
[2]

Kwangjun Ahn, Zhiyu Zhang, Yunbum Kook, and Yan Dai

URLhttps://proceedings.neurips.cc/paper_ files/paper/2024/file/ac8ec9b4d94c03f0af8c4fe3d5fad4fd-Paper-Conference.pdf. Kwangjun Ahn, Zhiyu Zhang, Yunbum Kook, and Yan Dai. Understanding Adam optimizer via online learning of updates: Adam is FTRL in disguise. In Ruslan Salakhutdinov, Zico Kolter, Katherine Heller, Adrian Weller, Nuria Oliver, Jonathan Scarl...

2024
[3]

URLhttps://doi.org/10.1137/19M1263443

doi: 10.1137/19M1263443. URLhttps://doi.org/10.1137/19M1263443. Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhari- wal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel Ziegler, Jeffrey Wu, Clemens Win...

work page doi:10.1137/19m1263443 1901
[4]

cc/paper_files/paper/2020/file/1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdf

URLhttps://proceedings.neurips. cc/paper_files/paper/2020/file/1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdf. Xiangyi Chen, Sijia Liu, Ruoyu Sun, and Mingyi Hong. On the convergence of a class of adam-type algorithms for non-convex optimization. InInternational Conference on Learning Representations,

2020
[5]

Soham De, Anirbit Mukherjee, and Enayat Ullah

URLhttps://proceedings.neurips.cc/paper_files/paper/2022/file/ 2c8d9636f74d0207ff4f65956010f450-Paper-Conference.pdf. Soham De, Anirbit Mukherjee, and Enayat Ullah. Convergence guarantees for rmsprop and adam in non-convex optimization and an empirical comparison to nesterov acceleration.arXiv preprint arXiv:1807.06766,

Pith/arXiv arXiv 2022
[6]

Geoffrey J. Gordon. Regret bounds for prediction problems. In Shai Ben-David and Philip M. Long, editors,Proceedings of the Twelfth Annual Conference on Computational Learning Theory, COLT 1999, Santa Cruz, CA, USA, July 7-9, 1999, pages 29–40. ACM,

1999
[7]

doi: 10.1145/307400. 307410. URLhttps://doi.org/10.1145/307400.307410. Zhishuai Guo, Yi Xu, Wotao Yin, Rong Jin, and Tianbao Yang. A novel convergence analysis for algorithms of the adam family.arXiv preprint arXiv:2112.03459,

work page doi:10.1145/307400
[8]

Extracting certainty from uncertainty: Regret bounded by variation in costs

Elad Hazan and Satyen Kale. Extracting certainty from uncertainty: Regret bounded by variation in costs. In Rocco A. Servedio and Tong Zhang, editors,21st Annual Conference on Learning Theory - COLT 2008, Helsinki, Finland, July 9-12, 2008, pages 57–68. Omnipress,

2008
[9]

Efficient

Association for Computing Machinery. ISBN 9798400715105. doi: 10.1145/3717823.3718308. URLhttps://doi.org/10.1145/3717823.3718308. Michael Jordan, Guy Kornowski, Tianyi Lin, Ohad Shamir, and Manolis Zampetakis. Deterministic nonsmooth nonconvex optimization. In Gergely Neu and Lorenzo Rosasco, editors,Proceedings of Thirty Sixth Conference on Learning The...

work page doi:10.1145/3717823.3718308
[10]

Diederik P Kingma and Jimmy Ba

URLhttps://proceedings.mlr.press/ v195/jordan23a.html. Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization.arXiv preprint arXiv:1412.6980,

Pith/arXiv arXiv
[11]

Oracle complexity in nonsmooth nonconvex optimization.Jour- nal of Machine Learning Research, 23(314):1–44, 2022a

Guy Kornowski and Ohad Shamir. Oracle complexity in nonsmooth nonconvex optimization.Jour- nal of Machine Learning Research, 23(314):1–44, 2022a. URLhttp://jmlr.org/papers/v23/ 21-1507.html. 25 Liu Guy Kornowski and Ohad Shamir. On the complexity of finding small subgradients in nonsmooth optimization. InOPT 2022: Optimization for Machine Learning (NeurIP...

2022
[12]

Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al

URLhttps://proceedings.neurips.cc/paper_files/paper/2023/file/ a3cc50126338b175e56bb3cad134db0b-Paper-Conference.pdf. Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al. Deepseek-v3 technical report.arXiv preprint arXiv:2412.19437, 2024a. Langqi Liu, Yibo Wang, and Lijun Zhang...

Pith/arXiv arXiv 2023
[13]

URLhttp://jmlr.org/papers/v18/14-428. html. H. Brendan McMahan and Matthew J. Streeter. Adaptive bound optimization for online convex optimization. In Adam Tauman Kalai and Mehryar Mohri, editors,COLT 2010 - The 23rd Conference on Learning Theory, Haifa, Israel, June 27-29, 2010, pages 244–256. Omnipress,

2010
[14]

Francesco Orabona

URLhttps://proceedings.mlr.press/v313/nguyen26b.html. Francesco Orabona. A modern introduction to online learning.arXiv preprint arXiv:1912.13213,

Pith/arXiv arXiv 1912
[15]

Iosif Pinelis

URLhttps://proceedings.neurips.cc/paper_files/paper/2025/file/ 5bd9aa206d782e4e1f7ab5d177a10828-Paper-Conference.pdf. Iosif Pinelis. Best possible bounds of the von Bahr–Esseen type.Annals of Functional Analysis, 6 (4):1 – 29,

2025
[16]

URLhttps://doi.org/10.15352/afa/06-4-1

doi: 10.15352/afa/06-4-1. URLhttps://doi.org/10.15352/afa/06-4-1. Sashank J. Reddi, Satyen Kale, and Sanjiv Kumar. On the convergence of adam and beyond. In International Conference on Learning Representations,

work page doi:10.15352/afa/06-4-1
[17]

Shai Shalev-Shwartz and Yoram Singer

URLhttps://openreview.net/ forum?id=ryQu7f-RZ. Shai Shalev-Shwartz and Yoram Singer. Online learning meets optimization in the dual. In Gábor Lugosi and Hans Ulrich Simon, editors,Learning Theory, 19th Annual Conference on Learning Theory, COLT 2006, Pittsburgh, PA, USA, June 22-25, 2006, Proceedings, volume 4005 ofLecture Notes in Computer Science, pages...

2006
[18]

URL https://doi.org/10.1007/11776420_32

doi: 10.1007/11776420\_32. URL https://doi.org/10.1007/11776420_32. Umut Simsekli, Levent Sagun, and Mert Gurbuzbalaban. A tail-index analysis of stochastic gra- dient noise in deep neural networks. In Kamalika Chaudhuri and Ruslan Salakhutdinov, ed- itors,Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Ma...

work page doi:10.1007/11776420
[19]

Llama 2: Open foundation and fine-tuned chat models.arXiv preprint arXiv:2307.09288,

HugoTouvron, LouisMartin, KevinStone, PeterAlbert, AmjadAlmahairi, YasmineBabaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models.arXiv preprint arXiv:2307.09288,

Pith/arXiv arXiv
[20]

URLhttp://www.jstor.org/stable/2238095

ISSN 00034851. URLhttp://www.jstor.org/stable/2238095. Bohan Wang, Jingwen Fu, Huishuai Zhang, Nanning Zheng, and Wei Chen. Closing the gap between the upper bound and lower bound of adam's iteration complexity. In A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, editors,Advances in Neural Information Processing Systems, volume 36, pag...

arXiv
[21]

27 Liu Bohan Wang, Yushun Zhang, Huishuai Zhang, Qi Meng, Ruoyu Sun, Zhi-Ming Ma, Tie-Yan Liu, Zhi-Quan Luo, and Wei Chen

URLhttps://proceedings.neurips.cc/paper_files/paper/2023/file/ 7ac19fdcdf4f311f3e3ef2e7ef4784d7-Paper-Conference.pdf. 27 Liu Bohan Wang, Yushun Zhang, Huishuai Zhang, Qi Meng, Ruoyu Sun, Zhi-Ming Ma, Tie-Yan Liu, Zhi-Quan Luo, and Wei Chen. Provable adaptivity of adam under non-uniform smoothness. In Proceedings of the 30th ACM SIGKDD Conference on Knowle...

2023
[22]

ISBN9798400704901

Association for Computing Machinery. ISBN9798400704901. doi: 10.1145/3637528.3671718. URLhttps://doi.org/10.1145/3637528. 3671718. Yan-Feng Xie, Yu-Jie Zhang, Peng Zhao, and Zhi-Hua Zhou. Dynamic regret via discounted- to-dynamic reduction with applications to curved losses and adam optimizer.arXiv preprint arXiv:2602.08372,

work page doi:10.1145/3637528.3671718
[23]

Stosignsgd: Unbiased structural stochasticity fixes signsgd for training large language models.arXiv preprint arXiv:2604.15416,

Dingzhi Yu, Rui Pan, Yuxing Liu, and Tong Zhang. Stosignsgd: Unbiased structural stochasticity fixes signsgd for training large language models.arXiv preprint arXiv:2604.15416,

Pith/arXiv arXiv
[24]

Jingzhao Zhang, Sai Praneeth Karimireddy, Andreas Veit, Seungyeon Kim, Sashank Reddi, Sanjiv Kumar, and Suvrit Sra

URLhttps://proceedings.neurips.cc/paper_files/paper/ 2018/file/90365351ccc7437a1309dc64e4db32a3-Paper.pdf. Jingzhao Zhang, Sai Praneeth Karimireddy, Andreas Veit, Seungyeon Kim, Sashank Reddi, Sanjiv Kumar, and Suvrit Sra. Why are adaptive methods good for attention models? In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin, editors,Advance...

arXiv 2018
[25]

Fangyu Zou, Li Shen, Zequn Jie, Weizhong Zhang, and Wei Liu

URLhttps://proceedings.neurips.cc/paper_ files/paper/2022/file/b6260ae5566442da053e5ab5d691067a-Paper-Conference.pdf. Fangyu Zou, Li Shen, Zequn Jie, Weizhong Zhang, and Wei Liu. A sufficient condition for conver- gences of adam and rmsprop. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June

2022

[1] [1]

Abernethy, Elad Hazan, and Alexander Rakhlin

Jacob D. Abernethy, Elad Hazan, and Alexander Rakhlin. Competing in the dark: An efficient algorithm for bandit linear optimization. In Rocco A. Servedio and Tong Zhang, editors,21st Annual Conference on Learning Theory - COLT 2008, Helsinki, Finland, July 9-12, 2008, pages 263–274. Omnipress,

2008

[2] [2]

Kwangjun Ahn, Zhiyu Zhang, Yunbum Kook, and Yan Dai

URLhttps://proceedings.neurips.cc/paper_ files/paper/2024/file/ac8ec9b4d94c03f0af8c4fe3d5fad4fd-Paper-Conference.pdf. Kwangjun Ahn, Zhiyu Zhang, Yunbum Kook, and Yan Dai. Understanding Adam optimizer via online learning of updates: Adam is FTRL in disguise. In Ruslan Salakhutdinov, Zico Kolter, Katherine Heller, Adrian Weller, Nuria Oliver, Jonathan Scarl...

2024

[3] [3]

URLhttps://doi.org/10.1137/19M1263443

doi: 10.1137/19M1263443. URLhttps://doi.org/10.1137/19M1263443. Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhari- wal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel Ziegler, Jeffrey Wu, Clemens Win...

work page doi:10.1137/19m1263443 1901

[4] [4]

cc/paper_files/paper/2020/file/1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdf

URLhttps://proceedings.neurips. cc/paper_files/paper/2020/file/1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdf. Xiangyi Chen, Sijia Liu, Ruoyu Sun, and Mingyi Hong. On the convergence of a class of adam-type algorithms for non-convex optimization. InInternational Conference on Learning Representations,

2020

[5] [5]

Soham De, Anirbit Mukherjee, and Enayat Ullah

URLhttps://proceedings.neurips.cc/paper_files/paper/2022/file/ 2c8d9636f74d0207ff4f65956010f450-Paper-Conference.pdf. Soham De, Anirbit Mukherjee, and Enayat Ullah. Convergence guarantees for rmsprop and adam in non-convex optimization and an empirical comparison to nesterov acceleration.arXiv preprint arXiv:1807.06766,

Pith/arXiv arXiv 2022

[6] [6]

Geoffrey J. Gordon. Regret bounds for prediction problems. In Shai Ben-David and Philip M. Long, editors,Proceedings of the Twelfth Annual Conference on Computational Learning Theory, COLT 1999, Santa Cruz, CA, USA, July 7-9, 1999, pages 29–40. ACM,

1999

[7] [7]

doi: 10.1145/307400. 307410. URLhttps://doi.org/10.1145/307400.307410. Zhishuai Guo, Yi Xu, Wotao Yin, Rong Jin, and Tianbao Yang. A novel convergence analysis for algorithms of the adam family.arXiv preprint arXiv:2112.03459,

work page doi:10.1145/307400

[8] [8]

Extracting certainty from uncertainty: Regret bounded by variation in costs

Elad Hazan and Satyen Kale. Extracting certainty from uncertainty: Regret bounded by variation in costs. In Rocco A. Servedio and Tong Zhang, editors,21st Annual Conference on Learning Theory - COLT 2008, Helsinki, Finland, July 9-12, 2008, pages 57–68. Omnipress,

2008

[9] [9]

Efficient

Association for Computing Machinery. ISBN 9798400715105. doi: 10.1145/3717823.3718308. URLhttps://doi.org/10.1145/3717823.3718308. Michael Jordan, Guy Kornowski, Tianyi Lin, Ohad Shamir, and Manolis Zampetakis. Deterministic nonsmooth nonconvex optimization. In Gergely Neu and Lorenzo Rosasco, editors,Proceedings of Thirty Sixth Conference on Learning The...

work page doi:10.1145/3717823.3718308

[10] [10]

Diederik P Kingma and Jimmy Ba

URLhttps://proceedings.mlr.press/ v195/jordan23a.html. Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization.arXiv preprint arXiv:1412.6980,

Pith/arXiv arXiv

[11] [11]

Oracle complexity in nonsmooth nonconvex optimization.Jour- nal of Machine Learning Research, 23(314):1–44, 2022a

Guy Kornowski and Ohad Shamir. Oracle complexity in nonsmooth nonconvex optimization.Jour- nal of Machine Learning Research, 23(314):1–44, 2022a. URLhttp://jmlr.org/papers/v23/ 21-1507.html. 25 Liu Guy Kornowski and Ohad Shamir. On the complexity of finding small subgradients in nonsmooth optimization. InOPT 2022: Optimization for Machine Learning (NeurIP...

2022

[12] [12]

Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al

URLhttps://proceedings.neurips.cc/paper_files/paper/2023/file/ a3cc50126338b175e56bb3cad134db0b-Paper-Conference.pdf. Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al. Deepseek-v3 technical report.arXiv preprint arXiv:2412.19437, 2024a. Langqi Liu, Yibo Wang, and Lijun Zhang...

Pith/arXiv arXiv 2023

[13] [13]

URLhttp://jmlr.org/papers/v18/14-428. html. H. Brendan McMahan and Matthew J. Streeter. Adaptive bound optimization for online convex optimization. In Adam Tauman Kalai and Mehryar Mohri, editors,COLT 2010 - The 23rd Conference on Learning Theory, Haifa, Israel, June 27-29, 2010, pages 244–256. Omnipress,

2010

[14] [14]

Francesco Orabona

URLhttps://proceedings.mlr.press/v313/nguyen26b.html. Francesco Orabona. A modern introduction to online learning.arXiv preprint arXiv:1912.13213,

Pith/arXiv arXiv 1912

[15] [15]

Iosif Pinelis

URLhttps://proceedings.neurips.cc/paper_files/paper/2025/file/ 5bd9aa206d782e4e1f7ab5d177a10828-Paper-Conference.pdf. Iosif Pinelis. Best possible bounds of the von Bahr–Esseen type.Annals of Functional Analysis, 6 (4):1 – 29,

2025

[16] [16]

URLhttps://doi.org/10.15352/afa/06-4-1

doi: 10.15352/afa/06-4-1. URLhttps://doi.org/10.15352/afa/06-4-1. Sashank J. Reddi, Satyen Kale, and Sanjiv Kumar. On the convergence of adam and beyond. In International Conference on Learning Representations,

work page doi:10.15352/afa/06-4-1

[17] [17]

Shai Shalev-Shwartz and Yoram Singer

URLhttps://openreview.net/ forum?id=ryQu7f-RZ. Shai Shalev-Shwartz and Yoram Singer. Online learning meets optimization in the dual. In Gábor Lugosi and Hans Ulrich Simon, editors,Learning Theory, 19th Annual Conference on Learning Theory, COLT 2006, Pittsburgh, PA, USA, June 22-25, 2006, Proceedings, volume 4005 ofLecture Notes in Computer Science, pages...

2006

[18] [18]

URL https://doi.org/10.1007/11776420_32

doi: 10.1007/11776420\_32. URL https://doi.org/10.1007/11776420_32. Umut Simsekli, Levent Sagun, and Mert Gurbuzbalaban. A tail-index analysis of stochastic gra- dient noise in deep neural networks. In Kamalika Chaudhuri and Ruslan Salakhutdinov, ed- itors,Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Ma...

work page doi:10.1007/11776420

[19] [19]

Llama 2: Open foundation and fine-tuned chat models.arXiv preprint arXiv:2307.09288,

HugoTouvron, LouisMartin, KevinStone, PeterAlbert, AmjadAlmahairi, YasmineBabaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models.arXiv preprint arXiv:2307.09288,

Pith/arXiv arXiv

[20] [20]

URLhttp://www.jstor.org/stable/2238095

ISSN 00034851. URLhttp://www.jstor.org/stable/2238095. Bohan Wang, Jingwen Fu, Huishuai Zhang, Nanning Zheng, and Wei Chen. Closing the gap between the upper bound and lower bound of adam's iteration complexity. In A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, editors,Advances in Neural Information Processing Systems, volume 36, pag...

arXiv

[21] [21]

27 Liu Bohan Wang, Yushun Zhang, Huishuai Zhang, Qi Meng, Ruoyu Sun, Zhi-Ming Ma, Tie-Yan Liu, Zhi-Quan Luo, and Wei Chen

URLhttps://proceedings.neurips.cc/paper_files/paper/2023/file/ 7ac19fdcdf4f311f3e3ef2e7ef4784d7-Paper-Conference.pdf. 27 Liu Bohan Wang, Yushun Zhang, Huishuai Zhang, Qi Meng, Ruoyu Sun, Zhi-Ming Ma, Tie-Yan Liu, Zhi-Quan Luo, and Wei Chen. Provable adaptivity of adam under non-uniform smoothness. In Proceedings of the 30th ACM SIGKDD Conference on Knowle...

2023

[22] [22]

ISBN9798400704901

Association for Computing Machinery. ISBN9798400704901. doi: 10.1145/3637528.3671718. URLhttps://doi.org/10.1145/3637528. 3671718. Yan-Feng Xie, Yu-Jie Zhang, Peng Zhao, and Zhi-Hua Zhou. Dynamic regret via discounted- to-dynamic reduction with applications to curved losses and adam optimizer.arXiv preprint arXiv:2602.08372,

work page doi:10.1145/3637528.3671718

[23] [23]

Stosignsgd: Unbiased structural stochasticity fixes signsgd for training large language models.arXiv preprint arXiv:2604.15416,

Dingzhi Yu, Rui Pan, Yuxing Liu, and Tong Zhang. Stosignsgd: Unbiased structural stochasticity fixes signsgd for training large language models.arXiv preprint arXiv:2604.15416,

Pith/arXiv arXiv

[24] [24]

Jingzhao Zhang, Sai Praneeth Karimireddy, Andreas Veit, Seungyeon Kim, Sashank Reddi, Sanjiv Kumar, and Suvrit Sra

URLhttps://proceedings.neurips.cc/paper_files/paper/ 2018/file/90365351ccc7437a1309dc64e4db32a3-Paper.pdf. Jingzhao Zhang, Sai Praneeth Karimireddy, Andreas Veit, Seungyeon Kim, Sashank Reddi, Sanjiv Kumar, and Suvrit Sra. Why are adaptive methods good for attention models? In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin, editors,Advance...

arXiv 2018

[25] [25]

Fangyu Zou, Li Shen, Zequn Jie, Weizhong Zhang, and Wei Liu

URLhttps://proceedings.neurips.cc/paper_ files/paper/2022/file/b6260ae5566442da053e5ab5d691067a-Paper-Conference.pdf. Fangyu Zou, Li Shen, Zequn Jie, Weizhong Zhang, and Wei Liu. A sufficient condition for conver- gences of adam and rmsprop. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June

2022