arxiv: 2605.12878 · v1 · submitted 2026-05-13 · 🧮 math.OC · cs.LG

Recognition: unknown

Adam-SHANG: A Convergent Adam-Type Method for Stochastic Smooth Convex Optimization

Long Chen, Minfu Feng, Yaxin Yu

Pith reviewed 2026-05-14 18:40 UTC · model grok-4.3

classification 🧮 math.OC cs.LG

keywords Adam optimizerstochastic convex optimizationconvergence in expectationadaptive preconditioninglagged preconditionerLyapunov analysismomentum methods

0 comments

The pith

Adam-SHANG converges in expectation for stochastic smooth convex optimization under a flexible stepsize condition.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Adam-SHANG as an Adam-type optimizer that adds a lagged preconditioner to stabilize momentum and adaptive preconditioning. It proves convergence in expectation for stochastic smooth convex problems by showing that an admissible stepsize rule, which can always be met via a conservative spectral bound, produces a Lyapunov decrease. This removes the need to assume global monotonicity on the second-moment sequence, a condition often hard to verify in practice. The work also supplies a computable trace-ratio stepsize based on local alignment and tests the structure outside the convex case.

Core claim

Adam-SHANG couples momentum, adaptive preconditioning, and a curvature-aware correction through a more stable lagged-preconditioner update. For stochastic smooth convex optimization, the method converges in expectation under an admissible stepsize condition satisfiable by a conservative spectral bound, without requiring global monotonicity on the second-moment sequence. A trace-ratio stepsize rule motivated by local coordinatewise alignment offers a less conservative practical choice.

What carries the argument

The lagged-preconditioner update that stabilizes the curvature estimate while coupling momentum and adaptive steps.

If this is right

Expected convergence holds whenever the conservative spectral bound is respected.
The trace-ratio stepsize yields a practical rule under local coordinatewise alignment.
The same lagged update structure applies beyond convex problems with simplified parameters.
Experiments confirm the predicted stochastic decay rates and show competitive performance versus Adam and AdamW on deep learning tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The relaxed monotonicity requirement may explain why Adam-type methods often succeed empirically even when classical analyses do not apply.
Lyapunov constructions centered on lagged preconditioners could be reused to analyze other adaptive first-order methods.
In non-convex deep learning, focusing on local curvature stability rather than global second-moment behavior might guide more reliable stepsize selection.

Load-bearing premise

The lagged preconditioner must keep the effective curvature estimate stable enough that the chosen stepsize satisfies the admissibility condition at every iteration.

What would settle it

A simple strongly convex quadratic where the preconditioner produces steps that violate the spectral bound and the expected objective value stops decreasing toward the minimum.

Figures

Figures reproduced from arXiv: 2605.12878 by Long Chen, Minfu Feng, Yaxin Yu.

**Figure 1.** Figure 1: Convex optimization benchmark with d = 16. Top: pure multiplicative noise with σ0 = 0 and σ1 ∈ {0, 10, 30}. Bottom: additive-multiplicative noise with σ1 = 10 and σ0 ∈ {0.5, 1, 3}. For Adam, we use the decaying stepsize ℓ0/ √ k + 1, where ℓ0 is selected by grid search separately for each setting. Full hyperparameter settings and implementation details are provided in Appendix E.1 [PITH_FULL_IMAGE:figures/… view at source ↗

**Figure 2.** Figure 2: Iterate trajectories xt on the Reddi et al. [34] counterexample. Left: deterministic sequence. Right: stochastic sequence (median over 30 independent runs). Hyperparameter settings and full results, including average regret curves, are provided in Appendix E.3. Deep Learning Tasks This section evaluates whether the proposed methods remain competitive on practical deep learning tasks and, in particular, whe… view at source ↗

**Figure 3.** Figure 3: Validation loss for Transformer language modeling on text8. All methods are trained for 30,000 steps under batch sizes B ∈ {32, 128} to examine sensitivity to gradient variance and effective data throughput. Model architecture, data split, and hyperparameter settings are provided in Appendix E.4. Character-level language modeling on text8. We evaluate the proposed methods on characterlevel language modeli… view at source ↗

**Figure 4.** Figure 4: shows that both Adam-SHANG and Adam-SHANG-s achieve final test accuracies competitive with Adam and AdamW, while reaching high accuracy earlier. At batch size 32, Adam-SHANG attains the baseline accuracy level in roughly half the number of steps. This suggests that the trace-ratio stepsize accelerates the transient optimization phase by extracting useful stepsize information from the evolving preconditione… view at source ↗

**Figure 5.** Figure 5: Empirical verification of the admissibility condition for [PITH_FULL_IMAGE:figures/full_fig_p019_5.png] view at source ↗

**Figure 6.** Figure 6: Ordering violation rate for Assumption 2.2. A value of [PITH_FULL_IMAGE:figures/full_fig_p020_6.png] view at source ↗

**Figure 7.** Figure 7: The deterministic version of the counterexample from Reddi et al. [34]. Each panel shows average regret Rt/t (left) and iterate trajectory xt (right). E.4 Experimental and hyperparameter settings for text8 We evaluate character-level language modeling on text8 [31] using a 4-layer Transformer encoder [38] with Pre-LayerNorm, 8 attention heads, hidden dimension 256, feedforward dimension 1024, and about 3.… view at source ↗

**Figure 8.** Figure 8: The stochastic version of the counterexample from Reddi et al. [34]. Each panel reports the mean average regret Rt/t (left) and the mean iterate trajectory xt (right) over 30 independent runs. The large initial regret of Adam-SHANG reflects a small fraction of runs with an anomalously large first step caused by g0 = 1010 at initialization; this is a non-typical artifact, as confirmed by the median in [PIT… view at source ↗

**Figure 9.** Figure 9: The stochastic version of the counterexample from Reddi et al. [34]. Each panel reports the median average regret Rt/t (left) and the median iterate trajectory xt (right) over 30 independent runs. E.5 Hyperparameter settings for CIFAR-100 We summarize here the hyperparameter choices used in the CIFAR-100 experiments. For the Adam-SHANG variants, Adam-SHANG and Adam-SHANG-s use (λ, β, γ) = (0.1, 0.1, 0.005)… view at source ↗

**Figure 10.** Figure 10: Training loss and test accuracy of different optimizers for training ResNet-34 on CIFAR-10. 10−5 for Adam and SHANG++, following common practice. Each model is trained for 50 epochs with three independent random seeds, and we report the mean and standard deviation over the runs [PITH_FULL_IMAGE:figures/full_fig_p023_10.png] view at source ↗

**Figure 11.** Figure 11: Ablation on the curvature-aware correction term for training the 4-layer Transformer on text8 E.7 Ablation on the curvature correction To examine the role of the curvature-aware correction term, we include two additional ablation variants in the deep learning experiments: Adam-SHANG(β = 0, γ = 1) and Adam-SHANG-s(β = 0, γ = 1). In these ablations, the base scale λ is kept the same as in the corresponding … view at source ↗

**Figure 12.** Figure 12: Ablation on the curvature-aware correction term for training ResNet-34 on CIFAR-10 [PITH_FULL_IMAGE:figures/full_fig_p024_12.png] view at source ↗

**Figure 13.** Figure 13: Ablation on the curvature-aware correction term for training ResNet-50 on CIFAR-100. 24 [PITH_FULL_IMAGE:figures/full_fig_p024_13.png] view at source ↗

read the original abstract

We propose Adam-SHANG, a Lyapunov-guided Adam-type method that couples momentum, adaptive preconditioning, and a curvature-aware correction through a more stable lagged-preconditioner update. For stochastic smooth convex optimization, we prove convergence in expectation under an admissible stepsize condition that can always be satisfied by a conservative spectral bound, without imposing global monotonicity on the second-moment sequence. To obtain a less conservative practical rule, we introduce a computable trace-ratio stepsize, motivated by a local coordinatewise alignment condition. The same structural update is also tested beyond the convex setting with simplified parameters. Experiments validate the predicted stochastic decay and show competitive training performance against Adam and AdamW on deep learning tasks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Adam-SHANG adds a lagged preconditioner and trace-ratio stepsize to remove the global monotonicity assumption on second moments, but the stability of that preconditioner under stochastic noise is the part that still needs checking.

read the letter

The main takeaway is that this paper gives a new Adam-type method, Adam-SHANG, built around a lagged preconditioner update and a trace-ratio stepsize rule. It proves expected convergence for stochastic smooth convex problems under an admissible stepsize that can be met by a conservative spectral bound, without forcing the second-moment sequence to be monotone everywhere. That structural choice is the actual novelty; it is not just another parameter tweak on existing Adam variants. The trace-ratio rule is motivated by a local alignment condition, which is a reasonable way to make the stepsize less conservative in practice. They also run the same update structure on some non-convex deep-learning tasks with simplified parameters and report competitive results against Adam and AdamW. That combination of a distinct lagged structure plus a concrete practical rule is what stands out and what the paper does cleanly. The claim is stated directly and the motivation for dropping the monotonicity requirement is explicit. On the soft side, the whole argument rests on the lagged preconditioner keeping the effective curvature estimate stable enough for the Lyapunov decrease to go through. If stochastic fluctuations push the lagged estimate outside the region where the inequality holds, the convergence step fails even when the spectral bound is respected. The abstract gives no quantitative control on lag size or on the probability that the local alignment condition holds, so the removal of the monotonicity assumption is not yet fully secured. Experiments are described only at high level, so it is hard to see how strongly the predicted decay rates are validated or how often the alignment condition is actually met. This paper is for researchers who work on adaptive stochastic optimizers and want a Lyapunov-based route that relaxes one common assumption. A reader who cares about stepsize rules or moment-sequence conditions in convex settings will get the most from the structural ideas. It deserves a serious referee because the core claim is specific and the machinery is new enough that full review can check the derivation and the experiments directly.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces Adam-SHANG, a Lyapunov-guided Adam-type method for stochastic smooth convex optimization that combines momentum, adaptive preconditioning, and a curvature-aware correction via a lagged-preconditioner update. It claims to prove convergence in expectation under an admissible stepsize condition (satisfied by a conservative spectral bound or a practical trace-ratio rule motivated by local coordinatewise alignment), without requiring global monotonicity on the second-moment sequence. The same update is tested in non-convex settings, and experiments are said to validate the predicted O(1/sqrt(T)) decay while showing competitive performance against Adam and AdamW on deep learning tasks.

Significance. If the convergence result holds with verifiable assumptions, the work is significant for relaxing a standard monotonicity assumption in analyses of adaptive stochastic methods through the lagged preconditioner and admissible stepsize framework. The dual provision of conservative and computable stepsize rules, together with the extension beyond convexity, offers both theoretical and practical value. Explicit credit is due for attempting a parameter-free derivation route via the spectral bound and for including empirical checks of the predicted decay.

major comments (2)

[Convergence theorem] Main convergence theorem: the removal of the global monotonicity requirement on the second-moment sequence is load-bearing for the central claim, yet the analysis provides no quantitative bound on the lag parameter or on the probability that the local alignment condition holds under stochastic fluctuations; if the lagged v_t drifts, the Lyapunov decrease may fail even when the spectral bound is respected.
[Theory section] Admissible stepsize condition (abstract and § on theory): the claim that the condition 'can always be satisfied' by the conservative spectral bound is not accompanied by explicit error bounds or a complete derivation showing how the trace-ratio rule avoids circularity with the fitted alignment; this leaves the O(1/sqrt(T)) rate unverified from the given information.

minor comments (2)

[Experiments] Experiments: the validation of predicted stochastic decay is described only at high level; adding concrete plots or tables with measured rates versus T and explicit comparison metrics would strengthen the empirical support.
[Method] Notation: the lagged-preconditioner update (v_t = beta v_{t-1} + (1-beta) g_t^2 with lag) should be defined with a single consistent equation number and clearly distinguished from the standard Adam second-moment update.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful reading and for recognizing the potential significance of relaxing the global monotonicity assumption via the lagged preconditioner. We address each major comment below and will revise the manuscript to improve clarity on the admissible stepsize framework.

read point-by-point responses

Referee: [Convergence theorem] Main convergence theorem: the removal of the global monotonicity requirement on the second-moment sequence is load-bearing for the central claim, yet the analysis provides no quantitative bound on the lag parameter or on the probability that the local alignment condition holds under stochastic fluctuations; if the lagged v_t drifts, the Lyapunov decrease may fail even when the spectral bound is respected.

Authors: The main theorem establishes convergence in expectation under the admissible stepsize condition, which is formulated to guarantee a sufficient Lyapunov decrease for any fixed lag parameter (chosen as a small constant in the algorithm). The conservative spectral bound is derived from an upper estimate on the operator norm of the preconditioner and holds deterministically, independent of stochastic fluctuations in v_t or any alignment probability. Thus, the Lyapunov decrease is ensured whenever the stepsize satisfies this bound, without requiring quantitative control on the lag or probabilistic statements on local alignment. The local alignment condition is used only to motivate the practical trace-ratio rule and is not invoked in the convergence proof. We will add a clarifying remark in the theory section explaining this deterministic guarantee. revision: partial
Referee: [Theory section] Admissible stepsize condition (abstract and § on theory): the claim that the condition 'can always be satisfied' by the conservative spectral bound is not accompanied by explicit error bounds or a complete derivation showing how the trace-ratio rule avoids circularity with the fitted alignment; this leaves the O(1/sqrt(T)) rate unverified from the given information.

Authors: The conservative spectral bound is constructed by replacing the preconditioner with its maximum possible eigenvalue (a uniform upper bound independent of the specific second-moment estimates), which directly yields an admissible stepsize that satisfies the required inequality by design; we will include the full derivation of this bound in the revised theory section, together with the resulting explicit constants in the O(1/sqrt(T)) rate. The trace-ratio rule is presented as a practical, computable alternative motivated by an empirical local alignment observation and is not used in the proof of the rate; the rate is established solely under the admissible condition satisfied by the spectral bound, avoiding any circularity. We will expand the manuscript to make this separation explicit and add the missing derivation steps. revision: yes

Circularity Check

0 steps flagged

No significant circularity in the convergence derivation

full rationale

The paper's central claim rests on a standard Lyapunov analysis for stochastic smooth convex optimization. The admissible stepsize condition is satisfied independently via a conservative spectral bound (or local trace-ratio alignment), without the convergence rate being defined in terms of itself or a fitted parameter from the same data. The lagged preconditioner update is presented as a structural choice motivated by stability considerations, not a self-definitional renaming or post-hoc fit. No load-bearing self-citation chains or ansatz smuggling reduce the result to its inputs by construction. The derivation remains self-contained against external benchmarks for convex stochastic problems.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on standard assumptions of stochastic smooth convex optimization plus one new algorithmic structure (lagged preconditioner) whose stability is justified by the Lyapunov argument rather than by new postulates.

free parameters (1)

admissible stepsize bound
Conservative spectral bound chosen to guarantee the stepsize condition without requiring global monotonicity of the second-moment sequence.

axioms (1)

domain assumption Stochastic gradients are unbiased with bounded variance; objective is smooth and convex.
Standard assumptions invoked to obtain convergence in expectation for the proposed update.

pith-pipeline@v0.9.0 · 5415 in / 1230 out tokens · 33988 ms · 2026-05-14T18:40:08.667591+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

48 extracted references · 11 canonical work pages · 1 internal anchor

[1]

Convergence of adaptive algorithms for weakly convex constrained optimization, 2020

Ahmet Alacaoglu, Yura Malitsky, and V olkan Cevher. Convergence of adaptive algorithms for weakly convex constrained optimization, 2020. URL https://arxiv.org/abs/2006. 06650

2020
[2]

Asgo: Adaptive structured gradient optimization, 2025

Kang An, Yuxing Liu, Rui Pan, Yi Ren, Shiqian Ma, Donald Goldfarb, and Tong Zhang. Asgo: Adaptive structured gradient optimization, 2025

2025
[3]

Ascher, Steven J

Uri M. Ascher, Steven J. Ruuth, and Brian T. R. Wetton. Implicit-explicit methods for time- dependent partial differential equations.SIAM Journal on Numerical Analysis, 32(3):797–823, 1995

1995
[4]

Convergence and dynamical behavior of the adam algorithm for non-convex stochastic optimization, 2020

Anas Barakat and Pascal Bianchi. Convergence and dynamical behavior of the adam algorithm for non-convex stochastic optimization, 2020

2020
[5]

Popov, Arash Sarshar, and Adrian Sandu

Abhinab Bhattacharjee, Andrey A. Popov, Arash Sarshar, and Adrian Sandu. Improv- ing adam through an implicit-explicit (imex) time-stepping approach.Journal of Ma- chine Learning for Modeling and Computing, 5(3):47–68, 2024. ISSN 2689-3967. doi: 10.1615/jmachlearnmodelcomput.2024053508

work page doi:10.1615/jmachlearnmodelcomput.2024053508 2024
[6]

First order optimization methods based on hessian-driven nesterov accelerated gradient flow, 2019

Long Chen and Hao Luo. First order optimization methods based on hessian-driven nesterov accelerated gradient flow, 2019

2019
[7]

Accelerated gradient methods through variable and operator splitting, 2025

Long Chen, Luo Hao, and Jingrong Wei. Accelerated gradient methods through variable and operator splitting, 2025

2025
[8]

On the convergence of a class of adam-type algorithms for non-convex optimization, 2019

Xiangyi Chen, Sijia Liu, Ruoyu Sun, and Mingyi Hong. On the convergence of a class of adam-type algorithms for non-convex optimization, 2019

2019
[9]

A general system of differential equations to model first order adaptive algorithms, 2019

André Belotto da Silva and Maxime Gazeau. A general system of differential equations to model first order adaptive algorithms, 2019

2019
[10]

Convergence guarantees for rmsprop and adam in non-convex optimization and an empirical comparison to nesterov acceleration, 2018

Soham De, Anirbit Mukherjee, and Enayat Ullah. Convergence guarantees for rmsprop and adam in non-convex optimization and an empirical comparison to nesterov acceleration, 2018

2018
[11]

Convergence rates for the adam optimizer, 2024

Steffen Dereich and Arnulf Jentzen. Convergence rates for the adam optimizer, 2024. URL https://arxiv.org/abs/2407.21078

work page arXiv 2024
[12]

Ode approximation for the adam algorithm: General and overparametrized setting, 2025

Steffen Dereich, Arnulf Jentzen, and Sebastian Kassing. Ode approximation for the adam algorithm: General and overparametrized setting, 2025

2025
[13]

Sharp higher order convergence rates for the adam optimizer, 2025

Steffen Dereich, Arnulf Jentzen, and Adrian Riekert. Sharp higher order convergence rates for the adam optimizer, 2025. URLhttps://arxiv.org/abs/2504.19426

work page arXiv 2025
[14]

Adaptive subgradient methods for online learning and stochastic optimization.Journal of Machine Learning Research, 12(61):2121–2159, 2011

John Duchi, Elad Hazan, and Yoram Singer. Adaptive subgradient methods for online learning and stochastic optimization.Journal of Machine Learning Research, 12(61):2121–2159, 2011

2011
[15]

A simple convergence proof of adam and adagrad, 2022

Alexandre Défossez, Léon Bottou, Francis Bach, and Nicolas Usunier. A simple convergence proof of adam and adagrad, 2022

2022
[16]

Continuous-time analysis of adaptive optimization and normalization, 2024

Rhys Gould and Hidenori Tanaka. Continuous-time analysis of adaptive optimization and normalization, 2024

2024
[17]

Siegel, and Stephan Wojtowytsch

Kanan Gupta, Jonathan W. Siegel, and Stephan Wojtowytsch. Nesterov acceleration despite very noisy gradients, 2024. URLhttps://arxiv.org/abs/2302.05515

work page arXiv 2024
[18]

Shampoo: Preconditioned stochastic tensor optimization, 2018

Vineet Gupta, Tomer Koren, and Yoram Singer. Shampoo: Preconditioned stochastic tensor optimization, 2018

2018
[19]

Deep residual learning for image recognition

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 770–778, 2016. doi: 10.1109/CVPR.2016.90

work page doi:10.1109/cvpr.2016.90 2016
[20]

Modeling adagrad, rmsprop, and adam with integro-differential equations, 2025

Carlos Heredia. Modeling adagrad, rmsprop, and adam with integro-differential equations, 2025. 10

2025
[21]

From adam to adam-like lagrangians: Second-order nonlocal dynamics, 2026

Carlos Heredia. From adam to adam-like lagrangians: Second-order nonlocal dynamics, 2026

2026
[22]

Nostalgic adam: Weighting more of the past gradients when designing the adaptive learning rate

Haiwen Huang, Chang Wang, and Bin Dong. Nostalgic adam: Weighting more of the past gradients when designing the adaptive learning rate. InProceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence, IJCAI-2019, page 2556–2562. In- ternational Joint Conferences on Artificial Intelligence Organization, August 2019. doi: 10.249...

work page doi:10.24963/ijcai.2019/355 2019
[23]

Yiming Jiang, Jinlan Liu, Dongpo Xu, and Danilo P. Mandic. Uadam: Unified adam-type algorithmic framework for nonconvex optimization.Neural Computation, 36(9):1912–1938, August 2024. ISSN 1530-888X. doi: 10.1162/neco_a_01692. URL http://dx.doi.org/10. 1162/neco_a_01692

work page doi:10.1162/neco_a_01692 1912
[24]

Kingma and Jimmy Ba

Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization, 2017

2017
[25]

Sgd with adaptive preconditioning: Unified analysis and momentum accelera- tion, 2025

Dmitry Kovalev. Sgd with adaptive preconditioning: Unified analysis and momentum accelera- tion, 2025

2025
[26]

Learning multiple layers of features from tiny images

Alex Krizhevsky. Learning multiple layers of features from tiny images. 2009

2009
[27]

Stochastic modified equations and dynamics of stochastic gradient algorithms i: Mathematical foundations, 2018

Qianxiao Li, Cheng Tai, and Weinan E. Stochastic modified equations and dynamics of stochastic gradient algorithms i: Mathematical foundations, 2018

2018
[28]

Adagrad under anisotropic smoothness, 2024

Yuxing Liu, Rui Pan, and Tong Zhang. Adagrad under anisotropic smoothness, 2024

2024
[29]

Decoupled weight decay regularization, 2019

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization, 2019

2019
[30]

A qualitative study of the dynamic behavior for adaptive gradient algorithms, 2021

Chao Ma, Lei Wu, and Weinan E. A qualitative study of the dynamic behavior for adaptive gradient algorithms, 2021

2021
[31]

Large text compression benchmark

Matt Mahoney. Large text compression benchmark. http://www.mattmahoney.net/dc/ text.html, 2009. Accessed: 2025

2009
[32]

On the sdes and scaling rules for adaptive gradient algorithms, 2024

Sadhika Malladi, Kaifeng Lyu, Abhishek Panigrahi, and Sanjeev Arora. On the sdes and scaling rules for adaptive gradient algorithms, 2024

2024
[33]

D. S. Mitrinovi´c, J. E. Pe ˇcari´c, and A. M. Fink.Classical and New Inequalities in Analysis. Kluwer Academic Publishers, Dordrecht, 1993

1993
[34]

Reddi, Satyen Kale, and Sanjiv Kumar

Sashank J. Reddi, Satyen Kale, and Sanjiv Kumar. On the convergence of adam and beyond, 2019

2019
[35]

Adopt: Modified adam can converge with any β2 with the optimal rate, 2024

Shohei Taniguchi, Keno Harada, Gouki Minegishi, Yuta Oshima, Seong Cheol Jeong, Go Na- gahara, Tomoshi Iiyama, Masahiro Suzuki, Yusuke Iwasawa, and Yutaka Matsuo. Adopt: Modified adam can converge with any β2 with the optimal rate, 2024. URL https: //arxiv.org/abs/2411.02853

work page arXiv 2024
[36]

Calibrating the adaptive learning rate to improve convergence of adam.Neurocomputing, 481:333–356, 2022

Qianqian Tong, Guannan Liang, and Jinbo Bi. Calibrating the adaptive learning rate to improve convergence of adam.Neurocomputing, 481:333–356, 2022. ISSN 0925-2312. doi: https://doi. org/10.1016/j.neucom.2022.01.014. URL https://www.sciencedirect.com/science/ article/pii/S0925231222000340

work page doi:10.1016/j.neucom.2022.01.014 2022
[37]

Incorporating preconditioning into accelerated approaches: Theoretical guarantees and practical improvement,

Stepan Trifonov, Leonid Levin, Savelii Chezhegov, and Aleksandr Beznosikov. Incorporating preconditioning into accelerated approaches: Theoretical guarantees and practical improvement,
[38]

URLhttps://arxiv.org/abs/2505.23510

work page arXiv
[39]

Gomez, Lukasz Kaiser, and Illia Polosukhin

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need, 2023

2023
[40]

Structured precondi- tioners in adaptive optimization: A unified analysis, 2025

Shuo Xie, Tianhao Wang, Sashank Reddi, Sanjiv Kumar, and Zhiyuan Li. Structured precondi- tioners in adaptive optimization: A unified analysis, 2025

2025
[41]

Shang++: Robust stochastic acceleration under multiplicative noise, 2026

Yaxin Yu, Long Chen, and Minfu Feng. Shang++: Robust stochastic acceleration under multiplicative noise, 2026. 11

2026
[42]

Adam-HNAG: A Convergent Reformulation of Adam with Accelerated Rate

Yaxin Yu, Long Chen, and Zeyi Xu. Adam-hnag: A convergent reformulation of adam with accelerated rate, 2026. URLhttps://arxiv.org/abs/2604.08742

work page internal anchor Pith review Pith/arXiv arXiv 2026
[43]

On the convergence of adaptive gradient methods for nonconvex optimization, 2024

Dongruo Zhou, Jinghui Chen, Yuan Cao, Ziyan Yang, and Quanquan Gu. On the convergence of adaptive gradient methods for nonconvex optimization, 2024

2024
[44]

A sufficient condition for convergences of adam and rmsprop, 2019

Fangyu Zou, Li Shen, Zequn Jie, Weizhong Zhang, and Wei Liu. A sufficient condition for convergences of adam and rmsprop, 2019. A Convergence analysis ofAdam-SHANG A.1 Proof of Lemma 2.1 Proof of Lemma 2.1.ByL-smoothness andx + k+1 −x k+1 =−η k+1P −1 k gk+1,we have f(x + k+1)≤f(x k+1)−η k+1⟨∇f(x k+1), gk+1⟩P −1 k + Lη2 k+1 2 ∥gk+1∥2 P −2 k . Taking expect...

2019
[45]

Applying Lemma 2.1 atx k+1 and multiplying by−α k, we obtain E[−αkE(z k+1, Pk+1)]≤E −αkE(z + k+1, Pk+1)− αkηk+1 2(1 +σ 2

∥gk+1∥2 P −1 k + ηk+1σ2 0 1 +σ 2 1 + αkγk 2 ∥yk+1 −x ⋆∥2 P −1 k G2 k+1 + α2 k 2 ∥gk+1∥2 P −1 k . Applying Lemma 2.1 atx k+1 and multiplying by−α k, we obtain E[−αkE(z k+1, Pk+1)]≤E −αkE(z + k+1, Pk+1)− αkηk+1 2(1 +σ 2
[46]

Usingsup k ∥yk −x ⋆∥∞ ≤Randγ k =α k/R2, we further have αkγk 2 ∥yk+1 −x ⋆∥2 P −1 k G2 k+1 = α2 k 2R2 dX i=1 yk+1,i −x ⋆ i 2 P −1 k iig2 k+1,i ≤ α2 k 2 ∥gk+1∥2 P −1 k

∥gk+1∥2 P −1 k + αkηk+1σ2 0 1 +σ 2 1 . Usingsup k ∥yk −x ⋆∥∞ ≤Randγ k =α k/R2, we further have αkγk 2 ∥yk+1 −x ⋆∥2 P −1 k G2 k+1 = α2 k 2R2 dX i=1 yk+1,i −x ⋆ i 2 P −1 k iig2 k+1,i ≤ α2 k 2 ∥gk+1∥2 P −1 k . With the chosen parameters2α 2 k(1 +σ 2
[47]

Therefore, E E(z + k+1, Pk+1) ≤E 1 1 +α k E(z + k , Pk) + 2α2 kσ2 0

=η k+1, 2α2 k −(1 +α k) ηk+1 1 +σ 2 1 ≤0, so theg k+1-weighted terms are nonpositive and can be dropped. Therefore, E E(z + k+1, Pk+1) ≤E 1 1 +α k E(z + k , Pk) + 2α2 kσ2 0 . This yields the claimed one-step bound, and iterating the recursion proves the theorem. RemarkA.1.For analytical tractability, we assume that there existsR >0such that sup k≥0 ∥yk −x...
[48]

Figure 5: Empirical verification of the admissibility condition forAdam-SHANG

≥1.(30) The admissibility condition is satisfied wheneverRatio≥1. Figure 5: Empirical verification of the admissibility condition forAdam-SHANG. The ratio should be not less than1. Observation.In all tested cases, the monitored ratio remains above 1 throughout the whole trajectory. Thus, with the safety factor λ= 0.5 in the practical rule (13), we do not ...