pith. machine review for the scientific record. sign in

arxiv: 2605.12878 · v1 · submitted 2026-05-13 · 🧮 math.OC · cs.LG

Recognition: unknown

Adam-SHANG: A Convergent Adam-Type Method for Stochastic Smooth Convex Optimization

Long Chen, Minfu Feng, Yaxin Yu

Pith reviewed 2026-05-14 18:40 UTC · model grok-4.3

classification 🧮 math.OC cs.LG
keywords Adam optimizerstochastic convex optimizationconvergence in expectationadaptive preconditioninglagged preconditionerLyapunov analysismomentum methods
0
0 comments X

The pith

Adam-SHANG converges in expectation for stochastic smooth convex optimization under a flexible stepsize condition.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Adam-SHANG as an Adam-type optimizer that adds a lagged preconditioner to stabilize momentum and adaptive preconditioning. It proves convergence in expectation for stochastic smooth convex problems by showing that an admissible stepsize rule, which can always be met via a conservative spectral bound, produces a Lyapunov decrease. This removes the need to assume global monotonicity on the second-moment sequence, a condition often hard to verify in practice. The work also supplies a computable trace-ratio stepsize based on local alignment and tests the structure outside the convex case.

Core claim

Adam-SHANG couples momentum, adaptive preconditioning, and a curvature-aware correction through a more stable lagged-preconditioner update. For stochastic smooth convex optimization, the method converges in expectation under an admissible stepsize condition satisfiable by a conservative spectral bound, without requiring global monotonicity on the second-moment sequence. A trace-ratio stepsize rule motivated by local coordinatewise alignment offers a less conservative practical choice.

What carries the argument

The lagged-preconditioner update that stabilizes the curvature estimate while coupling momentum and adaptive steps.

If this is right

  • Expected convergence holds whenever the conservative spectral bound is respected.
  • The trace-ratio stepsize yields a practical rule under local coordinatewise alignment.
  • The same lagged update structure applies beyond convex problems with simplified parameters.
  • Experiments confirm the predicted stochastic decay rates and show competitive performance versus Adam and AdamW on deep learning tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The relaxed monotonicity requirement may explain why Adam-type methods often succeed empirically even when classical analyses do not apply.
  • Lyapunov constructions centered on lagged preconditioners could be reused to analyze other adaptive first-order methods.
  • In non-convex deep learning, focusing on local curvature stability rather than global second-moment behavior might guide more reliable stepsize selection.

Load-bearing premise

The lagged preconditioner must keep the effective curvature estimate stable enough that the chosen stepsize satisfies the admissibility condition at every iteration.

What would settle it

A simple strongly convex quadratic where the preconditioner produces steps that violate the spectral bound and the expected objective value stops decreasing toward the minimum.

Figures

Figures reproduced from arXiv: 2605.12878 by Long Chen, Minfu Feng, Yaxin Yu.

Figure 1
Figure 1. Figure 1: Convex optimization benchmark with d = 16. Top: pure multiplicative noise with σ0 = 0 and σ1 ∈ {0, 10, 30}. Bottom: additive-multiplicative noise with σ1 = 10 and σ0 ∈ {0.5, 1, 3}. For Adam, we use the decaying stepsize ℓ0/ √ k + 1, where ℓ0 is selected by grid search separately for each setting. Full hyperparameter settings and implementation details are provided in Appendix E.1 [PITH_FULL_IMAGE:figures/… view at source ↗
Figure 2
Figure 2. Figure 2: Iterate trajectories xt on the Reddi et al. [34] counterexample. Left: deterministic sequence. Right: stochastic sequence (median over 30 independent runs). Hyperparameter settings and full results, including average regret curves, are provided in Appendix E.3. Deep Learning Tasks This section evaluates whether the proposed methods remain competitive on practical deep learning tasks and, in particular, whe… view at source ↗
Figure 3
Figure 3. Figure 3: Validation loss for Transformer language modeling on text8. All methods are trained for 30,000 steps under batch sizes B ∈ {32, 128} to examine sensitivity to gradient variance and effective data throughput. Model architecture, data split, and hyperparameter settings are provided in Appendix E.4. Character-level language modeling on text8. We evaluate the proposed methods on character￾level language modeli… view at source ↗
Figure 4
Figure 4. Figure 4: shows that both Adam-SHANG and Adam-SHANG-s achieve final test accuracies competitive with Adam and AdamW, while reaching high accuracy earlier. At batch size 32, Adam-SHANG attains the baseline accuracy level in roughly half the number of steps. This suggests that the trace-ratio stepsize accelerates the transient optimization phase by extracting useful stepsize information from the evolving preconditione… view at source ↗
Figure 5
Figure 5. Figure 5: Empirical verification of the admissibility condition for [PITH_FULL_IMAGE:figures/full_fig_p019_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Ordering violation rate for Assumption 2.2. A value of [PITH_FULL_IMAGE:figures/full_fig_p020_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: The deterministic version of the counterexample from Reddi et al. [34]. Each panel shows average regret Rt/t (left) and iterate trajectory xt (right). E.4 Experimental and hyperparameter settings for text8 We evaluate character-level language modeling on text8 [31] using a 4-layer Transformer en￾coder [38] with Pre-LayerNorm, 8 attention heads, hidden dimension 256, feedforward dimension 1024, and about 3.… view at source ↗
Figure 8
Figure 8. Figure 8: The stochastic version of the counterexample from Reddi et al. [34]. Each panel reports the mean average regret Rt/t (left) and the mean iterate trajectory xt (right) over 30 independent runs. The large initial regret of Adam-SHANG reflects a small fraction of runs with an anomalously large first step caused by g0 = 1010 at initialization; this is a non-typical artifact, as confirmed by the median in [PIT… view at source ↗
Figure 9
Figure 9. Figure 9: The stochastic version of the counterexample from Reddi et al. [34]. Each panel reports the median average regret Rt/t (left) and the median iterate trajectory xt (right) over 30 independent runs. E.5 Hyperparameter settings for CIFAR-100 We summarize here the hyperparameter choices used in the CIFAR-100 experiments. For the Adam-SHANG variants, Adam-SHANG and Adam-SHANG-s use (λ, β, γ) = (0.1, 0.1, 0.005)… view at source ↗
Figure 10
Figure 10. Figure 10: Training loss and test accuracy of different optimizers for training ResNet-34 on CIFAR-10. 10−5 for Adam and SHANG++, following common practice. Each model is trained for 50 epochs with three independent random seeds, and we report the mean and standard deviation over the runs [PITH_FULL_IMAGE:figures/full_fig_p023_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Ablation on the curvature-aware correction term for training the 4-layer Transformer on text8 E.7 Ablation on the curvature correction To examine the role of the curvature-aware correction term, we include two additional ablation variants in the deep learning experiments: Adam-SHANG(β = 0, γ = 1) and Adam-SHANG-s(β = 0, γ = 1). In these ablations, the base scale λ is kept the same as in the corresponding … view at source ↗
Figure 12
Figure 12. Figure 12: Ablation on the curvature-aware correction term for training ResNet-34 on CIFAR-10 [PITH_FULL_IMAGE:figures/full_fig_p024_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Ablation on the curvature-aware correction term for training ResNet-50 on CIFAR-100. 24 [PITH_FULL_IMAGE:figures/full_fig_p024_13.png] view at source ↗
read the original abstract

We propose Adam-SHANG, a Lyapunov-guided Adam-type method that couples momentum, adaptive preconditioning, and a curvature-aware correction through a more stable lagged-preconditioner update. For stochastic smooth convex optimization, we prove convergence in expectation under an admissible stepsize condition that can always be satisfied by a conservative spectral bound, without imposing global monotonicity on the second-moment sequence. To obtain a less conservative practical rule, we introduce a computable trace-ratio stepsize, motivated by a local coordinatewise alignment condition. The same structural update is also tested beyond the convex setting with simplified parameters. Experiments validate the predicted stochastic decay and show competitive training performance against Adam and AdamW on deep learning tasks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces Adam-SHANG, a Lyapunov-guided Adam-type method for stochastic smooth convex optimization that combines momentum, adaptive preconditioning, and a curvature-aware correction via a lagged-preconditioner update. It claims to prove convergence in expectation under an admissible stepsize condition (satisfied by a conservative spectral bound or a practical trace-ratio rule motivated by local coordinatewise alignment), without requiring global monotonicity on the second-moment sequence. The same update is tested in non-convex settings, and experiments are said to validate the predicted O(1/sqrt(T)) decay while showing competitive performance against Adam and AdamW on deep learning tasks.

Significance. If the convergence result holds with verifiable assumptions, the work is significant for relaxing a standard monotonicity assumption in analyses of adaptive stochastic methods through the lagged preconditioner and admissible stepsize framework. The dual provision of conservative and computable stepsize rules, together with the extension beyond convexity, offers both theoretical and practical value. Explicit credit is due for attempting a parameter-free derivation route via the spectral bound and for including empirical checks of the predicted decay.

major comments (2)
  1. [Convergence theorem] Main convergence theorem: the removal of the global monotonicity requirement on the second-moment sequence is load-bearing for the central claim, yet the analysis provides no quantitative bound on the lag parameter or on the probability that the local alignment condition holds under stochastic fluctuations; if the lagged v_t drifts, the Lyapunov decrease may fail even when the spectral bound is respected.
  2. [Theory section] Admissible stepsize condition (abstract and § on theory): the claim that the condition 'can always be satisfied' by the conservative spectral bound is not accompanied by explicit error bounds or a complete derivation showing how the trace-ratio rule avoids circularity with the fitted alignment; this leaves the O(1/sqrt(T)) rate unverified from the given information.
minor comments (2)
  1. [Experiments] Experiments: the validation of predicted stochastic decay is described only at high level; adding concrete plots or tables with measured rates versus T and explicit comparison metrics would strengthen the empirical support.
  2. [Method] Notation: the lagged-preconditioner update (v_t = beta v_{t-1} + (1-beta) g_t^2 with lag) should be defined with a single consistent equation number and clearly distinguished from the standard Adam second-moment update.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful reading and for recognizing the potential significance of relaxing the global monotonicity assumption via the lagged preconditioner. We address each major comment below and will revise the manuscript to improve clarity on the admissible stepsize framework.

read point-by-point responses
  1. Referee: [Convergence theorem] Main convergence theorem: the removal of the global monotonicity requirement on the second-moment sequence is load-bearing for the central claim, yet the analysis provides no quantitative bound on the lag parameter or on the probability that the local alignment condition holds under stochastic fluctuations; if the lagged v_t drifts, the Lyapunov decrease may fail even when the spectral bound is respected.

    Authors: The main theorem establishes convergence in expectation under the admissible stepsize condition, which is formulated to guarantee a sufficient Lyapunov decrease for any fixed lag parameter (chosen as a small constant in the algorithm). The conservative spectral bound is derived from an upper estimate on the operator norm of the preconditioner and holds deterministically, independent of stochastic fluctuations in v_t or any alignment probability. Thus, the Lyapunov decrease is ensured whenever the stepsize satisfies this bound, without requiring quantitative control on the lag or probabilistic statements on local alignment. The local alignment condition is used only to motivate the practical trace-ratio rule and is not invoked in the convergence proof. We will add a clarifying remark in the theory section explaining this deterministic guarantee. revision: partial

  2. Referee: [Theory section] Admissible stepsize condition (abstract and § on theory): the claim that the condition 'can always be satisfied' by the conservative spectral bound is not accompanied by explicit error bounds or a complete derivation showing how the trace-ratio rule avoids circularity with the fitted alignment; this leaves the O(1/sqrt(T)) rate unverified from the given information.

    Authors: The conservative spectral bound is constructed by replacing the preconditioner with its maximum possible eigenvalue (a uniform upper bound independent of the specific second-moment estimates), which directly yields an admissible stepsize that satisfies the required inequality by design; we will include the full derivation of this bound in the revised theory section, together with the resulting explicit constants in the O(1/sqrt(T)) rate. The trace-ratio rule is presented as a practical, computable alternative motivated by an empirical local alignment observation and is not used in the proof of the rate; the rate is established solely under the admissible condition satisfied by the spectral bound, avoiding any circularity. We will expand the manuscript to make this separation explicit and add the missing derivation steps. revision: yes

Circularity Check

0 steps flagged

No significant circularity in the convergence derivation

full rationale

The paper's central claim rests on a standard Lyapunov analysis for stochastic smooth convex optimization. The admissible stepsize condition is satisfied independently via a conservative spectral bound (or local trace-ratio alignment), without the convergence rate being defined in terms of itself or a fitted parameter from the same data. The lagged preconditioner update is presented as a structural choice motivated by stability considerations, not a self-definitional renaming or post-hoc fit. No load-bearing self-citation chains or ansatz smuggling reduce the result to its inputs by construction. The derivation remains self-contained against external benchmarks for convex stochastic problems.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on standard assumptions of stochastic smooth convex optimization plus one new algorithmic structure (lagged preconditioner) whose stability is justified by the Lyapunov argument rather than by new postulates.

free parameters (1)
  • admissible stepsize bound
    Conservative spectral bound chosen to guarantee the stepsize condition without requiring global monotonicity of the second-moment sequence.
axioms (1)
  • domain assumption Stochastic gradients are unbiased with bounded variance; objective is smooth and convex.
    Standard assumptions invoked to obtain convergence in expectation for the proposed update.

pith-pipeline@v0.9.0 · 5415 in / 1230 out tokens · 33988 ms · 2026-05-14T18:40:08.667591+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

48 extracted references · 11 canonical work pages · 1 internal anchor

  1. [1]

    Convergence of adaptive algorithms for weakly convex constrained optimization, 2020

    Ahmet Alacaoglu, Yura Malitsky, and V olkan Cevher. Convergence of adaptive algorithms for weakly convex constrained optimization, 2020. URL https://arxiv.org/abs/2006. 06650

  2. [2]

    Asgo: Adaptive structured gradient optimization, 2025

    Kang An, Yuxing Liu, Rui Pan, Yi Ren, Shiqian Ma, Donald Goldfarb, and Tong Zhang. Asgo: Adaptive structured gradient optimization, 2025

  3. [3]

    Ascher, Steven J

    Uri M. Ascher, Steven J. Ruuth, and Brian T. R. Wetton. Implicit-explicit methods for time- dependent partial differential equations.SIAM Journal on Numerical Analysis, 32(3):797–823, 1995

  4. [4]

    Convergence and dynamical behavior of the adam algorithm for non-convex stochastic optimization, 2020

    Anas Barakat and Pascal Bianchi. Convergence and dynamical behavior of the adam algorithm for non-convex stochastic optimization, 2020

  5. [5]

    Popov, Arash Sarshar, and Adrian Sandu

    Abhinab Bhattacharjee, Andrey A. Popov, Arash Sarshar, and Adrian Sandu. Improv- ing adam through an implicit-explicit (imex) time-stepping approach.Journal of Ma- chine Learning for Modeling and Computing, 5(3):47–68, 2024. ISSN 2689-3967. doi: 10.1615/jmachlearnmodelcomput.2024053508

  6. [6]

    First order optimization methods based on hessian-driven nesterov accelerated gradient flow, 2019

    Long Chen and Hao Luo. First order optimization methods based on hessian-driven nesterov accelerated gradient flow, 2019

  7. [7]

    Accelerated gradient methods through variable and operator splitting, 2025

    Long Chen, Luo Hao, and Jingrong Wei. Accelerated gradient methods through variable and operator splitting, 2025

  8. [8]

    On the convergence of a class of adam-type algorithms for non-convex optimization, 2019

    Xiangyi Chen, Sijia Liu, Ruoyu Sun, and Mingyi Hong. On the convergence of a class of adam-type algorithms for non-convex optimization, 2019

  9. [9]

    A general system of differential equations to model first order adaptive algorithms, 2019

    André Belotto da Silva and Maxime Gazeau. A general system of differential equations to model first order adaptive algorithms, 2019

  10. [10]

    Convergence guarantees for rmsprop and adam in non-convex optimization and an empirical comparison to nesterov acceleration, 2018

    Soham De, Anirbit Mukherjee, and Enayat Ullah. Convergence guarantees for rmsprop and adam in non-convex optimization and an empirical comparison to nesterov acceleration, 2018

  11. [11]

    Convergence rates for the adam optimizer, 2024

    Steffen Dereich and Arnulf Jentzen. Convergence rates for the adam optimizer, 2024. URL https://arxiv.org/abs/2407.21078

  12. [12]

    Ode approximation for the adam algorithm: General and overparametrized setting, 2025

    Steffen Dereich, Arnulf Jentzen, and Sebastian Kassing. Ode approximation for the adam algorithm: General and overparametrized setting, 2025

  13. [13]

    Sharp higher order convergence rates for the adam optimizer, 2025

    Steffen Dereich, Arnulf Jentzen, and Adrian Riekert. Sharp higher order convergence rates for the adam optimizer, 2025. URLhttps://arxiv.org/abs/2504.19426

  14. [14]

    Adaptive subgradient methods for online learning and stochastic optimization.Journal of Machine Learning Research, 12(61):2121–2159, 2011

    John Duchi, Elad Hazan, and Yoram Singer. Adaptive subgradient methods for online learning and stochastic optimization.Journal of Machine Learning Research, 12(61):2121–2159, 2011

  15. [15]

    A simple convergence proof of adam and adagrad, 2022

    Alexandre Défossez, Léon Bottou, Francis Bach, and Nicolas Usunier. A simple convergence proof of adam and adagrad, 2022

  16. [16]

    Continuous-time analysis of adaptive optimization and normalization, 2024

    Rhys Gould and Hidenori Tanaka. Continuous-time analysis of adaptive optimization and normalization, 2024

  17. [17]

    Siegel, and Stephan Wojtowytsch

    Kanan Gupta, Jonathan W. Siegel, and Stephan Wojtowytsch. Nesterov acceleration despite very noisy gradients, 2024. URLhttps://arxiv.org/abs/2302.05515

  18. [18]

    Shampoo: Preconditioned stochastic tensor optimization, 2018

    Vineet Gupta, Tomer Koren, and Yoram Singer. Shampoo: Preconditioned stochastic tensor optimization, 2018

  19. [19]

    Deep residual learning for image recognition

    Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 770–778, 2016. doi: 10.1109/CVPR.2016.90

  20. [20]

    Modeling adagrad, rmsprop, and adam with integro-differential equations, 2025

    Carlos Heredia. Modeling adagrad, rmsprop, and adam with integro-differential equations, 2025. 10

  21. [21]

    From adam to adam-like lagrangians: Second-order nonlocal dynamics, 2026

    Carlos Heredia. From adam to adam-like lagrangians: Second-order nonlocal dynamics, 2026

  22. [22]

    Nostalgic adam: Weighting more of the past gradients when designing the adaptive learning rate

    Haiwen Huang, Chang Wang, and Bin Dong. Nostalgic adam: Weighting more of the past gradients when designing the adaptive learning rate. InProceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence, IJCAI-2019, page 2556–2562. In- ternational Joint Conferences on Artificial Intelligence Organization, August 2019. doi: 10.249...

  23. [23]

    Yiming Jiang, Jinlan Liu, Dongpo Xu, and Danilo P. Mandic. Uadam: Unified adam-type algorithmic framework for nonconvex optimization.Neural Computation, 36(9):1912–1938, August 2024. ISSN 1530-888X. doi: 10.1162/neco_a_01692. URL http://dx.doi.org/10. 1162/neco_a_01692

  24. [24]

    Kingma and Jimmy Ba

    Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization, 2017

  25. [25]

    Sgd with adaptive preconditioning: Unified analysis and momentum accelera- tion, 2025

    Dmitry Kovalev. Sgd with adaptive preconditioning: Unified analysis and momentum accelera- tion, 2025

  26. [26]

    Learning multiple layers of features from tiny images

    Alex Krizhevsky. Learning multiple layers of features from tiny images. 2009

  27. [27]

    Stochastic modified equations and dynamics of stochastic gradient algorithms i: Mathematical foundations, 2018

    Qianxiao Li, Cheng Tai, and Weinan E. Stochastic modified equations and dynamics of stochastic gradient algorithms i: Mathematical foundations, 2018

  28. [28]

    Adagrad under anisotropic smoothness, 2024

    Yuxing Liu, Rui Pan, and Tong Zhang. Adagrad under anisotropic smoothness, 2024

  29. [29]

    Decoupled weight decay regularization, 2019

    Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization, 2019

  30. [30]

    A qualitative study of the dynamic behavior for adaptive gradient algorithms, 2021

    Chao Ma, Lei Wu, and Weinan E. A qualitative study of the dynamic behavior for adaptive gradient algorithms, 2021

  31. [31]

    Large text compression benchmark

    Matt Mahoney. Large text compression benchmark. http://www.mattmahoney.net/dc/ text.html, 2009. Accessed: 2025

  32. [32]

    On the sdes and scaling rules for adaptive gradient algorithms, 2024

    Sadhika Malladi, Kaifeng Lyu, Abhishek Panigrahi, and Sanjeev Arora. On the sdes and scaling rules for adaptive gradient algorithms, 2024

  33. [33]

    D. S. Mitrinovi´c, J. E. Pe ˇcari´c, and A. M. Fink.Classical and New Inequalities in Analysis. Kluwer Academic Publishers, Dordrecht, 1993

  34. [34]

    Reddi, Satyen Kale, and Sanjiv Kumar

    Sashank J. Reddi, Satyen Kale, and Sanjiv Kumar. On the convergence of adam and beyond, 2019

  35. [35]

    Adopt: Modified adam can converge with any β2 with the optimal rate, 2024

    Shohei Taniguchi, Keno Harada, Gouki Minegishi, Yuta Oshima, Seong Cheol Jeong, Go Na- gahara, Tomoshi Iiyama, Masahiro Suzuki, Yusuke Iwasawa, and Yutaka Matsuo. Adopt: Modified adam can converge with any β2 with the optimal rate, 2024. URL https: //arxiv.org/abs/2411.02853

  36. [36]

    Calibrating the adaptive learning rate to improve convergence of adam.Neurocomputing, 481:333–356, 2022

    Qianqian Tong, Guannan Liang, and Jinbo Bi. Calibrating the adaptive learning rate to improve convergence of adam.Neurocomputing, 481:333–356, 2022. ISSN 0925-2312. doi: https://doi. org/10.1016/j.neucom.2022.01.014. URL https://www.sciencedirect.com/science/ article/pii/S0925231222000340

  37. [37]

    Incorporating preconditioning into accelerated approaches: Theoretical guarantees and practical improvement,

    Stepan Trifonov, Leonid Levin, Savelii Chezhegov, and Aleksandr Beznosikov. Incorporating preconditioning into accelerated approaches: Theoretical guarantees and practical improvement,

  38. [38]

    URLhttps://arxiv.org/abs/2505.23510

  39. [39]

    Gomez, Lukasz Kaiser, and Illia Polosukhin

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need, 2023

  40. [40]

    Structured precondi- tioners in adaptive optimization: A unified analysis, 2025

    Shuo Xie, Tianhao Wang, Sashank Reddi, Sanjiv Kumar, and Zhiyuan Li. Structured precondi- tioners in adaptive optimization: A unified analysis, 2025

  41. [41]

    Shang++: Robust stochastic acceleration under multiplicative noise, 2026

    Yaxin Yu, Long Chen, and Minfu Feng. Shang++: Robust stochastic acceleration under multiplicative noise, 2026. 11

  42. [42]

    Adam-HNAG: A Convergent Reformulation of Adam with Accelerated Rate

    Yaxin Yu, Long Chen, and Zeyi Xu. Adam-hnag: A convergent reformulation of adam with accelerated rate, 2026. URLhttps://arxiv.org/abs/2604.08742

  43. [43]

    On the convergence of adaptive gradient methods for nonconvex optimization, 2024

    Dongruo Zhou, Jinghui Chen, Yuan Cao, Ziyan Yang, and Quanquan Gu. On the convergence of adaptive gradient methods for nonconvex optimization, 2024

  44. [44]

    A sufficient condition for convergences of adam and rmsprop, 2019

    Fangyu Zou, Li Shen, Zequn Jie, Weizhong Zhang, and Wei Liu. A sufficient condition for convergences of adam and rmsprop, 2019. A Convergence analysis ofAdam-SHANG A.1 Proof of Lemma 2.1 Proof of Lemma 2.1.ByL-smoothness andx + k+1 −x k+1 =−η k+1P −1 k gk+1,we have f(x + k+1)≤f(x k+1)−η k+1⟨∇f(x k+1), gk+1⟩P −1 k + Lη2 k+1 2 ∥gk+1∥2 P −2 k . Taking expect...

  45. [45]

    Applying Lemma 2.1 atx k+1 and multiplying by−α k, we obtain E[−αkE(z k+1, Pk+1)]≤E −αkE(z + k+1, Pk+1)− αkηk+1 2(1 +σ 2

    ∥gk+1∥2 P −1 k + ηk+1σ2 0 1 +σ 2 1 + αkγk 2 ∥yk+1 −x ⋆∥2 P −1 k G2 k+1 + α2 k 2 ∥gk+1∥2 P −1 k . Applying Lemma 2.1 atx k+1 and multiplying by−α k, we obtain E[−αkE(z k+1, Pk+1)]≤E −αkE(z + k+1, Pk+1)− αkηk+1 2(1 +σ 2

  46. [46]

    Usingsup k ∥yk −x ⋆∥∞ ≤Randγ k =α k/R2, we further have αkγk 2 ∥yk+1 −x ⋆∥2 P −1 k G2 k+1 = α2 k 2R2 dX i=1 yk+1,i −x ⋆ i 2 P −1 k iig2 k+1,i ≤ α2 k 2 ∥gk+1∥2 P −1 k

    ∥gk+1∥2 P −1 k + αkηk+1σ2 0 1 +σ 2 1 . Usingsup k ∥yk −x ⋆∥∞ ≤Randγ k =α k/R2, we further have αkγk 2 ∥yk+1 −x ⋆∥2 P −1 k G2 k+1 = α2 k 2R2 dX i=1 yk+1,i −x ⋆ i 2 P −1 k iig2 k+1,i ≤ α2 k 2 ∥gk+1∥2 P −1 k . With the chosen parameters2α 2 k(1 +σ 2

  47. [47]

    Therefore, E E(z + k+1, Pk+1) ≤E 1 1 +α k E(z + k , Pk) + 2α2 kσ2 0

    =η k+1, 2α2 k −(1 +α k) ηk+1 1 +σ 2 1 ≤0, so theg k+1-weighted terms are nonpositive and can be dropped. Therefore, E E(z + k+1, Pk+1) ≤E 1 1 +α k E(z + k , Pk) + 2α2 kσ2 0 . This yields the claimed one-step bound, and iterating the recursion proves the theorem. RemarkA.1.For analytical tractability, we assume that there existsR >0such that sup k≥0 ∥yk −x...

  48. [48]

    Figure 5: Empirical verification of the admissibility condition forAdam-SHANG

    ≥1.(30) The admissibility condition is satisfied wheneverRatio≥1. Figure 5: Empirical verification of the admissibility condition forAdam-SHANG. The ratio should be not less than1. Observation.In all tested cases, the monitored ratio remains above 1 throughout the whole trajectory. Thus, with the safety factor λ= 0.5 in the practical rule (13), we do not ...