hub

On the convergence of adam and beyond

Sashank J Reddi, Satyen Kale, Sanjiv Kumar · 2019 · cs.LG · arXiv 1904.09237

15 Pith papers cite this work. Polarity classification is still indexing.

15 Pith papers citing it

open full Pith review browse 15 citing papers arXiv PDF

abstract

Several recently proposed stochastic optimization methods that have been successfully used in training deep networks such as RMSProp, Adam, Adadelta, Nadam are based on using gradient updates scaled by square roots of exponential moving averages of squared past gradients. In many applications, e.g. learning with large output spaces, it has been empirically observed that these algorithms fail to converge to an optimal solution (or a critical point in nonconvex settings). We show that one cause for such failures is the exponential moving average used in the algorithms. We provide an explicit example of a simple convex optimization setting where Adam does not converge to the optimal solution, and describe the precise problems with the previous analysis of Adam algorithm. Our analysis suggests that the convergence issues can be fixed by endowing such algorithms with `long-term memory' of past gradients, and propose new variants of the Adam algorithm which not only fix the convergence issues but often also lead to improved empirical performance.

hub tools

JSON dossier citing papers JSON arXiv source

representative citing papers

Adam-HNAG: A Convergent Reformulation of Adam with Accelerated Rate

math.OC · 2026-04-09 · unverdicted · novelty 8.0

Adam-HNAG is a splitting-based reformulation of Adam that yields the first convergence proof for Adam-type methods, including accelerated rates, in convex smooth optimization.

VLTI/PIONIER imaging of post-AGB binaries. An INSPIRING hunt for inner rim substructures in circumbinary discs

astro-ph.SR · 2026-05-13 · unverdicted · novelty 7.0

High-resolution interferometric imaging of eight post-AGB circumbinary discs reveals diverse inner-rim substructures including azimuthal brightness enhancements and arc-like features not explained by inclination alone.

FIBER: A Differentially Private Optimizer with Filter-Aware Innovation Bias Correction

cs.LG · 2026-05-05 · unverdicted · novelty 7.0

FiBeR adds a closed-form filter-aware correction A(ω)σ_w² to the second-moment term for temporally filtered DP gradients, improving adaptive optimization performance.

BarbieGait: An Identity-Consistent Synthetic Human Dataset with Versatile Cloth-Changing for Gait Recognition

cs.CV · 2026-04-14 · unverdicted · novelty 7.0

BarbieGait is a new synthetic gait dataset with identity-consistent cloth changes paired with the GaitCLIF model that improves cross-clothing recognition on the new data and existing benchmarks.

\mathsf{VISTA}: Decentralized Machine Learning in Adversary Dominated Environments

cs.LG · 2026-05-08 · unverdicted · novelty 6.0

VISTA adaptively tunes consistency thresholds in decentralized SGD so that the system converges asymptotically like standard SGD even when adversaries dominate the worker pool.

Low-Order Explicit Hessian Imitation Method for Large-Scale Supervised Machine Learning

math.OC · 2026-05-07 · unverdicted · novelty 6.0

New optimizer uses auxiliary loss to imitate low-order Hessian information, replacing gradient squares in Adam-like training with convergence guarantee and some experimental gains.

Optimizer-Model Consistency: Full Finetuning with the Same Optimizer as Pretraining Forgets Less

cs.LG · 2026-05-07 · unverdicted · novelty 6.0

Full finetuning with the pretraining optimizer reduces forgetting compared to other optimizers or LoRA while achieving comparable new-task performance.

Anon: Extrapolating Adaptivity Beyond SGD and Adam

cs.AI · 2026-05-04 · unverdicted · novelty 6.0

Anon optimizer uses tunable adaptivity and incremental delay update to achieve convergence guarantees and outperform existing methods on image classification, diffusion, and language modeling tasks.

A Line-search-free Method for Adaptive Decentralized Optimization

math.OC · 2026-05-01 · unverdicted · novelty 6.0

New adaptive decentralized algorithms select stepsizes from local curvature estimates derived from a Lyapunov function, delivering sublinear convergence for convex problems and linear rates for strongly convex ones.

Delve into the Applicability of Advanced Optimizers for Multi-Task Learning

cs.LG · 2026-04-10 · unverdicted · novelty 6.0

APT augments multi-task learning by adapting advanced optimizers via momentum balancing and light direction preservation, delivering performance gains on four standard MTL datasets.

Pion: A Spectrum-Preserving Optimizer via Orthogonal Equivalence Transformation

cs.LG · 2026-05-12 · unverdicted · novelty 5.0

Pion is an optimizer that preserves the singular values of weight matrices in LLM training by applying orthogonal equivalence transformations.

Strait: Perceiving Priority and Interference in ML Inference Serving

cs.LG · 2026-04-30 · unverdicted · novelty 5.0

Strait cuts high-priority deadline violations in ML inference serving by 1-11 percentage points through contention modeling and priority scheduling under high GPU load.

AstroSURE: Learning to Remove Noise from Astronomical Images Without Ground Truth Data

astro-ph.IM · 2026-04-18 · unverdicted · novelty 5.0

Unsupervised denoising methods improve faint-source detection in astronomical images from HST and CFHT, with better performance when models are initialized on similar-domain data.

Fidelity of Machine Learned Potentials: Quantitative Assessment for Protonated Oxalate

physics.chem-ph · 2026-04-14 · accept · novelty 5.0

Two machine-learned potentials for protonated oxalate agree closely on vibrational energies, IR spectra, and hydrogen tunneling splittings despite using different regression techniques.

Communication-Efficient Gluon in Federated Learning

cs.LG · 2026-04-12 · unverdicted · novelty 5.0

Compressed Gluon variants using unbiased/contraction compressors and SARAH-style variance reduction achieve convergence guarantees and lower communication costs in federated learning under layer-wise smoothness.

citing papers explorer

Showing 15 of 15 citing papers.

Adam-HNAG: A Convergent Reformulation of Adam with Accelerated Rate math.OC · 2026-04-09 · unverdicted · none · ref 2
Adam-HNAG is a splitting-based reformulation of Adam that yields the first convergence proof for Adam-type methods, including accelerated rates, in convex smooth optimization.
VLTI/PIONIER imaging of post-AGB binaries. An INSPIRING hunt for inner rim substructures in circumbinary discs astro-ph.SR · 2026-05-13 · unverdicted · none · ref 109 · internal anchor
High-resolution interferometric imaging of eight post-AGB circumbinary discs reveals diverse inner-rim substructures including azimuthal brightness enhancements and arc-like features not explained by inclination alone.
FIBER: A Differentially Private Optimizer with Filter-Aware Innovation Bias Correction cs.LG · 2026-05-05 · unverdicted · none · ref 45
FiBeR adds a closed-form filter-aware correction A(ω)σ_w² to the second-moment term for temporally filtered DP gradients, improving adaptive optimization performance.
BarbieGait: An Identity-Consistent Synthetic Human Dataset with Versatile Cloth-Changing for Gait Recognition cs.CV · 2026-04-14 · unverdicted · none · ref 52
BarbieGait is a new synthetic gait dataset with identity-consistent cloth changes paired with the GaitCLIF model that improves cross-clothing recognition on the new data and existing benchmarks.
\mathsf{VISTA}: Decentralized Machine Learning in Adversary Dominated Environments cs.LG · 2026-05-08 · unverdicted · none · ref 47
VISTA adaptively tunes consistency thresholds in decentralized SGD so that the system converges asymptotically like standard SGD even when adversaries dominate the worker pool.
Low-Order Explicit Hessian Imitation Method for Large-Scale Supervised Machine Learning math.OC · 2026-05-07 · unverdicted · none · ref 15
New optimizer uses auxiliary loss to imitate low-order Hessian information, replacing gradient squares in Adam-like training with convergence guarantee and some experimental gains.
Optimizer-Model Consistency: Full Finetuning with the Same Optimizer as Pretraining Forgets Less cs.LG · 2026-05-07 · unverdicted · none · ref 21
Full finetuning with the pretraining optimizer reduces forgetting compared to other optimizers or LoRA while achieving comparable new-task performance.
Anon: Extrapolating Adaptivity Beyond SGD and Adam cs.AI · 2026-05-04 · unverdicted · none · ref 13
Anon optimizer uses tunable adaptivity and incremental delay update to achieve convergence guarantees and outperform existing methods on image classification, diffusion, and language modeling tasks.
A Line-search-free Method for Adaptive Decentralized Optimization math.OC · 2026-05-01 · unverdicted · none · ref 20
New adaptive decentralized algorithms select stepsizes from local curvature estimates derived from a Lyapunov function, delivering sublinear convergence for convex problems and linear rates for strongly convex ones.
Delve into the Applicability of Advanced Optimizers for Multi-Task Learning cs.LG · 2026-04-10 · unverdicted · none · ref 9
APT augments multi-task learning by adapting advanced optimizers via momentum balancing and light direction preservation, delivering performance gains on four standard MTL datasets.
Pion: A Spectrum-Preserving Optimizer via Orthogonal Equivalence Transformation cs.LG · 2026-05-12 · unverdicted · none · ref 64
Pion is an optimizer that preserves the singular values of weight matrices in LLM training by applying orthogonal equivalence transformations.
Strait: Perceiving Priority and Interference in ML Inference Serving cs.LG · 2026-04-30 · unverdicted · none · ref 75
Strait cuts high-priority deadline violations in ML inference serving by 1-11 percentage points through contention modeling and priority scheduling under high GPU load.
AstroSURE: Learning to Remove Noise from Astronomical Images Without Ground Truth Data astro-ph.IM · 2026-04-18 · unverdicted · none · ref 44
Unsupervised denoising methods improve faint-source detection in astronomical images from HST and CFHT, with better performance when models are initialized on similar-domain data.
Fidelity of Machine Learned Potentials: Quantitative Assessment for Protonated Oxalate physics.chem-ph · 2026-04-14 · accept · none · ref 54
Two machine-learned potentials for protonated oxalate agree closely on vibrational energies, IR spectra, and hydrogen tunneling splittings despite using different regression techniques.
Communication-Efficient Gluon in Federated Learning cs.LG · 2026-04-12 · unverdicted · none · ref 31
Compressed Gluon variants using unbiased/contraction compressors and SARAH-style variance reduction achieve convergence guarantees and lower communication costs in federated learning under layer-wise smoothness.

On the convergence of adam and beyond

hub tools

fields

years

verdicts

representative citing papers

citing papers explorer