Adam-HNAG is a splitting-based reformulation of Adam that yields the first convergence proof for Adam-type methods, including accelerated rates, in convex smooth optimization.
hub Canonical reference
On the Convergence of Adam and Beyond
Canonical reference. 83% of citing Pith papers cite this work as background.
abstract
Several recently proposed stochastic optimization methods that have been successfully used in training deep networks such as RMSProp, Adam, Adadelta, Nadam are based on using gradient updates scaled by square roots of exponential moving averages of squared past gradients. In many applications, e.g. learning with large output spaces, it has been empirically observed that these algorithms fail to converge to an optimal solution (or a critical point in nonconvex settings). We show that one cause for such failures is the exponential moving average used in the algorithms. We provide an explicit example of a simple convex optimization setting where Adam does not converge to the optimal solution, and describe the precise problems with the previous analysis of Adam algorithm. Our analysis suggests that the convergence issues can be fixed by endowing such algorithms with `long-term memory' of past gradients, and propose new variants of the Adam algorithm which not only fix the convergence issues but often also lead to improved empirical performance.
hub tools
citation-role summary
citation-polarity summary
representative citing papers
Derives ODE limits of Adam-DA showing that first- and second-order momentum parameters reverse their convergence roles in zero-sum games compared to minimization, validated on GAN experiments.
Riemannian networks are introduced for the full-rank correlation matrix manifold by extending MLR, FC, and convolutional layers to five geometries with backpropagation methods for two, showing effectiveness over SPD and Grassmannian baselines.
High-resolution interferometric imaging of eight post-AGB circumbinary discs reveals diverse inner-rim substructures including azimuthal brightness enhancements and arc-like features not explained by inclination alone.
DP-FedAdamW delivers an unbiased second-moment estimator for AdamW in DPFL, proving linear convergence acceleration without heterogeneity assumptions and outperforming SOTA by 5.83% on Tiny-ImageNet with Swin-Base at ε=1.
Muon-MVR2 attains the optimal anytime convergence rate of ~O(T^{-1/3}) in stochastic non-convex settings under horizon-free schedules.
Theoretical analysis of accelerated gradient methods showing almost-sure escape from strict saddles and linear exit times, plus a subclass achieving near-optimal convergence to local minima in convex neighborhoods of nonconvex functions.
FiBeR adds a closed-form filter-aware correction A(ω)σ_w² to the second-moment term for temporally filtered DP gradients, improving adaptive optimization performance.
BarbieGait is a new synthetic gait dataset with identity-consistent cloth changes paired with the GaitCLIF model that improves cross-clothing recognition on the new data and existing benchmarks.
OptMuon combines orthogonalized momentum with trajectory-dependent AdaGrad-Norm adaptation to obtain expected-stationarity rates of order T^{-1/2} + sigma^{1/2}T^{-1/4} or T^{-1/2} + sigma^{1/3}T^{-1/3} that reduce to near-optimal deterministic first-order rates in the zero-noise regime.
A reorganized Hartree-Fock framework imposes tunable orbital locality by pairing local degrees of freedom with local solution conditions, maintaining efficient SCF optimization and competitive reaction-energy accuracy.
SGD is reformulated via a master equation from discrete updates, producing a discrete Fokker-Planck equation that predicts non-stationary variance growth proportional to learning rate in flat Hessian directions.
MLorc compresses optimizer momentum with low-rank methods to enable memory-efficient full fine-tuning of LLMs, outperforming LoRA and GaLore while matching full-parameter performance at small ranks.
Establishes convergence for non-Lipschitz generators via bounded double-well lemma and truncated BSDE analysis, plus XNet architecture for efficient 100D PDE computation.
MACE-MP-0 is a general-purpose atomistic ML force field trained on public data that enables stable simulations of diverse chemical systems with qualitative and sometimes quantitative accuracy, serving as a starting point for fine-tuning.
Proposes federated adaptive optimizers (FedAdagrad, FedAdam, FedYogi) with convergence analysis for non-convex objectives under data heterogeneity and reports empirical gains over FedAvg.
VISTA adaptively tunes consistency thresholds in decentralized SGD so that the system converges asymptotically like standard SGD even when adversaries dominate the worker pool.
New optimizer uses auxiliary loss to imitate low-order Hessian information, replacing gradient squares in Adam-like training with convergence guarantee and some experimental gains.
Full finetuning with the pretraining optimizer reduces forgetting compared to other optimizers or LoRA while achieving comparable new-task performance.
Anon optimizer uses tunable adaptivity and incremental delay update to achieve convergence guarantees and outperform existing methods on image classification, diffusion, and language modeling tasks.
New adaptive decentralized algorithms select stepsizes from local curvature estimates derived from a Lyapunov function, delivering sublinear convergence for convex problems and linear rates for strongly convex ones.
APT augments multi-task learning by adapting advanced optimizers via momentum balancing and light direction preservation, delivering performance gains on four standard MTL datasets.
SoftSignum replaces hard sign with soft-sign in optimizers via temperature control and quantile scheduling, extends to SoftMuon, provides a convergence proof for stochastic non-convex settings, and reports better performance than sign-based methods and AdamW on deep learning tasks.
Refines subspace preconditioning for randomized linear solvers via QR-like factorization, enabling implicit use and proving expected linear convergence while reducing to a smaller system with good singular values.
citing papers explorer
-
Adam-HNAG: A Convergent Reformulation of Adam with Accelerated Rate
Adam-HNAG is a splitting-based reformulation of Adam that yields the first convergence proof for Adam-type methods, including accelerated rates, in convex smooth optimization.
-
Understanding Dynamics of Adam in Zero-Sum Games: An ODE Approach
Derives ODE limits of Adam-DA showing that first- and second-order momentum parameters reverse their convergence roles in zero-sum games compared to minimization, validated on GAN experiments.
-
Riemannian Networks over Full-Rank Correlation Matrices
Riemannian networks are introduced for the full-rank correlation matrix manifold by extending MLR, FC, and convolutional layers to five geometries with backpropagation methods for two, showing effectiveness over SPD and Grassmannian baselines.
-
VLTI/PIONIER imaging of post-AGB binaries. An INSPIRING hunt for inner rim substructures in circumbinary discs
High-resolution interferometric imaging of eight post-AGB circumbinary discs reveals diverse inner-rim substructures including azimuthal brightness enhancements and arc-like features not explained by inclination alone.
-
DP-FedAdamW: An Efficient Optimizer for Differentially Private Federated Large Models
DP-FedAdamW delivers an unbiased second-moment estimator for AdamW in DPFL, proving linear convergence acceleration without heterogeneity assumptions and outperforming SOTA by 5.83% on Tiny-ImageNet with Swin-Base at ε=1.
-
On the Convergence of Muon and Beyond
Muon-MVR2 attains the optimal anytime convergence rate of ~O(T^{-1/3}) in stochastic non-convex settings under horizon-free schedules.
-
Accelerated Gradient Methods for Nonconvex Optimization: Escape Trajectories From Strict Saddle Points and Convergence to Local Minima
Theoretical analysis of accelerated gradient methods showing almost-sure escape from strict saddles and linear exit times, plus a subclass achieving near-optimal convergence to local minima in convex neighborhoods of nonconvex functions.
-
FIBER: A Differentially Private Optimizer with Filter-Aware Innovation Bias Correction
FiBeR adds a closed-form filter-aware correction A(ω)σ_w² to the second-moment term for temporally filtered DP gradients, improving adaptive optimization performance.
-
BarbieGait: An Identity-Consistent Synthetic Human Dataset with Versatile Cloth-Changing for Gait Recognition
BarbieGait is a new synthetic gait dataset with identity-consistent cloth changes paired with the GaitCLIF model that improves cross-clothing recognition on the new data and existing benchmarks.
-
OptMuon: Closed-Loop Orthogonalized Momentum Methods for Stochastic Optimization with Zero-Noise Optimality
OptMuon combines orthogonalized momentum with trajectory-dependent AdaGrad-Norm adaptation to obtain expected-stationarity rates of order T^{-1/2} + sigma^{1/2}T^{-1/4} or T^{-1/2} + sigma^{1/3}T^{-1/3} that reduce to near-optimal deterministic first-order rates in the zero-noise regime.
-
Approximating Hartree-Fock theory via an efficiently local reformulation
A reorganized Hartree-Fock framework imposes tunable orbital locality by pairing local degrees of freedom with local solution conditions, maintaining efficient SCF optimization and competitive reaction-energy accuracy.
-
Why SGD is not Brownian Motion: A New Perspective on Stochastic Dynamics
SGD is reformulated via a master equation from discrete updates, producing a discrete Fokker-Planck equation that predicts non-stationary variance growth proportional to learning rate in flat Hessian directions.
-
MLorc: Momentum Low-rank Compression for Memory Efficient Large Language Model Adaptation
MLorc compresses optimizer momentum with low-rank methods to enable memory-efficient full fine-tuning of LLMs, outperforming LoRA and GaLore while matching full-parameter performance at small ranks.
-
XNet-Enhanced Deep BSDE Method and Numerical Analysis
Establishes convergence for non-Lipschitz generators via bounded double-well lemma and truncated BSDE analysis, plus XNet architecture for efficient 100D PDE computation.
-
A foundation model for atomistic materials chemistry
MACE-MP-0 is a general-purpose atomistic ML force field trained on public data that enables stable simulations of diverse chemical systems with qualitative and sometimes quantitative accuracy, serving as a starting point for fine-tuning.
-
Adaptive Federated Optimization
Proposes federated adaptive optimizers (FedAdagrad, FedAdam, FedYogi) with convergence analysis for non-convex objectives under data heterogeneity and reports empirical gains over FedAvg.
-
\mathsf{VISTA}: Decentralized Machine Learning in Adversary Dominated Environments
VISTA adaptively tunes consistency thresholds in decentralized SGD so that the system converges asymptotically like standard SGD even when adversaries dominate the worker pool.
-
Low-Order Explicit Hessian Imitation Method for Large-Scale Supervised Machine Learning
New optimizer uses auxiliary loss to imitate low-order Hessian information, replacing gradient squares in Adam-like training with convergence guarantee and some experimental gains.
-
Optimizer-Model Consistency: Full Finetuning with the Same Optimizer as Pretraining Forgets Less
Full finetuning with the pretraining optimizer reduces forgetting compared to other optimizers or LoRA while achieving comparable new-task performance.
-
Anon: Extrapolating Adaptivity Beyond SGD and Adam
Anon optimizer uses tunable adaptivity and incremental delay update to achieve convergence guarantees and outperform existing methods on image classification, diffusion, and language modeling tasks.
-
A Line-search-free Method for Adaptive Decentralized Optimization
New adaptive decentralized algorithms select stepsizes from local curvature estimates derived from a Lyapunov function, delivering sublinear convergence for convex problems and linear rates for strongly convex ones.
-
Delve into the Applicability of Advanced Optimizers for Multi-Task Learning
APT augments multi-task learning by adapting advanced optimizers via momentum balancing and light direction preservation, delivering performance gains on four standard MTL datasets.
-
Softsign: Smooth Sign in Your Optimizer For Better Parameter Heterogeneity Handling
SoftSignum replaces hard sign with soft-sign in optimizers via temperature control and quantile scheduling, extends to SoftMuon, provides a convergence proof for stochastic non-convex settings, and reports better performance than sign-based methods and AdamW on deep learning tasks.
-
On subspace-constrained preconditioning for randomized iterative methods
Refines subspace preconditioning for randomized linear solvers via QR-like factorization, enabling implicit use and proving expected linear convergence while reducing to a smaller system with good singular values.
-
Generative Prior-Guided Neural Interface Reconstruction for 3D Electrical Impedance Tomography
A solver-in-the-loop method combines a differentiable neural shape prior with a hard-constrained boundary integral equation solver to reconstruct 3D interfaces in EIT while enforcing the governing elliptic PDE at every step.
-
Pion: A Spectrum-Preserving Optimizer via Orthogonal Equivalence Transformation
Pion is an optimizer that preserves the singular values of weight matrices in LLM training by applying orthogonal equivalence transformations.
-
Strait: Perceiving Priority and Interference in ML Inference Serving
Strait cuts high-priority deadline violations in ML inference serving by 1-11 percentage points through contention modeling and priority scheduling under high GPU load.
-
AstroSURE: Learning to Remove Noise from Astronomical Images Without Ground Truth Data
Unsupervised denoising methods improve faint-source detection in astronomical images from HST and CFHT, with better performance when models are initialized on similar-domain data.
-
Fidelity of Machine Learned Potentials: Quantitative Assessment for Protonated Oxalate
Two machine-learned potentials for protonated oxalate agree closely on vibrational energies, IR spectra, and hydrogen tunneling splittings despite using different regression techniques.
-
Communication-Efficient Gluon in Federated Learning
Compressed Gluon variants using unbiased/contraction compressors and SARAH-style variance reduction achieve convergence guarantees and lower communication costs in federated learning under layer-wise smoothness.
-
A Theoretical and Experimental Study of a Novel Adaptive Learning Algorithm
Introduces C-Adam optimizer variant with claimed convergence proof and real-life numerical experiments.
-
Stochastic Optimization and Data Science
The paper motivates stochastic optimization problems from statistical perspectives and describes offline and online approaches to solve expectation minimization problems.
-
Deep learning applied to computational mechanics: A comprehensive review, state of the art, and the classics
A comprehensive review of deep learning techniques for computational mechanics, including LSTM for constitutive modeling, PINNs for PDE solving, optimizers, and kernel methods.
- On the Stability of Growth in Structural Plasticity
- Communication-Efficient Decentralized Stochastic Minimax Optimization
- A Physics-Inspired Optimizer: Velocity Regularized Adam