pith. sign in

hub Canonical reference

On the Convergence of Adam and Beyond

Canonical reference. 83% of citing Pith papers cite this work as background.

41 Pith papers citing it
Background 83% of classified citations
abstract

Several recently proposed stochastic optimization methods that have been successfully used in training deep networks such as RMSProp, Adam, Adadelta, Nadam are based on using gradient updates scaled by square roots of exponential moving averages of squared past gradients. In many applications, e.g. learning with large output spaces, it has been empirically observed that these algorithms fail to converge to an optimal solution (or a critical point in nonconvex settings). We show that one cause for such failures is the exponential moving average used in the algorithms. We provide an explicit example of a simple convex optimization setting where Adam does not converge to the optimal solution, and describe the precise problems with the previous analysis of Adam algorithm. Our analysis suggests that the convergence issues can be fixed by endowing such algorithms with `long-term memory' of past gradients, and propose new variants of the Adam algorithm which not only fix the convergence issues but often also lead to improved empirical performance.

hub tools

citation-role summary

background 5 method 1

citation-polarity summary

clear filters

representative citing papers

Why Muon Outperforms Adam: A Curvature Perspective

cs.LG · 2026-06-03 · conditional · novelty 7.0

Muon outperforms Adam by reducing curvature penalty via lower Normalized Directional Sharpness, as shown via Taylor approximation on LLM training and proven on stylized quadratic problems with heterogeneous curvature.

Riemannian Networks over Full-Rank Correlation Matrices

cs.LG · 2026-05-18 · unverdicted · novelty 7.0

Riemannian networks are introduced for the full-rank correlation matrix manifold by extending MLR, FC, and convolutional layers to five geometries with backpropagation methods for two, showing effectiveness over SPD and Grassmannian baselines.

On the Convergence of Muon and Beyond

cs.LG · 2025-09-19 · unverdicted · novelty 7.0

Muon-MVR2 attains the optimal anytime convergence rate of ~O(T^{-1/3}) in stochastic non-convex settings under horizon-free schedules.

XNet-Enhanced Deep BSDE Method and Numerical Analysis

cs.CE · 2025-02-10 · unverdicted · novelty 6.0

Establishes convergence for non-Lipschitz generators via bounded double-well lemma and truncated BSDE analysis, plus XNet architecture for efficient 100D PDE computation.

A foundation model for atomistic materials chemistry

physics.chem-ph · 2023-12-29 · unverdicted · novelty 6.0

MACE-MP-0 is a general-purpose atomistic ML force field trained on public data that enables stable simulations of diverse chemical systems with qualitative and sometimes quantitative accuracy, serving as a starting point for fine-tuning.

Adaptive Federated Optimization

cs.LG · 2020-02-29 · unverdicted · novelty 6.0

Proposes federated adaptive optimizers (FedAdagrad, FedAdam, FedYogi) with convergence analysis for non-convex objectives under data heterogeneity and reports empirical gains over FedAvg.

Anon: Extrapolating Adaptivity Beyond SGD and Adam

cs.AI · 2026-05-04 · unverdicted · novelty 6.0

Anon optimizer uses tunable adaptivity and incremental delay update to achieve convergence guarantees and outperform existing methods on image classification, diffusion, and language modeling tasks.

A Line-search-free Method for Adaptive Decentralized Optimization

math.OC · 2026-05-01 · unverdicted · novelty 6.0

New adaptive decentralized algorithms select stepsizes from local curvature estimates derived from a Lyapunov function, delivering sublinear convergence for convex problems and linear rates for strongly convex ones.

Muon Learns More Robust and Transferable Features than Adam

cs.LG · 2026-06-08 · unverdicted · novelty 5.0

Muon learns more robust and transferable features than Adam and SGD, shown via corruption robustness tests, transfer experiments, layer-wise probes, effective rank measurements, and a theoretical proof on margins in a multi-component classification problem.

citing papers explorer

Showing 1 of 1 citing paper after filters.

  • A foundation model for atomistic materials chemistry physics.chem-ph · 2023-12-29 · unverdicted · none · ref 142 · internal anchor

    MACE-MP-0 is a general-purpose atomistic ML force field trained on public data that enables stable simulations of diverse chemical systems with qualitative and sometimes quantitative accuracy, serving as a starting point for fine-tuning.