Quasi-hyperbolic momentum and Adam for deep learning

Denis Yarats; Jerry Ma

arxiv: 1810.06801 · v4 · pith:35L3SYVAnew · submitted 2018-10-16 · 💻 cs.LG · stat.ML

Quasi-hyperbolic momentum and Adam for deep learning

Jerry Ma , Denis Yarats This is my paper

classification 💻 cs.LG stat.ML

keywords momentumalgorithmsadamdeeplearningproposeqhadamquasi-hyperbolic

0 comments

read the original abstract

Momentum-based acceleration of stochastic gradient descent (SGD) is widely used in deep learning. We propose the quasi-hyperbolic momentum algorithm (QHM) as an extremely simple alteration of momentum SGD, averaging a plain SGD step with a momentum step. We describe numerous connections to and identities with other algorithms, and we characterize the set of two-state optimization algorithms that QHM can recover. Finally, we propose a QH variant of Adam called QHAdam, and we empirically demonstrate that our algorithms lead to significantly improved training in a variety of settings, including a new state-of-the-art result on WMT16 EN-DE. We hope that these empirical results, combined with the conceptual and practical simplicity of QHM and QHAdam, will spur interest from both practitioners and researchers. Code is immediately available.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Momentum Further Constrains Sharpness at the Edge of Stochastic Stability
cs.LG 2026-04 unverdicted novelty 7.0

Momentum SGD exhibits two distinct EoSS regimes for batch sharpness, stabilizing at 2(1-β)/η for small batches and 2(1+β)/η for large batches, aligning with linear stability thresholds.
Compute Efficiency and Serial Runtime Tradeoffs for Stochastic Momentum Methods
cs.LG 2026-06 unverdicted novelty 6.0

Lower bounds establish that heavy-ball momentum extends the compute-efficient batch-size window by sqrt(kappa) over SGD in linear regression, with accelerated SGD showing spectrum-dependent CE-serial runtime tradeoffs.