Stochastic Modified Equations and Dynamics of Stochastic Gradient Algorithms I: Mathematical Foundations

· 2018 · cs.LG · arXiv 1811.01558

4 Pith papers cite this work. Polarity classification is still indexing.

4 Pith papers citing it

open full Pith review browse 4 citing papers arXiv PDF

abstract

We develop the mathematical foundations of the stochastic modified equations (SME) framework for analyzing the dynamics of stochastic gradient algorithms, where the latter is approximated by a class of stochastic differential equations with small noise parameters. We prove that this approximation can be understood mathematically as an weak approximation, which leads to a number of precise and useful results on the approximations of stochastic gradient descent (SGD), momentum SGD and stochastic Nesterov's accelerated gradient method in the general setting of stochastic objectives. We also demonstrate through explicit calculations that this continuous-time approach can uncover important analytical insights into the stochastic gradient algorithms under consideration that may not be easy to obtain in a purely discrete-time setting.

citation-role summary

background 1

citation-polarity summary

background 1

representative citing papers

Adam-HNAG: A Convergent Reformulation of Adam with Accelerated Rate

math.OC · 2026-04-09 · unverdicted · novelty 8.0

Adam-HNAG is a splitting-based reformulation of Adam that yields the first convergence proof for Adam-type methods, including accelerated rates, in convex smooth optimization.

Homogenization of $\ell_2$-Adversarial Training in High-Dimensions: Exact Dynamics under Stochastic Gradient Descent

math.OC · 2026-06-30 · unverdicted · novelty 7.0

Derives ODE deterministic equivalents and an adversarial homogenized SDE for SGD iterates in high-dim ℓ2-adversarial training, showing no constant learning rate ensures monotone descent for single-class adversarial least squares and equivalence to adaptive regularized standard SGD.

Thermodynamic Irreversibility of Training Algorithms

cond-mat.stat-mech · 2026-05-21 · unverdicted · novelty 6.0

Four characterizations of irreversibility in training algorithms are equivalent to leading order in step size and produce an emergent force that breaks reparametrization symmetries while favoring minimum entropy production trajectories.

Limit Theorems for Stochastic Gradient Descent in High-Dimensional Single-Layer Networks

stat.ML · 2025-11-04 · unverdicted · novelty 5.0

At the critical step-size scaling for SGD in high-dimensional single-layer networks, effective dynamics gain a diffusive correction term that changes the phase diagram and reduces to an Ornstein-Uhlenbeck process near fixed points, with the information exponent governing sample complexity.

citing papers explorer

Showing 4 of 4 citing papers.

Adam-HNAG: A Convergent Reformulation of Adam with Accelerated Rate math.OC · 2026-04-09 · unverdicted · none · ref 14
Adam-HNAG is a splitting-based reformulation of Adam that yields the first convergence proof for Adam-type methods, including accelerated rates, in convex smooth optimization.
Homogenization of $\ell_2$-Adversarial Training in High-Dimensions: Exact Dynamics under Stochastic Gradient Descent math.OC · 2026-06-30 · unverdicted · none · ref 42 · internal anchor
Derives ODE deterministic equivalents and an adversarial homogenized SDE for SGD iterates in high-dim ℓ2-adversarial training, showing no constant learning rate ensures monotone descent for single-class adversarial least squares and equivalence to adaptive regularized standard SGD.
Thermodynamic Irreversibility of Training Algorithms cond-mat.stat-mech · 2026-05-21 · unverdicted · none · ref 29 · internal anchor
Four characterizations of irreversibility in training algorithms are equivalent to leading order in step size and produce an emergent force that breaks reparametrization symmetries while favoring minimum entropy production trajectories.
Limit Theorems for Stochastic Gradient Descent in High-Dimensional Single-Layer Networks stat.ML · 2025-11-04 · unverdicted · none · ref 18 · internal anchor
At the critical step-size scaling for SGD in high-dimensional single-layer networks, effective dynamics gain a diffusive correction term that changes the phase diagram and reduces to an Ornstein-Uhlenbeck process near fixed points, with the information exponent governing sample complexity.

Stochastic Modified Equations and Dynamics of Stochastic Gradient Algorithms I: Mathematical Foundations

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer