Introduction to stochastic gradient methods

Simon Weissmann

arxiv: 2606.03953 · v1 · pith:HM66DB53new · submitted 2026-06-02 · 🧮 math.OC

Introduction to stochastic gradient methods

Simon Weissmann This is my paper

Pith reviewed 2026-06-28 08:45 UTC · model grok-4.3

classification 🧮 math.OC

keywords stochastic gradient descentconvergence ratesconvex optimizationmachine learningstochastic approximationPolyak-Lojasiewicz conditionvariance reduction

0 comments

The pith

Stochastic gradient descent achieves classical convergence rates under smoothness and convexity assumptions on the objective.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The lecture notes begin by analyzing deterministic gradient methods for unconstrained optimization under standard assumptions. They then introduce stochastic gradient descent motivated by empirical risk minimization in machine learning and derive its convergence in expectation and almost surely. The notes also discuss variance reduction methods and convergence under the Polyak-Lojasiewicz condition. These results provide the theoretical foundation for why stochastic methods work reliably in large-scale optimization problems.

Core claim

Under the assumptions of smoothness and convexity, stochastic gradient descent converges to the optimum at a rate of O(1/sqrt(T)) after T steps in expectation, with almost sure convergence also holding for appropriate step-size sequences, and faster rates available under strong convexity or with variance reduction techniques.

What carries the argument

The stochastic gradient descent update rule analyzed using descent lemmas derived from smoothness and convexity to bound the progress per iteration.

If this is right

Convergence rates apply directly to minimizing expected risk in machine learning models.
Variance reduction methods improve upon basic SGD by reducing the variance of gradient estimates.
Almost sure convergence guarantees that the iterates converge with probability one under suitable conditions.
The Polyak-Lojasiewicz condition allows linear convergence without requiring strong convexity.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The derivations imply that step-size schedules must be chosen carefully based on the smoothness parameter to achieve the stated rates.
These results connect stochastic approximation theory to practical machine learning optimization.
Extensions to non-convex problems would require additional analysis beyond the paper's scope.

Load-bearing premise

The objective function must be smooth and satisfy convexity, strong convexity, or the Polyak-Lojasiewicz condition for the convergence proofs to hold.

What would settle it

A smooth convex function where running SGD with standard step sizes fails to achieve the O(1/sqrt(T)) rate in expectation.

Figures

Figures reproduced from arXiv: 2606.03953 by Simon Weissmann.

**Figure 2.1.** Figure 2.1: Illustration of descent methods. Lemma 2.1.2. Let f : R d → R be continuously differentiable in x ∈ R d . Then the condition ∇f(x) ⊤d < 0 (2.2) is sufficient for d ∈ R d being a descent direction of f in x. Proof. We fix x ∈ R d and d ∈ R d satisfying (2.2) and define φ(α) = f(x+αd). By Taylor expansion it follows φ(α) = φ(0) + αφ′ (0) + o(α). Note that φ(0) = f(x) and φ ′ (0) = ∇f(x) ⊤d such that φ(α) −… view at source ↗

**Figure 2.2.** Figure 2.2: Illustration of the Armijo step size rule based on [PITH_FULL_IMAGE:figures/full_fig_p015_2_2.png] view at source ↗

**Figure 2.3.** Figure 2.3: Illustration of possible terminations of the gradient descent method. However, not all stationary points are equally problematic. In particular, one can distinguish between local minima and saddle points. A stationary point x∗ is called a strict saddle point if the Hessian ∇2 f(x∗) has at least one negative eigenvalue. Intuitively, this means that there exists a direction of negative curvature along whic… view at source ↗

**Figure 2.4.** Figure 2.4: Illustration of L-smoothness and µ-strong convexity. We consider the function f(x) = x 2 such that f is L-smooth and µ-strongly convex with 1 := µ < f′′(x) = 2 < L =: 10 for all x ∈ R. This plot illustrates convexity and smoothness of f in x = 1 using the upper and lower bound (2.9) and (2.10) on f(y), y ∈ R. This motivates the definition of the Bregman divergence. Definition 2.2.3. Let f : R d → R be co… view at source ↗

**Figure 2.5.** Figure 2.5: Illustration of sub-gradients for differentiable convex (left) and non-differentiable convex (right) functions. For continuously differentiable and convex f, the sub-gradient is unique, i.e. ∂f(x) = {∇f(x)}. Example 2.3.3. Let fi : R d → R be a family of convex and differentiable functions i = 1, . . . , N for some N ∈ N. We define F(x) = maxi=1,...,N fi(x) and for given x ∈ R d we consider j ∈ arg maxi… view at source ↗

**Figure 2.6.** Figure 2.6: Illustration of Example 2.3.3. We consider N = 2 and define f(x) = max(f1(x), f2(x)) for two convex and differentiable functions f1, f2. In the above Situation, we find unique subgradients for f1(x) > f2(x) given by ∂f(x) = {f ′ 1 (x)}, and similarly for f1(x) < f2(x) given by ∂f(x) = {f ′ 2 (x)}. In the case of f1(x) = f2(x), the sub-differential is given by ∂f(x) = [f ′ 2 (x), f′ 1 (x)]. There are sim… view at source ↗

**Figure 3.1.** Figure 3.1: Contour lines of a quadratic function with increasing condition number κ. 3.1 Polyak’s heavy ball method The idea of incorporating momentum into iterative optimization schemes goes back to Polyak’s heavy ball method [38]. The name is motivated by the physical picture of a heavy ball rolling down a hill towards a valley. A light ball is strongly affected by local curvature and may oscillate in narrow vall… view at source ↗

**Figure 3.2.** Figure 3.2: Comparison of gradient descent and the heavy-ball method in a flat valley. The blue path illustrates the typical zigzag behavior of gradient descent. The orange path illustrates the smoother trajectory induced by momentum. At the current iterate xk, the heavy-ball update combines a gradient force (blue dotted arrow) with a momentum force (green dotted arrow), resulting in the new step (orange arrow) towa… view at source ↗

**Figure 3.3.** Figure 3.3: Illustration of the linear convergence rate depending on the condition number κ = µ L for GD and NAM. The left plot shows the convergence rate c GD(κ) = κ−1 κ+1 2 and c NAM(κ) = √ κ−1 √ κ , whereas the right plot shows the difference to 1, i.e. 1 − c(κ), in logarithmic scale [PITH_FULL_IMAGE:figures/full_fig_p064_3_3.png] view at source ↗

**Figure 4.1.** Figure 4.1: Illustration of ill-posedness through observational noise. The occurrence of noise might shift the observed data outside of the range of the forward map H. Θ R(H) H cY θ1 θ2 H [PITH_FULL_IMAGE:figures/full_fig_p068_4_1.png] view at source ↗

**Figure 4.2.** Figure 4.2: Illustration of ill-posedness through multiple solutions. Two different parameters θ1, θ2 ∈ Θ might map onto the observed data Yb. Outlook 2 (Statistical learning perspective). For example, in the area of statistical learning among other questions one is interested in the consistency of solutions of the empirical risk minimization problem. Let XbN be the minimizer of the empirical risk and x∗ be the mini… view at source ↗

**Figure 4.3.** Figure 4.3: Illustration of ill-posedness through discontinuity. Even if H is invertible, the instability may occur in the solution of the inverse problem resulting from possible discontinuity of the inverse operator. of the corresponding expected risk (provided both exist), then we can decompose the error F(XbN )−F(x∗) = F(XbN )−FN (XbN )+FN (XbN )−FN (x∗)+FN (x∗)−F(x∗) ≤ 2 sup x∈Rd |FN (x)−F(x)|. We emphasize that… view at source ↗

read the original abstract

These lecture notes provide an introduction to first-order optimization methods with a particular emphasis on stochastic gradient methods. We begin with deterministic gradient based methods for unconstrained optimization and study their convergence under standard assumptions such as smoothness, convexity, strong convexity, and the Polyak-Lojasiewicz condition. We then turn to stochastic approximation and stochastic gradient descent, motivated by empirical and expected risk minimization in machine learning. The main focus is on convergence theory: we discuss almost sure convergence and convergence in expectation, derive classical convergence rates, and present selected advanced topics, including almost sure convergence rates and variance reduction methods.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Lecture notes that restate classical SGD convergence results under standard assumptions with no new derivations or claims.

read the letter

These are lecture notes that walk through deterministic gradient methods before turning to stochastic gradient descent and its convergence theory. They derive the usual rates in expectation and almost surely, plus some variance reduction material, all under the standard smoothness, convexity, strong convexity, and Polyak-Lojasiewicz assumptions.

The notes organize the material logically and stick to well-known proofs from the literature. That makes them reliable for someone who wants a single document covering the basics to the selected advanced topics. The derivations reproduce established results without deviation, so there are no hidden gaps or questionable steps on the classical material.

The limitation is straightforward: nothing is new. No fresh rates, no extensions beyond the usual settings, no comparisons that change how people think about the methods. It is an exposition, not original work. The assumptions are exactly the textbook ones, so the scope stays narrow by design.

This is useful for students or early researchers who need a structured introduction to the topic. Experts already know the content. It does not advance open questions or enable new applications, so it does not belong in a research journal's peer-review pipeline. If the goal is teaching material, it could stand on its own as notes; otherwise, it is better left as an arXiv resource rather than submitted for formal review.

Referee Report

0 major / 0 minor

Summary. The manuscript consists of lecture notes introducing first-order optimization methods, beginning with deterministic gradient descent and progressing to stochastic gradient descent. It derives classical convergence rates (in expectation and almost surely) for problems satisfying smoothness, convexity, strong convexity, and the Polyak-Łojasiewicz condition, motivated by empirical and expected risk minimization, and includes selected topics on almost sure rates and variance reduction.

Significance. The notes reproduce well-established results from the stochastic optimization literature without introducing new theorems, extensions, or parameter-free derivations. While a clear compilation of standard material may have pedagogical utility, the absence of novel contributions limits its significance for advancing research in mathematical optimization.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their review of our lecture notes. The manuscript is explicitly positioned as an introductory compilation of established results rather than a research contribution with novel theorems.

read point-by-point responses

Referee: The notes reproduce well-established results from the stochastic optimization literature without introducing new theorems, extensions, or parameter-free derivations. While a clear compilation of standard material may have pedagogical utility, the absence of novel contributions limits its significance for advancing research in mathematical optimization.

Authors: We agree that the notes contain no new theorems and are a compilation of classical convergence results under standard assumptions (smoothness, convexity, strong convexity, PL condition). This is by design: the document is lecture notes whose purpose is to provide a self-contained, accessible derivation of these rates, motivated by empirical and expected risk minimization, for use in teaching or self-study. The referee acknowledges possible pedagogical utility; we believe this is the appropriate evaluation criterion for lecture notes rather than research novelty. revision: no

Circularity Check

0 steps flagged

No significant circularity

full rationale

The manuscript consists of lecture notes reproducing classical convergence results for gradient descent and stochastic gradient descent under standard external assumptions (smoothness, convexity, strong convexity, Polyak-Łojasiewicz). All derivations follow from these assumptions via standard analysis techniques drawn from the broader literature; no parameters are fitted and then relabeled as predictions, no self-citations form load-bearing uniqueness arguments, and no ansatzes or renamings reduce the claimed results to the inputs by construction. The notes are therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

As lecture notes on standard topics, the content relies on classical assumptions from optimization theory without introducing new free parameters or entities.

axioms (1)

domain assumption Objective functions satisfy L-smoothness, convexity, strong convexity, or Polyak-Lojasiewicz condition.
Invoked throughout for convergence analysis as stated in the abstract.

pith-pipeline@v0.9.1-grok · 5604 in / 938 out tokens · 27651 ms · 2026-06-28T08:45:47.006581+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

88 extracted references · 5 canonical work pages · 3 internal anchors

[1]

Bertsekas

Dimitri P. Bertsekas. Nonlinear programming. Athena Scientific, 2nd edition, September 2008

2008
[2]

Bertsekas and John N

Dimitri P. Bertsekas and John N. Tsitsiklis. Gradient convergence in gradient methods with errors. SIAM Journal on Optimization , 10(3):627–642, 2000

2000
[3]

Julius R. Blum. Approximation Methods which Converge with Probability one. The Annals of Mathematical Statistics , 25(2):382 – 386, 1954

1954
[4]

Curtis, and Jorge Nocedal

Léon Bottou, Frank E. Curtis, and Jorge Nocedal. Optimization methods for large-scale machine learning. SIAM Review, 60(2):223–311, 2018

2018
[5]

A Kurdyka-Lojasiewicz property for stochastic optimization algorithms in a non-convex setting

Emilie Chouzenoux, Jean-Baptiste Fest, and Audrey Repetti. A Kurdyka-Lojasiewicz property for stochastic optimization algorithms in a non-convex setting. arXiv Preprint , arXiv:2302.06447, 2023

work page arXiv 2023
[6]

Da Prato and J

G. Da Prato and J. Zabczyk. Stochastic Equations in Infinite Dimensions . Encyclopedia of Mathematics and its Applications. Cambridge University Press, 1992

1992
[7]

SAGA: A fast incremental gradient method with support for non-strongly convex composite objectives

Aaron Defazio, Francis Bach, and Simon Lacoste-Julien. SAGA: A fast incremental gradient method with support for non-strongly convex composite objectives. In Advances in Neural Information Processing Systems , volume 27. Curran Associates, Inc., 2014

2014
[8]

Convergence of stochastic gradient descent schemes for Łojasiewicz-landscapes

Steffen Dereich and Sebastian Kassing. Convergence of stochastic gradient descent schemes for Łojasiewicz-landscapes. Journal of Machine Learning , 3(3):245–281, 2024

2024
[9]

Adaptive subgradient methods for online learning and stochastic optimization

John Duchi, Elad Hazan, and Yoram Singer. Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research , 12(61):2121–2159, 2011

2011
[10]

Sharp analysis of stochastic optimization under global Kurdyka-Lojasiewicz inequality

Ilyas Fatkhullin, Jalal Etesami, Niao He, and Negar Kiyavash. Sharp analysis of stochastic optimization under global Kurdyka-Lojasiewicz inequality. In Advances in Neural Information Processing Systems, volume 35, pages 15836–15848. Curran Associates, Inc., 2022. 122 Simon Weissmann Introduction to stochastic gradient methods Page 123

2022
[11]

Convergence rates and approxi- mation results for sgd and its continuous-time counterpart

Xavier Fontaine, Valentin De Bortoli, and Alain Durmus. Convergence rates and approxi- mation results for sgd and its continuous-time counterpart. In Proceedings of Thirty Fourth Conference on Learning Theory , volume 134 of Proceedings of Machine Learning Research , pages 1965–2058. PMLR, 15–19 Aug 2021

1965
[12]

Stochastic heavy ball

Sébastien Gadat, Fabien Panloup, and Sofiane Saadane. Stochastic heavy ball. Electronic Journal of Statistics , 12(1):461 – 529, 2018

2018
[13]

Guillaume Garrigos and Robert M. Gower. Handbook of convergence theorems for (stochastic) gradient methods. arXiv-Preprint, 2023

2023
[14]

Exponential convergence rates for momentum stochas- tic gradient descent in the overparametrized setting

Benjamin Gess and Sebastian Kassing. Exponential convergence rates for momentum stochas- tic gradient descent in the overparametrized setting. Mathematical Programming, 2026

2026
[15]

Stochastic first- and zeroth-order methods for nonconvex stochastic programming

Saeed Ghadimi and Guanghui Lan. Stochastic first- and zeroth-order methods for nonconvex stochastic programming. SIAM Journal on Optimization , 23(4):2341–2368, 2013

2013
[16]

SGD for structured nonconvex func- tions: Learning rates, minibatching and interpolation

Robert Gower, Othmane Sebbouh, and Nicolas Loizou. SGD for structured nonconvex func- tions: Learning rates, minibatching and interpolation. In International Conference on Arti- ficial Intelligence and Statistics , pages 1315–1323. PMLR, 2021

2021
[17]

Hadamard

J. Hadamard. Sur les problèmes aux dérivées partielles et leur signification physique. Princeton University Bulletin , pages 49–52, 1902

1902
[18]

arXiv preprint arXiv:1905.00313 , title =

Elad Hazan and Sham Kakade. Revisiting the Polyak step size. arXiv Preprint , arXiv:1905.00313, 2022

work page arXiv 1905
[19]

Strong error analysis for stochastic gradient descent optimization algorithms

Arnulf Jentzen, Benno Kuckuck, Ariel Neufeld, and Philippe von Wurstemberger. Strong error analysis for stochastic gradient descent optimization algorithms. IMA Journal of Numerical Analysis, 41(1):455–492, 05 2020

2020
[20]

Lower error bounds for the stochastic gradient descent optimization algorithm: Sharp convergence rates for slowly and fast decaying learning rates

Arnulf Jentzen and Philippe von Wurstemberger. Lower error bounds for the stochastic gradient descent optimization algorithm: Sharp convergence rates for slowly and fast decaying learning rates. Journal of Complexity , 57:101438, 2020

2020
[21]

Accelerating stochastic gradient descent using predictive vari- ance reduction

Rie Johnson and Tong Zhang. Accelerating stochastic gradient descent using predictive vari- ance reduction. In Advances in Neural Information Processing Systems , volume 26. Curran Associates, Inc., 2013

2013
[22]

Linear convergence of gradient and proximal- gradient methods under the Polyak-Łojasiewicz condition

Hamed Karimi, Julie Nutini, and Mark Schmidt. Linear convergence of gradient and proximal- gradient methods under the Polyak-Łojasiewicz condition. In Machine Learning and Knowl- edge Discovery in Databases , pages 795–811, Cham, 2016. Springer International Publishing. Simon Weissmann Introduction to stochastic gradient methods Page 124

2016
[23]

Better theory for SGD in the nonconvex world

Ahmed Khaled and Peter Richtárik. Better theory for SGD in the nonconvex world. Trans- actions on Machine Learning Research , 2023

2023
[24]

H.K. Khalil. Nonlinear Systems. Pearson Education. Prentice Hall, 2002

2002
[25]

Adam: A Method for Stochastic Optimization

Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv Preprint arXiv:1412.6980, 2014

work page internal anchor Pith review Pith/arXiv arXiv 2014
[26]

Wahrscheinlichkeitstheorie

Achim Klenke. Wahrscheinlichkeitstheorie. Springer, 2006

2006
[27]

First-order and Stochastic Optimization Methods for Machine Learning

Guanghui Lan. First-order and Stochastic Optimization Methods for Machine Learning . Springer Cham, 1st edition, 2021

2021
[28]

Lee, Ioannis Panageas, Georgios Piliouras, Max Simchowitz, Michael I

Jason D. Lee, Ioannis Panageas, Georgios Piliouras, Max Simchowitz, Michael I. Jordan, and Benjamin Recht. First-order methods almost always avoid strict saddle points. Mathematical Programming, 176(1):311–337, July 2019

2019
[29]

Analysis and design of optimization algorithms via integral quadratic constraints

Laurent Lessard, Benjamin Recht, and Andrew Packard. Analysis and design of optimization algorithms via integral quadratic constraints. SIAM J. on Optimization , 26(1):57–95, 2016

2016
[30]

On almost sure convergence rates of stochastic gradient methods

Jun Liu and Ye Yuan. On almost sure convergence rates of stochastic gradient methods. In Proceedings of Thirty Fifth Conference on Learning Theory , volume 178 of Proceedings of Machine Learning Research, pages 2963–2983. PMLR, 02–05 Jul 2022

2022
[31]

High prob- ability convergence of stochastic gradient methods

Zijian Liu, Ta Duy Nguyen, Thien Hang Nguyen, Alina Ene, and Huy Nguyen. High prob- ability convergence of stochastic gradient methods. In Proceedings of the 40th International Conference on Machine Learning , volume 202 of Proceedings of Machine Learning Research , pages 21884–21914. PMLR, 23–29 Jul 2023

2023
[32]

Convergence of a stochastic gradient method with mo- mentum for non-smooth non-convex optimization

Vien Mai and Mikael Johansson. Convergence of a stochastic gradient method with mo- mentum for non-smooth non-convex optimization. In Proceedings of the 37th International Conference on Machine Learning , volume 119 of Proceedings of Machine Learning Research , pages 6630–6639. PMLR, 13–18 Jul 2020

2020
[33]

On the almost sure convergence of stochastic gradient descent in non-convex problems

Panayotis Mertikopoulos, Nadav Hallak, Ali Kavis, and Volkan Cevher. On the almost sure convergence of stochastic gradient descent in non-convex problems. In Advances in Neural Information Processing Systems , volume 33, pages 1117–1128. Curran Associates, Inc., 2020

2020
[34]

Non-asymptotic analysis of stochastic approximation algo- rithms for machine learning

Eric Moulines and Francis Bach. Non-asymptotic analysis of stochastic approximation algo- rithms for machine learning. In Advances in Neural Information Processing Systems , vol- ume 24. Curran Associates, Inc., 2011

2011
[35]

Lectures on Convex Optimization

Yurii Nesterov. Lectures on Convex Optimization . Springer Cham, 2nd edition, 2018. Simon Weissmann Introduction to stochastic gradient methods Page 125

2018
[36]

Nguyen, Phuong Ha Nguyen, Peter Richtárik, Katya Scheinberg, Martin Takác, and Marten van Dijk

Lam M. Nguyen, Phuong Ha Nguyen, Peter Richtárik, Katya Scheinberg, Martin Takác, and Marten van Dijk. New convergence aspects of stochastic gradient algorithms. Journal of Machine Learning Research, 20:176:1–176:49, 2018

2018
[37]

Introduction to Optimization

Boris T Polyak. Introduction to Optimization . Translations series in mathematics and engi- neering. Optimization Software, Publications Division, 1987

1987
[38]

B.T. Polyak. Some methods of speeding up the convergence of iteration methods. USSR Computational Mathematics and Mathematical Physics , 4(5):1–17, 1964

1964
[39]

On the momentum term in gradient descent learning algorithms

Ning Qian. On the momentum term in gradient descent learning algorithms. Neural Networks, 12(1):145–151, 1999

1999
[40]

Robbins and D

H. Robbins and D. Siegmund. A convergence theorem for non negative almost supermartin- gales and some applications. In Jagdish S. Rustagi, editor, Optimizing Methods in Statistics , pages 233–257. Academic Press, 1971

1971
[41]

A Stochastic Approximation Method

Herbert Robbins and Sutton Monro. A Stochastic Approximation Method. The Annals of Mathematical Statistics, 22(3):400 – 407, 1951

1951
[42]

An overview of gradient descent optimization algorithms

Sebastian Ruder. An overview of gradient descent optimization algorithms. arXiv Preprint , arXiv:1609.04747, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[43]

Minimizing finite sums with the stochastic average gradient

Mark Schmidt, Nicolas Le Roux, and Francis Bach. Minimizing finite sums with the stochastic average gradient. Mathematical Programming, 162(1):83–112, 2017

2017
[44]

Almost sure convergence rates for stochastic gradient descent and stochastic heavy ball

Othmane Sebbouh, Robert M Gower, and Aaron Defazio. Almost sure convergence rates for stochastic gradient descent and stochastic heavy ball. In Proceedings of Thirty Fourth Conference on Learning Theory . PMLR, 2021

2021
[45]

Lecture 6.5-rmsprop: Divide the gradient by a running average of its recent magnitude

Tijmen Tieleman and Geoffrey Hinton. Lecture 6.5-rmsprop: Divide the gradient by a running average of its recent magnitude. COURSERA: Neural networks for machine learning , 4(2):26– 31, 2012

2012
[46]

Almost sure convergence of stochastic gradient methods under gradient domination

Simon Weissmann, Sara Klein, Waïss Azizian, and Leif Döring. Almost sure convergence of stochastic gradient methods under gradient domination. Transactions on Machine Learning Research, 2025

2025
[47]

Multilevel optimization for inverse prob- lems

Simon Weissmann, Ashia Wilson, and Jakob Zech. Multilevel optimization for inverse prob- lems. In Proceedings of Thirty Fifth Conference on Learning Theory, volume 178 of Proceedings of Machine Learning Research , pages 5489–5524. PMLR, 02–05 Jul 2022

2022
[48]

Wilson, Benjamin Recht, and Michael I

Ashia C. Wilson, Benjamin Recht, and Michael I. Jordan. A Lyapunov analysis of accelerated methods in Optimization. J. Mach. Learn. Res. , 22(1), jan 2021. Simon Weissmann Introduction to stochastic gradient methods Page 126

2021
[49]

Matthew D. Zeiler. Adadelta: An adaptive learning rate method. arXiv Preprint , arXiv:1212.5701, 2012. A Appendix A.1 Convex sets and functions We give a brief overview of convex functions and properties that are important for optimization. This appendix covers selected material from [ 1, Appendix B], adapted to the notation and scope of these notes. We s...

work page internal anchor Pith review Pith/arXiv arXiv 2012
[50]

Then \i∈ICi is a convex set

Let fCi, i 2 Ig be a family of convex sets. Then \i∈ICi is a convex set
[51]

127 Simon Weissmann Introduction to stochastic gradient methods Page 128

Let C1, C2 Rd be two convex sets, then the set C = fx 2 Rd j x = x1 + x2, x 1 2 C1, x2 2 C2g is a convex set. 127 Simon Weissmann Introduction to stochastic gradient methods Page 128
[52]

The image of a convex set under linear transformation is again a convex set
[53]

Then the level sets Aα = fx 2 C j f (x) αg and Bα = fx 2 C j f (x) < αg are convex sets for all α 2 R

Let C Rd be a convex set and let f : C ! R be a convex function. Then the level sets Aα = fx 2 C j f (x) αg and Bα = fx 2 C j f (x) < αg are convex sets for all α 2 R. Definition A.1.3. Let f : C ! R be a function and C Rd be a convex set. We define the epigraph of f as E(f ) := f(x, w) 2 C R j f (x) wg, the set of all points above the graph of f defined ...
[54]

Every linear function is convex
[55]

Every norm in Rd is convex
[56]

, fn be convex functions and c1,

Let f1, . . . , fn be convex functions and c1, . . . , cn 0. a) Then x 7!Pn i=1 fi(x) is a convex function. b) Then x 7! supi=1,...,n fi(x) is a convex function. Exercise A.1.2. Prove the properties presented in Example A.1.6. Proposition A.1.7. Let C Rd be a convex set and let f : C ! R be differentiable over C. Then the following holds true:
[57]

The function f is convex over C if and only if f (z) f (x) + (z x)⊤rf (x), for all x, z 2 C
[58]

If f (z) > f (x) + (z x)⊤rf (x), x 6= z, for all x, z 2 C, then f is strictly convex. Proof. We only prove the first assertion, the second one follows by similar argumentation. Firstly, assume that f is convex and let x, z 2 C. Since C is a convex set, x+a(z x) 2 C for all a 2 (0, 1). The convexity of f implies that f (x + a(z x)) af (z) + (1 a)f (x), Sim...
[59]

If r2f (x) is positive semi-definite for all x 2 C, then f is convex over C
[60]

If r2f (x) is positive definite for all x 2 C, then f is strictly convex over C
[61]

If C = Rd and f is convex, then r2f (x) is positive semi-definite for all x 2 C
[62]

Moreover, f is strictly convex if and only if Q is positive definite

The quadratic function x 7! f (x) = x⊤Qx is convex if and only if Q is positive semi- definite. Moreover, f is strictly convex if and only if Q is positive definite. Proof. 1. Let x, y 2 C, then by the mean-value theorem there exists α 2 [0, 1] such that f (y) = f (x) + (y x)⊤rf (x) + 1 2 (y x)⊤r2f (x + α(y x))(y x) f (x) + (y x)⊤f (x), Simon Weissmann In...
[63]

Follows similarly as the proof of the first claim
[64]

By assumption we have that x 7! r 2f (x) is continuous, such that for ¯α > 0 small enough it holds true that z⊤r2f (x)z < 0, for α 2 [0, ¯α)

Let f be convex and suppose that there exists x 2 Rd such that z⊤r2f (x)z < 0 for some z 2 Rd. By assumption we have that x 7! r 2f (x) is continuous, such that for ¯α > 0 small enough it holds true that z⊤r2f (x)z < 0, for α 2 [0, ¯α). For an arbitrary ˜α 2 [0, ¯α) we set ˜z = ˜αz. By the mean-value theorem there exists β 2 [0, 1] such that f (x + ˜z) = ...
[65]

The hessian of f is given by r2f (x) = 2 Q for all x 2 Rd. With 1. and 3. it follows that f is convex if and only if Q is positive semi-definite. If Q is positive definite, then convexity of f follows by the 3. assertion. It is left to prove that strict convexity of f implies that Q is positive definite. Assume that f is strictly convex, then by 3. we kno...
[66]

Exercise A.1.3

A function f : C ! R is µ-strongly convex if and only if its hessian r2f (x) is uniformly positive definite with z⊤r2f (x)z µkzk2 for all x 2 C and all z 2 Rd. Exercise A.1.3. Prove Proposition A.1.10. A.2 Optimality conditions We recall a few basic optimality conditions that are used throughout the notes. The selection and presentation are partly guided ...
[67]

r2f (x∗) (strictly) positive definite. Then x∗ is a strict local minimum of f and there exist γ > 0, ε > 0 such that f (x) f (x∗) + γ 2 kx x∗k2 Simon Weissmann Introduction to stochastic gradient methods Page 134 for all x 2 B ε(x∗). Before proving Theorem A.2.4, we will need to prove the following auxiliary result. Lemma A.2.5. Let A 2 Rd×d be a symmetri...
[68]

λ1kzk2 z⊤Az λdkzk2 for all z 2 Rd
[69]

The matrix A is (strict) positive definite if and only if all eigenvalues are (strictly) positive. Proof. We start with the first assertion. Let z 2 Rd and write z = dX i=1 ξivi with coeﬀicients ξi 2 R. Then we can write z⊤Az = dX i=1 ξ2 i hvi, Av i|{z} =λivi i = dX i=1 λiξ2 i kvik2 8 < : λ1 Pd i=1 ξ2 i kvik2 = λikzk2, λd Pd i=1 ξ2 i kvik2 = λdkzk2., whic...
[70]

Every local minimum of f over S is also a global minimum over S
[71]

If f is even strictly convex, then there exists at most one global minimum
[72]

Then the condition rf (x∗) = 0 is a suﬀicient and necessary condition for x∗ 2 S to be a global minimum of f over S

Let S be an open set. Then the condition rf (x∗) = 0 is a suﬀicient and necessary condition for x∗ 2 S to be a global minimum of f over S. A.3 Lyapunov methods for optimization We will motivate the application of Lyapunov theory to optimization methods based on stability analysis of ordinary differential equations (ODEs). Lyapunov theory can be applied to...
[73]

stable, if for all ε > 0 there exists a δ = δ(ε) > 0 such that: kz(0)k < δ =) k z(t)k < ε for all t 0
[74]

unstable, if ¯z is not stable
[75]

locally asymptotically stable , if ¯z is stable and δ > 0 can be chosen such that: kz(0)k < δ =) lim t→∞ z(t) = 0
[76]

globally stable , if ¯z is stable and limε→∞ δ(ε) = 1
[77]

We consider a continuously differentiable function V : Rd ! R - in the following called candidate of a Lyapunov function - which will be used to be described along the trajectories

globally asymptotically stable , if limt→∞ z(t) = 0 for all z0 2 Rd. We consider a continuously differentiable function V : Rd ! R - in the following called candidate of a Lyapunov function - which will be used to be described along the trajectories. With the help of chain rule, we can describe the dynamical behavior of V along the trajectory of z(t): dV ...
[78]

V (¯z) = 0 and V (z) > 0 for all z 2 Rd n f¯zg,
[79]

V (z) ! 1 for kzk ! 1 ,
[80]

Simon Weissmann Introduction to stochastic gradient methods Page 137 If W (z) 0 for all z 2 Rd, then ¯z is globally stable

dV (z(t)) dt = hrzV (z(t)), g(z(t))i W (z(t)), for some continuous function W : Rd ! R. Simon Weissmann Introduction to stochastic gradient methods Page 137 If W (z) 0 for all z 2 Rd, then ¯z is globally stable. Moreover, if W (z) > 0 for all z 2 Rd nf ¯zg, then ¯z is even globally asymptotically stable. A function satisfying the above conditions is somet...

Showing first 80 references.

[1] [1]

Bertsekas

Dimitri P. Bertsekas. Nonlinear programming. Athena Scientific, 2nd edition, September 2008

2008

[2] [2]

Bertsekas and John N

Dimitri P. Bertsekas and John N. Tsitsiklis. Gradient convergence in gradient methods with errors. SIAM Journal on Optimization , 10(3):627–642, 2000

2000

[3] [3]

Julius R. Blum. Approximation Methods which Converge with Probability one. The Annals of Mathematical Statistics , 25(2):382 – 386, 1954

1954

[4] [4]

Curtis, and Jorge Nocedal

Léon Bottou, Frank E. Curtis, and Jorge Nocedal. Optimization methods for large-scale machine learning. SIAM Review, 60(2):223–311, 2018

2018

[5] [5]

A Kurdyka-Lojasiewicz property for stochastic optimization algorithms in a non-convex setting

Emilie Chouzenoux, Jean-Baptiste Fest, and Audrey Repetti. A Kurdyka-Lojasiewicz property for stochastic optimization algorithms in a non-convex setting. arXiv Preprint , arXiv:2302.06447, 2023

work page arXiv 2023

[6] [6]

Da Prato and J

G. Da Prato and J. Zabczyk. Stochastic Equations in Infinite Dimensions . Encyclopedia of Mathematics and its Applications. Cambridge University Press, 1992

1992

[7] [7]

SAGA: A fast incremental gradient method with support for non-strongly convex composite objectives

Aaron Defazio, Francis Bach, and Simon Lacoste-Julien. SAGA: A fast incremental gradient method with support for non-strongly convex composite objectives. In Advances in Neural Information Processing Systems , volume 27. Curran Associates, Inc., 2014

2014

[8] [8]

Convergence of stochastic gradient descent schemes for Łojasiewicz-landscapes

Steffen Dereich and Sebastian Kassing. Convergence of stochastic gradient descent schemes for Łojasiewicz-landscapes. Journal of Machine Learning , 3(3):245–281, 2024

2024

[9] [9]

Adaptive subgradient methods for online learning and stochastic optimization

John Duchi, Elad Hazan, and Yoram Singer. Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research , 12(61):2121–2159, 2011

2011

[10] [10]

Sharp analysis of stochastic optimization under global Kurdyka-Lojasiewicz inequality

Ilyas Fatkhullin, Jalal Etesami, Niao He, and Negar Kiyavash. Sharp analysis of stochastic optimization under global Kurdyka-Lojasiewicz inequality. In Advances in Neural Information Processing Systems, volume 35, pages 15836–15848. Curran Associates, Inc., 2022. 122 Simon Weissmann Introduction to stochastic gradient methods Page 123

2022

[11] [11]

Convergence rates and approxi- mation results for sgd and its continuous-time counterpart

Xavier Fontaine, Valentin De Bortoli, and Alain Durmus. Convergence rates and approxi- mation results for sgd and its continuous-time counterpart. In Proceedings of Thirty Fourth Conference on Learning Theory , volume 134 of Proceedings of Machine Learning Research , pages 1965–2058. PMLR, 15–19 Aug 2021

1965

[12] [12]

Stochastic heavy ball

Sébastien Gadat, Fabien Panloup, and Sofiane Saadane. Stochastic heavy ball. Electronic Journal of Statistics , 12(1):461 – 529, 2018

2018

[13] [13]

Guillaume Garrigos and Robert M. Gower. Handbook of convergence theorems for (stochastic) gradient methods. arXiv-Preprint, 2023

2023

[14] [14]

Exponential convergence rates for momentum stochas- tic gradient descent in the overparametrized setting

Benjamin Gess and Sebastian Kassing. Exponential convergence rates for momentum stochas- tic gradient descent in the overparametrized setting. Mathematical Programming, 2026

2026

[15] [15]

Stochastic first- and zeroth-order methods for nonconvex stochastic programming

Saeed Ghadimi and Guanghui Lan. Stochastic first- and zeroth-order methods for nonconvex stochastic programming. SIAM Journal on Optimization , 23(4):2341–2368, 2013

2013

[16] [16]

SGD for structured nonconvex func- tions: Learning rates, minibatching and interpolation

Robert Gower, Othmane Sebbouh, and Nicolas Loizou. SGD for structured nonconvex func- tions: Learning rates, minibatching and interpolation. In International Conference on Arti- ficial Intelligence and Statistics , pages 1315–1323. PMLR, 2021

2021

[17] [17]

Hadamard

J. Hadamard. Sur les problèmes aux dérivées partielles et leur signification physique. Princeton University Bulletin , pages 49–52, 1902

1902

[18] [18]

arXiv preprint arXiv:1905.00313 , title =

Elad Hazan and Sham Kakade. Revisiting the Polyak step size. arXiv Preprint , arXiv:1905.00313, 2022

work page arXiv 1905

[19] [19]

Strong error analysis for stochastic gradient descent optimization algorithms

Arnulf Jentzen, Benno Kuckuck, Ariel Neufeld, and Philippe von Wurstemberger. Strong error analysis for stochastic gradient descent optimization algorithms. IMA Journal of Numerical Analysis, 41(1):455–492, 05 2020

2020

[20] [20]

Lower error bounds for the stochastic gradient descent optimization algorithm: Sharp convergence rates for slowly and fast decaying learning rates

Arnulf Jentzen and Philippe von Wurstemberger. Lower error bounds for the stochastic gradient descent optimization algorithm: Sharp convergence rates for slowly and fast decaying learning rates. Journal of Complexity , 57:101438, 2020

2020

[21] [21]

Accelerating stochastic gradient descent using predictive vari- ance reduction

Rie Johnson and Tong Zhang. Accelerating stochastic gradient descent using predictive vari- ance reduction. In Advances in Neural Information Processing Systems , volume 26. Curran Associates, Inc., 2013

2013

[22] [22]

Linear convergence of gradient and proximal- gradient methods under the Polyak-Łojasiewicz condition

Hamed Karimi, Julie Nutini, and Mark Schmidt. Linear convergence of gradient and proximal- gradient methods under the Polyak-Łojasiewicz condition. In Machine Learning and Knowl- edge Discovery in Databases , pages 795–811, Cham, 2016. Springer International Publishing. Simon Weissmann Introduction to stochastic gradient methods Page 124

2016

[23] [23]

Better theory for SGD in the nonconvex world

Ahmed Khaled and Peter Richtárik. Better theory for SGD in the nonconvex world. Trans- actions on Machine Learning Research , 2023

2023

[24] [24]

H.K. Khalil. Nonlinear Systems. Pearson Education. Prentice Hall, 2002

2002

[25] [25]

Adam: A Method for Stochastic Optimization

Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv Preprint arXiv:1412.6980, 2014

work page internal anchor Pith review Pith/arXiv arXiv 2014

[26] [26]

Wahrscheinlichkeitstheorie

Achim Klenke. Wahrscheinlichkeitstheorie. Springer, 2006

2006

[27] [27]

First-order and Stochastic Optimization Methods for Machine Learning

Guanghui Lan. First-order and Stochastic Optimization Methods for Machine Learning . Springer Cham, 1st edition, 2021

2021

[28] [28]

Lee, Ioannis Panageas, Georgios Piliouras, Max Simchowitz, Michael I

Jason D. Lee, Ioannis Panageas, Georgios Piliouras, Max Simchowitz, Michael I. Jordan, and Benjamin Recht. First-order methods almost always avoid strict saddle points. Mathematical Programming, 176(1):311–337, July 2019

2019

[29] [29]

Analysis and design of optimization algorithms via integral quadratic constraints

Laurent Lessard, Benjamin Recht, and Andrew Packard. Analysis and design of optimization algorithms via integral quadratic constraints. SIAM J. on Optimization , 26(1):57–95, 2016

2016

[30] [30]

On almost sure convergence rates of stochastic gradient methods

Jun Liu and Ye Yuan. On almost sure convergence rates of stochastic gradient methods. In Proceedings of Thirty Fifth Conference on Learning Theory , volume 178 of Proceedings of Machine Learning Research, pages 2963–2983. PMLR, 02–05 Jul 2022

2022

[31] [31]

High prob- ability convergence of stochastic gradient methods

Zijian Liu, Ta Duy Nguyen, Thien Hang Nguyen, Alina Ene, and Huy Nguyen. High prob- ability convergence of stochastic gradient methods. In Proceedings of the 40th International Conference on Machine Learning , volume 202 of Proceedings of Machine Learning Research , pages 21884–21914. PMLR, 23–29 Jul 2023

2023

[32] [32]

Convergence of a stochastic gradient method with mo- mentum for non-smooth non-convex optimization

Vien Mai and Mikael Johansson. Convergence of a stochastic gradient method with mo- mentum for non-smooth non-convex optimization. In Proceedings of the 37th International Conference on Machine Learning , volume 119 of Proceedings of Machine Learning Research , pages 6630–6639. PMLR, 13–18 Jul 2020

2020

[33] [33]

On the almost sure convergence of stochastic gradient descent in non-convex problems

Panayotis Mertikopoulos, Nadav Hallak, Ali Kavis, and Volkan Cevher. On the almost sure convergence of stochastic gradient descent in non-convex problems. In Advances in Neural Information Processing Systems , volume 33, pages 1117–1128. Curran Associates, Inc., 2020

2020

[34] [34]

Non-asymptotic analysis of stochastic approximation algo- rithms for machine learning

Eric Moulines and Francis Bach. Non-asymptotic analysis of stochastic approximation algo- rithms for machine learning. In Advances in Neural Information Processing Systems , vol- ume 24. Curran Associates, Inc., 2011

2011

[35] [35]

Lectures on Convex Optimization

Yurii Nesterov. Lectures on Convex Optimization . Springer Cham, 2nd edition, 2018. Simon Weissmann Introduction to stochastic gradient methods Page 125

2018

[36] [36]

Nguyen, Phuong Ha Nguyen, Peter Richtárik, Katya Scheinberg, Martin Takác, and Marten van Dijk

Lam M. Nguyen, Phuong Ha Nguyen, Peter Richtárik, Katya Scheinberg, Martin Takác, and Marten van Dijk. New convergence aspects of stochastic gradient algorithms. Journal of Machine Learning Research, 20:176:1–176:49, 2018

2018

[37] [37]

Introduction to Optimization

Boris T Polyak. Introduction to Optimization . Translations series in mathematics and engi- neering. Optimization Software, Publications Division, 1987

1987

[38] [38]

B.T. Polyak. Some methods of speeding up the convergence of iteration methods. USSR Computational Mathematics and Mathematical Physics , 4(5):1–17, 1964

1964

[39] [39]

On the momentum term in gradient descent learning algorithms

Ning Qian. On the momentum term in gradient descent learning algorithms. Neural Networks, 12(1):145–151, 1999

1999

[40] [40]

Robbins and D

H. Robbins and D. Siegmund. A convergence theorem for non negative almost supermartin- gales and some applications. In Jagdish S. Rustagi, editor, Optimizing Methods in Statistics , pages 233–257. Academic Press, 1971

1971

[41] [41]

A Stochastic Approximation Method

Herbert Robbins and Sutton Monro. A Stochastic Approximation Method. The Annals of Mathematical Statistics, 22(3):400 – 407, 1951

1951

[42] [42]

An overview of gradient descent optimization algorithms

Sebastian Ruder. An overview of gradient descent optimization algorithms. arXiv Preprint , arXiv:1609.04747, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[43] [43]

Minimizing finite sums with the stochastic average gradient

Mark Schmidt, Nicolas Le Roux, and Francis Bach. Minimizing finite sums with the stochastic average gradient. Mathematical Programming, 162(1):83–112, 2017

2017

[44] [44]

Almost sure convergence rates for stochastic gradient descent and stochastic heavy ball

Othmane Sebbouh, Robert M Gower, and Aaron Defazio. Almost sure convergence rates for stochastic gradient descent and stochastic heavy ball. In Proceedings of Thirty Fourth Conference on Learning Theory . PMLR, 2021

2021

[45] [45]

Lecture 6.5-rmsprop: Divide the gradient by a running average of its recent magnitude

Tijmen Tieleman and Geoffrey Hinton. Lecture 6.5-rmsprop: Divide the gradient by a running average of its recent magnitude. COURSERA: Neural networks for machine learning , 4(2):26– 31, 2012

2012

[46] [46]

Almost sure convergence of stochastic gradient methods under gradient domination

Simon Weissmann, Sara Klein, Waïss Azizian, and Leif Döring. Almost sure convergence of stochastic gradient methods under gradient domination. Transactions on Machine Learning Research, 2025

2025

[47] [47]

Multilevel optimization for inverse prob- lems

Simon Weissmann, Ashia Wilson, and Jakob Zech. Multilevel optimization for inverse prob- lems. In Proceedings of Thirty Fifth Conference on Learning Theory, volume 178 of Proceedings of Machine Learning Research , pages 5489–5524. PMLR, 02–05 Jul 2022

2022

[48] [48]

Wilson, Benjamin Recht, and Michael I

Ashia C. Wilson, Benjamin Recht, and Michael I. Jordan. A Lyapunov analysis of accelerated methods in Optimization. J. Mach. Learn. Res. , 22(1), jan 2021. Simon Weissmann Introduction to stochastic gradient methods Page 126

2021

[49] [49]

Matthew D. Zeiler. Adadelta: An adaptive learning rate method. arXiv Preprint , arXiv:1212.5701, 2012. A Appendix A.1 Convex sets and functions We give a brief overview of convex functions and properties that are important for optimization. This appendix covers selected material from [ 1, Appendix B], adapted to the notation and scope of these notes. We s...

work page internal anchor Pith review Pith/arXiv arXiv 2012

[50] [50]

Then \i∈ICi is a convex set

Let fCi, i 2 Ig be a family of convex sets. Then \i∈ICi is a convex set

[51] [51]

127 Simon Weissmann Introduction to stochastic gradient methods Page 128

Let C1, C2 Rd be two convex sets, then the set C = fx 2 Rd j x = x1 + x2, x 1 2 C1, x2 2 C2g is a convex set. 127 Simon Weissmann Introduction to stochastic gradient methods Page 128

[52] [52]

The image of a convex set under linear transformation is again a convex set

[53] [53]

Then the level sets Aα = fx 2 C j f (x) αg and Bα = fx 2 C j f (x) < αg are convex sets for all α 2 R

Let C Rd be a convex set and let f : C ! R be a convex function. Then the level sets Aα = fx 2 C j f (x) αg and Bα = fx 2 C j f (x) < αg are convex sets for all α 2 R. Definition A.1.3. Let f : C ! R be a function and C Rd be a convex set. We define the epigraph of f as E(f ) := f(x, w) 2 C R j f (x) wg, the set of all points above the graph of f defined ...

[54] [54]

Every linear function is convex

[55] [55]

Every norm in Rd is convex

[56] [56]

, fn be convex functions and c1,

Let f1, . . . , fn be convex functions and c1, . . . , cn 0. a) Then x 7!Pn i=1 fi(x) is a convex function. b) Then x 7! supi=1,...,n fi(x) is a convex function. Exercise A.1.2. Prove the properties presented in Example A.1.6. Proposition A.1.7. Let C Rd be a convex set and let f : C ! R be differentiable over C. Then the following holds true:

[57] [57]

The function f is convex over C if and only if f (z) f (x) + (z x)⊤rf (x), for all x, z 2 C

[58] [58]

If f (z) > f (x) + (z x)⊤rf (x), x 6= z, for all x, z 2 C, then f is strictly convex. Proof. We only prove the first assertion, the second one follows by similar argumentation. Firstly, assume that f is convex and let x, z 2 C. Since C is a convex set, x+a(z x) 2 C for all a 2 (0, 1). The convexity of f implies that f (x + a(z x)) af (z) + (1 a)f (x), Sim...

[59] [59]

If r2f (x) is positive semi-definite for all x 2 C, then f is convex over C

[60] [60]

If r2f (x) is positive definite for all x 2 C, then f is strictly convex over C

[61] [61]

If C = Rd and f is convex, then r2f (x) is positive semi-definite for all x 2 C

[62] [62]

Moreover, f is strictly convex if and only if Q is positive definite

The quadratic function x 7! f (x) = x⊤Qx is convex if and only if Q is positive semi- definite. Moreover, f is strictly convex if and only if Q is positive definite. Proof. 1. Let x, y 2 C, then by the mean-value theorem there exists α 2 [0, 1] such that f (y) = f (x) + (y x)⊤rf (x) + 1 2 (y x)⊤r2f (x + α(y x))(y x) f (x) + (y x)⊤f (x), Simon Weissmann In...

[63] [63]

Follows similarly as the proof of the first claim

[64] [64]

By assumption we have that x 7! r 2f (x) is continuous, such that for ¯α > 0 small enough it holds true that z⊤r2f (x)z < 0, for α 2 [0, ¯α)

Let f be convex and suppose that there exists x 2 Rd such that z⊤r2f (x)z < 0 for some z 2 Rd. By assumption we have that x 7! r 2f (x) is continuous, such that for ¯α > 0 small enough it holds true that z⊤r2f (x)z < 0, for α 2 [0, ¯α). For an arbitrary ˜α 2 [0, ¯α) we set ˜z = ˜αz. By the mean-value theorem there exists β 2 [0, 1] such that f (x + ˜z) = ...

[65] [65]

The hessian of f is given by r2f (x) = 2 Q for all x 2 Rd. With 1. and 3. it follows that f is convex if and only if Q is positive semi-definite. If Q is positive definite, then convexity of f follows by the 3. assertion. It is left to prove that strict convexity of f implies that Q is positive definite. Assume that f is strictly convex, then by 3. we kno...

[66] [66]

Exercise A.1.3

A function f : C ! R is µ-strongly convex if and only if its hessian r2f (x) is uniformly positive definite with z⊤r2f (x)z µkzk2 for all x 2 C and all z 2 Rd. Exercise A.1.3. Prove Proposition A.1.10. A.2 Optimality conditions We recall a few basic optimality conditions that are used throughout the notes. The selection and presentation are partly guided ...

[67] [67]

r2f (x∗) (strictly) positive definite. Then x∗ is a strict local minimum of f and there exist γ > 0, ε > 0 such that f (x) f (x∗) + γ 2 kx x∗k2 Simon Weissmann Introduction to stochastic gradient methods Page 134 for all x 2 B ε(x∗). Before proving Theorem A.2.4, we will need to prove the following auxiliary result. Lemma A.2.5. Let A 2 Rd×d be a symmetri...

[68] [68]

λ1kzk2 z⊤Az λdkzk2 for all z 2 Rd

[69] [69]

The matrix A is (strict) positive definite if and only if all eigenvalues are (strictly) positive. Proof. We start with the first assertion. Let z 2 Rd and write z = dX i=1 ξivi with coeﬀicients ξi 2 R. Then we can write z⊤Az = dX i=1 ξ2 i hvi, Av i|{z} =λivi i = dX i=1 λiξ2 i kvik2 8 < : λ1 Pd i=1 ξ2 i kvik2 = λikzk2, λd Pd i=1 ξ2 i kvik2 = λdkzk2., whic...

[70] [70]

Every local minimum of f over S is also a global minimum over S

[71] [71]

If f is even strictly convex, then there exists at most one global minimum

[72] [72]

Then the condition rf (x∗) = 0 is a suﬀicient and necessary condition for x∗ 2 S to be a global minimum of f over S

Let S be an open set. Then the condition rf (x∗) = 0 is a suﬀicient and necessary condition for x∗ 2 S to be a global minimum of f over S. A.3 Lyapunov methods for optimization We will motivate the application of Lyapunov theory to optimization methods based on stability analysis of ordinary differential equations (ODEs). Lyapunov theory can be applied to...

[73] [73]

stable, if for all ε > 0 there exists a δ = δ(ε) > 0 such that: kz(0)k < δ =) k z(t)k < ε for all t 0

[74] [74]

unstable, if ¯z is not stable

[75] [75]

locally asymptotically stable , if ¯z is stable and δ > 0 can be chosen such that: kz(0)k < δ =) lim t→∞ z(t) = 0

[76] [76]

globally stable , if ¯z is stable and limε→∞ δ(ε) = 1

[77] [77]

We consider a continuously differentiable function V : Rd ! R - in the following called candidate of a Lyapunov function - which will be used to be described along the trajectories

globally asymptotically stable , if limt→∞ z(t) = 0 for all z0 2 Rd. We consider a continuously differentiable function V : Rd ! R - in the following called candidate of a Lyapunov function - which will be used to be described along the trajectories. With the help of chain rule, we can describe the dynamical behavior of V along the trajectory of z(t): dV ...

[78] [78]

V (¯z) = 0 and V (z) > 0 for all z 2 Rd n f¯zg,

[79] [79]

V (z) ! 1 for kzk ! 1 ,

[80] [80]

Simon Weissmann Introduction to stochastic gradient methods Page 137 If W (z) 0 for all z 2 Rd, then ¯z is globally stable

dV (z(t)) dt = hrzV (z(t)), g(z(t))i W (z(t)), for some continuous function W : Rd ! R. Simon Weissmann Introduction to stochastic gradient methods Page 137 If W (z) 0 for all z 2 Rd, then ¯z is globally stable. Moreover, if W (z) > 0 for all z 2 Rd nf ¯zg, then ¯z is even globally asymptotically stable. A function satisfying the above conditions is somet...