pith. sign in

arxiv: 2606.03953 · v1 · pith:HM66DB53new · submitted 2026-06-02 · 🧮 math.OC

Introduction to stochastic gradient methods

Pith reviewed 2026-06-28 08:45 UTC · model grok-4.3

classification 🧮 math.OC
keywords stochastic gradient descentconvergence ratesconvex optimizationmachine learningstochastic approximationPolyak-Lojasiewicz conditionvariance reduction
0
0 comments X

The pith

Stochastic gradient descent achieves classical convergence rates under smoothness and convexity assumptions on the objective.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The lecture notes begin by analyzing deterministic gradient methods for unconstrained optimization under standard assumptions. They then introduce stochastic gradient descent motivated by empirical risk minimization in machine learning and derive its convergence in expectation and almost surely. The notes also discuss variance reduction methods and convergence under the Polyak-Lojasiewicz condition. These results provide the theoretical foundation for why stochastic methods work reliably in large-scale optimization problems.

Core claim

Under the assumptions of smoothness and convexity, stochastic gradient descent converges to the optimum at a rate of O(1/sqrt(T)) after T steps in expectation, with almost sure convergence also holding for appropriate step-size sequences, and faster rates available under strong convexity or with variance reduction techniques.

What carries the argument

The stochastic gradient descent update rule analyzed using descent lemmas derived from smoothness and convexity to bound the progress per iteration.

If this is right

  • Convergence rates apply directly to minimizing expected risk in machine learning models.
  • Variance reduction methods improve upon basic SGD by reducing the variance of gradient estimates.
  • Almost sure convergence guarantees that the iterates converge with probability one under suitable conditions.
  • The Polyak-Lojasiewicz condition allows linear convergence without requiring strong convexity.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The derivations imply that step-size schedules must be chosen carefully based on the smoothness parameter to achieve the stated rates.
  • These results connect stochastic approximation theory to practical machine learning optimization.
  • Extensions to non-convex problems would require additional analysis beyond the paper's scope.

Load-bearing premise

The objective function must be smooth and satisfy convexity, strong convexity, or the Polyak-Lojasiewicz condition for the convergence proofs to hold.

What would settle it

A smooth convex function where running SGD with standard step sizes fails to achieve the O(1/sqrt(T)) rate in expectation.

Figures

Figures reproduced from arXiv: 2606.03953 by Simon Weissmann.

Figure 2.1
Figure 2.1. Figure 2.1: Illustration of descent methods. Lemma 2.1.2. Let f : R d → R be continuously differentiable in x ∈ R d . Then the condition ∇f(x) ⊤d < 0 (2.2) is sufficient for d ∈ R d being a descent direction of f in x. Proof. We fix x ∈ R d and d ∈ R d satisfying (2.2) and define φ(α) = f(x+αd). By Taylor expansion it follows φ(α) = φ(0) + αφ′ (0) + o(α). Note that φ(0) = f(x) and φ ′ (0) = ∇f(x) ⊤d such that φ(α) −… view at source ↗
Figure 2.2
Figure 2.2. Figure 2.2: Illustration of the Armijo step size rule based on [PITH_FULL_IMAGE:figures/full_fig_p015_2_2.png] view at source ↗
Figure 2.3
Figure 2.3. Figure 2.3: Illustration of possible terminations of the gradient descent method. However, not all stationary points are equally problematic. In particular, one can distinguish between local minima and saddle points. A stationary point x∗ is called a strict saddle point if the Hessian ∇2 f(x∗) has at least one negative eigenvalue. Intuitively, this means that there exists a direction of negative curvature along whic… view at source ↗
Figure 2.4
Figure 2.4. Figure 2.4: Illustration of L-smoothness and µ-strong convexity. We consider the function f(x) = x 2 such that f is L-smooth and µ-strongly convex with 1 := µ < f′′(x) = 2 < L =: 10 for all x ∈ R. This plot illustrates convexity and smoothness of f in x = 1 using the upper and lower bound (2.9) and (2.10) on f(y), y ∈ R. This motivates the definition of the Bregman divergence. Definition 2.2.3. Let f : R d → R be co… view at source ↗
Figure 2.5
Figure 2.5. Figure 2.5: Illustration of sub-gradients for differentiable convex (left) and non-differentiable con￾vex (right) functions. For continuously differentiable and convex f, the sub-gradient is unique, i.e. ∂f(x) = {∇f(x)}. Example 2.3.3. Let fi : R d → R be a family of convex and differentiable functions i = 1, . . . , N for some N ∈ N. We define F(x) = maxi=1,...,N fi(x) and for given x ∈ R d we consider j ∈ arg maxi… view at source ↗
Figure 2.6
Figure 2.6. Figure 2.6: Illustration of Example 2.3.3. We consider N = 2 and define f(x) = max(f1(x), f2(x)) for two convex and differentiable functions f1, f2. In the above Situation, we find unique sub￾gradients for f1(x) > f2(x) given by ∂f(x) = {f ′ 1 (x)}, and similarly for f1(x) < f2(x) given by ∂f(x) = {f ′ 2 (x)}. In the case of f1(x) = f2(x), the sub-differential is given by ∂f(x) = [f ′ 2 (x), f′ 1 (x)]. There are sim… view at source ↗
Figure 3.1
Figure 3.1. Figure 3.1: Contour lines of a quadratic function with increasing condition number κ. 3.1 Polyak’s heavy ball method The idea of incorporating momentum into iterative optimization schemes goes back to Polyak’s heavy ball method [38]. The name is motivated by the physical picture of a heavy ball rolling down a hill towards a valley. A light ball is strongly affected by local curvature and may oscillate in narrow vall… view at source ↗
Figure 3.2
Figure 3.2. Figure 3.2: Comparison of gradient descent and the heavy-ball method in a flat valley. The blue path illustrates the typical zigzag behavior of gradient descent. The orange path illustrates the smoother trajectory induced by momentum. At the current iterate xk, the heavy-ball update combines a gradient force (blue dotted arrow) with a momentum force (green dotted arrow), resulting in the new step (orange arrow) towa… view at source ↗
Figure 3.3
Figure 3.3. Figure 3.3: Illustration of the linear convergence rate depending on the condition number κ = µ L for GD and NAM. The left plot shows the convergence rate c GD(κ) = κ−1 κ+1 2 and c NAM(κ) =  √ κ−1 √ κ  , whereas the right plot shows the difference to 1, i.e. 1 − c(κ), in logarithmic scale [PITH_FULL_IMAGE:figures/full_fig_p064_3_3.png] view at source ↗
Figure 4.1
Figure 4.1. Figure 4.1: Illustration of ill-posedness through observational noise. The occurrence of noise might shift the observed data outside of the range of the forward map H. Θ R(H) H cY θ1 θ2 H [PITH_FULL_IMAGE:figures/full_fig_p068_4_1.png] view at source ↗
Figure 4.2
Figure 4.2. Figure 4.2: Illustration of ill-posedness through multiple solutions. Two different parameters θ1, θ2 ∈ Θ might map onto the observed data Yb. Outlook 2 (Statistical learning perspective). For example, in the area of statistical learning among other questions one is interested in the consistency of solutions of the empirical risk minimization problem. Let XbN be the minimizer of the empirical risk and x∗ be the mini… view at source ↗
Figure 4.3
Figure 4.3. Figure 4.3: Illustration of ill-posedness through discontinuity. Even if H is invertible, the instability may occur in the solution of the inverse problem resulting from possible discontinuity of the inverse operator. of the corresponding expected risk (provided both exist), then we can decompose the error F(XbN )−F(x∗) = F(XbN )−FN (XbN )+FN (XbN )−FN (x∗)+FN (x∗)−F(x∗) ≤ 2 sup x∈Rd |FN (x)−F(x)|. We emphasize that… view at source ↗
read the original abstract

These lecture notes provide an introduction to first-order optimization methods with a particular emphasis on stochastic gradient methods. We begin with deterministic gradient based methods for unconstrained optimization and study their convergence under standard assumptions such as smoothness, convexity, strong convexity, and the Polyak-Lojasiewicz condition. We then turn to stochastic approximation and stochastic gradient descent, motivated by empirical and expected risk minimization in machine learning. The main focus is on convergence theory: we discuss almost sure convergence and convergence in expectation, derive classical convergence rates, and present selected advanced topics, including almost sure convergence rates and variance reduction methods.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 0 minor

Summary. The manuscript consists of lecture notes introducing first-order optimization methods, beginning with deterministic gradient descent and progressing to stochastic gradient descent. It derives classical convergence rates (in expectation and almost surely) for problems satisfying smoothness, convexity, strong convexity, and the Polyak-Łojasiewicz condition, motivated by empirical and expected risk minimization, and includes selected topics on almost sure rates and variance reduction.

Significance. The notes reproduce well-established results from the stochastic optimization literature without introducing new theorems, extensions, or parameter-free derivations. While a clear compilation of standard material may have pedagogical utility, the absence of novel contributions limits its significance for advancing research in mathematical optimization.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their review of our lecture notes. The manuscript is explicitly positioned as an introductory compilation of established results rather than a research contribution with novel theorems.

read point-by-point responses
  1. Referee: The notes reproduce well-established results from the stochastic optimization literature without introducing new theorems, extensions, or parameter-free derivations. While a clear compilation of standard material may have pedagogical utility, the absence of novel contributions limits its significance for advancing research in mathematical optimization.

    Authors: We agree that the notes contain no new theorems and are a compilation of classical convergence results under standard assumptions (smoothness, convexity, strong convexity, PL condition). This is by design: the document is lecture notes whose purpose is to provide a self-contained, accessible derivation of these rates, motivated by empirical and expected risk minimization, for use in teaching or self-study. The referee acknowledges possible pedagogical utility; we believe this is the appropriate evaluation criterion for lecture notes rather than research novelty. revision: no

Circularity Check

0 steps flagged

No significant circularity

full rationale

The manuscript consists of lecture notes reproducing classical convergence results for gradient descent and stochastic gradient descent under standard external assumptions (smoothness, convexity, strong convexity, Polyak-Łojasiewicz). All derivations follow from these assumptions via standard analysis techniques drawn from the broader literature; no parameters are fitted and then relabeled as predictions, no self-citations form load-bearing uniqueness arguments, and no ansatzes or renamings reduce the claimed results to the inputs by construction. The notes are therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

As lecture notes on standard topics, the content relies on classical assumptions from optimization theory without introducing new free parameters or entities.

axioms (1)
  • domain assumption Objective functions satisfy L-smoothness, convexity, strong convexity, or Polyak-Lojasiewicz condition.
    Invoked throughout for convergence analysis as stated in the abstract.

pith-pipeline@v0.9.1-grok · 5604 in / 938 out tokens · 27651 ms · 2026-06-28T08:45:47.006581+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

88 extracted references · 5 canonical work pages · 3 internal anchors

  1. [1]

    Bertsekas

    Dimitri P. Bertsekas. Nonlinear programming. Athena Scientific, 2nd edition, September 2008

  2. [2]

    Bertsekas and John N

    Dimitri P. Bertsekas and John N. Tsitsiklis. Gradient convergence in gradient methods with errors. SIAM Journal on Optimization , 10(3):627–642, 2000

  3. [3]

    Julius R. Blum. Approximation Methods which Converge with Probability one. The Annals of Mathematical Statistics , 25(2):382 – 386, 1954

  4. [4]

    Curtis, and Jorge Nocedal

    Léon Bottou, Frank E. Curtis, and Jorge Nocedal. Optimization methods for large-scale machine learning. SIAM Review, 60(2):223–311, 2018

  5. [5]

    A Kurdyka-Lojasiewicz property for stochastic optimization algorithms in a non-convex setting

    Emilie Chouzenoux, Jean-Baptiste Fest, and Audrey Repetti. A Kurdyka-Lojasiewicz property for stochastic optimization algorithms in a non-convex setting. arXiv Preprint , arXiv:2302.06447, 2023

  6. [6]

    Da Prato and J

    G. Da Prato and J. Zabczyk. Stochastic Equations in Infinite Dimensions . Encyclopedia of Mathematics and its Applications. Cambridge University Press, 1992

  7. [7]

    SAGA: A fast incremental gradient method with support for non-strongly convex composite objectives

    Aaron Defazio, Francis Bach, and Simon Lacoste-Julien. SAGA: A fast incremental gradient method with support for non-strongly convex composite objectives. In Advances in Neural Information Processing Systems , volume 27. Curran Associates, Inc., 2014

  8. [8]

    Convergence of stochastic gradient descent schemes for Łojasiewicz-landscapes

    Steffen Dereich and Sebastian Kassing. Convergence of stochastic gradient descent schemes for Łojasiewicz-landscapes. Journal of Machine Learning , 3(3):245–281, 2024

  9. [9]

    Adaptive subgradient methods for online learning and stochastic optimization

    John Duchi, Elad Hazan, and Yoram Singer. Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research , 12(61):2121–2159, 2011

  10. [10]

    Sharp analysis of stochastic optimization under global Kurdyka-Lojasiewicz inequality

    Ilyas Fatkhullin, Jalal Etesami, Niao He, and Negar Kiyavash. Sharp analysis of stochastic optimization under global Kurdyka-Lojasiewicz inequality. In Advances in Neural Information Processing Systems, volume 35, pages 15836–15848. Curran Associates, Inc., 2022. 122 Simon Weissmann Introduction to stochastic gradient methods Page 123

  11. [11]

    Convergence rates and approxi- mation results for sgd and its continuous-time counterpart

    Xavier Fontaine, Valentin De Bortoli, and Alain Durmus. Convergence rates and approxi- mation results for sgd and its continuous-time counterpart. In Proceedings of Thirty Fourth Conference on Learning Theory , volume 134 of Proceedings of Machine Learning Research , pages 1965–2058. PMLR, 15–19 Aug 2021

  12. [12]

    Stochastic heavy ball

    Sébastien Gadat, Fabien Panloup, and Sofiane Saadane. Stochastic heavy ball. Electronic Journal of Statistics , 12(1):461 – 529, 2018

  13. [13]

    Guillaume Garrigos and Robert M. Gower. Handbook of convergence theorems for (stochastic) gradient methods. arXiv-Preprint, 2023

  14. [14]

    Exponential convergence rates for momentum stochas- tic gradient descent in the overparametrized setting

    Benjamin Gess and Sebastian Kassing. Exponential convergence rates for momentum stochas- tic gradient descent in the overparametrized setting. Mathematical Programming, 2026

  15. [15]

    Stochastic first- and zeroth-order methods for nonconvex stochastic programming

    Saeed Ghadimi and Guanghui Lan. Stochastic first- and zeroth-order methods for nonconvex stochastic programming. SIAM Journal on Optimization , 23(4):2341–2368, 2013

  16. [16]

    SGD for structured nonconvex func- tions: Learning rates, minibatching and interpolation

    Robert Gower, Othmane Sebbouh, and Nicolas Loizou. SGD for structured nonconvex func- tions: Learning rates, minibatching and interpolation. In International Conference on Arti- ficial Intelligence and Statistics , pages 1315–1323. PMLR, 2021

  17. [17]

    Hadamard

    J. Hadamard. Sur les problèmes aux dérivées partielles et leur signification physique. Princeton University Bulletin , pages 49–52, 1902

  18. [18]

    arXiv preprint arXiv:1905.00313 , title =

    Elad Hazan and Sham Kakade. Revisiting the Polyak step size. arXiv Preprint , arXiv:1905.00313, 2022

  19. [19]

    Strong error analysis for stochastic gradient descent optimization algorithms

    Arnulf Jentzen, Benno Kuckuck, Ariel Neufeld, and Philippe von Wurstemberger. Strong error analysis for stochastic gradient descent optimization algorithms. IMA Journal of Numerical Analysis, 41(1):455–492, 05 2020

  20. [20]

    Lower error bounds for the stochastic gradient descent optimization algorithm: Sharp convergence rates for slowly and fast decaying learning rates

    Arnulf Jentzen and Philippe von Wurstemberger. Lower error bounds for the stochastic gradient descent optimization algorithm: Sharp convergence rates for slowly and fast decaying learning rates. Journal of Complexity , 57:101438, 2020

  21. [21]

    Accelerating stochastic gradient descent using predictive vari- ance reduction

    Rie Johnson and Tong Zhang. Accelerating stochastic gradient descent using predictive vari- ance reduction. In Advances in Neural Information Processing Systems , volume 26. Curran Associates, Inc., 2013

  22. [22]

    Linear convergence of gradient and proximal- gradient methods under the Polyak-Łojasiewicz condition

    Hamed Karimi, Julie Nutini, and Mark Schmidt. Linear convergence of gradient and proximal- gradient methods under the Polyak-Łojasiewicz condition. In Machine Learning and Knowl- edge Discovery in Databases , pages 795–811, Cham, 2016. Springer International Publishing. Simon Weissmann Introduction to stochastic gradient methods Page 124

  23. [23]

    Better theory for SGD in the nonconvex world

    Ahmed Khaled and Peter Richtárik. Better theory for SGD in the nonconvex world. Trans- actions on Machine Learning Research , 2023

  24. [24]

    H.K. Khalil. Nonlinear Systems. Pearson Education. Prentice Hall, 2002

  25. [25]

    Adam: A Method for Stochastic Optimization

    Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv Preprint arXiv:1412.6980, 2014

  26. [26]

    Wahrscheinlichkeitstheorie

    Achim Klenke. Wahrscheinlichkeitstheorie. Springer, 2006

  27. [27]

    First-order and Stochastic Optimization Methods for Machine Learning

    Guanghui Lan. First-order and Stochastic Optimization Methods for Machine Learning . Springer Cham, 1st edition, 2021

  28. [28]

    Lee, Ioannis Panageas, Georgios Piliouras, Max Simchowitz, Michael I

    Jason D. Lee, Ioannis Panageas, Georgios Piliouras, Max Simchowitz, Michael I. Jordan, and Benjamin Recht. First-order methods almost always avoid strict saddle points. Mathematical Programming, 176(1):311–337, July 2019

  29. [29]

    Analysis and design of optimization algorithms via integral quadratic constraints

    Laurent Lessard, Benjamin Recht, and Andrew Packard. Analysis and design of optimization algorithms via integral quadratic constraints. SIAM J. on Optimization , 26(1):57–95, 2016

  30. [30]

    On almost sure convergence rates of stochastic gradient methods

    Jun Liu and Ye Yuan. On almost sure convergence rates of stochastic gradient methods. In Proceedings of Thirty Fifth Conference on Learning Theory , volume 178 of Proceedings of Machine Learning Research, pages 2963–2983. PMLR, 02–05 Jul 2022

  31. [31]

    High prob- ability convergence of stochastic gradient methods

    Zijian Liu, Ta Duy Nguyen, Thien Hang Nguyen, Alina Ene, and Huy Nguyen. High prob- ability convergence of stochastic gradient methods. In Proceedings of the 40th International Conference on Machine Learning , volume 202 of Proceedings of Machine Learning Research , pages 21884–21914. PMLR, 23–29 Jul 2023

  32. [32]

    Convergence of a stochastic gradient method with mo- mentum for non-smooth non-convex optimization

    Vien Mai and Mikael Johansson. Convergence of a stochastic gradient method with mo- mentum for non-smooth non-convex optimization. In Proceedings of the 37th International Conference on Machine Learning , volume 119 of Proceedings of Machine Learning Research , pages 6630–6639. PMLR, 13–18 Jul 2020

  33. [33]

    On the almost sure convergence of stochastic gradient descent in non-convex problems

    Panayotis Mertikopoulos, Nadav Hallak, Ali Kavis, and Volkan Cevher. On the almost sure convergence of stochastic gradient descent in non-convex problems. In Advances in Neural Information Processing Systems , volume 33, pages 1117–1128. Curran Associates, Inc., 2020

  34. [34]

    Non-asymptotic analysis of stochastic approximation algo- rithms for machine learning

    Eric Moulines and Francis Bach. Non-asymptotic analysis of stochastic approximation algo- rithms for machine learning. In Advances in Neural Information Processing Systems , vol- ume 24. Curran Associates, Inc., 2011

  35. [35]

    Lectures on Convex Optimization

    Yurii Nesterov. Lectures on Convex Optimization . Springer Cham, 2nd edition, 2018. Simon Weissmann Introduction to stochastic gradient methods Page 125

  36. [36]

    Nguyen, Phuong Ha Nguyen, Peter Richtárik, Katya Scheinberg, Martin Takác, and Marten van Dijk

    Lam M. Nguyen, Phuong Ha Nguyen, Peter Richtárik, Katya Scheinberg, Martin Takác, and Marten van Dijk. New convergence aspects of stochastic gradient algorithms. Journal of Machine Learning Research, 20:176:1–176:49, 2018

  37. [37]

    Introduction to Optimization

    Boris T Polyak. Introduction to Optimization . Translations series in mathematics and engi- neering. Optimization Software, Publications Division, 1987

  38. [38]

    B.T. Polyak. Some methods of speeding up the convergence of iteration methods. USSR Computational Mathematics and Mathematical Physics , 4(5):1–17, 1964

  39. [39]

    On the momentum term in gradient descent learning algorithms

    Ning Qian. On the momentum term in gradient descent learning algorithms. Neural Networks, 12(1):145–151, 1999

  40. [40]

    Robbins and D

    H. Robbins and D. Siegmund. A convergence theorem for non negative almost supermartin- gales and some applications. In Jagdish S. Rustagi, editor, Optimizing Methods in Statistics , pages 233–257. Academic Press, 1971

  41. [41]

    A Stochastic Approximation Method

    Herbert Robbins and Sutton Monro. A Stochastic Approximation Method. The Annals of Mathematical Statistics, 22(3):400 – 407, 1951

  42. [42]

    An overview of gradient descent optimization algorithms

    Sebastian Ruder. An overview of gradient descent optimization algorithms. arXiv Preprint , arXiv:1609.04747, 2017

  43. [43]

    Minimizing finite sums with the stochastic average gradient

    Mark Schmidt, Nicolas Le Roux, and Francis Bach. Minimizing finite sums with the stochastic average gradient. Mathematical Programming, 162(1):83–112, 2017

  44. [44]

    Almost sure convergence rates for stochastic gradient descent and stochastic heavy ball

    Othmane Sebbouh, Robert M Gower, and Aaron Defazio. Almost sure convergence rates for stochastic gradient descent and stochastic heavy ball. In Proceedings of Thirty Fourth Conference on Learning Theory . PMLR, 2021

  45. [45]

    Lecture 6.5-rmsprop: Divide the gradient by a running average of its recent magnitude

    Tijmen Tieleman and Geoffrey Hinton. Lecture 6.5-rmsprop: Divide the gradient by a running average of its recent magnitude. COURSERA: Neural networks for machine learning , 4(2):26– 31, 2012

  46. [46]

    Almost sure convergence of stochastic gradient methods under gradient domination

    Simon Weissmann, Sara Klein, Waïss Azizian, and Leif Döring. Almost sure convergence of stochastic gradient methods under gradient domination. Transactions on Machine Learning Research, 2025

  47. [47]

    Multilevel optimization for inverse prob- lems

    Simon Weissmann, Ashia Wilson, and Jakob Zech. Multilevel optimization for inverse prob- lems. In Proceedings of Thirty Fifth Conference on Learning Theory, volume 178 of Proceedings of Machine Learning Research , pages 5489–5524. PMLR, 02–05 Jul 2022

  48. [48]

    Wilson, Benjamin Recht, and Michael I

    Ashia C. Wilson, Benjamin Recht, and Michael I. Jordan. A Lyapunov analysis of accelerated methods in Optimization. J. Mach. Learn. Res. , 22(1), jan 2021. Simon Weissmann Introduction to stochastic gradient methods Page 126

  49. [49]

    Matthew D. Zeiler. Adadelta: An adaptive learning rate method. arXiv Preprint , arXiv:1212.5701, 2012. A Appendix A.1 Convex sets and functions We give a brief overview of convex functions and properties that are important for optimization. This appendix covers selected material from [ 1, Appendix B], adapted to the notation and scope of these notes. We s...

  50. [50]

    Then \i∈ICi is a convex set

    Let fCi, i 2 Ig be a family of convex sets. Then \i∈ICi is a convex set

  51. [51]

    127 Simon Weissmann Introduction to stochastic gradient methods Page 128

    Let C1, C2 Rd be two convex sets, then the set C = fx 2 Rd j x = x1 + x2, x 1 2 C1, x2 2 C2g is a convex set. 127 Simon Weissmann Introduction to stochastic gradient methods Page 128

  52. [52]

    The image of a convex set under linear transformation is again a convex set

  53. [53]

    Then the level sets Aα = fx 2 C j f (x) αg and Bα = fx 2 C j f (x) < αg are convex sets for all α 2 R

    Let C Rd be a convex set and let f : C ! R be a convex function. Then the level sets Aα = fx 2 C j f (x) αg and Bα = fx 2 C j f (x) < αg are convex sets for all α 2 R. Definition A.1.3. Let f : C ! R be a function and C Rd be a convex set. We define the epigraph of f as E(f ) := f(x, w) 2 C R j f (x) wg, the set of all points above the graph of f defined ...

  54. [54]

    Every linear function is convex

  55. [55]

    Every norm in Rd is convex

  56. [56]

    , fn be convex functions and c1,

    Let f1, . . . , fn be convex functions and c1, . . . , cn 0. a) Then x 7!Pn i=1 fi(x) is a convex function. b) Then x 7! supi=1,...,n fi(x) is a convex function. Exercise A.1.2. Prove the properties presented in Example A.1.6. Proposition A.1.7. Let C Rd be a convex set and let f : C ! R be differentiable over C. Then the following holds true:

  57. [57]

    The function f is convex over C if and only if f (z) f (x) + (z x)⊤rf (x), for all x, z 2 C

  58. [58]

    If f (z) > f (x) + (z x)⊤rf (x), x 6= z, for all x, z 2 C, then f is strictly convex. Proof. We only prove the first assertion, the second one follows by similar argumentation. Firstly, assume that f is convex and let x, z 2 C. Since C is a convex set, x+a(z x) 2 C for all a 2 (0, 1). The convexity of f implies that f (x + a(z x)) af (z) + (1 a)f (x), Sim...

  59. [59]

    If r2f (x) is positive semi-definite for all x 2 C, then f is convex over C

  60. [60]

    If r2f (x) is positive definite for all x 2 C, then f is strictly convex over C

  61. [61]

    If C = Rd and f is convex, then r2f (x) is positive semi-definite for all x 2 C

  62. [62]

    Moreover, f is strictly convex if and only if Q is positive definite

    The quadratic function x 7! f (x) = x⊤Qx is convex if and only if Q is positive semi- definite. Moreover, f is strictly convex if and only if Q is positive definite. Proof. 1. Let x, y 2 C, then by the mean-value theorem there exists α 2 [0, 1] such that f (y) = f (x) + (y x)⊤rf (x) + 1 2 (y x)⊤r2f (x + α(y x))(y x) f (x) + (y x)⊤f (x), Simon Weissmann In...

  63. [63]

    Follows similarly as the proof of the first claim

  64. [64]

    By assumption we have that x 7! r 2f (x) is continuous, such that for ¯α > 0 small enough it holds true that z⊤r2f (x)z < 0, for α 2 [0, ¯α)

    Let f be convex and suppose that there exists x 2 Rd such that z⊤r2f (x)z < 0 for some z 2 Rd. By assumption we have that x 7! r 2f (x) is continuous, such that for ¯α > 0 small enough it holds true that z⊤r2f (x)z < 0, for α 2 [0, ¯α). For an arbitrary ˜α 2 [0, ¯α) we set ˜z = ˜αz. By the mean-value theorem there exists β 2 [0, 1] such that f (x + ˜z) = ...

  65. [65]

    The hessian of f is given by r2f (x) = 2 Q for all x 2 Rd. With 1. and 3. it follows that f is convex if and only if Q is positive semi-definite. If Q is positive definite, then convexity of f follows by the 3. assertion. It is left to prove that strict convexity of f implies that Q is positive definite. Assume that f is strictly convex, then by 3. we kno...

  66. [66]

    Exercise A.1.3

    A function f : C ! R is µ-strongly convex if and only if its hessian r2f (x) is uniformly positive definite with z⊤r2f (x)z µkzk2 for all x 2 C and all z 2 Rd. Exercise A.1.3. Prove Proposition A.1.10. A.2 Optimality conditions We recall a few basic optimality conditions that are used throughout the notes. The selection and presentation are partly guided ...

  67. [67]

    r2f (x∗) (strictly) positive definite. Then x∗ is a strict local minimum of f and there exist γ > 0, ε > 0 such that f (x) f (x∗) + γ 2 kx x∗k2 Simon Weissmann Introduction to stochastic gradient methods Page 134 for all x 2 B ε(x∗). Before proving Theorem A.2.4, we will need to prove the following auxiliary result. Lemma A.2.5. Let A 2 Rd×d be a symmetri...

  68. [68]

    λ1kzk2 z⊤Az λdkzk2 for all z 2 Rd

  69. [69]

    The matrix A is (strict) positive definite if and only if all eigenvalues are (strictly) positive. Proof. We start with the first assertion. Let z 2 Rd and write z = dX i=1 ξivi with coefficients ξi 2 R. Then we can write z⊤Az = dX i=1 ξ2 i hvi, Av i|{z} =λivi i = dX i=1 λiξ2 i kvik2 8 < : λ1 Pd i=1 ξ2 i kvik2 = λikzk2, λd Pd i=1 ξ2 i kvik2 = λdkzk2., whic...

  70. [70]

    Every local minimum of f over S is also a global minimum over S

  71. [71]

    If f is even strictly convex, then there exists at most one global minimum

  72. [72]

    Then the condition rf (x∗) = 0 is a sufficient and necessary condition for x∗ 2 S to be a global minimum of f over S

    Let S be an open set. Then the condition rf (x∗) = 0 is a sufficient and necessary condition for x∗ 2 S to be a global minimum of f over S. A.3 Lyapunov methods for optimization We will motivate the application of Lyapunov theory to optimization methods based on stability analysis of ordinary differential equations (ODEs). Lyapunov theory can be applied to...

  73. [73]

    stable, if for all ε > 0 there exists a δ = δ(ε) > 0 such that: kz(0)k < δ =) k z(t)k < ε for all t 0

  74. [74]

    unstable, if ¯z is not stable

  75. [75]

    locally asymptotically stable , if ¯z is stable and δ > 0 can be chosen such that: kz(0)k < δ =) lim t→∞ z(t) = 0

  76. [76]

    globally stable , if ¯z is stable and limε→∞ δ(ε) = 1

  77. [77]

    We consider a continuously differentiable function V : Rd ! R - in the following called candidate of a Lyapunov function - which will be used to be described along the trajectories

    globally asymptotically stable , if limt→∞ z(t) = 0 for all z0 2 Rd. We consider a continuously differentiable function V : Rd ! R - in the following called candidate of a Lyapunov function - which will be used to be described along the trajectories. With the help of chain rule, we can describe the dynamical behavior of V along the trajectory of z(t): dV ...

  78. [78]

    V (¯z) = 0 and V (z) > 0 for all z 2 Rd n f¯zg,

  79. [79]

    V (z) ! 1 for kzk ! 1 ,

  80. [80]

    Simon Weissmann Introduction to stochastic gradient methods Page 137 If W (z) 0 for all z 2 Rd, then ¯z is globally stable

    dV (z(t)) dt = hrzV (z(t)), g(z(t))i W (z(t)), for some continuous function W : Rd ! R. Simon Weissmann Introduction to stochastic gradient methods Page 137 If W (z) 0 for all z 2 Rd, then ¯z is globally stable. Moreover, if W (z) > 0 for all z 2 Rd nf ¯zg, then ¯z is even globally asymptotically stable. A function satisfying the above conditions is somet...

Showing first 80 references.