pith. sign in

arxiv: 2502.06719 · v3 · submitted 2025-02-10 · 📊 stat.ML · cs.LG· math.OC· math.PR· math.ST· stat.TH

Gaussian Approximation and Multiplier Bootstrap for Stochastic Gradient Descent

Pith reviewed 2026-05-23 03:27 UTC · model grok-4.3

classification 📊 stat.ML cs.LGmath.OCmath.PRmath.STstat.TH
keywords multiplier bootstrapstochastic gradient descentGaussian approximationnon-asymptotic boundsconvex distanceconfidence setscentral limit theorem
0
0 comments X

The pith

Multiplier bootstrap for SGD achieves non-asymptotic rates up to 1/sqrt(n) in convex distance without approximating limiting covariance.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper establishes the non-asymptotic validity of the multiplier bootstrap procedure for constructing confidence sets using the Stochastic Gradient Descent algorithm. Under appropriate regularity conditions, the approach achieves approximation rates in convex distance of order up to 1/sqrt(n), which can be faster than the rate provable from the Polyak-Juditsky central limit theorem. It provides the first fully non-asymptotic bound on the accuracy of bootstrap approximations in SGD algorithms. The analysis builds on Gaussian approximation results for nonlinear statistics of independent random variables. Readers would care because the method allows construction of confidence sets for SGD iterates without first approximating their limiting covariance.

Core claim

Under appropriate regularity conditions, the multiplier bootstrap procedure for SGD achieves approximation rates in convex distance of order up to 1/sqrt(n), which can be faster than the rate provable from the Polyak-Juditsky central limit theorem, and provides the first fully non-asymptotic bound on the accuracy of bootstrap approximations in SGD algorithms. The analysis builds on the Gaussian approximation results for nonlinear statistics of independent random variables.

What carries the argument

Multiplier bootstrap procedure for SGD iterates, which constructs bootstrap distributions directly from the iterates to approximate their law in convex distance.

If this is right

  • Confidence sets for SGD can be formed without separately estimating the limiting covariance matrix.
  • The bootstrap approximation rate reaches order 1/sqrt(n) in convex distance under the regularity conditions.
  • This rate can exceed what is provable from the Polyak-Juditsky central limit theorem for the same iterates.
  • Fully non-asymptotic error bounds become available for the accuracy of bootstrap approximations in SGD.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The approach may extend to other first-order stochastic optimization methods that produce dependent iterates.
  • Practitioners could obtain finite-sample uncertainty quantification for SGD-trained models without relying on asymptotic normality.
  • The technique suggests bootstrap could replace covariance estimation in high-dimensional or online learning settings where CLT rates are slow.

Load-bearing premise

The unspecified regularity conditions on the SGD iterates and loss function must hold for the existing Gaussian approximation results to deliver the stated convex-distance bound.

What would settle it

A simulation on a simple strongly convex quadratic loss where the convex distance between the multiplier bootstrap distribution and the true distribution of normalized SGD iterates does not decrease at order 1/sqrt(n) for large n.

Figures

Figures reproduced from arXiv: 2502.06719 by Alexey Naumov, Denis Belomestny, Eric Moulines, Marina Sheshukova, Qi-Man Shao, Sergey Samsonov, Zhuo-Song Zhang.

Figure 1
Figure 1. Figure 1: Numerical verification of the lower bound given in Theorem [PITH_FULL_IMAGE:figures/full_fig_p032_1.png] view at source ↗
read the original abstract

In this paper, we establish the non-asymptotic validity of the multiplier bootstrap procedure for constructing the confidence sets using the Stochastic Gradient Descent (SGD) algorithm. Under appropriate regularity conditions, our approach avoids the need to approximate the limiting covariance of Polyak-Ruppert SGD iterates, which allows us to derive approximation rates in convex distance of order up to $1/\sqrt{n}$. Notably, this rate can be faster than the one that can be proven in the Polyak-Juditsky central limit theorem. To our knowledge, this provides the first fully non-asymptotic bound on the accuracy of bootstrap approximations in SGD algorithms. Our analysis builds on the Gaussian approximation results for nonlinear statistics of independent random variables.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper claims to establish the non-asymptotic validity of the multiplier bootstrap procedure for constructing confidence sets with SGD iterates. Under appropriate regularity conditions, the approach yields convex-distance approximation rates of order up to 1/sqrt(n) without needing to approximate the limiting covariance from the Polyak-Ruppert averaging, a rate that can be faster than what follows from the Polyak-Juditsky CLT; the analysis builds on existing Gaussian approximation theorems for nonlinear statistics of independent random variables and is presented as the first fully non-asymptotic bound for bootstrap accuracy in SGD.

Significance. If the dependence structure of SGD can be reduced to the independent nonlinear statistics setting without degrading the rate and if the regularity conditions are verifiable in standard regimes, the result would supply a practically useful non-asymptotic tool for inference after SGD that improves on existing asymptotic guarantees. The explicit non-asymptotic focus and the attempt to obtain a rate faster than the classical CLT are strengths worth preserving.

major comments (2)
  1. [Main result / Theorem statement] The central claim that the 1/sqrt(n) convex-distance bound is achievable for SGD rests on applying Gaussian approximation results for nonlinear statistics of independent random variables to the dependent sequence produced by SGD. The manuscript must show explicitly (in the section containing the main theorem) how the dependence is handled—e.g., via martingale approximation, blocking, or coupling—without introducing an extra error term that would cancel the claimed improvement over the Polyak-Juditsky rate.
  2. [Assumptions / Regularity conditions] The abstract invokes 'appropriate regularity conditions' on the SGD iterates and loss function but does not list them. Because these conditions are load-bearing for both the validity of the external Gaussian approximation and the preservation of the 1/sqrt(n) rate, they must be stated precisely (with explicit dependence on dimension, step-size schedule, and strong-convexity parameters) so that readers can check whether the bound applies in the regimes where SGD is typically used.
minor comments (1)
  1. [Abstract] The phrase 'rates of order up to 1/sqrt(n)' should be replaced by an explicit big-O statement together with the precise conditions under which the upper bound is attained.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful and constructive review. The comments identify areas where the presentation can be strengthened for clarity without altering the core contributions. We address each point below and will revise the manuscript accordingly.

read point-by-point responses
  1. Referee: [Main result / Theorem statement] The central claim that the 1/sqrt(n) convex-distance bound is achievable for SGD rests on applying Gaussian approximation results for nonlinear statistics of independent random variables to the dependent sequence produced by SGD. The manuscript must show explicitly (in the section containing the main theorem) how the dependence is handled—e.g., via martingale approximation, blocking, or coupling—without introducing an extra error term that would cancel the claimed improvement over the Polyak-Juditsky rate.

    Authors: We agree that an explicit description of the dependence reduction must appear in the main theorem section. In the revised manuscript we will insert a short dedicated paragraph immediately after the statement of the main result that outlines the martingale approximation step, recalls the relevant error bound from the literature on SGD, and verifies that this error is of strictly lower order than the target 1/sqrt(n) convex-distance rate, thereby preserving the claimed improvement. revision: yes

  2. Referee: [Assumptions / Regularity conditions] The abstract invokes 'appropriate regularity conditions' on the SGD iterates and loss function but does not list them. Because these conditions are load-bearing for both the validity of the external Gaussian approximation and the preservation of the 1/sqrt(n) rate, they must be stated precisely (with explicit dependence on dimension, step-size schedule, and strong-convexity parameters) so that readers can check whether the bound applies in the regimes where SGD is typically used.

    Authors: We acknowledge that the current abstract and introduction refer to the conditions only generically. The revised version will contain a new, self-contained Assumptions section that lists every regularity condition with its explicit dependence on dimension d, step-size schedule, strong-convexity parameter μ, and other relevant quantities, together with a brief discussion of how these conditions are satisfied in standard strongly convex regimes. revision: yes

Circularity Check

0 steps flagged

No circularity: derivation relies on external Gaussian approximation results

full rationale

The paper states that its analysis 'builds on the Gaussian approximation results for nonlinear statistics of independent random variables' and derives the 1/sqrt(n) convex-distance bound for the multiplier bootstrap under regularity conditions. No equation or claim reduces the target bound to a fitted input, self-citation chain, or definitional equivalence. The cited Gaussian results are treated as independent external inputs, and the SGD dependence is bridged only by the stated regularity conditions rather than by any self-referential step. This is the normal case of a self-contained derivation against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on (1) an external Gaussian approximation theorem for nonlinear statistics of independent random variables whose precise statement and assumptions are not reproduced here, and (2) a collection of regularity conditions on the SGD iterates and objective that are invoked but not enumerated. No free parameters, invented entities, or ad-hoc axioms are visible in the abstract.

axioms (2)
  • domain assumption Gaussian approximation results for nonlinear statistics of independent random variables hold with the required error bounds
    The paper states that its analysis builds on these results; they are treated as given background.
  • domain assumption Appropriate regularity conditions on the loss and SGD iterates are satisfied
    Invoked in the abstract as the setting under which the non-asymptotic bound holds.

pith-pipeline@v0.9.0 · 5679 in / 1652 out tokens · 24031 ms · 2026-05-23T03:27:08.020668+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 5 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Gaussian Approximation and Multiplier Bootstrap for Federated Linear Stochastic Approximation

    stat.ML 2026-05 unverdicted novelty 7.0

    Establishes non-asymptotic Gaussian approximation bounds for federated LSA with explicit communication-heterogeneity trade-offs and introduces an online multiplier bootstrap for last-iterate inference with validity gu...

  2. When Does Dynamic Preconditioning Preserve the Polyak-Ruppert CLT? A Stabilization Threshold

    math.ST 2026-04 unverdicted novelty 7.0

    Dynamic preconditioning preserves the Polyak-Ruppert CLT for averaged SGD if the preconditioner stabilizes at rate β > (α + 1)/2.

  3. Gaussian Approximation for Asynchronous Q-learning

    stat.ML 2026-04 unverdicted novelty 7.0

    Derived rates of order up to n^{-1/6} log^4(n S A) for the high-dimensional CLT of averaged asynchronous Q-learning iterates, plus a general martingale-difference CLT.

  4. Refining Covariance Matrix Estimation in Stochastic Gradient Descent Through Bias Reduction

    stat.ML 2026-04 unverdicted novelty 6.0

    A novel bias-reduced online covariance estimator for SGD achieves convergence rate n to the power (α-1)/2 times square root of log n without second-order derivatives.

  5. On Gaussian approximation for entropy-regularized Q-learning with function approximation

    stat.ML 2026-05 unverdicted novelty 5.0

    Establishes n^{-1/4} Gaussian approximation in convex distance for averaged entropy-regularized Q-learning with linear function approximation and polynomial stepsizes.

Reference graph

Works this paper leans on

63 extracted references · 63 canonical work pages · cited by 5 Pith papers

  1. [1]

    High-dimensional central limit theorems for linear functionals of online least-squares sgd

    Bhavya Agrawalla, Krishnakumar Balasubramanian, and Promit Ghosal. High-dimensional central limit theorems for linear functionals of online least-squares sgd. arXiv preprint arXiv:2302.09727, 2023

  2. [2]

    Andreas Anastasiou, Krishnakumar Balasubramanian, and Murat A. Erdogdu. Normal ap- proximation for stochastic gradient descent via non-asymptotic rates of martingale CLT. In Alina Beygelzimer and Daniel Hsu, editors, Proceedings of the Thirty-Second Conference on Learning Theory, volume 99 of Proceedings of Machine Learning Research, pages 115–137. PMLR, ...

  3. [3]

    G. D. Anderson and S.-L. Qiu. A monotoneity property of the gamma function. Proceedings of the American Mathematical Society, 125(11):3355–3362, 1997

  4. [4]

    The reverse isoperimetric problem for Gaussian measure

    Keith Ball. The reverse isoperimetric problem for Gaussian measure. Discrete & Computational Geometry, 10(4):411–420, October 1993

  5. [5]

    Barsov and V

    S. Barsov and V . Ulyanov. Estimates for the closeness of Gaussian measures.Dokl. Akad. Nauk SSSR, 291(2):273–277, 1986

  6. [6]

    V . Bentkus. On the dependence of the berry–esseen bound on dimension. Journal of Statistical Planning and Inference, 113(2):385–402, 2003

  7. [7]

    Benveniste, M

    A. Benveniste, M. Métivier, and P. Priouret.Adaptive algorithms and stochastic approximations, volume 22. Springer Science & Business Media, 2012

  8. [8]

    Convex optimization: Algorithms and complexity

    Sébastien Bubeck et al. Convex optimization: Algorithms and complexity. Foundations and Trends® in Machine Learning, 8(3-4):231–357, 2015

  9. [9]

    Statistical inference for online decision making via stochastic gradient descent

    Haoyu Chen, Wenbin Lu, and Rui Song. Statistical inference for online decision making via stochastic gradient descent. Journal of the American Statistical Association, 116(534):708–719, 2021

  10. [10]

    Online statistical inference for contextual bandits via stochastic gradient descent

    Xi Chen, Zehua Lai, He Li, and Yichen Zhang. Online statistical inference for contextual bandits via stochastic gradient descent. arXiv preprint arXiv:2212.14883, 2022

  11. [11]

    Lee, Xin T

    Xi Chen, Jason D. Lee, Xin T. Tong, and Yichen Zhang. Statistical inference for model parameters in stochastic gradient descent. The Annals of Statistics, 48(1):251 – 273, 2020

  12. [12]

    SAGA: A fast incremental gradi- ent method with support for non-strongly convex composite objectives

    Aaron Defazio, Francis Bach, and Simon Lacoste-Julien. SAGA: A fast incremental gradi- ent method with support for non-strongly convex composite objectives. Advances in neural information processing systems, 27, 2014

  13. [13]

    The total variation distance between high-dimensional gaussians with the same mean,

    Luc Devroye, Abbas Mehrabian, and Tommy Reddad. The total variation distance between high-dimensional gaussians with the same mean. arXiv preprint arXiv:1810.08693, 2018

  14. [14]

    Bridging the gap between constant step size stochastic gradient descent and Markov chains

    Aymeric Dieuleveut, Alain Durmus, and Francis Bach. Bridging the gap between constant step size stochastic gradient descent and Markov chains. The Annals of Statistics, 48(3):1348 – 1382, 2020

  15. [15]

    Bootstrap methods: another look at the jackknife

    Bradley Efron. Bootstrap methods: another look at the jackknife. In Breakthroughs in statistics: Methodology and distribution, pages 569–593. Springer, 1992

  16. [16]

    Online bootstrap confidence intervals for the stochastic gradient descent estimator

    Yixin Fang, Jinfeng Xu, and Lei Yang. Online bootstrap confidence intervals for the stochastic gradient descent estimator. Journal of Machine Learning Research, 19(78):1–21, 2018

  17. [17]

    M. V . Fedoryuk. Metod perevala. Izdat. “Nauka”, Moscow, 1977

  18. [18]

    Neue herleitung und explizite restabschätzung der riemann-siegel-formel

    Wolfgang Gabcke. Neue herleitung und explizite restabschätzung der riemann-siegel-formel. 2015

  19. [19]

    Large ball probabilities, Gaussian comparison and anti-concentration

    Friedrich Götze, Alexey Naumov, Vladimir Spokoiny, and Vladimir Ulyanov. Large ball probabilities, Gaussian comparison and anti-concentration. Bernoulli, 25(4A):2538–2563, 2019. 11

  20. [20]

    Tight analyses for non-smooth stochastic gradient descent

    Nicholas JA Harvey, Christopher Liaw, Yaniv Plan, and Sikander Randhawa. Tight analyses for non-smooth stochastic gradient descent. In Conference on Learning Theory, pages 1579–1613. PMLR, 2019

  21. [21]

    Beyond the regret minimization barrier: optimal algorithms for stochastic strongly-convex optimization

    Elad Hazan and Satyen Kale. Beyond the regret minimization barrier: optimal algorithms for stochastic strongly-convex optimization. The Journal of Machine Learning Research , 15(1):2489–2512, 2014

  22. [22]

    Stochastic approximation and recursive algorithms and applications, volume 35

    Harold Kushner and G George Yin. Stochastic approximation and recursive algorithms and applications, volume 35. Springer Science & Business Media, 2003

  23. [23]

    First-order and Stochastic Optimization Methods for Machine Learning

    Guanghui Lan. First-order and Stochastic Optimization Methods for Machine Learning. 01 2020

  24. [24]

    Root-sgd: Sharp nonasymptotics and asymptotic efficiency in a single algorithm

    Chris Junchi Li, Wenlong Mou, Martin Wainwright, and Michael Jordan. Root-sgd: Sharp nonasymptotics and asymptotic efficiency in a single algorithm. In Po-Ling Loh and Maxim Raginsky, editors, Proceedings of Thirty Fifth Conference on Learning Theory, volume 178 of Proceedings of Machine Learning Research, pages 909–981. PMLR, 02–05 Jul 2022

  25. [25]

    Statistical estimation and online inference via local sgd

    Xiang Li, Jiadong Liang, Xiangyu Chang, and Zhihua Zhang. Statistical estimation and online inference via local sgd. In Po-Ling Loh and Maxim Raginsky, editors, Proceedings of Thirty Fifth Conference on Learning Theory, volume 178 of Proceedings of Machine Learning Research, pages 1613–1661. PMLR, 02–05 Jul 2022

  26. [26]

    Non-asymptotic analysis of stochastic approximation algo- rithms for machine learning

    Eric Moulines and Francis Bach. Non-asymptotic analysis of stochastic approximation algo- rithms for machine learning. Advances in neural information processing systems, 24:451–459, 2011

  27. [27]

    Robust stochastic approximation approach to stochastic programming

    Arkadi Nemirovski, Anatoli Juditsky, Guanghui Lan, and Alexander Shapiro. Robust stochastic approximation approach to stochastic programming. SIAM Journal on optimization, 19(4):1574– 1609, 2009

  28. [28]

    Nesterov.Introductory Lectures on Convex Optimization: A Basic Course

    Y . Nesterov.Introductory Lectures on Convex Optimization: A Basic Course. Applied Opti- mization. Springer, 2004

  29. [29]

    Frank W. J. Olver.Asymptotics and special functions. AKP Classics. A K Peters, Ltd., Wellesley, MA, 1997. Reprint of the 1974 original [Academic Press, New York; MR0435697 (55 #8655)]

  30. [30]

    Sharp martingale and semimartingale inequalities, volume 72

    Adam Osekowski. Sharp martingale and semimartingale inequalities, volume 72. Springer Science & Business Media, 2012

  31. [31]

    Acceleration of stochastic approximation by averaging

    Boris T Polyak and Anatoli B Juditsky. Acceleration of stochastic approximation by averaging. SIAM journal on control and optimization, 30(4):838–855, 1992

  32. [32]

    On the momentum term in gradient descent learning algorithms

    Ning Qian. On the momentum term in gradient descent learning algorithms. Neural networks, 12(1):145–151, 1999

  33. [33]

    Making gradient descent optimal for strongly convex stochastic optimization

    Alexander Rakhlin, Ohad Shamir, and Karthik Sridharan. Making gradient descent optimal for strongly convex stochastic optimization. In Proceedings of the 29th International Coference on International Conference on Machine Learning, pages 1571–1578, 2012

  34. [34]

    Efficient estimations from a slowly convergent Robbins-Monro process

    David Ruppert. Efficient estimations from a slowly convergent Robbins-Monro process. Tech- nical report, Cornell University Operations Research and Industrial Engineering, 1988

  35. [35]

    High-probability bounds for stochastic optimization and variational inequalities: the case of unbounded variance

    Abdurakhmon Sadiev, Marina Danilova, Eduard Gorbunov, Samuel Horváth, Gauthier Gidel, Pavel Dvurechensky, Alexander Gasnikov, and Peter Richtárik. High-probability bounds for stochastic optimization and variational inequalities: the case of unbounded variance. In International Conference on Machine Learning, pages 29563–29648. PMLR, 2023

  36. [36]

    Gaus- sian Approximation and Multiplier Bootstrap for Polyak-Ruppert Averaged Linear Stochastic Approximation with Applications to TD Learning

    Sergey Samsonov, Eric Moulines, Qi-Man Shao, Zhuo-Song Zhang, and Alexey Naumov. Gaus- sian Approximation and Multiplier Bootstrap for Polyak-Ruppert Averaged Linear Stochastic Approximation with Applications to TD Learning. In Advances in Neural Information Process- ing Systems, volume 37, pages 12408–12460. Curran Associates, Inc., 2024. 12

  37. [37]

    Minimizing finite sums with the stochastic average gradient

    Mark Schmidt, Nicolas Le Roux, and Francis Bach. Minimizing finite sums with the stochastic average gradient. Mathematical Programming, 162:83–112, 2017

  38. [38]

    Mathematical statistics

    Jun Shao. Mathematical statistics. Springer Science & Business Media, 2003

  39. [39]

    The Jackknife and Bootstrap

    Jun Shao and Dongsheng Tu. The Jackknife and Bootstrap . Springer Series in Statistics. Springer New York, NY , 1 edition, 1995. Springer Book Archive,© Springer Science+Business Media New York 1995

  40. [40]

    Berry–Esseen bounds for multivariate nonlinear statis- tics with applications to M-estimators and stochastic gradient descent algorithms

    Qi-Man Shao and Zhuo-Song Zhang. Berry–Esseen bounds for multivariate nonlinear statis- tics with applications to M-estimators and stochastic gradient descent algorithms. Bernoulli, 28(3):1548–1576, 2022

  41. [41]

    Nonasymptotic analysis of stochastic gradient descent with the richardson- romberg extrapolation

    Marina Sheshukova, Denis Belomestny, Alain Durmus, Eric Moulines, Alexey Naumov, and Sergey Samsonov. Nonasymptotic analysis of stochastic gradient descent with the richardson- romberg extrapolation. arXiv preprint arXiv:2410.05106, 2024

  42. [42]

    Rates of Convergence in the Central Limit Theorem for Markov Chains, with an Application to TD learning

    R Srikant. Rates of Convergence in the Central Limit Theorem for Markov Chains, with an Application to TD learning. arXiv preprint arXiv:2401.15719, 2024

  43. [43]

    An introduction to matrix concentration inequalities

    Joel A Tropp et al. An introduction to matrix concentration inequalities. Foundations and Trends® in Machine Learning, 8(1-2):1–230, 2015

  44. [44]

    Statistical Inference for Temporal Difference Learning with Linear Function Approximation

    Weichen Wu, Gen Li, Yuting Wei, and Alessandro Rinaldo. Statistical Inference for Temporal Difference Learning with Linear Function Approximation. arXiv preprint arXiv:2410.16106, 2024

  45. [45]

    Online Bootstrap Inference with Noncon- vex Stochastic Gradient Descent Estimator

    Yanjie Zhong, Todd Kuffner, and Soumendra Lahiri. Online Bootstrap Inference with Noncon- vex Stochastic Gradient Descent Estimator. arXiv preprint arXiv:2306.02205, 2023

  46. [46]

    Online Covariance Matrix Estimation in Stochastic Gradient Descent

    Wanrong Zhu, Xi Chen, and Wei Biao Wu. Online Covariance Matrix Estimation in Stochastic Gradient Descent. Journal of the American Statistical Association, 118(541):393–404, 2023. 13 Contents 1 Introduction 1 2 Main results 3 2.1 Non-asymptotic multiplier bootstrap validity . . . . . . . . . . . . . . . . . . . . . 5 2.2 Gaussian approximation in the real...

  47. [47]

    + M1,2σ2 4n1/2−γ + M1,3σ2n−γ/2, where M1,1 = CΣCQ T1( µc0 4 )(L2 + LH) max( p C4,1, p C1) + kγ 0 /c0 M1,2 = CΣCQLH p C4,2c0 1 1 − γ M1,3 = CΣCQL2 p C2 √c0 r 1 1 − γ , (33) where C4,1 and C4,2 are defined in Corollary 4, C1 and C2 are defined in Lemma 12 and T1(·) is defined in eq. (32). Proof. Using Minkowski’s inequality and the definition ofDn, we obtai...

  48. [48]

    We now proceed withPn−1 i=1 E1/2[∥Dn,2 − D(i) n,2∥2]

    Combining inequalities above, we get n−1X i=1 E1/2[∥Dn,3 − D(i) n,3∥2] ≤ √ 2CΣCQL2√n s C1 + c2 0k−γ 0 R1R2T2 µc0 1 − γ T1 µc0 8 (∥θ0 − θ⋆∥ + σ2) + √ 2CΣCQL2√n s C2 + R1R3c0T2 µc0 1 − γ σ2 (n + k0 − 2)1−γ/2 − (k0 − 1)1−γ/2 1 − γ/2 . We now proceed withPn−1 i=1 E1/2[∥Dn,2 − D(i) n,2∥2]. Using Minkowski’s inequality together with Lemma 3 and Lemma 17, we get...

  49. [49]

    For k > i , applying A7 and A1, we have E[∥θ(i) k − θk∥2|Fk−1] ≤ ∥θ(i) k−1 − θk−1∥2 − 2αk⟨θ(i) k−1 − θk−1, ∇f(θ(i) k−1) − ∇f(θk−1)⟩ + 2α2 k(L1 + L2)2∥θ(i) k−1 − θk−1∥2

    + (1 +C2L2)σ2 2 , where in (a) we used A7, and in (b) we used Lemma 12 and αk−1L2 ≤ 1. For k > i , applying A7 and A1, we have E[∥θ(i) k − θk∥2|Fk−1] ≤ ∥θ(i) k−1 − θk−1∥2 − 2αk⟨θ(i) k−1 − θk−1, ∇f(θ(i) k−1) − ∇f(θk−1)⟩ + 2α2 k(L1 + L2)2∥θ(i) k−1 − θk−1∥2 . Taking expectation from both sides and applying A1 with Lemma 15(a), we obtain E[∥θ(i) k − θk∥2] ≤ (...

  50. [50]

    For k > i we denote δ(i) k = ∥θ(i) k − θk∥, similar to (62), we obtain E[{δ(i) k }4|Fk−1] ≤ (1 − 4µαk + 4α2 k(L1 + L2)2(1 + 3c0(L1 + L2))2){δ(i) k−1}4

    + (1 +L2 2C4,2)σ4 4 . For k > i we denote δ(i) k = ∥θ(i) k − θk∥, similar to (62), we obtain E[{δ(i) k }4|Fk−1] ≤ (1 − 4µαk + 4α2 k(L1 + L2)2(1 + 3c0(L1 + L2))2){δ(i) k−1}4 . Using Lemma 15(a), we obtain E[{δ(i) k }4] ≤ exp 4(L1 + L2)2(1 + 3c0(L1 + L2))2) 2γ − 1 exp −4µ kX j=i+1 αj E[∥θ(i) i − θi∥4] . Combining the above inequalities completes the proof. ...

  51. [51]

    ≤ E[{Eb[∥Db∥p]}] n(M b 1,1e2/pp3/2n2/p−γ/2 + M b 2,1e1/ppn1/2+1/p−γ)p = E[∥Db∥p] n(M b 1,1e1/pp3/2n1/p−γ/2 + M b 2,1e2/ppn1/2+2/p−γ)p ≤ 1 n . Lemma 8. Assume A1-A5. Then for any p ≥ 2 it holds E1/p[∥Db 1∥p] ≤ M b 1,1e1/pp3/2n1/p−γ/2 , where M b 1,1 = 4CQ max(L1, L2)max(√K2, √K1)√c0(Wmax + 1)√ 2(1 − γ) , (45) and K1, K2 are defined in (64), (66), respectiv...

  52. [52]

    (53) By applying the recurrence (53), we obtain that E[∥θk − θ⋆∥2] ≤ A1,k∥θ0 − θ⋆∥2 + 2σ2 2A2,k , where we have set A1,k = kY i=1 (1 − (3/2)αiµ + 2α2 i L2

  53. [53]

    (54) Using the elementary bound 1 + t ≤ et for any t ∈ R, we get A1,k ≤ exp −(3/2)µ kX i=1 αi exp 2L2 2 kX i=1 α2 i

    , A2,k = kX i=1 kY j=i+1 (1 − (3/2)αjµ + 2α2 j L2 2)α2 i . (54) Using the elementary bound 1 + t ≤ et for any t ∈ R, we get A1,k ≤ exp −(3/2)µ kX i=1 αi exp 2L2 2 kX i=1 α2 i . Using Lemma 15, we obtain A1,k ≤ c1 exp − 3µc0 4(1 − γ)(k + k0)1−γ , where we have set c1 = exp 2c2 0L2 2 2γ − 1 + 3µc0 4(1 − γ) k1−γ 0 . (55) Now we estimate A2,k. Let k1 be the l...

  54. [54]

    = 1 2L2 2 k1X i=1 k1Y j=i (1 + 2α2 j L2

  55. [55]

    − k1Y j=i+1 (1 + 2α2 j L2 2) ≤ 1 2L2 2 k1Y j=1 (1 + 2α2 j L2

  56. [56]

    Note, that for k ≤ k1, αk ≥ µ/(4L2 2), hence, we have kY j=k1+1 (1 − αjµ) ≤ exp −µ kX i=1 αi exp µ k1X i=1 αi ≤ exp −µ kX i=1 αi exp 4L2 2 k1X i=1 α2 i

    ≤ 1 2L2 2 exp 2L2 2 k1X j=1 α2 j . Note, that for k ≤ k1, αk ≥ µ/(4L2 2), hence, we have kY j=k1+1 (1 − αjµ) ≤ exp −µ kX i=1 αi exp µ k1X i=1 αi ≤ exp −µ kX i=1 αi exp 4L2 2 k1X i=1 α2 i . 33 Moreover, for any m ∈ {1, . . . , k}, we obtain kX i=1 α2 i kY j=i+1 (1 − αjµ) = mX i=1 kY j=i+1 (1 − αjµ)α2 i + kX i=m+1 kY j=i+1 (1 − αjµ)α2 i ≤ kY j=m+1 (1 − αjµ)...

  57. [57]

    First, for i = p, j = 0, l = 0, the corresponding term in the sum equals δ2p k−1

  58. [58]

    Second, for i = p − 1, j = 1, l = 0, we obtain, applying A1, that 2pαkE[⟨θk−1 − θ⋆, ∇f(θk−1) + ζk⟩δ2(p−1) k−1 |Fk−1] = 2pαk⟨θk−1 − θ⋆, ∇f(θk−1) − ∇f(θ⋆)⟩δ2(p−1) k−1 ≥ 2pµαkδ2p k−1

  59. [59]

    Third, for l ≥ 1 or j ≥ 2 (that is, 2l + j ≥ 2), we use Cauchy-Schwartz inequality |⟨θk−1 − θ⋆, ∇f(θk−1) + ζk⟩)j| ≤ ∥ θk−1 − θ⋆∥j∥∇f(θk−1) + ζk∥j , moreover, applying A1 and A7(2p) together with the Lyapunov inequality, we get E[∥∇f(θk−1) + ζk∥2l+j|Fk−1] = E[∥∇f(θk−1) + g(θk−1, ξk) + η(ξk)∥2l+j|Fk−1] ≤ 22l+j−1((L1 + L2)2l+jδ2l+j k−1 + σ2l+j 2p ) . 35 Comb...

  60. [60]

    First, for i = p, j = 0, l = 0, the corresponding term in the sum equals δ′2p k−1. 38

  61. [61]

    Second, for i = p − 1, j = 1, l = 0, we obtain, applying A1, that 2pαkE[⟨θk−1 − θ′ k−1, ∇f(θk−1) − ∇f(θ′ k−1) + g(θk−1, ξk) − g(θ′ k−1, ξk)⟩δ′2(p−1) k−1 |Fk−1] = 2pαk⟨θk−1 − θ′ k−1, ∇f(θk−1) − ∇f(θ′ k−1)⟩δ′2(p−1) k−1 ≥ 2pµαkδ′2p k−1

  62. [62]

    It remains to note that E[∥θk − θ⋆∥2p] ≤ 22p−1E[∥θ′ k − θ⋆∥2p] + 22p−1E[∥θk − θ′ k∥2p] ≤ C2p,1 exp − pµc0 4 (k + k0)1−γ (∥θ0 − θ⋆∥2p + σ2p 2p) + C2p,2σ2p 2pαp k

    Third, for l ≥ 1 or j ≥ 2 (that is, 2l + j ≥ 2), we use Cauchy-Schwartz inequality together with A7 and A1 |⟨θk−1−θ′ k−1, ∇f(θk−1)−∇f(θ′ k−1)+g(θk−1, ξk)−g(θ′ k−1, ξk)⟩j| ≤ ∥ θk−1−θ′ k−1∥2j(L1+L2)j , Combining inequalities above, we obtain E[δ′2p k |Fk−1] ≤ (1 − 2pµαk + X i+j+l=p; i,j,l∈{0,...p}; j+2l≥2 p! i!j!l!2jαj+2l k (L1 + L2)j+2l)δ′2p k−1 (62) Simil...

  63. [63]

    Here C1 and C2 are defined in Lemma 12 and c2,4 = exp exp 10c0(L1 + L2) 16(L1 + L2)2 2γ − 1 + 1 exp 2µc0 1 − γ k1−γ 0 1 3γ − 1 , c2,5 = 21+2γ 2µc0

    + C4,2σ4 4α2 k , 39 with C4,1 = 23 C1(1/c0 + C2) 4γ (1 − γ)µc0e γ/(1−γ) 4c1/2 0 2γ/2 + 2γ + 4c0 2 c2 0 + 1 c2,4 and C4,2 = 23C1(1/c0 + C2) 4γ (1 − γ)µc0e γ/(1−γ) 4c1/2 0 2γ/2 + 2γ + 4c0 2 c2,5. Here C1 and C2 are defined in Lemma 12 and c2,4 = exp exp 10c0(L1 + L2) 16(L1 + L2)2 2γ − 1 + 1 exp 2µc0 1 − γ k1−γ 0 1 3γ − 1 , c2,5 = 21+2γ 2µc0 . Proof. The pro...