Gaussian Approximation and Multiplier Bootstrap for Stochastic Gradient Descent
Pith reviewed 2026-05-23 03:27 UTC · model grok-4.3
The pith
Multiplier bootstrap for SGD achieves non-asymptotic rates up to 1/sqrt(n) in convex distance without approximating limiting covariance.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Under appropriate regularity conditions, the multiplier bootstrap procedure for SGD achieves approximation rates in convex distance of order up to 1/sqrt(n), which can be faster than the rate provable from the Polyak-Juditsky central limit theorem, and provides the first fully non-asymptotic bound on the accuracy of bootstrap approximations in SGD algorithms. The analysis builds on the Gaussian approximation results for nonlinear statistics of independent random variables.
What carries the argument
Multiplier bootstrap procedure for SGD iterates, which constructs bootstrap distributions directly from the iterates to approximate their law in convex distance.
If this is right
- Confidence sets for SGD can be formed without separately estimating the limiting covariance matrix.
- The bootstrap approximation rate reaches order 1/sqrt(n) in convex distance under the regularity conditions.
- This rate can exceed what is provable from the Polyak-Juditsky central limit theorem for the same iterates.
- Fully non-asymptotic error bounds become available for the accuracy of bootstrap approximations in SGD.
Where Pith is reading between the lines
- The approach may extend to other first-order stochastic optimization methods that produce dependent iterates.
- Practitioners could obtain finite-sample uncertainty quantification for SGD-trained models without relying on asymptotic normality.
- The technique suggests bootstrap could replace covariance estimation in high-dimensional or online learning settings where CLT rates are slow.
Load-bearing premise
The unspecified regularity conditions on the SGD iterates and loss function must hold for the existing Gaussian approximation results to deliver the stated convex-distance bound.
What would settle it
A simulation on a simple strongly convex quadratic loss where the convex distance between the multiplier bootstrap distribution and the true distribution of normalized SGD iterates does not decrease at order 1/sqrt(n) for large n.
Figures
read the original abstract
In this paper, we establish the non-asymptotic validity of the multiplier bootstrap procedure for constructing the confidence sets using the Stochastic Gradient Descent (SGD) algorithm. Under appropriate regularity conditions, our approach avoids the need to approximate the limiting covariance of Polyak-Ruppert SGD iterates, which allows us to derive approximation rates in convex distance of order up to $1/\sqrt{n}$. Notably, this rate can be faster than the one that can be proven in the Polyak-Juditsky central limit theorem. To our knowledge, this provides the first fully non-asymptotic bound on the accuracy of bootstrap approximations in SGD algorithms. Our analysis builds on the Gaussian approximation results for nonlinear statistics of independent random variables.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims to establish the non-asymptotic validity of the multiplier bootstrap procedure for constructing confidence sets with SGD iterates. Under appropriate regularity conditions, the approach yields convex-distance approximation rates of order up to 1/sqrt(n) without needing to approximate the limiting covariance from the Polyak-Ruppert averaging, a rate that can be faster than what follows from the Polyak-Juditsky CLT; the analysis builds on existing Gaussian approximation theorems for nonlinear statistics of independent random variables and is presented as the first fully non-asymptotic bound for bootstrap accuracy in SGD.
Significance. If the dependence structure of SGD can be reduced to the independent nonlinear statistics setting without degrading the rate and if the regularity conditions are verifiable in standard regimes, the result would supply a practically useful non-asymptotic tool for inference after SGD that improves on existing asymptotic guarantees. The explicit non-asymptotic focus and the attempt to obtain a rate faster than the classical CLT are strengths worth preserving.
major comments (2)
- [Main result / Theorem statement] The central claim that the 1/sqrt(n) convex-distance bound is achievable for SGD rests on applying Gaussian approximation results for nonlinear statistics of independent random variables to the dependent sequence produced by SGD. The manuscript must show explicitly (in the section containing the main theorem) how the dependence is handled—e.g., via martingale approximation, blocking, or coupling—without introducing an extra error term that would cancel the claimed improvement over the Polyak-Juditsky rate.
- [Assumptions / Regularity conditions] The abstract invokes 'appropriate regularity conditions' on the SGD iterates and loss function but does not list them. Because these conditions are load-bearing for both the validity of the external Gaussian approximation and the preservation of the 1/sqrt(n) rate, they must be stated precisely (with explicit dependence on dimension, step-size schedule, and strong-convexity parameters) so that readers can check whether the bound applies in the regimes where SGD is typically used.
minor comments (1)
- [Abstract] The phrase 'rates of order up to 1/sqrt(n)' should be replaced by an explicit big-O statement together with the precise conditions under which the upper bound is attained.
Simulated Author's Rebuttal
We thank the referee for the careful and constructive review. The comments identify areas where the presentation can be strengthened for clarity without altering the core contributions. We address each point below and will revise the manuscript accordingly.
read point-by-point responses
-
Referee: [Main result / Theorem statement] The central claim that the 1/sqrt(n) convex-distance bound is achievable for SGD rests on applying Gaussian approximation results for nonlinear statistics of independent random variables to the dependent sequence produced by SGD. The manuscript must show explicitly (in the section containing the main theorem) how the dependence is handled—e.g., via martingale approximation, blocking, or coupling—without introducing an extra error term that would cancel the claimed improvement over the Polyak-Juditsky rate.
Authors: We agree that an explicit description of the dependence reduction must appear in the main theorem section. In the revised manuscript we will insert a short dedicated paragraph immediately after the statement of the main result that outlines the martingale approximation step, recalls the relevant error bound from the literature on SGD, and verifies that this error is of strictly lower order than the target 1/sqrt(n) convex-distance rate, thereby preserving the claimed improvement. revision: yes
-
Referee: [Assumptions / Regularity conditions] The abstract invokes 'appropriate regularity conditions' on the SGD iterates and loss function but does not list them. Because these conditions are load-bearing for both the validity of the external Gaussian approximation and the preservation of the 1/sqrt(n) rate, they must be stated precisely (with explicit dependence on dimension, step-size schedule, and strong-convexity parameters) so that readers can check whether the bound applies in the regimes where SGD is typically used.
Authors: We acknowledge that the current abstract and introduction refer to the conditions only generically. The revised version will contain a new, self-contained Assumptions section that lists every regularity condition with its explicit dependence on dimension d, step-size schedule, strong-convexity parameter μ, and other relevant quantities, together with a brief discussion of how these conditions are satisfied in standard strongly convex regimes. revision: yes
Circularity Check
No circularity: derivation relies on external Gaussian approximation results
full rationale
The paper states that its analysis 'builds on the Gaussian approximation results for nonlinear statistics of independent random variables' and derives the 1/sqrt(n) convex-distance bound for the multiplier bootstrap under regularity conditions. No equation or claim reduces the target bound to a fitted input, self-citation chain, or definitional equivalence. The cited Gaussian results are treated as independent external inputs, and the SGD dependence is bridged only by the stated regularity conditions rather than by any self-referential step. This is the normal case of a self-contained derivation against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Gaussian approximation results for nonlinear statistics of independent random variables hold with the required error bounds
- domain assumption Appropriate regularity conditions on the loss and SGD iterates are satisfied
Forward citations
Cited by 5 Pith papers
-
Gaussian Approximation and Multiplier Bootstrap for Federated Linear Stochastic Approximation
Establishes non-asymptotic Gaussian approximation bounds for federated LSA with explicit communication-heterogeneity trade-offs and introduces an online multiplier bootstrap for last-iterate inference with validity gu...
-
When Does Dynamic Preconditioning Preserve the Polyak-Ruppert CLT? A Stabilization Threshold
Dynamic preconditioning preserves the Polyak-Ruppert CLT for averaged SGD if the preconditioner stabilizes at rate β > (α + 1)/2.
-
Gaussian Approximation for Asynchronous Q-learning
Derived rates of order up to n^{-1/6} log^4(n S A) for the high-dimensional CLT of averaged asynchronous Q-learning iterates, plus a general martingale-difference CLT.
-
Refining Covariance Matrix Estimation in Stochastic Gradient Descent Through Bias Reduction
A novel bias-reduced online covariance estimator for SGD achieves convergence rate n to the power (α-1)/2 times square root of log n without second-order derivatives.
-
On Gaussian approximation for entropy-regularized Q-learning with function approximation
Establishes n^{-1/4} Gaussian approximation in convex distance for averaged entropy-regularized Q-learning with linear function approximation and polynomial stepsizes.
Reference graph
Works this paper leans on
-
[1]
High-dimensional central limit theorems for linear functionals of online least-squares sgd
Bhavya Agrawalla, Krishnakumar Balasubramanian, and Promit Ghosal. High-dimensional central limit theorems for linear functionals of online least-squares sgd. arXiv preprint arXiv:2302.09727, 2023
-
[2]
Andreas Anastasiou, Krishnakumar Balasubramanian, and Murat A. Erdogdu. Normal ap- proximation for stochastic gradient descent via non-asymptotic rates of martingale CLT. In Alina Beygelzimer and Daniel Hsu, editors, Proceedings of the Thirty-Second Conference on Learning Theory, volume 99 of Proceedings of Machine Learning Research, pages 115–137. PMLR, ...
work page 2019
-
[3]
G. D. Anderson and S.-L. Qiu. A monotoneity property of the gamma function. Proceedings of the American Mathematical Society, 125(11):3355–3362, 1997
work page 1997
-
[4]
The reverse isoperimetric problem for Gaussian measure
Keith Ball. The reverse isoperimetric problem for Gaussian measure. Discrete & Computational Geometry, 10(4):411–420, October 1993
work page 1993
-
[5]
S. Barsov and V . Ulyanov. Estimates for the closeness of Gaussian measures.Dokl. Akad. Nauk SSSR, 291(2):273–277, 1986
work page 1986
-
[6]
V . Bentkus. On the dependence of the berry–esseen bound on dimension. Journal of Statistical Planning and Inference, 113(2):385–402, 2003
work page 2003
-
[7]
A. Benveniste, M. Métivier, and P. Priouret.Adaptive algorithms and stochastic approximations, volume 22. Springer Science & Business Media, 2012
work page 2012
-
[8]
Convex optimization: Algorithms and complexity
Sébastien Bubeck et al. Convex optimization: Algorithms and complexity. Foundations and Trends® in Machine Learning, 8(3-4):231–357, 2015
work page 2015
-
[9]
Statistical inference for online decision making via stochastic gradient descent
Haoyu Chen, Wenbin Lu, and Rui Song. Statistical inference for online decision making via stochastic gradient descent. Journal of the American Statistical Association, 116(534):708–719, 2021
work page 2021
-
[10]
Online statistical inference for contextual bandits via stochastic gradient descent
Xi Chen, Zehua Lai, He Li, and Yichen Zhang. Online statistical inference for contextual bandits via stochastic gradient descent. arXiv preprint arXiv:2212.14883, 2022
-
[11]
Xi Chen, Jason D. Lee, Xin T. Tong, and Yichen Zhang. Statistical inference for model parameters in stochastic gradient descent. The Annals of Statistics, 48(1):251 – 273, 2020
work page 2020
-
[12]
SAGA: A fast incremental gradi- ent method with support for non-strongly convex composite objectives
Aaron Defazio, Francis Bach, and Simon Lacoste-Julien. SAGA: A fast incremental gradi- ent method with support for non-strongly convex composite objectives. Advances in neural information processing systems, 27, 2014
work page 2014
-
[13]
The total variation distance between high-dimensional gaussians with the same mean,
Luc Devroye, Abbas Mehrabian, and Tommy Reddad. The total variation distance between high-dimensional gaussians with the same mean. arXiv preprint arXiv:1810.08693, 2018
-
[14]
Bridging the gap between constant step size stochastic gradient descent and Markov chains
Aymeric Dieuleveut, Alain Durmus, and Francis Bach. Bridging the gap between constant step size stochastic gradient descent and Markov chains. The Annals of Statistics, 48(3):1348 – 1382, 2020
work page 2020
-
[15]
Bootstrap methods: another look at the jackknife
Bradley Efron. Bootstrap methods: another look at the jackknife. In Breakthroughs in statistics: Methodology and distribution, pages 569–593. Springer, 1992
work page 1992
-
[16]
Online bootstrap confidence intervals for the stochastic gradient descent estimator
Yixin Fang, Jinfeng Xu, and Lei Yang. Online bootstrap confidence intervals for the stochastic gradient descent estimator. Journal of Machine Learning Research, 19(78):1–21, 2018
work page 2018
-
[17]
M. V . Fedoryuk. Metod perevala. Izdat. “Nauka”, Moscow, 1977
work page 1977
-
[18]
Neue herleitung und explizite restabschätzung der riemann-siegel-formel
Wolfgang Gabcke. Neue herleitung und explizite restabschätzung der riemann-siegel-formel. 2015
work page 2015
-
[19]
Large ball probabilities, Gaussian comparison and anti-concentration
Friedrich Götze, Alexey Naumov, Vladimir Spokoiny, and Vladimir Ulyanov. Large ball probabilities, Gaussian comparison and anti-concentration. Bernoulli, 25(4A):2538–2563, 2019. 11
work page 2019
-
[20]
Tight analyses for non-smooth stochastic gradient descent
Nicholas JA Harvey, Christopher Liaw, Yaniv Plan, and Sikander Randhawa. Tight analyses for non-smooth stochastic gradient descent. In Conference on Learning Theory, pages 1579–1613. PMLR, 2019
work page 2019
-
[21]
Elad Hazan and Satyen Kale. Beyond the regret minimization barrier: optimal algorithms for stochastic strongly-convex optimization. The Journal of Machine Learning Research , 15(1):2489–2512, 2014
work page 2014
-
[22]
Stochastic approximation and recursive algorithms and applications, volume 35
Harold Kushner and G George Yin. Stochastic approximation and recursive algorithms and applications, volume 35. Springer Science & Business Media, 2003
work page 2003
-
[23]
First-order and Stochastic Optimization Methods for Machine Learning
Guanghui Lan. First-order and Stochastic Optimization Methods for Machine Learning. 01 2020
work page 2020
-
[24]
Root-sgd: Sharp nonasymptotics and asymptotic efficiency in a single algorithm
Chris Junchi Li, Wenlong Mou, Martin Wainwright, and Michael Jordan. Root-sgd: Sharp nonasymptotics and asymptotic efficiency in a single algorithm. In Po-Ling Loh and Maxim Raginsky, editors, Proceedings of Thirty Fifth Conference on Learning Theory, volume 178 of Proceedings of Machine Learning Research, pages 909–981. PMLR, 02–05 Jul 2022
work page 2022
-
[25]
Statistical estimation and online inference via local sgd
Xiang Li, Jiadong Liang, Xiangyu Chang, and Zhihua Zhang. Statistical estimation and online inference via local sgd. In Po-Ling Loh and Maxim Raginsky, editors, Proceedings of Thirty Fifth Conference on Learning Theory, volume 178 of Proceedings of Machine Learning Research, pages 1613–1661. PMLR, 02–05 Jul 2022
work page 2022
-
[26]
Non-asymptotic analysis of stochastic approximation algo- rithms for machine learning
Eric Moulines and Francis Bach. Non-asymptotic analysis of stochastic approximation algo- rithms for machine learning. Advances in neural information processing systems, 24:451–459, 2011
work page 2011
-
[27]
Robust stochastic approximation approach to stochastic programming
Arkadi Nemirovski, Anatoli Juditsky, Guanghui Lan, and Alexander Shapiro. Robust stochastic approximation approach to stochastic programming. SIAM Journal on optimization, 19(4):1574– 1609, 2009
work page 2009
-
[28]
Nesterov.Introductory Lectures on Convex Optimization: A Basic Course
Y . Nesterov.Introductory Lectures on Convex Optimization: A Basic Course. Applied Opti- mization. Springer, 2004
work page 2004
-
[29]
Frank W. J. Olver.Asymptotics and special functions. AKP Classics. A K Peters, Ltd., Wellesley, MA, 1997. Reprint of the 1974 original [Academic Press, New York; MR0435697 (55 #8655)]
work page 1997
-
[30]
Sharp martingale and semimartingale inequalities, volume 72
Adam Osekowski. Sharp martingale and semimartingale inequalities, volume 72. Springer Science & Business Media, 2012
work page 2012
-
[31]
Acceleration of stochastic approximation by averaging
Boris T Polyak and Anatoli B Juditsky. Acceleration of stochastic approximation by averaging. SIAM journal on control and optimization, 30(4):838–855, 1992
work page 1992
-
[32]
On the momentum term in gradient descent learning algorithms
Ning Qian. On the momentum term in gradient descent learning algorithms. Neural networks, 12(1):145–151, 1999
work page 1999
-
[33]
Making gradient descent optimal for strongly convex stochastic optimization
Alexander Rakhlin, Ohad Shamir, and Karthik Sridharan. Making gradient descent optimal for strongly convex stochastic optimization. In Proceedings of the 29th International Coference on International Conference on Machine Learning, pages 1571–1578, 2012
work page 2012
-
[34]
Efficient estimations from a slowly convergent Robbins-Monro process
David Ruppert. Efficient estimations from a slowly convergent Robbins-Monro process. Tech- nical report, Cornell University Operations Research and Industrial Engineering, 1988
work page 1988
-
[35]
Abdurakhmon Sadiev, Marina Danilova, Eduard Gorbunov, Samuel Horváth, Gauthier Gidel, Pavel Dvurechensky, Alexander Gasnikov, and Peter Richtárik. High-probability bounds for stochastic optimization and variational inequalities: the case of unbounded variance. In International Conference on Machine Learning, pages 29563–29648. PMLR, 2023
work page 2023
-
[36]
Sergey Samsonov, Eric Moulines, Qi-Man Shao, Zhuo-Song Zhang, and Alexey Naumov. Gaus- sian Approximation and Multiplier Bootstrap for Polyak-Ruppert Averaged Linear Stochastic Approximation with Applications to TD Learning. In Advances in Neural Information Process- ing Systems, volume 37, pages 12408–12460. Curran Associates, Inc., 2024. 12
work page 2024
-
[37]
Minimizing finite sums with the stochastic average gradient
Mark Schmidt, Nicolas Le Roux, and Francis Bach. Minimizing finite sums with the stochastic average gradient. Mathematical Programming, 162:83–112, 2017
work page 2017
-
[38]
Jun Shao. Mathematical statistics. Springer Science & Business Media, 2003
work page 2003
-
[39]
Jun Shao and Dongsheng Tu. The Jackknife and Bootstrap . Springer Series in Statistics. Springer New York, NY , 1 edition, 1995. Springer Book Archive,© Springer Science+Business Media New York 1995
work page 1995
-
[40]
Qi-Man Shao and Zhuo-Song Zhang. Berry–Esseen bounds for multivariate nonlinear statis- tics with applications to M-estimators and stochastic gradient descent algorithms. Bernoulli, 28(3):1548–1576, 2022
work page 2022
-
[41]
Nonasymptotic analysis of stochastic gradient descent with the richardson- romberg extrapolation
Marina Sheshukova, Denis Belomestny, Alain Durmus, Eric Moulines, Alexey Naumov, and Sergey Samsonov. Nonasymptotic analysis of stochastic gradient descent with the richardson- romberg extrapolation. arXiv preprint arXiv:2410.05106, 2024
-
[42]
R Srikant. Rates of Convergence in the Central Limit Theorem for Markov Chains, with an Application to TD learning. arXiv preprint arXiv:2401.15719, 2024
-
[43]
An introduction to matrix concentration inequalities
Joel A Tropp et al. An introduction to matrix concentration inequalities. Foundations and Trends® in Machine Learning, 8(1-2):1–230, 2015
work page 2015
-
[44]
Statistical Inference for Temporal Difference Learning with Linear Function Approximation
Weichen Wu, Gen Li, Yuting Wei, and Alessandro Rinaldo. Statistical Inference for Temporal Difference Learning with Linear Function Approximation. arXiv preprint arXiv:2410.16106, 2024
-
[45]
Online Bootstrap Inference with Noncon- vex Stochastic Gradient Descent Estimator
Yanjie Zhong, Todd Kuffner, and Soumendra Lahiri. Online Bootstrap Inference with Noncon- vex Stochastic Gradient Descent Estimator. arXiv preprint arXiv:2306.02205, 2023
-
[46]
Online Covariance Matrix Estimation in Stochastic Gradient Descent
Wanrong Zhu, Xi Chen, and Wei Biao Wu. Online Covariance Matrix Estimation in Stochastic Gradient Descent. Journal of the American Statistical Association, 118(541):393–404, 2023. 13 Contents 1 Introduction 1 2 Main results 3 2.1 Non-asymptotic multiplier bootstrap validity . . . . . . . . . . . . . . . . . . . . . 5 2.2 Gaussian approximation in the real...
work page 2023
-
[47]
+ M1,2σ2 4n1/2−γ + M1,3σ2n−γ/2, where M1,1 = CΣCQ T1( µc0 4 )(L2 + LH) max( p C4,1, p C1) + kγ 0 /c0 M1,2 = CΣCQLH p C4,2c0 1 1 − γ M1,3 = CΣCQL2 p C2 √c0 r 1 1 − γ , (33) where C4,1 and C4,2 are defined in Corollary 4, C1 and C2 are defined in Lemma 12 and T1(·) is defined in eq. (32). Proof. Using Minkowski’s inequality and the definition ofDn, we obtai...
-
[48]
We now proceed withPn−1 i=1 E1/2[∥Dn,2 − D(i) n,2∥2]
Combining inequalities above, we get n−1X i=1 E1/2[∥Dn,3 − D(i) n,3∥2] ≤ √ 2CΣCQL2√n s C1 + c2 0k−γ 0 R1R2T2 µc0 1 − γ T1 µc0 8 (∥θ0 − θ⋆∥ + σ2) + √ 2CΣCQL2√n s C2 + R1R3c0T2 µc0 1 − γ σ2 (n + k0 − 2)1−γ/2 − (k0 − 1)1−γ/2 1 − γ/2 . We now proceed withPn−1 i=1 E1/2[∥Dn,2 − D(i) n,2∥2]. Using Minkowski’s inequality together with Lemma 3 and Lemma 17, we get...
-
[49]
+ (1 +C2L2)σ2 2 , where in (a) we used A7, and in (b) we used Lemma 12 and αk−1L2 ≤ 1. For k > i , applying A7 and A1, we have E[∥θ(i) k − θk∥2|Fk−1] ≤ ∥θ(i) k−1 − θk−1∥2 − 2αk⟨θ(i) k−1 − θk−1, ∇f(θ(i) k−1) − ∇f(θk−1)⟩ + 2α2 k(L1 + L2)2∥θ(i) k−1 − θk−1∥2 . Taking expectation from both sides and applying A1 with Lemma 15(a), we obtain E[∥θ(i) k − θk∥2] ≤ (...
-
[50]
+ (1 +L2 2C4,2)σ4 4 . For k > i we denote δ(i) k = ∥θ(i) k − θk∥, similar to (62), we obtain E[{δ(i) k }4|Fk−1] ≤ (1 − 4µαk + 4α2 k(L1 + L2)2(1 + 3c0(L1 + L2))2){δ(i) k−1}4 . Using Lemma 15(a), we obtain E[{δ(i) k }4] ≤ exp 4(L1 + L2)2(1 + 3c0(L1 + L2))2) 2γ − 1 exp −4µ kX j=i+1 αj E[∥θ(i) i − θi∥4] . Combining the above inequalities completes the proof. ...
-
[51]
≤ E[{Eb[∥Db∥p]}] n(M b 1,1e2/pp3/2n2/p−γ/2 + M b 2,1e1/ppn1/2+1/p−γ)p = E[∥Db∥p] n(M b 1,1e1/pp3/2n1/p−γ/2 + M b 2,1e2/ppn1/2+2/p−γ)p ≤ 1 n . Lemma 8. Assume A1-A5. Then for any p ≥ 2 it holds E1/p[∥Db 1∥p] ≤ M b 1,1e1/pp3/2n1/p−γ/2 , where M b 1,1 = 4CQ max(L1, L2)max(√K2, √K1)√c0(Wmax + 1)√ 2(1 − γ) , (45) and K1, K2 are defined in (64), (66), respectiv...
-
[52]
(53) By applying the recurrence (53), we obtain that E[∥θk − θ⋆∥2] ≤ A1,k∥θ0 − θ⋆∥2 + 2σ2 2A2,k , where we have set A1,k = kY i=1 (1 − (3/2)αiµ + 2α2 i L2
-
[53]
, A2,k = kX i=1 kY j=i+1 (1 − (3/2)αjµ + 2α2 j L2 2)α2 i . (54) Using the elementary bound 1 + t ≤ et for any t ∈ R, we get A1,k ≤ exp −(3/2)µ kX i=1 αi exp 2L2 2 kX i=1 α2 i . Using Lemma 15, we obtain A1,k ≤ c1 exp − 3µc0 4(1 − γ)(k + k0)1−γ , where we have set c1 = exp 2c2 0L2 2 2γ − 1 + 3µc0 4(1 − γ) k1−γ 0 . (55) Now we estimate A2,k. Let k1 be the l...
-
[54]
= 1 2L2 2 k1X i=1 k1Y j=i (1 + 2α2 j L2
-
[55]
− k1Y j=i+1 (1 + 2α2 j L2 2) ≤ 1 2L2 2 k1Y j=1 (1 + 2α2 j L2
-
[56]
≤ 1 2L2 2 exp 2L2 2 k1X j=1 α2 j . Note, that for k ≤ k1, αk ≥ µ/(4L2 2), hence, we have kY j=k1+1 (1 − αjµ) ≤ exp −µ kX i=1 αi exp µ k1X i=1 αi ≤ exp −µ kX i=1 αi exp 4L2 2 k1X i=1 α2 i . 33 Moreover, for any m ∈ {1, . . . , k}, we obtain kX i=1 α2 i kY j=i+1 (1 − αjµ) = mX i=1 kY j=i+1 (1 − αjµ)α2 i + kX i=m+1 kY j=i+1 (1 − αjµ)α2 i ≤ kY j=m+1 (1 − αjµ)...
-
[57]
First, for i = p, j = 0, l = 0, the corresponding term in the sum equals δ2p k−1
-
[58]
Second, for i = p − 1, j = 1, l = 0, we obtain, applying A1, that 2pαkE[⟨θk−1 − θ⋆, ∇f(θk−1) + ζk⟩δ2(p−1) k−1 |Fk−1] = 2pαk⟨θk−1 − θ⋆, ∇f(θk−1) − ∇f(θ⋆)⟩δ2(p−1) k−1 ≥ 2pµαkδ2p k−1
-
[59]
Third, for l ≥ 1 or j ≥ 2 (that is, 2l + j ≥ 2), we use Cauchy-Schwartz inequality |⟨θk−1 − θ⋆, ∇f(θk−1) + ζk⟩)j| ≤ ∥ θk−1 − θ⋆∥j∥∇f(θk−1) + ζk∥j , moreover, applying A1 and A7(2p) together with the Lyapunov inequality, we get E[∥∇f(θk−1) + ζk∥2l+j|Fk−1] = E[∥∇f(θk−1) + g(θk−1, ξk) + η(ξk)∥2l+j|Fk−1] ≤ 22l+j−1((L1 + L2)2l+jδ2l+j k−1 + σ2l+j 2p ) . 35 Comb...
-
[60]
First, for i = p, j = 0, l = 0, the corresponding term in the sum equals δ′2p k−1. 38
-
[61]
Second, for i = p − 1, j = 1, l = 0, we obtain, applying A1, that 2pαkE[⟨θk−1 − θ′ k−1, ∇f(θk−1) − ∇f(θ′ k−1) + g(θk−1, ξk) − g(θ′ k−1, ξk)⟩δ′2(p−1) k−1 |Fk−1] = 2pαk⟨θk−1 − θ′ k−1, ∇f(θk−1) − ∇f(θ′ k−1)⟩δ′2(p−1) k−1 ≥ 2pµαkδ′2p k−1
-
[62]
Third, for l ≥ 1 or j ≥ 2 (that is, 2l + j ≥ 2), we use Cauchy-Schwartz inequality together with A7 and A1 |⟨θk−1−θ′ k−1, ∇f(θk−1)−∇f(θ′ k−1)+g(θk−1, ξk)−g(θ′ k−1, ξk)⟩j| ≤ ∥ θk−1−θ′ k−1∥2j(L1+L2)j , Combining inequalities above, we obtain E[δ′2p k |Fk−1] ≤ (1 − 2pµαk + X i+j+l=p; i,j,l∈{0,...p}; j+2l≥2 p! i!j!l!2jαj+2l k (L1 + L2)j+2l)δ′2p k−1 (62) Simil...
-
[63]
+ C4,2σ4 4α2 k , 39 with C4,1 = 23 C1(1/c0 + C2) 4γ (1 − γ)µc0e γ/(1−γ) 4c1/2 0 2γ/2 + 2γ + 4c0 2 c2 0 + 1 c2,4 and C4,2 = 23C1(1/c0 + C2) 4γ (1 − γ)µc0e γ/(1−γ) 4c1/2 0 2γ/2 + 2γ + 4c0 2 c2,5. Here C1 and C2 are defined in Lemma 12 and c2,4 = exp exp 10c0(L1 + L2) 16(L1 + L2)2 2γ − 1 + 1 exp 2µc0 1 − γ k1−γ 0 1 3γ − 1 , c2,5 = 21+2γ 2µc0 . Proof. The pro...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.