Gaussian Approximation and Multiplier Bootstrap for Stochastic Gradient Descent

Alexey Naumov; Denis Belomestny; Eric Moulines; Marina Sheshukova; Qi-Man Shao; Sergey Samsonov; Zhuo-Song Zhang

arxiv: 2502.06719 · v3 · submitted 2025-02-10 · 📊 stat.ML · cs.LG· math.OC· math.PR· math.ST· stat.TH

Gaussian Approximation and Multiplier Bootstrap for Stochastic Gradient Descent

Marina Sheshukova , Sergey Samsonov , Denis Belomestny , Eric Moulines , Qi-Man Shao , Zhuo-Song Zhang , Alexey Naumov This is my paper

Pith reviewed 2026-05-23 03:27 UTC · model grok-4.3

classification 📊 stat.ML cs.LGmath.OCmath.PRmath.STstat.TH

keywords multiplier bootstrapstochastic gradient descentGaussian approximationnon-asymptotic boundsconvex distanceconfidence setscentral limit theorem

0 comments

The pith

Multiplier bootstrap for SGD achieves non-asymptotic rates up to 1/sqrt(n) in convex distance without approximating limiting covariance.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper establishes the non-asymptotic validity of the multiplier bootstrap procedure for constructing confidence sets using the Stochastic Gradient Descent algorithm. Under appropriate regularity conditions, the approach achieves approximation rates in convex distance of order up to 1/sqrt(n), which can be faster than the rate provable from the Polyak-Juditsky central limit theorem. It provides the first fully non-asymptotic bound on the accuracy of bootstrap approximations in SGD algorithms. The analysis builds on Gaussian approximation results for nonlinear statistics of independent random variables. Readers would care because the method allows construction of confidence sets for SGD iterates without first approximating their limiting covariance.

Core claim

Under appropriate regularity conditions, the multiplier bootstrap procedure for SGD achieves approximation rates in convex distance of order up to 1/sqrt(n), which can be faster than the rate provable from the Polyak-Juditsky central limit theorem, and provides the first fully non-asymptotic bound on the accuracy of bootstrap approximations in SGD algorithms. The analysis builds on the Gaussian approximation results for nonlinear statistics of independent random variables.

What carries the argument

Multiplier bootstrap procedure for SGD iterates, which constructs bootstrap distributions directly from the iterates to approximate their law in convex distance.

If this is right

Confidence sets for SGD can be formed without separately estimating the limiting covariance matrix.
The bootstrap approximation rate reaches order 1/sqrt(n) in convex distance under the regularity conditions.
This rate can exceed what is provable from the Polyak-Juditsky central limit theorem for the same iterates.
Fully non-asymptotic error bounds become available for the accuracy of bootstrap approximations in SGD.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The approach may extend to other first-order stochastic optimization methods that produce dependent iterates.
Practitioners could obtain finite-sample uncertainty quantification for SGD-trained models without relying on asymptotic normality.
The technique suggests bootstrap could replace covariance estimation in high-dimensional or online learning settings where CLT rates are slow.

Load-bearing premise

The unspecified regularity conditions on the SGD iterates and loss function must hold for the existing Gaussian approximation results to deliver the stated convex-distance bound.

What would settle it

A simulation on a simple strongly convex quadratic loss where the convex distance between the multiplier bootstrap distribution and the true distribution of normalized SGD iterates does not decrease at order 1/sqrt(n) for large n.

Figures

Figures reproduced from arXiv: 2502.06719 by Alexey Naumov, Denis Belomestny, Eric Moulines, Marina Sheshukova, Qi-Man Shao, Sergey Samsonov, Zhuo-Song Zhang.

read the original abstract

In this paper, we establish the non-asymptotic validity of the multiplier bootstrap procedure for constructing the confidence sets using the Stochastic Gradient Descent (SGD) algorithm. Under appropriate regularity conditions, our approach avoids the need to approximate the limiting covariance of Polyak-Ruppert SGD iterates, which allows us to derive approximation rates in convex distance of order up to $1/\sqrt{n}$. Notably, this rate can be faster than the one that can be proven in the Polyak-Juditsky central limit theorem. To our knowledge, this provides the first fully non-asymptotic bound on the accuracy of bootstrap approximations in SGD algorithms. Our analysis builds on the Gaussian approximation results for nonlinear statistics of independent random variables.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper claims a non-asymptotic 1/sqrt(n) multiplier bootstrap bound for SGD that can beat Polyak-Juditsky CLT rates, but the dependence reduction from independent nonlinear stats looks like the key unverified step.

read the letter

This paper claims the first fully non-asymptotic bound on multiplier bootstrap accuracy for SGD, with convex-distance rates up to 1/sqrt(n) that can improve on the Polyak-Juditsky CLT. The practical angle is clear: it sidesteps estimating the limiting covariance of the averaged iterates, which matters for confidence sets after SGD training in large models. That combination is what is new here, and the abstract states it directly without overclaiming prior results. The work builds on established Gaussian approximation theorems for nonlinear statistics of independent variables and applies them under regularity conditions on the loss and iterates. If those conditions hold cleanly, the result supplies a usable finite-sample tool without extra covariance estimation. The citation pattern is straightforward and points to external theorems rather than self-referential loops. The main soft spot is the dependence. SGD produces a dependent sequence through the recursive gradient steps, yet the analysis invokes results for independent random variables. The abstract leaves the bridging regularity conditions unspecified, so it is not obvious whether the reduction preserves the claimed rate or only works in regimes where the improvement over the CLT disappears. The stress-test concern lands on the abstract as written. Without the full derivation it is hard to judge how much extra approximation error enters. This is for readers working on non-asymptotic inference after optimization or bootstrap methods in statistical ML. Someone who needs finite-sample guarantees for SGD would find the claim worth checking. The paper shows clear engagement with the literature and a focused technical goal, so it deserves a serious referee to inspect the conditions and the dependence step, even if revisions follow.

Referee Report

2 major / 1 minor

Summary. The paper claims to establish the non-asymptotic validity of the multiplier bootstrap procedure for constructing confidence sets with SGD iterates. Under appropriate regularity conditions, the approach yields convex-distance approximation rates of order up to 1/sqrt(n) without needing to approximate the limiting covariance from the Polyak-Ruppert averaging, a rate that can be faster than what follows from the Polyak-Juditsky CLT; the analysis builds on existing Gaussian approximation theorems for nonlinear statistics of independent random variables and is presented as the first fully non-asymptotic bound for bootstrap accuracy in SGD.

Significance. If the dependence structure of SGD can be reduced to the independent nonlinear statistics setting without degrading the rate and if the regularity conditions are verifiable in standard regimes, the result would supply a practically useful non-asymptotic tool for inference after SGD that improves on existing asymptotic guarantees. The explicit non-asymptotic focus and the attempt to obtain a rate faster than the classical CLT are strengths worth preserving.

major comments (2)

[Main result / Theorem statement] The central claim that the 1/sqrt(n) convex-distance bound is achievable for SGD rests on applying Gaussian approximation results for nonlinear statistics of independent random variables to the dependent sequence produced by SGD. The manuscript must show explicitly (in the section containing the main theorem) how the dependence is handled—e.g., via martingale approximation, blocking, or coupling—without introducing an extra error term that would cancel the claimed improvement over the Polyak-Juditsky rate.
[Assumptions / Regularity conditions] The abstract invokes 'appropriate regularity conditions' on the SGD iterates and loss function but does not list them. Because these conditions are load-bearing for both the validity of the external Gaussian approximation and the preservation of the 1/sqrt(n) rate, they must be stated precisely (with explicit dependence on dimension, step-size schedule, and strong-convexity parameters) so that readers can check whether the bound applies in the regimes where SGD is typically used.

minor comments (1)

[Abstract] The phrase 'rates of order up to 1/sqrt(n)' should be replaced by an explicit big-O statement together with the precise conditions under which the upper bound is attained.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful and constructive review. The comments identify areas where the presentation can be strengthened for clarity without altering the core contributions. We address each point below and will revise the manuscript accordingly.

read point-by-point responses

Referee: [Main result / Theorem statement] The central claim that the 1/sqrt(n) convex-distance bound is achievable for SGD rests on applying Gaussian approximation results for nonlinear statistics of independent random variables to the dependent sequence produced by SGD. The manuscript must show explicitly (in the section containing the main theorem) how the dependence is handled—e.g., via martingale approximation, blocking, or coupling—without introducing an extra error term that would cancel the claimed improvement over the Polyak-Juditsky rate.

Authors: We agree that an explicit description of the dependence reduction must appear in the main theorem section. In the revised manuscript we will insert a short dedicated paragraph immediately after the statement of the main result that outlines the martingale approximation step, recalls the relevant error bound from the literature on SGD, and verifies that this error is of strictly lower order than the target 1/sqrt(n) convex-distance rate, thereby preserving the claimed improvement. revision: yes
Referee: [Assumptions / Regularity conditions] The abstract invokes 'appropriate regularity conditions' on the SGD iterates and loss function but does not list them. Because these conditions are load-bearing for both the validity of the external Gaussian approximation and the preservation of the 1/sqrt(n) rate, they must be stated precisely (with explicit dependence on dimension, step-size schedule, and strong-convexity parameters) so that readers can check whether the bound applies in the regimes where SGD is typically used.

Authors: We acknowledge that the current abstract and introduction refer to the conditions only generically. The revised version will contain a new, self-contained Assumptions section that lists every regularity condition with its explicit dependence on dimension d, step-size schedule, strong-convexity parameter μ, and other relevant quantities, together with a brief discussion of how these conditions are satisfied in standard strongly convex regimes. revision: yes

Circularity Check

0 steps flagged

No circularity: derivation relies on external Gaussian approximation results

full rationale

The paper states that its analysis 'builds on the Gaussian approximation results for nonlinear statistics of independent random variables' and derives the 1/sqrt(n) convex-distance bound for the multiplier bootstrap under regularity conditions. No equation or claim reduces the target bound to a fitted input, self-citation chain, or definitional equivalence. The cited Gaussian results are treated as independent external inputs, and the SGD dependence is bridged only by the stated regularity conditions rather than by any self-referential step. This is the normal case of a self-contained derivation against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on (1) an external Gaussian approximation theorem for nonlinear statistics of independent random variables whose precise statement and assumptions are not reproduced here, and (2) a collection of regularity conditions on the SGD iterates and objective that are invoked but not enumerated. No free parameters, invented entities, or ad-hoc axioms are visible in the abstract.

axioms (2)

domain assumption Gaussian approximation results for nonlinear statistics of independent random variables hold with the required error bounds
The paper states that its analysis builds on these results; they are treated as given background.
domain assumption Appropriate regularity conditions on the loss and SGD iterates are satisfied
Invoked in the abstract as the setting under which the non-asymptotic bound holds.

pith-pipeline@v0.9.0 · 5679 in / 1652 out tokens · 24031 ms · 2026-05-23T03:27:08.020668+00:00 · methodology

discussion (0)

Forward citations

Cited by 5 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Gaussian Approximation and Multiplier Bootstrap for Federated Linear Stochastic Approximation
stat.ML 2026-05 unverdicted novelty 7.0

Establishes non-asymptotic Gaussian approximation bounds for federated LSA with explicit communication-heterogeneity trade-offs and introduces an online multiplier bootstrap for last-iterate inference with validity gu...
When Does Dynamic Preconditioning Preserve the Polyak-Ruppert CLT? A Stabilization Threshold
math.ST 2026-04 unverdicted novelty 7.0

Dynamic preconditioning preserves the Polyak-Ruppert CLT for averaged SGD if the preconditioner stabilizes at rate β > (α + 1)/2.
Gaussian Approximation for Asynchronous Q-learning
stat.ML 2026-04 unverdicted novelty 7.0

Derived rates of order up to n^{-1/6} log^4(n S A) for the high-dimensional CLT of averaged asynchronous Q-learning iterates, plus a general martingale-difference CLT.
Refining Covariance Matrix Estimation in Stochastic Gradient Descent Through Bias Reduction
stat.ML 2026-04 unverdicted novelty 6.0

A novel bias-reduced online covariance estimator for SGD achieves convergence rate n to the power (α-1)/2 times square root of log n without second-order derivatives.
On Gaussian approximation for entropy-regularized Q-learning with function approximation
stat.ML 2026-05 unverdicted novelty 5.0

Establishes n^{-1/4} Gaussian approximation in convex distance for averaged entropy-regularized Q-learning with linear function approximation and polynomial stepsizes.

Reference graph

Works this paper leans on

63 extracted references · 63 canonical work pages · cited by 5 Pith papers

[1]

High-dimensional central limit theorems for linear functionals of online least-squares sgd

Bhavya Agrawalla, Krishnakumar Balasubramanian, and Promit Ghosal. High-dimensional central limit theorems for linear functionals of online least-squares sgd. arXiv preprint arXiv:2302.09727, 2023

work page arXiv 2023
[2]

Andreas Anastasiou, Krishnakumar Balasubramanian, and Murat A. Erdogdu. Normal ap- proximation for stochastic gradient descent via non-asymptotic rates of martingale CLT. In Alina Beygelzimer and Daniel Hsu, editors, Proceedings of the Thirty-Second Conference on Learning Theory, volume 99 of Proceedings of Machine Learning Research, pages 115–137. PMLR, ...

work page 2019
[3]

G. D. Anderson and S.-L. Qiu. A monotoneity property of the gamma function. Proceedings of the American Mathematical Society, 125(11):3355–3362, 1997

work page 1997
[4]

The reverse isoperimetric problem for Gaussian measure

Keith Ball. The reverse isoperimetric problem for Gaussian measure. Discrete & Computational Geometry, 10(4):411–420, October 1993

work page 1993
[5]

Barsov and V

S. Barsov and V . Ulyanov. Estimates for the closeness of Gaussian measures.Dokl. Akad. Nauk SSSR, 291(2):273–277, 1986

work page 1986
[6]

V . Bentkus. On the dependence of the berry–esseen bound on dimension. Journal of Statistical Planning and Inference, 113(2):385–402, 2003

work page 2003
[7]

Benveniste, M

A. Benveniste, M. Métivier, and P. Priouret.Adaptive algorithms and stochastic approximations, volume 22. Springer Science & Business Media, 2012

work page 2012
[8]

Convex optimization: Algorithms and complexity

Sébastien Bubeck et al. Convex optimization: Algorithms and complexity. Foundations and Trends® in Machine Learning, 8(3-4):231–357, 2015

work page 2015
[9]

Statistical inference for online decision making via stochastic gradient descent

Haoyu Chen, Wenbin Lu, and Rui Song. Statistical inference for online decision making via stochastic gradient descent. Journal of the American Statistical Association, 116(534):708–719, 2021

work page 2021
[10]

Online statistical inference for contextual bandits via stochastic gradient descent

Xi Chen, Zehua Lai, He Li, and Yichen Zhang. Online statistical inference for contextual bandits via stochastic gradient descent. arXiv preprint arXiv:2212.14883, 2022

work page arXiv 2022
[11]

Lee, Xin T

Xi Chen, Jason D. Lee, Xin T. Tong, and Yichen Zhang. Statistical inference for model parameters in stochastic gradient descent. The Annals of Statistics, 48(1):251 – 273, 2020

work page 2020
[12]

SAGA: A fast incremental gradi- ent method with support for non-strongly convex composite objectives

Aaron Defazio, Francis Bach, and Simon Lacoste-Julien. SAGA: A fast incremental gradi- ent method with support for non-strongly convex composite objectives. Advances in neural information processing systems, 27, 2014

work page 2014
[13]

The total variation distance between high-dimensional gaussians with the same mean,

Luc Devroye, Abbas Mehrabian, and Tommy Reddad. The total variation distance between high-dimensional gaussians with the same mean. arXiv preprint arXiv:1810.08693, 2018

work page arXiv 2018
[14]

Bridging the gap between constant step size stochastic gradient descent and Markov chains

Aymeric Dieuleveut, Alain Durmus, and Francis Bach. Bridging the gap between constant step size stochastic gradient descent and Markov chains. The Annals of Statistics, 48(3):1348 – 1382, 2020

work page 2020
[15]

Bootstrap methods: another look at the jackknife

Bradley Efron. Bootstrap methods: another look at the jackknife. In Breakthroughs in statistics: Methodology and distribution, pages 569–593. Springer, 1992

work page 1992
[16]

Online bootstrap confidence intervals for the stochastic gradient descent estimator

Yixin Fang, Jinfeng Xu, and Lei Yang. Online bootstrap confidence intervals for the stochastic gradient descent estimator. Journal of Machine Learning Research, 19(78):1–21, 2018

work page 2018
[17]

M. V . Fedoryuk. Metod perevala. Izdat. “Nauka”, Moscow, 1977

work page 1977
[18]

Neue herleitung und explizite restabschätzung der riemann-siegel-formel

Wolfgang Gabcke. Neue herleitung und explizite restabschätzung der riemann-siegel-formel. 2015

work page 2015
[19]

Large ball probabilities, Gaussian comparison and anti-concentration

Friedrich Götze, Alexey Naumov, Vladimir Spokoiny, and Vladimir Ulyanov. Large ball probabilities, Gaussian comparison and anti-concentration. Bernoulli, 25(4A):2538–2563, 2019. 11

work page 2019
[20]

Tight analyses for non-smooth stochastic gradient descent

Nicholas JA Harvey, Christopher Liaw, Yaniv Plan, and Sikander Randhawa. Tight analyses for non-smooth stochastic gradient descent. In Conference on Learning Theory, pages 1579–1613. PMLR, 2019

work page 2019
[21]

Beyond the regret minimization barrier: optimal algorithms for stochastic strongly-convex optimization

Elad Hazan and Satyen Kale. Beyond the regret minimization barrier: optimal algorithms for stochastic strongly-convex optimization. The Journal of Machine Learning Research , 15(1):2489–2512, 2014

work page 2014
[22]

Stochastic approximation and recursive algorithms and applications, volume 35

Harold Kushner and G George Yin. Stochastic approximation and recursive algorithms and applications, volume 35. Springer Science & Business Media, 2003

work page 2003
[23]

First-order and Stochastic Optimization Methods for Machine Learning

Guanghui Lan. First-order and Stochastic Optimization Methods for Machine Learning. 01 2020

work page 2020
[24]

Root-sgd: Sharp nonasymptotics and asymptotic efficiency in a single algorithm

Chris Junchi Li, Wenlong Mou, Martin Wainwright, and Michael Jordan. Root-sgd: Sharp nonasymptotics and asymptotic efficiency in a single algorithm. In Po-Ling Loh and Maxim Raginsky, editors, Proceedings of Thirty Fifth Conference on Learning Theory, volume 178 of Proceedings of Machine Learning Research, pages 909–981. PMLR, 02–05 Jul 2022

work page 2022
[25]

Statistical estimation and online inference via local sgd

Xiang Li, Jiadong Liang, Xiangyu Chang, and Zhihua Zhang. Statistical estimation and online inference via local sgd. In Po-Ling Loh and Maxim Raginsky, editors, Proceedings of Thirty Fifth Conference on Learning Theory, volume 178 of Proceedings of Machine Learning Research, pages 1613–1661. PMLR, 02–05 Jul 2022

work page 2022
[26]

Non-asymptotic analysis of stochastic approximation algo- rithms for machine learning

Eric Moulines and Francis Bach. Non-asymptotic analysis of stochastic approximation algo- rithms for machine learning. Advances in neural information processing systems, 24:451–459, 2011

work page 2011
[27]

Robust stochastic approximation approach to stochastic programming

Arkadi Nemirovski, Anatoli Juditsky, Guanghui Lan, and Alexander Shapiro. Robust stochastic approximation approach to stochastic programming. SIAM Journal on optimization, 19(4):1574– 1609, 2009

work page 2009
[28]

Nesterov.Introductory Lectures on Convex Optimization: A Basic Course

Y . Nesterov.Introductory Lectures on Convex Optimization: A Basic Course. Applied Opti- mization. Springer, 2004

work page 2004
[29]

Frank W. J. Olver.Asymptotics and special functions. AKP Classics. A K Peters, Ltd., Wellesley, MA, 1997. Reprint of the 1974 original [Academic Press, New York; MR0435697 (55 #8655)]

work page 1997
[30]

Sharp martingale and semimartingale inequalities, volume 72

Adam Osekowski. Sharp martingale and semimartingale inequalities, volume 72. Springer Science & Business Media, 2012

work page 2012
[31]

Acceleration of stochastic approximation by averaging

Boris T Polyak and Anatoli B Juditsky. Acceleration of stochastic approximation by averaging. SIAM journal on control and optimization, 30(4):838–855, 1992

work page 1992
[32]

On the momentum term in gradient descent learning algorithms

Ning Qian. On the momentum term in gradient descent learning algorithms. Neural networks, 12(1):145–151, 1999

work page 1999
[33]

Making gradient descent optimal for strongly convex stochastic optimization

Alexander Rakhlin, Ohad Shamir, and Karthik Sridharan. Making gradient descent optimal for strongly convex stochastic optimization. In Proceedings of the 29th International Coference on International Conference on Machine Learning, pages 1571–1578, 2012

work page 2012
[34]

Efficient estimations from a slowly convergent Robbins-Monro process

David Ruppert. Efficient estimations from a slowly convergent Robbins-Monro process. Tech- nical report, Cornell University Operations Research and Industrial Engineering, 1988

work page 1988
[35]

High-probability bounds for stochastic optimization and variational inequalities: the case of unbounded variance

Abdurakhmon Sadiev, Marina Danilova, Eduard Gorbunov, Samuel Horváth, Gauthier Gidel, Pavel Dvurechensky, Alexander Gasnikov, and Peter Richtárik. High-probability bounds for stochastic optimization and variational inequalities: the case of unbounded variance. In International Conference on Machine Learning, pages 29563–29648. PMLR, 2023

work page 2023
[36]

Gaus- sian Approximation and Multiplier Bootstrap for Polyak-Ruppert Averaged Linear Stochastic Approximation with Applications to TD Learning

Sergey Samsonov, Eric Moulines, Qi-Man Shao, Zhuo-Song Zhang, and Alexey Naumov. Gaus- sian Approximation and Multiplier Bootstrap for Polyak-Ruppert Averaged Linear Stochastic Approximation with Applications to TD Learning. In Advances in Neural Information Process- ing Systems, volume 37, pages 12408–12460. Curran Associates, Inc., 2024. 12

work page 2024
[37]

Minimizing finite sums with the stochastic average gradient

Mark Schmidt, Nicolas Le Roux, and Francis Bach. Minimizing finite sums with the stochastic average gradient. Mathematical Programming, 162:83–112, 2017

work page 2017
[38]

Mathematical statistics

Jun Shao. Mathematical statistics. Springer Science & Business Media, 2003

work page 2003
[39]

The Jackknife and Bootstrap

Jun Shao and Dongsheng Tu. The Jackknife and Bootstrap . Springer Series in Statistics. Springer New York, NY , 1 edition, 1995. Springer Book Archive,© Springer Science+Business Media New York 1995

work page 1995
[40]

Berry–Esseen bounds for multivariate nonlinear statis- tics with applications to M-estimators and stochastic gradient descent algorithms

Qi-Man Shao and Zhuo-Song Zhang. Berry–Esseen bounds for multivariate nonlinear statis- tics with applications to M-estimators and stochastic gradient descent algorithms. Bernoulli, 28(3):1548–1576, 2022

work page 2022
[41]

Nonasymptotic analysis of stochastic gradient descent with the richardson- romberg extrapolation

Marina Sheshukova, Denis Belomestny, Alain Durmus, Eric Moulines, Alexey Naumov, and Sergey Samsonov. Nonasymptotic analysis of stochastic gradient descent with the richardson- romberg extrapolation. arXiv preprint arXiv:2410.05106, 2024

work page arXiv 2024
[42]

Rates of Convergence in the Central Limit Theorem for Markov Chains, with an Application to TD learning

R Srikant. Rates of Convergence in the Central Limit Theorem for Markov Chains, with an Application to TD learning. arXiv preprint arXiv:2401.15719, 2024

work page arXiv 2024
[43]

An introduction to matrix concentration inequalities

Joel A Tropp et al. An introduction to matrix concentration inequalities. Foundations and Trends® in Machine Learning, 8(1-2):1–230, 2015

work page 2015
[44]

Statistical Inference for Temporal Difference Learning with Linear Function Approximation

Weichen Wu, Gen Li, Yuting Wei, and Alessandro Rinaldo. Statistical Inference for Temporal Difference Learning with Linear Function Approximation. arXiv preprint arXiv:2410.16106, 2024

work page arXiv 2024
[45]

Online Bootstrap Inference with Noncon- vex Stochastic Gradient Descent Estimator

Yanjie Zhong, Todd Kuffner, and Soumendra Lahiri. Online Bootstrap Inference with Noncon- vex Stochastic Gradient Descent Estimator. arXiv preprint arXiv:2306.02205, 2023

work page arXiv 2023
[46]

Online Covariance Matrix Estimation in Stochastic Gradient Descent

Wanrong Zhu, Xi Chen, and Wei Biao Wu. Online Covariance Matrix Estimation in Stochastic Gradient Descent. Journal of the American Statistical Association, 118(541):393–404, 2023. 13 Contents 1 Introduction 1 2 Main results 3 2.1 Non-asymptotic multiplier bootstrap validity . . . . . . . . . . . . . . . . . . . . . 5 2.2 Gaussian approximation in the real...

work page 2023
[47]

+ M1,2σ2 4n1/2−γ + M1,3σ2n−γ/2, where M1,1 = CΣCQ T1( µc0 4 )(L2 + LH) max( p C4,1, p C1) + kγ 0 /c0 M1,2 = CΣCQLH p C4,2c0 1 1 − γ M1,3 = CΣCQL2 p C2 √c0 r 1 1 − γ , (33) where C4,1 and C4,2 are defined in Corollary 4, C1 and C2 are defined in Lemma 12 and T1(·) is defined in eq. (32). Proof. Using Minkowski’s inequality and the definition ofDn, we obtai...

work page
[48]

We now proceed withPn−1 i=1 E1/2[∥Dn,2 − D(i) n,2∥2]

Combining inequalities above, we get n−1X i=1 E1/2[∥Dn,3 − D(i) n,3∥2] ≤ √ 2CΣCQL2√n s C1 + c2 0k−γ 0 R1R2T2 µc0 1 − γ T1 µc0 8 (∥θ0 − θ⋆∥ + σ2) + √ 2CΣCQL2√n s C2 + R1R3c0T2 µc0 1 − γ σ2 (n + k0 − 2)1−γ/2 − (k0 − 1)1−γ/2 1 − γ/2 . We now proceed withPn−1 i=1 E1/2[∥Dn,2 − D(i) n,2∥2]. Using Minkowski’s inequality together with Lemma 3 and Lemma 17, we get...

work page
[49]

For k > i , applying A7 and A1, we have E[∥θ(i) k − θk∥2|Fk−1] ≤ ∥θ(i) k−1 − θk−1∥2 − 2αk⟨θ(i) k−1 − θk−1, ∇f(θ(i) k−1) − ∇f(θk−1)⟩ + 2α2 k(L1 + L2)2∥θ(i) k−1 − θk−1∥2

+ (1 +C2L2)σ2 2 , where in (a) we used A7, and in (b) we used Lemma 12 and αk−1L2 ≤ 1. For k > i , applying A7 and A1, we have E[∥θ(i) k − θk∥2|Fk−1] ≤ ∥θ(i) k−1 − θk−1∥2 − 2αk⟨θ(i) k−1 − θk−1, ∇f(θ(i) k−1) − ∇f(θk−1)⟩ + 2α2 k(L1 + L2)2∥θ(i) k−1 − θk−1∥2 . Taking expectation from both sides and applying A1 with Lemma 15(a), we obtain E[∥θ(i) k − θk∥2] ≤ (...

work page
[50]

For k > i we denote δ(i) k = ∥θ(i) k − θk∥, similar to (62), we obtain E[{δ(i) k }4|Fk−1] ≤ (1 − 4µαk + 4α2 k(L1 + L2)2(1 + 3c0(L1 + L2))2){δ(i) k−1}4

+ (1 +L2 2C4,2)σ4 4 . For k > i we denote δ(i) k = ∥θ(i) k − θk∥, similar to (62), we obtain E[{δ(i) k }4|Fk−1] ≤ (1 − 4µαk + 4α2 k(L1 + L2)2(1 + 3c0(L1 + L2))2){δ(i) k−1}4 . Using Lemma 15(a), we obtain E[{δ(i) k }4] ≤ exp 4(L1 + L2)2(1 + 3c0(L1 + L2))2) 2γ − 1 exp −4µ kX j=i+1 αj E[∥θ(i) i − θi∥4] . Combining the above inequalities completes the proof. ...

work page
[51]

≤ E[{Eb[∥Db∥p]}] n(M b 1,1e2/pp3/2n2/p−γ/2 + M b 2,1e1/ppn1/2+1/p−γ)p = E[∥Db∥p] n(M b 1,1e1/pp3/2n1/p−γ/2 + M b 2,1e2/ppn1/2+2/p−γ)p ≤ 1 n . Lemma 8. Assume A1-A5. Then for any p ≥ 2 it holds E1/p[∥Db 1∥p] ≤ M b 1,1e1/pp3/2n1/p−γ/2 , where M b 1,1 = 4CQ max(L1, L2)max(√K2, √K1)√c0(Wmax + 1)√ 2(1 − γ) , (45) and K1, K2 are defined in (64), (66), respectiv...

work page
[52]

(53) By applying the recurrence (53), we obtain that E[∥θk − θ⋆∥2] ≤ A1,k∥θ0 − θ⋆∥2 + 2σ2 2A2,k , where we have set A1,k = kY i=1 (1 − (3/2)αiµ + 2α2 i L2

work page
[53]

(54) Using the elementary bound 1 + t ≤ et for any t ∈ R, we get A1,k ≤ exp −(3/2)µ kX i=1 αi exp 2L2 2 kX i=1 α2 i

, A2,k = kX i=1 kY j=i+1 (1 − (3/2)αjµ + 2α2 j L2 2)α2 i . (54) Using the elementary bound 1 + t ≤ et for any t ∈ R, we get A1,k ≤ exp −(3/2)µ kX i=1 αi exp 2L2 2 kX i=1 α2 i . Using Lemma 15, we obtain A1,k ≤ c1 exp − 3µc0 4(1 − γ)(k + k0)1−γ , where we have set c1 = exp 2c2 0L2 2 2γ − 1 + 3µc0 4(1 − γ) k1−γ 0 . (55) Now we estimate A2,k. Let k1 be the l...

work page
[54]

= 1 2L2 2 k1X i=1 k1Y j=i (1 + 2α2 j L2

work page
[55]

− k1Y j=i+1 (1 + 2α2 j L2 2) ≤ 1 2L2 2 k1Y j=1 (1 + 2α2 j L2

work page
[56]

Note, that for k ≤ k1, αk ≥ µ/(4L2 2), hence, we have kY j=k1+1 (1 − αjµ) ≤ exp −µ kX i=1 αi exp µ k1X i=1 αi ≤ exp −µ kX i=1 αi exp 4L2 2 k1X i=1 α2 i

≤ 1 2L2 2 exp 2L2 2 k1X j=1 α2 j . Note, that for k ≤ k1, αk ≥ µ/(4L2 2), hence, we have kY j=k1+1 (1 − αjµ) ≤ exp −µ kX i=1 αi exp µ k1X i=1 αi ≤ exp −µ kX i=1 αi exp 4L2 2 k1X i=1 α2 i . 33 Moreover, for any m ∈ {1, . . . , k}, we obtain kX i=1 α2 i kY j=i+1 (1 − αjµ) = mX i=1 kY j=i+1 (1 − αjµ)α2 i + kX i=m+1 kY j=i+1 (1 − αjµ)α2 i ≤ kY j=m+1 (1 − αjµ)...

work page
[57]

First, for i = p, j = 0, l = 0, the corresponding term in the sum equals δ2p k−1

work page
[58]

Second, for i = p − 1, j = 1, l = 0, we obtain, applying A1, that 2pαkE[⟨θk−1 − θ⋆, ∇f(θk−1) + ζk⟩δ2(p−1) k−1 |Fk−1] = 2pαk⟨θk−1 − θ⋆, ∇f(θk−1) − ∇f(θ⋆)⟩δ2(p−1) k−1 ≥ 2pµαkδ2p k−1

work page
[59]

Third, for l ≥ 1 or j ≥ 2 (that is, 2l + j ≥ 2), we use Cauchy-Schwartz inequality |⟨θk−1 − θ⋆, ∇f(θk−1) + ζk⟩)j| ≤ ∥ θk−1 − θ⋆∥j∥∇f(θk−1) + ζk∥j , moreover, applying A1 and A7(2p) together with the Lyapunov inequality, we get E[∥∇f(θk−1) + ζk∥2l+j|Fk−1] = E[∥∇f(θk−1) + g(θk−1, ξk) + η(ξk)∥2l+j|Fk−1] ≤ 22l+j−1((L1 + L2)2l+jδ2l+j k−1 + σ2l+j 2p ) . 35 Comb...

work page
[60]

First, for i = p, j = 0, l = 0, the corresponding term in the sum equals δ′2p k−1. 38

work page
[61]

Second, for i = p − 1, j = 1, l = 0, we obtain, applying A1, that 2pαkE[⟨θk−1 − θ′ k−1, ∇f(θk−1) − ∇f(θ′ k−1) + g(θk−1, ξk) − g(θ′ k−1, ξk)⟩δ′2(p−1) k−1 |Fk−1] = 2pαk⟨θk−1 − θ′ k−1, ∇f(θk−1) − ∇f(θ′ k−1)⟩δ′2(p−1) k−1 ≥ 2pµαkδ′2p k−1

work page
[62]

It remains to note that E[∥θk − θ⋆∥2p] ≤ 22p−1E[∥θ′ k − θ⋆∥2p] + 22p−1E[∥θk − θ′ k∥2p] ≤ C2p,1 exp − pµc0 4 (k + k0)1−γ (∥θ0 − θ⋆∥2p + σ2p 2p) + C2p,2σ2p 2pαp k

Third, for l ≥ 1 or j ≥ 2 (that is, 2l + j ≥ 2), we use Cauchy-Schwartz inequality together with A7 and A1 |⟨θk−1−θ′ k−1, ∇f(θk−1)−∇f(θ′ k−1)+g(θk−1, ξk)−g(θ′ k−1, ξk)⟩j| ≤ ∥ θk−1−θ′ k−1∥2j(L1+L2)j , Combining inequalities above, we obtain E[δ′2p k |Fk−1] ≤ (1 − 2pµαk + X i+j+l=p; i,j,l∈{0,...p}; j+2l≥2 p! i!j!l!2jαj+2l k (L1 + L2)j+2l)δ′2p k−1 (62) Simil...

work page
[63]

Here C1 and C2 are defined in Lemma 12 and c2,4 = exp exp 10c0(L1 + L2) 16(L1 + L2)2 2γ − 1 + 1 exp 2µc0 1 − γ k1−γ 0 1 3γ − 1 , c2,5 = 21+2γ 2µc0

+ C4,2σ4 4α2 k , 39 with C4,1 = 23 C1(1/c0 + C2) 4γ (1 − γ)µc0e γ/(1−γ) 4c1/2 0 2γ/2 + 2γ + 4c0 2 c2 0 + 1 c2,4 and C4,2 = 23C1(1/c0 + C2) 4γ (1 − γ)µc0e γ/(1−γ) 4c1/2 0 2γ/2 + 2γ + 4c0 2 c2,5. Here C1 and C2 are defined in Lemma 12 and c2,4 = exp exp 10c0(L1 + L2) 16(L1 + L2)2 2γ − 1 + 1 exp 2µc0 1 − γ k1−γ 0 1 3γ − 1 , c2,5 = 21+2γ 2µc0 . Proof. The pro...

work page

[1] [1]

High-dimensional central limit theorems for linear functionals of online least-squares sgd

Bhavya Agrawalla, Krishnakumar Balasubramanian, and Promit Ghosal. High-dimensional central limit theorems for linear functionals of online least-squares sgd. arXiv preprint arXiv:2302.09727, 2023

work page arXiv 2023

[2] [2]

Andreas Anastasiou, Krishnakumar Balasubramanian, and Murat A. Erdogdu. Normal ap- proximation for stochastic gradient descent via non-asymptotic rates of martingale CLT. In Alina Beygelzimer and Daniel Hsu, editors, Proceedings of the Thirty-Second Conference on Learning Theory, volume 99 of Proceedings of Machine Learning Research, pages 115–137. PMLR, ...

work page 2019

[3] [3]

G. D. Anderson and S.-L. Qiu. A monotoneity property of the gamma function. Proceedings of the American Mathematical Society, 125(11):3355–3362, 1997

work page 1997

[4] [4]

The reverse isoperimetric problem for Gaussian measure

Keith Ball. The reverse isoperimetric problem for Gaussian measure. Discrete & Computational Geometry, 10(4):411–420, October 1993

work page 1993

[5] [5]

Barsov and V

S. Barsov and V . Ulyanov. Estimates for the closeness of Gaussian measures.Dokl. Akad. Nauk SSSR, 291(2):273–277, 1986

work page 1986

[6] [6]

V . Bentkus. On the dependence of the berry–esseen bound on dimension. Journal of Statistical Planning and Inference, 113(2):385–402, 2003

work page 2003

[7] [7]

Benveniste, M

A. Benveniste, M. Métivier, and P. Priouret.Adaptive algorithms and stochastic approximations, volume 22. Springer Science & Business Media, 2012

work page 2012

[8] [8]

Convex optimization: Algorithms and complexity

Sébastien Bubeck et al. Convex optimization: Algorithms and complexity. Foundations and Trends® in Machine Learning, 8(3-4):231–357, 2015

work page 2015

[9] [9]

Statistical inference for online decision making via stochastic gradient descent

Haoyu Chen, Wenbin Lu, and Rui Song. Statistical inference for online decision making via stochastic gradient descent. Journal of the American Statistical Association, 116(534):708–719, 2021

work page 2021

[10] [10]

Online statistical inference for contextual bandits via stochastic gradient descent

Xi Chen, Zehua Lai, He Li, and Yichen Zhang. Online statistical inference for contextual bandits via stochastic gradient descent. arXiv preprint arXiv:2212.14883, 2022

work page arXiv 2022

[11] [11]

Lee, Xin T

Xi Chen, Jason D. Lee, Xin T. Tong, and Yichen Zhang. Statistical inference for model parameters in stochastic gradient descent. The Annals of Statistics, 48(1):251 – 273, 2020

work page 2020

[12] [12]

SAGA: A fast incremental gradi- ent method with support for non-strongly convex composite objectives

Aaron Defazio, Francis Bach, and Simon Lacoste-Julien. SAGA: A fast incremental gradi- ent method with support for non-strongly convex composite objectives. Advances in neural information processing systems, 27, 2014

work page 2014

[13] [13]

The total variation distance between high-dimensional gaussians with the same mean,

Luc Devroye, Abbas Mehrabian, and Tommy Reddad. The total variation distance between high-dimensional gaussians with the same mean. arXiv preprint arXiv:1810.08693, 2018

work page arXiv 2018

[14] [14]

Bridging the gap between constant step size stochastic gradient descent and Markov chains

Aymeric Dieuleveut, Alain Durmus, and Francis Bach. Bridging the gap between constant step size stochastic gradient descent and Markov chains. The Annals of Statistics, 48(3):1348 – 1382, 2020

work page 2020

[15] [15]

Bootstrap methods: another look at the jackknife

Bradley Efron. Bootstrap methods: another look at the jackknife. In Breakthroughs in statistics: Methodology and distribution, pages 569–593. Springer, 1992

work page 1992

[16] [16]

Online bootstrap confidence intervals for the stochastic gradient descent estimator

Yixin Fang, Jinfeng Xu, and Lei Yang. Online bootstrap confidence intervals for the stochastic gradient descent estimator. Journal of Machine Learning Research, 19(78):1–21, 2018

work page 2018

[17] [17]

M. V . Fedoryuk. Metod perevala. Izdat. “Nauka”, Moscow, 1977

work page 1977

[18] [18]

Neue herleitung und explizite restabschätzung der riemann-siegel-formel

Wolfgang Gabcke. Neue herleitung und explizite restabschätzung der riemann-siegel-formel. 2015

work page 2015

[19] [19]

Large ball probabilities, Gaussian comparison and anti-concentration

Friedrich Götze, Alexey Naumov, Vladimir Spokoiny, and Vladimir Ulyanov. Large ball probabilities, Gaussian comparison and anti-concentration. Bernoulli, 25(4A):2538–2563, 2019. 11

work page 2019

[20] [20]

Tight analyses for non-smooth stochastic gradient descent

Nicholas JA Harvey, Christopher Liaw, Yaniv Plan, and Sikander Randhawa. Tight analyses for non-smooth stochastic gradient descent. In Conference on Learning Theory, pages 1579–1613. PMLR, 2019

work page 2019

[21] [21]

Beyond the regret minimization barrier: optimal algorithms for stochastic strongly-convex optimization

Elad Hazan and Satyen Kale. Beyond the regret minimization barrier: optimal algorithms for stochastic strongly-convex optimization. The Journal of Machine Learning Research , 15(1):2489–2512, 2014

work page 2014

[22] [22]

Stochastic approximation and recursive algorithms and applications, volume 35

Harold Kushner and G George Yin. Stochastic approximation and recursive algorithms and applications, volume 35. Springer Science & Business Media, 2003

work page 2003

[23] [23]

First-order and Stochastic Optimization Methods for Machine Learning

Guanghui Lan. First-order and Stochastic Optimization Methods for Machine Learning. 01 2020

work page 2020

[24] [24]

Root-sgd: Sharp nonasymptotics and asymptotic efficiency in a single algorithm

Chris Junchi Li, Wenlong Mou, Martin Wainwright, and Michael Jordan. Root-sgd: Sharp nonasymptotics and asymptotic efficiency in a single algorithm. In Po-Ling Loh and Maxim Raginsky, editors, Proceedings of Thirty Fifth Conference on Learning Theory, volume 178 of Proceedings of Machine Learning Research, pages 909–981. PMLR, 02–05 Jul 2022

work page 2022

[25] [25]

Statistical estimation and online inference via local sgd

Xiang Li, Jiadong Liang, Xiangyu Chang, and Zhihua Zhang. Statistical estimation and online inference via local sgd. In Po-Ling Loh and Maxim Raginsky, editors, Proceedings of Thirty Fifth Conference on Learning Theory, volume 178 of Proceedings of Machine Learning Research, pages 1613–1661. PMLR, 02–05 Jul 2022

work page 2022

[26] [26]

Non-asymptotic analysis of stochastic approximation algo- rithms for machine learning

Eric Moulines and Francis Bach. Non-asymptotic analysis of stochastic approximation algo- rithms for machine learning. Advances in neural information processing systems, 24:451–459, 2011

work page 2011

[27] [27]

Robust stochastic approximation approach to stochastic programming

Arkadi Nemirovski, Anatoli Juditsky, Guanghui Lan, and Alexander Shapiro. Robust stochastic approximation approach to stochastic programming. SIAM Journal on optimization, 19(4):1574– 1609, 2009

work page 2009

[28] [28]

Nesterov.Introductory Lectures on Convex Optimization: A Basic Course

Y . Nesterov.Introductory Lectures on Convex Optimization: A Basic Course. Applied Opti- mization. Springer, 2004

work page 2004

[29] [29]

Frank W. J. Olver.Asymptotics and special functions. AKP Classics. A K Peters, Ltd., Wellesley, MA, 1997. Reprint of the 1974 original [Academic Press, New York; MR0435697 (55 #8655)]

work page 1997

[30] [30]

Sharp martingale and semimartingale inequalities, volume 72

Adam Osekowski. Sharp martingale and semimartingale inequalities, volume 72. Springer Science & Business Media, 2012

work page 2012

[31] [31]

Acceleration of stochastic approximation by averaging

Boris T Polyak and Anatoli B Juditsky. Acceleration of stochastic approximation by averaging. SIAM journal on control and optimization, 30(4):838–855, 1992

work page 1992

[32] [32]

On the momentum term in gradient descent learning algorithms

Ning Qian. On the momentum term in gradient descent learning algorithms. Neural networks, 12(1):145–151, 1999

work page 1999

[33] [33]

Making gradient descent optimal for strongly convex stochastic optimization

Alexander Rakhlin, Ohad Shamir, and Karthik Sridharan. Making gradient descent optimal for strongly convex stochastic optimization. In Proceedings of the 29th International Coference on International Conference on Machine Learning, pages 1571–1578, 2012

work page 2012

[34] [34]

Efficient estimations from a slowly convergent Robbins-Monro process

David Ruppert. Efficient estimations from a slowly convergent Robbins-Monro process. Tech- nical report, Cornell University Operations Research and Industrial Engineering, 1988

work page 1988

[35] [35]

High-probability bounds for stochastic optimization and variational inequalities: the case of unbounded variance

Abdurakhmon Sadiev, Marina Danilova, Eduard Gorbunov, Samuel Horváth, Gauthier Gidel, Pavel Dvurechensky, Alexander Gasnikov, and Peter Richtárik. High-probability bounds for stochastic optimization and variational inequalities: the case of unbounded variance. In International Conference on Machine Learning, pages 29563–29648. PMLR, 2023

work page 2023

[36] [36]

Gaus- sian Approximation and Multiplier Bootstrap for Polyak-Ruppert Averaged Linear Stochastic Approximation with Applications to TD Learning

Sergey Samsonov, Eric Moulines, Qi-Man Shao, Zhuo-Song Zhang, and Alexey Naumov. Gaus- sian Approximation and Multiplier Bootstrap for Polyak-Ruppert Averaged Linear Stochastic Approximation with Applications to TD Learning. In Advances in Neural Information Process- ing Systems, volume 37, pages 12408–12460. Curran Associates, Inc., 2024. 12

work page 2024

[37] [37]

Minimizing finite sums with the stochastic average gradient

Mark Schmidt, Nicolas Le Roux, and Francis Bach. Minimizing finite sums with the stochastic average gradient. Mathematical Programming, 162:83–112, 2017

work page 2017

[38] [38]

Mathematical statistics

Jun Shao. Mathematical statistics. Springer Science & Business Media, 2003

work page 2003

[39] [39]

The Jackknife and Bootstrap

Jun Shao and Dongsheng Tu. The Jackknife and Bootstrap . Springer Series in Statistics. Springer New York, NY , 1 edition, 1995. Springer Book Archive,© Springer Science+Business Media New York 1995

work page 1995

[40] [40]

Berry–Esseen bounds for multivariate nonlinear statis- tics with applications to M-estimators and stochastic gradient descent algorithms

Qi-Man Shao and Zhuo-Song Zhang. Berry–Esseen bounds for multivariate nonlinear statis- tics with applications to M-estimators and stochastic gradient descent algorithms. Bernoulli, 28(3):1548–1576, 2022

work page 2022

[41] [41]

Nonasymptotic analysis of stochastic gradient descent with the richardson- romberg extrapolation

Marina Sheshukova, Denis Belomestny, Alain Durmus, Eric Moulines, Alexey Naumov, and Sergey Samsonov. Nonasymptotic analysis of stochastic gradient descent with the richardson- romberg extrapolation. arXiv preprint arXiv:2410.05106, 2024

work page arXiv 2024

[42] [42]

Rates of Convergence in the Central Limit Theorem for Markov Chains, with an Application to TD learning

R Srikant. Rates of Convergence in the Central Limit Theorem for Markov Chains, with an Application to TD learning. arXiv preprint arXiv:2401.15719, 2024

work page arXiv 2024

[43] [43]

An introduction to matrix concentration inequalities

Joel A Tropp et al. An introduction to matrix concentration inequalities. Foundations and Trends® in Machine Learning, 8(1-2):1–230, 2015

work page 2015

[44] [44]

Statistical Inference for Temporal Difference Learning with Linear Function Approximation

Weichen Wu, Gen Li, Yuting Wei, and Alessandro Rinaldo. Statistical Inference for Temporal Difference Learning with Linear Function Approximation. arXiv preprint arXiv:2410.16106, 2024

work page arXiv 2024

[45] [45]

Online Bootstrap Inference with Noncon- vex Stochastic Gradient Descent Estimator

Yanjie Zhong, Todd Kuffner, and Soumendra Lahiri. Online Bootstrap Inference with Noncon- vex Stochastic Gradient Descent Estimator. arXiv preprint arXiv:2306.02205, 2023

work page arXiv 2023

[46] [46]

Online Covariance Matrix Estimation in Stochastic Gradient Descent

Wanrong Zhu, Xi Chen, and Wei Biao Wu. Online Covariance Matrix Estimation in Stochastic Gradient Descent. Journal of the American Statistical Association, 118(541):393–404, 2023. 13 Contents 1 Introduction 1 2 Main results 3 2.1 Non-asymptotic multiplier bootstrap validity . . . . . . . . . . . . . . . . . . . . . 5 2.2 Gaussian approximation in the real...

work page 2023

[47] [47]

+ M1,2σ2 4n1/2−γ + M1,3σ2n−γ/2, where M1,1 = CΣCQ T1( µc0 4 )(L2 + LH) max( p C4,1, p C1) + kγ 0 /c0 M1,2 = CΣCQLH p C4,2c0 1 1 − γ M1,3 = CΣCQL2 p C2 √c0 r 1 1 − γ , (33) where C4,1 and C4,2 are defined in Corollary 4, C1 and C2 are defined in Lemma 12 and T1(·) is defined in eq. (32). Proof. Using Minkowski’s inequality and the definition ofDn, we obtai...

work page

[48] [48]

We now proceed withPn−1 i=1 E1/2[∥Dn,2 − D(i) n,2∥2]

Combining inequalities above, we get n−1X i=1 E1/2[∥Dn,3 − D(i) n,3∥2] ≤ √ 2CΣCQL2√n s C1 + c2 0k−γ 0 R1R2T2 µc0 1 − γ T1 µc0 8 (∥θ0 − θ⋆∥ + σ2) + √ 2CΣCQL2√n s C2 + R1R3c0T2 µc0 1 − γ σ2 (n + k0 − 2)1−γ/2 − (k0 − 1)1−γ/2 1 − γ/2 . We now proceed withPn−1 i=1 E1/2[∥Dn,2 − D(i) n,2∥2]. Using Minkowski’s inequality together with Lemma 3 and Lemma 17, we get...

work page

[49] [49]

For k > i , applying A7 and A1, we have E[∥θ(i) k − θk∥2|Fk−1] ≤ ∥θ(i) k−1 − θk−1∥2 − 2αk⟨θ(i) k−1 − θk−1, ∇f(θ(i) k−1) − ∇f(θk−1)⟩ + 2α2 k(L1 + L2)2∥θ(i) k−1 − θk−1∥2

+ (1 +C2L2)σ2 2 , where in (a) we used A7, and in (b) we used Lemma 12 and αk−1L2 ≤ 1. For k > i , applying A7 and A1, we have E[∥θ(i) k − θk∥2|Fk−1] ≤ ∥θ(i) k−1 − θk−1∥2 − 2αk⟨θ(i) k−1 − θk−1, ∇f(θ(i) k−1) − ∇f(θk−1)⟩ + 2α2 k(L1 + L2)2∥θ(i) k−1 − θk−1∥2 . Taking expectation from both sides and applying A1 with Lemma 15(a), we obtain E[∥θ(i) k − θk∥2] ≤ (...

work page

[50] [50]

For k > i we denote δ(i) k = ∥θ(i) k − θk∥, similar to (62), we obtain E[{δ(i) k }4|Fk−1] ≤ (1 − 4µαk + 4α2 k(L1 + L2)2(1 + 3c0(L1 + L2))2){δ(i) k−1}4

+ (1 +L2 2C4,2)σ4 4 . For k > i we denote δ(i) k = ∥θ(i) k − θk∥, similar to (62), we obtain E[{δ(i) k }4|Fk−1] ≤ (1 − 4µαk + 4α2 k(L1 + L2)2(1 + 3c0(L1 + L2))2){δ(i) k−1}4 . Using Lemma 15(a), we obtain E[{δ(i) k }4] ≤ exp 4(L1 + L2)2(1 + 3c0(L1 + L2))2) 2γ − 1 exp −4µ kX j=i+1 αj E[∥θ(i) i − θi∥4] . Combining the above inequalities completes the proof. ...

work page

[51] [51]

≤ E[{Eb[∥Db∥p]}] n(M b 1,1e2/pp3/2n2/p−γ/2 + M b 2,1e1/ppn1/2+1/p−γ)p = E[∥Db∥p] n(M b 1,1e1/pp3/2n1/p−γ/2 + M b 2,1e2/ppn1/2+2/p−γ)p ≤ 1 n . Lemma 8. Assume A1-A5. Then for any p ≥ 2 it holds E1/p[∥Db 1∥p] ≤ M b 1,1e1/pp3/2n1/p−γ/2 , where M b 1,1 = 4CQ max(L1, L2)max(√K2, √K1)√c0(Wmax + 1)√ 2(1 − γ) , (45) and K1, K2 are defined in (64), (66), respectiv...

work page

[52] [52]

(53) By applying the recurrence (53), we obtain that E[∥θk − θ⋆∥2] ≤ A1,k∥θ0 − θ⋆∥2 + 2σ2 2A2,k , where we have set A1,k = kY i=1 (1 − (3/2)αiµ + 2α2 i L2

work page

[53] [53]

(54) Using the elementary bound 1 + t ≤ et for any t ∈ R, we get A1,k ≤ exp −(3/2)µ kX i=1 αi exp 2L2 2 kX i=1 α2 i

, A2,k = kX i=1 kY j=i+1 (1 − (3/2)αjµ + 2α2 j L2 2)α2 i . (54) Using the elementary bound 1 + t ≤ et for any t ∈ R, we get A1,k ≤ exp −(3/2)µ kX i=1 αi exp 2L2 2 kX i=1 α2 i . Using Lemma 15, we obtain A1,k ≤ c1 exp − 3µc0 4(1 − γ)(k + k0)1−γ , where we have set c1 = exp 2c2 0L2 2 2γ − 1 + 3µc0 4(1 − γ) k1−γ 0 . (55) Now we estimate A2,k. Let k1 be the l...

work page

[54] [54]

= 1 2L2 2 k1X i=1 k1Y j=i (1 + 2α2 j L2

work page

[55] [55]

− k1Y j=i+1 (1 + 2α2 j L2 2) ≤ 1 2L2 2 k1Y j=1 (1 + 2α2 j L2

work page

[56] [56]

Note, that for k ≤ k1, αk ≥ µ/(4L2 2), hence, we have kY j=k1+1 (1 − αjµ) ≤ exp −µ kX i=1 αi exp µ k1X i=1 αi ≤ exp −µ kX i=1 αi exp 4L2 2 k1X i=1 α2 i

≤ 1 2L2 2 exp 2L2 2 k1X j=1 α2 j . Note, that for k ≤ k1, αk ≥ µ/(4L2 2), hence, we have kY j=k1+1 (1 − αjµ) ≤ exp −µ kX i=1 αi exp µ k1X i=1 αi ≤ exp −µ kX i=1 αi exp 4L2 2 k1X i=1 α2 i . 33 Moreover, for any m ∈ {1, . . . , k}, we obtain kX i=1 α2 i kY j=i+1 (1 − αjµ) = mX i=1 kY j=i+1 (1 − αjµ)α2 i + kX i=m+1 kY j=i+1 (1 − αjµ)α2 i ≤ kY j=m+1 (1 − αjµ)...

work page

[57] [57]

First, for i = p, j = 0, l = 0, the corresponding term in the sum equals δ2p k−1

work page

[58] [58]

Second, for i = p − 1, j = 1, l = 0, we obtain, applying A1, that 2pαkE[⟨θk−1 − θ⋆, ∇f(θk−1) + ζk⟩δ2(p−1) k−1 |Fk−1] = 2pαk⟨θk−1 − θ⋆, ∇f(θk−1) − ∇f(θ⋆)⟩δ2(p−1) k−1 ≥ 2pµαkδ2p k−1

work page

[59] [59]

Third, for l ≥ 1 or j ≥ 2 (that is, 2l + j ≥ 2), we use Cauchy-Schwartz inequality |⟨θk−1 − θ⋆, ∇f(θk−1) + ζk⟩)j| ≤ ∥ θk−1 − θ⋆∥j∥∇f(θk−1) + ζk∥j , moreover, applying A1 and A7(2p) together with the Lyapunov inequality, we get E[∥∇f(θk−1) + ζk∥2l+j|Fk−1] = E[∥∇f(θk−1) + g(θk−1, ξk) + η(ξk)∥2l+j|Fk−1] ≤ 22l+j−1((L1 + L2)2l+jδ2l+j k−1 + σ2l+j 2p ) . 35 Comb...

work page

[60] [60]

First, for i = p, j = 0, l = 0, the corresponding term in the sum equals δ′2p k−1. 38

work page

[61] [61]

Second, for i = p − 1, j = 1, l = 0, we obtain, applying A1, that 2pαkE[⟨θk−1 − θ′ k−1, ∇f(θk−1) − ∇f(θ′ k−1) + g(θk−1, ξk) − g(θ′ k−1, ξk)⟩δ′2(p−1) k−1 |Fk−1] = 2pαk⟨θk−1 − θ′ k−1, ∇f(θk−1) − ∇f(θ′ k−1)⟩δ′2(p−1) k−1 ≥ 2pµαkδ′2p k−1

work page

[62] [62]

It remains to note that E[∥θk − θ⋆∥2p] ≤ 22p−1E[∥θ′ k − θ⋆∥2p] + 22p−1E[∥θk − θ′ k∥2p] ≤ C2p,1 exp − pµc0 4 (k + k0)1−γ (∥θ0 − θ⋆∥2p + σ2p 2p) + C2p,2σ2p 2pαp k

Third, for l ≥ 1 or j ≥ 2 (that is, 2l + j ≥ 2), we use Cauchy-Schwartz inequality together with A7 and A1 |⟨θk−1−θ′ k−1, ∇f(θk−1)−∇f(θ′ k−1)+g(θk−1, ξk)−g(θ′ k−1, ξk)⟩j| ≤ ∥ θk−1−θ′ k−1∥2j(L1+L2)j , Combining inequalities above, we obtain E[δ′2p k |Fk−1] ≤ (1 − 2pµαk + X i+j+l=p; i,j,l∈{0,...p}; j+2l≥2 p! i!j!l!2jαj+2l k (L1 + L2)j+2l)δ′2p k−1 (62) Simil...

work page

[63] [63]

Here C1 and C2 are defined in Lemma 12 and c2,4 = exp exp 10c0(L1 + L2) 16(L1 + L2)2 2γ − 1 + 1 exp 2µc0 1 − γ k1−γ 0 1 3γ − 1 , c2,5 = 21+2γ 2µc0

+ C4,2σ4 4α2 k , 39 with C4,1 = 23 C1(1/c0 + C2) 4γ (1 − γ)µc0e γ/(1−γ) 4c1/2 0 2γ/2 + 2γ + 4c0 2 c2 0 + 1 c2,4 and C4,2 = 23C1(1/c0 + C2) 4γ (1 − γ)µc0e γ/(1−γ) 4c1/2 0 2γ/2 + 2γ + 4c0 2 c2,5. Here C1 and C2 are defined in Lemma 12 and c2,4 = exp exp 10c0(L1 + L2) 16(L1 + L2)2 2γ − 1 + 1 exp 2µc0 1 − γ k1−γ 0 1 3γ − 1 , c2,5 = 21+2γ 2µc0 . Proof. The pro...

work page