Statistical Inference for Stochastic Gradient Descent Beyond Finite Variance

Jose Blanchet; Peter Glynn; Wenhao Yang

arxiv: 2605.26000 · v1 · pith:CMKT67U4new · submitted 2026-05-25 · 📊 stat.ML · cs.LG· stat.ME

Statistical Inference for Stochastic Gradient Descent Beyond Finite Variance

Jose Blanchet , Peter Glynn , Wenhao Yang This is my paper

Pith reviewed 2026-06-29 20:09 UTC · model grok-4.3

classification 📊 stat.ML cs.LGstat.ME

keywords stochastic gradient descentstatistical inferenceconfidence regionsinfinite varianceself-normalized statisticssubsamplingPolyak-Ruppert averagingstochastic optimization

0 comments

The pith

A self-normalized statistic from SGD trajectories yields valid confidence regions even when stochastic gradients have infinite variance.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops a model-agnostic procedure that constructs asymptotically valid confidence regions for Polyak-Ruppert averaged SGD estimators. The method rests on a joint weak convergence result between the averaged iterates and an empirical second-moment normalizer built from the stochastic gradients observed along the trajectory. This joint limit produces a self-normalized statistic whose limiting distribution no longer depends on the unknown tail scaling that appears in infinite-variance regimes. A subsampling scheme then estimates the critical values directly from the data, avoiding any explicit estimation of tail indices or stable-law parameters. The resulting regions remain valid under both finite- and infinite-second-moment conditions.

Core claim

The central claim is that the joint weak convergence of the Polyak-Ruppert averaged estimator and an empirical second-moment normalizer constructed from stochastic gradients produces a self-normalized statistic whose limiting distribution is free of tail-dependent nuisance parameters, allowing asymptotically valid confidence regions via subsampling calibration in both finite- and infinite-second-moment regimes.

What carries the argument

Joint weak convergence of the averaged estimator and the empirical second-moment normalizer, which produces a self-normalized statistic whose tail-dependent terms cancel.

If this is right

The procedure applies to SGD trajectories without requiring knowledge of whether variance is finite or infinite.
Confidence regions remain asymptotically valid in both regimes without separate handling.
Subsampling calibration avoids explicit estimation of tail indices or stable-law parameters.
Implementation requires only the SGD path and is therefore straightforward for practitioners.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same joint-convergence device might be tested on other averaged stochastic approximation schemes that produce similar trajectory data.
One could examine whether the self-normalized statistic continues to work when the step-size schedule deviates from the paper's assumptions.
The approach suggests a route to inference in optimization problems where the noise distribution changes over iterations.
Practitioners facing heavy-tailed loss surfaces could apply the regions directly to judge solution reliability without first clipping or transforming gradients.

Load-bearing premise

The joint weak convergence of the averaged estimator and the empirical second-moment normalizer holds under the paper's conditions on the SGD process.

What would settle it

If the constructed confidence regions exhibit coverage rates materially below the nominal level in repeated simulations or real data sets whose gradients display infinite variance, the validity claim would be refuted.

read the original abstract

Stochastic gradient descent (SGD) is a foundational algorithm for large-scale statistical learning and stochastic optimization. However, statistical inference based on SGD iterates remains challenging when stochastic gradients have infinite variance, as the relevant limiting distributions depend on unknown nuisance parameters. In this paper, we develop an efficient, model-agnostic methodology for constructing confidence regions from SGD trajectories that applies in both finite- and infinite-variance regimes. The procedure is based on a joint weak convergence result for the Polyak-Ruppert averaged estimator and an empirical second-moment normalizer constructed from stochastic gradients along the SGD trajectory. This joint limit yields a self-normalized statistic in which the leading tail-dependent scaling terms cancel. We then use a subsampling calibration scheme to estimate the relevant critical values, avoiding explicit estimation of tail indices, slowly varying functions, or stable-law parameters. The resulting confidence regions are straightforward to implement and are asymptotically valid under both the finite- and infinite-second-moment regimes. Simulation studies show reliable coverage in various settings, supporting the proposed method as a practical tool for uncertainty quantification in stochastic optimization.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gives a self-normalized subsampling method for SGD confidence regions that stays valid under infinite-variance gradients without estimating tail parameters.

read the letter

The core new piece is the joint weak convergence between the Polyak-Ruppert averaged iterate and an empirical second-moment normalizer built from the gradient sequence. This makes the self-normalized statistic pivotal so the leading stable-law scaling cancels, and subsampling then supplies critical values without fitting tail indices or stable parameters. That construction covers both finite- and infinite-second-moment regimes in one procedure.

It is a practical step for uncertainty quantification when SGD noise is heavy-tailed, which happens in some real problems. The model-agnostic framing and avoidance of explicit nuisance estimation are genuine advantages over standard CLT or stable-limit approaches.

The main soft spot is that everything rests on the joint convergence result holding under the paper's conditions. If those conditions turn out to require stronger moment or step-size restrictions than the abstract suggests, the method's reach shrinks. The simulation evidence is cited as supportive, but without seeing the exact designs and coverage tables it is hard to judge how stressful the infinite-variance cases were.

This is for people who already work on statistical inference for stochastic optimization and want a tool that does not break when variance is infinite. A reader focused on robust post-SGD inference would find it worth reading. It is coherent on its own terms and deserves a serious referee.

Referee Report

0 major / 3 minor

Summary. The paper develops an efficient, model-agnostic methodology for constructing confidence regions from SGD trajectories valid in both finite- and infinite-variance regimes. It relies on a joint weak convergence result for the Polyak-Ruppert averaged estimator and an empirical second-moment normalizer from stochastic gradients along the trajectory; this produces a self-normalized statistic in which tail-dependent scaling terms cancel. Critical values are obtained via subsampling, avoiding explicit estimation of tail indices, slowly varying functions, or stable-law parameters. Asymptotic validity is claimed under both regimes, supported by simulation studies showing reliable coverage.

Significance. If the joint convergence and subsampling calibration hold under the stated conditions, the result would provide a practical, nuisance-parameter-free tool for uncertainty quantification in SGD under heavy-tailed gradients, extending inference methods beyond the standard finite-variance setting common in machine learning. The self-normalized construction and model-agnostic nature are strengths that could see adoption in stochastic optimization applications.

minor comments (3)

[Abstract] The abstract states that the joint weak convergence 'yields a self-normalized statistic in which the leading tail-dependent scaling terms cancel,' but the precise form of the normalizer and the cancellation mechanism should be stated explicitly in the main theorem (likely Theorem 3.1 or equivalent) with the relevant scaling sequences identified.
Simulation studies are described only at a high level; the manuscript should include a table or section detailing the specific distributions (e.g., stable laws with varying indices), step-size schedules, and coverage probabilities across finite- and infinite-variance cases to allow reproducibility.
Notation for the empirical second-moment normalizer and the subsampling scheme should be introduced with a clear definition before the main convergence result to improve readability.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive summary of our work, the recognition of its potential impact, and the recommendation for minor revision. No specific major comments were provided in the report.

Circularity Check

0 steps flagged

No significant circularity identified

full rationale

The derivation rests on a joint weak convergence theorem for the Polyak-Ruppert average and an empirical second-moment normalizer that produces a self-normalized pivotal limit, followed by subsampling to obtain critical values. This construction is stated to hold under the paper's regularity conditions in both finite- and infinite-variance regimes and does not reduce by definition or by fitted-parameter renaming to the target statistic itself. No load-bearing self-citation chain, ansatz smuggling, or uniqueness theorem imported from the authors' prior work is invoked to force the result; the central claim remains an independent asymptotic statement whose validity can be checked against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only the abstract is available, preventing identification of specific free parameters, axioms, or invented entities; no details on assumptions or derivations are provided.

pith-pipeline@v0.9.1-grok · 5719 in / 1023 out tokens · 26504 ms · 2026-06-29T20:09:24.892626+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Statistical Inference on Gradient Flows
math.ST 2026-05 unverdicted novelty 7.0

Proves uniform CLT for gradient flows in ERM and constructs an algorithm-aware, inversion-free covariance estimator for asymptotically valid time-uniform confidence intervals.

Reference graph

Works this paper leans on

42 extracted references · 9 canonical work pages · cited by 1 Pith paper

[1]

Bootstrap of the mean in the infinite variance case.The annals of statistics, pages 724–731, 1987

Krishna B Athreya. Bootstrap of the mean in the infinite variance case.The annals of statistics, pages 724–731, 1987

1987
[2]

A unified approach to self-normalized block sampling.Stochastic Processes and their Applications, 126(8):2465–2493, 2016

Shuyang Bai, Murad S Taqqu, and Ting Zhang. A unified approach to self-normalized block sampling.Stochastic Processes and their Applications, 126(8):2465–2493, 2016. 26

2016
[3]

Heavy tails in sgd and compressibility of overparametrized neural networks

Melih Barsbey, Milad Sefidgaran, Murat A Erdogdu, Gael Richard, and Umut Simsekli. Heavy tails in sgd and compressibility of overparametrized neural networks. Advances in neural information processing systems, 34:29364–29378, 2021

2021
[4]

Gradient convergence in gradient methods with errors.SIAM Journal on Optimization, 10(3):627–642, 2000

Dimitri P Bertsekas and John N Tsitsiklis. Gradient convergence in gradient methods with errors.SIAM Journal on Optimization, 10(3):627–642, 2000

2000
[5]

Limit theorems for stochastic gradient descent with infinite variance.arXiv preprint arXiv:2410.16340, 2024

Jose Blanchet, Aleksandar Mijatović, and Wenhao Yang. Limit theorems for stochastic gradient descent with infinite variance.arXiv preprint arXiv:2410.16340, 2024

work page arXiv 2024
[6]

Statistical inference for model parameters in stochastic gradient descent

Xi Chen, Jason D Lee, Xin T Tong, and Yichen Zhang. Statistical inference for model parameters in stochastic gradient descent. 2020

2020
[7]

High-probability bounds for non-convex stochastic optimization with heavy tails.Advances in Neural Information Processing Systems, 34:4883–4895, 2021

Ashok Cutkosky and Harsh Mehta. High-probability bounds for non-convex stochastic optimization with heavy tails.Advances in Neural Information Processing Systems, 34:4883–4895, 2021

2021
[8]

Analysing heavy-tail properties of stochas- tic gradient descent by means of stochastic recurrence equations.arXiv preprint arXiv:2403.13868, 2024

Ewa Damek and Sebastian Mentemeier. Analysing heavy-tail properties of stochas- tic gradient descent by means of stochastic recurrence equations.arXiv preprint arXiv:2403.13868, 2024

work page arXiv 2024
[9]

Almost sure convergence for the robbins-monro process

CA Goodsell and DL Hanson. Almost sure convergence for the robbins-monro process. The Annals of Probability, pages 890–901, 1976

1976
[10]

The heavy-tail phenomenon in sgd

Mert Gurbuzbalaban, Umut Simsekli, and Lingjiong Zhu. The heavy-tail phenomenon in sgd. InInternational Conference on Machine Learning, pages 3964–3975. PMLR, 2021

2021
[11]

Stochastic approximation and recursive algorithm and applications.Application of Mathematics, 35(10), 1997

J Harold, G Kushner, and George Yin. Stochastic approximation and recursive algorithm and applications.Application of Mathematics, 35(10), 1997

1997
[12]

Multiplicative noise and heavy tails in stochastic optimization

Liam Hodgkinson and Michael Mahoney. Multiplicative noise and heavy tails in stochastic optimization. InInternational Conference on Machine Learning, pages 4262–4274. PMLR, 2021. 27

2021
[13]

Emergence of heavy tails in homogenized stochastic gradient descent.Advances in Neural Information Processing Systems, 37: 14066–14092, 2024

Zhezhe Jiao and Martin Keller-Ressel. Emergence of heavy tails in homogenized stochastic gradient descent.Advances in Neural Information Processing Systems, 37: 14066–14092, 2024

2024
[14]

On stochastic approximation processes with infinite variance.Theory of Probability & Its Applications, 14(3):522–526, 1969

Tatiana Pavlovna Krasulina. On stochastic approximation processes with infinite variance.Theory of Probability & Its Applications, 14(3):522–526, 1969

1969
[15]

Imagenet classification with deep convolutional neural networks.Advances in neural information processing systems, 25, 2012

Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks.Advances in neural information processing systems, 25, 2012

2012
[16]

Fast and robust online inference with stochastic gradient descent via random scaling

Sokbae Lee, Yuan Liao, Myung Hwan Seo, and Youngki Shin. Fast and robust online inference with stochastic gradient descent via random scaling. InProceedings of the AAAI Conference on Artificial Intelligence, volume 36, pages 7381–7389, 2022

2022
[17]

Almost sure convergence of stochastic approximation procedures.Statistica Sinica, pages 361–372, 1994

Gang Li. Almost sure convergence of stochastic approximation procedures.Statistica Sinica, pages 361–372, 1994

1994
[18]

Stochastic nonsmooth convex optimization with heavy-tailed noises: High-probability bound, in-expectation rate and initial distance adaptation.arXiv preprint arXiv:2303.12277, 2023

Zijian Liu and Zhengyuan Zhou. Stochastic nonsmooth convex optimization with heavy-tailed noises: High-probability bound, in-expectation rate and initial distance adaptation.arXiv preprint arXiv:2303.12277, 2023

work page arXiv 2023
[19]

Limit distributions of self-normalized sums.The Annals of Probability, 1(5):788–809, 1973

Benjamin F Logan, CL Mallows, SO Rice, and Larry A Shepp. Limit distributions of self-normalized sums.The Annals of Probability, 1(5):788–809, 1973

1973
[20]

Traditional and heavy tailed self regularization in neural network models

Michael Mahoney and Charles Martin. Traditional and heavy tailed self regularization in neural network models. InInternational Conference on Machine Learning, pages 4284–4293. PMLR, 2019

2019
[21]

Sample covariance matrix for random vectors with heavy tails.Journal of Theoretical Probability, 12(3):821–838, 1999

Mark M Meerschaert and Hans-Peter Scheffler. Sample covariance matrix for random vectors with heavy tails.Journal of Theoretical Probability, 12(3):821–838, 1999

1999
[22]

Optimal variance-reduced stochastic approximation in banach spaces.arXiv preprint arXiv:2201.08518, 2022

Wenlong Mou, Koulik Khamaru, Martin J Wainwright, Peter L Bartlett, and Michael I 28 Jordan. Optimal variance-reduced stochastic approximation in banach spaces.arXiv preprint arXiv:2201.08518, 2022

work page arXiv 2022
[23]

Non-asymptotic analysis of stochastic approximation algorithms for machine learning.Advances in neural information processing systems, 24, 2011

Eric Moulines and Francis Bach. Non-asymptotic analysis of stochastic approximation algorithms for machine learning.Advances in neural information processing systems, 24, 2011

2011
[24]

Multivariate elliptically contoured stable distributions: theory and estimation.Computational statistics, 28(5):2067–2089, 2013

John P Nolan. Multivariate elliptically contoured stable distributions: theory and estimation.Computational statistics, 28(5):2067–2089, 2013

2067
[25]

On the almost sure asymptotic behaviour of stochastic algorithms

Mariane Pelletier. On the almost sure asymptotic behaviour of stochastic algorithms. Stochastic processes and their applications, 78(2):217–244, 1998

1998
[26]

Weak convergence rates for stochastic approximation with appli- cation to multiple targets and simulated annealing.Annals of Applied Probability, pages 10–44, 1998

Mariane Pelletier. Weak convergence rates for stochastic approximation with appli- cation to multiple targets and simulated annealing.Annals of Applied Probability, pages 10–44, 1998

1998
[27]

Acceleration of stochastic approximation by averaging.SIAM journal on control and optimization, 30(4):838–855, 1992

Boris T Polyak and Anatoli B Juditsky. Acceleration of stochastic approximation by averaging.SIAM journal on control and optimization, 30(4):838–855, 1992

1992
[28]

Springer Science & Business Media, 2007

Sidney I Resnick.Heavy-tail phenomena: probabilistic and statistical modeling. Springer Science & Business Media, 2007

2007
[29]

A stochastic approximation method.The annals of mathematical statistics, pages 400–407, 1951

Herbert Robbins and Sutton Monro. A stochastic approximation method.The annals of mathematical statistics, pages 400–407, 1951

1951
[30]

A convergence theorem for non negative almost supermartingales and some applications

Herbert Robbins and David Siegmund. A convergence theorem for non negative almost supermartingales and some applications. InOptimizing methods in statistics, pages 233–257. Elsevier, 1971

1971
[31]

Subsampling inference for the mean in the heavy-tailed case.Metrika, 50(1):55–69, 1999

Joseph P Romano and Michael Wolf. Subsampling inference for the mean in the heavy-tailed case.Metrika, 50(1):55–69, 1999

1999
[32]

Lp spaces for 0< p< 1

Matt Rosenzweig. Lp spaces for 0< p< 1. 29
[33]

Asymptotic distribution of stochastic approximation procedures.The Annals of Mathematical Statistics, 29(2):373–405, 1958

Jerome Sacks. Asymptotic distribution of stochastic approximation procedures.The Annals of Mathematical Statistics, 29(2):373–405, 1958

1958
[34]

Stochastic differential equations models for least-squares stochastic gradient descent.arXiv preprint arXiv:2407.02322, 2024

Adrien Schertzer and Loucas Pillaud-Vivien. Stochastic differential equations models for least-squares stochastic gradient descent.arXiv preprint arXiv:2407.02322, 2024

work page arXiv 2024
[35]

On the heavy-tailed theory of stochastic gradient descent for deep neural networks.arXiv preprint arXiv:1912.00018, 2019

Umut Simsekli, Mert Gürbüzbalaban, Thanh Huy Nguyen, Gaël Richard, and Levent Sagun. On the heavy-tailed theory of stochastic gradient descent for deep neural networks.arXiv preprint arXiv:1912.00018, 2019

work page arXiv 1912
[36]

A tail-index analysis of stochastic gradient noise in deep neural networks

Umut Simsekli, Levent Sagun, and Mert Gurbuzbalaban. A tail-index analysis of stochastic gradient noise in deep neural networks. InInternational Conference on Machine Learning, pages 5827–5837. PMLR, 2019

2019
[37]

Hausdorff dimension, heavy tails, and generalization in neural networks.Advances in Neural Information Processing Systems, 33:5138–5151, 2020

Umut Simsekli, Ozan Sener, George Deligiannidis, and Murat A Erdogdu. Hausdorff dimension, heavy tails, and generalization in neural networks.Advances in Neural Information Processing Systems, 33:5138–5151, 2020

2020
[38]

Convergence rates of stochastic gradient descent under infinite noise variance.Advances in Neural Information Processing Systems, 34:18866–18877, 2021

Hongjian Wang, Mert Gurbuzbalaban, Lingjiong Zhu, Umut Simsekli, and Murat A Erdogdu. Convergence rates of stochastic gradient descent under infinite noise variance.Advances in Neural Information Processing Systems, 34:18866–18877, 2021

2021
[39]

Eliminating sharp minima from sgd with truncated heavy-tailed noise.arXiv preprint arXiv:2102.04297, 2021

Xingyu Wang, Sewoong Oh, and Chang-Han Rhee. Eliminating sharp minima from sgd with truncated heavy-tailed noise.arXiv preprint arXiv:2102.04297, 2021

work page arXiv 2021
[40]

Online bootstrap inference with nonconvex stochastic gradient descent estimator.arXiv preprint arXiv:2306.02205, 2023

Yanjie Zhong, Todd Kuffner, and Soumendra Lahiri. Online bootstrap inference with nonconvex stochastic gradient descent estimator.arXiv preprint arXiv:2306.02205, 2023

work page arXiv 2023
[41]

High confidence level inference is almost free using parallel stochastic optimization.arXiv preprint arXiv:2401.09346, 2024

Wanrong Zhu, Zhipeng Lou, Ziyang Wei, and Wei Biao Wu. High confidence level inference is almost free using parallel stochastic optimization.arXiv preprint arXiv:2401.09346, 2024. 30

work page arXiv 2024
[42]

On constructing confidence region for model parameters in stochastic gradient descent via batch means

Yi Zhu and Jing Dong. On constructing confidence region for model parameters in stochastic gradient descent via batch means. In2021 Winter Simulation Conference (WSC), pages 1–12. IEEE, 2021. 31 A Generalized Central Limit Theorem (GCLT) Fori.i.d. randomvectors X1,· · ·, X n with E[∥X1∥] < +∞andfinitecovariance E[X1X ⊤ 1 ] = Σ, classic CLT states the foll...

2021

[1] [1]

Bootstrap of the mean in the infinite variance case.The annals of statistics, pages 724–731, 1987

Krishna B Athreya. Bootstrap of the mean in the infinite variance case.The annals of statistics, pages 724–731, 1987

1987

[2] [2]

A unified approach to self-normalized block sampling.Stochastic Processes and their Applications, 126(8):2465–2493, 2016

Shuyang Bai, Murad S Taqqu, and Ting Zhang. A unified approach to self-normalized block sampling.Stochastic Processes and their Applications, 126(8):2465–2493, 2016. 26

2016

[3] [3]

Heavy tails in sgd and compressibility of overparametrized neural networks

Melih Barsbey, Milad Sefidgaran, Murat A Erdogdu, Gael Richard, and Umut Simsekli. Heavy tails in sgd and compressibility of overparametrized neural networks. Advances in neural information processing systems, 34:29364–29378, 2021

2021

[4] [4]

Gradient convergence in gradient methods with errors.SIAM Journal on Optimization, 10(3):627–642, 2000

Dimitri P Bertsekas and John N Tsitsiklis. Gradient convergence in gradient methods with errors.SIAM Journal on Optimization, 10(3):627–642, 2000

2000

[5] [5]

Limit theorems for stochastic gradient descent with infinite variance.arXiv preprint arXiv:2410.16340, 2024

Jose Blanchet, Aleksandar Mijatović, and Wenhao Yang. Limit theorems for stochastic gradient descent with infinite variance.arXiv preprint arXiv:2410.16340, 2024

work page arXiv 2024

[6] [6]

Statistical inference for model parameters in stochastic gradient descent

Xi Chen, Jason D Lee, Xin T Tong, and Yichen Zhang. Statistical inference for model parameters in stochastic gradient descent. 2020

2020

[7] [7]

High-probability bounds for non-convex stochastic optimization with heavy tails.Advances in Neural Information Processing Systems, 34:4883–4895, 2021

Ashok Cutkosky and Harsh Mehta. High-probability bounds for non-convex stochastic optimization with heavy tails.Advances in Neural Information Processing Systems, 34:4883–4895, 2021

2021

[8] [8]

Analysing heavy-tail properties of stochas- tic gradient descent by means of stochastic recurrence equations.arXiv preprint arXiv:2403.13868, 2024

Ewa Damek and Sebastian Mentemeier. Analysing heavy-tail properties of stochas- tic gradient descent by means of stochastic recurrence equations.arXiv preprint arXiv:2403.13868, 2024

work page arXiv 2024

[9] [9]

Almost sure convergence for the robbins-monro process

CA Goodsell and DL Hanson. Almost sure convergence for the robbins-monro process. The Annals of Probability, pages 890–901, 1976

1976

[10] [10]

The heavy-tail phenomenon in sgd

Mert Gurbuzbalaban, Umut Simsekli, and Lingjiong Zhu. The heavy-tail phenomenon in sgd. InInternational Conference on Machine Learning, pages 3964–3975. PMLR, 2021

2021

[11] [11]

Stochastic approximation and recursive algorithm and applications.Application of Mathematics, 35(10), 1997

J Harold, G Kushner, and George Yin. Stochastic approximation and recursive algorithm and applications.Application of Mathematics, 35(10), 1997

1997

[12] [12]

Multiplicative noise and heavy tails in stochastic optimization

Liam Hodgkinson and Michael Mahoney. Multiplicative noise and heavy tails in stochastic optimization. InInternational Conference on Machine Learning, pages 4262–4274. PMLR, 2021. 27

2021

[13] [13]

Emergence of heavy tails in homogenized stochastic gradient descent.Advances in Neural Information Processing Systems, 37: 14066–14092, 2024

Zhezhe Jiao and Martin Keller-Ressel. Emergence of heavy tails in homogenized stochastic gradient descent.Advances in Neural Information Processing Systems, 37: 14066–14092, 2024

2024

[14] [14]

On stochastic approximation processes with infinite variance.Theory of Probability & Its Applications, 14(3):522–526, 1969

Tatiana Pavlovna Krasulina. On stochastic approximation processes with infinite variance.Theory of Probability & Its Applications, 14(3):522–526, 1969

1969

[15] [15]

Imagenet classification with deep convolutional neural networks.Advances in neural information processing systems, 25, 2012

Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks.Advances in neural information processing systems, 25, 2012

2012

[16] [16]

Fast and robust online inference with stochastic gradient descent via random scaling

Sokbae Lee, Yuan Liao, Myung Hwan Seo, and Youngki Shin. Fast and robust online inference with stochastic gradient descent via random scaling. InProceedings of the AAAI Conference on Artificial Intelligence, volume 36, pages 7381–7389, 2022

2022

[17] [17]

Almost sure convergence of stochastic approximation procedures.Statistica Sinica, pages 361–372, 1994

Gang Li. Almost sure convergence of stochastic approximation procedures.Statistica Sinica, pages 361–372, 1994

1994

[18] [18]

Stochastic nonsmooth convex optimization with heavy-tailed noises: High-probability bound, in-expectation rate and initial distance adaptation.arXiv preprint arXiv:2303.12277, 2023

Zijian Liu and Zhengyuan Zhou. Stochastic nonsmooth convex optimization with heavy-tailed noises: High-probability bound, in-expectation rate and initial distance adaptation.arXiv preprint arXiv:2303.12277, 2023

work page arXiv 2023

[19] [19]

Limit distributions of self-normalized sums.The Annals of Probability, 1(5):788–809, 1973

Benjamin F Logan, CL Mallows, SO Rice, and Larry A Shepp. Limit distributions of self-normalized sums.The Annals of Probability, 1(5):788–809, 1973

1973

[20] [20]

Traditional and heavy tailed self regularization in neural network models

Michael Mahoney and Charles Martin. Traditional and heavy tailed self regularization in neural network models. InInternational Conference on Machine Learning, pages 4284–4293. PMLR, 2019

2019

[21] [21]

Sample covariance matrix for random vectors with heavy tails.Journal of Theoretical Probability, 12(3):821–838, 1999

Mark M Meerschaert and Hans-Peter Scheffler. Sample covariance matrix for random vectors with heavy tails.Journal of Theoretical Probability, 12(3):821–838, 1999

1999

[22] [22]

Optimal variance-reduced stochastic approximation in banach spaces.arXiv preprint arXiv:2201.08518, 2022

Wenlong Mou, Koulik Khamaru, Martin J Wainwright, Peter L Bartlett, and Michael I 28 Jordan. Optimal variance-reduced stochastic approximation in banach spaces.arXiv preprint arXiv:2201.08518, 2022

work page arXiv 2022

[23] [23]

Non-asymptotic analysis of stochastic approximation algorithms for machine learning.Advances in neural information processing systems, 24, 2011

Eric Moulines and Francis Bach. Non-asymptotic analysis of stochastic approximation algorithms for machine learning.Advances in neural information processing systems, 24, 2011

2011

[24] [24]

Multivariate elliptically contoured stable distributions: theory and estimation.Computational statistics, 28(5):2067–2089, 2013

John P Nolan. Multivariate elliptically contoured stable distributions: theory and estimation.Computational statistics, 28(5):2067–2089, 2013

2067

[25] [25]

On the almost sure asymptotic behaviour of stochastic algorithms

Mariane Pelletier. On the almost sure asymptotic behaviour of stochastic algorithms. Stochastic processes and their applications, 78(2):217–244, 1998

1998

[26] [26]

Weak convergence rates for stochastic approximation with appli- cation to multiple targets and simulated annealing.Annals of Applied Probability, pages 10–44, 1998

Mariane Pelletier. Weak convergence rates for stochastic approximation with appli- cation to multiple targets and simulated annealing.Annals of Applied Probability, pages 10–44, 1998

1998

[27] [27]

Acceleration of stochastic approximation by averaging.SIAM journal on control and optimization, 30(4):838–855, 1992

Boris T Polyak and Anatoli B Juditsky. Acceleration of stochastic approximation by averaging.SIAM journal on control and optimization, 30(4):838–855, 1992

1992

[28] [28]

Springer Science & Business Media, 2007

Sidney I Resnick.Heavy-tail phenomena: probabilistic and statistical modeling. Springer Science & Business Media, 2007

2007

[29] [29]

A stochastic approximation method.The annals of mathematical statistics, pages 400–407, 1951

Herbert Robbins and Sutton Monro. A stochastic approximation method.The annals of mathematical statistics, pages 400–407, 1951

1951

[30] [30]

A convergence theorem for non negative almost supermartingales and some applications

Herbert Robbins and David Siegmund. A convergence theorem for non negative almost supermartingales and some applications. InOptimizing methods in statistics, pages 233–257. Elsevier, 1971

1971

[31] [31]

Subsampling inference for the mean in the heavy-tailed case.Metrika, 50(1):55–69, 1999

Joseph P Romano and Michael Wolf. Subsampling inference for the mean in the heavy-tailed case.Metrika, 50(1):55–69, 1999

1999

[32] [32]

Lp spaces for 0< p< 1

Matt Rosenzweig. Lp spaces for 0< p< 1. 29

[33] [33]

Asymptotic distribution of stochastic approximation procedures.The Annals of Mathematical Statistics, 29(2):373–405, 1958

Jerome Sacks. Asymptotic distribution of stochastic approximation procedures.The Annals of Mathematical Statistics, 29(2):373–405, 1958

1958

[34] [34]

Stochastic differential equations models for least-squares stochastic gradient descent.arXiv preprint arXiv:2407.02322, 2024

Adrien Schertzer and Loucas Pillaud-Vivien. Stochastic differential equations models for least-squares stochastic gradient descent.arXiv preprint arXiv:2407.02322, 2024

work page arXiv 2024

[35] [35]

On the heavy-tailed theory of stochastic gradient descent for deep neural networks.arXiv preprint arXiv:1912.00018, 2019

Umut Simsekli, Mert Gürbüzbalaban, Thanh Huy Nguyen, Gaël Richard, and Levent Sagun. On the heavy-tailed theory of stochastic gradient descent for deep neural networks.arXiv preprint arXiv:1912.00018, 2019

work page arXiv 1912

[36] [36]

A tail-index analysis of stochastic gradient noise in deep neural networks

Umut Simsekli, Levent Sagun, and Mert Gurbuzbalaban. A tail-index analysis of stochastic gradient noise in deep neural networks. InInternational Conference on Machine Learning, pages 5827–5837. PMLR, 2019

2019

[37] [37]

Hausdorff dimension, heavy tails, and generalization in neural networks.Advances in Neural Information Processing Systems, 33:5138–5151, 2020

Umut Simsekli, Ozan Sener, George Deligiannidis, and Murat A Erdogdu. Hausdorff dimension, heavy tails, and generalization in neural networks.Advances in Neural Information Processing Systems, 33:5138–5151, 2020

2020

[38] [38]

Convergence rates of stochastic gradient descent under infinite noise variance.Advances in Neural Information Processing Systems, 34:18866–18877, 2021

Hongjian Wang, Mert Gurbuzbalaban, Lingjiong Zhu, Umut Simsekli, and Murat A Erdogdu. Convergence rates of stochastic gradient descent under infinite noise variance.Advances in Neural Information Processing Systems, 34:18866–18877, 2021

2021

[39] [39]

Eliminating sharp minima from sgd with truncated heavy-tailed noise.arXiv preprint arXiv:2102.04297, 2021

Xingyu Wang, Sewoong Oh, and Chang-Han Rhee. Eliminating sharp minima from sgd with truncated heavy-tailed noise.arXiv preprint arXiv:2102.04297, 2021

work page arXiv 2021

[40] [40]

Online bootstrap inference with nonconvex stochastic gradient descent estimator.arXiv preprint arXiv:2306.02205, 2023

Yanjie Zhong, Todd Kuffner, and Soumendra Lahiri. Online bootstrap inference with nonconvex stochastic gradient descent estimator.arXiv preprint arXiv:2306.02205, 2023

work page arXiv 2023

[41] [41]

High confidence level inference is almost free using parallel stochastic optimization.arXiv preprint arXiv:2401.09346, 2024

Wanrong Zhu, Zhipeng Lou, Ziyang Wei, and Wei Biao Wu. High confidence level inference is almost free using parallel stochastic optimization.arXiv preprint arXiv:2401.09346, 2024. 30

work page arXiv 2024

[42] [42]

On constructing confidence region for model parameters in stochastic gradient descent via batch means

Yi Zhu and Jing Dong. On constructing confidence region for model parameters in stochastic gradient descent via batch means. In2021 Winter Simulation Conference (WSC), pages 1–12. IEEE, 2021. 31 A Generalized Central Limit Theorem (GCLT) Fori.i.d. randomvectors X1,· · ·, X n with E[∥X1∥] < +∞andfinitecovariance E[X1X ⊤ 1 ] = Σ, classic CLT states the foll...

2021