arxiv: 2605.07184 · v1 · submitted 2026-05-08 · 🧮 math.PR

Recognition: no theorem link

Convergence of Stochastic Gradient Descent with mini-batching and infinite variance

Bartosz Glowacki , Rafal Kulik , Philippe Soulier

Authors on Pith no claims yet

Pith reviewed 2026-05-11 02:38 UTC · model grok-4.3

classification 🧮 math.PR

keywords stochastic gradient descentmini-batchingheavy-tailed noisealpha-stable distributionsOrnstein-Uhlenbeck processconvergence in probabilityPolyak-Ruppert averaging

0 comments

The pith

Increasing mini-batch sizes let SGD converge in probability with constant step sizes under heavy-tailed α-stable noise.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper studies stochastic gradient descent with growing mini-batches when gradient noise lies in the domain of attraction of an α-stable law for α between 1 and 2. It derives L^p moment bounds showing that larger batches improve convergence rates compared with the non-batched case. The analysis further proves that properly scaled SGD iterates converge in distribution to the stationary law of an Ornstein-Uhlenbeck process driven by α-stable Lévy noise. A similar stable limit theorem is obtained for Polyak-Ruppert averaged iterates, with the normalizing sequence made explicit in terms of the batch-size schedule. These extensions matter because heavy-tailed gradients appear in large-scale training, where standard finite-variance theory does not apply.

Core claim

When the gradient noise belongs to the domain of attraction of an α-stable law with 1 < α < 2, SGD with an increasing batch-size schedule produces L^p error bounds with faster rates, achieves convergence in probability even for constant step size, and yields iterates whose properly normalized versions converge in law to the stationary distribution of an Ornstein-Uhlenbeck process driven by an α-stable Lévy process; the averaged iterates obey a stable limit theorem whose normalization depends explicitly on the batch-size schedule.

What carries the argument

The Ornstein-Uhlenbeck process driven by an α-stable Lévy process, whose stationary law is the limit of the batch-size-dependent normalized and centered SGD recursion.

If this is right

L^p moment bounds improve with faster batch growth, producing convergence rates better than those available without batching.
Convergence in probability holds for any fixed positive step size provided the batch sizes increase sufficiently fast.
The limiting distribution of the iterates depends on the batch-size schedule, so the schedule can be chosen to control the asymptotic spread.
For averaged iterates the normalizing sequence in the stable limit theorem is an explicit function of the batch sizes, enabling direct comparison across schedules.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Practitioners could monitor empirical tail indices of gradients and adjust batch growth on the fly to reach a target convergence rate without retuning the step size.
The same reduction-to-OU technique may apply to other stochastic approximation recursions that encounter heavy-tailed increments, such as temporal-difference learning.
Synthetic experiments with known α-stable driving noise could directly measure the rate at which the normalized iterates approach the predicted stationary law.

Load-bearing premise

The gradient noise must belong to the domain of attraction of an α-stable law with index α between 1 and 2.

What would settle it

Numerical simulation or real-data experiment in which the empirical distribution of the normalized SGD iterates fails to approach the stationary measure of the corresponding α-stable Ornstein-Uhlenbeck process, or in which convergence in probability fails for constant step size despite growing batches.

read the original abstract

Stochastic gradient descent (SGD) with mini-batching is a standard tool in large-scale optimization, yet its theoretical properties under heavy-tailed gradient noise remain largely unexplored. In this paper we study SGD with increasing batch sizes when the gradient noise belongs to the domain of attraction of an $\alpha$-stable law with $\alpha\in(1,2)$. Building on existing results for the finite-variance regime and for heavy-tailed SGD without batching, we establish three main results. First, we derive $L^p$ moment bounds for the SGD error and show that increasing batch sizes lead to faster convergence rates. In particular, batching enables convergence in probability even for a constant stepsize. Second, we prove that the properly normalized SGD iterates converge in distribution to the stationary law of an Ornstein-Uhlenbeck process driven by an $\alpha$-stable L\'evy process. Third, for Polyak-Ruppert averaging we obtain a stable limit theorem with a normalization that explicitly depends on the batch-size schedule.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Mini-batching gives SGD convergence in probability with constant stepsizes under alpha-stable noise and supplies batch-dependent distributional limits.

read the letter

The main thing to know is that this paper shows mini-batching helps SGD converge even with constant step sizes when the gradient noise has infinite variance, specifically alpha-stable with alpha between 1 and 2. They provide Lp bounds that improve with larger batches, distributional convergence of the iterates to an Ornstein-Uhlenbeck stationary law driven by a stable Levy process, and batch-size-dependent limits for the averaged version. What is new is the explicit treatment of the mini-batching regime in the heavy-tailed setting. Prior results covered either finite variance with batching or heavy tails without batching. Here the batch schedule enters the rates and the normalizations in the limit theorems. That makes the theory closer to practical use. The paper does well by using established techniques for processes with stable increments and adapting them to the batched case. The derivations appear to avoid circularity and rely on standard regularity conditions on the objective function. The soft spots are limited. The key assumption is that the noise is in the domain of attraction of an alpha-stable law, which is standard but means the results apply only when tails are power-law like. The abstract does not include numerical checks or explicit constants, so the practical sharpness of the rates is not yet clear. But nothing in the outline suggests a load-bearing flaw. This paper is for people working on the theory of stochastic gradient methods with non-standard noise. A reader who wants to see how batch size affects convergence under heavy tails will find the statements useful. It deserves a serious referee. The claims are precise extensions of known results, and the topic matters for understanding why mini-batching works in real applications with noisy gradients. I recommend sending it for peer review.

Referee Report

0 major / 4 minor

Summary. The manuscript analyzes the convergence of mini-batched SGD under gradient noise in the domain of attraction of an α-stable law (α∈(1,2)). It derives L^p moment bounds showing improved rates with growing batch sizes, including in-probability convergence for constant stepsizes; establishes weak convergence of suitably normalized iterates to the stationary measure of an α-stable driven OU process; and provides a limit theorem for Polyak-Ruppert averages with normalization depending on the batch schedule. These build on prior work for finite-variance and non-batched heavy-tailed SGD under standard regularity conditions on the objective.

Significance. If the results hold, this work provides a valuable extension of SGD theory to the mini-batching regime under infinite-variance noise, which is practically relevant. The explicit batch-size dependence in the moment bounds and limit theorems offers concrete guidance for algorithm tuning. The approach appropriately leverages established techniques for stable processes and Lévy-driven SDEs, and the three families of results (moments, distributional convergence, averaged limits) are consistent with the literature. The paper ships clear statements of the central claims and assumptions.

minor comments (4)

The introduction should more explicitly delineate the novel arguments required for the mini-batching regime from the inherited results for the non-batched and finite-variance cases.
Clarify the precise growth conditions on the batch-size schedule b_n that are needed for each of the three main results (e.g., for constant-stepsize convergence in probability).
In the statements of the distributional limits, explicitly recall or reference the definition of the stationary law of the α-stable-driven OU process to ensure the limit is well-defined for α∈(1,2).
Minor notational inconsistencies in the indexing of the batch-size schedule across the moment bounds and averaging sections should be unified.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their positive summary of our manuscript, accurate description of the contributions, and recommendation for minor revision. No specific major comments were raised in the report.

Circularity Check

0 steps flagged

No significant circularity; derivations extend external results independently

full rationale

The paper explicitly builds on existing results for the finite-variance regime and for heavy-tailed SGD without batching, then introduces new arguments for the mini-batching regime under α-stable noise. The three main results (L^p moment bounds with faster rates from increasing batches, convergence in probability for constant stepsize, distributional convergence to stationary law of α-stable OU process, and batch-dependent normalization for Polyak-Ruppert averaging) are presented as extensions relying on standard regularity conditions and domain-of-attraction assumptions. No self-definitional reductions, fitted parameters renamed as predictions, load-bearing self-citations, or ansatz smuggling appear in the provided abstract or claims. The derivation chain remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claims rest on the domain-of-attraction assumption for the gradient noise and on standard regularity conditions for SGD analysis that are inherited from earlier works. No free parameters, new entities, or ad-hoc axioms are introduced in the abstract.

axioms (2)

domain assumption Gradient noise belongs to the domain of attraction of an α-stable law with α∈(1,2)
Explicitly stated as the setting of the study.
domain assumption Standard assumptions on the objective function, step-size schedule, and batch-size growth that enable the cited finite-variance and non-batched results
Implicit in the statement that the work builds on existing results.

pith-pipeline@v0.9.0 · 5476 in / 1540 out tokens · 45178 ms · 2026-05-11T02:38:07.517919+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

25 extracted references · 25 canonical work pages

[1]

Billingsley, P. (1968). Convergence of probability measures . New York, Wiley

work page 1968
[2]

H., Goldie, C

Bingham, N. H., Goldie, C. M., and Teugels, J. L. (1989). Regular variation . Cambridge University Press, Cambridge

work page 1989
[3]

Blanchet, J., Mijatovi\' c , A., and Yang, W. (2026). Limit theorems for stochastic gradient descent with infinite variance. Annals of Applied Probability

work page 2026
[4]

Buraczewski, D., Damek, E., and Mikosch, T. (2016). Stochastic models with power-law tails . Springer Series in Operations Research and Financial Engineering. Springer, [Cham]. The equation X=AX+B

work page 2016
[5]

Chung, K. L. (1954). On a stochastic approximation method. The Annals of Mathematical Statistics , 25(3):463--483

work page 1954
[6]

H., Richard, G., and Sagun, L

S im s ekli, U., G\" u rb\" u zbalaban, M., Nguyen, T. H., Richard, G., and Sagun, L. (2019). On the heavy-tailed theory of stochastic gradient descent for deep neural networks. arXiv:1912.00018v1

work page arXiv 2019
[7]

and Gavra, I

Gadat, S. and Gavra, I. (2022). Asymptotic study of stochastic adaptive algorithms in non-convex landscape. Journal of Machine Learning , 23

work page 2022
[8]

G\" u rb\" u zbalaban, M., S im s ekli, U., and Zhu, L. (2021). The heavy-tail phenomenon in sgd. In Proceedings of the 38 th International Conference on Machine Learning, PMLR . 1

work page 2021
[9]

Ilandarideva, S., Juditsky, A., Lan, G., and · Li, T. (2024). Accelerated stochastic approximation with state-dependent noise. arXiv:2307.01497v3

work page arXiv 2024
[10]

and Zhang, T

Johnson, R. and Zhang, T. (2013). Accelerating stochastic gradient descent using predictive variance reduction. In Advances in Neural Information Processing Systems , volume 26, pages 315--323

work page 2013
[11]

and Iiduka, H

Kamo, K. and Iiduka, H. (2025). Increasing batch size improves convergence of stochastic gradient descent with momentum. arXiv:2501.08883v2

work page arXiv 2025
[12]

Krasulina, T. P. (1969). On stochastic approximation processes with infinite variance. Theory of Probability & Its Applications , 14(3):522--526

work page 1969
[13]

and Soulier, P

Kulik, R. and Soulier, P. (2020). Heavy tailed time series . Springer

work page 2020
[14]

Kushner, H. J. and Yin, G. G. (1997). Stochastic Approximation and Recursive Algorithms and Applications . Springer, New York

work page 1997
[15]

Le Roux, N., Schmidt, M., Bach, F., and et al. (2012). A stochastic gradient method with an exponential convergence rate for finite training sets. In NIPS , pages 2672--2680

work page 2012
[16]

Li, M., Zhang, T., Chen, Y., and Smola, A. J. (2014). Efficient mini-batch training for stochastic optimization. In Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD '14) , pages 661--670, New York, NY, USA. Association for Computing Machinery

work page 2014
[17]

Ma, S., Chen, Z., Zhou, Y., Ji, K., and Liang, Y. (2022). Data sampling affects the complexity of online sgd over dependent data. In Proceedings of the 38th Conference on Uncertainty in Artificial Intelligence , volume 180, pages 1296–--1305

work page 2022
[18]

and Bach, F

Moulines, E. and Bach, F. (2011). Non-asymptotic analysis of stochastic approximation algorithms for machine learning. In Advances in Neural Information Processing Systems , volume 24

work page 2011
[19]

Nemirovski, A., Juditsky, A., Lan, G., and Shapiro, A. (2009). Robust stochastic approximation approach to stochastic programming. SIAM Journal on Optimization , 19(4):1574--1609

work page 2009
[20]

L., Durmus, A., and S im s ekli, U

Pavasovic, K. L., Durmus, A., and S im s ekli, U. (2023). Approximate heavy tails in offline (multi-pass) stochastic gradient descent. arxiv:2310.18455v1

work page arXiv 2023
[21]

Pelletier, M. (1998). Weak convergence rates for stochastic approximation with application to multiple targets and simulated annealing. Annals of Applied Probability , pages 10--44

work page 1998
[22]

Polyak, B. T. and Juditsky, A. B. (1992). Acceleration of stochastic approximation by averaging. IAM Journal on Control and Optimization

work page 1992
[23]

and Monro, S

Robbins, H. and Monro, S. (1951). A stochastic approximation method. The Annals of Mathematical Statistics , 22(3):400--407

work page 1951
[24]

and Iiduka, H

Umeda, H. and Iiduka, H. (2025). Increasing both batch size and learning rate accelerates stochastic gradient descent. Transactions on Machine Learning Research

work page 2025
[25]

Wang, H., G\" u rb\" u zbalaban, M., Zhu, L., S im s ekli, U., and Erdogdu, M. A. (2021). Convergence rates of stochastic gradient descent under infnite noise variance. arXiv:2102.10346v1

work page arXiv 2021