Recognition: no theorem link
Convergence of Stochastic Gradient Descent with mini-batching and infinite variance
Pith reviewed 2026-05-11 02:38 UTC · model grok-4.3
The pith
Increasing mini-batch sizes let SGD converge in probability with constant step sizes under heavy-tailed α-stable noise.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
When the gradient noise belongs to the domain of attraction of an α-stable law with 1 < α < 2, SGD with an increasing batch-size schedule produces L^p error bounds with faster rates, achieves convergence in probability even for constant step size, and yields iterates whose properly normalized versions converge in law to the stationary distribution of an Ornstein-Uhlenbeck process driven by an α-stable Lévy process; the averaged iterates obey a stable limit theorem whose normalization depends explicitly on the batch-size schedule.
What carries the argument
The Ornstein-Uhlenbeck process driven by an α-stable Lévy process, whose stationary law is the limit of the batch-size-dependent normalized and centered SGD recursion.
If this is right
- L^p moment bounds improve with faster batch growth, producing convergence rates better than those available without batching.
- Convergence in probability holds for any fixed positive step size provided the batch sizes increase sufficiently fast.
- The limiting distribution of the iterates depends on the batch-size schedule, so the schedule can be chosen to control the asymptotic spread.
- For averaged iterates the normalizing sequence in the stable limit theorem is an explicit function of the batch sizes, enabling direct comparison across schedules.
Where Pith is reading between the lines
- Practitioners could monitor empirical tail indices of gradients and adjust batch growth on the fly to reach a target convergence rate without retuning the step size.
- The same reduction-to-OU technique may apply to other stochastic approximation recursions that encounter heavy-tailed increments, such as temporal-difference learning.
- Synthetic experiments with known α-stable driving noise could directly measure the rate at which the normalized iterates approach the predicted stationary law.
Load-bearing premise
The gradient noise must belong to the domain of attraction of an α-stable law with index α between 1 and 2.
What would settle it
Numerical simulation or real-data experiment in which the empirical distribution of the normalized SGD iterates fails to approach the stationary measure of the corresponding α-stable Ornstein-Uhlenbeck process, or in which convergence in probability fails for constant step size despite growing batches.
read the original abstract
Stochastic gradient descent (SGD) with mini-batching is a standard tool in large-scale optimization, yet its theoretical properties under heavy-tailed gradient noise remain largely unexplored. In this paper we study SGD with increasing batch sizes when the gradient noise belongs to the domain of attraction of an $\alpha$-stable law with $\alpha\in(1,2)$. Building on existing results for the finite-variance regime and for heavy-tailed SGD without batching, we establish three main results. First, we derive $L^p$ moment bounds for the SGD error and show that increasing batch sizes lead to faster convergence rates. In particular, batching enables convergence in probability even for a constant stepsize. Second, we prove that the properly normalized SGD iterates converge in distribution to the stationary law of an Ornstein-Uhlenbeck process driven by an $\alpha$-stable L\'evy process. Third, for Polyak-Ruppert averaging we obtain a stable limit theorem with a normalization that explicitly depends on the batch-size schedule.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript analyzes the convergence of mini-batched SGD under gradient noise in the domain of attraction of an α-stable law (α∈(1,2)). It derives L^p moment bounds showing improved rates with growing batch sizes, including in-probability convergence for constant stepsizes; establishes weak convergence of suitably normalized iterates to the stationary measure of an α-stable driven OU process; and provides a limit theorem for Polyak-Ruppert averages with normalization depending on the batch schedule. These build on prior work for finite-variance and non-batched heavy-tailed SGD under standard regularity conditions on the objective.
Significance. If the results hold, this work provides a valuable extension of SGD theory to the mini-batching regime under infinite-variance noise, which is practically relevant. The explicit batch-size dependence in the moment bounds and limit theorems offers concrete guidance for algorithm tuning. The approach appropriately leverages established techniques for stable processes and Lévy-driven SDEs, and the three families of results (moments, distributional convergence, averaged limits) are consistent with the literature. The paper ships clear statements of the central claims and assumptions.
minor comments (4)
- The introduction should more explicitly delineate the novel arguments required for the mini-batching regime from the inherited results for the non-batched and finite-variance cases.
- Clarify the precise growth conditions on the batch-size schedule b_n that are needed for each of the three main results (e.g., for constant-stepsize convergence in probability).
- In the statements of the distributional limits, explicitly recall or reference the definition of the stationary law of the α-stable-driven OU process to ensure the limit is well-defined for α∈(1,2).
- Minor notational inconsistencies in the indexing of the batch-size schedule across the moment bounds and averaging sections should be unified.
Simulated Author's Rebuttal
We thank the referee for their positive summary of our manuscript, accurate description of the contributions, and recommendation for minor revision. No specific major comments were raised in the report.
Circularity Check
No significant circularity; derivations extend external results independently
full rationale
The paper explicitly builds on existing results for the finite-variance regime and for heavy-tailed SGD without batching, then introduces new arguments for the mini-batching regime under α-stable noise. The three main results (L^p moment bounds with faster rates from increasing batches, convergence in probability for constant stepsize, distributional convergence to stationary law of α-stable OU process, and batch-dependent normalization for Polyak-Ruppert averaging) are presented as extensions relying on standard regularity conditions and domain-of-attraction assumptions. No self-definitional reductions, fitted parameters renamed as predictions, load-bearing self-citations, or ansatz smuggling appear in the provided abstract or claims. The derivation chain remains self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Gradient noise belongs to the domain of attraction of an α-stable law with α∈(1,2)
- domain assumption Standard assumptions on the objective function, step-size schedule, and batch-size growth that enable the cited finite-variance and non-batched results
Reference graph
Works this paper leans on
-
[1]
Billingsley, P. (1968). Convergence of probability measures . New York, Wiley
work page 1968
-
[2]
Bingham, N. H., Goldie, C. M., and Teugels, J. L. (1989). Regular variation . Cambridge University Press, Cambridge
work page 1989
-
[3]
Blanchet, J., Mijatovi\' c , A., and Yang, W. (2026). Limit theorems for stochastic gradient descent with infinite variance. Annals of Applied Probability
work page 2026
-
[4]
Buraczewski, D., Damek, E., and Mikosch, T. (2016). Stochastic models with power-law tails . Springer Series in Operations Research and Financial Engineering. Springer, [Cham]. The equation X=AX+B
work page 2016
-
[5]
Chung, K. L. (1954). On a stochastic approximation method. The Annals of Mathematical Statistics , 25(3):463--483
work page 1954
-
[6]
S im s ekli, U., G\" u rb\" u zbalaban, M., Nguyen, T. H., Richard, G., and Sagun, L. (2019). On the heavy-tailed theory of stochastic gradient descent for deep neural networks. arXiv:1912.00018v1
-
[7]
Gadat, S. and Gavra, I. (2022). Asymptotic study of stochastic adaptive algorithms in non-convex landscape. Journal of Machine Learning , 23
work page 2022
-
[8]
G\" u rb\" u zbalaban, M., S im s ekli, U., and Zhu, L. (2021). The heavy-tail phenomenon in sgd. In Proceedings of the 38 th International Conference on Machine Learning, PMLR . 1
work page 2021
- [9]
-
[10]
Johnson, R. and Zhang, T. (2013). Accelerating stochastic gradient descent using predictive variance reduction. In Advances in Neural Information Processing Systems , volume 26, pages 315--323
work page 2013
-
[11]
Kamo, K. and Iiduka, H. (2025). Increasing batch size improves convergence of stochastic gradient descent with momentum. arXiv:2501.08883v2
-
[12]
Krasulina, T. P. (1969). On stochastic approximation processes with infinite variance. Theory of Probability & Its Applications , 14(3):522--526
work page 1969
- [13]
-
[14]
Kushner, H. J. and Yin, G. G. (1997). Stochastic Approximation and Recursive Algorithms and Applications . Springer, New York
work page 1997
-
[15]
Le Roux, N., Schmidt, M., Bach, F., and et al. (2012). A stochastic gradient method with an exponential convergence rate for finite training sets. In NIPS , pages 2672--2680
work page 2012
-
[16]
Li, M., Zhang, T., Chen, Y., and Smola, A. J. (2014). Efficient mini-batch training for stochastic optimization. In Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD '14) , pages 661--670, New York, NY, USA. Association for Computing Machinery
work page 2014
-
[17]
Ma, S., Chen, Z., Zhou, Y., Ji, K., and Liang, Y. (2022). Data sampling affects the complexity of online sgd over dependent data. In Proceedings of the 38th Conference on Uncertainty in Artificial Intelligence , volume 180, pages 1296–--1305
work page 2022
-
[18]
Moulines, E. and Bach, F. (2011). Non-asymptotic analysis of stochastic approximation algorithms for machine learning. In Advances in Neural Information Processing Systems , volume 24
work page 2011
-
[19]
Nemirovski, A., Juditsky, A., Lan, G., and Shapiro, A. (2009). Robust stochastic approximation approach to stochastic programming. SIAM Journal on Optimization , 19(4):1574--1609
work page 2009
-
[20]
L., Durmus, A., and S im s ekli, U
Pavasovic, K. L., Durmus, A., and S im s ekli, U. (2023). Approximate heavy tails in offline (multi-pass) stochastic gradient descent. arxiv:2310.18455v1
-
[21]
Pelletier, M. (1998). Weak convergence rates for stochastic approximation with application to multiple targets and simulated annealing. Annals of Applied Probability , pages 10--44
work page 1998
-
[22]
Polyak, B. T. and Juditsky, A. B. (1992). Acceleration of stochastic approximation by averaging. IAM Journal on Control and Optimization
work page 1992
-
[23]
Robbins, H. and Monro, S. (1951). A stochastic approximation method. The Annals of Mathematical Statistics , 22(3):400--407
work page 1951
-
[24]
Umeda, H. and Iiduka, H. (2025). Increasing both batch size and learning rate accelerates stochastic gradient descent. Transactions on Machine Learning Research
work page 2025
- [25]
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.