pith. machine review for the scientific record. sign in

arxiv: 2605.07184 · v1 · submitted 2026-05-08 · 🧮 math.PR

Recognition: no theorem link

Convergence of Stochastic Gradient Descent with mini-batching and infinite variance

Authors on Pith no claims yet

Pith reviewed 2026-05-11 02:38 UTC · model grok-4.3

classification 🧮 math.PR
keywords stochastic gradient descentmini-batchingheavy-tailed noisealpha-stable distributionsOrnstein-Uhlenbeck processconvergence in probabilityPolyak-Ruppert averaging
0
0 comments X

The pith

Increasing mini-batch sizes let SGD converge in probability with constant step sizes under heavy-tailed α-stable noise.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper studies stochastic gradient descent with growing mini-batches when gradient noise lies in the domain of attraction of an α-stable law for α between 1 and 2. It derives L^p moment bounds showing that larger batches improve convergence rates compared with the non-batched case. The analysis further proves that properly scaled SGD iterates converge in distribution to the stationary law of an Ornstein-Uhlenbeck process driven by α-stable Lévy noise. A similar stable limit theorem is obtained for Polyak-Ruppert averaged iterates, with the normalizing sequence made explicit in terms of the batch-size schedule. These extensions matter because heavy-tailed gradients appear in large-scale training, where standard finite-variance theory does not apply.

Core claim

When the gradient noise belongs to the domain of attraction of an α-stable law with 1 < α < 2, SGD with an increasing batch-size schedule produces L^p error bounds with faster rates, achieves convergence in probability even for constant step size, and yields iterates whose properly normalized versions converge in law to the stationary distribution of an Ornstein-Uhlenbeck process driven by an α-stable Lévy process; the averaged iterates obey a stable limit theorem whose normalization depends explicitly on the batch-size schedule.

What carries the argument

The Ornstein-Uhlenbeck process driven by an α-stable Lévy process, whose stationary law is the limit of the batch-size-dependent normalized and centered SGD recursion.

If this is right

  • L^p moment bounds improve with faster batch growth, producing convergence rates better than those available without batching.
  • Convergence in probability holds for any fixed positive step size provided the batch sizes increase sufficiently fast.
  • The limiting distribution of the iterates depends on the batch-size schedule, so the schedule can be chosen to control the asymptotic spread.
  • For averaged iterates the normalizing sequence in the stable limit theorem is an explicit function of the batch sizes, enabling direct comparison across schedules.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Practitioners could monitor empirical tail indices of gradients and adjust batch growth on the fly to reach a target convergence rate without retuning the step size.
  • The same reduction-to-OU technique may apply to other stochastic approximation recursions that encounter heavy-tailed increments, such as temporal-difference learning.
  • Synthetic experiments with known α-stable driving noise could directly measure the rate at which the normalized iterates approach the predicted stationary law.

Load-bearing premise

The gradient noise must belong to the domain of attraction of an α-stable law with index α between 1 and 2.

What would settle it

Numerical simulation or real-data experiment in which the empirical distribution of the normalized SGD iterates fails to approach the stationary measure of the corresponding α-stable Ornstein-Uhlenbeck process, or in which convergence in probability fails for constant step size despite growing batches.

read the original abstract

Stochastic gradient descent (SGD) with mini-batching is a standard tool in large-scale optimization, yet its theoretical properties under heavy-tailed gradient noise remain largely unexplored. In this paper we study SGD with increasing batch sizes when the gradient noise belongs to the domain of attraction of an $\alpha$-stable law with $\alpha\in(1,2)$. Building on existing results for the finite-variance regime and for heavy-tailed SGD without batching, we establish three main results. First, we derive $L^p$ moment bounds for the SGD error and show that increasing batch sizes lead to faster convergence rates. In particular, batching enables convergence in probability even for a constant stepsize. Second, we prove that the properly normalized SGD iterates converge in distribution to the stationary law of an Ornstein-Uhlenbeck process driven by an $\alpha$-stable L\'evy process. Third, for Polyak-Ruppert averaging we obtain a stable limit theorem with a normalization that explicitly depends on the batch-size schedule.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 4 minor

Summary. The manuscript analyzes the convergence of mini-batched SGD under gradient noise in the domain of attraction of an α-stable law (α∈(1,2)). It derives L^p moment bounds showing improved rates with growing batch sizes, including in-probability convergence for constant stepsizes; establishes weak convergence of suitably normalized iterates to the stationary measure of an α-stable driven OU process; and provides a limit theorem for Polyak-Ruppert averages with normalization depending on the batch schedule. These build on prior work for finite-variance and non-batched heavy-tailed SGD under standard regularity conditions on the objective.

Significance. If the results hold, this work provides a valuable extension of SGD theory to the mini-batching regime under infinite-variance noise, which is practically relevant. The explicit batch-size dependence in the moment bounds and limit theorems offers concrete guidance for algorithm tuning. The approach appropriately leverages established techniques for stable processes and Lévy-driven SDEs, and the three families of results (moments, distributional convergence, averaged limits) are consistent with the literature. The paper ships clear statements of the central claims and assumptions.

minor comments (4)
  1. The introduction should more explicitly delineate the novel arguments required for the mini-batching regime from the inherited results for the non-batched and finite-variance cases.
  2. Clarify the precise growth conditions on the batch-size schedule b_n that are needed for each of the three main results (e.g., for constant-stepsize convergence in probability).
  3. In the statements of the distributional limits, explicitly recall or reference the definition of the stationary law of the α-stable-driven OU process to ensure the limit is well-defined for α∈(1,2).
  4. Minor notational inconsistencies in the indexing of the batch-size schedule across the moment bounds and averaging sections should be unified.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their positive summary of our manuscript, accurate description of the contributions, and recommendation for minor revision. No specific major comments were raised in the report.

Circularity Check

0 steps flagged

No significant circularity; derivations extend external results independently

full rationale

The paper explicitly builds on existing results for the finite-variance regime and for heavy-tailed SGD without batching, then introduces new arguments for the mini-batching regime under α-stable noise. The three main results (L^p moment bounds with faster rates from increasing batches, convergence in probability for constant stepsize, distributional convergence to stationary law of α-stable OU process, and batch-dependent normalization for Polyak-Ruppert averaging) are presented as extensions relying on standard regularity conditions and domain-of-attraction assumptions. No self-definitional reductions, fitted parameters renamed as predictions, load-bearing self-citations, or ansatz smuggling appear in the provided abstract or claims. The derivation chain remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claims rest on the domain-of-attraction assumption for the gradient noise and on standard regularity conditions for SGD analysis that are inherited from earlier works. No free parameters, new entities, or ad-hoc axioms are introduced in the abstract.

axioms (2)
  • domain assumption Gradient noise belongs to the domain of attraction of an α-stable law with α∈(1,2)
    Explicitly stated as the setting of the study.
  • domain assumption Standard assumptions on the objective function, step-size schedule, and batch-size growth that enable the cited finite-variance and non-batched results
    Implicit in the statement that the work builds on existing results.

pith-pipeline@v0.9.0 · 5476 in / 1540 out tokens · 45178 ms · 2026-05-11T02:38:07.517919+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

25 extracted references · 25 canonical work pages

  1. [1]

    Billingsley, P. (1968). Convergence of probability measures . New York, Wiley

  2. [2]

    H., Goldie, C

    Bingham, N. H., Goldie, C. M., and Teugels, J. L. (1989). Regular variation . Cambridge University Press, Cambridge

  3. [3]

    Blanchet, J., Mijatovi\' c , A., and Yang, W. (2026). Limit theorems for stochastic gradient descent with infinite variance. Annals of Applied Probability

  4. [4]

    Buraczewski, D., Damek, E., and Mikosch, T. (2016). Stochastic models with power-law tails . Springer Series in Operations Research and Financial Engineering. Springer, [Cham]. The equation X=AX+B

  5. [5]

    Chung, K. L. (1954). On a stochastic approximation method. The Annals of Mathematical Statistics , 25(3):463--483

  6. [6]

    H., Richard, G., and Sagun, L

    S im s ekli, U., G\" u rb\" u zbalaban, M., Nguyen, T. H., Richard, G., and Sagun, L. (2019). On the heavy-tailed theory of stochastic gradient descent for deep neural networks. arXiv:1912.00018v1

  7. [7]

    and Gavra, I

    Gadat, S. and Gavra, I. (2022). Asymptotic study of stochastic adaptive algorithms in non-convex landscape. Journal of Machine Learning , 23

  8. [8]

    G\" u rb\" u zbalaban, M., S im s ekli, U., and Zhu, L. (2021). The heavy-tail phenomenon in sgd. In Proceedings of the 38 th International Conference on Machine Learning, PMLR . 1

  9. [9]

    Ilandarideva, S., Juditsky, A., Lan, G., and · Li, T. (2024). Accelerated stochastic approximation with state-dependent noise. arXiv:2307.01497v3

  10. [10]

    and Zhang, T

    Johnson, R. and Zhang, T. (2013). Accelerating stochastic gradient descent using predictive variance reduction. In Advances in Neural Information Processing Systems , volume 26, pages 315--323

  11. [11]

    and Iiduka, H

    Kamo, K. and Iiduka, H. (2025). Increasing batch size improves convergence of stochastic gradient descent with momentum. arXiv:2501.08883v2

  12. [12]

    Krasulina, T. P. (1969). On stochastic approximation processes with infinite variance. Theory of Probability & Its Applications , 14(3):522--526

  13. [13]

    and Soulier, P

    Kulik, R. and Soulier, P. (2020). Heavy tailed time series . Springer

  14. [14]

    Kushner, H. J. and Yin, G. G. (1997). Stochastic Approximation and Recursive Algorithms and Applications . Springer, New York

  15. [15]

    Le Roux, N., Schmidt, M., Bach, F., and et al. (2012). A stochastic gradient method with an exponential convergence rate for finite training sets. In NIPS , pages 2672--2680

  16. [16]

    Li, M., Zhang, T., Chen, Y., and Smola, A. J. (2014). Efficient mini-batch training for stochastic optimization. In Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD '14) , pages 661--670, New York, NY, USA. Association for Computing Machinery

  17. [17]

    Ma, S., Chen, Z., Zhou, Y., Ji, K., and Liang, Y. (2022). Data sampling affects the complexity of online sgd over dependent data. In Proceedings of the 38th Conference on Uncertainty in Artificial Intelligence , volume 180, pages 1296–--1305

  18. [18]

    and Bach, F

    Moulines, E. and Bach, F. (2011). Non-asymptotic analysis of stochastic approximation algorithms for machine learning. In Advances in Neural Information Processing Systems , volume 24

  19. [19]

    Nemirovski, A., Juditsky, A., Lan, G., and Shapiro, A. (2009). Robust stochastic approximation approach to stochastic programming. SIAM Journal on Optimization , 19(4):1574--1609

  20. [20]

    L., Durmus, A., and S im s ekli, U

    Pavasovic, K. L., Durmus, A., and S im s ekli, U. (2023). Approximate heavy tails in offline (multi-pass) stochastic gradient descent. arxiv:2310.18455v1

  21. [21]

    Pelletier, M. (1998). Weak convergence rates for stochastic approximation with application to multiple targets and simulated annealing. Annals of Applied Probability , pages 10--44

  22. [22]

    Polyak, B. T. and Juditsky, A. B. (1992). Acceleration of stochastic approximation by averaging. IAM Journal on Control and Optimization

  23. [23]

    and Monro, S

    Robbins, H. and Monro, S. (1951). A stochastic approximation method. The Annals of Mathematical Statistics , 22(3):400--407

  24. [24]

    and Iiduka, H

    Umeda, H. and Iiduka, H. (2025). Increasing both batch size and learning rate accelerates stochastic gradient descent. Transactions on Machine Learning Research

  25. [25]

    Wang, H., G\" u rb\" u zbalaban, M., Zhu, L., S im s ekli, U., and Erdogdu, M. A. (2021). Convergence rates of stochastic gradient descent under infnite noise variance. arXiv:2102.10346v1