pith. sign in

arxiv: 2311.11841 · v4 · submitted 2023-11-20 · 🧮 math.OC · cs.LG

High Probability Guarantees for Random Reshuffling

Pith reviewed 2026-05-24 05:15 UTC · model grok-4.3

classification 🧮 math.OC cs.LG
keywords random reshufflingstochastic gradient descenthigh probability boundsnonconvex optimizationsample complexitystopping criterionergodic convergence
0
0 comments X

The pith

Random reshuffling attains high-probability ergodic complexity for epsilon-stationary points matching best in-expectation bounds up to a log factor.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that the random reshuffling stochastic gradient method achieves high probability convergence to an epsilon-stationary point for smooth nonconvex problems. This result is ergodic, meaning it applies to the average of iterates, and holds without taking an expectation over the random permutations. The derived sample complexity is the same order as the best known expected complexity results, differing only by a logarithmic term, and it uses only standard assumptions on smoothness and gradient variance. The authors also develop a stopping criterion that makes the guarantee apply to the final returned iterate after a finite number of steps.

Core claim

Under standard assumptions, random reshuffling admits a high-probability bound on the ergodic sample complexity for reaching an epsilon-stationary point that matches the best in-expectation bound up to a logarithmic factor, and a computable stopping test extends this guarantee to the last iterate without changing the algorithm.

What carries the argument

A newly established concentration property for the random reshuffling process that provides tail bounds on the deviation of reshuffled stochastic gradients from their mean over epochs.

Load-bearing premise

The derived concentration property for random reshuffling holds true under the smoothness and bounded variance conditions used in the analysis.

What would settle it

A counter-example function that is smooth and has bounded gradient variance but where random reshuffling fails to satisfy the claimed high-probability ergodic bound with the stated probability.

Figures

Figures reproduced from arXiv: 2311.11841 by Hengxu Yu, Xiao Li.

Figure 1
Figure 1. Figure 1: Flowchart of p-RR. Our method is denoted as p-RR and is displayed in Algorithm 3. In particular, whenever a stationary point is detected, p-RR introduces a randomized perturbation to the iterate in Line 12 and performs at most Te escaping RR steps. Line 15 is to detect whether the iterates have moved a sufficient distance within Te iterations. We will establish subsequently that such substantial movement s… view at source ↗
Figure 2
Figure 2. Figure 2: Verification of the Lipschitz gradient and Hessian conditions used in our theoretical developments. [PITH_FULL_IMAGE:figures/full_fig_p020_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Comparison of performance between RR and SGD. 10 11 12 13 14 15 16 17 18 19 20 21 Number of iterations / epochs 0 10 20 30 40 Count RR SGD (a) Histogram of iterations / epochs t for achieving ∥∇f(xt)∥ ≤ 10−2 over 100 independent runs. 48 55 60 65 74 80 85 90 95 Number of iterations / epochs 0 1 2 3 4 5 6 7 8 9 10 Count RR SGD (b) Histogram of iterations / epochs t for achieving ∥∇f(xt)∥ ≤ 10−3 over 100 ind… view at source ↗
Figure 4
Figure 4. Figure 4: Statistics of iterations / epochs t of RR and SGD for achieving an ε-stationary point (i.e., ∥∇f(xt)∥ ≤ ε) with varying ε. problem. In this section, we conduct experiments to partly verify them along the trajectory of RR for this specific training problem. For verifying the Lipschitz gradient condition, we note that the actual Lipschitz gradient condition we used in Lemma 2.4 can be verified if the estimat… view at source ↗
Figure 5
Figure 5. Figure 5: Evolution of ∥gt∥, ∥∇f(xt)∥, and test accuracy of RR. We display the gradient norm statistics for the last iterate in both algorithms in Figure 3a. It can observed that RR not only tends to yield a smaller gradient norm of the last iterate, but also exhibits a superior concentration property. This empirical observation corroborates our theoretical findings that the gradient norm in RR converges with high p… view at source ↗
read the original abstract

We consider the stochastic gradient method with random reshuffling ($\mathsf{RR}$) for tackling smooth nonconvex optimization problems. $\mathsf{RR}$ finds broad applications in practice, notably in training neural networks. In this work, we provide high probability complexity guarantees for this method. First, we establish a high probability ergodic sample complexity result (without taking expectation) for finding an $\varepsilon$-stationary point. Our derived complexity matches the best existing in-expectation one up to a logarithmic term while imposing no additional assumptions nor modifying $\mathsf{RR}$'s updating rule. Second, building on this analysis, we propose a simple stopping criterion embedded with a computable stopping test for $\mathsf{RR}$ (denoted as $\mathsf{RR}$-$\mathsf{sc}$). This criterion is guaranteed to be triggered after a finite number of iterations, enabling us to prove the same order high probability complexity for the returned last iterate. The fundamental ingredient in deriving the aforementioned results is a new concentration property for random reshuffling, which could be of independent interest. Finally, we conduct numerical experiments on small neural network training to support our theoretical findings.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper claims high-probability ergodic sample complexity bounds for random reshuffling (RR) on smooth nonconvex problems that match the best known in-expectation rates up to a logarithmic factor, without extra assumptions or changes to the RR update. It further introduces a computable stopping criterion (RR-sc) that triggers after finitely many steps and yields the same-order high-probability guarantee for the returned last iterate. Both results rest on a new concentration inequality for the without-replacement sampling structure of RR, presented as the fundamental technical ingredient and potentially of independent interest. Numerical experiments on small neural-network training are included to support the theory.

Significance. If the new concentration property holds under the stated L-smoothness and bounded-variance assumptions, the work would be significant: it supplies the first high-probability (non-expectation) guarantees for unmodified RR that are essentially as strong as the best in-expectation results, which matters for the practical use of RR in neural-network training. The concentration inequality itself could be reusable in other analyses of shuffling-based methods.

major comments (2)
  1. [Section deriving the concentration property (likely §3 or §4)] The high-probability ergodic bound and the RR-sc guarantee both collapse if the new concentration property (the 'fundamental ingredient') fails to deliver the claimed tail decay under only L-smoothness and bounded variance. The derivation must be checked for the handling of epoch-wise dependence induced by without-replacement sampling; any gap here is load-bearing for the central claims.
  2. [Main ergodic complexity theorem] Theorem stating the ergodic high-probability complexity: confirm that the logarithmic overhead is the only difference from the best in-expectation rate and that no hidden constants or additional assumptions (e.g., on gradient norms) are introduced in the passage from expectation to high probability.
minor comments (2)
  1. [Definition of RR-sc] Notation for the stopping test in RR-sc should be clarified so that the computability of the test is immediate from the displayed expressions.
  2. [Experiments] The numerical experiments section would benefit from reporting the observed iteration counts against the theoretical bounds for the tested networks.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful reading and constructive comments. We are pleased that the significance of the work is recognized. We address each major comment below.

read point-by-point responses
  1. Referee: [Section deriving the concentration property (likely §3 or §4)] The high-probability ergodic bound and the RR-sc guarantee both collapse if the new concentration property (the 'fundamental ingredient') fails to deliver the claimed tail decay under only L-smoothness and bounded variance. The derivation must be checked for the handling of epoch-wise dependence induced by without-replacement sampling; any gap here is load-bearing for the central claims.

    Authors: We confirm that the derivation in the relevant section properly handles the epoch-wise dependence through a carefully constructed martingale argument that respects the without-replacement sampling within each epoch. The concentration inequality is derived under precisely the stated assumptions of L-smoothness and bounded variance, delivering the claimed tail decay. This ensures the high-probability bounds hold as stated. We are happy to elaborate on specific steps if the referee identifies a particular concern. revision: no

  2. Referee: [Main ergodic complexity theorem] Theorem stating the ergodic high-probability complexity: confirm that the logarithmic overhead is the only difference from the best in-expectation rate and that no hidden constants or additional assumptions (e.g., on gradient norms) are introduced in the passage from expectation to high probability.

    Authors: The main ergodic complexity theorem establishes a high-probability bound that differs from the best known in-expectation rate solely by a logarithmic factor in 1/δ (where δ is the failure probability). The transition from expectation to high probability relies exclusively on the new concentration inequality and does not introduce any hidden constants or additional assumptions, such as bounds on gradient norms beyond the problem's standard assumptions. This is explicitly verified in the proof. revision: no

Circularity Check

0 steps flagged

No circularity; new concentration property derived independently under standard assumptions

full rationale

The paper's high-probability ergodic bound and RR-sc stopping criterion are derived from a newly stated concentration inequality for random reshuffling, presented explicitly as an independent technical contribution under L-smoothness and bounded variance. No equations reduce a claimed prediction or result to a fitted parameter or self-citation by construction; the complexity statements are obtained by applying the new inequality to the RR iterates without renaming or smuggling prior ansatzes. The derivation chain remains self-contained against external benchmarks, with the concentration property serving as a verifiable primitive rather than a tautology.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only view supplies no explicit free parameters, axioms, or invented entities; the analysis is described as relying on standard smoothness and a newly derived concentration inequality whose details are unavailable.

pith-pipeline@v0.9.0 · 5721 in / 1085 out tokens · 22864 ms · 2026-05-24T05:15:16.223733+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

42 extracted references · 42 canonical work pages · 1 internal anchor

  1. [1]

    Incremental proximal methods for large scale convex optimization

    Dimitri P Bertsekas. Incremental proximal methods for large scale convex optimization. Mathematical Program- ming, 129(2):163–195, 2011

  2. [2]

    Curiously fast convergence of some stochastic gradient descent algorithms

    Léon Bottou. Curiously fast convergence of some stochastic gradient descent algorithms. In Proceedings of the symposium on learning and data science, volume 8, pages 2624–2633, 2009

  3. [3]

    Stochastic gradient descent tricks

    Léon Bottou. Stochastic gradient descent tricks. In Neural Networks: Tricks of the Trade, pages 421–436. Springer, 2012

  4. [4]

    Optimization methods for large-scale machine learning

    Léon Bottou, Frank E Curtis, and Jorge Nocedal. Optimization methods for large-scale machine learning. SIAM Review, 60(2):223–311, 2018

  5. [5]

    Tighter lower bounds for shuffling sgd: Random permutations and beyond

    Jaeyoung Cha, Jaewook Lee, and Chulhee Yun. Tighter lower bounds for shuffling sgd: Random permutations and beyond. International Conference on Machine Learning, 2023

  6. [6]

    Nonconvex optimization meets low-rank matrix factorization: An overview

    Yuejie Chi, Yue M Lu, and Yuxin Chen. Nonconvex optimization meets low-rank matrix factorization: An overview. IEEE Transactions on Signal Processing, 67(20):5239–5269, 2019

  7. [7]

    High-probability bounds for non-convex stochastic optimization with heavy tails

    Ashok Cutkosky and Harsh Mehta. High-probability bounds for non-convex stochastic optimization with heavy tails. Adv. in Neural Information Processing Systems, 34:4883–4895, 2021

  8. [8]

    Escaping from saddle points—online stochastic gradient for tensor decomposition

    Rong Ge, Furong Huang, Chi Jin, and Yang Yuan. Escaping from saddle points—online stochastic gradient for tensor decomposition. In Conference on Learning Theory, pages 797–842, 2015

  9. [9]

    Stochastic first- and zeroth-order methods for nonconvex stochastic program- ming

    Saeed Ghadimi and Guanghui Lan. Stochastic first- and zeroth-order methods for nonconvex stochastic program- ming. SIAM Journal on Optimization, 2013

  10. [10]

    Understanding the difficulty of training deep feedforward neural networks

    Xavier Glorot and Yoshua Bengio. Understanding the difficulty of training deep feedforward neural networks. In Proceedings of the thirteenth international conference on artificial intelligence and statistics, pages 249–256. JMLR Workshop and Conference Proceedings, 2010

  11. [11]

    Stochastic optimization with heavy-tailed noise via accelerated gradient clipping

    Eduard Gorbunov, Marina Danilova, and Alexander Gasnikov. Stochastic optimization with heavy-tailed noise via accelerated gradient clipping. Adv. in Neural Info. Processing Systems, 2020

  12. [12]

    Note on sampling without replacing from a finite collection of matrices

    David Gross and Vincent Nesme. Note on sampling without replacing from a finite collection of matrices. arXiv preprint arXiv:1001.2738, 2010

  13. [13]

    Convergence rate of incremental gradient and incremental Newton methods

    M Gürbüzbalaban, A Ozdaglar, and Pablo A Parrilo. Convergence rate of incremental gradient and incremental Newton methods. SIAM Journal on Optimization, 29(4):2542–2565, 2019

  14. [14]

    Why random reshuffling beats stochastic gradient descent

    Mert Gürbüzbalaban, Asu Ozdaglar, and PA Parrilo. Why random reshuffling beats stochastic gradient descent. Mathematical Programming, 186(1-2):49–84, 2021

  15. [15]

    Random shuffling beats SGD after finite epochs

    Jeff Haochen and Suvrit Sra. Random shuffling beats SGD after finite epochs. In International Conference on Machine Learning, pages 2624–2633, 2019

  16. [16]

    Nicholas J. A. Harvey, Christopher Liaw, Y . Plan, and Sikander Randhawa. Tight analyses for non-smooth stochastic gradient descent. Annual Conference Computational Learning Theory, 2018

  17. [17]

    Probability inequalities for sums of bounded random variables

    Wassily Hoeffding. Probability inequalities for sums of bounded random variables. Journal of the American Statistical Association, 58(301):13–30, 1963

  18. [18]

    An improved analysis and rates for variance reduction under without-replacement sampling orders

    Xinmeng Huang, Kun Yuan, Xianghui Mao, and Wotao Yin. An improved analysis and rates for variance reduction under without-replacement sampling orders. Advances in Neural Information Processing Systems, 34, 2021

  19. [19]

    How to escape saddle points efficiently

    Chi Jin, Rong Ge, Praneeth Netrapalli, Sham M Kakade, and Michael I Jordan. How to escape saddle points efficiently. In International conference on machine learning, pages 1724–1732, 2017

  20. [20]

    On nonconvex optimization for machine learning: Gradients, stochasticity, and saddle points

    Chi Jin, Praneeth Netrapalli, Rong Ge, Sham M Kakade, and Michael I Jordan. On nonconvex optimization for machine learning: Gradients, stochasticity, and saddle points. Journal of the ACM, 68(2):1–29, 2021

  21. [21]

    Better theory for sgd in the nonconvex world

    Ahmed Khaled and Peter Richtárik. Better theory for sgd in the nonconvex world. Transactions on Machine Learning Research, 2022

  22. [22]

    Using statistics to automate stochastic optimization

    Hunter Lang, Lin Xiao, and Pengchuan Zhang. Using statistics to automate stochastic optimization. In Advances in Neural Information Processing Systems, 2019

  23. [23]

    Gradient-based learning applied to document recognition

    Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998

  24. [24]

    A unified convergence theorem for stochastic optimization methods

    Xiao Li and Andre Milzarek. A unified convergence theorem for stochastic optimization methods. In Advances in Neural Information Processing Systems, volume 35, pages 33107–33119, 2022. 23 High Probability Guarantees for Random Reshuffling A PREPRINT

  25. [25]

    Convergence of random reshuffling under the kurdyka-łojasiewicz inequality

    Xiao Li, Andre Milzarek, and Junwen Qiu. Convergence of random reshuffling under the kurdyka-łojasiewicz inequality. SIAM Journal on Optimization, 33(2):1092–1120, 2023

  26. [26]

    GraB: Finding provably better data permutations than random reshuffling

    Yucheng Lu, Wentao Guo, and Christopher De Sa. GraB: Finding provably better data permutations than random reshuffling. Neural Information Processing Systems, 2022

  27. [27]

    Random reshuffling with variance reduction: New analysis and better rates

    Grigory Malinovsky, Alibek Sailanbayev, and Peter Richtárik. Random reshuffling with variance reduction: New analysis and better rates. Conference on Uncertainty in Arti. Intell., 2021

  28. [28]

    Random reshuffling: Simple analysis with vast improvements

    Konstantin Mishchenko, Ahmed Khaled, and Peter Richtárik. Random reshuffling: Simple analysis with vast improvements. Advances in Neural Information Processing Systems, 2020

  29. [29]

    Surpassing gradient descent provably: A cyclic incremental method with linear convergence rate

    Aryan Mokhtari, Mert Gürbüzbalaban, and Alejandro Ribeiro. Surpassing gradient descent provably: A cyclic incremental method with linear convergence rate. SIAM Journal on Optimization, 28(2):1420–1447, 2018

  30. [30]

    Incremental subgradient methods for nondifferentiable optimization

    Angelia Nedi´c and Dimitri P Bertsekas. Incremental subgradient methods for nondifferentiable optimization. SIAM Journal on Optimization, 12(1):109–138, 2001

  31. [31]

    Introductory lectures on convex optimization: A basic course, volume 87

    Yurii Nesterov. Introductory lectures on convex optimization: A basic course, volume 87. Springer Science & Business Media, 2003

  32. [32]

    A unified convergence analysis for shuffling-type gradient methods

    Lam M Nguyen, Quoc Tran-Dinh, Dzung T Phan, Phuong Ha Nguyen, and Marten Van Dijk. A unified convergence analysis for shuffling-type gradient methods. The Journal of Machine Learning Research, 22(1):9397–9440, 2021

  33. [33]

    Stopping criteria for, and strong convergence of, stochastic gradient descent on Bottou-Curtis-Nocedal functions

    Vivak Patel. Stopping criteria for, and strong convergence of, stochastic gradient descent on Bottou-Curtis-Nocedal functions. Mathematical Programming, 195(1-2):693–734, 2022

  34. [34]

    Closing the convergence gap of SGD without replacement

    Shashank Rajput, Anant Gupta, and Dimitris Papailiopoulos. Closing the convergence gap of SGD without replacement. International Conference On Machine Learning, 2020

  35. [35]

    Parallel stochastic gradient algorithms for large-scale matrix completion

    Benjamin Recht and Christopher Ré. Parallel stochastic gradient algorithms for large-scale matrix completion. Mathematical Programming Computation, 5(2):201–226, 2013

  36. [36]

    A stochastic approximation method

    Herbert Robbins and Sutton Monro. A stochastic approximation method. The annals of mathematical statistics, pages 400–407, 1951

  37. [37]

    How good is SGD with random shuffling? In Conference on Learning Theory, volume 125, pages 3250–3284, 2020

    Itay Safran and Ohad Shamir. How good is SGD with random shuffling? In Conference on Learning Theory, volume 125, pages 3250–3284, 2020

  38. [38]

    Optimization for deep learning: An overview

    Ruo-Yu Sun. Optimization for deep learning: An overview. Journal of the Operations Research Society of China, 8(2):249–294, 2020

  39. [39]

    Joel A. Tropp. An introduction to matrix concentration inequalities. Foundations and Trends® in Machine Learning, 8(1-2):1–230, 2015

  40. [40]

    A stopping rule for the Robbins-Monro method

    George Yin. A stopping rule for the Robbins-Monro method. Journal of Optimization Theory and Applications, 67(1):151–173, 1990

  41. [41]

    Minibatch vs local SGD with shuffling: Tight convergence bounds and beyond

    Chulhee Yun, Shashank Rajput, and Suvrit Sra. Minibatch vs local SGD with shuffling: Tight convergence bounds and beyond. In International Conference on Learning Representations, 2022

  42. [42]

    Why are adaptive methods good for attention models? Adv

    Jingzhao Zhang, Sai Praneeth Karimireddy, Andreas Veit, Seungyeon Kim, Sashank Reddi, Sanjiv Kumar, and Suvrit Sra. Why are adaptive methods good for attention models? Adv. in Neu. Info. Process. Systems, 2020. 24