pith. sign in

arxiv: 2605.14800 · v2 · pith:IMD7S2TInew · submitted 2026-05-14 · 🧮 math.OC

Avoiding Bias in Clipped SGD for Overparameterized Models under Generalized Smoothness

Pith reviewed 2026-06-30 20:11 UTC · model grok-4.3

classification 🧮 math.OC
keywords clipped SGDnormalized SGDoverparameterized modelsinterpolationgeneralized smoothnessconvergence ratesgradient bias
0
0 comments X

The pith

Clipped and normalized SGD converge without bias at deterministic rates in overparameterized interpolating models under generalized smoothness.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that when models are overparameterized and achieve zero training loss, clipping or normalizing gradients in SGD avoids the bias that normally appears. Under the (L0,L1)-smoothness condition and a mild batch-size requirement, these methods match the convergence speed of full-gradient methods. The result supplies a theoretical reason for the practical success of clipping in large neural nets. The analysis also handles heavy-tailed noise, a strictly weaker (H0,H1)-smoothness condition, and the deterministic case while improving prior rate bounds.

Core claim

Under overparameterization and the interpolation condition, both clipped SGD and normalized SGD reach the same convergence rates as their deterministic counterparts without incurring the bias that clipping typically introduces. The proof uses the (L0,L1)-smoothness assumption, which relaxes standard Lipschitz-gradient requirements, and yields faster rates than earlier work on generalized smoothness; the same framework covers heavy-tailed noise, (H0,H1)-smoothness, and the deterministic regime.

What carries the argument

The (L0,L1)-smoothness condition together with overparameterization and interpolation, which jointly remove the clipping bias term from the convergence analysis.

If this is right

  • Clipped SGD matches deterministic convergence rates under the stated conditions.
  • Normalized SGD behaves identically with respect to bias elimination.
  • The analysis applies directly to heavy-tailed noise distributions.
  • Rates improve on the best previously known bounds for generalized smoothness.
  • The same guarantees hold in the fully deterministic regime.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Clipping can be used in large-scale training without paying an asymptotic rate penalty once interpolation is reached.
  • The mild batch-size condition suggests a practical threshold that grows slowly with dimension or noise level.
  • The bias-elimination mechanism may extend to other first-order methods that truncate or rescale gradients.
  • Empirical checks on modern architectures that reach near-zero training loss would directly test the prediction.

Load-bearing premise

The model must be overparameterized enough to achieve exact zero training loss on the data.

What would settle it

Clipped SGD exhibiting a slower rate or measurable bias relative to unclipped SGD when training an overparameterized model to zero loss with batches larger than the stated mild threshold would falsify the claim.

Figures

Figures reproduced from arXiv: 2605.14800 by Aleksandr Lobanov, Anastasia Koloskova.

Figure 1
Figure 1. Figure 1: Illustration of the values of ρ obtained from NSGD runs with different learning rates. 5 Discussion and Limitation First, due to space constraints in the main text, we refer the reader to Appendix C. There, we extend the analysis of Section 4 to convex (H0, H1)-smooth (which is strictly weaker than standard assumptions in optimization literature) functions and provide an improved analysis of GD with a warm… view at source ↗
Figure 2
Figure 2. Figure 2: Comparison of ClipSGD, NSGD and SGD with tuned hyperparameters. [PITH_FULL_IMAGE:figures/full_fig_p009_2.png] view at source ↗
read the original abstract

Modern machine learning is dominated by complex, overparameterized architectures capable of interpolating data and achieving zero training loss. For such models, we investigate the convergence properties of two popular modifications to standard SGD: clipped SGD and normalized SGD. We show that under overparameterization and a mild assumption on batch size, both clipped and normalized SGD do not suffer from the bias typically introduced by clipping, converging effectively at the same rate as their deterministic counterparts. This provides a rigorous theoretical justification for the empirical success of gradient clipping methods. In our analysis, we employ the $(L_0,L_1)$-smoothness condition, under which we obtain convergence rates that improve upon the best known results in prior work. Furthermore, we extend our analysis to specific challenging regimes, including heavy-tailed noise, $(H_0,H_1)$-smoothness (which is strictly weaker than standard assumptions in optimization literature) and the deterministic regime.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 2 minor

Summary. The manuscript claims that for overparameterized interpolating models (zero training loss), clipped SGD and normalized SGD avoid the usual clipping-induced bias under a mild batch-size assumption. Using the (L0,L1)-smoothness condition, both methods achieve convergence rates matching their deterministic counterparts and improve on prior results; the analysis is extended to heavy-tailed noise, the weaker (H0,H1)-smoothness, and the deterministic regime.

Significance. If the derivations hold, the work supplies a rigorous explanation for the empirical effectiveness of gradient clipping in modern overparameterized networks and advances the state of the art on convergence under generalized smoothness. The key technical contribution is showing that interpolation plus overparameterization eliminates bias without additional assumptions beyond a batch-size lower bound.

minor comments (2)
  1. [Abstract] Abstract: the statement that the new rates 'improve upon the best known results in prior work' would be strengthened by an explicit comparison (e.g., dependence on L0, L1, or dimension) to the rates cited in the introduction.
  2. [Main theorems] The manuscript invokes a 'mild assumption on batch size' to remove bias; a precise statement of this assumption (in terms of n, batch size B, or variance parameters) should appear in the main theorem statements rather than only in the abstract.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their summary of the manuscript and for recognizing its potential significance in providing a rigorous justification for gradient clipping in overparameterized interpolating models under (L0,L1)-smoothness. We note that the report lists no specific major comments, so we have no points to address point-by-point at this time. We remain available to provide clarifications or revisions should additional feedback be supplied.

Circularity Check

0 steps flagged

No significant circularity; derivation self-contained under stated assumptions

full rationale

The provided abstract and description present convergence rates as derived from explicit assumptions (overparameterization + interpolation, (L0,L1)-smoothness, mild batch-size condition) that eliminate clipping bias. No equations, self-citations, or steps are quoted that reduce a claimed prediction to a fitted input or prior self-result by construction. The central claim is an implication under those conditions rather than a renaming or self-referential definition. This matches the default expectation for theoretical papers whose results remain falsifiable against the listed assumptions.

Axiom & Free-Parameter Ledger

0 free parameters · 3 axioms · 0 invented entities

The central claim rests on domain assumptions standard in optimization theory plus the specific generalized smoothness conditions employed in the analysis.

axioms (3)
  • domain assumption Overparameterization with interpolation (zero training loss)
    Required for the no-bias result under clipping.
  • domain assumption (L0,L1)-smoothness condition
    Milder than standard L-smoothness; used to obtain the stated convergence rates.
  • domain assumption Mild assumption on batch size
    Necessary to eliminate the usual clipping bias.

pith-pipeline@v0.9.1-grok · 5686 in / 1393 out tokens · 54705 ms · 2026-06-30T20:11:26.287171+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

70 extracted references · 9 canonical work pages · 2 internal anchors

  1. [1]

    Goodfellow, H

    Martin Abadi, Andy Chu, Ian Goodfellow, H. Brendan McMahan, Ilya Mironov, Kunal Talwar, and Li Zhang. Deep learning with differential privacy http://dx.doi.org/10.1145/2976749.2978318. In Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security, CCS’16, page 308–318. ACM, October 2016

  2. [2]

    Why Do We Need Warm-up? A Theoretical Perspective

    Foivos Alimisis, Rustem Islamov, and Aurelien Lucchi. Why do we need warm-up? a theoretical perspective. arXiv preprint arXiv:2510.03164, 2025

  3. [3]

    A convergence theory for deep learning via over-parameterization

    Zeyuan Allen-Zhu, Yuanzhi Li, and Zhao Song. A convergence theory for deep learning via over-parameterization. In International conference on machine learning, pages 242--252. PMLR, 2019

  4. [4]

    Learning theory from first principles

    Francis Bach. Learning theory from first principles. MIT press, 2024

  5. [5]

    Convexity, classification, and risk bounds

    Peter L Bartlett, Michael I Jordan, and Jon D McAuliffe. Convexity, classification, and risk bounds. Journal of the American Statistical Association, 101 0 (473): 0 138--156, 2006

  6. [6]

    On the linear convergence of the stochastic gradient method with constant step-size

    Volkan Cevher and Bang C \^o ng V \ u . On the linear convergence of the stochastic gradient method with constant step-size. Optimization Letters, 13 0 (5): 0 1177--1187, 2019

  7. [7]

    Optimal mean estimation without a variance

    Yeshwanth Cherapanamjeri, Nilesh Tripuraneni, Peter Bartlett, and Michael Jordan. Optimal mean estimation without a variance. In Conference on Learning Theory, pages 356--357. PMLR, 2022

  8. [8]

    Convergence of clipped-sgd for convex (l\_0, l\_1) -smooth optimization with heavy-tailed noise

    Savelii Chezhegov, Aleksandr Beznosikov, Samuel Horv \'a th, and Eduard Gorbunov. Convergence of clipped-sgd for convex (l\_0, l\_1) -smooth optimization with heavy-tailed noise. arXiv preprint arXiv:2505.20817, 2025

  9. [9]

    Global minima of overparameterized neural networks

    Yaim Cooper. Global minima of overparameterized neural networks. SIAM Journal on Mathematics of Data Science, 3 0 (2): 0 676--691, 2021

  10. [10]

    Note on a martingale inequality of pisier

    David C Cox. Note on a martingale inequality of pisier. In Mathematical Proceedings of the Cambridge Philosophical Society, volume 92, pages 163--165. Cambridge University Press, 1982

  11. [11]

    Momentum improves normalized sgd

    Ashok Cutkosky and Harsh Mehta. Momentum improves normalized sgd. In International conference on machine learning, pages 2260--2268. PMLR, 2020

  12. [12]

    On the ineffectiveness of variance reduced optimization for deep learning

    Aaron Defazio and L \'e on Bottou. On the ineffectiveness of variance reduced optimization for deep learning. Advances in Neural Information Processing Systems, 32, 2019

  13. [13]

    Saga: A fast incremental gradient method with support for non-strongly convex composite objectives

    Aaron Defazio, Francis Bach, and Simon Lacoste-Julien. Saga: A fast incremental gradient method with support for non-strongly convex composite objectives. Advances in neural information processing systems, 27, 2014

  14. [14]

    Convergence of clipped SGD on convex \ (l\_0,l\_1)\ -smooth functions

    Ofir Gaash, Kfir Yehuda Levy, and Yair Carmon. Convergence of clipped SGD on convex \ (l\_0,l\_1)\ -smooth functions. In The Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025

  15. [15]

    Optimal stochastic approximation algorithms for strongly convex stochastic composite optimization i: A generic algorithmic framework

    Saeed Ghadimi and Guanghui Lan. Optimal stochastic approximation algorithms for strongly convex stochastic composite optimization i: A generic algorithmic framework. SIAM Journal on Optimization, 22 0 (4): 0 1469--1492, 2012

  16. [16]

    Stochastic first-and zeroth-order methods for nonconvex stochastic programming

    Saeed Ghadimi and Guanghui Lan. Stochastic first-and zeroth-order methods for nonconvex stochastic programming. SIAM journal on optimization, 23 0 (4): 0 2341--2368, 2013

  17. [17]

    Methods for convex \ (l\_0,l\_1)\ -smooth optimization: Clipping, acceleration, and adaptivity

    Eduard Gorbunov, Nazarii Tupitsa, Sayantan Choudhury, Alen Aliev, Peter Richt \'a rik, Samuel Horv \'a th, and Martin Tak \'a c . Methods for convex \ (l\_0,l\_1)\ -smooth optimization: Clipping, acceleration, and adaptivity. In The Thirteenth International Conference on Learning Representations, 2025

  18. [18]

    Sgd for structured nonconvex functions: Learning rates, minibatching and interpolation

    Robert Gower, Othmane Sebbouh, and Nicolas Loizou. Sgd for structured nonconvex functions: Learning rates, minibatching and interpolation. In International Conference on Artificial Intelligence and Statistics, pages 1315--1323. PMLR, 2021

  19. [19]

    Sgd: General analysis and improved rates

    Robert Mansel Gower, Nicolas Loizou, Xun Qian, Alibek Sailanbayev, Egor Shulgin, and Peter Richt \'a rik. Sgd: General analysis and improved rates. In International conference on machine learning, pages 5200--5209. PMLR, 2019

  20. [20]

    Deep residual learning for image recognition

    Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770--778, 2016

  21. [21]

    Gradient correlation is a key ingredient to accelerate sgd with momentum

    Julien Hermant, Marien Renaud, Jean-Fran c ois Aujol, Charles Dossal, and Aude Rondepierre. Gradient correlation is a key ingredient to accelerate sgd with momentum. In International Conference on Learning Representations, 2025, 2025

  22. [22]

    Parameter-agnostic optimization under relaxed smoothness

    Florian H \"u bler, Junchi Yang, Xiang Li, and Niao He. Parameter-agnostic optimization under relaxed smoothness. In International Conference on Artificial Intelligence and Statistics, pages 4861--4869. PMLR, 2024

  23. [23]

    From gradient clipping to normalization for heavy tailed sgd

    Florian H \"u bler, Ilyas Fatkhullin, and Niao He. From gradient clipping to normalization for heavy tailed sgd. In International Conference on Artificial Intelligence and Statistics, pages 2413--2421. PMLR, 2025

  24. [24]

    Double momentum and error feedback for clipping with fast rates and differential privacy

    Rustem Islamov, Samuel Horvath, Aurelien Lucchi, Peter Richtarik, and Eduard Gorbunov. Double momentum and error feedback for clipping with fast rates and differential privacy. arXiv preprint arXiv:2502.11682, 2025

  25. [25]

    Accelerating stochastic gradient descent using predictive variance reduction

    Rie Johnson and Tong Zhang. Accelerating stochastic gradient descent using predictive variance reduction. Advances in neural information processing systems, 26, 2013

  26. [26]

    First order methods for nonsmooth convex large-scale optimization, ii: utilizing problems structure

    Anatoli Juditsky, Arkadi Nemirovski, et al. First order methods for nonsmooth convex large-scale optimization, ii: utilizing problems structure. Optimization for Machine Learning, 30 0 (9): 0 149--183, 2011

  27. [27]

    Linear convergence of gradient and proximal-gradient methods under the polyak- ojasiewicz condition

    Hamed Karimi, Julie Nutini, and Mark Schmidt. Linear convergence of gradient and proximal-gradient methods under the polyak- ojasiewicz condition. In Joint European conference on machine learning and knowledge discovery in databases, pages 795--811. Springer, 2016

  28. [28]

    Scaffold: Stochastic controlled averaging for federated learning

    Sai Praneeth Karimireddy, Satyen Kale, Mehryar Mohri, Sashank Reddi, Sebastian Stich, and Ananda Theertha Suresh. Scaffold: Stochastic controlled averaging for federated learning. In International conference on machine learning, pages 5132--5143. PMLR, 2020

  29. [29]

    Adam: A Method for Stochastic Optimization

    Diederik P Kingma. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014

  30. [30]

    A unified theory of decentralized sgd with changing topology and local updates

    Anastasia Koloskova, Nicolas Loizou, Sadra Boreiri, Martin Jaggi, and Sebastian Stich. A unified theory of decentralized sgd with changing topology and local updates. In International conference on machine learning, pages 5381--5393. PMLR, 2020

  31. [31]

    Revisiting gradient clipping: Stochastic bias and tight convergence guarantees

    Anastasia Koloskova, Hadrien Hendrikx, and Sebastian U Stich. Revisiting gradient clipping: Stochastic bias and tight convergence guarantees. In International Conference on Machine Learning, pages 17343--17363. PMLR, 2023

  32. [32]

    Accelerated zeroth-order method for non-smooth stochastic convex optimization problem with infinite variance

    Nikita Kornilov, Ohad Shamir, Aleksandr Lobanov, Darina Dvinskikh, Alexander Gasnikov, Innokentiy Shibaev, Eduard Gorbunov, and Samuel Horv \'a th. Accelerated zeroth-order method for non-smooth stochastic convex optimization problem with infinite variance. Advances in Neural Information Processing Systems, 36: 0 64083--64102, 2023

  33. [33]

    Learning multiple layers of features from tiny images

    Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple layers of features from tiny images. In Toronto, ON, Canada, 2009

  34. [34]

    An optimal method for stochastic composite optimization

    Guanghui Lan. An optimal method for stochastic composite optimization. Mathematical Programming, 133 0 (1): 0 365--397, 2012

  35. [35]

    First-order and stochastic optimization methods for machine learning, volume 1

    Guanghui Lan. First-order and stochastic optimization methods for machine learning, volume 1. Springer, 2020

  36. [36]

    Theoretical analysis on how learning rate warmup accelerates convergence

    Yuxing Liu, Yuze Ge, Rui Pan, An Kang, and Tong Zhang. Theoretical analysis on how learning rate warmup accelerates convergence. arXiv preprint arXiv:2509.07972, 2025

  37. [37]

    Nonconvex stochastic optimization under heavy-tailed noises: Optimal convergence without gradient clipping https://openreview.net/forum?id=NKotdPUc3L

    Zijian Liu and Zhengyuan Zhou. Nonconvex stochastic optimization under heavy-tailed noises: Optimal convergence without gradient clipping https://openreview.net/forum?id=NKotdPUc3L. In The Thirteenth International Conference on Learning Representations, 2025

  38. [38]

    Breaking the lower bound with (little) structure: Acceleration in non-convex stochastic optimization with heavy-tailed noise

    Zijian Liu, Jiawei Zhang, and Zhengyuan Zhou. Breaking the lower bound with (little) structure: Acceleration in non-convex stochastic optimization with heavy-tailed noise. In The Thirty Sixth Annual Conference on Learning Theory, pages 2266--2290. PMLR, 2023

  39. [39]

    Power of generalized smoothness in stochastic convex optimization: First-and zero-order algorithms

    Aleksandr Lobanov and Alexander Gasnikov. Power of generalized smoothness in stochastic convex optimization: First-and zero-order algorithms. arXiv preprint arXiv:2501.18198, 2025

  40. [40]

    black-box

    Aleksandr Lobanov, Nail Bashirov, and Alexander Gasnikov. The “black-box” optimization problem: Zero-order accelerated stochastic method via kernel approximation. Journal of Optimization Theory and Applications, 203 0 (3): 0 2451--2486, 2024 a

  41. [41]

    Linear convergence rate in convex setup is possible! gradient descent method variants under (l\_0, l\_1) -smoothness

    Aleksandr Lobanov, Alexander Gasnikov, Eduard Gorbunov, and Martin Tak \'a c . Linear convergence rate in convex setup is possible! gradient descent method variants under (l\_0, l\_1) -smoothness. arXiv preprint arXiv:2412.17050, 2024 b

  42. [42]

    Highly smooth zeroth-order methods for solving optimization problems under the pl condition

    Aleksandr Lobanov, Alexander Gasnikov, and Fedor Stonyakin. Highly smooth zeroth-order methods for solving optimization problems under the pl condition. Computational Mathematics and Mathematical Physics, 64 0 (4): 0 739--770, 2024 c

  43. [43]

    The power of interpolation: Understanding the effectiveness of sgd in modern over-parametrized learning

    Siyuan Ma, Raef Bassily, and Mikhail Belkin. The power of interpolation: Understanding the effectiveness of sgd in modern over-parametrized learning. In International Conference on Machine Learning, pages 3325--3334. PMLR, 2018

  44. [44]

    Statistical language models based on neural networks

    Tom \'a s Mikolov et al. Statistical language models based on neural networks. Presentation at Google, Mountain View, 2nd April, 80 0 (26), 2012

  45. [45]

    Non-asymptotic analysis of stochastic approximation algorithms for machine learning

    Eric Moulines and Francis Bach. Non-asymptotic analysis of stochastic approximation algorithms for machine learning. Advances in neural information processing systems, 24, 2011

  46. [46]

    Deep double descent: Where bigger models and more data hurt

    Preetum Nakkiran, Gal Kaplun, Yamini Bansal, Tristan Yang, Boaz Barak, and Ilya Sutskever. Deep double descent: Where bigger models and more data hurt. Journal of Statistical Mechanics: Theory and Experiment, 2021 0 (12): 0 124003, 2021

  47. [47]

    Problem complexity and method efficiency in optimization

    Arkadi Nemirovski and D Yudin. Problem complexity and method efficiency in optimization. Wiley, 1983

  48. [48]

    Robust stochastic approximation approach to stochastic programming

    Arkadi Nemirovski, Anatoli Juditsky, Guanghui Lan, and Alexander Shapiro. Robust stochastic approximation approach to stochastic programming. SIAM Journal on optimization, 19 0 (4): 0 1574--1609, 2009

  49. [49]

    Introductory lectures on convex optimization: A basic course, volume 87

    Yurii Nesterov. Introductory lectures on convex optimization: A basic course, volume 87. Springer Science & Business Media, 2013

  50. [50]

    Minimization methods for nonsmooth convex and quasiconvex functions

    Yurii E Nesterov. Minimization methods for nonsmooth convex and quasiconvex functions. Matekon, 29 0 (3): 0 519--531, 1984

  51. [51]

    Martingales with values in uniformly convex spaces

    Gilles Pisier. Martingales with values in uniformly convex spaces. Israel Journal of Mathematics, 20 0 (3): 0 326--350, 1975

  52. [52]

    Gradient methods for the minimisation of functionals

    Boris T Polyak. Gradient methods for the minimisation of functionals. USSR Computational Mathematics and Mathematical Physics, 3 0 (4): 0 864--878, 1963

  53. [53]

    Stochastic variance reduction for nonconvex optimization

    Sashank J Reddi, Ahmed Hefny, Suvrit Sra, Barnabas Poczos, and Alex Smola. Stochastic variance reduction for nonconvex optimization. In International conference on machine learning, pages 314--323. PMLR, 2016

  54. [54]

    A stochastic approximation method

    Herbert Robbins and Sutton Monro. A stochastic approximation method. The annals of mathematical statistics, pages 400--407, 1951

  55. [55]

    A stochastic gradient method with an exponential convergence rate for finite training sets

    Nicolas Roux, Mark Schmidt, and Francis Bach. A stochastic gradient method with an exponential convergence rate for finite training sets. Advances in neural information processing systems, 25, 2012

  56. [56]

    Minimizing finite sums with the stochastic average gradient

    Mark Schmidt, Nicolas Le Roux, and Francis Bach. Minimizing finite sums with the stochastic average gradient. Mathematical Programming, 162 0 (1): 0 83--112, 2017

  57. [57]

    Sparsified sgd with memory

    Sebastian U Stich, Jean-Baptiste Cordonnier, and Martin Jaggi. Sparsified sgd with memory. Advances in neural information processing systems, 31, 2018

  58. [58]

    Optimizing \ (l\_0, l\_1)\ -smooth functions by gradient methods https://openreview.net/forum?id=GQ1Tc3vHbt

    Daniil Vankov, Anton Rodomanov, Angelia Nedich, Lalitha Sankar, and Sebastian U Stich. Optimizing \ (l\_0, l\_1)\ -smooth functions by gradient methods https://openreview.net/forum?id=GQ1Tc3vHbt. In The Thirteenth International Conference on Learning Representations, 2025

  59. [59]

    Armijo line-search can make (stochastic) gradient descent provably faster

    Sharan Vaswani and Reza Babanezhad. Armijo line-search can make (stochastic) gradient descent provably faster. arXiv preprint arXiv:2503.00229, 2025

  60. [60]

    Fast and faster convergence of sgd for over-parameterized models and an accelerated perceptron

    Sharan Vaswani, Francis Bach, and Mark Schmidt. Fast and faster convergence of sgd for over-parameterized models and an accelerated perceptron. In The 22nd international conference on artificial intelligence and statistics, pages 1195--1204. PMLR, 2019 a

  61. [61]

    Painless stochastic gradient: Interpolation, line-search, and convergence rates

    Sharan Vaswani, Aaron Mishkin, Issam Laradji, Mark Schmidt, Gauthier Gidel, and Simon Lacoste-Julien. Painless stochastic gradient: Interpolation, line-search, and convergence rates. Advances in neural information processing systems, 32, 2019 b

  62. [62]

    Inequalities for the rth absolute moment of a sum of random variables, 1 r 2

    Bengt von Bahr and Carl-Gustav Esseen. Inequalities for the rth absolute moment of a sum of random variables, 1 r 2 . The Annals of Mathematical Statistics, pages 299--303, 1965

  63. [63]

    An even more optimal stochastic optimization algorithm: minibatching and interpolation learning

    Blake E Woodworth and Nathan Srebro. An even more optimal stochastic optimization algorithm: minibatching and interpolation learning. Advances in neural information processing systems, 34: 0 7333--7345, 2021

  64. [64]

    Faster single-loop algorithms for minimax optimization without strong concavity

    Junchi Yang, Antonio Orvieto, Aurelien Lucchi, and Niao He. Faster single-loop algorithms for minimax optimization without strong concavity. In International conference on artificial intelligence and statistics, pages 5485--5517. PMLR, 2022

  65. [65]

    On the lower bound of minimizing polyak- ojasiewicz functions

    Pengyun Yue, Cong Fang, and Zhouchen Lin. On the lower bound of minimizing polyak- ojasiewicz functions. In The Thirty Sixth Annual Conference on Learning Theory, pages 2948--2968. PMLR, 2023

  66. [66]

    Improved analysis of clipping algorithms for non-convex optimization

    Bohang Zhang, Jikai Jin, Cong Fang, and Liwei Wang. Improved analysis of clipping algorithms for non-convex optimization. Advances in Neural Information Processing Systems, 33: 0 15511--15521, 2020 a

  67. [67]

    Why gradient clipping accelerates training: A theoretical justification for adaptivity

    Jingzhao Zhang, Tianxing He, Suvrit Sra, and Ali Jadbabaie. Why gradient clipping accelerates training: A theoretical justification for adaptivity. In International Conference on Learning Representations, 2020 b

  68. [68]

    Why are adaptive methods good for attention models? Advances in Neural Information Processing Systems, 33: 0 15383--15393, 2020 c

    Jingzhao Zhang, Sai Praneeth Karimireddy, Andreas Veit, Seungyeon Kim, Sashank Reddi, Sanjiv Kumar, and Suvrit Sra. Why are adaptive methods good for attention models? Advances in Neural Information Processing Systems, 33: 0 15383--15393, 2020 c

  69. [69]

    On the convergence and improvement of stochastic normalized gradient descent

    Shen-Yi Zhao, Yin-Peng Xie, and Wu-Jun Li. On the convergence and improvement of stochastic normalized gradient descent. Science China Information Sciences, 64 0 (3): 0 132103, 2021

  70. [70]

    write newline

    " write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...