A stochastic gradient algorithm for non-separable optimization with convergence guarantee

Ruofan Wu; Yingzhou Li

arxiv: 2606.10383 · v1 · pith:JDWNFNBLnew · submitted 2026-06-09 · 🧮 math.OC · cs.NA· math.NA

A stochastic gradient algorithm for non-separable optimization with convergence guarantee

Yingzhou Li , Ruofan Wu This is my paper

Pith reviewed 2026-06-27 12:35 UTC · model grok-4.3

classification 🧮 math.OC cs.NAmath.NA

keywords stochastic gradient descentnon-separable optimizationconvergence analysislocal linear convergencefixed step sizebatch gradientsurrogate gradient

0 comments

The pith

An SGD-style method for non-separable losses proves local linear convergence under local strong convexity using fixed step sizes in explicitly bounded ranges.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops a stochastic gradient framework for optimization problems where the loss depends on dataset-level quantities rather than individual samples. It introduces two gradient constructs, an ideal batch gradient G and a cached surrogate H, that allow handling non-separable cases while reducing to standard SGD when separability holds. The core result is a unified convergence theory showing local linear rates under strong convexity and sublinear O(1/k) rates under convexity, all with fixed step sizes whose allowable ranges are explicitly bounded in terms of problem parameters like batch size and cache staleness. This matters because many practical objectives, such as those involving global statistics, fall outside the separable assumption that standard SGD theory requires.

Core claim

The central discovery is that both the G-driven and H-driven updates converge locally at linear rate when the objective is locally strongly convex, and at O(1/k) rate when locally convex, provided the step size is fixed within ranges derived from smoothness and Jacobian bounds; these rates hold uniformly for the two update types and explicitly quantify the effects of staleness and approximation error.

What carries the argument

The pair of batch-gradient constructs G (ideal per-batch gradient) and H (cached surrogate), which enable the algorithm to handle non-separable terms while preserving SGD-like updates.

If this is right

Fixed step sizes suffice for the stated local rates without needing diminishing schedules.
The allowable step-size range shrinks with increased cache staleness or surrogate error.
In the separable case the method recovers the standard mini-batch SGD convergence guarantees.
Batch size directly enters the bounds on convergence constants and step-size limits.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

This approach could extend to other non-separable settings like federated learning where global aggregates appear in the loss.
The explicit dependence on Jacobian bounds suggests that tighter control on data heterogeneity might enlarge the usable step-size range.
Testing on problems with known local strong convexity, such as certain regularized empirical risks with global features, would directly check the linear rate prediction.

Load-bearing premise

The objective must satisfy local smoothness and have bounded Jacobians so that the gradient approximations remain controlled.

What would settle it

Observe whether the method fails to converge linearly on a locally strongly convex non-separable problem when the step size is set inside the paper's derived range but the Jacobian bound is violated.

Figures

Figures reproduced from arXiv: 2606.10383 by Ruofan Wu, Yingzhou Li.

**Figure 5.1.** Figure 5.1: Convergence contrast under different convexity assu [PITH_FULL_IMAGE:figures/full_fig_p015_5_1.png] view at source ↗

**Figure 5.2.** Figure 5.2: Full objective gap F(Y ) − F ∗ and gradient norm k∇Y FkF (log scale) for the three schemes G/H/naive. This composite figure shows the evolution of the full objective gap and the corresponding gradient magnitude over training. 5.2 Experiment 2: Gram objective We consider the coupled-sample objective F(Y ) = 1 2 k 1 d Y Y ⊤ − Sk 2 F, with d = 8, Y ∈ R 200×8 , batch size 16. We parameterize each row y ⊤ i o… view at source ↗

**Figure 5.3.** Figure 5.3: Diagnostics for methods G (left) and H (right). Log–log plot of training loss versus epochs. Late-stage log–log fits yield slopes ≈ −1.467 (R2 ≈ 0.9854) for G and ≈ −1.483 (R2 ≈ 0.9871) for H, indicating Fk ∝ k −1.467 and Fk ∝ k −1.483 respectively. estimator; the gradient-norm plots give the same ordering. Furthermore, in the experiments, late-stage log–log diagnostics reveal sublinear decay of the gap … view at source ↗

read the original abstract

We study non-separable objectives in which the loss depend on dataset-level quantities. We introduce an SGD-style framework that employs two batch-gradient constructs: the ideal per-batch gradient `$G$' and a cached surrogate `$H$' for cases where full-data terms are expensive. Notably, in the sample-wise separable case, our method reduces to standard mini-batch SGD. Our main contribution is a unified local convergence theory: under mild smoothness and Jacobian-boundedness assumptions, we prove local linear convergence under local strong convexity and local $O(1/k)$ sublinear convergence under local convexity for both `$G$'-driven and `$H$'-driven updates. Crucially, these guarantees hold for fixed step sizes within explicitly characterized ranges; we provide explicit bounds showing how cache staleness, surrogate approximation error, batch size, and step size influence the convergence constants and allowable step-size ranges.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gives explicit fixed-stepsize local rates for SGD with cached surrogates on non-separable objectives, but the O(1/k) claim under convexity rests on an unverified neighborhood invariance.

read the letter

The main takeaway is that this work extends stochastic gradient methods to losses that depend on dataset-level quantities by introducing an ideal batch gradient G and a cached surrogate H, then proving local linear convergence under strong convexity and local O(1/k) under convexity for both update types.

What is new is the unified treatment that tracks cache staleness and surrogate error explicitly and gives concrete intervals for the fixed step size in terms of those quantities plus batch size. When the objective is separable the method collapses to ordinary mini-batch SGD, which is a clean sanity check.

The analysis is careful about stating the mild smoothness and Jacobian-boundedness assumptions and about deriving the allowable step-size ranges. That level of explicitness is useful.

The soft spot is exactly the one flagged in the stress-test note. The Jacobian bound is local. Under local convexity alone there is no contraction to keep the iterates inside the region where the bound holds, so the O(1/k) guarantee for the H-driven recursion is not automatically closed. The linear-rate case is safer because strong convexity supplies the contraction. If the full proofs contain an extra invariance argument that keeps the iterates inside the neighborhood, the gap disappears; the abstract does not show it.

This paper is for optimization researchers who need to handle non-separable terms with expensive full-data gradients and who value explicit constants over asymptotic statements. A reader looking for practical step-size rules will find something usable.

It deserves a serious referee. The framework is concrete and the claims are checkable once the proofs are examined.

Referee Report

1 major / 0 minor

Summary. The manuscript introduces an SGD-style framework for non-separable objectives depending on dataset-level quantities. It defines an ideal per-batch gradient G and a cached surrogate H, notes that the method reduces to standard mini-batch SGD in the separable case, and claims a unified local convergence theory: under mild smoothness and Jacobian-boundedness assumptions, both G-driven and H-driven updates achieve local linear convergence under local strong convexity and local O(1/k) sublinear convergence under local convexity, with these guarantees holding for fixed step sizes in explicitly characterized ranges that incorporate cache staleness, surrogate error, batch size, and step size.

Significance. If the stated local convergence results and explicit step-size intervals can be rigorously established, the work supplies a useful theoretical extension of SGD analysis to non-separable problems that arise in practice, together with concrete guidance on allowable step sizes and the effect of staleness.

major comments (1)

[Abstract] Abstract: the local O(1/k) claim for the H-driven recursion under mere local convexity rests on the Jacobian-boundedness assumption remaining valid for all subsequent iterates. Local convexity supplies no contraction that would keep the sequence inside the neighborhood where the bound holds; once H is formed from a stale cache the effective perturbation can grow, so the invariance needed to invoke the bound indefinitely is not automatic and must be established separately before the explicit step-size interval can be asserted.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the careful review and for highlighting this subtlety in the local-convergence argument. We address the single major comment below.

read point-by-point responses

Referee: [Abstract] Abstract: the local O(1/k) claim for the H-driven recursion under mere local convexity rests on the Jacobian-boundedness assumption remaining valid for all subsequent iterates. Local convexity supplies no contraction that would keep the sequence inside the neighborhood where the bound holds; once H is formed from a stale cache the effective perturbation can grow, so the invariance needed to invoke the bound indefinitely is not automatic and must be established separately before the explicit step-size interval can be asserted.

Authors: We agree that the invariance of the neighborhood under the H-driven recursion with only local convexity is not automatic and must be shown explicitly before the step-size interval can be asserted. The manuscript already derives explicit step-size bounds that incorporate the surrogate error and cache staleness; however, these bounds were obtained under the standing assumption that iterates remain inside the local region. To close the gap we will add a short supporting lemma (in the revised Section 4) proving that, under the stated step-size restriction and the local Lipschitz/Jacobian bounds, the H-driven sequence cannot escape the ball in which the Jacobian bound is valid. The abstract claim will be qualified accordingly. revision: yes

Circularity Check

0 steps flagged

No circularity; convergence derived from stated assumptions

full rationale

The paper derives local linear and O(1/k) convergence rates for G-driven and H-driven updates directly from explicit assumptions (mild smoothness, Jacobian-boundedness, local strong convexity or convexity) and provides explicit step-size ranges. No self-definitional reductions, fitted parameters renamed as predictions, or load-bearing self-citations appear in the claimed derivation. The analysis is self-contained against the listed assumptions and does not reduce any result to its inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The proofs rely on standard assumptions from optimization theory for local convergence analysis.

axioms (2)

domain assumption mild smoothness assumption
Used to establish the convergence rates for the algorithm.
domain assumption Jacobian-boundedness assumption
Invoked to bound the error in the surrogate gradient H.

pith-pipeline@v0.9.1-grok · 5681 in / 1018 out tokens · 27601 ms · 2026-06-27T12:35:51.576558+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

44 extracted references · 11 canonical work pages

[1]

467 and Fk ∝ k− 1

9871) for H, indicating Fk ∝ k− 1. 467 and Fk ∝ k− 1. 483 respectively. estimator; the gradient-norm plots give the same ordering. Furth ermore, in the experiments, late-stage log–log diagnostics reveal sublinear decay of the gap F − F ∗, consistent with the O(1/k ) convergence behavior established theoretically. Collectively, the two experiments conﬁrm t...
[2]

Adaptive subgradien t methods for online learning and stochastic optimization

John Duchi, Elad Hazan, and Yoram Singer. “Adaptive subgradien t methods for online learning and stochastic optimization”. In: Journal of Machine Learning Research 12.7 (2011), pp. 2121–2159. url: http://jmlr.org/papers/v12/duchi11a.html

2011
[3]

ADADELTA: An adaptive learning rate method

Matthew D. Zeiler. “ADADELTA: An adaptive learning rate method ”. In: arXiv preprint (2012). arXiv: 1212.5701 [cs.LG]

Pith/arXiv arXiv 2012
[4]

Adam: A method for stochastic o ptimization

Diederik P. Kingma and Jimmy Ba. “Adam: A method for stochastic o ptimization”. In: arXiv preprint (2014). arXiv: 1412.6980 [cs.LG]

Pith/arXiv arXiv 2014
[5]

Math Program

Mark Schmidt, Nicolas Le Roux, and Francis Bach. “Minimizing ﬁnite s ums with the stochas- tic average gradient”. In: Mathematical Programming 162.1 (2017), pp. 83–112. doi: 10.1007/s10107-

work page doi:10.1007/s10107- 2017
[6]

Accelerating stochastic gradient descent using predictive vari- ance reduction

Rie Johnson and Tong Zhang. “Accelerating stochastic gradient descent using predictive vari- ance reduction”. In: Advances in Neural Information Processing Systems (NeurIP S). Vol. 26. 2013, pp. 315–323. url: https://proceedings.neurips.cc/paper/2013/hash/ac1dd209cbcc5e5d1c6

2013
[7]

SAGA: A fast incremental gra- dient method with support for non-strongly convex composite obj ectives

Aaron Defazio, Francis Bach, and Simon Lacoste-Julien. “SAGA: A fast incremental gra- dient method with support for non-strongly convex composite obj ectives”. In: Advances in Neural Information Processing Systems (NeurIPS) . Vol. 27. 2014, pp. 1646–1654. url: https://proceedings.neurips.cc/paper/2014/hash/f7cade80b7cc92b991cf4d2806d6bd78-Abst

2014
[8]

Online and Stochastic Gra- dient Methods for Non-decomposable Loss Functions

Purushottam Kar, Harikrishna Narasimhan, and Prateek Jain. “ Online and Stochastic Gra- dient Methods for Non-decomposable Loss Functions”. In: Advances in Neural Information Processing Systems. Ed. by Z. Ghahramani et al. Vol. 27. Curran Associates, Inc., 201 4. url: https://proceedings.neurips.cc/paper_files/paper/2014/file/9638ddfc7e3d56a611292c157

2014
[9]

Debiased Contrastive Learning

Ching-Yao Chuang et al. “Debiased Contrastive Learning”. In: Advances in Neural Infor- mation Processing Systems . Vol. 33. Curran Associates, Inc., 2020, pp. 8765–8775. url: https://neurips.cc/virtual/2020/public/poster_63c3ddcc7b23daa1e42dc41f9a44a873.html

2020
[10]

Variance Reduced Stochastic Proximal Algorithm for AUC Maximization

Soham Dan and Dushyant Sahoo. “Variance Reduced Stochastic Proximal Algorithm for AUC Maximization”. In: Machine Learning and Knowledge Discovery in Databases. Res earch Track. Springer International Publishing, 2021, pp. 184–199. isbn: 9783030865238. doi: 10.1007/978-3-030-86523-8_12 . url: http://dx.doi.org/10.1007/978-3-030-86523-8_12

work page doi:10.1007/978-3-030-86523-8_12 2021
[11]

E., and Nocedal, J

L´ eon Bottou, Frank E. Curtis, and Jorge Nocedal. “Optimizat ion methods for large-scale machine learning”. In: SIAM Review 60.2 (2018), pp. 223–311. doi: 10.1137/16M1080173

work page doi:10.1137/16m1080173 2018
[12]

Robust stochastic approximation approach to stochastic programming

Arkadi Nemirovski et al. “Robust stochastic approximation ap proach to stochastic program- ming”. In: SIAM Journal on Optimization 19.4 (2009), pp. 1574–1609. doi: 10.1137/070704277

work page doi:10.1137/070704277 2009
[13]

Gradient methods for minimizing composite fun ctions

Yurii Nesterov. “Gradient methods for minimizing composite fun ctions”. In: Mathematical Programming 140.1 (2013), pp. 125–161. doi: 10.1007/s10107-012-0629-5

work page doi:10.1007/s10107-012-0629-5 2013
[14]

On the importance of initialization and mome ntum in deep learn- ing

Ilya Sutskever et al. “On the importance of initialization and mome ntum in deep learn- ing”. In: Proceedings of the 30th International Conference on Machin e Learning (ICML) . Vol. 28. Proceedings of Machine Learning Research. PMLR, 2013, p p. 1139–1147. url: https://proceedings.mlr.press/v28/sutskever13.html

2013
[15]

On Acceleration with Noise- Corrupted Gradients

Michael Cohen, Jelena Diakonikolas, and Lorenzo Orecchia. “On Acceleration with Noise- Corrupted Gradients”. In: Proceedings of the 35th International Conference on Machin e Learning. Ed. by Jennifer Dy and Andreas Krause. Vol. 80. Proceedings of M achine Learning Research. 10–15 July. PMLR, July 2018, pp. 1019–1028. url: https://proceedings.mlr.press/v80/co

2018
[16]

Katyusha: The ﬁrst direct acceleration of stochastic gradient methods

Zeyuan Allen-Zhu. “Katyusha: The ﬁrst direct acceleration of stochastic gradient methods”. In: Journal of Machine Learning Research 18.221 (2018), pp. 1–51. url: http://jmlr.org/papers/v18 32

2018
[17]

A Universal Cat alyst for First-Order Op- timization

Hongzhou Lin, Julien Mairal, and Zaid Harchaoui. “A Universal Cat alyst for First-Order Op- timization”. In: Advances in Neural Information Processing Systems . Ed. by C. Cortes et al. Vol. 28. Curran Associates, Inc., 2015. url: https://proceedings.neurips.cc/paper_files/paper

2015
[18]

Un-regularizing: Approximate proximal point and faster stochastic al- gorithms for empirical risk minimization

Roy Frostig et al. “Un-regularizing: Approximate proximal point and faster stochastic al- gorithms for empirical risk minimization”. In: International Conference on Machine Learn- ing. Vol. 37. Proceedings of Machine Learning Research. PMLR, 2015, pp. 2540–2548. url: https://proceedings.mlr.press/v37/frostig15.html

2015
[19]

Stochastic gradient descent for nonconve x learning without bounded gradient assumptions

Yunwen Lei et al. “Stochastic gradient descent for nonconve x learning without bounded gradient assumptions”. In: IEEE Transactions on Neural Networks and Learning Systems 31.10 (2019), pp. 4394–4400. doi: 10.1109/TNNLS.2019.2952217

work page doi:10.1109/tnnls.2019.2952217 2019
[20]

On the convergence of st ochastic gradient descent with adaptive stepsizes

Xiaoyu Li and Francesco Orabona. “On the convergence of st ochastic gradient descent with adaptive stepsizes”. In: International Conference on Artiﬁcial Intelligence and St atis- tics. Vol. 89. Proceedings of Machine Learning Research. PMLR, 2019, pp. 983–992. url: https://proceedings.mlr.press/v89/li19c.html

2019
[21]

2021 , issue_date =

Chiyuan Zhang et al. “Understanding deep learning (still) require s rethinking generaliza- tion”. In: Communications of the ACM 64.3 (2021), pp. 107–115. doi: 10.1145/3446776

work page doi:10.1145/3446776 2021
[22]

Eﬃcientnet: Rethinking model sca ling for convolutional neural networks

Mingxing Tan and Quoc V. Le. “Eﬃcientnet: Rethinking model sca ling for convolutional neural networks”. In: Proceedings of the 36th International Conference on Machin e Learning (ICML). Vol. 97. Proceedings of Machine Learning Research. PMLR, 2019, pp. 6105–6114. url: https://proceedings.mlr.press/v97/tan19a.html

2019
[23]

Gpipe: Eﬃcient training of giant neural net works using pipeline par- allelism

Yanping Huang et al. “Gpipe: Eﬃcient training of giant neural net works using pipeline par- allelism”. In: Advances in Neural Information Processing Systems (NeurIP S). Vol. 32. 2019, pp. 103–112. url: https://proceedings.neurips.cc/paper/2019/hash/093f65e080a295f8076b1c

2019
[24]

Big transfer (bit): General visual representation learning

Alexander Kolesnikov et al. “Big transfer (bit): General visual representation learning”. In: Computer Vision – ECCV 2020: 16th European Conference . Vol. 12350. Lecture Notes in Computer Science. Springer, 2020, pp. 491–507. doi: 10.1007/978-3-030-58558-7_29

work page doi:10.1007/978-3-030-58558-7_29 2020
[25]

Reconciling modern machine-learning practice and the classical bias–variance trade-off

Mikhail Belkin et al. “Reconciling modern machine-learning practice and the classical bias– variance trade-oﬀ”. In: Proceedings of the National Academy of Sciences 116.32 (2019), pp. 15849–15854. doi: 10.1073/pnas.1903070116

work page doi:10.1073/pnas.1903070116 2019
[26]

Densely Connected Convolutional Networks,

Gao Huang et al. “Densely connected convolutional networks” . In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVP R). 2017, pp. 4700–4708. doi: 10.1109/CVPR.2017.243

work page doi:10.1109/cvpr.2017.243 2017
[27]

Empirical analysis of the Hessian of over-p arametrized neural networks

Levent Sagun et al. “Empirical analysis of the Hessian of over-p arametrized neural networks”. In: arXiv preprint (2017). arXiv: 1706.04454 [cs.LG]

Pith/arXiv arXiv 2017
[28]

On exponential convergence of SGD in non- convex over-parametrized learning

Raef Bassily, Mikhail Belkin, and Siyuan Ma. On exponential convergence of SGD in non- convex over-parametrized learning. Tech. rep. arXiv, 2018. arXiv: 1802.06904 [cs.LG]

arXiv 2018
[29]

Fast and fa ster convergence of SGD for over-parameterized models and an accelerated perceptron

Sharan Vaswani, Francis Bach, and Mark Schmidt. “Fast and fa ster convergence of SGD for over-parameterized models and an accelerated perceptron” . In: International Conference on Artiﬁcial Intelligence and Statistics . Vol. 89. Proceedings of Machine Learning Research. PMLR, 2019, pp. 1195–1204. url: https://proceedings.mlr.press/v89/vaswani19a.html

2019
[30]

Aiming towards the minimizers: fast converge nce of SGD for over- parametrized problems

Chaoyue Liu et al. “Aiming towards the minimizers: fast converge nce of SGD for over- parametrized problems”. In: Advances in Neural Information Processing Systems . Ed. by A. Oh et al. Vol. 36. Curran Associates, Inc., 2023, pp. 60748–60767 . url: https://proceedings.neurip

2023
[31]

Better theory for SGD in t he nonconvex world

Ahmed Khaled and Peter Richt´ arik. “Better theory for SGD in t he nonconvex world”. In: arXiv preprint (2020). arXiv: 2002.03329 [cs.LG] . 33

arXiv 2020
[32]

SGD f or structured noncon- vex functions: Learning rates, minibatching and interpolation

Robert M. Gower, Othmane Sebbouh, and Nicolas Loizou. “SGD f or structured noncon- vex functions: Learning rates, minibatching and interpolation”. In : International Conference on Artiﬁcial Intelligence and Statistics . Vol. 130. Proceedings of Machine Learning Research. PMLR, 2021, pp. 1311–1319. url: https://proceedings.mlr.press/v130/gower21a.html

2021
[33]

Almo st sure convergence rates for stochastic gradient descent and stochastic heavy ball

Othmane Sebbouh, Robert M. Gower, and Aaron Defazio. “Almo st sure convergence rates for stochastic gradient descent and stochastic heavy ball”. In: Conference on Learning Theory . Vol. 134. Proceedings of Machine Learning Research. PMLR, 2021, pp. 3932–3972. url: https://proceedings.mlr.press/v134/sebbouh21a.html

2021
[34]

Accelerated st ochastic optimization methods under quasar-convexity

Qiang Fu, Dongchu Xu, and Ashia Camage Wilson. “Accelerated st ochastic optimization methods under quasar-convexity”. In: International Conference on Machine Learning. Vol. 202. Proceedings of Machine Learning Research. PMLR, 2023, pp. 1053 5–10562. url: https://proceeding

2023
[35]

On the convergence of ﬁrst order methods for qua sar-convex optimization

Jikai Jin. “On the convergence of ﬁrst order methods for qua sar-convex optimization”. In: arXiv preprint (2020). arXiv: 2010.04937 [cs.LG]

arXiv 2020
[36]

On information and sufficiency,

Herbert Robbins and Sutton Monro. “A Stochastic Approximat ion Method”. In: The Annals of Mathematical Statistics 22.3 (Sept. 1951), pp. 400–407. issn: 0003-4851. doi: 10.1214/aoms/1177729 url: http://dx.doi.org/10.1214/aoms/1177729586

work page doi:10.1214/aoms/1177729 1951
[37]

Large-Scale Machine Learning with Stochastic Gradient Descent

L´ eon Bottou. “Large-Scale Machine Learning with Stochastic Gradient Descent”. In: Pro- ceedings of COMPSTAT’2010. Ed. by Yves Lechevallier and Gilbert Saporta. Heidelberg: Physica-Verlag HD, 2010, pp. 177–186. isbn: 978-3-7908-2604-3

2010
[38]

Large Scale Distributed Deep Networks

Jeﬀrey Dean et al. “Large Scale Distributed Deep Networks”. I n: Advances in Neural In- formation Processing Systems. Ed. by F. Pereira et al. Vol. 25. Curran Associates, Inc., 2012. url: https://proceedings.neurips.cc/paper_files/paper/2012/file/6aca97005c68f1206823815

2012
[39]

Deep Sets

Manzil Zaheer et al. “Deep Sets”. In: Advances in Neural Information Processing Systems . Ed. by I. Guyon et al. Vol. 30. Curran Associates, Inc., 2017. url: https://proceedings.neurips.cc/p

2017
[40]

On Large-Batch Training for Deep Le arning: Generalization Gap and Sharp Minima

Nitish Shirish Keskar et al. “On Large-Batch Training for Deep Le arning: Generalization Gap and Sharp Minima”. In: International Conference on Learning Representations . 2017. url: https://openreview.net/forum?id=H1oyRlYgg

2017
[41]

An overview of gradient descent optimizatio n algorithms

Sebastian Ruder. “An overview of gradient descent optimizatio n algorithms”. In: CoRR abs/1609.04747 (2016). arXiv: 1609.04747. url: http://arxiv.org/abs/1609.04747

Pith/arXiv arXiv 2016
[42]

SpecNet2: Orthog onalization-free spectral embedding by neural networks

Ziyu Chen, Yingzhou Li, and Xiuyuan Cheng. “SpecNet2: Orthog onalization-free spectral embedding by neural networks”. In: Proceedings of Mathematical and Scientiﬁc Machine Learning. Vol. 190. Proceedings of Machine Learning Research. PMLR, 2022

2022
[43]

Neura l tangent kernel: Convergence and generalization in neural networks

Arthur Jacot, Franck Gabriel, and Cl´ ement Hongler. “Neura l tangent kernel: Convergence and generalization in neural networks”. In: Advances in Neural Information Processing Sys- tems. Vol. 31. 2018

2018
[44]

A convergence theory for deep learning via over-parameterization

Zeyuan Allen-Zhu, Yuanzhi Li, and Zhao Song. “A convergence theory for deep learning via over-parameterization”. In: International Conference on Machine Learning . Vol. 97. Pro- ceedings of Machine Learning Research. PMLR, 2019, pp. 242–252 . 34

2019

[1] [1]

467 and Fk ∝ k− 1

9871) for H, indicating Fk ∝ k− 1. 467 and Fk ∝ k− 1. 483 respectively. estimator; the gradient-norm plots give the same ordering. Furth ermore, in the experiments, late-stage log–log diagnostics reveal sublinear decay of the gap F − F ∗, consistent with the O(1/k ) convergence behavior established theoretically. Collectively, the two experiments conﬁrm t...

[2] [2]

Adaptive subgradien t methods for online learning and stochastic optimization

John Duchi, Elad Hazan, and Yoram Singer. “Adaptive subgradien t methods for online learning and stochastic optimization”. In: Journal of Machine Learning Research 12.7 (2011), pp. 2121–2159. url: http://jmlr.org/papers/v12/duchi11a.html

2011

[3] [3]

ADADELTA: An adaptive learning rate method

Matthew D. Zeiler. “ADADELTA: An adaptive learning rate method ”. In: arXiv preprint (2012). arXiv: 1212.5701 [cs.LG]

Pith/arXiv arXiv 2012

[4] [4]

Adam: A method for stochastic o ptimization

Diederik P. Kingma and Jimmy Ba. “Adam: A method for stochastic o ptimization”. In: arXiv preprint (2014). arXiv: 1412.6980 [cs.LG]

Pith/arXiv arXiv 2014

[5] [5]

Math Program

Mark Schmidt, Nicolas Le Roux, and Francis Bach. “Minimizing ﬁnite s ums with the stochas- tic average gradient”. In: Mathematical Programming 162.1 (2017), pp. 83–112. doi: 10.1007/s10107-

work page doi:10.1007/s10107- 2017

[6] [6]

Accelerating stochastic gradient descent using predictive vari- ance reduction

Rie Johnson and Tong Zhang. “Accelerating stochastic gradient descent using predictive vari- ance reduction”. In: Advances in Neural Information Processing Systems (NeurIP S). Vol. 26. 2013, pp. 315–323. url: https://proceedings.neurips.cc/paper/2013/hash/ac1dd209cbcc5e5d1c6

2013

[7] [7]

SAGA: A fast incremental gra- dient method with support for non-strongly convex composite obj ectives

Aaron Defazio, Francis Bach, and Simon Lacoste-Julien. “SAGA: A fast incremental gra- dient method with support for non-strongly convex composite obj ectives”. In: Advances in Neural Information Processing Systems (NeurIPS) . Vol. 27. 2014, pp. 1646–1654. url: https://proceedings.neurips.cc/paper/2014/hash/f7cade80b7cc92b991cf4d2806d6bd78-Abst

2014

[8] [8]

Online and Stochastic Gra- dient Methods for Non-decomposable Loss Functions

Purushottam Kar, Harikrishna Narasimhan, and Prateek Jain. “ Online and Stochastic Gra- dient Methods for Non-decomposable Loss Functions”. In: Advances in Neural Information Processing Systems. Ed. by Z. Ghahramani et al. Vol. 27. Curran Associates, Inc., 201 4. url: https://proceedings.neurips.cc/paper_files/paper/2014/file/9638ddfc7e3d56a611292c157

2014

[9] [9]

Debiased Contrastive Learning

Ching-Yao Chuang et al. “Debiased Contrastive Learning”. In: Advances in Neural Infor- mation Processing Systems . Vol. 33. Curran Associates, Inc., 2020, pp. 8765–8775. url: https://neurips.cc/virtual/2020/public/poster_63c3ddcc7b23daa1e42dc41f9a44a873.html

2020

[10] [10]

Variance Reduced Stochastic Proximal Algorithm for AUC Maximization

Soham Dan and Dushyant Sahoo. “Variance Reduced Stochastic Proximal Algorithm for AUC Maximization”. In: Machine Learning and Knowledge Discovery in Databases. Res earch Track. Springer International Publishing, 2021, pp. 184–199. isbn: 9783030865238. doi: 10.1007/978-3-030-86523-8_12 . url: http://dx.doi.org/10.1007/978-3-030-86523-8_12

work page doi:10.1007/978-3-030-86523-8_12 2021

[11] [11]

E., and Nocedal, J

L´ eon Bottou, Frank E. Curtis, and Jorge Nocedal. “Optimizat ion methods for large-scale machine learning”. In: SIAM Review 60.2 (2018), pp. 223–311. doi: 10.1137/16M1080173

work page doi:10.1137/16m1080173 2018

[12] [12]

Robust stochastic approximation approach to stochastic programming

Arkadi Nemirovski et al. “Robust stochastic approximation ap proach to stochastic program- ming”. In: SIAM Journal on Optimization 19.4 (2009), pp. 1574–1609. doi: 10.1137/070704277

work page doi:10.1137/070704277 2009

[13] [13]

Gradient methods for minimizing composite fun ctions

Yurii Nesterov. “Gradient methods for minimizing composite fun ctions”. In: Mathematical Programming 140.1 (2013), pp. 125–161. doi: 10.1007/s10107-012-0629-5

work page doi:10.1007/s10107-012-0629-5 2013

[14] [14]

On the importance of initialization and mome ntum in deep learn- ing

Ilya Sutskever et al. “On the importance of initialization and mome ntum in deep learn- ing”. In: Proceedings of the 30th International Conference on Machin e Learning (ICML) . Vol. 28. Proceedings of Machine Learning Research. PMLR, 2013, p p. 1139–1147. url: https://proceedings.mlr.press/v28/sutskever13.html

2013

[15] [15]

On Acceleration with Noise- Corrupted Gradients

Michael Cohen, Jelena Diakonikolas, and Lorenzo Orecchia. “On Acceleration with Noise- Corrupted Gradients”. In: Proceedings of the 35th International Conference on Machin e Learning. Ed. by Jennifer Dy and Andreas Krause. Vol. 80. Proceedings of M achine Learning Research. 10–15 July. PMLR, July 2018, pp. 1019–1028. url: https://proceedings.mlr.press/v80/co

2018

[16] [16]

Katyusha: The ﬁrst direct acceleration of stochastic gradient methods

Zeyuan Allen-Zhu. “Katyusha: The ﬁrst direct acceleration of stochastic gradient methods”. In: Journal of Machine Learning Research 18.221 (2018), pp. 1–51. url: http://jmlr.org/papers/v18 32

2018

[17] [17]

A Universal Cat alyst for First-Order Op- timization

Hongzhou Lin, Julien Mairal, and Zaid Harchaoui. “A Universal Cat alyst for First-Order Op- timization”. In: Advances in Neural Information Processing Systems . Ed. by C. Cortes et al. Vol. 28. Curran Associates, Inc., 2015. url: https://proceedings.neurips.cc/paper_files/paper

2015

[18] [18]

Un-regularizing: Approximate proximal point and faster stochastic al- gorithms for empirical risk minimization

Roy Frostig et al. “Un-regularizing: Approximate proximal point and faster stochastic al- gorithms for empirical risk minimization”. In: International Conference on Machine Learn- ing. Vol. 37. Proceedings of Machine Learning Research. PMLR, 2015, pp. 2540–2548. url: https://proceedings.mlr.press/v37/frostig15.html

2015

[19] [19]

Stochastic gradient descent for nonconve x learning without bounded gradient assumptions

Yunwen Lei et al. “Stochastic gradient descent for nonconve x learning without bounded gradient assumptions”. In: IEEE Transactions on Neural Networks and Learning Systems 31.10 (2019), pp. 4394–4400. doi: 10.1109/TNNLS.2019.2952217

work page doi:10.1109/tnnls.2019.2952217 2019

[20] [20]

On the convergence of st ochastic gradient descent with adaptive stepsizes

Xiaoyu Li and Francesco Orabona. “On the convergence of st ochastic gradient descent with adaptive stepsizes”. In: International Conference on Artiﬁcial Intelligence and St atis- tics. Vol. 89. Proceedings of Machine Learning Research. PMLR, 2019, pp. 983–992. url: https://proceedings.mlr.press/v89/li19c.html

2019

[21] [21]

2021 , issue_date =

Chiyuan Zhang et al. “Understanding deep learning (still) require s rethinking generaliza- tion”. In: Communications of the ACM 64.3 (2021), pp. 107–115. doi: 10.1145/3446776

work page doi:10.1145/3446776 2021

[22] [22]

Eﬃcientnet: Rethinking model sca ling for convolutional neural networks

Mingxing Tan and Quoc V. Le. “Eﬃcientnet: Rethinking model sca ling for convolutional neural networks”. In: Proceedings of the 36th International Conference on Machin e Learning (ICML). Vol. 97. Proceedings of Machine Learning Research. PMLR, 2019, pp. 6105–6114. url: https://proceedings.mlr.press/v97/tan19a.html

2019

[23] [23]

Gpipe: Eﬃcient training of giant neural net works using pipeline par- allelism

Yanping Huang et al. “Gpipe: Eﬃcient training of giant neural net works using pipeline par- allelism”. In: Advances in Neural Information Processing Systems (NeurIP S). Vol. 32. 2019, pp. 103–112. url: https://proceedings.neurips.cc/paper/2019/hash/093f65e080a295f8076b1c

2019

[24] [24]

Big transfer (bit): General visual representation learning

Alexander Kolesnikov et al. “Big transfer (bit): General visual representation learning”. In: Computer Vision – ECCV 2020: 16th European Conference . Vol. 12350. Lecture Notes in Computer Science. Springer, 2020, pp. 491–507. doi: 10.1007/978-3-030-58558-7_29

work page doi:10.1007/978-3-030-58558-7_29 2020

[25] [25]

Reconciling modern machine-learning practice and the classical bias–variance trade-off

Mikhail Belkin et al. “Reconciling modern machine-learning practice and the classical bias– variance trade-oﬀ”. In: Proceedings of the National Academy of Sciences 116.32 (2019), pp. 15849–15854. doi: 10.1073/pnas.1903070116

work page doi:10.1073/pnas.1903070116 2019

[26] [26]

Densely Connected Convolutional Networks,

Gao Huang et al. “Densely connected convolutional networks” . In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVP R). 2017, pp. 4700–4708. doi: 10.1109/CVPR.2017.243

work page doi:10.1109/cvpr.2017.243 2017

[27] [27]

Empirical analysis of the Hessian of over-p arametrized neural networks

Levent Sagun et al. “Empirical analysis of the Hessian of over-p arametrized neural networks”. In: arXiv preprint (2017). arXiv: 1706.04454 [cs.LG]

Pith/arXiv arXiv 2017

[28] [28]

On exponential convergence of SGD in non- convex over-parametrized learning

Raef Bassily, Mikhail Belkin, and Siyuan Ma. On exponential convergence of SGD in non- convex over-parametrized learning. Tech. rep. arXiv, 2018. arXiv: 1802.06904 [cs.LG]

arXiv 2018

[29] [29]

Fast and fa ster convergence of SGD for over-parameterized models and an accelerated perceptron

Sharan Vaswani, Francis Bach, and Mark Schmidt. “Fast and fa ster convergence of SGD for over-parameterized models and an accelerated perceptron” . In: International Conference on Artiﬁcial Intelligence and Statistics . Vol. 89. Proceedings of Machine Learning Research. PMLR, 2019, pp. 1195–1204. url: https://proceedings.mlr.press/v89/vaswani19a.html

2019

[30] [30]

Aiming towards the minimizers: fast converge nce of SGD for over- parametrized problems

Chaoyue Liu et al. “Aiming towards the minimizers: fast converge nce of SGD for over- parametrized problems”. In: Advances in Neural Information Processing Systems . Ed. by A. Oh et al. Vol. 36. Curran Associates, Inc., 2023, pp. 60748–60767 . url: https://proceedings.neurip

2023

[31] [31]

Better theory for SGD in t he nonconvex world

Ahmed Khaled and Peter Richt´ arik. “Better theory for SGD in t he nonconvex world”. In: arXiv preprint (2020). arXiv: 2002.03329 [cs.LG] . 33

arXiv 2020

[32] [32]

SGD f or structured noncon- vex functions: Learning rates, minibatching and interpolation

Robert M. Gower, Othmane Sebbouh, and Nicolas Loizou. “SGD f or structured noncon- vex functions: Learning rates, minibatching and interpolation”. In : International Conference on Artiﬁcial Intelligence and Statistics . Vol. 130. Proceedings of Machine Learning Research. PMLR, 2021, pp. 1311–1319. url: https://proceedings.mlr.press/v130/gower21a.html

2021

[33] [33]

Almo st sure convergence rates for stochastic gradient descent and stochastic heavy ball

Othmane Sebbouh, Robert M. Gower, and Aaron Defazio. “Almo st sure convergence rates for stochastic gradient descent and stochastic heavy ball”. In: Conference on Learning Theory . Vol. 134. Proceedings of Machine Learning Research. PMLR, 2021, pp. 3932–3972. url: https://proceedings.mlr.press/v134/sebbouh21a.html

2021

[34] [34]

Accelerated st ochastic optimization methods under quasar-convexity

Qiang Fu, Dongchu Xu, and Ashia Camage Wilson. “Accelerated st ochastic optimization methods under quasar-convexity”. In: International Conference on Machine Learning. Vol. 202. Proceedings of Machine Learning Research. PMLR, 2023, pp. 1053 5–10562. url: https://proceeding

2023

[35] [35]

On the convergence of ﬁrst order methods for qua sar-convex optimization

Jikai Jin. “On the convergence of ﬁrst order methods for qua sar-convex optimization”. In: arXiv preprint (2020). arXiv: 2010.04937 [cs.LG]

arXiv 2020

[36] [36]

On information and sufficiency,

Herbert Robbins and Sutton Monro. “A Stochastic Approximat ion Method”. In: The Annals of Mathematical Statistics 22.3 (Sept. 1951), pp. 400–407. issn: 0003-4851. doi: 10.1214/aoms/1177729 url: http://dx.doi.org/10.1214/aoms/1177729586

work page doi:10.1214/aoms/1177729 1951

[37] [37]

Large-Scale Machine Learning with Stochastic Gradient Descent

L´ eon Bottou. “Large-Scale Machine Learning with Stochastic Gradient Descent”. In: Pro- ceedings of COMPSTAT’2010. Ed. by Yves Lechevallier and Gilbert Saporta. Heidelberg: Physica-Verlag HD, 2010, pp. 177–186. isbn: 978-3-7908-2604-3

2010

[38] [38]

Large Scale Distributed Deep Networks

Jeﬀrey Dean et al. “Large Scale Distributed Deep Networks”. I n: Advances in Neural In- formation Processing Systems. Ed. by F. Pereira et al. Vol. 25. Curran Associates, Inc., 2012. url: https://proceedings.neurips.cc/paper_files/paper/2012/file/6aca97005c68f1206823815

2012

[39] [39]

Deep Sets

Manzil Zaheer et al. “Deep Sets”. In: Advances in Neural Information Processing Systems . Ed. by I. Guyon et al. Vol. 30. Curran Associates, Inc., 2017. url: https://proceedings.neurips.cc/p

2017

[40] [40]

On Large-Batch Training for Deep Le arning: Generalization Gap and Sharp Minima

Nitish Shirish Keskar et al. “On Large-Batch Training for Deep Le arning: Generalization Gap and Sharp Minima”. In: International Conference on Learning Representations . 2017. url: https://openreview.net/forum?id=H1oyRlYgg

2017

[41] [41]

An overview of gradient descent optimizatio n algorithms

Sebastian Ruder. “An overview of gradient descent optimizatio n algorithms”. In: CoRR abs/1609.04747 (2016). arXiv: 1609.04747. url: http://arxiv.org/abs/1609.04747

Pith/arXiv arXiv 2016

[42] [42]

SpecNet2: Orthog onalization-free spectral embedding by neural networks

Ziyu Chen, Yingzhou Li, and Xiuyuan Cheng. “SpecNet2: Orthog onalization-free spectral embedding by neural networks”. In: Proceedings of Mathematical and Scientiﬁc Machine Learning. Vol. 190. Proceedings of Machine Learning Research. PMLR, 2022

2022

[43] [43]

Neura l tangent kernel: Convergence and generalization in neural networks

Arthur Jacot, Franck Gabriel, and Cl´ ement Hongler. “Neura l tangent kernel: Convergence and generalization in neural networks”. In: Advances in Neural Information Processing Sys- tems. Vol. 31. 2018

2018

[44] [44]

A convergence theory for deep learning via over-parameterization

Zeyuan Allen-Zhu, Yuanzhi Li, and Zhao Song. “A convergence theory for deep learning via over-parameterization”. In: International Conference on Machine Learning . Vol. 97. Pro- ceedings of Machine Learning Research. PMLR, 2019, pp. 242–252 . 34

2019