A stochastic gradient algorithm for non-separable optimization with convergence guarantee
Pith reviewed 2026-06-27 12:35 UTC · model grok-4.3
The pith
An SGD-style method for non-separable losses proves local linear convergence under local strong convexity using fixed step sizes in explicitly bounded ranges.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central discovery is that both the G-driven and H-driven updates converge locally at linear rate when the objective is locally strongly convex, and at O(1/k) rate when locally convex, provided the step size is fixed within ranges derived from smoothness and Jacobian bounds; these rates hold uniformly for the two update types and explicitly quantify the effects of staleness and approximation error.
What carries the argument
The pair of batch-gradient constructs G (ideal per-batch gradient) and H (cached surrogate), which enable the algorithm to handle non-separable terms while preserving SGD-like updates.
If this is right
- Fixed step sizes suffice for the stated local rates without needing diminishing schedules.
- The allowable step-size range shrinks with increased cache staleness or surrogate error.
- In the separable case the method recovers the standard mini-batch SGD convergence guarantees.
- Batch size directly enters the bounds on convergence constants and step-size limits.
Where Pith is reading between the lines
- This approach could extend to other non-separable settings like federated learning where global aggregates appear in the loss.
- The explicit dependence on Jacobian bounds suggests that tighter control on data heterogeneity might enlarge the usable step-size range.
- Testing on problems with known local strong convexity, such as certain regularized empirical risks with global features, would directly check the linear rate prediction.
Load-bearing premise
The objective must satisfy local smoothness and have bounded Jacobians so that the gradient approximations remain controlled.
What would settle it
Observe whether the method fails to converge linearly on a locally strongly convex non-separable problem when the step size is set inside the paper's derived range but the Jacobian bound is violated.
Figures
read the original abstract
We study non-separable objectives in which the loss depend on dataset-level quantities. We introduce an SGD-style framework that employs two batch-gradient constructs: the ideal per-batch gradient `$G$' and a cached surrogate `$H$' for cases where full-data terms are expensive. Notably, in the sample-wise separable case, our method reduces to standard mini-batch SGD. Our main contribution is a unified local convergence theory: under mild smoothness and Jacobian-boundedness assumptions, we prove local linear convergence under local strong convexity and local $O(1/k)$ sublinear convergence under local convexity for both `$G$'-driven and `$H$'-driven updates. Crucially, these guarantees hold for fixed step sizes within explicitly characterized ranges; we provide explicit bounds showing how cache staleness, surrogate approximation error, batch size, and step size influence the convergence constants and allowable step-size ranges.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces an SGD-style framework for non-separable objectives depending on dataset-level quantities. It defines an ideal per-batch gradient G and a cached surrogate H, notes that the method reduces to standard mini-batch SGD in the separable case, and claims a unified local convergence theory: under mild smoothness and Jacobian-boundedness assumptions, both G-driven and H-driven updates achieve local linear convergence under local strong convexity and local O(1/k) sublinear convergence under local convexity, with these guarantees holding for fixed step sizes in explicitly characterized ranges that incorporate cache staleness, surrogate error, batch size, and step size.
Significance. If the stated local convergence results and explicit step-size intervals can be rigorously established, the work supplies a useful theoretical extension of SGD analysis to non-separable problems that arise in practice, together with concrete guidance on allowable step sizes and the effect of staleness.
major comments (1)
- [Abstract] Abstract: the local O(1/k) claim for the H-driven recursion under mere local convexity rests on the Jacobian-boundedness assumption remaining valid for all subsequent iterates. Local convexity supplies no contraction that would keep the sequence inside the neighborhood where the bound holds; once H is formed from a stale cache the effective perturbation can grow, so the invariance needed to invoke the bound indefinitely is not automatic and must be established separately before the explicit step-size interval can be asserted.
Simulated Author's Rebuttal
We thank the referee for the careful review and for highlighting this subtlety in the local-convergence argument. We address the single major comment below.
read point-by-point responses
-
Referee: [Abstract] Abstract: the local O(1/k) claim for the H-driven recursion under mere local convexity rests on the Jacobian-boundedness assumption remaining valid for all subsequent iterates. Local convexity supplies no contraction that would keep the sequence inside the neighborhood where the bound holds; once H is formed from a stale cache the effective perturbation can grow, so the invariance needed to invoke the bound indefinitely is not automatic and must be established separately before the explicit step-size interval can be asserted.
Authors: We agree that the invariance of the neighborhood under the H-driven recursion with only local convexity is not automatic and must be shown explicitly before the step-size interval can be asserted. The manuscript already derives explicit step-size bounds that incorporate the surrogate error and cache staleness; however, these bounds were obtained under the standing assumption that iterates remain inside the local region. To close the gap we will add a short supporting lemma (in the revised Section 4) proving that, under the stated step-size restriction and the local Lipschitz/Jacobian bounds, the H-driven sequence cannot escape the ball in which the Jacobian bound is valid. The abstract claim will be qualified accordingly. revision: yes
Circularity Check
No circularity; convergence derived from stated assumptions
full rationale
The paper derives local linear and O(1/k) convergence rates for G-driven and H-driven updates directly from explicit assumptions (mild smoothness, Jacobian-boundedness, local strong convexity or convexity) and provides explicit step-size ranges. No self-definitional reductions, fitted parameters renamed as predictions, or load-bearing self-citations appear in the claimed derivation. The analysis is self-contained against the listed assumptions and does not reduce any result to its inputs by construction.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption mild smoothness assumption
- domain assumption Jacobian-boundedness assumption
Reference graph
Works this paper leans on
-
[1]
467 and Fk ∝ k− 1
9871) for H, indicating Fk ∝ k− 1. 467 and Fk ∝ k− 1. 483 respectively. estimator; the gradient-norm plots give the same ordering. Furth ermore, in the experiments, late-stage log–log diagnostics reveal sublinear decay of the gap F − F ∗, consistent with the O(1/k ) convergence behavior established theoretically. Collectively, the two experiments confirm t...
-
[2]
Adaptive subgradien t methods for online learning and stochastic optimization
John Duchi, Elad Hazan, and Yoram Singer. “Adaptive subgradien t methods for online learning and stochastic optimization”. In: Journal of Machine Learning Research 12.7 (2011), pp. 2121–2159. url: http://jmlr.org/papers/v12/duchi11a.html
2011
-
[3]
ADADELTA: An adaptive learning rate method
Matthew D. Zeiler. “ADADELTA: An adaptive learning rate method ”. In: arXiv preprint (2012). arXiv: 1212.5701 [cs.LG]
Pith/arXiv arXiv 2012
-
[4]
Adam: A method for stochastic o ptimization
Diederik P. Kingma and Jimmy Ba. “Adam: A method for stochastic o ptimization”. In: arXiv preprint (2014). arXiv: 1412.6980 [cs.LG]
Pith/arXiv arXiv 2014
-
[5]
Mark Schmidt, Nicolas Le Roux, and Francis Bach. “Minimizing finite s ums with the stochas- tic average gradient”. In: Mathematical Programming 162.1 (2017), pp. 83–112. doi: 10.1007/s10107-
-
[6]
Accelerating stochastic gradient descent using predictive vari- ance reduction
Rie Johnson and Tong Zhang. “Accelerating stochastic gradient descent using predictive vari- ance reduction”. In: Advances in Neural Information Processing Systems (NeurIP S). Vol. 26. 2013, pp. 315–323. url: https://proceedings.neurips.cc/paper/2013/hash/ac1dd209cbcc5e5d1c6
2013
-
[7]
SAGA: A fast incremental gra- dient method with support for non-strongly convex composite obj ectives
Aaron Defazio, Francis Bach, and Simon Lacoste-Julien. “SAGA: A fast incremental gra- dient method with support for non-strongly convex composite obj ectives”. In: Advances in Neural Information Processing Systems (NeurIPS) . Vol. 27. 2014, pp. 1646–1654. url: https://proceedings.neurips.cc/paper/2014/hash/f7cade80b7cc92b991cf4d2806d6bd78-Abst
2014
-
[8]
Online and Stochastic Gra- dient Methods for Non-decomposable Loss Functions
Purushottam Kar, Harikrishna Narasimhan, and Prateek Jain. “ Online and Stochastic Gra- dient Methods for Non-decomposable Loss Functions”. In: Advances in Neural Information Processing Systems. Ed. by Z. Ghahramani et al. Vol. 27. Curran Associates, Inc., 201 4. url: https://proceedings.neurips.cc/paper_files/paper/2014/file/9638ddfc7e3d56a611292c157
2014
-
[9]
Debiased Contrastive Learning
Ching-Yao Chuang et al. “Debiased Contrastive Learning”. In: Advances in Neural Infor- mation Processing Systems . Vol. 33. Curran Associates, Inc., 2020, pp. 8765–8775. url: https://neurips.cc/virtual/2020/public/poster_63c3ddcc7b23daa1e42dc41f9a44a873.html
2020
-
[10]
Variance Reduced Stochastic Proximal Algorithm for AUC Maximization
Soham Dan and Dushyant Sahoo. “Variance Reduced Stochastic Proximal Algorithm for AUC Maximization”. In: Machine Learning and Knowledge Discovery in Databases. Res earch Track. Springer International Publishing, 2021, pp. 184–199. isbn: 9783030865238. doi: 10.1007/978-3-030-86523-8_12 . url: http://dx.doi.org/10.1007/978-3-030-86523-8_12
-
[11]
L´ eon Bottou, Frank E. Curtis, and Jorge Nocedal. “Optimizat ion methods for large-scale machine learning”. In: SIAM Review 60.2 (2018), pp. 223–311. doi: 10.1137/16M1080173
-
[12]
Robust stochastic approximation approach to stochastic programming
Arkadi Nemirovski et al. “Robust stochastic approximation ap proach to stochastic program- ming”. In: SIAM Journal on Optimization 19.4 (2009), pp. 1574–1609. doi: 10.1137/070704277
-
[13]
Gradient methods for minimizing composite fun ctions
Yurii Nesterov. “Gradient methods for minimizing composite fun ctions”. In: Mathematical Programming 140.1 (2013), pp. 125–161. doi: 10.1007/s10107-012-0629-5
-
[14]
On the importance of initialization and mome ntum in deep learn- ing
Ilya Sutskever et al. “On the importance of initialization and mome ntum in deep learn- ing”. In: Proceedings of the 30th International Conference on Machin e Learning (ICML) . Vol. 28. Proceedings of Machine Learning Research. PMLR, 2013, p p. 1139–1147. url: https://proceedings.mlr.press/v28/sutskever13.html
2013
-
[15]
On Acceleration with Noise- Corrupted Gradients
Michael Cohen, Jelena Diakonikolas, and Lorenzo Orecchia. “On Acceleration with Noise- Corrupted Gradients”. In: Proceedings of the 35th International Conference on Machin e Learning. Ed. by Jennifer Dy and Andreas Krause. Vol. 80. Proceedings of M achine Learning Research. 10–15 July. PMLR, July 2018, pp. 1019–1028. url: https://proceedings.mlr.press/v80/co
2018
-
[16]
Katyusha: The first direct acceleration of stochastic gradient methods
Zeyuan Allen-Zhu. “Katyusha: The first direct acceleration of stochastic gradient methods”. In: Journal of Machine Learning Research 18.221 (2018), pp. 1–51. url: http://jmlr.org/papers/v18 32
2018
-
[17]
A Universal Cat alyst for First-Order Op- timization
Hongzhou Lin, Julien Mairal, and Zaid Harchaoui. “A Universal Cat alyst for First-Order Op- timization”. In: Advances in Neural Information Processing Systems . Ed. by C. Cortes et al. Vol. 28. Curran Associates, Inc., 2015. url: https://proceedings.neurips.cc/paper_files/paper
2015
-
[18]
Un-regularizing: Approximate proximal point and faster stochastic al- gorithms for empirical risk minimization
Roy Frostig et al. “Un-regularizing: Approximate proximal point and faster stochastic al- gorithms for empirical risk minimization”. In: International Conference on Machine Learn- ing. Vol. 37. Proceedings of Machine Learning Research. PMLR, 2015, pp. 2540–2548. url: https://proceedings.mlr.press/v37/frostig15.html
2015
-
[19]
Stochastic gradient descent for nonconve x learning without bounded gradient assumptions
Yunwen Lei et al. “Stochastic gradient descent for nonconve x learning without bounded gradient assumptions”. In: IEEE Transactions on Neural Networks and Learning Systems 31.10 (2019), pp. 4394–4400. doi: 10.1109/TNNLS.2019.2952217
-
[20]
On the convergence of st ochastic gradient descent with adaptive stepsizes
Xiaoyu Li and Francesco Orabona. “On the convergence of st ochastic gradient descent with adaptive stepsizes”. In: International Conference on Artificial Intelligence and St atis- tics. Vol. 89. Proceedings of Machine Learning Research. PMLR, 2019, pp. 983–992. url: https://proceedings.mlr.press/v89/li19c.html
2019
-
[21]
Chiyuan Zhang et al. “Understanding deep learning (still) require s rethinking generaliza- tion”. In: Communications of the ACM 64.3 (2021), pp. 107–115. doi: 10.1145/3446776
-
[22]
Efficientnet: Rethinking model sca ling for convolutional neural networks
Mingxing Tan and Quoc V. Le. “Efficientnet: Rethinking model sca ling for convolutional neural networks”. In: Proceedings of the 36th International Conference on Machin e Learning (ICML). Vol. 97. Proceedings of Machine Learning Research. PMLR, 2019, pp. 6105–6114. url: https://proceedings.mlr.press/v97/tan19a.html
2019
-
[23]
Gpipe: Efficient training of giant neural net works using pipeline par- allelism
Yanping Huang et al. “Gpipe: Efficient training of giant neural net works using pipeline par- allelism”. In: Advances in Neural Information Processing Systems (NeurIP S). Vol. 32. 2019, pp. 103–112. url: https://proceedings.neurips.cc/paper/2019/hash/093f65e080a295f8076b1c
2019
-
[24]
Big transfer (bit): General visual representation learning
Alexander Kolesnikov et al. “Big transfer (bit): General visual representation learning”. In: Computer Vision – ECCV 2020: 16th European Conference . Vol. 12350. Lecture Notes in Computer Science. Springer, 2020, pp. 491–507. doi: 10.1007/978-3-030-58558-7_29
-
[25]
Reconciling modern machine-learning practice and the classical bias–variance trade-off
Mikhail Belkin et al. “Reconciling modern machine-learning practice and the classical bias– variance trade-off”. In: Proceedings of the National Academy of Sciences 116.32 (2019), pp. 15849–15854. doi: 10.1073/pnas.1903070116
-
[26]
Densely Connected Convolutional Networks,
Gao Huang et al. “Densely connected convolutional networks” . In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVP R). 2017, pp. 4700–4708. doi: 10.1109/CVPR.2017.243
-
[27]
Empirical analysis of the Hessian of over-p arametrized neural networks
Levent Sagun et al. “Empirical analysis of the Hessian of over-p arametrized neural networks”. In: arXiv preprint (2017). arXiv: 1706.04454 [cs.LG]
Pith/arXiv arXiv 2017
-
[28]
On exponential convergence of SGD in non- convex over-parametrized learning
Raef Bassily, Mikhail Belkin, and Siyuan Ma. On exponential convergence of SGD in non- convex over-parametrized learning. Tech. rep. arXiv, 2018. arXiv: 1802.06904 [cs.LG]
arXiv 2018
-
[29]
Fast and fa ster convergence of SGD for over-parameterized models and an accelerated perceptron
Sharan Vaswani, Francis Bach, and Mark Schmidt. “Fast and fa ster convergence of SGD for over-parameterized models and an accelerated perceptron” . In: International Conference on Artificial Intelligence and Statistics . Vol. 89. Proceedings of Machine Learning Research. PMLR, 2019, pp. 1195–1204. url: https://proceedings.mlr.press/v89/vaswani19a.html
2019
-
[30]
Aiming towards the minimizers: fast converge nce of SGD for over- parametrized problems
Chaoyue Liu et al. “Aiming towards the minimizers: fast converge nce of SGD for over- parametrized problems”. In: Advances in Neural Information Processing Systems . Ed. by A. Oh et al. Vol. 36. Curran Associates, Inc., 2023, pp. 60748–60767 . url: https://proceedings.neurip
2023
-
[31]
Better theory for SGD in t he nonconvex world
Ahmed Khaled and Peter Richt´ arik. “Better theory for SGD in t he nonconvex world”. In: arXiv preprint (2020). arXiv: 2002.03329 [cs.LG] . 33
arXiv 2020
-
[32]
SGD f or structured noncon- vex functions: Learning rates, minibatching and interpolation
Robert M. Gower, Othmane Sebbouh, and Nicolas Loizou. “SGD f or structured noncon- vex functions: Learning rates, minibatching and interpolation”. In : International Conference on Artificial Intelligence and Statistics . Vol. 130. Proceedings of Machine Learning Research. PMLR, 2021, pp. 1311–1319. url: https://proceedings.mlr.press/v130/gower21a.html
2021
-
[33]
Almo st sure convergence rates for stochastic gradient descent and stochastic heavy ball
Othmane Sebbouh, Robert M. Gower, and Aaron Defazio. “Almo st sure convergence rates for stochastic gradient descent and stochastic heavy ball”. In: Conference on Learning Theory . Vol. 134. Proceedings of Machine Learning Research. PMLR, 2021, pp. 3932–3972. url: https://proceedings.mlr.press/v134/sebbouh21a.html
2021
-
[34]
Accelerated st ochastic optimization methods under quasar-convexity
Qiang Fu, Dongchu Xu, and Ashia Camage Wilson. “Accelerated st ochastic optimization methods under quasar-convexity”. In: International Conference on Machine Learning. Vol. 202. Proceedings of Machine Learning Research. PMLR, 2023, pp. 1053 5–10562. url: https://proceeding
2023
-
[35]
On the convergence of first order methods for qua sar-convex optimization
Jikai Jin. “On the convergence of first order methods for qua sar-convex optimization”. In: arXiv preprint (2020). arXiv: 2010.04937 [cs.LG]
arXiv 2020
-
[36]
On information and sufficiency,
Herbert Robbins and Sutton Monro. “A Stochastic Approximat ion Method”. In: The Annals of Mathematical Statistics 22.3 (Sept. 1951), pp. 400–407. issn: 0003-4851. doi: 10.1214/aoms/1177729 url: http://dx.doi.org/10.1214/aoms/1177729586
-
[37]
Large-Scale Machine Learning with Stochastic Gradient Descent
L´ eon Bottou. “Large-Scale Machine Learning with Stochastic Gradient Descent”. In: Pro- ceedings of COMPSTAT’2010. Ed. by Yves Lechevallier and Gilbert Saporta. Heidelberg: Physica-Verlag HD, 2010, pp. 177–186. isbn: 978-3-7908-2604-3
2010
-
[38]
Large Scale Distributed Deep Networks
Jeffrey Dean et al. “Large Scale Distributed Deep Networks”. I n: Advances in Neural In- formation Processing Systems. Ed. by F. Pereira et al. Vol. 25. Curran Associates, Inc., 2012. url: https://proceedings.neurips.cc/paper_files/paper/2012/file/6aca97005c68f1206823815
2012
-
[39]
Deep Sets
Manzil Zaheer et al. “Deep Sets”. In: Advances in Neural Information Processing Systems . Ed. by I. Guyon et al. Vol. 30. Curran Associates, Inc., 2017. url: https://proceedings.neurips.cc/p
2017
-
[40]
On Large-Batch Training for Deep Le arning: Generalization Gap and Sharp Minima
Nitish Shirish Keskar et al. “On Large-Batch Training for Deep Le arning: Generalization Gap and Sharp Minima”. In: International Conference on Learning Representations . 2017. url: https://openreview.net/forum?id=H1oyRlYgg
2017
-
[41]
An overview of gradient descent optimizatio n algorithms
Sebastian Ruder. “An overview of gradient descent optimizatio n algorithms”. In: CoRR abs/1609.04747 (2016). arXiv: 1609.04747. url: http://arxiv.org/abs/1609.04747
Pith/arXiv arXiv 2016
-
[42]
SpecNet2: Orthog onalization-free spectral embedding by neural networks
Ziyu Chen, Yingzhou Li, and Xiuyuan Cheng. “SpecNet2: Orthog onalization-free spectral embedding by neural networks”. In: Proceedings of Mathematical and Scientific Machine Learning. Vol. 190. Proceedings of Machine Learning Research. PMLR, 2022
2022
-
[43]
Neura l tangent kernel: Convergence and generalization in neural networks
Arthur Jacot, Franck Gabriel, and Cl´ ement Hongler. “Neura l tangent kernel: Convergence and generalization in neural networks”. In: Advances in Neural Information Processing Sys- tems. Vol. 31. 2018
2018
-
[44]
A convergence theory for deep learning via over-parameterization
Zeyuan Allen-Zhu, Yuanzhi Li, and Zhao Song. “A convergence theory for deep learning via over-parameterization”. In: International Conference on Machine Learning . Vol. 97. Pro- ceedings of Machine Learning Research. PMLR, 2019, pp. 242–252 . 34
2019
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.