pith. sign in

arxiv: 2606.10383 · v1 · pith:JDWNFNBLnew · submitted 2026-06-09 · 🧮 math.OC · cs.NA· math.NA

A stochastic gradient algorithm for non-separable optimization with convergence guarantee

Pith reviewed 2026-06-27 12:35 UTC · model grok-4.3

classification 🧮 math.OC cs.NAmath.NA
keywords stochastic gradient descentnon-separable optimizationconvergence analysislocal linear convergencefixed step sizebatch gradientsurrogate gradient
0
0 comments X

The pith

An SGD-style method for non-separable losses proves local linear convergence under local strong convexity using fixed step sizes in explicitly bounded ranges.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops a stochastic gradient framework for optimization problems where the loss depends on dataset-level quantities rather than individual samples. It introduces two gradient constructs, an ideal batch gradient G and a cached surrogate H, that allow handling non-separable cases while reducing to standard SGD when separability holds. The core result is a unified convergence theory showing local linear rates under strong convexity and sublinear O(1/k) rates under convexity, all with fixed step sizes whose allowable ranges are explicitly bounded in terms of problem parameters like batch size and cache staleness. This matters because many practical objectives, such as those involving global statistics, fall outside the separable assumption that standard SGD theory requires.

Core claim

The central discovery is that both the G-driven and H-driven updates converge locally at linear rate when the objective is locally strongly convex, and at O(1/k) rate when locally convex, provided the step size is fixed within ranges derived from smoothness and Jacobian bounds; these rates hold uniformly for the two update types and explicitly quantify the effects of staleness and approximation error.

What carries the argument

The pair of batch-gradient constructs G (ideal per-batch gradient) and H (cached surrogate), which enable the algorithm to handle non-separable terms while preserving SGD-like updates.

If this is right

  • Fixed step sizes suffice for the stated local rates without needing diminishing schedules.
  • The allowable step-size range shrinks with increased cache staleness or surrogate error.
  • In the separable case the method recovers the standard mini-batch SGD convergence guarantees.
  • Batch size directly enters the bounds on convergence constants and step-size limits.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • This approach could extend to other non-separable settings like federated learning where global aggregates appear in the loss.
  • The explicit dependence on Jacobian bounds suggests that tighter control on data heterogeneity might enlarge the usable step-size range.
  • Testing on problems with known local strong convexity, such as certain regularized empirical risks with global features, would directly check the linear rate prediction.

Load-bearing premise

The objective must satisfy local smoothness and have bounded Jacobians so that the gradient approximations remain controlled.

What would settle it

Observe whether the method fails to converge linearly on a locally strongly convex non-separable problem when the step size is set inside the paper's derived range but the Jacobian bound is violated.

Figures

Figures reproduced from arXiv: 2606.10383 by Ruofan Wu, Yingzhou Li.

Figure 5.1
Figure 5.1. Figure 5.1: Convergence contrast under different convexity assu [PITH_FULL_IMAGE:figures/full_fig_p015_5_1.png] view at source ↗
Figure 5.2
Figure 5.2. Figure 5.2: Full objective gap F(Y ) − F ∗ and gradient norm k∇Y FkF (log scale) for the three schemes G/H/naive. This composite figure shows the evolution of the full objective gap and the corresponding gradient magnitude over training. 5.2 Experiment 2: Gram objective We consider the coupled-sample objective F(Y ) = 1 2 k 1 d Y Y ⊤ − Sk 2 F, with d = 8, Y ∈ R 200×8 , batch size 16. We parameterize each row y ⊤ i o… view at source ↗
Figure 5.3
Figure 5.3. Figure 5.3: Diagnostics for methods G (left) and H (right). Log–log plot of training loss versus epochs. Late-stage log–log fits yield slopes ≈ −1.467 (R2 ≈ 0.9854) for G and ≈ −1.483 (R2 ≈ 0.9871) for H, indicating Fk ∝ k −1.467 and Fk ∝ k −1.483 respectively. estimator; the gradient-norm plots give the same ordering. Furthermore, in the experiments, late-stage log–log diagnostics reveal sublinear decay of the gap … view at source ↗
read the original abstract

We study non-separable objectives in which the loss depend on dataset-level quantities. We introduce an SGD-style framework that employs two batch-gradient constructs: the ideal per-batch gradient `$G$' and a cached surrogate `$H$' for cases where full-data terms are expensive. Notably, in the sample-wise separable case, our method reduces to standard mini-batch SGD. Our main contribution is a unified local convergence theory: under mild smoothness and Jacobian-boundedness assumptions, we prove local linear convergence under local strong convexity and local $O(1/k)$ sublinear convergence under local convexity for both `$G$'-driven and `$H$'-driven updates. Crucially, these guarantees hold for fixed step sizes within explicitly characterized ranges; we provide explicit bounds showing how cache staleness, surrogate approximation error, batch size, and step size influence the convergence constants and allowable step-size ranges.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 0 minor

Summary. The manuscript introduces an SGD-style framework for non-separable objectives depending on dataset-level quantities. It defines an ideal per-batch gradient G and a cached surrogate H, notes that the method reduces to standard mini-batch SGD in the separable case, and claims a unified local convergence theory: under mild smoothness and Jacobian-boundedness assumptions, both G-driven and H-driven updates achieve local linear convergence under local strong convexity and local O(1/k) sublinear convergence under local convexity, with these guarantees holding for fixed step sizes in explicitly characterized ranges that incorporate cache staleness, surrogate error, batch size, and step size.

Significance. If the stated local convergence results and explicit step-size intervals can be rigorously established, the work supplies a useful theoretical extension of SGD analysis to non-separable problems that arise in practice, together with concrete guidance on allowable step sizes and the effect of staleness.

major comments (1)
  1. [Abstract] Abstract: the local O(1/k) claim for the H-driven recursion under mere local convexity rests on the Jacobian-boundedness assumption remaining valid for all subsequent iterates. Local convexity supplies no contraction that would keep the sequence inside the neighborhood where the bound holds; once H is formed from a stale cache the effective perturbation can grow, so the invariance needed to invoke the bound indefinitely is not automatic and must be established separately before the explicit step-size interval can be asserted.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the careful review and for highlighting this subtlety in the local-convergence argument. We address the single major comment below.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the local O(1/k) claim for the H-driven recursion under mere local convexity rests on the Jacobian-boundedness assumption remaining valid for all subsequent iterates. Local convexity supplies no contraction that would keep the sequence inside the neighborhood where the bound holds; once H is formed from a stale cache the effective perturbation can grow, so the invariance needed to invoke the bound indefinitely is not automatic and must be established separately before the explicit step-size interval can be asserted.

    Authors: We agree that the invariance of the neighborhood under the H-driven recursion with only local convexity is not automatic and must be shown explicitly before the step-size interval can be asserted. The manuscript already derives explicit step-size bounds that incorporate the surrogate error and cache staleness; however, these bounds were obtained under the standing assumption that iterates remain inside the local region. To close the gap we will add a short supporting lemma (in the revised Section 4) proving that, under the stated step-size restriction and the local Lipschitz/Jacobian bounds, the H-driven sequence cannot escape the ball in which the Jacobian bound is valid. The abstract claim will be qualified accordingly. revision: yes

Circularity Check

0 steps flagged

No circularity; convergence derived from stated assumptions

full rationale

The paper derives local linear and O(1/k) convergence rates for G-driven and H-driven updates directly from explicit assumptions (mild smoothness, Jacobian-boundedness, local strong convexity or convexity) and provides explicit step-size ranges. No self-definitional reductions, fitted parameters renamed as predictions, or load-bearing self-citations appear in the claimed derivation. The analysis is self-contained against the listed assumptions and does not reduce any result to its inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The proofs rely on standard assumptions from optimization theory for local convergence analysis.

axioms (2)
  • domain assumption mild smoothness assumption
    Used to establish the convergence rates for the algorithm.
  • domain assumption Jacobian-boundedness assumption
    Invoked to bound the error in the surrogate gradient H.

pith-pipeline@v0.9.1-grok · 5681 in / 1018 out tokens · 27601 ms · 2026-06-27T12:35:51.576558+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

44 extracted references · 11 canonical work pages

  1. [1]

    467 and Fk ∝ k− 1

    9871) for H, indicating Fk ∝ k− 1. 467 and Fk ∝ k− 1. 483 respectively. estimator; the gradient-norm plots give the same ordering. Furth ermore, in the experiments, late-stage log–log diagnostics reveal sublinear decay of the gap F − F ∗, consistent with the O(1/k ) convergence behavior established theoretically. Collectively, the two experiments confirm t...

  2. [2]

    Adaptive subgradien t methods for online learning and stochastic optimization

    John Duchi, Elad Hazan, and Yoram Singer. “Adaptive subgradien t methods for online learning and stochastic optimization”. In: Journal of Machine Learning Research 12.7 (2011), pp. 2121–2159. url: http://jmlr.org/papers/v12/duchi11a.html

  3. [3]

    ADADELTA: An adaptive learning rate method

    Matthew D. Zeiler. “ADADELTA: An adaptive learning rate method ”. In: arXiv preprint (2012). arXiv: 1212.5701 [cs.LG]

  4. [4]

    Adam: A method for stochastic o ptimization

    Diederik P. Kingma and Jimmy Ba. “Adam: A method for stochastic o ptimization”. In: arXiv preprint (2014). arXiv: 1412.6980 [cs.LG]

  5. [5]

    Math Program

    Mark Schmidt, Nicolas Le Roux, and Francis Bach. “Minimizing finite s ums with the stochas- tic average gradient”. In: Mathematical Programming 162.1 (2017), pp. 83–112. doi: 10.1007/s10107-

  6. [6]

    Accelerating stochastic gradient descent using predictive vari- ance reduction

    Rie Johnson and Tong Zhang. “Accelerating stochastic gradient descent using predictive vari- ance reduction”. In: Advances in Neural Information Processing Systems (NeurIP S). Vol. 26. 2013, pp. 315–323. url: https://proceedings.neurips.cc/paper/2013/hash/ac1dd209cbcc5e5d1c6

  7. [7]

    SAGA: A fast incremental gra- dient method with support for non-strongly convex composite obj ectives

    Aaron Defazio, Francis Bach, and Simon Lacoste-Julien. “SAGA: A fast incremental gra- dient method with support for non-strongly convex composite obj ectives”. In: Advances in Neural Information Processing Systems (NeurIPS) . Vol. 27. 2014, pp. 1646–1654. url: https://proceedings.neurips.cc/paper/2014/hash/f7cade80b7cc92b991cf4d2806d6bd78-Abst

  8. [8]

    Online and Stochastic Gra- dient Methods for Non-decomposable Loss Functions

    Purushottam Kar, Harikrishna Narasimhan, and Prateek Jain. “ Online and Stochastic Gra- dient Methods for Non-decomposable Loss Functions”. In: Advances in Neural Information Processing Systems. Ed. by Z. Ghahramani et al. Vol. 27. Curran Associates, Inc., 201 4. url: https://proceedings.neurips.cc/paper_files/paper/2014/file/9638ddfc7e3d56a611292c157

  9. [9]

    Debiased Contrastive Learning

    Ching-Yao Chuang et al. “Debiased Contrastive Learning”. In: Advances in Neural Infor- mation Processing Systems . Vol. 33. Curran Associates, Inc., 2020, pp. 8765–8775. url: https://neurips.cc/virtual/2020/public/poster_63c3ddcc7b23daa1e42dc41f9a44a873.html

  10. [10]

    Variance Reduced Stochastic Proximal Algorithm for AUC Maximization

    Soham Dan and Dushyant Sahoo. “Variance Reduced Stochastic Proximal Algorithm for AUC Maximization”. In: Machine Learning and Knowledge Discovery in Databases. Res earch Track. Springer International Publishing, 2021, pp. 184–199. isbn: 9783030865238. doi: 10.1007/978-3-030-86523-8_12 . url: http://dx.doi.org/10.1007/978-3-030-86523-8_12

  11. [11]

    E., and Nocedal, J

    L´ eon Bottou, Frank E. Curtis, and Jorge Nocedal. “Optimizat ion methods for large-scale machine learning”. In: SIAM Review 60.2 (2018), pp. 223–311. doi: 10.1137/16M1080173

  12. [12]

    Robust stochastic approximation approach to stochastic programming

    Arkadi Nemirovski et al. “Robust stochastic approximation ap proach to stochastic program- ming”. In: SIAM Journal on Optimization 19.4 (2009), pp. 1574–1609. doi: 10.1137/070704277

  13. [13]

    Gradient methods for minimizing composite fun ctions

    Yurii Nesterov. “Gradient methods for minimizing composite fun ctions”. In: Mathematical Programming 140.1 (2013), pp. 125–161. doi: 10.1007/s10107-012-0629-5

  14. [14]

    On the importance of initialization and mome ntum in deep learn- ing

    Ilya Sutskever et al. “On the importance of initialization and mome ntum in deep learn- ing”. In: Proceedings of the 30th International Conference on Machin e Learning (ICML) . Vol. 28. Proceedings of Machine Learning Research. PMLR, 2013, p p. 1139–1147. url: https://proceedings.mlr.press/v28/sutskever13.html

  15. [15]

    On Acceleration with Noise- Corrupted Gradients

    Michael Cohen, Jelena Diakonikolas, and Lorenzo Orecchia. “On Acceleration with Noise- Corrupted Gradients”. In: Proceedings of the 35th International Conference on Machin e Learning. Ed. by Jennifer Dy and Andreas Krause. Vol. 80. Proceedings of M achine Learning Research. 10–15 July. PMLR, July 2018, pp. 1019–1028. url: https://proceedings.mlr.press/v80/co

  16. [16]

    Katyusha: The first direct acceleration of stochastic gradient methods

    Zeyuan Allen-Zhu. “Katyusha: The first direct acceleration of stochastic gradient methods”. In: Journal of Machine Learning Research 18.221 (2018), pp. 1–51. url: http://jmlr.org/papers/v18 32

  17. [17]

    A Universal Cat alyst for First-Order Op- timization

    Hongzhou Lin, Julien Mairal, and Zaid Harchaoui. “A Universal Cat alyst for First-Order Op- timization”. In: Advances in Neural Information Processing Systems . Ed. by C. Cortes et al. Vol. 28. Curran Associates, Inc., 2015. url: https://proceedings.neurips.cc/paper_files/paper

  18. [18]

    Un-regularizing: Approximate proximal point and faster stochastic al- gorithms for empirical risk minimization

    Roy Frostig et al. “Un-regularizing: Approximate proximal point and faster stochastic al- gorithms for empirical risk minimization”. In: International Conference on Machine Learn- ing. Vol. 37. Proceedings of Machine Learning Research. PMLR, 2015, pp. 2540–2548. url: https://proceedings.mlr.press/v37/frostig15.html

  19. [19]

    Stochastic gradient descent for nonconve x learning without bounded gradient assumptions

    Yunwen Lei et al. “Stochastic gradient descent for nonconve x learning without bounded gradient assumptions”. In: IEEE Transactions on Neural Networks and Learning Systems 31.10 (2019), pp. 4394–4400. doi: 10.1109/TNNLS.2019.2952217

  20. [20]

    On the convergence of st ochastic gradient descent with adaptive stepsizes

    Xiaoyu Li and Francesco Orabona. “On the convergence of st ochastic gradient descent with adaptive stepsizes”. In: International Conference on Artificial Intelligence and St atis- tics. Vol. 89. Proceedings of Machine Learning Research. PMLR, 2019, pp. 983–992. url: https://proceedings.mlr.press/v89/li19c.html

  21. [21]

    2021 , issue_date =

    Chiyuan Zhang et al. “Understanding deep learning (still) require s rethinking generaliza- tion”. In: Communications of the ACM 64.3 (2021), pp. 107–115. doi: 10.1145/3446776

  22. [22]

    Efficientnet: Rethinking model sca ling for convolutional neural networks

    Mingxing Tan and Quoc V. Le. “Efficientnet: Rethinking model sca ling for convolutional neural networks”. In: Proceedings of the 36th International Conference on Machin e Learning (ICML). Vol. 97. Proceedings of Machine Learning Research. PMLR, 2019, pp. 6105–6114. url: https://proceedings.mlr.press/v97/tan19a.html

  23. [23]

    Gpipe: Efficient training of giant neural net works using pipeline par- allelism

    Yanping Huang et al. “Gpipe: Efficient training of giant neural net works using pipeline par- allelism”. In: Advances in Neural Information Processing Systems (NeurIP S). Vol. 32. 2019, pp. 103–112. url: https://proceedings.neurips.cc/paper/2019/hash/093f65e080a295f8076b1c

  24. [24]

    Big transfer (bit): General visual representation learning

    Alexander Kolesnikov et al. “Big transfer (bit): General visual representation learning”. In: Computer Vision – ECCV 2020: 16th European Conference . Vol. 12350. Lecture Notes in Computer Science. Springer, 2020, pp. 491–507. doi: 10.1007/978-3-030-58558-7_29

  25. [25]

    Reconciling modern machine-learning practice and the classical bias–variance trade-off

    Mikhail Belkin et al. “Reconciling modern machine-learning practice and the classical bias– variance trade-off”. In: Proceedings of the National Academy of Sciences 116.32 (2019), pp. 15849–15854. doi: 10.1073/pnas.1903070116

  26. [26]

    Densely Connected Convolutional Networks,

    Gao Huang et al. “Densely connected convolutional networks” . In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVP R). 2017, pp. 4700–4708. doi: 10.1109/CVPR.2017.243

  27. [27]

    Empirical analysis of the Hessian of over-p arametrized neural networks

    Levent Sagun et al. “Empirical analysis of the Hessian of over-p arametrized neural networks”. In: arXiv preprint (2017). arXiv: 1706.04454 [cs.LG]

  28. [28]

    On exponential convergence of SGD in non- convex over-parametrized learning

    Raef Bassily, Mikhail Belkin, and Siyuan Ma. On exponential convergence of SGD in non- convex over-parametrized learning. Tech. rep. arXiv, 2018. arXiv: 1802.06904 [cs.LG]

  29. [29]

    Fast and fa ster convergence of SGD for over-parameterized models and an accelerated perceptron

    Sharan Vaswani, Francis Bach, and Mark Schmidt. “Fast and fa ster convergence of SGD for over-parameterized models and an accelerated perceptron” . In: International Conference on Artificial Intelligence and Statistics . Vol. 89. Proceedings of Machine Learning Research. PMLR, 2019, pp. 1195–1204. url: https://proceedings.mlr.press/v89/vaswani19a.html

  30. [30]

    Aiming towards the minimizers: fast converge nce of SGD for over- parametrized problems

    Chaoyue Liu et al. “Aiming towards the minimizers: fast converge nce of SGD for over- parametrized problems”. In: Advances in Neural Information Processing Systems . Ed. by A. Oh et al. Vol. 36. Curran Associates, Inc., 2023, pp. 60748–60767 . url: https://proceedings.neurip

  31. [31]

    Better theory for SGD in t he nonconvex world

    Ahmed Khaled and Peter Richt´ arik. “Better theory for SGD in t he nonconvex world”. In: arXiv preprint (2020). arXiv: 2002.03329 [cs.LG] . 33

  32. [32]

    SGD f or structured noncon- vex functions: Learning rates, minibatching and interpolation

    Robert M. Gower, Othmane Sebbouh, and Nicolas Loizou. “SGD f or structured noncon- vex functions: Learning rates, minibatching and interpolation”. In : International Conference on Artificial Intelligence and Statistics . Vol. 130. Proceedings of Machine Learning Research. PMLR, 2021, pp. 1311–1319. url: https://proceedings.mlr.press/v130/gower21a.html

  33. [33]

    Almo st sure convergence rates for stochastic gradient descent and stochastic heavy ball

    Othmane Sebbouh, Robert M. Gower, and Aaron Defazio. “Almo st sure convergence rates for stochastic gradient descent and stochastic heavy ball”. In: Conference on Learning Theory . Vol. 134. Proceedings of Machine Learning Research. PMLR, 2021, pp. 3932–3972. url: https://proceedings.mlr.press/v134/sebbouh21a.html

  34. [34]

    Accelerated st ochastic optimization methods under quasar-convexity

    Qiang Fu, Dongchu Xu, and Ashia Camage Wilson. “Accelerated st ochastic optimization methods under quasar-convexity”. In: International Conference on Machine Learning. Vol. 202. Proceedings of Machine Learning Research. PMLR, 2023, pp. 1053 5–10562. url: https://proceeding

  35. [35]

    On the convergence of first order methods for qua sar-convex optimization

    Jikai Jin. “On the convergence of first order methods for qua sar-convex optimization”. In: arXiv preprint (2020). arXiv: 2010.04937 [cs.LG]

  36. [36]

    On information and sufficiency,

    Herbert Robbins and Sutton Monro. “A Stochastic Approximat ion Method”. In: The Annals of Mathematical Statistics 22.3 (Sept. 1951), pp. 400–407. issn: 0003-4851. doi: 10.1214/aoms/1177729 url: http://dx.doi.org/10.1214/aoms/1177729586

  37. [37]

    Large-Scale Machine Learning with Stochastic Gradient Descent

    L´ eon Bottou. “Large-Scale Machine Learning with Stochastic Gradient Descent”. In: Pro- ceedings of COMPSTAT’2010. Ed. by Yves Lechevallier and Gilbert Saporta. Heidelberg: Physica-Verlag HD, 2010, pp. 177–186. isbn: 978-3-7908-2604-3

  38. [38]

    Large Scale Distributed Deep Networks

    Jeffrey Dean et al. “Large Scale Distributed Deep Networks”. I n: Advances in Neural In- formation Processing Systems. Ed. by F. Pereira et al. Vol. 25. Curran Associates, Inc., 2012. url: https://proceedings.neurips.cc/paper_files/paper/2012/file/6aca97005c68f1206823815

  39. [39]

    Deep Sets

    Manzil Zaheer et al. “Deep Sets”. In: Advances in Neural Information Processing Systems . Ed. by I. Guyon et al. Vol. 30. Curran Associates, Inc., 2017. url: https://proceedings.neurips.cc/p

  40. [40]

    On Large-Batch Training for Deep Le arning: Generalization Gap and Sharp Minima

    Nitish Shirish Keskar et al. “On Large-Batch Training for Deep Le arning: Generalization Gap and Sharp Minima”. In: International Conference on Learning Representations . 2017. url: https://openreview.net/forum?id=H1oyRlYgg

  41. [41]

    An overview of gradient descent optimizatio n algorithms

    Sebastian Ruder. “An overview of gradient descent optimizatio n algorithms”. In: CoRR abs/1609.04747 (2016). arXiv: 1609.04747. url: http://arxiv.org/abs/1609.04747

  42. [42]

    SpecNet2: Orthog onalization-free spectral embedding by neural networks

    Ziyu Chen, Yingzhou Li, and Xiuyuan Cheng. “SpecNet2: Orthog onalization-free spectral embedding by neural networks”. In: Proceedings of Mathematical and Scientific Machine Learning. Vol. 190. Proceedings of Machine Learning Research. PMLR, 2022

  43. [43]

    Neura l tangent kernel: Convergence and generalization in neural networks

    Arthur Jacot, Franck Gabriel, and Cl´ ement Hongler. “Neura l tangent kernel: Convergence and generalization in neural networks”. In: Advances in Neural Information Processing Sys- tems. Vol. 31. 2018

  44. [44]

    A convergence theory for deep learning via over-parameterization

    Zeyuan Allen-Zhu, Yuanzhi Li, and Zhao Song. “A convergence theory for deep learning via over-parameterization”. In: International Conference on Machine Learning . Vol. 97. Pro- ceedings of Machine Learning Research. PMLR, 2019, pp. 242–252 . 34