Convergence of Continual Learning in Homogeneous Deep Networks

Daniel Soudry; Gon Buzaglo; Itay Evron; Matan Schliserman

arxiv: 2606.30559 · v1 · pith:525CCPXXnew · submitted 2026-06-29 · 💻 cs.LG · cs.NA· math.NA· math.OC· stat.ML

Convergence of Continual Learning in Homogeneous Deep Networks

Matan Schliserman , Gon Buzaglo , Itay Evron , Daniel Soudry This is my paper

Pith reviewed 2026-06-30 07:02 UTC · model grok-4.3

classification 💻 cs.LG cs.NAmath.NAmath.OCstat.ML

keywords continual learninghomogeneous networksconvergence analysisprojection theorydeep neural networksregularizationclassificationregression

0 comments

The pith

Weakly regularized continual classification in homogeneous models reduces to sequential projections onto task margin sets.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper characterizes continual classification in homogeneous deep networks under weak regularization as a process of sequential projections onto each task's margin set. This unifies previous separate analyses of single-task deep models and continual linear models. The characterization explains why global convergence generally does not occur, even in simple cases, but local linear convergence can be guaranteed using nonconvex projection theory for certain task sequences. The framework is also extended to continual regression.

Core claim

We characterize weakly regularized continual classification in homogeneous models as sequential projections onto task margin sets. This generalizes prior analyses. Global convergence generally fails, even for simple models linear in data but nonlinear in parameters. Regularity properties guarantee local linear convergence under random and cyclic task sequences. The analysis extends to continual regression.

What carries the argument

Sequential projections onto task margin sets that describe the parameter updates across successive tasks in the continual learning process.

If this is right

Global convergence generally fails even for models that are linear in the data but nonlinear in the parameters.
Local linear convergence is guaranteed under regularity properties of the homogeneous networks for random and cyclic task sequences.
The projection-based framework unifies the treatment of continual classification and continual regression in homogeneous models.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Task ordering could be optimized using the projection geometry to achieve faster local convergence.
Similar projection characterizations might apply to other continual optimization problems beyond classification and regression.
The reliance on nonconvex projection theory opens the door to importing more results from that field for tighter bounds.

Load-bearing premise

The models are homogeneous, so scaling all parameters by a positive constant does not change the direction of the network output, and the regularization strength is low enough that the dynamics remain projection-like.

What would settle it

Finding a homogeneous deep network trained with weak regularization on a sequence of tasks where the learned parameters do not correspond to the predicted sequence of projections onto the task margin sets would falsify the main characterization.

Figures

Figures reproduced from arXiv: 2606.30559 by Daniel Soudry, Gon Buzaglo, Itay Evron, Matan Schliserman.

**Figure 1.** Figure 1: Feasible sets under homogeneous models are not necessarily convex. In the two-parameter spaces depicted, only the linear model yields convex feasible sets and, consequently, unique projections. 3. Convergence Analysis from a Projection Perspective The established projection perspective (Theorem 1) facilitates analyzing continual learning through the lens of projection theory. Indeed, prior work has used cl… view at source ↗

read the original abstract

We characterize weakly regularized continual classification in homogeneous models as sequential projections onto task margin sets. This result generalizes prior analyses restricted to either stationary (single-task) deep models or continual linear models. We show that global convergence generally fails, even for simple models linear in data but nonlinear in parameters. Nevertheless, by leveraging results from nonconvex projection theory, we identify regularity properties of homogeneous deep networks that guarantee local linear convergence under random and cyclic task sequences. Finally, we extend our analysis to continual regression, unifying the framework for homogeneous models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper frames continual learning in homogeneous deep nets as sequential projections onto margin sets and gives local convergence conditions that extend prior linear and stationary cases.

read the letter

The core contribution is a characterization of weakly regularized continual classification in homogeneous models as sequential projections onto task margin sets. This unifies earlier work on single-task deep models and continual linear models by applying nonconvex projection theory, and it extends the same view to regression. The paper also shows that global convergence typically fails, even for models linear in data but nonlinear in parameters, while identifying regularity conditions that deliver local linear convergence for random and cyclic task sequences.

The generalization step looks solid on the surface because homogeneity lets the analysis treat the network output direction separately from its scale, which matches many practical architectures. The local convergence claims rest on existing projection results rather than new machinery, which keeps the argument contained. The explicit note that global convergence fails is a useful negative result that clarifies limits.

The main limitation is the homogeneity plus weak-regularization premise; results will not transfer directly to models where scaling changes the output direction or to stronger regularization regimes. Without the full proofs it is hard to judge how tight the regularity conditions are or whether they cover common training setups. The abstract gives no experimental validation, so the practical reach remains open.

This is for theorists working on continual learning and nonconvex optimization who want a projection lens on the problem. It is not a broad empirical study. The work shows clear engagement with the literature and produces falsifiable claims about convergence, so it merits peer review even if the proofs need tightening.

Referee Report

0 major / 3 minor

Summary. The paper characterizes weakly regularized continual classification in homogeneous deep networks as sequential projections onto task margin sets. This generalizes prior analyses restricted to stationary single-task deep models or continual linear models. It shows that global convergence generally fails even for models linear in data but nonlinear in parameters, but identifies regularity properties guaranteeing local linear convergence under random and cyclic task sequences via nonconvex projection theory, and extends the framework to continual regression.

Significance. If the central characterization and local-convergence results hold, the work supplies a unified projection-based view of continual learning that bridges linear and homogeneous nonlinear models. The explicit use of nonconvex projection theory to obtain local linear rates under random/cyclic sequences, together with the regression extension, constitutes a substantive theoretical contribution with potential implications for algorithm analysis in non-stationary settings.

minor comments (3)

[§4] The abstract states that global convergence 'generally fails' for simple nonlinear-in-parameter models; the corresponding counter-example construction (presumably in §4) should include an explicit parameter trajectory that violates global convergence while satisfying the homogeneity and weak-regularization premises.
[§5] Theorem statements on local linear convergence should list the precise regularity conditions (e.g., margin separation, Lipschitz constants of the projection operators) required by the nonconvex projection results invoked.
Notation for the task margin sets and the projection operator should be introduced once and used consistently; several passages reuse similar symbols for the regularized and unregularized cases.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their positive summary and recommendation of minor revision. The report accurately captures the paper's contributions on the projection-based characterization of continual learning in homogeneous networks, the failure of global convergence, the local linear rates under regularity conditions, and the regression extension. No specific major comments were provided in the report.

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper characterizes weakly regularized continual classification in homogeneous models as sequential projections onto task margin sets, generalizing prior analyses of stationary deep models or continual linear models. This rests on the stated premises of homogeneity and weak regularization, with local convergence derived via nonconvex projection theory. No self-definitional steps, fitted inputs renamed as predictions, load-bearing self-citations, uniqueness theorems imported from the authors' prior work, smuggled ansatzes, or renamings of known results appear in the abstract or described claims. The derivation chain is presented as externally supported by projection theory and is self-contained against the given premises without reducing to its own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides no explicit free parameters, axioms, or invented entities; ledger is empty pending full text.

pith-pipeline@v0.9.1-grok · 5629 in / 1058 out tokens · 31727 ms · 2026-06-30T07:02:19.734073+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

299 extracted references · 19 canonical work pages · 11 internal anchors

[1]

The Annals of Mathematical Statistics , volume=

Adjustment of an inverse matrix corresponding to a change in one element of a given matrix , author=. The Annals of Mathematical Statistics , volume=. 1950 , publisher=

1950
[2]

Mathematics of Operations Research , volume=

Linear convergence of projection algorithms , author=. Mathematics of Operations Research , volume=. 2019 , publisher=

2019
[3]

arXiv preprint arXiv:1906.05890 , year=

Gradient descent maximizes the margin of homogeneous neural networks , author=. arXiv preprint arXiv:1906.05890 , year=

work page arXiv 1906
[4]

Mathematics of Operations Research , volume=

Alternating projections on manifolds , author=. Mathematics of Operations Research , volume=. 2008 , publisher=

2008
[5]

A handbook of

Braides, Andrea , booktitle=. A handbook of. 2006 , publisher=

2006
[6]

Conference on Learning Theory , pages=

Kernel and rich regimes in overparametrized models , author=. Conference on Learning Theory , pages=. 2020 , organization=

2020
[7]

On the Theory of Continual Learning with Gradient Descent for Neural Networks

On the Theory of Continual Learning with Gradient Descent for Neural Networks , author=. arXiv preprint arXiv:2510.05573 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[8]

arXiv preprint arXiv:2502.05668 , year=

The late-stage training dynamics of (stochastic) subgradient descent on homogeneous neural networks , author=. arXiv preprint arXiv:2502.05668 , year=

work page arXiv
[9]

Journal of Machine Learning Research , volume=

High probability convergence bounds for non-convex stochastic gradient descent with sub-weibull noise , author=. Journal of Machine Learning Research , volume=
[10]

Conference on Learning Theory , pages=

Almost sure convergence rates for stochastic gradient descent and stochastic heavy ball , author=. Conference on Learning Theory , pages=. 2021 , organization=

2021
[11]

Foundations of Computational Mathematics , volume=

Local linear convergence for alternating and averaged nonconvex projections , author=. Foundations of Computational Mathematics , volume=. 2009 , publisher=

2009
[12]

Foundations of computational mathematics , volume=

Stochastic subgradient method converges on tame functions , author=. Foundations of computational mathematics , volume=. 2020 , publisher=

2020
[13]

1998 , publisher=

Variational analysis , author=. 1998 , publisher=

1998
[14]

, title =

Navakkode, Sheeja and Kennedy, Brian K. , title =. Frontiers in Aging Neuroscience , volume =. 2024 , month = aug, doi =

2024
[15]

The 22nd international conference on artificial intelligence and statistics , pages=

A continuous-time view of early stopping for least squares regression , author=. The 22nd international conference on artificial intelligence and statistics , pages=. 2019 , organization=

2019
[16]

and Vandemark, Katherine M

Voglewede, Ryan L. and Vandemark, Katherine M. and Davidson, Adam M. and DeWitt, Andrea R. and Heffler, Matthew D. and Trimmer, Ethan H. and Mostany, Ricardo , title =. Neurobiology of Aging , volume =. 2019 , month = sep, doi =

2019
[17]

Frontiers in Neural Circuits , volume =

Huang, Li and Zhou, Hao and Chen, Kai and Chen, Xin and Yang, Guangwei , title =. Frontiers in Neural Circuits , volume =. 2020 , month = nov, doi =

2020
[18]

SIAM Journal on Matrix Analysis and Applications , volume=

Randomized iterative methods for linear systems , author=. SIAM Journal on Matrix Analysis and Applications , volume=. 2015 , publisher=

2015
[19]

Accessed: May , url=

Last iterate of sgd converges (even in unbounded domains), 2020 , author=. Accessed: May , url=

2020
[20]

Conference on Learning Theory , pages=

Toward a noncommutative arithmetic-geometric mean inequality: conjectures, case-studies, and consequences , author=. Conference on Learning Theory , pages=. 2012 , organization=

2012
[21]

How good is

Safran, Itay and Shamir, Ohad , booktitle=. How good is. 2020 , organization=

2020
[22]

Closing the convergence gap of

Rajput, Shashank and Gupta, Anant and Papailiopoulos, Dimitris , booktitle=. Closing the convergence gap of. 2020 , organization=

2020
[23]

Advances in Neural Information Processing Systems , volume=

Random reshuffling: Simple analysis with vast improvements , author=. Advances in Neural Information Processing Systems , volume=
[24]

International Conference on Machine Learning , pages=

Tighter lower bounds for shuffling SGD: Random permutations and beyond , author=. International Conference on Machine Learning , pages=. 2023 , organization=

2023
[25]

arXiv preprint arXiv:2306.12498 , year=

Empirical risk minimization with shuffled SGD: a primal-dual perspective and improved bounds , author=. arXiv preprint arXiv:2306.12498 , year=

work page arXiv
[26]

Conference on Learning Theory , pages=

Making the last iterate of sgd information theoretically optimal , author=. Conference on Learning Theory , pages=. 2019 , organization=

2019
[27]

arXiv preprint arXiv:2307.11134 , year=

Exact convergence rate of the last iterate in subgradient methods , author=. arXiv preprint arXiv:2307.11134 , year=

work page arXiv
[28]

arXiv preprint arXiv:2310.07831 , year=

Optimal Linear Decay Learning Rate Schedules and Further Refinements , author=. arXiv preprint arXiv:2310.07831 , year=

work page arXiv
[29]

International Conference on Machine Learning , pages=

Train faster, generalize better: Stability of stochastic gradient descent , author=. International Conference on Machine Learning , pages=. 2016 , organization=

2016
[30]

The 22nd international conference on artificial intelligence and statistics , pages=

Fast and faster convergence of sgd for over-parameterized models and an accelerated perceptron , author=. The 22nd international conference on artificial intelligence and statistics , pages=. 2019 , organization=

2019
[31]

Lecture notes , volume=

Introductory lectures on convex programming volume i: Basic course , author=. Lecture notes , volume=
[32]

Mathematical Programming , volume=

An optimal method for stochastic composite optimization , author=. Mathematical Programming , volume=. 2012 , publisher=

2012
[33]

Conference on Learning Theory , pages=

Benign overfitting of constant-stepsize sgd for linear regression , author=. Conference on Learning Theory , pages=. 2021 , organization=

2021
[34]

Advances in Neural Information Processing Systems , volume=

Tight nonparametric convergence rates for stochastic gradient descent under the noiseless linear model , author=. Advances in Neural Information Processing Systems , volume=
[35]

2020 , publisher=

First-order and stochastic optimization methods for machine learning , author=. 2020 , publisher=

2020
[36]

Proceedings of the 37th International Conference on Machine Learning , pages =

Fine-Grained Analysis of Stability and Generalization for Stochastic Gradient Descent , author =. Proceedings of the 37th International Conference on Machine Learning , pages =. 2020 , editor =

2020
[37]

The Twelfth International Conference on Learning Representations , year=

Revisiting the Last-Iterate Convergence of Stochastic Gradient Methods , author=. The Twelfth International Conference on Learning Representations , year=
[38]

Advances in neural information processing systems , volume=

Smoothness, low noise and fast rates , author=. Advances in neural information processing systems , volume=
[39]

Advances in neural information processing systems , volume=

Non-strongly-convex smooth stochastic approximation with convergence rate O (1/n) , author=. Advances in neural information processing systems , volume=
[40]

Advances in Neural Information Processing Systems , volume=

Optimal rates for random order online optimization , author=. Advances in Neural Information Processing Systems , volume=
[41]

Advances in Neural Information Processing Systems , volume=

Benign underfitting of stochastic gradient descent , author=. Advances in Neural Information Processing Systems , volume=
[42]

International Conference on Machine Learning , pages=

Sgd without replacement: Sharper rates for general smooth convex functions , author=. International Conference on Machine Learning , pages=. 2019 , organization=

2019
[43]

The Journal of Machine Learning Research , volume=

Learnability, stability and uniform convergence , author=. The Journal of Machine Learning Research , volume=. 2010 , publisher=

2010
[44]

The Journal of Machine Learning Research , volume=

Stability and generalization , author=. The Journal of Machine Learning Research , volume=. 2002 , publisher=

2002
[45]

Proceedings of the 30th International Conference on Machine Learning , pages =

Stochastic Gradient Descent for Non-smooth Optimization: Convergence Results and Optimal Averaging Schemes , author =. Proceedings of the 30th International Conference on Machine Learning , pages =. 2013 , editor =

2013
[46]

ICLR , year=

The Implicit Bias of Gradient Descent on Separable Data , author=. ICLR , year=
[47]

The Implicit Bias of Gradient Descent on Separable Data , archivePrefix = "arXiv", arxivId =

Soudry, Daniel and Hoffer, Elad and. The Implicit Bias of Gradient Descent on Separable Data , archivePrefix = "arXiv", arxivId =. arXiv:1710.10345v3 , journal=

work page arXiv
[48]

AISTATS , year=

Stochastic gradient descent on separable data: Exact convergence with a fixed learning rate , author=. AISTATS , year=
[49]

Journal of Global Optimization , volume=

Accelerated sampling Kaczmarz Motzkin algorithm for the linear feasibility problem , author=. Journal of Global Optimization , volume=. 2020 , publisher=

2020
[50]

IEEE Transactions on Signal Processing , volume=

On the convergence behavior of the LMS and the normalized LMS algorithms , author=. IEEE Transactions on Signal Processing , volume=. 1993 , publisher=

1993
[51]

AISTATS , year=

Convergence of gradient descent on separable data , author=. AISTATS , year=
[52]

ICML , year=

Lexicographic and Depth-Sensitive Margins in Homogeneous and Non-Homogeneous Deep Models , author=. ICML , year=
[53]

Regularization Matters: Generalization and Optimization of Neural Nets v.s

Wei, Colin and Lee, Jason D and Liu, Qiang and Ma, Tengyu , booktitle =. Regularization Matters: Generalization and Optimization of Neural Nets v.s. their Induced Kernel , volume =
[54]

Conference on Learning Theory (COLT) , year=

Beneath the valley of the noncommutative arithmetic-geometric mean inequality: conjectures, case-studies, and consequences , author=. Conference on Learning Theory (COLT) , year=
[55]

Implicit Bias of Gradient Descent on Linear Convolutional Networks , volume =

Gunasekar, Suriya and Lee, Jason D and Soudry, Daniel and Srebro, Nati , booktitle =. Implicit Bias of Gradient Descent on Linear Convolutional Networks , volume =
[56]

and Bassily, R

Ma, S. and Bassily, R. and Belkin, M. , eprint =
[57]

Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour

Goyal, P. and Doll. arXiv , arxivId =:1706.02677 , title =

work page internal anchor Pith review Pith/arXiv arXiv
[58]

Smith, S. L. and Kindermans, P. and Ying, C. and Le, Q. V. , booktitle =
[59]

ICLR, Spotlight

At Stability's Edge: How to Adjust Hyperparameters to Preserve Minima Selection in Asynchronous Training of Neural Networks? , author=. ICLR, Spotlight. , year=
[60]

Nar, Kamil and Sastry, S Shankar , booktitle =
[61]

Wu, Lei and Ma, Chao and E, Weinan , booktitle =
[62]

COLT , year=

How do infinite width bounded norm networks look in function space? , author=. COLT , year=
[63]

, author=

The perceptron: a probabilistic model for information storage and organization in the brain. , author=. Psychological review , year=
[64]

Advances in Neural Information Processing Systems , pages=

Train longer, generalize better: closing the generalization gap in large batch training of neural networks , author=. Advances in Neural Information Processing Systems , pages=
[66]

Bertsekas, Dimitri P. , doi =. Mathematical Programming , keywords =. arXiv , arxivId =:1507.01030 , file =

work page internal anchor Pith review Pith/arXiv arXiv
[67]

arXiv , pages =

Smith, Leslie N , eprint =. arXiv , pages =
[68]

International Conference of Artificial Neural Networks (ICANN) , year=
[69]

ICML , year=

Characterizing implicit bias in terms of optimization geometry , author=. ICML , year=
[70]

AISTATS , title =

Nacson, Mor Shpigel and Lee, Jason and Gunasekar, Suriya and Srebro, Nathan and Soudry, Daniel , eprint =. AISTATS , title =
[71]

Journal of the ACM (JACM) , volume=

Sublinear optimization for machine learning , author=. Journal of the ACM (JACM) , volume=. 2012 , publisher=

2012
[72]

doi:10.1137/16M1080173 , eprint =

Bottou, L. doi:10.1137/16M1080173 , eprint =

work page doi:10.1137/16m1080173
[73]

2012 , publisher=

Boosting: Foundations and algorithms , author=. 2012 , publisher=

2012
[74]

Train longer, generalize better: closing the generalization gap in large batch training of neural networks

Hoffer, Elad and Hubara, Itay and Soudry, Daniel , booktitle =. arXiv , arxivId =:1705.08741 , title =

work page internal anchor Pith review Pith/arXiv arXiv
[75]

, publisher =

Bertsekas, D. , publisher =
[76]

The Annals of Mathematical Statistics , number =

Robbins, Herbert and Monro, Sutton , doi =. The Annals of Mathematical Statistics , number =
[77]

Foundations and Trends

Bubeck, S. Foundations and Trends
[78]

Ben-David, Shai and Shalev-Shwartz, Shai , doi =
[79]

Ghadimi, Saeed and Lan, Guanghui and Zhang, Hongchao , eprint =. Math. Prog. , keywords =
[80]

Wu, Yuhuai and Ren, Mengye and Liao, Renjie and Grosse, Roger , journal =
[81]

and Bertsekas, D.P

Geary, A. and Bertsekas, D.P. , doi =. Proceedings of the 38th IEEE Conference on Decision and Control (Cat. No.99CH36304) , keywords =

Showing first 80 references.

[1] [1]

The Annals of Mathematical Statistics , volume=

Adjustment of an inverse matrix corresponding to a change in one element of a given matrix , author=. The Annals of Mathematical Statistics , volume=. 1950 , publisher=

1950

[2] [2]

Mathematics of Operations Research , volume=

Linear convergence of projection algorithms , author=. Mathematics of Operations Research , volume=. 2019 , publisher=

2019

[3] [3]

arXiv preprint arXiv:1906.05890 , year=

Gradient descent maximizes the margin of homogeneous neural networks , author=. arXiv preprint arXiv:1906.05890 , year=

work page arXiv 1906

[4] [4]

Mathematics of Operations Research , volume=

Alternating projections on manifolds , author=. Mathematics of Operations Research , volume=. 2008 , publisher=

2008

[5] [5]

A handbook of

Braides, Andrea , booktitle=. A handbook of. 2006 , publisher=

2006

[6] [6]

Conference on Learning Theory , pages=

Kernel and rich regimes in overparametrized models , author=. Conference on Learning Theory , pages=. 2020 , organization=

2020

[7] [7]

On the Theory of Continual Learning with Gradient Descent for Neural Networks

On the Theory of Continual Learning with Gradient Descent for Neural Networks , author=. arXiv preprint arXiv:2510.05573 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[8] [8]

arXiv preprint arXiv:2502.05668 , year=

The late-stage training dynamics of (stochastic) subgradient descent on homogeneous neural networks , author=. arXiv preprint arXiv:2502.05668 , year=

work page arXiv

[9] [9]

Journal of Machine Learning Research , volume=

High probability convergence bounds for non-convex stochastic gradient descent with sub-weibull noise , author=. Journal of Machine Learning Research , volume=

[10] [10]

Conference on Learning Theory , pages=

Almost sure convergence rates for stochastic gradient descent and stochastic heavy ball , author=. Conference on Learning Theory , pages=. 2021 , organization=

2021

[11] [11]

Foundations of Computational Mathematics , volume=

Local linear convergence for alternating and averaged nonconvex projections , author=. Foundations of Computational Mathematics , volume=. 2009 , publisher=

2009

[12] [12]

Foundations of computational mathematics , volume=

Stochastic subgradient method converges on tame functions , author=. Foundations of computational mathematics , volume=. 2020 , publisher=

2020

[13] [13]

1998 , publisher=

Variational analysis , author=. 1998 , publisher=

1998

[14] [14]

, title =

Navakkode, Sheeja and Kennedy, Brian K. , title =. Frontiers in Aging Neuroscience , volume =. 2024 , month = aug, doi =

2024

[15] [15]

The 22nd international conference on artificial intelligence and statistics , pages=

A continuous-time view of early stopping for least squares regression , author=. The 22nd international conference on artificial intelligence and statistics , pages=. 2019 , organization=

2019

[16] [16]

and Vandemark, Katherine M

Voglewede, Ryan L. and Vandemark, Katherine M. and Davidson, Adam M. and DeWitt, Andrea R. and Heffler, Matthew D. and Trimmer, Ethan H. and Mostany, Ricardo , title =. Neurobiology of Aging , volume =. 2019 , month = sep, doi =

2019

[17] [17]

Frontiers in Neural Circuits , volume =

Huang, Li and Zhou, Hao and Chen, Kai and Chen, Xin and Yang, Guangwei , title =. Frontiers in Neural Circuits , volume =. 2020 , month = nov, doi =

2020

[18] [18]

SIAM Journal on Matrix Analysis and Applications , volume=

Randomized iterative methods for linear systems , author=. SIAM Journal on Matrix Analysis and Applications , volume=. 2015 , publisher=

2015

[19] [19]

Accessed: May , url=

Last iterate of sgd converges (even in unbounded domains), 2020 , author=. Accessed: May , url=

2020

[20] [20]

Conference on Learning Theory , pages=

Toward a noncommutative arithmetic-geometric mean inequality: conjectures, case-studies, and consequences , author=. Conference on Learning Theory , pages=. 2012 , organization=

2012

[21] [21]

How good is

Safran, Itay and Shamir, Ohad , booktitle=. How good is. 2020 , organization=

2020

[22] [22]

Closing the convergence gap of

Rajput, Shashank and Gupta, Anant and Papailiopoulos, Dimitris , booktitle=. Closing the convergence gap of. 2020 , organization=

2020

[23] [23]

Advances in Neural Information Processing Systems , volume=

Random reshuffling: Simple analysis with vast improvements , author=. Advances in Neural Information Processing Systems , volume=

[24] [24]

International Conference on Machine Learning , pages=

Tighter lower bounds for shuffling SGD: Random permutations and beyond , author=. International Conference on Machine Learning , pages=. 2023 , organization=

2023

[25] [25]

arXiv preprint arXiv:2306.12498 , year=

Empirical risk minimization with shuffled SGD: a primal-dual perspective and improved bounds , author=. arXiv preprint arXiv:2306.12498 , year=

work page arXiv

[26] [26]

Conference on Learning Theory , pages=

Making the last iterate of sgd information theoretically optimal , author=. Conference on Learning Theory , pages=. 2019 , organization=

2019

[27] [27]

arXiv preprint arXiv:2307.11134 , year=

Exact convergence rate of the last iterate in subgradient methods , author=. arXiv preprint arXiv:2307.11134 , year=

work page arXiv

[28] [28]

arXiv preprint arXiv:2310.07831 , year=

Optimal Linear Decay Learning Rate Schedules and Further Refinements , author=. arXiv preprint arXiv:2310.07831 , year=

work page arXiv

[29] [29]

International Conference on Machine Learning , pages=

Train faster, generalize better: Stability of stochastic gradient descent , author=. International Conference on Machine Learning , pages=. 2016 , organization=

2016

[30] [30]

The 22nd international conference on artificial intelligence and statistics , pages=

Fast and faster convergence of sgd for over-parameterized models and an accelerated perceptron , author=. The 22nd international conference on artificial intelligence and statistics , pages=. 2019 , organization=

2019

[31] [31]

Lecture notes , volume=

Introductory lectures on convex programming volume i: Basic course , author=. Lecture notes , volume=

[32] [32]

Mathematical Programming , volume=

An optimal method for stochastic composite optimization , author=. Mathematical Programming , volume=. 2012 , publisher=

2012

[33] [33]

Conference on Learning Theory , pages=

Benign overfitting of constant-stepsize sgd for linear regression , author=. Conference on Learning Theory , pages=. 2021 , organization=

2021

[34] [34]

Advances in Neural Information Processing Systems , volume=

Tight nonparametric convergence rates for stochastic gradient descent under the noiseless linear model , author=. Advances in Neural Information Processing Systems , volume=

[35] [35]

2020 , publisher=

First-order and stochastic optimization methods for machine learning , author=. 2020 , publisher=

2020

[36] [36]

Proceedings of the 37th International Conference on Machine Learning , pages =

Fine-Grained Analysis of Stability and Generalization for Stochastic Gradient Descent , author =. Proceedings of the 37th International Conference on Machine Learning , pages =. 2020 , editor =

2020

[37] [37]

The Twelfth International Conference on Learning Representations , year=

Revisiting the Last-Iterate Convergence of Stochastic Gradient Methods , author=. The Twelfth International Conference on Learning Representations , year=

[38] [38]

Advances in neural information processing systems , volume=

Smoothness, low noise and fast rates , author=. Advances in neural information processing systems , volume=

[39] [39]

Advances in neural information processing systems , volume=

Non-strongly-convex smooth stochastic approximation with convergence rate O (1/n) , author=. Advances in neural information processing systems , volume=

[40] [40]

Advances in Neural Information Processing Systems , volume=

Optimal rates for random order online optimization , author=. Advances in Neural Information Processing Systems , volume=

[41] [41]

Advances in Neural Information Processing Systems , volume=

Benign underfitting of stochastic gradient descent , author=. Advances in Neural Information Processing Systems , volume=

[42] [42]

International Conference on Machine Learning , pages=

Sgd without replacement: Sharper rates for general smooth convex functions , author=. International Conference on Machine Learning , pages=. 2019 , organization=

2019

[43] [43]

The Journal of Machine Learning Research , volume=

Learnability, stability and uniform convergence , author=. The Journal of Machine Learning Research , volume=. 2010 , publisher=

2010

[44] [44]

The Journal of Machine Learning Research , volume=

Stability and generalization , author=. The Journal of Machine Learning Research , volume=. 2002 , publisher=

2002

[45] [45]

Proceedings of the 30th International Conference on Machine Learning , pages =

Stochastic Gradient Descent for Non-smooth Optimization: Convergence Results and Optimal Averaging Schemes , author =. Proceedings of the 30th International Conference on Machine Learning , pages =. 2013 , editor =

2013

[46] [46]

ICLR , year=

The Implicit Bias of Gradient Descent on Separable Data , author=. ICLR , year=

[47] [47]

The Implicit Bias of Gradient Descent on Separable Data , archivePrefix = "arXiv", arxivId =

Soudry, Daniel and Hoffer, Elad and. The Implicit Bias of Gradient Descent on Separable Data , archivePrefix = "arXiv", arxivId =. arXiv:1710.10345v3 , journal=

work page arXiv

[48] [48]

AISTATS , year=

Stochastic gradient descent on separable data: Exact convergence with a fixed learning rate , author=. AISTATS , year=

[49] [49]

Journal of Global Optimization , volume=

Accelerated sampling Kaczmarz Motzkin algorithm for the linear feasibility problem , author=. Journal of Global Optimization , volume=. 2020 , publisher=

2020

[50] [50]

IEEE Transactions on Signal Processing , volume=

On the convergence behavior of the LMS and the normalized LMS algorithms , author=. IEEE Transactions on Signal Processing , volume=. 1993 , publisher=

1993

[51] [51]

AISTATS , year=

Convergence of gradient descent on separable data , author=. AISTATS , year=

[52] [52]

ICML , year=

Lexicographic and Depth-Sensitive Margins in Homogeneous and Non-Homogeneous Deep Models , author=. ICML , year=

[53] [53]

Regularization Matters: Generalization and Optimization of Neural Nets v.s

Wei, Colin and Lee, Jason D and Liu, Qiang and Ma, Tengyu , booktitle =. Regularization Matters: Generalization and Optimization of Neural Nets v.s. their Induced Kernel , volume =

[54] [54]

Conference on Learning Theory (COLT) , year=

Beneath the valley of the noncommutative arithmetic-geometric mean inequality: conjectures, case-studies, and consequences , author=. Conference on Learning Theory (COLT) , year=

[55] [55]

Implicit Bias of Gradient Descent on Linear Convolutional Networks , volume =

Gunasekar, Suriya and Lee, Jason D and Soudry, Daniel and Srebro, Nati , booktitle =. Implicit Bias of Gradient Descent on Linear Convolutional Networks , volume =

[56] [56]

and Bassily, R

Ma, S. and Bassily, R. and Belkin, M. , eprint =

[57] [57]

Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour

Goyal, P. and Doll. arXiv , arxivId =:1706.02677 , title =

work page internal anchor Pith review Pith/arXiv arXiv

[58] [58]

Smith, S. L. and Kindermans, P. and Ying, C. and Le, Q. V. , booktitle =

[59] [59]

ICLR, Spotlight

At Stability's Edge: How to Adjust Hyperparameters to Preserve Minima Selection in Asynchronous Training of Neural Networks? , author=. ICLR, Spotlight. , year=

[60] [60]

Nar, Kamil and Sastry, S Shankar , booktitle =

[61] [61]

Wu, Lei and Ma, Chao and E, Weinan , booktitle =

[62] [62]

COLT , year=

How do infinite width bounded norm networks look in function space? , author=. COLT , year=

[63] [63]

, author=

The perceptron: a probabilistic model for information storage and organization in the brain. , author=. Psychological review , year=

[64] [64]

Advances in Neural Information Processing Systems , pages=

Train longer, generalize better: closing the generalization gap in large batch training of neural networks , author=. Advances in Neural Information Processing Systems , pages=

[65] [66]

Bertsekas, Dimitri P. , doi =. Mathematical Programming , keywords =. arXiv , arxivId =:1507.01030 , file =

work page internal anchor Pith review Pith/arXiv arXiv

[66] [67]

arXiv , pages =

Smith, Leslie N , eprint =. arXiv , pages =

[67] [68]

International Conference of Artificial Neural Networks (ICANN) , year=

[68] [69]

ICML , year=

Characterizing implicit bias in terms of optimization geometry , author=. ICML , year=

[69] [70]

AISTATS , title =

Nacson, Mor Shpigel and Lee, Jason and Gunasekar, Suriya and Srebro, Nathan and Soudry, Daniel , eprint =. AISTATS , title =

[70] [71]

Journal of the ACM (JACM) , volume=

Sublinear optimization for machine learning , author=. Journal of the ACM (JACM) , volume=. 2012 , publisher=

2012

[71] [72]

doi:10.1137/16M1080173 , eprint =

Bottou, L. doi:10.1137/16M1080173 , eprint =

work page doi:10.1137/16m1080173

[72] [73]

2012 , publisher=

Boosting: Foundations and algorithms , author=. 2012 , publisher=

2012

[73] [74]

Train longer, generalize better: closing the generalization gap in large batch training of neural networks

Hoffer, Elad and Hubara, Itay and Soudry, Daniel , booktitle =. arXiv , arxivId =:1705.08741 , title =

work page internal anchor Pith review Pith/arXiv arXiv

[74] [75]

, publisher =

Bertsekas, D. , publisher =

[75] [76]

The Annals of Mathematical Statistics , number =

Robbins, Herbert and Monro, Sutton , doi =. The Annals of Mathematical Statistics , number =

[76] [77]

Foundations and Trends

Bubeck, S. Foundations and Trends

[77] [78]

Ben-David, Shai and Shalev-Shwartz, Shai , doi =

[78] [79]

Ghadimi, Saeed and Lan, Guanghui and Zhang, Hongchao , eprint =. Math. Prog. , keywords =

[79] [80]

Wu, Yuhuai and Ren, Mengye and Liao, Renjie and Grosse, Roger , journal =

[80] [81]

and Bertsekas, D.P

Geary, A. and Bertsekas, D.P. , doi =. Proceedings of the 38th IEEE Conference on Decision and Control (Cat. No.99CH36304) , keywords =