Accelerated Gradient Methods for Nonconvex Optimization: Escape Trajectories From Strict Saddle Points and Convergence to Local Minima

Mert Gurbuzbalaban; Rishabh Dixit; Waheed U. Bajwa

arxiv: 2307.07030 · v3 · submitted 2023-07-13 · 🧮 math.OC · cs.LG· cs.SY· eess.SY

Accelerated Gradient Methods for Nonconvex Optimization: Escape Trajectories From Strict Saddle Points and Convergence to Local Minima

Rishabh Dixit , Mert Gurbuzbalaban , Waheed U. Bajwa This is my paper

Pith reviewed 2026-05-24 07:48 UTC · model grok-4.3

classification 🧮 math.OC cs.LGcs.SYeess.SY

keywords accelerated gradient methodsnonconvex optimizationstrict saddle pointsNesterov accelerated gradientescape trajectoriesconvergence to local minimamomentum methodsdynamical systems

0 comments

The pith

Nesterov's accelerated gradient method with variable momentum avoids strict saddle points almost surely.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper analyzes a broad class of Nesterov-type accelerated gradient methods for smooth nonconvex objective functions. It establishes that these methods escape strict saddle points and reach local minima through both asymptotic and non-asymptotic arguments. The work resolves an open question on almost-sure avoidance for the variable-momentum version of NAG while also supplying linear exit-time bounds from saddle neighborhoods. Readers care because nonconvex problems arise throughout machine learning and engineering, where saddle points can trap standard methods away from useful solutions.

Core claim

The paper claims that Nesterov's accelerated gradient method with a variable momentum parameter avoids strict saddle points almost surely, shown via asymptotic analysis of the associated discrete dynamical system. It further derives linear exit-time estimates for trajectories leaving strict saddle neighborhoods under necessary conditions on the parameters, develops two metrics for asymptotic rates of convergence and divergence near saddles, and identifies a sub-class of these methods that converges at a near-optimal rate to a local minimum inside convex neighborhoods while escaping saddles more effectively than standard NAG.

What carries the argument

The discrete dynamical system that encodes the accelerated gradient updates with momentum parameters, used to track trajectories near strict saddle points.

If this is right

Trajectories exit strict saddle neighborhoods after a number of steps that scales linearly with the inverse of the smallest negative eigenvalue.
A sub-class of the methods reaches a local minimum at a near-optimal rate once inside a convex neighborhood.
Standard methods such as NAG and constant-momentum NCM exhibit quantifiable rates of divergence away from saddles under the developed metrics.
Necessary conditions on momentum parameters must hold for escape trajectories to exist.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same dynamical-system approach could be used to compare escape behavior across a wider family of momentum-based first-order methods.
Parameter choices that improve saddle escape might be combined with adaptive schemes to maintain performance across mixed convex-nonconvex landscapes.
The linear exit-time result suggests that practical implementations could monitor gradient norms or function values to detect and accelerate passage through saddle regions.

Load-bearing premise

The objective function is twice continuously differentiable with Lipschitz continuous gradients, and the chosen momentum parameters produce well-defined trajectories near saddle points.

What would settle it

A simulation in which a trajectory generated by NAG with variable momentum remains inside a strict saddle neighborhood for arbitrarily long time with positive probability would falsify the almost-sure escape result.

Figures

Figures reproduced from arXiv: 2307.07030 by Mert Gurbuzbalaban, Rishabh Dixit, Waheed U. Bajwa.

**Figure 1.** Figure 1: Asymptotic divergence metric M⋆ (f) (Theorem 3.25) and the exit time Kexit bound (65) as a function of asymptotic momentum β and step size h. Clearly the asymptotic divergence metric M⋆ (f) and the exit time Kexit bound vary inversely with respect to each another. Remark 5.3. Since the exit time from (65) was computed for the case when inf i∈NUS |zi | = ∥Λ∥2 in the 34 [PITH_FULL_IMAGE:figures/full_fig_p03… view at source ↗

**Figure 2.** Figure 2: Simulating various accelerated gradient trajectories of ( [PITH_FULL_IMAGE:figures/full_fig_p044_2.png] view at source ↗

**Figure 3.** Figure 3: Simulating various accelerated gradient trajectories of ( [PITH_FULL_IMAGE:figures/full_fig_p045_3.png] view at source ↗

**Figure 4.** Figure 4: Simulating various accelerated gradient trajectories of ( [PITH_FULL_IMAGE:figures/full_fig_p046_4.png] view at source ↗

**Figure 5.** Figure 5: Simulating various accelerated gradient trajectories of ( [PITH_FULL_IMAGE:figures/full_fig_p047_5.png] view at source ↗

**Figure 6.** Figure 6: Simulating various accelerated gradient trajectories of ( [PITH_FULL_IMAGE:figures/full_fig_p048_6.png] view at source ↗

read the original abstract

This paper considers the problem of understanding the behavior of a general class of accelerated gradient methods on smooth nonconvex functions. Motivated by some recent works that have proposed effective algorithms, based on Polyak's heavy ball method and the Nesterov accelerated gradient method, to achieve convergence to a local minimum of nonconvex functions, this work proposes a broad class of Nesterov-type accelerated methods and puts forth a rigorous study of these methods encompassing the escape from saddle points and convergence to local minima through both an asymptotic and a non-asymptotic analysis. In the asymptotic regime, this paper answers an open question of whether Nesterov's accelerated gradient method (NAG) with variable momentum parameter avoids strict saddle points almost surely. This work also develops two metrics of asymptotic rates of convergence and divergence, and evaluates these two metrics for several popular standard accelerated methods such as the NAG and Nesterov's accelerated gradient with constant momentum (NCM) near strict saddle points. In the non-asymptotic regime, this work provides an analysis that leads to the "linear" exit time estimates from strict saddle neighborhoods for trajectories of these accelerated methods as well the necessary conditions for the existence of such trajectories. Finally, this work studies a sub-class of accelerated methods that can converge in convex neighborhoods of nonconvex functions with a near optimal rate to a local minimum and at the same time this sub-class offers superior saddle-escape behavior compared to that of NAG.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper resolves the stated open question on almost-sure saddle avoidance for variable-momentum NAG and supplies new escape-time metrics, but the non-autonomous linearization step needs explicit verification against possible neutral modes.

read the letter

The core result is that NAG with the usual t/(t+3) momentum schedule avoids strict saddles almost surely, together with linear exit-time bounds from saddle neighborhoods and a subclass that improves both escape and local convergence. That directly answers the open question they cite and adds two new asymptotic rate metrics that they evaluate on NAG and NCM. The non-asymptotic exit-time analysis and the necessary conditions for escape trajectories are also new relative to the cited prior work. The subclass they isolate is a concrete design suggestion that looks useful on its face. These pieces are the actual additions. The twice-differentiable Lipschitz-gradient setting is standard, and the paper performs an independent analysis rather than fitting parameters to existing results. The stress-test concern about the time-varying momentum in the linearized recurrence is worth checking in the proofs. If the argument only approximates with a frozen momentum or does not bound the product of the successive Jacobians, directions that repel for constant momentum could become neutral or weakly attracting for the actual schedule, which would affect the measure-zero claim for the stable manifold. The abstract and the open-question framing suggest they attempt a direct non-autonomous treatment, but that step is load-bearing and needs to be inspected. Minor points include whether the new rate metrics are robust to small perturbations in the momentum schedule and how the constants in the linear exit times scale with dimension and condition number. Overall the work is aimed at theorists who study global convergence of first-order methods on nonconvex problems. It engages the literature honestly and supplies falsifiable claims, so it deserves a serious referee even if the non-autonomous details require tightening.

Referee Report

1 major / 0 minor

Summary. The paper proposes a broad class of Nesterov-type accelerated gradient methods for smooth nonconvex optimization. In the asymptotic regime it claims to resolve an open question by proving that NAG with variable momentum (e.g., the standard schedule) avoids strict saddle points almost surely; it introduces two metrics of asymptotic convergence/divergence rates and evaluates them for NAG and NCM near saddles. In the non-asymptotic regime it derives linear exit-time bounds from saddle neighborhoods together with necessary conditions for such trajectories to exist. Finally it identifies a subclass that converges at near-optimal rates inside convex neighborhoods while exhibiting improved saddle-escape behavior relative to standard NAG.

Significance. If the central claims hold, the work supplies the first rigorous almost-sure escape guarantee for variable-momentum NAG and supplies both asymptotic rate metrics and explicit linear exit-time estimates, which are load-bearing for any global-convergence theory of accelerated methods on nonconvex landscapes. The identification of a subclass that simultaneously enjoys near-optimal local convergence and superior escape properties is a concrete algorithmic contribution.

major comments (1)

[Abstract (asymptotic regime)] The almost-sure escape claim for variable-momentum NAG rests on the local analysis of a non-autonomous linearization around a strict saddle. The provided abstract does not indicate whether the proof controls the infinite product of the time-dependent Jacobians or rules out the emergence of neutral/contracting modes that the skeptic note identifies; without such control the stable manifold may fail to have measure zero. This is load-bearing for the open-question resolution stated in the abstract.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the careful reading and for identifying a point where the abstract could better reflect the technical content of the proof. We address the comment below.

read point-by-point responses

Referee: [Abstract (asymptotic regime)] The almost-sure escape claim for variable-momentum NAG rests on the local analysis of a non-autonomous linearization around a strict saddle. The provided abstract does not indicate whether the proof controls the infinite product of the time-dependent Jacobians or rules out the emergence of neutral/contracting modes that the skeptic note identifies; without such control the stable manifold may fail to have measure zero. This is load-bearing for the open-question resolution stated in the abstract.

Authors: We agree that the abstract should more explicitly signal the key technical controls. The manuscript's asymptotic analysis (Section 3) does control the infinite product of the time-dependent Jacobians by deriving uniform bounds on the operator norms along the trajectory and showing that the product of the linearized maps contracts to zero on the stable subspace while expanding on the unstable directions. The variable-momentum schedule is used to ensure that the time-varying eigenvalues remain bounded away from the unit circle, thereby excluding persistent neutral modes; the strict-saddle assumption then implies that the stable manifold has Lebesgue measure zero in a neighborhood of the saddle. We will revise the abstract to state that the proof establishes these controls on the infinite product and rules out neutral/contracting modes. revision: yes

Circularity Check

0 steps flagged

No circularity; independent analysis of existing accelerated methods

full rationale

The paper performs a self-contained asymptotic and non-asymptotic analysis of a broad class of Nesterov-type accelerated methods on twice continuously differentiable functions with Lipschitz gradients. It resolves an open question on almost-sure escape from strict saddles for variable-momentum NAG by direct study of the non-autonomous recurrence and linearization, without reducing any claimed prediction or rate to a fitted parameter, self-definition, or load-bearing self-citation. Standard external assumptions (smoothness, well-defined trajectories) are stated explicitly and do not embed the target escape or convergence claims. No equations or steps are shown to be equivalent to their inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claims rest on standard smoothness assumptions for the objective and on the existence of trajectories under chosen momentum schedules; no free parameters or invented entities are introduced in the abstract.

axioms (1)

domain assumption The objective function is smooth (twice continuously differentiable with Lipschitz gradient).
Required for gradient-based methods and saddle-point analysis to be well-defined.

pith-pipeline@v0.9.0 · 5815 in / 1218 out tokens · 31468 ms · 2026-05-24T07:48:31.611201+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Convergence of difference inclusions via a diameter criterion
math.OC 2026-05 unverdicted novelty 7.0

A diameter criterion tied to a potential function certifies convergence of difference inclusions, enabling discrete proofs for first-order optimization methods with diminishing steps.

Reference graph

Works this paper leans on

92 extracted references · 92 canonical work pages · cited by 1 Pith paper · 9 internal anchors

[1]

In: Advances in Neural Information Processing Systems, pp

Allen-Zhu, Z.: Natasha 2: Faster non-convex optimization than sgd. In: Advances in Neural Information Processing Systems, pp. 2675–2686 (2018)

work page 2018
[2]

In: Advances in Neural Information Processing Systems, pp

Allen-Zhu, Z., Li, Y.: Neon2: Finding local minima via first-order oracles. In: Advances in Neural Information Processing Systems, pp. 3716–3726 (2018)

work page 2018
[3]

Mathematical Programming 180(1-2), 137–156 (2020)

Apidopoulos, V., Aujol, J.F., Dossal, C.: Convergence rate of inertial forward–backward algorithm beyond nesterov’s rule. Mathematical Programming 180(1-2), 137–156 (2020)

work page 2020
[4]

Mathematical Programming 168, 123–175 (2018)

Attouch, H., Chbani, Z., Peypouquet, J., Redont, P.: Fast convergence of inertial dynamics and algo- rithms with asymptotic vanishing viscosity. Mathematical Programming 168, 123–175 (2018)

work page 2018
[5]

ESAIM: Control, Optimisation and Calculus of Variations 25, 2 (2019)

Attouch, H., Chbani, Z., Riahi, H.: Rate of convergence of the nesterov accelerated gradient method in the subcritical case α ≤ 3. ESAIM: Control, Optimisation and Calculus of Variations 25, 2 (2019)

work page 2019
[6]

Aujol, J., Dossal, C.: Optimal rate of convergence of an ode associated to the fast gradient descent schemes for b> 0 (2017)

work page 2017
[7]

arXiv preprint arXiv:1911.07596 (2019)

Barakat, A., Bianchi, P.: Convergence analysis of a momentum algorithm with adaptive step size for non convex optimization. arXiv preprint arXiv:1911.07596 (2019)

work page arXiv 1911
[8]

Springer Nature (2021) 59The matrix C will depend on the value of β, n and the eigenvalues in the EU S subspace of the matrix I − h∇2f(x∗)

Braun, P., Gr¨ une, L., Kellett, C.M.: (In-) Stability of Differential Inclusions: Notions, Equivalences, and Lyapunov-like Characterizations. Springer Nature (2021) 59The matrix C will depend on the value of β, n and the eigenvalues in the EU S subspace of the matrix I − h∇2f(x∗) . 103

work page 2021
[9]

Brezis, H., Br´ ezis, H.: Functional analysis, Sobolev spaces and partial differential equations, vol. 2. Springer (2011)

work page 2011
[10]

arXiv preprint arXiv:2204.11292 (2022)

Can, B., Gurbuzbalaban, M.: Entropic risk-averse generalized momentum methods. arXiv preprint arXiv:2204.11292 (2022)

work page arXiv 2022
[11]

In: International Conference on Machine Learning, pp

Can, B., Gurbuzbalaban, M., Zhu, L.: Accelerated linear convergence of stochastic momentum methods in wasserstein distances. In: International Conference on Machine Learning, pp. 891–901. PMLR (2019)

work page 2019
[12]

IEEE Transactions on Information Theory 61(4), 1985–2007 (2015)

Candes, E.J., Li, X., Soltanolkotabi, M.: Phase retrieval via wirtinger flow: Theory and algorithms. IEEE Transactions on Information Theory 61(4), 1985–2007 (2015)

work page 1985
[13]

SIAM Journal on Optimization 28(2), 1751–1772 (2018)

Carmon, Y., Duchi, J.C., Hinder, O., Sidford, A.: Accelerated methods for nonconvex optimization. SIAM Journal on Optimization 28(2), 1751–1772 (2018)

work page 2018
[14]

On the Convergence of A Class of Adam-Type Algorithms for Non-Convex Optimization

Chen, X., Liu, S., Sun, R., Hong, M.: On the convergence of a class of adam-type algorithms for non-convex optimization. arXiv preprint arXiv:1808.02941 (2018)

work page internal anchor Pith review Pith/arXiv arXiv 2018
[15]

Mathematical Programming 176(1), 5–37 (2019)

Chen, Y., Chi, Y., Fan, J., Ma, C.: Gradient descent with random initialization: Fast global convergence for nonconvex phase retrieval. Mathematical Programming 176(1), 5–37 (2019)

work page 2019
[16]

Springer (1999)

Chicone, C.C.: Ordinary Differential Equations With Applications. Springer (1999)

work page 1999
[17]

Bulletin of the Belgian Mathematical Society-Simon Stevin 22(1), 71–75 (2015)

Conejero, J.A., Mu˜ noz-Fern´ andez, G.A., Arcila, M.M., Seoane-Sep´ ulveda, J.B.: Smooth functions with uncountably many zeros. Bulletin of the Belgian Mathematical Society-Simon Stevin 22(1), 71–75 (2015)

work page 2015
[18]

Ad- vances in Computational mathematics 5(1), 329–359 (1996)

Corless, R.M., Gonnet, G.H., Hare, D.E., Jeffrey, D.J., Knuth, D.E.: On the lambertw function. Ad- vances in Computational mathematics 5(1), 329–359 (1996)

work page 1996
[19]

arXiv preprint arXiv:2012.04061 (2020)

Das, R., Acharya, A., Hashemi, A., Sanghavi, S., Dhillon, I.S., Topcu, U.: Faster non-convex federated learning via global and local momentum. arXiv preprint arXiv:2012.04061 (2020)

work page arXiv 2012
[20]

OUP Oxford (1994)

Davidson, J.: Stochastic limit theory: An introduction for econometricians. OUP Oxford (1994)

work page 1994
[21]

arXiv preprint arXiv:2108.11832 (2021)

Davis, D., Drusvyatskiy, D., Jiang, L.: Subgradient methods near active manifolds: saddle point avoid- ance, local convergence, and asymptotic normality. arXiv preprint arXiv:2108.11832 (2021)

work page arXiv 2021
[22]

Convergence guarantees for RMSProp and ADAM in non-convex optimization and an empirical comparison to Nesterov acceleration

De, S., Mukherjee, A., Ullah, E.: Convergence guarantees for rmsprop and adam in non-convex opti- mization and an empirical comparison to nesterov acceleration. arXiv preprint arXiv:1807.06766 (2018)

work page internal anchor Pith review Pith/arXiv arXiv 2018
[23]

Convergence of stochastic gradient descent schemes for Loj asiewicz- landscapes

Dereich, S., Kassing, S.: Convergence of stochastic gradient descent schemes for lojasiewicz-landscapes. arXiv preprint arXiv:2102.09385 (2021)

work page arXiv 2021
[24]

IEEE Transactions on Information Theory pp

Dixit, R., G¨ urb¨ uzbalaban, M., Bajwa, W.U.: Boundary conditions for linear exit time gradient trajec- tories around saddle points: Analysis and algorithm. IEEE Transactions on Information Theory pp. 1–1 (2022). DOI 10.1109/TIT.2022.3213607

work page doi:10.1109/tit.2022.3213607 2022
[25]

Information and Inference: A Journal of the IMA (2022)

Dixit, R., G¨ urb¨ uzbalaban, M., Bajwa, W.U.: Exit Time Analysis for Approximations of Gradient Descent Trajectories Around Saddle Points. Information and Inference: A Journal of the IMA (2022). DOI 10.1093/imaiai/iaac025. URL https://doi.org/10.1093/imaiai/iaac025. Iaac025

work page doi:10.1093/imaiai/iaac025 2022
[26]

Dozat, T.: Incorporating nesterov momentum into adam (2016)

work page 2016
[27]

Dunford, N., Schwartz, J.T.: Linear operators, part 1: general theory, vol. 10. John Wiley & Sons (1988)

work page 1988
[28]

Sharp Analysis for Nonconvex SGD Escaping from Saddle Points

Fang, C., Lin, Z., Zhang, T.: Sharp analysis for nonconvex sgd escaping from saddle points. arXiv preprint arXiv:1902.00247 (2019) 104

work page internal anchor Pith review Pith/arXiv arXiv 1902
[29]

SIAM Journal on Optimization 28(3), 2654–2689 (2018)

Fazlyab, M., Ribeiro, A., Morari, M., Preciado, V.M.: Analysis of optimization algorithms via integral quadratic constraints: Nonstrongly convex problems. SIAM Journal on Optimization 28(3), 2654–2689 (2018)

work page 2018
[30]

Electronic Journal of Statistics 12(1), 461–529 (2018)

Gadat, S., Panloup, F., Saadane, S.: Stochastic heavy ball. Electronic Journal of Statistics 12(1), 461–529 (2018)

work page 2018
[31]

Advances in Neural Information Processing Systems 33, 17850–17862 (2020)

Gao, X., Gurbuzbalaban, M., Zhu, L.: Breaking reversibility accelerates langevin dynamics for non- convex optimization. Advances in Neural Information Processing Systems 33, 17850–17862 (2020)

work page 2020
[32]

Operations Research 70(5), 2931–2947 (2022)

Gao, X., G¨ urb¨ uzbalaban, M., Zhu, L.: Global convergence of stochastic gradient hamiltonian monte carlo for nonconvex stochastic optimization: Nonasymptotic performance bounds and momentum-based acceleration. Operations Research 70(5), 2931–2947 (2022)

work page 2022
[33]

In: 2015 European control conference (ECC), pp

Ghadimi, E., Feyzmahdavian, H.R., Johansson, M.: Global convergence of the heavy-ball method for convex optimization. In: 2015 European control conference (ECC), pp. 310–315. IEEE (2015)

work page 2015
[34]

Mathematical Programming 156(1), 59–99 (2016)

Ghadimi, S., Lan, G.: Accelerated gradient methods for nonconvex nonlinear and stochastic program- ming. Mathematical Programming 156(1), 59–99 (2016)

work page 2016
[35]

Advances in Neural Information Processing Systems 32 (2019)

Gitman, I., Lang, H., Zhang, P., Xiao, L.: Understanding the role of momentum in stochastic gradient methods. Advances in Neural Information Processing Systems 32 (2019)

work page 2019
[36]

Hahn, W., et al.: Stability of motion, vol. 138. Springer (1967)

work page 1967
[37]

Hirsch, M., Pugh, C., Shub, M.: Invariant manifolds (lecture notes in mathematics, 583) (1977)

work page 1977
[38]

http://math.huji.ac.il/~mhochman/courses/ fractals-2012/convergence-of-sets-and-measures.pdf

Hochman, M.: Convergence of sets and measures. http://math.huji.ac.il/~mhochman/courses/ fractals-2012/convergence-of-sets-and-measures.pdf . Accessed: 2022-12-22

work page 2012
[39]

http://math.huji.ac.il/~mhochman/courses/ fractals-2012/

Hochman, M.: Convergence of sets and measures. http://math.huji.ac.il/~mhochman/courses/ fractals-2012/. Accessed: 2022-12-22

work page 2012
[40]

IEEE Journal of selected topics in signal processing 10(4), 770–781 (2016)

Jaganathan, K., Eldar, Y.C., Hassibi, B.: Stft phase retrieval: Uniqueness guarantees and recovery algorithms. IEEE Journal of selected topics in signal processing 10(4), 770–781 (2016)

work page 2016
[41]

Accelerated Gradient Descent Escapes Saddle Points Faster than Gradient Descent

Jin, C., Netrapalli, P., Jordan, M.I.: Accelerated gradient descent escapes saddle points faster than gradient descent. arXiv preprint arXiv:1711.10456 (2017)

work page internal anchor Pith review Pith/arXiv arXiv 2017
[42]

Courier Dover Publications (2017)

Kelley, J.L.: General topology. Courier Dover Publications (2017)

work page 2017
[43]

SIAM (2000)

Kinderlehrer, D., Stampacchia, G.: An introduction to variational inequalities and their applications. SIAM (2000)

work page 2000
[44]

Springer Science & Business Media (2012)

Kirillov, A.A., Gvishiani, A.D.: Theorems and problems in functional analysis. Springer Science & Business Media (2012)

work page 2012
[45]

Fundamenta Mathe- maticae 22(1), 77–108 (1934)

Kirszbraun, M.: ¨Uber die zusammenziehende und lipschitzsche transformationen. Fundamenta Mathe- maticae 22(1), 77–108 (1934)

work page 1934
[46]

In: 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp

Koppel, A., Mokhtari, A., Ribeiro, A.: Parallel stochastic successive convex approximation method for large-scale dictionary learning. In: 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 2771–2775. IEEE (2018)

work page 2018
[47]

Computational Mathematics and Mathematical Physics 61(7), 1162–1168 (2021)

Kurochkin, S.V.: Neural network with smooth activation functions and without bottlenecks is almost surely a morse function. Computational Mathematics and Mathematical Physics 61(7), 1162–1168 (2021)

work page 2021
[48]

Mathematical programming 176(1), 311–337 (2019) 105

Lee, J.D., Panageas, I., Piliouras, G., Simchowitz, M., Jordan, M.I., Recht, B.: First-order methods almost always avoid strict saddle points. Mathematical programming 176(1), 311–337 (2019) 105

work page 2019
[49]

Gradient Descent Converges to Minimizers

Lee, J.D., Simchowitz, M., Jordan, M.I., Recht, B.: Gradient descent converges to minimizers. arXiv preprint arXiv:1602.04915 (2016)

work page internal anchor Pith review Pith/arXiv arXiv 2016
[50]

In: Introduction to Smooth Manifolds, pp

Lee, J.M.: Smooth manifolds. In: Introduction to Smooth Manifolds, pp. 1–31. Springer (2013)

work page 2013
[51]

IEEE Control Systems Magazine 42(3), 58–72 (2022)

Lessard, L.: The analysis of optimization algorithms: A dissipativity approach. IEEE Control Systems Magazine 42(3), 58–72 (2022)

work page 2022
[52]

Letov, A.: Stability of nonlinear control systems

work page
[53]

Advances in Neural Information Processing Systems 33, 18261–18271 (2020)

Liu, Y., Gao, Y., Yin, W.: An improved analysis of stochastic gradient descent with momentum. Advances in Neural Information Processing Systems 33, 18261–18271 (2020)

work page 2020
[54]

Aggregated Momentum: Stability Through Passive Damping

Lucas, J., Sun, S., Zemel, R., Grosse, R.: Aggregated momentum: Stability through passive damping. arXiv preprint arXiv:1804.00325 (2018)

work page internal anchor Pith review Pith/arXiv arXiv 2018
[55]

Luenberger, D.G., Ye, Y., et al.: Linear and nonlinear programming, vol. 2. Springer (1984)

work page 1984
[56]

International journal of control 55(3), 531–534 (1992)

Lyapunov, A.M.: The general problem of the stability of motion. International journal of control 55(3), 531–534 (1992)

work page 1992
[57]

Foundations of Computational Mathematics 20(3) (2020)

Ma, C., Wang, K., Chi, Y., Chen, Y.: Implicit regularization in nonconvex statistical estimation: Gradient descent converges linearly for phase retrieval, matrix completion, and blind deconvolution. Foundations of Computational Mathematics 20(3) (2020)

work page 2020
[58]

In: International Confer- ence on Learning Representations (2018)

Ma, J., Yarats, D.: Quasi-hyperbolic momentum and adam for deep learning. In: International Confer- ence on Learning Representations (2018)

work page 2018
[59]

Matsumoto, Y.: An introduction to Morse theory, vol. 208. American Mathematical Soc. (2002)

work page 2002
[60]

Megginson, R.E.: An introduction to Banach space theory, vol. 183. Springer Science & Business Media (2012)

work page 2012
[61]

The Annals of Statistics 46(6A), 2747–2774 (2018)

Mei, S., Bai, Y., Montanari, A.: The landscape of empirical risk for nonconvex losses. The Annals of Statistics 46(6A), 2747–2774 (2018)

work page 2018
[62]

In: The 22nd International Conference on Artificial Intelligence and Statistics, pp

Mokhtari, A., Ozdaglar, A., Jadbabaie, A.: Efficient nonconvex empirical risk minimization via adaptive sample size methods. In: The 22nd International Conference on Artificial Intelligence and Statistics, pp. 2485–2494. PMLR (2019)

work page 2019
[63]

Nesterov, Y.: Introductory lectures on convex optimization: A basic course, vol. 87. Springer Science & Business Media (2003)

work page 2003
[64]

In: Dokl

Nesterov, Y.E.: A method for solving the convex programming problem with convergence rate o (1/kˆ 2). In: Dokl. akad. nauk Sssr, vol. 269, pp. 543–547 (1983)

work page 1983
[65]

Cambridge university press (2002)

Ott, E.: Chaos in dynamical systems. Cambridge university press (2002)

work page 2002
[66]

Mathematical Programming 176(1-2), 403–427 (2019)

O’Neill, M., Wright, S.J.: Behavior of accelerated gradient methods near critical points of nonconvex functions. Mathematical Programming 176(1-2), 403–427 (2019)

work page 2019
[67]

Gradient Descent Only Converges to Minimizers: Non-Isolated Critical Points and Invariant Regions

Panageas, I., Piliouras, G.: Gradient descent only converges to minimizers: Non-isolated critical points and invariant regions. arXiv preprint arXiv:1605.00405 (2016)

work page internal anchor Pith review Pith/arXiv arXiv 2016
[68]

Journal of Inequalities and Appli- cations 2005, 1–14 (2005)

Papi, M.: On the domain of the implicit function and applications. Journal of Inequalities and Appli- cations 2005, 1–14 (2005)

work page 2005
[69]

IEEE Transactions on Signal Processing 66(4), 982–991 (2017)

Pauwels, E.J.R., Beck, A., Eldar, Y.C., Sabach, S.: On fienup methods for sparse phase retrieval. IEEE Transactions on Signal Processing 66(4), 982–991 (2017)

work page 2017
[70]

USSR Computational Mathematics and Mathematical Physics 4(5), 1–17 (1964) 106

Polyak, B.T.: Some methods of speeding up the convergence of iteration methods. USSR Computational Mathematics and Mathematical Physics 4(5), 1–17 (1964) 106

work page 1964
[71]

In: Conference on Learning Theory, pp

Raginsky, M., Rakhlin, A., Telgarsky, M.: Non-convex learning via stochastic gradient langevin dynam- ics: a nonasymptotic analysis. In: Conference on Learning Theory, pp. 1674–1703. PMLR (2017)

work page 2017
[72]

On the Convergence of Adam and Beyond

Reddi, S.J., Kale, S., Kumar, S.: On the convergence of adam and beyond. arXiv preprint arXiv:1904.09237 (2019)

work page internal anchor Pith review Pith/arXiv arXiv 1904
[73]

A Generic Approach for Escaping Saddle points

Reddi, S.J., Zaheer, M., Sra, S., Poczos, B., Bach, F., Salakhutdinov, R., Smola, A.J.: A generic approach for escaping saddle points. arXiv preprint arXiv:1709.01434 (2017)

work page internal anchor Pith review Pith/arXiv arXiv 2017
[74]

Rudin, W., et al.: Principles of mathematical analysis, vol. 3. McGraw-hill New York (1976)

work page 1976
[75]

Schwartz, J.T.: Nonlinear functional analysis, vol. 4. CRC Press (1969)

work page 1969
[76]

Springer Science & Business Media (2013)

Shub, M.: Global stability of dynamical systems. Springer Science & Business Media (2013)

work page 2013
[77]

Bulletin of the American mathematical Society 73(6), 747–817 (1967)

Smale, S.: Differentiable dynamical systems. Bulletin of the American mathematical Society 73(6), 747–817 (1967)

work page 1967
[78]

Advances in neural information processing systems 27 (2014)

Su, W., Boyd, S., Candes, E.: A differential equation for modeling nesterov’s accelerated gradient method: theory and insights. Advances in neural information processing systems 27 (2014)

work page 2014
[79]

In: International conference on machine learning, pp

Sutskever, I., Martens, J., Dahl, G., Hinton, G.: On the importance of initialization and momentum in deep learning. In: International conference on machine learning, pp. 1139–1147. PMLR (2013)

work page 2013
[80]

Wiley-Interscience (1989)

Tabor, M.: Chaos and integrability in nonlinear dynamics: an introduction. Wiley-Interscience (1989)

work page 1989

Showing first 80 references.

[1] [1]

In: Advances in Neural Information Processing Systems, pp

Allen-Zhu, Z.: Natasha 2: Faster non-convex optimization than sgd. In: Advances in Neural Information Processing Systems, pp. 2675–2686 (2018)

work page 2018

[2] [2]

In: Advances in Neural Information Processing Systems, pp

Allen-Zhu, Z., Li, Y.: Neon2: Finding local minima via first-order oracles. In: Advances in Neural Information Processing Systems, pp. 3716–3726 (2018)

work page 2018

[3] [3]

Mathematical Programming 180(1-2), 137–156 (2020)

Apidopoulos, V., Aujol, J.F., Dossal, C.: Convergence rate of inertial forward–backward algorithm beyond nesterov’s rule. Mathematical Programming 180(1-2), 137–156 (2020)

work page 2020

[4] [4]

Mathematical Programming 168, 123–175 (2018)

Attouch, H., Chbani, Z., Peypouquet, J., Redont, P.: Fast convergence of inertial dynamics and algo- rithms with asymptotic vanishing viscosity. Mathematical Programming 168, 123–175 (2018)

work page 2018

[5] [5]

ESAIM: Control, Optimisation and Calculus of Variations 25, 2 (2019)

Attouch, H., Chbani, Z., Riahi, H.: Rate of convergence of the nesterov accelerated gradient method in the subcritical case α ≤ 3. ESAIM: Control, Optimisation and Calculus of Variations 25, 2 (2019)

work page 2019

[6] [6]

Aujol, J., Dossal, C.: Optimal rate of convergence of an ode associated to the fast gradient descent schemes for b> 0 (2017)

work page 2017

[7] [7]

arXiv preprint arXiv:1911.07596 (2019)

Barakat, A., Bianchi, P.: Convergence analysis of a momentum algorithm with adaptive step size for non convex optimization. arXiv preprint arXiv:1911.07596 (2019)

work page arXiv 1911

[8] [8]

Springer Nature (2021) 59The matrix C will depend on the value of β, n and the eigenvalues in the EU S subspace of the matrix I − h∇2f(x∗)

Braun, P., Gr¨ une, L., Kellett, C.M.: (In-) Stability of Differential Inclusions: Notions, Equivalences, and Lyapunov-like Characterizations. Springer Nature (2021) 59The matrix C will depend on the value of β, n and the eigenvalues in the EU S subspace of the matrix I − h∇2f(x∗) . 103

work page 2021

[9] [9]

Brezis, H., Br´ ezis, H.: Functional analysis, Sobolev spaces and partial differential equations, vol. 2. Springer (2011)

work page 2011

[10] [10]

arXiv preprint arXiv:2204.11292 (2022)

Can, B., Gurbuzbalaban, M.: Entropic risk-averse generalized momentum methods. arXiv preprint arXiv:2204.11292 (2022)

work page arXiv 2022

[11] [11]

In: International Conference on Machine Learning, pp

Can, B., Gurbuzbalaban, M., Zhu, L.: Accelerated linear convergence of stochastic momentum methods in wasserstein distances. In: International Conference on Machine Learning, pp. 891–901. PMLR (2019)

work page 2019

[12] [12]

IEEE Transactions on Information Theory 61(4), 1985–2007 (2015)

Candes, E.J., Li, X., Soltanolkotabi, M.: Phase retrieval via wirtinger flow: Theory and algorithms. IEEE Transactions on Information Theory 61(4), 1985–2007 (2015)

work page 1985

[13] [13]

SIAM Journal on Optimization 28(2), 1751–1772 (2018)

Carmon, Y., Duchi, J.C., Hinder, O., Sidford, A.: Accelerated methods for nonconvex optimization. SIAM Journal on Optimization 28(2), 1751–1772 (2018)

work page 2018

[14] [14]

On the Convergence of A Class of Adam-Type Algorithms for Non-Convex Optimization

Chen, X., Liu, S., Sun, R., Hong, M.: On the convergence of a class of adam-type algorithms for non-convex optimization. arXiv preprint arXiv:1808.02941 (2018)

work page internal anchor Pith review Pith/arXiv arXiv 2018

[15] [15]

Mathematical Programming 176(1), 5–37 (2019)

Chen, Y., Chi, Y., Fan, J., Ma, C.: Gradient descent with random initialization: Fast global convergence for nonconvex phase retrieval. Mathematical Programming 176(1), 5–37 (2019)

work page 2019

[16] [16]

Springer (1999)

Chicone, C.C.: Ordinary Differential Equations With Applications. Springer (1999)

work page 1999

[17] [17]

Bulletin of the Belgian Mathematical Society-Simon Stevin 22(1), 71–75 (2015)

Conejero, J.A., Mu˜ noz-Fern´ andez, G.A., Arcila, M.M., Seoane-Sep´ ulveda, J.B.: Smooth functions with uncountably many zeros. Bulletin of the Belgian Mathematical Society-Simon Stevin 22(1), 71–75 (2015)

work page 2015

[18] [18]

Ad- vances in Computational mathematics 5(1), 329–359 (1996)

Corless, R.M., Gonnet, G.H., Hare, D.E., Jeffrey, D.J., Knuth, D.E.: On the lambertw function. Ad- vances in Computational mathematics 5(1), 329–359 (1996)

work page 1996

[19] [19]

arXiv preprint arXiv:2012.04061 (2020)

Das, R., Acharya, A., Hashemi, A., Sanghavi, S., Dhillon, I.S., Topcu, U.: Faster non-convex federated learning via global and local momentum. arXiv preprint arXiv:2012.04061 (2020)

work page arXiv 2012

[20] [20]

OUP Oxford (1994)

Davidson, J.: Stochastic limit theory: An introduction for econometricians. OUP Oxford (1994)

work page 1994

[21] [21]

arXiv preprint arXiv:2108.11832 (2021)

Davis, D., Drusvyatskiy, D., Jiang, L.: Subgradient methods near active manifolds: saddle point avoid- ance, local convergence, and asymptotic normality. arXiv preprint arXiv:2108.11832 (2021)

work page arXiv 2021

[22] [22]

Convergence guarantees for RMSProp and ADAM in non-convex optimization and an empirical comparison to Nesterov acceleration

De, S., Mukherjee, A., Ullah, E.: Convergence guarantees for rmsprop and adam in non-convex opti- mization and an empirical comparison to nesterov acceleration. arXiv preprint arXiv:1807.06766 (2018)

work page internal anchor Pith review Pith/arXiv arXiv 2018

[23] [23]

Convergence of stochastic gradient descent schemes for Loj asiewicz- landscapes

Dereich, S., Kassing, S.: Convergence of stochastic gradient descent schemes for lojasiewicz-landscapes. arXiv preprint arXiv:2102.09385 (2021)

work page arXiv 2021

[24] [24]

IEEE Transactions on Information Theory pp

Dixit, R., G¨ urb¨ uzbalaban, M., Bajwa, W.U.: Boundary conditions for linear exit time gradient trajec- tories around saddle points: Analysis and algorithm. IEEE Transactions on Information Theory pp. 1–1 (2022). DOI 10.1109/TIT.2022.3213607

work page doi:10.1109/tit.2022.3213607 2022

[25] [25]

Information and Inference: A Journal of the IMA (2022)

Dixit, R., G¨ urb¨ uzbalaban, M., Bajwa, W.U.: Exit Time Analysis for Approximations of Gradient Descent Trajectories Around Saddle Points. Information and Inference: A Journal of the IMA (2022). DOI 10.1093/imaiai/iaac025. URL https://doi.org/10.1093/imaiai/iaac025. Iaac025

work page doi:10.1093/imaiai/iaac025 2022

[26] [26]

Dozat, T.: Incorporating nesterov momentum into adam (2016)

work page 2016

[27] [27]

Dunford, N., Schwartz, J.T.: Linear operators, part 1: general theory, vol. 10. John Wiley & Sons (1988)

work page 1988

[28] [28]

Sharp Analysis for Nonconvex SGD Escaping from Saddle Points

Fang, C., Lin, Z., Zhang, T.: Sharp analysis for nonconvex sgd escaping from saddle points. arXiv preprint arXiv:1902.00247 (2019) 104

work page internal anchor Pith review Pith/arXiv arXiv 1902

[29] [29]

SIAM Journal on Optimization 28(3), 2654–2689 (2018)

Fazlyab, M., Ribeiro, A., Morari, M., Preciado, V.M.: Analysis of optimization algorithms via integral quadratic constraints: Nonstrongly convex problems. SIAM Journal on Optimization 28(3), 2654–2689 (2018)

work page 2018

[30] [30]

Electronic Journal of Statistics 12(1), 461–529 (2018)

Gadat, S., Panloup, F., Saadane, S.: Stochastic heavy ball. Electronic Journal of Statistics 12(1), 461–529 (2018)

work page 2018

[31] [31]

Advances in Neural Information Processing Systems 33, 17850–17862 (2020)

Gao, X., Gurbuzbalaban, M., Zhu, L.: Breaking reversibility accelerates langevin dynamics for non- convex optimization. Advances in Neural Information Processing Systems 33, 17850–17862 (2020)

work page 2020

[32] [32]

Operations Research 70(5), 2931–2947 (2022)

Gao, X., G¨ urb¨ uzbalaban, M., Zhu, L.: Global convergence of stochastic gradient hamiltonian monte carlo for nonconvex stochastic optimization: Nonasymptotic performance bounds and momentum-based acceleration. Operations Research 70(5), 2931–2947 (2022)

work page 2022

[33] [33]

In: 2015 European control conference (ECC), pp

Ghadimi, E., Feyzmahdavian, H.R., Johansson, M.: Global convergence of the heavy-ball method for convex optimization. In: 2015 European control conference (ECC), pp. 310–315. IEEE (2015)

work page 2015

[34] [34]

Mathematical Programming 156(1), 59–99 (2016)

Ghadimi, S., Lan, G.: Accelerated gradient methods for nonconvex nonlinear and stochastic program- ming. Mathematical Programming 156(1), 59–99 (2016)

work page 2016

[35] [35]

Advances in Neural Information Processing Systems 32 (2019)

Gitman, I., Lang, H., Zhang, P., Xiao, L.: Understanding the role of momentum in stochastic gradient methods. Advances in Neural Information Processing Systems 32 (2019)

work page 2019

[36] [36]

Hahn, W., et al.: Stability of motion, vol. 138. Springer (1967)

work page 1967

[37] [37]

Hirsch, M., Pugh, C., Shub, M.: Invariant manifolds (lecture notes in mathematics, 583) (1977)

work page 1977

[38] [38]

http://math.huji.ac.il/~mhochman/courses/ fractals-2012/convergence-of-sets-and-measures.pdf

Hochman, M.: Convergence of sets and measures. http://math.huji.ac.il/~mhochman/courses/ fractals-2012/convergence-of-sets-and-measures.pdf . Accessed: 2022-12-22

work page 2012

[39] [39]

http://math.huji.ac.il/~mhochman/courses/ fractals-2012/

Hochman, M.: Convergence of sets and measures. http://math.huji.ac.il/~mhochman/courses/ fractals-2012/. Accessed: 2022-12-22

work page 2012

[40] [40]

IEEE Journal of selected topics in signal processing 10(4), 770–781 (2016)

Jaganathan, K., Eldar, Y.C., Hassibi, B.: Stft phase retrieval: Uniqueness guarantees and recovery algorithms. IEEE Journal of selected topics in signal processing 10(4), 770–781 (2016)

work page 2016

[41] [41]

Accelerated Gradient Descent Escapes Saddle Points Faster than Gradient Descent

Jin, C., Netrapalli, P., Jordan, M.I.: Accelerated gradient descent escapes saddle points faster than gradient descent. arXiv preprint arXiv:1711.10456 (2017)

work page internal anchor Pith review Pith/arXiv arXiv 2017

[42] [42]

Courier Dover Publications (2017)

Kelley, J.L.: General topology. Courier Dover Publications (2017)

work page 2017

[43] [43]

SIAM (2000)

Kinderlehrer, D., Stampacchia, G.: An introduction to variational inequalities and their applications. SIAM (2000)

work page 2000

[44] [44]

Springer Science & Business Media (2012)

Kirillov, A.A., Gvishiani, A.D.: Theorems and problems in functional analysis. Springer Science & Business Media (2012)

work page 2012

[45] [45]

Fundamenta Mathe- maticae 22(1), 77–108 (1934)

Kirszbraun, M.: ¨Uber die zusammenziehende und lipschitzsche transformationen. Fundamenta Mathe- maticae 22(1), 77–108 (1934)

work page 1934

[46] [46]

In: 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp

Koppel, A., Mokhtari, A., Ribeiro, A.: Parallel stochastic successive convex approximation method for large-scale dictionary learning. In: 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 2771–2775. IEEE (2018)

work page 2018

[47] [47]

Computational Mathematics and Mathematical Physics 61(7), 1162–1168 (2021)

Kurochkin, S.V.: Neural network with smooth activation functions and without bottlenecks is almost surely a morse function. Computational Mathematics and Mathematical Physics 61(7), 1162–1168 (2021)

work page 2021

[48] [48]

Mathematical programming 176(1), 311–337 (2019) 105

Lee, J.D., Panageas, I., Piliouras, G., Simchowitz, M., Jordan, M.I., Recht, B.: First-order methods almost always avoid strict saddle points. Mathematical programming 176(1), 311–337 (2019) 105

work page 2019

[49] [49]

Gradient Descent Converges to Minimizers

Lee, J.D., Simchowitz, M., Jordan, M.I., Recht, B.: Gradient descent converges to minimizers. arXiv preprint arXiv:1602.04915 (2016)

work page internal anchor Pith review Pith/arXiv arXiv 2016

[50] [50]

In: Introduction to Smooth Manifolds, pp

Lee, J.M.: Smooth manifolds. In: Introduction to Smooth Manifolds, pp. 1–31. Springer (2013)

work page 2013

[51] [51]

IEEE Control Systems Magazine 42(3), 58–72 (2022)

Lessard, L.: The analysis of optimization algorithms: A dissipativity approach. IEEE Control Systems Magazine 42(3), 58–72 (2022)

work page 2022

[52] [52]

Letov, A.: Stability of nonlinear control systems

work page

[53] [53]

Advances in Neural Information Processing Systems 33, 18261–18271 (2020)

Liu, Y., Gao, Y., Yin, W.: An improved analysis of stochastic gradient descent with momentum. Advances in Neural Information Processing Systems 33, 18261–18271 (2020)

work page 2020

[54] [54]

Aggregated Momentum: Stability Through Passive Damping

Lucas, J., Sun, S., Zemel, R., Grosse, R.: Aggregated momentum: Stability through passive damping. arXiv preprint arXiv:1804.00325 (2018)

work page internal anchor Pith review Pith/arXiv arXiv 2018

[55] [55]

Luenberger, D.G., Ye, Y., et al.: Linear and nonlinear programming, vol. 2. Springer (1984)

work page 1984

[56] [56]

International journal of control 55(3), 531–534 (1992)

Lyapunov, A.M.: The general problem of the stability of motion. International journal of control 55(3), 531–534 (1992)

work page 1992

[57] [57]

Foundations of Computational Mathematics 20(3) (2020)

Ma, C., Wang, K., Chi, Y., Chen, Y.: Implicit regularization in nonconvex statistical estimation: Gradient descent converges linearly for phase retrieval, matrix completion, and blind deconvolution. Foundations of Computational Mathematics 20(3) (2020)

work page 2020

[58] [58]

In: International Confer- ence on Learning Representations (2018)

Ma, J., Yarats, D.: Quasi-hyperbolic momentum and adam for deep learning. In: International Confer- ence on Learning Representations (2018)

work page 2018

[59] [59]

Matsumoto, Y.: An introduction to Morse theory, vol. 208. American Mathematical Soc. (2002)

work page 2002

[60] [60]

Megginson, R.E.: An introduction to Banach space theory, vol. 183. Springer Science & Business Media (2012)

work page 2012

[61] [61]

The Annals of Statistics 46(6A), 2747–2774 (2018)

Mei, S., Bai, Y., Montanari, A.: The landscape of empirical risk for nonconvex losses. The Annals of Statistics 46(6A), 2747–2774 (2018)

work page 2018

[62] [62]

In: The 22nd International Conference on Artificial Intelligence and Statistics, pp

Mokhtari, A., Ozdaglar, A., Jadbabaie, A.: Efficient nonconvex empirical risk minimization via adaptive sample size methods. In: The 22nd International Conference on Artificial Intelligence and Statistics, pp. 2485–2494. PMLR (2019)

work page 2019

[63] [63]

Nesterov, Y.: Introductory lectures on convex optimization: A basic course, vol. 87. Springer Science & Business Media (2003)

work page 2003

[64] [64]

In: Dokl

Nesterov, Y.E.: A method for solving the convex programming problem with convergence rate o (1/kˆ 2). In: Dokl. akad. nauk Sssr, vol. 269, pp. 543–547 (1983)

work page 1983

[65] [65]

Cambridge university press (2002)

Ott, E.: Chaos in dynamical systems. Cambridge university press (2002)

work page 2002

[66] [66]

Mathematical Programming 176(1-2), 403–427 (2019)

O’Neill, M., Wright, S.J.: Behavior of accelerated gradient methods near critical points of nonconvex functions. Mathematical Programming 176(1-2), 403–427 (2019)

work page 2019

[67] [67]

Gradient Descent Only Converges to Minimizers: Non-Isolated Critical Points and Invariant Regions

Panageas, I., Piliouras, G.: Gradient descent only converges to minimizers: Non-isolated critical points and invariant regions. arXiv preprint arXiv:1605.00405 (2016)

work page internal anchor Pith review Pith/arXiv arXiv 2016

[68] [68]

Journal of Inequalities and Appli- cations 2005, 1–14 (2005)

Papi, M.: On the domain of the implicit function and applications. Journal of Inequalities and Appli- cations 2005, 1–14 (2005)

work page 2005

[69] [69]

IEEE Transactions on Signal Processing 66(4), 982–991 (2017)

Pauwels, E.J.R., Beck, A., Eldar, Y.C., Sabach, S.: On fienup methods for sparse phase retrieval. IEEE Transactions on Signal Processing 66(4), 982–991 (2017)

work page 2017

[70] [70]

USSR Computational Mathematics and Mathematical Physics 4(5), 1–17 (1964) 106

Polyak, B.T.: Some methods of speeding up the convergence of iteration methods. USSR Computational Mathematics and Mathematical Physics 4(5), 1–17 (1964) 106

work page 1964

[71] [71]

In: Conference on Learning Theory, pp

Raginsky, M., Rakhlin, A., Telgarsky, M.: Non-convex learning via stochastic gradient langevin dynam- ics: a nonasymptotic analysis. In: Conference on Learning Theory, pp. 1674–1703. PMLR (2017)

work page 2017

[72] [72]

On the Convergence of Adam and Beyond

Reddi, S.J., Kale, S., Kumar, S.: On the convergence of adam and beyond. arXiv preprint arXiv:1904.09237 (2019)

work page internal anchor Pith review Pith/arXiv arXiv 1904

[73] [73]

A Generic Approach for Escaping Saddle points

Reddi, S.J., Zaheer, M., Sra, S., Poczos, B., Bach, F., Salakhutdinov, R., Smola, A.J.: A generic approach for escaping saddle points. arXiv preprint arXiv:1709.01434 (2017)

work page internal anchor Pith review Pith/arXiv arXiv 2017

[74] [74]

Rudin, W., et al.: Principles of mathematical analysis, vol. 3. McGraw-hill New York (1976)

work page 1976

[75] [75]

Schwartz, J.T.: Nonlinear functional analysis, vol. 4. CRC Press (1969)

work page 1969

[76] [76]

Springer Science & Business Media (2013)

Shub, M.: Global stability of dynamical systems. Springer Science & Business Media (2013)

work page 2013

[77] [77]

Bulletin of the American mathematical Society 73(6), 747–817 (1967)

Smale, S.: Differentiable dynamical systems. Bulletin of the American mathematical Society 73(6), 747–817 (1967)

work page 1967

[78] [78]

Advances in neural information processing systems 27 (2014)

Su, W., Boyd, S., Candes, E.: A differential equation for modeling nesterov’s accelerated gradient method: theory and insights. Advances in neural information processing systems 27 (2014)

work page 2014

[79] [79]

In: International conference on machine learning, pp

Sutskever, I., Martens, J., Dahl, G., Hinton, G.: On the importance of initialization and momentum in deep learning. In: International conference on machine learning, pp. 1139–1147. PMLR (2013)

work page 2013

[80] [80]

Wiley-Interscience (1989)

Tabor, M.: Chaos and integrability in nonlinear dynamics: an introduction. Wiley-Interscience (1989)

work page 1989