pith. sign in

arxiv: 2307.07030 · v3 · submitted 2023-07-13 · 🧮 math.OC · cs.LG· cs.SY· eess.SY

Accelerated Gradient Methods for Nonconvex Optimization: Escape Trajectories From Strict Saddle Points and Convergence to Local Minima

Pith reviewed 2026-05-24 07:48 UTC · model grok-4.3

classification 🧮 math.OC cs.LGcs.SYeess.SY
keywords accelerated gradient methodsnonconvex optimizationstrict saddle pointsNesterov accelerated gradientescape trajectoriesconvergence to local minimamomentum methodsdynamical systems
0
0 comments X

The pith

Nesterov's accelerated gradient method with variable momentum avoids strict saddle points almost surely.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper analyzes a broad class of Nesterov-type accelerated gradient methods for smooth nonconvex objective functions. It establishes that these methods escape strict saddle points and reach local minima through both asymptotic and non-asymptotic arguments. The work resolves an open question on almost-sure avoidance for the variable-momentum version of NAG while also supplying linear exit-time bounds from saddle neighborhoods. Readers care because nonconvex problems arise throughout machine learning and engineering, where saddle points can trap standard methods away from useful solutions.

Core claim

The paper claims that Nesterov's accelerated gradient method with a variable momentum parameter avoids strict saddle points almost surely, shown via asymptotic analysis of the associated discrete dynamical system. It further derives linear exit-time estimates for trajectories leaving strict saddle neighborhoods under necessary conditions on the parameters, develops two metrics for asymptotic rates of convergence and divergence near saddles, and identifies a sub-class of these methods that converges at a near-optimal rate to a local minimum inside convex neighborhoods while escaping saddles more effectively than standard NAG.

What carries the argument

The discrete dynamical system that encodes the accelerated gradient updates with momentum parameters, used to track trajectories near strict saddle points.

If this is right

  • Trajectories exit strict saddle neighborhoods after a number of steps that scales linearly with the inverse of the smallest negative eigenvalue.
  • A sub-class of the methods reaches a local minimum at a near-optimal rate once inside a convex neighborhood.
  • Standard methods such as NAG and constant-momentum NCM exhibit quantifiable rates of divergence away from saddles under the developed metrics.
  • Necessary conditions on momentum parameters must hold for escape trajectories to exist.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same dynamical-system approach could be used to compare escape behavior across a wider family of momentum-based first-order methods.
  • Parameter choices that improve saddle escape might be combined with adaptive schemes to maintain performance across mixed convex-nonconvex landscapes.
  • The linear exit-time result suggests that practical implementations could monitor gradient norms or function values to detect and accelerate passage through saddle regions.

Load-bearing premise

The objective function is twice continuously differentiable with Lipschitz continuous gradients, and the chosen momentum parameters produce well-defined trajectories near saddle points.

What would settle it

A simulation in which a trajectory generated by NAG with variable momentum remains inside a strict saddle neighborhood for arbitrarily long time with positive probability would falsify the almost-sure escape result.

Figures

Figures reproduced from arXiv: 2307.07030 by Mert Gurbuzbalaban, Rishabh Dixit, Waheed U. Bajwa.

Figure 1
Figure 1. Figure 1: Asymptotic divergence metric M⋆ (f) (Theorem 3.25) and the exit time Kexit bound (65) as a function of asymptotic momentum β and step size h. Clearly the asymptotic divergence metric M⋆ (f) and the exit time Kexit bound vary inversely with respect to each another. Remark 5.3. Since the exit time from (65) was computed for the case when inf i∈NUS |zi | = ∥Λ∥2 in the 34 [PITH_FULL_IMAGE:figures/full_fig_p03… view at source ↗
Figure 2
Figure 2. Figure 2: Simulating various accelerated gradient trajectories of ( [PITH_FULL_IMAGE:figures/full_fig_p044_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Simulating various accelerated gradient trajectories of ( [PITH_FULL_IMAGE:figures/full_fig_p045_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Simulating various accelerated gradient trajectories of ( [PITH_FULL_IMAGE:figures/full_fig_p046_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Simulating various accelerated gradient trajectories of ( [PITH_FULL_IMAGE:figures/full_fig_p047_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Simulating various accelerated gradient trajectories of ( [PITH_FULL_IMAGE:figures/full_fig_p048_6.png] view at source ↗
read the original abstract

This paper considers the problem of understanding the behavior of a general class of accelerated gradient methods on smooth nonconvex functions. Motivated by some recent works that have proposed effective algorithms, based on Polyak's heavy ball method and the Nesterov accelerated gradient method, to achieve convergence to a local minimum of nonconvex functions, this work proposes a broad class of Nesterov-type accelerated methods and puts forth a rigorous study of these methods encompassing the escape from saddle points and convergence to local minima through both an asymptotic and a non-asymptotic analysis. In the asymptotic regime, this paper answers an open question of whether Nesterov's accelerated gradient method (NAG) with variable momentum parameter avoids strict saddle points almost surely. This work also develops two metrics of asymptotic rates of convergence and divergence, and evaluates these two metrics for several popular standard accelerated methods such as the NAG and Nesterov's accelerated gradient with constant momentum (NCM) near strict saddle points. In the non-asymptotic regime, this work provides an analysis that leads to the "linear" exit time estimates from strict saddle neighborhoods for trajectories of these accelerated methods as well the necessary conditions for the existence of such trajectories. Finally, this work studies a sub-class of accelerated methods that can converge in convex neighborhoods of nonconvex functions with a near optimal rate to a local minimum and at the same time this sub-class offers superior saddle-escape behavior compared to that of NAG.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 0 minor

Summary. The paper proposes a broad class of Nesterov-type accelerated gradient methods for smooth nonconvex optimization. In the asymptotic regime it claims to resolve an open question by proving that NAG with variable momentum (e.g., the standard schedule) avoids strict saddle points almost surely; it introduces two metrics of asymptotic convergence/divergence rates and evaluates them for NAG and NCM near saddles. In the non-asymptotic regime it derives linear exit-time bounds from saddle neighborhoods together with necessary conditions for such trajectories to exist. Finally it identifies a subclass that converges at near-optimal rates inside convex neighborhoods while exhibiting improved saddle-escape behavior relative to standard NAG.

Significance. If the central claims hold, the work supplies the first rigorous almost-sure escape guarantee for variable-momentum NAG and supplies both asymptotic rate metrics and explicit linear exit-time estimates, which are load-bearing for any global-convergence theory of accelerated methods on nonconvex landscapes. The identification of a subclass that simultaneously enjoys near-optimal local convergence and superior escape properties is a concrete algorithmic contribution.

major comments (1)
  1. [Abstract (asymptotic regime)] The almost-sure escape claim for variable-momentum NAG rests on the local analysis of a non-autonomous linearization around a strict saddle. The provided abstract does not indicate whether the proof controls the infinite product of the time-dependent Jacobians or rules out the emergence of neutral/contracting modes that the skeptic note identifies; without such control the stable manifold may fail to have measure zero. This is load-bearing for the open-question resolution stated in the abstract.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the careful reading and for identifying a point where the abstract could better reflect the technical content of the proof. We address the comment below.

read point-by-point responses
  1. Referee: [Abstract (asymptotic regime)] The almost-sure escape claim for variable-momentum NAG rests on the local analysis of a non-autonomous linearization around a strict saddle. The provided abstract does not indicate whether the proof controls the infinite product of the time-dependent Jacobians or rules out the emergence of neutral/contracting modes that the skeptic note identifies; without such control the stable manifold may fail to have measure zero. This is load-bearing for the open-question resolution stated in the abstract.

    Authors: We agree that the abstract should more explicitly signal the key technical controls. The manuscript's asymptotic analysis (Section 3) does control the infinite product of the time-dependent Jacobians by deriving uniform bounds on the operator norms along the trajectory and showing that the product of the linearized maps contracts to zero on the stable subspace while expanding on the unstable directions. The variable-momentum schedule is used to ensure that the time-varying eigenvalues remain bounded away from the unit circle, thereby excluding persistent neutral modes; the strict-saddle assumption then implies that the stable manifold has Lebesgue measure zero in a neighborhood of the saddle. We will revise the abstract to state that the proof establishes these controls on the infinite product and rules out neutral/contracting modes. revision: yes

Circularity Check

0 steps flagged

No circularity; independent analysis of existing accelerated methods

full rationale

The paper performs a self-contained asymptotic and non-asymptotic analysis of a broad class of Nesterov-type accelerated methods on twice continuously differentiable functions with Lipschitz gradients. It resolves an open question on almost-sure escape from strict saddles for variable-momentum NAG by direct study of the non-autonomous recurrence and linearization, without reducing any claimed prediction or rate to a fitted parameter, self-definition, or load-bearing self-citation. Standard external assumptions (smoothness, well-defined trajectories) are stated explicitly and do not embed the target escape or convergence claims. No equations or steps are shown to be equivalent to their inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claims rest on standard smoothness assumptions for the objective and on the existence of trajectories under chosen momentum schedules; no free parameters or invented entities are introduced in the abstract.

axioms (1)
  • domain assumption The objective function is smooth (twice continuously differentiable with Lipschitz gradient).
    Required for gradient-based methods and saddle-point analysis to be well-defined.

pith-pipeline@v0.9.0 · 5815 in / 1218 out tokens · 31468 ms · 2026-05-24T07:48:31.611201+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Convergence of difference inclusions via a diameter criterion

    math.OC 2026-05 unverdicted novelty 7.0

    A diameter criterion tied to a potential function certifies convergence of difference inclusions, enabling discrete proofs for first-order optimization methods with diminishing steps.

Reference graph

Works this paper leans on

92 extracted references · 92 canonical work pages · cited by 1 Pith paper · 9 internal anchors

  1. [1]

    In: Advances in Neural Information Processing Systems, pp

    Allen-Zhu, Z.: Natasha 2: Faster non-convex optimization than sgd. In: Advances in Neural Information Processing Systems, pp. 2675–2686 (2018)

  2. [2]

    In: Advances in Neural Information Processing Systems, pp

    Allen-Zhu, Z., Li, Y.: Neon2: Finding local minima via first-order oracles. In: Advances in Neural Information Processing Systems, pp. 3716–3726 (2018)

  3. [3]

    Mathematical Programming 180(1-2), 137–156 (2020)

    Apidopoulos, V., Aujol, J.F., Dossal, C.: Convergence rate of inertial forward–backward algorithm beyond nesterov’s rule. Mathematical Programming 180(1-2), 137–156 (2020)

  4. [4]

    Mathematical Programming 168, 123–175 (2018)

    Attouch, H., Chbani, Z., Peypouquet, J., Redont, P.: Fast convergence of inertial dynamics and algo- rithms with asymptotic vanishing viscosity. Mathematical Programming 168, 123–175 (2018)

  5. [5]

    ESAIM: Control, Optimisation and Calculus of Variations 25, 2 (2019)

    Attouch, H., Chbani, Z., Riahi, H.: Rate of convergence of the nesterov accelerated gradient method in the subcritical case α ≤ 3. ESAIM: Control, Optimisation and Calculus of Variations 25, 2 (2019)

  6. [6]

    Aujol, J., Dossal, C.: Optimal rate of convergence of an ode associated to the fast gradient descent schemes for b> 0 (2017)

  7. [7]

    arXiv preprint arXiv:1911.07596 (2019)

    Barakat, A., Bianchi, P.: Convergence analysis of a momentum algorithm with adaptive step size for non convex optimization. arXiv preprint arXiv:1911.07596 (2019)

  8. [8]

    Springer Nature (2021) 59The matrix C will depend on the value of β, n and the eigenvalues in the EU S subspace of the matrix I − h∇2f(x∗)

    Braun, P., Gr¨ une, L., Kellett, C.M.: (In-) Stability of Differential Inclusions: Notions, Equivalences, and Lyapunov-like Characterizations. Springer Nature (2021) 59The matrix C will depend on the value of β, n and the eigenvalues in the EU S subspace of the matrix I − h∇2f(x∗) . 103

  9. [9]

    Brezis, H., Br´ ezis, H.: Functional analysis, Sobolev spaces and partial differential equations, vol. 2. Springer (2011)

  10. [10]

    arXiv preprint arXiv:2204.11292 (2022)

    Can, B., Gurbuzbalaban, M.: Entropic risk-averse generalized momentum methods. arXiv preprint arXiv:2204.11292 (2022)

  11. [11]

    In: International Conference on Machine Learning, pp

    Can, B., Gurbuzbalaban, M., Zhu, L.: Accelerated linear convergence of stochastic momentum methods in wasserstein distances. In: International Conference on Machine Learning, pp. 891–901. PMLR (2019)

  12. [12]

    IEEE Transactions on Information Theory 61(4), 1985–2007 (2015)

    Candes, E.J., Li, X., Soltanolkotabi, M.: Phase retrieval via wirtinger flow: Theory and algorithms. IEEE Transactions on Information Theory 61(4), 1985–2007 (2015)

  13. [13]

    SIAM Journal on Optimization 28(2), 1751–1772 (2018)

    Carmon, Y., Duchi, J.C., Hinder, O., Sidford, A.: Accelerated methods for nonconvex optimization. SIAM Journal on Optimization 28(2), 1751–1772 (2018)

  14. [14]

    On the Convergence of A Class of Adam-Type Algorithms for Non-Convex Optimization

    Chen, X., Liu, S., Sun, R., Hong, M.: On the convergence of a class of adam-type algorithms for non-convex optimization. arXiv preprint arXiv:1808.02941 (2018)

  15. [15]

    Mathematical Programming 176(1), 5–37 (2019)

    Chen, Y., Chi, Y., Fan, J., Ma, C.: Gradient descent with random initialization: Fast global convergence for nonconvex phase retrieval. Mathematical Programming 176(1), 5–37 (2019)

  16. [16]

    Springer (1999)

    Chicone, C.C.: Ordinary Differential Equations With Applications. Springer (1999)

  17. [17]

    Bulletin of the Belgian Mathematical Society-Simon Stevin 22(1), 71–75 (2015)

    Conejero, J.A., Mu˜ noz-Fern´ andez, G.A., Arcila, M.M., Seoane-Sep´ ulveda, J.B.: Smooth functions with uncountably many zeros. Bulletin of the Belgian Mathematical Society-Simon Stevin 22(1), 71–75 (2015)

  18. [18]

    Ad- vances in Computational mathematics 5(1), 329–359 (1996)

    Corless, R.M., Gonnet, G.H., Hare, D.E., Jeffrey, D.J., Knuth, D.E.: On the lambertw function. Ad- vances in Computational mathematics 5(1), 329–359 (1996)

  19. [19]

    arXiv preprint arXiv:2012.04061 (2020)

    Das, R., Acharya, A., Hashemi, A., Sanghavi, S., Dhillon, I.S., Topcu, U.: Faster non-convex federated learning via global and local momentum. arXiv preprint arXiv:2012.04061 (2020)

  20. [20]

    OUP Oxford (1994)

    Davidson, J.: Stochastic limit theory: An introduction for econometricians. OUP Oxford (1994)

  21. [21]

    arXiv preprint arXiv:2108.11832 (2021)

    Davis, D., Drusvyatskiy, D., Jiang, L.: Subgradient methods near active manifolds: saddle point avoid- ance, local convergence, and asymptotic normality. arXiv preprint arXiv:2108.11832 (2021)

  22. [22]

    Convergence guarantees for RMSProp and ADAM in non-convex optimization and an empirical comparison to Nesterov acceleration

    De, S., Mukherjee, A., Ullah, E.: Convergence guarantees for rmsprop and adam in non-convex opti- mization and an empirical comparison to nesterov acceleration. arXiv preprint arXiv:1807.06766 (2018)

  23. [23]

    Convergence of stochastic gradient descent schemes for Loj asiewicz- landscapes

    Dereich, S., Kassing, S.: Convergence of stochastic gradient descent schemes for lojasiewicz-landscapes. arXiv preprint arXiv:2102.09385 (2021)

  24. [24]

    IEEE Transactions on Information Theory pp

    Dixit, R., G¨ urb¨ uzbalaban, M., Bajwa, W.U.: Boundary conditions for linear exit time gradient trajec- tories around saddle points: Analysis and algorithm. IEEE Transactions on Information Theory pp. 1–1 (2022). DOI 10.1109/TIT.2022.3213607

  25. [25]

    Information and Inference: A Journal of the IMA (2022)

    Dixit, R., G¨ urb¨ uzbalaban, M., Bajwa, W.U.: Exit Time Analysis for Approximations of Gradient Descent Trajectories Around Saddle Points. Information and Inference: A Journal of the IMA (2022). DOI 10.1093/imaiai/iaac025. URL https://doi.org/10.1093/imaiai/iaac025. Iaac025

  26. [26]

    Dozat, T.: Incorporating nesterov momentum into adam (2016)

  27. [27]

    Dunford, N., Schwartz, J.T.: Linear operators, part 1: general theory, vol. 10. John Wiley & Sons (1988)

  28. [28]

    Sharp Analysis for Nonconvex SGD Escaping from Saddle Points

    Fang, C., Lin, Z., Zhang, T.: Sharp analysis for nonconvex sgd escaping from saddle points. arXiv preprint arXiv:1902.00247 (2019) 104

  29. [29]

    SIAM Journal on Optimization 28(3), 2654–2689 (2018)

    Fazlyab, M., Ribeiro, A., Morari, M., Preciado, V.M.: Analysis of optimization algorithms via integral quadratic constraints: Nonstrongly convex problems. SIAM Journal on Optimization 28(3), 2654–2689 (2018)

  30. [30]

    Electronic Journal of Statistics 12(1), 461–529 (2018)

    Gadat, S., Panloup, F., Saadane, S.: Stochastic heavy ball. Electronic Journal of Statistics 12(1), 461–529 (2018)

  31. [31]

    Advances in Neural Information Processing Systems 33, 17850–17862 (2020)

    Gao, X., Gurbuzbalaban, M., Zhu, L.: Breaking reversibility accelerates langevin dynamics for non- convex optimization. Advances in Neural Information Processing Systems 33, 17850–17862 (2020)

  32. [32]

    Operations Research 70(5), 2931–2947 (2022)

    Gao, X., G¨ urb¨ uzbalaban, M., Zhu, L.: Global convergence of stochastic gradient hamiltonian monte carlo for nonconvex stochastic optimization: Nonasymptotic performance bounds and momentum-based acceleration. Operations Research 70(5), 2931–2947 (2022)

  33. [33]

    In: 2015 European control conference (ECC), pp

    Ghadimi, E., Feyzmahdavian, H.R., Johansson, M.: Global convergence of the heavy-ball method for convex optimization. In: 2015 European control conference (ECC), pp. 310–315. IEEE (2015)

  34. [34]

    Mathematical Programming 156(1), 59–99 (2016)

    Ghadimi, S., Lan, G.: Accelerated gradient methods for nonconvex nonlinear and stochastic program- ming. Mathematical Programming 156(1), 59–99 (2016)

  35. [35]

    Advances in Neural Information Processing Systems 32 (2019)

    Gitman, I., Lang, H., Zhang, P., Xiao, L.: Understanding the role of momentum in stochastic gradient methods. Advances in Neural Information Processing Systems 32 (2019)

  36. [36]

    Hahn, W., et al.: Stability of motion, vol. 138. Springer (1967)

  37. [37]

    Hirsch, M., Pugh, C., Shub, M.: Invariant manifolds (lecture notes in mathematics, 583) (1977)

  38. [38]

    http://math.huji.ac.il/~mhochman/courses/ fractals-2012/convergence-of-sets-and-measures.pdf

    Hochman, M.: Convergence of sets and measures. http://math.huji.ac.il/~mhochman/courses/ fractals-2012/convergence-of-sets-and-measures.pdf . Accessed: 2022-12-22

  39. [39]

    http://math.huji.ac.il/~mhochman/courses/ fractals-2012/

    Hochman, M.: Convergence of sets and measures. http://math.huji.ac.il/~mhochman/courses/ fractals-2012/. Accessed: 2022-12-22

  40. [40]

    IEEE Journal of selected topics in signal processing 10(4), 770–781 (2016)

    Jaganathan, K., Eldar, Y.C., Hassibi, B.: Stft phase retrieval: Uniqueness guarantees and recovery algorithms. IEEE Journal of selected topics in signal processing 10(4), 770–781 (2016)

  41. [41]

    Accelerated Gradient Descent Escapes Saddle Points Faster than Gradient Descent

    Jin, C., Netrapalli, P., Jordan, M.I.: Accelerated gradient descent escapes saddle points faster than gradient descent. arXiv preprint arXiv:1711.10456 (2017)

  42. [42]

    Courier Dover Publications (2017)

    Kelley, J.L.: General topology. Courier Dover Publications (2017)

  43. [43]

    SIAM (2000)

    Kinderlehrer, D., Stampacchia, G.: An introduction to variational inequalities and their applications. SIAM (2000)

  44. [44]

    Springer Science & Business Media (2012)

    Kirillov, A.A., Gvishiani, A.D.: Theorems and problems in functional analysis. Springer Science & Business Media (2012)

  45. [45]

    Fundamenta Mathe- maticae 22(1), 77–108 (1934)

    Kirszbraun, M.: ¨Uber die zusammenziehende und lipschitzsche transformationen. Fundamenta Mathe- maticae 22(1), 77–108 (1934)

  46. [46]

    In: 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp

    Koppel, A., Mokhtari, A., Ribeiro, A.: Parallel stochastic successive convex approximation method for large-scale dictionary learning. In: 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 2771–2775. IEEE (2018)

  47. [47]

    Computational Mathematics and Mathematical Physics 61(7), 1162–1168 (2021)

    Kurochkin, S.V.: Neural network with smooth activation functions and without bottlenecks is almost surely a morse function. Computational Mathematics and Mathematical Physics 61(7), 1162–1168 (2021)

  48. [48]

    Mathematical programming 176(1), 311–337 (2019) 105

    Lee, J.D., Panageas, I., Piliouras, G., Simchowitz, M., Jordan, M.I., Recht, B.: First-order methods almost always avoid strict saddle points. Mathematical programming 176(1), 311–337 (2019) 105

  49. [49]

    Gradient Descent Converges to Minimizers

    Lee, J.D., Simchowitz, M., Jordan, M.I., Recht, B.: Gradient descent converges to minimizers. arXiv preprint arXiv:1602.04915 (2016)

  50. [50]

    In: Introduction to Smooth Manifolds, pp

    Lee, J.M.: Smooth manifolds. In: Introduction to Smooth Manifolds, pp. 1–31. Springer (2013)

  51. [51]

    IEEE Control Systems Magazine 42(3), 58–72 (2022)

    Lessard, L.: The analysis of optimization algorithms: A dissipativity approach. IEEE Control Systems Magazine 42(3), 58–72 (2022)

  52. [52]

    Letov, A.: Stability of nonlinear control systems

  53. [53]

    Advances in Neural Information Processing Systems 33, 18261–18271 (2020)

    Liu, Y., Gao, Y., Yin, W.: An improved analysis of stochastic gradient descent with momentum. Advances in Neural Information Processing Systems 33, 18261–18271 (2020)

  54. [54]

    Aggregated Momentum: Stability Through Passive Damping

    Lucas, J., Sun, S., Zemel, R., Grosse, R.: Aggregated momentum: Stability through passive damping. arXiv preprint arXiv:1804.00325 (2018)

  55. [55]

    Luenberger, D.G., Ye, Y., et al.: Linear and nonlinear programming, vol. 2. Springer (1984)

  56. [56]

    International journal of control 55(3), 531–534 (1992)

    Lyapunov, A.M.: The general problem of the stability of motion. International journal of control 55(3), 531–534 (1992)

  57. [57]

    Foundations of Computational Mathematics 20(3) (2020)

    Ma, C., Wang, K., Chi, Y., Chen, Y.: Implicit regularization in nonconvex statistical estimation: Gradient descent converges linearly for phase retrieval, matrix completion, and blind deconvolution. Foundations of Computational Mathematics 20(3) (2020)

  58. [58]

    In: International Confer- ence on Learning Representations (2018)

    Ma, J., Yarats, D.: Quasi-hyperbolic momentum and adam for deep learning. In: International Confer- ence on Learning Representations (2018)

  59. [59]

    Matsumoto, Y.: An introduction to Morse theory, vol. 208. American Mathematical Soc. (2002)

  60. [60]

    Megginson, R.E.: An introduction to Banach space theory, vol. 183. Springer Science & Business Media (2012)

  61. [61]

    The Annals of Statistics 46(6A), 2747–2774 (2018)

    Mei, S., Bai, Y., Montanari, A.: The landscape of empirical risk for nonconvex losses. The Annals of Statistics 46(6A), 2747–2774 (2018)

  62. [62]

    In: The 22nd International Conference on Artificial Intelligence and Statistics, pp

    Mokhtari, A., Ozdaglar, A., Jadbabaie, A.: Efficient nonconvex empirical risk minimization via adaptive sample size methods. In: The 22nd International Conference on Artificial Intelligence and Statistics, pp. 2485–2494. PMLR (2019)

  63. [63]

    Nesterov, Y.: Introductory lectures on convex optimization: A basic course, vol. 87. Springer Science & Business Media (2003)

  64. [64]

    In: Dokl

    Nesterov, Y.E.: A method for solving the convex programming problem with convergence rate o (1/kˆ 2). In: Dokl. akad. nauk Sssr, vol. 269, pp. 543–547 (1983)

  65. [65]

    Cambridge university press (2002)

    Ott, E.: Chaos in dynamical systems. Cambridge university press (2002)

  66. [66]

    Mathematical Programming 176(1-2), 403–427 (2019)

    O’Neill, M., Wright, S.J.: Behavior of accelerated gradient methods near critical points of nonconvex functions. Mathematical Programming 176(1-2), 403–427 (2019)

  67. [67]

    Gradient Descent Only Converges to Minimizers: Non-Isolated Critical Points and Invariant Regions

    Panageas, I., Piliouras, G.: Gradient descent only converges to minimizers: Non-isolated critical points and invariant regions. arXiv preprint arXiv:1605.00405 (2016)

  68. [68]

    Journal of Inequalities and Appli- cations 2005, 1–14 (2005)

    Papi, M.: On the domain of the implicit function and applications. Journal of Inequalities and Appli- cations 2005, 1–14 (2005)

  69. [69]

    IEEE Transactions on Signal Processing 66(4), 982–991 (2017)

    Pauwels, E.J.R., Beck, A., Eldar, Y.C., Sabach, S.: On fienup methods for sparse phase retrieval. IEEE Transactions on Signal Processing 66(4), 982–991 (2017)

  70. [70]

    USSR Computational Mathematics and Mathematical Physics 4(5), 1–17 (1964) 106

    Polyak, B.T.: Some methods of speeding up the convergence of iteration methods. USSR Computational Mathematics and Mathematical Physics 4(5), 1–17 (1964) 106

  71. [71]

    In: Conference on Learning Theory, pp

    Raginsky, M., Rakhlin, A., Telgarsky, M.: Non-convex learning via stochastic gradient langevin dynam- ics: a nonasymptotic analysis. In: Conference on Learning Theory, pp. 1674–1703. PMLR (2017)

  72. [72]

    On the Convergence of Adam and Beyond

    Reddi, S.J., Kale, S., Kumar, S.: On the convergence of adam and beyond. arXiv preprint arXiv:1904.09237 (2019)

  73. [73]

    A Generic Approach for Escaping Saddle points

    Reddi, S.J., Zaheer, M., Sra, S., Poczos, B., Bach, F., Salakhutdinov, R., Smola, A.J.: A generic approach for escaping saddle points. arXiv preprint arXiv:1709.01434 (2017)

  74. [74]

    Rudin, W., et al.: Principles of mathematical analysis, vol. 3. McGraw-hill New York (1976)

  75. [75]

    Schwartz, J.T.: Nonlinear functional analysis, vol. 4. CRC Press (1969)

  76. [76]

    Springer Science & Business Media (2013)

    Shub, M.: Global stability of dynamical systems. Springer Science & Business Media (2013)

  77. [77]

    Bulletin of the American mathematical Society 73(6), 747–817 (1967)

    Smale, S.: Differentiable dynamical systems. Bulletin of the American mathematical Society 73(6), 747–817 (1967)

  78. [78]

    Advances in neural information processing systems 27 (2014)

    Su, W., Boyd, S., Candes, E.: A differential equation for modeling nesterov’s accelerated gradient method: theory and insights. Advances in neural information processing systems 27 (2014)

  79. [79]

    In: International conference on machine learning, pp

    Sutskever, I., Martens, J., Dahl, G., Hinton, G.: On the importance of initialization and momentum in deep learning. In: International conference on machine learning, pp. 1139–1147. PMLR (2013)

  80. [80]

    Wiley-Interscience (1989)

    Tabor, M.: Chaos and integrability in nonlinear dynamics: an introduction. Wiley-Interscience (1989)

Showing first 80 references.