pith. machine review for the scientific record. sign in

arxiv: 2605.02127 · v1 · submitted 2026-05-04 · 🧮 math.OC · cs.LG

Recognition: 3 theorem links

· Lean Theorem

A Parameter-Free First-Order Algorithm for Non-Convex Optimization with tilde{mkern1mu O}(ε^{-5/3}) Global Rate

Coralia Cartis, Sadok Jerad, Sichao Xiong

Pith reviewed 2026-05-08 18:50 UTC · model grok-4.3

classification 🧮 math.OC cs.LG
keywords non-convex optimizationfirst-order methodsaccelerated gradient descentparameter-free algorithmsconvergence ratessmooth functionsbacktracking line search
0
0 comments X

The pith

PF-AGD is the first parameter-free deterministic accelerated first-order method to reach the O(ε^{-5/3} log(1/ε)) rate on smooth non-convex problems.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces PF-AGD as a first-order algorithm for minimizing non-convex functions that are sufficiently smooth. It achieves the best-known oracle complexity bound among first-order methods without requiring advance knowledge of the smoothness constant. The method adapts via backtracking line searches and gradient-based restarts to estimate local curvature on the fly. A sympathetic reader would care because prior algorithms with this rate needed problem-specific parameters tuned in advance, limiting their use on real instances where those constants are unknown or vary. If the claim holds, it supplies a deterministic, implementable procedure that closes the gap between theoretical optimality and practical adaptability for non-convex first-order optimization.

Core claim

PF-AGD is the first parameter-free, deterministic, accelerated first-order method to achieve O(ε^{-5/3} log(1/ε)) oracle complexity when minimizing sufficiently smooth non-convex functions; this matches the best-known bound for any first-order method on such objectives while removing the requirement for a priori smoothness constants through an adaptive backtracking scheme and a gradient-based restart mechanism.

What carries the argument

The PF-AGD algorithm equipped with an adaptive backtracking line search to estimate local smoothness constants together with a gradient-norm restart trigger that preserves the global rate.

If this is right

  • The method becomes a viable deterministic alternative to parameter-dependent accelerated schemes for smooth non-convex minimization.
  • It empirically outperforms the practical version of AGD-Until-Guilty and other parameter-free first-order variants on test problems.
  • The same adaptive mechanism can be used on problems whose smoothness constant changes locally without breaking the global complexity guarantee.
  • It supplies a concrete first-order procedure whose rate matches the current theoretical optimum while remaining fully implementable.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the adaptation logic generalizes, similar backtracking-restart ideas could be applied to stochastic or composite non-convex problems to obtain parameter-free rates.
  • The construction suggests that deterministic first-order methods can achieve optimal rates on non-convex landscapes once local curvature estimation is folded into the iteration.
  • In practice the algorithm offers a drop-in replacement for nonlinear conjugate-gradient methods when theoretical guarantees are desired alongside observed speed.

Load-bearing premise

The backtracking and restart rules correctly detect intervals where the function obeys a local smoothness bound without inflating the overall iteration count.

What would settle it

A smooth non-convex function on which PF-AGD requires more than O(ε^{-5/3} log(1/ε)) gradient evaluations to reach an ε-stationary point, either in theory or on a standard benchmark.

Figures

Figures reproduced from arXiv: 2605.02127 by Coralia Cartis, Sadok Jerad, Sichao Xiong.

Figure 1
Figure 1. Figure 1: Theoretical and Practical AGD-UNTIL-GUILTY vs. PF-AGD with ϵ = 10−4 view at source ↗
Figure 2
Figure 2. Figure 2: PF-AGD vs. state-of-the-art first-order methods. Top row: convergence on robust least squares, MNIST neural network, and the Dixon–Price function. Bottom row: robustness to condition number on ill-conditioned quadratics under various spectral distributions. All experiments are run on iMac M1 CPU. As all methods are first-order and dominated by gradient evaluations, oracle calls serve as a proxy for wall-cl… view at source ↗
Figure 3
Figure 3. Figure 3: f(x) = 1 2 x 2 + log(cosh(x)). Moving all V (w) k+1(yk, xk) terms to the left-hand side and all V (w) k (yk, xk) to the right-hand side, we obtain p Qk−1 √ Qk V (w) k+1(yk, xk) ≤ Qk Qk−1 V (w) k (yk, xk) + Ck∥∇f(w)∥ 2 , (35) where Ck = √ Qk − p Qk−1 σ √ Qk + Qk − Qk−1 σQk−1 = 1 σ  Qk Qk−1 − p Qk−1 √ Qk  ≤ Qk σQk−1 Multiplying both sides of (35) by √ Qk/ p Qk−1, and then using the fact that √ Qk ≥ 0 yield… view at source ↗
Figure 4
Figure 4. Figure 4: Performance of neural network training on MNIST (Section 4). view at source ↗
Figure 5
Figure 5. Figure 5: Performance on regularized quadratics with indefinite Hessians. view at source ↗
Figure 6
Figure 6. Figure 6: Empirical CDFs of oracle calls for the Qing (Appendix B.2.2), Dixon–Price (Section 4), view at source ↗
Figure 7
Figure 7. Figure 7: , corroborates these findings. At d = 100, the performance of AGD-UNTIL-GUILTY and Armijo SD remains closely aligned, which is consistent with the aggregate CDF results. However, at d = 4, Armijo SD marginally outperforms the accelerated variants. This suggests that the overhead associated with acceleration may be counterproductive in low-dimensional regimes. As dimensionality increases, PF-AGD recovers it… view at source ↗
Figure 8
Figure 8. Figure 8: Performance on the Rosenbrock function. 41 view at source ↗
Figure 9
Figure 9. Figure 9: Ackley function: the dots correspond to negative curvature detection and the diamonds view at source ↗
Figure 10
Figure 10. Figure 10: Performance on the Powell function. 43 view at source ↗
Figure 11
Figure 11. Figure 11: Performance on the SCOSINE function. 44 view at source ↗
read the original abstract

We introduce PF-AGD, the first parameter-free, deterministic, accelerated first-order method to achieve $O(\epsilon^{-5/3}\log(1/\epsilon))$ oracle complexity bound when minimizing sufficiently smooth, non-convex functions; this is the best-known bound for first-order methods on smooth non-convex objectives. Unlike existing methods possessing this rate that require a priori knowledge of smoothness constants, we use an adaptive backtracking scheme and a gradient-based restart mechanism to estimate local curvature. This yields a practical algorithm that matches best-known theoretical rates. Empirically, PF-AGD outperforms the practical variant of AGD-Until-Guilty (Carmon et al., 2017), as well as other parameter-free variants, and is a viable alternative to nonlinear conjugate gradient methods.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper introduces PF-AGD, a deterministic parameter-free accelerated first-order algorithm for non-convex optimization. It uses adaptive backtracking line search and gradient-based restarts to estimate local smoothness constants on the fly, achieving an oracle complexity of O(ε^{-5/3} log(1/ε)) for sufficiently smooth functions. This is claimed to be the first such method attaining the best-known rate without a priori knowledge of the smoothness parameter L, with supporting empirical comparisons to AGD-Until-Guilty and other baselines.

Significance. If the analysis is complete, the result would be significant for non-convex first-order optimization: it removes a practical barrier (unknown L) while matching the optimal known rate among deterministic methods. The adaptive mechanisms and empirical performance make it relevant for applications where smoothness varies locally. Strengths include the explicit parameter-free construction and the focus on total oracle count including failed trials.

major comments (1)
  1. [Complexity analysis (proof of main theorem)] The central complexity claim depends on the adaptive backtracking and restart scheme keeping the total number of gradient evaluations (including failed trials) inside the Õ(ε^{-5/3}) bound. No explicit worst-case lemma bounds the aggregate cost of backtracks when local curvature changes on a scale finer than the restart frequency; the argument relies on the scheme “correctly identifying intervals where local smoothness holds” without a first-principles derivation that this cost remains O(log(1/ε)) per phase on average. This is load-bearing for the global rate.
minor comments (2)
  1. [Algorithm description] Notation for the local smoothness estimate and restart threshold should be introduced with a clear table or definition list to avoid ambiguity when reading the algorithm pseudocode.
  2. [Numerical experiments] The empirical section would benefit from reporting the total number of gradient evaluations (not just iterations) to directly illustrate that the adaptive overhead remains controlled in practice.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their detailed and constructive report. The positive assessment of the significance of PF-AGD is appreciated. We respond to the major comment as follows and will incorporate revisions to address the concern raised.

read point-by-point responses
  1. Referee: The central complexity claim depends on the adaptive backtracking and restart scheme keeping the total number of gradient evaluations (including failed trials) inside the Õ(ε^{-5/3}) bound. No explicit worst-case lemma bounds the aggregate cost of backtracks when local curvature changes on a scale finer than the restart frequency; the argument relies on the scheme “correctly identifying intervals where local smoothness holds” without a first-principles derivation that this cost remains O(log(1/ε)) per phase on average. This is load-bearing for the global rate.

    Authors: We thank the referee for pointing out this potential gap in the presentation of the analysis. While the proof of the main theorem (Theorem 4.1) does account for the total number of gradient evaluations by summing over phases and bounding the per-phase cost, we acknowledge that the bound on backtracking steps when curvature changes rapidly could be stated more explicitly. In the revised manuscript, we will introduce a new supporting lemma that provides a first-principles worst-case analysis of the backtracking line search combined with the restart mechanism. Specifically, the lemma will show that regardless of how finely the local smoothness varies, the number of failed trials per successful step is bounded by a constant (due to the doubling strategy), and the restart frequency ensures that the aggregate extra cost per phase is O(log(1/ε)). This will confirm that the overall oracle complexity remains Õ(ε^{-5/3} log(1/ε)). We believe this addition will fully address the concern without changing the main result. revision: yes

Circularity Check

0 steps flagged

No circularity: complexity bound derived from standard non-convex analysis on adaptive scheme

full rationale

The paper constructs PF-AGD via explicit backtracking line search and gradient restarts, then bounds total oracle calls (including failed trials) by summing per-phase costs under local L-smoothness intervals. This bound is obtained from first-principles telescoping and restart lemmas that do not presuppose the final rate; the O(ε^{-5/3} log(1/ε)) result follows directly from the number of successful steps needed to drive the gradient norm below ε, with backtrack overhead controlled by a fixed geometric factor independent of the target ε. No step equates a fitted quantity to a prediction, invokes a self-citation as the sole justification for uniqueness, or renames an input as an output. The derivation remains self-contained against external benchmarks such as the Carmon et al. (2017) restart framework.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The result rests on standard smoothness assumptions for non-convex functions and the correctness of the adaptive estimation scheme; no free parameters are introduced because the method is parameter-free by design.

axioms (1)
  • domain assumption The objective function is continuously differentiable and locally smooth with gradient Lipschitz constant that can be estimated via backtracking line search.
    Invoked to justify the adaptive step-size selection and the global complexity bound.

pith-pipeline@v0.9.0 · 5450 in / 1279 out tokens · 45845 ms · 2026-05-08T18:50:15.062552+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

Reference graph

Works this paper leans on

42 extracted references · 2 canonical work pages · 1 internal anchor

  1. [1]

    Convex until proven guilty

    Yair Carmon, John C. Duchi, Oliver Hinder, and Aaron Sidford. “Convex until proven guilty”: Dimension-free acceleration of gradient descent on non-convex functions. InProceedings of the 34th International Conference on Machine Learning, volume 70 ofProceedings of Machine Learning Research, pages 654–663. PMLR, 2017

  2. [2]

    Springer, 2 edition, 2018

    Yurii Nesterov.Lectures on Convex Optimization. Springer, 2 edition, 2018

  3. [3]

    Coralia Cartis, Nicholas I. M. Gould, and Philippe L. Toint.Evaluation Complexity of Algorithms for Nonconvex Optimization: Theory, Computation and Perspectives. SIAM, 2022

  4. [4]

    Foundations and Trends in Optimization, 2021

    Alexandre d’Aspremont, Damien Scieur, and Adrien Taylor.Acceleration Methods, volume 5. Foundations and Trends in Optimization, 2021

  5. [5]

    Kluwer Academic Publishers, Boston, 2004

    Yurii Nesterov.Introductory Lectures on Convex Optimization: A Basic Course, volume 87 of Applied Optimization. Kluwer Academic Publishers, Boston, 2004. ISBN 1402075537

  6. [6]

    Parameter-free accelerated gradient descent for nonconvex minimization.SIAM Journal on Optimization, 34(2):2093–2120, 2024

    Naoki Marumo and Akiko Takeda. Parameter-free accelerated gradient descent for nonconvex minimization.SIAM Journal on Optimization, 34(2):2093–2120, 2024

  7. [7]

    Improved complexity for smooth nonconvex optimization: A two-level online learning approach with quasi-Newton methods

    Ruichen Jiang, Aryan Mokhtari, and Francisco Patitucci. Improved complexity for smooth nonconvex optimization: A two-level online learning approach with quasi-Newton methods. In Proceedings of the 57th Annual ACM Symposium on Theory of Computing, STOC ’25, pages 2225–2236, New York, NY , USA, 2025. Association for Computing Machinery

  8. [8]

    Parameter-free accelerated quasi-Newton method for nonconvex optimization

    Naoki Marumo. Parameter-free accelerated quasi-Newton method for nonconvex optimization. arXiv:2512.09439, 2025

  9. [9]

    First and zeroth-order implementations of the regularized Newton method with lazy approximated Hessians.Journal of Scientific Computing, 103(1):32, 2025

    Nikita Doikov and Geovani Nunes Grapiglia. First and zeroth-order implementations of the regularized Newton method with lazy approximated Hessians.Journal of Scientific Computing, 103(1):32, 2025

  10. [10]

    Coralia Cartis, Nicholas I. M. Gould, and Philippe L. Toint. On the oracle complexity of first-order and derivative-free algorithms for smooth nonconvex minimization.SIAM Journal on Optimization, 22(1):66–86, 2012

  11. [11]

    Rémi Chan-Renous-Legoubin and Clément W. Royer. A nonlinear conjugate gradient method with complexity guarantees and its application to nonconvex regression.EURO Journal on Computational Optimization, 10:100044, 2022

  12. [12]

    Cavalcanti, Laurent Lessard, and Ashia C

    Joao V . Cavalcanti, Laurent Lessard, and Ashia C. Wilson. Adaptive backtracking for faster optimization. InThe Thirteenth International Conference on Learning Representations, 2025. URLhttps://openreview.net/forum?id=SrGP0RQbYH. 10

  13. [13]

    Polak and G

    E. Polak and G. Ribière. Note sur la convergence de méthodes de directions conjuguées.Revue Française d’Informatique et de Recherche Opérationnelle, 3:35–43, 1969

  14. [14]

    Minimization of functions having Lipschitz continuous first partial derivatives

    Larry Armijo. Minimization of functions having Lipschitz continuous first partial derivatives. Pacific Journal of Mathematics, 16:1–3, 1966

  15. [15]

    Dynamic differential evolution strategy and applications in electromagnetic inverse scattering problems.IEEE Transactions on Geoscience and Remote Sensing, 44(1): 116–125, 2006

    Anyong Qing. Dynamic differential evolution strategy and applications in electromagnetic inverse scattering problems.IEEE Transactions on Geoscience and Remote Sensing, 44(1): 116–125, 2006

  16. [16]

    H. H. Rosenbrock. An automatic method for finding the greatest or least value of a function. The Computer Journal, 3:175–184, 1960

  17. [17]

    David H. Ackley. A connectionist machine for genetic hillclimbing. pages 13–14. Springer US, Boston, MA, 1987. ISBN 978-1-4613-1997-9

  18. [18]

    M. J. D. Powell. An iterative method for finding stationary values of a function of several variables.The Computer Journal, 5(2):147–151, 1962

  19. [19]

    Beaton and John W

    Albert E. Beaton and John W. Tukey. The fitting of power series, meaning polynomials, illustrated on band-spectroscopic data.Technometrics, 16(2):147–185, 1974

  20. [20]

    Yann LeCun, Corinna Cortes, and Christopher J. C. Burges. The MNIST database of handwritten digits, 1998

  21. [21]

    Understanding the difficulty of training deep feedforward neural networks

    Xavier Glorot and Yoshua Bengio. Understanding the difficulty of training deep feedforward neural networks. InProceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, volume 9 ofProceedings of Machine Learning Research, pages 249–256. PMLR, 2010

  22. [22]

    L. C. W. Dixon and R. C. Price. Truncated Newton method for sparse unconstrained optimization using automatic differentiation.Journal of Optimization Theory and Applications, 60(2):261– 275, 1989

  23. [23]

    Adam: A Method for Stochastic Optimization

    Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In3rd International Conference on Learning Representations (ICLR), 2015. URL http://arxiv. org/abs/1412.6980

  24. [24]

    A stochastic approximation method.The Annals of Mathematical Statistics, 22:400–407, 1951

    Herbert Robbins and Sutton Monro. A stochastic approximation method.The Annals of Mathematical Statistics, 22:400–407, 1951

  25. [25]

    A stochastic line search method with expected complexity analysis.SIAM Journal on Optimization, 30(1):349–376, 2020

    Courtney Paquette and Katya Scheinberg. A stochastic line search method with expected complexity analysis.SIAM Journal on Optimization, 30(1):349–376, 2020

  26. [26]

    Berahas, Liyuan Cao, and Katya Scheinberg

    Albert S. Berahas, Liyuan Cao, and Katya Scheinberg. Global convergence rate analysis of a generic line search algorithm with noise.SIAM Journal on Optimization, 31(2):1489–1518, 2021

  27. [27]

    Schmidt, Gauthier Gidel, and Simon Lacoste-Julien

    Sharan Vaswani, Aaron Mishkin, Issam Hadj Laradji, Mark W. Schmidt, Gauthier Gidel, and Simon Lacoste-Julien. Painless stochastic gradient: Interpolation, line-search, and convergence rates. InAdvances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019

  28. [28]

    Accelerated gradient methods for nonconvex nonlinear and stochastic programming.Mathematical Programming, 156(1):59–99, 2016

    Saeed Ghadimi and Guanghui Lan. Accelerated gradient methods for nonconvex nonlinear and stochastic programming.Mathematical Programming, 156(1):59–99, 2016

  29. [29]

    Duchi, Oliver Hinder, and Aaron Sidford

    Yair Carmon, John C. Duchi, Oliver Hinder, and Aaron Sidford. Lower bounds for finding stationary points I.Mathematical Programming, 184(1):71–120, 2020

  30. [30]

    Chi Jin, Praneeth Netrapalli, and Michael I. Jordan. Accelerated gradient descent escapes saddle points faster than gradient descent. InConference on Learning Theory, pages 1042–1085. PMLR, 2018. 11

  31. [31]

    Restarted nonconvex accelerated gradient descent: No more polylogarithmic factor in the O(ϵ−7/4) complexity.Journal of Machine Learning Research, 24 (157):1–37, 2023

    Huan Li and Zhouchen Lin. Restarted nonconvex accelerated gradient descent: No more polylogarithmic factor in the O(ϵ−7/4) complexity.Journal of Machine Learning Research, 24 (157):1–37, 2023

  32. [32]

    Universal heavy-ball method for nonconvex optimization under Hölder continuous Hessians.Mathematical Programming, 212(1):147–175, 2025

    Naoki Marumo and Akiko Takeda. Universal heavy-ball method for nonconvex optimization under Hölder continuous Hessians.Mathematical Programming, 212(1):147–175, 2025

  33. [33]

    Neon2: Finding local minima via first-order oracles

    Zeyuan Allen-Zhu and Yuanzhi Li. Neon2: Finding local minima via first-order oracles. Advances in Neural Information Processing Systems, 31, 2018

  34. [34]

    Coralia Cartis, Nicholas I. M. Gould, and Philippe L. Toint. On the complexity of steepest descent, Newton’s and regularized Newton’s methods for nonconvex unconstrained optimization problems.SIAM Journal on Optimization, 20(6):2833–2852, 2010

  35. [35]

    Coralia Cartis, Nicholas I. M. Gould, and Philippe L. Toint. Adaptive cubic regularisation methods for unconstrained optimization. Part I: motivation, convergence and numerical results. Mathematical Programming, 127(2):245–295, 2011

  36. [36]

    Coralia Cartis, Nicholas I. M. Gould, and Philippe L. Toint. Adaptive cubic regularisation methods for unconstrained optimization. Part II: worst-case function- and derivative-evaluation complexity.Mathematical Programming, 130(2):295–319, 2011

  37. [37]

    Birgin, J

    Ernesto G. Birgin, J. L. Gardenghi, José Mario Martínez, Sandra Augusta Santos, and Philippe L. Toint. Worst-case evaluation complexity for unconstrained nonlinear optimization using high- order regularized models.Mathematical Programming, 163(1):359–368, 2017

  38. [38]

    Royer, Michael O’Neill, and Stephen J

    Clément W. Royer, Michael O’Neill, and Stephen J. Wright. A Newton-CG algorithm with complexity guarantees for smooth unconstrained optimization.Mathematical Programming, 180(1):451–488, 2020

  39. [39]

    Liu and Jorge Nocedal

    Dong C. Liu and Jorge Nocedal. On the limited memory BFGS method for large scale optimization.Mathematical Programming, 45:503–528, 1989

  40. [40]

    Sahar Karimi and Stephen A. Vavasis. Nonlinear conjugate gradient for smooth convex functions. Mathematical Programming Computation, 16(2):229–254, 2024

  41. [41]

    difficulty

    Ilya Sutskever, James Martens, George Dahl, and Geoffrey Hinton. On the importance of initial- ization and momentum in deep learning. InProceedings of the 30th International Conference on Machine Learning, volume 28 ofProceedings of Machine Learning Research, pages 1139–1147. PMLR, 2013. 12 A Proofs A.1 Preliminaries In this section, we introduce notation...

  42. [42]

    The eventual descent of AGD-UNTIL-GUILTYcoincides with several instances of negative curvature detection, as indicated by the markers

    escapes this plateau after approximately 102 oracle calls, substantially earlier than AGD-UNTIL- GUILTY, which requires around 3×10 4 calls, indicating considerably greater robustness. The eventual descent of AGD-UNTIL-GUILTYcoincides with several instances of negative curvature detection, as indicated by the markers. 0 101 102 103 104 105 Oracle Calls 10...