arxiv: 2605.02127 · v1 · submitted 2026-05-04 · 🧮 math.OC · cs.LG

Recognition: 3 theorem links

· Lean Theorem

A Parameter-Free First-Order Algorithm for Non-Convex Optimization with tilde{mkern1mu O}(ε^{-5/3}) Global Rate

Coralia Cartis, Sadok Jerad, Sichao Xiong

Pith reviewed 2026-05-08 18:50 UTC · model grok-4.3

classification 🧮 math.OC cs.LG

keywords non-convex optimizationfirst-order methodsaccelerated gradient descentparameter-free algorithmsconvergence ratessmooth functionsbacktracking line search

0 comments

The pith

PF-AGD is the first parameter-free deterministic accelerated first-order method to reach the O(ε^{-5/3} log(1/ε)) rate on smooth non-convex problems.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces PF-AGD as a first-order algorithm for minimizing non-convex functions that are sufficiently smooth. It achieves the best-known oracle complexity bound among first-order methods without requiring advance knowledge of the smoothness constant. The method adapts via backtracking line searches and gradient-based restarts to estimate local curvature on the fly. A sympathetic reader would care because prior algorithms with this rate needed problem-specific parameters tuned in advance, limiting their use on real instances where those constants are unknown or vary. If the claim holds, it supplies a deterministic, implementable procedure that closes the gap between theoretical optimality and practical adaptability for non-convex first-order optimization.

Core claim

PF-AGD is the first parameter-free, deterministic, accelerated first-order method to achieve O(ε^{-5/3} log(1/ε)) oracle complexity when minimizing sufficiently smooth non-convex functions; this matches the best-known bound for any first-order method on such objectives while removing the requirement for a priori smoothness constants through an adaptive backtracking scheme and a gradient-based restart mechanism.

What carries the argument

The PF-AGD algorithm equipped with an adaptive backtracking line search to estimate local smoothness constants together with a gradient-norm restart trigger that preserves the global rate.

If this is right

The method becomes a viable deterministic alternative to parameter-dependent accelerated schemes for smooth non-convex minimization.
It empirically outperforms the practical version of AGD-Until-Guilty and other parameter-free first-order variants on test problems.
The same adaptive mechanism can be used on problems whose smoothness constant changes locally without breaking the global complexity guarantee.
It supplies a concrete first-order procedure whose rate matches the current theoretical optimum while remaining fully implementable.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the adaptation logic generalizes, similar backtracking-restart ideas could be applied to stochastic or composite non-convex problems to obtain parameter-free rates.
The construction suggests that deterministic first-order methods can achieve optimal rates on non-convex landscapes once local curvature estimation is folded into the iteration.
In practice the algorithm offers a drop-in replacement for nonlinear conjugate-gradient methods when theoretical guarantees are desired alongside observed speed.

Load-bearing premise

The backtracking and restart rules correctly detect intervals where the function obeys a local smoothness bound without inflating the overall iteration count.

What would settle it

A smooth non-convex function on which PF-AGD requires more than O(ε^{-5/3} log(1/ε)) gradient evaluations to reach an ε-stationary point, either in theory or on a standard benchmark.

Figures

Figures reproduced from arXiv: 2605.02127 by Coralia Cartis, Sadok Jerad, Sichao Xiong.

**Figure 1.** Figure 1: Theoretical and Practical AGD-UNTIL-GUILTY vs. PF-AGD with ϵ = 10−4 view at source ↗

**Figure 2.** Figure 2: PF-AGD vs. state-of-the-art first-order methods. Top row: convergence on robust least squares, MNIST neural network, and the Dixon–Price function. Bottom row: robustness to condition number on ill-conditioned quadratics under various spectral distributions. All experiments are run on iMac M1 CPU. As all methods are first-order and dominated by gradient evaluations, oracle calls serve as a proxy for wall-cl… view at source ↗

**Figure 3.** Figure 3: f(x) = 1 2 x 2 + log(cosh(x)). Moving all V (w) k+1(yk, xk) terms to the left-hand side and all V (w) k (yk, xk) to the right-hand side, we obtain p Qk−1 √ Qk V (w) k+1(yk, xk) ≤ Qk Qk−1 V (w) k (yk, xk) + Ck∥∇f(w)∥ 2 , (35) where Ck = √ Qk − p Qk−1 σ √ Qk + Qk − Qk−1 σQk−1 = 1 σ Qk Qk−1 − p Qk−1 √ Qk ≤ Qk σQk−1 Multiplying both sides of (35) by √ Qk/ p Qk−1, and then using the fact that √ Qk ≥ 0 yield… view at source ↗

**Figure 4.** Figure 4: Performance of neural network training on MNIST (Section 4). view at source ↗

**Figure 5.** Figure 5: Performance on regularized quadratics with indefinite Hessians. view at source ↗

**Figure 6.** Figure 6: Empirical CDFs of oracle calls for the Qing (Appendix B.2.2), Dixon–Price (Section 4), view at source ↗

**Figure 7.** Figure 7: , corroborates these findings. At d = 100, the performance of AGD-UNTIL-GUILTY and Armijo SD remains closely aligned, which is consistent with the aggregate CDF results. However, at d = 4, Armijo SD marginally outperforms the accelerated variants. This suggests that the overhead associated with acceleration may be counterproductive in low-dimensional regimes. As dimensionality increases, PF-AGD recovers it… view at source ↗

**Figure 8.** Figure 8: Performance on the Rosenbrock function. 41 view at source ↗

**Figure 9.** Figure 9: Ackley function: the dots correspond to negative curvature detection and the diamonds view at source ↗

**Figure 10.** Figure 10: Performance on the Powell function. 43 view at source ↗

**Figure 11.** Figure 11: Performance on the SCOSINE function. 44 view at source ↗

read the original abstract

We introduce PF-AGD, the first parameter-free, deterministic, accelerated first-order method to achieve $O(\epsilon^{-5/3}\log(1/\epsilon))$ oracle complexity bound when minimizing sufficiently smooth, non-convex functions; this is the best-known bound for first-order methods on smooth non-convex objectives. Unlike existing methods possessing this rate that require a priori knowledge of smoothness constants, we use an adaptive backtracking scheme and a gradient-based restart mechanism to estimate local curvature. This yields a practical algorithm that matches best-known theoretical rates. Empirically, PF-AGD outperforms the practical variant of AGD-Until-Guilty (Carmon et al., 2017), as well as other parameter-free variants, and is a viable alternative to nonlinear conjugate gradient methods.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

PF-AGD gives a parameter-free accelerated first-order method that hits the Õ(ε^{-5/3}) rate for smooth non-convex problems using backtracking and restarts.

read the letter

The main point is that this work removes the need to know the global smoothness constant L while keeping the current best first-order complexity for non-convex smooth minimization. They do this with an adaptive backtracking line search to estimate local curvature plus a gradient-based restart rule that preserves acceleration phases. The abstract states this is the first deterministic parameter-free method with the O(ε^{-5/3} log(1/ε)) bound, and the experiments show it beats practical AGD-Until-Guilty and other baselines on the tested problems. That combination of theory and practicality is the useful part. The construction itself looks clean and directly addresses a real barrier in applying accelerated methods at scale. The soft spot is the total oracle count from failed backtracks and restarts. If local smoothness varies on a fine scale, the number of rejected steps could add up, and the paper needs to show explicitly that these extras stay inside the stated Õ bound rather than introducing hidden log or polynomial factors. The abstract claims the scheme works, but the restart analysis and backtracking conditions are where any gaps would matter most. This is for people building or using first-order methods in non-convex optimization, especially those who want theoretical rates without manual tuning. Readers focused on adaptive or parameter-free algorithms will find the concrete construction and rate useful. It deserves a serious referee because the claim is precise, the novelty relative to prior accelerated non-convex work is stated clearly, and the practical motivation is straightforward. I would send it to review to get the full proof and complexity accounting checked.

Referee Report

1 major / 2 minor

Summary. The paper introduces PF-AGD, a deterministic parameter-free accelerated first-order algorithm for non-convex optimization. It uses adaptive backtracking line search and gradient-based restarts to estimate local smoothness constants on the fly, achieving an oracle complexity of O(ε^{-5/3} log(1/ε)) for sufficiently smooth functions. This is claimed to be the first such method attaining the best-known rate without a priori knowledge of the smoothness parameter L, with supporting empirical comparisons to AGD-Until-Guilty and other baselines.

Significance. If the analysis is complete, the result would be significant for non-convex first-order optimization: it removes a practical barrier (unknown L) while matching the optimal known rate among deterministic methods. The adaptive mechanisms and empirical performance make it relevant for applications where smoothness varies locally. Strengths include the explicit parameter-free construction and the focus on total oracle count including failed trials.

major comments (1)

[Complexity analysis (proof of main theorem)] The central complexity claim depends on the adaptive backtracking and restart scheme keeping the total number of gradient evaluations (including failed trials) inside the Õ(ε^{-5/3}) bound. No explicit worst-case lemma bounds the aggregate cost of backtracks when local curvature changes on a scale finer than the restart frequency; the argument relies on the scheme “correctly identifying intervals where local smoothness holds” without a first-principles derivation that this cost remains O(log(1/ε)) per phase on average. This is load-bearing for the global rate.

minor comments (2)

[Algorithm description] Notation for the local smoothness estimate and restart threshold should be introduced with a clear table or definition list to avoid ambiguity when reading the algorithm pseudocode.
[Numerical experiments] The empirical section would benefit from reporting the total number of gradient evaluations (not just iterations) to directly illustrate that the adaptive overhead remains controlled in practice.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their detailed and constructive report. The positive assessment of the significance of PF-AGD is appreciated. We respond to the major comment as follows and will incorporate revisions to address the concern raised.

read point-by-point responses

Referee: The central complexity claim depends on the adaptive backtracking and restart scheme keeping the total number of gradient evaluations (including failed trials) inside the Õ(ε^{-5/3}) bound. No explicit worst-case lemma bounds the aggregate cost of backtracks when local curvature changes on a scale finer than the restart frequency; the argument relies on the scheme “correctly identifying intervals where local smoothness holds” without a first-principles derivation that this cost remains O(log(1/ε)) per phase on average. This is load-bearing for the global rate.

Authors: We thank the referee for pointing out this potential gap in the presentation of the analysis. While the proof of the main theorem (Theorem 4.1) does account for the total number of gradient evaluations by summing over phases and bounding the per-phase cost, we acknowledge that the bound on backtracking steps when curvature changes rapidly could be stated more explicitly. In the revised manuscript, we will introduce a new supporting lemma that provides a first-principles worst-case analysis of the backtracking line search combined with the restart mechanism. Specifically, the lemma will show that regardless of how finely the local smoothness varies, the number of failed trials per successful step is bounded by a constant (due to the doubling strategy), and the restart frequency ensures that the aggregate extra cost per phase is O(log(1/ε)). This will confirm that the overall oracle complexity remains Õ(ε^{-5/3} log(1/ε)). We believe this addition will fully address the concern without changing the main result. revision: yes

Circularity Check

0 steps flagged

No circularity: complexity bound derived from standard non-convex analysis on adaptive scheme

full rationale

The paper constructs PF-AGD via explicit backtracking line search and gradient restarts, then bounds total oracle calls (including failed trials) by summing per-phase costs under local L-smoothness intervals. This bound is obtained from first-principles telescoping and restart lemmas that do not presuppose the final rate; the O(ε^{-5/3} log(1/ε)) result follows directly from the number of successful steps needed to drive the gradient norm below ε, with backtrack overhead controlled by a fixed geometric factor independent of the target ε. No step equates a fitted quantity to a prediction, invokes a self-citation as the sole justification for uniqueness, or renames an input as an output. The derivation remains self-contained against external benchmarks such as the Carmon et al. (2017) restart framework.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The result rests on standard smoothness assumptions for non-convex functions and the correctness of the adaptive estimation scheme; no free parameters are introduced because the method is parameter-free by design.

axioms (1)

domain assumption The objective function is continuously differentiable and locally smooth with gradient Lipschitz constant that can be estimated via backtracking line search.
Invoked to justify the adaptive step-size selection and the global complexity bound.

pith-pipeline@v0.9.0 · 5450 in / 1279 out tokens · 45845 ms · 2026-05-08T18:50:15.062552+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

Cost/FunctionalEquation.lean — RS's cost is J(x)=½(x+x⁻¹)−1 from Aczél functional equation; the paper's Lyapunov is a quadratic strong-convexity surrogate, not the reciprocal-symmetric J washburn_uniqueness_aczel unclear
Lyapunov potential V_t(y_k, x_k) := f(y_k) − f(w) + (σ/2)‖z_t(y_k,x_k) − w‖²

Reference graph

Works this paper leans on

42 extracted references · 2 canonical work pages · 1 internal anchor

[1]

Convex until proven guilty

Yair Carmon, John C. Duchi, Oliver Hinder, and Aaron Sidford. “Convex until proven guilty”: Dimension-free acceleration of gradient descent on non-convex functions. InProceedings of the 34th International Conference on Machine Learning, volume 70 ofProceedings of Machine Learning Research, pages 654–663. PMLR, 2017

2017
[2]

Springer, 2 edition, 2018

Yurii Nesterov.Lectures on Convex Optimization. Springer, 2 edition, 2018

2018
[3]

Coralia Cartis, Nicholas I. M. Gould, and Philippe L. Toint.Evaluation Complexity of Algorithms for Nonconvex Optimization: Theory, Computation and Perspectives. SIAM, 2022

2022
[4]

Foundations and Trends in Optimization, 2021

Alexandre d’Aspremont, Damien Scieur, and Adrien Taylor.Acceleration Methods, volume 5. Foundations and Trends in Optimization, 2021

2021
[5]

Kluwer Academic Publishers, Boston, 2004

Yurii Nesterov.Introductory Lectures on Convex Optimization: A Basic Course, volume 87 of Applied Optimization. Kluwer Academic Publishers, Boston, 2004. ISBN 1402075537

2004
[6]

Parameter-free accelerated gradient descent for nonconvex minimization.SIAM Journal on Optimization, 34(2):2093–2120, 2024

Naoki Marumo and Akiko Takeda. Parameter-free accelerated gradient descent for nonconvex minimization.SIAM Journal on Optimization, 34(2):2093–2120, 2024

2093
[7]

Improved complexity for smooth nonconvex optimization: A two-level online learning approach with quasi-Newton methods

Ruichen Jiang, Aryan Mokhtari, and Francisco Patitucci. Improved complexity for smooth nonconvex optimization: A two-level online learning approach with quasi-Newton methods. In Proceedings of the 57th Annual ACM Symposium on Theory of Computing, STOC ’25, pages 2225–2236, New York, NY , USA, 2025. Association for Computing Machinery

2025
[8]

Parameter-free accelerated quasi-Newton method for nonconvex optimization

Naoki Marumo. Parameter-free accelerated quasi-Newton method for nonconvex optimization. arXiv:2512.09439, 2025

work page arXiv 2025
[9]

First and zeroth-order implementations of the regularized Newton method with lazy approximated Hessians.Journal of Scientific Computing, 103(1):32, 2025

Nikita Doikov and Geovani Nunes Grapiglia. First and zeroth-order implementations of the regularized Newton method with lazy approximated Hessians.Journal of Scientific Computing, 103(1):32, 2025

2025
[10]

Coralia Cartis, Nicholas I. M. Gould, and Philippe L. Toint. On the oracle complexity of first-order and derivative-free algorithms for smooth nonconvex minimization.SIAM Journal on Optimization, 22(1):66–86, 2012

2012
[11]

Rémi Chan-Renous-Legoubin and Clément W. Royer. A nonlinear conjugate gradient method with complexity guarantees and its application to nonconvex regression.EURO Journal on Computational Optimization, 10:100044, 2022

2022
[12]

Cavalcanti, Laurent Lessard, and Ashia C

Joao V . Cavalcanti, Laurent Lessard, and Ashia C. Wilson. Adaptive backtracking for faster optimization. InThe Thirteenth International Conference on Learning Representations, 2025. URLhttps://openreview.net/forum?id=SrGP0RQbYH. 10

2025
[13]

Polak and G

E. Polak and G. Ribière. Note sur la convergence de méthodes de directions conjuguées.Revue Française d’Informatique et de Recherche Opérationnelle, 3:35–43, 1969

1969
[14]

Minimization of functions having Lipschitz continuous first partial derivatives

Larry Armijo. Minimization of functions having Lipschitz continuous first partial derivatives. Pacific Journal of Mathematics, 16:1–3, 1966

1966
[15]

Dynamic differential evolution strategy and applications in electromagnetic inverse scattering problems.IEEE Transactions on Geoscience and Remote Sensing, 44(1): 116–125, 2006

Anyong Qing. Dynamic differential evolution strategy and applications in electromagnetic inverse scattering problems.IEEE Transactions on Geoscience and Remote Sensing, 44(1): 116–125, 2006

2006
[16]

H. H. Rosenbrock. An automatic method for finding the greatest or least value of a function. The Computer Journal, 3:175–184, 1960

1960
[17]

David H. Ackley. A connectionist machine for genetic hillclimbing. pages 13–14. Springer US, Boston, MA, 1987. ISBN 978-1-4613-1997-9

1987
[18]

M. J. D. Powell. An iterative method for finding stationary values of a function of several variables.The Computer Journal, 5(2):147–151, 1962

1962
[19]

Beaton and John W

Albert E. Beaton and John W. Tukey. The fitting of power series, meaning polynomials, illustrated on band-spectroscopic data.Technometrics, 16(2):147–185, 1974

1974
[20]

Yann LeCun, Corinna Cortes, and Christopher J. C. Burges. The MNIST database of handwritten digits, 1998

1998
[21]

Understanding the difficulty of training deep feedforward neural networks

Xavier Glorot and Yoshua Bengio. Understanding the difficulty of training deep feedforward neural networks. InProceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, volume 9 ofProceedings of Machine Learning Research, pages 249–256. PMLR, 2010

2010
[22]

L. C. W. Dixon and R. C. Price. Truncated Newton method for sparse unconstrained optimization using automatic differentiation.Journal of Optimization Theory and Applications, 60(2):261– 275, 1989

1989
[23]

Adam: A Method for Stochastic Optimization

Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In3rd International Conference on Learning Representations (ICLR), 2015. URL http://arxiv. org/abs/1412.6980

work page internal anchor Pith review arXiv 2015
[24]

A stochastic approximation method.The Annals of Mathematical Statistics, 22:400–407, 1951

Herbert Robbins and Sutton Monro. A stochastic approximation method.The Annals of Mathematical Statistics, 22:400–407, 1951

1951
[25]

A stochastic line search method with expected complexity analysis.SIAM Journal on Optimization, 30(1):349–376, 2020

Courtney Paquette and Katya Scheinberg. A stochastic line search method with expected complexity analysis.SIAM Journal on Optimization, 30(1):349–376, 2020

2020
[26]

Berahas, Liyuan Cao, and Katya Scheinberg

Albert S. Berahas, Liyuan Cao, and Katya Scheinberg. Global convergence rate analysis of a generic line search algorithm with noise.SIAM Journal on Optimization, 31(2):1489–1518, 2021

2021
[27]

Schmidt, Gauthier Gidel, and Simon Lacoste-Julien

Sharan Vaswani, Aaron Mishkin, Issam Hadj Laradji, Mark W. Schmidt, Gauthier Gidel, and Simon Lacoste-Julien. Painless stochastic gradient: Interpolation, line-search, and convergence rates. InAdvances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019

2019
[28]

Accelerated gradient methods for nonconvex nonlinear and stochastic programming.Mathematical Programming, 156(1):59–99, 2016

Saeed Ghadimi and Guanghui Lan. Accelerated gradient methods for nonconvex nonlinear and stochastic programming.Mathematical Programming, 156(1):59–99, 2016

2016
[29]

Duchi, Oliver Hinder, and Aaron Sidford

Yair Carmon, John C. Duchi, Oliver Hinder, and Aaron Sidford. Lower bounds for finding stationary points I.Mathematical Programming, 184(1):71–120, 2020

2020
[30]

Chi Jin, Praneeth Netrapalli, and Michael I. Jordan. Accelerated gradient descent escapes saddle points faster than gradient descent. InConference on Learning Theory, pages 1042–1085. PMLR, 2018. 11

2018
[31]

Restarted nonconvex accelerated gradient descent: No more polylogarithmic factor in the O(ϵ−7/4) complexity.Journal of Machine Learning Research, 24 (157):1–37, 2023

Huan Li and Zhouchen Lin. Restarted nonconvex accelerated gradient descent: No more polylogarithmic factor in the O(ϵ−7/4) complexity.Journal of Machine Learning Research, 24 (157):1–37, 2023

2023
[32]

Universal heavy-ball method for nonconvex optimization under Hölder continuous Hessians.Mathematical Programming, 212(1):147–175, 2025

Naoki Marumo and Akiko Takeda. Universal heavy-ball method for nonconvex optimization under Hölder continuous Hessians.Mathematical Programming, 212(1):147–175, 2025

2025
[33]

Neon2: Finding local minima via first-order oracles

Zeyuan Allen-Zhu and Yuanzhi Li. Neon2: Finding local minima via first-order oracles. Advances in Neural Information Processing Systems, 31, 2018

2018
[34]

Coralia Cartis, Nicholas I. M. Gould, and Philippe L. Toint. On the complexity of steepest descent, Newton’s and regularized Newton’s methods for nonconvex unconstrained optimization problems.SIAM Journal on Optimization, 20(6):2833–2852, 2010

2010
[35]

Coralia Cartis, Nicholas I. M. Gould, and Philippe L. Toint. Adaptive cubic regularisation methods for unconstrained optimization. Part I: motivation, convergence and numerical results. Mathematical Programming, 127(2):245–295, 2011

2011
[36]

Coralia Cartis, Nicholas I. M. Gould, and Philippe L. Toint. Adaptive cubic regularisation methods for unconstrained optimization. Part II: worst-case function- and derivative-evaluation complexity.Mathematical Programming, 130(2):295–319, 2011

2011
[37]

Birgin, J

Ernesto G. Birgin, J. L. Gardenghi, José Mario Martínez, Sandra Augusta Santos, and Philippe L. Toint. Worst-case evaluation complexity for unconstrained nonlinear optimization using high- order regularized models.Mathematical Programming, 163(1):359–368, 2017

2017
[38]

Royer, Michael O’Neill, and Stephen J

Clément W. Royer, Michael O’Neill, and Stephen J. Wright. A Newton-CG algorithm with complexity guarantees for smooth unconstrained optimization.Mathematical Programming, 180(1):451–488, 2020

2020
[39]

Liu and Jorge Nocedal

Dong C. Liu and Jorge Nocedal. On the limited memory BFGS method for large scale optimization.Mathematical Programming, 45:503–528, 1989

1989
[40]

Sahar Karimi and Stephen A. Vavasis. Nonlinear conjugate gradient for smooth convex functions. Mathematical Programming Computation, 16(2):229–254, 2024

2024
[41]

difficulty

Ilya Sutskever, James Martens, George Dahl, and Geoffrey Hinton. On the importance of initial- ization and momentum in deep learning. InProceedings of the 30th International Conference on Machine Learning, volume 28 ofProceedings of Machine Learning Research, pages 1139–1147. PMLR, 2013. 12 A Proofs A.1 Preliminaries In this section, we introduce notation...

2013
[42]

The eventual descent of AGD-UNTIL-GUILTYcoincides with several instances of negative curvature detection, as indicated by the markers

escapes this plateau after approximately 102 oracle calls, substantially earlier than AGD-UNTIL- GUILTY, which requires around 3×10 4 calls, indicating considerably greater robustness. The eventual descent of AGD-UNTIL-GUILTYcoincides with several instances of negative curvature detection, as indicated by the markers. 0 101 102 103 104 105 Oracle Calls 10...