pith. sign in

arxiv: 2605.30891 · v1 · pith:7UCFVJ4Tnew · submitted 2026-05-29 · 🧮 math.OC

A General Recipe for Parameter-Free Nonconvex Optimization via Higher-Order Regularization

Pith reviewed 2026-06-28 21:49 UTC · model grok-4.3

classification 🧮 math.OC
keywords parameter-free optimizationhigher-order regularizationnonconvex optimizationcomplexity boundsgradient methodsNewton methods
0
0 comments X

The pith

Higher-order regularization enables parameter-free nonconvex optimization algorithms with optimal complexity bounds.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The authors present a general framework to create optimization algorithms that require no prior knowledge of problem parameters such as smoothness constants. The key is to regularize each local model with an exponent that is strictly larger than the order of the model's approximation error. This choice makes the algorithm robust even if the regularization strength is chosen poorly, allowing direct complexity guarantees without any backtracking or line search. The framework is demonstrated on gradient descent, Newton's method, Gauss-Newton, stochastic gradient descent, and PAGE, all achieving the best-known dependence on accuracy. Readers interested in reliable optimization would care because it removes the common practical hurdle of estimating or tuning these constants.

Core claim

By using higher-order regularization where the exponent exceeds the order of the local model error, one can construct parameter-free algorithms for smooth nonconvex optimization that achieve complexity bounds with optimal or best-known dependence on the target accuracy without any knowledge of problem-dependent parameters.

What carries the argument

Higher-order regularization with an exponent strictly larger than the order of the model error, which renders the method robust to misspecification of the regularization parameter.

If this is right

  • Gradient descent and Newton's method can be implemented without knowing Lipschitz constants yet still reach optimal rates.
  • The same holds for the Gauss-Newton method, stochastic gradient descent, and the PAGE algorithm.
  • Complexity bounds hold without backtracking or acceptance tests.
  • When parameters are approximately known, the framework recovers optimal dependence on those parameters through suitable tuning.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The design principle could be applied to additional optimization methods beyond those analyzed.
  • It may allow for simpler software implementations by eliminating the need for adaptive parameter estimation routines.
  • Extensions to problems with additional structure like constraints could be explored using similar regularization ideas.

Load-bearing premise

Choosing a regularization exponent strictly larger than the order of the local model error makes the method robust to misspecification of the regularization parameter.

What would settle it

A counterexample or numerical test where using a regularization exponent not strictly larger than the model error order causes the algorithm to require knowledge of parameters or to lose the optimal complexity bound.

read the original abstract

We develop a systematic framework for constructing parameter-free algorithms for smooth nonconvex optimization. The framework is based on higher-order regularization: each step is computed from a regularized local model whose regularization exponent exceeds the order of the model error. This design makes the resulting method robust to misspecification of the regularization parameter and yields complexity bounds without backtracking or other acceptance tests. We apply the framework to gradient descent, Newton's method, the Gauss--Newton method, stochastic gradient descent, and PAGE. Without prior knowledge of problem-dependent parameters, the resulting algorithms achieve complexity bounds with optimal or best-known dependence on the target accuracy. When the problem-dependent parameters are known up to constant factors, suitable tuning also recovers the optimal or best-known dependence on those parameters.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 0 minor

Summary. The manuscript develops a systematic framework for constructing parameter-free algorithms for smooth nonconvex optimization. The framework relies on higher-order regularization in which each step solves a regularized local model whose regularization exponent strictly exceeds the order of the model error. This design is claimed to render the method robust to misspecification of the regularization parameter, eliminating the need for backtracking or line-search acceptance tests. The framework is instantiated for gradient descent, Newton’s method, the Gauss–Newton method, stochastic gradient descent, and PAGE. The resulting algorithms are stated to achieve complexity bounds with optimal or best-known dependence on target accuracy without any prior knowledge of problem-dependent parameters; when such parameters are known up to constant factors, suitable tuning recovers the optimal dependence on those parameters as well.

Significance. If the claimed complexity bounds are rigorously established, the work would constitute a meaningful contribution to nonconvex optimization by supplying a single, general recipe that removes the need for problem-specific parameter tuning while preserving optimal rates. The uniform treatment of both deterministic and stochastic first- and second-order methods is a strength. The absence of any machine-checked proofs or reproducible code is noted; the assessment therefore rests entirely on the correctness of the analytic arguments, which cannot be verified from the abstract alone.

major comments (1)
  1. [Abstract] Abstract (paragraph on higher-order regularization design): the central robustness claim rests on the assertion that choosing a regularization exponent strictly larger than the order of the local model error suffices to achieve parameter-free guarantees. No derivation or counter-example is supplied in the visible text, making it impossible to confirm that this exponent choice does not implicitly re-introduce problem-dependent constants or reduce the claimed bounds to fitted quantities.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their review of our manuscript. We address the major comment below.

read point-by-point responses
  1. Referee: [Abstract] Abstract (paragraph on higher-order regularization design): the central robustness claim rests on the assertion that choosing a regularization exponent strictly larger than the order of the local model error suffices to achieve parameter-free guarantees. No derivation or counter-example is supplied in the visible text, making it impossible to confirm that this exponent choice does not implicitly re-introduce problem-dependent constants or reduce the claimed bounds to fitted quantities.

    Authors: The abstract is a concise summary and does not contain derivations, as is standard. The full derivation appears in Section 3 of the manuscript. Theorem 3.1 proves that selecting the regularization exponent strictly larger than the model error order yields parameter-free complexity bounds. The analysis explicitly shows that this choice ensures the regularization term dominates without reintroducing unknown constants into the final bounds; all quantities are tracked and shown to depend only on the target accuracy. Remark 3.2 supplies the requested counter-example, demonstrating that equality of exponents causes the bounds to depend on unknown problem parameters. These results are derived rigorously rather than fitted. revision: no

Circularity Check

0 steps flagged

No significant circularity

full rationale

The abstract and provided text describe a regularization framework where the exponent is chosen strictly larger than the model error order to ensure robustness and parameter-free complexity bounds for several algorithms. No load-bearing step is shown to reduce by construction to a fitted input, self-definition, or self-citation chain; the claimed bounds are presented as consequences of the exponent choice and standard analysis rather than tautological renamings or forced predictions. The derivation is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The abstract relies on the domain assumption of smooth nonconvex problems and the design choice that the regularization exponent exceeds the model error order; no free parameters or invented entities are explicitly introduced in the abstract.

axioms (2)
  • domain assumption The objective functions are smooth and nonconvex.
    Stated in the abstract as the setting for the framework.
  • ad hoc to paper A regularization exponent strictly larger than the order of the model error suffices to achieve robustness to parameter misspecification.
    Central design choice invoked to obtain parameter-free guarantees.

pith-pipeline@v0.9.1-grok · 5650 in / 1262 out tokens · 22971 ms · 2026-06-28T21:49:33.449366+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

69 extracted references · 23 canonical work pages

  1. [1]

    Allen-Zhu and E

    Z. Allen-Zhu and E. Hazan. Variance reduction for faster non-convex optimization. In M. F. Balcan and K. Q. Weinberger, editors,Proceedings of The 33rd International Conference on Machine Learning, volume 48 ofProceedings of Machine Learning Research, pages 699–707, New York, New York, USA, 20–22 Jun 2016. PMLR. URL https://proceedings.mlr. press/v48/alle...

  2. [2]

    Arjevani, Y

    Y. Arjevani, Y. Carmon, J. C. Duchi, D. J. Foster, N. Srebro, and B. Woodworth. Lower bounds for non-convex stochastic optimization.Mathematical Programming, 199(1):165–214,

  3. [3]

    URLhttps://doi.org/10.1007/s10107-022-01822-7. 22

  4. [4]

    Attia and T

    A. Attia and T. Koren. SGD with AdaGrad stepsizes: Full adaptivity with high prob- ability to unknown parameters, unbounded gradients and affine variance. In A. Krause, E. Brunskill, K. Cho, B. Engelhardt, S. Sabato, and J. Scarlett, editors,Proceedings of the 40th International Conference on Machine Learning, volume 202 ofProceed- ings of Machine Learnin...

  5. [5]

    Bellavia and B

    S. Bellavia and B. Morini. Strong local convergence properties of adaptive regularized methods for nonlinear least squares.IMA Journal of Numerical Analysis, 35(2):947–968,

  6. [6]

    URLhttps://doi.org/10.1093/imanum/dru021

  7. [7]

    Bellavia, S

    S. Bellavia, S. Gratton, and E. Riccietti. A Levenberg–Marquardt method for large nonlinear least-squares problems with dynamic accuracy in functions and gradients.Numerische Mathematik, 140(3):791–825, 2018. URL https://doi.org/10.1007/s00211-018-0977-z

  8. [8]

    E. H. Bergou, Y. Diouane, and V. Kungurtsev. Convergence and complexity analy- sis of a Levenberg–Marquardt algorithm for inverse problems.Journal of Optimiza- tion Theory and Applications, 185(3):927–944, 2020. URL https://doi.org/10.1007/ s10957-020-01666-1

  9. [9]

    E. G. Birgin, J. L. Gardenghi, J. M. Mart´ ınez, S. A. Santos, and P. L. Toint. Worst-case evaluation complexity for unconstrained nonlinear optimization using high-order regularized models.Mathematical Programming, 163(1):359–368, 2017. URL https://doi.org/10. 1007/s10107-016-1065-8

  10. [10]

    Carmon, J

    Y. Carmon, J. C. Duchi, O. Hinder, and A. Sidford. Lower bounds for finding stationary points I.Mathematical Programming, 184(1):71–120, 2020. URL https://doi.org/10. 1007/s10107-019-01406-y

  11. [11]

    Cartis, N

    C. Cartis, N. I. M. Gould, and P. L. Toint. Adaptive cubic regularisation methods for unconstrained optimization. Part I: Motivation, convergence and numerical results. Mathematical Programming, 127(2):245–295, 2011. URL https://doi.org/10.1007/ s10107-009-0286-5

  12. [12]

    Cartis, N

    C. Cartis, N. I. M. Gould, and P. L. Toint. Adaptive cubic regularisation methods for unconstrained optimization. Part II: Worst-case function- and derivative-evaluation complexity.Mathematical Programming, 130(2):295–319, 2011. URL https://doi.org/10. 1007/s10107-009-0337-y

  13. [13]

    Cartis, N

    C. Cartis, N. I. M. Gould, and P. L. Toint. Universal regularization methods: Varying the power, the smoothness and the accuracy.SIAM Journal on Optimization, 29(1):595–615,

  14. [14]

    URLhttps://doi.org/10.1137/16M1106316

  15. [15]

    Cartis, N

    C. Cartis, N. I. M. Gould, and P. L. Toint.Evaluation Complexity of Algorithms for Nonconvex Optimization: Theory, Computation and Perspectives. Society for Industrial and Applied Mathematics, Philadelphia, PA, 2022. URL https://doi.org/10.1137/1. 9781611976991

  16. [16]

    Z. Chen, Y. Zhou, Y. Liang, and Z. Lu. Generalized-smooth nonconvex optimization is as efficient as smooth nonconvex optimization. In A. Krause, E. Brunskill, K. Cho, B. Engelhardt, S. Sabato, and J. Scarlett, editors,Proceedings of the 40th International Conference on Machine Learning, volume 202 ofProceedings of Machine Learning Research, pages 5396–542...

  17. [17]

    A. R. Conn, N. I. M. Gould, and P. L. Toint.Trust Region Methods. Society for Industrial and Applied Mathematics, 2000. URLhttps://doi.org/10.1137/1.9780898719857

  18. [18]

    F. E. Curtis and Q. Wang. Worst-case complexity of TRACE with inexact subproblem solutions for nonconvex smooth optimization.SIAM Journal on Optimization, 33(3): 2191–2221, 2023. URLhttps://doi.org/10.1137/22M1492428

  19. [19]

    F. E. Curtis, D. P. Robinson, and M. Samadi. An inexact regularized Newton framework with a worst-case iteration complexity of O(ε−3/2) for nonconvex optimization.IMA Journal of Numerical Analysis, 39(3):1296–1327, 2019. URL https://doi.org/10.1093/imanum/ dry022

  20. [20]

    Cutkosky and F

    A. Cutkosky and F. Orabona. Momentum-based variance reduction in non-convex SGD. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alch´ e-Buc, E. Fox, and R. Garnett, editors,Advances in Neural Information Processing Systems, volume 32. Curran Asso- ciates, Inc., 2019. URL https://proceedings.neurips.cc/paper_files/paper/2019/ file/b8002139cdde66b87638f...

  21. [21]

    J. Fan. A modified Levenberg–Marquardt algorithm for singular system of nonlinear equations.Journal of Computational Mathematics, 21(5):625–636, 2003. ISSN 02549409, 19917139. URLhttp://www.jstor.org/stable/43693105

  22. [22]

    C. Fang, C. J. Li, Z. Lin, and T. Zhang. SPIDER: Near-optimal non-convex opti- mization via stochastic path-integrated differential estimator. In S. Bengio, H. Wal- lach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, editors,Ad- vances in Neural Information Processing Systems, volume 31. Curran Associates, Inc., 2018. URL https://proceedings...

  23. [23]

    M. Faw, I. Tziotis, C. Caramanis, A. Mokhtari, S. Shakkottai, and R. Ward. The power of adaptivity in SGD: Self-tuning step sizes with unbounded gradients and affine variance. In P.-L. Loh and M. Raginsky, editors,Proceedings of Thirty Fifth Conference on Learning Theory, volume 178 ofProceedings of Machine Learning Research, pages 313–355. PMLR, 02–05 Ju...

  24. [24]

    Ghadimi and G

    S. Ghadimi and G. Lan. Stochastic first- and zeroth-order methods for nonconvex stochastic programming.SIAM Journal on Optimization, 23(4):2341–2368, 2013. URL https://doi. org/10.1137/120880811

  25. [25]

    N. I. M. Gould and V. Simoncini. Error estimates for iterative algorithms for minimizing regularized quadratic subproblems.Optimization Methods and Software, 35(2):304–328,

  26. [26]

    URLhttps://doi.org/10.1080/10556788.2019.1670177

  27. [27]

    N. I. M. Gould, D. P. Robinson, and H. S. Thorne. On solving trust-region and other regularised subproblems in optimization.Mathematical Programming Computation, 2(1): 21–57, 2010. URLhttps://doi.org/10.1007/s12532-010-0011-7

  28. [28]

    G. N. Grapiglia and Y. Nesterov. Regularized Newton methods for minimizing functions with H¨ older continuous Hessians.SIAM Journal on Optimization, 27(1):478–506, 2017. URLhttps://doi.org/10.1137/16M1087801

  29. [29]

    Gratton, S

    S. Gratton, S. Jerad, and P. L. Toint. Convergence properties of an objective-function-free op- timization regularization algorithm, including an O(ε−3/2) complexity bound.SIAM Journal on Optimization, 33(3):1621–1646, 2023. URLhttps://doi.org/10.1137/22M1499522. 24

  30. [30]

    Gratton, S

    S. Gratton, S. Jerad, and P. L. Toint. Complexity of a class of first-order objective-function- free optimization algorithms.Optimization Methods and Software, 41(2):478–508, 2026. URLhttps://doi.org/10.1080/10556788.2023.2296431

  31. [31]

    Hazan, K

    E. Hazan, K. Levy, and S. Shalev-Shwartz. Beyond convexity: Stochastic quasi-convex optimization. In C. Cortes, N. Lawrence, D. Lee, M. Sugiyama, and R. Garnett, edi- tors,Advances in Neural Information Processing Systems, volume 28. Curran Associates, Inc., 2015. URL https://proceedings.neurips.cc/paper_files/paper/2015/file/ 934815ad542a4a7c5e8a2dfa04fe...

  32. [32]

    H¨ ubler, I

    F. H¨ ubler, I. Fatkhullin, and N. He. From gradient clipping to normalization for heavy tailed SGD. In Y. Li, S. Mandt, S. Agrawal, and E. Khan, editors,Proceedings of The 28th International Conference on Artificial Intelligence and Statistics, volume 258 ofProceedings of Machine Learning Research, pages 2413–2421. PMLR, 03–05 May 2025. URL https: //proc...

  33. [33]

    Jiang, S

    W. Jiang, S. Yang, Y. Wang, and L. Zhang. Adaptive variance reduction for stochas- tic optimization under weaker assumptions. In A. Globerson, L. Mackey, D. Bel- grave, A. Fan, U. Paquet, J. Tomczak, and C. Zhang, editors,Advances in Neural Information Processing Systems, volume 37, pages 22047–22080. Curran Associates, Inc., 2024. URL https://proceedings...

  34. [34]

    Kavis, S

    A. Kavis, S. Skoulakis, K. Antonakopoulos, L. T. Dadi, and V. Cevher. Adaptive stochastic variance reduction for non-convex finite-sum minimization. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, editors,Advances in Neu- ral Information Processing Systems, volume 35, pages 23524–23538. Curran Associates, Inc., 2022. URL https://proc...

  35. [35]

    Khaled and C

    A. Khaled and C. Jin. Tuning-free stochastic optimization. In R. Salakhutdinov, Z. Kolter, K. Heller, A. Weller, N. Oliver, J. Scarlett, and F. Berkenkamp, editors,Proceedings of the 41st International Conference on Machine Learning, volume 235 ofProceedings of Machine Learning Research, pages 23622–23661. PMLR, 21–27 Jul 2024. URL https: //proceedings.ml...

  36. [36]

    K. C. Kiwiel. Convergence and efficiency of subgradient methods for quasiconvex mini- mization.Mathematical Programming, 90(1):1–25, 2001. URL https://doi.org/10.1007/ PL00011414

  37. [37]

    Levenberg

    K. Levenberg. A method for the solution of certain non-linear problems in least squares. Quarterly of Applied Mathematics, 2(2):164–168, 1944. URL https://doi.org/10.1090/ qam/10666

  38. [38]

    K. Levy, A. Kavis, and V. Cevher. STORM+: Fully adaptive SGD with recursive momentum for nonconvex optimization. In M. Ranzato, A. Beygelzimer, Y. Dauphin, P. Liang, and J. W. Vaughan, editors,Advances in Neural Information Processing Systems, volume 34, pages 20571–20582. Curran Associates, Inc., 2021. URL https://proceedings.neurips. cc/paper_files/pape...

  39. [39]

    Li and F

    X. Li and F. Orabona. On the convergence of stochastic gradient descent with adaptive stepsizes. In K. Chaudhuri and M. Sugiyama, editors,Proceedings of the Twenty-Second International Conference on Artificial Intelligence and Statistics, volume 89 ofProceedings of Machine Learning Research, pages 983–992. PMLR, 16–18 Apr 2019. URL https: //proceedings.ml...

  40. [40]

    Z. Li. A short note of PAGE: Optimal convergence rates for nonconvex optimization.arXiv preprint arXiv:2106.09663, 2021

  41. [41]

    Z. Li, H. Bao, X. Zhang, and P. Richtarik. PAGE: A simple and optimal probabilistic gradient estimator for nonconvex optimization. In M. Meila and T. Zhang, editors,Proceedings of the 38th International Conference on Machine Learning, volume 139 ofProceedings of Machine Learning Research, pages 6286–6295. PMLR, 18–24 Jul 2021. URL https: //proceedings.mlr...

  42. [42]

    Malitsky and K

    Y. Malitsky and K. Mishchenko. Adaptive gradient descent without descent. In H. D. III and A. Singh, editors,Proceedings of the 37th International Conference on Machine Learning, volume 119 ofProceedings of Machine Learning Research, pages 6702–6712. PMLR, 13–18 Jul 2020. URLhttps://proceedings.mlr.press/v119/malitsky20a.html

  43. [43]

    D. W. Marquardt. An algorithm for least-squares estimation of nonlinear parameters. Journal of the Society for Industrial and Applied Mathematics, 11(2):431–441, 1963. URL https://doi.org/10.1137/0111030

  44. [44]

    N. Marumo. Parameter-free accelerated quasi-Newton method for nonconvex optimization. arXiv preprint arXiv:2512.09439, 2025

  45. [45]

    Marumo and A

    N. Marumo and A. Takeda. Parameter-free accelerated gradient descent for nonconvex minimization.SIAM Journal on Optimization, 34(2):2093–2120, 2024. URL https://doi. org/10.1137/22M1540934

  46. [46]

    Marumo and A

    N. Marumo and A. Takeda. Universal heavy-ball method for nonconvex optimization under H¨ older continuous Hessians.Mathematical Programming, 212(1):147–175, 2025. URL https://doi.org/10.1007/s10107-024-02100-4

  47. [47]

    Marumo, T

    N. Marumo, T. Okuno, and A. Takeda. Majorization-minimization-based Levenberg– Marquardt method for constrained nonlinear least squares.Computational Opti- mization and Applications, 84(3):833–874, 2023. URL https://doi.org/10.1007/ s10589-022-00447-y

  48. [48]

    Marumo, T

    N. Marumo, T. Okuno, and A. Takeda. Accelerated-gradient-based generalized Levenberg– Marquardt method with oracle complexity bound and local quadratic convergence. Mathematical Programming, 213(1):771–822, 2025. URL https://doi.org/10.1007/ s10107-024-02154-4

  49. [49]

    A. S. Nemirovsky and D. B. Yudin.Problem Complexity and Method Efficiency in Opti- mization. Wiley-Interscience Series in Discrete Mathematics. John Wiley & Sons, 1983. Translated from the Russian by E. R. Dawson

  50. [50]

    Nesterov

    Y. Nesterov. A method for solving a convex programming problem with convergence rate O(1/k2).Soviet Mathematics Doklady, 269(3):372–376, 1983

  51. [51]

    Nesterov.Introductory Lectures on Convex Optimization: A Basic Course

    Y. Nesterov.Introductory Lectures on Convex Optimization: A Basic Course. Springer, New York, 2004. URLhttps://doi.org/10.1007/978-1-4419-8853-9

  52. [52]

    Nesterov.Lectures on Convex Optimization, volume 137

    Y. Nesterov.Lectures on Convex Optimization, volume 137. Springer, Cham, 2018. URL https://doi.org/10.1007/978-3-319-91578-4

  53. [53]

    Nesterov

    Y. Nesterov. Inexact basic tensor methods for some classes of convex optimization problems. Optimization Methods and Software, 37(3):878–906, 2022. URL https://doi.org/10. 1080/10556788.2020.1854252. 26

  54. [54]

    Nesterov and B

    Y. Nesterov and B. T. Polyak. Cubic regularization of Newton method and its global performance.Mathematical Programming, 108(1):177–205, 2006. URL https://doi.org/ 10.1007/s10107-006-0706-8

  55. [55]

    Nocedal and S

    J. Nocedal and S. J. Wright.Numerical Optimization. Springer, 2nd edition, 2006. URL https://doi.org/10.1007/978-0-387-40065-5

  56. [56]

    N. H. Pham, L. M. Nguyen, D. T. Phan, and Q. Tran-Dinh. ProxSARAH: An efficient algorithmic framework for stochastic composite nonconvex optimization.Journal of Machine Learning Research, 21(110):1–48, 2020. URL http://jmlr.org/papers/v21/19-248.html

  57. [57]

    S. J. Reddi, A. Hefny, S. Sra, B. Poczos, and A. Smola. Stochastic variance reduction for nonconvex optimization. In M. F. Balcan and K. Q. Weinberger, editors,Proceedings of The 33rd International Conference on Machine Learning, volume 48 ofProceedings of Machine Learning Research, pages 314–323, New York, New York, USA, 20–22 Jun 2016. PMLR. URLhttps://...

  58. [58]

    On information and sufficiency,

    H. Robbins and S. Monro. A stochastic approximation method.The Annals of Mathematical Statistics, 22(3):400–407, 1951. URLhttps://doi.org/10.1214/aoms/1177729586

  59. [59]

    Roulet and A

    V. Roulet and A. d'Aspremont. Sharpness, restart and acceleration. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors,Advances in Neural Information Processing Systems, volume 30. Curran Asso- ciates, Inc., 2017. URL https://proceedings.neurips.cc/paper_files/paper/2017/ file/2ca65f58e35d9ad45bf7f3ae5cfd...

  60. [60]

    Shalev-Shwartz

    S. Shalev-Shwartz. Online learning and online convex optimization.Foundations and Trends in Machine Learning, 4(2):107–194, 03 2012. ISSN 1935-8237. URL https://doi.org/10. 1561/2200000018

  61. [61]

    D. B. Thomsen and N. Doikov. Complexity of minimizing regularized convex quadratic functions.arXiv preprint arXiv:2404.17543, 2024

  62. [62]

    Ueda and N

    K. Ueda and N. Yamashita. On a global complexity bound of the Levenberg–Marquardt method.Journal of Optimization Theory and Applications, 147(3):443–453, 2010. URL https://doi.org/10.1007/s10957-010-9731-0

  63. [63]

    Z. Wang, K. Ji, Y. Zhou, Y. Liang, and V. Tarokh. SpiderBoost and momentum: Faster variance reduction algorithms. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alch´ e- Buc, E. Fox, and R. Garnett, editors,Advances in Neural Information Processing Systems, volume 32, pages 2406–2416. Curran Associates, Inc., 2019. URL https://proceedings. neurips.cc/...

  64. [64]

    R. Ward, X. Wu, and L. Bottou. AdaGrad stepsizes: Sharp convergence over nonconvex landscapes.Journal of Machine Learning Research, 21(219):1–30, 2020. URL http: //jmlr.org/papers/v21/18-352.html

  65. [65]

    Xiong, S

    S. Xiong, S. Jerad, and C. Cartis. A parameter-free first-order algorithm for non-convex optimization with ˜O(ε−5/3) global rate.arXiv preprint arXiv:2605.02127, 2026

  66. [66]

    Yagishita and M

    S. Yagishita and M. Ito. Simple linesearch-free first-order methods for nonconvex optimiza- tion.arXiv preprint arXiv:2509.14670, 2025

  67. [67]

    J. Yang, X. Li, I. Fatkhullin, and N. He. Two sides of one coin: the limits of untuned SGD and the power of adaptive methods. In A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, editors,Advances in Neural 27 Information Processing Systems, volume 36, pages 74257–74288. Curran Associates, Inc., 2023. URL https://proceedings.neurips.cc/p...

  68. [68]

    Z. Ye, S. Ma, J. Yang, and D. Zhou. A simple adaptive proximal gradient method for nonconvex optimization.arXiv preprint arXiv:2510.06079, 2025

  69. [69]

    Zhao and J

    R. Zhao and J. Fan. Global complexity bound of the Levenberg–Marquardt method. Optimization Methods and Software, 31(4):805–814, 2016. URL https://doi.org/10. 1080/10556788.2016.1179737. 28