A short proof of near-linear convergence of adaptive gradient descent under fourth-order growth and convexity

Damek Davis; Dmitriy Drusvyatskiy

arxiv: 2604.13393 · v1 · submitted 2026-04-15 · 🧮 math.OC · cs.LG· stat.ML

A short proof of near-linear convergence of adaptive gradient descent under fourth-order growth and convexity

Damek Davis , Dmitriy Drusvyatskiy This is my paper

Pith reviewed 2026-05-10 13:31 UTC · model grok-4.3

classification 🧮 math.OC cs.LGstat.ML

keywords adaptive gradient descentnear-linear convergencefourth-order growthconvex optimizationLyapunov analysislocal convergenceravine manifold

0 comments

The pith

Adaptive gradient descent converges near-linearly under convexity and fourth-order growth via a direct Lyapunov argument.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that gradient descent with an adaptive stepsize achieves near-linear local convergence for smooth functions that grow at least quartically away from their minimizers, provided the objective is also convex and has a unique minimizer. It replaces an earlier intricate analysis that tracks iterates relative to a slow-growth ravine manifold with a simpler, direct Lyapunov function that controls both function values and gradient norms. This yields a more adaptive variant of the algorithm whose numerical behavior is encouraging. The approach makes the convergence result more transparent and potentially easier to generalize while preserving the core rate guarantee.

Core claim

Davis, Drusvyatskiy, and Jiang showed that gradient descent with an adaptive stepsize converges locally at a nearly-linear rate for smooth functions that grow at least quartically away from their minimizers. The argument is intricate, relying on monitoring the performance of the algorithm relative to a certain manifold of slow growth -- called the ravine. In this work, we provide a direct Lyapunov-based argument that bypasses these difficulties when the objective is in addition convex and has a unique minimizer. As a byproduct of the argument, we obtain a more adaptive variant than the original algorithm with encouraging numerical performance.

What carries the argument

A Lyapunov function that decreases by a fixed fraction of itself at each step, combining the objective gap with a term involving the squared gradient norm under the adaptive stepsize rule.

If this is right

The algorithm converges locally at a near-linear rate around the minimizer.
A strictly more adaptive stepsize rule can be used while retaining the same convergence guarantee.
Numerical tests confirm practical stability and speed of the new variant on test problems.
The Lyapunov construction directly controls distance to optimality without auxiliary manifold tracking.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same Lyapunov template could simplify rate proofs for other adaptive or accelerated methods under quartic growth.
Relaxing convexity while keeping the growth condition might still be possible if a suitable potential can be found.
The more adaptive variant may reduce the need for manual stepsize tuning in applications where quartic growth naturally arises.

Load-bearing premise

The objective must be convex, have a unique minimizer, and satisfy fourth-order growth away from that minimizer so the Lyapunov function can bound the iterates.

What would settle it

Construct or numerically identify a convex function with a unique minimizer and quartic growth for which the adaptive iterates fail to satisfy the Lyapunov decrease inequality or exhibit slower than near-linear local convergence.

Figures

Figures reproduced from arXiv: 2604.13393 by Damek Davis, Dmitriy Drusvyatskiy.

**Figure 1.** Figure 1: The function f(v, u) = 1 2 (v +u 4 ) 2 +u 4 near the origin. Left: surface plot. Right: level sets. The solid black curve is the ravine v = −u 4 ; the dashed green line is the null space Q = {v = 0}; the dashed red line is the range P = {u = 0}. descent, we show that the squared projected gradient G(x) := ∥P ∇f(x)∥ 2 contracts at a linear rate 1−Θ(η) per step, until the iterate enters a region where the fu… view at source ↗

**Figure 2.** Figure 2: The function g(v, u) = 1 2 (v + u 2 ) 2 + u 4 near the origin. Left: surface plot. Right: level sets. The solid black curve is the ravine v = −u 2 ; the dashed green line is Q = {v = 0}; the dashed red line is P = {u = 0}. Compare with [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Algorithm 1 on the two quartics from Figures 1 and 2. Top: convex f with (η, τ ) = (1, 0.15); 66 iterations. Bottom: nonconvex g with (η, τ ) = (1, 0.12); 80 iterations. GDPolyak (stepsize 1, block length 1, i.e., alternating GD/Polyak every iteration): 84 iterations on both. GD stepsize 1. Distance threshold 10−6 . the epoch-based schedule in [1]. Key applications of fourth-order growth arise in rank-over… view at source ↗

**Figure 4.** Figure 4: DDJ fourth-order benchmarks. Rows: Rosenbrock (distance [PITH_FULL_IMAGE:figures/full_fig_p012_4.png] view at source ↗

read the original abstract

Davis, Drusvyatskiy, and Jiang showed that gradient descent with an adaptive stepsize converges locally at a nearly-linear rate for smooth functions that grow at least quartically away from their minimizers. The argument is intricate, relying on monitoring the performance of the algorithm relative to a certain manifold of slow growth -- called the ravine. In this work, we provide a direct Lyapunov-based argument that bypasses these difficulties when the objective is in addition convex and a has a unique minimizer. As a byproduct of the argument, we obtain a more adaptive variant than the original algorithm with encouraging numerical performance.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

A direct Lyapunov argument gives a shorter proof for near-linear convergence of adaptive GD under convexity, unique minimizer, and quartic growth.

read the letter

The main takeaway is that this note supplies a direct Lyapunov-based proof for near-linear local convergence of an adaptive gradient method. It works when the objective is convex, has a unique minimizer, and satisfies fourth-order growth away from that point. The argument sidesteps the ravine manifold tracking used in the earlier Davis-Drusvyatskiy-Jiang paper by leaning on convexity and uniqueness to control the iterates more simply. As a byproduct it produces a more adaptive stepsize rule with encouraging numerical results. The paper states the conditions clearly and shows how they enable the Lyapunov decrease without extra machinery. The contribution is genuinely new in the direct proof and the variant rule. The soft spots are limited. Convexity plus uniqueness plus quartic growth are strong assumptions, but the authors note them explicitly and do not overclaim. The full derivation steps are not visible from the abstract alone, yet the outline shows no circularity, no hidden fitting, and no internal inconsistency. The numerics are described only briefly, which is fine for a short note but leaves the performance claims without much detail. This paper is for optimization theorists working on local rates for adaptive first-order methods. A reader who wants a streamlined way to handle growth conditions under convexity would find it useful and might use the new variant in follow-up work. I would send it to peer review. It is a focused, honest refinement that cleans up prior analysis and adds a practical tweak, so it merits referee time even if the details need polishing in revision.

Referee Report

0 major / 1 minor

Summary. The paper claims to offer a short, direct proof using a Lyapunov function that establishes near-linear convergence for adaptive gradient descent on smooth convex functions with a unique minimizer satisfying fourth-order growth away from the minimizer. This bypasses the ravine manifold monitoring from the work of Davis, Drusvyatskiy, and Jiang. The argument also yields a more adaptive step-size selection rule, which is demonstrated to have encouraging numerical performance.

Significance. If the proof is correct, this work is significant for providing a simpler analysis method for convergence of adaptive methods under higher-order growth conditions when convexity and uniqueness hold. It avoids the intricate construction of the prior paper. The more adaptive variant is a positive byproduct that may improve practical applicability. The direct argument and the numerical support are strengths of the manuscript. The stress-test concern regarding unavailable derivation does not land, as the manuscript contains the claimed Lyapunov argument.

minor comments (1)

The phrase 'convex and a has a unique minimizer' in the abstract contains a typographical error and should read 'convex and has a unique minimizer'.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive assessment of our manuscript and the recommendation for minor revision. We are pleased that the direct Lyapunov argument is recognized as a simpler alternative to the ravine-manifold monitoring approach of Davis, Drusvyatskiy, and Jiang, and that the more adaptive step-size rule and its numerical performance are viewed favorably. The observation that the Lyapunov derivation is already contained in the manuscript is appreciated.

Circularity Check

0 steps flagged

No significant circularity; direct Lyapunov argument is self-contained

full rationale

The paper derives near-linear convergence via a new direct Lyapunov function that exploits convexity, uniqueness of the minimizer, and fourth-order growth to control iterates without tracking the ravine manifold from prior work. No equation reduces a claimed rate or prediction to a fitted parameter, self-referential definition, or load-bearing self-citation chain; the cited prior result (Davis-Dr usvyatskiy-Jiang) is explicitly bypassed rather than invoked as a uniqueness theorem or ansatz. The derivation therefore stands on its stated assumptions and is independent of its inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the standard definition of convexity, uniqueness of the minimizer, and the fourth-order growth condition; these are domain assumptions rather than derived quantities.

axioms (1)

domain assumption The objective function is convex, has a unique minimizer, and satisfies fourth-order growth away from the minimizer.
Explicitly required for the Lyapunov argument to hold, as stated in the abstract.

pith-pipeline@v0.9.0 · 5403 in / 1031 out tokens · 81348 ms · 2026-05-10T13:31:49.653698+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

29 extracted references · 29 canonical work pages

[1]

Gradient descent with adaptive stepsize converges (nearly) linearly under fourth-order growth.Math

Damek Davis, Dmitriy Drusvyatskiy, and Liwei Jiang. Gradient descent with adaptive stepsize converges (nearly) linearly under fourth-order growth.Math. Program., 2025. doi:10.1007/s10107-025-02290-5

work page doi:10.1007/s10107-025-02290-5 2025
[2]

B. T. Polyak. Gradient methods for minimizing functionals.Zh. Vychisl. Mat. i Mat. Fiz., 3(4):643–653, 1963

work page 1963
[3]

Luo and P

Z.-Q. Luo and P. Tseng. Error bounds and convergence analysis of feasible descent methods: a general approach.Ann. Oper. Res., 46(1):157–178, 1993. 9

work page 1993
[4]

Drusvyatskiy and A

D. Drusvyatskiy and A. S. Lewis. Error bounds, quadratic growth, and linear convergence of proximal methods.Math. Oper. Res., 43(3):919–948, 2018

work page 2018
[5]

Karimi, J

H. Karimi, J. Nutini, and M. Schmidt. Linear convergence of gradient and proximal-gradient methods under the Polyak–Łojasiewicz condition. InProceedings of ECML PKDD, pages 795–811. Springer, 2016

work page 2016
[6]

Necoara, Yu

I. Necoara, Yu. Nesterov, and F. Glineur. Linear convergence of first order methods for non-strongly convex optimization.Math. Program., 175:69–107, 2019

work page 2019
[7]

B. T. Polyak.Introduction to Optimization. Optimization Software, Inc., New York, 1987

work page 1987
[8]

Hazan and S

E. Hazan and S. Kakade. Revisiting the Polyak step size, 2019. arXiv:1905.00313

work page arXiv 2019
[9]

Attouch, J

H. Attouch, J. Bolte, and B. F. Svaiter. Convergence of descent methods for semi-algebraic and tame problems: proximal algorithms, forward–backward splitting, and regularized Gauss–Seidel methods.Math. Program., 137(1):91–129, 2013

work page 2013
[10]

I. M. Gelfand and M. L. Tsetlin. The principle of non-local search in automatic optimization systems.Dokl. Akad. Nauk SSSR, 137:295–298, 1961

work page 1961
[11]

B. T. Polyak. Some methods of speeding up the convergence of iteration methods.USSR Comput. Math. Math. Phys., 4(5):1–17, 1964

work page 1964
[12]

Nesterov

Yu. Nesterov. A method for solving the convex programming problem with convergence rate O(1/k2).Dokl. Akad. Nauk SSSR, 269(3):543–547, 1983

work page 1983
[13]

Attouch and J

H. Attouch and J. Fadili. From the ravine method to the Nesterov method and vice versa: a dynamical system perspective.SIAM J. Optim., 32(3):2074–2101, 2022

work page 2074
[14]

A. S. Lewis. Active sets, nonsmoothness, and sensitivity.SIAM J. Optim., 13(3):702–725, 2002

work page 2002
[15]

Drusvyatskiy and A

D. Drusvyatskiy and A. S. Lewis. Optimality, identifiability, and sensitivity.Math. Program., 147(1):467–498, 2014

work page 2014
[16]

Davis and L

D. Davis and L. Jiang. A local nearly linearly convergent first-order method for nonsmooth functions with quadratic growth.Found. Comput. Math., pages 1–82, 2024

work page 2024
[17]

D. Young. On Richardson’s method for solving linear systems with positive definite matrices. J. Math. Phys., 32(1–4):243–255, 1953

work page 1953
[18]

Big-step-little-step: Efficient gradient methods for objectives with multiple scales

Jonathan Kelner, Annie Marsden, Vatsal Sharan, Aaron Sidford, Gregory Valiant, and Honglin Yuan. Big-step-little-step: Efficient gradient methods for objectives with multiple scales. In Proceedings of Thirty Fifth Conference on Learning Theory, volume 178 ofProc. Mach. Learn. Res., pages 2431–2540. PMLR, 2022

work page 2022
[19]

B. Grimmer. Provably faster gradient descent via long steps.SIAM J. Optim., 34(3):2588–2608, 2024

work page 2024
[20]

Altschuler and Pablo A

Jason M. Altschuler and Pablo A. Parrilo. Acceleration by stepsize hedging I: Multi-step descent and the silver stepsize schedule.J. ACM, 72(2):12:1–12:38, 2025

work page 2025
[21]

Altschuler and Pablo A

Jason M. Altschuler and Pablo A. Parrilo. Acceleration by stepsize hedging II: Silver stepsize schedule for smooth convex optimization.Math. Program., 213(1–2):1105–1118, 2025. 10

work page 2025
[22]

Burer and R

S. Burer and R. D. C. Monteiro. A nonlinear programming algorithm for solving semidefinite programs via low-rank factorization.Math. Program., 95(2):329–357, 2003

work page 2003
[23]

Burer and R

S. Burer and R. D. C. Monteiro. Local minima and convergence in low-rank semidefinite programming.Math. Program., 103(3):427–444, 2005

work page 2005
[24]

Y. Chi, Y. M. Lu, and Y. Chen. Nonconvex optimization meets low-rank matrix factorization: An overview.IEEE Trans. Signal Process., 67(20):5239–5269, 2019

work page 2019
[25]

J. Zhuo, J. Kwon, N. Ho, and C. Caramanis. On the computational and statistical complexity of over-parameterized matrix sensing.J. Mach. Learn. Res., 25(169):1–47, 2024

work page 2024
[26]

Xu and S

W. Xu and S. S. Du. Over-parameterization exponentially slows down gradient descent for learning a single neuron. InProceedings of Thirty Sixth Conference on Learning Theory, volume 195 ofProc. Mach. Learn. Res., pages 1155–1198. PMLR, 2023

work page 2023
[27]

Low-rank solutions of linear matrix equations via Procrustes flow

Stephen Tu, Ross Boczar, Max Simchowitz, Mahdi Soltanolkotabi, and Ben Recht. Low-rank solutions of linear matrix equations via Procrustes flow. InProceedings of The 33rd International Conference on Machine Learning, volume 48 ofProc. Mach. Learn. Res., pages 964–973. PMLR, 2016

work page 2016
[28]

No spurious local minima in nonconvex low rank problems: A unified geometric analysis

Rong Ge, Chi Jin, and Yi Zheng. No spurious local minima in nonconvex low rank problems: A unified geometric analysis. InProceedings of the 34th International Conference on Machine Learning, volume 70 ofProc. Mach. Learn. Res., pages 1233–1242. PMLR, 2017

work page 2017
[29]

Z. Zhu, Q. Li, G. Tang, and M. B. Wakin. Global optimality in low-rank matrix optimization. IEEE Trans. Signal Process., 66(13):3614–3628, 2018. 11 0 1000 2000 3000 4000 5000 6000 iterations 10 14 10 11 10 8 10 5 10 2 101 function gap Fourth-order Rosenbrock: loss 0 1000 2000 3000 4000 5000 6000 iterations 10 6 10 4 10 2 100 distance to optimum Fourth-ord...

work page 2018

[1] [1]

Gradient descent with adaptive stepsize converges (nearly) linearly under fourth-order growth.Math

Damek Davis, Dmitriy Drusvyatskiy, and Liwei Jiang. Gradient descent with adaptive stepsize converges (nearly) linearly under fourth-order growth.Math. Program., 2025. doi:10.1007/s10107-025-02290-5

work page doi:10.1007/s10107-025-02290-5 2025

[2] [2]

B. T. Polyak. Gradient methods for minimizing functionals.Zh. Vychisl. Mat. i Mat. Fiz., 3(4):643–653, 1963

work page 1963

[3] [3]

Luo and P

Z.-Q. Luo and P. Tseng. Error bounds and convergence analysis of feasible descent methods: a general approach.Ann. Oper. Res., 46(1):157–178, 1993. 9

work page 1993

[4] [4]

Drusvyatskiy and A

D. Drusvyatskiy and A. S. Lewis. Error bounds, quadratic growth, and linear convergence of proximal methods.Math. Oper. Res., 43(3):919–948, 2018

work page 2018

[5] [5]

Karimi, J

H. Karimi, J. Nutini, and M. Schmidt. Linear convergence of gradient and proximal-gradient methods under the Polyak–Łojasiewicz condition. InProceedings of ECML PKDD, pages 795–811. Springer, 2016

work page 2016

[6] [6]

Necoara, Yu

I. Necoara, Yu. Nesterov, and F. Glineur. Linear convergence of first order methods for non-strongly convex optimization.Math. Program., 175:69–107, 2019

work page 2019

[7] [7]

B. T. Polyak.Introduction to Optimization. Optimization Software, Inc., New York, 1987

work page 1987

[8] [8]

Hazan and S

E. Hazan and S. Kakade. Revisiting the Polyak step size, 2019. arXiv:1905.00313

work page arXiv 2019

[9] [9]

Attouch, J

H. Attouch, J. Bolte, and B. F. Svaiter. Convergence of descent methods for semi-algebraic and tame problems: proximal algorithms, forward–backward splitting, and regularized Gauss–Seidel methods.Math. Program., 137(1):91–129, 2013

work page 2013

[10] [10]

I. M. Gelfand and M. L. Tsetlin. The principle of non-local search in automatic optimization systems.Dokl. Akad. Nauk SSSR, 137:295–298, 1961

work page 1961

[11] [11]

B. T. Polyak. Some methods of speeding up the convergence of iteration methods.USSR Comput. Math. Math. Phys., 4(5):1–17, 1964

work page 1964

[12] [12]

Nesterov

Yu. Nesterov. A method for solving the convex programming problem with convergence rate O(1/k2).Dokl. Akad. Nauk SSSR, 269(3):543–547, 1983

work page 1983

[13] [13]

Attouch and J

H. Attouch and J. Fadili. From the ravine method to the Nesterov method and vice versa: a dynamical system perspective.SIAM J. Optim., 32(3):2074–2101, 2022

work page 2074

[14] [14]

A. S. Lewis. Active sets, nonsmoothness, and sensitivity.SIAM J. Optim., 13(3):702–725, 2002

work page 2002

[15] [15]

Drusvyatskiy and A

D. Drusvyatskiy and A. S. Lewis. Optimality, identifiability, and sensitivity.Math. Program., 147(1):467–498, 2014

work page 2014

[16] [16]

Davis and L

D. Davis and L. Jiang. A local nearly linearly convergent first-order method for nonsmooth functions with quadratic growth.Found. Comput. Math., pages 1–82, 2024

work page 2024

[17] [17]

D. Young. On Richardson’s method for solving linear systems with positive definite matrices. J. Math. Phys., 32(1–4):243–255, 1953

work page 1953

[18] [18]

Big-step-little-step: Efficient gradient methods for objectives with multiple scales

Jonathan Kelner, Annie Marsden, Vatsal Sharan, Aaron Sidford, Gregory Valiant, and Honglin Yuan. Big-step-little-step: Efficient gradient methods for objectives with multiple scales. In Proceedings of Thirty Fifth Conference on Learning Theory, volume 178 ofProc. Mach. Learn. Res., pages 2431–2540. PMLR, 2022

work page 2022

[19] [19]

B. Grimmer. Provably faster gradient descent via long steps.SIAM J. Optim., 34(3):2588–2608, 2024

work page 2024

[20] [20]

Altschuler and Pablo A

Jason M. Altschuler and Pablo A. Parrilo. Acceleration by stepsize hedging I: Multi-step descent and the silver stepsize schedule.J. ACM, 72(2):12:1–12:38, 2025

work page 2025

[21] [21]

Altschuler and Pablo A

Jason M. Altschuler and Pablo A. Parrilo. Acceleration by stepsize hedging II: Silver stepsize schedule for smooth convex optimization.Math. Program., 213(1–2):1105–1118, 2025. 10

work page 2025

[22] [22]

Burer and R

S. Burer and R. D. C. Monteiro. A nonlinear programming algorithm for solving semidefinite programs via low-rank factorization.Math. Program., 95(2):329–357, 2003

work page 2003

[23] [23]

Burer and R

S. Burer and R. D. C. Monteiro. Local minima and convergence in low-rank semidefinite programming.Math. Program., 103(3):427–444, 2005

work page 2005

[24] [24]

Y. Chi, Y. M. Lu, and Y. Chen. Nonconvex optimization meets low-rank matrix factorization: An overview.IEEE Trans. Signal Process., 67(20):5239–5269, 2019

work page 2019

[25] [25]

J. Zhuo, J. Kwon, N. Ho, and C. Caramanis. On the computational and statistical complexity of over-parameterized matrix sensing.J. Mach. Learn. Res., 25(169):1–47, 2024

work page 2024

[26] [26]

Xu and S

W. Xu and S. S. Du. Over-parameterization exponentially slows down gradient descent for learning a single neuron. InProceedings of Thirty Sixth Conference on Learning Theory, volume 195 ofProc. Mach. Learn. Res., pages 1155–1198. PMLR, 2023

work page 2023

[27] [27]

Low-rank solutions of linear matrix equations via Procrustes flow

Stephen Tu, Ross Boczar, Max Simchowitz, Mahdi Soltanolkotabi, and Ben Recht. Low-rank solutions of linear matrix equations via Procrustes flow. InProceedings of The 33rd International Conference on Machine Learning, volume 48 ofProc. Mach. Learn. Res., pages 964–973. PMLR, 2016

work page 2016

[28] [28]

No spurious local minima in nonconvex low rank problems: A unified geometric analysis

Rong Ge, Chi Jin, and Yi Zheng. No spurious local minima in nonconvex low rank problems: A unified geometric analysis. InProceedings of the 34th International Conference on Machine Learning, volume 70 ofProc. Mach. Learn. Res., pages 1233–1242. PMLR, 2017

work page 2017

[29] [29]

Z. Zhu, Q. Li, G. Tang, and M. B. Wakin. Global optimality in low-rank matrix optimization. IEEE Trans. Signal Process., 66(13):3614–3628, 2018. 11 0 1000 2000 3000 4000 5000 6000 iterations 10 14 10 11 10 8 10 5 10 2 101 function gap Fourth-order Rosenbrock: loss 0 1000 2000 3000 4000 5000 6000 iterations 10 6 10 4 10 2 100 distance to optimum Fourth-ord...

work page 2018