Stochastic Non-Smooth Convex Optimization with Unbounded Gradients

Dmitry Kovalev

arxiv: 2605.15522 · v2 · pith:J5AGE4GInew · submitted 2026-05-15 · 🧮 math.OC · cs.LG

Stochastic Non-Smooth Convex Optimization with Unbounded Gradients

Dmitry Kovalev This is my paper

Pith reviewed 2026-05-19 15:11 UTC · model grok-4.3

classification 🧮 math.OC cs.LG

keywords stochastic optimizationconvex optimizationAdamWgradient clippinggeneralized Lipschitzunbounded gradientsconvergence ratesquasar-convex

0 comments

The pith

AdamW with clipped updates outperforms SGD and AdaGrad on convex stochastic problems with unbounded gradients.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a generalized Lipschitz class of functions in which gradient norms grow at most linearly with the current optimality gap, replacing the common but restrictive assumption of uniformly bounded gradients. It then compares several first-order stochastic methods under this assumption and establishes faster global convergence for AdamW when its updates are clipped. The analysis also isolates the benefit of AdamW's exponentially weighted gradient accumulation over plain averaging, and shows that the same clipped variant remains competitive under generalized smoothness and quasar-convexity.

Core claim

For convex stochastic generalized Lipschitz optimization problems, AdamW with clipped updates achieves the best global convergence rates among popular stochastic optimization methods such as SGD and AdaGrad; the exponentially weighted gradient accumulation of AdamW is essential to these rates, and the same clipped procedure also yields improved rates under generalized smoothness while extending to quasar-convex and preconditioned settings.

What carries the argument

The generalized Lipschitz condition that bounds gradient norm by an affine function of the optimality gap, together with the clipped AdamW update that combines exponential moving averages and gradient clipping.

If this is right

Clipped AdamW remains competitive and achieves improved rates when the objective also satisfies the popular generalized smoothness assumption.
The exponentially weighted accumulation inside AdamW, rather than simple averaging, is necessary for the superior rates.
The same clipped AdamW analysis carries over to versions that use diagonal or matrix preconditioners.
The convergence guarantees extend from convex to quasar-convex objectives.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The framework suggests testing whether clipping plus momentum improves empirical performance on non-smooth deep-learning tasks whose gradients grow with parameter distance from a good solution.
Similar rate comparisons could be carried out for other adaptive methods such as RMSProp or Lion under the same generalized Lipschitz assumption.
The affine bound on gradients might be relaxed further to sublinear or logarithmic growth while preserving the advantage of clipped AdamW.

Load-bearing premise

The objective function satisfies that its gradient norm is bounded by a linear function of the current distance to optimality.

What would settle it

A convex function belonging to the generalized Lipschitz class on which SGD with tuned stepsizes converges at a strictly better rate than clipped AdamW.

read the original abstract

Much of the existing theory on first-order non-smooth optimization is built on a restrictive assumption that the gradients of the objective function are uniformly bounded. We introduce a much more realistic class of generalized Lipschitz functions, where the gradient norms are bounded by an affine function of the optimality gap. We then ask a natural question: what algorithm achieves the best global convergence rates for solving convex stochastic generalized Lipschitz optimization problems? To address this, we develop a new convergence analysis for several existing algorithms and find that AdamW with clipped updates, provably outperforms other popular stochastic optimization methods, such as SGD and AdaGrad. Moreover, our analysis establishes the critical role of AdamW's exponentially weighted gradient accumulation, as opposed to simple averaging. We further show that clipped AdamW is universal and achieves improved rates under the popular generalized smoothness assumption, analyze the convergence of clipped AdamW with diagonal and matrix preconditioners, and extend our results to the quasar-convex setting.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper relaxes uniform gradient bounds to a gap-dependent class and claims clipped AdamW beats SGD and AdaGrad, but the noise scaling needs explicit checking.

read the letter

The paper drops the usual uniform bound on gradients and replaces it with a generalized Lipschitz condition: the gradient norm is at most an affine function of the current optimality gap. This matches the behavior in many machine-learning problems where gradients start large and shrink as training progresses. Under this assumption they compare several clipped stochastic methods on convex problems and conclude that AdamW with clipping achieves the best rates, with the exponential moving average playing a key role over plain averaging. They also carry the same style of analysis to generalized smoothness and quasar-convex objectives. That extension is useful because it shows the result is not tied to one narrow setting. The derivations follow standard stochastic approximation steps but applied to the wider function class, and the explicit rate comparisons give a clear ordering among the algorithms. If the proofs are tight, this supplies a concrete reason to prefer the clipped adaptive method when gradients are not uniformly bounded. The soft spot is the handling of stochastic noise. The new class controls the true gradient, yet it does not automatically bound the variance of the stochastic estimator. If that variance grows with the gradient size, the accumulated noise in AdamW’s moving average could erase part of the claimed advantage over clipped SGD. The paper would be stronger with an explicit noise assumption or a short argument showing why the rates survive when variance scales with the gap. This work is aimed at researchers who write or read theory papers on stochastic first-order methods. Anyone comparing adaptive and non-adaptive optimizers under realistic unbounded-gradient conditions will find the comparisons worth reading. The modeling change is practical and the claims are specific enough to be checked, so the paper deserves a serious referee. I would send it out for review and ask the referees to verify the noise terms in the proofs.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces the class of generalized Lipschitz functions, in which the gradient norm is bounded by an affine function of the optimality gap. It develops convergence analyses for several stochastic first-order methods under this assumption for convex problems and claims that AdamW with clipped updates achieves strictly superior global rates to SGD and AdaGrad. The analysis emphasizes the role of exponentially weighted gradient accumulation rather than simple averaging. Extensions are given to the generalized smoothness setting, diagonal and matrix preconditioners, and the quasar-convex case.

Significance. If the central rates hold under the stated assumptions, the work would meaningfully relax the uniform gradient bound that is common but often unrealistic in non-smooth stochastic optimization. The explicit comparison of clipped AdamW against SGD and AdaGrad, together with the identification of exponential averaging as critical, could influence both theory and practical algorithm choice. No machine-checked proofs or reproducible code are mentioned, but the rates are in principle falsifiable.

major comments (2)

[Convergence analysis of clipped AdamW] The analysis of clipped AdamW (and the claimed superiority over SGD) appears to rely on a uniform bound on stochastic gradient variance that is independent of the optimality gap. The generalized Lipschitz condition controls only the true gradient; it does not automatically bound Var[g_t] when the gap is large. If variance scales with ||∇f||^2, the effective rate for AdamW may revert to that of clipped SGD, undermining the outperformance claim. This assumption must be stated explicitly and its necessity for the rate advantage demonstrated.
[Comparison theorems for SGD, AdaGrad, and AdamW] The global rates derived for SGD and AdaGrad under the same generalized Lipschitz class should be re-derived side-by-side with the AdamW analysis using identical noise assumptions. Any difference in the noise model between the algorithms would make the comparison non-informative for the central claim.

minor comments (2)

[Abstract] The abstract states that AdamW 'theoretically outperforms' other methods; the precise rates (e.g., dependence on T, constants, and noise parameters) should be stated explicitly so readers can verify the improvement.
[Definition of generalized Lipschitz class] Clarify whether the generalized Lipschitz condition is assumed to hold with the same constants a, b for all iterates or only in expectation; this affects how the analysis handles early iterates far from the optimum.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive report and the opportunity to clarify our contributions. We address each major comment below and have revised the manuscript to strengthen the presentation of assumptions and comparisons.

read point-by-point responses

Referee: [Convergence analysis of clipped AdamW] The analysis of clipped AdamW (and the claimed superiority over SGD) appears to rely on a uniform bound on stochastic gradient variance that is independent of the optimality gap. The generalized Lipschitz condition controls only the true gradient; it does not automatically bound Var[g_t] when the gap is large. If variance scales with ||∇f||^2, the effective rate for AdamW may revert to that of clipped SGD, undermining the outperformance claim. This assumption must be stated explicitly and its necessity for the rate advantage demonstrated.

Authors: We thank the referee for highlighting this important distinction. The generalized Lipschitz condition bounds only the true gradient; our analysis of clipped AdamW (and the comparison to SGD) additionally invokes a standard uniform bound on the variance of the stochastic gradients that is independent of the optimality gap. This assumption is present in the original submission but was not sufficiently foregrounded. In the revision we have added an explicit statement (Assumption 2.3) and a new remark (Remark 3.2) that isolates the role of bounded variance. We also include a short counter-example showing that if variance were allowed to grow quadratically with the gradient norm, the exponential-averaging advantage of AdamW would indeed disappear and the rates would collapse to those of clipped SGD. Thus the bounded-variance assumption is necessary for the claimed strict improvement. revision: yes
Referee: [Comparison theorems for SGD, AdaGrad, and AdamW] The global rates derived for SGD and AdaGrad under the same generalized Lipschitz class should be re-derived side-by-side with the AdamW analysis using identical noise assumptions. Any difference in the noise model between the algorithms would make the comparison non-informative for the central claim.

Authors: We agree that identical noise assumptions are required for the comparison to be meaningful. All three algorithms were originally analyzed under the same pair of assumptions (generalized Lipschitz plus uniformly bounded stochastic-gradient variance). To make the parallelism transparent, we have re-derived the SGD and AdaGrad rates in a single unified theorem (Theorem 3.1) that uses exactly the same noise model and proof template as the AdamW result (Theorem 3.4). The rates are now displayed side-by-side in Table 1, which also isolates the improvement attributable to exponential averaging. This revision removes any ambiguity about differing noise models. revision: yes

Circularity Check

0 steps flagged

No circularity: new function class analyzed with standard techniques

full rationale

The paper defines a new generalized Lipschitz class (gradient norm bounded affinely by optimality gap) and performs fresh convergence analysis on existing algorithms including clipped AdamW. No step reduces a claimed rate or outperformance result to a fitted parameter, self-referential definition, or load-bearing self-citation; the central comparison follows from applying standard stochastic analysis tools to the new assumption class. The derivation remains independent of its own outputs.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 1 invented entities

Central claims rest on the newly defined generalized Lipschitz property plus standard background assumptions of convexity and unbiased stochastic gradients.

axioms (2)

domain assumption The objective function is convex.
Standard assumption invoked for the optimization setting.
domain assumption Stochastic gradients are unbiased estimates of the true gradient.
Common modeling assumption in stochastic first-order methods.

invented entities (1)

generalized Lipschitz functions no independent evidence
purpose: Model non-smooth convex functions whose gradient norms grow affinely with optimality gap
Newly introduced class that replaces uniform bounded-gradient assumption.

pith-pipeline@v0.9.0 · 5683 in / 1230 out tokens · 75511 ms · 2026-05-19T15:11:36.275720+00:00 · methodology

Stochastic Non-Smooth Convex Optimization with Unbounded Gradients

Core claim

What carries the argument

If this is right

Where Pith is reading between the lines

Load-bearing premise

What would settle it

discussion (0)