DADA: Dual Averaging with Distance Adaptation

Anton Rodomanov; Daniil Vankov; Mohammad Moshtaghifar; Sebastian Stich

arxiv: 2501.10258 · v2 · submitted 2025-01-17 · 🧮 math.OC · cs.LG

DADA: Dual Averaging with Distance Adaptation

Mohammad Moshtaghifar , Anton Rodomanov , Daniil Vankov , Sebastian Stich This is my paper

Pith reviewed 2026-05-23 04:51 UTC · model grok-4.3

classification 🧮 math.OC cs.LG

keywords dual averaginguniversal gradient methodconvex optimizationdistance adaptationparameter-free optimizationlocal growth conditionnonsmooth optimizationsmooth optimization

0 comments

The pith

DADA is a dual averaging method that dynamically adjusts coefficients from gradients and distances to the start point, delivering universal guarantees for convex problems under a local growth bound around the minimizer.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Dual Averaging with Distance Adaptation (DADA), a gradient method that dynamically adjusts its coefficients using observed gradients and the distance of iterates from the starting point. This design allows the algorithm to work across diverse problem classes in convex optimization, including nonsmooth Lipschitz, smooth, Holder smooth, and others, as long as the objective function's growth near its minimum can be bounded. DADA applies to both constrained and unconstrained problems, even with unbounded domains, and requires no advance knowledge of the number of iterations or target accuracy. A reader would care if they want a single method that handles many optimization settings without custom tuning for each.

Core claim

DADA, based on the classical dual averaging scheme, dynamically adjusts its coefficients based on observed gradients and the distance between iterates and the starting point. This eliminates the need for problem-specific parameters and provides universal guarantees for a broad spectrum of convex problems where the local growth of the objective around its minimizer can be bounded, covering examples like nonsmooth Lipschitz functions, Lipschitz-smooth functions, Holder-smooth functions, functions with high-order Lipschitz derivatives, quasi-self-concordant functions, and (L0, L1)-smooth functions. The method works for unconstrained and constrained problems, including unbounded domains, without

What carries the argument

Dual averaging scheme with dynamic coefficient adjustment based on gradients and distance to the starting point.

If this is right

The algorithm simultaneously achieves good performance on nonsmooth Lipschitz, Lipschitz-smooth, Holder-smooth, high-order Lipschitz derivative, quasi-self-concordant, and (L0,L1)-smooth convex problems under the local growth bound.
It extends to constrained optimization over possibly unbounded domains.
No problem-specific parameters like step sizes tuned to iteration count or accuracy are needed.
Convergence holds as long as the local growth condition is satisfied.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The distance-based adaptation may allow the method to remain effective when the overall scale of the problem is unknown in advance.
A single implementation could replace several specialized solvers in software libraries for convex problems.
Similar adaptation ideas might be tested on stochastic or composite problems where only noisy gradient information is available.

Load-bearing premise

The local growth of the objective function around its minimizer can be bounded.

What would settle it

Running DADA to convergence on a convex function known to satisfy the local growth bound and checking whether the observed rate matches the universal bound claimed for that growth class without any parameter tuning.

Figures

Figures reproduced from arXiv: 2501.10258 by Anton Rodomanov, Daniil Vankov, Mohammad Moshtaghifar, Sebastian Stich.

**Figure 4.1.** Figure 4.1: Comparison of different methods on the Softmax function. [PITH_FULL_IMAGE:figures/full_fig_p009_4_1.png] view at source ↗

**Figure 4.2.** Figure 4.2: The ratio D r¯t for the Softmax function with different optimal points x ∗ . 0 1000 2000 3000 4000 5000 Number of oracle calls 10 13 10 11 10 9 10 7 10 5 10 3 10 1 10 1 10 3 f(x *k ) f * q = 1.0, n = 10000, d = 1000 WDA UGM DoG Prodigy DADA 0 1000 2000 3000 4000 5000 Number of oracle calls 10 12 10 9 10 6 10 3 10 0 10 3 f(x *k ) f * q = 1.5, n = 10000, d = 1000 WDA UGM DoG Prodigy DADA 0 1000 2000 3000 4… view at source ↗

**Figure 4.3.** Figure 4.3: Comparison of different methods on the polyhedron feasibility problem. [PITH_FULL_IMAGE:figures/full_fig_p010_4_3.png] view at source ↗

**Figure 4.4.** Figure 4.4: Comparison of different methods on the worst-case function. [PITH_FULL_IMAGE:figures/full_fig_p011_4_4.png] view at source ↗

**Figure 4.5.** Figure 4.5: The ratio D r¯t for the worst-case function with different optimal points x ∗ . cantly declines. However, DADA, Prodigy, and UGM demonstrate similar performance regardless of the choice of q. Worst-case function. As an example of a function with Lipschitz high-order derivative, we consider the following worst-case problem from [7]: min x∈Rd n f(x) := 1 p X d−1 i=1 |x (i) − x (i+1)| p + 1 p |x (d) | p o … view at source ↗

read the original abstract

We present a novel universal gradient method for solving convex optimization problems. Our algorithm, Dual Averaging with Distance Adaptation (DADA), is based on the classical scheme of dual averaging and dynamically adjusts its coefficients based on observed gradients and the distance between iterates and the starting point, eliminating the need for problem-specific parameters. DADA is a universal algorithm that simultaneously works for a broad spectrum of problem classes, provided the local growth of the objective function around its minimizer can be bounded. Particular examples of such problem classes are nonsmooth Lipschitz functions, Lipschitz-smooth functions, H\"older-smooth functions, functions with high-order Lipschitz derivative, quasi-self-concordant functions, and $(L_0,L_1)$-smooth functions. Crucially, DADA is applicable to both unconstrained and constrained problems, even when the domain is unbounded, without requiring prior knowledge of the number of iterations or desired accuracy.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

DADA adds a distance-to-x0 adaptation to dual averaging for parameter-free rates across several convex classes under local growth, but the unbounded-domain claim needs proof verification.

read the letter

The core of this paper is a dual-averaging variant that updates its coefficients using both gradient observations and the Euclidean distance from the current point back to the fixed starting point x0. The authors claim this single rule delivers the right rates for Lipschitz, Hölder, quasi-self-concordant, and a few other classes, provided only a local growth condition holds near the minimizer, and that it works on constrained or unconstrained problems even when the domain is unbounded and without knowing T or target accuracy in advance. That is the new element: the specific distance-based adaptation rule layered on top of classical dual averaging. If the analysis closes, it removes the usual need to pick a smoothness parameter or horizon in advance, which is practically convenient inside first-order convex work. The abstract is clear on the scope and the local-growth hypothesis, and the authors do not hide the assumption. The main open question is whether the convergence argument really stays inside that local hypothesis for arbitrary x0. The stress-test note is on point here: if any step in the proof leans on global trajectory properties or an implicit bound involving x0, then the “arbitrary start, unbounded domain” part does not follow from the stated condition alone. Without the full proofs it is impossible to tell whether that gap exists. The paper is aimed at researchers who build or tune first-order methods for convex problems. It is worth sending to referees so they can check the derivations and any experiments; the idea is concrete enough to merit that step even if revisions are needed on the unbounded case.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces Dual Averaging with Distance Adaptation (DADA), a parameter-free variant of dual averaging that adjusts coefficients using observed gradients and the Euclidean distance from current iterates to the initial point x0. It claims universal convergence rates (without knowledge of T or target accuracy) for convex problems belonging to multiple smoothness classes—nonsmooth Lipschitz, Lipschitz-smooth, Hölder-smooth, high-order Lipschitz derivatives, quasi-self-concordant, and (L0,L1)-smooth—provided only that a local growth condition f(x)−f* ≳ dist(x,x*)^p holds near the minimizer. The method is asserted to apply equally to constrained and unconstrained problems on unbounded domains.

Significance. A correct proof of the claimed universality would constitute a meaningful advance in parameter-free first-order methods, as it would supply a single algorithm whose rates adapt automatically across disparate function classes on unbounded domains using only local growth information.

major comments (2)

[Main convergence theorem] Main convergence theorem (location not specified in abstract but central to the universality claim): the argument must be shown to close exclusively under the local growth hypothesis. If the analysis of the distance-adapted step sizes invokes any global trajectory property (bounded sublevel sets, a priori bound on ||x0−x*||, or global growth away from x*), the “arbitrary x0, unbounded domain” part of the claim is not supported by the stated assumption alone.
[Constrained case] Section on constrained problems (presumably §4 or §5): the projection step onto a possibly unbounded feasible set must be shown not to re-introduce dependence on global constants; otherwise the distance-to-x0 adaptation alone does not suffice for the claimed universality.

minor comments (2)

[Abstract] Abstract: the phrases “high-order Lipschitz derivative” and “(L0,L1)-smooth functions” are used without definition or citation; a one-sentence clarification would improve accessibility.
[Algorithm 1] Notation: the precise functional form of the distance adaptation (how ||xk−x0|| enters the dual-averaging weights) should be stated explicitly in the algorithm box rather than only in the surrounding text.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful reading and for highlighting the need to confirm that the universality claims rest exclusively on the local growth hypothesis. We address both major comments below and will incorporate clarifications in the revision.

read point-by-point responses

Referee: [Main convergence theorem] the argument must be shown to close exclusively under the local growth hypothesis. If the analysis of the distance-adapted step sizes invokes any global trajectory property (bounded sublevel sets, a priori bound on ||x0−x*||, or global growth away from x*), the “arbitrary x0, unbounded domain” part of the claim is not supported by the stated assumption alone.

Authors: The proof of the main convergence theorem (Theorem 3.1) is constructed so that all bounds are derived from the local growth condition f(x)−f* ≳ dist(x,x*)^p together with the observed gradient norms and the Euclidean distances ||x_t − x_0|| that appear in the dual-averaging coefficients. No a-priori bound on ||x_0 − x*||, no global growth away from x*, and no boundedness of sublevel sets are invoked; the induction is closed by relating the local growth directly to the accumulated dual variables. We will add an explicit paragraph immediately after the statement of Theorem 3.1 that lists the quantities used in the argument and confirms that each is controlled by the local hypothesis alone. revision: partial
Referee: [Constrained case] the projection step onto a possibly unbounded feasible set must be shown not to re-introduce dependence on global constants; otherwise the distance-to-x0 adaptation alone does not suffice for the claimed universality.

Authors: In Section 4 the projection is performed onto the (possibly unbounded) feasible set C, but the distance-adaptation rule continues to use only ||x_t − x_0|| and the observed subgradient norms; the projection does not alter the form of the dual update or introduce any diameter or global Lipschitz constant. Because x_0 ∈ C by assumption and the local growth condition is stated with respect to the minimizer inside C, the same local argument applies verbatim. We will insert a short lemma (Lemma 4.2) that verifies the projection step preserves the local-growth-based bounds used in the unconstrained case. revision: partial

Circularity Check

0 steps flagged

No circularity: adaptation uses observed data under independent local growth assumption

full rationale

The abstract and description present DADA as a dual-averaging scheme whose coefficients are updated from observed gradients and ||x_k - x_0||, with convergence rates conditioned on an external local growth hypothesis f(x)-f* ≳ dist(x,x*)^p near the minimizer. This hypothesis is stated as the premise under which universality holds for multiple smoothness classes; it is not derived from the algorithm or fitted to its outputs. No equations, self-citations, or parameter-fitting steps are exhibited that would reduce any claimed rate to a tautology or to a quantity already supplied by the input data. The claim of operating without prior knowledge of T or ε is realized by the online adaptation rule itself and does not collapse into a post-hoc fit. The derivation chain therefore remains self-contained against the stated assumptions.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The only explicit premise visible in the abstract is the bounded local growth condition; no free parameters or invented entities are mentioned.

axioms (1)

domain assumption The local growth of the objective function around its minimizer can be bounded.
Explicitly stated in the abstract as the prerequisite for the universal guarantees.

pith-pipeline@v0.9.0 · 5689 in / 1311 out tokens · 32839 ms · 2026-05-23T04:51:50.905249+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

25 extracted references · 25 canonical work pages · 1 internal anchor

[1]

F. Bach. Self-concordant analysis for logistic regression. Electronic Journal of Statis- tics, 4:384 –414, 2010. doi: 10.1214/09-EJS521. url: https://doi.org/10.1214/ 09-EJS521. 11

work page doi:10.1214/09-ejs521 2010
[2]

Bottou, F

L. Bottou, F. E. Curtis, and J. Nocedal. Optimization methods for large-scale ma- chine learning. SIAM Review, 60(2):223–311, 2018. url: https://doi.org/10. 1137/16M1080173

work page 2018
[3]

Carmon and O

Y. Carmon and O. Hinder. Making SGD parameter-free. In Proceedings of Thirty Fifth Conference on Learning Theory , volume 178, pages 2360–2389, 2022. url: https://proceedings.mlr.press/v178/carmon22a.html

work page 2022
[4]

Cutkosky and F

A. Cutkosky and F. Orabona. Black-box reductions for parameter-free online learn- ing in Banach spaces. In Annual Conference Computational Learning Theory, 2018. url: https://api.semanticscholar.org/CorpusID:3346292

work page 2018
[5]

Defazio and K

A. Defazio and K. Mishchenko. Learning-rate-free learning by D-adaptation. In Proceedings of the 40th International Conference on Machine Learning, volume 202, pages 7449–7479, 2023. url: https://proceedings.mlr.press/v202/defazio23a. html

work page 2023
[6]

N. Doikov. Minimizing quasi-self-concordant functions by gradient regularization of Newton method, 2023. eprint: 2308.14742. url: https://arxiv.org/abs/2308. 14742

work page arXiv 2023
[7]

Doikov, K

N. Doikov, K. Mishchenko, and Y. Nesterov. Super-universal regularized New- ton method. SIAM Journal on Optimization , 34(1):27–56, 2024. doi: 10 . 1137 / 22M1519444. eprint: https://doi.org/10.1137/22M1519444. url: https://doi. org/10.1137/22M1519444

work page doi:10.1137/22m1519444 2024
[8]

Duchi, E

J. Duchi, E. Hazan, and Y. Singer. Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research , 12(61):2121– 2159, 2011. url: http://jmlr.org/papers/v12/duchi11a.html

work page 2011
[9]

M. Ivgi, O. Hinder, and Y. Carmon. DoG is SGD’s best friend: a parameter-free dynamic step size schedule. In Proceedings of the 40th International Conference on Machine Learning, pages 14465–14499, 2023. url: https : / / proceedings . mlr . press/v202/ivgi23a.html

work page 2023
[10]

Khaled, K

A. Khaled, K. Mishchenko, and C. Jin. DoWG unleashed: an efficient universal parameter-free gradient descent method. In Advances in Neural Information Pro- cessing Systems, volume 36, pages 6748–6769, 2023. url: https://proceedings. neurips.cc/paper_files/paper/2023/file/15ce36d35622f126f38e90167de1a350- Paper-Conference.pdf

work page 2023
[11]

Liu and Z

Z. Liu and Z. Zhou. Stochastic nonsmooth convex optimization with heavy-tailed noises. ArXiv, abs/2303.12277, 2023. url: https://api.semanticscholar.org/ CorpusID:257663403

work page arXiv 2023
[12]

Mishchenko and A

K. Mishchenko and A. Defazio. Prodigy: an expeditiously adaptive parameter-free learner. In Proceedings of the 41st International Conference on Machine Learning , volume 235, pages 35779–35804, 2024. url: https://proceedings.mlr.press/ v235/mishchenko24a.html

work page 2024
[13]

Nesterov

Y. Nesterov. Smooth minimization of non-smooth functions. Mathematical program- ming, 103:127–152, 2005. 12

work page 2005
[14]

Nesterov

Y. Nesterov. Primal-dual subgradient methods for convex problems. Mathematical Programming, 120:221–259, 2005. url: https : / / api . semanticscholar . org / CorpusID:14935076

work page 2005
[15]

Nesterov

Y. Nesterov. Universal gradient methods for convex optimization problems. Math- ematical Programming, 152:381–404, 2015. url: https://api.semanticscholar. org/CorpusID:18062781

work page 2015
[16]

Nesterov

Y. Nesterov. Lectures on Convex Optimization . Springer Publishing Company, In- corporated, 2nd edition, 2018, pages 221–259. isbn: 3319915770. url: https:// api.semanticscholar.org/CorpusID:14935076

work page 2018
[17]

Nesterov

Y. Nesterov. Primal subgradient methods with predefined step sizes. Journal of Optimization Theory and Applications , 2024. doi: 10.1007/s10957-024-02456-9. url: https://arxiv.org/abs/2308.14742

work page doi:10.1007/s10957-024-02456-9 2024
[18]

Orabona and T

F. Orabona and T. Tommasi. Training deep networks without learning rates through coin betting. In Neural Information Processing Systems, 2017. url: https://api. semanticscholar.org/CorpusID:6762437

work page 2017
[19]

Carbon Emissions and Large Neural Network Training

D. Patterson, J. Gonzalez, Q. Le, C. Liang, L.-M. Munguia, D. Rothchild, D. So, M. Texier, and J. Dean. Carbon emissions and large neural network training, 2021. eprint: 2104.10350. url: https://arxiv.org/abs/2104.10350

work page internal anchor Pith review Pith/arXiv arXiv 2021
[20]

Rodomanov, X

A. Rodomanov, X. Jiang, and S. U. Stich. Universality of AdaGrad stepsizes for stochastic optimization: inexact oracle, acceleration and variance reduction. In The Thirty-eighth Annual Conference on Neural Information Processing Systems , 2024. url: https://openreview.net/forum?id=rniiAVjHi5

work page 2024
[21]

Rodomanov and Y

A. Rodomanov and Y. Nesterov. Smoothness parameter of power of Euclidean norm. Journal of Optimization Theory and Applications , 185:303–326, 2019. url: https: //api.semanticscholar.org/CorpusID:198968030

work page 2019
[22]

Sharir, B

O. Sharir, B. Peleg, and Y. Shoham. The cost of training NLP models: a concise overview, 2020. eprint: 2004.08900. url: https://arxiv.org/abs/2004.08900

work page arXiv 2020
[23]

Vankov, A

D. Vankov, A. Rodomanov, A. Nedich, L. Sankar, and S. U. Stich. Optimizing (L0, L1)-smooth functions by gradient methods, 2024. eprint: 2410 . 10800. url: https://arxiv.org/abs/2410.10800

work page arXiv 2024
[24]

Zhang, T

J. Zhang, T. He, S. Sra, and A. Jadbabaie. Why gradient clipping accelerates train- ing: a theoretical justification for adaptivity. In International Conference on Learn- ing Representations, 2020. url: https://openreview.net/forum?id=BJgnXpVYwS. 13 A Auxiliary Results The following result has been established in prior works such as [11, Lemma 30]. We inc...

work page 2020
[25]

Then, we have the following inequalities for all 0 ≤ k ≤ T : ¯rk ≤ ¯D, D k ≤ D0 + √ 2 c ¯D, where ¯D := max ¯r, 2c c− √ 2 D0 and Dk := ∥xk − x∗∥. Proof. Both bounds are clearly valid for k = 0, so it suffices to consider only the case when 1 ≤ k ≤ T . Applying Lemma B.2, dropping the nonnegative ¯ rivi from the left-hand side and rearranging, we obtain D2...

work page

[1] [1]

F. Bach. Self-concordant analysis for logistic regression. Electronic Journal of Statis- tics, 4:384 –414, 2010. doi: 10.1214/09-EJS521. url: https://doi.org/10.1214/ 09-EJS521. 11

work page doi:10.1214/09-ejs521 2010

[2] [2]

Bottou, F

L. Bottou, F. E. Curtis, and J. Nocedal. Optimization methods for large-scale ma- chine learning. SIAM Review, 60(2):223–311, 2018. url: https://doi.org/10. 1137/16M1080173

work page 2018

[3] [3]

Carmon and O

Y. Carmon and O. Hinder. Making SGD parameter-free. In Proceedings of Thirty Fifth Conference on Learning Theory , volume 178, pages 2360–2389, 2022. url: https://proceedings.mlr.press/v178/carmon22a.html

work page 2022

[4] [4]

Cutkosky and F

A. Cutkosky and F. Orabona. Black-box reductions for parameter-free online learn- ing in Banach spaces. In Annual Conference Computational Learning Theory, 2018. url: https://api.semanticscholar.org/CorpusID:3346292

work page 2018

[5] [5]

Defazio and K

A. Defazio and K. Mishchenko. Learning-rate-free learning by D-adaptation. In Proceedings of the 40th International Conference on Machine Learning, volume 202, pages 7449–7479, 2023. url: https://proceedings.mlr.press/v202/defazio23a. html

work page 2023

[6] [6]

N. Doikov. Minimizing quasi-self-concordant functions by gradient regularization of Newton method, 2023. eprint: 2308.14742. url: https://arxiv.org/abs/2308. 14742

work page arXiv 2023

[7] [7]

Doikov, K

N. Doikov, K. Mishchenko, and Y. Nesterov. Super-universal regularized New- ton method. SIAM Journal on Optimization , 34(1):27–56, 2024. doi: 10 . 1137 / 22M1519444. eprint: https://doi.org/10.1137/22M1519444. url: https://doi. org/10.1137/22M1519444

work page doi:10.1137/22m1519444 2024

[8] [8]

Duchi, E

J. Duchi, E. Hazan, and Y. Singer. Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research , 12(61):2121– 2159, 2011. url: http://jmlr.org/papers/v12/duchi11a.html

work page 2011

[9] [9]

M. Ivgi, O. Hinder, and Y. Carmon. DoG is SGD’s best friend: a parameter-free dynamic step size schedule. In Proceedings of the 40th International Conference on Machine Learning, pages 14465–14499, 2023. url: https : / / proceedings . mlr . press/v202/ivgi23a.html

work page 2023

[10] [10]

Khaled, K

A. Khaled, K. Mishchenko, and C. Jin. DoWG unleashed: an efficient universal parameter-free gradient descent method. In Advances in Neural Information Pro- cessing Systems, volume 36, pages 6748–6769, 2023. url: https://proceedings. neurips.cc/paper_files/paper/2023/file/15ce36d35622f126f38e90167de1a350- Paper-Conference.pdf

work page 2023

[11] [11]

Liu and Z

Z. Liu and Z. Zhou. Stochastic nonsmooth convex optimization with heavy-tailed noises. ArXiv, abs/2303.12277, 2023. url: https://api.semanticscholar.org/ CorpusID:257663403

work page arXiv 2023

[12] [12]

Mishchenko and A

K. Mishchenko and A. Defazio. Prodigy: an expeditiously adaptive parameter-free learner. In Proceedings of the 41st International Conference on Machine Learning , volume 235, pages 35779–35804, 2024. url: https://proceedings.mlr.press/ v235/mishchenko24a.html

work page 2024

[13] [13]

Nesterov

Y. Nesterov. Smooth minimization of non-smooth functions. Mathematical program- ming, 103:127–152, 2005. 12

work page 2005

[14] [14]

Nesterov

Y. Nesterov. Primal-dual subgradient methods for convex problems. Mathematical Programming, 120:221–259, 2005. url: https : / / api . semanticscholar . org / CorpusID:14935076

work page 2005

[15] [15]

Nesterov

Y. Nesterov. Universal gradient methods for convex optimization problems. Math- ematical Programming, 152:381–404, 2015. url: https://api.semanticscholar. org/CorpusID:18062781

work page 2015

[16] [16]

Nesterov

Y. Nesterov. Lectures on Convex Optimization . Springer Publishing Company, In- corporated, 2nd edition, 2018, pages 221–259. isbn: 3319915770. url: https:// api.semanticscholar.org/CorpusID:14935076

work page 2018

[17] [17]

Nesterov

Y. Nesterov. Primal subgradient methods with predefined step sizes. Journal of Optimization Theory and Applications , 2024. doi: 10.1007/s10957-024-02456-9. url: https://arxiv.org/abs/2308.14742

work page doi:10.1007/s10957-024-02456-9 2024

[18] [18]

Orabona and T

F. Orabona and T. Tommasi. Training deep networks without learning rates through coin betting. In Neural Information Processing Systems, 2017. url: https://api. semanticscholar.org/CorpusID:6762437

work page 2017

[19] [19]

Carbon Emissions and Large Neural Network Training

D. Patterson, J. Gonzalez, Q. Le, C. Liang, L.-M. Munguia, D. Rothchild, D. So, M. Texier, and J. Dean. Carbon emissions and large neural network training, 2021. eprint: 2104.10350. url: https://arxiv.org/abs/2104.10350

work page internal anchor Pith review Pith/arXiv arXiv 2021

[20] [20]

Rodomanov, X

A. Rodomanov, X. Jiang, and S. U. Stich. Universality of AdaGrad stepsizes for stochastic optimization: inexact oracle, acceleration and variance reduction. In The Thirty-eighth Annual Conference on Neural Information Processing Systems , 2024. url: https://openreview.net/forum?id=rniiAVjHi5

work page 2024

[21] [21]

Rodomanov and Y

A. Rodomanov and Y. Nesterov. Smoothness parameter of power of Euclidean norm. Journal of Optimization Theory and Applications , 185:303–326, 2019. url: https: //api.semanticscholar.org/CorpusID:198968030

work page 2019

[22] [22]

Sharir, B

O. Sharir, B. Peleg, and Y. Shoham. The cost of training NLP models: a concise overview, 2020. eprint: 2004.08900. url: https://arxiv.org/abs/2004.08900

work page arXiv 2020

[23] [23]

Vankov, A

D. Vankov, A. Rodomanov, A. Nedich, L. Sankar, and S. U. Stich. Optimizing (L0, L1)-smooth functions by gradient methods, 2024. eprint: 2410 . 10800. url: https://arxiv.org/abs/2410.10800

work page arXiv 2024

[24] [24]

Zhang, T

J. Zhang, T. He, S. Sra, and A. Jadbabaie. Why gradient clipping accelerates train- ing: a theoretical justification for adaptivity. In International Conference on Learn- ing Representations, 2020. url: https://openreview.net/forum?id=BJgnXpVYwS. 13 A Auxiliary Results The following result has been established in prior works such as [11, Lemma 30]. We inc...

work page 2020

[25] [25]

Then, we have the following inequalities for all 0 ≤ k ≤ T : ¯rk ≤ ¯D, D k ≤ D0 + √ 2 c ¯D, where ¯D := max ¯r, 2c c− √ 2 D0 and Dk := ∥xk − x∗∥. Proof. Both bounds are clearly valid for k = 0, so it suffices to consider only the case when 1 ≤ k ≤ T . Applying Lemma B.2, dropping the nonnegative ¯ rivi from the left-hand side and rearranging, we obtain D2...

work page