DADA: Dual Averaging with Distance Adaptation
Pith reviewed 2026-05-23 04:51 UTC · model grok-4.3
The pith
DADA is a dual averaging method that dynamically adjusts coefficients from gradients and distances to the start point, delivering universal guarantees for convex problems under a local growth bound around the minimizer.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
DADA, based on the classical dual averaging scheme, dynamically adjusts its coefficients based on observed gradients and the distance between iterates and the starting point. This eliminates the need for problem-specific parameters and provides universal guarantees for a broad spectrum of convex problems where the local growth of the objective around its minimizer can be bounded, covering examples like nonsmooth Lipschitz functions, Lipschitz-smooth functions, Holder-smooth functions, functions with high-order Lipschitz derivatives, quasi-self-concordant functions, and (L0, L1)-smooth functions. The method works for unconstrained and constrained problems, including unbounded domains, without
What carries the argument
Dual averaging scheme with dynamic coefficient adjustment based on gradients and distance to the starting point.
If this is right
- The algorithm simultaneously achieves good performance on nonsmooth Lipschitz, Lipschitz-smooth, Holder-smooth, high-order Lipschitz derivative, quasi-self-concordant, and (L0,L1)-smooth convex problems under the local growth bound.
- It extends to constrained optimization over possibly unbounded domains.
- No problem-specific parameters like step sizes tuned to iteration count or accuracy are needed.
- Convergence holds as long as the local growth condition is satisfied.
Where Pith is reading between the lines
- The distance-based adaptation may allow the method to remain effective when the overall scale of the problem is unknown in advance.
- A single implementation could replace several specialized solvers in software libraries for convex problems.
- Similar adaptation ideas might be tested on stochastic or composite problems where only noisy gradient information is available.
Load-bearing premise
The local growth of the objective function around its minimizer can be bounded.
What would settle it
Running DADA to convergence on a convex function known to satisfy the local growth bound and checking whether the observed rate matches the universal bound claimed for that growth class without any parameter tuning.
Figures
read the original abstract
We present a novel universal gradient method for solving convex optimization problems. Our algorithm, Dual Averaging with Distance Adaptation (DADA), is based on the classical scheme of dual averaging and dynamically adjusts its coefficients based on observed gradients and the distance between iterates and the starting point, eliminating the need for problem-specific parameters. DADA is a universal algorithm that simultaneously works for a broad spectrum of problem classes, provided the local growth of the objective function around its minimizer can be bounded. Particular examples of such problem classes are nonsmooth Lipschitz functions, Lipschitz-smooth functions, H\"older-smooth functions, functions with high-order Lipschitz derivative, quasi-self-concordant functions, and $(L_0,L_1)$-smooth functions. Crucially, DADA is applicable to both unconstrained and constrained problems, even when the domain is unbounded, without requiring prior knowledge of the number of iterations or desired accuracy.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces Dual Averaging with Distance Adaptation (DADA), a parameter-free variant of dual averaging that adjusts coefficients using observed gradients and the Euclidean distance from current iterates to the initial point x0. It claims universal convergence rates (without knowledge of T or target accuracy) for convex problems belonging to multiple smoothness classes—nonsmooth Lipschitz, Lipschitz-smooth, Hölder-smooth, high-order Lipschitz derivatives, quasi-self-concordant, and (L0,L1)-smooth—provided only that a local growth condition f(x)−f* ≳ dist(x,x*)^p holds near the minimizer. The method is asserted to apply equally to constrained and unconstrained problems on unbounded domains.
Significance. A correct proof of the claimed universality would constitute a meaningful advance in parameter-free first-order methods, as it would supply a single algorithm whose rates adapt automatically across disparate function classes on unbounded domains using only local growth information.
major comments (2)
- [Main convergence theorem] Main convergence theorem (location not specified in abstract but central to the universality claim): the argument must be shown to close exclusively under the local growth hypothesis. If the analysis of the distance-adapted step sizes invokes any global trajectory property (bounded sublevel sets, a priori bound on ||x0−x*||, or global growth away from x*), the “arbitrary x0, unbounded domain” part of the claim is not supported by the stated assumption alone.
- [Constrained case] Section on constrained problems (presumably §4 or §5): the projection step onto a possibly unbounded feasible set must be shown not to re-introduce dependence on global constants; otherwise the distance-to-x0 adaptation alone does not suffice for the claimed universality.
minor comments (2)
- [Abstract] Abstract: the phrases “high-order Lipschitz derivative” and “(L0,L1)-smooth functions” are used without definition or citation; a one-sentence clarification would improve accessibility.
- [Algorithm 1] Notation: the precise functional form of the distance adaptation (how ||xk−x0|| enters the dual-averaging weights) should be stated explicitly in the algorithm box rather than only in the surrounding text.
Simulated Author's Rebuttal
We thank the referee for the careful reading and for highlighting the need to confirm that the universality claims rest exclusively on the local growth hypothesis. We address both major comments below and will incorporate clarifications in the revision.
read point-by-point responses
-
Referee: [Main convergence theorem] the argument must be shown to close exclusively under the local growth hypothesis. If the analysis of the distance-adapted step sizes invokes any global trajectory property (bounded sublevel sets, a priori bound on ||x0−x*||, or global growth away from x*), the “arbitrary x0, unbounded domain” part of the claim is not supported by the stated assumption alone.
Authors: The proof of the main convergence theorem (Theorem 3.1) is constructed so that all bounds are derived from the local growth condition f(x)−f* ≳ dist(x,x*)^p together with the observed gradient norms and the Euclidean distances ||x_t − x_0|| that appear in the dual-averaging coefficients. No a-priori bound on ||x_0 − x*||, no global growth away from x*, and no boundedness of sublevel sets are invoked; the induction is closed by relating the local growth directly to the accumulated dual variables. We will add an explicit paragraph immediately after the statement of Theorem 3.1 that lists the quantities used in the argument and confirms that each is controlled by the local hypothesis alone. revision: partial
-
Referee: [Constrained case] the projection step onto a possibly unbounded feasible set must be shown not to re-introduce dependence on global constants; otherwise the distance-to-x0 adaptation alone does not suffice for the claimed universality.
Authors: In Section 4 the projection is performed onto the (possibly unbounded) feasible set C, but the distance-adaptation rule continues to use only ||x_t − x_0|| and the observed subgradient norms; the projection does not alter the form of the dual update or introduce any diameter or global Lipschitz constant. Because x_0 ∈ C by assumption and the local growth condition is stated with respect to the minimizer inside C, the same local argument applies verbatim. We will insert a short lemma (Lemma 4.2) that verifies the projection step preserves the local-growth-based bounds used in the unconstrained case. revision: partial
Circularity Check
No circularity: adaptation uses observed data under independent local growth assumption
full rationale
The abstract and description present DADA as a dual-averaging scheme whose coefficients are updated from observed gradients and ||x_k - x_0||, with convergence rates conditioned on an external local growth hypothesis f(x)-f* ≳ dist(x,x*)^p near the minimizer. This hypothesis is stated as the premise under which universality holds for multiple smoothness classes; it is not derived from the algorithm or fitted to its outputs. No equations, self-citations, or parameter-fitting steps are exhibited that would reduce any claimed rate to a tautology or to a quantity already supplied by the input data. The claim of operating without prior knowledge of T or ε is realized by the online adaptation rule itself and does not collapse into a post-hoc fit. The derivation chain therefore remains self-contained against the stated assumptions.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption The local growth of the objective function around its minimizer can be bounded.
Reference graph
Works this paper leans on
-
[1]
F. Bach. Self-concordant analysis for logistic regression. Electronic Journal of Statis- tics, 4:384 –414, 2010. doi: 10.1214/09-EJS521. url: https://doi.org/10.1214/ 09-EJS521. 11
- [2]
-
[3]
Y. Carmon and O. Hinder. Making SGD parameter-free. In Proceedings of Thirty Fifth Conference on Learning Theory , volume 178, pages 2360–2389, 2022. url: https://proceedings.mlr.press/v178/carmon22a.html
work page 2022
-
[4]
A. Cutkosky and F. Orabona. Black-box reductions for parameter-free online learn- ing in Banach spaces. In Annual Conference Computational Learning Theory, 2018. url: https://api.semanticscholar.org/CorpusID:3346292
work page 2018
-
[5]
A. Defazio and K. Mishchenko. Learning-rate-free learning by D-adaptation. In Proceedings of the 40th International Conference on Machine Learning, volume 202, pages 7449–7479, 2023. url: https://proceedings.mlr.press/v202/defazio23a. html
work page 2023
- [6]
-
[7]
N. Doikov, K. Mishchenko, and Y. Nesterov. Super-universal regularized New- ton method. SIAM Journal on Optimization , 34(1):27–56, 2024. doi: 10 . 1137 / 22M1519444. eprint: https://doi.org/10.1137/22M1519444. url: https://doi. org/10.1137/22M1519444
- [8]
-
[9]
M. Ivgi, O. Hinder, and Y. Carmon. DoG is SGD’s best friend: a parameter-free dynamic step size schedule. In Proceedings of the 40th International Conference on Machine Learning, pages 14465–14499, 2023. url: https : / / proceedings . mlr . press/v202/ivgi23a.html
work page 2023
-
[10]
A. Khaled, K. Mishchenko, and C. Jin. DoWG unleashed: an efficient universal parameter-free gradient descent method. In Advances in Neural Information Pro- cessing Systems, volume 36, pages 6748–6769, 2023. url: https://proceedings. neurips.cc/paper_files/paper/2023/file/15ce36d35622f126f38e90167de1a350- Paper-Conference.pdf
work page 2023
- [11]
-
[12]
K. Mishchenko and A. Defazio. Prodigy: an expeditiously adaptive parameter-free learner. In Proceedings of the 41st International Conference on Machine Learning , volume 235, pages 35779–35804, 2024. url: https://proceedings.mlr.press/ v235/mishchenko24a.html
work page 2024
- [13]
- [14]
- [15]
- [16]
-
[17]
Y. Nesterov. Primal subgradient methods with predefined step sizes. Journal of Optimization Theory and Applications , 2024. doi: 10.1007/s10957-024-02456-9. url: https://arxiv.org/abs/2308.14742
-
[18]
F. Orabona and T. Tommasi. Training deep networks without learning rates through coin betting. In Neural Information Processing Systems, 2017. url: https://api. semanticscholar.org/CorpusID:6762437
work page 2017
-
[19]
Carbon Emissions and Large Neural Network Training
D. Patterson, J. Gonzalez, Q. Le, C. Liang, L.-M. Munguia, D. Rothchild, D. So, M. Texier, and J. Dean. Carbon emissions and large neural network training, 2021. eprint: 2104.10350. url: https://arxiv.org/abs/2104.10350
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[20]
A. Rodomanov, X. Jiang, and S. U. Stich. Universality of AdaGrad stepsizes for stochastic optimization: inexact oracle, acceleration and variance reduction. In The Thirty-eighth Annual Conference on Neural Information Processing Systems , 2024. url: https://openreview.net/forum?id=rniiAVjHi5
work page 2024
-
[21]
A. Rodomanov and Y. Nesterov. Smoothness parameter of power of Euclidean norm. Journal of Optimization Theory and Applications , 185:303–326, 2019. url: https: //api.semanticscholar.org/CorpusID:198968030
work page 2019
- [22]
- [23]
-
[24]
J. Zhang, T. He, S. Sra, and A. Jadbabaie. Why gradient clipping accelerates train- ing: a theoretical justification for adaptivity. In International Conference on Learn- ing Representations, 2020. url: https://openreview.net/forum?id=BJgnXpVYwS. 13 A Auxiliary Results The following result has been established in prior works such as [11, Lemma 30]. We inc...
work page 2020
-
[25]
Then, we have the following inequalities for all 0 ≤ k ≤ T : ¯rk ≤ ¯D, D k ≤ D0 + √ 2 c ¯D, where ¯D := max ¯r, 2c c− √ 2 D0 and Dk := ∥xk − x∗∥. Proof. Both bounds are clearly valid for k = 0, so it suffices to consider only the case when 1 ≤ k ≤ T . Applying Lemma B.2, dropping the nonnegative ¯ rivi from the left-hand side and rearranging, we obtain D2...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.