Stochastic convergence of parallel asynchronous adaptive first-order methods

Philippe L. Toint; Serge Gratton

arxiv: 2606.01787 · v1 · pith:EXZ4HOBUnew · submitted 2026-06-01 · 💻 cs.AI · math.OC

Stochastic convergence of parallel asynchronous adaptive first-order methods

Serge Gratton , Philippe L. Toint This is my paper

Pith reviewed 2026-06-28 14:26 UTC · model grok-4.3

classification 💻 cs.AI math.OC

keywords asynchronous optimizationadaptive first-order methodsstochastic convergencenon-convex optimizationparallel algorithmsmachine learningmomentum methods

0 comments

The pith

A new class of asynchronous adaptive first-order methods converges at rate O(1/sqrt t) on non-convex stochastic problems.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a class of asynchronous adaptive first-order optimization methods that includes parallel variants of several popular algorithms along with options for momentum and inexact normalization. It provides a convergence analysis for these methods applied to non-convex objective functions in a fully stochastic setting that accounts for gradient noise and delayed updates. The analysis shows the methods reach a convergence rate of order O(1/sqrt t) up to logarithmic factors when reasonable assumptions hold on the problem data and asynchrony model. Numerical tests indicate the approach is relevant for large-scale machine learning systems that run on heterogeneous hardware.

Core claim

The paper defines a new class of asynchronous adaptive first-order methods and proves that, under reasonable assumptions, their iterates converge to a stationary point of a non-convex stochastic objective at rate O(1/sqrt t) up to logarithmic factors in a fully stochastic parallel setting.

What carries the argument

The class of asynchronous adaptive first-order methods, whose stochastic convergence is controlled by bounding the combined effects of gradient noise and update delays.

If this is right

Variants that incorporate momentum or inexact normalization inherit the same O(1/sqrt t) stochastic rate.
The methods remain theoretically justified for use in parallel asynchronous environments without requiring synchronization.
The rate result applies directly to non-convex stochastic optimization problems that arise in machine learning.
Numerical evidence supports practical deployment in heterogeneous large-scale systems where synchronization costs are high.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The framework can be specialized to produce convergence guarantees for concrete asynchronous versions of Adam, RMSprop, or Adagrad.
The same bounding technique on delays and noise may extend to other first-order methods that use adaptive step sizes.
In distributed training the result suggests communication savings are possible while preserving the canonical stochastic rate.
Testing the rate on problems with heavier-tailed noise or more extreme delay distributions would probe the boundary of the reasonable assumptions.

Load-bearing premise

The analysis depends on unspecified reasonable assumptions about the objective functions, gradient noise statistics, and asynchrony model that keep delayed updates from destroying the rate.

What would settle it

A concrete counter-example or numerical run on a non-convex stochastic problem satisfying the paper's reasonable assumptions yet exhibiting a convergence rate strictly slower than O(1/sqrt t) up to logs would falsify the claim.

read the original abstract

A new class of asynchronous adaptive first-order optimization methods is introduced, comprising asynchronous variants of several popular algorithms. Versions of these methods using momentum and/or inexact normalization are also considered. The convergence of methods in the class on non-convex functions is analyzed in a fully stochastic setting, and is shown to be (up to logarithmic factors) of order O(1/sqrt{t}) under reasonable assumptions. Numerical experiments suggest that such asynchronous adaptive algorithms are very relevant in heterogeneous large-scale machine learning systems.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper defines a broad class of async adaptive first-order methods and derives an O(1/sqrt t) rate (up to logs) for stochastic non-convex problems, but the result rests on unspecified assumptions whose reasonableness is hard to judge from the abstract alone.

read the letter

The main contribution is a unified treatment of asynchronous variants of adaptive methods (including momentum and inexact normalization) together with a stochastic convergence analysis on non-convex objectives. That combination has not appeared in exactly this form before, and the claimed rate matches the usual minimax rate for stochastic non-convex optimization. The numerical experiments are presented only as motivation that these methods matter for heterogeneous large-scale ML, which is a reasonable use of space.

The soft spot is the reliance on “reasonable assumptions” about the objective, noise, and delay model. Without seeing the precise statements or the proof steps that control the effect of stale gradients, it is difficult to tell how restrictive those assumptions are or whether they cover the practical regimes the experiments aim at. The abstract gives no indication of circularity or internal inconsistency, but the lack of explicit conditions is the obvious place a referee would press.

This paper is aimed at researchers working on distributed or asynchronous optimization who want convergence guarantees that cover the adaptive methods already in use. A reader already familiar with the separate literatures on asynchrony and on adaptive methods will get the most out of it. The work is coherent on its own terms and shows clear engagement with the relevant literature, so it deserves a serious referee even if the assumptions turn out to need tightening.

Referee Report

0 major / 2 minor

Summary. The manuscript introduces a new class of asynchronous adaptive first-order optimization methods, including variants that incorporate momentum and/or inexact normalization. It analyzes the convergence of methods in this class for non-convex objectives in a fully stochastic setting and establishes a rate of O(1/sqrt{t}) (up to logarithmic factors) under reasonable assumptions on the objective, noise, and asynchrony model. Numerical experiments are provided to illustrate relevance in heterogeneous large-scale machine learning systems.

Significance. If the stated convergence result holds, the work is significant because it extends standard stochastic non-convex rates to the asynchronous adaptive setting while matching the known minimax rate (up to logs). This provides theoretical grounding for practical algorithms used in distributed and heterogeneous environments. The analysis is presented as a direct derivation from assumptions to rate with no indicated circularity or reduction to fitted quantities.

minor comments (2)

[Abstract] Abstract: the phrase 'reasonable assumptions' is used without a parenthetical summary or forward reference to the precise conditions (e.g., bounded variance, delay bounds, or smoothness parameters) that appear in the main analysis; this affects immediate readability even if the conditions are stated later.
The numerical experiments section would benefit from an explicit statement of the hardware heterogeneity model and delay distribution used, to allow direct comparison with the theoretical asynchrony assumptions.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive summary, significance assessment, and recommendation of minor revision. The report accurately captures the paper's contributions on asynchronous adaptive first-order methods and their convergence rates.

Circularity Check

0 steps flagged

No significant circularity; derivation is self-contained

full rationale

The paper introduces a class of asynchronous adaptive first-order methods and derives their O(1/sqrt t) convergence rate (up to logs) on non-convex objectives in a stochastic setting directly from stated assumptions on the objective function, gradient noise, and delay model. This is a standard mathematical analysis with no quoted steps in which a claimed prediction or uniqueness result reduces by construction to a fitted parameter, self-citation chain, or renamed input. The central claim remains independent of any internal redefinition or load-bearing self-reference.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The O(1/sqrt t) claim rests on domain assumptions about the objective and noise that are not enumerated in the abstract; no free parameters or invented entities are visible from the given text.

axioms (1)

domain assumption Reasonable assumptions on objective functions and stochastic gradients that enable the rate
Invoked to bound the effect of asynchrony and noise in the convergence analysis.

pith-pipeline@v0.9.1-grok · 5597 in / 1110 out tokens · 28514 ms · 2026-06-28T14:26:09.067219+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

bAdag: an adaptive block coordinate gradient method for smooth nonconvex functions
math.OC 2026-06 unverdicted novelty 6.0

Introduces bAdag, an AdaGrad-based block coordinate gradient method with ergodic sublinear convergence proofs for smooth nonconvex objectives under block Lipschitz gradient assumptions, covering cyclic, uniform random...

Reference graph

Works this paper leans on

58 extracted references · 16 canonical work pages · cited by 1 Pith paper · 7 internal anchors

[1]

Attia and T

A. Attia and T. Koren. SGD with AdaGrad stepsizes: Full adaptivity with high probability to unknown parameters, unbounded gradients and affine variance. arXiv:2302.08783, 2023

work page arXiv 2023
[2]

Bernstein and L

J. Bernstein and L. Newhouse. Modular duality in deep learning. arXiv:2410.21265v2, 2024

work page arXiv 2024
[3]

Old Optimizer, New Norm: An Anthology

J. Bernstein and L. Newhouse. Old optimizer, new norm: an anthology. arXiv:2409.20325v2, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[4]

Bertsekas and J

D.P. Bertsekas and J. N. Tsitsiklis.Parallel and Distributed Computation: Numerical Methods. Prentice Hall, 1989

1989
[5]

Bj¨ orck and C

˚A. Bj¨ orck and C. Bowie. An iterative algorithm for computing the best estimate of an orthogonal matrix. SIAM Journal on Numerical Analysis, 8(2):358–364, 1971

1971
[6]

J. A. Blackard and D. J. Dean. Comparative accuracies of artificial neural networks and discriminant analysis in predicting forest cover types from cartographic variables.Comput. Electron. Agric., 24(3):131–151, 1999

1999
[7]

Boyd and L

S. Boyd and L. Vandenberghe.Convex Optimization. Cambridge University Press, Cambridge, England, 2004

2004
[8]

Cannelli, F

L. Cannelli, F. Facchinei, V. Kungurstev, and G. Scutari. Asynchronous parallel algorithms for nonconvex optimization.Mathematical Programming, Series A, 184:121–154, 2020

2020
[9]

S. C. Chaturapruek, J. C. Duchi, and Ch. R´ e. Asynchronous stochastic convex optimization. InNIPS’15: Proceedings of the 29th International Conference on Neural Information Processing Systems, volume 1, pages 1531–1539, 2015

2015
[10]

Chazan and W

D. Chazan and W. Miranker. Chaotic relaxation.Linear Algebra and its Applications, 2(2):199–222, 1969

1969
[11]

A. R. Conn, N. I. M. Gould, and Ph. L. Toint.LANCELOT: a Fortran package for large-scale nonlinear optimization (Release A). Number 17 in Springer Series in Computational Mathematics. Springer Verlag, Heidelberg, Berlin, New York, 1992

1992
[12]

Duchi, E

J. Duchi, E. Hazan, and Y. Singer. Adaptive subgradient methods for online learning and stochastic optimiza- tion.Journal of Machine Learning Research, 12, July 2011

2011
[13]

M. Faw, I. Tziotis, C. Caramanis, A. Mokhtari, S. Shakkottai, and R. Ward. The power of adaptivity in SGD: Self-tuning step sizes with unbounded gradients and affine variance. InProceedings of 35th Conference on Learning Theory, volume 178, pages 313–355, 2022

2022
[14]

H. R. Feyzmahdavian and M. Johansson. Asynchronous iterations in optimization: New sequence results and sharper algorithmic guarantees.Journal of Machine Learning Research, 24:1–75, 2023

2023
[15]

Frommer and D

A. Frommer and D. B. Szyld. On asynchronous iterations.Journal of Computational and Applied Mathematics, 123(1–2):201–216, 2000

2000
[16]

Ginsburg, P

B. Ginsburg, P. Castonguay, O. Hrinchuk, O. Kuchaiev, R. Leary, V. Lavrukhin, J. Li, H. Nguyen, Y. Zhang, and J. M. Cohen. Training deep networks with stochastic gradient normalized by layerwise adaptive second moments. arXiv:1905.11286v3, 2019. 9Among other possibilities, we think here of structured networks such as mix of experts [23], DeepONets [35] or...

work page arXiv 1905
[17]

Gratton, V

S. Gratton, V. Mercier, E. Riccietti, and Ph. L. Toint. A block-coordinate approach of multi-level optimization with an application to physics-informed neural networks.Computational Optimization and Applications, 89(2):385–417, 2024

2024
[18]

A unified convergence theory for adaptive first-order methods in the nonconvex case, including AdaNorm, full and diagonal AdaGrad, Shampoo and Muo

S. Gratton and Ph. L. Toint. A unified convergence theory for adaptive first-order methods in the nonconvex case, including AdaNorm, full and diagonal AdaGrad, Shampoo and Muon. arXiv:2604.17423, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[19]

Griewank and Ph

A. Griewank and Ph. L. Toint. Local convergence analysis for partitioned quasi-Newton updates.Numerische Mathematik, 39:429–448, 1982

1982
[20]

Griewank and Ph

A. Griewank and Ph. L. Toint. On the existence of convex decomposition of partially separable functions. Mathematical Programming, 28:25–49, 1984

1984
[21]

Z. Guo, W. Yin, R. Jin, and T. Yang. A novel convergence analysis for algorithms of the Adam family. In Proceedings of OPT2021: 13th Annual Workshop on Optimization for Machine Learning, 2021

2021
[22]

Gupta, T

V. Gupta, T. Koren, and Y. Singer. Shampoo: Preconditioned stochastic tensor optimization. InProceedings of the International Conference on Machine Learning, pages 1842–1850, 2018

2018
[23]

J. B. Hampshire and A. Waibel. The Meta-Pi network: building distributed knowledge representations for robust multisource pattern recognition.IEEE Transactions on Pattern Analysis and Machine Intelligence, 14(7):751––769, 1992

1992
[24]

F. M. Harper and J. A. Konstan. The MovieLens datasets: History and context.ACM Transactions on Interactive Intelligent Systems (TiiS), 5(4), 2025

2025
[25]

, volume =

Y. Hong and J. Lin. Revisting convergence of Adagrad with relaxed assumptions. arXiv:2403.13794v2, 2024

work page arXiv 2024
[26]

Jordan, Y

K. Jordan, Y. Jin, V. Boza, Y. Jiacheng, F. Cecista, L. Newhouse, and J. Bernstein. URL:https://kellerjordan.github.io.posts/muon, 2024

2024
[27]

Kingma and J

D. Kingma and J. Ba. Adam: A method for stochastic optimization. InProceedings in the International Conference on Learning Representations (ICLR), 2015

2015
[28]

Z. Kovarik. Some iterative methods for improving orthonormality.SIAM Journal on Numerical Analysis, 7(3):386–389, 1970

1970
[29]

Kungurtsev, M

V. Kungurtsev, M. Egan, B. Chatterjee, and D Alistarh. Asynchronous optimization methods for efficient training of deep neural networks with guarantees. arXiv;1905.11845v2, 2019

work page arXiv 1905
[30]

Kungurtsev, M

V. Kungurtsev, M. Egan, B. Chatterjee, and D Alistarh. Asynchronous optimization methods for efficient training of deep neural networks with guarantees. InProceedings of the AAAI Conference on Artificial Intelligence, volume 35, pages 8209–8216, 2021

2021
[31]

Display advertising challenge

Criteo Labs. Display advertising challenge. Kaggle, 2014

2014
[32]

Leblond, F

R. Leblond, F. Pedregosa, and S. Lacoste-Julien. Improved asynchronous parallel optimization analysis for stochastic incremental methods.Journal of Machine Learning Research, 19(81):1–68, 2018

2018
[33]

X. Lian, Y. Huang, Y. Li, and J. Liu. Asynchronous parallel stochastic gradient for nonconvex optimization. In C. Cortes, N. Lawrence, D. Lee, M. Sugiyama, and R. Garnett, editors,Advances in Neural Information Processing Systems, volume 28. Curran Associates, Inc., 2015

2015
[34]

J. Liu, J. Lin, X.Yao, Z. Jiang, G. Lai, Y. Du, Y. Qin, W. Xu, E. Lu, J. Yan, Y. Chen and. H. Zheng, Y. Liu, S. Liu, B. Yin, W. He, H. Zhu, Y. Wang, J. Wang, M. Dong, Z. Zhang, Y. Kang, H. Zhang, X. Xu, Y. Zhang, Y. Wu, W. Zhou, and Z. Yang. MUON is scalable for LLM training. arXiv:2502.16982, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[35]

L. Lu, P. Jin, and G. Karniadakis. DeepONet: Learning nonlinear operators for identifying differential equa- tions based on the universal approximation theorem of operators
[36]

Mania, X

H. Mania, X. Pan, D. Papailiopoulos, B. Recht, K. Ramchandran, and M. I. Jordan. Perturbed iterate analysis for asynchronous stochastic optimization.SIAM Journal on Optimization, 27(4):2202––2229, 2017

2017
[37]

McMahan and M

B. McMahan and M. Streeter. Adaptive bound optimization for online convex optimization. InConference on Learning Theory, page 244sq, 2010

2010
[38]

Nabli and E

A. Nabli and E. Oyallon. DADA0: Decoupled accelerated decentralized asynchronous optimization. arXiv:2208.00779v3, 2022

work page arXiv 2022
[39]

Nadiradze, I

G. Nadiradze, I. Markov, B. Chatterjee, and V. Kungurstev. Elastic consistency: a practical consistency model for distributed stochastic gradient descent.ACM SIGACT News, 53(2):64–82, 2022

2022
[40]

Netzer, T

Y. Netzer, T. Wang, A. Coates, A. Bissacco, B. Wu, and A.Y. Ng. Reading digits in natural images with unsupervised feature learning, 2011

2011
[41]

L. M. Nguyen, P. H. Nguyen, P. Richtarik, K. Scheinberg, M. Tak´ aˇ c, and M. van Dijk. New convergence aspects of stochastic gradient algorithms.Journal of Machine Learning Research, 20(176):1–49, 2019

2019
[42]

L. M. Nguyen, P. H. Nguyen, M. van Dijk, P. Richtarik, K. Scheinberg, and M. Tak´ aˇ c. SGD and Hogwild! convergence without the bounded gradients assumption.Proceedings of Machine Learning Research, 80:3750– 3758, 2018

2018
[43]

F. Niu, B. Recht, Ch. R´ e, and S. J. Wright. HOGWILD!: A lock-free approach to parallelizing stochastic gradient descent.Advances in Neural Information Processing Systems, 4(1):693—-701, 2011. Gratton & Toint: GT35A (draft of 29 V 2026— not for distribution)21

2011
[44]

Z. Peng, Y. Xu, M. Yan, and W. Yin. ARock: An algorithmic framework for asynchronous parallel coordinate updates.SIAM Journal on Scientific Computing, 38(5):2851–2879, 2016

2016
[45]

Training Deep Learning Models with Norm-Constrained LMOs

Th. Pethick, W. Xie, K. Antonakopoulos, Z. Zhu, A. Silveti-Falls, and V. Cevher. Training deep learning models with norm-constrained LMOs. arXiv:2502.07529v2, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[46]

C. Si, D. Zhang, and W. Shen. AdaMuon: Adaptive Muon optimizer. arXiv2507.11005, 2025

work page arXiv 2025
[47]

Ubaru, J

S. Ubaru, J. Chen, and Y. Saad. Fast estimation of$tr(f(a))$via stochastic Lanczos quadrature.SIAM Journal on Matrix Analysis and Applications, 38(4):1075–1099, 2017

2017
[48]

N. Vyas, D. Morwani, R. Zhao, M. Kwun, I. Shapira, B. Brandfonbrener, L. Janson, and S. Kakade. SOAP: Improving and stabilizing Shampoo using Adam. arXiv:2409.11321, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[49]

B. Wang, H. Zhang, Z. Ma, and W. Chen. Convergence of Adagrad for non-convex objectives: simple proofs and relaxed assumptions. InProceedings of the Thirty-Sixth Conference on Learning Theory, pages 161–190, 2023

2023
[50]

R. Ward, X. Wu, and L. Bottou. Adagrad stepsizes: sharp convergence over nonconvex landscapes. In Proceedings in the International Conference on Machine Learning (ICML2019), 2019

2019
[51]

H. Xiao, K. Rasul, and R. Vollgraf. Fashion-MNIST, a novel image dataset for benchmarking machine learning algorithms. arXiv:1708.07747, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[52]

N. Xiao, X. Hu, X. Lin, and K.-C. Toh. Adam family methods for nonsmooth optimization with convergence guarantees.Journal of Machine Learning Research, 25, 2024

2024
[53]

A. W. Yu, L. Huang, Q. Lin, R. Salakhutdinov, and J. Carbonell. Block-normalized gradient method; an empirical study for training deep neural network. arXiv:1707.04822v2, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[54]

Zhang, Y

M. Zhang, Y. Liu, and H. Schaeffer. AdaGrad meets Muon: Adaptive stepsizes for orthogonal updates. arXiv:2509.02981v2, 2025

work page arXiv 2025
[55]

LX ℓ=1 tr(log Πk,ℓ)− LX ℓ=1 tr(log Π−1,ℓ) # .(A.4) Then, for everyk≥0, ηE

H. Zhuang, Y. Wang, Q. Liu, S. Zhang, and Z. Lin. Fully decoupled neural network learning using delayed gradients. arXiv:1906.09108v3, 2019. Appendix 1: Detailed analysis for momentum-lessDADPREC The proof of Theorem 3.1 follows the broad lines of [18]. In order to keep our argument as compact as possible, we refer, in what follows, to that reference for ...

work page arXiv 1906
[56]

Computeu=Gv,u=u/∥u∥,v=G T u,σ max =∥v∥,v=v/∥v∥(repeat twice)
[57]

Set est(∥G∥ ∗) = 1σmax mX i=1 mX j=1 G2 i,j,
[58]

Savevfor the next call. The cost of usingWSPOWERfor anm×nmatrix amounts toO(4mn) flops for the four matrix-vector products, plusO(mn) flops for∥G∥ 2 F , giving a total ofO(5mn) flops, compared to O(mnmin(m, n)) flops for the SVD. In our experiments with 256×2048 matrices, usingWSPOWER is min(m, n)/5≈50 times faster. Although a mere proxy of∥G∥ ∗,WSPOWERwo...

2048

[1] [1]

Attia and T

A. Attia and T. Koren. SGD with AdaGrad stepsizes: Full adaptivity with high probability to unknown parameters, unbounded gradients and affine variance. arXiv:2302.08783, 2023

work page arXiv 2023

[2] [2]

Bernstein and L

J. Bernstein and L. Newhouse. Modular duality in deep learning. arXiv:2410.21265v2, 2024

work page arXiv 2024

[3] [3]

Old Optimizer, New Norm: An Anthology

J. Bernstein and L. Newhouse. Old optimizer, new norm: an anthology. arXiv:2409.20325v2, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[4] [4]

Bertsekas and J

D.P. Bertsekas and J. N. Tsitsiklis.Parallel and Distributed Computation: Numerical Methods. Prentice Hall, 1989

1989

[5] [5]

Bj¨ orck and C

˚A. Bj¨ orck and C. Bowie. An iterative algorithm for computing the best estimate of an orthogonal matrix. SIAM Journal on Numerical Analysis, 8(2):358–364, 1971

1971

[6] [6]

J. A. Blackard and D. J. Dean. Comparative accuracies of artificial neural networks and discriminant analysis in predicting forest cover types from cartographic variables.Comput. Electron. Agric., 24(3):131–151, 1999

1999

[7] [7]

Boyd and L

S. Boyd and L. Vandenberghe.Convex Optimization. Cambridge University Press, Cambridge, England, 2004

2004

[8] [8]

Cannelli, F

L. Cannelli, F. Facchinei, V. Kungurstev, and G. Scutari. Asynchronous parallel algorithms for nonconvex optimization.Mathematical Programming, Series A, 184:121–154, 2020

2020

[9] [9]

S. C. Chaturapruek, J. C. Duchi, and Ch. R´ e. Asynchronous stochastic convex optimization. InNIPS’15: Proceedings of the 29th International Conference on Neural Information Processing Systems, volume 1, pages 1531–1539, 2015

2015

[10] [10]

Chazan and W

D. Chazan and W. Miranker. Chaotic relaxation.Linear Algebra and its Applications, 2(2):199–222, 1969

1969

[11] [11]

A. R. Conn, N. I. M. Gould, and Ph. L. Toint.LANCELOT: a Fortran package for large-scale nonlinear optimization (Release A). Number 17 in Springer Series in Computational Mathematics. Springer Verlag, Heidelberg, Berlin, New York, 1992

1992

[12] [12]

Duchi, E

J. Duchi, E. Hazan, and Y. Singer. Adaptive subgradient methods for online learning and stochastic optimiza- tion.Journal of Machine Learning Research, 12, July 2011

2011

[13] [13]

M. Faw, I. Tziotis, C. Caramanis, A. Mokhtari, S. Shakkottai, and R. Ward. The power of adaptivity in SGD: Self-tuning step sizes with unbounded gradients and affine variance. InProceedings of 35th Conference on Learning Theory, volume 178, pages 313–355, 2022

2022

[14] [14]

H. R. Feyzmahdavian and M. Johansson. Asynchronous iterations in optimization: New sequence results and sharper algorithmic guarantees.Journal of Machine Learning Research, 24:1–75, 2023

2023

[15] [15]

Frommer and D

A. Frommer and D. B. Szyld. On asynchronous iterations.Journal of Computational and Applied Mathematics, 123(1–2):201–216, 2000

2000

[16] [16]

Ginsburg, P

B. Ginsburg, P. Castonguay, O. Hrinchuk, O. Kuchaiev, R. Leary, V. Lavrukhin, J. Li, H. Nguyen, Y. Zhang, and J. M. Cohen. Training deep networks with stochastic gradient normalized by layerwise adaptive second moments. arXiv:1905.11286v3, 2019. 9Among other possibilities, we think here of structured networks such as mix of experts [23], DeepONets [35] or...

work page arXiv 1905

[17] [17]

Gratton, V

S. Gratton, V. Mercier, E. Riccietti, and Ph. L. Toint. A block-coordinate approach of multi-level optimization with an application to physics-informed neural networks.Computational Optimization and Applications, 89(2):385–417, 2024

2024

[18] [18]

A unified convergence theory for adaptive first-order methods in the nonconvex case, including AdaNorm, full and diagonal AdaGrad, Shampoo and Muo

S. Gratton and Ph. L. Toint. A unified convergence theory for adaptive first-order methods in the nonconvex case, including AdaNorm, full and diagonal AdaGrad, Shampoo and Muon. arXiv:2604.17423, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[19] [19]

Griewank and Ph

A. Griewank and Ph. L. Toint. Local convergence analysis for partitioned quasi-Newton updates.Numerische Mathematik, 39:429–448, 1982

1982

[20] [20]

Griewank and Ph

A. Griewank and Ph. L. Toint. On the existence of convex decomposition of partially separable functions. Mathematical Programming, 28:25–49, 1984

1984

[21] [21]

Z. Guo, W. Yin, R. Jin, and T. Yang. A novel convergence analysis for algorithms of the Adam family. In Proceedings of OPT2021: 13th Annual Workshop on Optimization for Machine Learning, 2021

2021

[22] [22]

Gupta, T

V. Gupta, T. Koren, and Y. Singer. Shampoo: Preconditioned stochastic tensor optimization. InProceedings of the International Conference on Machine Learning, pages 1842–1850, 2018

2018

[23] [23]

J. B. Hampshire and A. Waibel. The Meta-Pi network: building distributed knowledge representations for robust multisource pattern recognition.IEEE Transactions on Pattern Analysis and Machine Intelligence, 14(7):751––769, 1992

1992

[24] [24]

F. M. Harper and J. A. Konstan. The MovieLens datasets: History and context.ACM Transactions on Interactive Intelligent Systems (TiiS), 5(4), 2025

2025

[25] [25]

, volume =

Y. Hong and J. Lin. Revisting convergence of Adagrad with relaxed assumptions. arXiv:2403.13794v2, 2024

work page arXiv 2024

[26] [26]

Jordan, Y

K. Jordan, Y. Jin, V. Boza, Y. Jiacheng, F. Cecista, L. Newhouse, and J. Bernstein. URL:https://kellerjordan.github.io.posts/muon, 2024

2024

[27] [27]

Kingma and J

D. Kingma and J. Ba. Adam: A method for stochastic optimization. InProceedings in the International Conference on Learning Representations (ICLR), 2015

2015

[28] [28]

Z. Kovarik. Some iterative methods for improving orthonormality.SIAM Journal on Numerical Analysis, 7(3):386–389, 1970

1970

[29] [29]

Kungurtsev, M

V. Kungurtsev, M. Egan, B. Chatterjee, and D Alistarh. Asynchronous optimization methods for efficient training of deep neural networks with guarantees. arXiv;1905.11845v2, 2019

work page arXiv 1905

[30] [30]

Kungurtsev, M

V. Kungurtsev, M. Egan, B. Chatterjee, and D Alistarh. Asynchronous optimization methods for efficient training of deep neural networks with guarantees. InProceedings of the AAAI Conference on Artificial Intelligence, volume 35, pages 8209–8216, 2021

2021

[31] [31]

Display advertising challenge

Criteo Labs. Display advertising challenge. Kaggle, 2014

2014

[32] [32]

Leblond, F

R. Leblond, F. Pedregosa, and S. Lacoste-Julien. Improved asynchronous parallel optimization analysis for stochastic incremental methods.Journal of Machine Learning Research, 19(81):1–68, 2018

2018

[33] [33]

X. Lian, Y. Huang, Y. Li, and J. Liu. Asynchronous parallel stochastic gradient for nonconvex optimization. In C. Cortes, N. Lawrence, D. Lee, M. Sugiyama, and R. Garnett, editors,Advances in Neural Information Processing Systems, volume 28. Curran Associates, Inc., 2015

2015

[34] [34]

J. Liu, J. Lin, X.Yao, Z. Jiang, G. Lai, Y. Du, Y. Qin, W. Xu, E. Lu, J. Yan, Y. Chen and. H. Zheng, Y. Liu, S. Liu, B. Yin, W. He, H. Zhu, Y. Wang, J. Wang, M. Dong, Z. Zhang, Y. Kang, H. Zhang, X. Xu, Y. Zhang, Y. Wu, W. Zhou, and Z. Yang. MUON is scalable for LLM training. arXiv:2502.16982, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[35] [35]

L. Lu, P. Jin, and G. Karniadakis. DeepONet: Learning nonlinear operators for identifying differential equa- tions based on the universal approximation theorem of operators

[36] [36]

Mania, X

H. Mania, X. Pan, D. Papailiopoulos, B. Recht, K. Ramchandran, and M. I. Jordan. Perturbed iterate analysis for asynchronous stochastic optimization.SIAM Journal on Optimization, 27(4):2202––2229, 2017

2017

[37] [37]

McMahan and M

B. McMahan and M. Streeter. Adaptive bound optimization for online convex optimization. InConference on Learning Theory, page 244sq, 2010

2010

[38] [38]

Nabli and E

A. Nabli and E. Oyallon. DADA0: Decoupled accelerated decentralized asynchronous optimization. arXiv:2208.00779v3, 2022

work page arXiv 2022

[39] [39]

Nadiradze, I

G. Nadiradze, I. Markov, B. Chatterjee, and V. Kungurstev. Elastic consistency: a practical consistency model for distributed stochastic gradient descent.ACM SIGACT News, 53(2):64–82, 2022

2022

[40] [40]

Netzer, T

Y. Netzer, T. Wang, A. Coates, A. Bissacco, B. Wu, and A.Y. Ng. Reading digits in natural images with unsupervised feature learning, 2011

2011

[41] [41]

L. M. Nguyen, P. H. Nguyen, P. Richtarik, K. Scheinberg, M. Tak´ aˇ c, and M. van Dijk. New convergence aspects of stochastic gradient algorithms.Journal of Machine Learning Research, 20(176):1–49, 2019

2019

[42] [42]

L. M. Nguyen, P. H. Nguyen, M. van Dijk, P. Richtarik, K. Scheinberg, and M. Tak´ aˇ c. SGD and Hogwild! convergence without the bounded gradients assumption.Proceedings of Machine Learning Research, 80:3750– 3758, 2018

2018

[43] [43]

F. Niu, B. Recht, Ch. R´ e, and S. J. Wright. HOGWILD!: A lock-free approach to parallelizing stochastic gradient descent.Advances in Neural Information Processing Systems, 4(1):693—-701, 2011. Gratton & Toint: GT35A (draft of 29 V 2026— not for distribution)21

2011

[44] [44]

Z. Peng, Y. Xu, M. Yan, and W. Yin. ARock: An algorithmic framework for asynchronous parallel coordinate updates.SIAM Journal on Scientific Computing, 38(5):2851–2879, 2016

2016

[45] [45]

Training Deep Learning Models with Norm-Constrained LMOs

Th. Pethick, W. Xie, K. Antonakopoulos, Z. Zhu, A. Silveti-Falls, and V. Cevher. Training deep learning models with norm-constrained LMOs. arXiv:2502.07529v2, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[46] [46]

C. Si, D. Zhang, and W. Shen. AdaMuon: Adaptive Muon optimizer. arXiv2507.11005, 2025

work page arXiv 2025

[47] [47]

Ubaru, J

S. Ubaru, J. Chen, and Y. Saad. Fast estimation of$tr(f(a))$via stochastic Lanczos quadrature.SIAM Journal on Matrix Analysis and Applications, 38(4):1075–1099, 2017

2017

[48] [48]

N. Vyas, D. Morwani, R. Zhao, M. Kwun, I. Shapira, B. Brandfonbrener, L. Janson, and S. Kakade. SOAP: Improving and stabilizing Shampoo using Adam. arXiv:2409.11321, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[49] [49]

B. Wang, H. Zhang, Z. Ma, and W. Chen. Convergence of Adagrad for non-convex objectives: simple proofs and relaxed assumptions. InProceedings of the Thirty-Sixth Conference on Learning Theory, pages 161–190, 2023

2023

[50] [50]

R. Ward, X. Wu, and L. Bottou. Adagrad stepsizes: sharp convergence over nonconvex landscapes. In Proceedings in the International Conference on Machine Learning (ICML2019), 2019

2019

[51] [51]

H. Xiao, K. Rasul, and R. Vollgraf. Fashion-MNIST, a novel image dataset for benchmarking machine learning algorithms. arXiv:1708.07747, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[52] [52]

N. Xiao, X. Hu, X. Lin, and K.-C. Toh. Adam family methods for nonsmooth optimization with convergence guarantees.Journal of Machine Learning Research, 25, 2024

2024

[53] [53]

A. W. Yu, L. Huang, Q. Lin, R. Salakhutdinov, and J. Carbonell. Block-normalized gradient method; an empirical study for training deep neural network. arXiv:1707.04822v2, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[54] [54]

Zhang, Y

M. Zhang, Y. Liu, and H. Schaeffer. AdaGrad meets Muon: Adaptive stepsizes for orthogonal updates. arXiv:2509.02981v2, 2025

work page arXiv 2025

[55] [55]

LX ℓ=1 tr(log Πk,ℓ)− LX ℓ=1 tr(log Π−1,ℓ) # .(A.4) Then, for everyk≥0, ηE

H. Zhuang, Y. Wang, Q. Liu, S. Zhang, and Z. Lin. Fully decoupled neural network learning using delayed gradients. arXiv:1906.09108v3, 2019. Appendix 1: Detailed analysis for momentum-lessDADPREC The proof of Theorem 3.1 follows the broad lines of [18]. In order to keep our argument as compact as possible, we refer, in what follows, to that reference for ...

work page arXiv 1906

[56] [56]

Computeu=Gv,u=u/∥u∥,v=G T u,σ max =∥v∥,v=v/∥v∥(repeat twice)

[57] [57]

Set est(∥G∥ ∗) = 1σmax mX i=1 mX j=1 G2 i,j,

[58] [58]

Savevfor the next call. The cost of usingWSPOWERfor anm×nmatrix amounts toO(4mn) flops for the four matrix-vector products, plusO(mn) flops for∥G∥ 2 F , giving a total ofO(5mn) flops, compared to O(mnmin(m, n)) flops for the SVD. In our experiments with 256×2048 matrices, usingWSPOWER is min(m, n)/5≈50 times faster. Although a mere proxy of∥G∥ ∗,WSPOWERwo...

2048