pith. sign in

arxiv: 2510.06141 · v6 · pith:MDJRXQ3Vnew · submitted 2025-10-07 · 💻 cs.LG · cs.MA· math.OC

High-Probability Convergence Guarantees of Decentralized SGD

Pith reviewed 2026-05-22 13:10 UTC · model grok-4.3

classification 💻 cs.LG cs.MAmath.OC
keywords decentralized SGDhigh-probability convergencestochastic gradient descentdistributed optimizationconvergence rateslinear speed-upnon-convex optimization
0
0 comments X

The pith

Decentralized SGD converges in high probability under the same conditions on the cost function as required for mean-squared error convergence.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes high-probability convergence guarantees for decentralized stochastic gradient descent. It shows that DSGD achieves this convergence under the same smoothness and convexity assumptions used in mean-squared error analyses, without the extra restrictions like uniformly bounded gradients or asymptotically vanishing noise that earlier decentralized studies imposed. A sympathetic reader would care because high-probability bounds deliver exponentially decaying tail probabilities and reliable performance for individual algorithm runs, aligning decentralized methods with what is already known for centralized SGD. The analysis further delivers order-optimal rates for non-convex and strongly convex costs together with a linear speed-up in the number of participating users.

Core claim

Decentralized SGD converges in high probability under the same conditions on the cost as in the MSE sense, removing the restrictive assumptions used in prior works. The sharp analysis yields order-optimal rates for both non-convex and strongly convex costs. It also establishes a linear speed-up in the number of users, leading to matching or strictly better transient times than those obtained from MSE results. These outcomes rest on technical results including a variance-reduction effect of decentralized methods in the high-probability sense and a novel bound on the moment-generating function of strongly convex costs.

What carries the argument

The variance-reduction effect of decentralized methods in the high-probability sense together with a novel bound on the moment-generating function of strongly convex costs.

If this is right

  • DSGD achieves order-optimal convergence rates in high probability for both non-convex and strongly convex costs.
  • Linear speed-up with the number of users holds in the high-probability setting.
  • Transient times match or improve upon those derived from mean-squared error analyses.
  • The same light-tailed noise and standard smoothness assumptions suffice for both high-probability and mean-squared error convergence.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The moment-generating function bound may also tighten high-probability analyses for centralized SGD.
  • The variance-reduction observation could apply to other distributed first-order methods that average information across nodes.
  • These guarantees suggest decentralized training could be used in settings where single-run reliability matters as much as average performance.
  • Extensions to time-varying communication graphs or asynchronous updates would test how robust the linear speed-up remains.

Load-bearing premise

The stochastic noise must be light-tailed so that moment-generating functions exist and yield exponential tail bounds.

What would settle it

An experiment or calculation showing that the claimed high-probability bounds fail when the noise has heavier tails for which the moment-generating function does not exist.

Figures

Figures reproduced from arXiv: 2510.06141 by Aleksandar Armacki, Ali H. Sayed.

Figure 1
Figure 1. Figure 1: Performance of DSGD, in the MSE sense (left) and HP sense (right). We can see that DSGD achieves an exponential tail decay for all values of threshold ε. For the threshold ε = 10−4 , the tail probability starts decaying exponentially after approximately t = 6000 iterations, which is consistent with the MSE behaviour, where we can see that DSGD takes around the same number of iterations to reach the average… view at source ↗
Figure 2
Figure 2. Figure 2: Linear speed-up of DSGD, in the MSE and HP sense. Left to right and top to bottom: MSE performance and tail decay with threshold ε =  10−2 , 10−3 , 10−4 [PITH_FULL_IMAGE:figures/full_fig_p043_2.png] view at source ↗
read the original abstract

Convergence in high-probability (HP) has attracted increasing interest, due to implying exponentially decaying tail bounds and strong guarantees for individual runs of an algorithm. While many works study HP guarantees in centralized settings, much less is understood in the decentralized setup, where existing works require strong assumptions, like uniformly bounded gradients, or asymptotically vanishing noise. This results in a significant gap between the assumptions used to establish convergence in the HP and the mean-squared error (MSE) sense, and is also contrary to centralized settings, where it is known that $\mathtt{SGD}$ converges in HP under the same conditions on the cost function as needed for MSE convergence. Motivated by these observations, we study the HP convergence of Decentralized $\mathtt{SGD}$ ($\mathtt{DSGD}$) in the presence of light-tailed noise, providing several strong results. First, we show that $\mathtt{DSGD}$ converges in HP under the same conditions on the cost as in the MSE sense, removing the restrictive assumptions used in prior works. Second, our sharp analysis yields order-optimal rates for both non-convex and strongly convex costs. Third, we establish a linear speed-up in the number of users, leading to matching or strictly better transient times than those obtained from MSE results, further underlining the tightness of our analysis. To the best of our knowledge, this is the first work that shows $\mathtt{DSGD}$ achieves a linear speed-up in the HP sense. Our relaxed assumptions and sharp rates stem from several technical results of independent interest, including a result on the variance-reduction effect of decentralized methods in the HP sense, as well as a novel bound on the moment-generating function of strongly convex costs, of interest even in centralized settings. Numerical experiments validate our theory.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript analyzes high-probability convergence of Decentralized SGD (DSGD) under light-tailed noise. It claims that DSGD converges in high probability under the same smoothness/strong-convexity and light-tailed noise conditions required for mean-squared error convergence, without needing uniformly bounded gradients or asymptotically vanishing noise. The sharp analysis yields order-optimal rates for both non-convex and strongly convex objectives, establishes linear speedup in the number of users, and shows matching or strictly better transient times than MSE-based results. The results rely on new technical tools including a variance-reduction effect of decentralized averaging in the high-probability sense and a novel moment-generating function bound for strongly convex costs.

Significance. If the central claims hold, the work closes a notable gap between high-probability and MSE analyses in the decentralized setting, aligning it with known centralized SGD results. The order-optimal rates, linear speedup, and improved transient times would represent a tight analysis. The novel MGF bound and variance-reduction result are of independent interest and could apply even in centralized settings. Numerical experiments are cited as validation.

major comments (2)
  1. [MGF bound and Theorem on linear speedup (around the variance-reduction lemma)] The linear speedup and order-optimality claims hinge on the high-probability bounds not introducing network-dependent multiplicative factors from the consensus step. The analysis must explicitly show that the MGF bound and variance-reduction effect remain free of terms scaling with the mixing time or 1/(1-λ₂) of the gossip matrix; otherwise the rates would carry graph-dependent constants absent from the corresponding MSE analyses and could fail to deliver linear speedup on poorly connected graphs.
  2. [Main convergence theorems for strongly convex and non-convex cases] The central claim that DSGD converges in HP under exactly the same conditions as MSE requires that the light-tailed noise assumption suffices without additional restrictions on the network topology. If the consensus averaging inflates the effective sub-Gaussian parameter or MGF constant by a factor depending on the spectral gap, the high-probability deviation bounds would not match the MSE rates uniformly.
minor comments (2)
  1. [Abstract] Clarify in the abstract and introduction the precise regimes in which the transient time is strictly better than the MSE counterpart.
  2. [Preliminaries] Ensure all notation for the mixing matrix, its eigenvalues, and the consensus error is introduced consistently before the first technical lemma.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful reading and constructive comments. We address each major comment below, providing clarifications on how our analysis ensures the high-probability bounds remain free of network-dependent factors in the leading terms.

read point-by-point responses
  1. Referee: [MGF bound and Theorem on linear speedup (around the variance-reduction lemma)] The linear speedup and order-optimality claims hinge on the high-probability bounds not introducing network-dependent multiplicative factors from the consensus step. The analysis must explicitly show that the MGF bound and variance-reduction effect remain free of terms scaling with the mixing time or 1/(1-λ₂) of the gossip matrix; otherwise the rates would carry graph-dependent constants absent from the corresponding MSE analyses and could fail to deliver linear speedup on poorly connected graphs.

    Authors: We thank the referee for this important observation. Our variance-reduction result (Lemma 3.3) shows that decentralized averaging produces a high-probability contraction on the consensus error whose leading term is independent of the spectral gap; the mixing time appears only in a lower-order transient that is absorbed into the overall iteration complexity. The novel MGF bound (Lemma 4.2) for strongly convex objectives is obtained by exploiting strong convexity to dominate the deviation after averaging, again without a multiplicative 1/(1-λ₂) factor in the exponent. Consequently, the high-probability rates in Theorems 4.1 and 5.1 match the centralized rates up to constants independent of the graph, delivering linear speedup in the number of nodes for any fixed connected graph. We will add a short remark immediately after Lemma 3.3 to make this independence explicit. revision: partial

  2. Referee: [Main convergence theorems for strongly convex and non-convex cases] The central claim that DSGD converges in HP under exactly the same conditions as MSE requires that the light-tailed noise assumption suffices without additional restrictions on the network topology. If the consensus averaging inflates the effective sub-Gaussian parameter or MGF constant by a factor depending on the spectral gap, the high-probability deviation bounds would not match the MSE rates uniformly.

    Authors: We appreciate the referee raising this point. The proofs of the main theorems (Theorems 4.1 and 5.1) explicitly track the moment-generating function through the consensus step. Because the variance-reduction effect bounds the deviation of the averaged gradient noise by a quantity whose sub-Gaussian parameter is the same as the local noise (up to a universal constant independent of λ₂), the light-tailed assumption is preserved without inflation by the spectral gap. The network topology affects only the number of steps needed to reach consensus, which is already accounted for in the transient-time comparison with MSE results. Thus the high-probability convergence holds under precisely the same smoothness/strong-convexity and light-tailed noise conditions used for MSE convergence, for any connected gossip matrix satisfying the standard assumptions. revision: no

Circularity Check

0 steps flagged

Derivation chain self-contained; novel HP variance-reduction and MGF bounds are independent contributions

full rationale

The paper derives high-probability convergence for DSGD under the same smoothness/strong-convexity and light-tailed noise assumptions used for MSE analyses. It explicitly introduces two technical results of independent interest—a variance-reduction effect of decentralized methods in the HP sense and a novel MGF bound for strongly convex costs—that do not reduce to fitted parameters, self-definitions, or prior self-citations by construction. No equations or steps are shown to rename known results or import uniqueness via overlapping-author citations in a load-bearing manner. The linear speedup and order-optimal rates follow from these new bounds rather than from any circular reduction to inputs. This is the expected honest non-finding for a proof paper whose central claims rest on fresh analysis rather than reparameterization.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claims rest on light-tailed noise and the same smoothness/strong-convexity assumptions used in MSE analyses for DSGD; no free parameters or invented entities are introduced in the abstract.

axioms (2)
  • domain assumption The cost function satisfies standard smoothness and (strong) convexity assumptions identical to those sufficient for MSE convergence of DSGD.
    Invoked to remove the need for uniformly bounded gradients or asymptotically vanishing noise.
  • domain assumption The stochastic noise is light-tailed, allowing moment-generating function bounds.
    Stated as the setting in which the HP results are derived.

pith-pipeline@v0.9.0 · 5859 in / 1347 out tokens · 34443 ms · 2026-05-22T13:10:58.131616+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

88 extracted references · 88 canonical work pages · 3 internal anchors

  1. [1]

    Adaptive Networks,

    A. H. Sayed, “Adaptive Networks,”Proceedings of the IEEE, vol. 102, no. 4, pp. 460–497, 2014

  2. [2]

    Communication- Efficient Learning of Deep Networks from Decentralized Data,

    B. McMahan, E. Moore, D. Ramage, S. Hampson, and B. A. y. Arcas, “Communication- Efficient Learning of Deep Networks from Decentralized Data,” inProceedings of the 20th International Conference on Artificial Intelligence and Statistics(A. Singh and J. Zhu, eds.), vol. 54 ofProceedings of Machine Learning Research, pp. 1273–1282, PMLR, 20–22 Apr 2017

  3. [3]

    Networked Signal and Information Processing: Learning by multiagent systems,

    S. Vlaski, S. Kar, A. H. Sayed, and J. M. Moura, “Networked Signal and Information Processing: Learning by multiagent systems,”IEEE Signal Processing Magazine, vol. 40, no. 5, pp. 92–105, 2023

  4. [4]

    Practical secure aggregation for privacy-preserving machine learning,

    K. Bonawitz, V. Ivanov, B. Kreuter, A. Marcedone, H. B. McMahan, S. Patel, D. Ram- age, A. Segal, and K. Seth, “Practical secure aggregation for privacy-preserving machine learning,” inproceedings of the 2017 ACM SIGSAC Conference on Computer and Com- munications Security, pp. 1175–1191, 2017

  5. [5]

    Secure Distributed Optimization Under Gradient Attacks,

    S. Yu and S. Kar, “Secure Distributed Optimization Under Gradient Attacks,”IEEE Transactions on Signal Processing, vol. 71, pp. 1802–1816, 2023

  6. [6]

    Federated Learning: Strategies for Improving Communication Efficiency

    J. Koneˇ cn` y, H. B. McMahan, F. X. Yu, P. Richt´ arik, A. T. Suresh, and D. Bacon, “Fed- erated learning: Strategies for improving communication efficiency,”arXiv:1610.05492, 2016

  7. [7]

    Bullo, J

    F. Bullo, J. Cort´ es, and S. Martinez,Distributed Control of Robotic Networks: A Mathe- matical Approach to Motion Coordination Algorithms. Princeton University Press, 2009

  8. [8]

    Distributed Learning in the Nonconvex World: From batch data to streaming and beyond,

    T.-H. Chang, M. Hong, H.-T. Wai, X. Zhang, and S. Lu, “Distributed Learning in the Nonconvex World: From batch data to streaming and beyond,”IEEE Signal Processing Magazine, vol. 37, no. 3, pp. 26–38, 2020

  9. [9]

    Local SGD Converges Fast and Communicates Little,

    S. U. Stich, “Local SGD Converges Fast and Communicates Little,” inInternational Conference on Learning Representations, 2019

  10. [10]

    Towards Understanding Biased Client Selection in Federated Learning ,

    Y. J. Cho, J. Wang, and G. Joshi, “ Towards Understanding Biased Client Selection in Federated Learning ,” inProceedings of The 25th International Conference on Artificial Intelligence and Statistics(G. Camps-Valls, F. J. R. Ruiz, and I. Valera, eds.), vol. 151 ofProceedings of Machine Learning Research, pp. 10351–10375, PMLR, 28–30 Mar 2022. 14

  11. [11]

    A Unified Theory of Decen- tralized SGD with Changing Topology and Local Updates,

    A. Koloskova, N. Loizou, S. Boreiri, M. Jaggi, and S. Stich, “A Unified Theory of Decen- tralized SGD with Changing Topology and Local Updates,” inProceedings of the 37th International Conference on Machine Learning(H. D. III and A. Singh, eds.), vol. 119 ofProceedings of Machine Learning Research, pp. 5381–5393, PMLR, 13–18 Jul 2020

  12. [12]

    Asynchronous parallel algo- rithms for nonconvex optimization,

    L. Cannelli, F. Facchinei, V. Kungurtsev, and G. Scutari, “Asynchronous parallel algo- rithms for nonconvex optimization,”Mathematical Programming, vol. 184, pp. 121–154, Nov 2020

  13. [13]

    Distributed Subgradient Methods for Multi-Agent Opti- mization,

    A. Nedi´ c and A. Ozdaglar, “Distributed Subgradient Methods for Multi-Agent Opti- mization,”IEEE Transactions on Automatic Control, vol. 54, no. 1, pp. 48–61, 2009

  14. [14]

    Fast Distributed Gradient Methods,

    D. Jakoveti´ c, J. Xavier, and J. M. F. Moura, “Fast Distributed Gradient Methods,” IEEE Transactions on Automatic Control, vol. 59, no. 5, pp. 1131–1146, 2014

  15. [15]

    EXTRA: An Exact First-Order Algorithm for Decentralized Consensus Optimization,

    W. Shi, Q. Ling, G. Wu, and W. Yin, “EXTRA: An Exact First-Order Algorithm for Decentralized Consensus Optimization,”SIAM Journal on Optimization, vol. 25, no. 2, pp. 944–966, 2015

  16. [16]

    NEXT: In-Network Nonconvex Optimization,

    P. D. Lorenzo and G. Scutari, “NEXT: In-Network Nonconvex Optimization,”IEEE Transactions on Signal and Information Processing over Networks, vol. 2, no. 2, pp. 120– 136, 2016

  17. [17]

    On the Convergence of Decentralized Gradient Descent,

    K. Yuan, Q. Ling, and W. Yin, “On the Convergence of Decentralized Gradient Descent,” SIAM Journal on Optimization, vol. 26, no. 3, pp. 1835–1854, 2016

  18. [18]

    A Unification and Generalization of Exact Distributed First-Order Meth- ods,

    D. Jakoveti´ c, “A Unification and Generalization of Exact Distributed First-Order Meth- ods,”IEEE Transactions on Signal and Information Processing over Networks, vol. 5, no. 1, pp. 31–46, 2019

  19. [19]

    Distributed Stochastic Gradient Descent: Nonconvexity, Nonsmoothness, and Convergence to Local Minima,

    B. Swenson, R. Murray, H. V. Poor, and S. Kar, “Distributed Stochastic Gradient Descent: Nonconvexity, Nonsmoothness, and Convergence to Local Minima,”JMLR, vol. 23, jan 2022

  20. [20]

    Convergence Rates for Distributed Stochastic Optimization Over Random Networks,

    D. Jakoveti´ c, D. Bajovic, A. K. Sahu, and S. Kar, “Convergence Rates for Distributed Stochastic Optimization Over Random Networks,” in2018 IEEE Conference on Decision and Control (CDC), pp. 4238–4245, 2018

  21. [21]

    Distributed stochastic gradient tracking methods,

    S. Pu and A. Nedi´ c, “Distributed stochastic gradient tracking methods,”Mathematical Programming, vol. 187, pp. 409–457, May 2021

  22. [22]

    Distributed Learning in Non-Convex Environments—Part I: Agreement at a Linear Rate,

    S. Vlaski and A. H. Sayed, “Distributed Learning in Non-Convex Environments—Part I: Agreement at a Linear Rate,”IEEE Transactions on Signal Processing, vol. 69, pp. 1242– 1256, 2021

  23. [23]

    Distributed Learning in Non-Convex Environments— Part II: Polynomial Escape From Saddle-Points,

    S. Vlaski and A. H. Sayed, “Distributed Learning in Non-Convex Environments— Part II: Polynomial Escape From Saddle-Points,”IEEE Transactions on Signal Processing, vol. 69, pp. 1257–1270, 2021

  24. [24]

    Variance-Reduced Decentralized Stochastic Optimiza- tion With Accelerated Convergence,

    R. Xin, U. A. Khan, and S. Kar, “Variance-Reduced Decentralized Stochastic Optimiza- tion With Accelerated Convergence,”IEEE Transactions on Signal Processing, vol. 68, pp. 6255–6271, 2020. 15

  25. [25]

    An Improved Convergence Analysis for Decentralized Online Stochastic Non-Convex Optimization,

    R. Xin, U. A. Khan, and S. Kar, “An Improved Convergence Analysis for Decentralized Online Stochastic Non-Convex Optimization,”IEEE Transactions on Signal Processing, vol. 69, pp. 1842–1858, 2021

  26. [26]

    Cooperative SGD: A Unified Framework for the Design and Analysis of Local-Update SGD Algorithms,

    J. Wang and G. Joshi, “Cooperative SGD: A Unified Framework for the Design and Analysis of Local-Update SGD Algorithms,”Journal of Machine Learning Research, vol. 22, no. 213, pp. 1–50, 2021

  27. [27]

    Adaptation, Learning, and Optimization over Networks,

    A. H. Sayed, “Adaptation, Learning, and Optimization over Networks,”Foundations and Trends®in Machine Learning, vol. 7, no. 4-5, pp. 311–801, 2014

  28. [28]

    Robust stochastic approximation approach to stochastic programming,

    A. Nemirovski˘ ı, A. Juditsky, G. Lan, and A. Shapiro, “Robust stochastic approximation approach to stochastic programming,”SIAM Journal on optimization, vol. 19, no. 4, pp. 1574–1609, 2009

  29. [29]

    Stochastic first-and zeroth-order methods for nonconvex stochastic programming,

    S. Ghadimi and G. Lan, “Stochastic first-and zeroth-order methods for nonconvex stochastic programming,”SIAM Journal on Optimization, vol. 23, no. 4, pp. 2341–2368, 2013

  30. [30]

    On the Convergence of Stochastic Gradient Descent with Adap- tive Stepsizes,

    X. Li and F. Orabona, “On the Convergence of Stochastic Gradient Descent with Adap- tive Stepsizes,” inProceedings of the Twenty-Second International Conference on Ar- tificial Intelligence and Statistics(K. Chaudhuri and M. Sugiyama, eds.), vol. 89 of Proceedings of Machine Learning Research, pp. 983–992, PMLR, 16–18 Apr 2019

  31. [31]

    Tight analyses for non-smooth stochastic gradient descent,

    N. J. A. Harvey, C. Liaw, Y. Plan, and S. Randhawa, “Tight analyses for non-smooth stochastic gradient descent,” inProceedings of the Thirty-Second Conference on Learning Theory(A. Beygelzimer and D. Hsu, eds.), vol. 99 ofProceedings of Machine Learning Research, pp. 1579–1613, PMLR, 25–28 Jun 2019

  32. [32]

    Large deviations rates for stochastic gradient descent with strongly convex functions,

    D. Bajovi´ c, D. Jakoveti´ c, and S. Kar, “Large deviations rates for stochastic gradient descent with strongly convex functions,” inProceedings of The 26th International Con- ference on Artificial Intelligence and Statistics(F. Ruiz, J. Dy, and J.-W. van de Meent, eds.), vol. 206 ofProceedings of Machine Learning Research, pp. 10095–10111, PMLR, 25–27 Apr 2023

  33. [33]

    High Probability Conver- gence of Stochastic Gradient Methods,

    Z. Liu, T. D. Nguyen, T. H. Nguyen, A. Ene, and H. Nguyen, “High Probability Conver- gence of Stochastic Gradient Methods,” inProceedings of the 40th International Confer- ence on Machine Learning(A. Krause, E. Brunskill, K. Cho, B. Engelhardt, S. Sabato, and J. Scarlett, eds.), vol. 202 ofProceedings of Machine Learning Research, pp. 21884– 21914, PMLR, ...

  34. [34]

    Improved Convergence in High Probability of Clipped Gradient Methods with Heavy Tailed Noise,

    T. D. Nguyen, T. H. Nguyen, A. Ene, and H. Nguyen, “Improved Convergence in High Probability of Clipped Gradient Methods with Heavy Tailed Noise,” inAdvances in Neural Information Processing Systems(A. Oh, T. Neumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, eds.), vol. 36, pp. 24191–24222, Curran Associates, Inc., 2023

  35. [35]

    Breaking the Lower Bound with (Little) Structure: Acceleration in Non-Convex Stochastic Optimization with Heavy-Tailed Noise,

    Z. Liu, J. Zhang, and Z. Zhou, “Breaking the Lower Bound with (Little) Structure: Acceleration in Non-Convex Stochastic Optimization with Heavy-Tailed Noise,” inPro- ceedings of Thirty Sixth Conference on Learning Theory(G. Neu and L. Rosasco, eds.), vol. 195 ofProceedings of Machine Learning Research, pp. 2266–2290, PMLR, 12–15 Jul 2023. 16

  36. [36]

    From Gradient Clipping to Normalization for Heavy Tailed SGD,

    F. H¨ ubler, I. Fatkhullin, and N. He, “From Gradient Clipping to Normalization for Heavy Tailed SGD,” inThe 28th International Conference on Artificial Intelligence and Statistics, vol. 258, 2025

  37. [37]

    Sign Operator for Coping with Heavy-Tailed Noise: High Probability Convergence Bounds with Exten- sions to Distributed Optimization and Comparison Oracle,

    N. Kornilov, P. Zmushko, A. Semenov, A. Gasnikov, and A. Beznosikov, “Sign Operator for Coping with Heavy-Tailed Noise: High Probability Convergence Bounds with Exten- sions to Distributed Optimization and Comparison Oracle,”arXiv:2502.07923, 2025

  38. [38]

    High- probability Convergence Bounds for Online Nonlinear Stochastic Gradient Descent under Heavy-tailed Noise,

    A. Armacki, S. Yu, P. Sharma, G. Joshi, D. Bajovi´ c, D. Jakoveti´ c, and S. Kar, “High- probability Convergence Bounds for Online Nonlinear Stochastic Gradient Descent under Heavy-tailed Noise,” inThe 28th International Conference on Artificial Intelligence and Statistics(Y. Li, S. Mandt, S. Agrawal, and E. Khan, eds.), vol. 258 ofProceedings of Machine ...

  39. [39]

    A high probability analysis of adaptive sgd with momentum,

    X. Li and F. Orabona, “A high probability analysis of adaptive sgd with momentum,” Workshop on “Beyond first-order methods in ML systems”, 37th International Confer- ence on Machine Learning, 2020

  40. [40]

    Revisiting the Last-Iterate Convergence of Stochastic Gradient Methods,

    Z. Liu and Z. Zhou, “Revisiting the Last-Iterate Convergence of Stochastic Gradient Methods,” inThe Twelfth International Conference on Learning Representations, 2024

  41. [41]

    High Probability Convergence Bounds for Non-convex Stochastic Gradient Descent with Sub-Weibull Noise,

    L. Madden, E. Dall’Anese, and S. Becker, “High Probability Convergence Bounds for Non-convex Stochastic Gradient Descent with Sub-Weibull Noise,”Journal of Machine Learning Research, vol. 25, no. 241, pp. 1–36, 2024

  42. [42]

    J. Nair, A. Wierman, and B. Zwart,The Fundamentals of Heavy Tails: Properties, Emer- gence, and Estimation. Cambridge Series in Statistical and Probabilistic Mathematics, Cambridge University Press, 2022

  43. [43]

    High-Probability Bounds for Stochastic Optimization and Vari- ational Inequalities: the Case of Unbounded Variance,

    A. Sadiev, M. Danilova, E. Gorbunov, S. Horv´ ath, G. Gidel, P. Dvurechensky, A. Gas- nikov, and P. Richt´ arik, “High-Probability Bounds for Stochastic Optimization and Vari- ational Inequalities: the Case of Unbounded Variance,” inInternational Conference on Machine Learning, pp. 29563–29648, PMLR, 2023

  44. [44]

    Breaking the Heavy-Tailed Noise Barrier in Stochastic Optimization Problems,

    N. Puchkin, E. Gorbunov, N. Kutuzov, and A. Gasnikov, “Breaking the Heavy-Tailed Noise Barrier in Stochastic Optimization Problems,”arXiv:2311.04161, 2023

  45. [45]

    Large deviations and improved mean-squared error rates of nonlinear sgd: Heavy-tailed noise and power of symmetry,

    A. Armacki, S. Yu, D. Bajovi´ c, D. Jakoveti´ c, and S. Kar, “Large Deviations and Im- proved Mean-squared Error Rates of Nonlinear SGD: Heavy-tailed Noise and Power of Symmetry,”arXiv:2410.15637, 2024

  46. [46]

    Optimal High-probability Convergence of Nonlinear SGD under Heavy-tailed Noise via Symmetrization,

    A. Armacki, D. Bajovi´ c, D. Jakoveti´ c, and S. Kar, “Optimal High-probability Convergence of Nonlinear SGD under Heavy-tailed Noise via Symmetrization,” arXiv:2507.09093, 2025

  47. [47]

    Better Theory for SGD in the Nonconvex World,

    A. Khaled and P. Richt´ arik, “Better Theory for SGD in the Nonconvex World,”Trans- actions on Machine Learning Research, 2023

  48. [48]

    Achieving Geometric Convergence for Distributed Optimization Over Time-Varying Graphs,

    A. Nedi´ c, A. Olshevsky, and W. Shi, “Achieving Geometric Convergence for Distributed Optimization Over Time-Varying Graphs,”SIAM Journal on Optimization, vol. 27, no. 4, pp. 2597–2633, 2017. 17

  49. [49]

    Harnessing Smoothness to Accelerate Distributed Optimization,

    G. Qu and N. Li, “Harnessing Smoothness to Accelerate Distributed Optimization,” IEEE Transactions on Control of Network Systems, vol. 5, no. 3, pp. 1245–1260, 2018

  50. [50]

    Second-Order Guarantees of Stochastic Gradient Descent in Nonconvex Optimization,

    S. Vlaski and A. H. Sayed, “Second-Order Guarantees of Stochastic Gradient Descent in Nonconvex Optimization,”IEEE Transactions on Automatic Control, vol. 67, no. 12, pp. 6489–6504, 2022

  51. [51]

    Revisiting Gradient Clipping: Stochastic bias and tight convergence guarantees,

    A. Koloskova, H. Hendrikx, and S. U. Stich, “Revisiting Gradient Clipping: Stochastic bias and tight convergence guarantees,”arXiv:2305.01588, 2023

  52. [52]

    Linear convergence of gradient and proximal- gradient methods under the Polyak- Lojasiewicz condition,

    H. Karimi, J. Nutini, and M. Schmidt, “Linear convergence of gradient and proximal- gradient methods under the Polyak- Lojasiewicz condition,” inMachine Learning and Knowledge Discovery in Databases: European Conference, ECML PKDD 2016, Riva del Garda, Italy, September 19-23, 2016, Proceedings, Part I 16, pp. 795–811, Springer, 2016

  53. [53]

    Decentralized Nonconvex Optimization under Heavy-Tailed Noise: Normalization and Optimal Convergence

    S. Yu, D. Jakoveti´ c, and S. Kar, “Decentralized Nonconvex Optimization under Heavy- Tailed Noise: Normalization and Optimal Convergence,”arXiv:2505.03736, 2025

  54. [54]

    Distributed Parameter Estimation in Sensor Networks: Nonlinear Observation Models and Imperfect Communication,

    S. Kar, J. M. F. Moura, and K. Ramanan, “Distributed Parameter Estimation in Sensor Networks: Nonlinear Observation Models and Imperfect Communication,”IEEE Trans- actions on Information Theory, vol. 58, no. 6, pp. 3575–3605, 2012

  55. [55]

    Distributed Pareto Optimization via Diffusion Strategies,

    J. Chen and A. H. Sayed, “Distributed Pareto Optimization via Diffusion Strategies,” IEEE Journal of Selected Topics in Signal Processing, vol. 7, no. 2, pp. 205–220, 2013

  56. [56]

    Multitask learning over graphs: An approach for distributed, streaming machine learning,

    R. Nassif, S. Vlaski, C. Richard, J. Chen, and A. H. Sayed, “Multitask learning over graphs: An approach for distributed, streaming machine learning,”IEEE Signal Pro- cessing Magazine, vol. 37, no. 3, pp. 14–25, 2020

  57. [57]

    Convergence in High Probability of Dis- tributed Stochastic Gradient Descent Algorithms,

    K. Lu, H. Wang, H. Zhang, and L. Wang, “Convergence in High Probability of Dis- tributed Stochastic Gradient Descent Algorithms,”IEEE Transactions on Automatic Control, vol. 69, no. 4, pp. 2189–2204, 2024

  58. [58]

    Online Distributed Algorithms for Online Noncooperative Games With Stochas- tic Cost Functions: High Probability Bound of Regrets,

    K. Lu, “Online Distributed Algorithms for Online Noncooperative Games With Stochas- tic Cost Functions: High Probability Bound of Regrets,”IEEE Transactions on Auto- matic Control, vol. 69, no. 12, pp. 8860–8867, 2024

  59. [59]

    Online distributed nonconvex optimization with stochas- tic objective functions: High probability bound analysis of dynamic regrets,

    H. Xu, K. Lu, and Y.-L. Wang, “Online distributed nonconvex optimization with stochas- tic objective functions: High probability bound analysis of dynamic regrets,”Automatica, vol. 170, p. 111863, 2024

  60. [60]

    High Probability Convergence of Clipped Dis- tributed Dual Averaging With Heavy-Tailed Noises,

    Y. Qin, K. Lu, H. Xu, and X. Chen, “High Probability Convergence of Clipped Dis- tributed Dual Averaging With Heavy-Tailed Noises,”IEEE Transactions on Systems, Man, and Cybernetics: Systems, vol. 55, no. 4, pp. 2624–2632, 2025

  61. [61]

    Online distributed optimization with clipped stochastic gradients: High probability bound of regrets,

    Y. Yang, K. Lu, and L. Wang, “Online distributed optimization with clipped stochastic gradients: High probability bound of regrets,”Automatica, vol. 182, p. 112525, 2025

  62. [62]

    High Probability Convergence of Distributed Clipped Stochastic Gradient Descent with Heavy-tailed Noise,

    Y. Yang, K. Lu, and L. Wang, “High Probability Convergence of Distributed Clipped Stochastic Gradient Descent with Heavy-tailed Noise,”arXiv:2506.11647, 2025. 18

  63. [63]

    Distributed Detection via Gaussian Running Consensus: Large Deviations Asymptotic Analysis,

    D. Bajovi´ c, D. Jakoveti´ c, J. Xavier, B. Sinopoli, and J. M. F. Moura, “Distributed Detection via Gaussian Running Consensus: Large Deviations Asymptotic Analysis,” IEEE Transactions on Signal Processing, vol. 59, no. 9, pp. 4381–4396, 2011

  64. [64]

    Large Devia- tions Performance of Consensus+Innovations Distributed Detection With Non-Gaussian Observations,

    D. Bajovi´ c, D. Jakoveti´ c, J. M. F. Moura, J. Xavier, and B. Sinopoli, “Large Devia- tions Performance of Consensus+Innovations Distributed Detection With Non-Gaussian Observations,”IEEE Transactions on Signal Processing, vol. 60, no. 11, pp. 5987–6002, 2012

  65. [65]

    Diffusion-Based Adaptive Distributed Detection: Steady-State Performance in the Slow Adaptation Regime,

    V. Matta, P. Braca, S. Marano, and A. H. Sayed, “Diffusion-Based Adaptive Distributed Detection: Steady-State Performance in the Slow Adaptation Regime,”IEEE Transac- tions on Information Theory, vol. 62, no. 8, pp. 4710–4732, 2016

  66. [66]

    Distributed Detection Over Adaptive Networks: Refined Asymptotics and the Role of Connectivity,

    V. Matta, P. Braca, S. Marano, and A. H. Sayed, “Distributed Detection Over Adaptive Networks: Refined Asymptotics and the Role of Connectivity,”IEEE Transactions on Signal and Information Processing over Networks, vol. 2, no. 4, pp. 442–460, 2016

  67. [67]

    Inaccuracy Rates for Distributed Inference Over Random Networks With Applications to Social Learning,

    D. Bajovi´ c, “Inaccuracy Rates for Distributed Inference Over Random Networks With Applications to Social Learning,”IEEE Transactions on Information Theory, vol. 70, no. 1, pp. 415–435, 2024

  68. [68]

    A unified and refined convergence analysis for non-convex decentralized learning,

    S. A. Alghunaim and K. Yuan, “A unified and refined convergence analysis for non-convex decentralized learning,”IEEE Transactions on Signal Processing, vol. 70, pp. 3264–3279, 2022

  69. [69]

    Diffusion Least-Mean Squares Over Adaptive Net- works: Formulation and Performance Analysis,

    C. G. Lopes and A. H. Sayed, “Diffusion Least-Mean Squares Over Adaptive Net- works: Formulation and Performance Analysis,”IEEE Transactions on Signal Process- ing, vol. 56, no. 7, pp. 3122–3136, 2008

  70. [70]

    Diffusion LMS Strategies for Distributed Estimation,

    F. S. Cattivelli and A. H. Sayed, “Diffusion LMS Strategies for Distributed Estimation,” IEEE Transactions on Signal Processing, vol. 58, no. 3, pp. 1035–1048, 2010

  71. [71]

    Diffusion Adaptation Strategies for Distributed Optimization and Learning Over Networks,

    J. Chen and A. H. Sayed, “Diffusion Adaptation Strategies for Distributed Optimization and Learning Over Networks,”IEEE Transactions on Signal Processing, vol. 60, no. 8, pp. 4289–4305, 2012

  72. [72]

    Consensus + innovations distributed inference over net- works: cooperation and sensing in networked systems,

    S. Kar and J. M. Moura, “Consensus + innovations distributed inference over net- works: cooperation and sensing in networked systems,”IEEE Signal Processing Mag- azine, vol. 30, no. 3, pp. 99–109, 2013

  73. [73]

    R. A. Horn and C. R. Johnson,Matrix Analysis. Cambridge University Press, 2 ed., 2012

  74. [74]

    Can Decentralized Algorithms Outperform Centralized Algorithms? A Case Study for Decentralized Parallel Stochastic Gradient Descent,

    X. Lian, C. Zhang, H. Zhang, C.-J. Hsieh, W. Zhang, and J. Liu, “Can Decentralized Algorithms Outperform Centralized Algorithms? A Case Study for Decentralized Parallel Stochastic Gradient Descent,” inAdvances in Neural Information Processing Systems (I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, eds.), vol. 3...

  75. [75]

    Nesterov,Lectures on Convex Optimization

    Y. Nesterov,Lectures on Convex Optimization. Springer Publishing Company, Incorpo- rated, 2nd ed., 2018. 19

  76. [76]

    A Short Note on Concentration Inequalities for Random Vectors with SubGaussian Norm

    C. Jin, P. Netrapalli, R. Ge, S. M. Kakade, and M. I. Jordan, “A Short Note on Concen- tration Inequalities for Random Vectors with SubGaussian Norm,”arXiv:1902.03736, 2019

  77. [77]

    Gradient Convergence in Gradient methods with Errors,

    D. P. Bertsekas and J. N. Tsitsiklis, “Gradient Convergence in Gradient methods with Errors,”SIAM Journal on Optimization, vol. 10, no. 3, pp. 627–642, 2000

  78. [78]

    Vershynin,High-Dimensional Probability: An Introduction with Applications in Data Science

    R. Vershynin,High-Dimensional Probability: An Introduction with Applications in Data Science. Cambridge Series in Statistical and Probabilistic Mathematics, Cambridge Uni- versity Press, 2018

  79. [79]

    Fast linear iterations for distributed averaging,

    L. Xiao and S. Boyd, “Fast linear iterations for distributed averaging,”Systems & Control Letters, vol. 53, no. 1, pp. 65–78, 2004

  80. [80]

    Cvetkovic, P

    D. Cvetkovic, P. Rowlinson, and S. Simic,Eigenspaces of Graphs. Encyclopedia of Math- ematics and its Applications, Cambridge University Press, 1997

Showing first 80 references.