arxiv: 2605.13434 · v1 · submitted 2026-05-13 · 💻 cs.LG · cs.DC· math.OC· stat.ML

Recognition: unknown

Rescaled Asynchronous SGD: Optimal Distributed Optimization under Data and System Heterogeneity

Ammar Mahran , Artavazd Maranjyan , Peter Richt\'arik

Authors on Pith no claims yet

Pith reviewed 2026-05-14 19:27 UTC · model grok-4.3

classification 💻 cs.LG cs.DCmath.OCstat.ML

keywords asynchronous SGDdistributed optimizationdata heterogeneitysystem heterogeneitynon-convex optimizationstochastic gradient descent

0 comments

The pith

Rescaling worker stepsizes by computation time fixes bias in asynchronous SGD so it converges to the true global objective.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Standard asynchronous SGD applies arriving gradients with equal weight, but faster workers then dominate when local data distributions differ, producing a biased frequency-weighted average rather than the desired global objective. The paper keeps the plain ASGD template and corrects this by rescaling each worker's stepsize in direct proportion to its observed computation time, so that every worker contributes the same total learning rate over any full cycle. Under standard smoothness and bounded-heterogeneity assumptions, the resulting Rescaled ASGD is shown to converge to stationary points of the correct global objective in the fixed-computation model. Its leading time-complexity term matches the known lower bound, while staleness and data-heterogeneity effects appear only in lower-order terms.

Core claim

Rescaled ASGD recovers convergence to stationary points of the correct global objective by rescaling each worker's stepsize proportionally to its computation time inside the standard asynchronous update rule. In the non-convex setting the method matches the optimal leading time-complexity term, with the influence of staleness and heterogeneity confined to lower-order terms.

What carries the argument

Rescaling of per-worker stepsizes in proportion to measured computation times, equalizing aggregate learning rates across heterogeneous workers inside the vanilla ASGD mechanism.

If this is right

The algorithm converges to stationary points of the true global objective rather than a frequency-weighted average.
Leading time complexity matches the known lower bound for distributed non-convex optimization.
Staleness and data heterogeneity affect only lower-order terms in the complexity bound.
The method remains competitive with state-of-the-art baselines while using the unmodified ASGD communication pattern.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same rescaling idea could be tested on other first-order asynchronous methods to restore unbiasedness without extra synchronization phases.
In practice the approach may allow heterogeneous clusters to be used at full speed without explicit load balancing.
Extensions to strongly convex or federated settings would be natural next checks of whether the lower-order terms remain benign.

Load-bearing premise

The objective is smooth and the local data distributions satisfy a bounded-heterogeneity condition.

What would settle it

Run Rescaled ASGD and plain ASGD side-by-side on a heterogeneous synthetic dataset whose global minimum is known; check whether the final loss of Rescaled ASGD matches the known global value while plain ASGD does not.

Figures

Figures reproduced from arXiv: 2605.13434 by Ammar Mahran, Artavazd Maranjyan, Peter Richt\'arik.

**Figure 2.** Figure 2: Cumulative stepsize taken over wall-clock time. The gathering phases dilate under [PITH_FULL_IMAGE:figures/full_fig_p009_2.png] view at source ↗

**Figure 3.** Figure 3: Rescaled ASGD targets the equal-weighted average F, Vanilla ASGD the frequencyweighted average F˜. F.2 Delay-Adaptive SGD Exhibits Objective Inconsistency Delay-Adaptive ASGD [Mishchenko et al., 2022a] scales stepsizes down in proportion to the gradient staleness, i.e., the number of iterations that have passed between the worker receiving the model from the server and them delivering the gradient back. I… view at source ↗

**Figure 4.** Figure 4: shows a simulation for F1(x) = (x − 1)2 , F2(x) = (x + 1)2 . As expected, Delay-Adaptive ASGD converges to a point close to x ∗ 1 = 1, the minimizer of worker 1’s objective function, whereas Rescaled ASGD converges to a small neighborhood around the minimizer, x ∗ = 0, of the equal-weighted average. 0 200 400 600 800 1000 1200 1400 Simulated Wall-Clock Time 0.0 0.5 1.0 1.5 2.0 2.5 3.0 x Rescaled ASGD Dela… view at source ↗

read the original abstract

Asynchronous stochastic gradient descent (ASGD) is a standard way to exploit heterogeneous compute resources in distributed learning: instead of forcing fast workers to wait for slow ones, the server updates the model whenever a gradient arrives. Vanilla ASGD applies each arriving gradient with the same weight. When local data distributions are heterogeneous, this becomes problematic: faster workers contribute more updates, and we show theoretically that the method is biased toward a frequency-weighted average of the local objectives rather than the desired global objective. Existing remedies typically move away from the simple ASGD template by introducing gathering phases, buffering, or extra memory. We show that this is unnecessary. Keeping the standard ASGD mechanism, we recover the correct objective by rescaling worker-specific stepsizes in proportion to their computation times, so that each worker contributes the same aggregate learning rate over a cycle. In the non-convex setting, under smoothness and bounded heterogeneity assumptions, we prove that the resulting method, Rescaled ASGD, converges to stationary points of the correct global objective in the fixed-computation model. Its time complexity matches the known lower bound in the leading term, while the effects of staleness and data heterogeneity appear only in lower-order terms. Experiments confirm that the method converges to the correct objective and is competitive with state-of-the-art baselines.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Rescaled ASGD gives a minimal tweak to fix bias toward the global objective in vanilla async SGD, but the stepsize scaling may break the smoothness bound when compute times differ sharply.

read the letter

The core idea is straightforward: instead of changing the ASGD update rule, rescale each worker's stepsize proportionally to its computation time so that every worker contributes the same total learning rate over a full cycle. This keeps the method inside the standard template while targeting the unweighted global objective rather than the frequency-weighted one that vanilla ASGD converges to under data heterogeneity. The abstract claims a non-convex convergence result whose leading time complexity matches the known lower bound, with staleness and heterogeneity pushed into lower-order terms. That is the main novelty and the part worth paying attention to if the derivation holds up. Experiments are said to confirm convergence to the correct objective and competitiveness with baselines, which is useful practical evidence even if the theory is the headline. The soft spot is the stepsize condition. The rescaling sets larger steps for slower workers; under standard L-smoothness the analysis requires an effective stepsize bounded by something like 1/L. If computation-time ratios across workers are unbounded, the largest rescaled stepsize can violate that bound even when the nominal stepsize is chosen safely for the fastest worker. The stated assumptions cover smoothness and bounded data heterogeneity, but do not appear to constrain compute-time ratios. That leaves the leading-term optimality claim conditional on an extra implicit restriction that is not highlighted. The paper is for researchers working on practical distributed training who want a lightweight alternative to buffering or gathering phases. It deserves a serious referee because the claim is clean and the fix is minimal, even if the stepsize issue needs explicit handling in the proof or experiments.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes Rescaled ASGD, a simple modification to asynchronous SGD that rescales each worker's stepsize η_i proportionally to its computation time t_i. This ensures equal aggregate learning rates across workers over a cycle, correcting the bias of vanilla ASGD toward a frequency-weighted average of local objectives under data heterogeneity. Under standard L-smoothness and bounded heterogeneity assumptions, the paper proves convergence to stationary points of the true global objective in the fixed-computation model. The leading term of the time complexity matches known lower bounds, while staleness and heterogeneity appear only in lower-order terms. Experiments confirm convergence to the correct objective and competitiveness with baselines.

Significance. If the analysis holds, this is a significant contribution: it achieves optimal rates for heterogeneous distributed optimization with a minimal change to the standard ASGD template, avoiding extra memory, buffering, or synchronization phases. The parameter-free rescaling and the clean separation of complexity terms (leading term optimal, others lower-order) would be valuable for both theory and practice in large-scale training.

major comments (2)

[Abstract and convergence analysis] The rescaling η_i ∝ t_i (stated in the abstract) may violate the uniform stepsize bound required by L-smoothness analyses. Standard non-convex SGD theorems impose η ≤ O(1/L) (or similar) on the effective stepsize; when max(t_i)/min(t_i) is unbounded, the largest rescaled η_i can exceed this bound even if the nominal η is set for the fastest worker. This would invalidate the claim that staleness and heterogeneity effects are confined to lower-order terms, as the leading-term complexity relies on the stepsize condition holding uniformly. Bounded heterogeneity addresses data distributions but does not constrain computation-time ratios. The analysis section should explicitly state how the global stepsize is chosen or add a bounded-ratio assumption on t_i.
[Model and theorem statement] The fixed-computation model is invoked for the complexity result but is not defined in the provided abstract or high-level description. The proof that Rescaled ASGD converges to the correct (unweighted) global objective rather than the frequency-weighted one depends on the precise modeling of updates and delays in this setting; without the definition and the exact error terms, the support for the central claim cannot be verified.

minor comments (2)

[Abstract] The abstract introduces the 'fixed-computation model' without a one-sentence definition; adding this would improve accessibility for readers.
[Introduction/Notation] Notation for per-worker stepsizes η_i and times t_i should be introduced with a brief equation or table in the main text to avoid ambiguity when discussing the rescaling.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful reading and constructive comments. We address each major comment below and indicate the revisions planned for the next version of the manuscript.

read point-by-point responses

Referee: [Abstract and convergence analysis] The rescaling η_i ∝ t_i (stated in the abstract) may violate the uniform stepsize bound required by L-smoothness analyses. Standard non-convex SGD theorems impose η ≤ O(1/L) (or similar) on the effective stepsize; when max(t_i)/min(t_i) is unbounded, the largest rescaled η_i can exceed this bound even if the nominal η is set for the fastest worker. This would invalidate the claim that staleness and heterogeneity effects are confined to lower-order terms, as the leading-term complexity relies on the stepsize condition holding uniformly. Bounded heterogeneity addresses data distributions but does not constrain computation-time ratios. The analysis section should explicitly state how the global stepsize is chosen or add a bounded-ratio assumption on t_i.

Authors: We agree that the rescaling must be accompanied by an explicit global stepsize choice to maintain the uniform bound required by L-smoothness. In the revised manuscript we will state in Section 4 that the global stepsize is set to η = Θ(1/(L ⋅ max_i t_i)), ensuring every worker-specific stepsize η_i = η ⋅ (t_i / t̄) satisfies η_i ≤ O(1/L). Under this choice the leading term of the time complexity remains optimal up to constants that depend on the maximum computation time (inherent to any fixed-computation model), while staleness and heterogeneity remain lower-order. No additional bounded-ratio assumption on the t_i is required. revision: yes
Referee: [Model and theorem statement] The fixed-computation model is invoked for the complexity result but is not defined in the provided abstract or high-level description. The proof that Rescaled ASGD converges to the correct (unweighted) global objective rather than the frequency-weighted one depends on the precise modeling of updates and delays in this setting; without the definition and the exact error terms, the support for the central claim cannot be verified.

Authors: The fixed-computation model is formally defined in Section 3, where each worker i is assigned a deterministic computation time t_i per gradient and updates arrive asynchronously with delays bounded by the t_i values. To address the concern we will add a concise definition to the revised abstract and introduction: “In the fixed-computation model each worker i requires a fixed time t_i to compute a gradient, producing asynchronous updates whose delays are proportional to t_i.” The theorem statements will explicitly reference this model, and the main text will highlight the staleness error terms (full derivations remain in the appendix). revision: yes

Circularity Check

0 steps flagged

Rescaling is a direct design choice to equalize aggregate rates; convergence proof remains independent

full rationale

The paper defines Rescaled ASGD explicitly by setting per-worker stepsizes proportional to computation times so each contributes the same aggregate learning rate over a cycle, thereby targeting the global objective instead of a frequency-weighted one. This is presented as a motivated design fix rather than a derived prediction or fitted parameter. The subsequent non-convex convergence analysis under smoothness and bounded heterogeneity then shows the method reaches stationary points of the correct objective with leading-term time complexity matching the lower bound. No load-bearing self-citations, uniqueness theorems, or reductions by construction appear in the provided text; the central claim rests on standard analysis rather than tautology. This is a minor definitional element (score 2) with no significant circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on standard smoothness and bounded-heterogeneity assumptions common to non-convex optimization analyses; no free parameters or new entities are introduced.

axioms (2)

domain assumption smoothness assumption
Invoked for the non-convex convergence analysis in the fixed-computation model.
domain assumption bounded heterogeneity
Used to control the deviation between local objectives and the global objective.

pith-pipeline@v0.9.0 · 5545 in / 1212 out tokens · 34317 ms · 2026-05-14T19:27:27.645909+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

300 extracted references · 300 canonical work pages · 32 internal anchors

[1]

IEEE Signal Processing Magazine , author =

Federated. IEEE Signal Processing Magazine , author =. 2020 , note =. doi:10.1109/MSP.2020.2975749 , abstract =

work page doi:10.1109/msp.2020.2975749 2020
[2]

Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, and 1 others

Hsu, Tzu-Ming Harry and Qi, Hang and Brown, Matthew , month = sep, year =. Measuring the. doi:10.48550/arXiv.1909.06335 , abstract =

work page doi:10.48550/arxiv.1909.06335 1909
[3]

arXiv.org , author =

Speeding. arXiv.org , author =. 2015 , file =. doi:10.1109/TIT.2017.2736066 , abstract =

work page doi:10.1109/tit.2017.2736066 2015
[4]

Lichang Chen, Jiuhai Chen, Tom Goldstein, Heng Huang, and Tianyi Zhou

Caldas, Sebastian and Duddu, Sai Meher Karthik and Wu, Peter and Li, Tian and Konečný, Jakub and McMahan, H. Brendan and Smith, Virginia and Talwalkar, Ameet , month = dec, year =. doi:10.48550/arXiv.1812.01097 , abstract =

work page doi:10.48550/arxiv.1812.01097
[5]

arXiv.org , author =

Practical. arXiv.org , author =. 2012 , file =

work page 2012
[6]

Cauchy activation function and

Li, Xin and Xia, Zhihong and Zhang, Hongkun , month = jan, year =. Cauchy activation function and. doi:10.48550/arXiv.2409.19221 , abstract =

work page doi:10.48550/arxiv.2409.19221
[7]

Bender, Timnit Gebru, Angelina McMillan-Major, and Margaret Mitchell

Bender, Emily M. and Gebru, Timnit and McMillan-Major, Angelina and Shmitchell, Shmargaret , month = mar, year =. On the. Proceedings of the 2021. doi:10.1145/3442188.3445922 , abstract =

work page doi:10.1145/3442188.3445922 2021
[8]

Second-Order Optimization for Non-Convex Machine Learning: An Empirical Study

Xu, Peng and Roosta-Khorasani, Farbod and Mahoney, Michael W. , month = feb, year =. Second-. doi:10.48550/arXiv.1708.07827 , abstract =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1708.07827
[9]

Khaled, Ahmed and Jin, Chi , month = mar, year =. Tuning-. doi:10.48550/arXiv.2402.07793 , abstract =

work page doi:10.48550/arxiv.2402.07793
[10]

Towards stability and optimality in stochastic gradient descent

Toulis, Panos and Tran, Dustin and Airoldi, Edoardo M. , month = jun, year =. Towards stability and optimality in stochastic gradient descent , url =. doi:10.48550/arXiv.1505.02417 , abstract =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1505.02417
[11]

EMNIST: an extension of MNIST to handwritten letters

Cohen, Gregory and Afshar, Saeed and Tapson, Jonathan and Schaik, André van , month = mar, year =. doi:10.48550/arXiv.1702.05373 , abstract =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1702.05373
[12]

doi:10.48550/arXiv.2202.11599 , abstract =

Zhao, Shipu and Frangella, Zachary and Udell, Madeleine , month = jul, year =. doi:10.48550/arXiv.2202.11599 , abstract =

work page doi:10.48550/arxiv.2202.11599
[13]

and Schneider, Frank and Hennig, Philipp , month = aug, year =

Schmidt, Robin M. and Schneider, Frank and Hennig, Philipp , month = aug, year =. Descending through a. doi:10.48550/arXiv.2007.01547 , abstract =

work page doi:10.48550/arxiv.2007.01547 2007
[14]

Optimizer

Sivaprasad, Prabhu Teja and Mai, Florian and Vogels, Thijs and Jaggi, Martin and Fleuret, François , month = aug, year =. Optimizer. doi:10.48550/arXiv.1910.11758 , abstract =

work page doi:10.48550/arxiv.1910.11758 1910
[15]

Google for Developers , file =

Deep. Google for Developers , file =

work page
[16]

Statistical

Demsˇar, Janez and Demsar, Janez , file =. Statistical

work page
[17]

Incremental

Wang, Xiaolu and Sun, Yuchang and Wai, Hoi To and Zhang, Jun , month = oct, year =. Incremental

work page
[18]

Proceedings of the 36th

Gower, Robert Mansel and Loizou, Nicolas and Qian, Xun and Sailanbayev, Alibek and Shulgin, Egor and Richtárik, Peter , month = may, year =. Proceedings of the 36th

work page
[19]

Wang, Xiaolu and Sun, Yuchang and Wai, Hoi-To and Zhang, Jun , month = may, year =. Dual-. doi:10.48550/arXiv.2405.16966 , abstract =

work page doi:10.48550/arxiv.2405.16966
[20]

Visualizing the Loss Landscape of Neural Nets

Li, Hao and Xu, Zheng and Taylor, Gavin and Studer, Christoph and Goldstein, Tom , month = nov, year =. Visualizing the. doi:10.48550/arXiv.1712.09913 , abstract =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1712.09913
[21]

Attention is not Explanation

Jain, Sarthak and Wallace, Byron C. , month = may, year =. Attention is not. doi:10.48550/arXiv.1902.10186 , abstract =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1902.10186 1902
[22]

Understanding deep learning requires rethinking generalization

Zhang, Chiyuan and Bengio, Samy and Hardt, Moritz and Recht, Benjamin and Vinyals, Oriol , month = feb, year =. Understanding deep learning requires rethinking generalization , url =. doi:10.48550/arXiv.1611.03530 , abstract =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1611.03530
[23]

Henderson, Peter and Islam, Riashat and Bachman, Philip and Pineau, Joelle and Precup, Doina and Meger, David , month = jan, year =. Deep. doi:10.48550/arXiv.1709.06560 , abstract =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1709.06560
[24]

Bubeck, Sébastien , month = nov, year =. Convex. doi:10.48550/arXiv.1405.4980 , abstract =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1405.4980
[25]

Mishchenko, Konstantin and Iutzeler, Franck and Malick, Jérôme and Amini, Massih-Reza , month = jul, year =. A. Proceedings of the 35th

work page
[26]

Mishchenko, Konstantin and Khaled, Ahmed and Richtárik, Peter , month = apr, year =. Random. doi:10.48550/arXiv.2006.05988 , abstract =

work page doi:10.48550/arxiv.2006.05988 2006
[27]

KAN: Kolmogorov-Arnold Networks

Liu, Ziming and Wang, Yixuan and Vaidya, Sachin and Ruehle, Fabian and Halverson, James and Soljačić, Marin and Hou, Thomas Y. and Tegmark, Max , month = feb, year =. doi:10.48550/arXiv.2404.19756 , abstract =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2404.19756
[28]

Muon and a

Cesista, Franz Louis , month = apr, year =. Muon and a

work page
[29]

, month = sep, year =

Takezawa, Yuki and Koloskova, Anastasia and Jiang, Xiaowen and Stich, Sebastian U. , month = sep, year =. doi:10.48550/arXiv.2509.26337 , abstract =

work page doi:10.48550/arxiv.2509.26337
[30]

Bernstein, Jeremy and Newhouse, Laker , month = dec, year =. Old. doi:10.48550/arXiv.2409.20325 , abstract =

work page doi:10.48550/arxiv.2409.20325
[31]

doi:10.48550/arXiv.2301.11913 , abstract =

Ryabinin, Max and Dettmers, Tim and Diskin, Michael and Borzunov, Alexander , month = jun, year =. doi:10.48550/arXiv.2301.11913 , abstract =

work page doi:10.48550/arxiv.2301.11913
[32]

Benchmarking optimizers for large language model pretraining.arXiv preprint arXiv:2509.01440, 2025

Semenov, Andrei and Pagliardini, Matteo and Jaggi, Martin , month = sep, year =. Benchmarking. doi:10.48550/arXiv.2509.01440 , abstract =

work page doi:10.48550/arxiv.2509.01440
[33]

Asgo: Adaptive structured gradient optimization.arXiv preprint arXiv:2503.20762, 2025

An, Kang and Liu, Yuxing and Pan, Rui and Ren, Yi and Ma, Shiqian and Goldfarb, Donald and Zhang, Tong , month = jun, year =. doi:10.48550/arXiv.2503.20762 , abstract =

work page doi:10.48550/arxiv.2503.20762
[34]

Khirirat, Sarit and Sadiev, Abdurakhmon and Riabinin, Artem and Gorbunov, Eduard and Richtárik, Peter , month = oct, year =. Error. doi:10.48550/arXiv.2410.16871 , abstract =

work page doi:10.48550/arxiv.2410.16871
[35]

Learning representations by back-propagating errors , language =

Rumelhart, David E and Hintont, Geoffrey E and Williams, Ronald J , year =. Learning representations by back-propagating errors , language =

work page
[36]

1-bit stochastic gradient descent and its application to data-parallel distributed training of speech

Seide, Frank and Fu, Hao and Droppo, Jasha and Li, Gang and Yu, Dong , month = sep, year =. 1-bit stochastic gradient descent and its application to data-parallel distributed training of speech. Interspeech 2014 , publisher =. doi:10.21437/Interspeech.2014-274 , abstract =

work page doi:10.21437/interspeech.2014-274 2014
[37]

Journal of Optimization Theory and Applications , author =

Dualize,. Journal of Optimization Theory and Applications , author =. 2022 , note =. doi:10.1007/s10957-022-02061-8 , abstract =

work page doi:10.1007/s10957-022-02061-8 2022
[38]

Kovalev, Dmitry and Beznosikov, Aleksandr and Borodich, Ekaterina and Gasnikov, Alexander and Scutari, Gesualdo , month = may, year =. Optimal. doi:10.48550/arXiv.2205.15136 , abstract =

work page doi:10.48550/arxiv.2205.15136
[39]

Accelerated gradient sliding for structured convex optimization

Lan, Guanghui and Ouyang, Yuyuan , month = sep, year =. Accelerated gradient sliding for structured convex optimization , url =. doi:10.48550/arXiv.1609.04905 , abstract =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1609.04905
[40]

Conditional

Lan, Guanghui and Zhou, Yi , month = jun, year =. Conditional. SIAM Journal on Optimization , publisher =. doi:10.1137/140992382 , abstract =

work page doi:10.1137/140992382
[41]

Universal

Ouyang, Yuyuan and Squires, Trevor , month = mar, year =. Universal. doi:10.48550/arXiv.2103.11026 , abstract =

work page doi:10.48550/arxiv.2103.11026
[42]

Gradient Sliding for Composite Optimization

Lan, Guanghui , month = jun, year =. Gradient. doi:10.48550/arXiv.1406.0919 , abstract =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1406.0919
[43]

doi:10.48550/arXiv.2302.09832 , abstract =

Condat, Laurent and Agarský, Ivan and Malinovsky, Grigory and Richtárik, Peter , month = apr, year =. doi:10.48550/arXiv.2302.09832 , abstract =

work page doi:10.48550/arxiv.2302.09832
[44]

Grudzień, Michał and Malinovsky, Grigory and Richtárik, Peter , month = jan, year =. Can 5th. doi:10.48550/arXiv.2212.14370 , abstract =

work page doi:10.48550/arxiv.2212.14370
[45]

Provably

Condat, Laurent and Agarský, Ivan and Richtárik, Peter , month = feb, year =. Provably. doi:10.48550/arXiv.2210.13277 , abstract =

work page doi:10.48550/arxiv.2210.13277
[46]

doi:10.48550/arXiv.2207.12891 , abstract =

Condat, Laurent and Richtárik, Peter , month = mar, year =. doi:10.48550/arXiv.2207.12891 , abstract =

work page doi:10.48550/arxiv.2207.12891
[47]

Communication

Sadiev, Abdurakhmon and Kovalev, Dmitry and Richtárik, Peter , month = jul, year =. Communication. doi:10.48550/arXiv.2207.03957 , abstract =

work page doi:10.48550/arxiv.2207.03957
[48]

and Hassani, Hamed , month = aug, year =

Mitra, Aritra and Jaafar, Rayana and Pappas, George J. and Hassani, Hamed , month = aug, year =. Linear. doi:10.48550/arXiv.2102.07053 , abstract =

work page doi:10.48550/arxiv.2102.07053
[49]

Gorbunov, Eduard and Hanzely, Filip and Richtárik, Peter , month = nov, year =. Local. doi:10.48550/arXiv.2011.02828 , abstract =

work page doi:10.48550/arxiv.2011.02828 2011
[50]

and Stich, Sebastian U

Karimireddy, Sai Praneeth and Kale, Satyen and Mohri, Mehryar and Reddi, Sashank J. and Stich, Sebastian U. and Suresh, Ananda Theertha , month = apr, year =. doi:10.48550/arXiv.1910.06378 , abstract =

work page doi:10.48550/arxiv.1910.06378 1910
[51]

Khaled, Ahmed and Mishchenko, Konstantin and Richtárik, Peter , month = apr, year =. Tighter. doi:10.48550/arXiv.1909.04746 , abstract =

work page doi:10.48550/arxiv.1909.04746 1909
[52]

Khaled, Ahmed and Mishchenko, Konstantin and Richtárik, Peter , month = mar, year =. First. doi:10.48550/arXiv.1909.04715 , abstract =

work page doi:10.48550/arxiv.1909.04715 1909
[53]

Haddadpour, Farzin and Mahdavi, Mehrdad , month = dec, year =. On the. doi:10.48550/arXiv.1910.14425 , abstract =

work page doi:10.48550/arxiv.1910.14425 1910
[54]

Li, Xiang and Huang, Kaixuan and Yang, Wenhao and Wang, Shusen and Zhang, Zhihua , month = jun, year =. On the. doi:10.48550/arXiv.1907.02189 , abstract =

work page doi:10.48550/arxiv.1907.02189 1907
[55]

SparkNet: Training Deep Networks in Spark

Moritz, Philipp and Nishihara, Robert and Stoica, Ion and Jordan, Michael I. , month = feb, year =. doi:10.48550/arXiv.1511.06051 , abstract =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1511.06051
[56]

Parallel training of DNNs with Natural Gradient and Parameter Averaging

Povey, Daniel and Zhang, Xiaohui and Khudanpur, Sanjeev , month = jun, year =. Parallel training of. doi:10.48550/arXiv.1410.7455 , abstract =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1410.7455
[57]

Practical Secure Aggregation for Federated Learning on User-Held Data

Bonawitz, Keith and Ivanov, Vladimir and Kreuter, Ben and Marcedone, Antonio and McMahan, H. Brendan and Patel, Sarvar and Ramage, Daniel and Segal, Aaron and Seth, Karn , month = nov, year =. Practical. doi:10.48550/arXiv.1611.04482 , abstract =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1611.04482
[58]

Training deep learning models with norm-constrained lmos.arXiv preprint arXiv:2502.07529, 2025

Pethick, Thomas and Xie, Wanyun and Antonakopoulos, Kimon and Zhu, Zhenyu and Silveti-Falls, Antonio and Cevher, Volkan , month = jun, year =. Training. doi:10.48550/arXiv.2502.07529 , abstract =

work page doi:10.48550/arxiv.2502.07529
[59]

Riabinin, Artem and Shulgin, Egor and Gruntkowska, Kaja and Richtárik, Peter , month = may, year =. Gluon:. doi:10.48550/arXiv.2505.13416 , abstract =

work page doi:10.48550/arxiv.2505.13416
[60]

Gruntkowska, Kaja and Li, Hanmin and Rane, Aadi and Richtárik, Peter , month = jul, year =. The. doi:10.48550/arXiv.2502.02002 , abstract =

work page doi:10.48550/arxiv.2502.02002
[61]

Konečný, Jakub and Richtárik, Peter , month = nov, year =. Simple. doi:10.48550/arXiv.1410.0390 , abstract =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1410.0390
[62]

Introduction to

Beck, Amir , month = oct, year =. Introduction to. doi:10.1137/1.9781611973655 , language =

work page doi:10.1137/1.9781611973655
[63]

Dekel, Ofer and Gilad-Bachrach, Ran and Shamir, Ohad and Xiao, Lin , month = jan, year =. Optimal. doi:10.48550/arXiv.1012.1367 , abstract =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1012.1367
[64]

IEEE Transactions on Computers , author =

Falcon:. IEEE Transactions on Computers , author =. 2021 , keywords =. doi:10.1109/TC.2020.2974461 , abstract =

work page doi:10.1109/tc.2020.2974461 2021
[65]

Balancing

Basu, Saurav and Saxena, Vaibhav and Panja, Rintu and Verma, Ashish , month = dec, year =. Balancing. 2018. doi:10.1109/HiPC.2018.00011 , abstract =

work page doi:10.1109/hipc.2018.00011 2018
[66]

, month = dec, year =

Ferdinand, Nuwan and Gharachorloo, Benjamin and Draper, Stark C. , month = dec, year =. Anytime. 2017 16th. doi:10.1109/ICMLA.2017.0-166 , abstract =

work page doi:10.1109/icmla.2017.0-166 2017
[67]

Staleness-aware Async-SGD for Distributed Deep Learning

Zhang, Wei and Gupta, Suyog and Lian, Xiangru and Liu, Ji , month = apr, year =. Staleness-aware. doi:10.48550/arXiv.1511.05950 , abstract =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1511.05950
[68]

Gupta, Suyog and Zhang, Wei and Wang, Fei , month = dec, year =. Model. doi:10.48550/arXiv.1509.04210 , abstract =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1509.04210
[69]

Robust and Communication-Efficient Federated Learning from Non-IID Data

Sattler, Felix and Wiedemann, Simon and Müller, Klaus-Robert and Samek, Wojciech , month = mar, year =. Robust and. doi:10.48550/arXiv.1903.02891 , abstract =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1903.02891 1903
[70]

Federated Learning for Mobile Keyboard Prediction

Hard, Andrew and Rao, Kanishka and Mathews, Rajiv and Ramaswamy, Swaroop and Beaufays, Françoise and Augenstein, Sean and Eichner, Hubert and Kiddon, Chloé and Ramage, Daniel , month = feb, year =. Federated. doi:10.48550/arXiv.1811.03604 , abstract =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1811.03604
[71]

Asynchronous

Cohen, Alon and Daniely, Amit and Drori, Yoel and Koren, Tomer and Schain, Mariano , year =. Asynchronous. Advances in

work page
[72]

Communication-

Tang, Zhenheng and Shi, Shaohuai and Wang, Wei and Li, Bo and Chu, Xiaowen , month = sep, year =. Communication-. doi:10.48550/arXiv.2003.06307 , abstract =

work page doi:10.48550/arxiv.2003.06307 2003
[73]

Liang, Feng and Zhang, Zhen and Lu, Haifeng and Leung, Victor C. M. and Guo, Yanyi and Hu, Xiping , month = apr, year =. Communication-. doi:10.48550/arXiv.2404.06114 , abstract =

work page doi:10.48550/arxiv.2404.06114
[74]

2008 , pages =

Communications of the ACM , author =. 2008 , pages =. doi:10.1145/1327452.1327492 , abstract =

work page doi:10.1145/1327452.1327492 2008
[75]

Dean, Jeffrey and Corrado, Greg and Monga, Rajat and Chen, Kai and Devin, Matthieu and Mao, Mark and Ranzato, Marc' aurelio and Senior, Andrew and Tucker, Paul and Yang, Ke and Le, Quoc and Ng, Andrew , year =. Large. Advances in

work page
[76]

Communication-

Zakerinia, Hossein and Talaei, Shayan and Nadiradze, Giorgi and Alistarh, Dan , month = may, year =. Communication-. doi:10.48550/arXiv.2206.10032 , abstract =

work page doi:10.48550/arxiv.2206.10032
[77]

Proceedings of the AAAI Conference on Artificial Intelligence , author =

Efficient. Proceedings of the AAAI Conference on Artificial Intelligence , author =. 2024 , keywords =. doi:10.1609/aaai.v38i15.29603 , abstract =

work page doi:10.1609/aaai.v38i15.29603 2024
[78]

, month = oct, year =

Shi, Jianwei and Abdulah, Sameh and Sun, Ying and Genton, Marc G. , month = oct, year =. Scalable. doi:10.48550/arXiv.2510.01771 , abstract =

work page doi:10.48550/arxiv.2510.01771
[79]

Hogwild!:

Recht, Benjamin and Re, Christopher and Wright, Stephen and Niu, Feng , year =. Hogwild!:. Advances in

work page
[80]

Gruntkowska, Kaja and Gaponov, Alexander and Tovmasyan, Zhirayr and Richtárik, Peter , month = oct, year =. Error. doi:10.48550/arXiv.2510.00643 , abstract =

work page doi:10.48550/arxiv.2510.00643

Showing first 80 references.