pith. machine review for the scientific record. sign in

arxiv: 2605.13434 · v1 · submitted 2026-05-13 · 💻 cs.LG · cs.DC· math.OC· stat.ML

Recognition: unknown

Rescaled Asynchronous SGD: Optimal Distributed Optimization under Data and System Heterogeneity

Authors on Pith no claims yet

Pith reviewed 2026-05-14 19:27 UTC · model grok-4.3

classification 💻 cs.LG cs.DCmath.OCstat.ML
keywords asynchronous SGDdistributed optimizationdata heterogeneitysystem heterogeneitynon-convex optimizationstochastic gradient descent
0
0 comments X

The pith

Rescaling worker stepsizes by computation time fixes bias in asynchronous SGD so it converges to the true global objective.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Standard asynchronous SGD applies arriving gradients with equal weight, but faster workers then dominate when local data distributions differ, producing a biased frequency-weighted average rather than the desired global objective. The paper keeps the plain ASGD template and corrects this by rescaling each worker's stepsize in direct proportion to its observed computation time, so that every worker contributes the same total learning rate over any full cycle. Under standard smoothness and bounded-heterogeneity assumptions, the resulting Rescaled ASGD is shown to converge to stationary points of the correct global objective in the fixed-computation model. Its leading time-complexity term matches the known lower bound, while staleness and data-heterogeneity effects appear only in lower-order terms.

Core claim

Rescaled ASGD recovers convergence to stationary points of the correct global objective by rescaling each worker's stepsize proportionally to its computation time inside the standard asynchronous update rule. In the non-convex setting the method matches the optimal leading time-complexity term, with the influence of staleness and heterogeneity confined to lower-order terms.

What carries the argument

Rescaling of per-worker stepsizes in proportion to measured computation times, equalizing aggregate learning rates across heterogeneous workers inside the vanilla ASGD mechanism.

If this is right

  • The algorithm converges to stationary points of the true global objective rather than a frequency-weighted average.
  • Leading time complexity matches the known lower bound for distributed non-convex optimization.
  • Staleness and data heterogeneity affect only lower-order terms in the complexity bound.
  • The method remains competitive with state-of-the-art baselines while using the unmodified ASGD communication pattern.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same rescaling idea could be tested on other first-order asynchronous methods to restore unbiasedness without extra synchronization phases.
  • In practice the approach may allow heterogeneous clusters to be used at full speed without explicit load balancing.
  • Extensions to strongly convex or federated settings would be natural next checks of whether the lower-order terms remain benign.

Load-bearing premise

The objective is smooth and the local data distributions satisfy a bounded-heterogeneity condition.

What would settle it

Run Rescaled ASGD and plain ASGD side-by-side on a heterogeneous synthetic dataset whose global minimum is known; check whether the final loss of Rescaled ASGD matches the known global value while plain ASGD does not.

Figures

Figures reproduced from arXiv: 2605.13434 by Ammar Mahran, Artavazd Maranjyan, Peter Richt\'arik.

Figure 1
Figure 1. Figure 1: Solid lines denote the median loss across five random seeds, with shaded regions indicating [PITH_FULL_IMAGE:figures/full_fig_p008_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Cumulative stepsize taken over wall-clock time. The gathering phases dilate under [PITH_FULL_IMAGE:figures/full_fig_p009_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Rescaled ASGD targets the equal-weighted average F, Vanilla ASGD the frequency￾weighted average F˜. F.2 Delay-Adaptive SGD Exhibits Objective Inconsistency Delay-Adaptive ASGD [Mishchenko et al., 2022a] scales stepsizes down in proportion to the gradient staleness, i.e., the number of iterations that have passed between the worker receiving the model from the server and them delivering the gradient back. I… view at source ↗
Figure 4
Figure 4. Figure 4: shows a simulation for F1(x) = (x − 1)2 , F2(x) = (x + 1)2 . As expected, Delay-Adaptive ASGD converges to a point close to x ∗ 1 = 1, the minimizer of worker 1’s objec￾tive function, whereas Rescaled ASGD converges to a small neighborhood around the minimizer, x ∗ = 0, of the equal-weighted average. 0 200 400 600 800 1000 1200 1400 Simulated Wall-Clock Time 0.0 0.5 1.0 1.5 2.0 2.5 3.0 x Rescaled ASGD Dela… view at source ↗
read the original abstract

Asynchronous stochastic gradient descent (ASGD) is a standard way to exploit heterogeneous compute resources in distributed learning: instead of forcing fast workers to wait for slow ones, the server updates the model whenever a gradient arrives. Vanilla ASGD applies each arriving gradient with the same weight. When local data distributions are heterogeneous, this becomes problematic: faster workers contribute more updates, and we show theoretically that the method is biased toward a frequency-weighted average of the local objectives rather than the desired global objective. Existing remedies typically move away from the simple ASGD template by introducing gathering phases, buffering, or extra memory. We show that this is unnecessary. Keeping the standard ASGD mechanism, we recover the correct objective by rescaling worker-specific stepsizes in proportion to their computation times, so that each worker contributes the same aggregate learning rate over a cycle. In the non-convex setting, under smoothness and bounded heterogeneity assumptions, we prove that the resulting method, Rescaled ASGD, converges to stationary points of the correct global objective in the fixed-computation model. Its time complexity matches the known lower bound in the leading term, while the effects of staleness and data heterogeneity appear only in lower-order terms. Experiments confirm that the method converges to the correct objective and is competitive with state-of-the-art baselines.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes Rescaled ASGD, a simple modification to asynchronous SGD that rescales each worker's stepsize η_i proportionally to its computation time t_i. This ensures equal aggregate learning rates across workers over a cycle, correcting the bias of vanilla ASGD toward a frequency-weighted average of local objectives under data heterogeneity. Under standard L-smoothness and bounded heterogeneity assumptions, the paper proves convergence to stationary points of the true global objective in the fixed-computation model. The leading term of the time complexity matches known lower bounds, while staleness and heterogeneity appear only in lower-order terms. Experiments confirm convergence to the correct objective and competitiveness with baselines.

Significance. If the analysis holds, this is a significant contribution: it achieves optimal rates for heterogeneous distributed optimization with a minimal change to the standard ASGD template, avoiding extra memory, buffering, or synchronization phases. The parameter-free rescaling and the clean separation of complexity terms (leading term optimal, others lower-order) would be valuable for both theory and practice in large-scale training.

major comments (2)
  1. [Abstract and convergence analysis] The rescaling η_i ∝ t_i (stated in the abstract) may violate the uniform stepsize bound required by L-smoothness analyses. Standard non-convex SGD theorems impose η ≤ O(1/L) (or similar) on the effective stepsize; when max(t_i)/min(t_i) is unbounded, the largest rescaled η_i can exceed this bound even if the nominal η is set for the fastest worker. This would invalidate the claim that staleness and heterogeneity effects are confined to lower-order terms, as the leading-term complexity relies on the stepsize condition holding uniformly. Bounded heterogeneity addresses data distributions but does not constrain computation-time ratios. The analysis section should explicitly state how the global stepsize is chosen or add a bounded-ratio assumption on t_i.
  2. [Model and theorem statement] The fixed-computation model is invoked for the complexity result but is not defined in the provided abstract or high-level description. The proof that Rescaled ASGD converges to the correct (unweighted) global objective rather than the frequency-weighted one depends on the precise modeling of updates and delays in this setting; without the definition and the exact error terms, the support for the central claim cannot be verified.
minor comments (2)
  1. [Abstract] The abstract introduces the 'fixed-computation model' without a one-sentence definition; adding this would improve accessibility for readers.
  2. [Introduction/Notation] Notation for per-worker stepsizes η_i and times t_i should be introduced with a brief equation or table in the main text to avoid ambiguity when discussing the rescaling.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful reading and constructive comments. We address each major comment below and indicate the revisions planned for the next version of the manuscript.

read point-by-point responses
  1. Referee: [Abstract and convergence analysis] The rescaling η_i ∝ t_i (stated in the abstract) may violate the uniform stepsize bound required by L-smoothness analyses. Standard non-convex SGD theorems impose η ≤ O(1/L) (or similar) on the effective stepsize; when max(t_i)/min(t_i) is unbounded, the largest rescaled η_i can exceed this bound even if the nominal η is set for the fastest worker. This would invalidate the claim that staleness and heterogeneity effects are confined to lower-order terms, as the leading-term complexity relies on the stepsize condition holding uniformly. Bounded heterogeneity addresses data distributions but does not constrain computation-time ratios. The analysis section should explicitly state how the global stepsize is chosen or add a bounded-ratio assumption on t_i.

    Authors: We agree that the rescaling must be accompanied by an explicit global stepsize choice to maintain the uniform bound required by L-smoothness. In the revised manuscript we will state in Section 4 that the global stepsize is set to η = Θ(1/(L ⋅ max_i t_i)), ensuring every worker-specific stepsize η_i = η ⋅ (t_i / t̄) satisfies η_i ≤ O(1/L). Under this choice the leading term of the time complexity remains optimal up to constants that depend on the maximum computation time (inherent to any fixed-computation model), while staleness and heterogeneity remain lower-order. No additional bounded-ratio assumption on the t_i is required. revision: yes

  2. Referee: [Model and theorem statement] The fixed-computation model is invoked for the complexity result but is not defined in the provided abstract or high-level description. The proof that Rescaled ASGD converges to the correct (unweighted) global objective rather than the frequency-weighted one depends on the precise modeling of updates and delays in this setting; without the definition and the exact error terms, the support for the central claim cannot be verified.

    Authors: The fixed-computation model is formally defined in Section 3, where each worker i is assigned a deterministic computation time t_i per gradient and updates arrive asynchronously with delays bounded by the t_i values. To address the concern we will add a concise definition to the revised abstract and introduction: “In the fixed-computation model each worker i requires a fixed time t_i to compute a gradient, producing asynchronous updates whose delays are proportional to t_i.” The theorem statements will explicitly reference this model, and the main text will highlight the staleness error terms (full derivations remain in the appendix). revision: yes

Circularity Check

0 steps flagged

Rescaling is a direct design choice to equalize aggregate rates; convergence proof remains independent

full rationale

The paper defines Rescaled ASGD explicitly by setting per-worker stepsizes proportional to computation times so each contributes the same aggregate learning rate over a cycle, thereby targeting the global objective instead of a frequency-weighted one. This is presented as a motivated design fix rather than a derived prediction or fitted parameter. The subsequent non-convex convergence analysis under smoothness and bounded heterogeneity then shows the method reaches stationary points of the correct objective with leading-term time complexity matching the lower bound. No load-bearing self-citations, uniqueness theorems, or reductions by construction appear in the provided text; the central claim rests on standard analysis rather than tautology. This is a minor definitional element (score 2) with no significant circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on standard smoothness and bounded-heterogeneity assumptions common to non-convex optimization analyses; no free parameters or new entities are introduced.

axioms (2)
  • domain assumption smoothness assumption
    Invoked for the non-convex convergence analysis in the fixed-computation model.
  • domain assumption bounded heterogeneity
    Used to control the deviation between local objectives and the global objective.

pith-pipeline@v0.9.0 · 5545 in / 1212 out tokens · 34317 ms · 2026-05-14T19:27:27.645909+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

300 extracted references · 300 canonical work pages · 32 internal anchors

  1. [1]

    IEEE Signal Processing Magazine , author =

    Federated. IEEE Signal Processing Magazine , author =. 2020 , note =. doi:10.1109/MSP.2020.2975749 , abstract =

  2. [2]

    Mea- suring the effects of non-identical data distribution for feder- ated visual classification.arXiv preprint arXiv:1909.06335,

    Hsu, Tzu-Ming Harry and Qi, Hang and Brown, Matthew , month = sep, year =. Measuring the. doi:10.48550/arXiv.1909.06335 , abstract =

  3. [3]

    arXiv.org , author =

    Speeding. arXiv.org , author =. 2015 , file =. doi:10.1109/TIT.2017.2736066 , abstract =

  4. [4]

    Lichang Chen, Jiuhai Chen, Tom Goldstein, Heng Huang, and Tianyi Zhou

    Caldas, Sebastian and Duddu, Sai Meher Karthik and Wu, Peter and Li, Tian and Konečný, Jakub and McMahan, H. Brendan and Smith, Virginia and Talwalkar, Ameet , month = dec, year =. doi:10.48550/arXiv.1812.01097 , abstract =

  5. [5]

    arXiv.org , author =

    Practical. arXiv.org , author =. 2012 , file =

  6. [6]

    Cauchy activation function and

    Li, Xin and Xia, Zhihong and Zhang, Hongkun , month = jan, year =. Cauchy activation function and. doi:10.48550/arXiv.2409.19221 , abstract =

  7. [7]

    Bender, Timnit Gebru, Angelina McMillan-Major, and Margaret Mitchell

    Bender, Emily M. and Gebru, Timnit and McMillan-Major, Angelina and Shmitchell, Shmargaret , month = mar, year =. On the. Proceedings of the 2021. doi:10.1145/3442188.3445922 , abstract =

  8. [8]

    Second-Order Optimization for Non-Convex Machine Learning: An Empirical Study

    Xu, Peng and Roosta-Khorasani, Farbod and Mahoney, Michael W. , month = feb, year =. Second-. doi:10.48550/arXiv.1708.07827 , abstract =

  9. [9]

    Khaled, Ahmed and Jin, Chi , month = mar, year =. Tuning-. doi:10.48550/arXiv.2402.07793 , abstract =

  10. [10]

    Towards stability and optimality in stochastic gradient descent

    Toulis, Panos and Tran, Dustin and Airoldi, Edoardo M. , month = jun, year =. Towards stability and optimality in stochastic gradient descent , url =. doi:10.48550/arXiv.1505.02417 , abstract =

  11. [11]

    EMNIST: an extension of MNIST to handwritten letters

    Cohen, Gregory and Afshar, Saeed and Tapson, Jonathan and Schaik, André van , month = mar, year =. doi:10.48550/arXiv.1702.05373 , abstract =

  12. [12]

    doi:10.48550/arXiv.2202.11599 , abstract =

    Zhao, Shipu and Frangella, Zachary and Udell, Madeleine , month = jul, year =. doi:10.48550/arXiv.2202.11599 , abstract =

  13. [13]

    and Schneider, Frank and Hennig, Philipp , month = aug, year =

    Schmidt, Robin M. and Schneider, Frank and Hennig, Philipp , month = aug, year =. Descending through a. doi:10.48550/arXiv.2007.01547 , abstract =

  14. [14]

    Optimizer

    Sivaprasad, Prabhu Teja and Mai, Florian and Vogels, Thijs and Jaggi, Martin and Fleuret, François , month = aug, year =. Optimizer. doi:10.48550/arXiv.1910.11758 , abstract =

  15. [15]

    Google for Developers , file =

    Deep. Google for Developers , file =

  16. [16]

    Statistical

    Demsˇar, Janez and Demsar, Janez , file =. Statistical

  17. [17]

    Incremental

    Wang, Xiaolu and Sun, Yuchang and Wai, Hoi To and Zhang, Jun , month = oct, year =. Incremental

  18. [18]

    Proceedings of the 36th

    Gower, Robert Mansel and Loizou, Nicolas and Qian, Xun and Sailanbayev, Alibek and Shulgin, Egor and Richtárik, Peter , month = may, year =. Proceedings of the 36th

  19. [19]

    Wang, Xiaolu and Sun, Yuchang and Wai, Hoi-To and Zhang, Jun , month = may, year =. Dual-. doi:10.48550/arXiv.2405.16966 , abstract =

  20. [20]

    Visualizing the Loss Landscape of Neural Nets

    Li, Hao and Xu, Zheng and Taylor, Gavin and Studer, Christoph and Goldstein, Tom , month = nov, year =. Visualizing the. doi:10.48550/arXiv.1712.09913 , abstract =

  21. [21]

    Attention is not Explanation

    Jain, Sarthak and Wallace, Byron C. , month = may, year =. Attention is not. doi:10.48550/arXiv.1902.10186 , abstract =

  22. [22]

    Understanding deep learning requires rethinking generalization

    Zhang, Chiyuan and Bengio, Samy and Hardt, Moritz and Recht, Benjamin and Vinyals, Oriol , month = feb, year =. Understanding deep learning requires rethinking generalization , url =. doi:10.48550/arXiv.1611.03530 , abstract =

  23. [23]

    Henderson, Peter and Islam, Riashat and Bachman, Philip and Pineau, Joelle and Precup, Doina and Meger, David , month = jan, year =. Deep. doi:10.48550/arXiv.1709.06560 , abstract =

  24. [24]

    Bubeck, Sébastien , month = nov, year =. Convex. doi:10.48550/arXiv.1405.4980 , abstract =

  25. [25]

    Mishchenko, Konstantin and Iutzeler, Franck and Malick, Jérôme and Amini, Massih-Reza , month = jul, year =. A. Proceedings of the 35th

  26. [26]

    Mishchenko, Konstantin and Khaled, Ahmed and Richtárik, Peter , month = apr, year =. Random. doi:10.48550/arXiv.2006.05988 , abstract =

  27. [27]

    KAN: Kolmogorov-Arnold Networks

    Liu, Ziming and Wang, Yixuan and Vaidya, Sachin and Ruehle, Fabian and Halverson, James and Soljačić, Marin and Hou, Thomas Y. and Tegmark, Max , month = feb, year =. doi:10.48550/arXiv.2404.19756 , abstract =

  28. [28]

    Muon and a

    Cesista, Franz Louis , month = apr, year =. Muon and a

  29. [29]

    , month = sep, year =

    Takezawa, Yuki and Koloskova, Anastasia and Jiang, Xiaowen and Stich, Sebastian U. , month = sep, year =. doi:10.48550/arXiv.2509.26337 , abstract =

  30. [30]

    Bernstein, Jeremy and Newhouse, Laker , month = dec, year =. Old. doi:10.48550/arXiv.2409.20325 , abstract =

  31. [31]

    doi:10.48550/arXiv.2301.11913 , abstract =

    Ryabinin, Max and Dettmers, Tim and Diskin, Michael and Borzunov, Alexander , month = jun, year =. doi:10.48550/arXiv.2301.11913 , abstract =

  32. [32]

    Benchmarking optimizers for large language model pretraining.arXiv preprint arXiv:2509.01440, 2025

    Semenov, Andrei and Pagliardini, Matteo and Jaggi, Martin , month = sep, year =. Benchmarking. doi:10.48550/arXiv.2509.01440 , abstract =

  33. [33]

    Asgo: Adaptive structured gradient optimization.arXiv preprint arXiv:2503.20762, 2025

    An, Kang and Liu, Yuxing and Pan, Rui and Ren, Yi and Ma, Shiqian and Goldfarb, Donald and Zhang, Tong , month = jun, year =. doi:10.48550/arXiv.2503.20762 , abstract =

  34. [34]

    Khirirat, Sarit and Sadiev, Abdurakhmon and Riabinin, Artem and Gorbunov, Eduard and Richtárik, Peter , month = oct, year =. Error. doi:10.48550/arXiv.2410.16871 , abstract =

  35. [35]

    Learning representations by back-propagating errors , language =

    Rumelhart, David E and Hintont, Geoffrey E and Williams, Ronald J , year =. Learning representations by back-propagating errors , language =

  36. [36]

    1-bit stochastic gradient descent and its application to data-parallel distributed training of speech

    Seide, Frank and Fu, Hao and Droppo, Jasha and Li, Gang and Yu, Dong , month = sep, year =. 1-bit stochastic gradient descent and its application to data-parallel distributed training of speech. Interspeech 2014 , publisher =. doi:10.21437/Interspeech.2014-274 , abstract =

  37. [37]

    Journal of Optimization Theory and Applications , author =

    Dualize,. Journal of Optimization Theory and Applications , author =. 2022 , note =. doi:10.1007/s10957-022-02061-8 , abstract =

  38. [38]

    Kovalev, Dmitry and Beznosikov, Aleksandr and Borodich, Ekaterina and Gasnikov, Alexander and Scutari, Gesualdo , month = may, year =. Optimal. doi:10.48550/arXiv.2205.15136 , abstract =

  39. [39]

    Accelerated gradient sliding for structured convex optimization

    Lan, Guanghui and Ouyang, Yuyuan , month = sep, year =. Accelerated gradient sliding for structured convex optimization , url =. doi:10.48550/arXiv.1609.04905 , abstract =

  40. [40]

    Conditional

    Lan, Guanghui and Zhou, Yi , month = jun, year =. Conditional. SIAM Journal on Optimization , publisher =. doi:10.1137/140992382 , abstract =

  41. [41]

    Universal

    Ouyang, Yuyuan and Squires, Trevor , month = mar, year =. Universal. doi:10.48550/arXiv.2103.11026 , abstract =

  42. [42]

    Gradient Sliding for Composite Optimization

    Lan, Guanghui , month = jun, year =. Gradient. doi:10.48550/arXiv.1406.0919 , abstract =

  43. [43]

    doi:10.48550/arXiv.2302.09832 , abstract =

    Condat, Laurent and Agarský, Ivan and Malinovsky, Grigory and Richtárik, Peter , month = apr, year =. doi:10.48550/arXiv.2302.09832 , abstract =

  44. [44]

    Grudzień, Michał and Malinovsky, Grigory and Richtárik, Peter , month = jan, year =. Can 5th. doi:10.48550/arXiv.2212.14370 , abstract =

  45. [45]

    Provably

    Condat, Laurent and Agarský, Ivan and Richtárik, Peter , month = feb, year =. Provably. doi:10.48550/arXiv.2210.13277 , abstract =

  46. [46]

    doi:10.48550/arXiv.2207.12891 , abstract =

    Condat, Laurent and Richtárik, Peter , month = mar, year =. doi:10.48550/arXiv.2207.12891 , abstract =

  47. [47]

    Communication

    Sadiev, Abdurakhmon and Kovalev, Dmitry and Richtárik, Peter , month = jul, year =. Communication. doi:10.48550/arXiv.2207.03957 , abstract =

  48. [48]

    and Hassani, Hamed , month = aug, year =

    Mitra, Aritra and Jaafar, Rayana and Pappas, George J. and Hassani, Hamed , month = aug, year =. Linear. doi:10.48550/arXiv.2102.07053 , abstract =

  49. [49]

    Gorbunov, Eduard and Hanzely, Filip and Richtárik, Peter , month = nov, year =. Local. doi:10.48550/arXiv.2011.02828 , abstract =

  50. [50]

    and Stich, Sebastian U

    Karimireddy, Sai Praneeth and Kale, Satyen and Mohri, Mehryar and Reddi, Sashank J. and Stich, Sebastian U. and Suresh, Ananda Theertha , month = apr, year =. doi:10.48550/arXiv.1910.06378 , abstract =

  51. [51]

    Khaled, Ahmed and Mishchenko, Konstantin and Richtárik, Peter , month = apr, year =. Tighter. doi:10.48550/arXiv.1909.04746 , abstract =

  52. [52]

    Khaled, Ahmed and Mishchenko, Konstantin and Richtárik, Peter , month = mar, year =. First. doi:10.48550/arXiv.1909.04715 , abstract =

  53. [53]

    Haddadpour, Farzin and Mahdavi, Mehrdad , month = dec, year =. On the. doi:10.48550/arXiv.1910.14425 , abstract =

  54. [54]

    Li, Xiang and Huang, Kaixuan and Yang, Wenhao and Wang, Shusen and Zhang, Zhihua , month = jun, year =. On the. doi:10.48550/arXiv.1907.02189 , abstract =

  55. [55]

    SparkNet: Training Deep Networks in Spark

    Moritz, Philipp and Nishihara, Robert and Stoica, Ion and Jordan, Michael I. , month = feb, year =. doi:10.48550/arXiv.1511.06051 , abstract =

  56. [56]

    Parallel training of DNNs with Natural Gradient and Parameter Averaging

    Povey, Daniel and Zhang, Xiaohui and Khudanpur, Sanjeev , month = jun, year =. Parallel training of. doi:10.48550/arXiv.1410.7455 , abstract =

  57. [57]

    Practical Secure Aggregation for Federated Learning on User-Held Data

    Bonawitz, Keith and Ivanov, Vladimir and Kreuter, Ben and Marcedone, Antonio and McMahan, H. Brendan and Patel, Sarvar and Ramage, Daniel and Segal, Aaron and Seth, Karn , month = nov, year =. Practical. doi:10.48550/arXiv.1611.04482 , abstract =

  58. [58]

    Training deep learning models with norm-constrained lmos.arXiv preprint arXiv:2502.07529, 2025

    Pethick, Thomas and Xie, Wanyun and Antonakopoulos, Kimon and Zhu, Zhenyu and Silveti-Falls, Antonio and Cevher, Volkan , month = jun, year =. Training. doi:10.48550/arXiv.2502.07529 , abstract =

  59. [59]

    Riabinin, Artem and Shulgin, Egor and Gruntkowska, Kaja and Richtárik, Peter , month = may, year =. Gluon:. doi:10.48550/arXiv.2505.13416 , abstract =

  60. [60]

    Gruntkowska, Kaja and Li, Hanmin and Rane, Aadi and Richtárik, Peter , month = jul, year =. The. doi:10.48550/arXiv.2502.02002 , abstract =

  61. [61]

    Konečný, Jakub and Richtárik, Peter , month = nov, year =. Simple. doi:10.48550/arXiv.1410.0390 , abstract =

  62. [62]

    Introduction to

    Beck, Amir , month = oct, year =. Introduction to. doi:10.1137/1.9781611973655 , language =

  63. [63]

    Dekel, Ofer and Gilad-Bachrach, Ran and Shamir, Ohad and Xiao, Lin , month = jan, year =. Optimal. doi:10.48550/arXiv.1012.1367 , abstract =

  64. [64]

    IEEE Transactions on Computers , author =

    Falcon:. IEEE Transactions on Computers , author =. 2021 , keywords =. doi:10.1109/TC.2020.2974461 , abstract =

  65. [65]

    Balancing

    Basu, Saurav and Saxena, Vaibhav and Panja, Rintu and Verma, Ashish , month = dec, year =. Balancing. 2018. doi:10.1109/HiPC.2018.00011 , abstract =

  66. [66]

    , month = dec, year =

    Ferdinand, Nuwan and Gharachorloo, Benjamin and Draper, Stark C. , month = dec, year =. Anytime. 2017 16th. doi:10.1109/ICMLA.2017.0-166 , abstract =

  67. [67]

    Staleness-aware Async-SGD for Distributed Deep Learning

    Zhang, Wei and Gupta, Suyog and Lian, Xiangru and Liu, Ji , month = apr, year =. Staleness-aware. doi:10.48550/arXiv.1511.05950 , abstract =

  68. [68]

    Gupta, Suyog and Zhang, Wei and Wang, Fei , month = dec, year =. Model. doi:10.48550/arXiv.1509.04210 , abstract =

  69. [69]

    Robust and Communication-Efficient Federated Learning from Non-IID Data

    Sattler, Felix and Wiedemann, Simon and Müller, Klaus-Robert and Samek, Wojciech , month = mar, year =. Robust and. doi:10.48550/arXiv.1903.02891 , abstract =

  70. [70]

    Federated Learning for Mobile Keyboard Prediction

    Hard, Andrew and Rao, Kanishka and Mathews, Rajiv and Ramaswamy, Swaroop and Beaufays, Françoise and Augenstein, Sean and Eichner, Hubert and Kiddon, Chloé and Ramage, Daniel , month = feb, year =. Federated. doi:10.48550/arXiv.1811.03604 , abstract =

  71. [71]

    Asynchronous

    Cohen, Alon and Daniely, Amit and Drori, Yoel and Koren, Tomer and Schain, Mariano , year =. Asynchronous. Advances in

  72. [72]

    Communication-

    Tang, Zhenheng and Shi, Shaohuai and Wang, Wei and Li, Bo and Chu, Xiaowen , month = sep, year =. Communication-. doi:10.48550/arXiv.2003.06307 , abstract =

  73. [73]

    Liang, Feng and Zhang, Zhen and Lu, Haifeng and Leung, Victor C. M. and Guo, Yanyi and Hu, Xiping , month = apr, year =. Communication-. doi:10.48550/arXiv.2404.06114 , abstract =

  74. [74]

    2008 , pages =

    Communications of the ACM , author =. 2008 , pages =. doi:10.1145/1327452.1327492 , abstract =

  75. [75]

    Dean, Jeffrey and Corrado, Greg and Monga, Rajat and Chen, Kai and Devin, Matthieu and Mao, Mark and Ranzato, Marc' aurelio and Senior, Andrew and Tucker, Paul and Yang, Ke and Le, Quoc and Ng, Andrew , year =. Large. Advances in

  76. [76]

    Communication-

    Zakerinia, Hossein and Talaei, Shayan and Nadiradze, Giorgi and Alistarh, Dan , month = may, year =. Communication-. doi:10.48550/arXiv.2206.10032 , abstract =

  77. [77]

    Proceedings of the AAAI Conference on Artificial Intelligence , author =

    Efficient. Proceedings of the AAAI Conference on Artificial Intelligence , author =. 2024 , keywords =. doi:10.1609/aaai.v38i15.29603 , abstract =

  78. [78]

    , month = oct, year =

    Shi, Jianwei and Abdulah, Sameh and Sun, Ying and Genton, Marc G. , month = oct, year =. Scalable. doi:10.48550/arXiv.2510.01771 , abstract =

  79. [79]

    Hogwild!:

    Recht, Benjamin and Re, Christopher and Wright, Stephen and Niu, Feng , year =. Hogwild!:. Advances in

  80. [80]

    Gruntkowska, Kaja and Gaponov, Alexander and Tovmasyan, Zhirayr and Richtárik, Peter , month = oct, year =. Error. doi:10.48550/arXiv.2510.00643 , abstract =

Showing first 80 references.