signSGD with Majority Vote is Communication Efficient And Fault Tolerant

Anima Anandkumar; Jeremy Bernstein; Jiawei Zhao; Kamyar Azizzadenesheli

arxiv: 1810.05291 · v3 · pith:HW4JON2Wnew · submitted 2018-10-11 · 💻 cs.DC · cs.AI· cs.LG

signSGD with Majority Vote is Communication Efficient And Fault Tolerant

Jeremy Bernstein , Jiawei Zhao , Kamyar Azizzadenesheli , Anima Anandkumar This is my paper

classification 💻 cs.DC cs.AIcs.LG

keywords majorityvotelargemachinestrainingalgorithmcommunicationdatasets

0 comments

read the original abstract

Training neural networks on large datasets can be accelerated by distributing the workload over a network of machines. As datasets grow ever larger, networks of hundreds or thousands of machines become economically viable. The time cost of communicating gradients limits the effectiveness of using such large machine counts, as may the increased chance of network faults. We explore a particularly simple algorithm for robust, communication-efficient learning---signSGD. Workers transmit only the sign of their gradient vector to a server, and the overall update is decided by a majority vote. This algorithm uses $32\times$ less communication per iteration than full-precision, distributed SGD. Under natural conditions verified by experiment, we prove that signSGD converges in the large and mini-batch settings, establishing convergence for a parameter regime of Adam as a byproduct. Aggregating sign gradients by majority vote means that no individual worker has too much power. We prove that unlike SGD, majority vote is robust when up to 50% of workers behave adversarially. The class of adversaries we consider includes as special cases those that invert or randomise their gradient estimate. On the practical side, we built our distributed training system in Pytorch. Benchmarking against the state of the art collective communications library (NCCL), our framework---with the parameter server housed entirely on one machine---led to a 25% reduction in time for training resnet50 on Imagenet when using 15 AWS p3.2xlarge machines.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 7 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

XFED: Non-Collusive Model Poisoning Attack Against Byzantine-Robust Federated Classifiers
cs.CR 2026-04 unverdicted novelty 7.0

XFED is the first aggregation-agnostic non-collusive model poisoning attack that bypasses eight state-of-the-art defenses on six benchmark datasets without attacker coordination.
Decentralized Stochastic Subgradient-type Methods with Communication Compression for Nonsmooth Nonconvex Optimization
math.OC 2026-07 unverdicted novelty 6.0

A unified framework for decentralized stochastic subgradient methods with compressed communication is proposed, proving global convergence for nonsmooth nonconvex objectives via differential inclusions and developing ...
Quantum ring all-reduce: communication and privacy advantages for distributed learning
quant-ph 2026-06 unverdicted novelty 6.0

Quantum ring all-reduce halves per-link communication via superdense coding and enables composable ε-secure aggregation at 2x GHZ overhead, plus quantum advantages in gradient conflict detection.
LionMuon: Alternating Spectral and Sign Descent for Efficient Training
cs.LG 2026-05 unverdicted novelty 6.0

LionMuon alternates Lion sign steps and Muon spectral steps with shared dual-EMA momentum to match Lion memory while outperforming both at P=2 on 124M-720M models, backed by heavy-tailed complexity bounds that predict...
LionMuon: Alternating Spectral and Sign Descent for Efficient Training
cs.LG 2026-05 unverdicted novelty 6.0

LionMuon alternates Lion and Muon steps with shared dual-EMA buffer to Pareto-dominate existing optimizers in loss and compute on models up to 720M parameters.
Robust stochastic first order methods in heavy-tailed noise via medoid mini-batch gradient sampling
math.OC 2026-05 unverdicted novelty 6.0

R-SGD-Mini achieves O(1/T) convergence of expected squared gradient norm to a noise-dependent neighborhood in heavy-tailed settings by selecting the medoid gradient from M data chunks.
SignMuon: Communication-Efficient Distributed Muon Optimization
cs.LG 2026-05 unverdicted novelty 6.0

SignMuon merges majority-vote sign aggregation from signSGD with Muon's polar-factor steps to create a communication-efficient distributed optimizer that matches signSGD rates under symmetric noise and shows strong em...