Understanding outer optimizers in local SGD: learning rates, momentum, and acceleration

Khaled, A · 2025 · arXiv 2509.10439

3 Pith papers cite this work. Polarity classification is still indexing.

3 Pith papers citing it

representative citing papers

FoMoE: Breaking the Full-Replica Barrier with a Federation of MoEs

cs.LG · 2026-06-17 · unverdicted · novelty 6.0

FoMoE partitions expert layers across workers in MoE LLMs, skips non-resident experts, and reports up to 1.42x lower communication than baselines plus 1.4x throughput gains while maintaining stable routing.

Local MixVR: Breaking the Communication-Sample Dependence in Distributed Learning

cs.LG · 2026-05-31 · unverdicted · novelty 6.0

Local MixVR achieves communication complexity scaling only with number of workers M, independent of total samples N, and outperforms Minibatch Accelerated SGD when M is smaller than order N to the 1/4.

Outer-Momentum Restarting in High-Dimensional Two-Phase Optimization

cs.LG · 2026-05-27 · unverdicted · novelty 5.0

Periodic outer-momentum restarts in two-phase optimizers exploit phase cancellation in a linearized NTK model to widen stable learning-rate and momentum ranges in language-model pretraining.

citing papers explorer

Showing 3 of 3 citing papers.

FoMoE: Breaking the Full-Replica Barrier with a Federation of MoEs cs.LG · 2026-06-17 · unverdicted · none · ref 114
FoMoE partitions expert layers across workers in MoE LLMs, skips non-resident experts, and reports up to 1.42x lower communication than baselines plus 1.4x throughput gains while maintaining stable routing.
Local MixVR: Breaking the Communication-Sample Dependence in Distributed Learning cs.LG · 2026-05-31 · unverdicted · none · ref 8
Local MixVR achieves communication complexity scaling only with number of workers M, independent of total samples N, and outperforms Minibatch Accelerated SGD when M is smaller than order N to the 1/4.
Outer-Momentum Restarting in High-Dimensional Two-Phase Optimization cs.LG · 2026-05-27 · unverdicted · none · ref 17
Periodic outer-momentum restarts in two-phase optimizers exploit phase cancellation in a linearized NTK model to widen stable learning-rate and momentum ranges in language-model pretraining.

Understanding outer optimizers in local SGD: learning rates, momentum, and acceleration

fields

years

verdicts

representative citing papers

citing papers explorer