One-step gradient delay is optimizer-dependent rather than intrinsically unstable, with Muon and error-feedback correction enabling async pipeline parallelism to match synchronous performance on models up to 10B parameters.
As shown in Table 4, the performance of the Async + EF model closely matches the synchronous baseline across these diverse tasks
1 Pith paper cite this work. Polarity classification is still indexing.
1
Pith paper citing it
fields
cs.LG 1years
2026 1verdicts
UNVERDICTED 1representative citing papers
citing papers explorer
-
One-Step Gradient Delay is Not a Barrier for Large-Scale Asynchronous Pipeline Parallel LLM Pretraining
One-step gradient delay is optimizer-dependent rather than intrinsically unstable, with Muon and error-feedback correction enabling async pipeline parallelism to match synchronous performance on models up to 10B parameters.