One-step gradient delay is optimizer-dependent rather than intrinsically unstable, with Muon and error-feedback correction enabling async pipeline parallelism to match synchronous performance on models up to 10B parameters.
Nesterov method for asynchronous pipeline parallel optimization.arXiv preprint arXiv:2505.01099, 2025
2 Pith papers cite this work. Polarity classification is still indexing.
2
Pith papers citing it
fields
cs.LG 2years
2026 2verdicts
UNVERDICTED 2representative citing papers
PACI enables bubble-free asynchronous pipeline training by bounding version drift via local gradient accumulation, matching synchronous stability with higher throughput and no extra memory.
citing papers explorer
-
One-Step Gradient Delay is Not a Barrier for Large-Scale Asynchronous Pipeline Parallel LLM Pretraining
One-step gradient delay is optimizer-dependent rather than intrinsically unstable, with Muon and error-feedback correction enabling async pipeline parallelism to match synchronous performance on models up to 10B parameters.
-
Breaking the Bubble: Asynchronous Pipeline Parallel Training with Bounded Weight Inconsistency
PACI enables bubble-free asynchronous pipeline training by bounding version drift via local gradient accumulation, matching synchronous stability with higher throughput and no extra memory.