ReCoVer uses fault-tolerant collectives, in-step recovery, and dynamic microbatch redistribution to maintain training trajectory equivalence under GPU failures, delivering 2.23x higher effective throughput than checkpoint-restart on up to 512 GPUs with 256 failures.
Title resolution pending
1 Pith paper cite this work. Polarity classification is still indexing.
1
Pith paper citing it
fields
cs.DC 1years
2026 1verdicts
UNVERDICTED 1representative citing papers
citing papers explorer
-
ReCoVer: Resilient LLM Pre-Training System via Fault-Tolerant Collective and Versatile Workload
ReCoVer uses fault-tolerant collectives, in-step recovery, and dynamic microbatch redistribution to maintain training trajectory equivalence under GPU failures, delivering 2.23x higher effective throughput than checkpoint-restart on up to 512 GPUs with 256 failures.