DeadPool achieves zero-overhead checkpointing during error-free LLM training and hot-swapping recovery in under 40 seconds by replacing failed nodes without terminating the job.
Fastper- sist: Accelerating model checkpointing in deep learning
2 Pith papers cite this work. Polarity classification is still indexing.
years
2026 2verdicts
UNVERDICTED 2representative citing papers
LiveR enables live reconfiguration for elastic LLM training by asynchronously preparing new parallel worlds and streaming reshaped model state over interconnects, reducing downtime to seconds and achieving 14-23x faster reconfiguration than checkpoint/restart.
citing papers explorer
-
DeadPool: Resilient LLM Training with Hot-Swapping via Zero-Overhead Checkpoint
DeadPool achieves zero-overhead checkpointing during error-free LLM training and hot-swapping recovery in under 40 seconds by replacing failed nodes without terminating the job.
-
LiveR: Fine-Grained Elasticity via Live Reconfiguration for Model Training
LiveR enables live reconfiguration for elastic LLM training by asynchronously preparing new parallel worlds and streaming reshaped model state over interconnects, reducing downtime to seconds and achieving 14-23x faster reconfiguration than checkpoint/restart.