← back to paper
arxiv: 2605.06374 · 2 revisions
ResiHP: Taming LLM Training Failures with Dynamic Hybrid Parallelism