Title resolution pending

Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, Yuxiong He

12 Pith papers cite this work. Polarity classification is still indexing.

12 Pith papers citing it

Title metadata for this work has not finished resolving. The hub is built from the citation graph; the title resolver retries DOI and OpenAlex on its next pass.

citation-role summary

background 1

citation-polarity summary

background 1

representative citing papers

A Readiness-Driven Runtime for Pipeline-Parallel Training under Runtime Variability

cs.DC · 2026-05-18 · unverdicted · novelty 6.0

RRFP introduces a readiness-driven runtime for pipeline parallelism that uses schedules as hints and ready-set arbitration to improve utilization under runtime variability, reporting up to 2.77x speedup on multimodal workloads.

EPIC: Abstraction and Polymorphism of In-Network Collectives on Ethernet

cs.DC · 2026-05-18 · unverdicted · novelty 6.0

EPIC defines a unified abstraction for in-network collectives on Ethernet with polymorphic implementations and modular design to support incremental hardware evolution.

A Few GPUs, A Whole Lotta Scale: Faithful LLM Training Emulation with PrismLLM

cs.DC · 2026-05-15 · conditional · novelty 6.0

PrismLLM constructs a sliced execution graph and uses hybrid emulation to faithfully reproduce performance and memory behavior of up to 8192-GPU LLM training runs on fewer than 1% of the original GPUs.

COPUS: Co-adaptive Parallelism and Batch Size Selection in Large Language Model Training

cs.DC · 2026-04-29 · unverdicted · novelty 6.0

COPUS co-adapts batch size and parallelism during LLM training via goodput to deliver 3.9-8% average faster convergence than fixing one while tuning the other.

Symphony: Taming Step Misalignments in the Network for Ring-based Collective Operations

cs.NI · 2026-04-18 · unverdicted · novelty 6.0

Symphony detects step misalignments in ring collectives via lightweight in-network tracking and mitigates them by throttling outpacing flows with congestion signals, yielding up to 54% better communication times in Astra-Sim simulations and a Tofino2 prototype.

DMA-Latte: Expanding the Reach of DMA Offloads to Latency-bound ML Communication

cs.DC · 2025-11-10 · unverdicted · novelty 6.0

DMA offloads on AMD MI300X GPUs are extended to latency-bound ML communication using untapped hardware features, closing up to 4.5x performance gap versus RCCL in collectives and delivering up to 1.5x lower latency and 1.9x higher throughput in LLM inference over vLLM.

HARP: Orchestrating Automated Parallel Training on Heterogeneous GPU Clusters

cs.DC · 2025-09-29 · unverdicted · novelty 6.0

HARP provides a fine-grained inter-operator parallel planner and a heterogeneity-aware 1F1B scheduler that together improve training throughput by 1.3x-1.6x on mixed GPU clusters compared with current homogeneous-oriented frameworks.

HybridFlow: A Flexible and Efficient RLHF Framework

cs.LG · 2024-09-28 · unverdicted · novelty 6.0

HybridFlow combines single- and multi-controller paradigms with a 3D-HybridEngine to deliver 1.53x to 20.57x higher throughput for various RLHF algorithms compared to prior systems.

TierCheck: Tiered Checkpointing for Fault Tolerance in Large Language Model Training

cs.DC · 2026-05-18 · unverdicted · novelty 5.0

TierCheck is a cluster-aware tiered checkpointing system that uses local memory for fast differential recovery and remote persistent storage for base checkpoints to reduce overhead and enable high-frequency checkpointing in LLM training.

DualScale: Energy-Efficient Disaggregated LLM Serving via Phase-Aware Placement and DVFS

cs.DC · 2026-02-21 · unverdicted · novelty 5.0

DualScale reduces energy by up to 39% in prefill and 48% in decode for disaggregated LLM serving while meeting TTFT and TPOT SLOs on a 16x H100 cluster.

HexiScale: Facilitating Large Language Model Training over Heterogeneous Hardware

cs.DC · 2024-09-02 · unverdicted · novelty 5.0

HexiScale enables LLM training on heterogeneous GPUs via asymmetric parallelism and graph partitioning, matching homogeneous performance at equal FLOPS and delivering 1.5-2.4x higher throughput than prior heterogeneous systems.

CCL-D: A High-Precision Diagnostic System for Slow and Hang Anomalies in Large-Scale Model Training

cs.DC · 2026-05-06 · unverdicted · novelty 4.0

CCL-D detects slow/hang anomalies in CCL for distributed training via lightweight tracing probes and an intelligent analyzer, achieving near-complete coverage and 6-minute rank localization on a 4000-GPU cluster over one year.

citing papers explorer

Showing 12 of 12 citing papers.

A Readiness-Driven Runtime for Pipeline-Parallel Training under Runtime Variability cs.DC · 2026-05-18 · unverdicted · none · ref 44
RRFP introduces a readiness-driven runtime for pipeline parallelism that uses schedules as hints and ready-set arbitration to improve utilization under runtime variability, reporting up to 2.77x speedup on multimodal workloads.
EPIC: Abstraction and Polymorphism of In-Network Collectives on Ethernet cs.DC · 2026-05-18 · unverdicted · none · ref 68
EPIC defines a unified abstraction for in-network collectives on Ethernet with polymorphic implementations and modular design to support incremental hardware evolution.
A Few GPUs, A Whole Lotta Scale: Faithful LLM Training Emulation with PrismLLM cs.DC · 2026-05-15 · conditional · none · ref 22
PrismLLM constructs a sliced execution graph and uses hybrid emulation to faithfully reproduce performance and memory behavior of up to 8192-GPU LLM training runs on fewer than 1% of the original GPUs.
COPUS: Co-adaptive Parallelism and Batch Size Selection in Large Language Model Training cs.DC · 2026-04-29 · unverdicted · none · ref 41
COPUS co-adapts batch size and parallelism during LLM training via goodput to deliver 3.9-8% average faster convergence than fixing one while tuning the other.
Symphony: Taming Step Misalignments in the Network for Ring-based Collective Operations cs.NI · 2026-04-18 · unverdicted · none · ref 66
Symphony detects step misalignments in ring collectives via lightweight in-network tracking and mitigates them by throttling outpacing flows with congestion signals, yielding up to 54% better communication times in Astra-Sim simulations and a Tofino2 prototype.
DMA-Latte: Expanding the Reach of DMA Offloads to Latency-bound ML Communication cs.DC · 2025-11-10 · unverdicted · none · ref 24
DMA offloads on AMD MI300X GPUs are extended to latency-bound ML communication using untapped hardware features, closing up to 4.5x performance gap versus RCCL in collectives and delivering up to 1.5x lower latency and 1.9x higher throughput in LLM inference over vLLM.
HARP: Orchestrating Automated Parallel Training on Heterogeneous GPU Clusters cs.DC · 2025-09-29 · unverdicted · none · ref 26
HARP provides a fine-grained inter-operator parallel planner and a heterogeneity-aware 1F1B scheduler that together improve training throughput by 1.3x-1.6x on mixed GPU clusters compared with current homogeneous-oriented frameworks.
HybridFlow: A Flexible and Efficient RLHF Framework cs.LG · 2024-09-28 · unverdicted · none · ref 65
HybridFlow combines single- and multi-controller paradigms with a 3D-HybridEngine to deliver 1.53x to 20.57x higher throughput for various RLHF algorithms compared to prior systems.
TierCheck: Tiered Checkpointing for Fault Tolerance in Large Language Model Training cs.DC · 2026-05-18 · unverdicted · none · ref 31
TierCheck is a cluster-aware tiered checkpointing system that uses local memory for fast differential recovery and remote persistent storage for base checkpoints to reduce overhead and enable high-frequency checkpointing in LLM training.
DualScale: Energy-Efficient Disaggregated LLM Serving via Phase-Aware Placement and DVFS cs.DC · 2026-02-21 · unverdicted · none · ref 39
DualScale reduces energy by up to 39% in prefill and 48% in decode for disaggregated LLM serving while meeting TTFT and TPOT SLOs on a 16x H100 cluster.
HexiScale: Facilitating Large Language Model Training over Heterogeneous Hardware cs.DC · 2024-09-02 · unverdicted · none · ref 43
HexiScale enables LLM training on heterogeneous GPUs via asymmetric parallelism and graph partitioning, matching homogeneous performance at equal FLOPS and delivering 1.5-2.4x higher throughput than prior heterogeneous systems.
CCL-D: A High-Precision Diagnostic System for Slow and Hang Anomalies in Large-Scale Model Training cs.DC · 2026-05-06 · unverdicted · none · ref 45
CCL-D detects slow/hang anomalies in CCL for distributed training via lightweight tracing probes and an intelligent analyzer, achieving near-complete coverage and 6-minute rank localization on a 4000-GPU cluster over one year.

Title resolution pending

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer