LOSCAR-SGD combines local updates, sparse model averaging, and communication-computation overlap with a delay-corrected merge rule, providing convergence rates for smooth non-convex objectives under worker heterogeneity.
hub Canonical reference
PyTorch Distributed: Experiences on Accelerating Data Parallel Training
Canonical reference. 86% of citing Pith papers cite this work as background.
abstract
This paper presents the design, implementation, and evaluation of the PyTorch distributed data parallel module. PyTorch is a widely-adopted scientific computing package used in deep learning research and applications. Recent advances in deep learning argue for the value of large datasets and large models, which necessitates the ability to scale out model training to more computational resources. Data parallelism has emerged as a popular solution for distributed training thanks to its straightforward principle and broad applicability. In general, the technique of distributed data parallelism replicates the model on every computational resource to generate gradients independently and then communicates those gradients at each iteration to keep model replicas consistent. Despite the conceptual simplicity of the technique, the subtle dependencies between computation and communication make it non-trivial to optimize the distributed training efficiency. As of v1.5, PyTorch natively provides several techniques to accelerate distributed data parallel, including bucketing gradients, overlapping computation with communication, and skipping gradient synchronization. Evaluations show that, when configured appropriately, the PyTorch distributed data parallel module attains near-linear scalability using 256 GPUs.
hub tools
citation-role summary
citation-polarity summary
representative citing papers
JanusPipe introduces SymFold and WaveK to enable efficient 3D-parallel training for conservative MLIPs, reporting 1.51x and 1.45x average throughput gains over 1F1B and Hanayo baselines on 32 GPUs.
Ringmaster LMO extends delay-thresholding from ASGD to LMO-based momentum updates, providing convergence guarantees under (L0, L1)-smoothness and time-complexity bounds that recover optimal rates in the Euclidean case.
MoE-Prefill achieves 1.35-1.59x higher throughput for prefill-only MoE serving by using asynchronous expert parallelism to overlap weight AllGather with computation and prefix-aware routing with true-FLOPs tracking.
ALTO accelerates LoRA tuning up to 13.8x by monitoring loss trajectories for early stopping, using fused grouped GEMM with rank-local adapter parallelism, and combining intra- and inter-task scheduling for heterogeneous workloads without quality loss.
BiomedCLIP, pretrained on the new 15-million-pair PMC-15M dataset, achieves state-of-the-art performance on diverse biomedical vision-language tasks and even outperforms radiology-specific models on chest X-ray pneumonia detection.
MultiWrite is a new many-to-many transmission semantic that uses multicast principles to eliminate redundant packets in collective operations, delivering up to 33% lower latency for AllGather and AlltoAll on Ascend NPUs.
Gradient descent optimization reconstructs POVMs for phase-insensitive quantum detectors with higher or comparable fidelity to constrained convex optimization but in much less time.
MegaScale-Data is a distributed data loading system that disaggregates preprocessing and applies auto-partitioning to deliver 4.5x higher end-to-end training throughput and 13.5x lower CPU memory usage for multisource large foundation models.
Deep Optimizer States splits LLMs into subgroups and uses a performance model to schedule optimizer updates on CPU or GPU, achieving 2.5x faster iterations than prior offloading methods when integrated with DeepSpeed.
Analog-SGD-AP converges with iteration complexity O(ε^{-2} + ε^{-1}) for multi-layer DNNs on AIMC hardware despite analog weight-update imperfections and asynchronous stale gradients.
Rescaled ASGD recovers convergence to the true global objective by rescaling worker stepsizes proportional to computation times, matching the known time lower bound in the leading term under non-convex smoothness and bounded heterogeneity.
ShardTensor is a domain-parallelism system for SciML that enables flexible scaling of extreme-resolution spatial datasets by removing the constraint of batch size one per device.
Maestro accelerates compound LLM training via section graphs for per-component configuration and wavefront scheduling for dynamic execution, reducing GPU consumption by ~40% in real deployments.
MegaScale-Omni delivers 1.27x-7.57x higher throughput for dynamic multimodal LLM training by decoupling encoder and LLM parallelism, using unified colocation, and applying adaptive workload balancing.
CommFuse eliminates tail latency in communication-computation overlap for distributed LLM training by decomposing collective operations into P2P communications and fusing them with fine-grained computation scheduling.
A precision-aware predictor for distributed training time achieves 9.8% MAPE across precision settings, compared to errors up to 147.85% when precision is ignored.
Continuous adversarial flow models replace MSE in flow matching with adversarial training via a discriminator, improving guidance-free FID on ImageNet from 8.26 to 3.63 for SiT and similar gains for JiT and text-to-image benchmarks.
veScale-FSDP uses RaggedShard and structure-aware planning to support block-wise quantization and non-element-wise optimizers while delivering 5-66% higher throughput and 16-30% lower memory than prior FSDP systems at massive scale.
PyTorch Fully Sharded Data Parallel enables training of significantly larger models than Distributed Data Parallel with comparable speed and near-linear TFLOPS scaling.
PaLM 540B demonstrates continued scaling benefits by setting new few-shot SOTA results on hundreds of benchmarks and outperforming humans on BIG-bench.
DynaFlow enables transparent intra-device parallelism in ML systems by separating model definition from execution scheduling, integrating into 6 frameworks with up to 1.29x throughput gains and minimal code changes.
Apollo uses temporal-spatial multiplexing and a performance model to let multiple multimodal model modules share GPUs, delivering up to 1.31x training speedup in testbed experiments.
ProTrain automates memory management for LLM training via cost models from profiling to deliver 1.43x-2.71x throughput gains over state-of-the-art systems without accuracy loss.
citing papers explorer
-
LOSCAR-SGD: Local SGD with Communication-Computation Overlap and Delay-Corrected Sparse Model Averaging
LOSCAR-SGD combines local updates, sparse model averaging, and communication-computation overlap with a delay-corrected merge rule, providing convergence rates for smooth non-convex objectives under worker heterogeneity.
-
JanusPipe: Efficient Pipeline Parallel Training for Machine Learning Interatomic Potentials
JanusPipe introduces SymFold and WaveK to enable efficient 3D-parallel training for conservative MLIPs, reporting 1.51x and 1.45x average throughput gains over 1F1B and Hanayo baselines on 32 GPUs.
-
Ringmaster LMO: Asynchronous Linear Minimization Oracle Momentum Method
Ringmaster LMO extends delay-thresholding from ASGD to LMO-based momentum updates, providing convergence guarantees under (L0, L1)-smoothness and time-complexity bounds that recover optimal rates in the Euclidean case.
-
MoE-Prefill: Zero Redundancy Overheads in MoE Prefill Serving
MoE-Prefill achieves 1.35-1.59x higher throughput for prefill-only MoE serving by using asynchronous expert parallelism to overlap weight AllGather with computation and prefix-aware routing with true-FLOPs tracking.
-
ALTO: Adaptive LoRA Tuning and Orchestration for Heterogeneous LoRA Training Workloads
ALTO accelerates LoRA tuning up to 13.8x by monitoring loss trajectories for early stopping, using fused grouped GEMM with rank-local adapter parallelism, and combining intra- and inter-task scheduling for heterogeneous workloads without quality loss.
-
BiomedCLIP: a multimodal biomedical foundation model pretrained from fifteen million scientific image-text pairs
BiomedCLIP, pretrained on the new 15-million-pair PMC-15M dataset, achieves state-of-the-art performance on diverse biomedical vision-language tasks and even outperforms radiology-specific models on chest X-ray pneumonia detection.
-
Exploiting Multicast for Accelerating Collective Communication
MultiWrite is a new many-to-many transmission semantic that uses multicast principles to eliminate redundant packets in collective operations, delivering up to 33% lower latency for AllGather and AlltoAll on Ascend NPUs.
-
Gradient-descent methods for scalable quantum detector tomography
Gradient descent optimization reconstructs POVMs for phase-insensitive quantum detectors with higher or comparable fidelity to constrained convex optimization but in much less time.
-
MegaScale-Data: Scaling Dataloader for Multisource Large Foundation Model Training
MegaScale-Data is a distributed data loading system that disaggregates preprocessing and applies auto-partitioning to deliver 4.5x higher end-to-end training throughput and 13.5x lower CPU memory usage for multisource large foundation models.
-
Deep Optimizer States: Towards Scalable Training of Transformer Models Using Interleaved Offloading
Deep Optimizer States splits LLMs into subgroups and uses a performance model to schedule optimizer updates on CPU or GPU, achieving 2.5x faster iterations than prior offloading methods when integrated with DeepSpeed.
-
On the Convergence Theory of Pipeline Gradient-based Analog In-memory Training
Analog-SGD-AP converges with iteration complexity O(ε^{-2} + ε^{-1}) for multi-layer DNNs on AIMC hardware despite analog weight-update imperfections and asynchronous stale gradients.
-
Rescaled Asynchronous SGD: Optimal Distributed Optimization under Data and System Heterogeneity
Rescaled ASGD recovers convergence to the true global objective by rescaling worker stepsizes proportional to computation times, matching the known time lower bound in the leading term under non-convex smoothness and bounded heterogeneity.
-
ShardTensor: Domain Parallelism for Scientific Machine Learning
ShardTensor is a domain-parallelism system for SciML that enables flexible scaling of extreme-resolution spatial datasets by removing the constraint of batch size one per device.
-
Accelerating Compound LLM Training Workloads with Maestro
Maestro accelerates compound LLM training via section graphs for per-component configuration and wavefront scheduling for dynamic execution, reducing GPU consumption by ~40% in real deployments.
-
MegaScale-Omni: A Hyper-Scale, Workload-Resilient System for MultiModal LLM Training in Production
MegaScale-Omni delivers 1.27x-7.57x higher throughput for dynamic multimodal LLM training by decoupling encoder and LLM parallelism, using unified colocation, and applying adaptive workload balancing.
-
CommFuse: Hiding Tail Latency via Communication Decomposition and Fusion for Distributed LLM Training
CommFuse eliminates tail latency in communication-computation overlap for distributed LLM training by decomposing collective operations into P2P communications and fusing them with fine-grained computation scheduling.
-
Training Time Prediction for Mixed Precision-based Distributed Training
A precision-aware predictor for distributed training time achieves 9.8% MAPE across precision settings, compared to errors up to 147.85% when precision is ignored.
-
Continuous Adversarial Flow Models
Continuous adversarial flow models replace MSE in flow matching with adversarial training via a discriminator, improving guidance-free FID on ImageNet from 8.26 to 3.63 for SiT and similar gains for JiT and text-to-image benchmarks.
-
veScale-FSDP: Flexible and High-Performance FSDP at Scale
veScale-FSDP uses RaggedShard and structure-aware planning to support block-wise quantization and non-element-wise optimizers while delivering 5-66% higher throughput and 16-30% lower memory than prior FSDP systems at massive scale.
-
PyTorch FSDP: Experiences on Scaling Fully Sharded Data Parallel
PyTorch Fully Sharded Data Parallel enables training of significantly larger models than Distributed Data Parallel with comparable speed and near-linear TFLOPS scaling.
-
PaLM: Scaling Language Modeling with Pathways
PaLM 540B demonstrates continued scaling benefits by setting new few-shot SOTA results on hundreds of benchmarks and outperforming humans on BIG-bench.
-
DynaFlow: Transparent and Flexible Intra-Device Parallelism via Programmable Operator Scheduling
DynaFlow enables transparent intra-device parallelism in ML systems by separating model definition from execution scheduling, integrating into 6 frameworks with up to 1.29x throughput gains and minimal code changes.
-
Mosaic: Towards Efficient Training of Multimodal Models with Spatial Resource Multiplexing
Apollo uses temporal-spatial multiplexing and a performance model to let multiple multimodal model modules share GPUs, delivering up to 1.31x training speedup in testbed experiments.
-
ProTrain: Efficient LLM Training via Memory-Aware Techniques
ProTrain automates memory management for LLM training via cost models from profiling to deliver 1.43x-2.71x throughput gains over state-of-the-art systems without accuracy loss.
-
PyTorch Distributed: Experiences on Accelerating Data Parallel Training
PyTorch distributed data parallel attains near-linear scalability on 256 GPUs through gradient bucketing, computation-communication overlap, and selective synchronization skipping.
-
Adaptive DNN Partitioning and Offloading in Heterogeneous Edge-Cloud Continuum
An adaptive DNN partitioning framework for heterogeneous edge-cloud systems reduces energy consumption by 27-36% and end-to-end latency by 6-23% versus static baselines on real hardware with VGG16, AlexNet, and MobileNetV2.
-
Rennala MVR: Improved Time Complexity for Parallel Stochastic Optimization via Momentum-Based Variance Reduction
Rennala MVR improves time complexity over Rennala SGD for smooth nonconvex stochastic optimization in heterogeneous parallel systems under a mean-squared smoothness assumption.
-
Lit Silicon: A Case Where Thermal Imbalance Couples Concurrent Execution in Multiple GPUs
Thermal imbalance in multi-GPU nodes creates hotter straggler GPUs that slow down cooler leader GPUs during overlapped computation and communication in LLM training.
-
CCL-D: A High-Precision Diagnostic System for Slow and Hang Anomalies in Large-Scale Model Training
CCL-D detects slow/hang anomalies in CCL for distributed training via lightweight tracing probes and an intelligent analyzer, achieving near-complete coverage and 6-minute rank localization on a 4000-GPU cluster over one year.
-
Modeling the Impact of Fiber Latency on Compute-Communication Overlap in Geo-Distributed Multi-Datacenter AI Training
Discrete-event simulation finds optimal 10-100 km separation between AI clusters where hollow-core fiber provides 25% higher compute-communication overlap in geo-distributed data-parallel training.