LOSCAR-SGD combines local updates, sparse model averaging, and communication-computation overlap with a delay-corrected merge rule, providing convergence rates for smooth non-convex objectives under worker heterogeneity.
hub Canonical reference
PyTorch Distributed: Experiences on Accelerating Data Parallel Training
Canonical reference. 86% of citing Pith papers cite this work as background.
abstract
This paper presents the design, implementation, and evaluation of the PyTorch distributed data parallel module. PyTorch is a widely-adopted scientific computing package used in deep learning research and applications. Recent advances in deep learning argue for the value of large datasets and large models, which necessitates the ability to scale out model training to more computational resources. Data parallelism has emerged as a popular solution for distributed training thanks to its straightforward principle and broad applicability. In general, the technique of distributed data parallelism replicates the model on every computational resource to generate gradients independently and then communicates those gradients at each iteration to keep model replicas consistent. Despite the conceptual simplicity of the technique, the subtle dependencies between computation and communication make it non-trivial to optimize the distributed training efficiency. As of v1.5, PyTorch natively provides several techniques to accelerate distributed data parallel, including bucketing gradients, overlapping computation with communication, and skipping gradient synchronization. Evaluations show that, when configured appropriately, the PyTorch distributed data parallel module attains near-linear scalability using 256 GPUs.
hub tools
citation-role summary
citation-polarity summary
representative citing papers
JanusPipe introduces SymFold and WaveK to enable efficient 3D-parallel training for conservative MLIPs, reporting 1.51x and 1.45x average throughput gains over 1F1B and Hanayo baselines on 32 GPUs.
Ringmaster LMO extends delay-thresholding from ASGD to LMO-based momentum updates, providing convergence guarantees under (L0, L1)-smoothness and time-complexity bounds that recover optimal rates in the Euclidean case.
MoE-Prefill achieves 1.35-1.59x higher throughput for prefill-only MoE serving by using asynchronous expert parallelism to overlap weight AllGather with computation and prefix-aware routing with true-FLOPs tracking.
ALTO accelerates LoRA tuning up to 13.8x by monitoring loss trajectories for early stopping, using fused grouped GEMM with rank-local adapter parallelism, and combining intra- and inter-task scheduling for heterogeneous workloads without quality loss.
BiomedCLIP, pretrained on the new 15-million-pair PMC-15M dataset, achieves state-of-the-art performance on diverse biomedical vision-language tasks and even outperforms radiology-specific models on chest X-ray pneumonia detection.
FoMoE partitions expert layers across workers in MoE LLMs, skips non-resident experts, and reports up to 1.42x lower communication than baselines plus 1.4x throughput gains while maintaining stable routing.
Splaxel achieves up to 7.6x speedup in distributed 3DGS training on scenes with up to 120M Gaussians by using pixel-level communication and visibility prediction while preserving reconstruction quality.
Standard LLM inference benchmarks introduce systematic bias via GIL-induced queuing in single-process asyncio setups; a multi-process framework and NTPOT metric isolate true serving engine performance at high query rates.
MultiWrite is a new many-to-many transmission semantic that uses multicast principles to eliminate redundant packets in collective operations, delivering up to 33% lower latency for AllGather and AlltoAll on Ascend NPUs.
PAT adaptively reconfigures tensor parallelism in RLHF generation using predictor-guided decisions and lightweight state updates, cutting generation latency by up to 34.6%.
Gradient descent optimization reconstructs POVMs for phase-insensitive quantum detectors with higher or comparable fidelity to constrained convex optimization but in much less time.
MegaScale-Data is a distributed data loading system that disaggregates preprocessing and applies auto-partitioning to deliver 4.5x higher end-to-end training throughput and 13.5x lower CPU memory usage for multisource large foundation models.
Deep Optimizer States splits LLMs into subgroups and uses a performance model to schedule optimizer updates on CPU or GPU, achieving 2.5x faster iterations than prior offloading methods when integrated with DeepSpeed.
Analog-SGD-AP converges with iteration complexity O(ε^{-2} + ε^{-1}) for multi-layer DNNs on AIMC hardware despite analog weight-update imperfections and asynchronous stale gradients.
Rescaled ASGD recovers convergence to the true global objective by rescaling worker stepsizes proportional to computation times, matching the known time lower bound in the leading term under non-convex smoothness and bounded heterogeneity.
ShardTensor is a domain-parallelism system for SciML that enables flexible scaling of extreme-resolution spatial datasets by removing the constraint of batch size one per device.
Maestro accelerates compound LLM training via section graphs for per-component configuration and wavefront scheduling for dynamic execution, reducing GPU consumption by ~40% in real deployments.
MegaScale-Omni delivers 1.27x-7.57x higher throughput for dynamic multimodal LLM training by decoupling encoder and LLM parallelism, using unified colocation, and applying adaptive workload balancing.
CommFuse eliminates tail latency in communication-computation overlap for distributed LLM training by decomposing collective operations into P2P communications and fusing them with fine-grained computation scheduling.
A precision-aware predictor for distributed training time achieves 9.8% MAPE across precision settings, compared to errors up to 147.85% when precision is ignored.
Continuous adversarial flow models replace MSE in flow matching with adversarial training via a discriminator, improving guidance-free FID on ImageNet from 8.26 to 3.63 for SiT and similar gains for JiT and text-to-image benchmarks.
veScale-FSDP uses RaggedShard and structure-aware planning to support block-wise quantization and non-element-wise optimizers while delivering 5-66% higher throughput and 16-30% lower memory than prior FSDP systems at massive scale.
PyTorch Fully Sharded Data Parallel enables training of significantly larger models than Distributed Data Parallel with comparable speed and near-linear TFLOPS scaling.
citing papers explorer
-
Identifying and Mitigating Systemic Measurement Bias in Production LLM Inference Benchmarks
Standard LLM inference benchmarks introduce systematic bias via GIL-induced queuing in single-process asyncio setups; a multi-process framework and NTPOT metric isolate true serving engine performance at high query rates.
-
Accelerating Long-Tail Generation in Synchronous RLHF Training via Adaptive Tensor Parallelism
PAT adaptively reconfigures tensor parallelism in RLHF generation using predictor-guided decisions and lightweight state updates, cutting generation latency by up to 34.6%.