hub Canonical reference

PyTorch Distributed: Experiences on Accelerating Data Parallel Training

Shen Li, Yanli Zhao, Rohan Varma, Omkar Salpekar, Pieter Noordhuis, Teng Li · 2020 · cs.DC · arXiv 2006.15704

Canonical reference. 86% of citing Pith papers cite this work as background.

36 Pith papers citing it

Background 86% of classified citations

open full Pith review browse 36 citing papers arXiv PDF

abstract

This paper presents the design, implementation, and evaluation of the PyTorch distributed data parallel module. PyTorch is a widely-adopted scientific computing package used in deep learning research and applications. Recent advances in deep learning argue for the value of large datasets and large models, which necessitates the ability to scale out model training to more computational resources. Data parallelism has emerged as a popular solution for distributed training thanks to its straightforward principle and broad applicability. In general, the technique of distributed data parallelism replicates the model on every computational resource to generate gradients independently and then communicates those gradients at each iteration to keep model replicas consistent. Despite the conceptual simplicity of the technique, the subtle dependencies between computation and communication make it non-trivial to optimize the distributed training efficiency. As of v1.5, PyTorch natively provides several techniques to accelerate distributed data parallel, including bucketing gradients, overlapping computation with communication, and skipping gradient synchronization. Evaluations show that, when configured appropriately, the PyTorch distributed data parallel module attains near-linear scalability using 256 GPUs.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 6 other 1

citation-polarity summary

background 6 unclear 1

representative citing papers

LOSCAR-SGD: Local SGD with Communication-Computation Overlap and Delay-Corrected Sparse Model Averaging

cs.LG · 2026-05-20 · unverdicted · novelty 7.0

LOSCAR-SGD combines local updates, sparse model averaging, and communication-computation overlap with a delay-corrected merge rule, providing convergence rates for smooth non-convex objectives under worker heterogeneity.

JanusPipe: Efficient Pipeline Parallel Training for Machine Learning Interatomic Potentials

cs.DC · 2026-05-18 · unverdicted · novelty 7.0 · 2 refs

JanusPipe introduces SymFold and WaveK to enable efficient 3D-parallel training for conservative MLIPs, reporting 1.51x and 1.45x average throughput gains over 1F1B and Hanayo baselines on 32 GPUs.

Ringmaster LMO: Asynchronous Linear Minimization Oracle Momentum Method

cs.LG · 2026-05-18 · unverdicted · novelty 7.0

Ringmaster LMO extends delay-thresholding from ASGD to LMO-based momentum updates, providing convergence guarantees under (L0, L1)-smoothness and time-complexity bounds that recover optimal rates in the Euclidean case.

MoE-Prefill: Zero Redundancy Overheads in MoE Prefill Serving

cs.LG · 2026-05-03 · unverdicted · novelty 7.0 · 2 refs

MoE-Prefill achieves 1.35-1.59x higher throughput for prefill-only MoE serving by using asynchronous expert parallelism to overlap weight AllGather with computation and prefix-aware routing with true-FLOPs tracking.

ALTO: Adaptive LoRA Tuning and Orchestration for Heterogeneous LoRA Training Workloads

cs.LG · 2026-04-07 · unverdicted · novelty 7.0

ALTO accelerates LoRA tuning up to 13.8x by monitoring loss trajectories for early stopping, using fused grouped GEMM with rank-local adapter parallelism, and combining intra- and inter-task scheduling for heterogeneous workloads without quality loss.

BiomedCLIP: a multimodal biomedical foundation model pretrained from fifteen million scientific image-text pairs

cs.CV · 2023-03-02 · conditional · novelty 7.0

BiomedCLIP, pretrained on the new 15-million-pair PMC-15M dataset, achieves state-of-the-art performance on diverse biomedical vision-language tasks and even outperforms radiology-specific models on chest X-ray pneumonia detection.

FoMoE: Breaking the Full-Replica Barrier with a Federation of MoEs

cs.LG · 2026-06-17 · unverdicted · novelty 6.0

FoMoE partitions expert layers across workers in MoE LLMs, skips non-resident experts, and reports up to 1.42x lower communication than baselines plus 1.4x throughput gains while maintaining stable routing.

Splaxel: Efficient Distributed Training of 3D Gaussian Splatting for Large-scale Scene Reconstruction via Pixel-level Communication

cs.DC · 2026-06-17 · unverdicted · novelty 6.0

Splaxel achieves up to 7.6x speedup in distributed 3DGS training on scenes with up to 120M Gaussians by using pixel-level communication and visibility prediction while preserving reconstruction quality.

Identifying and Mitigating Systemic Measurement Bias in Production LLM Inference Benchmarks

cs.AI · 2026-05-22 · unverdicted · novelty 6.0

Standard LLM inference benchmarks introduce systematic bias via GIL-induced queuing in single-process asyncio setups; a multi-process framework and NTPOT metric isolate true serving engine performance at high query rates.

Exploiting Multicast for Accelerating Collective Communication

cs.DC · 2026-05-21 · unverdicted · novelty 6.0

MultiWrite is a new many-to-many transmission semantic that uses multicast principles to eliminate redundant packets in collective operations, delivering up to 33% lower latency for AllGather and AlltoAll on Ascend NPUs.

Accelerating Long-Tail Generation in Synchronous RLHF Training via Adaptive Tensor Parallelism

cs.AI · 2026-05-03 · unverdicted · novelty 6.0

PAT adaptively reconfigures tensor parallelism in RLHF generation using predictor-guided decisions and lightweight state updates, cutting generation latency by up to 34.6%.

Gradient-descent methods for scalable quantum detector tomography

quant-ph · 2025-11-18 · conditional · novelty 6.0

Gradient descent optimization reconstructs POVMs for phase-insensitive quantum detectors with higher or comparable fidelity to constrained convex optimization but in much less time.

MegaScale-Data: Scaling Dataloader for Multisource Large Foundation Model Training

cs.DC · 2025-04-14 · unverdicted · novelty 6.0

MegaScale-Data is a distributed data loading system that disaggregates preprocessing and applies auto-partitioning to deliver 4.5x higher end-to-end training throughput and 13.5x lower CPU memory usage for multisource large foundation models.

Deep Optimizer States: Towards Scalable Training of Transformer Models Using Interleaved Offloading

cs.LG · 2024-10-26 · unverdicted · novelty 6.0

Deep Optimizer States splits LLMs into subgroups and uses a performance model to schedule optimizer updates on CPU or GPU, achieving 2.5x faster iterations than prior offloading methods when integrated with DeepSpeed.

On the Convergence Theory of Pipeline Gradient-based Analog In-memory Training

cs.LG · 2024-10-19 · unverdicted · novelty 6.0

Analog-SGD-AP converges with iteration complexity O(ε^{-2} + ε^{-1}) for multi-layer DNNs on AIMC hardware despite analog weight-update imperfections and asynchronous stale gradients.

Rescaled Asynchronous SGD: Optimal Distributed Optimization under Data and System Heterogeneity

cs.LG · 2026-05-13 · unverdicted · novelty 6.0

Rescaled ASGD recovers convergence to the true global objective by rescaling worker stepsizes proportional to computation times, matching the known time lower bound in the leading term under non-convex smoothness and bounded heterogeneity.

ShardTensor: Domain Parallelism for Scientific Machine Learning

cs.DC · 2026-05-11 · unverdicted · novelty 6.0

ShardTensor is a domain-parallelism system for SciML that enables flexible scaling of extreme-resolution spatial datasets by removing the constraint of batch size one per device.

Accelerating Compound LLM Training Workloads with Maestro

cs.DC · 2026-05-11 · unverdicted · novelty 6.0

Maestro accelerates compound LLM training via section graphs for per-component configuration and wavefront scheduling for dynamic execution, reducing GPU consumption by ~40% in real deployments.

MegaScale-Omni: A Hyper-Scale, Workload-Resilient System for MultiModal LLM Training in Production

cs.DC · 2026-05-09 · unverdicted · novelty 6.0

MegaScale-Omni delivers 1.27x-7.57x higher throughput for dynamic multimodal LLM training by decoupling encoder and LLM parallelism, using unified colocation, and applying adaptive workload balancing.

CommFuse: Hiding Tail Latency via Communication Decomposition and Fusion for Distributed LLM Training

cs.LG · 2026-04-27 · unverdicted · novelty 6.0

CommFuse eliminates tail latency in communication-computation overlap for distributed LLM training by decomposing collective operations into P2P communications and fusing them with fine-grained computation scheduling.

Training Time Prediction for Mixed Precision-based Distributed Training

cs.LG · 2026-04-17 · unverdicted · novelty 6.0

A precision-aware predictor for distributed training time achieves 9.8% MAPE across precision settings, compared to errors up to 147.85% when precision is ignored.

Continuous Adversarial Flow Models

cs.LG · 2026-04-13 · unverdicted · novelty 6.0

Continuous adversarial flow models replace MSE in flow matching with adversarial training via a discriminator, improving guidance-free FID on ImageNet from 8.26 to 3.63 for SiT and similar gains for JiT and text-to-image benchmarks.

veScale-FSDP: Flexible and High-Performance FSDP at Scale

cs.DC · 2026-02-25 · unverdicted · novelty 6.0

veScale-FSDP uses RaggedShard and structure-aware planning to support block-wise quantization and non-element-wise optimizers while delivering 5-66% higher throughput and 16-30% lower memory than prior FSDP systems at massive scale.

PyTorch FSDP: Experiences on Scaling Fully Sharded Data Parallel

cs.DC · 2023-04-21 · unverdicted · novelty 6.0

PyTorch Fully Sharded Data Parallel enables training of significantly larger models than Distributed Data Parallel with comparable speed and near-linear TFLOPS scaling.

citing papers explorer

Showing 2 of 2 citing papers after filters.

Identifying and Mitigating Systemic Measurement Bias in Production LLM Inference Benchmarks cs.AI · 2026-05-22 · unverdicted · none · ref 15 · internal anchor
Standard LLM inference benchmarks introduce systematic bias via GIL-induced queuing in single-process asyncio setups; a multi-process framework and NTPOT metric isolate true serving engine performance at high query rates.
Accelerating Long-Tail Generation in Synchronous RLHF Training via Adaptive Tensor Parallelism cs.AI · 2026-05-03 · unverdicted · none · ref 8 · internal anchor
PAT adaptively reconfigures tensor parallelism in RLHF generation using predictor-guided decisions and lightweight state updates, cutting generation latency by up to 34.6%.

PyTorch Distributed: Experiences on Accelerating Data Parallel Training

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer