arxiv: 2304.11277 · v2 · submitted 2023-04-21 · 💻 cs.DC · cs.AI· cs.LG· cs.PF

Recognition: 2 theorem links

· Lean Theorem

PyTorch FSDP: Experiences on Scaling Fully Sharded Data Parallel

Yanli Zhao , Andrew Gu , Rohan Varma , Liang Luo , Chien-Chin Huang , Min Xu , Less Wright , Hamid Shojanazeri

show 10 more authors

Myle Ott Sam Shleifer Alban Desmaison Can Balioglu Pritam Damania Bernard Nguyen Geeta Chauhan Yuchen Hao Ajit Mathews Shen Li

Authors on Pith no claims yet

Pith reviewed 2026-05-12 04:10 UTC · model grok-4.3

classification 💻 cs.DC cs.AIcs.LGcs.PF

keywords FSDPFully Sharded Data Parallellarge model trainingdistributed data parallelPyTorchmodel shardingscalabilityTFLOPS

0 comments

The pith

PyTorch FSDP enables training of significantly larger models than standard Distributed Data Parallel while delivering comparable performance and near-linear TFLOPS scalability.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces PyTorch Fully Sharded Data Parallel (FSDP) as a practical system for training large models that exceed the memory limits of conventional data-parallel approaches. It achieves this by sharding model parameters, gradients, and optimizer states across workers, with tight integration into PyTorch's tensor system, dispatcher, and memory allocator to keep overhead low and usage straightforward. Experiments demonstrate that FSDP runs at speeds close to Distributed Data Parallel yet supports much bigger models, with computational efficiency that scales nearly linearly as more devices are added. This combination lowers the barrier for researchers and engineers working outside the largest labs to experiment with scale.

Core claim

FSDP has been closely co-designed with several key PyTorch core components including Tensor implementation, dispatcher system, and CUDA memory caching allocator, to provide non-intrusive user experiences and high training efficiency. The experimental results demonstrate that FSDP is capable of achieving comparable performance to Distributed Data Parallel while providing support for significantly larger models with near-linear scalability in terms of TFLOPS.

What carries the argument

Fully Sharded Data Parallel (FSDP), which shards parameters, gradients, and optimizer states across data-parallel processes and is co-designed with PyTorch's tensor, dispatcher, and CUDA allocator layers.

Load-bearing premise

Close co-design with PyTorch internals will deliver non-intrusive usage and high efficiency across diverse hardware and model architectures without hidden performance cliffs.

What would settle it

A direct head-to-head benchmark on identical hardware and model size where FSDP throughput falls substantially below Distributed Data Parallel, or where scaling efficiency drops below linear beyond a modest number of GPUs.

read the original abstract

It is widely acknowledged that large models have the potential to deliver superior performance across a broad range of domains. Despite the remarkable progress made in the field of machine learning systems research, which has enabled the development and exploration of large models, such abilities remain confined to a small group of advanced users and industry leaders, resulting in an implicit technical barrier for the wider community to access and leverage these technologies. In this paper, we introduce PyTorch Fully Sharded Data Parallel (FSDP) as an industry-grade solution for large model training. FSDP has been closely co-designed with several key PyTorch core components including Tensor implementation, dispatcher system, and CUDA memory caching allocator, to provide non-intrusive user experiences and high training efficiency. Additionally, FSDP natively incorporates a range of techniques and settings to optimize resource utilization across a variety of hardware configurations. The experimental results demonstrate that FSDP is capable of achieving comparable performance to Distributed Data Parallel while providing support for significantly larger models with near-linear scalability in terms of TFLOPS.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces PyTorch Fully Sharded Data Parallel (FSDP) as an industry-grade solution for training large models. It describes the close co-design with PyTorch core components (Tensor implementation, dispatcher, and CUDA memory caching allocator) to enable non-intrusive usage and high efficiency, along with native optimizations for resource utilization across hardware. Experimental results are reported to demonstrate performance comparable to Distributed Data Parallel (DDP), support for significantly larger models, and near-linear TFLOPS scalability.

Significance. If the empirical claims are robust, this is a significant practical contribution: it lowers the technical barrier for large-model training by delivering an efficient, integrated PyTorch primitive that supports model sizes beyond DDP while maintaining competitive throughput. The emphasis on co-design for efficiency and the reported scalability results would be valuable to the distributed systems and ML systems communities.

major comments (2)

[§5] §5 (Experimental Evaluation): The central claim of DDP-comparable performance and near-linear TFLOPS scaling is load-bearing, yet the reported results provide no concrete details on model sizes (parameter counts), hardware specifications (GPU count, interconnect, CUDA version), run counts, or error bars. This directly undermines assessment of whether the co-design assumptions hold without hidden overheads on untested configurations.
[§3] §3 (FSDP Design and Co-design): The assumption that integration with the CUDA allocator and dispatcher yields high efficiency without performance cliffs is presented as a key enabler, but no ablation studies isolate the contribution of each co-design element versus baseline sharding or communication optimizations. This leaves the robustness claim under-supported for diverse model architectures or interconnects.

minor comments (2)

[Abstract] The abstract states 'near-linear scalability' without quantifying the observed scaling slope or the range of GPU counts over which it was measured; adding this would improve precision.
[§3] Notation for sharding strategies and memory savings could be introduced earlier with a small table for clarity before the experimental section.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments on our manuscript. We address each major point below and indicate planned revisions to strengthen the presentation of our experimental results and design rationale.

read point-by-point responses

Referee: [§5] §5 (Experimental Evaluation): The central claim of DDP-comparable performance and near-linear TFLOPS scaling is load-bearing, yet the reported results provide no concrete details on model sizes (parameter counts), hardware specifications (GPU count, interconnect, CUDA version), run counts, or error bars. This directly undermines assessment of whether the co-design assumptions hold without hidden overheads on untested configurations.

Authors: We agree that additional concrete details are needed to support reproducibility and evaluation of the claims. In the revised manuscript we will expand §5 to report the specific model parameter counts, hardware configurations (GPU count, interconnect type, CUDA version), number of runs, and error bars or variance measures for the key metrics. This will allow readers to better judge the robustness of the observed DDP-comparable performance and scaling behavior. revision: yes
Referee: [§3] §3 (FSDP Design and Co-design): The assumption that integration with the CUDA allocator and dispatcher yields high efficiency without performance cliffs is presented as a key enabler, but no ablation studies isolate the contribution of each co-design element versus baseline sharding or communication optimizations. This leaves the robustness claim under-supported for diverse model architectures or interconnects.

Authors: The paper describes FSDP as a tightly integrated system whose value is demonstrated through end-to-end scaling results rather than isolated component studies. We will partially revise §3 to provide additional rationale for each co-design decision and to reference any internal validation data available from our development process. Comprehensive ablations across every architecture and interconnect are outside the scope of this experience-focused paper, but we will clarify the configurations in which the claims have been validated. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical engineering report with direct measurements, no derivations or fitted predictions.

full rationale

The paper describes the FSDP implementation, its co-design with PyTorch internals, and reports empirical performance results on specific hardware and models. No mathematical derivation chain, equations, or parameter-fitting steps exist that could reduce claims to inputs by construction. Central claims rest on experimental benchmarks rather than self-referential definitions, self-citations as load-bearing premises, or renamed known results. This is a standard self-contained systems paper whose evidence is externally falsifiable via reproduction on the reported setups.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is a systems-engineering paper describing an implementation and empirical measurements. It introduces no free parameters, mathematical axioms, or newly postulated entities.

pith-pipeline@v0.9.0 · 5547 in / 1106 out tokens · 48324 ms · 2026-05-12T04:10:51.896425+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

FSDP decomposes the model instance into smaller units and handles each unit independently... FlatParameter is a 1D tensor constructed by concatenating p flattened original parameters and padding...
IndisputableMonolith/Foundation/DimensionForcing.lean alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

FSDP offers a variety of sharding strategies... sharding factor F... hybrid sharding

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 45 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

OP-GRPO: Efficient Off-Policy GRPO for Flow-Matching Models
cs.CV 2026-04 unverdicted novelty 8.0

OP-GRPO is the first off-policy GRPO method for flow-matching models that reuses trajectories via replay buffer and importance sampling corrections, matching on-policy performance with 34.2% of the training steps.
Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Vision-Language Models
cs.CV 2024-09 accept novelty 8.0

Molmo VLMs trained on newly collected PixMo open datasets achieve state-of-the-art performance among open-weight models and surpass multiple proprietary VLMs including Claude 3.5 Sonnet and Gemini 1.5 Pro.
CausalCine: Real-Time Autoregressive Generation for Multi-Shot Video Narratives
cs.CV 2026-05 unverdicted novelty 7.0

CausalCine enables real-time causal autoregressive multi-shot video generation via multi-shot training, content-aware memory routing for coherence, and distillation to few-step inference.
AsymTalker: Identity-Consistent Long-Term Talking Head Generation via Asymmetric Distillation
cs.LG 2026-05 unverdicted novelty 7.0

AsymTalker maintains identity consistency in long-term diffusion talking-head videos by encoding temporal references from a static image and training a student model under inference-like conditions via asymmetric dist...
A satellite foundation model for improved wealth monitoring
cs.CY 2026-04 unverdicted novelty 7.0

Tempov is a self-supervised satellite foundation model that predicts wealth levels and decadal changes at high resolution across Africa from Landsat imagery, outperforming baselines even with limited labels and genera...
ALTO: Adaptive LoRA Tuning and Orchestration for Heterogeneous LoRA Training Workloads
cs.LG 2026-04 unverdicted novelty 7.0

ALTO accelerates LoRA tuning up to 13.8x by monitoring loss trajectories for early stopping, using fused grouped GEMM with rank-local adapter parallelism, and combining intra- and inter-task scheduling for heterogeneo...
Training Agents Inside of Scalable World Models
cs.AI 2025-09 conditional novelty 7.0

Dreamer 4 is the first agent to obtain diamonds in Minecraft from only offline data by reinforcement learning inside a scalable world model that accurately predicts game mechanics.
Autoregressive Model Beats Diffusion: Llama for Scalable Image Generation
cs.CV 2024-06 conditional novelty 7.0

Scaled vanilla autoregressive models based on Llama achieve 2.18 FID on ImageNet 256x256 image generation, beating popular diffusion models without visual inductive biases.
Scaling and evaluating sparse autoencoders
cs.LG 2024-06 unverdicted novelty 7.0

K-sparse autoencoders with dead-latent fixes produce clean scaling laws and better feature quality metrics that improve with size, shown by training a 16-million-latent model on GPT-4 activations.
ReCoVer: Resilient LLM Pre-Training System via Fault-Tolerant Collective and Versatile Workload
cs.DC 2026-05 unverdicted novelty 6.0

ReCoVer uses fault-tolerant collectives, in-step recovery, and dynamic microbatch redistribution to maintain training trajectory equivalence under GPU failures, delivering 2.23x higher effective throughput than checkp...
ShardTensor: Domain Parallelism for Scientific Machine Learning
cs.DC 2026-05 unverdicted novelty 6.0

ShardTensor is a domain-parallelism system for SciML that enables flexible scaling of extreme-resolution spatial datasets by removing the constraint of batch size one per device.
LoKA: Low-precision Kernel Applications for Recommendation Models At Scale
cs.LG 2026-05 unverdicted novelty 6.0

LoKA enables practical FP8 use in numerically sensitive large recommendation models via profiling, model adaptations, and runtime kernel orchestration.
LoKA: Low-precision Kernel Applications for Recommendation Models At Scale
cs.LG 2026-05 unverdicted novelty 6.0

LoKA enables practical FP8 use in numerically sensitive large recommendation models via online profiling of activations, reusable model modifications for stability, and dynamic kernel dispatching.
DisagMoE: Computation-Communication overlapped MoE Training via Disaggregated AF-Pipe Parallelism
cs.LG 2026-05 unverdicted novelty 6.0

DisagMoE achieves up to 1.8x faster MoE training by disaggregating attention and FFN layers into disjoint GPU groups with a multi-stage uni-directional pipeline and roofline-based bandwidth balancing.
MegaScale-Omni: A Hyper-Scale, Workload-Resilient System for MultiModal LLM Training in Production
cs.DC 2026-05 unverdicted novelty 6.0

MegaScale-Omni delivers 1.27x-7.57x higher throughput for dynamic multimodal LLM training by decoupling encoder and LLM parallelism, using unified colocation, and applying adaptive workload balancing.
FlashEvolve: Accelerating Agent Self-Evolution with Asynchronous Stage Orchestration
cs.LG 2026-05 unverdicted novelty 6.0

FlashEvolve accelerates LLM agent self-evolution via asynchronous stage orchestration and inspectable language-space staleness handling, reporting 3.5-4.9x proposal throughput gains over synchronous baselines on GEPA ...
AsymTalker: Identity-Consistent Long-Term Talking Head Generation via Asymmetric Distillation
cs.LG 2026-05 unverdicted novelty 6.0

AsymTalker uses temporal reference encoding and asymmetric knowledge distillation to produce identity-consistent talking head videos up to 600 seconds long at 66 FPS.
AsymTalker: Identity-Consistent Long-Term Talking Head Generation via Asymmetric Distillation
cs.LG 2026-05 unverdicted novelty 6.0

AsymK-Talker introduces kernel-conditioned loop generation, temporal reference encoding, and asymmetric kernel distillation to achieve real-time, drift-resistant talking head synthesis from audio using diffusion models.
LaST-R1: Reinforcing Robotic Manipulation via Adaptive Physical Latent Reasoning
cs.RO 2026-04 unverdicted novelty 6.0

LaST-R1 reaches 99.8% average success on the LIBERO benchmark using one-shot warm-up plus LAPO reinforcement learning on latent physical reasoning, with up to 44% real-world gains on complex single- and dual-arm tasks.
LaST-R1: Reinforcing Robotic Manipulation via Adaptive Physical Latent Reasoning
cs.RO 2026-04 unverdicted novelty 6.0

LaST-R1 introduces a RL post-training method called LAPO that optimizes latent Chain-of-Thought reasoning in vision-language-action models, yielding 99.9% success on LIBERO and up to 22.5% real-world gains.
Learning Human-Intention Priors from Large-Scale Human Demonstrations for Robotic Manipulation
cs.RO 2026-04 unverdicted novelty 6.0

MoT-HRA learns embodiment-agnostic human-intention priors from the HA-2.2M dataset of 2.2M human video episodes through a three-expert hierarchy to improve robotic motion plausibility and robustness under distribution shift.
JigsawRL: Assembling RL Pipelines for Efficient LLM Post-Training
cs.LG 2026-04 unverdicted novelty 6.0

JigsawRL achieves up to 1.85x higher throughput in LLM RL pipelines via pipeline multiplexing, sub-stage graphs, and look-ahead scheduling compared to prior systems.
MARS$^2$: Scaling Multi-Agent Tree Search via Reinforcement Learning for Code Generation
cs.AI 2026-04 unverdicted novelty 6.0

MARS² integrates multi-agent collaboration with tree-structured search in RL to boost code generation by increasing exploratory diversity and using path-level group advantages for credit assignment.
Nucleus-Image: Sparse MoE for Image Generation
cs.CV 2026-04 unverdicted novelty 6.0

A 17B-parameter sparse MoE diffusion transformer activates 2B parameters per pass and reaches competitive quality on image generation benchmarks without post-training.
OmniShow: Unifying Multimodal Conditions for Human-Object Interaction Video Generation
cs.CV 2026-04 unverdicted novelty 6.0

OmniShow unifies text, image, audio, and pose conditions into an end-to-end model for high-quality human-object interaction video generation and introduces the HOIVG-Bench benchmark, claiming state-of-the-art results.
Relax: An Asynchronous Reinforcement Learning Engine for Omni-Modal Post-Training at Scale
cs.CL 2026-04 unverdicted novelty 6.0

Relax is a new RL training engine with omni-native design and async execution that delivers up to 2x speedups over baselines like veRL while converging to equivalent reward levels on Qwen3 models.
Continuous Adversarial Flow Models
cs.LG 2026-04 unverdicted novelty 6.0

Continuous adversarial flow models replace MSE in flow matching with adversarial training via a discriminator, improving guidance-free FID on ImageNet from 8.26 to 3.63 for SiT and similar gains for JiT and text-to-im...
DeepStack: Scalable and Accurate Design Space Exploration for Distributed 3D-Stacked AI Accelerators
cs.AR 2026-04 conditional novelty 6.0

DeepStack introduces a fast performance model and hierarchical search method for co-optimizing 3D DRAM stacking, interconnects, and distributed scheduling in AI accelerators, delivering up to 9.5x throughput gains ove...
InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency
cs.CV 2025-08 unverdicted novelty 6.0

InternVL3.5 advances open-source multimodal models with Cascade RL for +16% reasoning gains and ViR for 4x inference speedup, with the 241B model reaching SOTA among open-source MLLMs on multimodal, reasoning, and age...
MAGI-1: Autoregressive Video Generation at Scale
cs.CV 2025-05 unverdicted novelty 6.0

MAGI-1 is a 24B-parameter autoregressive video world model that predicts denoised frame chunks sequentially with increasing noise to enable causal, scalable, streaming generation up to 4M token contexts.
OpenVLA: An Open-Source Vision-Language-Action Model
cs.RO 2024-06 unverdicted novelty 6.0

OpenVLA achieves 16.5% higher task success than the 55B RT-2-X model across 29 tasks with 7x fewer parameters while enabling effective fine-tuning and quantization without performance loss.
OpenRLHF: An Easy-to-use, Scalable and High-performance RLHF Framework
cs.AI 2024-05 unverdicted novelty 6.0

OpenRLHF is a new open-source RLHF framework reporting 1.22x to 1.68x speedups and fewer lines of code than prior systems.
YaRN: Efficient Context Window Extension of Large Language Models
cs.CL 2023-08 unverdicted novelty 6.0

YaRN extends the context window of RoPE-based LLMs like LLaMA more efficiently than prior methods, using 10x fewer tokens and 2.5x fewer steps while surpassing state-of-the-art performance and enabling extrapolation b...
Mela: Test-Time Memory Consolidation based on Transformation Hypothesis
cs.CL 2026-05 unverdicted novelty 5.0

Mela is a Transformer variant with a dual-frequency Hierarchical Memory Module and MemStack that performs test-time memory consolidation, outperforming baselines on long contexts.
On Training Large Language Models for Long-Horizon Tasks: An Empirical Study of Horizon Length
cs.AI 2026-05 unverdicted novelty 5.0

Longer action horizons bottleneck LLM agent training through instability, but training with reduced horizons stabilizes learning and enables better generalization to longer horizons.
MegaTrain: Full Precision Training of 100B+ Parameter Large Language Models on a Single GPU
cs.CL 2026-04 conditional novelty 5.0

MegaTrain enables reliable full-precision training of up to 120B parameter LLMs on one H200 GPU with 1.5TB host memory via host-memory streaming, pipelined double-buffered execution, and stateless layer templates, ach...
Sampling Parallelism for Fast and Efficient Bayesian Learning
cs.LG 2026-04 unverdicted novelty 5.0

Sampling parallelism distributes Bayesian sample evaluations across GPUs for near-perfect scaling, lower memory use, and faster convergence via per-GPU data augmentations, outperforming pure data parallelism in diversity.
Z-Image: An Efficient Image Generation Foundation Model with Single-Stream Diffusion Transformer
cs.CV 2025-11 unverdicted novelty 5.0

Z-Image is an efficient 6B-parameter foundation model for image generation that rivals larger commercial systems in photorealism and bilingual text rendering through a new single-stream diffusion transformer and strea...
Qwen-Image Technical Report
cs.CV 2025-08 unverdicted novelty 5.0

Qwen-Image is a foundation model that reaches state-of-the-art results in image generation and editing by combining a large-scale text-focused data pipeline with curriculum learning and dual semantic-reconstructive en...
Movie Gen: A Cast of Media Foundation Models
cs.CV 2024-10 unverdicted novelty 5.0

A 30B-parameter transformer and related models generate high-quality videos and audio, claiming state-of-the-art results on text-to-video, video editing, personalization, and audio generation tasks.
CCL-D: A High-Precision Diagnostic System for Slow and Hang Anomalies in Large-Scale Model Training
cs.DC 2026-05 unverdicted novelty 4.0

CCL-D detects slow/hang anomalies in CCL for distributed training via lightweight tracing probes and an intelligent analyzer, achieving near-complete coverage and 6-minute rank localization on a 4000-GPU cluster over ...
StepPO: Step-Aligned Policy Optimization for Agentic Reinforcement Learning
cs.CL 2026-04 unverdicted novelty 4.0

StepPO argues that LLM agents should optimize at the step level rather than token level to better handle delayed rewards and long contexts in agentic RL.
Seedance 1.0: Exploring the Boundaries of Video Generation Models
cs.CV 2025-06 unverdicted novelty 4.0

Seedance 1.0 generates 5-second 1080p videos in about 41 seconds with claimed superior motion quality, prompt adherence, and multi-shot consistency compared to prior models.
OpenFlamingo: An Open-Source Framework for Training Large Autoregressive Vision-Language Models
cs.CV 2023-08 unverdicted novelty 4.0

OpenFlamingo provides open-source autoregressive vision-language models that achieve 80-89% of Flamingo performance on seven vision-language datasets.
Navigating LLM Valley: From AdamW to Memory-Efficient and Matrix-Based Optimizers
cs.LG 2026-05 unverdicted novelty 3.0

This survey organizes LLM optimizer literature into categories and argues the field is shifting toward rigorous, multi-factor comparisons of convergence, memory, stability, and complexity.

Reference graph

Works this paper leans on

39 extracted references · 39 canonical work pages · cited by 41 Pith papers · 1 internal anchor

[1]

torch.amp Gradient Scaling

2023. torch.amp Gradient Scaling. https://pytorch.org/docs/2.0/amp.html# gradient-scaling

work page 2023
[2]

Youhui Bai, Cheng Li, Quan Zhou, Jun Yi, Ping Gong, Feng Yan, Ruichuan Chen, and Yinlong Xu. 2021. Gradient compression supercharged high-performance data parallel dnn training. In Proceedings of the ACM SIGOPS 28th Symposium on Operating Systems Principles. 359–375

work page 2021
[3]

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. 2020. Language models are few-shot learners. Advances in neural information processing systems 33 (2020), 1877–1901

work page 2020
[4]

Jiarui Fang, Zilin Zhu, Shenggui Li, Hui Su, Yang Yu, Jie Zhou, and Yang You. 2022. Parallel Training of Pre-Trained Models via Chunk-Based Dynamic Memory Management. IEEE Transactions on Parallel and Distributed Systems 34, 1 (2022), 304–315

work page 2022
[5]

Aaron Harlap, Deepak Narayanan, Amar Phanishayee, Vivek Seshadri, Nikhil Devanur, Greg Ganger, and Phil Gibbons. 2018. Pipedream: Fast and efficient pipeline parallel dnn training. arXiv preprint arXiv:1806.03377 (2018)

work page Pith review arXiv 2018
[6]

Chaoyang He, Shen Li, Mahdi Soltanolkotabi, and Salman Avestimehr. 2021. Pipetransformer: Automated elastic pipelining for distributed training of trans- formers. arXiv preprint arXiv:2102.03161 (2021)

work page arXiv 2021
[7]

Xin He, Jianhua Sun, Hao Chen, and Dong Li. 2022. Campo: Cost-Aware Per- formance Optimization for Mixed-Precision Neural Network Training. In 2022 USENIX Annual Technical Conference (USENIX ATC 22) . USENIX Association, Carlsbad, CA, 505–518. https://www.usenix.org/conference/atc22/presentation/ he

work page 2022
[8]

Yanping Huang, Youlong Cheng, Ankur Bapna, Orhan Firat, Dehao Chen, Mia Chen, HyoukJoong Lee, Jiquan Ngiam, Quoc V Le, Yonghui Wu, et al . 2019. Gpipe: Efficient training of giant neural networks using pipeline parallelism. Advances in neural information processing systems 32 (2019)

work page 2019
[9]

Zhihao Jia, Matei Zaharia, and Alex Aiken. 2018. Beyond Data and Model Parallelism for Deep Neural Networks. https://doi.org/10.48550/ARXIV.1807. 05358

work page doi:10.48550/arxiv.1807 2018
[10]

Andrej Karpathy. 2020. MinGPT Transformer model. https://github.com/ karpathy/minGPT

work page 2020
[11]

Chiheon Kim, Heungsub Lee, Myungryong Jeong, Woonhyuk Baek, Boogeon Yoon, Ildoo Kim, Sungbin Lim, and Sungwoong Kim. 2020. torchgpipe: On-the-fly pipeline parallelism for training giant models. arXiv preprint arXiv:2004.09910 (2020)

work page arXiv 2020
[12]

Marisa Kirisame, Steven Lyubomirsky, Altan Haan, Jennifer Brennan, Mike He, Jared Roesch, Tianqi Chen, and Zachary Tatlock. 2020. Dynamic Tensor Rematerialization. https://doi.org/10.48550/ARXIV.2006.09616

work page doi:10.48550/arxiv.2006.09616 2020
[13]

Vijay Korthikanti, Jared Casper, Sangkug Lym, Lawrence McAfee, Michael An- dersch, Mohammad Shoeybi, and Bryan Catanzaro. 2022. Reducing activation recomputation in large transformer models. arXiv preprint arXiv:2205.05198 (2022)

work page arXiv 2022
[14]

Shen Li, Yanli Zhao, Rohan Varma, Omkar Salpekar, Pieter Noordhuis, Teng Li, Adam Paszke, Jeff Smith, Brian Vaughan, Pritam Damania, et al. 2020. Pytorch distributed: Experiences on accelerating data parallel training. arXiv preprint arXiv:2006.15704 (2020)

work page arXiv 2020
[15]

Zhuohan Li, Siyuan Zhuang, Shiyuan Guo, Danyang Zhuo, Hao Zhang, Dawn Song, and Ion Stoica. 2021. Terapipe: Token-level pipeline parallelism for training large-scale language models. In International Conference on Machine Learning . PMLR, 6543–6552

work page 2021
[16]

Ming Liu, Liang Luo, Jacob Nelson, Luis Ceze, Arvind Krishnamurthy, and Kishore Atreya. 2017. Incbricks: Toward in-network computation with an in- network cache. In Proceedings of the Twenty-Second International Conference on Architectural Support for Programming Languages and Operating Systems . 795–809

work page 2017
[17]

Liang Luo, Peter West, Jacob Nelson, Arvind Krishnamurthy, and Luis Ceze. 2020. Plink: Discovering and exploiting locality for accelerated distributed training on the public cloud. Proceedings of Machine Learning and Systems 2 (2020), 82–97

work page 2020
[18]

Paulius Micikevicius, Sharan Narang, Jonah Alben, Gregory Diamos, Erich Elsen, David Garcia, Boris Ginsburg, Michael Houston, Oleksii Kuchaiev, Ganesh Venkatesh, and Hao Wu. 2017. Mixed Precision Training. https://doi.org/10. 48550/ARXIV.1710.03740

work page internal anchor Pith review arXiv 2017
[19]

Dheevatsa Mudigere, Yuchen Hao, Jianyu Huang, Andrew Tulloch, Srinivas Sridharan, Xing Liu, Mustafa Ozdal, Jade Nie, Jongsoo Park, Liang Luo, et al

work page
[20]

High-performance, distributed training of large-scale deep learning recommendation models,

High-performance, distributed training of large-scale deep learning rec- ommendation models. arXiv preprint arXiv:2104.05158 (2021)

work page arXiv 2021
[21]

Deepak Narayanan, Aaron Harlap, Amar Phanishayee, Vivek Seshadri, Nikhil R Devanur, Gregory R Ganger, Phillip B Gibbons, and Matei Zaharia. 2019. PipeDream: Generalized pipeline parallelism for DNN training. In Proceedings of the 27th ACM Symposium on Operating Systems Principles . 1–15

work page 2019
[22]

Deepak Narayanan, Mohammad Shoeybi, Jared Casper, Patrick LeGresley, Mostofa Patwary, Vijay Korthikanti, Dmitri Vainbrand, Prethvi Kashinkunti, Julie Bernauer, Bryan Catanzaro, et al. 2021. Efficient large-scale language model training on gpu clusters using megatron-lm. In Proceedings of the International Conference for High Performance Computing, Network...

work page 2021
[23]

NVIDIA. 2023. The NVIDIA Collective Communication Library (NCCL). https: //developer.nvidia.com/nccl

work page 2023
[24]

OpenAI. 2023. ChatGPT. https://chat.openai.com/

work page 2023
[25]

Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gre- gory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Kopf, Edward Yang, Zachary DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. 2019. PyTorch: An Imperative Style, Hi...

work page 2019
[26]

Team PyTorch. 2023. DISTRIBUTED RPC FRAMEWORK. https://pytorch.org/ docs/stable/rpc.html

work page 2023
[27]

Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. 2020. Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research 21, 1 (2020), 5485–5551

work page 2020
[28]

Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, and Yuxiong He. 2020. Zero: Memory optimizations toward training trillion parameter models. In SC20: Inter- national Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, 1–16

work page 2020
[29]

Jie Ren, Samyam Rajbhandari, Reza Yazdani Aminabadi, Olatunji Ruwase, Shuangyan Yang, Minjia Zhang, Dong Li, and Yuxiong He. 2021. ZeRO-Offload: Democratizing Billion-Scale Model Training.. In USENIX Annual Technical Con- ference. 551–564

work page 2021
[30]

Nick Schneider, Florian Piewak, Christoph Stiller, and Uwe Franke. 2017. Reg- Net: Multimodal sensor registration using deep neural networks. In 2017 IEEE intelligent vehicles symposium (IV) . IEEE, 1803–1810

work page 2017
[31]

Yuanzhong Xu, HyoukJoong Lee, Dehao Chen, Hongjun Choi, Blake Hechtman, and Shibo Wang. 2020. Automatic cross-replica sharding of weight update in data-parallel training. arXiv preprint arXiv:2004.13336 (2020)

work page arXiv 2020
[32]

Yuanzhong Xu, HyoukJoong Lee, Dehao Chen, Blake Hechtman, Yanping Huang, Rahul Joshi, Maxim Krikun, Dmitry Lepikhin, Andy Ly, Marcello Maggioni, et al

work page
[33]

Gspmd: general and scalable parallelization for ml computation graphs

GSPMD: general and scalable parallelization for ML computation graphs. arXiv preprint arXiv:2105.04663 (2021)

work page arXiv 2021
[34]

Jinhui Yuan, Xinqi Li, Cheng Cheng, Juncheng Liu, Ran Guo, Shenghang Cai, Chi Yao, Fei Yang, Xiaodong Yi, Chuan Wu, et al. 2021. Oneflow: Redesign the dis- tributed deep learning framework from scratch. arXiv preprint arXiv:2110.15032 (2021)

work page arXiv 2021
[35]

Buyun Zhang, Liang Luo, Xi Liu, Jay Li, Zeliang Chen, Weilin Zhang, Xiaohan Wei, Yuchen Hao, Michael Tsang, Wenjun Wang, Yang Liu, Huayu Li, Yasmine Badr, Jongsoo Park, Jiyan Yang, Dheevatsa Mudigere, and Ellie Wen. 2022. DHEN: A Deep and Hierarchical Ensemble Network for Large-Scale Click-Through Rate Prediction. https://doi.org/10.48550/ARXIV.2203.11014

work page doi:10.48550/arxiv.2203.11014 2022
[36]

Zhen Zhang, Shuai Zheng, Yida Wang, Justin Chiu, George Karypis, Trishul Chilimbi, Mu Li, and Xin Jin. 2022. MiCS: near-linear scaling for training gigantic model on public cloud. arXiv preprint arXiv:2205.00119 (2022)

work page arXiv 2022
[37]

Lianmin Zheng, Zhuohan Li, Hao Zhang, Yonghao Zhuang, Zhifeng Chen, Yan- ping Huang, Yida Wang, Yuanzhong Xu, Danyang Zhuo, Eric P Xing, et al

work page
[38]

In 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22)

Alpa: Automating Inter-and {Intra-Operator} Parallelism for Distributed Deep Learning. In 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22). 559–578

work page
[39]

Daquan Zhou, Bingyi Kang, Xiaojie Jin, Linjie Yang, Xiaochen Lian, Zihang Jiang, Qibin Hou, and Jiashi Feng. 2021. Deepvit: Towards deeper vision transformer. arXiv preprint arXiv:2103.11886 (2021)

work page arXiv 2021