Recognition: 2 theorem links
· Lean TheoremPyTorch FSDP: Experiences on Scaling Fully Sharded Data Parallel
Pith reviewed 2026-05-12 04:10 UTC · model grok-4.3
The pith
PyTorch FSDP enables training of significantly larger models than standard Distributed Data Parallel while delivering comparable performance and near-linear TFLOPS scalability.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
FSDP has been closely co-designed with several key PyTorch core components including Tensor implementation, dispatcher system, and CUDA memory caching allocator, to provide non-intrusive user experiences and high training efficiency. The experimental results demonstrate that FSDP is capable of achieving comparable performance to Distributed Data Parallel while providing support for significantly larger models with near-linear scalability in terms of TFLOPS.
What carries the argument
Fully Sharded Data Parallel (FSDP), which shards parameters, gradients, and optimizer states across data-parallel processes and is co-designed with PyTorch's tensor, dispatcher, and CUDA allocator layers.
Load-bearing premise
Close co-design with PyTorch internals will deliver non-intrusive usage and high efficiency across diverse hardware and model architectures without hidden performance cliffs.
What would settle it
A direct head-to-head benchmark on identical hardware and model size where FSDP throughput falls substantially below Distributed Data Parallel, or where scaling efficiency drops below linear beyond a modest number of GPUs.
read the original abstract
It is widely acknowledged that large models have the potential to deliver superior performance across a broad range of domains. Despite the remarkable progress made in the field of machine learning systems research, which has enabled the development and exploration of large models, such abilities remain confined to a small group of advanced users and industry leaders, resulting in an implicit technical barrier for the wider community to access and leverage these technologies. In this paper, we introduce PyTorch Fully Sharded Data Parallel (FSDP) as an industry-grade solution for large model training. FSDP has been closely co-designed with several key PyTorch core components including Tensor implementation, dispatcher system, and CUDA memory caching allocator, to provide non-intrusive user experiences and high training efficiency. Additionally, FSDP natively incorporates a range of techniques and settings to optimize resource utilization across a variety of hardware configurations. The experimental results demonstrate that FSDP is capable of achieving comparable performance to Distributed Data Parallel while providing support for significantly larger models with near-linear scalability in terms of TFLOPS.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces PyTorch Fully Sharded Data Parallel (FSDP) as an industry-grade solution for training large models. It describes the close co-design with PyTorch core components (Tensor implementation, dispatcher, and CUDA memory caching allocator) to enable non-intrusive usage and high efficiency, along with native optimizations for resource utilization across hardware. Experimental results are reported to demonstrate performance comparable to Distributed Data Parallel (DDP), support for significantly larger models, and near-linear TFLOPS scalability.
Significance. If the empirical claims are robust, this is a significant practical contribution: it lowers the technical barrier for large-model training by delivering an efficient, integrated PyTorch primitive that supports model sizes beyond DDP while maintaining competitive throughput. The emphasis on co-design for efficiency and the reported scalability results would be valuable to the distributed systems and ML systems communities.
major comments (2)
- [§5] §5 (Experimental Evaluation): The central claim of DDP-comparable performance and near-linear TFLOPS scaling is load-bearing, yet the reported results provide no concrete details on model sizes (parameter counts), hardware specifications (GPU count, interconnect, CUDA version), run counts, or error bars. This directly undermines assessment of whether the co-design assumptions hold without hidden overheads on untested configurations.
- [§3] §3 (FSDP Design and Co-design): The assumption that integration with the CUDA allocator and dispatcher yields high efficiency without performance cliffs is presented as a key enabler, but no ablation studies isolate the contribution of each co-design element versus baseline sharding or communication optimizations. This leaves the robustness claim under-supported for diverse model architectures or interconnects.
minor comments (2)
- [Abstract] The abstract states 'near-linear scalability' without quantifying the observed scaling slope or the range of GPU counts over which it was measured; adding this would improve precision.
- [§3] Notation for sharding strategies and memory savings could be introduced earlier with a small table for clarity before the experimental section.
Simulated Author's Rebuttal
We thank the referee for the constructive comments on our manuscript. We address each major point below and indicate planned revisions to strengthen the presentation of our experimental results and design rationale.
read point-by-point responses
-
Referee: [§5] §5 (Experimental Evaluation): The central claim of DDP-comparable performance and near-linear TFLOPS scaling is load-bearing, yet the reported results provide no concrete details on model sizes (parameter counts), hardware specifications (GPU count, interconnect, CUDA version), run counts, or error bars. This directly undermines assessment of whether the co-design assumptions hold without hidden overheads on untested configurations.
Authors: We agree that additional concrete details are needed to support reproducibility and evaluation of the claims. In the revised manuscript we will expand §5 to report the specific model parameter counts, hardware configurations (GPU count, interconnect type, CUDA version), number of runs, and error bars or variance measures for the key metrics. This will allow readers to better judge the robustness of the observed DDP-comparable performance and scaling behavior. revision: yes
-
Referee: [§3] §3 (FSDP Design and Co-design): The assumption that integration with the CUDA allocator and dispatcher yields high efficiency without performance cliffs is presented as a key enabler, but no ablation studies isolate the contribution of each co-design element versus baseline sharding or communication optimizations. This leaves the robustness claim under-supported for diverse model architectures or interconnects.
Authors: The paper describes FSDP as a tightly integrated system whose value is demonstrated through end-to-end scaling results rather than isolated component studies. We will partially revise §3 to provide additional rationale for each co-design decision and to reference any internal validation data available from our development process. Comprehensive ablations across every architecture and interconnect are outside the scope of this experience-focused paper, but we will clarify the configurations in which the claims have been validated. revision: partial
Circularity Check
No circularity: empirical engineering report with direct measurements, no derivations or fitted predictions.
full rationale
The paper describes the FSDP implementation, its co-design with PyTorch internals, and reports empirical performance results on specific hardware and models. No mathematical derivation chain, equations, or parameter-fitting steps exist that could reduce claims to inputs by construction. Central claims rest on experimental benchmarks rather than self-referential definitions, self-citations as load-bearing premises, or renamed known results. This is a standard self-contained systems paper whose evidence is externally falsifiable via reproduction on the reported setups.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
FSDP decomposes the model instance into smaller units and handles each unit independently... FlatParameter is a 1D tensor constructed by concatenating p flattened original parameters and padding...
-
IndisputableMonolith/Foundation/DimensionForcing.leanalexander_duality_circle_linking unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
FSDP offers a variety of sharding strategies... sharding factor F... hybrid sharding
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 45 Pith papers
-
OP-GRPO: Efficient Off-Policy GRPO for Flow-Matching Models
OP-GRPO is the first off-policy GRPO method for flow-matching models that reuses trajectories via replay buffer and importance sampling corrections, matching on-policy performance with 34.2% of the training steps.
-
Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Vision-Language Models
Molmo VLMs trained on newly collected PixMo open datasets achieve state-of-the-art performance among open-weight models and surpass multiple proprietary VLMs including Claude 3.5 Sonnet and Gemini 1.5 Pro.
-
CausalCine: Real-Time Autoregressive Generation for Multi-Shot Video Narratives
CausalCine enables real-time causal autoregressive multi-shot video generation via multi-shot training, content-aware memory routing for coherence, and distillation to few-step inference.
-
AsymTalker: Identity-Consistent Long-Term Talking Head Generation via Asymmetric Distillation
AsymTalker maintains identity consistency in long-term diffusion talking-head videos by encoding temporal references from a static image and training a student model under inference-like conditions via asymmetric dist...
-
A satellite foundation model for improved wealth monitoring
Tempov is a self-supervised satellite foundation model that predicts wealth levels and decadal changes at high resolution across Africa from Landsat imagery, outperforming baselines even with limited labels and genera...
-
ALTO: Adaptive LoRA Tuning and Orchestration for Heterogeneous LoRA Training Workloads
ALTO accelerates LoRA tuning up to 13.8x by monitoring loss trajectories for early stopping, using fused grouped GEMM with rank-local adapter parallelism, and combining intra- and inter-task scheduling for heterogeneo...
-
Training Agents Inside of Scalable World Models
Dreamer 4 is the first agent to obtain diamonds in Minecraft from only offline data by reinforcement learning inside a scalable world model that accurately predicts game mechanics.
-
Autoregressive Model Beats Diffusion: Llama for Scalable Image Generation
Scaled vanilla autoregressive models based on Llama achieve 2.18 FID on ImageNet 256x256 image generation, beating popular diffusion models without visual inductive biases.
-
Scaling and evaluating sparse autoencoders
K-sparse autoencoders with dead-latent fixes produce clean scaling laws and better feature quality metrics that improve with size, shown by training a 16-million-latent model on GPT-4 activations.
-
ReCoVer: Resilient LLM Pre-Training System via Fault-Tolerant Collective and Versatile Workload
ReCoVer uses fault-tolerant collectives, in-step recovery, and dynamic microbatch redistribution to maintain training trajectory equivalence under GPU failures, delivering 2.23x higher effective throughput than checkp...
-
ShardTensor: Domain Parallelism for Scientific Machine Learning
ShardTensor is a domain-parallelism system for SciML that enables flexible scaling of extreme-resolution spatial datasets by removing the constraint of batch size one per device.
-
LoKA: Low-precision Kernel Applications for Recommendation Models At Scale
LoKA enables practical FP8 use in numerically sensitive large recommendation models via profiling, model adaptations, and runtime kernel orchestration.
-
LoKA: Low-precision Kernel Applications for Recommendation Models At Scale
LoKA enables practical FP8 use in numerically sensitive large recommendation models via online profiling of activations, reusable model modifications for stability, and dynamic kernel dispatching.
-
DisagMoE: Computation-Communication overlapped MoE Training via Disaggregated AF-Pipe Parallelism
DisagMoE achieves up to 1.8x faster MoE training by disaggregating attention and FFN layers into disjoint GPU groups with a multi-stage uni-directional pipeline and roofline-based bandwidth balancing.
-
MegaScale-Omni: A Hyper-Scale, Workload-Resilient System for MultiModal LLM Training in Production
MegaScale-Omni delivers 1.27x-7.57x higher throughput for dynamic multimodal LLM training by decoupling encoder and LLM parallelism, using unified colocation, and applying adaptive workload balancing.
-
FlashEvolve: Accelerating Agent Self-Evolution with Asynchronous Stage Orchestration
FlashEvolve accelerates LLM agent self-evolution via asynchronous stage orchestration and inspectable language-space staleness handling, reporting 3.5-4.9x proposal throughput gains over synchronous baselines on GEPA ...
-
AsymTalker: Identity-Consistent Long-Term Talking Head Generation via Asymmetric Distillation
AsymTalker uses temporal reference encoding and asymmetric knowledge distillation to produce identity-consistent talking head videos up to 600 seconds long at 66 FPS.
-
AsymTalker: Identity-Consistent Long-Term Talking Head Generation via Asymmetric Distillation
AsymK-Talker introduces kernel-conditioned loop generation, temporal reference encoding, and asymmetric kernel distillation to achieve real-time, drift-resistant talking head synthesis from audio using diffusion models.
-
LaST-R1: Reinforcing Robotic Manipulation via Adaptive Physical Latent Reasoning
LaST-R1 reaches 99.8% average success on the LIBERO benchmark using one-shot warm-up plus LAPO reinforcement learning on latent physical reasoning, with up to 44% real-world gains on complex single- and dual-arm tasks.
-
LaST-R1: Reinforcing Robotic Manipulation via Adaptive Physical Latent Reasoning
LaST-R1 introduces a RL post-training method called LAPO that optimizes latent Chain-of-Thought reasoning in vision-language-action models, yielding 99.9% success on LIBERO and up to 22.5% real-world gains.
-
Learning Human-Intention Priors from Large-Scale Human Demonstrations for Robotic Manipulation
MoT-HRA learns embodiment-agnostic human-intention priors from the HA-2.2M dataset of 2.2M human video episodes through a three-expert hierarchy to improve robotic motion plausibility and robustness under distribution shift.
-
JigsawRL: Assembling RL Pipelines for Efficient LLM Post-Training
JigsawRL achieves up to 1.85x higher throughput in LLM RL pipelines via pipeline multiplexing, sub-stage graphs, and look-ahead scheduling compared to prior systems.
-
MARS$^2$: Scaling Multi-Agent Tree Search via Reinforcement Learning for Code Generation
MARS² integrates multi-agent collaboration with tree-structured search in RL to boost code generation by increasing exploratory diversity and using path-level group advantages for credit assignment.
-
Nucleus-Image: Sparse MoE for Image Generation
A 17B-parameter sparse MoE diffusion transformer activates 2B parameters per pass and reaches competitive quality on image generation benchmarks without post-training.
-
OmniShow: Unifying Multimodal Conditions for Human-Object Interaction Video Generation
OmniShow unifies text, image, audio, and pose conditions into an end-to-end model for high-quality human-object interaction video generation and introduces the HOIVG-Bench benchmark, claiming state-of-the-art results.
-
Relax: An Asynchronous Reinforcement Learning Engine for Omni-Modal Post-Training at Scale
Relax is a new RL training engine with omni-native design and async execution that delivers up to 2x speedups over baselines like veRL while converging to equivalent reward levels on Qwen3 models.
-
Continuous Adversarial Flow Models
Continuous adversarial flow models replace MSE in flow matching with adversarial training via a discriminator, improving guidance-free FID on ImageNet from 8.26 to 3.63 for SiT and similar gains for JiT and text-to-im...
-
DeepStack: Scalable and Accurate Design Space Exploration for Distributed 3D-Stacked AI Accelerators
DeepStack introduces a fast performance model and hierarchical search method for co-optimizing 3D DRAM stacking, interconnects, and distributed scheduling in AI accelerators, delivering up to 9.5x throughput gains ove...
-
InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency
InternVL3.5 advances open-source multimodal models with Cascade RL for +16% reasoning gains and ViR for 4x inference speedup, with the 241B model reaching SOTA among open-source MLLMs on multimodal, reasoning, and age...
-
MAGI-1: Autoregressive Video Generation at Scale
MAGI-1 is a 24B-parameter autoregressive video world model that predicts denoised frame chunks sequentially with increasing noise to enable causal, scalable, streaming generation up to 4M token contexts.
-
OpenVLA: An Open-Source Vision-Language-Action Model
OpenVLA achieves 16.5% higher task success than the 55B RT-2-X model across 29 tasks with 7x fewer parameters while enabling effective fine-tuning and quantization without performance loss.
-
OpenRLHF: An Easy-to-use, Scalable and High-performance RLHF Framework
OpenRLHF is a new open-source RLHF framework reporting 1.22x to 1.68x speedups and fewer lines of code than prior systems.
-
YaRN: Efficient Context Window Extension of Large Language Models
YaRN extends the context window of RoPE-based LLMs like LLaMA more efficiently than prior methods, using 10x fewer tokens and 2.5x fewer steps while surpassing state-of-the-art performance and enabling extrapolation b...
-
Mela: Test-Time Memory Consolidation based on Transformation Hypothesis
Mela is a Transformer variant with a dual-frequency Hierarchical Memory Module and MemStack that performs test-time memory consolidation, outperforming baselines on long contexts.
-
On Training Large Language Models for Long-Horizon Tasks: An Empirical Study of Horizon Length
Longer action horizons bottleneck LLM agent training through instability, but training with reduced horizons stabilizes learning and enables better generalization to longer horizons.
-
MegaTrain: Full Precision Training of 100B+ Parameter Large Language Models on a Single GPU
MegaTrain enables reliable full-precision training of up to 120B parameter LLMs on one H200 GPU with 1.5TB host memory via host-memory streaming, pipelined double-buffered execution, and stateless layer templates, ach...
-
Sampling Parallelism for Fast and Efficient Bayesian Learning
Sampling parallelism distributes Bayesian sample evaluations across GPUs for near-perfect scaling, lower memory use, and faster convergence via per-GPU data augmentations, outperforming pure data parallelism in diversity.
-
Z-Image: An Efficient Image Generation Foundation Model with Single-Stream Diffusion Transformer
Z-Image is an efficient 6B-parameter foundation model for image generation that rivals larger commercial systems in photorealism and bilingual text rendering through a new single-stream diffusion transformer and strea...
-
Qwen-Image Technical Report
Qwen-Image is a foundation model that reaches state-of-the-art results in image generation and editing by combining a large-scale text-focused data pipeline with curriculum learning and dual semantic-reconstructive en...
-
Movie Gen: A Cast of Media Foundation Models
A 30B-parameter transformer and related models generate high-quality videos and audio, claiming state-of-the-art results on text-to-video, video editing, personalization, and audio generation tasks.
-
CCL-D: A High-Precision Diagnostic System for Slow and Hang Anomalies in Large-Scale Model Training
CCL-D detects slow/hang anomalies in CCL for distributed training via lightweight tracing probes and an intelligent analyzer, achieving near-complete coverage and 6-minute rank localization on a 4000-GPU cluster over ...
-
StepPO: Step-Aligned Policy Optimization for Agentic Reinforcement Learning
StepPO argues that LLM agents should optimize at the step level rather than token level to better handle delayed rewards and long contexts in agentic RL.
-
Seedance 1.0: Exploring the Boundaries of Video Generation Models
Seedance 1.0 generates 5-second 1080p videos in about 41 seconds with claimed superior motion quality, prompt adherence, and multi-shot consistency compared to prior models.
-
OpenFlamingo: An Open-Source Framework for Training Large Autoregressive Vision-Language Models
OpenFlamingo provides open-source autoregressive vision-language models that achieve 80-89% of Flamingo performance on seven vision-language datasets.
-
Navigating LLM Valley: From AdamW to Memory-Efficient and Matrix-Based Optimizers
This survey organizes LLM optimizer literature into categories and argues the field is shifting toward rigorous, multi-factor comparisons of convergence, memory, stability, and complexity.
Reference graph
Works this paper leans on
-
[1]
2023. torch.amp Gradient Scaling. https://pytorch.org/docs/2.0/amp.html# gradient-scaling
work page 2023
-
[2]
Youhui Bai, Cheng Li, Quan Zhou, Jun Yi, Ping Gong, Feng Yan, Ruichuan Chen, and Yinlong Xu. 2021. Gradient compression supercharged high-performance data parallel dnn training. In Proceedings of the ACM SIGOPS 28th Symposium on Operating Systems Principles. 359–375
work page 2021
-
[3]
Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. 2020. Language models are few-shot learners. Advances in neural information processing systems 33 (2020), 1877–1901
work page 2020
-
[4]
Jiarui Fang, Zilin Zhu, Shenggui Li, Hui Su, Yang Yu, Jie Zhou, and Yang You. 2022. Parallel Training of Pre-Trained Models via Chunk-Based Dynamic Memory Management. IEEE Transactions on Parallel and Distributed Systems 34, 1 (2022), 304–315
work page 2022
-
[5]
Aaron Harlap, Deepak Narayanan, Amar Phanishayee, Vivek Seshadri, Nikhil Devanur, Greg Ganger, and Phil Gibbons. 2018. Pipedream: Fast and efficient pipeline parallel dnn training. arXiv preprint arXiv:1806.03377 (2018)
work page Pith review arXiv 2018
- [6]
-
[7]
Xin He, Jianhua Sun, Hao Chen, and Dong Li. 2022. Campo: Cost-Aware Per- formance Optimization for Mixed-Precision Neural Network Training. In 2022 USENIX Annual Technical Conference (USENIX ATC 22) . USENIX Association, Carlsbad, CA, 505–518. https://www.usenix.org/conference/atc22/presentation/ he
work page 2022
-
[8]
Yanping Huang, Youlong Cheng, Ankur Bapna, Orhan Firat, Dehao Chen, Mia Chen, HyoukJoong Lee, Jiquan Ngiam, Quoc V Le, Yonghui Wu, et al . 2019. Gpipe: Efficient training of giant neural networks using pipeline parallelism. Advances in neural information processing systems 32 (2019)
work page 2019
-
[9]
Zhihao Jia, Matei Zaharia, and Alex Aiken. 2018. Beyond Data and Model Parallelism for Deep Neural Networks. https://doi.org/10.48550/ARXIV.1807. 05358
-
[10]
Andrej Karpathy. 2020. MinGPT Transformer model. https://github.com/ karpathy/minGPT
work page 2020
- [11]
-
[12]
Marisa Kirisame, Steven Lyubomirsky, Altan Haan, Jennifer Brennan, Mike He, Jared Roesch, Tianqi Chen, and Zachary Tatlock. 2020. Dynamic Tensor Rematerialization. https://doi.org/10.48550/ARXIV.2006.09616
- [13]
- [14]
-
[15]
Zhuohan Li, Siyuan Zhuang, Shiyuan Guo, Danyang Zhuo, Hao Zhang, Dawn Song, and Ion Stoica. 2021. Terapipe: Token-level pipeline parallelism for training large-scale language models. In International Conference on Machine Learning . PMLR, 6543–6552
work page 2021
-
[16]
Ming Liu, Liang Luo, Jacob Nelson, Luis Ceze, Arvind Krishnamurthy, and Kishore Atreya. 2017. Incbricks: Toward in-network computation with an in- network cache. In Proceedings of the Twenty-Second International Conference on Architectural Support for Programming Languages and Operating Systems . 795–809
work page 2017
-
[17]
Liang Luo, Peter West, Jacob Nelson, Arvind Krishnamurthy, and Luis Ceze. 2020. Plink: Discovering and exploiting locality for accelerated distributed training on the public cloud. Proceedings of Machine Learning and Systems 2 (2020), 82–97
work page 2020
-
[18]
Paulius Micikevicius, Sharan Narang, Jonah Alben, Gregory Diamos, Erich Elsen, David Garcia, Boris Ginsburg, Michael Houston, Oleksii Kuchaiev, Ganesh Venkatesh, and Hao Wu. 2017. Mixed Precision Training. https://doi.org/10. 48550/ARXIV.1710.03740
work page internal anchor Pith review arXiv 2017
-
[19]
Dheevatsa Mudigere, Yuchen Hao, Jianyu Huang, Andrew Tulloch, Srinivas Sridharan, Xing Liu, Mustafa Ozdal, Jade Nie, Jongsoo Park, Liang Luo, et al
-
[20]
High-performance, distributed training of large-scale deep learning recommendation models,
High-performance, distributed training of large-scale deep learning rec- ommendation models. arXiv preprint arXiv:2104.05158 (2021)
-
[21]
Deepak Narayanan, Aaron Harlap, Amar Phanishayee, Vivek Seshadri, Nikhil R Devanur, Gregory R Ganger, Phillip B Gibbons, and Matei Zaharia. 2019. PipeDream: Generalized pipeline parallelism for DNN training. In Proceedings of the 27th ACM Symposium on Operating Systems Principles . 1–15
work page 2019
-
[22]
Deepak Narayanan, Mohammad Shoeybi, Jared Casper, Patrick LeGresley, Mostofa Patwary, Vijay Korthikanti, Dmitri Vainbrand, Prethvi Kashinkunti, Julie Bernauer, Bryan Catanzaro, et al. 2021. Efficient large-scale language model training on gpu clusters using megatron-lm. In Proceedings of the International Conference for High Performance Computing, Network...
work page 2021
-
[23]
NVIDIA. 2023. The NVIDIA Collective Communication Library (NCCL). https: //developer.nvidia.com/nccl
work page 2023
-
[24]
OpenAI. 2023. ChatGPT. https://chat.openai.com/
work page 2023
-
[25]
Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gre- gory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Kopf, Edward Yang, Zachary DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. 2019. PyTorch: An Imperative Style, Hi...
work page 2019
-
[26]
Team PyTorch. 2023. DISTRIBUTED RPC FRAMEWORK. https://pytorch.org/ docs/stable/rpc.html
work page 2023
-
[27]
Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. 2020. Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research 21, 1 (2020), 5485–5551
work page 2020
-
[28]
Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, and Yuxiong He. 2020. Zero: Memory optimizations toward training trillion parameter models. In SC20: Inter- national Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, 1–16
work page 2020
-
[29]
Jie Ren, Samyam Rajbhandari, Reza Yazdani Aminabadi, Olatunji Ruwase, Shuangyan Yang, Minjia Zhang, Dong Li, and Yuxiong He. 2021. ZeRO-Offload: Democratizing Billion-Scale Model Training.. In USENIX Annual Technical Con- ference. 551–564
work page 2021
-
[30]
Nick Schneider, Florian Piewak, Christoph Stiller, and Uwe Franke. 2017. Reg- Net: Multimodal sensor registration using deep neural networks. In 2017 IEEE intelligent vehicles symposium (IV) . IEEE, 1803–1810
work page 2017
- [31]
-
[32]
Yuanzhong Xu, HyoukJoong Lee, Dehao Chen, Blake Hechtman, Yanping Huang, Rahul Joshi, Maxim Krikun, Dmitry Lepikhin, Andy Ly, Marcello Maggioni, et al
-
[33]
Gspmd: general and scalable parallelization for ml computation graphs
GSPMD: general and scalable parallelization for ML computation graphs. arXiv preprint arXiv:2105.04663 (2021)
- [34]
-
[35]
Buyun Zhang, Liang Luo, Xi Liu, Jay Li, Zeliang Chen, Weilin Zhang, Xiaohan Wei, Yuchen Hao, Michael Tsang, Wenjun Wang, Yang Liu, Huayu Li, Yasmine Badr, Jongsoo Park, Jiyan Yang, Dheevatsa Mudigere, and Ellie Wen. 2022. DHEN: A Deep and Hierarchical Ensemble Network for Large-Scale Click-Through Rate Prediction. https://doi.org/10.48550/ARXIV.2203.11014
- [36]
-
[37]
Lianmin Zheng, Zhuohan Li, Hao Zhang, Yonghao Zhuang, Zhifeng Chen, Yan- ping Huang, Yida Wang, Yuanzhong Xu, Danyang Zhuo, Eric P Xing, et al
-
[38]
In 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22)
Alpa: Automating Inter-and {Intra-Operator} Parallelism for Distributed Deep Learning. In 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22). 559–578
- [39]
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.