arxiv: 1910.02054 · v3 · submitted 2019-10-04 · 💻 cs.LG · cs.DC· stat.ML

Recognition: 2 theorem links

· Lean Theorem

ZeRO: Memory Optimizations Toward Training Trillion Parameter Models

Samyam Rajbhandari , Jeff Rasley , Olatunji Ruwase , Yuxiong He

Authors on Pith no claims yet

Pith reviewed 2026-05-16 09:20 UTC · model grok-4.3

classification 💻 cs.LG cs.DCstat.ML

keywords memory optimizationdistributed trainingdata parallelismmodel scalingoptimizer stateslarge language modelstrillion parametersZeRO

0 comments

The pith

ZeRO partitions optimizer states and gradients across devices to remove memory redundancy in parallel training.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Training models with billions or trillions of parameters hits hard limits from the memory on each GPU. ZeRO solves this by splitting the optimizer states, gradients, and parameters so each device holds only a portion instead of full redundant copies. The split keeps the volume of data exchanged between devices low and preserves fine-grained computation on each GPU. Model size can therefore increase in direct proportion to the number of devices without efficiency loss. Analysis shows the method supports training beyond one trillion parameters on hardware available today.

Core claim

ZeRO eliminates memory redundancies in data- and model-parallel training while retaining low communication volume and high computational granularity, allowing the model size to scale proportional to the number of devices with sustained high efficiency. The approach has the potential to scale beyond one trillion parameters using today's hardware.

What carries the argument

The Zero Redundancy Optimizer (ZeRO) that partitions optimizer states, gradients, and parameters across data-parallel devices.

If this is right

Model size scales linearly with the number of devices.
Models up to 13 billion parameters can be trained without model parallelism.
Over 100 billion parameter models achieve super-linear speedup on 400 GPUs at 15 petaflops throughput.
An eightfold increase in model size and tenfold increase in performance compared with prior state-of-the-art systems.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same state-partitioning idea could be tested on inference workloads to lower memory needs for serving large models.
Combining ZeRO with newer interconnect hardware might push the practical limit well past current trillion-parameter targets.
Widespread use would let smaller research groups train models that previously required specialized large clusters.

Load-bearing premise

Splitting optimizer states and gradients across devices will not create communication or synchronization costs that grow faster than the memory savings.

What would settle it

A measurement on thousands of GPUs showing that extra communication time per step cancels the benefit of reduced per-device memory usage.

read the original abstract

Large deep learning models offer significant accuracy gains, but training billions to trillions of parameters is challenging. Existing solutions such as data and model parallelisms exhibit fundamental limitations to fit these models into limited device memory, while obtaining computation, communication and development efficiency. We develop a novel solution, Zero Redundancy Optimizer (ZeRO), to optimize memory, vastly improving training speed while increasing the model size that can be efficiently trained. ZeRO eliminates memory redundancies in data- and model-parallel training while retaining low communication volume and high computational granularity, allowing us to scale the model size proportional to the number of devices with sustained high efficiency. Our analysis on memory requirements and communication volume demonstrates: ZeRO has the potential to scale beyond 1 Trillion parameters using today's hardware. We implement and evaluate ZeRO: it trains large models of over 100B parameter with super-linear speedup on 400 GPUs, achieving throughput of 15 Petaflops. This represents an 8x increase in model size and 10x increase in achievable performance over state-of-the-art. In terms of usability, ZeRO can train large models of up to 13B parameters (e.g., larger than Megatron GPT 8.3B and T5 11B) without requiring model parallelism which is harder for scientists to apply. Last but not the least, researchers have used the system breakthroughs of ZeRO to create the world's largest language model (Turing-NLG, 17B parameters) with record breaking accuracy.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ZeRO shows a workable partitioning scheme that cuts memory redundancy enough to train 100B models on 400 GPUs with measured speedups, though the trillion-parameter projection rests on untested communication assumptions at larger scale.

read the letter

ZeRO's core idea is to partition optimizer states, gradients, and parameters across data-parallel ranks instead of replicating them. This removes the main memory duplication that limits model size in standard data parallelism while keeping the communication pattern close to what data-parallel training already does. The paper backs this with both a volume analysis and real runs that hit 15 petaflops on 400 GPUs for models above 100B parameters, plus an 8x model-size increase and 10x performance lift over prior state of the art. It also notes that the same system was used to train the 17B Turing-NLG model without needing model parallelism for that size. Those measured results are the strongest part of the work; they are concrete and show the technique works in practice on existing hardware. The analysis of memory savings and communication volume is clear and supports the claim that model size can grow roughly with device count. The soft spot is the jump to trillion parameters. The largest reported run is 100B on 400 GPUs. At 1T parameters on thousands of devices the full ZeRO-3 all-gather and reduce-scatter steps will move roughly 2P bytes per iteration; any deviation from ideal bandwidth or overlap with compute could make communication the limiter, and the paper's 400-GPU data does not reach that regime. The assumption of sustained high efficiency therefore remains a projection rather than a demonstrated result. This paper is for systems researchers and practitioners who train large models on GPU clusters. The empirical evidence is solid enough that it deserves a serious referee even if the extreme-scale claim needs additional validation.

Referee Report

2 major / 3 minor

Summary. The manuscript introduces ZeRO (Zero Redundancy Optimizer), a memory optimization framework that partitions optimizer states, gradients, and parameters across data-parallel processes to eliminate redundancies while preserving low communication volume and high computational granularity. It provides analytical bounds on memory and communication costs, projects scalability beyond 1 trillion parameters on current hardware, and reports an implementation that trains 100B-parameter models on 400 GPUs at 15 Petaflops throughput (8x larger models and 10x performance versus prior state-of-the-art), with additional usability results for models up to 13B parameters without model parallelism.

Significance. If the communication-volume analysis and overlap assumptions hold, ZeRO materially expands the feasible model size on existing GPU clusters by removing per-device redundancy without introducing super-linear communication costs, directly enabling the reported 100B-scale results and the subsequent training of the 17B Turing-NLG model. The combination of concrete throughput numbers, hardware scaling data, and an open implementation constitutes a practical contribution to distributed training systems.

major comments (2)

[§4] §4 (Communication Volume Analysis): the projection that ZeRO-3 sustains high efficiency at 1T parameters with D=1024 devices rests on the assumption of near-ideal bandwidth utilization and perfect compute-communication overlap for the all-gather of the full parameter set (~2P bytes per step). The 400-GPU, 100B-parameter measurements do not reach this regime, so the analysis should include a sensitivity study for reduced effective bandwidth or increased latency.
[§5.1, Table 3] §5.1, Table 3: the reported super-linear speedup for the 100B model is presented without an explicit baseline comparison (e.g., versus data-parallel only or versus Megatron at identical batch size), making it difficult to isolate the contribution of memory partitioning from other factors such as batch-size scaling.

minor comments (3)

[Abstract] The abstract states 'super-linear speedup' but the main text should clarify the exact baseline configuration and whether larger batch sizes enabled by reduced memory footprint are included in the comparison.
[§3] Notation for the three ZeRO stages (ZeRO-1/2/3) is introduced without a compact summary table relating each stage to the partitioned quantities (optimizer states, gradients, parameters); adding such a table would improve readability.
[Figure 4] Figure 4 caption should explicitly state the GPU count and model size used for the throughput curve so readers can map it directly to the 400-GPU, 100B result.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the positive review and constructive comments on our ZeRO manuscript. We appreciate the recognition of the practical contributions and will revise the paper to address the major comments as detailed below.

read point-by-point responses

Referee: [§4] §4 (Communication Volume Analysis): the projection that ZeRO-3 sustains high efficiency at 1T parameters with D=1024 devices rests on the assumption of near-ideal bandwidth utilization and perfect compute-communication overlap for the all-gather of the full parameter set (~2P bytes per step). The 400-GPU, 100B-parameter measurements do not reach this regime, so the analysis should include a sensitivity study for reduced effective bandwidth or increased latency.

Authors: We agree that the current measurements at 400 GPUs do not fully cover the 1024-device, 1T-parameter regime and that a sensitivity study would strengthen the analysis. In the revised §4 we will add a sensitivity study that varies effective bandwidth utilization (50%, 75%, and 100% of peak) and introduces additional latency factors for the all-gather operations, reporting the resulting efficiency bounds. This will make the scalability projections more robust. revision: yes
Referee: [§5.1, Table 3] §5.1, Table 3: the reported super-linear speedup for the 100B model is presented without an explicit baseline comparison (e.g., versus data-parallel only or versus Megatron at identical batch size), making it difficult to isolate the contribution of memory partitioning from other factors such as batch-size scaling.

Authors: The super-linear speedup arises because ZeRO enables a significantly larger per-GPU batch size than standard data parallelism, which is memory-constrained for the 100B model. We will revise §5.1 to explicitly state the baseline configuration (standard data-parallel training with the largest feasible batch size) and clarify the speedup calculation. We will also add a textual comparison to published Megatron results for comparable model sizes, noting the differences in parallelism approach. This revision will better isolate the contribution of memory partitioning. revision: yes

Circularity Check

0 steps flagged

No significant circularity in ZeRO derivation chain

full rationale

The paper's core claims rest on explicit analytical formulas for per-GPU memory (partitioned optimizer states, gradients, and parameters) and communication volume (all-gather/reduce-scatter costs) that are derived directly from the definitions of data-parallel and model-parallel partitioning. These expressions are not obtained by fitting parameters to the target 1T regime or by re-using self-citations as load-bearing premises; they are first-principles reductions from the ZeRO stage descriptions. Empirical results on 400 GPUs for 100B models serve as validation rather than inputs to the scaling projection. The single self-mention of Turing-NLG is post-hoc and does not support any equation or uniqueness argument inside the derivation.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on standard distributed-systems assumptions about bandwidth and latency plus the new partitioning method itself; no fitted constants or new physical entities are introduced.

axioms (1)

domain assumption Network bandwidth and latency remain sufficient for the reduced communication volume after partitioning
Invoked in the communication-volume analysis and scaling claims.

invented entities (1)

ZeRO partitioning of optimizer states, gradients, and parameters no independent evidence
purpose: Eliminate memory redundancy while keeping communication low
The method is the primary contribution; no independent falsifiable prediction for a new physical quantity is given.

pith-pipeline@v0.9.0 · 5591 in / 1217 out tokens · 33836 ms · 2026-05-16T09:20:59.131371+00:00 · methodology

discussion (0)

Forward citations

Cited by 19 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Efficient Training on Multiple Consumer GPUs with RoundPipe
cs.DC 2026-04 conditional novelty 8.0

RoundPipe achieves near-zero-bubble pipeline parallelism for LLM training on consumer GPUs by dynamically dispatching computation stages round-robin, yielding 1.48-2.16x speedups and enabling 235B model fine-tuning on...
VAnim: Rendering-Aware Sparse State Modeling for Structure-Preserving Vector Animation
cs.CV 2026-05 unverdicted novelty 7.0

VAnim creates open-domain text-to-SVG animations via sparse state updates on a persistent DOM tree, identification-first planning, and rendering-aware RL with a new 134k-example benchmark.
The World is Not Mono: Enabling Spatial Understanding in Large Audio-Language Models
cs.SD 2026-01 unverdicted novelty 7.0

TWNM framework equips audio-language models with spatial scene analysis via FOA simulation and metadata-grounded training, reaching 70.8% accuracy on a new ASA benchmark.
Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity
cs.LG 2021-01 accept novelty 7.0

Switch Transformers use top-1 expert routing in a Mixture of Experts setup to scale to trillion-parameter language models with constant compute and up to 4x speedup over T5-XXL.
MinT: Managed Infrastructure for Training and Serving Millions of LLMs
cs.LG 2026-05 unverdicted novelty 6.0

MinT enables efficient management of million-scale LoRA-adapted LLM policies over shared 1T-parameter base models by moving only small adapters through training and serving pipelines.
Symphony: Taming Step Misalignments in the Network for Ring-based Collective Operations
cs.NI 2026-04 unverdicted novelty 6.0

Symphony detects step misalignments in ring collectives via lightweight in-network tracking and mitigates them by throttling outpacing flows with congestion signals, yielding up to 54% better communication times in As...
Switching Efficiency: A Novel Framework for Dissecting AI Data Center Network Efficiency
cs.NI 2026-04 unverdicted novelty 6.0

Introduces Switching Efficiency (η) decomposed into data, routing efficiency, and port utilization factors to analyze and improve communication bottlenecks in AI data center networks for LLM training.
SweetSpot: An Analytical Model for Predicting Energy Efficiency of LLM Inference
cs.AI 2026-02 unverdicted novelty 6.0

SweetSpot is an analytical model from Transformer computational and memory complexity that identifies energy minima at short-to-moderate inputs and medium outputs, achieving 1.79% MAPE on H100 GPU measurements across ...
MAGI-1: Autoregressive Video Generation at Scale
cs.CV 2025-05 unverdicted novelty 6.0

MAGI-1 is a 24B-parameter autoregressive video world model that predicts denoised frame chunks sequentially with increasing noise to enable causal, scalable, streaming generation up to 4M token contexts.
Steering Llama 2 via Contrastive Activation Addition
cs.CL 2023-12 unverdicted novelty 6.0

Contrastive Activation Addition steers Llama 2 Chat by adding averaged residual-stream activation differences from contrastive example pairs to control targeted behaviors at inference time.
Vision Transformers Need Registers
cs.CV 2023-09 unverdicted novelty 6.0

Adding register tokens to Vision Transformers eliminates high-norm background artifacts and raises state-of-the-art performance on dense visual prediction tasks.
ST-MoE: Designing Stable and Transferable Sparse Expert Models
cs.CL 2022-02 unverdicted novelty 6.0

ST-MoE introduces stability techniques for sparse expert models, allowing a 269B-parameter model to achieve state-of-the-art transfer learning results across reasoning, summarization, and QA tasks at the compute cost ...
GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding
cs.CL 2020-06 unverdicted novelty 6.0

GShard supplies automatic sharding and conditional computation support that enabled training a 600-billion-parameter multilingual translation model on thousands of TPUs with superior quality.
UserGPT Technical Report
cs.IR 2026-05 unverdicted novelty 5.0

UserGPT introduces a generative LLM framework with a behavior simulation engine, semantization module, and DF-GRPO post-training that scores 0.7325 on tag prediction and 0.7528 on summary generation on HPR-Bench while...
Unleashing Scalable Context Parallelism for Foundation Models Pre-Training via FCP
cs.DC 2026-05 unverdicted novelty 5.0

FCP shards sequences at block level with flexible P2P communication and bin-packing to achieve near-linear scaling up to 256 GPUs and 1.13x-2.21x higher attention MFU in foundation model pre-training.
TACO: Efficient Communication Compression of Intermediate Tensors for Scalable Tensor-Parallel LLM Training
cs.DC 2026-04 unverdicted novelty 5.0

TACO compresses tensor-parallel intermediate tensors with an adaptive FP8 scheme and fused kernels, yielding up to 1.87X throughput gains on GPT and Qwen models with near-lossless accuracy.
Movie Gen: A Cast of Media Foundation Models
cs.CV 2024-10 unverdicted novelty 5.0

A 30B-parameter transformer and related models generate high-quality videos and audio, claiming state-of-the-art results on text-to-video, video editing, personalization, and audio generation tasks.
Navigating LLM Valley: From AdamW to Memory-Efficient and Matrix-Based Optimizers
cs.LG 2026-05 unverdicted novelty 3.0

This survey organizes LLM optimizer literature into categories and argues the field is shifting toward rigorous, multi-factor comparisons of convergence, memory, stability, and complexity.
A Scalable Recipe on SuperMUC-NG Phase 2: Efficient Large-Scale Training of Language Models
cs.DC 2026-05 unverdicted novelty 3.0

A combined parallelism recipe on SuperMUC-NG Phase 2 delivers 10% of theoretical peak throughput for 175B models plus 93% weak and 82% strong scaling efficiency on 128 nodes using unmodified public software.

Reference graph

Works this paper leans on

26 extracted references · 26 canonical work pages · cited by 19 Pith papers · 9 internal anchors

[1]

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: pre-training of deep bidirectional transformers for language understanding. CoRR, abs/1810.04805, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[2]

Language models are unsupervised multitask learners

Alec Radford, Jeﬀ Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. Language models are unsupervised multitask learners. 2019

work page 2019
[3]

Megatron-lm: Training multi-billion parameter language models using model parallelism, 2019

Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catanzaro. Megatron-lm: Training multi-billion parameter language models using model parallelism, 2019

work page 2019
[4]

Colin Raﬀel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. Exploring the limits of transfer learn- ing with a uniﬁed text-to-text transformer, 2019

work page 2019
[5]

Mesh-TensorFlow: Deep Learning for Supercomputers

Noam Shazeer, Youlong Cheng, Niki Parmar, Dustin Tran, Ashish Vaswani, Penporn Koanantakool, Peter Hawkins, HyoukJoong Lee, Mingsheng Hong, Cliﬀ Young, Ryan Sep- assi, and Blake A. Hechtman. Mesh-tensorﬂow: Deep learning for supercomputers. CoRR, abs/1811.02084, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[6]

Kingma and Jimmy Ba

Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In Yoshua Bengio and Yann LeCun, editors, 3rd International Conference on Learning Rep- resentations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Pro- ceedings, 2015

work page 2015
[7]

Training Deep Nets with Sublinear Memory Cost

Tianqi Chen, Bing Xu, Chiyuan Zhang, and Carlos Guestrin. Training deep nets with sublinear memory cost. CoRR, abs/1604.06174, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016
[8]

An Empirical Model of Large-Batch Training

Sam McCandlish, Jared Kaplan, Dario Amodei, and OpenAI Dota Team. An empirical model of large-batch training. CoRR, abs/1812.06162, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[9]

Turing-nlg: A 17-billion-parameter language model by microsoft

Microsoft. Turing-nlg: A 17-billion-parameter language model by microsoft. https://www.microsoft.com/en-us/research/blog/ turing-nlg-a-17-billion-parameter-language-model-by-microsoft/ , 2020

work page 2020
[10]

GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism

Yanping Huang, Yonglong Cheng, Dehao Chen, HyoukJoong Lee, Jiquan Ngiam, Quoc V. Le, and Zhifeng Chen. Gpipe: Eﬃcient training of giant neural networks using pipeline parallelism. ArXiv, abs/1811.06965, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[11]

PipeDream: Fast and Efficient Pipeline Parallel DNN Training

Aaron Harlap, Deepak Narayanan, Amar Phanishayee, Vivek Seshadri, Nikhil R. Devanur, Gregory R. Ganger, and Phillip B. Gibbons. Pipedream: Fast and eﬃcient pipeline parallel DNN training. CoRR, abs/1806.03377, 2018. 20

work page internal anchor Pith review Pith/arXiv arXiv 2018
[12]

Pipedream: Generalized pipeline par- allelism for dnn training

Deepak Narayanan, Aaron Harlap, Amar Phanishayee, Vivek Seshadri, Nikhil Devanur, Greg Granger, Phil Gibbons, and Matei Zaharia. Pipedream: Generalized pipeline par- allelism for dnn training. In ACM Symposium on Operating Systems Principles (SOSP 2019), October 2019

work page 2019
[13]

Gist: Eﬃcient data encoding for deep neural network training

Animesh Jain, Amar Phanishayee, Jason Mars, Lingjia Tang, and Gennady Pekhimenko. Gist: Eﬃcient data encoding for deep neural network training. In International Symposium on Computer Architecture (ISCA 2018) , 2018

work page 2018
[14]

Gonzalez

Paras Jain, Ajay Jain, Aniruddha Nrusimha, Amir Gholami, Pieter Abbeel, Kurt Keutzer, Ion Stoica, and Joseph E. Gonzalez. Checkmate: Breaking the memory wall with optimal tensor rematerialization. ArXiv, abs/1910.02653, 2019

work page arXiv 1910
[15]

SuperNeurons: Dynamic GPU Memory Management for Training Deep Neural Networks

Linnan Wang, Jinmian Ye, Yiyang Zhao, Wei Wu, Ang Li, Shuaiwen Leon Song, Zenglin Xu, and Tim Kraska. Superneurons: Dynamic GPU memory management for training deep neural networks. CoRR, abs/1801.04380, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[16]

Training large neural networks with constant memory using a new execution algorithm

Bharadwaj Pudipeddi, Maral Mesmakhosroshahi, Jinwen Xi, and Sujeeth Bharadwaj. Training large neural networks with constant memory using a new execution algorithm. ArXiv, abs/2002.05645, 2020

work page arXiv 2002
[17]

M. Rhu, N. Gimelshein, J. Clemons, A. Zulﬁqar, and S. W. Keckler. vdnn: Virtualized deep neural networks for scalable, memory-eﬃcient neural network design. In 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), pages 1–13, 2016

work page 2016
[18]

Adafactor: Adaptive Learning Rates with Sublinear Memory Cost

Noam Shazeer and Mitchell Stern. Adafactor: Adaptive learning rates with sublinear memory cost. CoRR, abs/1804.04235, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[19]

Memory-eﬃcient adaptive optimization for large-scale learning

Rohan Anil, Vineet Gupta, Tomer Koren, and Yoram Singer. Memory-eﬃcient adaptive optimization for large-scale learning. ArXiv, abs/1901.11150, 2019

work page arXiv 1901
[20]

Adaptive subgradient methods for online learning and stochastic optimization

John Duchi, Elad Hazan, and Yoram Singer. Adaptive subgradient methods for online learning and stochastic optimization. J. Mach. Learn. Res., 12(null):21212159, July 2011

work page 2011
[21]

Large Batch Training of Convolutional Networks

Yang You, Igor Gitman, and Boris Ginsburg. Scaling SGD batch size to 32k for imagenet training. CoRR, abs/1708.03888, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[22]

Reducing BERT pre-training time from 3 days to 76 minutes

Yang You, Jing Li, Jonathan Hseu, Xiaodan Song, James Demmel, and Cho-Jui Hsieh. Reducing BERT pre-training time from 3 days to 76 minutes. CoRR, abs/1904.00962, 2019

work page arXiv 1904
[23]

Mixed precision training, 2017

Paulius Micikevicius, Sharan Narang, Jonah Alben, Gregory Diamos, Erich Elsen, David Garcia, Boris Ginsburg, Michael Houston, Oleksii Kuchaiev, Ganesh Venkatesh, and Hao Wu. Mixed precision training, 2017

work page 2017
[24]

http://images.nvidia.com/content/ volta-architecture/pdf/volta-architecture-whitepaper.pdf , 2017

NVIDIA Tesla V100 GPU architecture. http://images.nvidia.com/content/ volta-architecture/pdf/volta-architecture-whitepaper.pdf , 2017. [Online, ac- cessed 22-April-2020]

work page 2017
[25]

Automatic mixed-precision

NVIDIA. Automatic mixed-precision. https://developer.nvidia.com/ automatic-mixed-precision, 2019. 21

work page 2019
[26]

NVIDIA Clocks Worlds Fastest BERT Training Time

Shar Narasimhan. NVIDIA Clocks Worlds Fastest BERT Training Time ... https: //devblogs.nvidia.com/training-bert-with-gpus/ , 2019. [Online; accessed 25- September-2019]. 22 Figure 2 Model size ZeRO/Baseline Number of GPUs MP Layers Hidden size Attention head Batch size Total batch size 1.5B ZeRO 400 1 48 1600 16 24 9600 1.5B Baseline 400 2 48 1600 16 16 3...

work page 2019