pith. machine review for the scientific record. sign in

arxiv: 1910.02054 · v3 · submitted 2019-10-04 · 💻 cs.LG · cs.DC· stat.ML

Recognition: 2 theorem links

· Lean Theorem

ZeRO: Memory Optimizations Toward Training Trillion Parameter Models

Authors on Pith no claims yet

Pith reviewed 2026-05-16 09:20 UTC · model grok-4.3

classification 💻 cs.LG cs.DCstat.ML
keywords memory optimizationdistributed trainingdata parallelismmodel scalingoptimizer stateslarge language modelstrillion parametersZeRO
0
0 comments X

The pith

ZeRO partitions optimizer states and gradients across devices to remove memory redundancy in parallel training.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Training models with billions or trillions of parameters hits hard limits from the memory on each GPU. ZeRO solves this by splitting the optimizer states, gradients, and parameters so each device holds only a portion instead of full redundant copies. The split keeps the volume of data exchanged between devices low and preserves fine-grained computation on each GPU. Model size can therefore increase in direct proportion to the number of devices without efficiency loss. Analysis shows the method supports training beyond one trillion parameters on hardware available today.

Core claim

ZeRO eliminates memory redundancies in data- and model-parallel training while retaining low communication volume and high computational granularity, allowing the model size to scale proportional to the number of devices with sustained high efficiency. The approach has the potential to scale beyond one trillion parameters using today's hardware.

What carries the argument

The Zero Redundancy Optimizer (ZeRO) that partitions optimizer states, gradients, and parameters across data-parallel devices.

If this is right

  • Model size scales linearly with the number of devices.
  • Models up to 13 billion parameters can be trained without model parallelism.
  • Over 100 billion parameter models achieve super-linear speedup on 400 GPUs at 15 petaflops throughput.
  • An eightfold increase in model size and tenfold increase in performance compared with prior state-of-the-art systems.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same state-partitioning idea could be tested on inference workloads to lower memory needs for serving large models.
  • Combining ZeRO with newer interconnect hardware might push the practical limit well past current trillion-parameter targets.
  • Widespread use would let smaller research groups train models that previously required specialized large clusters.

Load-bearing premise

Splitting optimizer states and gradients across devices will not create communication or synchronization costs that grow faster than the memory savings.

What would settle it

A measurement on thousands of GPUs showing that extra communication time per step cancels the benefit of reduced per-device memory usage.

read the original abstract

Large deep learning models offer significant accuracy gains, but training billions to trillions of parameters is challenging. Existing solutions such as data and model parallelisms exhibit fundamental limitations to fit these models into limited device memory, while obtaining computation, communication and development efficiency. We develop a novel solution, Zero Redundancy Optimizer (ZeRO), to optimize memory, vastly improving training speed while increasing the model size that can be efficiently trained. ZeRO eliminates memory redundancies in data- and model-parallel training while retaining low communication volume and high computational granularity, allowing us to scale the model size proportional to the number of devices with sustained high efficiency. Our analysis on memory requirements and communication volume demonstrates: ZeRO has the potential to scale beyond 1 Trillion parameters using today's hardware. We implement and evaluate ZeRO: it trains large models of over 100B parameter with super-linear speedup on 400 GPUs, achieving throughput of 15 Petaflops. This represents an 8x increase in model size and 10x increase in achievable performance over state-of-the-art. In terms of usability, ZeRO can train large models of up to 13B parameters (e.g., larger than Megatron GPT 8.3B and T5 11B) without requiring model parallelism which is harder for scientists to apply. Last but not the least, researchers have used the system breakthroughs of ZeRO to create the world's largest language model (Turing-NLG, 17B parameters) with record breaking accuracy.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 3 minor

Summary. The manuscript introduces ZeRO (Zero Redundancy Optimizer), a memory optimization framework that partitions optimizer states, gradients, and parameters across data-parallel processes to eliminate redundancies while preserving low communication volume and high computational granularity. It provides analytical bounds on memory and communication costs, projects scalability beyond 1 trillion parameters on current hardware, and reports an implementation that trains 100B-parameter models on 400 GPUs at 15 Petaflops throughput (8x larger models and 10x performance versus prior state-of-the-art), with additional usability results for models up to 13B parameters without model parallelism.

Significance. If the communication-volume analysis and overlap assumptions hold, ZeRO materially expands the feasible model size on existing GPU clusters by removing per-device redundancy without introducing super-linear communication costs, directly enabling the reported 100B-scale results and the subsequent training of the 17B Turing-NLG model. The combination of concrete throughput numbers, hardware scaling data, and an open implementation constitutes a practical contribution to distributed training systems.

major comments (2)
  1. [§4] §4 (Communication Volume Analysis): the projection that ZeRO-3 sustains high efficiency at 1T parameters with D=1024 devices rests on the assumption of near-ideal bandwidth utilization and perfect compute-communication overlap for the all-gather of the full parameter set (~2P bytes per step). The 400-GPU, 100B-parameter measurements do not reach this regime, so the analysis should include a sensitivity study for reduced effective bandwidth or increased latency.
  2. [§5.1, Table 3] §5.1, Table 3: the reported super-linear speedup for the 100B model is presented without an explicit baseline comparison (e.g., versus data-parallel only or versus Megatron at identical batch size), making it difficult to isolate the contribution of memory partitioning from other factors such as batch-size scaling.
minor comments (3)
  1. [Abstract] The abstract states 'super-linear speedup' but the main text should clarify the exact baseline configuration and whether larger batch sizes enabled by reduced memory footprint are included in the comparison.
  2. [§3] Notation for the three ZeRO stages (ZeRO-1/2/3) is introduced without a compact summary table relating each stage to the partitioned quantities (optimizer states, gradients, parameters); adding such a table would improve readability.
  3. [Figure 4] Figure 4 caption should explicitly state the GPU count and model size used for the throughput curve so readers can map it directly to the 400-GPU, 100B result.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the positive review and constructive comments on our ZeRO manuscript. We appreciate the recognition of the practical contributions and will revise the paper to address the major comments as detailed below.

read point-by-point responses
  1. Referee: [§4] §4 (Communication Volume Analysis): the projection that ZeRO-3 sustains high efficiency at 1T parameters with D=1024 devices rests on the assumption of near-ideal bandwidth utilization and perfect compute-communication overlap for the all-gather of the full parameter set (~2P bytes per step). The 400-GPU, 100B-parameter measurements do not reach this regime, so the analysis should include a sensitivity study for reduced effective bandwidth or increased latency.

    Authors: We agree that the current measurements at 400 GPUs do not fully cover the 1024-device, 1T-parameter regime and that a sensitivity study would strengthen the analysis. In the revised §4 we will add a sensitivity study that varies effective bandwidth utilization (50%, 75%, and 100% of peak) and introduces additional latency factors for the all-gather operations, reporting the resulting efficiency bounds. This will make the scalability projections more robust. revision: yes

  2. Referee: [§5.1, Table 3] §5.1, Table 3: the reported super-linear speedup for the 100B model is presented without an explicit baseline comparison (e.g., versus data-parallel only or versus Megatron at identical batch size), making it difficult to isolate the contribution of memory partitioning from other factors such as batch-size scaling.

    Authors: The super-linear speedup arises because ZeRO enables a significantly larger per-GPU batch size than standard data parallelism, which is memory-constrained for the 100B model. We will revise §5.1 to explicitly state the baseline configuration (standard data-parallel training with the largest feasible batch size) and clarify the speedup calculation. We will also add a textual comparison to published Megatron results for comparable model sizes, noting the differences in parallelism approach. This revision will better isolate the contribution of memory partitioning. revision: yes

Circularity Check

0 steps flagged

No significant circularity in ZeRO derivation chain

full rationale

The paper's core claims rest on explicit analytical formulas for per-GPU memory (partitioned optimizer states, gradients, and parameters) and communication volume (all-gather/reduce-scatter costs) that are derived directly from the definitions of data-parallel and model-parallel partitioning. These expressions are not obtained by fitting parameters to the target 1T regime or by re-using self-citations as load-bearing premises; they are first-principles reductions from the ZeRO stage descriptions. Empirical results on 400 GPUs for 100B models serve as validation rather than inputs to the scaling projection. The single self-mention of Turing-NLG is post-hoc and does not support any equation or uniqueness argument inside the derivation.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on standard distributed-systems assumptions about bandwidth and latency plus the new partitioning method itself; no fitted constants or new physical entities are introduced.

axioms (1)
  • domain assumption Network bandwidth and latency remain sufficient for the reduced communication volume after partitioning
    Invoked in the communication-volume analysis and scaling claims.
invented entities (1)
  • ZeRO partitioning of optimizer states, gradients, and parameters no independent evidence
    purpose: Eliminate memory redundancy while keeping communication low
    The method is the primary contribution; no independent falsifiable prediction for a new physical quantity is given.

pith-pipeline@v0.9.0 · 5591 in / 1217 out tokens · 33836 ms · 2026-05-16T09:20:59.131371+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 19 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Efficient Training on Multiple Consumer GPUs with RoundPipe

    cs.DC 2026-04 conditional novelty 8.0

    RoundPipe achieves near-zero-bubble pipeline parallelism for LLM training on consumer GPUs by dynamically dispatching computation stages round-robin, yielding 1.48-2.16x speedups and enabling 235B model fine-tuning on...

  2. VAnim: Rendering-Aware Sparse State Modeling for Structure-Preserving Vector Animation

    cs.CV 2026-05 unverdicted novelty 7.0

    VAnim creates open-domain text-to-SVG animations via sparse state updates on a persistent DOM tree, identification-first planning, and rendering-aware RL with a new 134k-example benchmark.

  3. The World is Not Mono: Enabling Spatial Understanding in Large Audio-Language Models

    cs.SD 2026-01 unverdicted novelty 7.0

    TWNM framework equips audio-language models with spatial scene analysis via FOA simulation and metadata-grounded training, reaching 70.8% accuracy on a new ASA benchmark.

  4. Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity

    cs.LG 2021-01 accept novelty 7.0

    Switch Transformers use top-1 expert routing in a Mixture of Experts setup to scale to trillion-parameter language models with constant compute and up to 4x speedup over T5-XXL.

  5. MinT: Managed Infrastructure for Training and Serving Millions of LLMs

    cs.LG 2026-05 unverdicted novelty 6.0

    MinT enables efficient management of million-scale LoRA-adapted LLM policies over shared 1T-parameter base models by moving only small adapters through training and serving pipelines.

  6. Symphony: Taming Step Misalignments in the Network for Ring-based Collective Operations

    cs.NI 2026-04 unverdicted novelty 6.0

    Symphony detects step misalignments in ring collectives via lightweight in-network tracking and mitigates them by throttling outpacing flows with congestion signals, yielding up to 54% better communication times in As...

  7. Switching Efficiency: A Novel Framework for Dissecting AI Data Center Network Efficiency

    cs.NI 2026-04 unverdicted novelty 6.0

    Introduces Switching Efficiency (η) decomposed into data, routing efficiency, and port utilization factors to analyze and improve communication bottlenecks in AI data center networks for LLM training.

  8. SweetSpot: An Analytical Model for Predicting Energy Efficiency of LLM Inference

    cs.AI 2026-02 unverdicted novelty 6.0

    SweetSpot is an analytical model from Transformer computational and memory complexity that identifies energy minima at short-to-moderate inputs and medium outputs, achieving 1.79% MAPE on H100 GPU measurements across ...

  9. MAGI-1: Autoregressive Video Generation at Scale

    cs.CV 2025-05 unverdicted novelty 6.0

    MAGI-1 is a 24B-parameter autoregressive video world model that predicts denoised frame chunks sequentially with increasing noise to enable causal, scalable, streaming generation up to 4M token contexts.

  10. Steering Llama 2 via Contrastive Activation Addition

    cs.CL 2023-12 unverdicted novelty 6.0

    Contrastive Activation Addition steers Llama 2 Chat by adding averaged residual-stream activation differences from contrastive example pairs to control targeted behaviors at inference time.

  11. Vision Transformers Need Registers

    cs.CV 2023-09 unverdicted novelty 6.0

    Adding register tokens to Vision Transformers eliminates high-norm background artifacts and raises state-of-the-art performance on dense visual prediction tasks.

  12. ST-MoE: Designing Stable and Transferable Sparse Expert Models

    cs.CL 2022-02 unverdicted novelty 6.0

    ST-MoE introduces stability techniques for sparse expert models, allowing a 269B-parameter model to achieve state-of-the-art transfer learning results across reasoning, summarization, and QA tasks at the compute cost ...

  13. GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding

    cs.CL 2020-06 unverdicted novelty 6.0

    GShard supplies automatic sharding and conditional computation support that enabled training a 600-billion-parameter multilingual translation model on thousands of TPUs with superior quality.

  14. UserGPT Technical Report

    cs.IR 2026-05 unverdicted novelty 5.0

    UserGPT introduces a generative LLM framework with a behavior simulation engine, semantization module, and DF-GRPO post-training that scores 0.7325 on tag prediction and 0.7528 on summary generation on HPR-Bench while...

  15. Unleashing Scalable Context Parallelism for Foundation Models Pre-Training via FCP

    cs.DC 2026-05 unverdicted novelty 5.0

    FCP shards sequences at block level with flexible P2P communication and bin-packing to achieve near-linear scaling up to 256 GPUs and 1.13x-2.21x higher attention MFU in foundation model pre-training.

  16. TACO: Efficient Communication Compression of Intermediate Tensors for Scalable Tensor-Parallel LLM Training

    cs.DC 2026-04 unverdicted novelty 5.0

    TACO compresses tensor-parallel intermediate tensors with an adaptive FP8 scheme and fused kernels, yielding up to 1.87X throughput gains on GPT and Qwen models with near-lossless accuracy.

  17. Movie Gen: A Cast of Media Foundation Models

    cs.CV 2024-10 unverdicted novelty 5.0

    A 30B-parameter transformer and related models generate high-quality videos and audio, claiming state-of-the-art results on text-to-video, video editing, personalization, and audio generation tasks.

  18. Navigating LLM Valley: From AdamW to Memory-Efficient and Matrix-Based Optimizers

    cs.LG 2026-05 unverdicted novelty 3.0

    This survey organizes LLM optimizer literature into categories and argues the field is shifting toward rigorous, multi-factor comparisons of convergence, memory, stability, and complexity.

  19. A Scalable Recipe on SuperMUC-NG Phase 2: Efficient Large-Scale Training of Language Models

    cs.DC 2026-05 unverdicted novelty 3.0

    A combined parallelism recipe on SuperMUC-NG Phase 2 delivers 10% of theoretical peak throughput for 175B models plus 93% weak and 82% strong scaling efficiency on 128 nodes using unmodified public software.

Reference graph

Works this paper leans on

26 extracted references · 26 canonical work pages · cited by 19 Pith papers · 9 internal anchors

  1. [1]

    BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

    Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: pre-training of deep bidirectional transformers for language understanding. CoRR, abs/1810.04805, 2018

  2. [2]

    Language models are unsupervised multitask learners

    Alec Radford, Jeff Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. Language models are unsupervised multitask learners. 2019

  3. [3]

    Megatron-lm: Training multi-billion parameter language models using model parallelism, 2019

    Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catanzaro. Megatron-lm: Training multi-billion parameter language models using model parallelism, 2019

  4. [4]

    Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. Exploring the limits of transfer learn- ing with a unified text-to-text transformer, 2019

  5. [5]

    Mesh-TensorFlow: Deep Learning for Supercomputers

    Noam Shazeer, Youlong Cheng, Niki Parmar, Dustin Tran, Ashish Vaswani, Penporn Koanantakool, Peter Hawkins, HyoukJoong Lee, Mingsheng Hong, Cliff Young, Ryan Sep- assi, and Blake A. Hechtman. Mesh-tensorflow: Deep learning for supercomputers. CoRR, abs/1811.02084, 2018

  6. [6]

    Kingma and Jimmy Ba

    Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In Yoshua Bengio and Yann LeCun, editors, 3rd International Conference on Learning Rep- resentations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Pro- ceedings, 2015

  7. [7]

    Training Deep Nets with Sublinear Memory Cost

    Tianqi Chen, Bing Xu, Chiyuan Zhang, and Carlos Guestrin. Training deep nets with sublinear memory cost. CoRR, abs/1604.06174, 2016

  8. [8]

    An Empirical Model of Large-Batch Training

    Sam McCandlish, Jared Kaplan, Dario Amodei, and OpenAI Dota Team. An empirical model of large-batch training. CoRR, abs/1812.06162, 2018

  9. [9]

    Turing-nlg: A 17-billion-parameter language model by microsoft

    Microsoft. Turing-nlg: A 17-billion-parameter language model by microsoft. https://www.microsoft.com/en-us/research/blog/ turing-nlg-a-17-billion-parameter-language-model-by-microsoft/ , 2020

  10. [10]

    GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism

    Yanping Huang, Yonglong Cheng, Dehao Chen, HyoukJoong Lee, Jiquan Ngiam, Quoc V. Le, and Zhifeng Chen. Gpipe: Efficient training of giant neural networks using pipeline parallelism. ArXiv, abs/1811.06965, 2018

  11. [11]

    PipeDream: Fast and Efficient Pipeline Parallel DNN Training

    Aaron Harlap, Deepak Narayanan, Amar Phanishayee, Vivek Seshadri, Nikhil R. Devanur, Gregory R. Ganger, and Phillip B. Gibbons. Pipedream: Fast and efficient pipeline parallel DNN training. CoRR, abs/1806.03377, 2018. 20

  12. [12]

    Pipedream: Generalized pipeline par- allelism for dnn training

    Deepak Narayanan, Aaron Harlap, Amar Phanishayee, Vivek Seshadri, Nikhil Devanur, Greg Granger, Phil Gibbons, and Matei Zaharia. Pipedream: Generalized pipeline par- allelism for dnn training. In ACM Symposium on Operating Systems Principles (SOSP 2019), October 2019

  13. [13]

    Gist: Efficient data encoding for deep neural network training

    Animesh Jain, Amar Phanishayee, Jason Mars, Lingjia Tang, and Gennady Pekhimenko. Gist: Efficient data encoding for deep neural network training. In International Symposium on Computer Architecture (ISCA 2018) , 2018

  14. [14]

    Gonzalez

    Paras Jain, Ajay Jain, Aniruddha Nrusimha, Amir Gholami, Pieter Abbeel, Kurt Keutzer, Ion Stoica, and Joseph E. Gonzalez. Checkmate: Breaking the memory wall with optimal tensor rematerialization. ArXiv, abs/1910.02653, 2019

  15. [15]

    SuperNeurons: Dynamic GPU Memory Management for Training Deep Neural Networks

    Linnan Wang, Jinmian Ye, Yiyang Zhao, Wei Wu, Ang Li, Shuaiwen Leon Song, Zenglin Xu, and Tim Kraska. Superneurons: Dynamic GPU memory management for training deep neural networks. CoRR, abs/1801.04380, 2018

  16. [16]

    Training large neural networks with constant memory using a new execution algorithm

    Bharadwaj Pudipeddi, Maral Mesmakhosroshahi, Jinwen Xi, and Sujeeth Bharadwaj. Training large neural networks with constant memory using a new execution algorithm. ArXiv, abs/2002.05645, 2020

  17. [17]

    M. Rhu, N. Gimelshein, J. Clemons, A. Zulfiqar, and S. W. Keckler. vdnn: Virtualized deep neural networks for scalable, memory-efficient neural network design. In 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), pages 1–13, 2016

  18. [18]

    Adafactor: Adaptive Learning Rates with Sublinear Memory Cost

    Noam Shazeer and Mitchell Stern. Adafactor: Adaptive learning rates with sublinear memory cost. CoRR, abs/1804.04235, 2018

  19. [19]

    Memory-efficient adaptive optimization for large-scale learning

    Rohan Anil, Vineet Gupta, Tomer Koren, and Yoram Singer. Memory-efficient adaptive optimization for large-scale learning. ArXiv, abs/1901.11150, 2019

  20. [20]

    Adaptive subgradient methods for online learning and stochastic optimization

    John Duchi, Elad Hazan, and Yoram Singer. Adaptive subgradient methods for online learning and stochastic optimization. J. Mach. Learn. Res., 12(null):21212159, July 2011

  21. [21]

    Large Batch Training of Convolutional Networks

    Yang You, Igor Gitman, and Boris Ginsburg. Scaling SGD batch size to 32k for imagenet training. CoRR, abs/1708.03888, 2017

  22. [22]

    Reducing BERT pre-training time from 3 days to 76 minutes

    Yang You, Jing Li, Jonathan Hseu, Xiaodan Song, James Demmel, and Cho-Jui Hsieh. Reducing BERT pre-training time from 3 days to 76 minutes. CoRR, abs/1904.00962, 2019

  23. [23]

    Mixed precision training, 2017

    Paulius Micikevicius, Sharan Narang, Jonah Alben, Gregory Diamos, Erich Elsen, David Garcia, Boris Ginsburg, Michael Houston, Oleksii Kuchaiev, Ganesh Venkatesh, and Hao Wu. Mixed precision training, 2017

  24. [24]

    http://images.nvidia.com/content/ volta-architecture/pdf/volta-architecture-whitepaper.pdf , 2017

    NVIDIA Tesla V100 GPU architecture. http://images.nvidia.com/content/ volta-architecture/pdf/volta-architecture-whitepaper.pdf , 2017. [Online, ac- cessed 22-April-2020]

  25. [25]

    Automatic mixed-precision

    NVIDIA. Automatic mixed-precision. https://developer.nvidia.com/ automatic-mixed-precision, 2019. 21

  26. [26]

    NVIDIA Clocks Worlds Fastest BERT Training Time

    Shar Narasimhan. NVIDIA Clocks Worlds Fastest BERT Training Time ... https: //devblogs.nvidia.com/training-bert-with-gpus/ , 2019. [Online; accessed 25- September-2019]. 22 Figure 2 Model size ZeRO/Baseline Number of GPUs MP Layers Hidden size Attention head Batch size Total batch size 1.5B ZeRO 400 1 48 1600 16 24 9600 1.5B Baseline 400 2 48 1600 16 16 3...