GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism

Ankur Bapna; Dehao Chen; HyoukJoong Lee; Jiquan Ngiam; Mia Xu Chen; Orhan Firat; Quoc V. Le; Yanping Huang; Yonghui Wu; Youlong Cheng

arxiv: 1811.06965 · v5 · pith:GNKSV2C6new · submitted 2018-11-16 · 💻 cs.CV

GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism

Yanping Huang , Youlong Cheng , Ankur Bapna , Orhan Firat , Mia Xu Chen , Dehao Chen , HyoukJoong Lee , Jiquan Ngiam

show 3 more authors

Quoc V. Le Yonghui Wu Zhifeng Chen

This is my paper

classification 💻 cs.CV

keywords modelgpipedifferentneuralnetworknetworksparallelismscaling

0 comments

read the original abstract

Scaling up deep neural network capacity has been known as an effective approach to improving model quality for several different machine learning tasks. In many cases, increasing model capacity beyond the memory limit of a single accelerator has required developing special algorithms or infrastructure. These solutions are often architecture-specific and do not transfer to other tasks. To address the need for efficient and task-independent model parallelism, we introduce GPipe, a pipeline parallelism library that allows scaling any network that can be expressed as a sequence of layers. By pipelining different sub-sequences of layers on separate accelerators, GPipe provides the flexibility of scaling a variety of different networks to gigantic sizes efficiently. Moreover, GPipe utilizes a novel batch-splitting pipelining algorithm, resulting in almost linear speedup when a model is partitioned across multiple accelerators. We demonstrate the advantages of GPipe by training large-scale neural networks on two different tasks with distinct network architectures: (i) Image Classification: We train a 557-million-parameter AmoebaNet model and attain a top-1 accuracy of 84.4% on ImageNet-2012, (ii) Multilingual Neural Machine Translation: We train a single 6-billion-parameter, 128-layer Transformer model on a corpus spanning over 100 languages and achieve better quality than all bilingual models.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 21 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Efficient Training on Multiple Consumer GPUs with RoundPipe
cs.DC 2026-04 conditional novelty 8.0

RoundPipe achieves near-zero-bubble pipeline parallelism for LLM training on consumer GPUs by dynamically dispatching computation stages round-robin, yielding 1.48-2.16x speedups and enabling 235B model fine-tuning on...
Analyzing Reverse Address Translation Overheads in Multi-GPU Scale-Up Pods
cs.DC 2026-04 unverdicted novelty 7.0

Simulation study shows cold TLB misses in reverse address translation dominate latency for small collectives in multi-GPU pods, causing up to 1.4x degradation, while larger ones see diminishing returns.
LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale
cs.LG 2022-08 conditional novelty 7.0

LLM.int8() performs 8-bit inference for transformers up to 175B parameters with no accuracy loss by combining vector-wise quantization for most features with 16-bit mixed-precision handling of systematic outlier dimensions.
Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer
cs.LG 2019-10 unverdicted novelty 7.0

T5 casts all NLP tasks as text-to-text generation, systematically explores pre-training choices, and reaches strong performance on summarization, QA, classification and other tasks via large-scale training on the Colo...
ZeRO: Memory Optimizations Toward Training Trillion Parameter Models
cs.LG 2019-10 accept novelty 7.0

ZeRO removes memory redundancies in parallel training to scale deep learning models to over a trillion parameters with high throughput on current hardware.
Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism
cs.CL 2019-09 unverdicted novelty 7.0

Intra-layer model parallelism in PyTorch enables training of 8.3B-parameter transformers, achieving SOTA perplexity of 10.8 on WikiText103 and 66.5% accuracy on LAMBADA.
A Readiness-Driven Runtime for Pipeline-Parallel Training under Runtime Variability
cs.DC 2026-05 unverdicted novelty 6.0

RRFP introduces a readiness-driven runtime for pipeline parallelism that uses schedules as hints and ready-set arbitration to improve utilization under runtime variability, reporting up to 2.77x speedup on multimodal ...
Rescaled Asynchronous SGD: Optimal Distributed Optimization under Data and System Heterogeneity
cs.LG 2026-05 unverdicted novelty 6.0

Rescaled ASGD recovers convergence to the true global objective by rescaling worker stepsizes proportional to computation times, matching the known time lower bound in the leading term under non-convex smoothness and ...
ShardTensor: Domain Parallelism for Scientific Machine Learning
cs.DC 2026-05 unverdicted novelty 6.0

ShardTensor is a domain-parallelism system for SciML that enables flexible scaling of extreme-resolution spatial datasets by removing the constraint of batch size one per device.
Piper: Efficient Large-Scale MoE Training via Resource Modeling and Pipelined Hybrid Parallelism
cs.DC 2026-05 unverdicted novelty 6.0

Piper introduces resource modeling and pipelined hybrid parallelism for MoE training, delivering 2-3.5X higher MFU than prior frameworks and 1.2-9X better all-to-all bandwidth.
Chameleon: Adaptive Fault Tolerance for Distributed Training via Real-time Policy Selection
cs.DC 2025-08 unverdicted novelty 6.0

Chameleon provides adaptive fault tolerance for distributed training by real-time selection of optimal recovery policies via a unified performance model, demonstrated with low overhead on a 32-card cluster.
DeepSpeed Ulysses: System Optimizations for Enabling Training of Extreme Long Sequence Transformer Models
cs.LG 2023-09 accept novelty 6.0

DeepSpeed-Ulysses keeps communication volume constant for sequence-parallel attention when sequence length and device count scale together, delivering 2.5x faster training on 4x longer sequences than prior SOTA.
Language Models (Mostly) Know What They Know
cs.CL 2022-07 unverdicted novelty 6.0

Language models show good calibration when asked to estimate the probability that their own answers are correct, with performance improving as models get larger.
A General Language Assistant as a Laboratory for Alignment
cs.CL 2021-12 conditional novelty 6.0

Ranked preference modeling outperforms imitation learning for language model alignment and scales more favorably with model size.
GSPMD: General and Scalable Parallelization for ML Computation Graphs
cs.DC 2021-05 unverdicted novelty 6.0

GSPMD automatically infers tensor partitioning from limited user annotations to parallelize single-device ML programs across thousands of TPUs, reporting 50-62% utilization for up to trillion-parameter models.
Scaling Laws for Transfer
cs.LG 2021-02 unverdicted novelty 6.0

Effective data transferred from pre-training to fine-tuning is described by a power law in model parameter count and fine-tuning dataset size, acting like a multiplier on the fine-tuning data.
Sharpness-Aware Minimization for Efficiently Improving Generalization
cs.LG 2020-10 conditional novelty 6.0

SAM solves a min-max problem to locate flat low-loss regions, improving generalization on CIFAR, ImageNet and label-noise tasks.
A Large-scale Study of Representation Learning with the Visual Task Adaptation Benchmark
cs.CV 2019-10 accept novelty 6.0

VTAB is a 19-task benchmark that measures representation quality by few-shot adaptation performance across diverse vision domains, with a controlled large-scale comparison of popular pretraining methods.
Mosaic: Towards Efficient Training of Multimodal Models with Spatial Resource Multiplexing
cs.DC 2026-05 unverdicted novelty 5.0

Apollo uses temporal-spatial multiplexing and a performance model to let multiple multimodal model modules share GPUs, delivering up to 1.31x training speedup in testbed experiments.
Kimi K2.5: Visual Agentic Intelligence
cs.CL 2026-02 unverdicted novelty 5.0

Kimi K2.5 combines joint text-vision training with an Agent Swarm parallel orchestration framework to reach claimed state-of-the-art results on coding, vision, reasoning, and agent tasks while cutting latency up to 4.5 times.
PRISM: Probabilistic Runtime Insights and Scalable Performance Modeling for Large-Scale Distributed Training
cs.DC 2025-10 unverdicted novelty 5.0

PRISM introduces a probabilistic performance modeling framework that quantifies guarantees on training time for large-scale distributed systems under runtime variability.