pith. sign in

arxiv: 1807.11205 · v1 · pith:LBT725URnew · submitted 2018-07-30 · 💻 cs.LG · cs.DC· stat.ML

Highly Scalable Deep Learning Training System with Mixed-Precision: Training ImageNet in Four Minutes

classification 💻 cs.LG cs.DCstat.ML
keywords trainingsystemaccuracyminutesachievedeepgpushighly
0
0 comments X
read the original abstract

Synchronized stochastic gradient descent (SGD) optimizers with data parallelism are widely used in training large-scale deep neural networks. Although using larger mini-batch sizes can improve the system scalability by reducing the communication-to-computation ratio, it may hurt the generalization ability of the models. To this end, we build a highly scalable deep learning training system for dense GPU clusters with three main contributions: (1) We propose a mixed-precision training method that significantly improves the training throughput of a single GPU without losing accuracy. (2) We propose an optimization approach for extremely large mini-batch size (up to 64k) that can train CNN models on the ImageNet dataset without losing accuracy. (3) We propose highly optimized all-reduce algorithms that achieve up to 3x and 11x speedup on AlexNet and ResNet-50 respectively than NCCL-based training on a cluster with 1024 Tesla P40 GPUs. On training ResNet-50 with 90 epochs, the state-of-the-art GPU-based system with 1024 Tesla P100 GPUs spent 15 minutes and achieved 74.9\% top-1 test accuracy, and another KNL-based system with 2048 Intel KNLs spent 20 minutes and achieved 75.4\% accuracy. Our training system can achieve 75.8\% top-1 test accuracy in only 6.6 minutes using 2048 Tesla P40 GPUs. When training AlexNet with 95 epochs, our system can achieve 58.7\% top-1 test accuracy within 4 minutes, which also outperforms all other existing systems.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 6 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Language Models (Mostly) Know What They Know

    cs.CL 2022-07 unverdicted novelty 6.0

    Language models show good calibration when asked to estimate the probability that their own answers are correct, with performance improving as models get larger.

  2. A General Language Assistant as a Laboratory for Alignment

    cs.CL 2021-12 conditional novelty 6.0

    Ranked preference modeling outperforms imitation learning for language model alignment and scales more favorably with model size.

  3. Scaling Laws for Transfer

    cs.LG 2021-02 unverdicted novelty 6.0

    Effective data transferred from pre-training to fine-tuning is described by a power law in model parameter count and fine-tuning dataset size, acting like a multiplier on the fine-tuning data.

  4. Large Batch Optimization for Deep Learning: Training BERT in 76 minutes

    cs.LG 2019-04 conditional novelty 6.0

    LAMB optimizer trains BERT with batch size 32868, reducing training time to 76 minutes on TPUv3 Pod without performance loss.

  5. Fast Training of Sparse Graph Neural Networks on Dense Hardware

    stat.ML 2019-06 unverdicted novelty 5.0

    Techniques enable training the sparse GNN from Allamanis et al. [2018] on dense TPU hardware in 13 minutes versus a full day originally.

  6. Gradient Noise Convolution (GNC): Smoothing Loss Function for Distributed Large-Batch SGD

    cs.LG 2019-06 unverdicted novelty 5.0

    GNC convolves stochastic gradient noise to smooth sharp minima in large-batch SGD, outperforming isotropic noise for better generalization in distributed deep learning.