ImageNet Training in Minutes

· 2017 · cs.CV · arXiv 1709.05011

6 Pith papers cite this work. Polarity classification is still indexing.

6 Pith papers citing it

open full Pith review browse 6 citing papers arXiv PDF

abstract

Finishing 90-epoch ImageNet-1k training with ResNet-50 on a NVIDIA M40 GPU takes 14 days. This training requires 10^18 single precision operations in total. On the other hand, the world's current fastest supercomputer can finish 2 * 10^17 single precision operations per second (Dongarra et al 2017, https://www.top500.org/lists/2017/06/). If we can make full use of the supercomputer for DNN training, we should be able to finish the 90-epoch ResNet-50 training in one minute. However, the current bottleneck for fast DNN training is in the algorithm level. Specifically, the current batch size (e.g. 512) is too small to make efficient use of many processors. For large-scale DNN training, we focus on using large-batch data-parallelism synchronous SGD without losing accuracy in the fixed epochs. The LARS algorithm (You, Gitman, Ginsburg, 2017, arXiv:1708.03888) enables us to scale the batch size to extremely large case (e.g. 32K). We finish the 100-epoch ImageNet training with AlexNet in 11 minutes on 1024 CPUs. About three times faster than Facebook's result (Goyal et al 2017, arXiv:1706.02677), we finish the 90-epoch ImageNet training with ResNet-50 in 20 minutes on 2048 KNLs without losing accuracy. State-of-the-art ImageNet training speed with ResNet-50 is 74.9% top-1 test accuracy in 15 minutes. We got 74.9% top-1 test accuracy in 64 epochs, which only needs 14 minutes. Furthermore, when we increase the batch size to above 16K, our accuracy is much higher than Facebook's on corresponding batch sizes. Our source code is available upon request.

citation-role summary

background 1

citation-polarity summary

background 1

representative citing papers

Chameleon: Adaptive Fault Tolerance for Distributed Training via Real-time Policy Selection

cs.DC · 2025-08-29 · unverdicted · novelty 6.0

Chameleon provides adaptive fault tolerance for distributed training by real-time selection of optimal recovery policies via a unified performance model, demonstrated with low overhead on a 32-card cluster.

Scaling Laws for Transfer

cs.LG · 2021-02-02 · unverdicted · novelty 6.0

Effective data transferred from pre-training to fine-tuning is described by a power law in model parameter count and fine-tuning dataset size, acting like a multiplier on the fine-tuning data.

ShardTensor: Domain Parallelism for Scientific Machine Learning

cs.DC · 2026-05-11 · unverdicted · novelty 6.0

ShardTensor is a domain-parallelism system for SciML that enables flexible scaling of extreme-resolution spatial datasets by removing the constraint of batch size one per device.

Language Models (Mostly) Know What They Know

cs.CL · 2022-07-11 · unverdicted · novelty 6.0

Language models show good calibration when asked to estimate the probability that their own answers are correct, with performance improving as models get larger.

A General Language Assistant as a Laboratory for Alignment

cs.CL · 2021-12-01 · conditional · novelty 6.0

Ranked preference modeling outperforms imitation learning for language model alignment and scales more favorably with model size.

Gradient Noise Convolution (GNC): Smoothing Loss Function for Distributed Large-Batch SGD

cs.LG · 2019-06-26 · unverdicted · novelty 5.0

GNC convolves stochastic gradient noise to smooth sharp minima in large-batch SGD, outperforming isotropic noise for better generalization in distributed deep learning.

citing papers explorer

Showing 6 of 6 citing papers.

Chameleon: Adaptive Fault Tolerance for Distributed Training via Real-time Policy Selection cs.DC · 2025-08-29 · unverdicted · none · ref 15 · internal anchor
Chameleon provides adaptive fault tolerance for distributed training by real-time selection of optimal recovery policies via a unified performance model, demonstrated with low overhead on a 32-card cluster.
Scaling Laws for Transfer cs.LG · 2021-02-02 · unverdicted · none · ref 140 · internal anchor
Effective data transferred from pre-training to fine-tuning is described by a power law in model parameter count and fine-tuning dataset size, acting like a multiplier on the fine-tuning data.
ShardTensor: Domain Parallelism for Scientific Machine Learning cs.DC · 2026-05-11 · unverdicted · none · ref 55
ShardTensor is a domain-parallelism system for SciML that enables flexible scaling of extreme-resolution spatial datasets by removing the constraint of batch size one per device.
Language Models (Mostly) Know What They Know cs.CL · 2022-07-11 · unverdicted · none · ref 259
Language models show good calibration when asked to estimate the probability that their own answers are correct, with performance improving as models get larger.
A General Language Assistant as a Laboratory for Alignment cs.CL · 2021-12-01 · conditional · none · ref 182
Ranked preference modeling outperforms imitation learning for language model alignment and scales more favorably with model size.
Gradient Noise Convolution (GNC): Smoothing Loss Function for Distributed Large-Batch SGD cs.LG · 2019-06-26 · unverdicted · none · ref 27 · internal anchor
GNC convolves stochastic gradient noise to smooth sharp minima in large-batch SGD, outperforming isotropic noise for better generalization in distributed deep learning.

ImageNet Training in Minutes

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer