pith. machine review for the scientific record. sign in

hub

Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour

40 Pith papers cite this work. Polarity classification is still indexing.

40 Pith papers citing it
abstract

Deep learning thrives with large neural networks and large datasets. However, larger networks and larger datasets result in longer training times that impede research and development progress. Distributed synchronous SGD offers a potential solution to this problem by dividing SGD minibatches over a pool of parallel workers. Yet to make this scheme efficient, the per-worker workload must be large, which implies nontrivial growth in the SGD minibatch size. In this paper, we empirically show that on the ImageNet dataset large minibatches cause optimization difficulties, but when these are addressed the trained networks exhibit good generalization. Specifically, we show no loss of accuracy when training with large minibatch sizes up to 8192 images. To achieve this result, we adopt a hyper-parameter-free linear scaling rule for adjusting learning rates as a function of minibatch size and develop a new warmup scheme that overcomes optimization challenges early in training. With these simple techniques, our Caffe2-based system trains ResNet-50 with a minibatch size of 8192 on 256 GPUs in one hour, while matching small minibatch accuracy. Using commodity hardware, our implementation achieves ~90% scaling efficiency when moving from 8 to 256 GPUs. Our findings enable training visual recognition models on internet-scale data with high efficiency.

hub tools

citation-role summary

background 1

citation-polarity summary

claims ledger

  • abstract Deep learning thrives with large neural networks and large datasets. However, larger networks and larger datasets result in longer training times that impede research and development progress. Distributed synchronous SGD offers a potential solution to this problem by dividing SGD minibatches over a pool of parallel workers. Yet to make this scheme efficient, the per-worker workload must be large, which implies nontrivial growth in the SGD minibatch size. In this paper, we empirically show that on the ImageNet dataset large minibatches cause optimization difficulties, but when these are address

co-cited works

roles

background 1

polarities

background 1

clear filters

representative citing papers

Segment Anything

cs.CV · 2023-04-05 · unverdicted · novelty 7.0

A promptable model trained on 1B masks achieves competitive zero-shot segmentation performance across tasks and is released publicly with its dataset.

Scalable Diffusion Models with Transformers

cs.CV · 2022-12-19 · unverdicted · novelty 7.0

DiTs achieve SOTA FID of 2.27 on ImageNet 256x256 by scaling transformer-based latent diffusion models, with performance improving consistently as Gflops increase.

OPT: Open Pre-trained Transformer Language Models

cs.CL · 2022-05-02 · unverdicted · novelty 7.0

OPT releases open decoder-only transformers up to 175B parameters that match GPT-3 performance at one-seventh the carbon cost, along with code and training logs.

Hypernetworks for Dynamic Feature Selection

cs.LG · 2026-05-12 · unverdicted · novelty 6.0

Hyper-DFS uses hypernetworks and Set Transformers to generate on-demand parameters for feature subsets in dynamic selection, outperforming prior methods on tabular data and showing stronger zero-shot generalization.

OrScale: Orthogonalised Optimization with Layer-Wise Trust-Ratio Scaling

cs.LG · 2026-05-08 · unverdicted · novelty 6.0

OrScale adds a Frobenius-norm trust-ratio layer-wise scaler to Muon’s orthogonalized updates, with per-layer calibration for language models, yielding higher CIFAR-10 accuracy and better language-model pre-training loss than Muon+Moonlight and AdamW.

Back to Basics: Let Denoising Generative Models Denoise

cs.CV · 2025-11-17 · unverdicted · novelty 6.0

Directly predicting clean data with large-patch pixel Transformers enables strong generative performance in diffusion models where noise prediction fails at high dimensions.

Mean Flows for One-step Generative Modeling

cs.LG · 2025-05-19 · unverdicted · novelty 6.0

MeanFlow uses a derived identity between average and instantaneous velocities to train one-step flow models, achieving FID 3.43 on ImageNet 256x256 with 1-NFE from scratch.

Vision Transformers Need Registers

cs.CV · 2023-09-28 · unverdicted · novelty 6.0

Adding register tokens to Vision Transformers eliminates high-norm background artifacts and raises state-of-the-art performance on dense visual prediction tasks.

citing papers explorer

Showing 1 of 1 citing paper after filters.