Recognition: unknown
Revisiting Small Batch Training for Deep Neural Networks
read the original abstract
Modern deep neural network training is typically based on mini-batch stochastic gradient optimization. While the use of large mini-batches increases the available computational parallelism, small batch training has been shown to provide improved generalization performance and allows a significantly smaller memory footprint, which might also be exploited to improve machine throughput. In this paper, we review common assumptions on learning rate scaling and training duration, as a basis for an experimental comparison of test performance for different mini-batch sizes. We adopt a learning rate that corresponds to a constant average weight update per gradient calculation (i.e., per unit cost of computation), and point out that this results in a variance of the weight updates that increases linearly with the mini-batch size $m$. The collected experimental results for the CIFAR-10, CIFAR-100 and ImageNet datasets show that increasing the mini-batch size progressively reduces the range of learning rates that provide stable convergence and acceptable test performance. On the other hand, small mini-batch sizes provide more up-to-date gradient calculations, which yields more stable and reliable training. The best performance has been consistently obtained for mini-batch sizes between $m = 2$ and $m = 32$, which contrasts with recent work advocating the use of mini-batch sizes in the thousands.
This paper has not been read by Pith yet.
Forward citations
Cited by 3 Pith papers
-
Momentum Further Constrains Sharpness at the Edge of Stochastic Stability
Momentum SGD exhibits two distinct EoSS regimes for batch sharpness, stabilizing at 2(1-β)/η for small batches and 2(1+β)/η for large batches, aligning with linear stability thresholds.
-
Behavior Score Prediction in Resting-State Functional MRI by Deep State Space Modeling
A deep state space model on rs-fMRI time series predicts Alzheimer's behavior scores better than functional connectivity approaches and identifies key predictive brain regions.
-
Algorithmic Advantage on a Gate-Based Photonic Quantum Neural Network
A two-parameter photonic QNN achieves 100% accuracy on nonlinear tasks where a matched classical ANN saturates at random guessing, suggesting algorithmic advantage on current photonic hardware.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.