Gradient Noise Convolution (GNC): Smoothing Loss Function for Distributed Large-Batch SGD

Kosuke Haruki; Masahiro Ozawa; Mitsuhiro Kimura; Ryuji Sakai; Taiji Suzuki; Takeshi Toda; Yohei Hamakawa

arxiv: 1906.10822 · v1 · pith:JUV2VY6Fnew · submitted 2019-06-26 · 💻 cs.LG · stat.ML

Gradient Noise Convolution (GNC): Smoothing Loss Function for Distributed Large-Batch SGD

Kosuke Haruki , Taiji Suzuki , Yohei Hamakawa , Takeshi Toda , Ryuji Sakai , Masahiro Ozawa , Mitsuhiro Kimura This is my paper

Pith reviewed 2026-05-25 15:39 UTC · model grok-4.3

classification 💻 cs.LG stat.ML

keywords gradient noise convolutionlarge-batch SGDdistributed trainingsharp minimageneralizationloss smoothingdata-parallel optimization

0 comments

The pith

Gradient noise from stochastic gradients smooths sharp loss minima in large-batch distributed SGD.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Large-batch SGD trains models faster in distributed settings but tends to converge to sharp minima that generalize poorly. The paper proposes gradient noise convolution, which treats the variation among stochastic gradients computed on parallel workers as a noise distribution and convolves it with the loss surface. Because this noise spreads preferentially along directions of high curvature, the convolution flattens sharp regions more effectively than isotropic random perturbations. The method requires no new hyperparameters and is realized simply by averaging the per-worker gradients. Experiments indicate it yields state-of-the-art generalization on large-scale image and language models.

Core claim

GNC utilizes so-called gradient noise, which is induced by stochastic gradient variation and convolved to the loss function as a smoothing effect. Due to convolving with the gradient noise, which tends to spread along a sharper direction of the loss function, GNC can effectively smooth sharp minima and achieve better generalization, whereas isotropic random noise cannot.

What carries the argument

Gradient noise convolution (GNC): the operation that convolves the loss with the empirical distribution of gradient differences across parallel workers.

If this is right

Large-batch training can reach flatter minima without changing batch size or adding explicit regularizers.
Implementation reduces to computing per-worker stochastic gradients and merging them, adding negligible overhead.
The same mechanism explains why isotropic noise injection fails to match GNC performance.
State-of-the-art test accuracy is reported for large-scale deep-network training under data-parallel SGD.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The directional bias of gradient noise may generalize to other first-order methods that already compute multiple gradient samples.
If the smoothing effect is truly curvature-dependent, similar gains might appear in non-distributed settings by deliberately sampling gradient directions along estimated Hessian eigenvectors.
Theoretical work could quantify how the covariance of per-worker gradients relates to the local Hessian eigenvalues.

Load-bearing premise

Stochastic gradient variation produces noise whose directional statistics align with and preferentially smooth the sharper directions of the loss surface without extra tuning.

What would settle it

An ablation in which replacing the observed gradient noise with isotropic random noise of matched magnitude yields equal or better generalization and sharpness metrics.

Figures

Figures reproduced from arXiv: 1906.10822 by Kosuke Haruki, Masahiro Ozawa, Mitsuhiro Kimura, Ryuji Sakai, Taiji Suzuki, Takeshi Toda, Yohei Hamakawa.

**Figure 2.** Figure 2: (a)Convolution through iterations: The SGD update rule can be viewed as the GD update rule with noise convolution through training iterations. (b)Convolution among workers: The GNC update rule can be viewed as the DP-SGD update rule with noise convolution among parallel workers. 2.1 Convolution Through SGD Iterations Consider the SGD update rule for a loss function f of a deep neural network. Let D = (zi) … view at source ↗

**Figure 3.** Figure 3: (a) Validation accuracy of training with and without GNC. Applying GNC improved validation accuracy. (b) Cosine similarity between the full gradient (FG) and large-batch gradient. The learning rate was step-decayed at the 80th and 120th epochs. Concurrently, cosine similarity significantly dropped. Note that to eliminate side-effects of noise convolution, the full gradient was calculated without GNC. In co… view at source ↗

**Figure 4.** Figure 4: (a) Upper: Condition numbers for GNC and RNC noise covariance through the training process. We calculated condition numbers for each ResNet32 layer and plot lines layer-by-layer, so different lines indicate different layers. Condition numbers for GNC are larger than those for RNC, so it is obvious that noise in GNC is highly anisotropic and noise in RNC is isotropic. (a) Lower: Eigenvalues of GNC noise cov… view at source ↗

**Figure 5.** Figure 5: Smoothness of the loss function with and without GNC. We calculated variation (shaded [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗

**Figure 6.** Figure 6: (a) Learning rate scheduling for CIFAR-10/CIFAR100. We adopted gradual warmup over [PITH_FULL_IMAGE:figures/full_fig_p012_6.png] view at source ↗

**Figure 7.** Figure 7: (a) Validation accuracies of training with and without GNC for the varieties in Sec. 3.4; results for CIFAR-100 with batch size 8,192. (b) Cosine similarity between full gradient (FG) and large-batch gradient as calculated by the models in (a). Predictiveness of the full gradient dropped at step decay in all test cases. 805 810 815 820 825 830 835 840 Iteration 2.0 2.5 3.0 3.5 4.0 Loss Loss with GNC Origin… view at source ↗

**Figure 8.** Figure 8: We measured losses of models x i t in [PITH_FULL_IMAGE:figures/full_fig_p014_8.png] view at source ↗

**Figure 9.** Figure 9: Smoothness of the loss function with and without GNC for the varieties in 3.4. Calculations [PITH_FULL_IMAGE:figures/full_fig_p015_9.png] view at source ↗

**Figure 10.** Figure 10: ImageNet training loss curve with batch sizes of 32,768 and 131,072. In both cases, GNC [PITH_FULL_IMAGE:figures/full_fig_p016_10.png] view at source ↗

**Figure 11.** Figure 11: CIFAR-100 training loss curve with batch sizes of 4,096 and 8,192. Besides the smoothing [PITH_FULL_IMAGE:figures/full_fig_p017_11.png] view at source ↗

read the original abstract

Large-batch stochastic gradient descent (SGD) is widely used for training in distributed deep learning because of its training-time efficiency, however, extremely large-batch SGD leads to poor generalization and easily converges to sharp minima, which prevents naive large-scale data-parallel SGD (DP-SGD) from converging to good minima. To overcome this difficulty, we propose gradient noise convolution (GNC), which effectively smooths sharper minima of the loss function. For DP-SGD, GNC utilizes so-called gradient noise, which is induced by stochastic gradient variation and convolved to the loss function as a smoothing effect. GNC computation can be performed by simply computing the stochastic gradient on each parallel worker and merging them, and is therefore extremely easy to implement. Due to convolving with the gradient noise, which tends to spread along a sharper direction of the loss function, GNC can effectively smooth sharp minima and achieve better generalization, whereas isotropic random noise cannot. We empirically show this effect by comparing GNC with isotropic random noise, and show that it achieves state-of-the-art generalization performance for large-scale deep neural network optimization.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

GNC is a straightforward way to repurpose existing stochastic gradient variation as a directional smoother in distributed large-batch training, but the key claim about noise aligning with sharp curvature directions has no derivation or isolating experiment in the provided text.

read the letter

The paper's core contribution is Gradient Noise Convolution, which takes the natural differences among stochastic gradients computed on parallel workers and uses them to smooth the loss surface during large-batch DP-SGD. The implementation is genuinely simple: each worker computes its local gradient, they are merged in the usual way, and the variation among those gradients supplies the convolution effect without extra hyperparameters or separate noise sampling. That part is practical and low-cost, which matters for people already running data-parallel training at scale. The authors also report that this beats explicit isotropic noise on generalization, which is the main empirical claim. If the experiments hold up with proper controls, that would be useful for anyone stuck with the generalization penalty of very large batches. The soft spot is the central mechanism. The abstract states that gradient noise tends to spread along sharper directions of the loss and therefore smooths them preferentially, but supplies no derivation from the mini-batch sampling process and no controlled test that separates this anisotropy from other large-batch effects such as effective learning-rate scaling. Without that, the advantage over isotropic noise could be coming from something else entirely. The stress-test concern about unproven directional alignment therefore lands directly on the argument as written. The work is aimed at practitioners scaling deep networks across many GPUs who need a drop-in change rather than a new optimizer. It deserves a serious referee because the implementation cost is near zero and the empirical comparison is at least framed clearly, even if the theory needs tightening. I would send it out for review rather than desk-reject, with the expectation that the authors will have to supply the missing justification or stronger ablations.

Referee Report

2 major / 0 minor

Summary. The manuscript proposes Gradient Noise Convolution (GNC) for distributed large-batch SGD. GNC convolves the loss function with gradient noise induced by stochastic gradient variation across parallel workers, claiming this preferentially smooths sharp minima (because the noise covariance aligns with high-curvature directions) and yields better generalization than isotropic random noise. The method requires no extra hyperparameters and is implemented by simply computing and merging per-worker stochastic gradients; the abstract states that this achieves state-of-the-art generalization for large-scale DNN training.

Significance. If the directional alignment between mini-batch gradient noise and loss curvature is rigorously shown, GNC would supply a parameter-free mechanism that exploits existing stochasticity to close the generalization gap of large-batch training, with direct relevance to distributed deep-learning systems.

major comments (2)

[Abstract] Abstract: the central claim that gradient noise 'tends to spread along a sharper direction of the loss function' (thereby smoothing sharp minima more effectively than isotropic noise) is load-bearing yet unsupported by any derivation from the mini-batch sampling process or by a controlled experiment that isolates noise-covariance anisotropy from other large-batch effects such as effective step-size scaling.
[Abstract] Abstract: the assertion of 'state-of-the-art generalization performance' and empirical superiority over isotropic random noise is stated without any reported metrics, baselines, datasets, model architectures, or ablation details, preventing assessment of whether the claimed directional smoothing actually occurs.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments. We address each major comment below and will revise the manuscript to strengthen the presentation of the central claims.

read point-by-point responses

Referee: [Abstract] Abstract: the central claim that gradient noise 'tends to spread along a sharper direction of the loss function' (thereby smoothing sharp minima more effectively than isotropic noise) is load-bearing yet unsupported by any derivation from the mini-batch sampling process or by a controlled experiment that isolates noise-covariance anisotropy from other large-batch effects such as effective step-size scaling.

Authors: We agree that the abstract would benefit from explicit support for this claim. The manuscript provides an empirical demonstration via direct comparison of GNC against isotropic random noise under matched conditions; this comparison is intended to isolate the benefit of the noise covariance structure. To make the argument more rigorous, we will add a short derivation in the revised manuscript showing how the per-worker gradient covariance aligns with the Hessian eigenvectors of high-curvature directions. We will also clarify that the experimental controls already hold effective step size and batch size fixed across the GNC and isotropic-noise arms. revision: yes
Referee: [Abstract] Abstract: the assertion of 'state-of-the-art generalization performance' and empirical superiority over isotropic random noise is stated without any reported metrics, baselines, datasets, model architectures, or ablation details, preventing assessment of whether the claimed directional smoothing actually occurs.

Authors: We acknowledge that the abstract is too terse on these points. The full manuscript reports concrete results on standard large-scale image-classification benchmarks (ImageNet, CIFAR-10/100) with ResNet and VGG architectures, comparing against both naive large-batch DP-SGD and isotropic-noise baselines. We will expand the abstract to include the key quantitative improvements (e.g., top-1 accuracy deltas) and the primary experimental settings so that the claims can be assessed without reading the full text. revision: yes

Circularity Check

0 steps flagged

No circularity: method is direct computation on existing gradients; directional claim is heuristic, not derived by construction.

full rationale

The paper defines GNC explicitly as merging stochastic gradients across workers and presents the smoothing benefit as an empirical observation from the noise's directional spread. No equations, fitted parameters, or self-citations are shown reducing the claimed advantage to an input quantity by definition. The abstract and description treat the anisotropy as an observed tendency rather than a self-referential prediction, leaving the derivation self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the unproven directional alignment of stochastic gradient noise with sharper loss directions and on standard assumptions of SGD convergence behavior; no free parameters or new entities with independent evidence are introduced in the abstract.

axioms (1)

domain assumption Stochastic gradient noise tends to spread along sharper directions of the loss function
Invoked to explain why convolution smooths sharp minima preferentially; appears in the abstract description of the mechanism.

invented entities (1)

Gradient Noise Convolution (GNC) no independent evidence
purpose: Smoothing operator that uses per-worker stochastic gradients to convolve noise into the loss
New named technique introduced by the paper; no independent evidence outside the claimed empirical results is supplied.

pith-pipeline@v0.9.0 · 5749 in / 1092 out tokens · 23842 ms · 2026-05-25T15:39:44.974490+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

29 extracted references · 29 canonical work pages · 20 internal anchors

[1]

ChainerMN: Scalable Distributed Deep Learning Framework

Takuya Akiba, Keisuke Fukuda, and Shuji Suzuki. ChainerMN: Scalable Distributed Deep Learning Framework. arXiv e-prints, art. arXiv:1710.11351, Oct 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[2]

Extremely Large Minibatch SGD: Training ResNet-50 on ImageNet in 15 Minutes

Takuya Akiba, Shuji Suzuki, and Keisuke Fukuda. Extremely large minibatch SGD: training resnet-50 on imagenet in 15 minutes. CoRR, abs/1711.04325, 2017. URL http://arxiv.org/abs/1711.04325

work page internal anchor Pith review Pith/arXiv arXiv 2017
[3]

Stochastic gradient descent performs variational inference, converges to limit cycles for deep networks

Pratik Chaudhari and Stefano Soatto. Stochastic gradient descent performs variational inference, converges to limit cycles for deep networks. arXiv e-prints, art. arXiv:1710.11029, Oct 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[4]

Entropy-SGD: Biasing Gradient Descent Into Wide Valleys

Pratik Chaudhari, Anna Choromanska, Stefano Soatto, Yann LeCun, Carlo Baldassi, Christian Borgs, Jennifer T. Chayes, Levent Sagun, and Riccardo Zecchina. Entropy-sgd: Biasing gradient descent into wide valleys. CoRR, abs/1611.01838, 2016. URL http://arxiv.org/abs/1611.01838

work page internal anchor Pith review Pith/arXiv arXiv 2016
[5]

Scale out for large minibatch SGD: Residual network training on ImageNet-1K with improved accuracy and reduced time to train

Valeriu Codreanu, Damian Podareanu, and Vikram Saletore. Scale out for large minibatch SGD: Residual network training on ImageNet-1K with improved accuracy and reduced time to train. arXiv e-prints, art. arXiv:1711.04291, Nov 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[6]

Escaping saddles with stochastic gradients

Hadi Daneshmand, Jonas Kohler, Aurelien Lucchi, and Thomas Hofmann. Escaping saddles with stochastic gradients. In Jennifer Dy and Andreas Krause, editors, Proceedings of the 35th International Conference on Machine Learning , volume 80 of Proceedings of Machine Learning Research , pages 1155–1164, Stockholmsmässan, Stockholm Sweden, 10–15 Jul 2018. PMLR....

work page 2018
[7]

Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour

Priya Goyal, Piotr Dollár, Ross B. Girshick, Pieter Noordhuis, Lukasz Wesolowski, Aapo Kyrola, Andrew Tulloch, Yangqing Jia, and Kaiming He. Accurate, large minibatch SGD: training imagenet in 1 hour. CoRR, abs/1706.02677, 2017. URL http://arxiv.org/abs/1706.02677

work page internal anchor Pith review Pith/arXiv arXiv 2017
[8]

Deep Residual Learning for Image Recognition

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. CoRR, abs/1512.03385, 2015. URL http://arxiv.org/abs/1512.03385

work page internal anchor Pith review Pith/arXiv arXiv 2015
[9]

FireCaffe: near-linear acceleration of deep neural network training on compute clusters

Forrest N. Iandola, Khalid Ashraf, Matthew W. Moskewicz, and Kurt Keutzer. Firecaffe: near-linear acceleration of deep neural network training on compute clusters. CoRR, abs/1511.00175, 2015. URL http://arxiv.org/abs/1511.00175

work page internal anchor Pith review Pith/arXiv arXiv 2015
[10]

On the relation between the sharpest directions of DNN loss and the SGD step length

Stanisław Jastrz˛ ebski, Zachary Kenton, Nicolas Ballas, Asja Fischer, Yoshua Bengio, and Amost Storkey. On the relation between the sharpest directions of DNN loss and the SGD step length. InInternational Con- ference on Learning Representations, 2019. URL https://openreview.net/forum?id=SkgEaj05t7

work page 2019
[11]

Highly Scalable Deep Learning Training System with Mixed-Precision: Training ImageNet in Four Minutes

Xianyan Jia, Shutao Song, Wei He, Yangzihao Wang, Haidong Rong, Feihu Zhou, Liqiang Xie, Zhenyu Guo, Yuanzhou Yang, Liwei Yu, Tiegang Chen, Guangxiao Hu, Shaohuai Shi, and Xiaowen Chu. Highly scalable deep learning training system with mixed-precision: Training imagenet in four minutes. CoRR, abs/1807.11205, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[13]

URL http://arxiv.org/abs/1609.04836

work page internal anchor Pith review Pith/arXiv arXiv
[14]

Bobby Kleinberg, Yuanzhi Li, and Yang Yuan. An alternative view: When does SGD escape local minima? In Jennifer Dy and Andreas Krause, editors, Proceedings of the 35th International Conference on Machine Learning , volume 80 of Proceedings of Machine Learning Research , pages 2698–2707, Stockholmsmässan, Stockholm Sweden, 10–15 Jul 2018. PMLR. URL http://...

work page 2018
[15]

Learning multiple layers of features from tiny images

Alex Krizhevsky. Learning multiple layers of features from tiny images. 2009

work page 2009
[16]

Visualizing the Loss Landscape of Neural Nets

Hao Li, Zheng Xu, Gavin Taylor, and Tom Goldstein. Visualizing the loss landscape of neural nets. CoRR, abs/1712.09913, 2017. URL http://arxiv.org/abs/1712.09913. 9

work page internal anchor Pith review Pith/arXiv arXiv 2017
[17]

Large-Scale Distributed Second-Order Optimization Using Kronecker-Factored Approximate Curvature for Deep Convolutional Neural Networks

Kazuki Osawa, Yohei Tsuji, Yuichiro Ueno, Akira Naruse, Rio Yokota, and Satoshi Matsuoka. Large-Scale Distributed Second-Order Optimization Using Kronecker-Factored Approximate Curvature for Deep Convolutional Neural Networks. arXiv e-prints, art. arXiv:1811.12019, Nov 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[18]

Berg, and Li Fei-Fei

Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg, and Li Fei-Fei. Imagenet large scale visual recognition challenge. International Journal of Computer Vision, 115(3):211–252, Dec 2015. ISSN 1573-

work page 2015
[19]

ImageNet Large Scale Visual Recognition Challenge,

doi: 10.1007/s11263-015-0816-y. URL https://doi.org/10.1007/s11263-015-0816-y

work page doi:10.1007/s11263-015-0816-y
[20]

Empirical Analysis of the Hessian of Over-Parametrized Neural Networks

Levent Sagun, Utku Evci, V . Ugur Güney, Yann Dauphin, and Léon Bottou. Empirical analysis of the hessian of over-parametrized neural networks. CoRR, abs/1706.04454, 2017. URL http://arxiv.org/ abs/1706.04454

work page internal anchor Pith review Pith/arXiv arXiv 2017
[21]

How Does Batch Normalization Help Optimization?

Shibani Santurkar, Dimitris Tsipras, Andrew Ilyas, and Aleksander Madry. How Does Batch Normalization Help Optimization? arXiv e-prints, art. arXiv:1805.11604, May 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[22]

Chainer : a next-generation open source framework for deep learning

Seiya Tokui and Kenta Oono. Chainer : a next-generation open source framework for deep learning. In Proceedings of Workshop on Machine Learning Systems (LearningSys) in The Twenty-ninth Annual Conference on Neural Information Processing Systems (NIPS-W), 2015

work page 2015
[23]

SmoothOut: Smoothing Out Sharp Minima to Improve Generalization in Deep Learning

Wei Wen, Yandan Wang, Feng Yan, Cong Xu, Yiran Chen, and Hai Li. Smoothout: Smoothing out sharp minima for generalization in large-batch deep learning. CoRR, abs/1805.07898, 2018. URL http://arxiv.org/abs/1805.07898

work page internal anchor Pith review Pith/arXiv arXiv 2018
[24]

Interplay between optimization and generalization of stochastic gradient descent with covariance noise

Yeming Wen, Kevin Luk, Maxime Gazeau, Guodong Zhang, Harris Chan, and Jimmy Ba. Interplay between optimization and generalization of stochastic gradient descent with covariance noise. CoRR, abs/1902.08234, 2019. URL http://arxiv.org/abs/1902.08234

work page arXiv 1902
[25]

Image Classification at Supercomputer Scale

Chris Ying, Sameer Kumar, Dehao Chen, Tao Wang, and Youlong Cheng. Image classiﬁcation at supercomputer scale. CoRR, abs/1811.06992, 2018. URL http://arxiv.org/abs/1811.06992

work page internal anchor Pith review Pith/arXiv arXiv 2018
[26]

Large Batch Training of Convolutional Networks

Yang You, Igor Gitman, and Boris Ginsburg. Large Batch Training of Convolutional Networks. arXiv e-prints, art. arXiv:1708.03888, Aug 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[27]

ImageNet Training in Minutes

Yang You, Zhao Zhang, Cho-Jui Hsieh, and James Demmel. Imagenet training in minutes. CoRR, abs/1709.05011v10, 2018. URL https://arxiv.org/abs/1709.05011v10

work page internal anchor Pith review Pith/arXiv arXiv 2018
[28]

Random Erasing Data Augmentation

Zhun Zhong, Liang Zheng, Guoliang Kang, Shaozi Li, and Yi Yang. Random erasing data augmentation. CoRR, abs/1708.04896, 2017. URL http://arxiv.org/abs/1708.04896

work page internal anchor Pith review Pith/arXiv arXiv 2017
[29]

The Anisotropic Noise in Stochastic Gradient Descent: Its Behavior of Escaping from Sharp Minima and Regularization Effects

Zhanxing Zhu, Jingfeng Wu, Bing Yu, Lei Wu, and Jinwen Ma. The Anisotropic Noise in Stochastic Gradient Descent: Its Behavior of Escaping from Minima and Regularization Effects. ArXiv e-prints, art. arXiv:1803.00195, February 2018. 10 A Detailed Experimental Setup A.1 Datasets We empirically evaluated the performance of GNC in large scale distributed trai...

work page internal anchor Pith review Pith/arXiv arXiv 2018
[30]

[23] did not reveal the details of their experimental settings, so those results are beyond the scope of this paper

and Ying et al. [23] did not reveal the details of their experimental settings, so those results are beyond the scope of this paper. Epochs Batch size Accuracy(%) Goyal et al. [7] 90 32,768 72.45 Akiba et al. [2] 90 32,768 74.94 Codreanu et al. [5] 100 32,768 75.31 You et al. [25] 90 32,768 75.4 Jia et al. [11] 90 65,536 76.2 ∗ Ying et al. [23] 90 32,768 ...

work page

[1] [1]

ChainerMN: Scalable Distributed Deep Learning Framework

Takuya Akiba, Keisuke Fukuda, and Shuji Suzuki. ChainerMN: Scalable Distributed Deep Learning Framework. arXiv e-prints, art. arXiv:1710.11351, Oct 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[2] [2]

Extremely Large Minibatch SGD: Training ResNet-50 on ImageNet in 15 Minutes

Takuya Akiba, Shuji Suzuki, and Keisuke Fukuda. Extremely large minibatch SGD: training resnet-50 on imagenet in 15 minutes. CoRR, abs/1711.04325, 2017. URL http://arxiv.org/abs/1711.04325

work page internal anchor Pith review Pith/arXiv arXiv 2017

[3] [3]

Stochastic gradient descent performs variational inference, converges to limit cycles for deep networks

Pratik Chaudhari and Stefano Soatto. Stochastic gradient descent performs variational inference, converges to limit cycles for deep networks. arXiv e-prints, art. arXiv:1710.11029, Oct 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[4] [4]

Entropy-SGD: Biasing Gradient Descent Into Wide Valleys

Pratik Chaudhari, Anna Choromanska, Stefano Soatto, Yann LeCun, Carlo Baldassi, Christian Borgs, Jennifer T. Chayes, Levent Sagun, and Riccardo Zecchina. Entropy-sgd: Biasing gradient descent into wide valleys. CoRR, abs/1611.01838, 2016. URL http://arxiv.org/abs/1611.01838

work page internal anchor Pith review Pith/arXiv arXiv 2016

[5] [5]

Scale out for large minibatch SGD: Residual network training on ImageNet-1K with improved accuracy and reduced time to train

Valeriu Codreanu, Damian Podareanu, and Vikram Saletore. Scale out for large minibatch SGD: Residual network training on ImageNet-1K with improved accuracy and reduced time to train. arXiv e-prints, art. arXiv:1711.04291, Nov 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[6] [6]

Escaping saddles with stochastic gradients

Hadi Daneshmand, Jonas Kohler, Aurelien Lucchi, and Thomas Hofmann. Escaping saddles with stochastic gradients. In Jennifer Dy and Andreas Krause, editors, Proceedings of the 35th International Conference on Machine Learning , volume 80 of Proceedings of Machine Learning Research , pages 1155–1164, Stockholmsmässan, Stockholm Sweden, 10–15 Jul 2018. PMLR....

work page 2018

[7] [7]

Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour

Priya Goyal, Piotr Dollár, Ross B. Girshick, Pieter Noordhuis, Lukasz Wesolowski, Aapo Kyrola, Andrew Tulloch, Yangqing Jia, and Kaiming He. Accurate, large minibatch SGD: training imagenet in 1 hour. CoRR, abs/1706.02677, 2017. URL http://arxiv.org/abs/1706.02677

work page internal anchor Pith review Pith/arXiv arXiv 2017

[8] [8]

Deep Residual Learning for Image Recognition

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. CoRR, abs/1512.03385, 2015. URL http://arxiv.org/abs/1512.03385

work page internal anchor Pith review Pith/arXiv arXiv 2015

[9] [9]

FireCaffe: near-linear acceleration of deep neural network training on compute clusters

Forrest N. Iandola, Khalid Ashraf, Matthew W. Moskewicz, and Kurt Keutzer. Firecaffe: near-linear acceleration of deep neural network training on compute clusters. CoRR, abs/1511.00175, 2015. URL http://arxiv.org/abs/1511.00175

work page internal anchor Pith review Pith/arXiv arXiv 2015

[10] [10]

On the relation between the sharpest directions of DNN loss and the SGD step length

Stanisław Jastrz˛ ebski, Zachary Kenton, Nicolas Ballas, Asja Fischer, Yoshua Bengio, and Amost Storkey. On the relation between the sharpest directions of DNN loss and the SGD step length. InInternational Con- ference on Learning Representations, 2019. URL https://openreview.net/forum?id=SkgEaj05t7

work page 2019

[11] [11]

Highly Scalable Deep Learning Training System with Mixed-Precision: Training ImageNet in Four Minutes

Xianyan Jia, Shutao Song, Wei He, Yangzihao Wang, Haidong Rong, Feihu Zhou, Liqiang Xie, Zhenyu Guo, Yuanzhou Yang, Liwei Yu, Tiegang Chen, Guangxiao Hu, Shaohuai Shi, and Xiaowen Chu. Highly scalable deep learning training system with mixed-precision: Training imagenet in four minutes. CoRR, abs/1807.11205, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[12] [13]

URL http://arxiv.org/abs/1609.04836

work page internal anchor Pith review Pith/arXiv arXiv

[13] [14]

Bobby Kleinberg, Yuanzhi Li, and Yang Yuan. An alternative view: When does SGD escape local minima? In Jennifer Dy and Andreas Krause, editors, Proceedings of the 35th International Conference on Machine Learning , volume 80 of Proceedings of Machine Learning Research , pages 2698–2707, Stockholmsmässan, Stockholm Sweden, 10–15 Jul 2018. PMLR. URL http://...

work page 2018

[14] [15]

Learning multiple layers of features from tiny images

Alex Krizhevsky. Learning multiple layers of features from tiny images. 2009

work page 2009

[15] [16]

Visualizing the Loss Landscape of Neural Nets

Hao Li, Zheng Xu, Gavin Taylor, and Tom Goldstein. Visualizing the loss landscape of neural nets. CoRR, abs/1712.09913, 2017. URL http://arxiv.org/abs/1712.09913. 9

work page internal anchor Pith review Pith/arXiv arXiv 2017

[16] [17]

Large-Scale Distributed Second-Order Optimization Using Kronecker-Factored Approximate Curvature for Deep Convolutional Neural Networks

Kazuki Osawa, Yohei Tsuji, Yuichiro Ueno, Akira Naruse, Rio Yokota, and Satoshi Matsuoka. Large-Scale Distributed Second-Order Optimization Using Kronecker-Factored Approximate Curvature for Deep Convolutional Neural Networks. arXiv e-prints, art. arXiv:1811.12019, Nov 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[17] [18]

Berg, and Li Fei-Fei

Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg, and Li Fei-Fei. Imagenet large scale visual recognition challenge. International Journal of Computer Vision, 115(3):211–252, Dec 2015. ISSN 1573-

work page 2015

[18] [19]

ImageNet Large Scale Visual Recognition Challenge,

doi: 10.1007/s11263-015-0816-y. URL https://doi.org/10.1007/s11263-015-0816-y

work page doi:10.1007/s11263-015-0816-y

[19] [20]

Empirical Analysis of the Hessian of Over-Parametrized Neural Networks

Levent Sagun, Utku Evci, V . Ugur Güney, Yann Dauphin, and Léon Bottou. Empirical analysis of the hessian of over-parametrized neural networks. CoRR, abs/1706.04454, 2017. URL http://arxiv.org/ abs/1706.04454

work page internal anchor Pith review Pith/arXiv arXiv 2017

[20] [21]

How Does Batch Normalization Help Optimization?

Shibani Santurkar, Dimitris Tsipras, Andrew Ilyas, and Aleksander Madry. How Does Batch Normalization Help Optimization? arXiv e-prints, art. arXiv:1805.11604, May 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[21] [22]

Chainer : a next-generation open source framework for deep learning

Seiya Tokui and Kenta Oono. Chainer : a next-generation open source framework for deep learning. In Proceedings of Workshop on Machine Learning Systems (LearningSys) in The Twenty-ninth Annual Conference on Neural Information Processing Systems (NIPS-W), 2015

work page 2015

[22] [23]

SmoothOut: Smoothing Out Sharp Minima to Improve Generalization in Deep Learning

Wei Wen, Yandan Wang, Feng Yan, Cong Xu, Yiran Chen, and Hai Li. Smoothout: Smoothing out sharp minima for generalization in large-batch deep learning. CoRR, abs/1805.07898, 2018. URL http://arxiv.org/abs/1805.07898

work page internal anchor Pith review Pith/arXiv arXiv 2018

[23] [24]

Interplay between optimization and generalization of stochastic gradient descent with covariance noise

Yeming Wen, Kevin Luk, Maxime Gazeau, Guodong Zhang, Harris Chan, and Jimmy Ba. Interplay between optimization and generalization of stochastic gradient descent with covariance noise. CoRR, abs/1902.08234, 2019. URL http://arxiv.org/abs/1902.08234

work page arXiv 1902

[24] [25]

Image Classification at Supercomputer Scale

Chris Ying, Sameer Kumar, Dehao Chen, Tao Wang, and Youlong Cheng. Image classiﬁcation at supercomputer scale. CoRR, abs/1811.06992, 2018. URL http://arxiv.org/abs/1811.06992

work page internal anchor Pith review Pith/arXiv arXiv 2018

[25] [26]

Large Batch Training of Convolutional Networks

Yang You, Igor Gitman, and Boris Ginsburg. Large Batch Training of Convolutional Networks. arXiv e-prints, art. arXiv:1708.03888, Aug 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[26] [27]

ImageNet Training in Minutes

Yang You, Zhao Zhang, Cho-Jui Hsieh, and James Demmel. Imagenet training in minutes. CoRR, abs/1709.05011v10, 2018. URL https://arxiv.org/abs/1709.05011v10

work page internal anchor Pith review Pith/arXiv arXiv 2018

[27] [28]

Random Erasing Data Augmentation

Zhun Zhong, Liang Zheng, Guoliang Kang, Shaozi Li, and Yi Yang. Random erasing data augmentation. CoRR, abs/1708.04896, 2017. URL http://arxiv.org/abs/1708.04896

work page internal anchor Pith review Pith/arXiv arXiv 2017

[28] [29]

The Anisotropic Noise in Stochastic Gradient Descent: Its Behavior of Escaping from Sharp Minima and Regularization Effects

Zhanxing Zhu, Jingfeng Wu, Bing Yu, Lei Wu, and Jinwen Ma. The Anisotropic Noise in Stochastic Gradient Descent: Its Behavior of Escaping from Minima and Regularization Effects. ArXiv e-prints, art. arXiv:1803.00195, February 2018. 10 A Detailed Experimental Setup A.1 Datasets We empirically evaluated the performance of GNC in large scale distributed trai...

work page internal anchor Pith review Pith/arXiv arXiv 2018

[29] [30]

[23] did not reveal the details of their experimental settings, so those results are beyond the scope of this paper

and Ying et al. [23] did not reveal the details of their experimental settings, so those results are beyond the scope of this paper. Epochs Batch size Accuracy(%) Goyal et al. [7] 90 32,768 72.45 Akiba et al. [2] 90 32,768 74.94 Codreanu et al. [5] 100 32,768 75.31 You et al. [25] 90 32,768 75.4 Jia et al. [11] 90 65,536 76.2 ∗ Ying et al. [23] 90 32,768 ...

work page