SmoothOut: Smoothing Out Sharp Minima to Improve Generalization in Deep Learning

Chunpeng Wu; Cong Xu; Feng Yan; Hai Li; Wei Wen; Yandan Wang; Yiran Chen

arxiv: 1805.07898 · v3 · pith:CWPYLZDWnew · submitted 2018-05-21 · 💻 cs.LG · stat.ML

SmoothOut: Smoothing Out Sharp Minima to Improve Generalization in Deep Learning

Wei Wen , Yandan Wang , Feng Yan , Cong Xu , Chunpeng Wu , Yiran Chen , Hai Li This is my paper

classification 💻 cs.LG stat.ML

keywords smoothoutgeneralizationminimanoisesharpimprovecopiesdeep

0 comments

read the original abstract

In Deep Learning, Stochastic Gradient Descent (SGD) is usually selected as a training method because of its efficiency; however, recently, a problem in SGD gains research interest: sharp minima in Deep Neural Networks (DNNs) have poor generalization; especially, large-batch SGD tends to converge to sharp minima. It becomes an open question whether escaping sharp minima can improve the generalization. To answer this question, we propose SmoothOut framework to smooth out sharp minima in DNNs and thereby improve generalization. In a nutshell, SmoothOut perturbs multiple copies of the DNN by noise injection and averages these copies. Injecting noises to SGD is widely used in the literature, but SmoothOut differs in lots of ways: (1) a de-noising process is applied before parameter updating; (2) noise strength is adapted to filter norm; (3) an alternative interpretation on the advantage of noise injection, from the perspective of sharpness and generalization; (4) usage of uniform noise instead of Gaussian noise. We prove that SmoothOut can eliminate sharp minima. Training multiple DNN copies is inefficient, we further propose an unbiased stochastic SmoothOut which only introduces the overhead of noise injecting and de-noising per batch. An adaptive variant of SmoothOut, AdaSmoothOut, is also proposed to improve generalization. In a variety of experiments, SmoothOut and AdaSmoothOut consistently improve generalization in both small-batch and large-batch training on the top of state-of-the-art solutions.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Gradient Noise Convolution (GNC): Smoothing Loss Function for Distributed Large-Batch SGD
cs.LG 2019-06 unverdicted novelty 5.0

GNC convolves stochastic gradient noise to smooth sharp minima in large-batch SGD, outperforming isotropic noise for better generalization in distributed deep learning.