pith. machine review for the scientific record. sign in

arxiv: 1605.08361 · v2 · submitted 2016-05-26 · 📊 stat.ML · cs.LG· cs.NE

Recognition: unknown

No bad local minima: Data independent training error guarantees for multilayer neural networks

Authors on Pith no claims yet
classification 📊 stat.ML cs.LGcs.NE
keywords localtrainingguaranteeslossmnnsdatadifferentiableerror
0
0 comments X
read the original abstract

We use smoothed analysis techniques to provide guarantees on the training loss of Multilayer Neural Networks (MNNs) at differentiable local minima. Specifically, we examine MNNs with piecewise linear activation functions, quadratic loss and a single output, under mild over-parametrization. We prove that for a MNN with one hidden layer, the training error is zero at every differentiable local minimum, for almost every dataset and dropout-like noise realization. We then extend these results to the case of more than one hidden layer. Our theoretical guarantees assume essentially nothing on the training data, and are verified numerically. These results suggest why the highly non-convex loss of such MNNs can be easily optimized using local updates (e.g., stochastic gradient descent), as observed empirically.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Self-Play Fine-Tuning Converts Weak Language Models to Strong Language Models

    cs.LG 2024-01 unverdicted novelty 6.0

    SPIN lets weak LLMs become strong by self-generating training data from previous model versions and training to prefer human-annotated responses over its own outputs, outperforming DPO even with extra GPT-4 data on be...

  2. On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima

    cs.LG 2016-09 unverdicted novelty 6.0

    Large-batch methods converge to sharp minima causing a generalization gap, while small-batch methods reach flat minima due to inherent gradient noise.