pith. machine review for the scientific record. sign in

arxiv: 1703.11008 · v2 · submitted 2017-03-31 · 💻 cs.LG

Recognition: unknown

Computing Nonvacuous Generalization Bounds for Deep (Stochastic) Neural Networks with Many More Parameters than Training Data

Authors on Pith no claims yet
classification 💻 cs.LG
keywords boundsgeneralizationdeepnonvacuousdatalearningmanynetworks
0
0 comments X
read the original abstract

One of the defining properties of deep learning is that models are chosen to have many more parameters than available training data. In light of this capacity for overfitting, it is remarkable that simple algorithms like SGD reliably return solutions with low test error. One roadblock to explaining these phenomena in terms of implicit regularization, structural properties of the solution, and/or easiness of the data is that many learning bounds are quantitatively vacuous when applied to networks learned by SGD in this "deep learning" regime. Logically, in order to explain generalization, we need nonvacuous bounds. We return to an idea by Langford and Caruana (2001), who used PAC-Bayes bounds to compute nonvacuous numerical bounds on generalization error for stochastic two-layer two-hidden-unit neural networks via a sensitivity analysis. By optimizing the PAC-Bayes bound directly, we are able to extend their approach and obtain nonvacuous generalization bounds for deep stochastic neural network classifiers with millions of parameters trained on only tens of thousands of examples. We connect our findings to recent and old work on flat minima and MDL-based explanations of generalization.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 7 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Are Flat Minima an Illusion?

    cs.LG 2026-03 unverdicted novelty 8.0

    Flat minima are illusory; generalization is driven by weakness, a reparameterization-invariant measure of compatible completions that predicts performance better than sharpness on MNIST and Fashion-MNIST.

  2. Fix the Loss, Not the Radius: Rethinking the Adversarial Perturbation of Sharpness-Aware Minimization

    cs.LG 2026-05 unverdicted novelty 7.0

    LE-SAM inverts SAM by fixing the loss budget instead of the parameter-space radius, yielding better generalization across benchmarks.

  3. ConquerNet: Convolution-Smoothed Quantile ReLU Neural Networks with Minimax Guarantees

    stat.ML 2026-05 unverdicted novelty 7.0

    ConquerNet smooths quantile ReLU networks with convolution for easier training and establishes minimax-optimal nonasymptotic risk bounds over Besov function classes.

  4. Topology-Aware PAC-Bayesian Generalization Analysis for Graph Neural Networks

    cs.LG 2026-04 unverdicted novelty 7.0

    A new PAC-Bayesian framework for GCNs derives a family of generalization bounds that embed graph topology via structured sensitivity matrices from spatial and spectral perspectives, recovering prior bounds as special ...

  5. On the Generalization Bounds of Symbolic Regression with Genetic Programming

    cs.LG 2026-04 unverdicted novelty 6.0

    Derives a generalization bound for GP-based symbolic regression that decomposes the gap into structure-selection complexity and constant-fitting complexity under tree constraints.

  6. Towards Initialization-dependent and Non-vacuous Generalization Bounds for Overparameterized Shallow Neural Networks

    cs.LG 2026-04 unverdicted novelty 6.0

    Path-norm initialization-dependent bounds with a new peeling technique give non-vacuous generalization guarantees for overparameterized shallow networks with Lipschitz activations.

  7. Generalization error bounds for two-layer neural networks with Lipschitz loss function

    stat.ML 2026-04 unverdicted novelty 4.0

    Generalization error bounds of order O(n^{-1/2}) (dimension-free) are derived for two-layer neural networks with Lipschitz losses under independent test data, and O(n^{-1/(d_in + d_out)}) without independence, using W...