arxiv: 1611.03530 · v2 · submitted 2016-11-10 · 💻 cs.LG

Recognition: 2 theorem links

· Lean Theorem

Understanding deep learning requires rethinking generalization

Benjamin Recht, Chiyuan Zhang, Moritz Hardt, Oriol Vinyals, Samy Bengio

Authors on Pith no claims yet

Pith reviewed 2026-05-13 11:51 UTC · model grok-4.3

classification 💻 cs.LG

keywords deep learninggeneralizationneural networksrandom labelsexpressivityoverparameterizationstochastic gradient descent

0 comments

The pith

Deep neural networks easily fit random training labels even with regularization or noise inputs, showing that generalization on real data must come from training dynamics rather than model capacity.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests conventional explanations for why deep networks generalize well despite their size. It shows through experiments that state-of-the-art convolutional networks trained by stochastic gradient descent achieve zero training error on completely random labelings of the data. The same networks fit the random labels even when explicit regularizers are added and when the input images are replaced by unstructured random noise. A supporting theoretical construction proves that depth-two networks already possess perfect finite-sample expressivity once the parameter count exceeds the number of training points, which is the usual regime in practice. These results imply that the model family itself is expressive enough to realize arbitrary functions on the training set, so any account of generalization must invoke properties of the optimization process or the structure of real data.

Core claim

State-of-the-art convolutional networks for image classification trained with stochastic gradient methods easily fit a random labeling of the training data. This phenomenon is qualitatively unaffected by explicit regularization, and occurs even if we replace the true images by completely unstructured random noise. Simple depth-two neural networks already have perfect finite-sample expressivity as soon as the number of parameters exceeds the number of data points.

What carries the argument

The experimental protocol of training networks to zero error on random labelings, together with the theoretical construction proving that depth-two networks can realize any labeling once parameters exceed sample count.

If this is right

Generalization on real tasks must be explained by implicit biases induced by the optimizer or by the alignment between network architecture and natural data structure.
Explicit regularization techniques such as weight decay or dropout do not prevent the network from memorizing arbitrary labelings.
The classical bias-variance tradeoff does not apply in the usual way because the model class can already realize every possible labeling of the training points.
New theoretical tools are needed that analyze the specific trajectory taken by stochastic gradient descent rather than only the final hypothesis class.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same memorization behavior may appear in other high-capacity function classes such as wide kernel machines or decision-tree ensembles.
Future analyses could measure how quickly the training dynamics separate real-data minima from random-label minima in the loss landscape.
Practical model selection might benefit from explicit checks of how easily a candidate architecture fits random labels on the target dataset size.

Load-bearing premise

That the networks' ability to fit random labels in these finite-sample regimes directly rules out capacity or explicit regularization as the main reason they generalize on natural data.

What would settle it

An architecture and training procedure that reaches low test error on natural images yet fails to reach zero training error on a random labeling of the same training set.

read the original abstract

Despite their massive size, successful deep artificial neural networks can exhibit a remarkably small difference between training and test performance. Conventional wisdom attributes small generalization error either to properties of the model family, or to the regularization techniques used during training. Through extensive systematic experiments, we show how these traditional approaches fail to explain why large neural networks generalize well in practice. Specifically, our experiments establish that state-of-the-art convolutional networks for image classification trained with stochastic gradient methods easily fit a random labeling of the training data. This phenomenon is qualitatively unaffected by explicit regularization, and occurs even if we replace the true images by completely unstructured random noise. We corroborate these experimental findings with a theoretical construction showing that simple depth two neural networks already have perfect finite sample expressivity as soon as the number of parameters exceeds the number of data points as it usually does in practice. We interpret our experimental findings by comparison with traditional models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Deep networks fit random labels with ease, undermining standard generalization explanations.

read the letter

This paper's main finding is that state-of-the-art convolutional networks trained with stochastic gradient descent can achieve zero training error on randomly labeled versions of standard image classification datasets. The same networks also fit when the input images are replaced by random noise. These behaviors are not much affected by adding explicit regularization techniques. What the paper does well is present these results in a systematic way. The experiments cover multiple network architectures, datasets including CIFAR-10, CIFAR-100, and ImageNet, and vary the amount of label noise. They also include a simple theoretical result for two-layer networks that can represent any labeling of the training points as long as there are more parameters than points. This matches the overparameterized setting common in deep learning. The weaker part is the interpretation section. The paper contrasts the findings with traditional statistical learning theory and some kernel methods, but the discussion is mostly negative. It does not explore in depth what properties of SGD or the optimization landscape might explain the good generalization on real labels. The evidence for the central claim is strong, but the call to rethink generalization is broad. This paper is useful for researchers working on theoretical understanding of deep learning. It deserves to go through peer review because the experiments are solid and the question it poses about existing generalization theories is important. I would cite it in work on overparameterization and bring it to a reading group for discussion.

Referee Report

0 major / 3 minor

Summary. The paper claims that conventional explanations for the strong generalization of deep neural networks—based on model family properties or explicit regularization—fail to account for observed behavior. Through systematic experiments it shows that state-of-the-art convolutional networks trained by SGD achieve zero training error on randomly labeled data and even on unstructured random inputs; this holds across multiple datasets and is largely unaffected by regularization. A supporting theoretical construction demonstrates that depth-two networks already possess perfect finite-sample expressivity once the number of parameters exceeds the number of training points.

Significance. If the central empirical and theoretical results hold, the work is significant because it directly undermines capacity-based and regularization-based accounts of generalization and shifts attention to training dynamics. Credit is due for the use of real architectures on standard image-classification benchmarks, consistent results across random-label and random-input regimes, and the clean finite-sample argument for depth-two networks that requires no post-hoc adjustments.

minor comments (3)

[Section 2] In the experimental section the precise hyper-parameter settings (learning-rate schedule, batch size, weight-decay values) used for the random-label runs should be tabulated for reproducibility.
[Section 2] Figure 1 and Figure 2 would benefit from explicit axis labels indicating that the y-axis is training error (not test error) in the random-label panels.
[Section 3] The statement of the depth-two construction (Theorem 1) should explicitly list the activation-function assumptions required for the finite-sample expressivity claim.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive review and recommendation to accept. The summary correctly identifies the core empirical finding that modern CNNs achieve zero training error on random labels and unstructured noise, as well as the supporting finite-sample expressivity result for depth-two networks.

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper's core argument consists of direct experimental measurements (training error on random labels and random inputs) and an independent constructive proof of finite-sample expressivity for depth-2 networks when parameter count exceeds sample size. Neither reduces to a self-definition, a fitted parameter renamed as a prediction, or a load-bearing self-citation. The experiments compare observed behavior against explicit random baselines without circular fitting, and the theoretical result is a standalone construction that does not presuppose the empirical findings or rely on prior author work for its validity. The derivation chain is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper relies on standard results from approximation theory for the expressivity construction and performs direct empirical measurements without introducing new fitted parameters or postulated entities.

axioms (1)

standard math A depth-two neural network with more parameters than training points can represent any labeling of those points
Invoked in the theoretical section to establish finite-sample expressivity.

pith-pipeline@v0.9.0 · 5455 in / 1156 out tokens · 102427 ms · 2026-05-13T11:51:55.908120+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

Cost.FunctionalEquation washburn_uniqueness_aczel echoes
state-of-the-art convolutional networks for image classification trained with stochastic gradient methods easily fit a random labeling of the training data. This phenomenon is qualitatively unaffected by explicit regularization, and occurs even if we replace the true images by completely unstructured random noise.
Foundation.HierarchyEmergence hierarchy_emergence_forces_phi unclear
simple depth two neural networks already have perfect finite sample expressivity as soon as the number of parameters exceeds the number of data points

Forward citations

Cited by 25 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

When Does $\ell_2$-Boosting Overfit Benignly? High-Dimensional Risk Asymptotics and the $\ell_1$ Implicit Bias
cs.LG 2026-05 unverdicted novelty 8.0

ℓ₂-Boosting exhibits benign overfitting with logarithmic excess variance decay Θ(σ²/log(p/n)) under isotropic noise due to ℓ₁ bias, and a subdifferential early stopping rule recovers minimax-optimal ℓ₁ rates.
Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets
cs.LG 2022-01 unverdicted novelty 8.0

Neural networks exhibit grokking on small algorithmic datasets, achieving perfect generalization well after overfitting.
Unlearning with Asymmetric Sources: Improved Unlearning-Utility Trade-off with Public Data
cs.LG 2026-05 unverdicted novelty 7.0

Asymmetric Langevin Unlearning uses public data to suppress unlearning noise costs by O(1/n_pub²), enabling practical mass unlearning with preserved utility under distribution mismatch.
Fin-Bias: Comprehensive Evaluation for LLM Decision-Making under human bias in Finance Domain
cs.CL 2026-05 unverdicted novelty 7.0

LLMs copy biased analyst ratings in investment decisions but a new detection method encourages independent reasoning and can improve stock return predictions beyond human levels.
When Does $\ell_2$-Boosting Overfit Benignly? High-Dimensional Risk Asymptotics and the $\ell_1$ Implicit Bias
cs.LG 2026-05 unverdicted novelty 7.0

ℓ₂-boosting localizes noise into sparse sets under isotropic pure-noise models, yielding excess variance Θ(σ²/log(p/n)) instead of linear decay, with a tuning-free early stopping rule attaining minimax ℓ₁ rates.
Stochastic Trust-Region Methods for Over-parameterized Models
math.OC 2026-04 unverdicted novelty 7.0

Stochastic trust-region methods achieve O(ε^{-2} log(1/ε)) complexity for unconstrained problems and O(ε^{-4} log(1/ε)) for equality-constrained problems under the strong growth condition, with experiments showing sta...
Stochastic Modified Equations for Stochastic Gradient Descent in Infinite-Dimensional Hilbert Spaces
math.OC 2026-04 unverdicted novelty 7.0

SGD dynamics in Hilbert spaces are approximated by an SDE with cylindrical noise, with the weak error between discrete and continuous versions shown to be second order in the step size.
Deep Learning Scaling is Predictable, Empirically
cs.LG 2017-12 unverdicted novelty 7.0

Deep learning generalization error follows power-law scaling with training set size across multiple domains, with model size scaling sublinearly with data size.
Understanding intermediate layers using linear classifier probes
stat.ML 2016-10 accept novelty 7.0

Linear probes demonstrate that feature separability for classification increases monotonically with network depth in Inception v3 and ResNet-50.
Rescaled Asynchronous SGD: Optimal Distributed Optimization under Data and System Heterogeneity
cs.LG 2026-05 unverdicted novelty 6.0

Rescaled ASGD recovers convergence to the true global objective by rescaling worker stepsizes proportional to computation times, matching the known time lower bound in the leading term under non-convex smoothness and ...
ASD-Bench: A Four-Axis Comprehensive Benchmark of AI Models for Autism Spectrum Disorder
cs.LG 2026-05 unverdicted novelty 6.0

ASD-Bench evaluates 17 ML and deep learning models on 4,068 AQ-10 records across child, adolescent, and adult cohorts, showing high adult performance, harder adolescent classification, shifting feature importance, and...
Optimizer-Induced Mode Connectivity: From AdamW to Muon
cs.AI 2026-05 unverdicted novelty 6.0

Optimizer choice induces distinct connected regions in the loss landscape of two-layer ReLU networks, with AdamW and Muon sometimes separated by provable barriers.
The Propagation Field: A Geometric Substrate Theory of Deep Learning
cs.LG 2026-05 unverdicted novelty 6.0

Neural networks possess a propagation field of trajectories and Jacobians whose quality can be measured and optimized independently of endpoint loss, yielding better unseen-path generalization and reduced forgetting i...
Distributional simplicity bias and effective convexity in Energy Based Models
cs.LG 2026-05 unverdicted novelty 6.0

Gradient flow in energy-based models for strictly positive binary distributions produces stable data-consistent fixed points and a learning hierarchy that favors lower-order interactions first, mechanistically explain...
Optimizer-Model Consistency: Full Finetuning with the Same Optimizer as Pretraining Forgets Less
cs.LG 2026-05 unverdicted novelty 6.0

Full finetuning with the pretraining optimizer reduces forgetting compared to other optimizers or LoRA while achieving comparable new-task performance.
Adversarial Robustness of NTK Neural Networks
stat.ML 2026-04 unverdicted novelty 6.0

NTK networks achieve minimax optimal adversarial regression rates in Sobolev spaces with early stopping, but minimum-norm interpolants are vulnerable.
Learning from Noisy Preferences: A Semi-Supervised Learning Approach to Direct Preference Optimization
cs.CV 2026-04 unverdicted novelty 6.0

Semi-DPO applies semi-supervised learning to noisy preference data in diffusion DPO by training first on consensus pairs then iteratively pseudo-labeling conflicts, yielding state-of-the-art alignment with complex hum...
Nexus: Same Pretraining Loss, Better Downstream Generalization via Common Minima
cs.LG 2026-04 unverdicted novelty 6.0

Nexus optimizer improves LLM downstream performance by converging to common minima across data sources despite identical pretraining loss.
Misspecified Universal Learning
cs.IT 2026-05 unverdicted novelty 5.0

Minimax regret is characterized for misspecified universal learning with log-loss, yielding the optimal universal learner as a unified framework for any uncertainty in the data-generating process.
DP-LAC: Lightweight Adaptive Clipping for Differentially Private Federated Fine-tuning of Language Models
cs.LG 2026-05 unverdicted novelty 5.0

DP-LAC provides a new adaptive clipping technique for DP-SGD in federated LLM fine-tuning that improves accuracy by 6.6% on average without consuming additional privacy budget or requiring new hyperparameters.
Trust, but Verify: Peeling Low-Bit Transformer Networks for Training Monitoring
cs.LG 2026-05 unverdicted novelty 5.0

A layer-wise peeling framework creates reference bounds to diagnose under-optimized layers in trained decoder-only transformers, including low-bit and quantized versions.
WRF4CIR: Weight-Regularized Fine-Tuning Network for Composed Image Retrieval
cs.CV 2026-04 unverdicted novelty 5.0

WRF4CIR uses weight-regularized fine-tuning with adversarial perturbations to mitigate overfitting in composed image retrieval and narrows the generalization gap on benchmarks.
Spectral methods: crucial for machine learning, natural for quantum computers?
quant-ph 2026-03 unverdicted novelty 5.0

Quantum computers may enable more natural manipulation of Fourier spectra in ML models via the Quantum Fourier Transform, potentially leading to resource-efficient spectral methods.
A Gesture-Based Visual Learning Model for Acoustophoretic Interactions using a Swarm of AcoustoBots
cs.RO 2026-04 unverdicted novelty 4.0

OpenCLIP-based gesture classification with linear probing controls AcoustoBot swarms at 87.8% accuracy and 3.95 s latency in controlled tests.
Benefits of Low-Cost Bio-Inspiration in the Age of Overparametrization
cs.RO 2026-04 unverdicted novelty 3.0

Shallow MLPs and dense CPGs outperform deeper MLPs and Actor-Critic RL in bounded robot control tasks with limited proprioception, with a Parameter Impact metric indicating extra RL parameters yield no performance gai...

Reference graph

Works this paper leans on

8 extracted references · 8 canonical work pages · cited by 24 Pith papers

[2]

Hrushikesh Narhar Mhaskar

URL http://arxiv.org/abs/1608.03287. Hrushikesh Narhar Mhaskar. Approximation properties of a multilayered feedforward artiﬁcial neu- ral network. Advances in Computational Mathematics, 1(1):61–80,

work page arXiv
[3]

Statistical learning: Stability is sufﬁcient for generalization and necessary and sufﬁcient for consistency of empirical risk min- imization

Sayan Mukherjee, Partha Niyogi, Tomaso Poggio, and Ryan Rifkin. Statistical learning: Stability is sufﬁcient for generalization and necessary and sufﬁcient for consistency of empirical risk min- imization. Technical Report AI Memo 2002-024, Massachusetts Institute of Technology,

work page 2002
[4]

In Search of the Real Inductive Bias: On the Role of Implicit Regularization in Deep Learning

Behnam Neyshabur, Ryota Tomioka, and Nathan Srebro. In search of the real inductive bias: On the role of implicit regularization in deep learning. CoRR, abs/1412.6614,

work page Pith review arXiv
[5]

S., Berg, A

ISSN 1573-1405. doi: 10.1007/s11263-015-0816-y. Bernhard Sch¨olkopf, Ralf Herbrich, and Alex J Smola. A generalized representer theorem. InCOLT,

work page doi:10.1007/s11263-015-0816-y
[6]

Dropout: a simple way to prevent neural networks from overﬁtting.Journal of Machine Learning Research, 15(1):1929–1958,

Nitish Srivastava, Geoffrey E Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout: a simple way to prevent neural networks from overﬁtting.Journal of Machine Learning Research, 15(1):1929–1958,

work page 1929
[7]

Matus Telgarsky

doi: 10.1109/CVPR.2016.308. Matus Telgarsky. Beneﬁts of depth in neural networks. InCOLT,

work page doi:10.1109/cvpr.2016.308 2016
[8]

The CIFAR10 dataset contains 50,000 training and 10,000 validation images, split into 10 classes

ILSVRC 2012 dataset. The CIFAR10 dataset contains 50,000 training and 10,000 validation images, split into 10 classes. Each image is of size 32x32, with 3 color channels. We divide the pixel values by 255 to scale them into [0, 1], crop from the center to get 28x28 inputs, and then normalize them by subtract- ing the mean and dividing the adjusted standar...

work page 2012
[9]

The data pipeline is extended to allow dis- abling of data augmentation and feeding random labels that are consistent across epochs

architecture and reuse the data preprocessing and experimental setup from the T ENSOR FLOW package. The data pipeline is extended to allow dis- abling of data augmentation and feeding random labels that are consistent across epochs. We run the ImageNet experiment in a distributed asynchronized SGD system with 50 workers. B D ETAILED RESULTS ON IMAGENET Ta...

work page 2012