Recognition: 2 theorem links
· Lean TheoremUnderstanding deep learning requires rethinking generalization
Pith reviewed 2026-05-13 11:51 UTC · model grok-4.3
The pith
Deep neural networks easily fit random training labels even with regularization or noise inputs, showing that generalization on real data must come from training dynamics rather than model capacity.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
State-of-the-art convolutional networks for image classification trained with stochastic gradient methods easily fit a random labeling of the training data. This phenomenon is qualitatively unaffected by explicit regularization, and occurs even if we replace the true images by completely unstructured random noise. Simple depth-two neural networks already have perfect finite-sample expressivity as soon as the number of parameters exceeds the number of data points.
What carries the argument
The experimental protocol of training networks to zero error on random labelings, together with the theoretical construction proving that depth-two networks can realize any labeling once parameters exceed sample count.
If this is right
- Generalization on real tasks must be explained by implicit biases induced by the optimizer or by the alignment between network architecture and natural data structure.
- Explicit regularization techniques such as weight decay or dropout do not prevent the network from memorizing arbitrary labelings.
- The classical bias-variance tradeoff does not apply in the usual way because the model class can already realize every possible labeling of the training points.
- New theoretical tools are needed that analyze the specific trajectory taken by stochastic gradient descent rather than only the final hypothesis class.
Where Pith is reading between the lines
- The same memorization behavior may appear in other high-capacity function classes such as wide kernel machines or decision-tree ensembles.
- Future analyses could measure how quickly the training dynamics separate real-data minima from random-label minima in the loss landscape.
- Practical model selection might benefit from explicit checks of how easily a candidate architecture fits random labels on the target dataset size.
Load-bearing premise
That the networks' ability to fit random labels in these finite-sample regimes directly rules out capacity or explicit regularization as the main reason they generalize on natural data.
What would settle it
An architecture and training procedure that reaches low test error on natural images yet fails to reach zero training error on a random labeling of the same training set.
read the original abstract
Despite their massive size, successful deep artificial neural networks can exhibit a remarkably small difference between training and test performance. Conventional wisdom attributes small generalization error either to properties of the model family, or to the regularization techniques used during training. Through extensive systematic experiments, we show how these traditional approaches fail to explain why large neural networks generalize well in practice. Specifically, our experiments establish that state-of-the-art convolutional networks for image classification trained with stochastic gradient methods easily fit a random labeling of the training data. This phenomenon is qualitatively unaffected by explicit regularization, and occurs even if we replace the true images by completely unstructured random noise. We corroborate these experimental findings with a theoretical construction showing that simple depth two neural networks already have perfect finite sample expressivity as soon as the number of parameters exceeds the number of data points as it usually does in practice. We interpret our experimental findings by comparison with traditional models.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that conventional explanations for the strong generalization of deep neural networks—based on model family properties or explicit regularization—fail to account for observed behavior. Through systematic experiments it shows that state-of-the-art convolutional networks trained by SGD achieve zero training error on randomly labeled data and even on unstructured random inputs; this holds across multiple datasets and is largely unaffected by regularization. A supporting theoretical construction demonstrates that depth-two networks already possess perfect finite-sample expressivity once the number of parameters exceeds the number of training points.
Significance. If the central empirical and theoretical results hold, the work is significant because it directly undermines capacity-based and regularization-based accounts of generalization and shifts attention to training dynamics. Credit is due for the use of real architectures on standard image-classification benchmarks, consistent results across random-label and random-input regimes, and the clean finite-sample argument for depth-two networks that requires no post-hoc adjustments.
minor comments (3)
- [Section 2] In the experimental section the precise hyper-parameter settings (learning-rate schedule, batch size, weight-decay values) used for the random-label runs should be tabulated for reproducibility.
- [Section 2] Figure 1 and Figure 2 would benefit from explicit axis labels indicating that the y-axis is training error (not test error) in the random-label panels.
- [Section 3] The statement of the depth-two construction (Theorem 1) should explicitly list the activation-function assumptions required for the finite-sample expressivity claim.
Simulated Author's Rebuttal
We thank the referee for the positive review and recommendation to accept. The summary correctly identifies the core empirical finding that modern CNNs achieve zero training error on random labels and unstructured noise, as well as the supporting finite-sample expressivity result for depth-two networks.
Circularity Check
No significant circularity in derivation chain
full rationale
The paper's core argument consists of direct experimental measurements (training error on random labels and random inputs) and an independent constructive proof of finite-sample expressivity for depth-2 networks when parameter count exceeds sample size. Neither reduces to a self-definition, a fitted parameter renamed as a prediction, or a load-bearing self-citation. The experiments compare observed behavior against explicit random baselines without circular fitting, and the theoretical result is a standalone construction that does not presuppose the empirical findings or rely on prior author work for its validity. The derivation chain is therefore self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (1)
- standard math A depth-two neural network with more parameters than training points can represent any labeling of those points
Lean theorems connected to this paper
-
Cost.FunctionalEquationwashburn_uniqueness_aczel echoesstate-of-the-art convolutional networks for image classification trained with stochastic gradient methods easily fit a random labeling of the training data. This phenomenon is qualitatively unaffected by explicit regularization, and occurs even if we replace the true images by completely unstructured random noise.
-
Foundation.HierarchyEmergencehierarchy_emergence_forces_phi unclearsimple depth two neural networks already have perfect finite sample expressivity as soon as the number of parameters exceeds the number of data points
Forward citations
Cited by 25 Pith papers
-
When Does $\ell_2$-Boosting Overfit Benignly? High-Dimensional Risk Asymptotics and the $\ell_1$ Implicit Bias
ℓ₂-Boosting exhibits benign overfitting with logarithmic excess variance decay Θ(σ²/log(p/n)) under isotropic noise due to ℓ₁ bias, and a subdifferential early stopping rule recovers minimax-optimal ℓ₁ rates.
-
Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets
Neural networks exhibit grokking on small algorithmic datasets, achieving perfect generalization well after overfitting.
-
Unlearning with Asymmetric Sources: Improved Unlearning-Utility Trade-off with Public Data
Asymmetric Langevin Unlearning uses public data to suppress unlearning noise costs by O(1/n_pub²), enabling practical mass unlearning with preserved utility under distribution mismatch.
-
Fin-Bias: Comprehensive Evaluation for LLM Decision-Making under human bias in Finance Domain
LLMs copy biased analyst ratings in investment decisions but a new detection method encourages independent reasoning and can improve stock return predictions beyond human levels.
-
When Does $\ell_2$-Boosting Overfit Benignly? High-Dimensional Risk Asymptotics and the $\ell_1$ Implicit Bias
ℓ₂-boosting localizes noise into sparse sets under isotropic pure-noise models, yielding excess variance Θ(σ²/log(p/n)) instead of linear decay, with a tuning-free early stopping rule attaining minimax ℓ₁ rates.
-
Stochastic Trust-Region Methods for Over-parameterized Models
Stochastic trust-region methods achieve O(ε^{-2} log(1/ε)) complexity for unconstrained problems and O(ε^{-4} log(1/ε)) for equality-constrained problems under the strong growth condition, with experiments showing sta...
-
Stochastic Modified Equations for Stochastic Gradient Descent in Infinite-Dimensional Hilbert Spaces
SGD dynamics in Hilbert spaces are approximated by an SDE with cylindrical noise, with the weak error between discrete and continuous versions shown to be second order in the step size.
-
Deep Learning Scaling is Predictable, Empirically
Deep learning generalization error follows power-law scaling with training set size across multiple domains, with model size scaling sublinearly with data size.
-
Understanding intermediate layers using linear classifier probes
Linear probes demonstrate that feature separability for classification increases monotonically with network depth in Inception v3 and ResNet-50.
-
Rescaled Asynchronous SGD: Optimal Distributed Optimization under Data and System Heterogeneity
Rescaled ASGD recovers convergence to the true global objective by rescaling worker stepsizes proportional to computation times, matching the known time lower bound in the leading term under non-convex smoothness and ...
-
ASD-Bench: A Four-Axis Comprehensive Benchmark of AI Models for Autism Spectrum Disorder
ASD-Bench evaluates 17 ML and deep learning models on 4,068 AQ-10 records across child, adolescent, and adult cohorts, showing high adult performance, harder adolescent classification, shifting feature importance, and...
-
Optimizer-Induced Mode Connectivity: From AdamW to Muon
Optimizer choice induces distinct connected regions in the loss landscape of two-layer ReLU networks, with AdamW and Muon sometimes separated by provable barriers.
-
The Propagation Field: A Geometric Substrate Theory of Deep Learning
Neural networks possess a propagation field of trajectories and Jacobians whose quality can be measured and optimized independently of endpoint loss, yielding better unseen-path generalization and reduced forgetting i...
-
Distributional simplicity bias and effective convexity in Energy Based Models
Gradient flow in energy-based models for strictly positive binary distributions produces stable data-consistent fixed points and a learning hierarchy that favors lower-order interactions first, mechanistically explain...
-
Optimizer-Model Consistency: Full Finetuning with the Same Optimizer as Pretraining Forgets Less
Full finetuning with the pretraining optimizer reduces forgetting compared to other optimizers or LoRA while achieving comparable new-task performance.
-
Adversarial Robustness of NTK Neural Networks
NTK networks achieve minimax optimal adversarial regression rates in Sobolev spaces with early stopping, but minimum-norm interpolants are vulnerable.
-
Learning from Noisy Preferences: A Semi-Supervised Learning Approach to Direct Preference Optimization
Semi-DPO applies semi-supervised learning to noisy preference data in diffusion DPO by training first on consensus pairs then iteratively pseudo-labeling conflicts, yielding state-of-the-art alignment with complex hum...
-
Nexus: Same Pretraining Loss, Better Downstream Generalization via Common Minima
Nexus optimizer improves LLM downstream performance by converging to common minima across data sources despite identical pretraining loss.
-
Misspecified Universal Learning
Minimax regret is characterized for misspecified universal learning with log-loss, yielding the optimal universal learner as a unified framework for any uncertainty in the data-generating process.
-
DP-LAC: Lightweight Adaptive Clipping for Differentially Private Federated Fine-tuning of Language Models
DP-LAC provides a new adaptive clipping technique for DP-SGD in federated LLM fine-tuning that improves accuracy by 6.6% on average without consuming additional privacy budget or requiring new hyperparameters.
-
Trust, but Verify: Peeling Low-Bit Transformer Networks for Training Monitoring
A layer-wise peeling framework creates reference bounds to diagnose under-optimized layers in trained decoder-only transformers, including low-bit and quantized versions.
-
WRF4CIR: Weight-Regularized Fine-Tuning Network for Composed Image Retrieval
WRF4CIR uses weight-regularized fine-tuning with adversarial perturbations to mitigate overfitting in composed image retrieval and narrows the generalization gap on benchmarks.
-
Spectral methods: crucial for machine learning, natural for quantum computers?
Quantum computers may enable more natural manipulation of Fourier spectra in ML models via the Quantum Fourier Transform, potentially leading to resource-efficient spectral methods.
-
A Gesture-Based Visual Learning Model for Acoustophoretic Interactions using a Swarm of AcoustoBots
OpenCLIP-based gesture classification with linear probing controls AcoustoBot swarms at 87.8% accuracy and 3.95 s latency in controlled tests.
-
Benefits of Low-Cost Bio-Inspiration in the Age of Overparametrization
Shallow MLPs and dense CPGs outperform deeper MLPs and Actor-Critic RL in bounded robot control tasks with limited proprioception, with a Parameter Impact metric indicating extra RL parameters yield no performance gai...
Reference graph
Works this paper leans on
-
[2]
URL http://arxiv.org/abs/1608.03287. Hrushikesh Narhar Mhaskar. Approximation properties of a multilayered feedforward artificial neu- ral network. Advances in Computational Mathematics, 1(1):61–80,
-
[3]
Sayan Mukherjee, Partha Niyogi, Tomaso Poggio, and Ryan Rifkin. Statistical learning: Stability is sufficient for generalization and necessary and sufficient for consistency of empirical risk min- imization. Technical Report AI Memo 2002-024, Massachusetts Institute of Technology,
work page 2002
-
[4]
In Search of the Real Inductive Bias: On the Role of Implicit Regularization in Deep Learning
Behnam Neyshabur, Ryota Tomioka, and Nathan Srebro. In search of the real inductive bias: On the role of implicit regularization in deep learning. CoRR, abs/1412.6614,
-
[5]
ISSN 1573-1405. doi: 10.1007/s11263-015-0816-y. Bernhard Sch¨olkopf, Ralf Herbrich, and Alex J Smola. A generalized representer theorem. InCOLT,
-
[6]
Nitish Srivastava, Geoffrey E Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout: a simple way to prevent neural networks from overfitting.Journal of Machine Learning Research, 15(1):1929–1958,
work page 1929
-
[7]
doi: 10.1109/CVPR.2016.308. Matus Telgarsky. Benefits of depth in neural networks. InCOLT,
-
[8]
The CIFAR10 dataset contains 50,000 training and 10,000 validation images, split into 10 classes
ILSVRC 2012 dataset. The CIFAR10 dataset contains 50,000 training and 10,000 validation images, split into 10 classes. Each image is of size 32x32, with 3 color channels. We divide the pixel values by 255 to scale them into [0, 1], crop from the center to get 28x28 inputs, and then normalize them by subtract- ing the mean and dividing the adjusted standar...
work page 2012
-
[9]
architecture and reuse the data preprocessing and experimental setup from the T ENSOR FLOW package. The data pipeline is extended to allow dis- abling of data augmentation and feeding random labels that are consistent across epochs. We run the ImageNet experiment in a distributed asynchronized SGD system with 50 workers. B D ETAILED RESULTS ON IMAGENET Ta...
work page 2012
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.