hub Canonical reference

Understanding deep learning requires rethinking generalization

Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, Oriol Vinyals · 2016 · cs.LG · arXiv 1611.03530

Canonical reference. 100% of citing Pith papers cite this work as background.

46 Pith papers citing it

Background 100% of classified citations

open full Pith review browse 46 citing papers arXiv PDF

abstract

Despite their massive size, successful deep artificial neural networks can exhibit a remarkably small difference between training and test performance. Conventional wisdom attributes small generalization error either to properties of the model family, or to the regularization techniques used during training. Through extensive systematic experiments, we show how these traditional approaches fail to explain why large neural networks generalize well in practice. Specifically, our experiments establish that state-of-the-art convolutional networks for image classification trained with stochastic gradient methods easily fit a random labeling of the training data. This phenomenon is qualitatively unaffected by explicit regularization, and occurs even if we replace the true images by completely unstructured random noise. We corroborate these experimental findings with a theoretical construction showing that simple depth two neural networks already have perfect finite sample expressivity as soon as the number of parameters exceeds the number of data points as it usually does in practice. We interpret our experimental findings by comparison with traditional models.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 6

citation-polarity summary

background 6

representative citing papers

When Does $\ell_2$-Boosting Overfit Benignly? High-Dimensional Risk Asymptotics and the $\ell_1$ Implicit Bias

cs.LG · 2026-05-07 · unverdicted · novelty 8.0 · 2 refs

ℓ₂-Boosting exhibits benign overfitting with logarithmic excess variance decay Θ(σ²/log(p/n)) under isotropic noise due to ℓ₁ bias, and a subdifferential early stopping rule recovers minimax-optimal ℓ₁ rates.

Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets

cs.LG · 2022-01-06 · unverdicted · novelty 8.0

Neural networks exhibit grokking on small algorithmic datasets, achieving perfect generalization well after overfitting.

Fin-Bias: Comprehensive Evaluation for LLM Decision-Making under human bias in Finance Domain

cs.CL · 2026-05-09 · unverdicted · novelty 7.0

LLMs copy biased analyst ratings in investment decisions but a new detection method encourages independent reasoning and can improve stock return predictions beyond human levels.

Stochastic Trust-Region Methods for Over-parameterized Models

math.OC · 2026-04-15 · unverdicted · novelty 7.0

Stochastic trust-region methods achieve O(ε^{-2} log(1/ε)) complexity for unconstrained problems and O(ε^{-4} log(1/ε)) for equality-constrained problems under the strong growth condition, with experiments showing stable performance comparable to tuned baselines without learning-rate scheduling.

Stochastic Modified Equations for Stochastic Gradient Descent in Infinite-Dimensional Hilbert Spaces

math.OC · 2026-04-12 · unverdicted · novelty 7.0

SGD dynamics in Hilbert spaces are approximated by an SDE with cylindrical noise, with the weak error between discrete and continuous versions shown to be second order in the step size.

Indirect Question Answering in English, German and Bavarian: A Challenging Task for High- and Low-Resource Languages Alike

cs.CL · 2026-03-16 · unverdicted · novelty 7.0

IQA is a pragmatically difficult task where multilingual models achieve low performance and overfit severely, even for English, and GPT-4o-mini cannot generate high-quality training data for it.

ANCHOR: LLM-driven Subject Conditioning for Text-to-Image Synthesis

cs.CV · 2024-04-15 · unverdicted · novelty 7.0

ANCHOR dataset exposes T2I model weaknesses on multi-subject abstractive captions; SAFE uses LLMs for subject extraction and embedding enhancement to improve consistency.

Weight-space symmetry in deep networks gives rise to permutation saddles, connected by equal-loss valleys across the loss landscape

cs.LG · 2019-07-05 · conditional · novelty 7.0

Permutation symmetries generate permutation saddles and equal-loss valleys linking equivalent global minima, yielding a lower bound on symmetry-induced critical points.

Deep Learning Scaling is Predictable, Empirically

cs.LG · 2017-12-01 · unverdicted · novelty 7.0

Deep learning generalization error follows power-law scaling with training set size across multiple domains, with model size scaling sublinearly with data size.

Understanding intermediate layers using linear classifier probes

stat.ML · 2016-10-05 · accept · novelty 7.0

Linear probes demonstrate that feature separability for classification increases monotonically with network depth in Inception v3 and ResNet-50.

An Assessment of Human vs. Model Uncertainty in Soft-Label Learning and Calibration

cs.LG · 2026-05-18 · unverdicted · novelty 6.0

Controlled experiments on MNIST show human soft-labels act as a regularizer that improves calibration on hard samples and aligns model uncertainty with humans, beyond accuracy gains from correcting mislabels.

Rescaled Asynchronous SGD: Optimal Distributed Optimization under Data and System Heterogeneity

cs.LG · 2026-05-13 · unverdicted · novelty 6.0

Rescaled ASGD recovers convergence to the true global objective by rescaling worker stepsizes proportional to computation times, matching the known time lower bound in the leading term under non-convex smoothness and bounded heterogeneity.

ASD-Bench: A Four-Axis Comprehensive Benchmark of AI Models for Autism Spectrum Disorder

cs.LG · 2026-05-11 · unverdicted · novelty 6.0

ASD-Bench evaluates 17 ML and deep learning models on 4,068 AQ-10 records across child, adolescent, and adult cohorts, showing high adult performance, harder adolescent classification, shifting feature importance, and dissociation between accuracy and calibration.

Optimizer-Induced Mode Connectivity: From AdamW to Muon

cs.AI · 2026-05-11 · unverdicted · novelty 6.0

Optimizer choice induces distinct connected regions in the loss landscape of two-layer ReLU networks, with AdamW and Muon sometimes separated by provable barriers.

The Propagation Field: A Geometric Substrate Theory of Deep Learning

cs.LG · 2026-05-08 · unverdicted · novelty 6.0

Neural networks possess a propagation field of trajectories and Jacobians whose quality can be measured and optimized independently of endpoint loss, yielding better unseen-path generalization and reduced forgetting in continual learning.

Distributional simplicity bias and effective convexity in Energy Based Models

cs.LG · 2026-05-08 · unverdicted · novelty 6.0

Gradient flow in energy-based models for strictly positive binary distributions produces stable data-consistent fixed points and a learning hierarchy that favors lower-order interactions first, mechanistically explaining distributional simplicity bias.

Optimizer-Model Consistency: Full Finetuning with the Same Optimizer as Pretraining Forgets Less

cs.LG · 2026-05-07 · unverdicted · novelty 6.0

Full finetuning with the pretraining optimizer reduces forgetting compared to other optimizers or LoRA while achieving comparable new-task performance.

Learning from Noisy Preferences: A Semi-Supervised Learning Approach to Direct Preference Optimization

cs.CV · 2026-04-27 · unverdicted · novelty 6.0

Semi-DPO applies semi-supervised learning to noisy preference data in diffusion DPO by training first on consensus pairs then iteratively pseudo-labeling conflicts, yielding state-of-the-art alignment with complex human preferences.

Prediction horizon shapes representations in predictive learning

cs.LG · 2025-11-12 · unverdicted · novelty 6.0

Longer prediction horizons in predictive learning interact with model biases to recover the latent geometry of the task.

Scaling and renormalization in high-dimensional regression

stat.ML · 2024-05-01 · unverdicted · novelty 6.0

Ridge regression in high dimensions exhibits power-law scalings because covariance fluctuations renormalize the ridge parameter, allowing closed-form error expressions and bias-variance decompositions for random feature models via free probability.

Sharpness-Aware Minimization for Efficiently Improving Generalization

cs.LG · 2020-10-03 · conditional · novelty 6.0

SAM solves a min-max problem to locate flat low-loss regions, improving generalization on CIFAR, ImageNet and label-noise tasks.

Bayesian Inference with Shaped Deep Non-linear MLPs

math.ST · 2026-05-29 · unverdicted · novelty 5.0

In the LP/N = Θ(1) regime, Bayesian predictive posteriors for deep MLPs equal those of data-dependent kernels to first order, with a criterion identifying data processes that benefit from larger effective depth.

A Rigorous, Tractable Measure of Model Complexity

stat.ML · 2026-05-20 · unverdicted · novelty 5.0

A gradient-similarity complexity measure that generalizes polynomial degree, kernel length scale, neighbor count, tree splits, and forest size while offering insights into double descent.

Decoupling Communication from Policy: Robust MARL under Bandwidth Constraints

cs.MA · 2026-05-20 · unverdicted · novelty 5.0

SLIM decouples inter-agent communication from policy execution in MARL via a dedicated pathway and a normalized bandwidth budget β, yielding robust performance under tight communication limits on standard benchmarks.

citing papers explorer

Showing 46 of 46 citing papers.

When Does $\ell_2$-Boosting Overfit Benignly? High-Dimensional Risk Asymptotics and the $\ell_1$ Implicit Bias cs.LG · 2026-05-07 · unverdicted · none · ref 35 · 2 links · internal anchor
ℓ₂-Boosting exhibits benign overfitting with logarithmic excess variance decay Θ(σ²/log(p/n)) under isotropic noise due to ℓ₁ bias, and a subdifferential early stopping rule recovers minimax-optimal ℓ₁ rates.
Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets cs.LG · 2022-01-06 · unverdicted · none · ref 19 · internal anchor
Neural networks exhibit grokking on small algorithmic datasets, achieving perfect generalization well after overfitting.
Fin-Bias: Comprehensive Evaluation for LLM Decision-Making under human bias in Finance Domain cs.CL · 2026-05-09 · unverdicted · none · ref 36 · internal anchor
LLMs copy biased analyst ratings in investment decisions but a new detection method encourages independent reasoning and can improve stock return predictions beyond human levels.
Stochastic Trust-Region Methods for Over-parameterized Models math.OC · 2026-04-15 · unverdicted · none · ref 41 · internal anchor
Stochastic trust-region methods achieve O(ε^{-2} log(1/ε)) complexity for unconstrained problems and O(ε^{-4} log(1/ε)) for equality-constrained problems under the strong growth condition, with experiments showing stable performance comparable to tuned baselines without learning-rate scheduling.
Stochastic Modified Equations for Stochastic Gradient Descent in Infinite-Dimensional Hilbert Spaces math.OC · 2026-04-12 · unverdicted · none · ref 29 · internal anchor
SGD dynamics in Hilbert spaces are approximated by an SDE with cylindrical noise, with the weak error between discrete and continuous versions shown to be second order in the step size.
Indirect Question Answering in English, German and Bavarian: A Challenging Task for High- and Low-Resource Languages Alike cs.CL · 2026-03-16 · unverdicted · none · ref 16 · internal anchor
IQA is a pragmatically difficult task where multilingual models achieve low performance and overfit severely, even for English, and GPT-4o-mini cannot generate high-quality training data for it.
ANCHOR: LLM-driven Subject Conditioning for Text-to-Image Synthesis cs.CV · 2024-04-15 · unverdicted · none · ref 18 · internal anchor
ANCHOR dataset exposes T2I model weaknesses on multi-subject abstractive captions; SAFE uses LLMs for subject extraction and embedding enhancement to improve consistency.
Weight-space symmetry in deep networks gives rise to permutation saddles, connected by equal-loss valleys across the loss landscape cs.LG · 2019-07-05 · conditional · none · ref 14 · internal anchor
Permutation symmetries generate permutation saddles and equal-loss valleys linking equivalent global minima, yielding a lower bound on symmetry-induced critical points.
Deep Learning Scaling is Predictable, Empirically cs.LG · 2017-12-01 · unverdicted · none · ref 11 · internal anchor
Deep learning generalization error follows power-law scaling with training set size across multiple domains, with model size scaling sublinearly with data size.
Understanding intermediate layers using linear classifier probes stat.ML · 2016-10-05 · accept · none · ref 28 · internal anchor
Linear probes demonstrate that feature separability for classification increases monotonically with network depth in Inception v3 and ResNet-50.
An Assessment of Human vs. Model Uncertainty in Soft-Label Learning and Calibration cs.LG · 2026-05-18 · unverdicted · none · ref 47 · internal anchor
Controlled experiments on MNIST show human soft-labels act as a regularizer that improves calibration on hard samples and aligns model uncertainty with humans, beyond accuracy gains from correcting mislabels.
Rescaled Asynchronous SGD: Optimal Distributed Optimization under Data and System Heterogeneity cs.LG · 2026-05-13 · unverdicted · none · ref 22 · internal anchor
Rescaled ASGD recovers convergence to the true global objective by rescaling worker stepsizes proportional to computation times, matching the known time lower bound in the leading term under non-convex smoothness and bounded heterogeneity.
ASD-Bench: A Four-Axis Comprehensive Benchmark of AI Models for Autism Spectrum Disorder cs.LG · 2026-05-11 · unverdicted · none · ref 15 · internal anchor
ASD-Bench evaluates 17 ML and deep learning models on 4,068 AQ-10 records across child, adolescent, and adult cohorts, showing high adult performance, harder adolescent classification, shifting feature importance, and dissociation between accuracy and calibration.
Optimizer-Induced Mode Connectivity: From AdamW to Muon cs.AI · 2026-05-11 · unverdicted · none · ref 146 · internal anchor
Optimizer choice induces distinct connected regions in the loss landscape of two-layer ReLU networks, with AdamW and Muon sometimes separated by provable barriers.
The Propagation Field: A Geometric Substrate Theory of Deep Learning cs.LG · 2026-05-08 · unverdicted · none · ref 3 · internal anchor
Neural networks possess a propagation field of trajectories and Jacobians whose quality can be measured and optimized independently of endpoint loss, yielding better unseen-path generalization and reduced forgetting in continual learning.
Distributional simplicity bias and effective convexity in Energy Based Models cs.LG · 2026-05-08 · unverdicted · none · ref 22 · internal anchor
Gradient flow in energy-based models for strictly positive binary distributions produces stable data-consistent fixed points and a learning hierarchy that favors lower-order interactions first, mechanistically explaining distributional simplicity bias.
Optimizer-Model Consistency: Full Finetuning with the Same Optimizer as Pretraining Forgets Less cs.LG · 2026-05-07 · unverdicted · none · ref 45 · internal anchor
Full finetuning with the pretraining optimizer reduces forgetting compared to other optimizers or LoRA while achieving comparable new-task performance.
Learning from Noisy Preferences: A Semi-Supervised Learning Approach to Direct Preference Optimization cs.CV · 2026-04-27 · unverdicted · none · ref 14 · internal anchor
Semi-DPO applies semi-supervised learning to noisy preference data in diffusion DPO by training first on consensus pairs then iteratively pseudo-labeling conflicts, yielding state-of-the-art alignment with complex human preferences.
Prediction horizon shapes representations in predictive learning cs.LG · 2025-11-12 · unverdicted · none · ref 6 · internal anchor
Longer prediction horizons in predictive learning interact with model biases to recover the latent geometry of the task.
Scaling and renormalization in high-dimensional regression stat.ML · 2024-05-01 · unverdicted · none · ref 25 · internal anchor
Ridge regression in high dimensions exhibits power-law scalings because covariance fluctuations renormalize the ridge parameter, allowing closed-form error expressions and bias-variance decompositions for random feature models via free probability.
Sharpness-Aware Minimization for Efficiently Improving Generalization cs.LG · 2020-10-03 · conditional · none · ref 48 · internal anchor
SAM solves a min-max problem to locate flat low-loss regions, improving generalization on CIFAR, ImageNet and label-noise tasks.
Bayesian Inference with Shaped Deep Non-linear MLPs math.ST · 2026-05-29 · unverdicted · none · ref 23 · internal anchor
In the LP/N = Θ(1) regime, Bayesian predictive posteriors for deep MLPs equal those of data-dependent kernels to first order, with a criterion identifying data processes that benefit from larger effective depth.
A Rigorous, Tractable Measure of Model Complexity stat.ML · 2026-05-20 · unverdicted · none · ref 44 · internal anchor
A gradient-similarity complexity measure that generalizes polynomial degree, kernel length scale, neighbor count, tree splits, and forest size while offering insights into double descent.
Decoupling Communication from Policy: Robust MARL under Bandwidth Constraints cs.MA · 2026-05-20 · unverdicted · none · ref 46 · internal anchor
SLIM decouples inter-agent communication from policy execution in MARL via a dedicated pathway and a normalized bandwidth budget β, yielding robust performance under tight communication limits on standard benchmarks.
Axiomatizing Neural Networks via Pursuit of Subspaces cs.LG · 2026-05-19 · unverdicted · none · ref 77 · internal anchor
Authors introduce the Pursuit of Subspaces (PoS) hypothesis, an axiomatic geometric framework that unifies explanations for representation, computation, and generalization in shallow and deep neural networks.
DP-LAC: Lightweight Adaptive Clipping for Differentially Private Federated Fine-tuning of Language Models cs.LG · 2026-05-11 · unverdicted · none · ref 16 · internal anchor
DP-LAC provides a new adaptive clipping technique for DP-SGD in federated LLM fine-tuning that improves accuracy by 6.6% on average without consuming additional privacy budget or requiring new hyperparameters.
Trust, but Verify: Peeling Low-Bit Transformer Networks for Training Monitoring cs.LG · 2026-05-04 · unverdicted · none · ref 6 · internal anchor
A layer-wise peeling framework creates reference bounds to diagnose under-optimized layers in trained decoder-only transformers, including low-bit and quantized versions.
WRF4CIR: Weight-Regularized Fine-Tuning Network for Composed Image Retrieval cs.CV · 2026-04-07 · unverdicted · none · ref 77 · internal anchor
WRF4CIR uses weight-regularized fine-tuning with adversarial perturbations to mitigate overfitting in composed image retrieval and narrows the generalization gap on benchmarks.
Spectral methods: crucial for machine learning, natural for quantum computers? quant-ph · 2026-03-25 · unverdicted · none · ref 67 · internal anchor
Quantum computers may enable more natural manipulation of Fourier spectra in ML models via the Quantum Fourier Transform, potentially leading to resource-efficient spectral methods.
The Platonic Representation Hypothesis cs.LG · 2024-05-13 · unverdicted · none · ref 169 · internal anchor
Representations learned by large AI models are converging toward a shared statistical model of reality.
On the Role of Geometry in Geo-Localization cs.CV · 2019-06-26 · unverdicted · none · ref 33 · internal anchor
CNNs recover camera pose from lean geometric images of a city by learning its geometry rather than memorizing textures.
A Survey on Data-Dependent Worst-Case Generalization Bounds stat.ML · 2026-05-13 · unverdicted · none · ref 12 · internal anchor
The survey unifies extensions of PAC-Bayesian theory to data-dependent sets, geometric and topological complexity measures of optimization trajectories, and stability replacements for information terms into one template inequality with comparative evaluation.
A Gesture-Based Visual Learning Model for Acoustophoretic Interactions using a Swarm of AcoustoBots cs.RO · 2026-04-21 · unverdicted · none · ref 13 · internal anchor
OpenCLIP-based gesture classification with linear probing controls AcoustoBot swarms at 87.8% accuracy and 3.95 s latency in controlled tests.
Online Learning-Enhanced High Order Adaptive Safety Control cs.RO · 2025-11-24 · unverdicted · none · ref 17 · internal anchor
An online learning-enhanced high-order adaptive CBF with Neural ODEs maintains safety for a 38g nano quadrotor against 18km/h wind by adapting to time-varying perturbations on the fly.
DNNs, Dataset Statistics, and Correlation Functions physics.hist-ph · 2025-11-18 · unverdicted · none · ref 31 · internal anchor
DNNs succeed by capturing high-order correlation structures in datasets, similar to mesoscale methods in physics.
Sharpness-Aware Minimization with Z-Score Gradient Filtering cs.LG · 2025-05-05 · unverdicted · none · ref 46 · internal anchor
Z-Score Filtered SAM retains only high absolute Z-score gradient components per layer during the ascent step and reports higher test accuracy than standard SAM on CIFAR and Tiny-ImageNet benchmarks.
Single-bit-per-weight deep convolutional neural networks without batch-normalization layers for embedded systems cs.LG · 2019-07-16 · unverdicted · none · ref 14 · internal anchor
Experiments show that shifted-ReLU layers can replace batch-normalization in single-bit-weight wide residual networks on CIFAR-10/100 and ImageNet without consistent accuracy penalty.
Mean Spectral Normalization of Deep Neural Networks for Embedded Automation cs.LG · 2019-07-09 · unverdicted · none · ref 14 · internal anchor
Proposes MSN reparameterization to address mean-drift in SN, claiming ~16% faster inference than BN with fewer parameters on CNNs and GANs.
Further advantages of data augmentation on convolutional neural networks cs.CV · 2019-06-26 · unverdicted · none · ref 34 · internal anchor
Data augmentation enables CNNs to adapt to varying architectures and data amounts without hyperparameter fine-tuning, unlike weight decay and dropout.
Machine Learning Approaches for Improved Scalability of Metallic Magnetic Calorimeters physics.ins-det · 2026-06-23 · unverdicted · none · ref 63 · internal anchor
Machine learning methods are explored for pulse classification, artifact rejection, and shape analysis in metallic magnetic calorimeters to improve scalability over traditional signal processing.
Benefits of Low-Cost Bio-Inspiration in the Age of Overparametrization cs.RO · 2026-04-22 · unverdicted · none · ref 45 · internal anchor
Shallow MLPs and dense CPGs outperform deeper MLPs and Actor-Critic RL in bounded robot control tasks with limited proprioception, with a Parameter Impact metric indicating extra RL parameters yield no performance gain over evolutionary strategies.
Enhancing Computer Vision Model Generalization in Warehouse Facilities: A Case Study on Anomaly Detection in Vertical Material Handling Systems cs.CV · 2026-05-29 · unverdicted · none · ref 5 · internal anchor
Lab-only training of CV models for fork anomaly detection generalizes to warehouses via camera optimization, triggering strategy, model choice, and ensembling.
Unlearning with Asymmetric Sources: Improved Unlearning-Utility Trade-off with Public Data cs.LG · 2026-05-11 · unreviewed · ref 212 · internal anchor
Misspecified Universal Learning cs.IT · 2026-05-11 · unreviewed · ref 7 · internal anchor
Adversarial Robustness of NTK Neural Networks stat.ML · 2026-04-28 · unreviewed · ref 16 · internal anchor
Nexus: Same Pretraining Loss, Better Downstream Generalization via Common Minima cs.LG · 2026-04-10 · unreviewed · ref 48 · internal anchor

Understanding deep learning requires rethinking generalization

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer