A Convergence Theory for Deep Learning via Over-Parameterization

Yuanzhi Li; Zeyuan Allen-Zhu; Zhao Song

arxiv: 1811.03962 · v5 · pith:CKEBGIFRnew · submitted 2018-11-09 · 💻 cs.LG · cs.DS· cs.NE· math.OC· stat.ML

A Convergence Theory for Deep Learning via Over-Parameterization

Zeyuan Allen-Zhu , Yuanzhi Li , Zhao Song This is my paper

classification 💻 cs.LG cs.DScs.NEmath.OCstat.ML

keywords networksneuralpolynomialtheorynetworktextittrainingapplies

0 comments

read the original abstract

Deep neural networks (DNNs) have demonstrated dominating performance in many fields; since AlexNet, networks used in practice are going wider and deeper. On the theoretical side, a long line of works has been focusing on training neural networks with one hidden layer. The theory of multi-layer networks remains largely unsettled. In this work, we prove why stochastic gradient descent (SGD) can find $\textit{global minima}$ on the training objective of DNNs in $\textit{polynomial time}$. We only make two assumptions: the inputs are non-degenerate and the network is over-parameterized. The latter means the network width is sufficiently large: $\textit{polynomial}$ in $L$, the number of layers and in $n$, the number of samples. Our key technique is to derive that, in a sufficiently large neighborhood of the random initialization, the optimization landscape is almost-convex and semi-smooth even with ReLU activations. This implies an equivalence between over-parameterized neural networks and neural tangent kernel (NTK) in the finite (and polynomial) width setting. As concrete examples, starting from randomly initialized weights, we prove that SGD can attain 100% training accuracy in classification tasks, or minimize regression loss in linear convergence speed, with running time polynomial in $n,L$. Our theory applies to the widely-used but non-smooth ReLU activation, and to any smooth and possibly non-convex loss functions. In terms of network architectures, our theory at least applies to fully-connected neural networks, convolutional neural networks (CNN), and residual neural networks (ResNet).

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 7 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

The Statistical Cost of Adaptation in Multi-Source Transfer Learning
math.ST 2026-05 unverdicted novelty 8.0

Multi-source transfer learning incurs an intrinsic adaptation cost that can exceed one, with phase transitions separating regimes where bias-agnostic estimators match oracle performance from those where they cannot.
Limitations of Lazy Training of Two-layers Neural Networks
stat.ML 2019-06 unverdicted novelty 8.0

For quadratic targets in d dimensions, two-layer quadratic networks achieve lower risk when fully trained than in random features or neural tangent regimes if hidden units < d.
Alignment-Sensitive Minimax Rates for Spectral Algorithms with Learned Kernels
cs.LG 2025-09 unverdicted novelty 7.0

Introduces alignment-sensitive effective span dimension (ESD) for learned-kernel spectral algorithms and proves minimax excess risk bounds of order sigma^2 * ESD, with gradient flow shown to reduce ESD.
LoRA: Low-Rank Adaptation of Large Language Models
cs.CL 2021-06 accept novelty 7.0

Adapting large language models by training only a low-rank decomposition BA added to frozen weight matrices matches full fine-tuning while cutting trainable parameters by orders of magnitude and adding no inference latency.
ID3 Learns Juntas for Smoothed Product Distributions
cs.LG 2019-06 unverdicted novelty 6.0

ID3 learns log n-juntas in polynomial time under the smoothed analysis model for product distributions.
On the Eigenvalue Decay Rates of a Class of Neural-Network Related Kernel Functions Defined on General Domains
stat.ML 2023-05 unverdicted novelty 5.0

A method is given to determine eigenvalue decay rates of NTK and related kernels on general domains, leading to minimax optimality results for wide neural networks under smoothness assumptions on the target function.
Convergence rates for gradient descent in the training of overparameterized artificial neural networks with piecewise affine activation
cs.LG 2021-02 unverdicted novelty 4.0

Batch gradient descent achieves linear convergence to zero MSE with high probability for sufficiently wide shallow NNs with non-affine piecewise affine activations and distinct inputs.