Generalization in Deep Neural Networks: Minimax Rates for Gradient Methods

Junyu Zhou; Marius Kloft; Puyu Wang; Yiming Ying; Yunwen Lei

arxiv: 2606.06772 · v1 · pith:B5E63NMUnew · submitted 2026-06-04 · 📊 stat.ML · cs.AI· cs.LG

Generalization in Deep Neural Networks: Minimax Rates for Gradient Methods

Junyu Zhou , Puyu Wang , Yunwen Lei , Marius Kloft , Yiming Ying This is my paper

Pith reviewed 2026-06-27 23:02 UTC · model grok-4.3

classification 📊 stat.ML cs.AIcs.LG

keywords generalizationdeep neural networksgradient descentstochastic gradient descentminimax ratesneural tangent kerneloverparameterized networksexcess risk

0 comments

The pith

Overparameterized DNNs trained by gradient descent or SGD achieve the first minimax-optimal excess risk rates matching kernels when width grows polynomially with sample size.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper first shows that the training trajectory of a deep network with smooth activations under gradient-based updates is equivalent to the dynamics of a kernel method. It then uses this link to transfer known optimal rates from kernels and obtain matching minimax rates for the excess population risk of both full gradient descent and stochastic gradient descent. The rates require only that network width scale as a polynomial of the number of training samples. A sympathetic reader cares because the result supplies the first rigorous guarantee that sufficiently wide neural networks trained by standard optimizers can generalize as well as the best kernel methods on regression tasks.

Core claim

By establishing that gradient-based methods on over-parameterized DNNs with smooth activations fully inherit the learning dynamics of their kernel counterparts, the authors derive the first known minimax-optimal rates for the excess population risk of both GD and SGD, under the assumption that network width scales polynomially with the sample size, showing that such DNNs achieve generalization performance comparable to kernel-based methods.

What carries the argument

The equivalence between the learning dynamics of DNNs with smooth activations trained via gradient methods and those of kernel methods, which transfers optimality from kernels to DNNs.

If this is right

DNNs with polynomially growing width attain the same excess-risk upper bounds as optimal kernel methods.
Both gradient descent and stochastic gradient descent reach these rates.
The results hold for regression tasks under standard smoothness assumptions on the activation.
The polynomial width scaling is sufficient to close the gap between neural networks and kernels.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the dynamics equivalence extends to deeper or non-smooth activations, the same rates could apply to modern architectures beyond the shallow case.
Empirical checks on whether observed excess risk saturates the derived bounds at polynomial widths would test the practical relevance of the rates.
The connection suggests that other gradient-based algorithms might inherit kernel optimality once width is large enough.

Load-bearing premise

The learning dynamics of a DNN with smooth activation functions trained via gradient-based methods are equivalent to those of kernel methods.

What would settle it

An explicit construction of a smooth activation, a regression problem, and a polynomially wide network where GD or SGD produces excess risk strictly larger than the known kernel minimax rate would falsify the claim.

read the original abstract

Understanding the generalization performance of over-parameterized neural networks has become a central topic in deep learning theory. While recent advances, particularly works under the Neural Tangent Kernel (NTK) regime, have shed light on the behavior of shallow architectures, the statistical generalization properties of deep neural networks (DNNs), especially in regression tasks, remain far less understood. In this paper, we make significant progress toward closing this gap by providing a comprehensive generalization analysis of DNNs trained using gradient-based methods. First, we establish, for the first time, a crucial connection between the learning dynamics of a DNN with smooth activation functions trained via gradient-based methods and those of kernel methods, showing that gradient-based methods on over-parameterized DNNs can fully inherit the favorable learning dynamics of their kernel counterparts. Building on this connection and the well-established optimality of kernel methods, we derive the first known minimax-optimal rates for the excess population risk of both gradient descent (GD) and stochastic gradient descent (SGD), under the assumption that network width scales polynomially with the sample size. Our results demonstrate that, with sufficient width, DNNs trained by GD or SGD can achieve generalization performance comparable to kernel-based methods.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper claims first minimax rates for deep nets under GD/SGD via a kernel dynamics link, but the approximation error at polynomial width is the load-bearing step that needs verification.

read the letter

The abstract states they connect the gradient dynamics of over-parameterized DNNs with smooth activations to kernel methods, then use that to transfer known kernel optimality and obtain the first minimax rates for both GD and SGD when width grows polynomially in sample size.

What is new is the extension of NTK-style analysis past shallow nets to deeper architectures while targeting exact optimality rather than just consistency. The abstract is explicit that this is the first such result and that the connection is the foundational step.

The soft spot is exactly where the stress-test note lands. Equivalence results between DNN trajectories and kernels are typically derived in the infinite-width limit or with error terms that decay in width. At only polynomial width the residual must be shown o of the target excess-risk rate; otherwise the transferred minimax claim does not follow. The abstract presents the connection as established but gives no quantitative bound here, so it is impossible to tell whether that gap is closed.

This paper is for people working on statistical generalization bounds for over-parameterized models. A reader who follows NTK developments and wants to see whether optimality carries to deep nets would get value from the proofs if they hold.

It deserves a serious referee to check the error analysis in the dynamics connection. I would send it to peer review.

Referee Report

1 major / 0 minor

Summary. The paper establishes a connection between the learning dynamics of over-parameterized DNNs with smooth activations trained by gradient-based methods and those of kernel methods. It then uses this connection, together with known optimality results for kernels, to derive the first minimax-optimal rates for the excess population risk of both GD and SGD, under the assumption that network width scales polynomially with sample size n.

Significance. If the quantitative error bounds in the dynamics connection hold, the result would be significant: it would supply the first explicit minimax rates for gradient-trained DNNs and demonstrate that polynomial-width networks can match the generalization performance of kernel methods. The work directly addresses a central open question in the NTK and mean-field regimes.

major comments (1)

[connection result between DNN dynamics and kernel methods (foundational step)] The transfer of minimax optimality from kernels to DNNs rests on the claim that the DNN trajectory under GD/SGD is sufficiently close to the kernel dynamics. The manuscript must supply an explicit quantitative bound showing that the approximation error is o(n^{-r}) (where n^{-r} is the target excess-risk rate) under the stated polynomial-width assumption; the abstract and the description of the foundational step locate the connection as the load-bearing step, but without this o(rate) control the optimality claim does not follow.

Simulated Author's Rebuttal

1 responses · 0 unresolved

Thank you for the careful review and for identifying the need for explicit control on the approximation error. We address the single major comment below.

read point-by-point responses

Referee: [connection result between DNN dynamics and kernel methods (foundational step)] The transfer of minimax optimality from kernels to DNNs rests on the claim that the DNN trajectory under GD/SGD is sufficiently close to the kernel dynamics. The manuscript must supply an explicit quantitative bound showing that the approximation error is o(n^{-r}) (where n^{-r} is the target excess-risk rate) under the stated polynomial-width assumption; the abstract and the description of the foundational step locate the connection as the load-bearing step, but without this o(rate) control the optimality claim does not follow.

Authors: We agree that an explicit o(n^{-r}) guarantee is required for the optimality transfer to be fully rigorous. Our existing trajectory approximation results already yield polynomial decay in 1/n (under the stated width scaling), which is faster than the target kernel rates r in the regimes we consider. In the revision we will insert a short corollary (or remark) that directly combines the approximation theorem with the specific minimax rates from the kernel literature, confirming the error is o(n^{-r}). This addition clarifies the argument without altering any proofs or assumptions. revision: yes

Circularity Check

0 steps flagged

No circularity; derivation builds on a claimed novel connection plus external kernel results.

full rationale

The paper states it first establishes (within the manuscript) a connection between DNN gradient dynamics for smooth activations and kernel methods, then transfers minimax optimality from the 'well-established' kernel literature. No load-bearing step reduces by definition, by fitting, or by self-citation chain to the target rates. The connection is presented as newly derived rather than presupposed, and kernel optimality is invoked as independent prior knowledge. No equations or sections exhibit self-definitional, fitted-input, or ansatz-smuggling patterns. The derivation chain is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Only abstract available; the polynomial width scaling is presented as an explicit assumption required for the rates, but no further free parameters, axioms, or invented entities can be extracted.

axioms (1)

domain assumption network width scales polynomially with the sample size
Explicitly required for the minimax rates to hold, as stated in the abstract.

pith-pipeline@v0.9.1-grok · 5755 in / 1237 out tokens · 21758 ms · 2026-06-27T23:02:15.393392+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

50 extracted references · 4 linked inside Pith

[1]

A convergence theory for deep learning via over-parameterization

Zeyuan Allen-Zhu, Yuanzhi Li, and Zhao Song. A convergence theory for deep learning via over-parameterization. InInternational Conference on Machine Learning, pages 242–252. PMLR, 2019

2019
[2]

Fine-grained analysis of optimization and generalization for overparameterized two-layer neural networks

Sanjeev Arora, Simon Du, Wei Hu, Zhiyuan Li, and Ruosong Wang. Fine-grained analysis of optimization and generalization for overparameterized two-layer neural networks. InInternational Conference on Machine Learning, pages 322–332. PMLR, 2019

2019
[3]

Spectrally-normalized margin bounds for neural networks

Peter L Bartlett, Dylan J Foster, and Matus J Telgarsky. Spectrally-normalized margin bounds for neural networks. Advances in Neural Information Processing Systems, 30, 2017

2017
[4]

Convergence rates for shallow neural networks learned by gradient descent.Bernoulli, 30(1):475–502, 2024

Alina Braun, Michael Kohler, Sophie Langer, and Harro Walk. Convergence rates for shallow neural networks learned by gradient descent.Bernoulli, 30(1):475–502, 2024

2024
[5]

Stochastic gradient descent for two-layer neural networks.arXiv preprint arXiv:2407.07670, 2024

Dinghao Cao, Zheng-Chu Guo, and Lei Shi. Stochastic gradient descent for two-layer neural networks.arXiv preprint arXiv:2407.07670, 2024

arXiv 2024
[6]

Generalization bounds of stochastic gradient descent for wide and deep neural networks

Yuan Cao and Quanquan Gu. Generalization bounds of stochastic gradient descent for wide and deep neural networks. InAdvances in Neural Information Processing Systems, volume 32, 2019

2019
[7]

Generalization error bounds of gradient descent for learning over-parameterized deep relu networks

Yuan Cao and Quanquan Gu. Generalization error bounds of gradient descent for learning over-parameterized deep relu networks. InProceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 3349–3356, 2020

2020
[8]

Optimal rates for the regularized least-squares algorithm.Foundations of Computational Mathematics, 7(3):331–368, 2007

Andrea Caponnetto and Ernesto De Vito. Optimal rates for the regularized least-squares algorithm.Foundations of Computational Mathematics, 7(3):331–368, 2007

2007
[9]

Learning with sgd and random features

Luigi Carratino, Alessandro Rudi, and Lorenzo Rosasco. Learning with sgd and random features. InAdvances in Neural Information Processing Systems, pages 10213–10224, 2018

2018
[10]

How much over-parameterization is sufficient to learn deep relu networks? InInternational Conference on Learning Representation, 2021

Zixiang Chen, Yuan Cao, Difan Zou, and Quanquan Gu. How much over-parameterization is sufficient to learn deep relu networks? InInternational Conference on Learning Representation, 2021

2021
[11]

Cambridge University Press, 2007

Felipe Cucker and Ding-Xuan Zhou.Learning Theory: an Approximation Theory Viewpoint. Cambridge University Press, 2007

2007
[12]

Nonparametric stochastic approximation with large step-sizes.Annals of Statistics, 44(4):1363–1399, 2016

Aymeric Dieuleveut and Francis Bach. Nonparametric stochastic approximation with large step-sizes.Annals of Statistics, 44(4):1363–1399, 2016

2016
[13]

Analysis of the expected l 2 error of an over-parametrized deep neural network estimate learned by gradient descent without regularization.arXiv preprint arXiv:2311.14609, 2023

Selina Drews and Michael Kohler. Analysis of the expected l 2 error of an over-parametrized deep neural network estimate learned by gradient descent without regularization.arXiv preprint arXiv:2311.14609, 2023

arXiv 2023
[14]

Selina Drews and Michael Kohler. On the universal consistency of an over-parametrized deep neural network estimate learned by gradient descent.Annals of the Institute of Statistical Mathematics, 76(3):361–391, 2024. 10

2024
[15]

Gradient descent finds global minima of deep neural networks

Simon Du, Jason Lee, Haochuan Li, Liwei Wang, and Xiyu Zhai. Gradient descent finds global minima of deep neural networks. InInternational Conference on Machine Learning, pages 1675–1685. PMLR, 2019

2019
[16]

Random feature amplification: Feature learning and generalization in neural networks.Journal of Machine Learning Research, 24(303):1–49, 2023

Spencer Frei, Niladri S Chatterji, and Peter L Bartlett. Random feature amplification: Feature learning and generalization in neural networks.Journal of Machine Learning Research, 24(303):1–49, 2023

2023
[17]

Size-independent sample complexity of neural networks

Noah Golowich, Alexander Rakhlin, and Ohad Shamir. Size-independent sample complexity of neural networks. InConference On Learning Theory, pages 297–299. PMLR, 2018

2018
[18]

Springer Science & Business Media, 2006

L´aszl´o Gy¨orfi, Michael Kohler, Adam Krzyzak, and Harro Walk.A distribution-free theory of nonparametric regression. Springer Science & Business Media, 2006

2006
[19]

Deep residual learning for image recognition

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016

2016
[20]

Neural tangent kernel: Convergence and generalization in neural networks.Advances in Neural Information Processing Systems, 31, 2018

Arthur Jacot, Franck Gabriel, and Cl´ement Hongler. Neural tangent kernel: Convergence and generalization in neural networks.Advances in Neural Information Processing Systems, 31, 2018

2018
[21]

Polylogarithmic width suffices for gradient descent to achieve arbitrarily small test error with shallow relu networks

Ziwei Ji and Matus Telgarsky. Polylogarithmic width suffices for gradient descent to achieve arbitrarily small test error with shallow relu networks. InInternational Conference on Learning Representations, 2020

2020
[22]

On the rate of convergence of an over-parametrized deep neural network regression estimate learned by gradient descent.arXiv preprint arXiv:2504.03405, 2025

Michael Kohler. On the rate of convergence of an over-parametrized deep neural network regression estimate learned by gradient descent.arXiv preprint arXiv:2504.03405, 2025

arXiv 2025
[23]

Learning lipschitz functions by gd-trained shallow overparameterized relu neural networks.arXiv preprint arXiv:2212.13848, 2022

Ilja Kuzborskij and Csaba Szepesv´ari. Learning lipschitz functions by gd-trained shallow overparameterized relu neural networks.arXiv preprint arXiv:2212.13848, 2022

arXiv 2022
[24]

Stability and generalization analysis of gradient methods for shallow neural networks

Yunwen Lei, Rong Jin, and Yiming Ying. Stability and generalization analysis of gradient methods for shallow neural networks. InAdvances in Neural Information Processing Systems, volume 35. PMLR, 2022

2022
[25]

Optimization and generalization of gradient descent for shallow ReLU networks with minimal width.Journal of Machine Learning Research, 27(34):1–35, 2026

Yunwen Lei, Puyu Wang, Yiming Ying, and Ding-Xuan Zhou. Optimization and generalization of gradient descent for shallow ReLU networks with minimal width.Journal of Machine Learning Research, 27(34):1–35, 2026

2026
[26]

Optimal rates for generalization of gradient descent for deep relu classification

Yuanfan Li, Yunwen Lei, Zheng-Chu Guo, and Yiming Ying. Optimal rates for generalization of gradient descent for deep relu classification. InAdvances in Neural Information Processing Systems, volume 38, pages 28623–28663, 2025

2025
[27]

Learning overparameterized neural networks via stochastic gradient descent on structured data

Yuanzhi Li and Yingyu Liang. Learning overparameterized neural networks via stochastic gradient descent on structured data. InAdvances in Neural Information Processing Systems, volume 31, 2018

2018
[28]

Optimal rates for multi-pass stochastic gradient methods.Journal of Machine Learning Research, 18(1):3375–3421, 2017

Junhong Lin and Lorenzo Rosasco. Optimal rates for multi-pass stochastic gradient methods.Journal of Machine Learning Research, 18(1):3375–3421, 2017

2017
[29]

On the linearity of large non-linear models: when and why the tangent kernel is constant.Advances in Neural Information Processing Systems, 33:15954–15964, 2020

Chaoyue Liu, Libin Zhu, and Misha Belkin. On the linearity of large non-linear models: when and why the tangent kernel is constant.Advances in Neural Information Processing Systems, 33:15954–15964, 2020

2020
[30]

Norm-based capacity control in neural networks

Behnam Neyshabur, Ryota Tomioka, and Nathan Srebro. Norm-based capacity control in neural networks. In Conference on Learning Theory, pages 1376–1401. PMLR, 2015

2015
[31]

Random feature approximation for general spectral methods.arXiv preprint arXiv:2308.15434, 2023

Mike Nguyen and Nicole M¨ucke. Random feature approximation for general spectral methods.arXiv preprint arXiv:2308.15434, 2023

arXiv 2023
[32]

How many neurons do we need? a refined analysis for shallow networks trained with gradient descent.Journal of Statistical Planning and Inference, page 106169, 2024

Mike Nguyen and Nicole M¨ucke. How many neurons do we need? a refined analysis for shallow networks trained with gradient descent.Journal of Statistical Planning and Inference, page 106169, 2024

2024
[33]

Gradient descent can learn less over-parameterized two-layer neural networks on classification problems.arXiv preprint arXiv:1905.09870, 2019

Atsushi Nitanda, Geoffrey Chinot, and Taiji Suzuki. Gradient descent can learn less over-parameterized two-layer neural networks on classification problems.arXiv preprint arXiv:1905.09870, 2019

arXiv 1905
[34]

Optimal rates for averaged stochastic gradient descent under neural tangent kernel regime

Atsushi Nitanda and Suzuki Taiji. Optimal rates for averaged stochastic gradient descent under neural tangent kernel regime. InInternational Conference on Learning Representations, 2021. 11

2021
[35]

Near-minimax optimal estimation with shallow relu neural networks.IEEE Transactions on Information Theory, 69(2):1125–1140, 2022

Rahul Parhi and Robert D Nowak. Near-minimax optimal estimation with shallow relu neural networks.IEEE Transactions on Information Theory, 69(2):1125–1140, 2022

2022
[36]

Weighted sums of random kitchen sinks: Replacing minimization with random- ization in learning.Advances in Neural Information Processing Systems, 21, 2008

Ali Rahimi and Benjamin Recht. Weighted sums of random kitchen sinks: Replacing minimization with random- ization in learning.Advances in Neural Information Processing Systems, 21, 2008

2008
[37]

Searching for activation functions.arXiv preprint arXiv:1710.05941, 2017

Prajit Ramachandran, Barret Zoph, and Quoc V Le. Searching for activation functions.arXiv preprint arXiv:1710.05941, 2017

Pith/arXiv arXiv 2017
[38]

Stability & generalisation of gradient descent for shallow neural networks without the neural tangent kernel

Dominic Richards and Ilja Kuzborskij. Stability & generalisation of gradient descent for shallow neural networks without the neural tangent kernel. InAdvances in Neural Information Processing Systems, volume 34. PMLR, 2021

2021
[39]

Springer Science & Business Media, 2008

Ingo Steinwart and Andreas Christmann.Support Vector Machines. Springer Science & Business Media, 2008

2008
[40]

Generalization and stability of interpolating neural networks with minimal width.Journal of Machine Learning Research, 25(156):1–41, 2024

Hossein Taheri and Christos Thrampoulidis. Generalization and stability of interpolating neural networks with minimal width.Journal of Machine Learning Research, 25(156):1–41, 2024

2024
[41]

Sharper guarantees for learning neural network classifiers with gradient methods, 2025

Hossein Taheri, Christos Thrampoulidis, and Arya Mazumdar. Sharper guarantees for learning neural network classifiers with gradient methods, 2025

2025
[42]

Cambridge university press, 2018

Roman Vershynin.High-dimensional probability: An introduction with applications in data science. Cambridge university press, 2018

2018
[43]

Cambridge university press, 2019

Martin J Wainwright.High-dimensional statistics: A non-asymptotic viewpoint, volume 48. Cambridge university press, 2019

2019
[44]

Generalization guarantees of gradient descent for shallow neural networks.Neural Computation, 37(2):344–402, 2025

Puyu Wang, Yunwen Lei, Di Wang, Yiming Ying, and Ding-Xuan Zhou. Generalization guarantees of gradient descent for shallow neural networks.Neural Computation, 37(2):344–402, 2025

2025
[45]

Population risk bounds for kolmogorov-arnold networks trained by dp-sgd with correlated noise.arXiv preprint arXiv:2605.12648, 2026

Puyu Wang, Jan Schuchardt, Nikita Kalinin, Junyu Zhou, Sophie Fellenz, Christoph Lampert, and Marius Kloft. Population risk bounds for kolmogorov-arnold networks trained by dp-sgd with correlated noise.arXiv preprint arXiv:2605.12648, 2026

Pith/arXiv arXiv 2026
[46]

Optimization, generalization and differential privacy bounds for gradient descent on kolmogorov-arnold networks.arXiv preprint arXiv:2601.22409, 2026

Puyu Wang, Junyu Zhou, Philipp Liznerski, and Marius Kloft. Optimization, generalization and differential privacy bounds for gradient descent on kolmogorov-arnold networks.arXiv preprint arXiv:2601.22409, 2026

Pith/arXiv arXiv 2026
[47]

Jiaming Xu and Hanjing Zhu. Overparametrized multi-layer neural networks: Uniform concentration of neural tangent kernel and convergence of stochastic gradient descent.Journal of Machine Learning Research, 25(94):1–83, 2024

2024
[48]

Learning bounds for kernel regression using effective data dimensionality.Neural computation, 17(9):2077–2098, 2005

Tong Zhang. Learning bounds for kernel regression using effective data dimensionality.Neural computation, 17(9):2077–2098, 2005

2077
[49]

Generalization in Deep Neural Networks: Minimax Rates for Gradient Methods

Difan Zou, Yuan Cao, Dongruo Zhou, and Quanquan Gu. Stochastic gradient descent optimizes over-parameterized deep relu networks. arxiv e-prints, art.arXiv preprint arXiv:1811.08888, 2018. 12 Appendix for “Generalization in Deep Neural Networks: Minimax Rates for Gradient Methods” A Problem Reformulation We first introduce some useful notations and express...

Pith/arXiv arXiv 2018
[50]

Their equation (47) is guaranteed by Lemma 18 with κ2 =∥K∥ ∞, Γ =n , δ=δ 2, ζi =K xi, Q= R X Kx ⊗K xdρx

Further, their Assumption 3 is guaranteed by Assumption 2, their equation (3) holds true with κ2 =∥K∥ ∞ due to ⟨Kx, Kx′⟩HK =K(x,x ′)≤ ∥K∥ ∞. Their equation (47) is guaranteed by Lemma 18 with κ2 =∥K∥ ∞, Γ =n , δ=δ 2, ζi =K xi, Q= R X Kx ⊗K xdρx. In addition, by taking the step-size ηk =η for all k∈[T] , we know Sρνk+1 − Sρµk+1 in [28] is equivalent to our...

[1] [1]

A convergence theory for deep learning via over-parameterization

Zeyuan Allen-Zhu, Yuanzhi Li, and Zhao Song. A convergence theory for deep learning via over-parameterization. InInternational Conference on Machine Learning, pages 242–252. PMLR, 2019

2019

[2] [2]

Fine-grained analysis of optimization and generalization for overparameterized two-layer neural networks

Sanjeev Arora, Simon Du, Wei Hu, Zhiyuan Li, and Ruosong Wang. Fine-grained analysis of optimization and generalization for overparameterized two-layer neural networks. InInternational Conference on Machine Learning, pages 322–332. PMLR, 2019

2019

[3] [3]

Spectrally-normalized margin bounds for neural networks

Peter L Bartlett, Dylan J Foster, and Matus J Telgarsky. Spectrally-normalized margin bounds for neural networks. Advances in Neural Information Processing Systems, 30, 2017

2017

[4] [4]

Convergence rates for shallow neural networks learned by gradient descent.Bernoulli, 30(1):475–502, 2024

Alina Braun, Michael Kohler, Sophie Langer, and Harro Walk. Convergence rates for shallow neural networks learned by gradient descent.Bernoulli, 30(1):475–502, 2024

2024

[5] [5]

Stochastic gradient descent for two-layer neural networks.arXiv preprint arXiv:2407.07670, 2024

Dinghao Cao, Zheng-Chu Guo, and Lei Shi. Stochastic gradient descent for two-layer neural networks.arXiv preprint arXiv:2407.07670, 2024

arXiv 2024

[6] [6]

Generalization bounds of stochastic gradient descent for wide and deep neural networks

Yuan Cao and Quanquan Gu. Generalization bounds of stochastic gradient descent for wide and deep neural networks. InAdvances in Neural Information Processing Systems, volume 32, 2019

2019

[7] [7]

Generalization error bounds of gradient descent for learning over-parameterized deep relu networks

Yuan Cao and Quanquan Gu. Generalization error bounds of gradient descent for learning over-parameterized deep relu networks. InProceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 3349–3356, 2020

2020

[8] [8]

Optimal rates for the regularized least-squares algorithm.Foundations of Computational Mathematics, 7(3):331–368, 2007

Andrea Caponnetto and Ernesto De Vito. Optimal rates for the regularized least-squares algorithm.Foundations of Computational Mathematics, 7(3):331–368, 2007

2007

[9] [9]

Learning with sgd and random features

Luigi Carratino, Alessandro Rudi, and Lorenzo Rosasco. Learning with sgd and random features. InAdvances in Neural Information Processing Systems, pages 10213–10224, 2018

2018

[10] [10]

How much over-parameterization is sufficient to learn deep relu networks? InInternational Conference on Learning Representation, 2021

Zixiang Chen, Yuan Cao, Difan Zou, and Quanquan Gu. How much over-parameterization is sufficient to learn deep relu networks? InInternational Conference on Learning Representation, 2021

2021

[11] [11]

Cambridge University Press, 2007

Felipe Cucker and Ding-Xuan Zhou.Learning Theory: an Approximation Theory Viewpoint. Cambridge University Press, 2007

2007

[12] [12]

Nonparametric stochastic approximation with large step-sizes.Annals of Statistics, 44(4):1363–1399, 2016

Aymeric Dieuleveut and Francis Bach. Nonparametric stochastic approximation with large step-sizes.Annals of Statistics, 44(4):1363–1399, 2016

2016

[13] [13]

Analysis of the expected l 2 error of an over-parametrized deep neural network estimate learned by gradient descent without regularization.arXiv preprint arXiv:2311.14609, 2023

Selina Drews and Michael Kohler. Analysis of the expected l 2 error of an over-parametrized deep neural network estimate learned by gradient descent without regularization.arXiv preprint arXiv:2311.14609, 2023

arXiv 2023

[14] [14]

Selina Drews and Michael Kohler. On the universal consistency of an over-parametrized deep neural network estimate learned by gradient descent.Annals of the Institute of Statistical Mathematics, 76(3):361–391, 2024. 10

2024

[15] [15]

Gradient descent finds global minima of deep neural networks

Simon Du, Jason Lee, Haochuan Li, Liwei Wang, and Xiyu Zhai. Gradient descent finds global minima of deep neural networks. InInternational Conference on Machine Learning, pages 1675–1685. PMLR, 2019

2019

[16] [16]

Random feature amplification: Feature learning and generalization in neural networks.Journal of Machine Learning Research, 24(303):1–49, 2023

Spencer Frei, Niladri S Chatterji, and Peter L Bartlett. Random feature amplification: Feature learning and generalization in neural networks.Journal of Machine Learning Research, 24(303):1–49, 2023

2023

[17] [17]

Size-independent sample complexity of neural networks

Noah Golowich, Alexander Rakhlin, and Ohad Shamir. Size-independent sample complexity of neural networks. InConference On Learning Theory, pages 297–299. PMLR, 2018

2018

[18] [18]

Springer Science & Business Media, 2006

L´aszl´o Gy¨orfi, Michael Kohler, Adam Krzyzak, and Harro Walk.A distribution-free theory of nonparametric regression. Springer Science & Business Media, 2006

2006

[19] [19]

Deep residual learning for image recognition

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016

2016

[20] [20]

Neural tangent kernel: Convergence and generalization in neural networks.Advances in Neural Information Processing Systems, 31, 2018

Arthur Jacot, Franck Gabriel, and Cl´ement Hongler. Neural tangent kernel: Convergence and generalization in neural networks.Advances in Neural Information Processing Systems, 31, 2018

2018

[21] [21]

Polylogarithmic width suffices for gradient descent to achieve arbitrarily small test error with shallow relu networks

Ziwei Ji and Matus Telgarsky. Polylogarithmic width suffices for gradient descent to achieve arbitrarily small test error with shallow relu networks. InInternational Conference on Learning Representations, 2020

2020

[22] [22]

On the rate of convergence of an over-parametrized deep neural network regression estimate learned by gradient descent.arXiv preprint arXiv:2504.03405, 2025

Michael Kohler. On the rate of convergence of an over-parametrized deep neural network regression estimate learned by gradient descent.arXiv preprint arXiv:2504.03405, 2025

arXiv 2025

[23] [23]

Learning lipschitz functions by gd-trained shallow overparameterized relu neural networks.arXiv preprint arXiv:2212.13848, 2022

Ilja Kuzborskij and Csaba Szepesv´ari. Learning lipschitz functions by gd-trained shallow overparameterized relu neural networks.arXiv preprint arXiv:2212.13848, 2022

arXiv 2022

[24] [24]

Stability and generalization analysis of gradient methods for shallow neural networks

Yunwen Lei, Rong Jin, and Yiming Ying. Stability and generalization analysis of gradient methods for shallow neural networks. InAdvances in Neural Information Processing Systems, volume 35. PMLR, 2022

2022

[25] [25]

Optimization and generalization of gradient descent for shallow ReLU networks with minimal width.Journal of Machine Learning Research, 27(34):1–35, 2026

Yunwen Lei, Puyu Wang, Yiming Ying, and Ding-Xuan Zhou. Optimization and generalization of gradient descent for shallow ReLU networks with minimal width.Journal of Machine Learning Research, 27(34):1–35, 2026

2026

[26] [26]

Optimal rates for generalization of gradient descent for deep relu classification

Yuanfan Li, Yunwen Lei, Zheng-Chu Guo, and Yiming Ying. Optimal rates for generalization of gradient descent for deep relu classification. InAdvances in Neural Information Processing Systems, volume 38, pages 28623–28663, 2025

2025

[27] [27]

Learning overparameterized neural networks via stochastic gradient descent on structured data

Yuanzhi Li and Yingyu Liang. Learning overparameterized neural networks via stochastic gradient descent on structured data. InAdvances in Neural Information Processing Systems, volume 31, 2018

2018

[28] [28]

Optimal rates for multi-pass stochastic gradient methods.Journal of Machine Learning Research, 18(1):3375–3421, 2017

Junhong Lin and Lorenzo Rosasco. Optimal rates for multi-pass stochastic gradient methods.Journal of Machine Learning Research, 18(1):3375–3421, 2017

2017

[29] [29]

On the linearity of large non-linear models: when and why the tangent kernel is constant.Advances in Neural Information Processing Systems, 33:15954–15964, 2020

Chaoyue Liu, Libin Zhu, and Misha Belkin. On the linearity of large non-linear models: when and why the tangent kernel is constant.Advances in Neural Information Processing Systems, 33:15954–15964, 2020

2020

[30] [30]

Norm-based capacity control in neural networks

Behnam Neyshabur, Ryota Tomioka, and Nathan Srebro. Norm-based capacity control in neural networks. In Conference on Learning Theory, pages 1376–1401. PMLR, 2015

2015

[31] [31]

Random feature approximation for general spectral methods.arXiv preprint arXiv:2308.15434, 2023

Mike Nguyen and Nicole M¨ucke. Random feature approximation for general spectral methods.arXiv preprint arXiv:2308.15434, 2023

arXiv 2023

[32] [32]

How many neurons do we need? a refined analysis for shallow networks trained with gradient descent.Journal of Statistical Planning and Inference, page 106169, 2024

Mike Nguyen and Nicole M¨ucke. How many neurons do we need? a refined analysis for shallow networks trained with gradient descent.Journal of Statistical Planning and Inference, page 106169, 2024

2024

[33] [33]

Gradient descent can learn less over-parameterized two-layer neural networks on classification problems.arXiv preprint arXiv:1905.09870, 2019

Atsushi Nitanda, Geoffrey Chinot, and Taiji Suzuki. Gradient descent can learn less over-parameterized two-layer neural networks on classification problems.arXiv preprint arXiv:1905.09870, 2019

arXiv 1905

[34] [34]

Optimal rates for averaged stochastic gradient descent under neural tangent kernel regime

Atsushi Nitanda and Suzuki Taiji. Optimal rates for averaged stochastic gradient descent under neural tangent kernel regime. InInternational Conference on Learning Representations, 2021. 11

2021

[35] [35]

Near-minimax optimal estimation with shallow relu neural networks.IEEE Transactions on Information Theory, 69(2):1125–1140, 2022

Rahul Parhi and Robert D Nowak. Near-minimax optimal estimation with shallow relu neural networks.IEEE Transactions on Information Theory, 69(2):1125–1140, 2022

2022

[36] [36]

Weighted sums of random kitchen sinks: Replacing minimization with random- ization in learning.Advances in Neural Information Processing Systems, 21, 2008

Ali Rahimi and Benjamin Recht. Weighted sums of random kitchen sinks: Replacing minimization with random- ization in learning.Advances in Neural Information Processing Systems, 21, 2008

2008

[37] [37]

Searching for activation functions.arXiv preprint arXiv:1710.05941, 2017

Prajit Ramachandran, Barret Zoph, and Quoc V Le. Searching for activation functions.arXiv preprint arXiv:1710.05941, 2017

Pith/arXiv arXiv 2017

[38] [38]

Stability & generalisation of gradient descent for shallow neural networks without the neural tangent kernel

Dominic Richards and Ilja Kuzborskij. Stability & generalisation of gradient descent for shallow neural networks without the neural tangent kernel. InAdvances in Neural Information Processing Systems, volume 34. PMLR, 2021

2021

[39] [39]

Springer Science & Business Media, 2008

Ingo Steinwart and Andreas Christmann.Support Vector Machines. Springer Science & Business Media, 2008

2008

[40] [40]

Generalization and stability of interpolating neural networks with minimal width.Journal of Machine Learning Research, 25(156):1–41, 2024

Hossein Taheri and Christos Thrampoulidis. Generalization and stability of interpolating neural networks with minimal width.Journal of Machine Learning Research, 25(156):1–41, 2024

2024

[41] [41]

Sharper guarantees for learning neural network classifiers with gradient methods, 2025

Hossein Taheri, Christos Thrampoulidis, and Arya Mazumdar. Sharper guarantees for learning neural network classifiers with gradient methods, 2025

2025

[42] [42]

Cambridge university press, 2018

Roman Vershynin.High-dimensional probability: An introduction with applications in data science. Cambridge university press, 2018

2018

[43] [43]

Cambridge university press, 2019

Martin J Wainwright.High-dimensional statistics: A non-asymptotic viewpoint, volume 48. Cambridge university press, 2019

2019

[44] [44]

Generalization guarantees of gradient descent for shallow neural networks.Neural Computation, 37(2):344–402, 2025

Puyu Wang, Yunwen Lei, Di Wang, Yiming Ying, and Ding-Xuan Zhou. Generalization guarantees of gradient descent for shallow neural networks.Neural Computation, 37(2):344–402, 2025

2025

[45] [45]

Population risk bounds for kolmogorov-arnold networks trained by dp-sgd with correlated noise.arXiv preprint arXiv:2605.12648, 2026

Puyu Wang, Jan Schuchardt, Nikita Kalinin, Junyu Zhou, Sophie Fellenz, Christoph Lampert, and Marius Kloft. Population risk bounds for kolmogorov-arnold networks trained by dp-sgd with correlated noise.arXiv preprint arXiv:2605.12648, 2026

Pith/arXiv arXiv 2026

[46] [46]

Optimization, generalization and differential privacy bounds for gradient descent on kolmogorov-arnold networks.arXiv preprint arXiv:2601.22409, 2026

Puyu Wang, Junyu Zhou, Philipp Liznerski, and Marius Kloft. Optimization, generalization and differential privacy bounds for gradient descent on kolmogorov-arnold networks.arXiv preprint arXiv:2601.22409, 2026

Pith/arXiv arXiv 2026

[47] [47]

Jiaming Xu and Hanjing Zhu. Overparametrized multi-layer neural networks: Uniform concentration of neural tangent kernel and convergence of stochastic gradient descent.Journal of Machine Learning Research, 25(94):1–83, 2024

2024

[48] [48]

Learning bounds for kernel regression using effective data dimensionality.Neural computation, 17(9):2077–2098, 2005

Tong Zhang. Learning bounds for kernel regression using effective data dimensionality.Neural computation, 17(9):2077–2098, 2005

2077

[49] [49]

Generalization in Deep Neural Networks: Minimax Rates for Gradient Methods

Difan Zou, Yuan Cao, Dongruo Zhou, and Quanquan Gu. Stochastic gradient descent optimizes over-parameterized deep relu networks. arxiv e-prints, art.arXiv preprint arXiv:1811.08888, 2018. 12 Appendix for “Generalization in Deep Neural Networks: Minimax Rates for Gradient Methods” A Problem Reformulation We first introduce some useful notations and express...

Pith/arXiv arXiv 2018

[50] [50]

Their equation (47) is guaranteed by Lemma 18 with κ2 =∥K∥ ∞, Γ =n , δ=δ 2, ζi =K xi, Q= R X Kx ⊗K xdρx

Further, their Assumption 3 is guaranteed by Assumption 2, their equation (3) holds true with κ2 =∥K∥ ∞ due to ⟨Kx, Kx′⟩HK =K(x,x ′)≤ ∥K∥ ∞. Their equation (47) is guaranteed by Lemma 18 with κ2 =∥K∥ ∞, Γ =n , δ=δ 2, ζi =K xi, Q= R X Kx ⊗K xdρx. In addition, by taking the step-size ηk =η for all k∈[T] , we know Sρνk+1 − Sρµk+1 in [28] is equivalent to our...