pith. sign in

arxiv: 2606.06764 · v1 · pith:VYYDH6T3new · submitted 2026-06-04 · 📊 stat.ML · cs.AI· cs.LG

Optimal Rates for Generalization of Gradient Descent Methods with Deep Neural Networks

Pith reviewed 2026-06-27 23:04 UTC · model grok-4.3

classification 📊 stat.ML cs.AIcs.LG
keywords deep neural networksReLU activationgradient descentstochastic gradient descentgeneralization analysisminimax ratesexcess riskoverparameterized networks
0
0 comments X

The pith

Gradient descent on deep ReLU networks achieves minimax-optimal generalization rates under polynomial width scaling.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper establishes minimax-optimal rates for the excess population risk of gradient descent and stochastic gradient descent applied to deep ReLU networks. The key condition is that the network width grows polynomially in both the depth and the number of training samples. Under this condition the rates match the best known rates for kernel methods on regression tasks. Prior analyses were mostly restricted to shallow networks, so this work fills the gap for deeper architectures. The results show that overparameterized deep networks trained by gradient methods can generalize optimally without explicit kernel computations.

Core claim

We establish the first known minimax-optimal rates of excess population risk for both GD and SGD with deep ReLU networks, under the assumption that the network width scales polynomially with respect to the network depth and training sample size. Our results demonstrate that with sufficient width, gradient descent methods for deep ReLU networks can achieve optimal generalization rates on par with kernel methods.

What carries the argument

Polynomial width scaling with depth and sample size, which allows the derivation of tight generalization bounds matching kernel methods for deep ReLU networks.

Load-bearing premise

The network width scales polynomially with respect to the network depth and training sample size.

What would settle it

Finding a deep ReLU network trained by GD or SGD where the width does not scale polynomially but still achieves the claimed minimax rates, or conversely where polynomial scaling fails to yield optimal rates, would test the claim.

read the original abstract

Recent progress has been made in understanding the statistical generalization performance of gradient descent methods for overparameterized neural networks within the neural tangent kernel (NTK) regime. However, most of the existing work on regression problems is limited to shallow network architectures, leaving a notable gap in the theory of deep neural networks. This paper addresses this gap by presenting a comprehensive generalization analysis for deep ReLU networks trained using gradient descent (GD) and stochastic gradient descent (SGD). Specifically, we establish the first known minimax-optimal rates of excess population risk for both GD and SGD with deep ReLU networks, under the assumption that the network width scales polynomially with respect to the network depth and training sample size. Our results demonstrate that with sufficient width, gradient descent methods for deep ReLU networks can achieve optimal generalization rates on par with kernel methods.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper claims to establish the first known minimax-optimal rates of excess population risk for both GD and SGD with deep ReLU networks in the NTK regime. This is achieved under the assumption that network width scales polynomially with network depth and training sample size, allowing the methods to match the generalization performance of kernel methods.

Significance. If the derivations hold, the result would be significant as it extends existing NTK generalization theory from shallow to deep networks, providing a theoretical basis for why sufficiently wide deep ReLU networks can achieve optimal rates. The work highlights the role of width scaling in controlling approximation and optimization errors.

major comments (2)
  1. [Abstract] Abstract, paragraph on main results: the minimax-optimal rates are explicitly conditioned on the network width scaling polynomially with depth and sample size. This assumption is load-bearing for the central claim, as it is required for the NTK approximation error and optimization dynamics to yield rates on par with kernel methods; without it the optimality does not follow from the stated analysis.
  2. [Abstract] Abstract: the rates and optimality claim are stated without derivation steps, error bounds, or proof sketches. The full manuscript must supply these to allow verification that the rates are indeed minimax optimal and not post-hoc.
minor comments (1)
  1. The abstract should state the precise form of the excess risk rate (e.g., dependence on n, depth, width) to make the optimality claim more concrete.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their careful reading of the manuscript and for raising these points about the abstract. We address each major comment below.

read point-by-point responses
  1. Referee: [Abstract] Abstract, paragraph on main results: the minimax-optimal rates are explicitly conditioned on the network width scaling polynomially with depth and sample size. This assumption is load-bearing for the central claim, as it is required for the NTK approximation error and optimization dynamics to yield rates on par with kernel methods; without it the optimality does not follow from the stated analysis.

    Authors: The abstract already states the results under the explicit assumption of polynomial width scaling with depth and sample size. This scaling is necessary to ensure the NTK approximation holds with controlled error and that optimization dynamics achieve the desired rates; the paper does not claim optimality without it. The main text (Section 2 and Theorems 3.1, 4.1) derives the precise polynomial exponents needed to balance approximation, optimization, and generalization errors, showing they match kernel minimax rates under this condition. revision: no

  2. Referee: [Abstract] Abstract: the rates and optimality claim are stated without derivation steps, error bounds, or proof sketches. The full manuscript must supply these to allow verification that the rates are indeed minimax optimal and not post-hoc.

    Authors: The abstract serves as a high-level summary. The full manuscript supplies the required derivations: error decompositions, explicit bounds on each term, and proof sketches appear in Sections 3 and 4, with complete proofs in the appendix. Minimax optimality is verified by matching the derived upper bounds to known kernel lower bounds once the width condition holds. revision: no

Circularity Check

0 steps flagged

No significant circularity; rates derived from NTK analysis under explicit width assumption

full rationale

The paper presents minimax-optimal excess risk rates for GD/SGD on deep ReLU networks as derived results within the NTK regime, conditioned on the polynomial width scaling assumption (abstract). This scaling is stated as a prerequisite for the analysis to hold and is not obtained by fitting or self-definition from the target rates. No quoted steps reduce a claimed prediction to an input by construction, nor do self-citations form a load-bearing chain that substitutes for independent derivation. The result is positioned as extending prior NTK work to deep nets with sufficient width, remaining self-contained against external kernel-method benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on the NTK regime for overparameterized networks and the polynomial width scaling; no free parameters or new entities are introduced in the abstract.

axioms (2)
  • domain assumption Neural tangent kernel regime holds for the deep ReLU networks under study
    Abstract opens by referencing recent progress in the NTK regime and builds the deep-network result on that foundation.
  • standard math Standard minimax lower bounds for kernel methods apply as comparison baseline
    The optimality claim is relative to kernel-method rates already established in the literature.

pith-pipeline@v0.9.1-grok · 5684 in / 1340 out tokens · 19728 ms · 2026-06-27T23:04:21.173351+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

60 extracted references · 4 linked inside Pith

  1. [1]

    A convergence theory for deep learning via over- parameterization

    Zeyuan Allen-Zhu, Yuanzhi Li, and Zhao Song. A convergence theory for deep learning via over- parameterization. InInternational Conference on Machine Learning, pages 242–252. PMLR, 2019

  2. [2]

    Fine-grained analysis of optimization and generalization for overparameterized two-layer neural networks

    Sanjeev Arora, Simon Du, Wei Hu, Zhiyuan Li, and Ruosong Wang. Fine-grained analysis of optimization and generalization for overparameterized two-layer neural networks. InInternational Conference on Machine Learning, pages 322–332. PMLR, 2019

  3. [3]

    Breaking the curse of dimensionality with convex neural networks.Journal of Machine Learning Research, 18(19):1–53, 2017

    Francis Bach. Breaking the curse of dimensionality with convex neural networks.Journal of Machine Learning Research, 18(19):1–53, 2017

  4. [4]

    Neural machine translation by jointly learning to align and translate.arXiv preprint arXiv:1409.0473, 2014

    Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly learning to align and translate.arXiv preprint arXiv:1409.0473, 2014

  5. [5]

    Spectrally-normalized margin bounds for neural net- works.Advances in Neural Information Processing Systems, 30, 2017

    Peter L Bartlett, Dylan J Foster, and Matus J Telgarsky. Spectrally-normalized margin bounds for neural net- works.Advances in Neural Information Processing Systems, 30, 2017

  6. [6]

    Deep equals shallow for relu networks in kernel regimes

    Alberto Bietti and Francis Bach. Deep equals shallow for relu networks in kernel regimes. InInternational Conference on Learning Representations, 2021

  7. [7]

    On the inductive bias of neural tangent kernels

    Alberto Bietti and Julien Mairal. On the inductive bias of neural tangent kernels. InAdvances in Neural Information Processing Systems, volume 32, 2019. 36

  8. [8]

    Learnability and the vapnik- chervonenkis dimension.Journal of the ACM (JACM), 36(4):929–965, 1989

    Anselm Blumer, Andrzej Ehrenfeucht, David Haussler, and Manfred K Warmuth. Learnability and the vapnik- chervonenkis dimension.Journal of the ACM (JACM), 36(4):929–965, 1989

  9. [9]

    Sgd learns over-parameterized net- works that provably generalize on linearly separable data

    Alon Brutzkus, Amir Globerson, Eran Malach, and Shai Shalev-Shwartz. Sgd learns over-parameterized net- works that provably generalize on linearly separable data. InInternational Conference on Learning Representa- tions, 2018

  10. [10]

    Stochastic gradient descent for two-layer neural networks.arXiv preprint arXiv:2407.07670, 2024

    Dinghao Cao, Zheng-Chu Guo, and Lei Shi. Stochastic gradient descent for two-layer neural networks.arXiv preprint arXiv:2407.07670, 2024

  11. [11]

    Generalization bounds of stochastic gradient descent for wide and deep neural networks

    Yuan Cao and Quanquan Gu. Generalization bounds of stochastic gradient descent for wide and deep neural networks. InAdvances in Neural Information Processing Systems, volume 32, 2019

  12. [12]

    Generalization error bounds of gradient descent for learning over-parameterized deep relu networks

    Yuan Cao and Quanquan Gu. Generalization error bounds of gradient descent for learning over-parameterized deep relu networks. InProceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 3349– 3356, 2020

  13. [13]

    Optimal rates for the regularized least-squares algorithm.Foundations of Computational Mathematics, 7(3):331–368, 2007

    Andrea Caponnetto and Ernesto De Vito. Optimal rates for the regularized least-squares algorithm.Foundations of Computational Mathematics, 7(3):331–368, 2007

  14. [14]

    Learning with sgd and random features

    Luigi Carratino, Alessandro Rudi, and Lorenzo Rosasco. Learning with sgd and random features. InAdvances in Neural Information Processing Systems, pages 10213–10224, 2018

  15. [15]

    How much over-parameterization is sufficient to learn deep relu networks? InInternational Conference on Learning Representations, 2021

    Zixiang Chen, Yuan Cao, Difan Zou, and Quanquan Gu. How much over-parameterization is sufficient to learn deep relu networks? InInternational Conference on Learning Representations, 2021

  16. [16]

    On the mathematical foundations of learning.Bulletin of the American mathe- matical society, 39(1):1–49, 2002

    Felipe Cucker and Steve Smale. On the mathematical foundations of learning.Bulletin of the American mathe- matical society, 39(1):1–49, 2002

  17. [17]

    Cambridge Uni- versity Press, 2007

    Felipe Cucker and Ding-Xuan Zhou.Learning Theory: an Approximation Theory Viewpoint. Cambridge Uni- versity Press, 2007

  18. [18]

    Nonparametric stochastic approximation with large step-sizes.Annals of Statistics, 44(4):1363–1399, 2016

    Aymeric Dieuleveut and Francis Bach. Nonparametric stochastic approximation with large step-sizes.Annals of Statistics, 44(4):1363–1399, 2016

  19. [19]

    Gradient descent finds global minima of deep neural networks

    Simon Du, Jason Lee, Haochuan Li, Liwei Wang, and Xiyu Zhai. Gradient descent finds global minima of deep neural networks. InInternational Conference on Machine Learning, pages 1675–1685. PMLR, 2019

  20. [20]

    Gradient descent provably optimizes over- parameterized neural networks

    Simon S Du, Xiyu Zhai, Barnabas Poczos, and Aarti Singh. Gradient descent provably optimizes over- parameterized neural networks. InInternational Conference on Learning Representations, 2018

  21. [21]

    Random feature amplification: Feature learning and generalization in neural networks.Journal of Machine Learning Research, 24(303):1–49, 2023

    Spencer Frei, Niladri S Chatterji, and Peter L Bartlett. Random feature amplification: Feature learning and generalization in neural networks.Journal of Machine Learning Research, 24(303):1–49, 2023

  22. [22]

    Size-independent sample complexity of neural networks

    Noah Golowich, Alexander Rakhlin, and Ohad Shamir. Size-independent sample complexity of neural networks. InConference On Learning Theory, pages 297–299. PMLR, 2018

  23. [23]

    Springer Science & Business Media, 2006

    László Györfi, Michael Kohler, Adam Krzyzak, and Harro Walk.A distribution-free theory of nonparametric regression. Springer Science & Business Media, 2006

  24. [24]

    Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups.IEEE Signal processing magazine, 29(6):82–97, 2012

    Geoffrey Hinton, Li Deng, Dong Yu, George E Dahl, Abdel-rahman Mohamed, Navdeep Jaitly, Andrew Senior, Vincent Vanhoucke, Patrick Nguyen, Tara N Sainath, et al. Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups.IEEE Signal processing magazine, 29(6):82–97, 2012

  25. [25]

    Regularization matters: A nonparametric perspective on overparametrized neural network

    Tianyang Hu, Wenjia Wang, Cong Lin, and Guang Cheng. Regularization matters: A nonparametric perspective on overparametrized neural network. InInternational Conference on Artificial Intelligence and Statistics, pages 829–837. PMLR, 2021

  26. [26]

    Neural tangent kernel: Convergence and generalization in neural networks.Advances in Neural Information Processing Systems, 31, 2018

    Arthur Jacot, Franck Gabriel, and Clément Hongler. Neural tangent kernel: Convergence and generalization in neural networks.Advances in Neural Information Processing Systems, 31, 2018

  27. [27]

    Polylogarithmic width suffices for gradient descent to achieve arbitrarily small 37 test error with shallow relu networks

    Ziwei Ji and Matus Telgarsky. Polylogarithmic width suffices for gradient descent to achieve arbitrarily small 37 test error with shallow relu networks. InInternational Conference on Learning Representations, 2020

  28. [28]

    Imagenet classification with deep convolutional neural networks.Communications of the ACM, 60(6):84–90, 2017

    Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks.Communications of the ACM, 60(6):84–90, 2017

  29. [29]

    Learning lipschitz functions by gd-trained shallow overparameterized relu neural networks.arXiv preprint arXiv:2212.13848, 2022

    Ilja Kuzborskij and Csaba Szepesvári. Learning lipschitz functions by gd-trained shallow overparameterized relu neural networks.arXiv preprint arXiv:2212.13848, 2022

  30. [30]

    Generalization ability of wide neural networks onR.arXiv preprint arXiv:2302.05933, 2023

    Jianfa Lai, Manyun Xu, Rui Chen, and Qian Lin. Generalization ability of wide neural networks onR.arXiv preprint arXiv:2302.05933, 2023

  31. [31]

    Stability and generalization analysis of gradient methods for shallow neural networks

    Yunwen Lei, Rong Jin, and Yiming Ying. Stability and generalization analysis of gradient methods for shallow neural networks. InAdvances in Neural Information Processing Systems, volume 35. PMLR, 2022

  32. [32]

    Optimization and generalization of gradient descent for shallow ReLU networks with minimal width.Journal of Machine Learning Research, 27(34):1–35, 2026

    Yunwen Lei, Puyu Wang, Yiming Ying, and Ding-Xuan Zhou. Optimization and generalization of gradient descent for shallow ReLU networks with minimal width.Journal of Machine Learning Research, 27(34):1–35, 2026

  33. [33]

    Learning overparameterized neural networks via stochastic gradient descent on structured data

    Yuanzhi Li and Yingyu Liang. Learning overparameterized neural networks via stochastic gradient descent on structured data. InAdvances in Neural Information Processing Systems, volume 31, 2018

  34. [34]

    Optimal learning for multi-pass stochastic gradient methods

    Junhong Lin and Lorenzo Rosasco. Optimal learning for multi-pass stochastic gradient methods. InAdvances in Neural Information Processing Systems, pages 4556–4564, 2016

  35. [35]

    Optimal rates for multi-pass stochastic gradient methods.Journal of Machine Learning Research, 18(1):3375–3421, 2017

    Junhong Lin and Lorenzo Rosasco. Optimal rates for multi-pass stochastic gradient methods.Journal of Machine Learning Research, 18(1):3375–3421, 2017

  36. [36]

    Distributed kernel-based gradient descent algorithms.Constructive Approx- imation, pages 1–28, 2017

    Shao-Bo Lin and Ding-Xuan Zhou. Distributed kernel-based gradient descent algorithms.Constructive Approx- imation, pages 1–28, 2017

  37. [37]

    Norm-based capacity control in neural networks

    Behnam Neyshabur, Ryota Tomioka, and Nathan Srebro. Norm-based capacity control in neural networks. In Conference on Learning Theory, pages 1376–1401. PMLR, 2015

  38. [38]

    Random feature approximation for general spectral methods.arXiv preprint arXiv:2308.15434, 2023

    Mike Nguyen and Nicole Mücke. Random feature approximation for general spectral methods.arXiv preprint arXiv:2308.15434, 2023

  39. [39]

    How many neurons do we need? a refined analysis for shallow networks trained with gradient descent.Journal of Statistical Planning and Inference, page 106169, 2024

    Mike Nguyen and Nicole Mücke. How many neurons do we need? a refined analysis for shallow networks trained with gradient descent.Journal of Statistical Planning and Inference, page 106169, 2024

  40. [40]

    Optimal rates for averaged stochastic gradient descent under neural tangent kernel regime

    Atsushi Nitanda and Suzuki Taiji. Optimal rates for averaged stochastic gradient descent under neural tangent kernel regime. InInternational Conference on Learning Representations, 2021

  41. [41]

    Gradient descent can learn less over-parameterized two- layer neural networks on classification problems.arXiv preprint arXiv:1905.09870, 2019

    Atsushi Nitanda, Geoffrey Chinot, and Taiji Suzuki. Gradient descent can learn less over-parameterized two- layer neural networks on classification problems.arXiv preprint arXiv:1905.09870, 2019

  42. [42]

    Near-minimax optimal estimation with shallow relu neural networks.IEEE Transactions on Information Theory, 69(2):1125–1140, 2022

    Rahul Parhi and Robert D Nowak. Near-minimax optimal estimation with shallow relu neural networks.IEEE Transactions on Information Theory, 69(2):1125–1140, 2022

  43. [43]

    A spectral analysis of dot-product kernels

    Meyer Scetbon and Zaid Harchaoui. A spectral analysis of dot-product kernels. InInternational Conference on Artificial Intelligence and Statistics, pages 3394–3402. PMLR, 2021

  44. [44]

    Mastering the game of go with deep neural networks and tree search.nature, 529(7587):484–489, 2016

    David Silver, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre, George Van Den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, et al. Mastering the game of go with deep neural networks and tree search.nature, 529(7587):484–489, 2016

  45. [45]

    Springer Science & Business Media, 2008

    Ingo Steinwart and Andreas Christmann.Support Vector Machines. Springer Science & Business Media, 2008

  46. [46]

    On learning over-parameterized neural networks: A functional approximation per- spective

    Lili Su and Pengkun Yang. On learning over-parameterized neural networks: A functional approximation per- spective. InAdvances in Neural Information Processing Systems, volume 32, 2019

  47. [47]

    Generalization and stability of interpolating neural networks with minimal width.Journal of Machine Learning Research, 25(156):1–41, 2024

    Hossein Taheri and Christos Thrampoulidis. Generalization and stability of interpolating neural networks with minimal width.Journal of Machine Learning Research, 25(156):1–41, 2024

  48. [48]

    Sharper guarantees for learning neural network classifiers with gradient methods

    Hossein Taheri, Christos Thrampoulidis, and Arya Mazumdar. Sharper guarantees for learning neural network classifiers with gradient methods. InInternational Conference on Learning Representations, 2025. 38

  49. [49]

    Cambridge university press, 2018

    Roman Vershynin.High-dimensional probability: An introduction with applications in data science. Cambridge university press, 2018

  50. [50]

    Generalization guarantees of gradient descent for shallow neural networks.Neural Computation, 37(2):344–402, 2025

    Puyu Wang, Yunwen Lei, Di Wang, Yiming Ying, and Ding-Xuan Zhou. Generalization guarantees of gradient descent for shallow neural networks.Neural Computation, 37(2):344–402, 2025

  51. [51]

    Population risk bounds for kolmogorov-arnold networks trained by dp-sgd with correlated noise.arXiv preprint arXiv:2605.12648, 2026

    Puyu Wang, Jan Schuchardt, Nikita Kalinin, Junyu Zhou, Sophie Fellenz, Christoph Lampert, and Marius Kloft. Population risk bounds for kolmogorov-arnold networks trained by dp-sgd with correlated noise.arXiv preprint arXiv:2605.12648, 2026

  52. [52]

    Optimization, generalization and differential privacy bounds for gradient descent on kolmogorov-arnold networks.arXiv preprint arXiv:2601.22409, 2026

    Puyu Wang, Junyu Zhou, Philipp Liznerski, and Marius Kloft. Optimization, generalization and differential privacy bounds for gradient descent on kolmogorov-arnold networks.arXiv preprint arXiv:2601.22409, 2026

  53. [53]

    Jiaming Xu and Hanjing Zhu. Overparametrized multi-layer neural networks: Uniform concentration of neural tangent kernel and convergence of stochastic gradient descent.Journal of Machine Learning Research, 25(94): 1–83, 2024

  54. [54]

    On early stopping in gradient descent learning.Construc- tive Approximation, 26(2):289–315, 2007

    Yuan Yao, Lorenzo Rosasco, and Andrea Caponnetto. On early stopping in gradient descent learning.Construc- tive Approximation, 26(2):289–315, 2007

  55. [55]

    Online gradient descent learning algorithms.Foundations of Computa- tional Mathematics, 8(5):561–596, 2008

    Yiming Ying and Massimiliano Pontil. Online gradient descent learning algorithms.Foundations of Computa- tional Mathematics, 8(5):561–596, 2008

  56. [56]

    Online regularized classification algorithms.IEEE Transactions on Infor- mation Theory, 52(11):4775–4788, 2006

    Yiming Ying and Ding-Xuan Zhou. Online regularized classification algorithms.IEEE Transactions on Infor- mation Theory, 52(11):4775–4788, 2006

  57. [57]

    Understanding deep learning (still) requires rethinking generalization.Communications of the ACM, 64(3):107–115, 2021

    Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. Understanding deep learning (still) requires rethinking generalization.Communications of the ACM, 64(3):107–115, 2021

  58. [58]

    Learning bounds for kernel regression using effective data dimensionality.Neural computation, 17 (9):2077–2098, 2005

    Tong Zhang. Learning bounds for kernel regression using effective data dimensionality.Neural computation, 17 (9):2077–2098, 2005

  59. [59]

    Stochastic gradient descent optimizes over- parameterized deep relu networks

    Difan Zou, Yuan Cao, Dongruo Zhou, and Quanquan Gu. Stochastic gradient descent optimizes over- parameterized deep relu networks. arxiv e-prints, art.arXiv preprint arXiv:1811.08888, 2018

  60. [60]

    Gradient descent optimizes over-parameterized deep relu networks.Machine learning, 109:467–492, 2020

    Difan Zou, Yuan Cao, Dongruo Zhou, and Quanquan Gu. Gradient descent optimizes over-parameterized deep relu networks.Machine learning, 109:467–492, 2020. 39