pith. sign in

arxiv: 2606.06772 · v1 · pith:B5E63NMUnew · submitted 2026-06-04 · 📊 stat.ML · cs.AI· cs.LG

Generalization in Deep Neural Networks: Minimax Rates for Gradient Methods

Pith reviewed 2026-06-27 23:02 UTC · model grok-4.3

classification 📊 stat.ML cs.AIcs.LG
keywords generalizationdeep neural networksgradient descentstochastic gradient descentminimax ratesneural tangent kerneloverparameterized networksexcess risk
0
0 comments X

The pith

Overparameterized DNNs trained by gradient descent or SGD achieve the first minimax-optimal excess risk rates matching kernels when width grows polynomially with sample size.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper first shows that the training trajectory of a deep network with smooth activations under gradient-based updates is equivalent to the dynamics of a kernel method. It then uses this link to transfer known optimal rates from kernels and obtain matching minimax rates for the excess population risk of both full gradient descent and stochastic gradient descent. The rates require only that network width scale as a polynomial of the number of training samples. A sympathetic reader cares because the result supplies the first rigorous guarantee that sufficiently wide neural networks trained by standard optimizers can generalize as well as the best kernel methods on regression tasks.

Core claim

By establishing that gradient-based methods on over-parameterized DNNs with smooth activations fully inherit the learning dynamics of their kernel counterparts, the authors derive the first known minimax-optimal rates for the excess population risk of both GD and SGD, under the assumption that network width scales polynomially with the sample size, showing that such DNNs achieve generalization performance comparable to kernel-based methods.

What carries the argument

The equivalence between the learning dynamics of DNNs with smooth activations trained via gradient methods and those of kernel methods, which transfers optimality from kernels to DNNs.

If this is right

  • DNNs with polynomially growing width attain the same excess-risk upper bounds as optimal kernel methods.
  • Both gradient descent and stochastic gradient descent reach these rates.
  • The results hold for regression tasks under standard smoothness assumptions on the activation.
  • The polynomial width scaling is sufficient to close the gap between neural networks and kernels.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the dynamics equivalence extends to deeper or non-smooth activations, the same rates could apply to modern architectures beyond the shallow case.
  • Empirical checks on whether observed excess risk saturates the derived bounds at polynomial widths would test the practical relevance of the rates.
  • The connection suggests that other gradient-based algorithms might inherit kernel optimality once width is large enough.

Load-bearing premise

The learning dynamics of a DNN with smooth activation functions trained via gradient-based methods are equivalent to those of kernel methods.

What would settle it

An explicit construction of a smooth activation, a regression problem, and a polynomially wide network where GD or SGD produces excess risk strictly larger than the known kernel minimax rate would falsify the claim.

read the original abstract

Understanding the generalization performance of over-parameterized neural networks has become a central topic in deep learning theory. While recent advances, particularly works under the Neural Tangent Kernel (NTK) regime, have shed light on the behavior of shallow architectures, the statistical generalization properties of deep neural networks (DNNs), especially in regression tasks, remain far less understood. In this paper, we make significant progress toward closing this gap by providing a comprehensive generalization analysis of DNNs trained using gradient-based methods. First, we establish, for the first time, a crucial connection between the learning dynamics of a DNN with smooth activation functions trained via gradient-based methods and those of kernel methods, showing that gradient-based methods on over-parameterized DNNs can fully inherit the favorable learning dynamics of their kernel counterparts. Building on this connection and the well-established optimality of kernel methods, we derive the first known minimax-optimal rates for the excess population risk of both gradient descent (GD) and stochastic gradient descent (SGD), under the assumption that network width scales polynomially with the sample size. Our results demonstrate that, with sufficient width, DNNs trained by GD or SGD can achieve generalization performance comparable to kernel-based methods.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 0 minor

Summary. The paper establishes a connection between the learning dynamics of over-parameterized DNNs with smooth activations trained by gradient-based methods and those of kernel methods. It then uses this connection, together with known optimality results for kernels, to derive the first minimax-optimal rates for the excess population risk of both GD and SGD, under the assumption that network width scales polynomially with sample size n.

Significance. If the quantitative error bounds in the dynamics connection hold, the result would be significant: it would supply the first explicit minimax rates for gradient-trained DNNs and demonstrate that polynomial-width networks can match the generalization performance of kernel methods. The work directly addresses a central open question in the NTK and mean-field regimes.

major comments (1)
  1. [connection result between DNN dynamics and kernel methods (foundational step)] The transfer of minimax optimality from kernels to DNNs rests on the claim that the DNN trajectory under GD/SGD is sufficiently close to the kernel dynamics. The manuscript must supply an explicit quantitative bound showing that the approximation error is o(n^{-r}) (where n^{-r} is the target excess-risk rate) under the stated polynomial-width assumption; the abstract and the description of the foundational step locate the connection as the load-bearing step, but without this o(rate) control the optimality claim does not follow.

Simulated Author's Rebuttal

1 responses · 0 unresolved

Thank you for the careful review and for identifying the need for explicit control on the approximation error. We address the single major comment below.

read point-by-point responses
  1. Referee: [connection result between DNN dynamics and kernel methods (foundational step)] The transfer of minimax optimality from kernels to DNNs rests on the claim that the DNN trajectory under GD/SGD is sufficiently close to the kernel dynamics. The manuscript must supply an explicit quantitative bound showing that the approximation error is o(n^{-r}) (where n^{-r} is the target excess-risk rate) under the stated polynomial-width assumption; the abstract and the description of the foundational step locate the connection as the load-bearing step, but without this o(rate) control the optimality claim does not follow.

    Authors: We agree that an explicit o(n^{-r}) guarantee is required for the optimality transfer to be fully rigorous. Our existing trajectory approximation results already yield polynomial decay in 1/n (under the stated width scaling), which is faster than the target kernel rates r in the regimes we consider. In the revision we will insert a short corollary (or remark) that directly combines the approximation theorem with the specific minimax rates from the kernel literature, confirming the error is o(n^{-r}). This addition clarifies the argument without altering any proofs or assumptions. revision: yes

Circularity Check

0 steps flagged

No circularity; derivation builds on a claimed novel connection plus external kernel results.

full rationale

The paper states it first establishes (within the manuscript) a connection between DNN gradient dynamics for smooth activations and kernel methods, then transfers minimax optimality from the 'well-established' kernel literature. No load-bearing step reduces by definition, by fitting, or by self-citation chain to the target rates. The connection is presented as newly derived rather than presupposed, and kernel optimality is invoked as independent prior knowledge. No equations or sections exhibit self-definitional, fitted-input, or ansatz-smuggling patterns. The derivation chain is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Only abstract available; the polynomial width scaling is presented as an explicit assumption required for the rates, but no further free parameters, axioms, or invented entities can be extracted.

axioms (1)
  • domain assumption network width scales polynomially with the sample size
    Explicitly required for the minimax rates to hold, as stated in the abstract.

pith-pipeline@v0.9.1-grok · 5755 in / 1237 out tokens · 21758 ms · 2026-06-27T23:02:15.393392+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

50 extracted references · 4 linked inside Pith

  1. [1]

    A convergence theory for deep learning via over-parameterization

    Zeyuan Allen-Zhu, Yuanzhi Li, and Zhao Song. A convergence theory for deep learning via over-parameterization. InInternational Conference on Machine Learning, pages 242–252. PMLR, 2019

  2. [2]

    Fine-grained analysis of optimization and generalization for overparameterized two-layer neural networks

    Sanjeev Arora, Simon Du, Wei Hu, Zhiyuan Li, and Ruosong Wang. Fine-grained analysis of optimization and generalization for overparameterized two-layer neural networks. InInternational Conference on Machine Learning, pages 322–332. PMLR, 2019

  3. [3]

    Spectrally-normalized margin bounds for neural networks

    Peter L Bartlett, Dylan J Foster, and Matus J Telgarsky. Spectrally-normalized margin bounds for neural networks. Advances in Neural Information Processing Systems, 30, 2017

  4. [4]

    Convergence rates for shallow neural networks learned by gradient descent.Bernoulli, 30(1):475–502, 2024

    Alina Braun, Michael Kohler, Sophie Langer, and Harro Walk. Convergence rates for shallow neural networks learned by gradient descent.Bernoulli, 30(1):475–502, 2024

  5. [5]

    Stochastic gradient descent for two-layer neural networks.arXiv preprint arXiv:2407.07670, 2024

    Dinghao Cao, Zheng-Chu Guo, and Lei Shi. Stochastic gradient descent for two-layer neural networks.arXiv preprint arXiv:2407.07670, 2024

  6. [6]

    Generalization bounds of stochastic gradient descent for wide and deep neural networks

    Yuan Cao and Quanquan Gu. Generalization bounds of stochastic gradient descent for wide and deep neural networks. InAdvances in Neural Information Processing Systems, volume 32, 2019

  7. [7]

    Generalization error bounds of gradient descent for learning over-parameterized deep relu networks

    Yuan Cao and Quanquan Gu. Generalization error bounds of gradient descent for learning over-parameterized deep relu networks. InProceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 3349–3356, 2020

  8. [8]

    Optimal rates for the regularized least-squares algorithm.Foundations of Computational Mathematics, 7(3):331–368, 2007

    Andrea Caponnetto and Ernesto De Vito. Optimal rates for the regularized least-squares algorithm.Foundations of Computational Mathematics, 7(3):331–368, 2007

  9. [9]

    Learning with sgd and random features

    Luigi Carratino, Alessandro Rudi, and Lorenzo Rosasco. Learning with sgd and random features. InAdvances in Neural Information Processing Systems, pages 10213–10224, 2018

  10. [10]

    How much over-parameterization is sufficient to learn deep relu networks? InInternational Conference on Learning Representation, 2021

    Zixiang Chen, Yuan Cao, Difan Zou, and Quanquan Gu. How much over-parameterization is sufficient to learn deep relu networks? InInternational Conference on Learning Representation, 2021

  11. [11]

    Cambridge University Press, 2007

    Felipe Cucker and Ding-Xuan Zhou.Learning Theory: an Approximation Theory Viewpoint. Cambridge University Press, 2007

  12. [12]

    Nonparametric stochastic approximation with large step-sizes.Annals of Statistics, 44(4):1363–1399, 2016

    Aymeric Dieuleveut and Francis Bach. Nonparametric stochastic approximation with large step-sizes.Annals of Statistics, 44(4):1363–1399, 2016

  13. [13]

    Analysis of the expected l 2 error of an over-parametrized deep neural network estimate learned by gradient descent without regularization.arXiv preprint arXiv:2311.14609, 2023

    Selina Drews and Michael Kohler. Analysis of the expected l 2 error of an over-parametrized deep neural network estimate learned by gradient descent without regularization.arXiv preprint arXiv:2311.14609, 2023

  14. [14]

    Selina Drews and Michael Kohler. On the universal consistency of an over-parametrized deep neural network estimate learned by gradient descent.Annals of the Institute of Statistical Mathematics, 76(3):361–391, 2024. 10

  15. [15]

    Gradient descent finds global minima of deep neural networks

    Simon Du, Jason Lee, Haochuan Li, Liwei Wang, and Xiyu Zhai. Gradient descent finds global minima of deep neural networks. InInternational Conference on Machine Learning, pages 1675–1685. PMLR, 2019

  16. [16]

    Random feature amplification: Feature learning and generalization in neural networks.Journal of Machine Learning Research, 24(303):1–49, 2023

    Spencer Frei, Niladri S Chatterji, and Peter L Bartlett. Random feature amplification: Feature learning and generalization in neural networks.Journal of Machine Learning Research, 24(303):1–49, 2023

  17. [17]

    Size-independent sample complexity of neural networks

    Noah Golowich, Alexander Rakhlin, and Ohad Shamir. Size-independent sample complexity of neural networks. InConference On Learning Theory, pages 297–299. PMLR, 2018

  18. [18]

    Springer Science & Business Media, 2006

    L´aszl´o Gy¨orfi, Michael Kohler, Adam Krzyzak, and Harro Walk.A distribution-free theory of nonparametric regression. Springer Science & Business Media, 2006

  19. [19]

    Deep residual learning for image recognition

    Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016

  20. [20]

    Neural tangent kernel: Convergence and generalization in neural networks.Advances in Neural Information Processing Systems, 31, 2018

    Arthur Jacot, Franck Gabriel, and Cl´ement Hongler. Neural tangent kernel: Convergence and generalization in neural networks.Advances in Neural Information Processing Systems, 31, 2018

  21. [21]

    Polylogarithmic width suffices for gradient descent to achieve arbitrarily small test error with shallow relu networks

    Ziwei Ji and Matus Telgarsky. Polylogarithmic width suffices for gradient descent to achieve arbitrarily small test error with shallow relu networks. InInternational Conference on Learning Representations, 2020

  22. [22]

    On the rate of convergence of an over-parametrized deep neural network regression estimate learned by gradient descent.arXiv preprint arXiv:2504.03405, 2025

    Michael Kohler. On the rate of convergence of an over-parametrized deep neural network regression estimate learned by gradient descent.arXiv preprint arXiv:2504.03405, 2025

  23. [23]

    Learning lipschitz functions by gd-trained shallow overparameterized relu neural networks.arXiv preprint arXiv:2212.13848, 2022

    Ilja Kuzborskij and Csaba Szepesv´ari. Learning lipschitz functions by gd-trained shallow overparameterized relu neural networks.arXiv preprint arXiv:2212.13848, 2022

  24. [24]

    Stability and generalization analysis of gradient methods for shallow neural networks

    Yunwen Lei, Rong Jin, and Yiming Ying. Stability and generalization analysis of gradient methods for shallow neural networks. InAdvances in Neural Information Processing Systems, volume 35. PMLR, 2022

  25. [25]

    Optimization and generalization of gradient descent for shallow ReLU networks with minimal width.Journal of Machine Learning Research, 27(34):1–35, 2026

    Yunwen Lei, Puyu Wang, Yiming Ying, and Ding-Xuan Zhou. Optimization and generalization of gradient descent for shallow ReLU networks with minimal width.Journal of Machine Learning Research, 27(34):1–35, 2026

  26. [26]

    Optimal rates for generalization of gradient descent for deep relu classification

    Yuanfan Li, Yunwen Lei, Zheng-Chu Guo, and Yiming Ying. Optimal rates for generalization of gradient descent for deep relu classification. InAdvances in Neural Information Processing Systems, volume 38, pages 28623–28663, 2025

  27. [27]

    Learning overparameterized neural networks via stochastic gradient descent on structured data

    Yuanzhi Li and Yingyu Liang. Learning overparameterized neural networks via stochastic gradient descent on structured data. InAdvances in Neural Information Processing Systems, volume 31, 2018

  28. [28]

    Optimal rates for multi-pass stochastic gradient methods.Journal of Machine Learning Research, 18(1):3375–3421, 2017

    Junhong Lin and Lorenzo Rosasco. Optimal rates for multi-pass stochastic gradient methods.Journal of Machine Learning Research, 18(1):3375–3421, 2017

  29. [29]

    On the linearity of large non-linear models: when and why the tangent kernel is constant.Advances in Neural Information Processing Systems, 33:15954–15964, 2020

    Chaoyue Liu, Libin Zhu, and Misha Belkin. On the linearity of large non-linear models: when and why the tangent kernel is constant.Advances in Neural Information Processing Systems, 33:15954–15964, 2020

  30. [30]

    Norm-based capacity control in neural networks

    Behnam Neyshabur, Ryota Tomioka, and Nathan Srebro. Norm-based capacity control in neural networks. In Conference on Learning Theory, pages 1376–1401. PMLR, 2015

  31. [31]

    Random feature approximation for general spectral methods.arXiv preprint arXiv:2308.15434, 2023

    Mike Nguyen and Nicole M¨ucke. Random feature approximation for general spectral methods.arXiv preprint arXiv:2308.15434, 2023

  32. [32]

    How many neurons do we need? a refined analysis for shallow networks trained with gradient descent.Journal of Statistical Planning and Inference, page 106169, 2024

    Mike Nguyen and Nicole M¨ucke. How many neurons do we need? a refined analysis for shallow networks trained with gradient descent.Journal of Statistical Planning and Inference, page 106169, 2024

  33. [33]

    Gradient descent can learn less over-parameterized two-layer neural networks on classification problems.arXiv preprint arXiv:1905.09870, 2019

    Atsushi Nitanda, Geoffrey Chinot, and Taiji Suzuki. Gradient descent can learn less over-parameterized two-layer neural networks on classification problems.arXiv preprint arXiv:1905.09870, 2019

  34. [34]

    Optimal rates for averaged stochastic gradient descent under neural tangent kernel regime

    Atsushi Nitanda and Suzuki Taiji. Optimal rates for averaged stochastic gradient descent under neural tangent kernel regime. InInternational Conference on Learning Representations, 2021. 11

  35. [35]

    Near-minimax optimal estimation with shallow relu neural networks.IEEE Transactions on Information Theory, 69(2):1125–1140, 2022

    Rahul Parhi and Robert D Nowak. Near-minimax optimal estimation with shallow relu neural networks.IEEE Transactions on Information Theory, 69(2):1125–1140, 2022

  36. [36]

    Weighted sums of random kitchen sinks: Replacing minimization with random- ization in learning.Advances in Neural Information Processing Systems, 21, 2008

    Ali Rahimi and Benjamin Recht. Weighted sums of random kitchen sinks: Replacing minimization with random- ization in learning.Advances in Neural Information Processing Systems, 21, 2008

  37. [37]

    Searching for activation functions.arXiv preprint arXiv:1710.05941, 2017

    Prajit Ramachandran, Barret Zoph, and Quoc V Le. Searching for activation functions.arXiv preprint arXiv:1710.05941, 2017

  38. [38]

    Stability & generalisation of gradient descent for shallow neural networks without the neural tangent kernel

    Dominic Richards and Ilja Kuzborskij. Stability & generalisation of gradient descent for shallow neural networks without the neural tangent kernel. InAdvances in Neural Information Processing Systems, volume 34. PMLR, 2021

  39. [39]

    Springer Science & Business Media, 2008

    Ingo Steinwart and Andreas Christmann.Support Vector Machines. Springer Science & Business Media, 2008

  40. [40]

    Generalization and stability of interpolating neural networks with minimal width.Journal of Machine Learning Research, 25(156):1–41, 2024

    Hossein Taheri and Christos Thrampoulidis. Generalization and stability of interpolating neural networks with minimal width.Journal of Machine Learning Research, 25(156):1–41, 2024

  41. [41]

    Sharper guarantees for learning neural network classifiers with gradient methods, 2025

    Hossein Taheri, Christos Thrampoulidis, and Arya Mazumdar. Sharper guarantees for learning neural network classifiers with gradient methods, 2025

  42. [42]

    Cambridge university press, 2018

    Roman Vershynin.High-dimensional probability: An introduction with applications in data science. Cambridge university press, 2018

  43. [43]

    Cambridge university press, 2019

    Martin J Wainwright.High-dimensional statistics: A non-asymptotic viewpoint, volume 48. Cambridge university press, 2019

  44. [44]

    Generalization guarantees of gradient descent for shallow neural networks.Neural Computation, 37(2):344–402, 2025

    Puyu Wang, Yunwen Lei, Di Wang, Yiming Ying, and Ding-Xuan Zhou. Generalization guarantees of gradient descent for shallow neural networks.Neural Computation, 37(2):344–402, 2025

  45. [45]

    Population risk bounds for kolmogorov-arnold networks trained by dp-sgd with correlated noise.arXiv preprint arXiv:2605.12648, 2026

    Puyu Wang, Jan Schuchardt, Nikita Kalinin, Junyu Zhou, Sophie Fellenz, Christoph Lampert, and Marius Kloft. Population risk bounds for kolmogorov-arnold networks trained by dp-sgd with correlated noise.arXiv preprint arXiv:2605.12648, 2026

  46. [46]

    Optimization, generalization and differential privacy bounds for gradient descent on kolmogorov-arnold networks.arXiv preprint arXiv:2601.22409, 2026

    Puyu Wang, Junyu Zhou, Philipp Liznerski, and Marius Kloft. Optimization, generalization and differential privacy bounds for gradient descent on kolmogorov-arnold networks.arXiv preprint arXiv:2601.22409, 2026

  47. [47]

    Jiaming Xu and Hanjing Zhu. Overparametrized multi-layer neural networks: Uniform concentration of neural tangent kernel and convergence of stochastic gradient descent.Journal of Machine Learning Research, 25(94):1–83, 2024

  48. [48]

    Learning bounds for kernel regression using effective data dimensionality.Neural computation, 17(9):2077–2098, 2005

    Tong Zhang. Learning bounds for kernel regression using effective data dimensionality.Neural computation, 17(9):2077–2098, 2005

  49. [49]

    Generalization in Deep Neural Networks: Minimax Rates for Gradient Methods

    Difan Zou, Yuan Cao, Dongruo Zhou, and Quanquan Gu. Stochastic gradient descent optimizes over-parameterized deep relu networks. arxiv e-prints, art.arXiv preprint arXiv:1811.08888, 2018. 12 Appendix for “Generalization in Deep Neural Networks: Minimax Rates for Gradient Methods” A Problem Reformulation We first introduce some useful notations and express...

  50. [50]

    Their equation (47) is guaranteed by Lemma 18 with κ2 =∥K∥ ∞, Γ =n , δ=δ 2, ζi =K xi, Q= R X Kx ⊗K xdρx

    Further, their Assumption 3 is guaranteed by Assumption 2, their equation (3) holds true with κ2 =∥K∥ ∞ due to ⟨Kx, Kx′⟩HK =K(x,x ′)≤ ∥K∥ ∞. Their equation (47) is guaranteed by Lemma 18 with κ2 =∥K∥ ∞, Γ =n , δ=δ 2, ζi =K xi, Q= R X Kx ⊗K xdρx. In addition, by taking the step-size ηk =η for all k∈[T] , we know Sρνk+1 − Sρµk+1 in [28] is equivalent to our...