Generalization in Nonlinear Least Squares via Learned Feature Geometry

Ayub Kharel; Ilja Kuzborskij; Patrick Rebeschini; Yasin Abbasi-Yadkori

arxiv: 2606.08799 · v2 · pith:6HWJNTT6new · submitted 2026-06-07 · 📊 stat.ML · cs.LG

Generalization in Nonlinear Least Squares via Learned Feature Geometry

Ayub Kharel , Ilja Kuzborskij , Patrick Rebeschini , Yasin Abbasi-Yadkori This is my paper

Pith reviewed 2026-06-27 17:42 UTC · model grok-4.3

classification 📊 stat.ML cs.LG

keywords generalization boundsnonlinear least squaresalgorithmic stabilityJacobian Gram matrixeffective dimensionridge regularizationlocal minimizerslearned feature geometry

0 comments

The pith

Generalization error bounds for nonlinear least squares depend on the geometry of the trained Jacobian rather than parameter count.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper derives generalization bounds for local minimizers of ridge-regularized nonlinear least-squares problems using on-average algorithmic stability. The bounds are expressed through a data-dependent effective dimension computed from the empirical Jacobian Gram matrix evaluated at the trained parameters together with a residual-curvature term. In the linear case the curvature term disappears and the bound reduces to the classical effective dimension but taken at the solution instead of at initialization. The same effective dimension is further controlled by the covering numbers of the learned gradient features, yielding guarantees that scale with intrinsic dimension for manifold data and with counts of activation-stable regions for shallow ReLU networks. The derivation rests on the Brascamp-Lieb inequality applied to strongly log-concave noise.

Core claim

For ridge-regularized nonlinear least squares, the generalization error of any local minimizer is bounded by a term whose leading factor is the effective dimension trace((J^T J + lambda I)^{-1} J^T J) plus a residual-curvature correction, where J is the Jacobian of the model evaluated at the trained parameters; this quantity is data-dependent and reflects the geometry of the learned gradient features rather than the ambient parameter dimension.

What carries the argument

The empirical Jacobian Gram matrix at the trained parameters, which defines a data-dependent effective dimension together with the residual-curvature term.

If this is right

In the linear case the bound recovers the classical effective dimension evaluated at the trained model instead of at initialization.
For data supported on a manifold and piecewise Lipschitz Jacobians the bound scales with intrinsic rather than ambient dimension.
For one-hidden-layer ReLU networks the bound can be expressed explicitly in terms of the number of activation-stable regions.
The effective dimension itself can be upper-bounded by the covering complexity of the gradient features.
The bounds are obtained directly from first-principles stability arguments without reference to uniform convergence.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Training procedures that compress the rank or condition number of the trained Jacobian would be expected to improve the generalization bound.
The same stability-plus-Jacobian approach could be applied to other convex or locally convex losses beyond squared loss.
On clustered or low-dimensional data the observed compression of the Jacobian Gram matrix during training should correlate with the size of the generalization gap.

Load-bearing premise

The noise distribution must be strongly log-concave so that the Brascamp-Lieb inequality can be applied to the stability analysis.

What would settle it

Compute the proposed stability bound on a dataset where the noise is known to be strongly log-concave and check whether the observed generalization gap exceeds the bound by more than a small constant factor across multiple random seeds.

Figures

Figures reproduced from arXiv: 2606.08799 by Ayub Kharel, Ilja Kuzborskij, Patrick Rebeschini, Yasin Abbasi-Yadkori.

**Figure 1.** Figure 1: The nonlinear effective dimension is controlled by trained Jacobian geometry. Left: across noisy manifold regression tasks, dlin(G, t b − ρ) closely tracks the directly estimated nonlinear effective dimension, validating the residual-curvature reduction used in the theory. Right: on clusteredsphere data, the trained effective dimension remains below the initialization geometry as the number of displayed c… view at source ↗

**Figure 2.** Figure 2: The number of occupied activation regions is small. The left panel visualizes a twodimensional ReLU input-domain partition; only cells containing data matter for the bound. The right panel measures the same count based on the frequency. The number of occupied regions is orders of magnitude smaller compared with the parameter count p = 51,712. synth m = 4, n = 512 synth m = 4, n = 1024 synth m = 8, n = 102… view at source ↗

**Figure 3.** Figure 3: The effective-dimension stability bound upper-bounds the observed gaps. In every configuration, the trained deff bound remains above but close to the observed gap. initialization Jacobian with our more general effective dimension, which let us study the effect of feature learning and activation-region complexity that fixed-kernel analyses cannot. We also note some limitations. Our analysis focuses on fixed… view at source ↗

**Figure 4.** Figure 4: Schematic of results in Section 3. Roadmap. The goal of this appendix is to prove Proposition 8. We do so via three results whose composition gives the manifold bound on deff. First, we reduce deff to the classical effective dimension of Gb at margin λ − ρ (Theorem 17, proving Proposition 5). Second, we bound this classical effective dimension by an empirical Jacobian-feature covering number (Theorem 19, p… view at source ↗

**Figure 5.** Figure 5: Synthetic regression targets. The manifold experiments avoid intrinsic dimensions lower than 4 where results can quickly become trivial; the targets shown here use m = 4 one-dimensional cuts or m = 4 two-dimensional slices. D Experimental Details D.1 Compute details All experiments were run on a MacBook (14-inch, 2024) with an M4 Pro chip and 24GB memory. Running all experiments takes approximately 6 hours… view at source ↗

**Figure 6.** Figure 6: Training compresses the relevant Jacobian geometry. On the same m = 4 manifold regression task, the trained Jacobian Gram has a much faster spectral decay than the initialization Gram. At t ≈ 10−2 , the median effective dimension drops from 12.6 at initialization to 3.52 after training, even though p = 51,712 and n = 4096. This directly supports the theory’s use of the trained Jacobian rather than a fixed-… view at source ↗

**Figure 7.** Figure 7: Empirical Jacobian-feature cover bound. At t = 3.16 · 10−4 , the best cover value is 598, well below n = 4096 and p = 51,712, while the measured deff is 36.7. Here we see the characteristic U-shaped behaviour of the cover bound. As the radius ε increases, the covering number K(ε) decreases, while the approximation term ε 2/t increases. The turning point reflects the tradeoff between using fewer Jacobian-fe… view at source ↗

**Figure 8.** Figure 8: California Housing geometry diagnostics. The panels report trained and initialization spectra, effective dimensions, cover curves, and stability-bound diagnostics for the eight-dimensional California Housing benchmark. 2 4 6 8 10 12 Principal component 0.0 0.2 0.4 0.6 0.8 1.0 Variance fraction ambient d=12 PCA95=9 participation=5.54 TwoNN=0.28 A. Ambient and PCA Dimension Cumulative Explained variance 10 1… view at source ↗

**Figure 9.** Figure 9: UCI Wine Quality geometry diagnostics. The retained Wine Quality benchmark combines red and white wines, and the panels report spectra, effective dimensions, cover curves, and stabilitybound diagnostics for the trained Jacobian geometry. 37 [PITH_FULL_IMAGE:figures/full_fig_p037_9.png] view at source ↗

**Figure 10.** Figure 10: UCI Superconductivity geometry diagnostics. The panels report spectra, effective dimensions, cover curves, and stability-bound diagnostics for the 81-dimensional superconductivity benchmark. 10 0 10 1 10 2 Cover radius 10 3 10 4 10 5 10 6 C o v er b o u n d K( ) + 2 /t at t = 0.0 1 California Housing Input cover Jacobian-feature cover 10 1 Cover radius UCI Wine Quality 10 1 Cover radius UCI Superconductiv… view at source ↗

**Figure 11.** Figure 11: Real-data cover-bound curves. For each real-data benchmark, the plot compares inputspace covers and trained Jacobian-feature covers through the cover bound K(ε)+ε 2/t. The U-shaped curves make explicit the tradeoff in the theory: small radii require many centers, while large radii pay a larger within-ball approximation penalty. D.8 Generalization-bound numbers For [PITH_FULL_IMAGE:figures/full_fig_p038_… view at source ↗

read the original abstract

We study the generalization of ridge-regularized nonlinear least-squares models via on-average algorithmic stability, deriving error bounds for local minimizers in terms of a data-dependent effective dimension that reflects the geometry of the gradient model at the trained parameters, through the empirical Jacobian Gram matrix and a residual-curvature term. In the linear case, where the curvature term vanishes, this recovers the classical effective dimension of the Jacobian kernel covariance, but evaluated at the trained model rather than at initialization as is typical in neural tangent kernel analyses. We further bound this effective dimension via covering complexity of the gradient features, leading to guarantees that depend on learned geometry rather than parameter count. In particular, for manifold-supported data and piecewise Lipschitz Jacobians, the bounds scale with intrinsic dimension, while for one-hidden-layer ReLU networks, the mechanism can be made explicit through counts of activation-stable regions. Experiments on synthetic manifolds, clustered distributions, and benchmark datasets illustrate trained-Jacobian compression, the tightness of the residual-curvature linearization, and agreement between the stability bound and observed generalization gaps. A key feature of our bounds is the simplicity of their derivation, which follows from first principles using the Brascamp-Lieb inequality under strongly log-concave noise.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gives stability bounds for nonlinear least squares that evaluate effective dimension from the trained Jacobian Gram matrix plus residual curvature, yielding intrinsic-dimension scaling under manifold assumptions, but only when noise is strongly log-concave.

read the letter

The main thing here is a derivation of on-average algorithmic stability bounds for ridge-regularized nonlinear least squares at local minimizers. The bound is expressed through an effective dimension taken from the empirical Jacobian Gram matrix at the trained point, plus a residual-curvature correction. In the linear case the curvature term drops and you recover the usual effective dimension but evaluated after training rather than at initialization.

What is new is the combination of the trained-point evaluation with a covering-complexity argument that produces explicit scaling with intrinsic dimension for manifold-supported data and for one-hidden-layer ReLU networks via activation-region counts. The experiments on synthetic manifolds, clustered data, and benchmarks are used to show Jacobian compression after training, that the linearization around the residual is reasonable, and that the numerical value of the stability bound lines up with observed gaps.

The derivation is presented as first-principles via the Brascamp-Lieb inequality. That step requires the noise to be strongly log-concave so the posterior remains strongly log-concave. This is stated up front, but it excludes several standard least-squares noise models (Laplace, uniform, or heavier tails) that are log-concave yet not strongly so. The bounds therefore hold only conditionally on that assumption, and the paper does not appear to relax it or provide matching lower bounds.

The effective dimension is computed from the trained model, so the bound is more of a post-hoc explanation than an a-priori guarantee. Whether the local-minimizer setting is the right one for overparameterized practice is left open.

This is worth sending to referees. The angle on learned geometry is distinct from standard NTK work, the experiments are on point, and the assumption is clearly flagged. A serious review can check the derivation details and see how much the strong log-concavity restricts the practical reach.

Referee Report

1 major / 2 minor

Summary. The paper derives on-average algorithmic stability bounds for generalization of ridge-regularized nonlinear least-squares models at local minimizers. The bounds are expressed via a data-dependent effective dimension constructed from the empirical Jacobian Gram matrix plus a residual-curvature term evaluated at the trained parameters. The derivation proceeds from first principles via the Brascamp-Lieb inequality under strongly log-concave noise; the linear case recovers the classical Jacobian-kernel effective dimension (now at the trained point), while nonlinear cases are further bounded via covering numbers of the gradient features. Experiments on synthetic manifolds, clustered data, and benchmarks illustrate Jacobian compression, residual-curvature linearization tightness, and numerical agreement between the stability bound and observed gaps.

Significance. If the derivation is valid, the work supplies a simple, first-principles stability analysis that ties generalization directly to learned feature geometry rather than parameter count or initialization. The explicit recovery of classical effective-dimension results and the geometry-based scaling (intrinsic dimension for manifold data, activation-region counts for ReLU networks) are concrete strengths. The use of Brascamp-Lieb under the stated noise assumption is a clean technical route when the assumption holds.

major comments (1)

[Abstract / derivation section] Abstract and derivation (Brascamp-Lieb application): the central stability bound for local minimizers is obtained only when the negative log-likelihood plus ridge term induces a strongly log-concave measure, which requires the noise distribution itself to be strongly log-concave. This excludes standard least-squares noise models (Laplace, uniform, or any log-concave but not strongly log-concave density) for which the inequality does not deliver the claimed finite, data-dependent bound; the guarantees are therefore conditional on an assumption that is not satisfied by many practical residual distributions.

minor comments (2)

Notation: the effective dimension is defined from the trained Jacobian Gram matrix; a short remark clarifying that this quantity is computed post-training (and is therefore not available before optimization) would help readers distinguish it from NTK-style quantities evaluated at initialization.
Experiments: the synthetic-manifold and ReLU-network sections would benefit from an explicit statement of how the residual-curvature term is estimated in practice and whether its magnitude is reported relative to the Jacobian term.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the detailed review and for highlighting the scope of the noise assumption in our derivation. We address the major comment below and propose targeted revisions to improve clarity without altering the technical contribution.

read point-by-point responses

Referee: [Abstract / derivation section] Abstract and derivation (Brascamp-Lieb application): the central stability bound for local minimizers is obtained only when the negative log-likelihood plus ridge term induces a strongly log-concave measure, which requires the noise distribution itself to be strongly log-concave. This excludes standard least-squares noise models (Laplace, uniform, or any log-concave but not strongly log-concave density) for which the inequality does not deliver the claimed finite, data-dependent bound; the guarantees are therefore conditional on an assumption that is not satisfied by many practical residual distributions.

Authors: We agree that the central bound relies on strong log-concavity of the noise distribution to apply the Brascamp-Lieb inequality and obtain a finite, data-dependent stability guarantee. This assumption is stated in the abstract and derivation section. The manuscript focuses on the Gaussian case (which satisfies strong log-concavity) as the canonical model for least-squares residuals, but the referee is correct that the result does not automatically extend to merely log-concave densities such as Laplace or uniform. We will revise the abstract to foreground the assumption more explicitly and add a short paragraph in the introduction and conclusion discussing its scope, including that extensions to other log-concave noises would require different concentration tools. This does not change the validity of the existing derivation under the stated condition. revision: yes

Circularity Check

0 steps flagged

No circularity; bounds derived independently via stability and Brascamp-Lieb

full rationale

The paper derives on-average algorithmic stability bounds for local minimizers of the ridge-regularized nonlinear least-squares objective using the Brascamp-Lieb inequality applied to the posterior under strongly log-concave noise. The data-dependent effective dimension (Jacobian Gram matrix plus residual-curvature term) appears as an explicit term in the resulting generalization bound rather than being fitted to the gap or defined in terms of the bound itself. No self-citations, ansatzes, or renamings are invoked as load-bearing steps in the provided text; the derivation is stated to follow from first principles. The central claim therefore remains independent of its inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the standard Brascamp-Lieb inequality applied to strongly log-concave noise and on the definition of effective dimension from the Jacobian at the trained point; no new free parameters or invented entities appear in the abstract.

axioms (1)

standard math Brascamp-Lieb inequality holds for the noise distribution
Invoked to derive the stability bounds from first principles under strongly log-concave noise.

pith-pipeline@v0.9.1-grok · 5759 in / 1335 out tokens · 27984 ms · 2026-06-27T17:42:24.253186+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

91 extracted references · 1 canonical work pages

[1]

Bartlett.Neural network learning: Theoretical foundations

Martin Anthony and Peter L. Bartlett.Neural network learning: Theoretical foundations. Cambridge University Press, 1999

1999
[2]

On Exact Computation with an Infinitely Wide Neural Net

Sanjeev Arora, Simon S Du, Wei Hu, Zhiyuan Li, Russ R Salakhutdinov, and Ruosong Wang. On Exact Computation with an Infinitely Wide Neural Net. InAdvances in Neural Information Processing Systems, volume 32, 2019

2019
[3]

Random Fourier Features for Kernel Ridge Regression: Approximation Bounds and Statistical Guarantees

Haim Avron, Michael Kapralov, Cameron Musco, Christopher Musco, Ameya Velingker, and Amir Zandieh. Random Fourier Features for Kernel Ridge Regression: Approximation Bounds and Statistical Guarantees. InInternational Conference on Machine Learning (ICML), volume 70 ofProceedings of Machine Learning Research, pages 253–262. PMLR, 2017

2017
[4]

Sharp analysis of low-rank kernel matrix approximations

Francis Bach. Sharp analysis of low-rank kernel matrix approximations. InProceedings of the 26th Annual Conference on Learning Theory, volume 30 ofProceedings of Machine Learning Research, pages 185–209, Princeton, NJ, USA, 12–14 Jun 2013. PMLR

2013
[5]

Baraniuk

Randall Balestriero and Richard G. Baraniuk. A Spline Theory of Deep Learning. InProceed- ings of the 35th International Conference on Machine Learning, volume 80 ofProceedings of Machine Learning Research, pages 374–383. PMLR, 2018

2018
[6]

Bartlett, Dylan J

Peter L. Bartlett, Dylan J. Foster, and Matus Telgarsky. Spectrally-Normalized Margin Bounds for Neural Networks. InAdvances in Neural Information Processing Systems, volume 30, 2017

2017
[7]

Bartlett, Nick Harvey, Christopher Liaw, and Abbas Mehrabian

Peter L. Bartlett, Nick Harvey, Christopher Liaw, and Abbas Mehrabian. Nearly-tight VC- dimension and pseudodimension bounds for piecewise linear neural networks.Journal of Machine Learning Research, 20(63):1–17, 2019

2019
[8]

Bartlett, Philip M

Peter L. Bartlett, Philip M. Long, Gábor Lugosi, and Alexander Tsigler. Benign Overfitting in Linear Regression.Proceedings of the National Academy of Sciences, 117(48):30063–30070, 2020

2020
[9]

Bartlett and Shahar Mendelson

Peter L. Bartlett and Shahar Mendelson. Rademacher and Gaussian Complexities: Risk Bounds and Structural Results.Journal of Machine Learning Research, 3:463–482, 2002

2002
[10]

Stability of Stochastic Gradient Descent on Nonsmooth Convex Losses

Raef Bassily, Vitaly Feldman, Cristóbal Guzmán, and Kunal Talwar. Stability of Stochastic Gradient Descent on Nonsmooth Convex Losses. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin, editors,Advances in Neural Information Processing Systems, volume 33, pages 4381–4391. Curran Associates, Inc., 2020

2020
[11]

Reconciling modern machine- learning practice and the classical bias–variance trade-off.Proceedings of the National Academy of Sciences, 116(32):15849–15854, 2019

Mikhail Belkin, Daniel Hsu, Siyuan Ma, and Soumik Mandal. Reconciling modern machine- learning practice and the classical bias–variance trade-off.Proceedings of the National Academy of Sciences, 116(32):15849–15854, 2019

2019
[12]

Laplacian Eigenmaps for Dimensionality Reduction and Data Representation.Neural Computation, 15(6):1373–1396, 2003

Mikhail Belkin and Partha Niyogi. Laplacian Eigenmaps for Dimensionality Reduction and Data Representation.Neural Computation, 15(6):1373–1396, 2003

2003
[13]

Towards a Theoretical Foundation for Laplacian-Based Manifold Methods.Journal of Computer and System Sciences, 74(8):1289–1308, 2008

Mikhail Belkin and Partha Niyogi. Towards a Theoretical Foundation for Laplacian-Based Manifold Methods.Journal of Computer and System Sciences, 74(8):1289–1308, 2008

2008
[14]

Manifold Regularization: A Geometric Framework for Learning from Labeled and Unlabeled Examples.Journal of Machine Learning Research, 7:2399–2434, 2006

Mikhail Belkin, Partha Niyogi, and Vikas Sindhwani. Manifold Regularization: A Geometric Framework for Learning from Labeled and Unlabeled Examples.Journal of Machine Learning Research, 7:2399–2434, 2006

2006
[15]

Bickel and Bo Li

Peter J. Bickel and Bo Li. Local Polynomial Regression on Unknown Manifolds. InComplex Datasets and Inverse Problems: Tomography, Networks and Beyond, volume 54 ofIMS Lecture Notes–Monograph Series, pages 177–186. Institute of Mathematical Statistics, 2007. 10

2007
[16]

Simplicity Bias and Optimization Threshold in Two-Layer ReLU Networks

Etienne Boursier and Nicolas Flammarion. Simplicity Bias and Optimization Threshold in Two-Layer ReLU Networks. InProceedings of the 42nd International Conference on Machine Learning, volume 267, pages 5241–5275. PMLR, 13–19 Jul 2025

2025
[17]

Stability and generalization.Journal of Machine Learning Research, 2(Mar):499–526, 2002

Olivier Bousquet and André Elisseeff. Stability and generalization.Journal of Machine Learning Research, 2(Mar):499–526, 2002

2002
[18]

Herm Jan Brascamp and Elliott H Lieb. On extensions of the Brunn-Minkowski and Prékopa- Leindler theorems, including inequalities for log concave functions, and with an application to the diffusion equation.Journal of Functional Analysis, 22(4):366–389, 1976

1976
[19]

Fractal structure and generalization properties of stochastic optimization algorithms.Advances in Neural Information Processing Systems, 34:18774–18788, 2021

Alexander Camuto, George Deligiannidis, Murat A Erdogdu, Mert Gurbuzbalaban, Umut Simsekli, and Lingjiong Zhu. Fractal structure and generalization properties of stochastic optimization algorithms.Advances in Neural Information Processing Systems, 34:18774–18788, 2021

2021
[20]

Generalization Bounds of Stochastic Gradient Descent for Wide and Deep Neural Networks

Yuan Cao and Quanquan Gu. Generalization Bounds of Stochastic Gradient Descent for Wide and Deep Neural Networks. InAdvances in Neural Information Processing Systems, 2019

2019
[21]

Optimal Rates for the Regularized Least-Squares Algorithm.Foundations of Computational Mathematics, 7(3):331–368, 2007

Andrea Caponnetto and Ernesto De Vito. Optimal Rates for the Regularized Least-Squares Algorithm.Foundations of Computational Mathematics, 7(3):331–368, 2007

2007
[22]

Cambridge University Press, Cambridge, 1990

Bernd Carl and Irmtraud Stephani.Entropy, Compactness and the Approximation of Operators. Cambridge University Press, Cambridge, 1990

1990
[23]

Carlen, Dario Cordero-Erausquin, and Elliott H

Eric A. Carlen, Dario Cordero-Erausquin, and Elliott H. Lieb. Asymmetric covariance estimates of Brascamp-Lieb type and related inequalities for log-concave measures.Annales de l’I.H.P . Probabilités et statistiques, 49(1):1–12, 2013

2013
[24]

Stability and generalization of learning algorithms that converge to global optima

Zachary Charles and Dimitris Papailiopoulos. Stability and generalization of learning algorithms that converge to global optima. InInternational Conference on Machine Learning (ICML), pages 745–754. PMLR, 2018

2018
[25]

Minshuo Chen, Haoming Jiang, Wenjing Liao, and Tuo Zhao. Nonparametric Regression on Low-Dimensional Manifolds Using Deep ReLU Networks: Function Approximation and Statistical Recovery.Information and Inference: A Journal of the IMA, 11(4):1203–1253, 2022

2022
[26]

Local Linear Regression on Manifolds and Its Geometric Interpretation.Journal of the American Statistical Association, 108(504):1421–1434, 2013

Ming-Yen Cheng and Hau-Tieng Wu. Local Linear Regression on Manifolds and Its Geometric Interpretation.Journal of the American Statistical Association, 108(504):1421–1434, 2013

2013
[27]

On Lazy Training in Differentiable Programming

Lénaïc Chizat, Edouard Oyallon, and Francis Bach. On Lazy Training in Differentiable Programming. InAdvances in Neural Information Processing Systems, volume 32, 2019

2019
[28]

Coifman and Stéphane Lafon

Ronald R. Coifman and Stéphane Lafon. Diffusion Maps.Applied and Computational Harmonic Analysis, 21(1):5–30, 2006

2006
[29]

Cerdeira, F

Paulo Cortez, A. Cerdeira, F. Almeida, T. Matos, and J. Reis. Wine Quality. UCI Machine Learning Repository, 2009. Dataset

2009
[30]

Donoho and Carrie Grimes

David L. Donoho and Carrie Grimes. Hessian Eigenmaps: Locally Linear Embedding Techniques for High-Dimensional Data.Proceedings of the National Academy of Sciences, 100(10):5591–5596, 2003

2003
[31]

Gradient descent finds global minima of deep neural networks

Simon Du, Jason Lee, Haochuan Li, Liwei Wang, and Xiyu Zhai. Gradient descent finds global minima of deep neural networks. volume 97 ofProceedings of Machine Learning Research, pages 1675–1685. PMLR, 09–15 Jun 2019

2019
[32]

Richard M. Dudley. The Sizes of Compact Subsets of Hilbert Space and Continuity of Gaussian Processes.Journal of Functional Analysis, 1(3):290–330, 1967

1967
[33]

Dudley.Uniform Central Limit Theorems

Richard M. Dudley.Uniform Central Limit Theorems. Cambridge University Press, Cambridge, 1999. 11

1999
[34]

Uniform gener- alization bounds on data-dependent hypothesis sets via PAC-Bayesian theory on random sets

Benjamin Dupuis, Paul Viallard, George Deligiannidis, and Umut Simsekli. Uniform gener- alization bounds on data-dependent hypothesis sets via PAC-Bayesian theory on random sets. Journal of Machine Learning Research, 25(409):1–55, 2024

2024
[35]

Computing Nonvacuous Generalization Bounds for Deep (Stochastic) Neural Networks with Many More Parameters than Training Data

Gintare Karolina Dziugaite and Daniel M Roy. Computing Nonvacuous Generalization Bounds for Deep (Stochastic) Neural Networks with Many More Parameters than Training Data. In Uncertainty in Artificial Intelligence (UAI), 2017

2017
[36]

On a Theorem of Weyl Concerning Eigenvalues of Linear Transformations I*.Pro- ceedings of the National Academy of Sciences, 35(11):652–655, 1949

Ky Fan. On a Theorem of Weyl Concerning Eigenvalues of Linear Transformations I*.Pro- ceedings of the National Academy of Sciences, 35(11):652–655, 1949

1949
[37]

Efficient Classification for Metric Data.IEEE Transactions on Information Theory, 60(9):5750–5759, 2014

Lee-Ad Gottlieb, Aryeh Kontorovich, and Robert Krauthgamer. Efficient Classification for Metric Data.IEEE Transactions on Information Theory, 60(9):5750–5759, 2014

2014
[38]

Efficient Regression in Metric Spaces via Approximate Lipschitz Extension.IEEE Transactions on Information Theory, 63(8):4838–4849, 2017

Lee-Ad Gottlieb, Aryeh Kontorovich, and Robert Krauthgamer. Efficient Regression in Metric Spaces via Approximate Lipschitz Extension.IEEE Transactions on Information Theory, 63(8):4838–4849, 2017

2017
[39]

Superconductivty Data

Kam Hamidieh. Superconductivty Data. UCI Machine Learning Repository, 2018. Dataset

2018
[40]

Complexity of Linear Regions in Deep Networks

Boris Hanin and David Rolnick. Complexity of Linear Regions in Deep Networks. volume 97 ofProceedings of Machine Learning Research, pages 2596–2604. PMLR, 09–15 Jun 2019

2019
[41]

Deep ReLU Networks Have Surprisingly Few Activation Patterns

Boris Hanin and David Rolnick. Deep ReLU Networks Have Surprisingly Few Activation Patterns. InAdvances in Neural Information Processing Systems, volume 32, pages 361–370. Curran Associates, Inc., 2019

2019
[42]

Train faster, generalize better: Stability of stochastic gradient descent

Moritz Hardt, Ben Recht, and Yoram Singer. Train faster, generalize better: Stability of stochastic gradient descent. volume 48 ofProceedings of Machine Learning Research, pages 1225–1234, New York, New York, USA, 20–22 Jun 2016. PMLR

2016
[43]

Neural Tangent Kernel: Convergence and Generalization in Neural Networks

Arthur Jacot, Franck Gabriel, and Clément Hongler. Neural Tangent Kernel: Convergence and Generalization in Neural Networks. InAdvances in Neural Information Processing Systems 31, pages 8571–8580, 2018

2018
[44]

Directional convergence and alignment in deep learning.Ad- vances in Neural Information Processing Systems, 2020

Ziwei Ji and Matus Telgarsky. Directional convergence and alignment in deep learning.Ad- vances in Neural Information Processing Systems, 2020

2020
[45]

Deep Nonparametric Regression on Approximate Manifolds: Nonasymptotic Error Bounds with Polynomial Prefactors.The Annals of Statistics, 51(2):691–716, 2023

Yuling Jiao, Guohao Shen, Yuanyuan Lin, and Jian Huang. Deep Nonparametric Regression on Approximate Manifolds: Nonasymptotic Error Bounds with Polynomial Prefactors.The Annals of Statistics, 51(2):691–716, 2023

2023
[46]

Kelley Pace and Ronald Barry

R. Kelley Pace and Ronald Barry. Sparse spatial autoregressions.Statistics & Probability Letters, 33(3):291–297, 1997

1997
[47]

Uniform convergence of interpolators: Gaussian width, norm bounds and benign overfitting.Advances in Neural Information Processing Systems, 34:20657–20668, 2021

Frederic Koehler, Lijia Zhou, Danica J Sutherland, and Nathan Srebro. Uniform convergence of interpolators: Gaussian width, norm bounds and benign overfitting.Advances in Neural Information Processing Systems, 34:20657–20668, 2021

2021
[48]

Kolmogorov and Vladimir M

Andrey N. Kolmogorov and Vladimir M. Tikhomirov. ϵ-Entropy and ϵ-Capacity of Sets in Functional Spaces.American Mathematical Society Translations, Series 2, 17:277–364, 1961

1961
[49]

Rademacher penalties and structural risk minimization.IEEE Transac- tions on Information Theory, 47(5):1902–1914, 2002

Vladimir Koltchinskii. Rademacher penalties and structural risk minimization.IEEE Transac- tions on Information Theory, 47(5):1902–1914, 2002

1902
[50]

Distribution-Dependent Analysis of Gibbs-ERM Principle

Ilja Kuzborskij, Nicolò Cesa-Bianchi, and Csaba Szepesvári. Distribution-Dependent Analysis of Gibbs-ERM Principle. InConference on Computational Learning Theory (COLT), volume 99, pages 2028–2054. PMLR, 2019

2028
[51]

Pointwise confidence estimation in the non-linear ℓ2-regularized least squares

Ilja Kuzborskij and Yasin Abbasi Yadkori. Pointwise confidence estimation in the non-linear ℓ2-regularized least squares. arxiv preprint 2506.07088, 2025. 12

arXiv 2025
[52]

Schoenholz, Yasaman Bahri, Roman Novak, Jascha Sohl-Dickstein, and Jeffrey Pennington

Jaehoon Lee, Lechao Xiao, Samuel S. Schoenholz, Yasaman Bahri, Roman Novak, Jascha Sohl-Dickstein, and Jeffrey Pennington. Wide Neural Networks of Any Depth Evolve as Linear Models Under Gradient Descent. InAdvances in Neural Information Processing Systems, volume 32, 2019

2019
[53]

Stability and Generalization of Stochastic Optimization with Nonconvex and Nonsmooth Problems

Yunwen Lei. Stability and Generalization of Stochastic Optimization with Nonconvex and Nonsmooth Problems. InProceedings of Thirty Sixth Conference on Learning Theory, volume 195 ofProceedings of Machine Learning Research, pages 191–227. PMLR, 2023

2023
[54]

Fine-Grained Analysis of Stability and Generalization for Stochastic Gradient Descent

Yunwen Lei and Yiming Ying. Fine-Grained Analysis of Stability and Generalization for Stochastic Gradient Descent. InProceedings of the 37th International Conference on Machine Learning, volume 119 ofProceedings of Machine Learning Research, pages 5809–5819. PMLR, 2020

2020
[55]

Elizaveta Levina and Peter J. Bickel. Maximum Likelihood Estimation of Intrinsic Dimension. InAdvances in Neural Information Processing Systems, volume 17, pages 777–784, 2004

2004
[56]

Ridgeless

Tengyuan Liang and Alexander Rakhlin. Just Interpolate: Kernel “Ridgeless” Regression Can Generalize.The Annals of Statistics, 48(3):1329–1347, 2020

2020
[57]

PAC-Bayes Compression Bounds So Tight That They Can Explain Generalization

Sanae Lotfi, Marc Anton Finzi, Sanyam Kapoor, Andres Potapczynski, Micah Goldblum, and Andrew Gordon Wilson. PAC-Bayes Compression Bounds So Tight That They Can Explain Generalization. InAdvances in Neural Information Processing Systems, volume 35, 2022

2022
[58]

McAllester

David A. McAllester. Some PAC-Bayesian theorems. InConference on Computational Learning Theory (COLT), 1998

1998
[59]

Montúfar, Razvan Pascanu, Kyunghyun Cho, and Yoshua Bengio

Guido F. Montúfar, Razvan Pascanu, Kyunghyun Cho, and Yoshua Bengio. On the Number of Linear Regions of Deep Neural Networks. InAdvances in Neural Information Processing Systems, pages 2924–2932, 2014

2014
[60]

Uniform convergence may be unable to explain generalization in deep learning.Advances in Neural Information Processing Systems, 32, 2019

Vaishnavh Nagarajan and J Zico Kolter. Uniform convergence may be unable to explain generalization in deep learning.Advances in Neural Information Processing Systems, 32, 2019

2019
[61]

Adaptive Approximation and Generalization of Deep Neural Network with Intrinsic Dimensionality.Journal of Machine Learning Research, 21(174):1–38, 2020

Ryumei Nakada and Masaaki Imaizumi. Adaptive Approximation and Generalization of Deep Neural Network with Intrinsic Dimensionality.Journal of Machine Learning Research, 21(174):1–38, 2020

2020
[62]

Iterate averaging as regularization for stochastic gradient descent

Gergely Neu and Lorenzo Rosasco. Iterate averaging as regularization for stochastic gradient descent. InConference on Computational Learning Theory (COLT), pages 3222–3242. PMLR, 2018

2018
[63]

A PAC-Bayesian Approach to Spectrally-Normalized Margin Bounds for Neural Networks

Behnam Neyshabur, Srinadh Bhojanapalli, and Nathan Srebro. A PAC-Bayesian Approach to Spectrally-Normalized Margin Bounds for Neural Networks. InInternational Conference on Learning Representations (ICLR), 2018

2018
[64]

Abolafia, Jeffrey Pennington, and Jascha Sohl- Dickstein

Roman Novak, Yasaman Bahri, Daniel A. Abolafia, Jeffrey Pennington, and Jascha Sohl- Dickstein. Sensitivity and Generalization in Neural Networks: An Empirical Study. In International Conference on Learning Representations (ICLR), 2018

2018
[65]

Using Local Complexity to Evaluate Out-of-Distribution Generalization

Grace O’Brien, Andrew Aguilar, Robert Jasper, Henry Kvinge, Sarah McGuire Scullen, and Helen Jenne. Using Local Complexity to Evaluate Out-of-Distribution Generalization. In Topology, Algebra, and Geometry in Data Science, 2025

2025
[66]

Vardan Papyan, X. Y . Han, and David L. Donoho. Prevalence of Neural Collapse During the Terminal Phase of Deep Learning Training.Proceedings of the National Academy of Sciences, 117(40):24652–24663, 2020

2020
[67]

Montúfar, and Yoshua Bengio

Razvan Pascanu, Guido F. Montúfar, and Yoshua Bengio. On the Number of Response Regions of Deep Feed Forward Networks with Piece-Wise Linear Activations. arXiv preprint 1312.6098, 2013

Pith/arXiv arXiv 2013
[68]

On the Local Complexity of Linear Regions in Deep ReLU Networks

Niket Nikul Patel and Guido Montufar. On the Local Complexity of Linear Regions in Deep ReLU Networks. InInternational Conference on Machine Learning (ICML), pages 48335– 48370, 2025. 13

2025
[69]

Generalization error bounds for noisy, iterative algorithms

Ankit Pensia, Varun Jog, and Po-Ling Loh. Generalization error bounds for noisy, iterative algorithms. In2018 IEEE International Symposium on Information Theory (ISIT), pages 546–550. IEEE, 2018

2018
[70]

The inductive bias of ReLU networks on orthogonally separable data

Mary Phuong and Christoph H Lampert. The inductive bias of ReLU networks on orthogonally separable data. InInternational Conference on Learning Representations, 2021

2021
[71]

Springer, New York, 1984

David Pollard.Convergence of Stochastic Processes. Springer, New York, 1984

1984
[72]

The Intrinsic Dimension of Images and Its Impact on Learning

Phillip Pope, Chen Zhu, Ahmed Abdelkader, Micah Goldblum, and Tom Goldstein. The Intrinsic Dimension of Images and Its Impact on Learning. InInternational Conference on Learning Representations (ICLR), 2021

2021
[73]

On the Expressive Power of Deep Neural Networks

Maithra Raghu, Ben Poole, Jon Kleinberg, Surya Ganguli, and Jascha Sohl-Dickstein. On the Expressive Power of Deep Neural Networks. InInternational Conference on Machine Learning (ICML), volume 70 ofProceedings of Machine Learning Research, pages 2847–2854, 2017

2017
[74]

Roweis and Lawrence K

Sam T. Roweis and Lawrence K. Saul. Nonlinear Dimensionality Reduction by Locally Linear Embedding.Science, 290(5500):2323–2326, 2000

2000
[75]

Generalization properties of learning with random features

Alessandro Rudi and Lorenzo Rosasco. Generalization properties of learning with random features. InAdvances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017

2017
[76]

Deep ReLU Network Approximation of Functions on a Manifold

Johannes Schmidt-Hieber. Deep ReLU Network Approximation of Functions on a Manifold. arXiv preprint arXiv:1908.00695, 2019

arXiv 1908
[77]

Bounding and Counting Linear Regions of Deep Neural Networks

Thiago Serra, Christian Tjandraatmadja, and Srikumar Ramalingam. Bounding and Counting Linear Regions of Deep Neural Networks. InInternational Conference on Machine Learning (ICML), volume 80 ofProceedings of Machine Learning Research, pages 4558–4566. PMLR, 2018

2018
[78]

Shalev-Shwartz and S

S. Shalev-Shwartz and S. Ben-David.Understanding machine learning: From theory to algorithms. Cambridge University Press, 2014

2014
[79]

Jure Sokoli´c, Raja Giryes, Guillermo Sapiro, and Miguel R. D. Rodrigues. Robust Large Margin Deep Neural Networks.IEEE Transactions on Signal Processing, 65(16):4265–4280, 2017

2017
[80]

Lampert, and Marco Mondelli

Peter Súkeník, Christoph H. Lampert, and Marco Mondelli. Neural Collapse is Globally Optimal in Deep Regularized ResNets and Transformers. InAdvances in Neural Information Processing Systems, 2025

2025

Showing first 80 references.

[1] [1]

Bartlett.Neural network learning: Theoretical foundations

Martin Anthony and Peter L. Bartlett.Neural network learning: Theoretical foundations. Cambridge University Press, 1999

1999

[2] [2]

On Exact Computation with an Infinitely Wide Neural Net

Sanjeev Arora, Simon S Du, Wei Hu, Zhiyuan Li, Russ R Salakhutdinov, and Ruosong Wang. On Exact Computation with an Infinitely Wide Neural Net. InAdvances in Neural Information Processing Systems, volume 32, 2019

2019

[3] [3]

Random Fourier Features for Kernel Ridge Regression: Approximation Bounds and Statistical Guarantees

Haim Avron, Michael Kapralov, Cameron Musco, Christopher Musco, Ameya Velingker, and Amir Zandieh. Random Fourier Features for Kernel Ridge Regression: Approximation Bounds and Statistical Guarantees. InInternational Conference on Machine Learning (ICML), volume 70 ofProceedings of Machine Learning Research, pages 253–262. PMLR, 2017

2017

[4] [4]

Sharp analysis of low-rank kernel matrix approximations

Francis Bach. Sharp analysis of low-rank kernel matrix approximations. InProceedings of the 26th Annual Conference on Learning Theory, volume 30 ofProceedings of Machine Learning Research, pages 185–209, Princeton, NJ, USA, 12–14 Jun 2013. PMLR

2013

[5] [5]

Baraniuk

Randall Balestriero and Richard G. Baraniuk. A Spline Theory of Deep Learning. InProceed- ings of the 35th International Conference on Machine Learning, volume 80 ofProceedings of Machine Learning Research, pages 374–383. PMLR, 2018

2018

[6] [6]

Bartlett, Dylan J

Peter L. Bartlett, Dylan J. Foster, and Matus Telgarsky. Spectrally-Normalized Margin Bounds for Neural Networks. InAdvances in Neural Information Processing Systems, volume 30, 2017

2017

[7] [7]

Bartlett, Nick Harvey, Christopher Liaw, and Abbas Mehrabian

Peter L. Bartlett, Nick Harvey, Christopher Liaw, and Abbas Mehrabian. Nearly-tight VC- dimension and pseudodimension bounds for piecewise linear neural networks.Journal of Machine Learning Research, 20(63):1–17, 2019

2019

[8] [8]

Bartlett, Philip M

Peter L. Bartlett, Philip M. Long, Gábor Lugosi, and Alexander Tsigler. Benign Overfitting in Linear Regression.Proceedings of the National Academy of Sciences, 117(48):30063–30070, 2020

2020

[9] [9]

Bartlett and Shahar Mendelson

Peter L. Bartlett and Shahar Mendelson. Rademacher and Gaussian Complexities: Risk Bounds and Structural Results.Journal of Machine Learning Research, 3:463–482, 2002

2002

[10] [10]

Stability of Stochastic Gradient Descent on Nonsmooth Convex Losses

Raef Bassily, Vitaly Feldman, Cristóbal Guzmán, and Kunal Talwar. Stability of Stochastic Gradient Descent on Nonsmooth Convex Losses. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin, editors,Advances in Neural Information Processing Systems, volume 33, pages 4381–4391. Curran Associates, Inc., 2020

2020

[11] [11]

Reconciling modern machine- learning practice and the classical bias–variance trade-off.Proceedings of the National Academy of Sciences, 116(32):15849–15854, 2019

Mikhail Belkin, Daniel Hsu, Siyuan Ma, and Soumik Mandal. Reconciling modern machine- learning practice and the classical bias–variance trade-off.Proceedings of the National Academy of Sciences, 116(32):15849–15854, 2019

2019

[12] [12]

Laplacian Eigenmaps for Dimensionality Reduction and Data Representation.Neural Computation, 15(6):1373–1396, 2003

Mikhail Belkin and Partha Niyogi. Laplacian Eigenmaps for Dimensionality Reduction and Data Representation.Neural Computation, 15(6):1373–1396, 2003

2003

[13] [13]

Towards a Theoretical Foundation for Laplacian-Based Manifold Methods.Journal of Computer and System Sciences, 74(8):1289–1308, 2008

Mikhail Belkin and Partha Niyogi. Towards a Theoretical Foundation for Laplacian-Based Manifold Methods.Journal of Computer and System Sciences, 74(8):1289–1308, 2008

2008

[14] [14]

Manifold Regularization: A Geometric Framework for Learning from Labeled and Unlabeled Examples.Journal of Machine Learning Research, 7:2399–2434, 2006

Mikhail Belkin, Partha Niyogi, and Vikas Sindhwani. Manifold Regularization: A Geometric Framework for Learning from Labeled and Unlabeled Examples.Journal of Machine Learning Research, 7:2399–2434, 2006

2006

[15] [15]

Bickel and Bo Li

Peter J. Bickel and Bo Li. Local Polynomial Regression on Unknown Manifolds. InComplex Datasets and Inverse Problems: Tomography, Networks and Beyond, volume 54 ofIMS Lecture Notes–Monograph Series, pages 177–186. Institute of Mathematical Statistics, 2007. 10

2007

[16] [16]

Simplicity Bias and Optimization Threshold in Two-Layer ReLU Networks

Etienne Boursier and Nicolas Flammarion. Simplicity Bias and Optimization Threshold in Two-Layer ReLU Networks. InProceedings of the 42nd International Conference on Machine Learning, volume 267, pages 5241–5275. PMLR, 13–19 Jul 2025

2025

[17] [17]

Stability and generalization.Journal of Machine Learning Research, 2(Mar):499–526, 2002

Olivier Bousquet and André Elisseeff. Stability and generalization.Journal of Machine Learning Research, 2(Mar):499–526, 2002

2002

[18] [18]

Herm Jan Brascamp and Elliott H Lieb. On extensions of the Brunn-Minkowski and Prékopa- Leindler theorems, including inequalities for log concave functions, and with an application to the diffusion equation.Journal of Functional Analysis, 22(4):366–389, 1976

1976

[19] [19]

Fractal structure and generalization properties of stochastic optimization algorithms.Advances in Neural Information Processing Systems, 34:18774–18788, 2021

Alexander Camuto, George Deligiannidis, Murat A Erdogdu, Mert Gurbuzbalaban, Umut Simsekli, and Lingjiong Zhu. Fractal structure and generalization properties of stochastic optimization algorithms.Advances in Neural Information Processing Systems, 34:18774–18788, 2021

2021

[20] [20]

Generalization Bounds of Stochastic Gradient Descent for Wide and Deep Neural Networks

Yuan Cao and Quanquan Gu. Generalization Bounds of Stochastic Gradient Descent for Wide and Deep Neural Networks. InAdvances in Neural Information Processing Systems, 2019

2019

[21] [21]

Optimal Rates for the Regularized Least-Squares Algorithm.Foundations of Computational Mathematics, 7(3):331–368, 2007

Andrea Caponnetto and Ernesto De Vito. Optimal Rates for the Regularized Least-Squares Algorithm.Foundations of Computational Mathematics, 7(3):331–368, 2007

2007

[22] [22]

Cambridge University Press, Cambridge, 1990

Bernd Carl and Irmtraud Stephani.Entropy, Compactness and the Approximation of Operators. Cambridge University Press, Cambridge, 1990

1990

[23] [23]

Carlen, Dario Cordero-Erausquin, and Elliott H

Eric A. Carlen, Dario Cordero-Erausquin, and Elliott H. Lieb. Asymmetric covariance estimates of Brascamp-Lieb type and related inequalities for log-concave measures.Annales de l’I.H.P . Probabilités et statistiques, 49(1):1–12, 2013

2013

[24] [24]

Stability and generalization of learning algorithms that converge to global optima

Zachary Charles and Dimitris Papailiopoulos. Stability and generalization of learning algorithms that converge to global optima. InInternational Conference on Machine Learning (ICML), pages 745–754. PMLR, 2018

2018

[25] [25]

Minshuo Chen, Haoming Jiang, Wenjing Liao, and Tuo Zhao. Nonparametric Regression on Low-Dimensional Manifolds Using Deep ReLU Networks: Function Approximation and Statistical Recovery.Information and Inference: A Journal of the IMA, 11(4):1203–1253, 2022

2022

[26] [26]

Local Linear Regression on Manifolds and Its Geometric Interpretation.Journal of the American Statistical Association, 108(504):1421–1434, 2013

Ming-Yen Cheng and Hau-Tieng Wu. Local Linear Regression on Manifolds and Its Geometric Interpretation.Journal of the American Statistical Association, 108(504):1421–1434, 2013

2013

[27] [27]

On Lazy Training in Differentiable Programming

Lénaïc Chizat, Edouard Oyallon, and Francis Bach. On Lazy Training in Differentiable Programming. InAdvances in Neural Information Processing Systems, volume 32, 2019

2019

[28] [28]

Coifman and Stéphane Lafon

Ronald R. Coifman and Stéphane Lafon. Diffusion Maps.Applied and Computational Harmonic Analysis, 21(1):5–30, 2006

2006

[29] [29]

Cerdeira, F

Paulo Cortez, A. Cerdeira, F. Almeida, T. Matos, and J. Reis. Wine Quality. UCI Machine Learning Repository, 2009. Dataset

2009

[30] [30]

Donoho and Carrie Grimes

David L. Donoho and Carrie Grimes. Hessian Eigenmaps: Locally Linear Embedding Techniques for High-Dimensional Data.Proceedings of the National Academy of Sciences, 100(10):5591–5596, 2003

2003

[31] [31]

Gradient descent finds global minima of deep neural networks

Simon Du, Jason Lee, Haochuan Li, Liwei Wang, and Xiyu Zhai. Gradient descent finds global minima of deep neural networks. volume 97 ofProceedings of Machine Learning Research, pages 1675–1685. PMLR, 09–15 Jun 2019

2019

[32] [32]

Richard M. Dudley. The Sizes of Compact Subsets of Hilbert Space and Continuity of Gaussian Processes.Journal of Functional Analysis, 1(3):290–330, 1967

1967

[33] [33]

Dudley.Uniform Central Limit Theorems

Richard M. Dudley.Uniform Central Limit Theorems. Cambridge University Press, Cambridge, 1999. 11

1999

[34] [34]

Uniform gener- alization bounds on data-dependent hypothesis sets via PAC-Bayesian theory on random sets

Benjamin Dupuis, Paul Viallard, George Deligiannidis, and Umut Simsekli. Uniform gener- alization bounds on data-dependent hypothesis sets via PAC-Bayesian theory on random sets. Journal of Machine Learning Research, 25(409):1–55, 2024

2024

[35] [35]

Computing Nonvacuous Generalization Bounds for Deep (Stochastic) Neural Networks with Many More Parameters than Training Data

Gintare Karolina Dziugaite and Daniel M Roy. Computing Nonvacuous Generalization Bounds for Deep (Stochastic) Neural Networks with Many More Parameters than Training Data. In Uncertainty in Artificial Intelligence (UAI), 2017

2017

[36] [36]

On a Theorem of Weyl Concerning Eigenvalues of Linear Transformations I*.Pro- ceedings of the National Academy of Sciences, 35(11):652–655, 1949

Ky Fan. On a Theorem of Weyl Concerning Eigenvalues of Linear Transformations I*.Pro- ceedings of the National Academy of Sciences, 35(11):652–655, 1949

1949

[37] [37]

Efficient Classification for Metric Data.IEEE Transactions on Information Theory, 60(9):5750–5759, 2014

Lee-Ad Gottlieb, Aryeh Kontorovich, and Robert Krauthgamer. Efficient Classification for Metric Data.IEEE Transactions on Information Theory, 60(9):5750–5759, 2014

2014

[38] [38]

Efficient Regression in Metric Spaces via Approximate Lipschitz Extension.IEEE Transactions on Information Theory, 63(8):4838–4849, 2017

Lee-Ad Gottlieb, Aryeh Kontorovich, and Robert Krauthgamer. Efficient Regression in Metric Spaces via Approximate Lipschitz Extension.IEEE Transactions on Information Theory, 63(8):4838–4849, 2017

2017

[39] [39]

Superconductivty Data

Kam Hamidieh. Superconductivty Data. UCI Machine Learning Repository, 2018. Dataset

2018

[40] [40]

Complexity of Linear Regions in Deep Networks

Boris Hanin and David Rolnick. Complexity of Linear Regions in Deep Networks. volume 97 ofProceedings of Machine Learning Research, pages 2596–2604. PMLR, 09–15 Jun 2019

2019

[41] [41]

Deep ReLU Networks Have Surprisingly Few Activation Patterns

Boris Hanin and David Rolnick. Deep ReLU Networks Have Surprisingly Few Activation Patterns. InAdvances in Neural Information Processing Systems, volume 32, pages 361–370. Curran Associates, Inc., 2019

2019

[42] [42]

Train faster, generalize better: Stability of stochastic gradient descent

Moritz Hardt, Ben Recht, and Yoram Singer. Train faster, generalize better: Stability of stochastic gradient descent. volume 48 ofProceedings of Machine Learning Research, pages 1225–1234, New York, New York, USA, 20–22 Jun 2016. PMLR

2016

[43] [43]

Neural Tangent Kernel: Convergence and Generalization in Neural Networks

Arthur Jacot, Franck Gabriel, and Clément Hongler. Neural Tangent Kernel: Convergence and Generalization in Neural Networks. InAdvances in Neural Information Processing Systems 31, pages 8571–8580, 2018

2018

[44] [44]

Directional convergence and alignment in deep learning.Ad- vances in Neural Information Processing Systems, 2020

Ziwei Ji and Matus Telgarsky. Directional convergence and alignment in deep learning.Ad- vances in Neural Information Processing Systems, 2020

2020

[45] [45]

Deep Nonparametric Regression on Approximate Manifolds: Nonasymptotic Error Bounds with Polynomial Prefactors.The Annals of Statistics, 51(2):691–716, 2023

Yuling Jiao, Guohao Shen, Yuanyuan Lin, and Jian Huang. Deep Nonparametric Regression on Approximate Manifolds: Nonasymptotic Error Bounds with Polynomial Prefactors.The Annals of Statistics, 51(2):691–716, 2023

2023

[46] [46]

Kelley Pace and Ronald Barry

R. Kelley Pace and Ronald Barry. Sparse spatial autoregressions.Statistics & Probability Letters, 33(3):291–297, 1997

1997

[47] [47]

Uniform convergence of interpolators: Gaussian width, norm bounds and benign overfitting.Advances in Neural Information Processing Systems, 34:20657–20668, 2021

Frederic Koehler, Lijia Zhou, Danica J Sutherland, and Nathan Srebro. Uniform convergence of interpolators: Gaussian width, norm bounds and benign overfitting.Advances in Neural Information Processing Systems, 34:20657–20668, 2021

2021

[48] [48]

Kolmogorov and Vladimir M

Andrey N. Kolmogorov and Vladimir M. Tikhomirov. ϵ-Entropy and ϵ-Capacity of Sets in Functional Spaces.American Mathematical Society Translations, Series 2, 17:277–364, 1961

1961

[49] [49]

Rademacher penalties and structural risk minimization.IEEE Transac- tions on Information Theory, 47(5):1902–1914, 2002

Vladimir Koltchinskii. Rademacher penalties and structural risk minimization.IEEE Transac- tions on Information Theory, 47(5):1902–1914, 2002

1902

[50] [50]

Distribution-Dependent Analysis of Gibbs-ERM Principle

Ilja Kuzborskij, Nicolò Cesa-Bianchi, and Csaba Szepesvári. Distribution-Dependent Analysis of Gibbs-ERM Principle. InConference on Computational Learning Theory (COLT), volume 99, pages 2028–2054. PMLR, 2019

2028

[51] [51]

Pointwise confidence estimation in the non-linear ℓ2-regularized least squares

Ilja Kuzborskij and Yasin Abbasi Yadkori. Pointwise confidence estimation in the non-linear ℓ2-regularized least squares. arxiv preprint 2506.07088, 2025. 12

arXiv 2025

[52] [52]

Schoenholz, Yasaman Bahri, Roman Novak, Jascha Sohl-Dickstein, and Jeffrey Pennington

Jaehoon Lee, Lechao Xiao, Samuel S. Schoenholz, Yasaman Bahri, Roman Novak, Jascha Sohl-Dickstein, and Jeffrey Pennington. Wide Neural Networks of Any Depth Evolve as Linear Models Under Gradient Descent. InAdvances in Neural Information Processing Systems, volume 32, 2019

2019

[53] [53]

Stability and Generalization of Stochastic Optimization with Nonconvex and Nonsmooth Problems

Yunwen Lei. Stability and Generalization of Stochastic Optimization with Nonconvex and Nonsmooth Problems. InProceedings of Thirty Sixth Conference on Learning Theory, volume 195 ofProceedings of Machine Learning Research, pages 191–227. PMLR, 2023

2023

[54] [54]

Fine-Grained Analysis of Stability and Generalization for Stochastic Gradient Descent

Yunwen Lei and Yiming Ying. Fine-Grained Analysis of Stability and Generalization for Stochastic Gradient Descent. InProceedings of the 37th International Conference on Machine Learning, volume 119 ofProceedings of Machine Learning Research, pages 5809–5819. PMLR, 2020

2020

[55] [55]

Elizaveta Levina and Peter J. Bickel. Maximum Likelihood Estimation of Intrinsic Dimension. InAdvances in Neural Information Processing Systems, volume 17, pages 777–784, 2004

2004

[56] [56]

Ridgeless

Tengyuan Liang and Alexander Rakhlin. Just Interpolate: Kernel “Ridgeless” Regression Can Generalize.The Annals of Statistics, 48(3):1329–1347, 2020

2020

[57] [57]

PAC-Bayes Compression Bounds So Tight That They Can Explain Generalization

Sanae Lotfi, Marc Anton Finzi, Sanyam Kapoor, Andres Potapczynski, Micah Goldblum, and Andrew Gordon Wilson. PAC-Bayes Compression Bounds So Tight That They Can Explain Generalization. InAdvances in Neural Information Processing Systems, volume 35, 2022

2022

[58] [58]

McAllester

David A. McAllester. Some PAC-Bayesian theorems. InConference on Computational Learning Theory (COLT), 1998

1998

[59] [59]

Montúfar, Razvan Pascanu, Kyunghyun Cho, and Yoshua Bengio

Guido F. Montúfar, Razvan Pascanu, Kyunghyun Cho, and Yoshua Bengio. On the Number of Linear Regions of Deep Neural Networks. InAdvances in Neural Information Processing Systems, pages 2924–2932, 2014

2014

[60] [60]

Uniform convergence may be unable to explain generalization in deep learning.Advances in Neural Information Processing Systems, 32, 2019

Vaishnavh Nagarajan and J Zico Kolter. Uniform convergence may be unable to explain generalization in deep learning.Advances in Neural Information Processing Systems, 32, 2019

2019

[61] [61]

Adaptive Approximation and Generalization of Deep Neural Network with Intrinsic Dimensionality.Journal of Machine Learning Research, 21(174):1–38, 2020

Ryumei Nakada and Masaaki Imaizumi. Adaptive Approximation and Generalization of Deep Neural Network with Intrinsic Dimensionality.Journal of Machine Learning Research, 21(174):1–38, 2020

2020

[62] [62]

Iterate averaging as regularization for stochastic gradient descent

Gergely Neu and Lorenzo Rosasco. Iterate averaging as regularization for stochastic gradient descent. InConference on Computational Learning Theory (COLT), pages 3222–3242. PMLR, 2018

2018

[63] [63]

A PAC-Bayesian Approach to Spectrally-Normalized Margin Bounds for Neural Networks

Behnam Neyshabur, Srinadh Bhojanapalli, and Nathan Srebro. A PAC-Bayesian Approach to Spectrally-Normalized Margin Bounds for Neural Networks. InInternational Conference on Learning Representations (ICLR), 2018

2018

[64] [64]

Abolafia, Jeffrey Pennington, and Jascha Sohl- Dickstein

Roman Novak, Yasaman Bahri, Daniel A. Abolafia, Jeffrey Pennington, and Jascha Sohl- Dickstein. Sensitivity and Generalization in Neural Networks: An Empirical Study. In International Conference on Learning Representations (ICLR), 2018

2018

[65] [65]

Using Local Complexity to Evaluate Out-of-Distribution Generalization

Grace O’Brien, Andrew Aguilar, Robert Jasper, Henry Kvinge, Sarah McGuire Scullen, and Helen Jenne. Using Local Complexity to Evaluate Out-of-Distribution Generalization. In Topology, Algebra, and Geometry in Data Science, 2025

2025

[66] [66]

Vardan Papyan, X. Y . Han, and David L. Donoho. Prevalence of Neural Collapse During the Terminal Phase of Deep Learning Training.Proceedings of the National Academy of Sciences, 117(40):24652–24663, 2020

2020

[67] [67]

Montúfar, and Yoshua Bengio

Razvan Pascanu, Guido F. Montúfar, and Yoshua Bengio. On the Number of Response Regions of Deep Feed Forward Networks with Piece-Wise Linear Activations. arXiv preprint 1312.6098, 2013

Pith/arXiv arXiv 2013

[68] [68]

On the Local Complexity of Linear Regions in Deep ReLU Networks

Niket Nikul Patel and Guido Montufar. On the Local Complexity of Linear Regions in Deep ReLU Networks. InInternational Conference on Machine Learning (ICML), pages 48335– 48370, 2025. 13

2025

[69] [69]

Generalization error bounds for noisy, iterative algorithms

Ankit Pensia, Varun Jog, and Po-Ling Loh. Generalization error bounds for noisy, iterative algorithms. In2018 IEEE International Symposium on Information Theory (ISIT), pages 546–550. IEEE, 2018

2018

[70] [70]

The inductive bias of ReLU networks on orthogonally separable data

Mary Phuong and Christoph H Lampert. The inductive bias of ReLU networks on orthogonally separable data. InInternational Conference on Learning Representations, 2021

2021

[71] [71]

Springer, New York, 1984

David Pollard.Convergence of Stochastic Processes. Springer, New York, 1984

1984

[72] [72]

The Intrinsic Dimension of Images and Its Impact on Learning

Phillip Pope, Chen Zhu, Ahmed Abdelkader, Micah Goldblum, and Tom Goldstein. The Intrinsic Dimension of Images and Its Impact on Learning. InInternational Conference on Learning Representations (ICLR), 2021

2021

[73] [73]

On the Expressive Power of Deep Neural Networks

Maithra Raghu, Ben Poole, Jon Kleinberg, Surya Ganguli, and Jascha Sohl-Dickstein. On the Expressive Power of Deep Neural Networks. InInternational Conference on Machine Learning (ICML), volume 70 ofProceedings of Machine Learning Research, pages 2847–2854, 2017

2017

[74] [74]

Roweis and Lawrence K

Sam T. Roweis and Lawrence K. Saul. Nonlinear Dimensionality Reduction by Locally Linear Embedding.Science, 290(5500):2323–2326, 2000

2000

[75] [75]

Generalization properties of learning with random features

Alessandro Rudi and Lorenzo Rosasco. Generalization properties of learning with random features. InAdvances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017

2017

[76] [76]

Deep ReLU Network Approximation of Functions on a Manifold

Johannes Schmidt-Hieber. Deep ReLU Network Approximation of Functions on a Manifold. arXiv preprint arXiv:1908.00695, 2019

arXiv 1908

[77] [77]

Bounding and Counting Linear Regions of Deep Neural Networks

Thiago Serra, Christian Tjandraatmadja, and Srikumar Ramalingam. Bounding and Counting Linear Regions of Deep Neural Networks. InInternational Conference on Machine Learning (ICML), volume 80 ofProceedings of Machine Learning Research, pages 4558–4566. PMLR, 2018

2018

[78] [78]

Shalev-Shwartz and S

S. Shalev-Shwartz and S. Ben-David.Understanding machine learning: From theory to algorithms. Cambridge University Press, 2014

2014

[79] [79]

Jure Sokoli´c, Raja Giryes, Guillermo Sapiro, and Miguel R. D. Rodrigues. Robust Large Margin Deep Neural Networks.IEEE Transactions on Signal Processing, 65(16):4265–4280, 2017

2017

[80] [80]

Lampert, and Marco Mondelli

Peter Súkeník, Christoph H. Lampert, and Marco Mondelli. Neural Collapse is Globally Optimal in Deep Regularized ResNets and Transformers. InAdvances in Neural Information Processing Systems, 2025

2025