arxiv: 2605.12648 · v1 · submitted 2026-05-12 · 💻 cs.LG · stat.ML

Recognition: no theorem link

Population Risk Bounds for Kolmogorov-Arnold Networks Trained by DP-SGD with Correlated Noise

Christoph Lampert, Jan Schuchardt, Junyu Zhou, Marius Kloft, Nikita Kalinin, Puyu Wang, Sophie Fellenz

Authors on Pith no claims yet

Pith reviewed 2026-05-14 21:34 UTC · model grok-4.3

classification 💻 cs.LG stat.ML

keywords Kolmogorov-Arnold Networkspopulation risk boundsDP-SGDcorrelated noisemini-batch SGDgradient clippingdifferential privacynon-convex optimization

0 comments

The pith

Kolmogorov-Arnold Networks receive population risk bounds under mini-batch DP-SGD with correlated noise.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes the first population risk bounds for Kolmogorov-Arnold Networks trained by mini-batch stochastic gradient descent with gradient clipping. These bounds hold for non-private SGD as well as for differentially private SGD using Gaussian noise that can be independent or temporally correlated. This setup matches practical training more closely than earlier work because it uses mini-batch updates instead of full-batch gradients and allows correlated noise, which often improves the privacy-utility tradeoff. The proof develops a new way to analyze optimization in non-convex problems with temporal noise dependence by using an auxiliary unprojected process and a shifted iterate. A stability argument then turns the optimization guarantee into a population risk bound.

Core claim

We establish the first population risk bounds for KANs trained by mini-batch SGD with gradient clipping, covering non-private SGD as well as DP-SGD with Gaussian perturbations that interpolate between independent and temporally correlated noise. The results cover prior full-batch GD and independent-noise DP-GD for KANs while giving sharper bounds when the second layer is fixed. The technical core is a new analysis route using an auxiliary unprojected dynamics, a shifted iterate absorbing noise, and a high-probability bootstrap certifying projection inactivity to handle temporal dependence and projection in the correlated noise case.

What carries the argument

An auxiliary unprojected dynamics together with a shifted iterate that absorbs the current noise perturbation and a high-probability bootstrap that certifies the projection step remains inactive.

If this is right

The bounds apply directly to mini-batch training used in practice.
They cover DP-SGD mechanisms with temporally correlated Gaussian noise.
Sharper specializations exist for KANs with a fixed second layer.
The analysis extends to cover the corresponding full-batch cases as well.
These are the first such bounds beyond convex learning for correlated-noise DP training of neural networks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The technique for handling correlated noise could apply to other non-convex neural network training under DP.
Empirical validation could involve checking if the projection remains inactive in typical KAN training runs.
Practitioners might use these bounds to select noise correlation levels that balance privacy and accuracy.
The results suggest that correlated noise does not necessarily worsen the theoretical guarantees compared to independent noise.

Load-bearing premise

The high-probability bootstrap must successfully certify that the projection step is inactive so that the shifted iterate can absorb the noise without interference from clipping.

What would settle it

Training runs of KANs under the correlated noise model where the projection step activates frequently enough to violate the population risk bounds derived in the analysis.

Figures

Figures reproduced from arXiv: 2605.12648 by Christoph Lampert, Jan Schuchardt, Junyu Zhou, Marius Kloft, Nikita Kalinin, Puyu Wang, Sophie Fellenz.

**Figure 1.** Figure 1: CNN on CIFAR-10. Left: Moderate noise correlation improves the accuracy of DP-SGD over independent noise (λ = 0), especially for larger privacy budgets ϵ. However, the gain is not monotone in λ, and accuracy can drop when λ → 1. Right: Subtracting a λ-fraction of the previous noise partially cancels consecutive noise perturbations, slowing cumulative-noise growth and thus preserving accuracy. Figure reprod… view at source ↗

read the original abstract

We establish the first population risk bounds for Kolmogorov-Arnold Networks (KANs) trained by mini-batch SGD with gradient clipping, covering non-private SGD as well as differentially private SGD (DP-SGD) with Gaussian perturbations that interpolate between independent and temporally correlated noise. This setting is substantially closer to practice than prior KAN theory along two axes: training is by mini-batch SGD, the standard recipe for modern networks, rather than full-batch gradient descent (GD); and correlated-noise mechanisms have empirically shown a more favorable privacy-utility tradeoff than independent-noise mechanisms. Our results cover the corresponding full-batch GD and independent-noise DP-GD results for KANs by Wang et al. (2026), while yielding sharper fixed-second-layer specializations. The technical core is a new analysis route for correlated-noise DP training in the non-convex regime. Temporal dependence breaks the conditional-centering structure underlying standard one-step SGD arguments, and the projection step obstructs the exact cancellation structure of correlated perturbations. We address these difficulties through an auxiliary unprojected dynamics, a shifted iterate that absorbs the current noise perturbation, and a high-probability bootstrap certifying projection inactivity. Combining this optimization analysis with a stability-based generalization argument yields the stated population risk bounds. To the best of our knowledge, this is the first optimization and population risk analysis of a correlated-noise mechanism for DP training beyond convex learning, in particular for neural networks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gives the first population risk bounds for KANs under mini-batch DP-SGD with correlated noise, but the bootstrap for inactive clipping looks like the main point that needs checking.

read the letter

The main thing to know is that this work derives the first population risk bounds for Kolmogorov-Arnold Networks trained by mini-batch SGD with gradient clipping, covering both plain SGD and DP-SGD where the added Gaussian noise can be temporally correlated. It recovers the earlier full-batch independent-noise results from Wang et al. as special cases and gets tighter bounds when the second layer is fixed. That is the concrete advance: moving the analysis from idealized full-batch GD to the mini-batch recipe people actually use, while handling noise correlation that often gives better privacy-utility numbers in experiments. The technical route uses an auxiliary unprojected trajectory, a shifted iterate that absorbs the current noise, and a high-probability bootstrap to certify that clipping stays off. Those pieces let them restore the cancellation that correlated noise otherwise breaks, then combine it with a stability argument for the generalization step. The derivation does not appear circular; the auxiliary objects are defined independently of the fitted parameters. The soft spot is exactly the bootstrap. In non-convex KAN landscapes, mini-batching plus positive correlation in the noise can make large gradient norms more likely, which raises the chance that the projection activates and the high-probability event fails. If that probability drops below 1-delta, the one-step analysis does not go through and the stated bounds do not hold. The abstract gives no explicit assumptions or concrete error terms, so it is hard to judge how restrictive the conditions end up being. This is for readers who care about private non-convex optimization and want guarantees that match practical training loops rather than full-batch theory. A specialist in DP-SGD analysis or KAN approximation theory would get value from the new technical tools. It deserves a serious referee because the claim is new, the setting is closer to practice than prior work, and the only real question is whether the bootstrap survives detailed checking.

Referee Report

1 major / 1 minor

Summary. The paper claims to establish the first population risk bounds for Kolmogorov-Arnold Networks (KANs) trained by mini-batch SGD with gradient clipping. The bounds cover both non-private SGD and DP-SGD with Gaussian perturbations that interpolate between independent and temporally correlated noise. The analysis uses an auxiliary unprojected dynamics, a shifted iterate absorbing noise, and a high-probability bootstrap to certify projection inactivity, combined with a stability-based generalization argument. This extends prior full-batch GD and independent-noise results for KANs while yielding sharper specializations for fixed second layers.

Significance. If the central claims hold, the work would be significant for providing the first optimization and population-risk analysis of correlated-noise DP mechanisms beyond convex learning, specifically for neural networks. It addresses a setting substantially closer to practice than prior KAN theory by handling mini-batch SGD and correlated noise, which empirically improves privacy-utility tradeoffs. The manuscript ships a coherent new analysis route with explicit coverage of non-convex regimes.

major comments (1)

[Technical core (auxiliary unprojected dynamics, shifted iterate, and bootstrap argument)] The high-probability bootstrap certifying projection inactivity (described in the technical core) may fail to control clipping effects under temporally correlated noise. In non-convex KAN landscapes with mini-batching, increased correlation can raise the likelihood that gradient norms exceed the clip threshold, breaking the one-step SGD cancellation structure and preventing the population risk bounds from holding.

minor comments (1)

The abstract states that results cover and sharpen prior work by Wang et al. (2026) but does not quantify the improvement in the fixed-second-layer specialization or state the precise assumptions on the KAN architecture and noise correlation parameter.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their careful reading, positive assessment of the significance, and for raising this technical point on the bootstrap argument. We address it directly below.

read point-by-point responses

Referee: The high-probability bootstrap certifying projection inactivity (described in the technical core) may fail to control clipping effects under temporally correlated noise. In non-convex KAN landscapes with mini-batching, increased correlation can raise the likelihood that gradient norms exceed the clip threshold, breaking the one-step SGD cancellation structure and preventing the population risk bounds from holding.

Authors: We appreciate the concern. The analysis is designed to handle precisely this issue: the auxiliary unprojected dynamics together with the shifted iterate (which absorbs the current Gaussian perturbation) restore a conditional centering property even under temporal correlation. The high-probability bootstrap then certifies projection inactivity via a union-bound argument whose failure probability is controlled uniformly in the correlation parameter by the sub-Gaussian tails of the noise; the mini-batch variance is absorbed into the same concentration bound under the KAN Lipschitz and smoothness assumptions stated in the paper. Consequently the one-step cancellation structure is preserved on the high-probability event, and the population-risk bounds continue to hold. We have added a clarifying paragraph in Section 3.2 and a supporting lemma (Lemma 4.3) in the appendix that makes the uniform control explicit. revision: partial

Circularity Check

0 steps flagged

No significant circularity; auxiliary dynamics and bootstrap are independent of inputs

full rationale

The paper's core derivation introduces an auxiliary unprojected dynamics, a shifted iterate absorbing noise, and a high-probability bootstrap for projection inactivity to handle correlated noise in non-convex KAN training. These constructs are presented as new technical tools that restore cancellation and control clipping effects without reducing to fitted parameters or prior results by construction. The self-citation to Wang et al. (2026) only recovers prior full-batch and independent-noise cases as specializations, while the correlated-noise population risk bounds rely on the new stability-based generalization argument applied to the fresh optimization analysis. No equation or step equates a prediction to its own input or imports uniqueness via self-citation chain.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Abstract-only review; no explicit free parameters, axioms, or invented entities are stated. The analysis rests on standard non-convex optimization assumptions plus new constructs for handling correlated noise and projection.

axioms (1)

domain assumption Standard assumptions for non-convex SGD analysis and DP mechanisms (e.g., bounded gradients, Lipschitz continuity)
Invoked implicitly to support the optimization analysis and stability argument in the non-convex regime.

pith-pipeline@v0.9.0 · 5575 in / 1232 out tokens · 46877 ms · 2026-05-14T21:34:44.874838+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

70 extracted references · 70 canonical work pages · 2 internal anchors

[1]

A convergence theory for deep learning via over- parameterization

Zeyuan Allen-Zhu, Yuanzhi Li, and Zhao Song. A convergence theory for deep learning via over- parameterization. InInternational Conference on Machine Learning, pages 242–252. PMLR, 2019

work page 2019
[2]

A smooth binary mechanism for efficient private continual observation.Advances in Neural Information Processing Systems, 36:49133–49145, 2023

Joel Daniel Andersson and Rasmus Pagh. A smooth binary mechanism for efficient private continual observation.Advances in Neural Information Processing Systems, 36:49133–49145, 2023

work page 2023
[3]

The hitchhiker’s guide to efficient, end-to-end, and tight dp auditing

Meenatchi Sundaram Muthu Selva Annamalai, Borja Balle, Jamie Hayes, Georgios Kaissis, and Emiliano De Cristofaro. The hitchhiker’s guide to efficient, end-to-end, and tight dp auditing.arXiv preprint arXiv:2506.16666, 2025

work page arXiv 2025
[4]

Fine-grained analysis of optimization and generalization for overparameterized two-layer neural networks

Sanjeev Arora, Simon Du, Wei Hu, Zhiyuan Li, and Ruosong Wang. Fine-grained analysis of optimization and generalization for overparameterized two-layer neural networks. InInternational Conference on Machine Learning, pages 322–332. PMLR, 2019

work page 2019
[5]

Beyond differential privacy: Composition theorems and relational logic for f-divergences between probabilistic programs

Gilles Barthe and Federico Olmedo. Beyond differential privacy: Composition theorems and relational logic for f-divergences between probabilistic programs. InInternational Colloquium on Automata, Languages, and Programming, pages 49–60. Springer, 2013

work page 2013
[6]

Spectrally-normalized margin bounds for neural networks

Peter L Bartlett, Dylan J Foster, and Matus J Telgarsky. Spectrally-normalized margin bounds for neural networks. InAdvances in Neural Information Processing Systems, volume 30, 2017

work page 2017
[7]

Private stochastic convex optimization with optimal rates

Raef Bassily, Vitaly Feldman, Kunal Talwar, and Abhradeep Guha Thakurta. Private stochastic convex optimization with optimal rates. InAdvances in Neural Information Processing Systems, volume 32, 2019

work page 2019
[8]

Generalization bounds of stochastic gradient descent for wide and deep neural networks

Yuan Cao and Quanquan Gu. Generalization bounds of stochastic gradient descent for wide and deep neural networks. InAdvances in Neural Information Processing Systems, volume 32, 2019

work page 2019
[9]

How much over-parameterization is sufficient to learn deep ReLU networks? InInternational Conference on Learning Representation, 2021

Zixiang Chen, Yuan Cao, Difan Zou, and Quanquan Gu. How much over-parameterization is sufficient to learn deep ReLU networks? InInternational Conference on Learning Representation, 2021

work page 2021
[10]

Kolmogorov–arnold networks for genomic tasks.Briefings in Bioinformatics, 26(2):bbaf129, 2025

Oleksandr Cherednichenko and Maria Poptsova. Kolmogorov–arnold networks for genomic tasks.Briefings in Bioinformatics, 26(2):bbaf129, 2025

work page 2025
[11]

Correlated noise provably beats independent noise for differen- tially private learning

Christopher A Choquette-Choo, Krishnamurthy Dj Dvijotham, Krishna Pillutla, Arun Ganesh, Thomas Steinke, and Abhradeep Guha Thakurta. Correlated noise provably beats independent noise for differen- tially private learning. InInternational Conference on Learning Representations, 2024

work page 2024
[12]

Near exact privacy amplification for matrix mechanisms.arXiv preprint arXiv:2410.06266, 2024

Christopher A Choquette-Choo, Arun Ganesh, Saminul Haque, Thomas Steinke, and Abhradeep Thakurta. Near exact privacy amplification for matrix mechanisms.arXiv preprint arXiv:2410.06266, 2024

work page arXiv 2024
[13]

(amplified) banded matrix factorization: A unified approach to private training

Christopher A Choquette-Choo, Arun Ganesh, Ryan McKenna, H Brendan McMahan, John Rush, Abhradeep Guha Thakurta, and Zheng Xu. (amplified) banded matrix factorization: A unified approach to private training. InAdvances in Neural Information Processing Systems, volume 36, pages 74856–74889, 2023

work page 2023
[14]

Choquette-Choo, Arun Ganesh, Thomas Steinke, and Abhradeep Guha Thakurta

Christopher A. Choquette-Choo, Arun Ganesh, Thomas Steinke, and Abhradeep Guha Thakurta. Privacy amplification for matrix mechanisms. InInternational Conference on Learning Representations, 2024. 11

work page 2024
[15]

Choquette-Choo, H

Christopher A. Choquette-Choo, H. Brendan McMahan, Keith Rush, and Abhradeep Thakurta. Multi- epoch matrix factorization mechanisms for private machine learning. InInternational Conference on Machine Learning. JMLR.org, 2023

work page 2023
[16]

Improved differential privacy for SGD via optimal private linear operators on adaptive streams

Sergey Denisov, H Brendan McMahan, John Rush, Adam Smith, and Abhradeep Guha Thakurta. Improved differential privacy for SGD via optimal private linear operators on adaptive streams. In Advances in Neural Information Processing Systems, volume 35, pages 5910–5924, 2022

work page 2022
[17]

Understanding private learning from feature perspective.arXiv preprint arXiv:2511.18006, 2025

Meng Ding, Mingxi Lei, Shaopeng Fu, Shaowei Wang, Di Wang, and Jinhui Xu. Understanding private learning from feature perspective.arXiv preprint arXiv:2511.18006, 2025

work page arXiv 2025
[18]

Differential privacy

Cynthia Dwork. Differential privacy. InInternational colloquium on automata, languages, and program- ming, 2006

work page 2006
[19]

Calibrating noise to sensitivity in private data analysis

Cynthia Dwork, Frank McSherry, Kobbi Nissim, and Adam Smith. Calibrating noise to sensitivity in private data analysis. In Shai Halevi and Tal Rabin, editors,Theory of Cryptography, pages 265–284, Berlin, Heidelberg, 2006. Springer Berlin Heidelberg

work page 2006
[20]

On the convergence of two-layer kolmogorov-arnold networks with first-layer training

Seyed Mohammad Eshtehardian, Mohammad Hossein Yassaee, and Babak Khalaj. On the convergence of two-layer kolmogorov-arnold networks with first-layer training. InInternational Conference on Learning Representations, 2026

work page 2026
[21]

Constant matters: Fine-grained error bound on differentially private continual observation

Hendrik Fichtenberger, Monika Henzinger, and Jalaj Upadhyay. Constant matters: Fine-grained error bound on differentially private continual observation. InInternational Conference on Machine Learning, pages 10072–10092, 2023

work page 2023
[22]

Random feature amplification: Feature learning and generalization in neural networks.Journal of Machine Learning Research, 24(303):1–49, 2023

Spencer Frei, Niladri S Chatterji, and Peter L Bartlett. Random feature amplification: Feature learning and generalization in neural networks.Journal of Machine Learning Research, 24(303):1–49, 2023

work page 2023
[23]

On the convergence of (stochastic) gradient descent for Kolmogorov– Arnold networks.IEEE Transactions on Information Theory, 2025

Yihang Gao and Vincent YF Tan. On the convergence of (stochastic) gradient descent for Kolmogorov– Arnold networks.IEEE Transactions on Information Theory, 2025

work page 2025
[24]

Probability inequalities for sums of bounded random variables.Journal of the American Statistical Association, 58(301):13–30, 1963

Wassily Hoeffding. Probability inequalities for sums of bounded random variables.Journal of the American Statistical Association, 58(301):13–30, 1963

work page 1963
[25]

Neural tangent kernel: Convergence and generaliza- tion in neural networks.Advances in Neural Information Processing Systems, 31, 2018

Arthur Jacot, Franck Gabriel, and Cl´ ement Hongler. Neural tangent kernel: Convergence and generaliza- tion in neural networks.Advances in Neural Information Processing Systems, 31, 2018

work page 2018
[26]

Polylogarithmic width suffices for gradient descent to achieve arbitrarily small test error with shallow ReLU networks

Ziwei Ji and Matus Telgarsky. Polylogarithmic width suffices for gradient descent to achieve arbitrarily small test error with shallow ReLU networks. InInternational Conference on Learning Representations, 2020

work page 2020
[27]

Banded square root matrix factorization for differentially private model training

Nikita P Kalinin and Christoph Lampert. Banded square root matrix factorization for differentially private model training. InAdvances in Neural Information Processing Systems, volume 37, pages 17602–17655, 2024

work page 2024
[28]

DP-{\lambda}CGD: Efficient Noise Correlation for Differentially Private Model Training

Nikita P Kalinin, Ryan McKenna, Rasmus Pagh, and Christoph H Lampert. DP- λCGD: efficient noise correlation for differentially private model training, 2026. arXiv preprint arXiv:2601.22334

work page internal anchor Pith review Pith/arXiv arXiv 2026
[29]

Kalinin, Ryan McKenna, Jalaj Upadhyay, and Christoph H

Nikita P. Kalinin, Ryan McKenna, Jalaj Upadhyay, and Christoph H. Lampert. Back to square roots: An optimal bound on the matrix factorization error for multi-epoch differentially private SGD. In International Conference on Learning Representations, 2026

work page 2026
[30]

Gradient descent with linearly correlated noise: Theory and applications to differential privacy

Anastasiia Koloskova, Ryan McKenna, Zachary Charles, John Rush, and H Brendan McMahan. Gradient descent with linearly correlated noise: Theory and applications to differential privacy. InAdvances in Neural Information Processing Systems, volume 36, pages 35761–35773, 2023

work page 2023
[31]

Adaptive estimation of a quadratic functional by model selection

Beatrice Laurent and Pascal Massart. Adaptive estimation of a quadratic functional by model selection. Annals of Statistics, pages 1302–1338, 2000. 12

work page 2000
[32]

Stability and generalization analysis of gradient methods for shallow neural networks

Yunwen Lei, Rong Jin, and Yiming Ying. Stability and generalization analysis of gradient methods for shallow neural networks. InAdvances in Neural Information Processing Systems, volume 35, pages 38557–38570, 2022

work page 2022
[33]

Optimization and generalization of gradient descent for shallow ReLU networks with minimal width.Journal of Machine Learning Research, 27(34):1–35, 2026

Yunwen Lei, Puyu Wang, Yiming Ying, and Ding-Xuan Zhou. Optimization and generalization of gradient descent for shallow ReLU networks with minimal width.Journal of Machine Learning Research, 27(34):1–35, 2026

work page 2026
[34]

Fine-grained analysis of stability and generalization for stochastic gradient descent

Yunwen Lei and Yiming Ying. Fine-grained analysis of stability and generalization for stochastic gradient descent. InInternational Conference on Machine Learning, pages 5809–5819. PMLR, 2020

work page 2020
[35]

Kolmogorov–arnold graph neural networks for molecular property prediction.Nature Machine Intelligence, 7(8):1346–1354, 2025

Longlong Li, Yipeng Zhang, Guanghui Wang, and Kelin Xia. Kolmogorov–arnold graph neural networks for molecular property prediction.Nature Machine Intelligence, 7(8):1346–1354, 2025

work page 2025
[36]

Generalization bounds for kolmogorov- arnold networks (KANs) and enhanced KANs with lower lipschitz complexity

Pengqi Li, Lizhong Ding, Jiarun Fu, Guoren Wang, Ye Yuan, et al. Generalization bounds for kolmogorov- arnold networks (KANs) and enhanced KANs with lower lipschitz complexity. InAdvances in Neural Information Processing Systems, 2025

work page 2025
[37]

Optimal rates for generalization of gradient descent for deep ReLU classification

Yuanfan Li, Yunwen Lei, Zheng-Chu Guo, and Yiming Ying. Optimal rates for generalization of gradient descent for deep ReLU classification. InAdvances in Neural Information Processing Systems, 2026

work page 2026
[38]

On the rate of convergence of kolmogorov-arnold network regression estimators.arXiv preprint arXiv:2509.19830, 2025

Wei Liu, Eleni Chatzi, and Zhilu Lai. On the rate of convergence of kolmogorov-arnold network regression estimators.arXiv preprint arXiv:2509.19830, 2025

work page arXiv 2025
[39]

KAN: Kolmogorov-Arnold networks

Ziming Liu, Yixuan Wang, Sachin Vaidya, Fabian Ruehle, James Halverson, Marin Soljaˇ ci´ c, Thomas Y Hou, and Max Tegmark. KAN: Kolmogorov-Arnold networks. InInternational Conference on Learning Representations, 2025

work page 2025
[40]

Scaling up the banded matrix factorization mechanism for differentially private ML

Ryan McKenna. Scaling up the banded matrix factorization mechanism for differentially private ML. In International Conference on Learning Representation, 2025

work page 2025
[41]

A hassle-free algorithm for strong differential privacy in federated learning systems

Hugh Brendan McMahan, Zheng Xu, and Yanxiang Zhang. A hassle-free algorithm for strong differential privacy in federated learning systems. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: Industry Track, pages 842–865, 2024

work page 2024
[42]

R´ enyi differential privacy

Ilya Mironov. R´ enyi differential privacy. In2017 IEEE 30th computer security foundations symposium (CSF), pages 263–275. IEEE, 2017

work page 2017
[43]

How many neurons do we need? a refined analysis for shallow networks trained with gradient descent.Journal of Statistical Planning and Inference, 233:106169, 2024

Mike Nguyen and Nicole Muecke. How many neurons do we need? a refined analysis for shallow networks trained with gradient descent.Journal of Statistical Planning and Inference, 233:106169, 2024

work page 2024
[44]

arXiv preprint arXiv:1905.09870 , year=

Atsushi Nitanda, Geoffrey Chinot, and Taiji Suzuki. Gradient descent can learn less over-parameterized two-layer neural networks on classification problems.arXiv preprint arXiv:1905.09870, 2019

work page arXiv 1905
[45]

Optimal rates for averaged stochastic gradient descent under neural tangent kernel regime

Atsushi Nitanda and Taiji Suzuki. Optimal rates for averaged stochastic gradient descent under neural tangent kernel regime. InInternational Conference on Learning Representations, 2021

work page 2021
[46]

Physics informed kolmogorov-arnold neural networks for dynamical analysis via efficient-kan and wav-kan.Journal of Machine Learning Research, 26(233):1–39, 2025

Subhajit Patra, Sonali Panda, Bikram Keshari Parida, Mahima Arya, Kurt Jacobs, Denys I Bondar, and Abhijit Sen. Physics informed kolmogorov-arnold neural networks for dynamical analysis via efficient-kan and wav-kan.Journal of Machine Learning Research, 26(233):1–39, 2025

work page 2025
[47]

Correlated noise mechanisms for differentially private learning, 2025

Krishna Pillutla, Jalaj Upadhyay, Christopher A Choquette-Choo, Krishnamurthy Dvijotham, Arun Ganesh, Monika Henzinger, Jonathan Katz, Ryan McKenna, H Brendan McMahan, Keith Rush, et al. Correlated noise mechanisms for differentially private learning, 2025. arXiv preprint arXiv:2506.08201

work page arXiv 2025
[48]

Stability & generalisation of gradient descent for shallow neural networks without the neural tangent kernel

Dominic Richards and Ilja Kuzborskij. Stability & generalisation of gradient descent for shallow neural networks without the neural tangent kernel. InAdvances in Neural Information Processing Systems, volume 34. PMLR, 2021. 13

work page 2021
[49]

Optimizing privacy-utility trade-off in decentralized learning with generalized correlated noise

Angelo Rodio, Zheng Chen, and Erik G Larsson. Optimizing privacy-utility trade-off in decentralized learning with generalized correlated noise. In2025 IEEE Information Theory Workshop (ITW), pages 1–6. IEEE, 2025

work page 2025
[50]

Sampling-free privacy accounting for matrix mechanisms under random allocation, 2026

Jan Schuchardt and Nikita Kalinin. Sampling-free privacy accounting for matrix mechanisms under random allocation, 2026

work page 2026
[51]

Towards understanding generalization in DP-GD: A case study in training two-layer CNNs

Zhongjie Shi, Puyu Wang, Chenyang Zhang, and Yuan Cao. Towards understanding generalization in DP-GD: A case study in training two-layer CNNs. InAAAI Conference on Artificial Intelligence, 2026

work page 2026
[52]

Khemraj Shukla, Juan Diego Toscano, Zhicheng Wang, Zongren Zou, and George Em Karniadakis. A comprehensive and fair comparison between mlp and kan representations for differential equations and operator networks.Computer Methods in Applied Mechanics and Engineering, 431:117290, 2024

work page 2024
[53]

Stochastic gradient descent with differentially private updates

Shuang Song, Kamalika Chaudhuri, and Anand D Sarwate. Stochastic gradient descent with differentially private updates. In2013 IEEE global conference on signal and information processing, pages 245–248. IEEE, 2013

work page 2013
[54]

Generalization and stability of interpolating neural networks with minimal width.Journal of Machine Learning Research, 25(156):1–41, 2024

Hossein Taheri and Christos Thrampoulidis. Generalization and stability of interpolating neural networks with minimal width.Journal of Machine Learning Research, 25(156):1–41, 2024

work page 2024
[55]

Sharper guarantees for learning neural network classifiers with gradient methods

Hossein Taheri, Christos Thrampoulidis, and Arya Mazumdar. Sharper guarantees for learning neural network classifiers with gradient methods. InInternational Conference on Learning Representations, 2025

work page 2025
[56]

Kolmogorov-arnold networks (kans) for time series analysis

Cristian J Vaca-Rubio, Luis Blanco, Roberto Pereira, and M` arius Caus. Kolmogorov-arnold networks (kans) for time series analysis. In2024 IEEE Globecom Workshops (GC Wkshps), pages 1–6. IEEE, 2024

work page 2024
[57]

Cambridge university press, 2019

Martin J Wainwright.High-dimensional statistics: A non-asymptotic viewpoint, volume 48. Cambridge university press, 2019

work page 2019
[58]

Optimal utility bounds for differentially private gradient descent in three-layer neural networks

Puyu Wang, Yunwen Lei, Marius Kloft, and Yiming Ying. Optimal utility bounds for differentially private gradient descent in three-layer neural networks. In2025 IEEE 12th International Conference on Data Science and Advanced Analytics (DSAA), pages 1–8. IEEE, 2025

work page 2025
[59]

Generalization guarantees of gradient descent for shallow neural networks.Neural Computation, 37(2):344–402, 2025

Puyu Wang, Yunwen Lei, Di Wang, Yiming Ying, and Ding-Xuan Zhou. Generalization guarantees of gradient descent for shallow neural networks.Neural Computation, 37(2):344–402, 2025

work page 2025
[60]

Optimization, generalization and differential privacy bounds for gradient descent on Kolmogorov-Arnold networks

Puyu Wang, Junyu Zhou, Philipp Liznerski, and Marius Kloft. Optimization, generalization and differential privacy bounds for gradient descent on Kolmogorov-Arnold networks. InInternational Conference on Machine Learning, 2026

work page 2026
[61]

On the expressiveness and spectral bias of kans

Yixuan Wang, Jonathan W Siegel, Ziming Liu, and Thomas Y Hou. On the expressiveness and spectral bias of kans. InInternational Conference on Learning Representations, 2025

work page 2025
[62]

Kolmogorov–arnold-informed neural network: A physics-informed deep learning framework for solving forward and inverse problems based on kolmogorov–arnold networks

Yizheng Wang, Jia Sun, Jinshuai Bai, Cosmin Anitescu, Mohammad Sadegh Eshaghi, Xiaoying Zhuang, Timon Rabczuk, and Yinghua Liu. Kolmogorov–arnold-informed neural network: A physics-informed deep learning framework for solving forward and inverse problems based on kolmogorov–arnold networks. Computer Methods in Applied Mechanics and Engineering, 433:117518, 2025

work page 2025
[63]

Subsampled r´ enyi differential privacy and analytical moments accountant

Yu-Xiang Wang, Borja Balle, and Shiva Prasad Kasiviswanathan. Subsampled r´ enyi differential privacy and analytical moments accountant. InInternational Conference on Artificial Intelligence and Statistics (AISTATS), 2019

work page 2019
[64]

Differential Privacy in Two-Layer Networks: How DP-SGD Harms Fairness and Robustness

Ruichen Xu and Kexin Chen. Differential privacy in two-layer networks: How dp-sgd harms fairness and robustness.arXiv preprint arXiv:2603.04881, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[65]

Understanding the im- pact of differentially private training on memorization of long-tailed data.arXiv preprint arXiv:2602.03872, 2026

Jiaming Zhang, Huanyi Xie, Meng Ding, Shaopeng Fu, Jinyan Liu, and Di Wang. Understanding the im- pact of differentially private training on memorization of long-tailed data.arXiv preprint arXiv:2602.03872, 2026. 14

work page arXiv 2026
[66]

Generalization analysis with deep relu networks for metric and similarity learning.arXiv preprint arXiv:2405.06415, 2024

Junyu Zhou, Puyu Wang, and Ding-Xuan Zhou. Generalization analysis with deep relu networks for metric and similarity learning.arXiv preprint arXiv:2405.06415, 2024

work page arXiv 2024
[67]

Optimal accounting of differential privacy via char- acteristic function

Yuqing Zhu, Jinshuo Dong, and Yu-Xiang Wang. Optimal accounting of differential privacy via char- acteristic function. InInternational Conference on Artificial Intelligence and Statistics (AISTATS), 2022

work page 2022
[68]

Gradient descent optimizes over-parameterized deep relu networks.Machine learning, 109:467–492, 2020

Difan Zou, Yuan Cao, Dongruo Zhou, and Quanquan Gu. Gradient descent optimizes over-parameterized deep relu networks.Machine learning, 109:467–492, 2020. 15 Appendix A Further Related Work This appendix expands on the related work referenced in Section 2, covering generalization theory for neural networks (Appendix A.1) and privacy amplification by subsam...

work page 2020
[69]

Rearranging the above inequality gives the claim

+P A GZ(z, VZ)∩E c 3 ≤δ pot. Rearranging the above inequality gives the claim. Recall that zδZ = p mdp+ r 2 log 2T δZ andV ∆,δ∆ = 2T G2 δ B + 8G2 δ log 1 δ∆ , VZ,δZ = (T−1)mdp+ 2 r (T−1)mdplog 2 δZ + 2 log 2 δZ , and Mδpot = 4 √ 2ηG δ ¯R r Tlog(6/δ pot) B + 4(1−λ)ηc priv ¯R s Tlog 6 δpot + 4(1−λ)η 2cprivGδ r VZ,δZ log(6/δpot) B .(13) We now combine the hi...

work page
[70]

matrix mechanism

Since replacing δ by a constant fraction only affects logarithmic factors by absolute constants, we suppress this distinction below. Set τ 2 γ ≍ log2(T)+log(n/δ) γ2 . We choose the shifted localization radius as ¯R = C ¯Rτγ, where C ¯R > 0 is a sufficiently large universal constant. Our proof consists of the following steps. (i). Comparator construction u...

work page