pith. machine review for the scientific record. sign in

arxiv: 2605.12648 · v1 · submitted 2026-05-12 · 💻 cs.LG · stat.ML

Recognition: no theorem link

Population Risk Bounds for Kolmogorov-Arnold Networks Trained by DP-SGD with Correlated Noise

Christoph Lampert, Jan Schuchardt, Junyu Zhou, Marius Kloft, Nikita Kalinin, Puyu Wang, Sophie Fellenz

Authors on Pith no claims yet

Pith reviewed 2026-05-14 21:34 UTC · model grok-4.3

classification 💻 cs.LG stat.ML
keywords Kolmogorov-Arnold Networkspopulation risk boundsDP-SGDcorrelated noisemini-batch SGDgradient clippingdifferential privacynon-convex optimization
0
0 comments X

The pith

Kolmogorov-Arnold Networks receive population risk bounds under mini-batch DP-SGD with correlated noise.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes the first population risk bounds for Kolmogorov-Arnold Networks trained by mini-batch stochastic gradient descent with gradient clipping. These bounds hold for non-private SGD as well as for differentially private SGD using Gaussian noise that can be independent or temporally correlated. This setup matches practical training more closely than earlier work because it uses mini-batch updates instead of full-batch gradients and allows correlated noise, which often improves the privacy-utility tradeoff. The proof develops a new way to analyze optimization in non-convex problems with temporal noise dependence by using an auxiliary unprojected process and a shifted iterate. A stability argument then turns the optimization guarantee into a population risk bound.

Core claim

We establish the first population risk bounds for KANs trained by mini-batch SGD with gradient clipping, covering non-private SGD as well as DP-SGD with Gaussian perturbations that interpolate between independent and temporally correlated noise. The results cover prior full-batch GD and independent-noise DP-GD for KANs while giving sharper bounds when the second layer is fixed. The technical core is a new analysis route using an auxiliary unprojected dynamics, a shifted iterate absorbing noise, and a high-probability bootstrap certifying projection inactivity to handle temporal dependence and projection in the correlated noise case.

What carries the argument

An auxiliary unprojected dynamics together with a shifted iterate that absorbs the current noise perturbation and a high-probability bootstrap that certifies the projection step remains inactive.

If this is right

  • The bounds apply directly to mini-batch training used in practice.
  • They cover DP-SGD mechanisms with temporally correlated Gaussian noise.
  • Sharper specializations exist for KANs with a fixed second layer.
  • The analysis extends to cover the corresponding full-batch cases as well.
  • These are the first such bounds beyond convex learning for correlated-noise DP training of neural networks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The technique for handling correlated noise could apply to other non-convex neural network training under DP.
  • Empirical validation could involve checking if the projection remains inactive in typical KAN training runs.
  • Practitioners might use these bounds to select noise correlation levels that balance privacy and accuracy.
  • The results suggest that correlated noise does not necessarily worsen the theoretical guarantees compared to independent noise.

Load-bearing premise

The high-probability bootstrap must successfully certify that the projection step is inactive so that the shifted iterate can absorb the noise without interference from clipping.

What would settle it

Training runs of KANs under the correlated noise model where the projection step activates frequently enough to violate the population risk bounds derived in the analysis.

Figures

Figures reproduced from arXiv: 2605.12648 by Christoph Lampert, Jan Schuchardt, Junyu Zhou, Marius Kloft, Nikita Kalinin, Puyu Wang, Sophie Fellenz.

Figure 1
Figure 1. Figure 1: CNN on CIFAR-10. Left: Moderate noise correlation improves the accuracy of DP-SGD over independent noise (λ = 0), especially for larger privacy budgets ϵ. However, the gain is not monotone in λ, and accuracy can drop when λ → 1. Right: Subtracting a λ-fraction of the previous noise partially cancels consecutive noise perturbations, slowing cumulative-noise growth and thus preserving accuracy. Figure reprod… view at source ↗
read the original abstract

We establish the first population risk bounds for Kolmogorov-Arnold Networks (KANs) trained by mini-batch SGD with gradient clipping, covering non-private SGD as well as differentially private SGD (DP-SGD) with Gaussian perturbations that interpolate between independent and temporally correlated noise. This setting is substantially closer to practice than prior KAN theory along two axes: training is by mini-batch SGD, the standard recipe for modern networks, rather than full-batch gradient descent (GD); and correlated-noise mechanisms have empirically shown a more favorable privacy-utility tradeoff than independent-noise mechanisms. Our results cover the corresponding full-batch GD and independent-noise DP-GD results for KANs by Wang et al. (2026), while yielding sharper fixed-second-layer specializations. The technical core is a new analysis route for correlated-noise DP training in the non-convex regime. Temporal dependence breaks the conditional-centering structure underlying standard one-step SGD arguments, and the projection step obstructs the exact cancellation structure of correlated perturbations. We address these difficulties through an auxiliary unprojected dynamics, a shifted iterate that absorbs the current noise perturbation, and a high-probability bootstrap certifying projection inactivity. Combining this optimization analysis with a stability-based generalization argument yields the stated population risk bounds. To the best of our knowledge, this is the first optimization and population risk analysis of a correlated-noise mechanism for DP training beyond convex learning, in particular for neural networks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The paper claims to establish the first population risk bounds for Kolmogorov-Arnold Networks (KANs) trained by mini-batch SGD with gradient clipping. The bounds cover both non-private SGD and DP-SGD with Gaussian perturbations that interpolate between independent and temporally correlated noise. The analysis uses an auxiliary unprojected dynamics, a shifted iterate absorbing noise, and a high-probability bootstrap to certify projection inactivity, combined with a stability-based generalization argument. This extends prior full-batch GD and independent-noise results for KANs while yielding sharper specializations for fixed second layers.

Significance. If the central claims hold, the work would be significant for providing the first optimization and population-risk analysis of correlated-noise DP mechanisms beyond convex learning, specifically for neural networks. It addresses a setting substantially closer to practice than prior KAN theory by handling mini-batch SGD and correlated noise, which empirically improves privacy-utility tradeoffs. The manuscript ships a coherent new analysis route with explicit coverage of non-convex regimes.

major comments (1)
  1. [Technical core (auxiliary unprojected dynamics, shifted iterate, and bootstrap argument)] The high-probability bootstrap certifying projection inactivity (described in the technical core) may fail to control clipping effects under temporally correlated noise. In non-convex KAN landscapes with mini-batching, increased correlation can raise the likelihood that gradient norms exceed the clip threshold, breaking the one-step SGD cancellation structure and preventing the population risk bounds from holding.
minor comments (1)
  1. The abstract states that results cover and sharpen prior work by Wang et al. (2026) but does not quantify the improvement in the fixed-second-layer specialization or state the precise assumptions on the KAN architecture and noise correlation parameter.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their careful reading, positive assessment of the significance, and for raising this technical point on the bootstrap argument. We address it directly below.

read point-by-point responses
  1. Referee: The high-probability bootstrap certifying projection inactivity (described in the technical core) may fail to control clipping effects under temporally correlated noise. In non-convex KAN landscapes with mini-batching, increased correlation can raise the likelihood that gradient norms exceed the clip threshold, breaking the one-step SGD cancellation structure and preventing the population risk bounds from holding.

    Authors: We appreciate the concern. The analysis is designed to handle precisely this issue: the auxiliary unprojected dynamics together with the shifted iterate (which absorbs the current Gaussian perturbation) restore a conditional centering property even under temporal correlation. The high-probability bootstrap then certifies projection inactivity via a union-bound argument whose failure probability is controlled uniformly in the correlation parameter by the sub-Gaussian tails of the noise; the mini-batch variance is absorbed into the same concentration bound under the KAN Lipschitz and smoothness assumptions stated in the paper. Consequently the one-step cancellation structure is preserved on the high-probability event, and the population-risk bounds continue to hold. We have added a clarifying paragraph in Section 3.2 and a supporting lemma (Lemma 4.3) in the appendix that makes the uniform control explicit. revision: partial

Circularity Check

0 steps flagged

No significant circularity; auxiliary dynamics and bootstrap are independent of inputs

full rationale

The paper's core derivation introduces an auxiliary unprojected dynamics, a shifted iterate absorbing noise, and a high-probability bootstrap for projection inactivity to handle correlated noise in non-convex KAN training. These constructs are presented as new technical tools that restore cancellation and control clipping effects without reducing to fitted parameters or prior results by construction. The self-citation to Wang et al. (2026) only recovers prior full-batch and independent-noise cases as specializations, while the correlated-noise population risk bounds rely on the new stability-based generalization argument applied to the fresh optimization analysis. No equation or step equates a prediction to its own input or imports uniqueness via self-citation chain.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Abstract-only review; no explicit free parameters, axioms, or invented entities are stated. The analysis rests on standard non-convex optimization assumptions plus new constructs for handling correlated noise and projection.

axioms (1)
  • domain assumption Standard assumptions for non-convex SGD analysis and DP mechanisms (e.g., bounded gradients, Lipschitz continuity)
    Invoked implicitly to support the optimization analysis and stability argument in the non-convex regime.

pith-pipeline@v0.9.0 · 5575 in / 1232 out tokens · 46877 ms · 2026-05-14T21:34:44.874838+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

70 extracted references · 70 canonical work pages · 2 internal anchors

  1. [1]

    A convergence theory for deep learning via over- parameterization

    Zeyuan Allen-Zhu, Yuanzhi Li, and Zhao Song. A convergence theory for deep learning via over- parameterization. InInternational Conference on Machine Learning, pages 242–252. PMLR, 2019

  2. [2]

    A smooth binary mechanism for efficient private continual observation.Advances in Neural Information Processing Systems, 36:49133–49145, 2023

    Joel Daniel Andersson and Rasmus Pagh. A smooth binary mechanism for efficient private continual observation.Advances in Neural Information Processing Systems, 36:49133–49145, 2023

  3. [3]

    The hitchhiker’s guide to efficient, end-to-end, and tight dp auditing

    Meenatchi Sundaram Muthu Selva Annamalai, Borja Balle, Jamie Hayes, Georgios Kaissis, and Emiliano De Cristofaro. The hitchhiker’s guide to efficient, end-to-end, and tight dp auditing.arXiv preprint arXiv:2506.16666, 2025

  4. [4]

    Fine-grained analysis of optimization and generalization for overparameterized two-layer neural networks

    Sanjeev Arora, Simon Du, Wei Hu, Zhiyuan Li, and Ruosong Wang. Fine-grained analysis of optimization and generalization for overparameterized two-layer neural networks. InInternational Conference on Machine Learning, pages 322–332. PMLR, 2019

  5. [5]

    Beyond differential privacy: Composition theorems and relational logic for f-divergences between probabilistic programs

    Gilles Barthe and Federico Olmedo. Beyond differential privacy: Composition theorems and relational logic for f-divergences between probabilistic programs. InInternational Colloquium on Automata, Languages, and Programming, pages 49–60. Springer, 2013

  6. [6]

    Spectrally-normalized margin bounds for neural networks

    Peter L Bartlett, Dylan J Foster, and Matus J Telgarsky. Spectrally-normalized margin bounds for neural networks. InAdvances in Neural Information Processing Systems, volume 30, 2017

  7. [7]

    Private stochastic convex optimization with optimal rates

    Raef Bassily, Vitaly Feldman, Kunal Talwar, and Abhradeep Guha Thakurta. Private stochastic convex optimization with optimal rates. InAdvances in Neural Information Processing Systems, volume 32, 2019

  8. [8]

    Generalization bounds of stochastic gradient descent for wide and deep neural networks

    Yuan Cao and Quanquan Gu. Generalization bounds of stochastic gradient descent for wide and deep neural networks. InAdvances in Neural Information Processing Systems, volume 32, 2019

  9. [9]

    How much over-parameterization is sufficient to learn deep ReLU networks? InInternational Conference on Learning Representation, 2021

    Zixiang Chen, Yuan Cao, Difan Zou, and Quanquan Gu. How much over-parameterization is sufficient to learn deep ReLU networks? InInternational Conference on Learning Representation, 2021

  10. [10]

    Kolmogorov–arnold networks for genomic tasks.Briefings in Bioinformatics, 26(2):bbaf129, 2025

    Oleksandr Cherednichenko and Maria Poptsova. Kolmogorov–arnold networks for genomic tasks.Briefings in Bioinformatics, 26(2):bbaf129, 2025

  11. [11]

    Correlated noise provably beats independent noise for differen- tially private learning

    Christopher A Choquette-Choo, Krishnamurthy Dj Dvijotham, Krishna Pillutla, Arun Ganesh, Thomas Steinke, and Abhradeep Guha Thakurta. Correlated noise provably beats independent noise for differen- tially private learning. InInternational Conference on Learning Representations, 2024

  12. [12]

    Near exact privacy amplification for matrix mechanisms.arXiv preprint arXiv:2410.06266, 2024

    Christopher A Choquette-Choo, Arun Ganesh, Saminul Haque, Thomas Steinke, and Abhradeep Thakurta. Near exact privacy amplification for matrix mechanisms.arXiv preprint arXiv:2410.06266, 2024

  13. [13]

    (amplified) banded matrix factorization: A unified approach to private training

    Christopher A Choquette-Choo, Arun Ganesh, Ryan McKenna, H Brendan McMahan, John Rush, Abhradeep Guha Thakurta, and Zheng Xu. (amplified) banded matrix factorization: A unified approach to private training. InAdvances in Neural Information Processing Systems, volume 36, pages 74856–74889, 2023

  14. [14]

    Choquette-Choo, Arun Ganesh, Thomas Steinke, and Abhradeep Guha Thakurta

    Christopher A. Choquette-Choo, Arun Ganesh, Thomas Steinke, and Abhradeep Guha Thakurta. Privacy amplification for matrix mechanisms. InInternational Conference on Learning Representations, 2024. 11

  15. [15]

    Choquette-Choo, H

    Christopher A. Choquette-Choo, H. Brendan McMahan, Keith Rush, and Abhradeep Thakurta. Multi- epoch matrix factorization mechanisms for private machine learning. InInternational Conference on Machine Learning. JMLR.org, 2023

  16. [16]

    Improved differential privacy for SGD via optimal private linear operators on adaptive streams

    Sergey Denisov, H Brendan McMahan, John Rush, Adam Smith, and Abhradeep Guha Thakurta. Improved differential privacy for SGD via optimal private linear operators on adaptive streams. In Advances in Neural Information Processing Systems, volume 35, pages 5910–5924, 2022

  17. [17]

    Understanding private learning from feature perspective.arXiv preprint arXiv:2511.18006, 2025

    Meng Ding, Mingxi Lei, Shaopeng Fu, Shaowei Wang, Di Wang, and Jinhui Xu. Understanding private learning from feature perspective.arXiv preprint arXiv:2511.18006, 2025

  18. [18]

    Differential privacy

    Cynthia Dwork. Differential privacy. InInternational colloquium on automata, languages, and program- ming, 2006

  19. [19]

    Calibrating noise to sensitivity in private data analysis

    Cynthia Dwork, Frank McSherry, Kobbi Nissim, and Adam Smith. Calibrating noise to sensitivity in private data analysis. In Shai Halevi and Tal Rabin, editors,Theory of Cryptography, pages 265–284, Berlin, Heidelberg, 2006. Springer Berlin Heidelberg

  20. [20]

    On the convergence of two-layer kolmogorov-arnold networks with first-layer training

    Seyed Mohammad Eshtehardian, Mohammad Hossein Yassaee, and Babak Khalaj. On the convergence of two-layer kolmogorov-arnold networks with first-layer training. InInternational Conference on Learning Representations, 2026

  21. [21]

    Constant matters: Fine-grained error bound on differentially private continual observation

    Hendrik Fichtenberger, Monika Henzinger, and Jalaj Upadhyay. Constant matters: Fine-grained error bound on differentially private continual observation. InInternational Conference on Machine Learning, pages 10072–10092, 2023

  22. [22]

    Random feature amplification: Feature learning and generalization in neural networks.Journal of Machine Learning Research, 24(303):1–49, 2023

    Spencer Frei, Niladri S Chatterji, and Peter L Bartlett. Random feature amplification: Feature learning and generalization in neural networks.Journal of Machine Learning Research, 24(303):1–49, 2023

  23. [23]

    On the convergence of (stochastic) gradient descent for Kolmogorov– Arnold networks.IEEE Transactions on Information Theory, 2025

    Yihang Gao and Vincent YF Tan. On the convergence of (stochastic) gradient descent for Kolmogorov– Arnold networks.IEEE Transactions on Information Theory, 2025

  24. [24]

    Probability inequalities for sums of bounded random variables.Journal of the American Statistical Association, 58(301):13–30, 1963

    Wassily Hoeffding. Probability inequalities for sums of bounded random variables.Journal of the American Statistical Association, 58(301):13–30, 1963

  25. [25]

    Neural tangent kernel: Convergence and generaliza- tion in neural networks.Advances in Neural Information Processing Systems, 31, 2018

    Arthur Jacot, Franck Gabriel, and Cl´ ement Hongler. Neural tangent kernel: Convergence and generaliza- tion in neural networks.Advances in Neural Information Processing Systems, 31, 2018

  26. [26]

    Polylogarithmic width suffices for gradient descent to achieve arbitrarily small test error with shallow ReLU networks

    Ziwei Ji and Matus Telgarsky. Polylogarithmic width suffices for gradient descent to achieve arbitrarily small test error with shallow ReLU networks. InInternational Conference on Learning Representations, 2020

  27. [27]

    Banded square root matrix factorization for differentially private model training

    Nikita P Kalinin and Christoph Lampert. Banded square root matrix factorization for differentially private model training. InAdvances in Neural Information Processing Systems, volume 37, pages 17602–17655, 2024

  28. [28]

    DP-{\lambda}CGD: Efficient Noise Correlation for Differentially Private Model Training

    Nikita P Kalinin, Ryan McKenna, Rasmus Pagh, and Christoph H Lampert. DP- λCGD: efficient noise correlation for differentially private model training, 2026. arXiv preprint arXiv:2601.22334

  29. [29]

    Kalinin, Ryan McKenna, Jalaj Upadhyay, and Christoph H

    Nikita P. Kalinin, Ryan McKenna, Jalaj Upadhyay, and Christoph H. Lampert. Back to square roots: An optimal bound on the matrix factorization error for multi-epoch differentially private SGD. In International Conference on Learning Representations, 2026

  30. [30]

    Gradient descent with linearly correlated noise: Theory and applications to differential privacy

    Anastasiia Koloskova, Ryan McKenna, Zachary Charles, John Rush, and H Brendan McMahan. Gradient descent with linearly correlated noise: Theory and applications to differential privacy. InAdvances in Neural Information Processing Systems, volume 36, pages 35761–35773, 2023

  31. [31]

    Adaptive estimation of a quadratic functional by model selection

    Beatrice Laurent and Pascal Massart. Adaptive estimation of a quadratic functional by model selection. Annals of Statistics, pages 1302–1338, 2000. 12

  32. [32]

    Stability and generalization analysis of gradient methods for shallow neural networks

    Yunwen Lei, Rong Jin, and Yiming Ying. Stability and generalization analysis of gradient methods for shallow neural networks. InAdvances in Neural Information Processing Systems, volume 35, pages 38557–38570, 2022

  33. [33]

    Optimization and generalization of gradient descent for shallow ReLU networks with minimal width.Journal of Machine Learning Research, 27(34):1–35, 2026

    Yunwen Lei, Puyu Wang, Yiming Ying, and Ding-Xuan Zhou. Optimization and generalization of gradient descent for shallow ReLU networks with minimal width.Journal of Machine Learning Research, 27(34):1–35, 2026

  34. [34]

    Fine-grained analysis of stability and generalization for stochastic gradient descent

    Yunwen Lei and Yiming Ying. Fine-grained analysis of stability and generalization for stochastic gradient descent. InInternational Conference on Machine Learning, pages 5809–5819. PMLR, 2020

  35. [35]

    Kolmogorov–arnold graph neural networks for molecular property prediction.Nature Machine Intelligence, 7(8):1346–1354, 2025

    Longlong Li, Yipeng Zhang, Guanghui Wang, and Kelin Xia. Kolmogorov–arnold graph neural networks for molecular property prediction.Nature Machine Intelligence, 7(8):1346–1354, 2025

  36. [36]

    Generalization bounds for kolmogorov- arnold networks (KANs) and enhanced KANs with lower lipschitz complexity

    Pengqi Li, Lizhong Ding, Jiarun Fu, Guoren Wang, Ye Yuan, et al. Generalization bounds for kolmogorov- arnold networks (KANs) and enhanced KANs with lower lipschitz complexity. InAdvances in Neural Information Processing Systems, 2025

  37. [37]

    Optimal rates for generalization of gradient descent for deep ReLU classification

    Yuanfan Li, Yunwen Lei, Zheng-Chu Guo, and Yiming Ying. Optimal rates for generalization of gradient descent for deep ReLU classification. InAdvances in Neural Information Processing Systems, 2026

  38. [38]

    On the rate of convergence of kolmogorov-arnold network regression estimators.arXiv preprint arXiv:2509.19830, 2025

    Wei Liu, Eleni Chatzi, and Zhilu Lai. On the rate of convergence of kolmogorov-arnold network regression estimators.arXiv preprint arXiv:2509.19830, 2025

  39. [39]

    KAN: Kolmogorov-Arnold networks

    Ziming Liu, Yixuan Wang, Sachin Vaidya, Fabian Ruehle, James Halverson, Marin Soljaˇ ci´ c, Thomas Y Hou, and Max Tegmark. KAN: Kolmogorov-Arnold networks. InInternational Conference on Learning Representations, 2025

  40. [40]

    Scaling up the banded matrix factorization mechanism for differentially private ML

    Ryan McKenna. Scaling up the banded matrix factorization mechanism for differentially private ML. In International Conference on Learning Representation, 2025

  41. [41]

    A hassle-free algorithm for strong differential privacy in federated learning systems

    Hugh Brendan McMahan, Zheng Xu, and Yanxiang Zhang. A hassle-free algorithm for strong differential privacy in federated learning systems. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: Industry Track, pages 842–865, 2024

  42. [42]

    R´ enyi differential privacy

    Ilya Mironov. R´ enyi differential privacy. In2017 IEEE 30th computer security foundations symposium (CSF), pages 263–275. IEEE, 2017

  43. [43]

    How many neurons do we need? a refined analysis for shallow networks trained with gradient descent.Journal of Statistical Planning and Inference, 233:106169, 2024

    Mike Nguyen and Nicole Muecke. How many neurons do we need? a refined analysis for shallow networks trained with gradient descent.Journal of Statistical Planning and Inference, 233:106169, 2024

  44. [44]

    arXiv preprint arXiv:1905.09870 , year=

    Atsushi Nitanda, Geoffrey Chinot, and Taiji Suzuki. Gradient descent can learn less over-parameterized two-layer neural networks on classification problems.arXiv preprint arXiv:1905.09870, 2019

  45. [45]

    Optimal rates for averaged stochastic gradient descent under neural tangent kernel regime

    Atsushi Nitanda and Taiji Suzuki. Optimal rates for averaged stochastic gradient descent under neural tangent kernel regime. InInternational Conference on Learning Representations, 2021

  46. [46]

    Physics informed kolmogorov-arnold neural networks for dynamical analysis via efficient-kan and wav-kan.Journal of Machine Learning Research, 26(233):1–39, 2025

    Subhajit Patra, Sonali Panda, Bikram Keshari Parida, Mahima Arya, Kurt Jacobs, Denys I Bondar, and Abhijit Sen. Physics informed kolmogorov-arnold neural networks for dynamical analysis via efficient-kan and wav-kan.Journal of Machine Learning Research, 26(233):1–39, 2025

  47. [47]

    Correlated noise mechanisms for differentially private learning, 2025

    Krishna Pillutla, Jalaj Upadhyay, Christopher A Choquette-Choo, Krishnamurthy Dvijotham, Arun Ganesh, Monika Henzinger, Jonathan Katz, Ryan McKenna, H Brendan McMahan, Keith Rush, et al. Correlated noise mechanisms for differentially private learning, 2025. arXiv preprint arXiv:2506.08201

  48. [48]

    Stability & generalisation of gradient descent for shallow neural networks without the neural tangent kernel

    Dominic Richards and Ilja Kuzborskij. Stability & generalisation of gradient descent for shallow neural networks without the neural tangent kernel. InAdvances in Neural Information Processing Systems, volume 34. PMLR, 2021. 13

  49. [49]

    Optimizing privacy-utility trade-off in decentralized learning with generalized correlated noise

    Angelo Rodio, Zheng Chen, and Erik G Larsson. Optimizing privacy-utility trade-off in decentralized learning with generalized correlated noise. In2025 IEEE Information Theory Workshop (ITW), pages 1–6. IEEE, 2025

  50. [50]

    Sampling-free privacy accounting for matrix mechanisms under random allocation, 2026

    Jan Schuchardt and Nikita Kalinin. Sampling-free privacy accounting for matrix mechanisms under random allocation, 2026

  51. [51]

    Towards understanding generalization in DP-GD: A case study in training two-layer CNNs

    Zhongjie Shi, Puyu Wang, Chenyang Zhang, and Yuan Cao. Towards understanding generalization in DP-GD: A case study in training two-layer CNNs. InAAAI Conference on Artificial Intelligence, 2026

  52. [52]

    Khemraj Shukla, Juan Diego Toscano, Zhicheng Wang, Zongren Zou, and George Em Karniadakis. A comprehensive and fair comparison between mlp and kan representations for differential equations and operator networks.Computer Methods in Applied Mechanics and Engineering, 431:117290, 2024

  53. [53]

    Stochastic gradient descent with differentially private updates

    Shuang Song, Kamalika Chaudhuri, and Anand D Sarwate. Stochastic gradient descent with differentially private updates. In2013 IEEE global conference on signal and information processing, pages 245–248. IEEE, 2013

  54. [54]

    Generalization and stability of interpolating neural networks with minimal width.Journal of Machine Learning Research, 25(156):1–41, 2024

    Hossein Taheri and Christos Thrampoulidis. Generalization and stability of interpolating neural networks with minimal width.Journal of Machine Learning Research, 25(156):1–41, 2024

  55. [55]

    Sharper guarantees for learning neural network classifiers with gradient methods

    Hossein Taheri, Christos Thrampoulidis, and Arya Mazumdar. Sharper guarantees for learning neural network classifiers with gradient methods. InInternational Conference on Learning Representations, 2025

  56. [56]

    Kolmogorov-arnold networks (kans) for time series analysis

    Cristian J Vaca-Rubio, Luis Blanco, Roberto Pereira, and M` arius Caus. Kolmogorov-arnold networks (kans) for time series analysis. In2024 IEEE Globecom Workshops (GC Wkshps), pages 1–6. IEEE, 2024

  57. [57]

    Cambridge university press, 2019

    Martin J Wainwright.High-dimensional statistics: A non-asymptotic viewpoint, volume 48. Cambridge university press, 2019

  58. [58]

    Optimal utility bounds for differentially private gradient descent in three-layer neural networks

    Puyu Wang, Yunwen Lei, Marius Kloft, and Yiming Ying. Optimal utility bounds for differentially private gradient descent in three-layer neural networks. In2025 IEEE 12th International Conference on Data Science and Advanced Analytics (DSAA), pages 1–8. IEEE, 2025

  59. [59]

    Generalization guarantees of gradient descent for shallow neural networks.Neural Computation, 37(2):344–402, 2025

    Puyu Wang, Yunwen Lei, Di Wang, Yiming Ying, and Ding-Xuan Zhou. Generalization guarantees of gradient descent for shallow neural networks.Neural Computation, 37(2):344–402, 2025

  60. [60]

    Optimization, generalization and differential privacy bounds for gradient descent on Kolmogorov-Arnold networks

    Puyu Wang, Junyu Zhou, Philipp Liznerski, and Marius Kloft. Optimization, generalization and differential privacy bounds for gradient descent on Kolmogorov-Arnold networks. InInternational Conference on Machine Learning, 2026

  61. [61]

    On the expressiveness and spectral bias of kans

    Yixuan Wang, Jonathan W Siegel, Ziming Liu, and Thomas Y Hou. On the expressiveness and spectral bias of kans. InInternational Conference on Learning Representations, 2025

  62. [62]

    Kolmogorov–arnold-informed neural network: A physics-informed deep learning framework for solving forward and inverse problems based on kolmogorov–arnold networks

    Yizheng Wang, Jia Sun, Jinshuai Bai, Cosmin Anitescu, Mohammad Sadegh Eshaghi, Xiaoying Zhuang, Timon Rabczuk, and Yinghua Liu. Kolmogorov–arnold-informed neural network: A physics-informed deep learning framework for solving forward and inverse problems based on kolmogorov–arnold networks. Computer Methods in Applied Mechanics and Engineering, 433:117518, 2025

  63. [63]

    Subsampled r´ enyi differential privacy and analytical moments accountant

    Yu-Xiang Wang, Borja Balle, and Shiva Prasad Kasiviswanathan. Subsampled r´ enyi differential privacy and analytical moments accountant. InInternational Conference on Artificial Intelligence and Statistics (AISTATS), 2019

  64. [64]

    Differential Privacy in Two-Layer Networks: How DP-SGD Harms Fairness and Robustness

    Ruichen Xu and Kexin Chen. Differential privacy in two-layer networks: How dp-sgd harms fairness and robustness.arXiv preprint arXiv:2603.04881, 2026

  65. [65]

    Understanding the im- pact of differentially private training on memorization of long-tailed data.arXiv preprint arXiv:2602.03872, 2026

    Jiaming Zhang, Huanyi Xie, Meng Ding, Shaopeng Fu, Jinyan Liu, and Di Wang. Understanding the im- pact of differentially private training on memorization of long-tailed data.arXiv preprint arXiv:2602.03872, 2026. 14

  66. [66]

    Generalization analysis with deep relu networks for metric and similarity learning.arXiv preprint arXiv:2405.06415, 2024

    Junyu Zhou, Puyu Wang, and Ding-Xuan Zhou. Generalization analysis with deep relu networks for metric and similarity learning.arXiv preprint arXiv:2405.06415, 2024

  67. [67]

    Optimal accounting of differential privacy via char- acteristic function

    Yuqing Zhu, Jinshuo Dong, and Yu-Xiang Wang. Optimal accounting of differential privacy via char- acteristic function. InInternational Conference on Artificial Intelligence and Statistics (AISTATS), 2022

  68. [68]

    Gradient descent optimizes over-parameterized deep relu networks.Machine learning, 109:467–492, 2020

    Difan Zou, Yuan Cao, Dongruo Zhou, and Quanquan Gu. Gradient descent optimizes over-parameterized deep relu networks.Machine learning, 109:467–492, 2020. 15 Appendix A Further Related Work This appendix expands on the related work referenced in Section 2, covering generalization theory for neural networks (Appendix A.1) and privacy amplification by subsam...

  69. [69]

    Rearranging the above inequality gives the claim

    +P A GZ(z, VZ)∩E c 3 ≤δ pot. Rearranging the above inequality gives the claim. Recall that zδZ = p mdp+ r 2 log 2T δZ andV ∆,δ∆ = 2T G2 δ B + 8G2 δ log 1 δ∆ , VZ,δZ = (T−1)mdp+ 2 r (T−1)mdplog 2 δZ + 2 log 2 δZ , and Mδpot = 4 √ 2ηG δ ¯R r Tlog(6/δ pot) B + 4(1−λ)ηc priv ¯R s Tlog 6 δpot + 4(1−λ)η 2cprivGδ r VZ,δZ log(6/δpot) B .(13) We now combine the hi...

  70. [70]

    matrix mechanism

    Since replacing δ by a constant fraction only affects logarithmic factors by absolute constants, we suppress this distinction below. Set τ 2 γ ≍ log2(T)+log(n/δ) γ2 . We choose the shifted localization radius as ¯R = C ¯Rτγ, where C ¯R > 0 is a sufficiently large universal constant. Our proof consists of the following steps. (i). Comparator construction u...