pith. machine review for the scientific record. sign in

arxiv: 2605.07097 · v1 · submitted 2026-05-08 · 📊 stat.ML · cs.LG· cs.NE· math.LO· math.ST· stat.TH

Recognition: no theorem link

Every Feedforward Neural Network Definable in an o-Minimal Structure Has Finite Sample Complexity

Anastasis Kratsios, Bum Jun Kim, Gregory Cousins, Haitz S\'aez de Oc\'ariz Borde, Simone Brugiapaglia

Pith reviewed 2026-05-11 01:02 UTC · model grok-4.3

classification 📊 stat.ML cs.LGcs.NEmath.LOmath.STstat.TH
keywords neural networkso-minimal structuresPAC learningsample complexityagnostic learningfeedforward networkstransformersgraph neural networks
0
0 comments X

The pith

Every fixed feedforward neural network with o-minimal definable layers has finite sample complexity in the agnostic PAC setting, even with unbounded parameters.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that every fixed finite feedforward neural network architecture whose layers are definable in an o-minimal structure has finite sample complexity in the agnostic PAC learning model. This holds regardless of the number of parameters. It applies to many modern architectures including MLPs, CNNs, GNNs, and transformers with fixed sequence lengths, as well as their typical components. A sympathetic reader cares because this makes finite sample learnability a standard consequence of using tame, definable functions in feedforward computation rather than relying on specific design choices. The work suggests repositioning learnability as a baseline and focusing instead on other aspects of architectures.

Core claim

We show that, in a precise sense, a broad class of feedforward neural networks learn (have finite sample complexity) in the PAC model: every fixed finite feedforward architecture whose layers are definable in an o-minimal structure has finite sample complexity in the agnostic PAC setting, even with unbounded parameters. This covers standard fixed-size MLPs, CNNs, GNNs, and transformers with fixed sequence length, together with the operations and layers typically used in such architectures, including linear projections, residual connections, attention mechanisms, pooling layers, normalization layers, and admissible positional encodings. Hence, distribution-free learnability for modern non-re

What carries the argument

O-minimal definability of layer functions and its preservation under standard network operations, which guarantees finite sample complexity for the resulting hypothesis class.

Load-bearing premise

That definability in an o-minimal structure for each layer is preserved through all the compositions and operations in the full architecture, and that such definability ensures finite sample complexity in the agnostic PAC model.

What would settle it

Finding a counterexample fixed feedforward architecture with o-minimal definable layers but without finite sample complexity in the agnostic PAC setting.

Figures

Figures reproduced from arXiv: 2605.07097 by Anastasis Kratsios, Bum Jun Kim, Gregory Cousins, Haitz S\'aez de Oc\'ariz Borde, Simone Brugiapaglia.

Figure 1
Figure 1. Figure 1: G-NN; cf. Definition 2.4 (left): The structure of a general non-recursive (feedforward) neural network (G￾NN). Parameterizable/trainable computations (gates) chosen from the gate dictionary G (2.1) are executed in the order specified by a DAG. The “standardized” MLP special case (right): The standard multilayer perceptron (MLP) is recovered as a special case. The MLP model arises when the computational DAG… view at source ↗
read the original abstract

We show that, in a precise sense, a broad class of feedforward neural networks learn (have finite sample complexity) in the PAC model: every fixed finite feedforward architecture whose layers are definable in an o-minimal structure has finite sample complexity in the agnostic PAC setting, even with unbounded parameters. This covers standard fixed-size MLPs, CNNs, GNNs, and transformers with fixed sequence length, together with the operations and layers typically used in such architectures, including linear projections, residual connections, attention mechanisms, pooling layers, normalization layers, and admissible positional encodings. Hence, distribution-free learnability for modern non-recurrent architectures is not an exceptional property of particular activations or architecture-specific VC arguments, but a consequence of tame feedforward computation. Our results reposition finite-sample PAC learnability as a baseline rather than a differentiator: they shift the focus of architectural comparison toward inductive biases, symmetries and geometric priors, scalability, and optimization behaviour.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 3 minor

Summary. The paper proves that every fixed finite feedforward neural network whose layers are definable in an o-minimal structure has finite sample complexity in the agnostic PAC model, even with unbounded parameters. This covers standard MLPs, CNNs, GNNs, and fixed-sequence-length transformers using operations including linear projections, residual connections, attention, pooling, normalization, and admissible positional encodings. The argument proceeds by showing that the parameterized input-to-output map is definable in a suitable o-minimal expansion of the reals (such as the semialgebraic or R_exp structure), which implies that the function class has finite VC-dimension or pseudo-dimension by model-theoretic results, and hence finite agnostic PAC sample complexity.

Significance. If the central theorem holds, the result supplies a unified, architecture-independent explanation for the PAC learnability of modern feedforward networks, deriving finite sample complexity directly from the tameness of definable computation rather than case-by-case VC calculations. It repositions learnability as a baseline property of fixed finite architectures and shifts emphasis to inductive biases, symmetries, optimization, and scalability. The approach leverages standard facts from o-minimality theory and PAC learning without introducing new parameters or bounds.

major comments (2)
  1. [§3] §3 (Main Theorem): The preservation of definability under composition, finite sums, and projections is invoked to conclude that the full network map remains definable; while this follows from the general properties of o-minimal structures, the manuscript should explicitly verify that the specific operations listed (e.g., softmax in attention and layer-norm) are definable in the chosen expansion before applying the VC-finiteness theorem.
  2. [§4.2] §4.2 (Examples): The treatment of GNNs and transformers assumes fixed graph or sequence size to keep the architecture finite; the finite-sample-complexity claim would be strengthened by a brief remark confirming that this fixed-size restriction is maintained throughout the definability argument.
minor comments (3)
  1. [Abstract] Abstract: The qualifier 'in a precise sense' should be expanded to name the model-theoretic theorem (finite VC-dimension for definable families) that bridges o-minimality to PAC sample complexity.
  2. [Introduction] Introduction: Include a short reference to the foundational results on o-minimal structures (e.g., van den Dries or Pillay-Steinhorn) and to the standard PAC theorem relating pseudo-dimension to agnostic sample complexity.
  3. [Notation] Notation: Ensure consistent use of 'definable family' versus 'parameterized function class' when moving between model theory and learning-theoretic statements.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the supportive summary and recommendation of minor revision. The comments are helpful for improving clarity. We address each major comment below.

read point-by-point responses
  1. Referee: [§3] §3 (Main Theorem): The preservation of definability under composition, finite sums, and projections is invoked to conclude that the full network map remains definable; while this follows from the general properties of o-minimal structures, the manuscript should explicitly verify that the specific operations listed (e.g., softmax in attention and layer-norm) are definable in the chosen expansion before applying the VC-finiteness theorem.

    Authors: We agree that adding explicit verification improves self-containment. In the revised §3 we will insert a short paragraph confirming that the listed operations (softmax via exponentials and normalization, layer-norm via arithmetic means/variances and divisions, attention via the same, etc.) are definable in the semialgebraic structure or in R_exp, as appropriate, prior to invoking the model-theoretic VC/pseudo-dimension results. This uses only standard facts already implicit in the paper. revision: yes

  2. Referee: [§4.2] §4.2 (Examples): The treatment of GNNs and transformers assumes fixed graph or sequence size to keep the architecture finite; the finite-sample-complexity claim would be strengthened by a brief remark confirming that this fixed-size restriction is maintained throughout the definability argument.

    Authors: We thank the referee for the suggestion. The fixed-size restriction is part of the problem statement (fixed finite architecture) and is preserved by construction in the definability argument. We will add one clarifying sentence in §4.2 stating that the fixed graph or sequence length is maintained throughout the composition, so that the overall input-to-output map remains a finite definable function in the chosen o-minimal structure. revision: yes

Circularity Check

0 steps flagged

No circularity; derivation relies on external model-theoretic facts

full rationale

The paper establishes that definability of each layer operation in an o-minimal structure is preserved under the finite compositions, sums, and projections that define a fixed feedforward architecture. It then invokes the standard, externally established theorem that any definable family in an o-minimal structure has finite VC-dimension (or pseudo-dimension). This directly yields finite agnostic PAC sample complexity without parameter bounds. No equation or step equates the target conclusion to a fitted quantity, a self-referential definition, or a load-bearing self-citation whose validity depends on the present work. The argument is therefore self-contained against independent benchmarks from model theory and statistical learning theory.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The claim rests on the domain assumption that o-minimality implies finite sample complexity for the induced function classes, together with standard facts from statistical learning theory; no free parameters or new entities are introduced.

axioms (1)
  • domain assumption Definable sets and functions in an o-minimal structure yield function classes with finite sample complexity in the agnostic PAC model
    Invoked to conclude that any fixed architecture built from such layers inherits finite sample complexity.

pith-pipeline@v0.9.0 · 5503 in / 1338 out tokens · 66377 ms · 2026-05-11T01:02:57.713216+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

96 extracted references · 17 canonical work pages · 8 internal anchors

  1. [1]

    Open prob- lem: Are all VC-classes CPAC learnable? InProceedings of the 34th Conference on Learning Theory, volume 134 ofProceedings of Machine Learning Research, pages 4636–4641

    Sushant Agarwal, Nivasini Ananthakrishnan, Shai Ben-David, Tosca Lechner, and Ruth Urner. Open prob- lem: Are all VC-classes CPAC learnable? InProceedings of the 34th Conference on Learning Theory, volume 134 ofProceedings of Machine Learning Research, pages 4636–4641. PMLR, 2021

  2. [2]

    Scale-sensitive dimensions, uniform convergence, and learnability.J

    Noga Alon, Shai Ben-David, Nicolò Cesa-Bianchi, and David Haussler. Scale-sensitive dimensions, uniform convergence, and learnability.J. ACM, 44(4):615–631, 1997

  3. [3]

    Zico Kolter

    Brandon Amos and J. Zico Kolter. OptNet: Differentiable optimization as a layer in neural networks. In Proceedings of the 34th International Conference on Machine Learning, volume 70 ofProceedings of Machine Learning Research, pages 136–145. PMLR, 2017

  4. [4]

    Bartlett.Neural Network Learning: Theoretical Foundations

    Martin Anthony and Peter L. Bartlett.Neural Network Learning: Theoretical Foundations. Cambridge University Press, Cambridge, 1999

  5. [5]

    Optimal learners for realizable regression: PAC learning and online learning

    Idan Attias, Steve Hanneke, Alkis Kalavasis, Amin Karbasi, and Grigoris Velegkas. Optimal learners for realizable regression: PAC learning and online learning. InThirty-seventh Conference on Neural Information Processing Systems, 2023

  6. [6]

    Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E. Hinton. Layer normalization.arXiv preprint arXiv:1607.06450, 2016

  7. [7]

    Neural machine translation by jointly learning to align and translate

    Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly learning to align and translate. InICLR, 2015

  8. [8]

    Deepequilibriummodels

    ShaojieBai, J.ZicoKolter, andVladlenKoltun. Deepequilibriummodels. InAdvances in Neural Information Processing Systems, volume 32, 2019

  9. [9]

    Springer Science & Business Media, 2008

    Jørgen Bang-Jensen and Gregory Z Gutin.Digraphs: Theory, Algorithms and Applications. Springer Science & Business Media, 2008

  10. [10]

    P. L. Bartlett, S. R. Kulkarni, and S. E. Posner. Covering numbers for real-valued function classes.IEEE Trans. Inform. Theory, 43(5):1721–1724, 1997

  11. [11]

    Bartlett and Wolfgang Maass

    Peter L. Bartlett and Wolfgang Maass. Vapnik-Chervonenkis dimension of neural nets. In Michael A. Arbib, editor,The Handbook of Brain Theory and Neural Networks, pages 1188–1192. MIT Press, 2 edition, 2003

  12. [12]

    Bartlett, Nick Harvey, Christopher Liaw, and Abbas Mehrabian

    Peter L. Bartlett, Nick Harvey, Christopher Liaw, and Abbas Mehrabian. Nearly-tight VC-dimension and pseudodimension bounds for piecewise linear neural networks.J. Mach. Learn. Res., 20:63:1–63:17, 2019

  13. [13]

    Tameness in geometry and arithmetic: beyond o-minimality

    Gal Binyamini and Dmitry Novikov. Tameness in geometry and arithmetic: beyond o-minimality. InInter- national Congress of Mathematicians, pages 1440–1461. European Mathematical Society-EMS-Publishing House GmbH, 2023

  14. [14]

    Sharply o-minimal structures and sharp cellular decom- position.arXiv preprint arXiv:2209.10972, 2022

    Gal Binyamini, Dmitri Novikov, and Benny Zack. Sharply o-minimal structures and sharp cellular decom- position.arXiv preprint arXiv:2209.10972, 2022. 10

  15. [15]

    Wilkie’s conjecture for Pfaffian structures.Annals of Mathematics, 199(2):795–821, 2024

    Gal Binyamini, Dmitry Novikov, and Benny Zak. Wilkie’s conjecture for Pfaffian structures.Annals of Mathematics, 199(2):795–821, 2024

  16. [16]

    Anselm Blumer, Andrzej Ehrenfeucht, David Haussler, and Manfred K. Warmuth. Learnability and the Vapnik-Chervonenkis dimension.Journal of the ACM, 36(4):929–965, 1989

  17. [17]

    A fine-grained characterization of PAC learnability

    Marco Bressan, Nataly Brukhim, Nicolò Cesa-Bianchi, Emmanuel Esposito, Yishay Mansour, Shay Moran, and Maximilian Thiessen. A fine-grained characterization of PAC learnability. InCOLT, Proceedings of Machine Learning Research, pages 641–676. PMLR, 2025

  18. [18]

    Bronstein, Joan Bruna, Yann LeCun, Arthur Szlam, and Pierre Vandergheynst

    Michael M. Bronstein, Joan Bruna, Yann LeCun, Arthur Szlam, and Pierre Vandergheynst. Geometric deep learning: Going beyond Euclidean data.IEEE Signal Process. Mag., 34:18–42, 2016

  19. [19]

    Geometric Deep Learning: Grids, Groups, Graphs, Geodesics, and Gauges

    Michael M Bronstein, Joan Bruna, Taco Cohen, and Petar Veličković. Geometric deep learning: Grids, groups, graphs, geodesics, and gauges.arXiv preprint arXiv:2104.13478, 2021

  20. [20]

    Model theory and machine learning.Bulletin of Symbolic Logic, 25(3): 319–332, 2019

    Hunter Chase and James Freitag. Model theory and machine learning.Bulletin of Symbolic Logic, 25(3): 319–332, 2019

  21. [21]

    Ricky T. Q. Chen, Yulia Rubanova, Jesse Bettencourt, and David K. Duvenaud. Neural ordinary differential equations. InAdvances in Neural Information Processing Systems, volume 31, pages 6572–6583, 2018

  22. [22]

    Transformers in Uniform TC0.Transactions on Machine Learning Research, 2025

    David Chiang. Transformers in Uniform TC0.Transactions on Machine Learning Research, 2025

  23. [23]

    Probabilistic circuits: A unifying framework for tractable probabilistic models.UCLA

    Y Choi, Antonio Vergari, and Guy Van den Broeck. Probabilistic circuits: A unifying framework for tractable probabilistic models.UCLA. URL: http://starai.cs.ucla.edu/papers/ProbCirc20.pdf, 6, 2020

  24. [24]

    Fast and accurate deep network learning by Exponential Linear Units (ELUs)

    Djork-Arné Clevert, Thomas Unterthiner, and Sepp Hochreiter. Fast and accurate deep network learning by Exponential Linear Units (ELUs). InProceedings of the International Conference on Learning Represen- tations (ICLR), 2016

  25. [25]

    Institut de Recherche Mathématique de Rennes (IRMAR), Rennes, France, 1999

    Michel Coste.An Introduction to O-Minimal Geometry. Institut de Recherche Mathématique de Rennes (IRMAR), Rennes, France, 1999. Lecture notes

  26. [26]

    Mathematical foundations of geometric deep learning

    Haitz Sáez de Ocáriz Borde and Michael Bronstein. Mathematical foundations of geometric deep learning. arXiv preprint arXiv:2508.02723, 2025

  27. [27]

    DeepSeek-V3 Technical Report

    DeepSeek-AI. DeepSeek-V3 technical report.CoRR, abs/2412.19437, 2024

  28. [28]

    Universal trans- formers

    Mostafa Dehghani, Stephan Gouws, Oriol Vinyals, Jakob Uszkoreit, and Łukasz Kaiser. Universal trans- formers. InInternational Conference on Learning Representations, 2019

  29. [29]

    Find a witness or shatter: the landscape of computable PAC learning

    Valentino Delle Rose, Alexander Kozachinskiy, Cristóbal Rojas, and Tomasz Steifer. Find a witness or shatter: the landscape of computable PAC learning. InProceedings of the 36th Conference on Learning Theory, volume 195 ofProceedings of Machine Learning Research, pages 511–524. PMLR, 2023

  30. [30]

    VC dimension of graph neural networks with Pfaffian activation functions.Neural Networks, 182:106924, 2025

    Giuseppe Alessio D’Inverno, Monica Bianchini, and Franco Scarselli. VC dimension of graph neural networks with Pfaffian activation functions.Neural Networks, 182:106924, 2025

  31. [31]

    Nan Du, Yanping Huang, Andrew M. Dai, Simon Tong, Dmitry Lepikhin, Yuanzhong Xu, Maxim Krikun, Yanqi Zhou, Adams Wei Yu, Orhan Firat, Barret Zoph, Liam Fedus, Maarten Bosma, Zongwei Zhou, Tao Wang, Yu Emma Wang, Kellie Webster, Marie Pellat, Kevin Robinson, Kathleen Meier-Hellstern, Toju Duke, Lucas Dixon, Kun Zhang, Quoc V Le, Yonghui Wu, Zhifeng Chen, a...

  32. [32]

    Edelman, Surbhi Goel, Sham M

    Benjamin L. Edelman, Surbhi Goel, Sham M. Kakade, and Cyril Zhang. Inductive biases and variable creation in self-attention mechanisms. InICML, Proceedings of Machine Learning Research, pages 5793–

  33. [33]

    Jeffrey L. Elman. Finding structure in time.Cognitive Science, 14(2):179–211, 1990

  34. [34]

    Neocognitron: A self-organizing neural network model for a mechanism of pattern recognition unaffected by shift in position.Biological Cybernetics, 36(4):193–202, 1980

    Kunihiko Fukushima. Neocognitron: A self-organizing neural network model for a mechanism of pattern recognition unaffected by shift in position.Biological Cybernetics, 36(4):193–202, 1980. 11

  35. [35]

    Complexity of computations with Pfaffian and Noetherian functions

    Andrei Gabrielov and Nicolai Vorobjov. Complexity of computations with Pfaffian and Noetherian functions. InNormal Forms, Bifurcations and Finiteness Problems in Differential Equations, volume 137 ofNATO Sci. Ser. II Math. Phys. Chem., pages 211–250. Kluwer Acad. Publ., Dordrecht, 2004

  36. [36]

    Generalization and representational limits of graph neural networks

    Vikas Garg, Stefanie Jegelka, and Tommi Jaakkola. Generalization and representational limits of graph neural networks. InProceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, pages 3419–3430. PMLR, 2020

  37. [37]

    Lee, and Dimitris Papailiopou- los

    Angeliki Giannou, Shashank Rajput, Jy-Yong Sohn, Kangwook Lee, Jason D. Lee, and Dimitris Papailiopou- los. Looped transformers as programmable computers. InProceedings of the 40th International Conference on Machine Learning, volume 202 ofProceedings of Machine Learning Research, pages 11398–11442. PMLR, 2023

  38. [38]

    Deep sparse rectifier neural networks

    Xavier Glorot, Antoine Bordes, and Yoshua Bengio. Deep sparse rectifier neural networks. InProceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, volume 15 ofProceedings of Machine Learning Research, pages 315–323. PMLR, 2011

  39. [39]

    Goldberg and Mark R

    Paul W. Goldberg and Mark R. Jerrum. Bounding the Vapnik-Chervonenkis dimension of concept classes parameterized by real numbers.Machine Learning, 18:131–148, 1995

  40. [40]

    Gemma 4 model card.https://ai.google.dev/gemma/docs/core/model_card_4, 2026

    Google. Gemma 4 model card.https://ai.google.dev/gemma/docs/core/model_card_4, 2026

  41. [41]

    On the hardness of robust classification.Journal of Machine Learning Research, 22(273):1–29, 2021

    Pascale Gourdeau, Varun Kanade, Marta Kwiatkowska, and James Worrell. On the hardness of robust classification.Journal of Machine Learning Research, 22(273):1–29, 2021

  42. [42]

    Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, Xiaokang Zhang, Xingkai Yu, Yu Wu, Z. F. Wu, Zhibin Gou, Zhihong Shao, Zhuoshu Li, Ziyi Gao, Aixin Liu, Bing Xue, Bingxuan Wang, Bochao Wu, Bei Feng, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chong Ruan, Damai Dai, Deli Chen, Dongjie Ji, ...

  43. [43]

    Deep residual learning for image recognition

    Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 770–778, 2016

  44. [44]

    Gaussian Error Linear Units (GELUs)

    Dan Hendrycks and Kevin Gimpel. Gaussian Error Linear Units (GELUs).arXiv preprint arXiv:1606.08415, 2016

  45. [45]

    Long short-term memory.Neural Computation, 9(8):1735–1780, 1997

    Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory.Neural Computation, 9(8):1735–1780, 1997. 12

  46. [46]

    Batch normalization: Accelerating deep network training by reducing internal covariate shift

    Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. InProceedings of the 32nd International Conference on Machine Learning, volume 37 ofProceedings of Machine Learning Research, pages 448–456. PMLR, 2015

  47. [47]

    Mixtral of Experts

    Albert Q Jiang, Alexandre Sablayrolles, Antoine Roux, Arthur Mensch, Blanche Savary, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Emma Bou Hanna, Florian Bressand, et al. Mixtral of experts.arXiv preprint arXiv:2401.04088, 2024

  48. [48]

    Springer, Heidel- berg, 2012

    Stasys Jukna.Boolean Function Complexity, volume 27 ofAlgorithms and Combinatorics. Springer, Heidel- berg, 2012. Advances and frontiers

  49. [49]

    Tropical circuit complexity.Limits of Pure Dynamic Programming/by Stasys Jukna, 2, 2023

    Stasys Jukna. Tropical circuit complexity.Limits of Pure Dynamic Programming/by Stasys Jukna, 2, 2023

  50. [50]

    Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei

    Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B. Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models, 2020

  51. [51]

    Polynomial bounds for VC dimension of sigmoidal and general Pfaffian neural networks.Journal of Computer and System Sciences, 54(1):169–176, 1997

    Marek Karpinski and Angus Macintyre. Polynomial bounds for VC dimension of sigmoidal and general Pfaffian neural networks.Journal of Computer and System Sciences, 54(1):169–176, 1997

  52. [52]

    Kearns and Robert E

    Michael J. Kearns and Robert E. Schapire. Efficient distribution-free learning of probabilistic concepts.J. Comput. Syst. Sci., 48(3):464–497, 1994

  53. [53]

    A. G. Khovanski˘i.Fewnomials, volume 88 ofTranslations of Mathematical Monographs. American Mathe- matical Society, Providence, RI, 1991. Translated from the Russian by Smilka Zdravkovska

  54. [54]

    Quantifying the limits of AI reasoning: Systematic neural network representations of algorithms.arXiv preprint arXiv:2508.18526, 2025

    Anastasis Kratsios, Dennis Zvigelsky, and Bradd Hart. Quantifying the limits of AI reasoning: Systematic neural network representations of algorithms.arXiv preprint arXiv:2508.18526, 2025

  55. [55]

    Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. Imagenet classification with deep convolutional neural networks. InNIPS, pages 1106–1114, 2012

  56. [56]

    Laskowski

    Michael C. Laskowski. Vapnik-Chervonenkis classes of definable sets.Journal of the London Mathematical Society, 45(2):377–384, 1992

  57. [57]

    Gradient-based learning applied to docu- ment recognition.Proceedings of the IEEE, 86(11):2278–2324, 1998

    Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to docu- ment recognition.Proceedings of the IEEE, 86(11):2278–2324, 1998

  58. [58]

    Certifiable Boolean reasoning is universal.arXiv preprint arXiv:2602.05120, 2026

    Wenhao Li, Anastasis Kratsios, Hrad Ghoukasian, and Dennis Zvigelsky. Certifiable Boolean reasoning is universal.arXiv preprint arXiv:2602.05120, 2026

  59. [59]

    Higher-order trans- former derivative estimates for explicit pathwise learning guarantees.arXiv preprint arXiv:2405.16563, 2024

    Yannick Limmer, Anastasis Kratsios, Xuwei Yang, Raeid Saqur, and Blanka Horvath. Higher-order trans- former derivative estimates for explicit pathwise learning guarantees.arXiv preprint arXiv:2405.16563, 2024

  60. [60]

    Learning quickly when irrelevant attributes abound: A new linear-threshold algorithm

    Nick Littlestone. Learning quickly when irrelevant attributes abound: A new linear-threshold algorithm. Machine Learning, 2(4):285–318, 1988

  61. [61]

    Hou, and Max Tegmark

    Ziming Liu, Yixuan Wang, Sachin Vaidya, Fabian Ruehle, James Halverson, Marin Soljacic, Thomas Y. Hou, and Max Tegmark. KAN: Kolmogorov-Arnold Networks. InICLR, 2025

  62. [62]

    Springer, Heidelberg, 2014

    Antonio Lloris Ruiz, Encarnación Castillo Morales, Luis Parrilla Roure, and Antonio García Ríos.Algebraic Circuits, volume 66 ofIntelligent Systems Reference Library. Springer, Heidelberg, 2014

  63. [63]

    Maas, Awni Y

    Andrew L. Maas, Awni Y. Hannun, and Andrew Y. Ng. Rectifier nonlinearities improve neural network acousticmodels. InProceedings of the 30th International Conference on Machine Learning (ICML) Workshop on Deep Learning for Audio, Speech and Language Processing, 2013

  64. [64]

    Angus Macintyre and Eduardo D. Sontag. Finiteness results for sigmoidal “neural” networks. InProceedings of the Twenty-Fifth Annual ACM Symposium on Theory of Computing, pages 325–334, 1993

  65. [65]

    WL meet VC

    Christopher Morris, Floris Geerts, Jan Tönshoff, and Martin Grohe. WL meet VC. InProceedings of the 40th International Conference on Machine Learning, volume 202 ofProceedings of Machine Learning Research, pages 25275–25302. PMLR, 2023. 13

  66. [66]

    Vinod Nair and Geoffrey E. Hinton. Rectified linear units improve restricted Boltzmann machines. In Proceedings of the 27th International Conference on Machine Learning, pages 807–814, 2010

  67. [67]

    arXiv preprint arXiv:2407.18384 , year=

    Philipp Petersen and Jakob Zech. Mathematical Theory of Deep Learning.arXiv preprint arXiv:2407.18384, 2024

  68. [68]

    Definable sets in ordered structures

    Anand Pillay and Charles Steinhorn. Definable sets in ordered structures. i.Transactions of the American Mathematical Society, 295(2):565–592, 1986

  69. [69]

    Prajit Ramachandran, Barret Zoph, and Quoc V. Le. Searching for activation functions. InInternational Conference on Learning Representations Workshop, 2018

  70. [70]

    Hierarchical models of object recognition in cortex.Nature Neuroscience, 2(11):1019–1025, 1999

    Maximilian Riesenhuber and Tomaso Poggio. Hierarchical models of object recognition in cortex.Nature Neuroscience, 2(11):1019–1025, 1999

  71. [71]

    Elucidating graph neural networks, transformers, and graph transformers

    Haitz Sáez De Ocáriz Borde. Elucidating graph neural networks, transformers, and graph transformers. ResearchGate, 2024

  72. [72]

    Cambridge University Press, 2014

    Shai Shalev-Shwartz and Shai Ben-David.Understanding Machine Learning: From Theory to Algorithms. Cambridge University Press, 2014

  73. [73]

    GLU Variants Improve Transformer

    Noam Shazeer. GLU variants improve transformer.arXiv preprint arXiv:2002.05202, 2020

  74. [74]

    Bounds on the number of examples needed for learning functions.SIAM Journal on Computing, 26(3):751–763, 1997

    Hans Ulrich Simon. Bounds on the number of examples needed for learning functions.SIAM Journal on Computing, 26(3):751–763, 1997

  75. [75]

    The Pfaffian closure of ano-minimal structure.Journal für die Reine und Angewandte Mathematik, 1999(508):189–211, 1999

    Patrick Speissegger. The Pfaffian closure of ano-minimal structure.Journal für die Reine und Angewandte Mathematik, 1999(508):189–211, 1999

  76. [76]

    Training very deep networks

    Rupesh Kumar Srivastava, Klaus Greff, and Jürgen Schmidhuber. Training very deep networks. InProceed- ings of the 29th International Conference on Neural Information Processing Systems - Volume 2, NIPS’15, page 2377–2385, Cambridge, MA, USA, 2015. MIT Press

  77. [77]

    Highway Networks

    Rupesh Kumar Srivastava, Klaus Greff, and Jürgen Schmidhuber. Highway networks.arXiv preprint arXiv:1505.00387, 2015

  78. [78]

    Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063, 2024

    Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063, 2024

  79. [79]

    Srinivasan, Ben Mildenhall, Sara Fridovich-Keil, Nithin Raghavan, Utkarsh Singhal, Ravi Ramamoorthi, Jonathan T

    Matthew Tancik, Pratul P. Srinivasan, Ben Mildenhall, Sara Fridovich-Keil, Nithin Raghavan, Utkarsh Singhal, Ravi Ramamoorthi, Jonathan T. Barron, and Ren Ng. Fourier features let networks learn high frequency functions in low dimensional domains. InAdvances in Neural Information Processing Systems, volume 33, pages 7537–7547, 2020

  80. [80]

    Gemma 3 Technical Report

    Gemma Team. Gemma 3 technical report.CoRR, abs/2503.19786, 2025

Showing first 80 references.