arxiv: 2605.07097 · v1 · submitted 2026-05-08 · 📊 stat.ML · cs.LG· cs.NE· math.LO· math.ST· stat.TH

Recognition: no theorem link

Every Feedforward Neural Network Definable in an o-Minimal Structure Has Finite Sample Complexity

Anastasis Kratsios, Bum Jun Kim, Gregory Cousins, Haitz S\'aez de Oc\'ariz Borde, Simone Brugiapaglia

Pith reviewed 2026-05-11 01:02 UTC · model grok-4.3

classification 📊 stat.ML cs.LGcs.NEmath.LOmath.STstat.TH

keywords neural networkso-minimal structuresPAC learningsample complexityagnostic learningfeedforward networkstransformersgraph neural networks

0 comments

The pith

Every fixed feedforward neural network with o-minimal definable layers has finite sample complexity in the agnostic PAC setting, even with unbounded parameters.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that every fixed finite feedforward neural network architecture whose layers are definable in an o-minimal structure has finite sample complexity in the agnostic PAC learning model. This holds regardless of the number of parameters. It applies to many modern architectures including MLPs, CNNs, GNNs, and transformers with fixed sequence lengths, as well as their typical components. A sympathetic reader cares because this makes finite sample learnability a standard consequence of using tame, definable functions in feedforward computation rather than relying on specific design choices. The work suggests repositioning learnability as a baseline and focusing instead on other aspects of architectures.

Core claim

What carries the argument

O-minimal definability of layer functions and its preservation under standard network operations, which guarantees finite sample complexity for the resulting hypothesis class.

Load-bearing premise

That definability in an o-minimal structure for each layer is preserved through all the compositions and operations in the full architecture, and that such definability ensures finite sample complexity in the agnostic PAC model.

What would settle it

Finding a counterexample fixed feedforward architecture with o-minimal definable layers but without finite sample complexity in the agnostic PAC setting.

Figures

Figures reproduced from arXiv: 2605.07097 by Anastasis Kratsios, Bum Jun Kim, Gregory Cousins, Haitz S\'aez de Oc\'ariz Borde, Simone Brugiapaglia.

**Figure 1.** Figure 1: G-NN; cf. Definition 2.4 (left): The structure of a general non-recursive (feedforward) neural network (GNN). Parameterizable/trainable computations (gates) chosen from the gate dictionary G (2.1) are executed in the order specified by a DAG. The “standardized” MLP special case (right): The standard multilayer perceptron (MLP) is recovered as a special case. The MLP model arises when the computational DAG… view at source ↗

read the original abstract

We show that, in a precise sense, a broad class of feedforward neural networks learn (have finite sample complexity) in the PAC model: every fixed finite feedforward architecture whose layers are definable in an o-minimal structure has finite sample complexity in the agnostic PAC setting, even with unbounded parameters. This covers standard fixed-size MLPs, CNNs, GNNs, and transformers with fixed sequence length, together with the operations and layers typically used in such architectures, including linear projections, residual connections, attention mechanisms, pooling layers, normalization layers, and admissible positional encodings. Hence, distribution-free learnability for modern non-recurrent architectures is not an exceptional property of particular activations or architecture-specific VC arguments, but a consequence of tame feedforward computation. Our results reposition finite-sample PAC learnability as a baseline rather than a differentiator: they shift the focus of architectural comparison toward inductive biases, symmetries and geometric priors, scalability, and optimization behaviour.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 3 minor

Summary. The paper proves that every fixed finite feedforward neural network whose layers are definable in an o-minimal structure has finite sample complexity in the agnostic PAC model, even with unbounded parameters. This covers standard MLPs, CNNs, GNNs, and fixed-sequence-length transformers using operations including linear projections, residual connections, attention, pooling, normalization, and admissible positional encodings. The argument proceeds by showing that the parameterized input-to-output map is definable in a suitable o-minimal expansion of the reals (such as the semialgebraic or R_exp structure), which implies that the function class has finite VC-dimension or pseudo-dimension by model-theoretic results, and hence finite agnostic PAC sample complexity.

Significance. If the central theorem holds, the result supplies a unified, architecture-independent explanation for the PAC learnability of modern feedforward networks, deriving finite sample complexity directly from the tameness of definable computation rather than case-by-case VC calculations. It repositions learnability as a baseline property of fixed finite architectures and shifts emphasis to inductive biases, symmetries, optimization, and scalability. The approach leverages standard facts from o-minimality theory and PAC learning without introducing new parameters or bounds.

major comments (2)

[§3] §3 (Main Theorem): The preservation of definability under composition, finite sums, and projections is invoked to conclude that the full network map remains definable; while this follows from the general properties of o-minimal structures, the manuscript should explicitly verify that the specific operations listed (e.g., softmax in attention and layer-norm) are definable in the chosen expansion before applying the VC-finiteness theorem.
[§4.2] §4.2 (Examples): The treatment of GNNs and transformers assumes fixed graph or sequence size to keep the architecture finite; the finite-sample-complexity claim would be strengthened by a brief remark confirming that this fixed-size restriction is maintained throughout the definability argument.

minor comments (3)

[Abstract] Abstract: The qualifier 'in a precise sense' should be expanded to name the model-theoretic theorem (finite VC-dimension for definable families) that bridges o-minimality to PAC sample complexity.
[Introduction] Introduction: Include a short reference to the foundational results on o-minimal structures (e.g., van den Dries or Pillay-Steinhorn) and to the standard PAC theorem relating pseudo-dimension to agnostic sample complexity.
[Notation] Notation: Ensure consistent use of 'definable family' versus 'parameterized function class' when moving between model theory and learning-theoretic statements.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the supportive summary and recommendation of minor revision. The comments are helpful for improving clarity. We address each major comment below.

read point-by-point responses

Referee: [§3] §3 (Main Theorem): The preservation of definability under composition, finite sums, and projections is invoked to conclude that the full network map remains definable; while this follows from the general properties of o-minimal structures, the manuscript should explicitly verify that the specific operations listed (e.g., softmax in attention and layer-norm) are definable in the chosen expansion before applying the VC-finiteness theorem.

Authors: We agree that adding explicit verification improves self-containment. In the revised §3 we will insert a short paragraph confirming that the listed operations (softmax via exponentials and normalization, layer-norm via arithmetic means/variances and divisions, attention via the same, etc.) are definable in the semialgebraic structure or in R_exp, as appropriate, prior to invoking the model-theoretic VC/pseudo-dimension results. This uses only standard facts already implicit in the paper. revision: yes
Referee: [§4.2] §4.2 (Examples): The treatment of GNNs and transformers assumes fixed graph or sequence size to keep the architecture finite; the finite-sample-complexity claim would be strengthened by a brief remark confirming that this fixed-size restriction is maintained throughout the definability argument.

Authors: We thank the referee for the suggestion. The fixed-size restriction is part of the problem statement (fixed finite architecture) and is preserved by construction in the definability argument. We will add one clarifying sentence in §4.2 stating that the fixed graph or sequence length is maintained throughout the composition, so that the overall input-to-output map remains a finite definable function in the chosen o-minimal structure. revision: yes

Circularity Check

0 steps flagged

No circularity; derivation relies on external model-theoretic facts

full rationale

The paper establishes that definability of each layer operation in an o-minimal structure is preserved under the finite compositions, sums, and projections that define a fixed feedforward architecture. It then invokes the standard, externally established theorem that any definable family in an o-minimal structure has finite VC-dimension (or pseudo-dimension). This directly yields finite agnostic PAC sample complexity without parameter bounds. No equation or step equates the target conclusion to a fitted quantity, a self-referential definition, or a load-bearing self-citation whose validity depends on the present work. The argument is therefore self-contained against independent benchmarks from model theory and statistical learning theory.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The claim rests on the domain assumption that o-minimality implies finite sample complexity for the induced function classes, together with standard facts from statistical learning theory; no free parameters or new entities are introduced.

axioms (1)

domain assumption Definable sets and functions in an o-minimal structure yield function classes with finite sample complexity in the agnostic PAC model
Invoked to conclude that any fixed architecture built from such layers inherits finite sample complexity.

pith-pipeline@v0.9.0 · 5503 in / 1338 out tokens · 66377 ms · 2026-05-11T01:02:57.713216+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

96 extracted references · 17 canonical work pages · 8 internal anchors

[1]

Open prob- lem: Are all VC-classes CPAC learnable? InProceedings of the 34th Conference on Learning Theory, volume 134 ofProceedings of Machine Learning Research, pages 4636–4641

Sushant Agarwal, Nivasini Ananthakrishnan, Shai Ben-David, Tosca Lechner, and Ruth Urner. Open prob- lem: Are all VC-classes CPAC learnable? InProceedings of the 34th Conference on Learning Theory, volume 134 ofProceedings of Machine Learning Research, pages 4636–4641. PMLR, 2021

2021
[2]

Scale-sensitive dimensions, uniform convergence, and learnability.J

Noga Alon, Shai Ben-David, Nicolò Cesa-Bianchi, and David Haussler. Scale-sensitive dimensions, uniform convergence, and learnability.J. ACM, 44(4):615–631, 1997

1997
[3]

Zico Kolter

Brandon Amos and J. Zico Kolter. OptNet: Differentiable optimization as a layer in neural networks. In Proceedings of the 34th International Conference on Machine Learning, volume 70 ofProceedings of Machine Learning Research, pages 136–145. PMLR, 2017

2017
[4]

Bartlett.Neural Network Learning: Theoretical Foundations

Martin Anthony and Peter L. Bartlett.Neural Network Learning: Theoretical Foundations. Cambridge University Press, Cambridge, 1999

1999
[5]

Optimal learners for realizable regression: PAC learning and online learning

Idan Attias, Steve Hanneke, Alkis Kalavasis, Amin Karbasi, and Grigoris Velegkas. Optimal learners for realizable regression: PAC learning and online learning. InThirty-seventh Conference on Neural Information Processing Systems, 2023

2023
[6]

Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E. Hinton. Layer normalization.arXiv preprint arXiv:1607.06450, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016
[7]

Neural machine translation by jointly learning to align and translate

Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly learning to align and translate. InICLR, 2015

2015
[8]

Deepequilibriummodels

ShaojieBai, J.ZicoKolter, andVladlenKoltun. Deepequilibriummodels. InAdvances in Neural Information Processing Systems, volume 32, 2019

2019
[9]

Springer Science & Business Media, 2008

Jørgen Bang-Jensen and Gregory Z Gutin.Digraphs: Theory, Algorithms and Applications. Springer Science & Business Media, 2008

2008
[10]

P. L. Bartlett, S. R. Kulkarni, and S. E. Posner. Covering numbers for real-valued function classes.IEEE Trans. Inform. Theory, 43(5):1721–1724, 1997

1997
[11]

Bartlett and Wolfgang Maass

Peter L. Bartlett and Wolfgang Maass. Vapnik-Chervonenkis dimension of neural nets. In Michael A. Arbib, editor,The Handbook of Brain Theory and Neural Networks, pages 1188–1192. MIT Press, 2 edition, 2003

2003
[12]

Bartlett, Nick Harvey, Christopher Liaw, and Abbas Mehrabian

Peter L. Bartlett, Nick Harvey, Christopher Liaw, and Abbas Mehrabian. Nearly-tight VC-dimension and pseudodimension bounds for piecewise linear neural networks.J. Mach. Learn. Res., 20:63:1–63:17, 2019

2019
[13]

Tameness in geometry and arithmetic: beyond o-minimality

Gal Binyamini and Dmitry Novikov. Tameness in geometry and arithmetic: beyond o-minimality. InInter- national Congress of Mathematicians, pages 1440–1461. European Mathematical Society-EMS-Publishing House GmbH, 2023

2023
[14]

Sharply o-minimal structures and sharp cellular decom- position.arXiv preprint arXiv:2209.10972, 2022

Gal Binyamini, Dmitri Novikov, and Benny Zack. Sharply o-minimal structures and sharp cellular decom- position.arXiv preprint arXiv:2209.10972, 2022. 10

work page arXiv 2022
[15]

Wilkie’s conjecture for Pfaffian structures.Annals of Mathematics, 199(2):795–821, 2024

Gal Binyamini, Dmitry Novikov, and Benny Zak. Wilkie’s conjecture for Pfaffian structures.Annals of Mathematics, 199(2):795–821, 2024

2024
[16]

Anselm Blumer, Andrzej Ehrenfeucht, David Haussler, and Manfred K. Warmuth. Learnability and the Vapnik-Chervonenkis dimension.Journal of the ACM, 36(4):929–965, 1989

1989
[17]

A fine-grained characterization of PAC learnability

Marco Bressan, Nataly Brukhim, Nicolò Cesa-Bianchi, Emmanuel Esposito, Yishay Mansour, Shay Moran, and Maximilian Thiessen. A fine-grained characterization of PAC learnability. InCOLT, Proceedings of Machine Learning Research, pages 641–676. PMLR, 2025

2025
[18]

Bronstein, Joan Bruna, Yann LeCun, Arthur Szlam, and Pierre Vandergheynst

Michael M. Bronstein, Joan Bruna, Yann LeCun, Arthur Szlam, and Pierre Vandergheynst. Geometric deep learning: Going beyond Euclidean data.IEEE Signal Process. Mag., 34:18–42, 2016

2016
[19]

Geometric Deep Learning: Grids, Groups, Graphs, Geodesics, and Gauges

Michael M Bronstein, Joan Bruna, Taco Cohen, and Petar Veličković. Geometric deep learning: Grids, groups, graphs, geodesics, and gauges.arXiv preprint arXiv:2104.13478, 2021

work page internal anchor Pith review arXiv 2021
[20]

Model theory and machine learning.Bulletin of Symbolic Logic, 25(3): 319–332, 2019

Hunter Chase and James Freitag. Model theory and machine learning.Bulletin of Symbolic Logic, 25(3): 319–332, 2019

2019
[21]

Ricky T. Q. Chen, Yulia Rubanova, Jesse Bettencourt, and David K. Duvenaud. Neural ordinary differential equations. InAdvances in Neural Information Processing Systems, volume 31, pages 6572–6583, 2018

2018
[22]

Transformers in Uniform TC0.Transactions on Machine Learning Research, 2025

David Chiang. Transformers in Uniform TC0.Transactions on Machine Learning Research, 2025

2025
[23]

Probabilistic circuits: A unifying framework for tractable probabilistic models.UCLA

Y Choi, Antonio Vergari, and Guy Van den Broeck. Probabilistic circuits: A unifying framework for tractable probabilistic models.UCLA. URL: http://starai.cs.ucla.edu/papers/ProbCirc20.pdf, 6, 2020

2020
[24]

Fast and accurate deep network learning by Exponential Linear Units (ELUs)

Djork-Arné Clevert, Thomas Unterthiner, and Sepp Hochreiter. Fast and accurate deep network learning by Exponential Linear Units (ELUs). InProceedings of the International Conference on Learning Represen- tations (ICLR), 2016

2016
[25]

Institut de Recherche Mathématique de Rennes (IRMAR), Rennes, France, 1999

Michel Coste.An Introduction to O-Minimal Geometry. Institut de Recherche Mathématique de Rennes (IRMAR), Rennes, France, 1999. Lecture notes

1999
[26]

Mathematical foundations of geometric deep learning

Haitz Sáez de Ocáriz Borde and Michael Bronstein. Mathematical foundations of geometric deep learning. arXiv preprint arXiv:2508.02723, 2025

work page arXiv 2025
[27]

DeepSeek-V3 Technical Report

DeepSeek-AI. DeepSeek-V3 technical report.CoRR, abs/2412.19437, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[28]

Universal trans- formers

Mostafa Dehghani, Stephan Gouws, Oriol Vinyals, Jakob Uszkoreit, and Łukasz Kaiser. Universal trans- formers. InInternational Conference on Learning Representations, 2019

2019
[29]

Find a witness or shatter: the landscape of computable PAC learning

Valentino Delle Rose, Alexander Kozachinskiy, Cristóbal Rojas, and Tomasz Steifer. Find a witness or shatter: the landscape of computable PAC learning. InProceedings of the 36th Conference on Learning Theory, volume 195 ofProceedings of Machine Learning Research, pages 511–524. PMLR, 2023

2023
[30]

VC dimension of graph neural networks with Pfaffian activation functions.Neural Networks, 182:106924, 2025

Giuseppe Alessio D’Inverno, Monica Bianchini, and Franco Scarselli. VC dimension of graph neural networks with Pfaffian activation functions.Neural Networks, 182:106924, 2025

2025
[31]

Nan Du, Yanping Huang, Andrew M. Dai, Simon Tong, Dmitry Lepikhin, Yuanzhong Xu, Maxim Krikun, Yanqi Zhou, Adams Wei Yu, Orhan Firat, Barret Zoph, Liam Fedus, Maarten Bosma, Zongwei Zhou, Tao Wang, Yu Emma Wang, Kellie Webster, Marie Pellat, Kevin Robinson, Kathleen Meier-Hellstern, Toju Duke, Lucas Dixon, Kun Zhang, Quoc V Le, Yonghui Wu, Zhifeng Chen, a...

2022
[32]

Edelman, Surbhi Goel, Sham M

Benjamin L. Edelman, Surbhi Goel, Sham M. Kakade, and Cyril Zhang. Inductive biases and variable creation in self-attention mechanisms. InICML, Proceedings of Machine Learning Research, pages 5793–
[33]

Jeffrey L. Elman. Finding structure in time.Cognitive Science, 14(2):179–211, 1990

1990
[34]

Neocognitron: A self-organizing neural network model for a mechanism of pattern recognition unaffected by shift in position.Biological Cybernetics, 36(4):193–202, 1980

Kunihiko Fukushima. Neocognitron: A self-organizing neural network model for a mechanism of pattern recognition unaffected by shift in position.Biological Cybernetics, 36(4):193–202, 1980. 11

1980
[35]

Complexity of computations with Pfaffian and Noetherian functions

Andrei Gabrielov and Nicolai Vorobjov. Complexity of computations with Pfaffian and Noetherian functions. InNormal Forms, Bifurcations and Finiteness Problems in Differential Equations, volume 137 ofNATO Sci. Ser. II Math. Phys. Chem., pages 211–250. Kluwer Acad. Publ., Dordrecht, 2004

2004
[36]

Generalization and representational limits of graph neural networks

Vikas Garg, Stefanie Jegelka, and Tommi Jaakkola. Generalization and representational limits of graph neural networks. InProceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, pages 3419–3430. PMLR, 2020

2020
[37]

Lee, and Dimitris Papailiopou- los

Angeliki Giannou, Shashank Rajput, Jy-Yong Sohn, Kangwook Lee, Jason D. Lee, and Dimitris Papailiopou- los. Looped transformers as programmable computers. InProceedings of the 40th International Conference on Machine Learning, volume 202 ofProceedings of Machine Learning Research, pages 11398–11442. PMLR, 2023

2023
[38]

Deep sparse rectifier neural networks

Xavier Glorot, Antoine Bordes, and Yoshua Bengio. Deep sparse rectifier neural networks. InProceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, volume 15 ofProceedings of Machine Learning Research, pages 315–323. PMLR, 2011

2011
[39]

Goldberg and Mark R

Paul W. Goldberg and Mark R. Jerrum. Bounding the Vapnik-Chervonenkis dimension of concept classes parameterized by real numbers.Machine Learning, 18:131–148, 1995

1995
[40]

Gemma 4 model card.https://ai.google.dev/gemma/docs/core/model_card_4, 2026

Google. Gemma 4 model card.https://ai.google.dev/gemma/docs/core/model_card_4, 2026

2026
[41]

On the hardness of robust classification.Journal of Machine Learning Research, 22(273):1–29, 2021

Pascale Gourdeau, Varun Kanade, Marta Kwiatkowska, and James Worrell. On the hardness of robust classification.Journal of Machine Learning Research, 22(273):1–29, 2021

2021
[42]

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, Xiaokang Zhang, Xingkai Yu, Yu Wu, Z. F. Wu, Zhibin Gou, Zhihong Shao, Zhuoshu Li, Ziyi Gao, Aixin Liu, Bing Xue, Bingxuan Wang, Bochao Wu, Bei Feng, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chong Ruan, Damai Dai, Deli Chen, Dongjie Ji, ...

2025
[43]

Deep residual learning for image recognition

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 770–778, 2016

2016
[44]

Gaussian Error Linear Units (GELUs)

Dan Hendrycks and Kevin Gimpel. Gaussian Error Linear Units (GELUs).arXiv preprint arXiv:1606.08415, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016
[45]

Long short-term memory.Neural Computation, 9(8):1735–1780, 1997

Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory.Neural Computation, 9(8):1735–1780, 1997. 12

1997
[46]

Batch normalization: Accelerating deep network training by reducing internal covariate shift

Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. InProceedings of the 32nd International Conference on Machine Learning, volume 37 ofProceedings of Machine Learning Research, pages 448–456. PMLR, 2015

2015
[47]

Mixtral of Experts

Albert Q Jiang, Alexandre Sablayrolles, Antoine Roux, Arthur Mensch, Blanche Savary, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Emma Bou Hanna, Florian Bressand, et al. Mixtral of experts.arXiv preprint arXiv:2401.04088, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[48]

Springer, Heidel- berg, 2012

Stasys Jukna.Boolean Function Complexity, volume 27 ofAlgorithms and Combinatorics. Springer, Heidel- berg, 2012. Advances and frontiers

2012
[49]

Tropical circuit complexity.Limits of Pure Dynamic Programming/by Stasys Jukna, 2, 2023

Stasys Jukna. Tropical circuit complexity.Limits of Pure Dynamic Programming/by Stasys Jukna, 2, 2023

2023
[50]

Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei

Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B. Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models, 2020

2020
[51]

Polynomial bounds for VC dimension of sigmoidal and general Pfaffian neural networks.Journal of Computer and System Sciences, 54(1):169–176, 1997

Marek Karpinski and Angus Macintyre. Polynomial bounds for VC dimension of sigmoidal and general Pfaffian neural networks.Journal of Computer and System Sciences, 54(1):169–176, 1997

1997
[52]

Kearns and Robert E

Michael J. Kearns and Robert E. Schapire. Efficient distribution-free learning of probabilistic concepts.J. Comput. Syst. Sci., 48(3):464–497, 1994

1994
[53]

A. G. Khovanski˘i.Fewnomials, volume 88 ofTranslations of Mathematical Monographs. American Mathe- matical Society, Providence, RI, 1991. Translated from the Russian by Smilka Zdravkovska

1991
[54]

Quantifying the limits of AI reasoning: Systematic neural network representations of algorithms.arXiv preprint arXiv:2508.18526, 2025

Anastasis Kratsios, Dennis Zvigelsky, and Bradd Hart. Quantifying the limits of AI reasoning: Systematic neural network representations of algorithms.arXiv preprint arXiv:2508.18526, 2025

work page arXiv 2025
[55]

Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. Imagenet classification with deep convolutional neural networks. InNIPS, pages 1106–1114, 2012

2012
[56]

Laskowski

Michael C. Laskowski. Vapnik-Chervonenkis classes of definable sets.Journal of the London Mathematical Society, 45(2):377–384, 1992

1992
[57]

Gradient-based learning applied to docu- ment recognition.Proceedings of the IEEE, 86(11):2278–2324, 1998

Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to docu- ment recognition.Proceedings of the IEEE, 86(11):2278–2324, 1998

1998
[58]

Certifiable Boolean reasoning is universal.arXiv preprint arXiv:2602.05120, 2026

Wenhao Li, Anastasis Kratsios, Hrad Ghoukasian, and Dennis Zvigelsky. Certifiable Boolean reasoning is universal.arXiv preprint arXiv:2602.05120, 2026

work page arXiv 2026
[59]

Higher-order trans- former derivative estimates for explicit pathwise learning guarantees.arXiv preprint arXiv:2405.16563, 2024

Yannick Limmer, Anastasis Kratsios, Xuwei Yang, Raeid Saqur, and Blanka Horvath. Higher-order trans- former derivative estimates for explicit pathwise learning guarantees.arXiv preprint arXiv:2405.16563, 2024

work page arXiv 2024
[60]

Learning quickly when irrelevant attributes abound: A new linear-threshold algorithm

Nick Littlestone. Learning quickly when irrelevant attributes abound: A new linear-threshold algorithm. Machine Learning, 2(4):285–318, 1988

1988
[61]

Hou, and Max Tegmark

Ziming Liu, Yixuan Wang, Sachin Vaidya, Fabian Ruehle, James Halverson, Marin Soljacic, Thomas Y. Hou, and Max Tegmark. KAN: Kolmogorov-Arnold Networks. InICLR, 2025

2025
[62]

Springer, Heidelberg, 2014

Antonio Lloris Ruiz, Encarnación Castillo Morales, Luis Parrilla Roure, and Antonio García Ríos.Algebraic Circuits, volume 66 ofIntelligent Systems Reference Library. Springer, Heidelberg, 2014

2014
[63]

Maas, Awni Y

Andrew L. Maas, Awni Y. Hannun, and Andrew Y. Ng. Rectifier nonlinearities improve neural network acousticmodels. InProceedings of the 30th International Conference on Machine Learning (ICML) Workshop on Deep Learning for Audio, Speech and Language Processing, 2013

2013
[64]

Angus Macintyre and Eduardo D. Sontag. Finiteness results for sigmoidal “neural” networks. InProceedings of the Twenty-Fifth Annual ACM Symposium on Theory of Computing, pages 325–334, 1993

1993
[65]

WL meet VC

Christopher Morris, Floris Geerts, Jan Tönshoff, and Martin Grohe. WL meet VC. InProceedings of the 40th International Conference on Machine Learning, volume 202 ofProceedings of Machine Learning Research, pages 25275–25302. PMLR, 2023. 13

2023
[66]

Vinod Nair and Geoffrey E. Hinton. Rectified linear units improve restricted Boltzmann machines. In Proceedings of the 27th International Conference on Machine Learning, pages 807–814, 2010

2010
[67]

arXiv preprint arXiv:2407.18384 , year=

Philipp Petersen and Jakob Zech. Mathematical Theory of Deep Learning.arXiv preprint arXiv:2407.18384, 2024

work page arXiv 2024
[68]

Definable sets in ordered structures

Anand Pillay and Charles Steinhorn. Definable sets in ordered structures. i.Transactions of the American Mathematical Society, 295(2):565–592, 1986

1986
[69]

Prajit Ramachandran, Barret Zoph, and Quoc V. Le. Searching for activation functions. InInternational Conference on Learning Representations Workshop, 2018

2018
[70]

Hierarchical models of object recognition in cortex.Nature Neuroscience, 2(11):1019–1025, 1999

Maximilian Riesenhuber and Tomaso Poggio. Hierarchical models of object recognition in cortex.Nature Neuroscience, 2(11):1019–1025, 1999

1999
[71]

Elucidating graph neural networks, transformers, and graph transformers

Haitz Sáez De Ocáriz Borde. Elucidating graph neural networks, transformers, and graph transformers. ResearchGate, 2024

2024
[72]

Cambridge University Press, 2014

Shai Shalev-Shwartz and Shai Ben-David.Understanding Machine Learning: From Theory to Algorithms. Cambridge University Press, 2014

2014
[73]

GLU Variants Improve Transformer

Noam Shazeer. GLU variants improve transformer.arXiv preprint arXiv:2002.05202, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2002
[74]

Bounds on the number of examples needed for learning functions.SIAM Journal on Computing, 26(3):751–763, 1997

Hans Ulrich Simon. Bounds on the number of examples needed for learning functions.SIAM Journal on Computing, 26(3):751–763, 1997

1997
[75]

The Pfaffian closure of ano-minimal structure.Journal für die Reine und Angewandte Mathematik, 1999(508):189–211, 1999

Patrick Speissegger. The Pfaffian closure of ano-minimal structure.Journal für die Reine und Angewandte Mathematik, 1999(508):189–211, 1999

1999
[76]

Training very deep networks

Rupesh Kumar Srivastava, Klaus Greff, and Jürgen Schmidhuber. Training very deep networks. InProceed- ings of the 29th International Conference on Neural Information Processing Systems - Volume 2, NIPS’15, page 2377–2385, Cambridge, MA, USA, 2015. MIT Press

2015
[77]

Highway Networks

Rupesh Kumar Srivastava, Klaus Greff, and Jürgen Schmidhuber. Highway networks.arXiv preprint arXiv:1505.00387, 2015

work page Pith review arXiv 2015
[78]

Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063, 2024

Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063, 2024

2024
[79]

Srinivasan, Ben Mildenhall, Sara Fridovich-Keil, Nithin Raghavan, Utkarsh Singhal, Ravi Ramamoorthi, Jonathan T

Matthew Tancik, Pratul P. Srinivasan, Ben Mildenhall, Sara Fridovich-Keil, Nithin Raghavan, Utkarsh Singhal, Ravi Ramamoorthi, Jonathan T. Barron, and Ren Ng. Fourier features let networks learn high frequency functions in low dimensional domains. InAdvances in Neural Information Processing Systems, volume 33, pages 7537–7547, 2020

2020
[80]

Gemma 3 Technical Report

Gemma Team. Gemma 3 technical report.CoRR, abs/2503.19786, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

Showing first 80 references.