Recognition: 4 theorem links
· Lean TheoremFloating-Point Networks with Automatic Differentiation Can Represent Almost All Floating-Point Functions and Their Gradients
Pith reviewed 2026-05-08 19:31 UTC · model grok-4.3
The pith
Floating-point networks with automatic differentiation can exactly represent arbitrary floating-point function values and their gradients.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Given any floating-point function ϕ, arbitrary function values can be represented by a floating-point network f and arbitrary gradients by D^AD(ϕ ∘ f). The result extends to multiple functions ϕ1 through ϕn where D^AD(ϕi ∘ f) simultaneously represent arbitrary gradients while f represents the target values, under mild conditions. These statements hold for the listed practical activation functions.
What carries the argument
The composition of a floating-point network f with automatic differentiation D^AD applied to a loss ϕ ∘ f, which encodes both exact target values in the network output and exact gradients in the differentiated loss.
If this is right
- Any finite collection of input-output pairs and corresponding gradient vectors can be realized exactly by some network under floating-point arithmetic.
- Gradient-based optimization can in principle reach zero training loss with exact gradient matches without residual discretization error.
- Multiple loss functions can be optimized simultaneously with independent gradient control while sharing the same network weights.
- Standard activations already suffice; no special activation is required for the representation property.
Where Pith is reading between the lines
- The main barrier to exact representation in practice is likely the choice of architecture or the optimization procedure rather than floating-point rounding itself.
- This property could be used to construct networks that serve as exact oracles for testing gradient accuracy of other implementations.
- The multi-loss version suggests a route to multi-task learning where each task's gradient can be prescribed independently.
Load-bearing premise
That the set of floating-point numbers, the rounding rules, and the network architecture together provide enough degrees of freedom to hit any desired finite collection of values and gradients.
What would settle it
Exhibit a concrete ϕ, a finite set of input points, target values, and target gradients such that no floating-point network f of any size produces exactly those values at the inputs while D^AD(ϕ ∘ f) produces the target gradients.
read the original abstract
Theoretical studies show that for any differentiable function on a compact domain, there exists a neural network that approximates both the function values and gradients. However, such a result cannot be used in practice since it assumes real parameters and exact internal operations. In contrast, real implementations only use a finite subset of reals and machine operations with round-off errors. In this work, we investigate whether a similar result holds for neural networks under floating-point arithmetic, when the gradient with respect to the input is computed by the automatic differentiation algorithm $D^\mathtt{AD}$. We first show that given a floating-point function $\phi$ (e.g., a loss function), arbitrary function values and gradients can be represented by a floating-point network $f$ and $D^\mathtt{AD}(\phi\circ f)$, respectively. We further extend this result: given $\phi_1,\dots,\phi_n$, $D^\mathtt{AD}(\phi_i\circ f)$ can simultaneously represent arbitrary gradients while $f$ represents the target values, under mild conditions. Our results hold for practical activation functions, e.g., $\mathrm{ReLU}$, $\mathrm{ELU}$, $\mathrm{GeLU}$, $\mathrm{Swish}$, $\mathrm{Sigmoid}$, and $\mathrm{tanh}$.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript claims that, under floating-point arithmetic, there exist neural networks f (with practical activations such as ReLU, ELU, GELU, Swish, Sigmoid, and tanh) such that for any given floating-point function ϕ, the network f can realize arbitrary target values while D^AD(ϕ ∘ f) realizes arbitrary target gradients. It further claims that, for multiple functions ϕ1,…,ϕn, a single f can simultaneously realize arbitrary target values while each D^AD(ϕi ∘ f) realizes arbitrary target gradients, provided mild conditions hold.
Significance. If the stated results hold with explicit hypotheses, the work would be significant: it supplies an exact-representation theorem (rather than an approximation result) that directly accounts for finite floating-point sets and the mechanics of automatic differentiation. This strengthens the theoretical basis for understanding expressivity in deployed networks and is strengthened by the explicit coverage of common activations.
major comments (1)
- [Abstract] Abstract: the multi-function extension is asserted to hold 'under mild conditions,' yet no explicit statement of those conditions (e.g., requirements on the support of the target gradient vectors, rounding-mode assumptions, minimum network depth, or restrictions on the floating-point domain) appears. Because the single-ϕ case is stated unconditionally while the multi-ϕ case is the more useful extension, the precise scope of the central claim cannot be assessed.
minor comments (1)
- [Abstract] Abstract: the operator D^AD is used without a brief inline definition or reference to its precise implementation (forward-mode, reverse-mode, or specific rounding behavior), which would aid readers unfamiliar with the exact AD formulation employed.
Simulated Author's Rebuttal
We thank the referee for the careful review and for identifying the need to clarify the scope of the multi-function result in the abstract. We address the comment below and will incorporate the suggested change in the revised manuscript.
read point-by-point responses
-
Referee: [Abstract] Abstract: the multi-function extension is asserted to hold 'under mild conditions,' yet no explicit statement of those conditions (e.g., requirements on the support of the target gradient vectors, rounding-mode assumptions, minimum network depth, or restrictions on the floating-point domain) appears. Because the single-ϕ case is stated unconditionally while the multi-ϕ case is the more useful extension, the precise scope of the central claim cannot be assessed.
Authors: We agree that the abstract should explicitly indicate the conditions under which the multi-ϕ result holds, so that readers can immediately assess its scope. The mild conditions are stated in the body of the paper (specifically, in the hypotheses of the theorem establishing the multi-function case): the target gradient vectors must lie in the representable floating-point set for the chosen precision and rounding mode, the network must have sufficient depth to realize the required piecewise-linear or smooth mappings for the listed activations, and the floating-point domain must be finite and closed under the arithmetic operations used by the network and the automatic-differentiation procedure. We will revise the abstract to include a concise parenthetical statement of these conditions, thereby making the distinction between the unconditional single-ϕ claim and the conditional multi-ϕ claim transparent. revision: yes
Circularity Check
No circularity: theoretical representation result derived from first principles on FP arithmetic
full rationale
The paper claims to prove that floating-point networks f with AD-computed gradients D^AD(ϕ ∘ f) can realize arbitrary target values and gradients for given loss ϕ, and extends to simultaneous representation for multiple ϕ_i under mild conditions. No load-bearing step reduces to a self-definition, fitted parameter renamed as prediction, or self-citation chain; the abstract and structure present direct mathematical arguments on finite FP sets and practical activations (ReLU etc.) without importing uniqueness theorems or ansatzes from prior author work. The unspecified mild conditions affect scope but do not make the derivation tautological. The result is self-contained against external FP benchmarks and does not rename known empirical patterns.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Floating-point arithmetic operates over a finite subset of the reals with round-off errors in machine operations.
Lean theorems connected to this paper
-
Foundation.LogicAsFunctionalEquation / Cost.FunctionalEquationwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
given a floating-point function ϕ (e.g., a loss function), arbitrary function values and gradients can be represented by a floating-point network f and D^AD(ϕ∘f), respectively
-
Foundation.BranchSelection (interactionDefect / coupling combiner)branch_selection unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Theorem 3.2 ... DAD(ϕ◦f, x) changes accordingly ... by exploiting the non-associativity of floating-point multiplication
-
Foundation.AlphaCoordinateFixation (parameter-free derivations)alpha_pin_under_high_calibration unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
for any L ≥ 2^{q+1} + 2p + 11, there exists an L-layer σ network f such that f(x) = f*(x) and D_{f,x}(y) = g*(x,y)
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
International Conference on Computer Aided Verification , pages=
Floating-Point Neural Networks Are Provably Robust Universal Approximators , author=. International Conference on Computer Aided Verification , pages=. 2025 , organization=
2025
-
[2]
Forty-second International Conference on Machine Learning , year=
Floating-point neural networks can represent almost all floating-point functions , author=. Forty-second International Conference on Machine Learning , year=
-
[3]
Proceedings of the ACM on Programming Languages , volume=
Interval Universal Approximation for Neural Networks , author=. Proceedings of the ACM on Programming Languages , volume=. 2022 , _publisher=
2022
-
[4]
Dexter Kozen , title =
-
[5]
Enderton , title =
Herbert B. Enderton , title =
-
[6]
Cristopher Moore and Stephan Mertens , title =
-
[7]
2023 , howpublished =
Boaz Barak , title =. 2023 , howpublished =
2023
-
[8]
Mathematische Annalen , number =
Sch. Mathematische Annalen , number =
-
[9]
A Set of Postulates For the Foundation of Logic , volume =
Alonzo Church , journal =. A Set of Postulates For the Foundation of Logic , volume =
-
[10]
Turing, A. M. , journal =. On Computable Numbers, with an Application to the Entscheidungsproblem , volume =
-
[11]
On Undecidable Propositions of Formal Mathematics Systems (
G. On Undecidable Propositions of Formal Mathematics Systems (. 1934 , _url=
1934
-
[12]
An Unsolvable Problem of Elementary Number Theory , volume =
Alonzo Church , journal =. An Unsolvable Problem of Elementary Number Theory , volume =
-
[13]
Post , journal =
Emil L. Post , journal =. Formal Reductions of the General Combinatorial Decision Problem , volume =
-
[14]
1966 , note =
Theory of Self-Reproducing Automata , author =. 1966 , note =
1966
-
[15]
Complex Systems , volume=
Universality in Elementary Cellular Automata , author=. Complex Systems , volume=. 2004 , publisher=
2004
-
[16]
Wansbrough , title=
K. Wansbrough , title=. 1998 , note =
1998
-
[17]
T. L. Veldhuizen , title=
-
[18]
2017 , booktitle =
Grigore, Radu , title =. 2017 , booktitle =
2017
-
[19]
Siegelmann and E.D
H.T. Siegelmann and E.D. Sontag , journal =. On the Computational Power of Neural Nets , volume =
-
[20]
Attention is Turing-Complete , journal =
Jorge P. Attention is Turing-Complete , journal =
-
[21]
nature , volume=
Deep learning , author=. nature , volume=. 2015 , publisher=
2015
-
[22]
Mathematics of Control, Signals and Systems , volume=
Approximation by superpositions of a sigmoidal function , author=. Mathematics of Control, Signals and Systems , volume=. 1989 , publisher=
1989
-
[23]
and Stinchcombe, M
Hornik, K. and Stinchcombe, M. and White, H. , journal =
-
[24]
Approximation theory of the
Allan Pinkus , journal=. Approximation theory of the. 1999 , volume=
1999
-
[25]
Lu, Zhou and Pu, Hongming and Wang, Feicheng and Hu, Zhiqiang and Wang, Liwei , title =
-
[26]
Neural Networks , year=
Multilayer feedforward networks with a nonpolynomial activation function can approximate any function , author=. Neural Networks , year=
-
[27]
Neural Networks , volume=
On the approximation of functions by tanh neural networks , author=. Neural Networks , volume=. 2021 , publisher=
2021
-
[28]
On the Universal Approximability and Complexity Bounds of Quantized
Yukun Ding and Jinglan Liu and Jinjun Xiong and Yiyu Shi , booktitle=ICLR, year=. On the Universal Approximability and Complexity Bounds of Quantized
-
[29]
Expressive power of
Park, Yeachan and Hwang, Geonho and Lee, Wonyeol and Park, Sejun , journal=. Expressive power of. 2024 , publisher=
2024
-
[30]
Approximation speed of quantized versus unquantized
Gonon, Antoine and Brisebarre, Nicolas and Gribonval, R. Approximation speed of quantized versus unquantized. IEEE Transactions on Information Theory , volume=. 2023 , publisher=
2023
-
[31]
arXiv preprint arXiv:2409.00297 , year=
On Expressive Power of Quantized Neural Networks under Fixed-Point Arithmetic , author=. arXiv preprint arXiv:2409.00297 , year=
-
[32]
Gaussian error linear units (
Hendrycks, Dan and Gimpel, Kevin , journal=. Gaussian error linear units (
-
[33]
Searching for Activation Functions
Searching for activation functions , author=. arXiv preprint arXiv:1710.05941 , year=
work page internal anchor Pith review arXiv
-
[34]
arXiv preprint arXiv:1908.08681
Mish: A self regularized non-monotonic activation function , author=. arXiv preprint arXiv:1908.08681 , year=
-
[35]
Self-normalizing neural networks , author=
-
[36]
Fast and Accurate Deep Network Learning by Exponential Linear Units (ELUs)
Fast and accurate deep network learning by exponential linear units (elus) , author=. arXiv preprint arXiv:1511.07289 , year=
-
[37]
Advances in neural information processing systems , volume=
Training deep neural networks with 8-bit floating point numbers , author=. Advances in neural information processing systems , volume=
-
[38]
International Conference on Learning Representations , year=
Mixed Precision Training , author=. International Conference on Learning Representations , year=
-
[39]
Micikevicius, Paulius and Stosic, Dusan and Burgess, Neil and Cornea, Marius and Dubey, Pradeep and Grisenthwaite, Richard and Ha, Sangwon and Heinecke, Alexander and Judd, Patrick and Kamalu, John and others , journal=
-
[40]
Advances in neural information processing systems , volume=
Hybrid 8-bit floating point (HFP8) training and inference for deep neural networks , author=. Advances in neural information processing systems , volume=
-
[41]
Park, Sejun and Yun, Chulhee and Lee, Jaeho and Shin, Jinwoo , booktitle=ICLR, year=
-
[42]
Universal approximation with deep narrow networks , author=
-
[43]
Approximating continuous functions by
Hanin, Boris and Sellke, Mark , journal=. Approximating continuous functions by
-
[44]
Achieve the Minimum Width of Neural Networks for Universal Approximation , author=
-
[45]
Applied and Computational Harmonic Analysis , volume=
Universality of deep convolutional neural networks , author=. Applied and Computational Harmonic Analysis , volume=. 2020 , publisher=
2020
-
[46]
Universal approximation power of deep residual neural networks via nonlinear control theory , author=
-
[47]
Are Transformers universal approximators of sequence-to-sequence functions? , author=
-
[48]
Error bounds for approximations with deep
Yarotsky, Dmitry , journal=. Error bounds for approximations with deep. 2017 , publisher=
2017
-
[49]
Advances in Neural Information Processing Systems , year=
Attention is all you need , author=. Advances in Neural Information Processing Systems , year=
-
[50]
Implicit neural representations with periodic activation functions , author=
-
[51]
YOLOv4: Optimal Speed and Accuracy of Object Detection
Yolov4: Optimal speed and accuracy of object detection , author=. arXiv preprint arXiv:2004.10934 , year=
work page internal anchor Pith review arXiv 2004
-
[52]
TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems
Tensorflow: Large-scale machine learning on heterogeneous distributed systems , author=. arXiv preprint arXiv:1603.04467 , year=
-
[53]
Improve your model's performance with bfloat16 , author =
-
[54]
Artificial Intelligence Review , volume=
A comprehensive review of binary neural network , author=. Artificial Intelligence Review , volume=. 2023 , publisher=
2023
-
[55]
What's hidden in a randomly weighted neural network? , author=
-
[56]
Optimal approximation of continuous functions by very deep
Yarotsky, Dmitry , booktitle=COLT, year=. Optimal approximation of continuous functions by very deep
-
[57]
Deep network approximation: Beyond
Zhang, Shijun and Lu, Jianfeng and Zhao, Hongkai , journal=. Deep network approximation: Beyond
-
[58]
Minimum width for universal approximation using
Namjun Kim and Chanho Min and Sejun Park , booktitle = ICLR, year =. Minimum width for universal approximation using
-
[59]
Minimum Width for Universal Approximation using Squashable Activation Functions , author =
-
[60]
Benefits of depth in neural networks , author=
-
[61]
Depth-width tradeoffs in approximating natural functions with neural networks , author=
-
[62]
Depth-Width Trade-offs for
Chatziafratis, Vaggos and Nagarajan, Sai Ganesh and Panageas, Ioannis and Wang, Xiao , booktitle=ICLR, year=. Depth-Width Trade-offs for
-
[63]
International Conference on Computer Aided Verification (CAV) , year=
Floating-Point Neural Networks Are Provably Robust Universal Approximators , author=. International Conference on Computer Aided Verification (CAV) , year=
-
[64]
Journal of Computational physics , volume=
Physics-informed neural networks: A deep learning framework for solving forward and inverse problems involving nonlinear partial differential equations , author=. Journal of Computational physics , volume=. 2019 , publisher=
2019
-
[65]
Sobolev training for neural networks , author=
-
[66]
Compressed sensing using generative models , author=
-
[67]
Deep Inside Convolutional Networks: Visualising Image Classification Models and Saliency Maps
Karen Simonyan and Andrea Vedaldi and Andrew Zisserman. Deep Inside Convolutional Networks: Visualising Image Classification Models and Saliency Maps. Workshop at International Conference on Learning Representations. 2014
2014
-
[68]
Continuous control with deep reinforcement learning , author=
-
[69]
International Journal of Neural Systems , volume=
Recurrent neural networks are universal approximators , author=. International Journal of Neural Systems , volume=. 2007 , publisher=
2007
-
[70]
2008 , publisher=
Evaluating derivatives: principles and techniques of algorithmic differentiation , author=. 2008 , publisher=
2008
-
[71]
Image style transfer using convolutional neural networks , author=
-
[72]
Explaining and harnessing adversarial examples , author=
-
[73]
Gulrajani, Ishaan and Ahmed, Faruk and Arjovsky, Martin and Dumoulin, Vincent and Courville, Aaron C , booktitle=NeurIPS, year=
-
[74]
Neurocomputing , volume=
Simultaneous approximations of multivariate functions and their derivatives by neural networks with one hidden layer , author=. Neurocomputing , volume=. 1996 , publisher=
1996
-
[75]
Optimal Minimum Width for the Universal Approximation of Continuously Differentiable Functions by Deep Narrow MLPs , author=
-
[76]
Resnet with one-neuron hidden layers is a universal approximator , author=
-
[77]
Proceedings of the 22nd ACM SIGSAC conference on computer and communications security , pages=
Model inversion attacks that exploit confidence information and basic countermeasures , author=. Proceedings of the 22nd ACM SIGSAC conference on computer and communications security , pages=
-
[78]
arXiv preprint arXiv:2601.16450 , year=
On the Expressive Power of Floating-Point Transformers , author=. arXiv preprint arXiv:2601.16450 , year=
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.