arxiv: 2605.01702 · v1 · submitted 2026-05-03 · 💻 cs.LG

Recognition: 4 theorem links

· Lean Theorem

Floating-Point Networks with Automatic Differentiation Can Represent Almost All Floating-Point Functions and Their Gradients

Sejun Park , Yeachan Park , Geonho Hwang

Authors on Pith no claims yet

Pith reviewed 2026-05-08 19:31 UTC · model grok-4.3

classification 💻 cs.LG

keywords neural networksautomatic differentiationfloating-point arithmeticfunction representationgradientsReLUloss functions

0 comments

The pith

Floating-point networks with automatic differentiation can exactly represent arbitrary floating-point function values and their gradients.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proves that for any chosen floating-point loss function ϕ, there exists a floating-point neural network f such that the network outputs match any prescribed target values while the automatic differentiation of ϕ composed with f matches any prescribed gradient values. This holds even though all operations use finite floating-point numbers and rounding. The result applies directly to common activations including ReLU, ELU, GeLU, Swish, sigmoid, and tanh. A further extension shows that multiple loss functions can be handled simultaneously, with each producing its own arbitrary gradient via AD while the shared network f still matches the target values, provided mild conditions are met.

Core claim

Given any floating-point function ϕ, arbitrary function values can be represented by a floating-point network f and arbitrary gradients by D^AD(ϕ ∘ f). The result extends to multiple functions ϕ1 through ϕn where D^AD(ϕi ∘ f) simultaneously represent arbitrary gradients while f represents the target values, under mild conditions. These statements hold for the listed practical activation functions.

What carries the argument

The composition of a floating-point network f with automatic differentiation D^AD applied to a loss ϕ ∘ f, which encodes both exact target values in the network output and exact gradients in the differentiated loss.

If this is right

Any finite collection of input-output pairs and corresponding gradient vectors can be realized exactly by some network under floating-point arithmetic.
Gradient-based optimization can in principle reach zero training loss with exact gradient matches without residual discretization error.
Multiple loss functions can be optimized simultaneously with independent gradient control while sharing the same network weights.
Standard activations already suffice; no special activation is required for the representation property.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The main barrier to exact representation in practice is likely the choice of architecture or the optimization procedure rather than floating-point rounding itself.
This property could be used to construct networks that serve as exact oracles for testing gradient accuracy of other implementations.
The multi-loss version suggests a route to multi-task learning where each task's gradient can be prescribed independently.

Load-bearing premise

That the set of floating-point numbers, the rounding rules, and the network architecture together provide enough degrees of freedom to hit any desired finite collection of values and gradients.

What would settle it

Exhibit a concrete ϕ, a finite set of input points, target values, and target gradients such that no floating-point network f of any size produces exactly those values at the inputs while D^AD(ϕ ∘ f) produces the target gradients.

read the original abstract

Theoretical studies show that for any differentiable function on a compact domain, there exists a neural network that approximates both the function values and gradients. However, such a result cannot be used in practice since it assumes real parameters and exact internal operations. In contrast, real implementations only use a finite subset of reals and machine operations with round-off errors. In this work, we investigate whether a similar result holds for neural networks under floating-point arithmetic, when the gradient with respect to the input is computed by the automatic differentiation algorithm $D^\mathtt{AD}$. We first show that given a floating-point function $\phi$ (e.g., a loss function), arbitrary function values and gradients can be represented by a floating-point network $f$ and $D^\mathtt{AD}(\phi\circ f)$, respectively. We further extend this result: given $\phi_1,\dots,\phi_n$, $D^\mathtt{AD}(\phi_i\circ f)$ can simultaneously represent arbitrary gradients while $f$ represents the target values, under mild conditions. Our results hold for practical activation functions, e.g., $\mathrm{ReLU}$, $\mathrm{ELU}$, $\mathrm{GeLU}$, $\mathrm{Swish}$, $\mathrm{Sigmoid}$, and $\mathrm{tanh}$.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Floating-point networks with AD can exactly represent arbitrary FP functions and their gradients, extending real-number results to machine arithmetic under some conditions.

read the letter

The main thing to know is that this paper proves neural networks using only floating-point numbers and automatic differentiation can exactly match any target floating-point function values along with their AD-computed gradients. For a single loss phi they claim this holds without extra assumptions, and they extend it to multiple phis at once under mild conditions, covering common activations like ReLU, ELU, GELU, Swish, sigmoid, and tanh. This moves past the usual real-number density theorems that ignore rounding and finite precision, which is the actual novelty here. It gives a theoretical reason why FP networks in practice might achieve exact representation in some cases, rather than just approximation. The proofs appear to adapt standard arguments to the discrete FP domain while respecting how AD actually computes gradients on hardware. That part is solid if the derivations check out. The multi-phi extension is the more useful claim for applications like multi-loss training, but it rests on those unspecified mild conditions. The abstract leaves open exactly what they require on network depth, rounding modes, or the support of the target gradients, so the practical scope is unclear until the full hypotheses are examined. No empirical checks or discussion of edge cases like overflow or denormals appear in the provided material, which is a minor gap for a cs.LG paper. The work is aimed at theorists studying expressivity in finite precision or AD correctness. A reader already familiar with approximation results will see the extension clearly and might use the single-phi case as a building block. It deserves peer review because the core claim is non-trivial and could matter for hardware-aware theory; referees should focus on verifying the mild conditions and whether the FP set assumptions hold for standard IEEE formats. I would send it out rather than desk reject.

Referee Report

1 major / 1 minor

Summary. The manuscript claims that, under floating-point arithmetic, there exist neural networks f (with practical activations such as ReLU, ELU, GELU, Swish, Sigmoid, and tanh) such that for any given floating-point function ϕ, the network f can realize arbitrary target values while D^AD(ϕ ∘ f) realizes arbitrary target gradients. It further claims that, for multiple functions ϕ1,…,ϕn, a single f can simultaneously realize arbitrary target values while each D^AD(ϕi ∘ f) realizes arbitrary target gradients, provided mild conditions hold.

Significance. If the stated results hold with explicit hypotheses, the work would be significant: it supplies an exact-representation theorem (rather than an approximation result) that directly accounts for finite floating-point sets and the mechanics of automatic differentiation. This strengthens the theoretical basis for understanding expressivity in deployed networks and is strengthened by the explicit coverage of common activations.

major comments (1)

[Abstract] Abstract: the multi-function extension is asserted to hold 'under mild conditions,' yet no explicit statement of those conditions (e.g., requirements on the support of the target gradient vectors, rounding-mode assumptions, minimum network depth, or restrictions on the floating-point domain) appears. Because the single-ϕ case is stated unconditionally while the multi-ϕ case is the more useful extension, the precise scope of the central claim cannot be assessed.

minor comments (1)

[Abstract] Abstract: the operator D^AD is used without a brief inline definition or reference to its precise implementation (forward-mode, reverse-mode, or specific rounding behavior), which would aid readers unfamiliar with the exact AD formulation employed.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the careful review and for identifying the need to clarify the scope of the multi-function result in the abstract. We address the comment below and will incorporate the suggested change in the revised manuscript.

read point-by-point responses

Referee: [Abstract] Abstract: the multi-function extension is asserted to hold 'under mild conditions,' yet no explicit statement of those conditions (e.g., requirements on the support of the target gradient vectors, rounding-mode assumptions, minimum network depth, or restrictions on the floating-point domain) appears. Because the single-ϕ case is stated unconditionally while the multi-ϕ case is the more useful extension, the precise scope of the central claim cannot be assessed.

Authors: We agree that the abstract should explicitly indicate the conditions under which the multi-ϕ result holds, so that readers can immediately assess its scope. The mild conditions are stated in the body of the paper (specifically, in the hypotheses of the theorem establishing the multi-function case): the target gradient vectors must lie in the representable floating-point set for the chosen precision and rounding mode, the network must have sufficient depth to realize the required piecewise-linear or smooth mappings for the listed activations, and the floating-point domain must be finite and closed under the arithmetic operations used by the network and the automatic-differentiation procedure. We will revise the abstract to include a concise parenthetical statement of these conditions, thereby making the distinction between the unconditional single-ϕ claim and the conditional multi-ϕ claim transparent. revision: yes

Circularity Check

0 steps flagged

No circularity: theoretical representation result derived from first principles on FP arithmetic

full rationale

The paper claims to prove that floating-point networks f with AD-computed gradients D^AD(ϕ ∘ f) can realize arbitrary target values and gradients for given loss ϕ, and extends to simultaneous representation for multiple ϕ_i under mild conditions. No load-bearing step reduces to a self-definition, fitted parameter renamed as prediction, or self-citation chain; the abstract and structure present direct mathematical arguments on finite FP sets and practical activations (ReLU etc.) without importing uniqueness theorems or ansatzes from prior author work. The unspecified mild conditions affect scope but do not make the derivation tautological. The result is self-contained against external FP benchmarks and does not rename known empirical patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Abstract-only review provides no explicit free parameters or invented entities; the central claim rests on standard domain assumptions about floating-point arithmetic.

axioms (1)

domain assumption Floating-point arithmetic operates over a finite subset of the reals with round-off errors in machine operations.
Explicitly contrasted in the abstract with the ideal real-parameter setting of prior approximation theorems.

pith-pipeline@v0.9.0 · 5528 in / 1126 out tokens · 71501 ms · 2026-05-08T19:31:51.472883+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

Foundation.LogicAsFunctionalEquation / Cost.FunctionalEquation washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

given a floating-point function ϕ (e.g., a loss function), arbitrary function values and gradients can be represented by a floating-point network f and D^AD(ϕ∘f), respectively
Foundation.BranchSelection (interactionDefect / coupling combiner) branch_selection unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Theorem 3.2 ... DAD(ϕ◦f, x) changes accordingly ... by exploiting the non-associativity of floating-point multiplication
Foundation.AlphaCoordinateFixation (parameter-free derivations) alpha_pin_under_high_calibration unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

for any L ≥ 2^{q+1} + 2p + 11, there exists an L-layer σ network f such that f(x) = f*(x) and D_{f,x}(y) = g*(x,y)

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

78 extracted references · 7 canonical work pages · 2 internal anchors

[1]

International Conference on Computer Aided Verification , pages=

Floating-Point Neural Networks Are Provably Robust Universal Approximators , author=. International Conference on Computer Aided Verification , pages=. 2025 , organization=

2025
[2]

Forty-second International Conference on Machine Learning , year=

Floating-point neural networks can represent almost all floating-point functions , author=. Forty-second International Conference on Machine Learning , year=
[3]

Proceedings of the ACM on Programming Languages , volume=

Interval Universal Approximation for Neural Networks , author=. Proceedings of the ACM on Programming Languages , volume=. 2022 , _publisher=

2022
[4]

Dexter Kozen , title =
[5]

Enderton , title =

Herbert B. Enderton , title =
[6]

Cristopher Moore and Stephan Mertens , title =
[7]

2023 , howpublished =

Boaz Barak , title =. 2023 , howpublished =

2023
[8]

Mathematische Annalen , number =

Sch. Mathematische Annalen , number =
[9]

A Set of Postulates For the Foundation of Logic , volume =

Alonzo Church , journal =. A Set of Postulates For the Foundation of Logic , volume =
[10]

Turing, A. M. , journal =. On Computable Numbers, with an Application to the Entscheidungsproblem , volume =
[11]

On Undecidable Propositions of Formal Mathematics Systems (

G. On Undecidable Propositions of Formal Mathematics Systems (. 1934 , _url=

1934
[12]

An Unsolvable Problem of Elementary Number Theory , volume =

Alonzo Church , journal =. An Unsolvable Problem of Elementary Number Theory , volume =
[13]

Post , journal =

Emil L. Post , journal =. Formal Reductions of the General Combinatorial Decision Problem , volume =
[14]

1966 , note =

Theory of Self-Reproducing Automata , author =. 1966 , note =

1966
[15]

Complex Systems , volume=

Universality in Elementary Cellular Automata , author=. Complex Systems , volume=. 2004 , publisher=

2004
[16]

Wansbrough , title=

K. Wansbrough , title=. 1998 , note =

1998
[17]

T. L. Veldhuizen , title=
[18]

2017 , booktitle =

Grigore, Radu , title =. 2017 , booktitle =

2017
[19]

Siegelmann and E.D

H.T. Siegelmann and E.D. Sontag , journal =. On the Computational Power of Neural Nets , volume =
[20]

Attention is Turing-Complete , journal =

Jorge P. Attention is Turing-Complete , journal =
[21]

nature , volume=

Deep learning , author=. nature , volume=. 2015 , publisher=

2015
[22]

Mathematics of Control, Signals and Systems , volume=

Approximation by superpositions of a sigmoidal function , author=. Mathematics of Control, Signals and Systems , volume=. 1989 , publisher=

1989
[23]

and Stinchcombe, M

Hornik, K. and Stinchcombe, M. and White, H. , journal =
[24]

Approximation theory of the

Allan Pinkus , journal=. Approximation theory of the. 1999 , volume=

1999
[25]

Lu, Zhou and Pu, Hongming and Wang, Feicheng and Hu, Zhiqiang and Wang, Liwei , title =
[26]

Neural Networks , year=

Multilayer feedforward networks with a nonpolynomial activation function can approximate any function , author=. Neural Networks , year=
[27]

Neural Networks , volume=

On the approximation of functions by tanh neural networks , author=. Neural Networks , volume=. 2021 , publisher=

2021
[28]

On the Universal Approximability and Complexity Bounds of Quantized

Yukun Ding and Jinglan Liu and Jinjun Xiong and Yiyu Shi , booktitle=ICLR, year=. On the Universal Approximability and Complexity Bounds of Quantized
[29]

Expressive power of

Park, Yeachan and Hwang, Geonho and Lee, Wonyeol and Park, Sejun , journal=. Expressive power of. 2024 , publisher=

2024
[30]

Approximation speed of quantized versus unquantized

Gonon, Antoine and Brisebarre, Nicolas and Gribonval, R. Approximation speed of quantized versus unquantized. IEEE Transactions on Information Theory , volume=. 2023 , publisher=

2023
[31]

arXiv preprint arXiv:2409.00297 , year=

On Expressive Power of Quantized Neural Networks under Fixed-Point Arithmetic , author=. arXiv preprint arXiv:2409.00297 , year=

work page arXiv
[32]

Gaussian error linear units (

Hendrycks, Dan and Gimpel, Kevin , journal=. Gaussian error linear units (
[33]

Searching for Activation Functions

Searching for activation functions , author=. arXiv preprint arXiv:1710.05941 , year=

work page internal anchor Pith review arXiv
[34]

arXiv preprint arXiv:1908.08681

Mish: A self regularized non-monotonic activation function , author=. arXiv preprint arXiv:1908.08681 , year=

work page arXiv 1908
[35]

Self-normalizing neural networks , author=
[36]

Fast and Accurate Deep Network Learning by Exponential Linear Units (ELUs)

Fast and accurate deep network learning by exponential linear units (elus) , author=. arXiv preprint arXiv:1511.07289 , year=

work page Pith review arXiv
[37]

Advances in neural information processing systems , volume=

Training deep neural networks with 8-bit floating point numbers , author=. Advances in neural information processing systems , volume=
[38]

International Conference on Learning Representations , year=

Mixed Precision Training , author=. International Conference on Learning Representations , year=
[39]

Micikevicius, Paulius and Stosic, Dusan and Burgess, Neil and Cornea, Marius and Dubey, Pradeep and Grisenthwaite, Richard and Ha, Sangwon and Heinecke, Alexander and Judd, Patrick and Kamalu, John and others , journal=
[40]

Advances in neural information processing systems , volume=

Hybrid 8-bit floating point (HFP8) training and inference for deep neural networks , author=. Advances in neural information processing systems , volume=
[41]

Park, Sejun and Yun, Chulhee and Lee, Jaeho and Shin, Jinwoo , booktitle=ICLR, year=
[42]

Universal approximation with deep narrow networks , author=
[43]

Approximating continuous functions by

Hanin, Boris and Sellke, Mark , journal=. Approximating continuous functions by
[44]

Achieve the Minimum Width of Neural Networks for Universal Approximation , author=
[45]

Applied and Computational Harmonic Analysis , volume=

Universality of deep convolutional neural networks , author=. Applied and Computational Harmonic Analysis , volume=. 2020 , publisher=

2020
[46]

Universal approximation power of deep residual neural networks via nonlinear control theory , author=
[47]

Are Transformers universal approximators of sequence-to-sequence functions? , author=
[48]

Error bounds for approximations with deep

Yarotsky, Dmitry , journal=. Error bounds for approximations with deep. 2017 , publisher=

2017
[49]

Advances in Neural Information Processing Systems , year=

Attention is all you need , author=. Advances in Neural Information Processing Systems , year=
[50]

Implicit neural representations with periodic activation functions , author=
[51]

YOLOv4: Optimal Speed and Accuracy of Object Detection

Yolov4: Optimal speed and accuracy of object detection , author=. arXiv preprint arXiv:2004.10934 , year=

work page internal anchor Pith review arXiv 2004
[52]

TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems

Tensorflow: Large-scale machine learning on heterogeneous distributed systems , author=. arXiv preprint arXiv:1603.04467 , year=

work page Pith review arXiv
[53]

Improve your model's performance with bfloat16 , author =
[54]

Artificial Intelligence Review , volume=

A comprehensive review of binary neural network , author=. Artificial Intelligence Review , volume=. 2023 , publisher=

2023
[55]

What's hidden in a randomly weighted neural network? , author=
[56]

Optimal approximation of continuous functions by very deep

Yarotsky, Dmitry , booktitle=COLT, year=. Optimal approximation of continuous functions by very deep
[57]

Deep network approximation: Beyond

Zhang, Shijun and Lu, Jianfeng and Zhao, Hongkai , journal=. Deep network approximation: Beyond
[58]

Minimum width for universal approximation using

Namjun Kim and Chanho Min and Sejun Park , booktitle = ICLR, year =. Minimum width for universal approximation using
[59]

Minimum Width for Universal Approximation using Squashable Activation Functions , author =
[60]

Benefits of depth in neural networks , author=
[61]

Depth-width tradeoffs in approximating natural functions with neural networks , author=
[62]

Depth-Width Trade-offs for

Chatziafratis, Vaggos and Nagarajan, Sai Ganesh and Panageas, Ioannis and Wang, Xiao , booktitle=ICLR, year=. Depth-Width Trade-offs for
[63]

International Conference on Computer Aided Verification (CAV) , year=

Floating-Point Neural Networks Are Provably Robust Universal Approximators , author=. International Conference on Computer Aided Verification (CAV) , year=
[64]

Journal of Computational physics , volume=

Physics-informed neural networks: A deep learning framework for solving forward and inverse problems involving nonlinear partial differential equations , author=. Journal of Computational physics , volume=. 2019 , publisher=

2019
[65]

Sobolev training for neural networks , author=
[66]

Compressed sensing using generative models , author=
[67]

Deep Inside Convolutional Networks: Visualising Image Classification Models and Saliency Maps

Karen Simonyan and Andrea Vedaldi and Andrew Zisserman. Deep Inside Convolutional Networks: Visualising Image Classification Models and Saliency Maps. Workshop at International Conference on Learning Representations. 2014

2014
[68]

Continuous control with deep reinforcement learning , author=
[69]

International Journal of Neural Systems , volume=

Recurrent neural networks are universal approximators , author=. International Journal of Neural Systems , volume=. 2007 , publisher=

2007
[70]

2008 , publisher=

Evaluating derivatives: principles and techniques of algorithmic differentiation , author=. 2008 , publisher=

2008
[71]

Image style transfer using convolutional neural networks , author=
[72]

Explaining and harnessing adversarial examples , author=
[73]

Gulrajani, Ishaan and Ahmed, Faruk and Arjovsky, Martin and Dumoulin, Vincent and Courville, Aaron C , booktitle=NeurIPS, year=
[74]

Neurocomputing , volume=

Simultaneous approximations of multivariate functions and their derivatives by neural networks with one hidden layer , author=. Neurocomputing , volume=. 1996 , publisher=

1996
[75]

Optimal Minimum Width for the Universal Approximation of Continuously Differentiable Functions by Deep Narrow MLPs , author=
[76]

Resnet with one-neuron hidden layers is a universal approximator , author=
[77]

Proceedings of the 22nd ACM SIGSAC conference on computer and communications security , pages=

Model inversion attacks that exploit confidence information and basic countermeasures , author=. Proceedings of the 22nd ACM SIGSAC conference on computer and communications security , pages=
[78]

arXiv preprint arXiv:2601.16450 , year=

On the Expressive Power of Floating-Point Transformers , author=. arXiv preprint arXiv:2601.16450 , year=

work page arXiv