pith. machine review for the scientific record. sign in

arxiv: 2605.01702 · v1 · submitted 2026-05-03 · 💻 cs.LG

Recognition: 4 theorem links

· Lean Theorem

Floating-Point Networks with Automatic Differentiation Can Represent Almost All Floating-Point Functions and Their Gradients

Authors on Pith no claims yet

Pith reviewed 2026-05-08 19:31 UTC · model grok-4.3

classification 💻 cs.LG
keywords neural networksautomatic differentiationfloating-point arithmeticfunction representationgradientsReLUloss functions
0
0 comments X

The pith

Floating-point networks with automatic differentiation can exactly represent arbitrary floating-point function values and their gradients.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proves that for any chosen floating-point loss function ϕ, there exists a floating-point neural network f such that the network outputs match any prescribed target values while the automatic differentiation of ϕ composed with f matches any prescribed gradient values. This holds even though all operations use finite floating-point numbers and rounding. The result applies directly to common activations including ReLU, ELU, GeLU, Swish, sigmoid, and tanh. A further extension shows that multiple loss functions can be handled simultaneously, with each producing its own arbitrary gradient via AD while the shared network f still matches the target values, provided mild conditions are met.

Core claim

Given any floating-point function ϕ, arbitrary function values can be represented by a floating-point network f and arbitrary gradients by D^AD(ϕ ∘ f). The result extends to multiple functions ϕ1 through ϕn where D^AD(ϕi ∘ f) simultaneously represent arbitrary gradients while f represents the target values, under mild conditions. These statements hold for the listed practical activation functions.

What carries the argument

The composition of a floating-point network f with automatic differentiation D^AD applied to a loss ϕ ∘ f, which encodes both exact target values in the network output and exact gradients in the differentiated loss.

If this is right

  • Any finite collection of input-output pairs and corresponding gradient vectors can be realized exactly by some network under floating-point arithmetic.
  • Gradient-based optimization can in principle reach zero training loss with exact gradient matches without residual discretization error.
  • Multiple loss functions can be optimized simultaneously with independent gradient control while sharing the same network weights.
  • Standard activations already suffice; no special activation is required for the representation property.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The main barrier to exact representation in practice is likely the choice of architecture or the optimization procedure rather than floating-point rounding itself.
  • This property could be used to construct networks that serve as exact oracles for testing gradient accuracy of other implementations.
  • The multi-loss version suggests a route to multi-task learning where each task's gradient can be prescribed independently.

Load-bearing premise

That the set of floating-point numbers, the rounding rules, and the network architecture together provide enough degrees of freedom to hit any desired finite collection of values and gradients.

What would settle it

Exhibit a concrete ϕ, a finite set of input points, target values, and target gradients such that no floating-point network f of any size produces exactly those values at the inputs while D^AD(ϕ ∘ f) produces the target gradients.

read the original abstract

Theoretical studies show that for any differentiable function on a compact domain, there exists a neural network that approximates both the function values and gradients. However, such a result cannot be used in practice since it assumes real parameters and exact internal operations. In contrast, real implementations only use a finite subset of reals and machine operations with round-off errors. In this work, we investigate whether a similar result holds for neural networks under floating-point arithmetic, when the gradient with respect to the input is computed by the automatic differentiation algorithm $D^\mathtt{AD}$. We first show that given a floating-point function $\phi$ (e.g., a loss function), arbitrary function values and gradients can be represented by a floating-point network $f$ and $D^\mathtt{AD}(\phi\circ f)$, respectively. We further extend this result: given $\phi_1,\dots,\phi_n$, $D^\mathtt{AD}(\phi_i\circ f)$ can simultaneously represent arbitrary gradients while $f$ represents the target values, under mild conditions. Our results hold for practical activation functions, e.g., $\mathrm{ReLU}$, $\mathrm{ELU}$, $\mathrm{GeLU}$, $\mathrm{Swish}$, $\mathrm{Sigmoid}$, and $\mathrm{tanh}$.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The manuscript claims that, under floating-point arithmetic, there exist neural networks f (with practical activations such as ReLU, ELU, GELU, Swish, Sigmoid, and tanh) such that for any given floating-point function ϕ, the network f can realize arbitrary target values while D^AD(ϕ ∘ f) realizes arbitrary target gradients. It further claims that, for multiple functions ϕ1,…,ϕn, a single f can simultaneously realize arbitrary target values while each D^AD(ϕi ∘ f) realizes arbitrary target gradients, provided mild conditions hold.

Significance. If the stated results hold with explicit hypotheses, the work would be significant: it supplies an exact-representation theorem (rather than an approximation result) that directly accounts for finite floating-point sets and the mechanics of automatic differentiation. This strengthens the theoretical basis for understanding expressivity in deployed networks and is strengthened by the explicit coverage of common activations.

major comments (1)
  1. [Abstract] Abstract: the multi-function extension is asserted to hold 'under mild conditions,' yet no explicit statement of those conditions (e.g., requirements on the support of the target gradient vectors, rounding-mode assumptions, minimum network depth, or restrictions on the floating-point domain) appears. Because the single-ϕ case is stated unconditionally while the multi-ϕ case is the more useful extension, the precise scope of the central claim cannot be assessed.
minor comments (1)
  1. [Abstract] Abstract: the operator D^AD is used without a brief inline definition or reference to its precise implementation (forward-mode, reverse-mode, or specific rounding behavior), which would aid readers unfamiliar with the exact AD formulation employed.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the careful review and for identifying the need to clarify the scope of the multi-function result in the abstract. We address the comment below and will incorporate the suggested change in the revised manuscript.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the multi-function extension is asserted to hold 'under mild conditions,' yet no explicit statement of those conditions (e.g., requirements on the support of the target gradient vectors, rounding-mode assumptions, minimum network depth, or restrictions on the floating-point domain) appears. Because the single-ϕ case is stated unconditionally while the multi-ϕ case is the more useful extension, the precise scope of the central claim cannot be assessed.

    Authors: We agree that the abstract should explicitly indicate the conditions under which the multi-ϕ result holds, so that readers can immediately assess its scope. The mild conditions are stated in the body of the paper (specifically, in the hypotheses of the theorem establishing the multi-function case): the target gradient vectors must lie in the representable floating-point set for the chosen precision and rounding mode, the network must have sufficient depth to realize the required piecewise-linear or smooth mappings for the listed activations, and the floating-point domain must be finite and closed under the arithmetic operations used by the network and the automatic-differentiation procedure. We will revise the abstract to include a concise parenthetical statement of these conditions, thereby making the distinction between the unconditional single-ϕ claim and the conditional multi-ϕ claim transparent. revision: yes

Circularity Check

0 steps flagged

No circularity: theoretical representation result derived from first principles on FP arithmetic

full rationale

The paper claims to prove that floating-point networks f with AD-computed gradients D^AD(ϕ ∘ f) can realize arbitrary target values and gradients for given loss ϕ, and extends to simultaneous representation for multiple ϕ_i under mild conditions. No load-bearing step reduces to a self-definition, fitted parameter renamed as prediction, or self-citation chain; the abstract and structure present direct mathematical arguments on finite FP sets and practical activations (ReLU etc.) without importing uniqueness theorems or ansatzes from prior author work. The unspecified mild conditions affect scope but do not make the derivation tautological. The result is self-contained against external FP benchmarks and does not rename known empirical patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Abstract-only review provides no explicit free parameters or invented entities; the central claim rests on standard domain assumptions about floating-point arithmetic.

axioms (1)
  • domain assumption Floating-point arithmetic operates over a finite subset of the reals with round-off errors in machine operations.
    Explicitly contrasted in the abstract with the ideal real-parameter setting of prior approximation theorems.

pith-pipeline@v0.9.0 · 5528 in / 1126 out tokens · 71501 ms · 2026-05-08T19:31:51.472883+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

78 extracted references · 7 canonical work pages · 2 internal anchors

  1. [1]

    International Conference on Computer Aided Verification , pages=

    Floating-Point Neural Networks Are Provably Robust Universal Approximators , author=. International Conference on Computer Aided Verification , pages=. 2025 , organization=

  2. [2]

    Forty-second International Conference on Machine Learning , year=

    Floating-point neural networks can represent almost all floating-point functions , author=. Forty-second International Conference on Machine Learning , year=

  3. [3]

    Proceedings of the ACM on Programming Languages , volume=

    Interval Universal Approximation for Neural Networks , author=. Proceedings of the ACM on Programming Languages , volume=. 2022 , _publisher=

  4. [4]

    Dexter Kozen , title =

  5. [5]

    Enderton , title =

    Herbert B. Enderton , title =

  6. [6]

    Cristopher Moore and Stephan Mertens , title =

  7. [7]

    2023 , howpublished =

    Boaz Barak , title =. 2023 , howpublished =

  8. [8]

    Mathematische Annalen , number =

    Sch. Mathematische Annalen , number =

  9. [9]

    A Set of Postulates For the Foundation of Logic , volume =

    Alonzo Church , journal =. A Set of Postulates For the Foundation of Logic , volume =

  10. [10]

    Turing, A. M. , journal =. On Computable Numbers, with an Application to the Entscheidungsproblem , volume =

  11. [11]

    On Undecidable Propositions of Formal Mathematics Systems (

    G. On Undecidable Propositions of Formal Mathematics Systems (. 1934 , _url=

  12. [12]

    An Unsolvable Problem of Elementary Number Theory , volume =

    Alonzo Church , journal =. An Unsolvable Problem of Elementary Number Theory , volume =

  13. [13]

    Post , journal =

    Emil L. Post , journal =. Formal Reductions of the General Combinatorial Decision Problem , volume =

  14. [14]

    1966 , note =

    Theory of Self-Reproducing Automata , author =. 1966 , note =

  15. [15]

    Complex Systems , volume=

    Universality in Elementary Cellular Automata , author=. Complex Systems , volume=. 2004 , publisher=

  16. [16]

    Wansbrough , title=

    K. Wansbrough , title=. 1998 , note =

  17. [17]

    T. L. Veldhuizen , title=

  18. [18]

    2017 , booktitle =

    Grigore, Radu , title =. 2017 , booktitle =

  19. [19]

    Siegelmann and E.D

    H.T. Siegelmann and E.D. Sontag , journal =. On the Computational Power of Neural Nets , volume =

  20. [20]

    Attention is Turing-Complete , journal =

    Jorge P. Attention is Turing-Complete , journal =

  21. [21]

    nature , volume=

    Deep learning , author=. nature , volume=. 2015 , publisher=

  22. [22]

    Mathematics of Control, Signals and Systems , volume=

    Approximation by superpositions of a sigmoidal function , author=. Mathematics of Control, Signals and Systems , volume=. 1989 , publisher=

  23. [23]

    and Stinchcombe, M

    Hornik, K. and Stinchcombe, M. and White, H. , journal =

  24. [24]

    Approximation theory of the

    Allan Pinkus , journal=. Approximation theory of the. 1999 , volume=

  25. [25]

    Lu, Zhou and Pu, Hongming and Wang, Feicheng and Hu, Zhiqiang and Wang, Liwei , title =

  26. [26]

    Neural Networks , year=

    Multilayer feedforward networks with a nonpolynomial activation function can approximate any function , author=. Neural Networks , year=

  27. [27]

    Neural Networks , volume=

    On the approximation of functions by tanh neural networks , author=. Neural Networks , volume=. 2021 , publisher=

  28. [28]

    On the Universal Approximability and Complexity Bounds of Quantized

    Yukun Ding and Jinglan Liu and Jinjun Xiong and Yiyu Shi , booktitle=ICLR, year=. On the Universal Approximability and Complexity Bounds of Quantized

  29. [29]

    Expressive power of

    Park, Yeachan and Hwang, Geonho and Lee, Wonyeol and Park, Sejun , journal=. Expressive power of. 2024 , publisher=

  30. [30]

    Approximation speed of quantized versus unquantized

    Gonon, Antoine and Brisebarre, Nicolas and Gribonval, R. Approximation speed of quantized versus unquantized. IEEE Transactions on Information Theory , volume=. 2023 , publisher=

  31. [31]

    arXiv preprint arXiv:2409.00297 , year=

    On Expressive Power of Quantized Neural Networks under Fixed-Point Arithmetic , author=. arXiv preprint arXiv:2409.00297 , year=

  32. [32]

    Gaussian error linear units (

    Hendrycks, Dan and Gimpel, Kevin , journal=. Gaussian error linear units (

  33. [33]

    Searching for Activation Functions

    Searching for activation functions , author=. arXiv preprint arXiv:1710.05941 , year=

  34. [34]

    arXiv preprint arXiv:1908.08681

    Mish: A self regularized non-monotonic activation function , author=. arXiv preprint arXiv:1908.08681 , year=

  35. [35]

    Self-normalizing neural networks , author=

  36. [36]

    Fast and Accurate Deep Network Learning by Exponential Linear Units (ELUs)

    Fast and accurate deep network learning by exponential linear units (elus) , author=. arXiv preprint arXiv:1511.07289 , year=

  37. [37]

    Advances in neural information processing systems , volume=

    Training deep neural networks with 8-bit floating point numbers , author=. Advances in neural information processing systems , volume=

  38. [38]

    International Conference on Learning Representations , year=

    Mixed Precision Training , author=. International Conference on Learning Representations , year=

  39. [39]

    Micikevicius, Paulius and Stosic, Dusan and Burgess, Neil and Cornea, Marius and Dubey, Pradeep and Grisenthwaite, Richard and Ha, Sangwon and Heinecke, Alexander and Judd, Patrick and Kamalu, John and others , journal=

  40. [40]

    Advances in neural information processing systems , volume=

    Hybrid 8-bit floating point (HFP8) training and inference for deep neural networks , author=. Advances in neural information processing systems , volume=

  41. [41]

    Park, Sejun and Yun, Chulhee and Lee, Jaeho and Shin, Jinwoo , booktitle=ICLR, year=

  42. [42]

    Universal approximation with deep narrow networks , author=

  43. [43]

    Approximating continuous functions by

    Hanin, Boris and Sellke, Mark , journal=. Approximating continuous functions by

  44. [44]

    Achieve the Minimum Width of Neural Networks for Universal Approximation , author=

  45. [45]

    Applied and Computational Harmonic Analysis , volume=

    Universality of deep convolutional neural networks , author=. Applied and Computational Harmonic Analysis , volume=. 2020 , publisher=

  46. [46]

    Universal approximation power of deep residual neural networks via nonlinear control theory , author=

  47. [47]

    Are Transformers universal approximators of sequence-to-sequence functions? , author=

  48. [48]

    Error bounds for approximations with deep

    Yarotsky, Dmitry , journal=. Error bounds for approximations with deep. 2017 , publisher=

  49. [49]

    Advances in Neural Information Processing Systems , year=

    Attention is all you need , author=. Advances in Neural Information Processing Systems , year=

  50. [50]

    Implicit neural representations with periodic activation functions , author=

  51. [51]

    YOLOv4: Optimal Speed and Accuracy of Object Detection

    Yolov4: Optimal speed and accuracy of object detection , author=. arXiv preprint arXiv:2004.10934 , year=

  52. [52]

    TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems

    Tensorflow: Large-scale machine learning on heterogeneous distributed systems , author=. arXiv preprint arXiv:1603.04467 , year=

  53. [53]

    Improve your model's performance with bfloat16 , author =

  54. [54]

    Artificial Intelligence Review , volume=

    A comprehensive review of binary neural network , author=. Artificial Intelligence Review , volume=. 2023 , publisher=

  55. [55]

    What's hidden in a randomly weighted neural network? , author=

  56. [56]

    Optimal approximation of continuous functions by very deep

    Yarotsky, Dmitry , booktitle=COLT, year=. Optimal approximation of continuous functions by very deep

  57. [57]

    Deep network approximation: Beyond

    Zhang, Shijun and Lu, Jianfeng and Zhao, Hongkai , journal=. Deep network approximation: Beyond

  58. [58]

    Minimum width for universal approximation using

    Namjun Kim and Chanho Min and Sejun Park , booktitle = ICLR, year =. Minimum width for universal approximation using

  59. [59]

    Minimum Width for Universal Approximation using Squashable Activation Functions , author =

  60. [60]

    Benefits of depth in neural networks , author=

  61. [61]

    Depth-width tradeoffs in approximating natural functions with neural networks , author=

  62. [62]

    Depth-Width Trade-offs for

    Chatziafratis, Vaggos and Nagarajan, Sai Ganesh and Panageas, Ioannis and Wang, Xiao , booktitle=ICLR, year=. Depth-Width Trade-offs for

  63. [63]

    International Conference on Computer Aided Verification (CAV) , year=

    Floating-Point Neural Networks Are Provably Robust Universal Approximators , author=. International Conference on Computer Aided Verification (CAV) , year=

  64. [64]

    Journal of Computational physics , volume=

    Physics-informed neural networks: A deep learning framework for solving forward and inverse problems involving nonlinear partial differential equations , author=. Journal of Computational physics , volume=. 2019 , publisher=

  65. [65]

    Sobolev training for neural networks , author=

  66. [66]

    Compressed sensing using generative models , author=

  67. [67]

    Deep Inside Convolutional Networks: Visualising Image Classification Models and Saliency Maps

    Karen Simonyan and Andrea Vedaldi and Andrew Zisserman. Deep Inside Convolutional Networks: Visualising Image Classification Models and Saliency Maps. Workshop at International Conference on Learning Representations. 2014

  68. [68]

    Continuous control with deep reinforcement learning , author=

  69. [69]

    International Journal of Neural Systems , volume=

    Recurrent neural networks are universal approximators , author=. International Journal of Neural Systems , volume=. 2007 , publisher=

  70. [70]

    2008 , publisher=

    Evaluating derivatives: principles and techniques of algorithmic differentiation , author=. 2008 , publisher=

  71. [71]

    Image style transfer using convolutional neural networks , author=

  72. [72]

    Explaining and harnessing adversarial examples , author=

  73. [73]

    Gulrajani, Ishaan and Ahmed, Faruk and Arjovsky, Martin and Dumoulin, Vincent and Courville, Aaron C , booktitle=NeurIPS, year=

  74. [74]

    Neurocomputing , volume=

    Simultaneous approximations of multivariate functions and their derivatives by neural networks with one hidden layer , author=. Neurocomputing , volume=. 1996 , publisher=

  75. [75]

    Optimal Minimum Width for the Universal Approximation of Continuously Differentiable Functions by Deep Narrow MLPs , author=

  76. [76]

    Resnet with one-neuron hidden layers is a universal approximator , author=

  77. [77]

    Proceedings of the 22nd ACM SIGSAC conference on computer and communications security , pages=

    Model inversion attacks that exploit confidence information and basic countermeasures , author=. Proceedings of the 22nd ACM SIGSAC conference on computer and communications security , pages=

  78. [78]

    arXiv preprint arXiv:2601.16450 , year=

    On the Expressive Power of Floating-Point Transformers , author=. arXiv preprint arXiv:2601.16450 , year=