Expressive Power of Floating-Point Neural Networks with Arbitrary Reduction Orders and Inexact Activation Implementations
Pith reviewed 2026-06-29 13:35 UTC · model grok-4.3
The pith
Floating-point neural networks represent any function between floating-point domains exactly when their first-layer activations distinguish every pair of distinct inputs.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
A floating-point neural network achieves universal representability over floating-point domains if and only if its first layer distinguishes every pair of distinct inputs, with the sufficiency direction holding once the activation implementation meets mild conditions that allow distinctions to propagate through the network.
What carries the argument
The distinguishability framework, which requires that for every pair of distinct inputs there is at least one first-layer neuron whose activation output differs on that pair.
If this is right
- Implementations of sigmoid, tanh, ReLU, ELU, SeLU, GeLU, Swish, Mish, and sin become universal representators under arbitrary reduction orders and bounded ulp errors.
- Correctly rounded cosine and certain other activations remain non-universal even under the generalized model.
- Universal representability fails for any activation whose first layer cannot separate all distinct inputs.
- Prior results limited to fixed left-to-right reduction and exact rounding are subsumed by the new framework.
Where Pith is reading between the lines
- Hardware or compiler changes to reduction order could turn a previously universal network non-universal for a given activation.
- Verifying distinguishability only on the first layer offers a practical test for universality without enumerating all possible target functions.
Load-bearing premise
The activation implementation satisfies additional mild conditions beyond bounded ulp error that let first-layer distinctions propagate to the full network.
What would settle it
A concrete activation implementation that distinguishes all input pairs in the first layer yet fails to represent some target function on the floating-point domain when reduction order is arbitrary.
Figures
read the original abstract
Most existing expressivity theories for neural networks assume exact real arithmetic, whereas practical neural networks are executed under finite-precision floating-point arithmetic with implementation-dependent execution semantics. Recent works have begun studying the expressive power of floating-point neural networks, but existing results are limited to highly restricted activation functions and idealized assumptions such as fixed left-to-right reduction orders and correctly rounded activation implementations. In this work, we study the expressive power of floating-point neural networks under generalized floating-point execution semantics, including arbitrary reduction orders and inexact activation implementations with bounded ulp errors. We investigate when floating-point neural networks can represent arbitrary functions between floating-point domains exactly. To this end, we introduce a general distinguishability framework and show that the ability to distinguish every pair of distinct inputs in the first layer is necessary for universal representability. This characterization yields broad classes of activation implementations that are not universal representators, extending previous isolated counterexamples such as the correctly rounded cosine activation. We further prove that a suitable form of distinguishability is also sufficient for universal representability under mild conditions on the activation implementation. Using this framework, we establish universal representability results for a broad class of practical activation functions, including implementations of $\mathrm{Sigmoid}$, $\tanh$, $\mathrm{ReLU}$, $\mathrm{ELU}$, $\mathrm{SeLU}$, $\mathrm{GeLU}$, $\mathrm{Swish}$, $\mathrm{Mish}$, and $\sin$, under significantly more realistic floating-point execution models than previously known.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces a distinguishability framework for floating-point neural networks under arbitrary reduction orders and inexact activations with bounded ulp errors. It proves that first-layer distinguishability of distinct inputs is necessary for a network to exactly represent arbitrary functions between floating-point domains. It further shows that a suitable distinguishability property is sufficient for universal representability under mild conditions on the activation implementation, and applies the framework to establish universal representability for implementations of Sigmoid, tanh, ReLU, ELU, SeLU, GeLU, Swish, Mish, and sin.
Significance. If the results hold, the work substantially extends prior floating-point expressivity results (limited to fixed reduction orders and correctly rounded activations) by providing a general necessity-sufficiency characterization under more realistic execution models. The framework enables both negative classifications and positive results for a broad class of practical activations, which would be a meaningful advance in the theory of neural network expressivity.
major comments (1)
- [Sufficiency theorem] Sufficiency theorem (framework section): the claim that distinguishability is sufficient for universal representability requires 'mild conditions' on the activation implementation to propagate through arbitrary reduction orders and inexact activations; the manuscript does not explicitly verify these conditions (e.g., continuity or error-bounded monotonicity surviving reduction) for sin, Mish, or GeLU, which is load-bearing for the universal-representability results listed for those activations.
minor comments (1)
- [Abstract / Introduction] The abstract invokes 'mild conditions' without a precise statement; a brief enumeration of the conditions in the introduction or framework section would improve readability.
Simulated Author's Rebuttal
Thank you for the opportunity to respond to the referee's report. We address the major comment regarding the sufficiency theorem below.
read point-by-point responses
-
Referee: [Sufficiency theorem] Sufficiency theorem (framework section): the claim that distinguishability is sufficient for universal representability requires 'mild conditions' on the activation implementation to propagate through arbitrary reduction orders and inexact activations; the manuscript does not explicitly verify these conditions (e.g., continuity or error-bounded monotonicity surviving reduction) for sin, Mish, or GeLU, which is load-bearing for the universal-representability results listed for those activations.
Authors: The referee correctly identifies that the sufficiency result depends on mild conditions on the activation implementations. The manuscript applies the framework to sin, Mish, and GeLU based on the fact that their standard floating-point implementations satisfy the required properties (such as the error being bounded by a small number of ulps and preserving sufficient monotonicity or continuity for the distinguishability to hold under arbitrary reductions). However, we agree that explicit verification would strengthen the presentation. We will revise the manuscript to include a dedicated verification subsection or appendix detailing how the mild conditions are satisfied for each of these activations, including sin, Mish, and GeLU. This will make the universal representability claims fully rigorous and self-contained. revision: yes
Circularity Check
No significant circularity; derivation self-contained
full rationale
The paper defines a distinguishability framework, proves necessity of first-layer distinguishability directly from the floating-point execution model, and establishes sufficiency under separately stated mild conditions on activations. No quoted step reduces a central claim to a fitted parameter, self-definition, or load-bearing self-citation chain; the listed activation results follow from applying the framework rather than presupposing the target representability. The derivation remains independent of its own outputs.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems
Mart ´ın Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, et al. Tensorflow: Large-scale machine learning on heterogeneous distributed systems.arXiv preprint arXiv:1603.04467, 2016
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[2]
Fast and Accurate Deep Network Learning by Exponential Linear Units (ELUs)
Djork-Arn ´e Clevert. Fast and accurate deep network learning by exponential linear units (elus).arXiv preprint arXiv:1511.07289, 2015
work page internal anchor Pith review Pith/arXiv arXiv 2015
-
[3]
Approximation by superpositions of a sigmoidal function.Mathematics of Control, Signals and Systems, 2(4):303–314, 1989
George Cybenko. Approximation by superpositions of a sigmoidal function.Mathematics of Control, Signals and Systems, 2(4):303–314, 1989
1989
-
[4]
On the universal approximability and complexity bounds of quantized ReLU neural networks
Yukun Ding, Jinglan Liu, Jinjun Xiong, and Yiyu Shi. On the universal approximability and complexity bounds of quantized ReLU neural networks. InInternational Conference on Learning Representations (ICLR), 2019
2019
-
[5]
Accuracy of mathematical functions in single, double, double extended, and quadruple precision
Brian Gladman, Vincenzo Innocente, John Mather, and Paul Zimmer- mann. Accuracy of mathematical functions in single, double, double extended, and quadruple precision. 2025
2025
-
[6]
Approximation speed of quantized versus unquantized ReLU neural networks and beyond.IEEE Transactions on Information Theory, 69(6):3960–3977, 2023
Antoine Gonon, Nicolas Brisebarre, R ´emi Gribonval, and Elisa Riccietti. Approximation speed of quantized versus unquantized ReLU neural networks and beyond.IEEE Transactions on Information Theory, 69(6):3960–3977, 2023
2023
-
[7]
Improve your model’s performance with bfloat16
Google. Improve your model’s performance with bfloat16. https://cloud. google.com/tpu/docs/bfloat16
-
[8]
Gaussian Error Linear Units (GELUs)
Dan Hendrycks and Kevin Gimpel. Gaussian error linear units (GELUs). arXiv preprint arXiv:1606.08415, 2016
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[9]
Hornik, M
K. Hornik, M. Stinchcombe, and H. White. Multilayer feedforward networks are universal approximators.Neural Networks, 2(5):359–366, 1989
1989
-
[10]
Floating- point neural networks can represent almost all floating-point functions
Geonho Hwang, Yeachan Park, Wonyeol Lee, and Sejun Park. Floating- point neural networks can represent almost all floating-point functions. InForty-second International Conference on Machine Learning, 2025
2025
-
[11]
Geonho Hwang, Yeachan Park, and Sejun Park. On expressive power of quantized neural networks under fixed-point arithmetic.arXiv preprint arXiv:2409.00297, 2024
-
[12]
IEEE, Piscataway, NJ, USA, 2019
IEEE.IEEE Standard for Floating-Point Arithmetic (IEEE Std 754- 2019). IEEE, Piscataway, NJ, USA, 2019
2019
-
[13]
Universal approximation with deep narrow networks
Patrick Kidger and Terry Lyons. Universal approximation with deep narrow networks. InConference on Learning Theory (COLT), 2020
2020
-
[14]
Self-normalizing neural networks
G ¨unter Klambauer, Thomas Unterthiner, Andreas Mayr, and Sepp Hochreiter. Self-normalizing neural networks. InAnnual Conference on Neural Information Processing Systems (NeurIPS), 2017
2017
-
[15]
Deep learning
Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. Deep learning. nature, 521(7553):436–444, 2015
2015
-
[16]
Multilayer feedforward networks with a nonpolynomial activation func- tion can approximate any function.Neural Networks, 1993
Moshe Leshno, Vladimir Ya Lin, Allan Pinkus, and Shimon Schocken. Multilayer feedforward networks with a nonpolynomial activation func- tion can approximate any function.Neural Networks, 1993
1993
-
[17]
The expressive power of neural networks: A view from the width
Zhou Lu, Hongming Pu, Feicheng Wang, Zhiqiang Hu, and Liwei Wang. The expressive power of neural networks: A view from the width. InAnnual Conference on Neural Information Processing Systems (NeurIPS), 2017
2017
-
[18]
Paulius Micikevicius, Dusan Stosic, Neil Burgess, Marius Cornea, Pradeep Dubey, Richard Grisenthwaite, Sangwon Ha, Alexander Hei- necke, Patrick Judd, John Kamalu, et al. FP8 formats for deep learning. arXiv preprint arXiv:2209.05433, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[19]
Mish: A self regularized non-monotonic activation function.arXiv preprint arXiv:1908.08681, 2019
Diganta Misra. Mish: A self regularized non-monotonic activation function.arXiv preprint arXiv:1908.08681, 2019
-
[20]
Minimum width for universal approximation
Sejun Park, Chulhee Yun, Jaeho Lee, and Jinwoo Shin. Minimum width for universal approximation. InInternational Conference on Learning Representations (ICLR), 2021
2021
-
[21]
Expres- sive power of ReLU and step networks under floating-point operations
Yeachan Park, Geonho Hwang, Wonyeol Lee, and Sejun Park. Expres- sive power of ReLU and step networks under floating-point operations. Neural Networks, 175:106297, 2024
2024
-
[22]
Approximation theory of the MLP model in neural networks.Acta Numerica, 8:143 – 195, 1999
Allan Pinkus. Approximation theory of the MLP model in neural networks.Acta Numerica, 8:143 – 195, 1999
1999
-
[23]
Searching for Activation Functions
Prajit Ramachandran, Barret Zoph, and Quoc V Le. Searching for activation functions.arXiv preprint arXiv:1710.05941, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[24]
What’s hidden in a randomly weighted neural network? InIEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2020
Vivek Ramanujan, Mitchell Wortsman, Aniruddha Kembhavi, Ali Farhadi, and Mohammad Rastegari. What’s hidden in a randomly weighted neural network? InIEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2020
2020
-
[25]
Universal approximation power of deep residual neural networks via nonlinear control theory
Paulo Tabuada and Bahman Gharesifard. Universal approximation power of deep residual neural networks via nonlinear control theory. In International Conference on Learning Representations (ICLR), 2021
2021
-
[26]
A comprehensive review of binary neural network.Artificial Intelligence Review, 56(11):12949–13013, 2023
Chunyu Yuan and Sos S Agaian. A comprehensive review of binary neural network.Artificial Intelligence Review, 56(11):12949–13013, 2023
2023
-
[27]
Are transformers universal approximators of sequence-to-sequence functions? InInternational Conference on Learning Representations (ICLR), 2020
Chulhee Yun, Srinadh Bhojanapalli, Ankit Singh Rawat, Sashank Reddi, and Sanjiv Kumar. Are transformers universal approximators of sequence-to-sequence functions? InInternational Conference on Learning Representations (ICLR), 2020
2020
-
[28]
Universality of deep convolutional neural networks
Ding-Xuan Zhou. Universality of deep convolutional neural networks. Applied and Computational Harmonic Analysis, 48(2):787–794, 2020
2020
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.