pith. sign in

arxiv: 2605.28704 · v1 · pith:PCW47THAnew · submitted 2026-05-27 · 💻 cs.LG

Expressive Power of Floating-Point Neural Networks with Arbitrary Reduction Orders and Inexact Activation Implementations

Pith reviewed 2026-06-29 13:35 UTC · model grok-4.3

classification 💻 cs.LG
keywords floating-point neural networksexpressive poweruniversal representabilityactivation functionsdistinguishabilityreduction ordersulp error
0
0 comments X

The pith

Floating-point neural networks represent any function between floating-point domains exactly when their first-layer activations distinguish every pair of distinct inputs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines the conditions under which neural networks executed with real floating-point arithmetic, arbitrary reduction orders, and inexact activation implementations can exactly represent every possible mapping from one floating-point set to another. It introduces a distinguishability framework showing that the first layer must separate all distinct inputs as a necessary condition for this universal representability. The work further proves that distinguishability is also sufficient under mild conditions on the activation, and verifies that this holds for implementations of common functions including sigmoid, tanh, ReLU, ELU, SeLU, GeLU, Swish, Mish, and sin.

Core claim

A floating-point neural network achieves universal representability over floating-point domains if and only if its first layer distinguishes every pair of distinct inputs, with the sufficiency direction holding once the activation implementation meets mild conditions that allow distinctions to propagate through the network.

What carries the argument

The distinguishability framework, which requires that for every pair of distinct inputs there is at least one first-layer neuron whose activation output differs on that pair.

If this is right

  • Implementations of sigmoid, tanh, ReLU, ELU, SeLU, GeLU, Swish, Mish, and sin become universal representators under arbitrary reduction orders and bounded ulp errors.
  • Correctly rounded cosine and certain other activations remain non-universal even under the generalized model.
  • Universal representability fails for any activation whose first layer cannot separate all distinct inputs.
  • Prior results limited to fixed left-to-right reduction and exact rounding are subsumed by the new framework.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Hardware or compiler changes to reduction order could turn a previously universal network non-universal for a given activation.
  • Verifying distinguishability only on the first layer offers a practical test for universality without enumerating all possible target functions.

Load-bearing premise

The activation implementation satisfies additional mild conditions beyond bounded ulp error that let first-layer distinctions propagate to the full network.

What would settle it

A concrete activation implementation that distinguishes all input pairs in the first layer yet fails to represent some target function on the floating-point domain when reduction order is arbitrary.

Figures

Figures reproduced from arXiv: 2605.28704 by Geonho Hwang, Sejun Park, Wonyeol Lee, Yeachan Park.

Figure 1
Figure 1. Figure 1: Visualization of the conditions in Lemma [PITH_FULL_IMAGE:figures/full_fig_p006_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Visualization of the conditions in Lemma [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗
read the original abstract

Most existing expressivity theories for neural networks assume exact real arithmetic, whereas practical neural networks are executed under finite-precision floating-point arithmetic with implementation-dependent execution semantics. Recent works have begun studying the expressive power of floating-point neural networks, but existing results are limited to highly restricted activation functions and idealized assumptions such as fixed left-to-right reduction orders and correctly rounded activation implementations. In this work, we study the expressive power of floating-point neural networks under generalized floating-point execution semantics, including arbitrary reduction orders and inexact activation implementations with bounded ulp errors. We investigate when floating-point neural networks can represent arbitrary functions between floating-point domains exactly. To this end, we introduce a general distinguishability framework and show that the ability to distinguish every pair of distinct inputs in the first layer is necessary for universal representability. This characterization yields broad classes of activation implementations that are not universal representators, extending previous isolated counterexamples such as the correctly rounded cosine activation. We further prove that a suitable form of distinguishability is also sufficient for universal representability under mild conditions on the activation implementation. Using this framework, we establish universal representability results for a broad class of practical activation functions, including implementations of $\mathrm{Sigmoid}$, $\tanh$, $\mathrm{ReLU}$, $\mathrm{ELU}$, $\mathrm{SeLU}$, $\mathrm{GeLU}$, $\mathrm{Swish}$, $\mathrm{Mish}$, and $\sin$, under significantly more realistic floating-point execution models than previously known.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The paper introduces a distinguishability framework for floating-point neural networks under arbitrary reduction orders and inexact activations with bounded ulp errors. It proves that first-layer distinguishability of distinct inputs is necessary for a network to exactly represent arbitrary functions between floating-point domains. It further shows that a suitable distinguishability property is sufficient for universal representability under mild conditions on the activation implementation, and applies the framework to establish universal representability for implementations of Sigmoid, tanh, ReLU, ELU, SeLU, GeLU, Swish, Mish, and sin.

Significance. If the results hold, the work substantially extends prior floating-point expressivity results (limited to fixed reduction orders and correctly rounded activations) by providing a general necessity-sufficiency characterization under more realistic execution models. The framework enables both negative classifications and positive results for a broad class of practical activations, which would be a meaningful advance in the theory of neural network expressivity.

major comments (1)
  1. [Sufficiency theorem] Sufficiency theorem (framework section): the claim that distinguishability is sufficient for universal representability requires 'mild conditions' on the activation implementation to propagate through arbitrary reduction orders and inexact activations; the manuscript does not explicitly verify these conditions (e.g., continuity or error-bounded monotonicity surviving reduction) for sin, Mish, or GeLU, which is load-bearing for the universal-representability results listed for those activations.
minor comments (1)
  1. [Abstract / Introduction] The abstract invokes 'mild conditions' without a precise statement; a brief enumeration of the conditions in the introduction or framework section would improve readability.

Simulated Author's Rebuttal

1 responses · 0 unresolved

Thank you for the opportunity to respond to the referee's report. We address the major comment regarding the sufficiency theorem below.

read point-by-point responses
  1. Referee: [Sufficiency theorem] Sufficiency theorem (framework section): the claim that distinguishability is sufficient for universal representability requires 'mild conditions' on the activation implementation to propagate through arbitrary reduction orders and inexact activations; the manuscript does not explicitly verify these conditions (e.g., continuity or error-bounded monotonicity surviving reduction) for sin, Mish, or GeLU, which is load-bearing for the universal-representability results listed for those activations.

    Authors: The referee correctly identifies that the sufficiency result depends on mild conditions on the activation implementations. The manuscript applies the framework to sin, Mish, and GeLU based on the fact that their standard floating-point implementations satisfy the required properties (such as the error being bounded by a small number of ulps and preserving sufficient monotonicity or continuity for the distinguishability to hold under arbitrary reductions). However, we agree that explicit verification would strengthen the presentation. We will revise the manuscript to include a dedicated verification subsection or appendix detailing how the mild conditions are satisfied for each of these activations, including sin, Mish, and GeLU. This will make the universal representability claims fully rigorous and self-contained. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation self-contained

full rationale

The paper defines a distinguishability framework, proves necessity of first-layer distinguishability directly from the floating-point execution model, and establishes sufficiency under separately stated mild conditions on activations. No quoted step reduces a central claim to a fitted parameter, self-definition, or load-bearing self-citation chain; the listed activation results follow from applying the framework rather than presupposing the target representability. The derivation remains independent of its own outputs.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The central claims rest on standard mathematical properties of floating-point arithmetic and the assumption of bounded ulp errors in activation implementations; no free parameters are fitted and no new entities are postulated.

pith-pipeline@v0.9.1-grok · 5806 in / 1182 out tokens · 33404 ms · 2026-06-29T13:35:12.560373+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

28 extracted references · 7 canonical work pages · 5 internal anchors

  1. [1]

    TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems

    Mart ´ın Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, et al. Tensorflow: Large-scale machine learning on heterogeneous distributed systems.arXiv preprint arXiv:1603.04467, 2016

  2. [2]

    Fast and Accurate Deep Network Learning by Exponential Linear Units (ELUs)

    Djork-Arn ´e Clevert. Fast and accurate deep network learning by exponential linear units (elus).arXiv preprint arXiv:1511.07289, 2015

  3. [3]

    Approximation by superpositions of a sigmoidal function.Mathematics of Control, Signals and Systems, 2(4):303–314, 1989

    George Cybenko. Approximation by superpositions of a sigmoidal function.Mathematics of Control, Signals and Systems, 2(4):303–314, 1989

  4. [4]

    On the universal approximability and complexity bounds of quantized ReLU neural networks

    Yukun Ding, Jinglan Liu, Jinjun Xiong, and Yiyu Shi. On the universal approximability and complexity bounds of quantized ReLU neural networks. InInternational Conference on Learning Representations (ICLR), 2019

  5. [5]

    Accuracy of mathematical functions in single, double, double extended, and quadruple precision

    Brian Gladman, Vincenzo Innocente, John Mather, and Paul Zimmer- mann. Accuracy of mathematical functions in single, double, double extended, and quadruple precision. 2025

  6. [6]

    Approximation speed of quantized versus unquantized ReLU neural networks and beyond.IEEE Transactions on Information Theory, 69(6):3960–3977, 2023

    Antoine Gonon, Nicolas Brisebarre, R ´emi Gribonval, and Elisa Riccietti. Approximation speed of quantized versus unquantized ReLU neural networks and beyond.IEEE Transactions on Information Theory, 69(6):3960–3977, 2023

  7. [7]

    Improve your model’s performance with bfloat16

    Google. Improve your model’s performance with bfloat16. https://cloud. google.com/tpu/docs/bfloat16

  8. [8]

    Gaussian Error Linear Units (GELUs)

    Dan Hendrycks and Kevin Gimpel. Gaussian error linear units (GELUs). arXiv preprint arXiv:1606.08415, 2016

  9. [9]

    Hornik, M

    K. Hornik, M. Stinchcombe, and H. White. Multilayer feedforward networks are universal approximators.Neural Networks, 2(5):359–366, 1989

  10. [10]

    Floating- point neural networks can represent almost all floating-point functions

    Geonho Hwang, Yeachan Park, Wonyeol Lee, and Sejun Park. Floating- point neural networks can represent almost all floating-point functions. InForty-second International Conference on Machine Learning, 2025

  11. [11]

    On expressive power of quantized neural networks under fixed-point arithmetic.arXiv preprint arXiv:2409.00297, 2024

    Geonho Hwang, Yeachan Park, and Sejun Park. On expressive power of quantized neural networks under fixed-point arithmetic.arXiv preprint arXiv:2409.00297, 2024

  12. [12]

    IEEE, Piscataway, NJ, USA, 2019

    IEEE.IEEE Standard for Floating-Point Arithmetic (IEEE Std 754- 2019). IEEE, Piscataway, NJ, USA, 2019

  13. [13]

    Universal approximation with deep narrow networks

    Patrick Kidger and Terry Lyons. Universal approximation with deep narrow networks. InConference on Learning Theory (COLT), 2020

  14. [14]

    Self-normalizing neural networks

    G ¨unter Klambauer, Thomas Unterthiner, Andreas Mayr, and Sepp Hochreiter. Self-normalizing neural networks. InAnnual Conference on Neural Information Processing Systems (NeurIPS), 2017

  15. [15]

    Deep learning

    Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. Deep learning. nature, 521(7553):436–444, 2015

  16. [16]

    Multilayer feedforward networks with a nonpolynomial activation func- tion can approximate any function.Neural Networks, 1993

    Moshe Leshno, Vladimir Ya Lin, Allan Pinkus, and Shimon Schocken. Multilayer feedforward networks with a nonpolynomial activation func- tion can approximate any function.Neural Networks, 1993

  17. [17]

    The expressive power of neural networks: A view from the width

    Zhou Lu, Hongming Pu, Feicheng Wang, Zhiqiang Hu, and Liwei Wang. The expressive power of neural networks: A view from the width. InAnnual Conference on Neural Information Processing Systems (NeurIPS), 2017

  18. [18]

    FP8 Formats for Deep Learning

    Paulius Micikevicius, Dusan Stosic, Neil Burgess, Marius Cornea, Pradeep Dubey, Richard Grisenthwaite, Sangwon Ha, Alexander Hei- necke, Patrick Judd, John Kamalu, et al. FP8 formats for deep learning. arXiv preprint arXiv:2209.05433, 2022

  19. [19]

    Mish: A self regularized non-monotonic activation function.arXiv preprint arXiv:1908.08681, 2019

    Diganta Misra. Mish: A self regularized non-monotonic activation function.arXiv preprint arXiv:1908.08681, 2019

  20. [20]

    Minimum width for universal approximation

    Sejun Park, Chulhee Yun, Jaeho Lee, and Jinwoo Shin. Minimum width for universal approximation. InInternational Conference on Learning Representations (ICLR), 2021

  21. [21]

    Expres- sive power of ReLU and step networks under floating-point operations

    Yeachan Park, Geonho Hwang, Wonyeol Lee, and Sejun Park. Expres- sive power of ReLU and step networks under floating-point operations. Neural Networks, 175:106297, 2024

  22. [22]

    Approximation theory of the MLP model in neural networks.Acta Numerica, 8:143 – 195, 1999

    Allan Pinkus. Approximation theory of the MLP model in neural networks.Acta Numerica, 8:143 – 195, 1999

  23. [23]

    Searching for Activation Functions

    Prajit Ramachandran, Barret Zoph, and Quoc V Le. Searching for activation functions.arXiv preprint arXiv:1710.05941, 2017

  24. [24]

    What’s hidden in a randomly weighted neural network? InIEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2020

    Vivek Ramanujan, Mitchell Wortsman, Aniruddha Kembhavi, Ali Farhadi, and Mohammad Rastegari. What’s hidden in a randomly weighted neural network? InIEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2020

  25. [25]

    Universal approximation power of deep residual neural networks via nonlinear control theory

    Paulo Tabuada and Bahman Gharesifard. Universal approximation power of deep residual neural networks via nonlinear control theory. In International Conference on Learning Representations (ICLR), 2021

  26. [26]

    A comprehensive review of binary neural network.Artificial Intelligence Review, 56(11):12949–13013, 2023

    Chunyu Yuan and Sos S Agaian. A comprehensive review of binary neural network.Artificial Intelligence Review, 56(11):12949–13013, 2023

  27. [27]

    Are transformers universal approximators of sequence-to-sequence functions? InInternational Conference on Learning Representations (ICLR), 2020

    Chulhee Yun, Srinadh Bhojanapalli, Ankit Singh Rawat, Sashank Reddi, and Sanjiv Kumar. Are transformers universal approximators of sequence-to-sequence functions? InInternational Conference on Learning Representations (ICLR), 2020

  28. [28]

    Universality of deep convolutional neural networks

    Ding-Xuan Zhou. Universality of deep convolutional neural networks. Applied and Computational Harmonic Analysis, 48(2):787–794, 2020