Mathematical analysis of one-layer neural network with fixed biases, a new activation function and other observations
Pith reviewed 2026-05-10 17:50 UTC · model grok-4.3
The pith
One-hidden-layer network with fixed biases converges under gradient descent on L2 loss and shows spectral bias.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
For the continuous and discrete versions of this one-layer model the gradient-descent flow on the L2 squared loss converges to a global minimizer; moreover the dynamics are governed by the spectrum of certain integral operators induced by the activation, which produces the observed spectral bias. The same operator analysis yields necessary conditions on the activation function and supports the introduction of FReX, for which convergence is likewise proved.
What carries the argument
The one-hidden-layer network with fixed biases and ReLU (or FReX) activation whose training dynamics reduce to gradient flow on a loss whose Hessian spectrum encodes both convergence and frequency bias.
If this is right
- The parameters converge to values that globally minimize the L2 loss for any continuous target function.
- Lower-frequency Fourier modes of the target are recovered first during training.
- Activation functions must satisfy spectral conditions derived from the associated integral operators to guarantee convergence.
- The proposed FReX activation inherits the same convergence guarantees as ReLU.
Where Pith is reading between the lines
- The same operator-spectrum approach might be applied to other simple architectures to predict their bias toward smooth or low-frequency solutions.
- If spectral bias persists when biases are allowed to train, the result would strengthen the claim that the phenomenon is intrinsic to gradient descent rather than an artifact of fixed biases.
- Practical tests of FReX on low-dimensional regression tasks could check whether the theoretical convergence advantage translates to faster or more stable training.
Load-bearing premise
The entire analysis is carried out only for scalar input and output with all biases held fixed, which removes many degrees of freedom that are present in typical neural networks.
What would settle it
A numerical run of gradient descent on the exact one-dimensional model that either diverges or fails to learn low frequencies before high frequencies would falsify the convergence and spectral-bias claims.
read the original abstract
We analyze a simple one-hidden-layer neural network with ReLU activation functions and fixed biases, with one-dimensional input and output. We study both continuous and discrete versions of the model, and we rigorously prove the convergence of the learning process with the $L^2$ squared loss function and the gradient descent procedure. We also prove the spectral bias property for this learning process. Several conclusions of this analysis are discussed; in particular, regarding the structure and properties that activation functions should possess, as well as the relationships between the spectrum of certain operators and the learning process. Based on this, we also propose an alternative activation function, the full-wave rectified exponential function (FReX), and we discuss the convergence of the gradient descent with this alternative activation function.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript analyzes a one-hidden-layer neural network with ReLU activations, fixed biases, and one-dimensional input/output. It rigorously proves convergence of both the continuous gradient flow and discrete gradient descent under squared L² loss, establishes the spectral bias property via the spectrum of an associated integral operator, and proposes the full-wave rectified exponential (FReX) activation function while discussing its convergence under the same training procedure.
Significance. If the derivations hold, the work supplies a concrete, fully-scoped mathematical treatment of gradient-descent dynamics and spectral bias for a deliberately simplified model. The explicit proofs for both continuous and discrete cases, together with the operator-theoretic framing of spectral bias and the analysis of a new activation, constitute a clear strength. Such results can serve as a reference point for understanding why spectral bias appears in practice and for guiding the design of activation functions, even though the setting is restricted to 1-D fixed-bias networks.
minor comments (3)
- [Model definition] The model definition (early sections) introduces the network with fixed biases but does not explicitly state the precise function space in which the weights live; adding a short sentence clarifying that the weights are real scalars (or vectors in the 1-D case) would remove any ambiguity for readers.
- [Discrete GD convergence] In the convergence proof for discrete gradient descent, the step-size restriction is stated in terms of a generic Lipschitz constant; an explicit upper bound derived from the network parameters would make the result more immediately usable.
- [FReX proposal] The FReX activation is defined and its convergence is discussed, yet no plot or numerical comparison with ReLU on a simple target function is provided; a single illustrative figure would strengthen the claim that FReX is a viable alternative.
Simulated Author's Rebuttal
We thank the referee for the positive assessment of our manuscript, including the rigorous proofs for gradient flow and discrete gradient descent convergence, the spectral bias analysis via the integral operator, and the proposal of the FReX activation function. The recommendation for minor revision is noted. No specific major comments were provided in the report.
Circularity Check
No significant circularity; derivations are self-contained mathematical proofs
full rationale
The paper restricts itself to a concrete 1D one-hidden-layer model with fixed biases and ReLU (or FReX). Central results are explicit proofs of convergence for continuous/discrete gradient descent under L2 loss and of spectral bias, obtained directly from the gradient-flow ODEs and the spectrum of the associated integral operator. No parameters are fitted to data and then relabeled as predictions, no self-definitional loops appear in the activation or loss definitions, and no load-bearing uniqueness theorem is imported from the authors' prior work. Any self-citations are peripheral and do not substitute for the derivations. The analysis therefore remains independent of its own outputs.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption ReLU satisfies the standard piecewise-linear properties used in convergence arguments for gradient descent
- domain assumption The loss landscape for the squared L2 loss on this model permits global convergence of gradient descent under the stated conditions
invented entities (1)
-
FReX (full-wave rectified exponential) activation function
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We rigorously prove the convergence of the learning process with the L2 squared loss function and the gradient descent procedure. We also prove the spectral bias property... propose... FReX... fundamental solution of 1/2(-d²/dx² +1)
-
IndisputableMonolith/Foundation/AlphaCoordinateFixation.leancostAlphaLog_fourth_deriv_at_zero unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
ReLU''(z)=δ(z)... FReX satisfies 1/2(-d²/dx² +1)FReX=δ
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Brown, Rainald L¨ ohner, Fumiya To- gashi, and Deepanshu Verma
[ABL+24] Harbir Antil, Thomas S. Brown, Rainald L¨ ohner, Fumiya To- gashi, and Deepanshu Verma. Deep neural nets with fixed bias configuration.Numerical Algebra, Control and Optimization, 14(1):20–33, 2024. [CFW+21] Yuan Cao, Zhiying Fang, Yue Wu, Ding-Xuan Zhou, and Quan- quan Gu. Towards understanding the spectral bias of deep learn- ing. InProceedings...
work page 2024
-
[2]
On the spectral bias of neural networks
Draxler, Min Lin, Fred Hamprecht, Yoshua Bengio, and Aaron Courville. On the spectral bias of neural networks. In Ka- malika Chaudhuri and Ruslan Salakhutdinov, editors,Proceed- ings of the 36th International Conference on Machine Learning, volume 97 ofProceedings of Machine Learning Research, pages 5301–5310. PMLR, 2019. [RS80] Michael Reed and Barry Sim...
work page 2019
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.