Recognition: 2 theorem links
· Lean TheoremSearching for Activation Functions
Pith reviewed 2026-05-12 02:43 UTC · model grok-4.3
The pith
Automatic search discovers Swish activation function that works better than ReLU on deeper networks.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Using a combination of exhaustive and reinforcement learning-based search, the authors discover multiple activation functions and identify the best one, f(x) = x · sigmoid(βx), which they name Swish. Swish tends to work better than ReLU on deeper models across challenging datasets. Simply replacing ReLUs with Swish units improves top-1 classification accuracy on ImageNet by 0.9% for Mobile NASNet-A and 0.6% for Inception-ResNet-v2. The simplicity of Swish and its similarity to ReLU make it straightforward for practitioners to adopt.
What carries the argument
The Swish activation function f(x) = x · sigmoid(βx), found by combining exhaustive search over simple expressions with reinforcement-learning-guided search over more complex candidates.
If this is right
- Deeper networks show larger relative gains from Swish than shallower ones.
- Swish can be dropped into any existing architecture in place of ReLU without other changes.
- The same search procedure yields several other functions that also outperform ReLU in the reported experiments.
- Practitioners can adopt Swish immediately because it requires no new hyperparameters beyond the single scalar beta.
Where Pith is reading between the lines
- The search framework could be reused to discover task-specific activation functions rather than a single universal one.
- Because Swish is smooth and non-monotonic near zero, it may interact differently with gradient-based optimizers than ReLU does.
- Similar automated search might be applied to other low-level choices such as normalization layers or loss functions.
Load-bearing premise
The performance gains come from intrinsic properties of the discovered functions rather than from interactions with the specific model architectures, training schedules, or hyperparameter settings used in the tests.
What would settle it
A controlled study that swaps Swish into a broad collection of models while holding all other training details fixed and finds no consistent accuracy improvement would show the advantage is not general.
read the original abstract
The choice of activation functions in deep networks has a significant effect on the training dynamics and task performance. Currently, the most successful and widely-used activation function is the Rectified Linear Unit (ReLU). Although various hand-designed alternatives to ReLU have been proposed, none have managed to replace it due to inconsistent gains. In this work, we propose to leverage automatic search techniques to discover new activation functions. Using a combination of exhaustive and reinforcement learning-based search, we discover multiple novel activation functions. We verify the effectiveness of the searches by conducting an empirical evaluation with the best discovered activation function. Our experiments show that the best discovered activation function, $f(x) = x \cdot \text{sigmoid}(\beta x)$, which we name Swish, tends to work better than ReLU on deeper models across a number of challenging datasets. For example, simply replacing ReLUs with Swish units improves top-1 classification accuracy on ImageNet by 0.9\% for Mobile NASNet-A and 0.6\% for Inception-ResNet-v2. The simplicity of Swish and its similarity to ReLU make it easy for practitioners to replace ReLUs with Swish units in any neural network.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes leveraging automatic search techniques, including exhaustive and reinforcement learning-based methods, to discover new activation functions for deep neural networks. The best discovered function, named Swish and defined as f(x) = x · sigmoid(βx), is empirically shown to outperform the standard ReLU activation on deeper models across challenging datasets, with specific improvements such as 0.9% top-1 accuracy gain on ImageNet for Mobile NASNet-A and 0.6% for Inception-ResNet-v2 upon simple replacement.
Significance. If the empirical results are robust to hyperparameter choices, this work is significant in providing a simple yet effective alternative to ReLU that practitioners can easily adopt. The automated search approach offers a systematic alternative to hand-designed activations and demonstrates concrete gains on standard benchmarks like ImageNet, which is a strength of the manuscript.
major comments (2)
- [§4] §4: The experiments report accuracy improvements by replacing ReLU with Swish while retaining identical training schedules, optimizer, learning-rate schedule, batch size, and initialization chosen for ReLU. Without per-activation hyperparameter re-optimization or ablation on schedule sensitivity, it remains unclear whether the reported gains (0.9% for Mobile NASNet-A and 0.6% for Inception-ResNet-v2 on ImageNet) are caused by the intrinsic properties of Swish or by interactions with ReLU-tuned protocols.
- [§3] §3: The search space definition, number of trials, and statistical controls for both the exhaustive and RL-based searches are insufficiently detailed. This affects assessment of whether Swish was reliably identified as superior rather than selected post-hoc from a large set of candidates.
minor comments (3)
- The value of β used in the reported Swish experiments should be explicitly stated (fixed or learned) along with sensitivity analysis.
- Additional citations to prior parametric activation functions (e.g., PReLU) would strengthen the related-work discussion.
- Figures comparing activation functions would benefit from including their derivatives to illustrate effects on gradient flow.
Simulated Author's Rebuttal
We thank the referee for the detailed and constructive report. We address each major comment below and indicate the revisions we will make to the manuscript.
read point-by-point responses
-
Referee: [§4] §4: The experiments report accuracy improvements by replacing ReLU with Swish while retaining identical training schedules, optimizer, learning-rate schedule, batch size, and initialization chosen for ReLU. Without per-activation hyperparameter re-optimization or ablation on schedule sensitivity, it remains unclear whether the reported gains (0.9% for Mobile NASNet-A and 0.6% for Inception-ResNet-v2 on ImageNet) are caused by the intrinsic properties of Swish or by interactions with ReLU-tuned protocols.
Authors: We agree that the hyperparameters and training schedules were those originally tuned for ReLU, and no per-activation re-optimization or schedule ablation was performed. This design choice was made to evaluate Swish as a drop-in replacement that requires no additional tuning effort from practitioners. The consistent gains across two distinct architectures (Mobile NASNet-A and Inception-ResNet-v2) and multiple datasets provide supporting evidence that the improvements are not solely due to protocol interactions. Nevertheless, we acknowledge the limitation noted by the referee. In the revised manuscript we will add an explicit discussion of this point in Section 4 and include a limited ablation study on learning-rate sensitivity for Swish versus ReLU using a smaller proxy task. revision: partial
-
Referee: [§3] §3: The search space definition, number of trials, and statistical controls for both the exhaustive and RL-based searches are insufficiently detailed. This affects assessment of whether Swish was reliably identified as superior rather than selected post-hoc from a large set of candidates.
Authors: We thank the referee for highlighting the need for greater transparency in the search methodology. The original Section 3 described the overall approach at a high level but omitted precise specifications of the search space, exact trial counts, and statistical safeguards. In the revision we will expand Section 3 to enumerate the full set of unary and binary operations, the constants considered, the total number of functions evaluated in the exhaustive search, the RL training details (agent architecture, number of episodes, and reward formulation), and any repeated runs or variance metrics used to rank candidates. These additions will allow readers to evaluate the reliability of Swish's selection. revision: yes
Circularity Check
No circularity: empirical search discovery followed by independent benchmark evaluation
full rationale
The paper's chain consists of (1) applying exhaustive and RL-based search over activation function spaces to identify candidates, (2) selecting the best performer f(x) = x · sigmoid(βx), and (3) measuring its accuracy when substituted into fixed architectures on held-out datasets such as ImageNet. None of these steps reduces a reported performance delta to a quantity defined by the same fitted parameters or search objective used to generate the candidate; the gains are direct empirical measurements on standard test sets under the paper's stated protocol. No self-citation is invoked as a uniqueness theorem or load-bearing premise, and no equation equates a prediction to its own input by construction.
Axiom & Free-Parameter Ledger
free parameters (1)
- β
axioms (1)
- domain assumption The space of functions considered during search contains useful activation functions that transfer to held-out models and tasks.
Lean theorems connected to this paper
-
Cost.FunctionalEquationwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Our experiments show that the best discovered activation function, f(x) = x · sigmoid(βx), which we name Swish, tends to work better than ReLU on deeper models across a number of challenging datasets.
-
Foundation.DAlembert.Inevitabilitybilinear_family_forced unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Using a combination of exhaustive and reinforcement learning-based search, we discover multiple novel activation functions.
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 39 Pith papers
-
Floating-Point Networks with Automatic Differentiation Can Represent Almost All Floating-Point Functions and Their Gradients
Floating-point neural networks with automatic differentiation can represent arbitrary floating-point functions and their gradients under mild conditions.
-
KAN: Kolmogorov-Arnold Networks
KANs with learnable univariate spline activations on edges achieve better accuracy than MLPs with fewer parameters, faster scaling, and direct visualization for scientific discovery.
-
Mamba: Linear-Time Sequence Modeling with Selective State Spaces
Mamba is a linear-time sequence model using input-dependent selective SSMs that achieves SOTA results across modalities and matches twice-larger Transformers on language modeling with 5x higher inference throughput.
-
Neural Statistical Functions
Neural statistical functions use prefix statistics to unify and directly predict statistical quantities over continuous ranges from pre-trained single-sample models without repeated sampling.
-
Learning Less Is More: Premature Upper-Layer Attention Specialization Hurts Language Model Pretraining
Temporarily reducing the learning rate on upper-layer query and key projections during early GPT pretraining prevents premature attention specialization and improves model performance.
-
The Linear Centroids Hypothesis: Features as Directions Learned by Local Experts
The Linear Centroids Hypothesis reframes network features as directions in centroid spaces of local affine experts, unifying interpretability methods and yielding sparser, more faithful dictionaries, circuits, and sal...
-
Selectivity and Shape in the Design of Forward-Forward Goodness Functions
Shape- and peak-sensitive goodness functions for Forward-Forward deliver up to 72pp gains over sum-of-squares, reaching 98.2% on MNIST and 89% on Fashion-MNIST.
-
Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality
Transformers and SSMs are unified through structured state space duality, producing a 2-8X faster Mamba-2 model that remains competitive with Transformers.
-
DECO: Sparse Mixture-of-Experts with Dense-Comparable Performance on End-Side Devices
DECO matches dense model performance at 20% expert activation via ReLU-based routing with learnable scaling and the NormSiLU activation, plus a 3x real-hardware speedup.
-
DECO: Sparse Mixture-of-Experts with Dense-Comparable Performance on End-Side Devices
DECO sparse MoE matches dense Transformer performance at 20% expert activation with a 3x hardware inference speedup.
-
On the global convergence of gradient descent for wide shallow models with bounded nonlinearities
Gradient descent on wide shallow models with bounded nonlinearities converges globally in the mean-field limit as non-global critical points are unstable under the dynamics.
-
Sparsity Moves Computation: How FFN Architecture Reshapes Attention in Small Transformers
Sparse MoE in FFN blocks redistributes computation to attention in small Transformers primarily due to architectural capacity reduction and partitioning, not learned router specialization.
-
MoMo: Conditioned Contrastive Representation Learning for Preference-Modulated Planning
MoMo uses Feature-Wise Linear Modulation and low-rank neural modulation to condition contrastive planning representations on user preferences while preserving inference efficiency and probability density ratios.
-
What If We Let Forecasting Forget? A Sparse Bottleneck for Cross-Variable Dependencies
MS-FLOW uses a capacity-limited sparse routing mechanism to model only critical inter-variable dependencies in time series data, achieving state-of-the-art accuracy on 12 benchmarks with fewer but more reliable connections.
-
On the Blessing of Pre-training in Weak-to-Strong Generalization
Pre-training provides a geometric warm start in a single-index model that enables weak-to-strong generalization up to a supervisor-limited bound, with empirical phase-transition evidence in LLMs.
-
Competing nonlinearities, criticality, and order-to-chaos transition in deep networks
A statistical mixture of Tanh and Swish activations with critical mixing fraction p_c induces a continuous phase transition to scale-invariant signal propagation in deep networks while preserving smoothness.
-
Neural-network reconstruction of THz transmission spectra using electrically tunable AlGaN/GaN plasmonic-crystal analyzer
A feedforward neural network trained on synthetic data inverts voltage-dependent intensities from an electrically tunable AlGaN/GaN plasmonic analyzer to reconstruct THz spectra, achieving lower error than Tikhonov re...
-
Distribution-Free Pretraining of Classification Losses via Evolutionary Dynamics
EDL learns a transferable classification loss from unlimited synthetic data via evolutionary optimization and a ranking-consistency objective, serving as a competitive drop-in replacement for cross-entropy on CIFAR-10...
-
AGoQ: Activation and Gradient Quantization for Memory-Efficient Distributed Training of LLMs
AGoQ delivers up to 52% lower memory use and 1.34x faster training for 8B-32B LLaMA models by using near-4-bit adaptive activations and 8-bit gradients while preserving pretraining convergence and downstream accuracy.
-
Four-dimensional QCD equation of state from a quasi-parton model with physics-informed neural networks
A PINN-trained quasi-parton model reproduces lattice cumulants at vanishing chemical potentials and supplies a consistent four-dimensional QCD equation of state at finite densities.
-
Geometric Monomial (GEM): a family of rational 2N-differentiable activation functions
GEM is a new family of C^{2N}-smooth rational activation functions with variants that achieve performance on par with or exceeding GELU on ResNet, GPT-2, and BERT benchmarks.
-
A Complex-Valued Continuous-Variable Quantum Approximation Optimization Algorithm (CCV-QAOA)
CCV-QAOA is a new complex-valued continuous-variable variant of QAOA that solves real and complex multivariate optimization problems via a variational framework.
-
The Linear Centroids Hypothesis: Features as Directions Learned by Local Experts
Features in deep networks correspond to linear directions of centroids summarizing local functional behavior, enabling sparser and more effective feature dictionaries via sparse autoencoders applied to centroids rathe...
-
OTProf: estimating high-resolution profiles of optical turbulence ($C_n^2$) from reanalysis using deep learning
Deep learning model OTProf generates high-resolution C_n² profiles from ERA5 reanalysis data and outperforms the Hufnagel-Valley model for vertical structure and integrated parameters like Fried parameter r_0 in the N...
-
Mela: Test-Time Memory Consolidation based on Transformation Hypothesis
Mela is a Transformer variant with a dual-frequency Hierarchical Memory Module and MemStack that performs test-time memory consolidation, outperforming baselines on long contexts.
-
Neural-Guided Domain Restriction to Accelerate Pseudospectra Computation for Structured Non-normal Banded Matrices
A neural network predicts sensitive pseudospectra regions from matrix features to accelerate computation on structured non-normal banded matrices while preserving accuracy in identifying those regions.
-
Sparsity Hurts: Simple Linear Adapter Can Boost Generalized Category Discovery
LAGCD inserts residual linear adapters into each ViT block plus a distribution alignment loss to improve generalized category discovery by increasing model flexibility while reducing bias between seen and novel classes.
-
Universal Smoothness via Bernstein Polynomials: A Constructive Approximation Approach for Activation Functions
BerLU constructs a C1-differentiable activation with Lipschitz constant 1 via Bernstein polynomial approximation, showing better performance and efficiency than baselines on image classification with ViTs and CNNs.
-
AGoQ: Activation and Gradient Quantization for Memory-Efficient Distributed Training of LLMs
AGoQ cuts LLM training memory by up to 52% and speeds it up by 1.34x using tailored 4-bit activations and 8-bit gradients with special communication, matching baseline accuracy on LLaMA models.
-
GCImOpt: Learning efficient goal-conditioned policies by imitating optimal trajectories
GCImOpt trains compact goal-conditioned neural policies by imitating efficiently generated optimal trajectories, achieving high success rates and near-optimal performance on cart-pole, quadcopter, and robot arm tasks ...
-
Physics-informed neural networks for form-finding of unilateral membrane structures
PINNs with hard and soft boundary enforcement solve membrane form-finding PDEs to accuracy comparable with FEM, with hard-BC yielding smaller boundary errors.
-
ZC-Swish: Stabilizing Deep BN-Free Networks for Edge and Micro-Batch Applications
ZC-Swish stabilizes deep BN-free networks by anchoring activation means near zero, preventing collapse at depths 16 and beyond where standard Swish fails.
-
YOLOv4: Optimal Speed and Accuracy of Object Detection
YOLOv4 achieves 43.5% AP (65.7% AP50) on MS COCO at ~65 FPS on Tesla V100 by integrating WRC, CSP, CmBN, SAT, Mish activation, Mosaic augmentation, DropBlock, and CIoU loss.
-
Agentic Risk-Aware Set-Based Engineering Design
Multi-agent LLM system applies set-based design and Conditional Value-at-Risk to explore and risk-filter airfoil designs with human manager coordination.
-
GLU Variants Improve Transformer
Some GLU variants using non-sigmoid nonlinearities improve Transformer quality over ReLU and GELU in feed-forward sublayers.
-
Deep Learning for Solving and Estimating Dynamic Models in Economics and Finance
The paper surveys deep learning methods such as Deep Equilibrium Nets and Physics-Informed Neural Networks for solving and estimating high-dimensional dynamic stochastic models in economics and finance.
-
Developing a Strong Pre-Trained Base Model for Plant Leaf Disease Classification
A DenseNet201 base model trained on a constructed plant leaf disease dataset outperforms baselines and enables faster, more robust transfer learning with less data than general models.
-
Cosmos World Foundation Model Platform for Physical AI
The Cosmos platform supplies open-source pre-trained world models and supporting tools for building fine-tunable digital world simulations to train Physical AI.
-
A Survey of Large Language Models
This survey reviews the background, key techniques, and evaluation methods for large language models, emphasizing emergent abilities that appear at large scales.
Reference graph
Works this paper leans on
-
[1]
Learning activation functions to improve deep neural networks
Forest Agostinelli, Matthew Hoffman, Peter Sadowski, and Pierre Baldi. Learning activation functions to improve deep neural networks. arXiv preprint arXiv:1412.6830,
-
[2]
Reinforcement learning for architecture search by network transformation
Han Cai, Tianyao Chen, Weinan Zhang, Yong Yu, and Jun Wang. Reinforcement learning for architecture search by network transformation. arXiv preprint arXiv:1707.04873,
-
[3]
Fast and Accurate Deep Network Learning by Exponential Linear Units (ELUs)
Djork-Arn´e Clevert, Thomas Unterthiner, and Sepp Hochreiter. Fast and accurate deep network learning by exponential linear units (elus). arXiv preprint arXiv:1511.07289,
-
[4]
Language Modeling with Gated Convolutional Networks
Yann N Dauphin, Angela Fan, Michael Auli, and David Grangier. Language modeling with gated convolutional networks. arXiv preprint arXiv:1612.08083,
-
[5]
Bartlett, Ilya Sutskever, and Pieter Abbeel
Yan Duan, John Schulman, Xi Chen, Peter L Bartlett, Ilya Sutskever, and Pieter Abbeel. Rl2: Fast reinforce- ment learning via slow reinforcement learning. arXiv preprint arXiv:1611.02779,
-
[6]
Sigmoid-weighted linear units for neural network function approximation in reinforcement learning
Stefan Elfwing, Eiji Uchibe, and Kenji Doya. Sigmoid-weighted linear units for neural network function approximation in reinforcement learning. arXiv preprint arXiv:1702.03118,
-
[7]
Model-agnostic meta-learning for fast adaptation of deep networks
Chelsea Finn, Pieter Abbeel, and Sergey Levine. Model-agnostic meta-learning for fast adaptation of deep networks. arXiv preprint arXiv:1703.03400,
-
[8]
David Ha, Andrew Dai, and Quoc V Le. Hypernetworks. arXiv preprint arXiv:1609.09106,
work page internal anchor Pith review arXiv
-
[9]
Gaussian Error Linear Units (GELUs)
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition , pp. 770–778, 2016a. Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Identity mappings in deep residual networks. In European Conference on Computer Vision, pp. 630–64...
work page internal anchor Pith review Pith/arXiv arXiv
-
[10]
MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications
Andrew G Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco Andreetto, and Hartwig Adam. Mobilenets: Efficient convolutional neural networks for mobile vision ap- plications. arXiv preprint arXiv:1704.04861,
work page internal anchor Pith review Pith/arXiv arXiv
-
[11]
Kevin Jarrett, Koray Kavukcuoglu, Yann LeCun, et al. What is the best multi-stage architecture for object recognition? In 2009 IEEE 12th International Conference on Computer Vision ,
work page 2009
-
[12]
Self-normalizing neural net- works
G¨unter Klambauer, Thomas Unterthiner, Andreas Mayr, and Sepp Hochreiter. Self-normalizing neural net- works. arXiv preprint arXiv:1706.02515,
-
[13]
Learnable pooling with context gating for video classification
Antoine Miech, Ivan Laptev, and Josef Sivic. Learnable pooling with context gating for video classification. arXiv preprint arXiv:1706.06905,
-
[14]
Flexible rectified linear units for improving convolutional neural networks
Suo Qiu and Bolun Cai. Flexible rectified linear units for improving convolutional neural networks. arXiv preprint arXiv:1706.08098,
-
[15]
Large-scale evolution of image classifiers
Esteban Real, Sherry Moore, Andrew Selle, Saurabh Saxena, Yutaka Leon Suematsu, Quoc Le, and Alex Kurakin. Large-scale evolution of image classifiers. arXiv preprint arXiv:1703.01041,
-
[16]
Proximal Policy Optimization Algorithms
John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347,
work page internal anchor Pith review Pith/arXiv arXiv
-
[17]
Rupesh Kumar Srivastava, Klaus Greff, and J ¨urgen Schmidhuber. Highway networks. arXiv preprint arXiv:1505.00387,
-
[18]
Learning to reinforcement learn
Jane X Wang, Zeb Kurth-Nelson, Dhruva Tirumala, Hubert Soyer, Joel Z Leibo, Remi Munos, Charles Blundell, Dharshan Kumaran, and Matt Botvinick. Learning to reinforcement learn. arXiv preprint arXiv:1611.05763,
-
[19]
Empirical evalua- tion of rectified activations in convolutional network
12 Bing Xu, Naiyan Wang, Tianqi Chen, and Mu Li. Empirical evaluation of rectified activations in convolutional network. arXiv preprint arXiv:1505.00853,
-
[20]
Practical network blocks design with q-learning
Zhao Zhong, Junjie Yan, and Cheng-Lin Liu. Practical network blocks design with q-learning. arXiv preprint arXiv:1708.05552,
-
[21]
Deep interest network for click-through rate prediction
Guorui Zhou, Chengru Song, Xiaoqiang Zhu, Xiao Ma, Yanghui Yan, Xingya Dai, Han Zhu, Junqi Jin, Han Li, and Kun Gai. Deep interest network for click-through rate prediction. arXiv preprint arXiv:1706.06978,
-
[22]
Learning transferable architectures for scal- able image recognition
Barret Zoph, Vijay Vasudevan, Jonathon Shlens, and Quoc V Le. Learning transferable architectures for scal- able image recognition. arXiv preprint arXiv:1707.07012,
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.