Floating-point neural networks with automatic differentiation can represent arbitrary floating-point functions and their gradients under mild conditions.
hub Mixed citations
Searching for Activation Functions
Mixed citation behavior. Most common role is background (69%).
abstract
The choice of activation functions in deep networks has a significant effect on the training dynamics and task performance. Currently, the most successful and widely-used activation function is the Rectified Linear Unit (ReLU). Although various hand-designed alternatives to ReLU have been proposed, none have managed to replace it due to inconsistent gains. In this work, we propose to leverage automatic search techniques to discover new activation functions. Using a combination of exhaustive and reinforcement learning-based search, we discover multiple novel activation functions. We verify the effectiveness of the searches by conducting an empirical evaluation with the best discovered activation function. Our experiments show that the best discovered activation function, $f(x) = x \cdot \text{sigmoid}(\beta x)$, which we name Swish, tends to work better than ReLU on deeper models across a number of challenging datasets. For example, simply replacing ReLUs with Swish units improves top-1 classification accuracy on ImageNet by 0.9\% for Mobile NASNet-A and 0.6\% for Inception-ResNet-v2. The simplicity of Swish and its similarity to ReLU make it easy for practitioners to replace ReLUs with Swish units in any neural network.
hub tools
citation-role summary
citation-polarity summary
claims ledger
- abstract The choice of activation functions in deep networks has a significant effect on the training dynamics and task performance. Currently, the most successful and widely-used activation function is the Rectified Linear Unit (ReLU). Although various hand-designed alternatives to ReLU have been proposed, none have managed to replace it due to inconsistent gains. In this work, we propose to leverage automatic search techniques to discover new activation functions. Using a combination of exhaustive and reinforcement learning-based search, we discover multiple novel activation functions. We verify the
co-cited works
representative citing papers
Supervised Guidance Training enables conditioning of infinite-dimensional diffusion models via an extended Doob h-transform so that fine-tuned models accurately sample from posteriors in function space.
KANs with learnable univariate spline activations on edges achieve better accuracy than MLPs with fewer parameters, faster scaling, and direct visualization for scientific discovery.
Mamba is a linear-time sequence model using input-dependent selective SSMs that achieves SOTA results across modalities and matches twice-larger Transformers on language modeling with 5x higher inference throughput.
Neural statistical functions use prefix statistics to unify and directly predict statistical quantities over continuous ranges from pre-trained single-sample models without repeated sampling.
Temporarily reducing the learning rate on upper-layer query and key projections during early GPT pretraining prevents premature attention specialization and improves model performance.
The Linear Centroids Hypothesis reframes network features as directions in centroid spaces of local affine experts, unifying interpretability methods and yielding sparser, more faithful dictionaries, circuits, and saliency maps.
Shape- and peak-sensitive goodness functions for Forward-Forward deliver up to 72pp gains over sum-of-squares, reaching 98.2% on MNIST and 89% on Fashion-MNIST.
SetFlow is a flow-matching generative model for permutation-invariant MIL bags in representation space that produces synthetic data improving classification performance and enabling training on synthetic data alone.
Evolved multi-channel activation functions that incorporate missingness and confidence scores improve classification performance on datasets with missing data.
A framework learns boundary-to-domain pseudo-extensions to condition neural operators on complex BCs, achieving SOTA accuracy on 18 challenging PDE datasets without hyperparameter tuning.
DiffeoMorph learns distributed agent protocols to morph into complex 3D shapes from minimal initial conditions via equivariant GNNs and rotation-invariant Zernike loss.
KA-CRNNs learn pressure-dependent and collider-specific kinetic rate laws from data using Kolmogorov-Arnold activations inside a CRNN framework, outperforming interpolative methods by 2.88x in MSE on two proof-of-concept reactions.
Diffusion and flow processes forget dependencies to define valid copulas then learn to remember them for density estimation and sampling, outperforming prior copula methods on complex datasets.
Skala is a neural XC functional trained on wavefunction data that beats state-of-the-art hybrids on main-group chemistry benchmarks at semi-local computational cost.
Transformers and SSMs are unified through structured state space duality, producing a 2-8X faster Mamba-2 model that remains competitive with Transformers.
A generative semantic communication system that sends compressed semantic information and uses diffusion models with spatially-adaptive normalizations to reconstruct high-quality, semantically consistent images even under severe channel noise.
EfficientNet scales network depth, width, and resolution uniformly via a compound coefficient to deliver state-of-the-art accuracy and efficiency on image classification.
DECO is a sparse MoE architecture with ReLU-based routing, learnable expert scaling, and NormSiLU activation that matches dense Transformer performance at 20% expert activation and delivers 2.93x speedup on Jetson AGX Orin.
Gradient descent on wide shallow models with bounded nonlinearities converges globally in the mean-field limit as non-global critical points are unstable under the dynamics.
Sparse MoE FFNs redistribute computation from FFN to attention in small Transformers, driven mainly by architectural sparsity rather than learned expert specialization.
MoMo conditions contrastive representations and prediction operators on user preferences via FiLM and low-rank modulation to enable continuous modulation of plan safety while preserving inference efficiency.
MS-FLOW uses a capacity-limited sparse routing mechanism to model only critical inter-variable dependencies in time series data, achieving state-of-the-art accuracy on 12 benchmarks with fewer but more reliable connections.
Pre-training provides a geometric warm start in a single-index model that enables weak-to-strong generalization up to a supervisor-limited bound, with empirical phase-transition evidence in LLMs.
citing papers explorer
-
Four-dimensional QCD equation of state from a quasi-parton model with physics-informed neural networks
A PINN-trained quasi-parton model reproduces lattice cumulants at vanishing chemical potentials and supplies a consistent four-dimensional QCD equation of state at finite densities.