Floating-point neural networks with automatic differentiation can represent arbitrary floating-point functions and their gradients under mild conditions.
hub Mixed citations
Searching for Activation Functions
Mixed citation behavior. Most common role is background (69%).
abstract
The choice of activation functions in deep networks has a significant effect on the training dynamics and task performance. Currently, the most successful and widely-used activation function is the Rectified Linear Unit (ReLU). Although various hand-designed alternatives to ReLU have been proposed, none have managed to replace it due to inconsistent gains. In this work, we propose to leverage automatic search techniques to discover new activation functions. Using a combination of exhaustive and reinforcement learning-based search, we discover multiple novel activation functions. We verify the effectiveness of the searches by conducting an empirical evaluation with the best discovered activation function. Our experiments show that the best discovered activation function, $f(x) = x \cdot \text{sigmoid}(\beta x)$, which we name Swish, tends to work better than ReLU on deeper models across a number of challenging datasets. For example, simply replacing ReLUs with Swish units improves top-1 classification accuracy on ImageNet by 0.9\% for Mobile NASNet-A and 0.6\% for Inception-ResNet-v2. The simplicity of Swish and its similarity to ReLU make it easy for practitioners to replace ReLUs with Swish units in any neural network.
hub tools
citation-role summary
citation-polarity summary
claims ledger
- abstract The choice of activation functions in deep networks has a significant effect on the training dynamics and task performance. Currently, the most successful and widely-used activation function is the Rectified Linear Unit (ReLU). Although various hand-designed alternatives to ReLU have been proposed, none have managed to replace it due to inconsistent gains. In this work, we propose to leverage automatic search techniques to discover new activation functions. Using a combination of exhaustive and reinforcement learning-based search, we discover multiple novel activation functions. We verify the
co-cited works
representative citing papers
Supervised Guidance Training enables conditioning of infinite-dimensional diffusion models via an extended Doob h-transform so that fine-tuned models accurately sample from posteriors in function space.
KANs with learnable univariate spline activations on edges achieve better accuracy than MLPs with fewer parameters, faster scaling, and direct visualization for scientific discovery.
Mamba is a linear-time sequence model using input-dependent selective SSMs that achieves SOTA results across modalities and matches twice-larger Transformers on language modeling with 5x higher inference throughput.
Restricting layers to width 3 and using linking numbers shows ResNets and transformers match in topological power, exceed monotonic feedforward nets which exceed flows, but nonmonotonic activations match the top class.
A diffusion model-based nonparametric method for undirected graphical model selection with model selection consistency.
CoMetaPNS combines meta-learned neural surrogates with a continual Bayesian Gaussian Mixture Model to adapt cardiac electrophysiology simulations to new data while avoiding catastrophic forgetting.
The paper derives the first minimax-optimal excess population risk rates for gradient descent and stochastic gradient descent on over-parameterized DNNs by linking their dynamics to kernel methods under polynomial width scaling.
TriSearch is an RL framework that optimizes triangulations of polytopes using bistellar flips with a circuit-supported subtriangulation action representation, generalizing zero-shot to larger instances and outperforming prior samplers in 3D and 4D.
Floating-point neural networks achieve universal representability for practical activations like ReLU, sigmoid, and tanh under arbitrary reduction orders and bounded ulp errors in activations via a new distinguishability condition.
Neural statistical functions use prefix statistics to unify and directly predict statistical quantities over continuous ranges from pre-trained single-sample models without repeated sampling.
Temporarily reducing the learning rate on upper-layer query and key projections during early GPT pretraining prevents premature attention specialization and improves model performance.
The Linear Centroids Hypothesis reframes network features as directions in centroid spaces of local affine experts, unifying interpretability methods and yielding sparser, more faithful dictionaries, circuits, and saliency maps.
Shape- and peak-sensitive goodness functions for Forward-Forward deliver up to 72pp gains over sum-of-squares, reaching 98.2% on MNIST and 89% on Fashion-MNIST.
SetFlow is a flow-matching generative model for permutation-invariant MIL bags in representation space that produces synthetic data improving classification performance and enabling training on synthetic data alone.
Evolved multi-channel activation functions that incorporate missingness and confidence scores improve classification performance on datasets with missing data.
A framework learns boundary-to-domain pseudo-extensions to condition neural operators on complex BCs, achieving SOTA accuracy on 18 challenging PDE datasets without hyperparameter tuning.
DiffeoMorph learns distributed agent protocols to morph into complex 3D shapes from minimal initial conditions via equivariant GNNs and rotation-invariant Zernike loss.
KA-CRNNs learn pressure-dependent and collider-specific kinetic rate laws from data using Kolmogorov-Arnold activations inside a CRNN framework, outperforming interpolative methods by 2.88x in MSE on two proof-of-concept reactions.
Diffusion and flow processes forget dependencies to define valid copulas then learn to remember them for density estimation and sampling, outperforming prior copula methods on complex datasets.
Skala is a neural XC functional trained on wavefunction data that beats state-of-the-art hybrids on main-group chemistry benchmarks at semi-local computational cost.
Transformers and SSMs are unified through structured state space duality, producing a 2-8X faster Mamba-2 model that remains competitive with Transformers.
A generative semantic communication system that sends compressed semantic information and uses diffusion models with spatially-adaptive normalizations to reconstruct high-quality, semantically consistent images even under severe channel noise.
EfficientNet scales network depth, width, and resolution uniformly via a compound coefficient to deliver state-of-the-art accuracy and efficiency on image classification.
citing papers explorer
-
The Falcon Series of Open Language Models
Falcon-180B is a 180B-parameter open decoder-only model trained on 3.5 trillion tokens that approaches PaLM-2-Large performance at lower cost and is released with dataset extracts.
-
TF-MoE: Time-Frequency Mixture-of-Experts for Efficient Speech Separation
TF-MoE uses dynamic per-frame and per-mel-band expert selection in time and frequency dimensions to improve speech separation performance at comparable compute cost to prior models.
-
SaluNet: Enabling Total Plasticity in Normalization-Free Deep Networks
SaluNet replaces normalization layers with the SALU activation and reports competitive accuracies on CIFAR-10/100 and ImageNet-1K without normalization.
-
LALE: Lightweight-Transformer Architecture for Land-Cover Estimation
LALE introduces a bifurcated ConvMixer-transformer encoder with an all-MLP decoder for efficient semantic segmentation of remote sensing imagery, achieving near-baseline F1 scores with 4.5x fewer parameters on the ARAS400k benchmark.
-
Confidence-Adaptive SwiGLU for Mixture-of-Experts
κ-SwiGLU adapts SiLU gate sharpness in MoE Transformers as a learnable function of router logits, reporting improved mean CORE performance on FineWeb-Edu across 8-28 layer models with negligible added parameters and small overhead.
-
Selective Ambulance Dispatch Under Contextual Travel-Time Uncertainty
IDEAL is a selective dual ambulance dispatch framework that learns context-specific travel times via weakly supervised bilevel networks and models uncertainty with Burg-divergence perturbations to achieve better response-time and resource trade-offs than region-based or map-based baselines.
-
A Constant-Time Implementation Methodology for Activation Functions on Microcontrollers
A constant-time implementation methodology for activation functions on ARM Cortex-M4 microcontrollers using branchless selection, Padé approximations, dummy arithmetic, and cycle alignment to eliminate timing side channels while preserving accuracy.
-
Quantification of atmospheric carbon dioxide from the Geostationary Operational Environmental Satellite (GOES East)
A physics-guided neural network trained on collocated GOES-East and OCO-2/3 data estimates XCO2 and reproduces observed variability against held-out OCO and TCCON measurements.
-
Mela: Test-Time Memory Consolidation based on Transformation Hypothesis
Mela is a Transformer variant with a dual-frequency Hierarchical Memory Module and MemStack that performs test-time memory consolidation, outperforming baselines on long contexts.
-
Neural-Guided Domain Restriction to Accelerate Pseudospectra Computation for Structured Non-normal Banded Matrices
A neural network predicts sensitive pseudospectra regions from matrix features to accelerate computation on structured non-normal banded matrices while preserving accuracy in identifying those regions.
-
Sparsity Hurts: Simple Linear Adapter Can Boost Generalized Category Discovery
LAGCD inserts residual linear adapters into each ViT block plus a distribution alignment loss to improve generalized category discovery by increasing model flexibility while reducing bias between seen and novel classes.
-
Universal Smoothness via Bernstein Polynomials: A Constructive Approximation Approach for Activation Functions
BerLU constructs a C1-differentiable activation with Lipschitz constant 1 via Bernstein polynomial approximation, showing better performance and efficiency than baselines on image classification with ViTs and CNNs.
-
GCImOpt: Learning efficient goal-conditioned policies by imitating optimal trajectories
GCImOpt trains compact goal-conditioned neural policies by imitating efficiently generated optimal trajectories, achieving high success rates and near-optimal performance on cart-pole, quadcopter, and robot arm tasks while running thousands of times faster than optimization solvers.
-
Physics-informed neural networks for form-finding of unilateral membrane structures
PINNs with hard and soft boundary enforcement solve membrane form-finding PDEs to accuracy comparable with FEM, with hard-BC yielding smaller boundary errors.
-
ZC-Swish: Stabilizing Deep BN-Free Networks for Edge and Micro-Batch Applications
ZC-Swish stabilizes deep BN-free networks by anchoring activation means near zero, preventing collapse at depths 16 and beyond where standard Swish fails.
-
Data-Efficient Self-Supervised Algorithms for Fine-Grained Birdsong Analysis
A three-stage self-supervised pipeline for data-efficient frame-level syllable detection in complex birdsong using a Residual MLP-RNN model.
-
Activation Function Design Sustains Plasticity in Continual Learning
Smooth-Leaky and Randomized Smooth-Leaky activations mitigate loss of plasticity in continual learning by targeting negative-branch shape and saturation behavior.
-
YOLOv4: Optimal Speed and Accuracy of Object Detection
YOLOv4 achieves 43.5% AP (65.7% AP50) on MS COCO at ~65 FPS on Tesla V100 by integrating WRC, CSP, CmBN, SAT, Mish activation, Mosaic augmentation, DropBlock, and CIoU loss.
-
Deep Learning for CSI Feedback Based on Superimposed Coding
A multi-task neural network recovers superimposed downlink CSI and uplink data sequences in FDD massive MIMO, improving CSI estimation over standalone SC while maintaining similar UL-US detection across varying SNR and PPC.
-
Ghost in the Kernel: In-Context Learning with Efficient Transformers via Domain Generalization
Linear transformers perform in-context learning by mapping context distributions to response functions, achieving dimension-independent convergence rates under domain generalization with tradeoffs in data and feature regularities.
-
Deep Reinforcement Learning for Spacecraft Attitude Control During Atmospheric Re-Entry
Hybrid RL-PID controllers track angle of attack better and show greater robustness than PID alone within a defined operational envelope for re-entry attitude control.
-
A Surrogate Model for Proton Spectrum Prediction to Map Transitions in Laser-Ion Acceleration
A decoupled dual-branch surrogate model predicts proton spectra with R²=0.94 for cutoff energy and flux, median spectral R²=0.985, and reproduces TNSA-to-RIT/BOA regime transitions validated on 1D PIC simulations.
-
PowLU: An Activation Function for Stable Pre-Training of LLMs
PowLU replaces SwiGLU with a rational-power activation to reduce outlier amplification and numerical instability during large-scale LLM pre-training while matching performance.
-
Activation Functions, Statistics and Learning of Higher-Order Interactions in Restricted Boltzmann Machines
RBMs with Gaussian weights rarely induce or easily learn distributions with strong higher-order interactions on visible units, except when the hidden-unit activation function is Exponential.
-
Agentic Risk-Aware Set-Based Engineering Design
Multi-agent LLM system applies set-based design and Conditional Value-at-Risk to explore and risk-filter airfoil designs with human manager coordination.
-
DemaFormer: Damped Exponential Moving Average Transformer with Energy-Based Modeling for Temporal Language Grounding
DemaFormer pairs energy-based modeling with a damped-EMA Transformer to localize video moments matching language queries and reports gains over baselines on four datasets.
-
GLU Variants Improve Transformer
Some GLU variants using non-sigmoid nonlinearities improve Transformer quality over ReLU and GELU in feed-forward sublayers.
-
CNN-Based Classifier for Automated Identification of Magnetic States in Spin Dynamics Simulations
CNN classifies nine magnetic states from visualized atomistic spin dynamics simulation images using EfficientNetV1B0.
-
Deep Learning for Solving and Estimating Dynamic Models in Economics and Finance
The paper surveys deep learning methods such as Deep Equilibrium Nets and Physics-Informed Neural Networks for solving and estimating high-dimensional dynamic stochastic models in economics and finance.
-
Developing a Strong Pre-Trained Base Model for Plant Leaf Disease Classification
A DenseNet201 base model trained on a constructed plant leaf disease dataset outperforms baselines and enables faster, more robust transfer learning with less data than general models.
-
Graph neural network for colliding particles with an application to sea ice floe modeling
A graph neural network learns to simulate 1D sea ice floe collisions and trajectories using data assimilation on synthetic data.
-
Cosmos World Foundation Model Platform for Physical AI
The Cosmos platform supplies open-source pre-trained world models and supporting tools for building fine-tunable digital world simulations to train Physical AI.
-
A Survey of Large Language Models
This survey reviews the background, key techniques, and evaluation methods for large language models, emphasizing emergent abilities that appear at large scales.
-
Deep learning applied to computational mechanics: A comprehensive review, state of the art, and the classics
A comprehensive review of deep learning techniques for computational mechanics, including LSTM for constitutive modeling, PINNs for PDE solving, optimizers, and kernel methods.