hub Mixed citations

Searching for Activation Functions

Prajit Ramachandran, Barret Zoph, Quoc V. Le · 2017 · cs.NE · arXiv 1710.05941

Mixed citation behavior. Most common role is background (69%).

63 Pith papers citing it

Background 69% of classified citations

open full Pith review browse 63 citing papers arXiv PDF

abstract

The choice of activation functions in deep networks has a significant effect on the training dynamics and task performance. Currently, the most successful and widely-used activation function is the Rectified Linear Unit (ReLU). Although various hand-designed alternatives to ReLU have been proposed, none have managed to replace it due to inconsistent gains. In this work, we propose to leverage automatic search techniques to discover new activation functions. Using a combination of exhaustive and reinforcement learning-based search, we discover multiple novel activation functions. We verify the effectiveness of the searches by conducting an empirical evaluation with the best discovered activation function. Our experiments show that the best discovered activation function, $f(x) = x \cdot \text{sigmoid}(\beta x)$, which we name Swish, tends to work better than ReLU on deeper models across a number of challenging datasets. For example, simply replacing ReLUs with Swish units improves top-1 classification accuracy on ImageNet by 0.9\% for Mobile NASNet-A and 0.6\% for Inception-ResNet-v2. The simplicity of Swish and its similarity to ReLU make it easy for practitioners to replace ReLUs with Swish units in any neural network.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 12 method 1

citation-polarity summary

background 9 unclear 3 use method 1

claims ledger

abstract The choice of activation functions in deep networks has a significant effect on the training dynamics and task performance. Currently, the most successful and widely-used activation function is the Rectified Linear Unit (ReLU). Although various hand-designed alternatives to ReLU have been proposed, none have managed to replace it due to inconsistent gains. In this work, we propose to leverage automatic search techniques to discover new activation functions. Using a combination of exhaustive and reinforcement learning-based search, we discover multiple novel activation functions. We verify the

co-cited works

representative citing papers

Floating-Point Networks with Automatic Differentiation Can Represent Almost All Floating-Point Functions and Their Gradients

cs.LG · 2026-05-03 · unverdicted · novelty 8.0

Floating-point neural networks with automatic differentiation can represent arbitrary floating-point functions and their gradients under mild conditions.

Supervised Guidance Training for Infinite-Dimensional Diffusion Models

cs.LG · 2026-01-28 · conditional · novelty 8.0

Supervised Guidance Training enables conditioning of infinite-dimensional diffusion models via an extended Doob h-transform so that fine-tuned models accurately sample from posteriors in function space.

KAN: Kolmogorov-Arnold Networks

cs.LG · 2024-04-30 · conditional · novelty 8.0

KANs with learnable univariate spline activations on edges achieve better accuracy than MLPs with fewer parameters, faster scaling, and direct visualization for scientific discovery.

Mamba: Linear-Time Sequence Modeling with Selective State Spaces

cs.LG · 2023-12-01 · unverdicted · novelty 8.0

Mamba is a linear-time sequence model using input-dependent selective SSMs that achieves SOTA results across modalities and matches twice-larger Transformers on language modeling with 5x higher inference throughput.

Neural Statistical Functions

cs.LG · 2026-05-11 · unverdicted · novelty 7.0

Neural statistical functions use prefix statistics to unify and directly predict statistical quantities over continuous ranges from pre-trained single-sample models without repeated sampling.

Learning Less Is More: Premature Upper-Layer Attention Specialization Hurts Language Model Pretraining

cs.CL · 2026-05-11 · unverdicted · novelty 7.0

Temporarily reducing the learning rate on upper-layer query and key projections during early GPT pretraining prevents premature attention specialization and improves model performance.

The Linear Centroids Hypothesis: Features as Directions Learned by Local Experts

cs.LG · 2026-04-13 · unverdicted · novelty 7.0 · 2 refs

The Linear Centroids Hypothesis reframes network features as directions in centroid spaces of local affine experts, unifying interpretability methods and yielding sparser, more faithful dictionaries, circuits, and saliency maps.

Selectivity and Shape in the Design of Forward-Forward Goodness Functions

cs.LG · 2026-03-28 · unverdicted · novelty 7.0

Shape- and peak-sensitive goodness functions for Forward-Forward deliver up to 72pp gains over sum-of-squares, reaching 98.2% on MNIST and 89% on Fashion-MNIST.

SetFlow: Generating Structured Sets of Representations for Multiple Instance Learning

cs.LG · 2026-03-20 · unverdicted · novelty 7.0

SetFlow is a flow-matching generative model for permutation-invariant MIL bags in representation space that produces synthetic data improving classification performance and enabling training on synthetic data alone.

Evolving Multi-Channel Confidence-Aware Activation Functions for Missing Data with Channel Propagation

cs.NE · 2026-02-14 · unverdicted · novelty 7.0

Evolved multi-channel activation functions that incorporate missingness and confidence scores improve classification performance on datasets with missing data.

Imposing Boundary Conditions on Neural Operators via Learned Function Extensions

cs.LG · 2026-02-04 · unverdicted · novelty 7.0

A framework learns boundary-to-domain pseudo-extensions to condition neural operators on complex BCs, achieving SOTA accuracy on 18 challenging PDE datasets without hyperparameter tuning.

DiffeoMorph: Learning to Morph 3D Shapes Using Differentiable Agent-Based Simulations

cs.LG · 2025-12-18 · unverdicted · novelty 7.0

DiffeoMorph learns distributed agent protocols to morph into complex 3D shapes from minimal initial conditions via equivariant GNNs and rotation-invariant Zernike loss.

Kolmogorov-Arnold Chemical Reaction Neural Networks for learning pressure-dependent kinetic rate laws

physics.chem-ph · 2025-11-10 · unverdicted · novelty 7.0

KA-CRNNs learn pressure-dependent and collider-specific kinetic rate laws from data using Kolmogorov-Arnold activations inside a CRNN framework, outperforming interpolative methods by 2.88x in MSE on two proof-of-concept reactions.

Diffusion and Flow-based Copulas: Forgetting and Remembering Dependencies

stat.ML · 2025-09-24 · unverdicted · novelty 7.0

Diffusion and flow processes forget dependencies to define valid copulas then learn to remember them for density estimation and sampling, outperforming prior copula methods on complex datasets.

Accurate and scalable exchange-correlation with deep learning

physics.chem-ph · 2025-06-17 · unverdicted · novelty 7.0

Skala is a neural XC functional trained on wavefunction data that beats state-of-the-art hybrids on main-group chemistry benchmarks at semi-local computational cost.

Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality

cs.LG · 2024-05-31 · unverdicted · novelty 7.0

Transformers and SSMs are unified through structured state space duality, producing a 2-8X faster Mamba-2 model that remains competitive with Transformers.

Generative Semantic Communication: Diffusion Models Beyond Bit Recovery

cs.AI · 2023-06-07 · unverdicted · novelty 7.0

A generative semantic communication system that sends compressed semantic information and uses diffusion models with spatially-adaptive normalizations to reconstruct high-quality, semantically consistent images even under severe channel noise.

EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks

cs.LG · 2019-05-28 · accept · novelty 7.0

EfficientNet scales network depth, width, and resolution uniformly via a compound coefficient to deliver state-of-the-art accuracy and efficiency on image classification.

DECO: Sparse Mixture-of-Experts with Dense-Comparable Performance on End-Side Devices

cs.LG · 2026-05-11 · unverdicted · novelty 6.0 · 3 refs

DECO is a sparse MoE architecture with ReLU-based routing, learnable expert scaling, and NormSiLU activation that matches dense Transformer performance at 20% expert activation and delivers 2.93x speedup on Jetson AGX Orin.

On the global convergence of gradient descent for wide shallow models with bounded nonlinearities

math.OC · 2026-05-11 · unverdicted · novelty 6.0

Gradient descent on wide shallow models with bounded nonlinearities converges globally in the mean-field limit as non-global critical points are unstable under the dynamics.

Sparsity Moves Computation: How FFN Architecture Reshapes Attention in Small Transformers

cs.LG · 2026-05-10 · conditional · novelty 6.0 · 2 refs

Sparse MoE FFNs redistribute computation from FFN to attention in small Transformers, driven mainly by architectural sparsity rather than learned expert specialization.

MoMo: Conditioned Contrastive Representation Learning for Preference-Modulated Planning

cs.LG · 2026-05-08 · unverdicted · novelty 6.0 · 2 refs

MoMo conditions contrastive representations and prediction operators on user preferences via FiLM and low-rank modulation to enable continuous modulation of plan safety while preserving inference efficiency.

What If We Let Forecasting Forget? A Sparse Bottleneck for Cross-Variable Dependencies

cs.LG · 2026-05-08 · unverdicted · novelty 6.0

MS-FLOW uses a capacity-limited sparse routing mechanism to model only critical inter-variable dependencies in time series data, achieving state-of-the-art accuracy on 12 benchmarks with fewer but more reliable connections.

On the Blessing of Pre-training in Weak-to-Strong Generalization

cs.LG · 2026-05-07 · unverdicted · novelty 6.0

Pre-training provides a geometric warm start in a single-index model that enables weak-to-strong generalization up to a supervisor-limited bound, with empirical phase-transition evidence in LLMs.

citing papers explorer

Showing 26 of 26 citing papers after filters.

Floating-Point Networks with Automatic Differentiation Can Represent Almost All Floating-Point Functions and Their Gradients cs.LG · 2026-05-03 · unverdicted · none · ref 33 · internal anchor
Floating-point neural networks with automatic differentiation can represent arbitrary floating-point functions and their gradients under mild conditions.
Supervised Guidance Training for Infinite-Dimensional Diffusion Models cs.LG · 2026-01-28 · conditional · none · ref 5 · internal anchor
Supervised Guidance Training enables conditioning of infinite-dimensional diffusion models via an extended Doob h-transform so that fine-tuned models accurately sample from posteriors in function space.
KAN: Kolmogorov-Arnold Networks cs.LG · 2024-04-30 · conditional · none · ref 89 · internal anchor
KANs with learnable univariate spline activations on edges achieve better accuracy than MLPs with fewer parameters, faster scaling, and direct visualization for scientific discovery.
Mamba: Linear-Time Sequence Modeling with Selective State Spaces cs.LG · 2023-12-01 · unverdicted · none · ref 89 · internal anchor
Mamba is a linear-time sequence model using input-dependent selective SSMs that achieves SOTA results across modalities and matches twice-larger Transformers on language modeling with 5x higher inference throughput.
Neural Statistical Functions cs.LG · 2026-05-11 · unverdicted · none · ref 27 · internal anchor
Neural statistical functions use prefix statistics to unify and directly predict statistical quantities over continuous ranges from pre-trained single-sample models without repeated sampling.
The Linear Centroids Hypothesis: Features as Directions Learned by Local Experts cs.LG · 2026-04-13 · unverdicted · none · ref 8 · 2 links · internal anchor
The Linear Centroids Hypothesis reframes network features as directions in centroid spaces of local affine experts, unifying interpretability methods and yielding sparser, more faithful dictionaries, circuits, and saliency maps.
Selectivity and Shape in the Design of Forward-Forward Goodness Functions cs.LG · 2026-03-28 · unverdicted · none · ref 20 · internal anchor
Shape- and peak-sensitive goodness functions for Forward-Forward deliver up to 72pp gains over sum-of-squares, reaching 98.2% on MNIST and 89% on Fashion-MNIST.
SetFlow: Generating Structured Sets of Representations for Multiple Instance Learning cs.LG · 2026-03-20 · unverdicted · none · ref 20 · internal anchor
SetFlow is a flow-matching generative model for permutation-invariant MIL bags in representation space that produces synthetic data improving classification performance and enabling training on synthetic data alone.
Imposing Boundary Conditions on Neural Operators via Learned Function Extensions cs.LG · 2026-02-04 · unverdicted · none · ref 29 · internal anchor
A framework learns boundary-to-domain pseudo-extensions to condition neural operators on complex BCs, achieving SOTA accuracy on 18 challenging PDE datasets without hyperparameter tuning.
DiffeoMorph: Learning to Morph 3D Shapes Using Differentiable Agent-Based Simulations cs.LG · 2025-12-18 · unverdicted · none · ref 72 · internal anchor
DiffeoMorph learns distributed agent protocols to morph into complex 3D shapes from minimal initial conditions via equivariant GNNs and rotation-invariant Zernike loss.
Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality cs.LG · 2024-05-31 · unverdicted · none · ref 87 · internal anchor
Transformers and SSMs are unified through structured state space duality, producing a 2-8X faster Mamba-2 model that remains competitive with Transformers.
EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks cs.LG · 2019-05-28 · accept · none · ref 35 · internal anchor
EfficientNet scales network depth, width, and resolution uniformly via a compound coefficient to deliver state-of-the-art accuracy and efficiency on image classification.
DECO: Sparse Mixture-of-Experts with Dense-Comparable Performance on End-Side Devices cs.LG · 2026-05-11 · unverdicted · none · ref 161 · 3 links · internal anchor
DECO is a sparse MoE architecture with ReLU-based routing, learnable expert scaling, and NormSiLU activation that matches dense Transformer performance at 20% expert activation and delivers 2.93x speedup on Jetson AGX Orin.
Sparsity Moves Computation: How FFN Architecture Reshapes Attention in Small Transformers cs.LG · 2026-05-10 · conditional · none · ref 1 · 2 links · internal anchor
Sparse MoE FFNs redistribute computation from FFN to attention in small Transformers, driven mainly by architectural sparsity rather than learned expert specialization.
MoMo: Conditioned Contrastive Representation Learning for Preference-Modulated Planning cs.LG · 2026-05-08 · unverdicted · none · ref 52 · 2 links · internal anchor
MoMo conditions contrastive representations and prediction operators on user preferences via FiLM and low-rank modulation to enable continuous modulation of plan safety while preserving inference efficiency.
What If We Let Forecasting Forget? A Sparse Bottleneck for Cross-Variable Dependencies cs.LG · 2026-05-08 · unverdicted · none · ref 13 · internal anchor
MS-FLOW uses a capacity-limited sparse routing mechanism to model only critical inter-variable dependencies in time series data, achieving state-of-the-art accuracy on 12 benchmarks with fewer but more reliable connections.
On the Blessing of Pre-training in Weak-to-Strong Generalization cs.LG · 2026-05-07 · unverdicted · none · ref 162 · internal anchor
Pre-training provides a geometric warm start in a single-index model that enables weak-to-strong generalization up to a supervisor-limited bound, with empirical phase-transition evidence in LLMs.
Distribution-Free Pretraining of Classification Losses via Evolutionary Dynamics cs.LG · 2026-05-05 · unverdicted · none · ref 3 · internal anchor
EDL learns a transferable classification loss from unlimited synthetic data via evolutionary optimization and a ranking-consistency objective, serving as a competitive drop-in replacement for cross-entropy on CIFAR-10 with ResNet models.
Geometric Monomial (GEM): a family of rational 2N-differentiable activation functions cs.LG · 2026-04-23 · unverdicted · none · ref 5 · internal anchor
GEM is a new family of C^{2N}-smooth rational activation functions with variants that achieve performance on par with or exceeding GELU on ResNet, GPT-2, and BERT benchmarks.
ZC-Swish: Stabilizing Deep BN-Free Networks for Edge and Micro-Batch Applications cs.LG · 2026-04-21 · unverdicted · none · ref 2 · internal anchor
ZC-Swish stabilizes deep BN-free networks by anchoring activation means near zero, preventing collapse at depths 16 and beyond where standard Swish fails.
Data-Efficient Self-Supervised Algorithms for Fine-Grained Birdsong Analysis cs.LG · 2025-11-15 · unverdicted · none · ref 44 · internal anchor
A three-stage self-supervised pipeline for data-efficient frame-level syllable detection in complex birdsong using a Residual MLP-RNN model.
Activation Function Design Sustains Plasticity in Continual Learning cs.LG · 2025-09-26 · unverdicted · none · ref 21 · internal anchor
Smooth-Leaky and Randomized Smooth-Leaky activations mitigate loss of plasticity in continual learning by targeting negative-branch shape and saturation behavior.
GLU Variants Improve Transformer cs.LG · 2020-02-12 · unverdicted · none · ref 7 · internal anchor
Some GLU variants using non-sigmoid nonlinearities improve Transformer quality over ReLU and GELU in feed-forward sublayers.
Graph neural network for colliding particles with an application to sea ice floe modeling cs.LG · 2026-02-18 · unverdicted · none · ref 62 · internal anchor
A graph neural network learns to simulate 1D sea ice floe collisions and trajectories using data assimilation on synthetic data.
Deep learning applied to computational mechanics: A comprehensive review, state of the art, and the classics cs.LG · 2022-12-18 · unverdicted · none · ref 38 · internal anchor
A comprehensive review of deep learning techniques for computational mechanics, including LSTM for constitutive modeling, PINNs for PDE solving, optimizers, and kernel methods.
Efficient Learning of Deep State Space Models via Importance Smoothing cs.LG · 2026-05-20 · unreviewed · ref 9 · internal anchor

Searching for Activation Functions

hub tools

citation-role summary

citation-polarity summary

claims ledger

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer