hub Mixed citations

Searching for Activation Functions

Prajit Ramachandran, Barret Zoph, Quoc V. Le · 2017 · cs.NE · arXiv 1710.05941

Mixed citation behavior. Most common role is background (60%).

35 Pith papers citing it

Background 60% of classified citations

open full Pith review browse 35 citing papers arXiv PDF

abstract

The choice of activation functions in deep networks has a significant effect on the training dynamics and task performance. Currently, the most successful and widely-used activation function is the Rectified Linear Unit (ReLU). Although various hand-designed alternatives to ReLU have been proposed, none have managed to replace it due to inconsistent gains. In this work, we propose to leverage automatic search techniques to discover new activation functions. Using a combination of exhaustive and reinforcement learning-based search, we discover multiple novel activation functions. We verify the effectiveness of the searches by conducting an empirical evaluation with the best discovered activation function. Our experiments show that the best discovered activation function, $f(x) = x \cdot \text{sigmoid}(\beta x)$, which we name Swish, tends to work better than ReLU on deeper models across a number of challenging datasets. For example, simply replacing ReLUs with Swish units improves top-1 classification accuracy on ImageNet by 0.9\% for Mobile NASNet-A and 0.6\% for Inception-ResNet-v2. The simplicity of Swish and its similarity to ReLU make it easy for practitioners to replace ReLUs with Swish units in any neural network.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 5

citation-polarity summary

background 3 unclear 2

claims ledger

abstract The choice of activation functions in deep networks has a significant effect on the training dynamics and task performance. Currently, the most successful and widely-used activation function is the Rectified Linear Unit (ReLU). Although various hand-designed alternatives to ReLU have been proposed, none have managed to replace it due to inconsistent gains. In this work, we propose to leverage automatic search techniques to discover new activation functions. Using a combination of exhaustive and reinforcement learning-based search, we discover multiple novel activation functions. We verify the

co-cited works

representative citing papers

Floating-Point Networks with Automatic Differentiation Can Represent Almost All Floating-Point Functions and Their Gradients

cs.LG · 2026-05-03 · unverdicted · novelty 8.0

Floating-point neural networks with automatic differentiation can represent arbitrary floating-point functions and their gradients under mild conditions.

KAN: Kolmogorov-Arnold Networks

cs.LG · 2024-04-30 · conditional · novelty 8.0

KANs with learnable univariate spline activations on edges achieve better accuracy than MLPs with fewer parameters, faster scaling, and direct visualization for scientific discovery.

Mamba: Linear-Time Sequence Modeling with Selective State Spaces

cs.LG · 2023-12-01 · unverdicted · novelty 8.0

Mamba is a linear-time sequence model using input-dependent selective SSMs that achieves SOTA results across modalities and matches twice-larger Transformers on language modeling with 5x higher inference throughput.

Neural Statistical Functions

cs.LG · 2026-05-11 · unverdicted · novelty 7.0

Neural statistical functions use prefix statistics to unify and directly predict statistical quantities over continuous ranges from pre-trained single-sample models without repeated sampling.

Learning Less Is More: Premature Upper-Layer Attention Specialization Hurts Language Model Pretraining

cs.CL · 2026-05-11 · unverdicted · novelty 7.0

Temporarily reducing the learning rate on upper-layer query and key projections during early GPT pretraining prevents premature attention specialization and improves model performance.

The Linear Centroids Hypothesis: Features as Directions Learned by Local Experts

cs.LG · 2026-04-13 · unverdicted · novelty 7.0 · 2 refs

The Linear Centroids Hypothesis reframes network features as directions in centroid spaces of local affine experts, unifying interpretability methods and yielding sparser, more faithful dictionaries, circuits, and saliency maps.

Selectivity and Shape in the Design of Forward-Forward Goodness Functions

cs.LG · 2026-03-28 · unverdicted · novelty 7.0

Shape- and peak-sensitive goodness functions for Forward-Forward deliver up to 72pp gains over sum-of-squares, reaching 98.2% on MNIST and 89% on Fashion-MNIST.

Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality

cs.LG · 2024-05-31 · unverdicted · novelty 7.0

Transformers and SSMs are unified through structured state space duality, producing a 2-8X faster Mamba-2 model that remains competitive with Transformers.

DECO: Sparse Mixture-of-Experts with Dense-Comparable Performance on End-Side Devices

cs.LG · 2026-05-11 · conditional · novelty 6.0 · 2 refs

DECO matches dense model performance at 20% expert activation via ReLU-based routing with learnable scaling and the NormSiLU activation, plus a 3x real-hardware speedup.

On the global convergence of gradient descent for wide shallow models with bounded nonlinearities

math.OC · 2026-05-11 · unverdicted · novelty 6.0

Gradient descent on wide shallow models with bounded nonlinearities converges globally in the mean-field limit as non-global critical points are unstable under the dynamics.

Sparsity Moves Computation: How FFN Architecture Reshapes Attention in Small Transformers

cs.LG · 2026-05-10 · unverdicted · novelty 6.0

Sparse MoE in FFN blocks redistributes computation to attention in small Transformers primarily due to architectural capacity reduction and partitioning, not learned router specialization.

MoMo: Conditioned Contrastive Representation Learning for Preference-Modulated Planning

cs.LG · 2026-05-08 · unverdicted · novelty 6.0

MoMo uses Feature-Wise Linear Modulation and low-rank neural modulation to condition contrastive planning representations on user preferences while preserving inference efficiency and probability density ratios.

What If We Let Forecasting Forget? A Sparse Bottleneck for Cross-Variable Dependencies

cs.LG · 2026-05-08 · unverdicted · novelty 6.0

MS-FLOW uses a capacity-limited sparse routing mechanism to model only critical inter-variable dependencies in time series data, achieving state-of-the-art accuracy on 12 benchmarks with fewer but more reliable connections.

On the Blessing of Pre-training in Weak-to-Strong Generalization

cs.LG · 2026-05-07 · unverdicted · novelty 6.0

Pre-training provides a geometric warm start in a single-index model that enables weak-to-strong generalization up to a supervisor-limited bound, with empirical phase-transition evidence in LLMs.

Competing nonlinearities, criticality, and order-to-chaos transition in deep networks

cond-mat.dis-nn · 2026-05-06 · unverdicted · novelty 6.0

A statistical mixture of Tanh and Swish activations with critical mixing fraction p_c induces a continuous phase transition to scale-invariant signal propagation in deep networks while preserving smoothness.

Neural-network reconstruction of THz transmission spectra using electrically tunable AlGaN/GaN plasmonic-crystal analyzer

physics.optics · 2026-05-05 · unverdicted · novelty 6.0

A feedforward neural network trained on synthetic data inverts voltage-dependent intensities from an electrically tunable AlGaN/GaN plasmonic analyzer to reconstruct THz spectra, achieving lower error than Tikhonov regularization and identifying most resonances correctly.

Distribution-Free Pretraining of Classification Losses via Evolutionary Dynamics

cs.LG · 2026-05-05 · unverdicted · novelty 6.0

EDL learns a transferable classification loss from unlimited synthetic data via evolutionary optimization and a ranking-consistency objective, serving as a competitive drop-in replacement for cross-entropy on CIFAR-10 with ResNet models.

AGoQ: Activation and Gradient Quantization for Memory-Efficient Distributed Training of LLMs

cs.CL · 2026-05-01 · unverdicted · novelty 6.0 · 2 refs

AGoQ delivers up to 52% lower memory use and 1.34x faster training for 8B-32B LLaMA models by using near-4-bit adaptive activations and 8-bit gradients while preserving pretraining convergence and downstream accuracy.

Four-dimensional QCD equation of state from a quasi-parton model with physics-informed neural networks

nucl-th · 2026-04-24 · unverdicted · novelty 6.0

A PINN-trained quasi-parton model reproduces lattice cumulants at vanishing chemical potentials and supplies a consistent four-dimensional QCD equation of state at finite densities.

Geometric Monomial (GEM): a family of rational 2N-differentiable activation functions

cs.LG · 2026-04-23 · unverdicted · novelty 6.0

GEM is a new family of C^{2N}-smooth rational activation functions with variants that achieve performance on par with or exceeding GELU on ResNet, GPT-2, and BERT benchmarks.

A Complex-Valued Continuous-Variable Quantum Approximation Optimization Algorithm (CCV-QAOA)

quant-ph · 2026-04-23 · unverdicted · novelty 6.0

CCV-QAOA is a new complex-valued continuous-variable variant of QAOA that solves real and complex multivariate optimization problems via a variational framework.

OTProf: estimating high-resolution profiles of optical turbulence ($C_n^2$) from reanalysis using deep learning

physics.ao-ph · 2026-04-10 · conditional · novelty 6.0

Deep learning model OTProf generates high-resolution C_n² profiles from ERA5 reanalysis data and outperforms the Hufnagel-Valley model for vertical structure and integrated parameters like Fried parameter r_0 in the Netherlands.

Mela: Test-Time Memory Consolidation based on Transformation Hypothesis

cs.CL · 2026-05-11 · unverdicted · novelty 5.0

Mela is a Transformer variant with a dual-frequency Hierarchical Memory Module and MemStack that performs test-time memory consolidation, outperforming baselines on long contexts.

Neural-Guided Domain Restriction to Accelerate Pseudospectra Computation for Structured Non-normal Banded Matrices

math.NA · 2026-05-06 · unverdicted · novelty 5.0

A neural network predicts sensitive pseudospectra regions from matrix features to accelerate computation on structured non-normal banded matrices while preserving accuracy in identifying those regions.

citing papers explorer

Showing 35 of 35 citing papers.

Floating-Point Networks with Automatic Differentiation Can Represent Almost All Floating-Point Functions and Their Gradients cs.LG · 2026-05-03 · unverdicted · none · ref 33 · internal anchor
Floating-point neural networks with automatic differentiation can represent arbitrary floating-point functions and their gradients under mild conditions.
KAN: Kolmogorov-Arnold Networks cs.LG · 2024-04-30 · conditional · none · ref 89 · internal anchor
KANs with learnable univariate spline activations on edges achieve better accuracy than MLPs with fewer parameters, faster scaling, and direct visualization for scientific discovery.
Mamba: Linear-Time Sequence Modeling with Selective State Spaces cs.LG · 2023-12-01 · unverdicted · none · ref 89 · internal anchor
Mamba is a linear-time sequence model using input-dependent selective SSMs that achieves SOTA results across modalities and matches twice-larger Transformers on language modeling with 5x higher inference throughput.
Neural Statistical Functions cs.LG · 2026-05-11 · unverdicted · none · ref 27 · internal anchor
Neural statistical functions use prefix statistics to unify and directly predict statistical quantities over continuous ranges from pre-trained single-sample models without repeated sampling.
Learning Less Is More: Premature Upper-Layer Attention Specialization Hurts Language Model Pretraining cs.CL · 2026-05-11 · unverdicted · none · ref 12 · internal anchor
Temporarily reducing the learning rate on upper-layer query and key projections during early GPT pretraining prevents premature attention specialization and improves model performance.
The Linear Centroids Hypothesis: Features as Directions Learned by Local Experts cs.LG · 2026-04-13 · unverdicted · none · ref 8 · 2 links · internal anchor
The Linear Centroids Hypothesis reframes network features as directions in centroid spaces of local affine experts, unifying interpretability methods and yielding sparser, more faithful dictionaries, circuits, and saliency maps.
Selectivity and Shape in the Design of Forward-Forward Goodness Functions cs.LG · 2026-03-28 · unverdicted · none · ref 20 · internal anchor
Shape- and peak-sensitive goodness functions for Forward-Forward deliver up to 72pp gains over sum-of-squares, reaching 98.2% on MNIST and 89% on Fashion-MNIST.
Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality cs.LG · 2024-05-31 · unverdicted · none · ref 87 · internal anchor
Transformers and SSMs are unified through structured state space duality, producing a 2-8X faster Mamba-2 model that remains competitive with Transformers.
DECO: Sparse Mixture-of-Experts with Dense-Comparable Performance on End-Side Devices cs.LG · 2026-05-11 · conditional · none · ref 161 · 2 links · internal anchor
DECO matches dense model performance at 20% expert activation via ReLU-based routing with learnable scaling and the NormSiLU activation, plus a 3x real-hardware speedup.
On the global convergence of gradient descent for wide shallow models with bounded nonlinearities math.OC · 2026-05-11 · unverdicted · none · ref 76 · internal anchor
Gradient descent on wide shallow models with bounded nonlinearities converges globally in the mean-field limit as non-global critical points are unstable under the dynamics.
Sparsity Moves Computation: How FFN Architecture Reshapes Attention in Small Transformers cs.LG · 2026-05-10 · unverdicted · none · ref 1 · internal anchor
Sparse MoE in FFN blocks redistributes computation to attention in small Transformers primarily due to architectural capacity reduction and partitioning, not learned router specialization.
MoMo: Conditioned Contrastive Representation Learning for Preference-Modulated Planning cs.LG · 2026-05-08 · unverdicted · none · ref 52 · internal anchor
MoMo uses Feature-Wise Linear Modulation and low-rank neural modulation to condition contrastive planning representations on user preferences while preserving inference efficiency and probability density ratios.
What If We Let Forecasting Forget? A Sparse Bottleneck for Cross-Variable Dependencies cs.LG · 2026-05-08 · unverdicted · none · ref 13 · internal anchor
MS-FLOW uses a capacity-limited sparse routing mechanism to model only critical inter-variable dependencies in time series data, achieving state-of-the-art accuracy on 12 benchmarks with fewer but more reliable connections.
On the Blessing of Pre-training in Weak-to-Strong Generalization cs.LG · 2026-05-07 · unverdicted · none · ref 162 · internal anchor
Pre-training provides a geometric warm start in a single-index model that enables weak-to-strong generalization up to a supervisor-limited bound, with empirical phase-transition evidence in LLMs.
Competing nonlinearities, criticality, and order-to-chaos transition in deep networks cond-mat.dis-nn · 2026-05-06 · unverdicted · none · ref 27 · internal anchor
A statistical mixture of Tanh and Swish activations with critical mixing fraction p_c induces a continuous phase transition to scale-invariant signal propagation in deep networks while preserving smoothness.
Neural-network reconstruction of THz transmission spectra using electrically tunable AlGaN/GaN plasmonic-crystal analyzer physics.optics · 2026-05-05 · unverdicted · none · ref 24 · internal anchor
A feedforward neural network trained on synthetic data inverts voltage-dependent intensities from an electrically tunable AlGaN/GaN plasmonic analyzer to reconstruct THz spectra, achieving lower error than Tikhonov regularization and identifying most resonances correctly.
Distribution-Free Pretraining of Classification Losses via Evolutionary Dynamics cs.LG · 2026-05-05 · unverdicted · none · ref 3 · internal anchor
EDL learns a transferable classification loss from unlimited synthetic data via evolutionary optimization and a ranking-consistency objective, serving as a competitive drop-in replacement for cross-entropy on CIFAR-10 with ResNet models.
AGoQ: Activation and Gradient Quantization for Memory-Efficient Distributed Training of LLMs cs.CL · 2026-05-01 · unverdicted · none · ref 87 · 2 links · internal anchor
AGoQ delivers up to 52% lower memory use and 1.34x faster training for 8B-32B LLaMA models by using near-4-bit adaptive activations and 8-bit gradients while preserving pretraining convergence and downstream accuracy.
Four-dimensional QCD equation of state from a quasi-parton model with physics-informed neural networks nucl-th · 2026-04-24 · unverdicted · none · ref 63 · internal anchor
A PINN-trained quasi-parton model reproduces lattice cumulants at vanishing chemical potentials and supplies a consistent four-dimensional QCD equation of state at finite densities.
Geometric Monomial (GEM): a family of rational 2N-differentiable activation functions cs.LG · 2026-04-23 · unverdicted · none · ref 5 · internal anchor
GEM is a new family of C^{2N}-smooth rational activation functions with variants that achieve performance on par with or exceeding GELU on ResNet, GPT-2, and BERT benchmarks.
A Complex-Valued Continuous-Variable Quantum Approximation Optimization Algorithm (CCV-QAOA) quant-ph · 2026-04-23 · unverdicted · none · ref 40 · internal anchor
CCV-QAOA is a new complex-valued continuous-variable variant of QAOA that solves real and complex multivariate optimization problems via a variational framework.
OTProf: estimating high-resolution profiles of optical turbulence ($C_n^2$) from reanalysis using deep learning physics.ao-ph · 2026-04-10 · conditional · none · ref 33 · internal anchor
Deep learning model OTProf generates high-resolution C_n² profiles from ERA5 reanalysis data and outperforms the Hufnagel-Valley model for vertical structure and integrated parameters like Fried parameter r_0 in the Netherlands.
Mela: Test-Time Memory Consolidation based on Transformation Hypothesis cs.CL · 2026-05-11 · unverdicted · none · ref 15 · internal anchor
Mela is a Transformer variant with a dual-frequency Hierarchical Memory Module and MemStack that performs test-time memory consolidation, outperforming baselines on long contexts.
Neural-Guided Domain Restriction to Accelerate Pseudospectra Computation for Structured Non-normal Banded Matrices math.NA · 2026-05-06 · unverdicted · none · ref 26 · internal anchor
A neural network predicts sensitive pseudospectra regions from matrix features to accelerate computation on structured non-normal banded matrices while preserving accuracy in identifying those regions.
Sparsity Hurts: Simple Linear Adapter Can Boost Generalized Category Discovery cs.CV · 2026-05-05 · unverdicted · none · ref 57 · internal anchor
LAGCD inserts residual linear adapters into each ViT block plus a distribution alignment loss to improve generalized category discovery by increasing model flexibility while reducing bias between seen and novel classes.
Universal Smoothness via Bernstein Polynomials: A Constructive Approximation Approach for Activation Functions cs.AI · 2026-05-04 · unverdicted · none · ref 13 · internal anchor
BerLU constructs a C1-differentiable activation with Lipschitz constant 1 via Bernstein polynomial approximation, showing better performance and efficiency than baselines on image classification with ViTs and CNNs.
GCImOpt: Learning efficient goal-conditioned policies by imitating optimal trajectories cs.RO · 2026-04-24 · unverdicted · none · ref 9 · internal anchor
GCImOpt trains compact goal-conditioned neural policies by imitating efficiently generated optimal trajectories, achieving high success rates and near-optimal performance on cart-pole, quadcopter, and robot arm tasks while running thousands of times faster than optimization solvers.
Physics-informed neural networks for form-finding of unilateral membrane structures cs.CE · 2026-04-21 · unverdicted · none · ref 35 · internal anchor
PINNs with hard and soft boundary enforcement solve membrane form-finding PDEs to accuracy comparable with FEM, with hard-BC yielding smaller boundary errors.
ZC-Swish: Stabilizing Deep BN-Free Networks for Edge and Micro-Batch Applications cs.LG · 2026-04-21 · unverdicted · none · ref 2 · internal anchor
ZC-Swish stabilizes deep BN-free networks by anchoring activation means near zero, preventing collapse at depths 16 and beyond where standard Swish fails.
YOLOv4: Optimal Speed and Accuracy of Object Detection cs.CV · 2020-04-23 · unverdicted · none · ref 59 · internal anchor
YOLOv4 achieves 43.5% AP (65.7% AP50) on MS COCO at ~65 FPS on Tesla V100 by integrating WRC, CSP, CmBN, SAT, Mish activation, Mosaic augmentation, DropBlock, and CIoU loss.
Agentic Risk-Aware Set-Based Engineering Design cs.AI · 2026-04-17 · unverdicted · none · ref 53 · internal anchor
Multi-agent LLM system applies set-based design and Conditional Value-at-Risk to explore and risk-filter airfoil designs with human manager coordination.
GLU Variants Improve Transformer cs.LG · 2020-02-12 · unverdicted · none · ref 7 · internal anchor
Some GLU variants using non-sigmoid nonlinearities improve Transformer quality over ReLU and GELU in feed-forward sublayers.
Developing a Strong Pre-Trained Base Model for Plant Leaf Disease Classification cs.CV · 2026-05-02 · unverdicted · none · ref 146 · internal anchor
A DenseNet201 base model trained on a constructed plant leaf disease dataset outperforms baselines and enables faster, more robust transfer learning with less data than general models.
Cosmos World Foundation Model Platform for Physical AI cs.CV · 2025-01-07 · unverdicted · none · ref 161 · internal anchor
The Cosmos platform supplies open-source pre-trained world models and supporting tools for building fine-tunable digital world simulations to train Physical AI.
A Survey of Large Language Models cs.CL · 2023-03-31 · accept · none · ref 285 · internal anchor
This survey reviews the background, key techniques, and evaluation methods for large language models, emphasizing emergent abilities that appear at large scales.

Searching for Activation Functions

hub tools

citation-role summary

citation-polarity summary

claims ledger

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer