hub Mixed citations

Searching for Activation Functions

Prajit Ramachandran, Barret Zoph, Quoc V. Le · 2017 · cs.NE · arXiv 1710.05941

Mixed citation behavior. Most common role is background (69%).

84 Pith papers citing it

Background 69% of classified citations

open full Pith review browse 84 citing papers arXiv PDF

abstract

The choice of activation functions in deep networks has a significant effect on the training dynamics and task performance. Currently, the most successful and widely-used activation function is the Rectified Linear Unit (ReLU). Although various hand-designed alternatives to ReLU have been proposed, none have managed to replace it due to inconsistent gains. In this work, we propose to leverage automatic search techniques to discover new activation functions. Using a combination of exhaustive and reinforcement learning-based search, we discover multiple novel activation functions. We verify the effectiveness of the searches by conducting an empirical evaluation with the best discovered activation function. Our experiments show that the best discovered activation function, $f(x) = x \cdot \text{sigmoid}(\beta x)$, which we name Swish, tends to work better than ReLU on deeper models across a number of challenging datasets. For example, simply replacing ReLUs with Swish units improves top-1 classification accuracy on ImageNet by 0.9\% for Mobile NASNet-A and 0.6\% for Inception-ResNet-v2. The simplicity of Swish and its similarity to ReLU make it easy for practitioners to replace ReLUs with Swish units in any neural network.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 12 method 1

citation-polarity summary

background 9 unclear 3 use method 1

claims ledger

abstract The choice of activation functions in deep networks has a significant effect on the training dynamics and task performance. Currently, the most successful and widely-used activation function is the Rectified Linear Unit (ReLU). Although various hand-designed alternatives to ReLU have been proposed, none have managed to replace it due to inconsistent gains. In this work, we propose to leverage automatic search techniques to discover new activation functions. Using a combination of exhaustive and reinforcement learning-based search, we discover multiple novel activation functions. We verify the

co-cited works

representative citing papers

Floating-Point Networks with Automatic Differentiation Can Represent Almost All Floating-Point Functions and Their Gradients

cs.LG · 2026-05-03 · unverdicted · novelty 8.0

Floating-point neural networks with automatic differentiation can represent arbitrary floating-point functions and their gradients under mild conditions.

Supervised Guidance Training for Infinite-Dimensional Diffusion Models

cs.LG · 2026-01-28 · conditional · novelty 8.0

Supervised Guidance Training enables conditioning of infinite-dimensional diffusion models via an extended Doob h-transform so that fine-tuned models accurately sample from posteriors in function space.

KAN: Kolmogorov-Arnold Networks

cs.LG · 2024-04-30 · conditional · novelty 8.0

KANs with learnable univariate spline activations on edges achieve better accuracy than MLPs with fewer parameters, faster scaling, and direct visualization for scientific discovery.

Mamba: Linear-Time Sequence Modeling with Selective State Spaces

cs.LG · 2023-12-01 · unverdicted · novelty 8.0

Mamba is a linear-time sequence model using input-dependent selective SSMs that achieves SOTA results across modalities and matches twice-larger Transformers on language modeling with 5x higher inference throughput.

Low-dimensional topology of deep neural networks

cs.LG · 2026-06-30 · unverdicted · novelty 7.0

Restricting layers to width 3 and using linking numbers shows ResNets and transformers match in topological power, exceed monotonic feedforward nets which exceed flows, but nonmonotonic activations match the top class.

Nonparametric undirected graphical model selection using diffusion models

stat.ME · 2026-06-07 · unverdicted · novelty 7.0

A diffusion model-based nonparametric method for undirected graphical model selection with model selection consistency.

CoMetaPNS: Continually Meta-learning Personalized Neural Surrogates for Cardiac Electrophysiology Simulations

cs.LG · 2026-06-05 · unverdicted · novelty 7.0

CoMetaPNS combines meta-learned neural surrogates with a continual Bayesian Gaussian Mixture Model to adapt cardiac electrophysiology simulations to new data while avoiding catastrophic forgetting.

Generalization in Deep Neural Networks: Minimax Rates for Gradient Methods

stat.ML · 2026-06-04 · unverdicted · novelty 7.0

The paper derives the first minimax-optimal excess population risk rates for gradient descent and stochastic gradient descent on over-parameterized DNNs by linking their dynamics to kernel methods under polynomial width scaling.

TriSearch: Learning to Optimize Triangulations via Bistellar Flips

cs.LG · 2026-05-28 · unverdicted · novelty 7.0

TriSearch is an RL framework that optimizes triangulations of polytopes using bistellar flips with a circuit-supported subtriangulation action representation, generalizing zero-shot to larger instances and outperforming prior samplers in 3D and 4D.

Expressive Power of Floating-Point Neural Networks with Arbitrary Reduction Orders and Inexact Activation Implementations

cs.LG · 2026-05-27 · unverdicted · novelty 7.0

Floating-point neural networks achieve universal representability for practical activations like ReLU, sigmoid, and tanh under arbitrary reduction orders and bounded ulp errors in activations via a new distinguishability condition.

Neural Statistical Functions

cs.LG · 2026-05-11 · unverdicted · novelty 7.0

Neural statistical functions use prefix statistics to unify and directly predict statistical quantities over continuous ranges from pre-trained single-sample models without repeated sampling.

Learning Less Is More: Premature Upper-Layer Attention Specialization Hurts Language Model Pretraining

cs.CL · 2026-05-11 · unverdicted · novelty 7.0

Temporarily reducing the learning rate on upper-layer query and key projections during early GPT pretraining prevents premature attention specialization and improves model performance.

The Linear Centroids Hypothesis: Features as Directions Learned by Local Experts

cs.LG · 2026-04-13 · unverdicted · novelty 7.0 · 2 refs

The Linear Centroids Hypothesis reframes network features as directions in centroid spaces of local affine experts, unifying interpretability methods and yielding sparser, more faithful dictionaries, circuits, and saliency maps.

Selectivity and Shape in the Design of Forward-Forward Goodness Functions

cs.LG · 2026-03-28 · unverdicted · novelty 7.0

Shape- and peak-sensitive goodness functions for Forward-Forward deliver up to 72pp gains over sum-of-squares, reaching 98.2% on MNIST and 89% on Fashion-MNIST.

SetFlow: Generating Structured Sets of Representations for Multiple Instance Learning

cs.LG · 2026-03-20 · unverdicted · novelty 7.0

SetFlow is a flow-matching generative model for permutation-invariant MIL bags in representation space that produces synthetic data improving classification performance and enabling training on synthetic data alone.

Evolving Multi-Channel Confidence-Aware Activation Functions for Missing Data with Channel Propagation

cs.NE · 2026-02-14 · unverdicted · novelty 7.0

Evolved multi-channel activation functions that incorporate missingness and confidence scores improve classification performance on datasets with missing data.

Imposing Boundary Conditions on Neural Operators via Learned Function Extensions

cs.LG · 2026-02-04 · unverdicted · novelty 7.0

A framework learns boundary-to-domain pseudo-extensions to condition neural operators on complex BCs, achieving SOTA accuracy on 18 challenging PDE datasets without hyperparameter tuning.

DiffeoMorph: Learning to Morph 3D Shapes Using Differentiable Agent-Based Simulations

cs.LG · 2025-12-18 · unverdicted · novelty 7.0

DiffeoMorph learns distributed agent protocols to morph into complex 3D shapes from minimal initial conditions via equivariant GNNs and rotation-invariant Zernike loss.

Kolmogorov-Arnold Chemical Reaction Neural Networks for learning pressure-dependent kinetic rate laws

physics.chem-ph · 2025-11-10 · unverdicted · novelty 7.0

KA-CRNNs learn pressure-dependent and collider-specific kinetic rate laws from data using Kolmogorov-Arnold activations inside a CRNN framework, outperforming interpolative methods by 2.88x in MSE on two proof-of-concept reactions.

Diffusion and Flow-based Copulas: Forgetting and Remembering Dependencies

stat.ML · 2025-09-24 · unverdicted · novelty 7.0

Diffusion and flow processes forget dependencies to define valid copulas then learn to remember them for density estimation and sampling, outperforming prior copula methods on complex datasets.

Accurate and scalable exchange-correlation with deep learning

physics.chem-ph · 2025-06-17 · unverdicted · novelty 7.0

Skala is a neural XC functional trained on wavefunction data that beats state-of-the-art hybrids on main-group chemistry benchmarks at semi-local computational cost.

Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality

cs.LG · 2024-05-31 · unverdicted · novelty 7.0

Transformers and SSMs are unified through structured state space duality, producing a 2-8X faster Mamba-2 model that remains competitive with Transformers.

Generative Semantic Communication: Diffusion Models Beyond Bit Recovery

cs.AI · 2023-06-07 · unverdicted · novelty 7.0

A generative semantic communication system that sends compressed semantic information and uses diffusion models with spatially-adaptive normalizations to reconstruct high-quality, semantically consistent images even under severe channel noise.

EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks

cs.LG · 2019-05-28 · accept · novelty 7.0

EfficientNet scales network depth, width, and resolution uniformly via a compound coefficient to deliver state-of-the-art accuracy and efficiency on image classification.

citing papers explorer

Showing 34 of 84 citing papers.

The Falcon Series of Open Language Models cs.CL · 2023-11-28 · conditional · none · ref 22 · internal anchor
Falcon-180B is a 180B-parameter open decoder-only model trained on 3.5 trillion tokens that approaches PaLM-2-Large performance at lower cost and is released with dataset extracts.
TF-MoE: Time-Frequency Mixture-of-Experts for Efficient Speech Separation cs.SD · 2026-06-28 · unverdicted · none · ref 41 · 2 links · internal anchor
TF-MoE uses dynamic per-frame and per-mel-band expert selection in time and frequency dimensions to improve speech separation performance at comparable compute cost to prior models.
SaluNet: Enabling Total Plasticity in Normalization-Free Deep Networks cs.CV · 2026-06-01 · unverdicted · none · ref 14 · internal anchor
SaluNet replaces normalization layers with the SALU activation and reports competitive accuracies on CIFAR-10/100 and ImageNet-1K without normalization.
LALE: Lightweight-Transformer Architecture for Land-Cover Estimation eess.IV · 2026-06-01 · unverdicted · none · ref 19 · internal anchor
LALE introduces a bifurcated ConvMixer-transformer encoder with an all-MLP decoder for efficient semantic segmentation of remote sensing imagery, achieving near-baseline F1 scores with 4.5x fewer parameters on the ARAS400k benchmark.
Confidence-Adaptive SwiGLU for Mixture-of-Experts cs.LG · 2026-05-30 · unverdicted · none · ref 35 · internal anchor
κ-SwiGLU adapts SiLU gate sharpness in MoE Transformers as a learnable function of router logits, reporting improved mean CORE performance on FineWeb-Edu across 8-28 layer models with negligible added parameters and small overhead.
Selective Ambulance Dispatch Under Contextual Travel-Time Uncertainty math.OC · 2026-05-22 · unverdicted · none · ref 15 · internal anchor
IDEAL is a selective dual ambulance dispatch framework that learns context-specific travel times via weakly supervised bilevel networks and models uncertainty with Burg-divergence perturbations to achieve better response-time and resource trade-offs than region-based or map-based baselines.
A Constant-Time Implementation Methodology for Activation Functions on Microcontrollers cs.CR · 2026-05-21 · unverdicted · none · ref 11 · internal anchor
A constant-time implementation methodology for activation functions on ARM Cortex-M4 microcontrollers using branchless selection, Padé approximations, dummy arithmetic, and cycle alignment to eliminate timing side channels while preserving accuracy.
Quantification of atmospheric carbon dioxide from the Geostationary Operational Environmental Satellite (GOES East) physics.ao-ph · 2026-05-17 · unverdicted · none · ref 4 · internal anchor
A physics-guided neural network trained on collocated GOES-East and OCO-2/3 data estimates XCO2 and reproduces observed variability against held-out OCO and TCCON measurements.
Mela: Test-Time Memory Consolidation based on Transformation Hypothesis cs.CL · 2026-05-11 · unverdicted · none · ref 15 · internal anchor
Mela is a Transformer variant with a dual-frequency Hierarchical Memory Module and MemStack that performs test-time memory consolidation, outperforming baselines on long contexts.
Neural-Guided Domain Restriction to Accelerate Pseudospectra Computation for Structured Non-normal Banded Matrices math.NA · 2026-05-06 · unverdicted · none · ref 26 · internal anchor
A neural network predicts sensitive pseudospectra regions from matrix features to accelerate computation on structured non-normal banded matrices while preserving accuracy in identifying those regions.
Sparsity Hurts: Simple Linear Adapter Can Boost Generalized Category Discovery cs.CV · 2026-05-05 · unverdicted · none · ref 57 · internal anchor
LAGCD inserts residual linear adapters into each ViT block plus a distribution alignment loss to improve generalized category discovery by increasing model flexibility while reducing bias between seen and novel classes.
Universal Smoothness via Bernstein Polynomials: A Constructive Approximation Approach for Activation Functions cs.AI · 2026-05-04 · unverdicted · none · ref 13 · internal anchor
BerLU constructs a C1-differentiable activation with Lipschitz constant 1 via Bernstein polynomial approximation, showing better performance and efficiency than baselines on image classification with ViTs and CNNs.
GCImOpt: Learning efficient goal-conditioned policies by imitating optimal trajectories cs.RO · 2026-04-24 · unverdicted · none · ref 9 · internal anchor
GCImOpt trains compact goal-conditioned neural policies by imitating efficiently generated optimal trajectories, achieving high success rates and near-optimal performance on cart-pole, quadcopter, and robot arm tasks while running thousands of times faster than optimization solvers.
Physics-informed neural networks for form-finding of unilateral membrane structures cs.CE · 2026-04-21 · unverdicted · none · ref 35 · internal anchor
PINNs with hard and soft boundary enforcement solve membrane form-finding PDEs to accuracy comparable with FEM, with hard-BC yielding smaller boundary errors.
ZC-Swish: Stabilizing Deep BN-Free Networks for Edge and Micro-Batch Applications cs.LG · 2026-04-21 · unverdicted · none · ref 2 · internal anchor
ZC-Swish stabilizes deep BN-free networks by anchoring activation means near zero, preventing collapse at depths 16 and beyond where standard Swish fails.
Data-Efficient Self-Supervised Algorithms for Fine-Grained Birdsong Analysis cs.LG · 2025-11-15 · unverdicted · none · ref 44 · internal anchor
A three-stage self-supervised pipeline for data-efficient frame-level syllable detection in complex birdsong using a Residual MLP-RNN model.
Activation Function Design Sustains Plasticity in Continual Learning cs.LG · 2025-09-26 · unverdicted · none · ref 21 · internal anchor
Smooth-Leaky and Randomized Smooth-Leaky activations mitigate loss of plasticity in continual learning by targeting negative-branch shape and saturation behavior.
YOLOv4: Optimal Speed and Accuracy of Object Detection cs.CV · 2020-04-23 · unverdicted · none · ref 59 · internal anchor
YOLOv4 achieves 43.5% AP (65.7% AP50) on MS COCO at ~65 FPS on Tesla V100 by integrating WRC, CSP, CmBN, SAT, Mish activation, Mosaic augmentation, DropBlock, and CIoU loss.
Deep Learning for CSI Feedback Based on Superimposed Coding cs.NI · 2019-07-27 · unverdicted · none · ref 30 · internal anchor
A multi-task neural network recovers superimposed downlink CSI and uplink data sequences in FDD massive MIMO, improving CSI estimation over standalone SC while maintaining similar UL-US detection across varying SNR and PPC.
Ghost in the Kernel: In-Context Learning with Efficient Transformers via Domain Generalization cs.LG · 2026-07-01 · unverdicted · none · ref 33 · internal anchor
Linear transformers perform in-context learning by mapping context distributions to response functions, achieving dimension-independent convergence rates under domain generalization with tradeoffs in data and feature regularities.
Deep Reinforcement Learning for Spacecraft Attitude Control During Atmospheric Re-Entry cs.LG · 2026-06-30 · unverdicted · none · ref 157 · internal anchor
Hybrid RL-PID controllers track angle of attack better and show greater robustness than PID alone within a defined operational envelope for re-entry attitude control.
A Surrogate Model for Proton Spectrum Prediction to Map Transitions in Laser-Ion Acceleration physics.plasm-ph · 2026-06-04 · unverdicted · none · ref 51 · internal anchor
A decoupled dual-branch surrogate model predicts proton spectra with R²=0.94 for cutoff energy and flux, median spectral R²=0.985, and reproduces TNSA-to-RIT/BOA regime transitions validated on 1D PIC simulations.
PowLU: An Activation Function for Stable Pre-Training of LLMs cs.CL · 2026-05-25 · unverdicted · none · ref 16 · internal anchor
PowLU replaces SwiGLU with a rational-power activation to reduce outlier amplification and numerical instability during large-scale LLM pre-training while matching performance.
Activation Functions, Statistics and Learning of Higher-Order Interactions in Restricted Boltzmann Machines cond-mat.dis-nn · 2026-05-18 · unverdicted · none · ref 11 · 2 links · internal anchor
RBMs with Gaussian weights rarely induce or easily learn distributions with strong higher-order interactions on visible units, except when the hidden-unit activation function is Exponential.
Agentic Risk-Aware Set-Based Engineering Design cs.AI · 2026-04-17 · unverdicted · none · ref 53 · internal anchor
Multi-agent LLM system applies set-based design and Conditional Value-at-Risk to explore and risk-filter airfoil designs with human manager coordination.
DemaFormer: Damped Exponential Moving Average Transformer with Energy-Based Modeling for Temporal Language Grounding cs.CV · 2023-12-05 · unverdicted · none · ref 33 · internal anchor
DemaFormer pairs energy-based modeling with a damped-EMA Transformer to localize video moments matching language queries and reports gains over baselines on four datasets.
GLU Variants Improve Transformer cs.LG · 2020-02-12 · unverdicted · none · ref 7 · internal anchor
Some GLU variants using non-sigmoid nonlinearities improve Transformer quality over ReLU and GELU in feed-forward sublayers.
CNN-Based Classifier for Automated Identification of Magnetic States in Spin Dynamics Simulations cond-mat.mtrl-sci · 2026-05-21 · unverdicted · none · ref 64 · internal anchor
CNN classifies nine magnetic states from visualized atomistic spin dynamics simulation images using EfficientNetV1B0.
Deep Learning for Solving and Estimating Dynamic Models in Economics and Finance econ.GN · 2026-05-14 · unverdicted · none · ref 10 · internal anchor
The paper surveys deep learning methods such as Deep Equilibrium Nets and Physics-Informed Neural Networks for solving and estimating high-dimensional dynamic stochastic models in economics and finance.
Developing a Strong Pre-Trained Base Model for Plant Leaf Disease Classification cs.CV · 2026-05-02 · unverdicted · none · ref 146 · internal anchor
A DenseNet201 base model trained on a constructed plant leaf disease dataset outperforms baselines and enables faster, more robust transfer learning with less data than general models.
Graph neural network for colliding particles with an application to sea ice floe modeling cs.LG · 2026-02-18 · unverdicted · none · ref 62 · internal anchor
A graph neural network learns to simulate 1D sea ice floe collisions and trajectories using data assimilation on synthetic data.
Cosmos World Foundation Model Platform for Physical AI cs.CV · 2025-01-07 · unverdicted · none · ref 161 · internal anchor
The Cosmos platform supplies open-source pre-trained world models and supporting tools for building fine-tunable digital world simulations to train Physical AI.
A Survey of Large Language Models cs.CL · 2023-03-31 · accept · none · ref 285 · internal anchor
This survey reviews the background, key techniques, and evaluation methods for large language models, emphasizing emergent abilities that appear at large scales.
Deep learning applied to computational mechanics: A comprehensive review, state of the art, and the classics cs.LG · 2022-12-18 · unverdicted · none · ref 38 · internal anchor
A comprehensive review of deep learning techniques for computational mechanics, including LSTM for constitutive modeling, PINNs for PDE solving, optimizers, and kernel methods.

Searching for Activation Functions

hub tools

citation-role summary

citation-polarity summary

claims ledger

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer