pith. machine review for the scientific record. sign in

arxiv: 1606.08415 · v5 · submitted 2016-06-27 · 💻 cs.LG

Recognition: 2 theorem links

· Lean Theorem

Gaussian Error Linear Units (GELUs)

Dan Hendrycks, Kevin Gimpel

Pith reviewed 2026-05-10 12:12 UTC · model grok-4.3

classification 💻 cs.LG
keywords activation functionGELUReLUELUneural networkdeep learningcomputer visionnatural language processing
0
0 comments X

The pith

The GELU activation xΦ(x) outperforms ReLU and ELU on computer vision, natural language processing, and speech tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes the Gaussian Error Linear Unit (GELU) as a new activation function for neural networks. GELU is defined as xΦ(x), where Φ(x) is the cumulative distribution function of the standard normal distribution, which has the effect of weighting inputs by their value rather than gating them by sign like the ReLU. The authors conduct experiments comparing GELU to ReLU and ELU across multiple domains and report consistent performance improvements with GELU. A sympathetic reader would care because activation functions are a basic building block of deep networks, and small changes here can affect overall model quality without requiring architectural redesign.

Core claim

The GELU nonlinearity, given by xΦ(x) with Φ the standard Gaussian CDF, weights inputs by their value and yields better empirical performance than ReLU or ELU on the considered computer vision, natural language processing, and speech tasks.

What carries the argument

The GELU function xΦ(x), which multiplies each input by the probability that a standard normal random variable is less than or equal to that input.

If this is right

  • GELU can be used as a drop-in replacement for ReLU or ELU in existing neural network models.
  • Performance gains are expected in vision, language, and speech applications when using GELU.
  • Training may converge to better solutions because inputs are scaled continuously rather than thresholded.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The weighting mechanism may provide smoother gradient flow during backpropagation compared to hard gating.
  • Similar activation functions could be derived using other probability distributions beyond the Gaussian.
  • Adoption of GELU might reduce the need for careful initialization or normalization techniques in some models.

Load-bearing premise

The performance improvements seen on the specific tasks and models tested will continue to appear on other architectures, datasets, and training setups.

What would settle it

Running the same models with GELU on a new task or dataset and observing no improvement or degradation relative to ReLU.

read the original abstract

We propose the Gaussian Error Linear Unit (GELU), a high-performing neural network activation function. The GELU activation function is $x\Phi(x)$, where $\Phi(x)$ the standard Gaussian cumulative distribution function. The GELU nonlinearity weights inputs by their value, rather than gates inputs by their sign as in ReLUs ($x\mathbf{1}_{x>0}$). We perform an empirical evaluation of the GELU nonlinearity against the ReLU and ELU activations and find performance improvements across all considered computer vision, natural language processing, and speech tasks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 3 minor

Summary. The paper introduces the Gaussian Error Linear Unit (GELU) activation function defined as xΦ(x), with Φ denoting the standard Gaussian cumulative distribution function. It highlights that GELU weights inputs according to their value, in contrast to ReLU which gates by sign. An empirical evaluation is performed comparing GELU to ReLU and ELU, with reported performance gains across computer vision, natural language processing, and speech tasks.

Significance. If the results are reliable, GELU offers a high-performing, parameter-free activation function with a probabilistic interpretation. This could lead to better neural network models in various fields. The direct definition from the Gaussian CDF and the broad empirical testing are positive aspects of the work.

major comments (1)
  1. The abstract and corresponding experimental sections report consistent improvements but omit key details such as the number of runs, statistical significance tests, hyperparameter search methodology, and precise baseline configurations. These omissions make it difficult to fully evaluate the strength of the empirical claims.
minor comments (3)
  1. The definition of Φ(x) as the Gaussian CDF should be stated explicitly in the introduction or methods section for readers unfamiliar with the notation.
  2. Consider adding a figure illustrating the GELU function alongside ReLU and ELU to visually support the textual description.
  3. Ensure that all acronyms are defined at first use, such as NLP if used in the text.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive review and for recognizing the potential significance of the GELU activation function. We address the major comment below and will revise the manuscript to strengthen the experimental reporting.

read point-by-point responses
  1. Referee: The abstract and corresponding experimental sections report consistent improvements but omit key details such as the number of runs, statistical significance tests, hyperparameter search methodology, and precise baseline configurations. These omissions make it difficult to fully evaluate the strength of the empirical claims.

    Authors: We agree that additional experimental details would improve the manuscript's clarity and allow readers to better assess the reliability of the reported gains. In the revised version, we will expand the relevant sections to specify: the number of runs (noting that computational constraints led to single runs for most large-scale experiments, consistent with practices in the field at the time of submission); that formal statistical significance tests were not conducted but improvements were consistent across diverse tasks; a description of the hyperparameter search process (including search ranges and selection criteria for learning rates, regularization, and other settings applied uniformly to GELU, ReLU, and ELU); and more precise baseline configurations, such as exact network architectures, initialization schemes, and training protocols. These additions will support rather than change the abstract. We believe this directly addresses the concern without overstating the original experiments. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The GELU is defined directly as xΦ(x) using the standard Gaussian CDF with no fitted parameters, self-referential equations, or load-bearing self-citations. The central claim is an empirical observation of performance gains on specific tasks, supported by side-by-side experimental results rather than any internal derivation that reduces to its own inputs by construction. No steps match the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The central claim rests only on the standard definition of the Gaussian CDF and the empirical results; no free parameters are introduced, no new axioms are stated, and no new entities are postulated.

pith-pipeline@v0.9.0 · 5378 in / 987 out tokens · 41939 ms · 2026-05-10T12:12:07.972046+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 60 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Stability and Generalization in Looped Transformers

    cs.LG 2026-04 unverdicted novelty 8.0

    Looped transformers with recall and outer normalization produce reachable, input-dependent fixed points with stable gradients, enabling generalization, while those without recall cannot; a new internal recall variant ...

  2. Mamba: Linear-Time Sequence Modeling with Selective State Spaces

    cs.LG 2023-12 unverdicted novelty 8.0

    Mamba is a linear-time sequence model using input-dependent selective SSMs that achieves SOTA results across modalities and matches twice-larger Transformers on language modeling with 5x higher inference throughput.

  3. Parallel Scan Recurrent Neural Quantum States for Scalable Variational Monte Carlo

    cond-mat.str-el 2026-05 conditional novelty 7.0

    PSR-NQS makes recurrent neural quantum states scalable for variational Monte Carlo by using parallel scan recurrence, reaching accurate results on 52x52 two-dimensional lattices.

  4. Graph Neural Networks with Triangle-Based Messages for the Multicut Problem

    cs.LG 2026-05 unverdicted novelty 7.0

    A triangle-message GNN for multicut outperforms heuristics in solution quality on graphs up to 200 nodes and finds optimal solutions faster than exact solvers for some cases.

  5. Pretraining Strategies and Scaling for ECG Foundation Models: A Systematic Study

    eess.SP 2026-05 unverdicted novelty 7.0

    Contrastive predictive coding pretraining combined with structured state space models yields the strongest ECG foundation models, with continued gains from scaling data to 11 million samples.

  6. Disentangled Sparse Representations for Concept-Separated Diffusion Unlearning

    cs.LG 2026-05 unverdicted novelty 7.0

    SAEParate disentangles sparse representations in diffusion models via contrastive clustering and nonlinear encoding to enable more precise concept unlearning with reduced side effects.

  7. Learning Less Is More: Premature Upper-Layer Attention Specialization Hurts Language Model Pretraining

    cs.CL 2026-05 unverdicted novelty 7.0

    Temporarily reducing the learning rate on upper-layer query and key projections during early GPT pretraining prevents premature attention specialization and improves model performance.

  8. From Holo Pockets to Electron Density: GPT-style Drug Design with Density

    cs.AI 2026-05 unverdicted novelty 7.0

    EDMolGPT generates drug-like molecules from low-resolution electron density point clouds of holo binding pockets and shows effectiveness across 101 biological targets.

  9. Zero-Shot Neural Network Evaluation with Sample-Wise Activation Patterns

    cs.LG 2026-05 unverdicted novelty 7.0

    SWAP-Score evaluates neural networks without training by quantifying sample-wise activation patterns, achieving high correlation with true performance on CIFAR-10 for CNNs and GLUE for Transformers while enabling fast NAS.

  10. GPROF-IR: An Improved Single-Channel Infrared Precipitation Retrieval for Merged Satellite Precipitation Products

    physics.ao-ph 2026-05 unverdicted novelty 7.0

    GPROF-IR is a CNN-based retrieval that uses temporal context in geostationary IR observations to produce precipitation estimates with lower error than prior IR methods and climatological consistency with PMW retrieval...

  11. Every Feedforward Neural Network Definable in an o-Minimal Structure Has Finite Sample Complexity

    stat.ML 2026-05 unverdicted novelty 7.0

    Every fixed finite feedforward neural network definable in an o-minimal structure has finite sample complexity in the agnostic PAC setting.

  12. Path-Coupled Bellman Flows for Distributional Reinforcement Learning

    cs.LG 2026-05 unverdicted novelty 7.0

    Path-Coupled Bellman Flows use source-consistent Bellman-coupled paths and a lambda-parameterized control-variate to learn return distributions via flow matching, improving fidelity and stability over prior DRL approaches.

  13. Align3D-AD: Cross-Modal Feature Alignment and Dual-Prompt Learning for Zero-shot 3D Anomaly Detection

    cs.CV 2026-05 unverdicted novelty 7.0

    Align3D-AD improves zero-shot 3D anomaly detection by cross-modal feature alignment from RGB guidance and dual-prompt contrastive alignment to capture complementary semantics.

  14. Bayesian Rain Field Reconstruction using Commercial Microwave Links and Diffusion Model Priors

    cs.LG 2026-05 unverdicted novelty 7.0

    Diffusion model priors enable training-free Bayesian sampling for more accurate rain field reconstruction from path-integrated commercial microwave link measurements than Gaussian process baselines.

  15. Neuron-Anchored Rule Extraction for Large Language Models via Contrastive Hierarchical Ablation

    cs.LG 2026-05 unverdicted novelty 7.0

    MechaRule localizes agonist neurons in LLMs via contrastive hierarchical ablation to ground rule extraction in circuitry, recalling 96.8% of high-effect neurons and reducing task performance when suppressed.

  16. Towards Efficient and Expressive Offline RL via Flow-Anchored Noise-conditioned Q-Learning

    cs.LG 2026-05 unverdicted novelty 7.0

    FAN achieves state-of-the-art offline RL performance on robotic tasks by anchoring flow policies and using single-sample noise-conditioned Q-learning, with proven convergence and reduced runtimes.

  17. Acceleration of horizontal numerical advection for atmospheric modeling through surrogate modeling with temporal coarse-graining

    physics.ao-ph 2026-04 conditional novelty 7.0

    A CNN surrogate with temporal coarse-graining accelerates 10-day advection simulations up to 92x while achieving r² of 0.60-0.98 against the baseline solver.

  18. Robust Model-Based Iteration for Passive Gamma Emission Tomography

    math.NA 2026-04 unverdicted novelty 7.0

    A safeguarded hybrid of Levenberg-Marquardt and learned operators achieves equivalent reconstruction quality for PGET in roughly one-third the iterations, with architecture-dependent robustness.

  19. Learning Neural Operator Surrogates for the Black Hole Accretion Code

    astro-ph.HE 2026-04 unverdicted novelty 7.0

    Physics-informed Fourier neural operators recover plasmoid formation in sparse SRRMHD vortex data where data-only models fail, and transformer operators approximate AMR jet evolution, marking first reported uses in th...

  20. PhyloSDF: Phylogenetically-Conditioned Neural Generation of 3D Skull Morphology via Residual Flow Matching

    q-bio.QM 2026-04 unverdicted novelty 7.0

    PhyloSDF generates novel 3D skull morphologies for Darwin's finches via phylogenetically-conditioned residual flow matching, achieving 88-129% of real intra-species variation from few specimens and enabling phylogenet...

  21. VitaminP: cross-modal learning enables whole-cell segmentation from routine histology

    cs.CV 2026-04 unverdicted novelty 7.0

    VitaminP uses paired H&E-mIF data to train a model that transfers molecular boundary information, enabling accurate whole-cell segmentation directly from routine H&E histology across 34 cancer types.

  22. A satellite foundation model for improved wealth monitoring

    cs.CY 2026-04 unverdicted novelty 7.0

    Tempov is a self-supervised satellite foundation model that predicts wealth levels and decadal changes at high resolution across Africa from Landsat imagery, outperforming baselines even with limited labels and genera...

  23. Latent Space Probing for Adult Content Detection in Video Generative Models

    cs.CV 2026-04 unverdicted novelty 7.0

    Latent space probing on CogVideoX achieves 97.29% F1 for adult content detection on a new 11k-clip dataset with 4-6ms overhead.

  24. Critical role of phase-dependent properties in modeling photothermal sintering of LiCoO2 cathodes

    cond-mat.mtrl-sci 2026-04 unverdicted novelty 7.0

    Amorphous LiCoO2 absorbs light more strongly and reaches higher peak temperatures than crystalline LiCoO2 during photothermal sintering, so constant-property models overestimate safe operating windows.

  25. To See the Unseen: on the Generalization Ability of Transformers in Symbolic Reasoning

    cs.AI 2026-04 conditional novelty 7.0

    Unembedding collapse in transformers prevents distinguishing unseen tokens in symbolic reasoning, but targeted interventions restore generalization.

  26. Decoding Text Spans for Efficient and Accurate Named-Entity Recognition

    cs.CL 2026-04 unverdicted novelty 7.0

    SpanDec achieves competitive NER accuracy with improved efficiency by using a final-stage lightweight decoder for span representations and early candidate filtering to reduce redundant computation.

  27. Latent Fourier Transform

    cs.SD 2026-04 unverdicted novelty 7.0

    LatentFT uses latent-space Fourier transforms and frequency masking in diffusion autoencoders to enable timescale-specific manipulation of musical structure in generative models.

  28. Grokking of Diffusion Models: Case Study on Modular Addition

    cs.LG 2026-04 unverdicted novelty 7.0

    Diffusion models show grokking on modular addition by composing periodic operand representations in simple data regimes or by separating arithmetic computation from visual denoising across timesteps in varied regimes.

  29. Machine learning isotope shifts in molecular energy levels

    astro-ph.EP 2026-04 unverdicted novelty 7.0

    Neural network corrects residual errors in isotopologue energy extrapolations for CO2 (MAE reduction in >87% of levels vs Marvel) and transfers patterns to improve CO predictions in >93% of samples.

  30. Neural Dynamic GI: Random-Access Neural Compression for Temporal Lightmaps in Dynamic Lighting Environments

    cs.GR 2026-04 unverdicted novelty 7.0

    NDGI compresses temporal lightmaps via neural feature maps and lightweight networks, delivering high-quality dynamic global illumination with low storage and modest real-time decompression cost.

  31. The Linear Centroids Hypothesis: Features as Directions Learned by Local Experts

    cs.LG 2026-04 unverdicted novelty 7.0

    The Linear Centroids Hypothesis reframes network features as directions in centroid spaces of local affine experts, unifying interpretability methods and yielding sparser, more faithful dictionaries, circuits, and sal...

  32. Emulating Non-Differentiable Metrics via Knowledge-Guided Learning: Introducing the Minkowski Image Loss

    cs.LG 2026-04 unverdicted novelty 7.0

    A knowledge-guided framework produces a differentiable surrogate for Minkowski functionals on precipitation images via Lipschitz-constrained CNNs, validated on radar data but revealing a stability-versus-detail trade-...

  33. Winner-Take-All Spiking Transformer for Language Modeling

    cs.NE 2026-04 unverdicted novelty 7.0

    Winner-take-all spiking self-attention replaces softmax in spiking transformers to support language modeling on 16 datasets with spike-driven, energy-efficient architectures.

  34. Transactional Attention: Semantic Sponsorship for KV-Cache Retention

    cs.CL 2026-04 unverdicted novelty 7.0

    Transactional Attention uses semantic sponsorship from anchor patterns to retain dormant critical tokens in KV caches, achieving 100% credential retrieval at 16 tokens where all prior methods fail.

  35. Neighbourhood Transformer: Switchable Attention for Monophily-Aware Graph Learning

    cs.LG 2026-04 unverdicted novelty 7.0

    Neighbourhood Transformers apply local self-attention for monophily-aware graph learning, guarantee expressiveness at least as strong as message-passing GNNs, and outperform prior methods on node classification across...

  36. IAT: Instance-As-Token Compression for Historical User Sequence Modeling in Industrial Recommender Systems

    cs.IR 2026-04 unverdicted novelty 7.0

    IAT compresses each historical interaction instance into a unified embedding token via temporal-order or user-order schemes, allowing standard sequence models to learn long-range preferences with better performance an...

  37. Dual Triangle Attention: Effective Bidirectional Attention Without Positional Embeddings

    q-bio.QM 2026-04 unverdicted novelty 7.0

    Dual Triangle Attention achieves effective bidirectional attention with built-in positional inductive bias via dual triangular masks, outperforming standard bidirectional attention on position-sensitive tasks and show...

  38. Trilinear Compute-in-Memory Architecture for Energy-Efficient Transformer Acceleration

    cs.AR 2026-04 unverdicted novelty 7.0

    TrilinearCIM enables complete in-memory Transformer attention computation via DG-FeFET three-operand MAC without runtime NVM reprogramming, delivering up to 46.6% energy reduction and 20.4% latency improvement on BERT...

  39. Multi-Head Attention based interaction-aware architecture for Bangla Handwritten Character Recognition: Introducing a Primary Dataset

    cs.CV 2026-04 accept novelty 7.0

    A new balanced Bangla handwritten character dataset paired with a multi-head attention hybrid model using EfficientNetB3, ViT, and Conformer achieves high accuracy and strong generalization.

  40. Hybrid Fourier Neural Operator for Surrogate Modeling of Laser Processing with a Quantum-Circuit Mixer

    quant-ph 2026-04 unverdicted novelty 7.0

    HQ-LP-FNO replaces part of the spectral channel mixing in a 3D FNO with a mode-shared VQC, reducing parameters by 15.6% and phase-fraction MAE by 26% on laser-processing surrogates while remaining stable under calibra...

  41. Moshi: a speech-text foundation model for real-time dialogue

    eess.AS 2024-09 accept novelty 7.0

    Moshi is the first real-time full-duplex spoken large language model that casts dialogue as speech-to-speech generation using parallel audio streams and an inner monologue of time-aligned text tokens.

  42. Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality

    cs.LG 2024-05 unverdicted novelty 7.0

    Transformers and SSMs are unified through structured state space duality, producing a 2-8X faster Mamba-2 model that remains competitive with Transformers.

  43. Language Model Beats Diffusion -- Tokenizer is Key to Visual Generation

    cs.CV 2023-10 unverdicted novelty 7.0

    A new shared video-image tokenizer enables large language models to surpass diffusion models on standard visual generation benchmarks.

  44. Segment Anything

    cs.CV 2023-04 unverdicted novelty 7.0

    A promptable model trained on 1B masks achieves competitive zero-shot segmentation performance across tasks and is released publicly with its dataset.

  45. Scalable Diffusion Models with Transformers

    cs.CV 2022-12 unverdicted novelty 7.0

    DiTs achieve SOTA FID of 2.27 on ImageNet 256x256 by scaling transformer-based latent diffusion models, with performance improving consistently as Gflops increase.

  46. DreamFusion: Text-to-3D using 2D Diffusion

    cs.CV 2022-09 accept novelty 7.0

    Optimizes a Neural Radiance Field via probability density distillation from a 2D diffusion model to produce text-conditioned 3D scenes viewable from any angle.

  47. A Generalist Agent

    cs.AI 2022-05 accept novelty 7.0

    Gato is a multi-modal, multi-task, multi-embodiment generalist policy using one transformer network to handle text, vision, games, and robotics tasks.

  48. OPT: Open Pre-trained Transformer Language Models

    cs.CL 2022-05 unverdicted novelty 7.0

    OPT releases open decoder-only transformers up to 175B parameters that match GPT-3 performance at one-seventh the carbon cost, along with code and training logs.

  49. Flamingo: a Visual Language Model for Few-Shot Learning

    cs.CV 2022-04 unverdicted novelty 7.0

    Flamingo models reach new state-of-the-art few-shot results on image and video tasks by bridging frozen vision and language models with cross-attention layers trained on interleaved web-scale data.

  50. BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension

    cs.CL 2019-10 accept novelty 7.0

    BART introduces a denoising pretraining method for seq2seq models that matches RoBERTa on GLUE and SQuAD while setting new state-of-the-art results on abstractive summarization, dialogue, and QA with up to 6 ROUGE gains.

  51. ALBERT: A Lite BERT for Self-supervised Learning of Language Representations

    cs.CL 2019-09 accept novelty 7.0

    ALBERT reduces BERT parameters via embedding factorization and layer sharing, adds inter-sentence coherence pretraining, and reaches SOTA on GLUE, RACE, and SQuAD with fewer parameters than BERT-large.

  52. Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism

    cs.CL 2019-09 unverdicted novelty 7.0

    Intra-layer model parallelism in PyTorch enables training of 8.3B-parameter transformers, achieving SOTA perplexity of 10.8 on WikiText103 and 66.5% accuracy on LAMBADA.

  53. Generating Long Sequences with Sparse Transformers

    cs.LG 2019-04 unverdicted novelty 7.0

    Sparse Transformers factorize attention to handle sequences tens of thousands long, achieving new SOTA density modeling on Enwik8, CIFAR-10, and ImageNet-64.

  54. Searching for Activation Functions

    cs.NE 2017-10 conditional novelty 7.0

    Automated search discovers Swish activation f(x) = x * sigmoid(βx) that improves top-1 ImageNet accuracy over ReLU by 0.9% on Mobile NASNet-A and 0.6% on Inception-ResNet-v2.

  55. Q-Flow: Stable and Expressive Reinforcement Learning with Flow-Based Policy

    cs.LG 2026-05 unverdicted novelty 6.0

    Q-Flow enables stable optimization of expressive flow-based policies in RL by propagating terminal values along deterministic flow dynamics to intermediate states for gradient updates without solver unrolling.

  56. Contrastive Learning under Noisy Temporal Self-Supervision for Colonoscopy Videos

    cs.CV 2026-05 unverdicted novelty 6.0

    A noise-aware contrastive loss built on temporal self-supervision learns polyp tracklet representations from 27 videos that outperform prior self-supervised and supervised baselines and match foundation models on retr...

  57. Spatial Adapter: Structured Spatial Decomposition and Closed-Form Covariance for Frozen Predictors

    stat.ML 2026-05 unverdicted novelty 6.0

    The Spatial Adapter equips frozen predictors with a spatially regularized orthonormal basis for residuals and derives a closed-form low-rank-plus-noise covariance for spatial prediction and kriging.

  58. On the global convergence of gradient descent for wide shallow models with bounded nonlinearities

    math.OC 2026-05 unverdicted novelty 6.0

    Gradient descent on wide shallow models with bounded nonlinearities converges globally in the mean-field limit as non-global critical points are unstable under the dynamics.

  59. Adaptive Action Chunking via Multi-Chunk Q Value Estimation

    cs.LG 2026-05 unverdicted novelty 6.0

    ACH lets RL policies dynamically pick action chunk lengths by jointly estimating Q-values for all candidate lengths via a single Transformer pass.

  60. Hystar: Hypernetwork-driven Style-adaptive Retrieval via Dynamic SVD Modulation

    cs.CV 2026-05 unverdicted novelty 6.0

    Hystar adapts CLIP-like models to unseen query styles by generating per-input singular-value perturbations with a hypernetwork for attention layers and a new StyleNCE contrastive loss.

Reference graph

Works this paper leans on

50 extracted references · 50 canonical work pages · cited by 143 Pith papers

  1. [1]

    Adaptive dropout for training deep neural networks , year =

    Jimmy Ba and Brendan Frey , booktitle =. Adaptive dropout for training deep neural networks , year =

  2. [2]

    Learning with pseudo-ensembles , year =

    Philip Bachman and Ouais Alsharif and Doina Precup , booktitle =. Learning with pseudo-ensembles , year =

  3. [3]

    International Conference on Learning Representations , title =

    Djork. International Conference on Learning Representations , title =

  4. [4]

    A Simple Approximation to the Area Under Standard Normal Curve , year =

    Amit Choudhury , booktitle =. A Simple Approximation to the Area Under Standard Normal Curve , year =

  5. [5]

    Natural Neural Networks , year =

    Guillaume Desjardins and Karen Simonyan and Razvan Pascanu and Koray Kavukcuoglu , booktitle =. Natural Neural Networks , year =

  6. [6]

    Smith , publisher =

    Kevin Gimpel and Nathan Schneider and Brendan O ' Connor and Dipanjan Das and Daniel Mills and Jacob Eisenstein and Michael Heilman and Dani Yogatama and Jeffrey Flanigan and Noah A. Smith , publisher =. Part-of-Speech Tagging for

  7. [7]

    Adjusting for Dropout Variance in Batch Normalization and Weight Initialization , year =

    Dan Hendrycks and Kevin Gimpel , booktitle =. Adjusting for Dropout Variance in Batch Normalization and Weight Initialization , year =

  8. [8]

    Neural networks and physical systems with emergent collective computational abilities , year =

    John Hopfield , booktitle =. Neural networks and physical systems with emergent collective computational abilities , year =

  9. [9]

    Bayesian Active Learning for Classification and Preference Learning , year =

    Neil Houlsby and Ferenc Husz\'ar and Zoubin Ghahramani and M\'at\'e Lengyel , booktitle =. Bayesian Active Learning for Classification and Preference Learning , year =

  10. [10]

    Adam: A Method for Stochastic Optimization , year =

    Diederik Kingma and Jimmy Ba , publisher =. Adam: A Method for Stochastic Optimization , year =

  11. [11]

    Alex Krizhevsky , publisher =

  12. [12]

    Zoneout: Regularizing

    David Krueger and Tegan Maharaj and János Kram\'ar and Mohammad Pezeshki and Nicolas Ballas and Nan Rosemary Ke1 and Anirudh Goyal and Yoshua Bengio and Hugo Larochelle and Aaron Courville and Chris Pal , booktitle =. Zoneout: Regularizing

  13. [13]

    Ilya Loshchilov and Frank Hutter , journal =

  14. [14]

    Maas and Awni Y

    Andrew L. Maas and Awni Y. Hannun and and Andrew Y. Ng , booktitle =. Rectifier nonlinearities improve neural network acoustic models , year =

  15. [15]

    McCulloch and Walter Pitts , booktitle =

    Warren S. McCulloch and Walter Pitts , booktitle =. A logical calculus of the ideas immanent in nervous activity , year =

  16. [16]

    All You Need Is a Good Init , year =

    Dmytro Mishkin and Jiri Matas , booktitle =. All You Need Is a Good Init , year =

  17. [17]

    Dahl and Geoffrey E

    Abdelrahman Mohamed and George E. Dahl and Geoffrey E. Hinton , booktitle =. Acoustic Modeling Using Deep Belief Networks , year =

  18. [18]

    Smith , title =

    Olutobi Owoputi and Brendan O'Connor and Chris Dyer and Kevin Gimpel and Nathan Schneider and Noah A. Smith , title =. North American Chapter of the Association for Computational Linguistics (NAACL) , year =

  19. [19]

    Hinton , booktitle =

    Vinod Nair and Geoffrey E. Hinton , booktitle =. Rectified Linear Units Improve Restricted Boltzmann Machines , year =

  20. [20]

    Kingma , booktitle =

    Tim Salimans and Diederik P. Kingma , booktitle =. Weight Normalization: A Simple Reparameterization to Accelerate Training of Deep Neural Networks , year =

  21. [21]

    Saxe and James L

    Andrew M. Saxe and James L. McClelland and Surya Ganguli , booktitle =. Exact solutions to the nonlinear dynamics of learning in deep linear neural networks , year =

  22. [22]

    Deep Residual Networks with Exponential Linear Unit , year =

    Anish Shah and Sameer Shinde and Eashan Kadam and Hena Shah and Sandip Shingade , booktitle =. Deep Residual Networks with Exponential Linear Unit , year =

  23. [23]

    Hinton and Alex Krizhevsky and Ilya Sutskever and Ruslan Salakhutdinov , booktitle =

    Nitish Srivastava and Geoffrey E. Hinton and Alex Krizhevsky and Ilya Sutskever and Ruslan Salakhutdinov , booktitle =. Dropout: A Simple Way to Prevent Neural Networks from Overfitting , year =

  24. [24]

    Improving Neural Networks with Dropout , year =

    Nitish Srivastava , booktitle =. Improving Neural Networks with Dropout , year =

  25. [25]

    Wang and Christopher D

    Sida I. Wang and Christopher D. Manning , booktitle =. Fast dropout training , year =

  26. [26]

    Wide Residual Networks , year =

    Sergey Zagoruyko and Nikos Komodakis , journal =. Wide Residual Networks , year =

  27. [27]

    Adaptive dropout for training deep neural networks

    Jimmy Ba and Brendan Frey. Adaptive dropout for training deep neural networks. In Neural Information Processing Systems, 2013

  28. [28]

    Learning with pseudo-ensembles

    Philip Bachman, Ouais Alsharif, and Doina Precup. Learning with pseudo-ensembles. In Neural Information Processing Systems, 2014

  29. [29]

    A simple approximation to the area under standard normal curve

    Amit Choudhury. A simple approximation to the area under standard normal curve. In Mathematics and Statistics, 2014

  30. [30]

    Fast and accurate deep network learning by exponential linear units ( ELUs )

    Djork - Arn \' e Clevert, Thomas Unterthiner, and Sepp Hochreiter. Fast and accurate deep network learning by exponential linear units ( ELUs ). In International Conference on Learning Representations, 2016

  31. [31]

    Natural neural networks

    Guillaume Desjardins, Karen Simonyan, Razvan Pascanu, and Koray Kavukcuoglu. Natural neural networks. In arXiv, 2015

  32. [32]

    Kevin Gimpel, Nathan Schneider, Brendan O ' Connor, Dipanjan Das, Daniel Mills, Jacob Eisenstein, Michael Heilman, Dani Yogatama, Jeffrey Flanigan, and Noah A. Smith. Part-of-Speech Tagging for T witter: Annotation, Features, and Experiments . Association for Computational Linguistics (ACL), 2011

  33. [33]

    Adjusting for dropout variance in batch normalization and weight initialization

    Dan Hendrycks and Kevin Gimpel. Adjusting for dropout variance in batch normalization and weight initialization. In arXiv, 2016

  34. [34]

    Neural networks and physical systems with emergent collective computational abilities

    John Hopfield. Neural networks and physical systems with emergent collective computational abilities. In Proceedings of the National Academy of Sciences of the USA, 1982

  35. [35]

    Adam: A Method for Stochastic Optimization

    Diederik Kingma and Jimmy Ba. Adam: A Method for Stochastic Optimization. International Conference for Learning Representations, 2015

  36. [36]

    Learning Multiple Layers of Features from Tiny Images, 2009

    Alex Krizhevsky. Learning Multiple Layers of Features from Tiny Images, 2009

  37. [37]

    Zoneout: Regularizing RNNs by randomly preserving hidden activations

    David Krueger, Tegan Maharaj, János Kram\'ar, Mohammad Pezeshki, Nicolas Ballas, Nan Rosemary Ke1, Anirudh Goyal, Yoshua Bengio, Hugo Larochelle, Aaron Courville, and Chris Pal. Zoneout: Regularizing RNNs by randomly preserving hidden activations. In Neural Information Processing Systems, 2016

  38. [38]

    SGDR : Stochastic gradient descent with restarts

    Ilya Loshchilov and Frank Hutter. SGDR : Stochastic gradient descent with restarts. arXiv, 2016

  39. [39]

    Maas, Awni Y

    Andrew L. Maas, Awni Y. Hannun, , and Andrew Y. Ng. Rectifier nonlinearities improve neural network acoustic models. In International Conference on Machine Learning, 2013

  40. [40]

    McCulloch and Walter Pitts

    Warren S. McCulloch and Walter Pitts. A logical calculus of the ideas immanent in nervous activity. In Bulletin of Mathematical Biophysics, 1943

  41. [41]

    All you need is a good init

    Dmytro Mishkin and Jiri Matas. All you need is a good init. In International Conference on Learning Representations, 2016

  42. [42]

    Dahl, and Geoffrey E

    Abdelrahman Mohamed, George E. Dahl, and Geoffrey E. Hinton. Acoustic modeling using deep belief networks. In IEEE Transactions on Audio, Speech, and Language Processing, 2012

  43. [43]

    Vinod Nair and Geoffrey E. Hinton. Rectified linear units improve restricted boltzmann machines. In International Conference on Machine Learning, 2010

  44. [44]

    Olutobi Owoputi, Brendan O'Connor, Chris Dyer, Kevin Gimpel, Nathan Schneider, and Noah A. Smith. Improved part-of-speech tagging for online conversational text with word clusters. In North American Chapter of the Association for Computational Linguistics (NAACL), 2013

  45. [45]

    Tim Salimans and Diederik P. Kingma. Weight normalization: A simple reparameterization to accelerate training of deep neural networks. In Neural Information Processing Systems, 2016

  46. [46]

    Saxe, James L

    Andrew M. Saxe, James L. McClelland, and Surya Ganguli. Exact solutions to the nonlinear dynamics of learning in deep linear neural networks. In International Conference on Learning Representations, 2014

  47. [47]

    Deep residual networks with exponential linear unit

    Anish Shah, Sameer Shinde, Eashan Kadam, Hena Shah, and Sandip Shingade. Deep residual networks with exponential linear unit. In Vision Net, 2016

  48. [48]

    Improving neural networks with dropout

    Nitish Srivastava. Improving neural networks with dropout. In University of Toronto, 2013

  49. [49]

    Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov

    Nitish Srivastava, Geoffrey E. Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout: A simple way to prevent neural networks from overfitting. In Journal of Machine Learning Research, 2014

  50. [50]

    Wide residual networks

    Sergey Zagoruyko and Nikos Komodakis. Wide residual networks. British Machine Vision Conference, 2016