Recurrent Neural Network Regularization

Wojciech Zaremba, Ilya Sutskever, Oriol Vinyals · 2014 · cs.NE · arXiv 1409.2329

9 Pith papers cite this work. Polarity classification is still indexing.

9 Pith papers citing it

open full Pith review browse 9 citing papers arXiv PDF

abstract

We present a simple regularization technique for Recurrent Neural Networks (RNNs) with Long Short-Term Memory (LSTM) units. Dropout, the most successful technique for regularizing neural networks, does not work well with RNNs and LSTMs. In this paper, we show how to correctly apply dropout to LSTMs, and show that it substantially reduces overfitting on a variety of tasks. These tasks include language modeling, speech recognition, image caption generation, and machine translation.

citation-role summary

baseline 1

citation-polarity summary

baseline 1

representative citing papers

Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer

cs.LG · 2017-01-23 · accept · novelty 8.0

A noisy top-k gated mixture-of-experts layer between LSTMs scales neural networks to 137B parameters with sub-linear compute, beating SOTA on language modeling and machine translation.

Augmenting Self-attention with Persistent Memory

cs.LG · 2019-07-02 · unverdicted · novelty 7.0

Augmenting self-attention with persistent memory vectors allows removal of feed-forward layers from Transformers without degrading performance on character and word level language modeling benchmarks.

SIGMA-ASL: Sensor-Integrated Multimodal Dataset for Sign Language Recognition

cs.HC · 2026-05-07 · unverdicted · novelty 7.0

SIGMA-ASL is a multimodal dataset with 93,545 word-level ASL clips from Kinect RGB-D, mmWave radar, and dual IMUs, plus benchmarking protocols for single- and multi-modal recognition.

Pointer Sentinel Mixture Models

cs.CL · 2016-09-26 · conditional · novelty 7.0

Pointer sentinel-LSTM mixes context copying with softmax prediction to reach 70.9 perplexity on Penn Treebank using fewer parameters than standard LSTMs.

Empowering VLMs for Few-Shot Multimodal Time Series Classification via Tailored Agentic Reasoning

cs.AI · 2026-05-10 · unverdicted · novelty 6.0 · 2 refs

MarsTSC is a VLM agentic system with generator, reflector, and modifier roles that iteratively refines a knowledge bank to improve few-shot multimodal time series classification and produce human-readable explanations.

Adversarial Learning for Improved Onsets and Frames Music Transcription

cs.SD · 2019-06-20 · unverdicted · novelty 6.0

Adversarial training on time-frequency representations yields consistent gains in frame-level and note-level accuracy over the Onsets and Frames baseline for automatic music transcription.

Online Supervised Learning for Traffic Load Prediction in Framed-ALOHA Networks

cs.NI · 2019-07-25 · unverdicted · novelty 5.0

LSTM online predictor with MOM-based labeling estimates backlog in framed-ALOHA networks and adapts to changing statistics without prior traffic model knowledge.

Wind Estimation Using Quadcopter Motion: A Machine Learning Approach

eess.SP · 2019-07-11 · unverdicted · novelty 5.0

An LSTM neural network trained on simulated quadcopter states estimates turbulent wind velocities with lower mean and variance errors than a tilt-angle wind triangle method.

Listen, Attend, Spell and Adapt: Speaker Adapted Sequence-to-Sequence ASR

eess.AS · 2019-07-08 · unverdicted · novelty 4.0

KLD-based speaker adaptation of seq2seq ASR achieves 25% relative WER reduction, outperforming the 18.7% gain from conventional acoustic model adaptation.

citing papers explorer

Showing 9 of 9 citing papers.

Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer cs.LG · 2017-01-23 · accept · none · ref 44
A noisy top-k gated mixture-of-experts layer between LSTMs scales neural networks to 137B parameters with sub-linear compute, beating SOTA on language modeling and machine translation.
Augmenting Self-attention with Persistent Memory cs.LG · 2019-07-02 · unverdicted · none · ref 44 · internal anchor
Augmenting self-attention with persistent memory vectors allows removal of feed-forward layers from Transformers without degrading performance on character and word level language modeling benchmarks.
SIGMA-ASL: Sensor-Integrated Multimodal Dataset for Sign Language Recognition cs.HC · 2026-05-07 · unverdicted · none · ref 98
SIGMA-ASL is a multimodal dataset with 93,545 word-level ASL clips from Kinect RGB-D, mmWave radar, and dual IMUs, plus benchmarking protocols for single- and multi-modal recognition.
Pointer Sentinel Mixture Models cs.CL · 2016-09-26 · conditional · none · ref 13
Pointer sentinel-LSTM mixes context copying with softmax prediction to reach 70.9 perplexity on Penn Treebank using fewer parameters than standard LSTMs.
Empowering VLMs for Few-Shot Multimodal Time Series Classification via Tailored Agentic Reasoning cs.AI · 2026-05-10 · unverdicted · none · ref 58 · 2 links · internal anchor
MarsTSC is a VLM agentic system with generator, reflector, and modifier roles that iteratively refines a knowledge bank to improve few-shot multimodal time series classification and produce human-readable explanations.
Adversarial Learning for Improved Onsets and Frames Music Transcription cs.SD · 2019-06-20 · unverdicted · none · ref 52 · internal anchor
Adversarial training on time-frequency representations yields consistent gains in frame-level and note-level accuracy over the Onsets and Frames baseline for automatic music transcription.
Online Supervised Learning for Traffic Load Prediction in Framed-ALOHA Networks cs.NI · 2019-07-25 · unverdicted · none · ref 11 · internal anchor
LSTM online predictor with MOM-based labeling estimates backlog in framed-ALOHA networks and adapts to changing statistics without prior traffic model knowledge.
Wind Estimation Using Quadcopter Motion: A Machine Learning Approach eess.SP · 2019-07-11 · unverdicted · none · ref 20 · internal anchor
An LSTM neural network trained on simulated quadcopter states estimates turbulent wind velocities with lower mean and variance errors than a tilt-angle wind triangle method.
Listen, Attend, Spell and Adapt: Speaker Adapted Sequence-to-Sequence ASR eess.AS · 2019-07-08 · unverdicted · none · ref 40 · internal anchor
KLD-based speaker adaptation of seq2seq ASR achieves 25% relative WER reduction, outperforming the 18.7% gain from conventional acoustic model adaptation.

Recurrent Neural Network Regularization

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer