pith. sign in

arxiv: 1708.02182 · v1 · pith:BGNOTKMUnew · submitted 2017-08-07 · 💻 cs.CL · cs.LG· cs.NE

Regularizing and Optimizing LSTM Language Models

classification 💻 cs.CL cs.LGcs.NE
keywords languageachievelstmmodelingmodelsnetworksneuraloptimizing
0
0 comments X
read the original abstract

Recurrent neural networks (RNNs), such as long short-term memory networks (LSTMs), serve as a fundamental building block for many sequence learning tasks, including machine translation, language modeling, and question answering. In this paper, we consider the specific problem of word-level language modeling and investigate strategies for regularizing and optimizing LSTM-based models. We propose the weight-dropped LSTM which uses DropConnect on hidden-to-hidden weights as a form of recurrent regularization. Further, we introduce NT-ASGD, a variant of the averaged stochastic gradient method, wherein the averaging trigger is determined using a non-monotonic condition as opposed to being tuned by the user. Using these and other regularization strategies, we achieve state-of-the-art word level perplexities on two data sets: 57.3 on Penn Treebank and 65.8 on WikiText-2. In exploring the effectiveness of a neural cache in conjunction with our proposed model, we achieve an even lower state-of-the-art perplexity of 52.8 on Penn Treebank and 52.0 on WikiText-2.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 5 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. RACE Attention: A Strictly Linear-Time Attention Layer for Training on Outrageously Large Contexts

    cs.LG 2025-10 unverdicted novelty 7.0

    RACE Attention is a strictly linear-time attention mechanism that approximates softmax attention outputs using Gaussian projections and soft LSH to enable training on contexts up to 12 million tokens.

  2. FinBERT: Financial Sentiment Analysis with Pre-trained Language Models

    cs.CL 2019-08 unverdicted novelty 7.0

    FinBERT adapts BERT to the financial domain and outperforms prior state-of-the-art methods on financial sentiment analysis tasks.

  3. CTRL: A Conditional Transformer Language Model for Controllable Generation

    cs.CL 2019-09 unverdicted novelty 6.0

    CTRL is a large conditional transformer language model that uses naturally occurring control codes to steer text generation style and content.

  4. Kite: Automatic speech recognition for unmanned aerial vehicles

    cs.SD 2019-07 unverdicted novelty 5.0

    Introduces a multimodal UAV command dataset and shows image-augmented RNN language models outperform text-only versions despite imperfect training associations.

  5. Applying a Pre-trained Language Model to Spanish Twitter Humor Prediction

    cs.CL 2019-07 unverdicted novelty 3.0

    A Spanish Twitter language model trained from scratch with label smoothing placed 3rd and 2nd in the HAHA 2019 humor classification and regression tasks.