Using the Output Embedding to Improve Language Models

Oﬁr Press, Lior Wolf · 2016 · cs.CL · arXiv 1608.05859

7 Pith papers cite this work. Polarity classification is still indexing.

7 Pith papers citing it

open full Pith review browse 7 citing papers arXiv PDF

abstract

We study the topmost weight matrix of neural network language models. We show that this matrix constitutes a valid word embedding. When training language models, we recommend tying the input embedding and this output embedding. We analyze the resulting update rules and show that the tied embedding evolves in a more similar way to the output embedding than to the input embedding in the untied model. We also offer a new method of regularizing the output embedding. Our methods lead to a significant reduction in perplexity, as we are able to show on a variety of neural network language models. Finally, we show that weight tying can reduce the size of neural translation models to less than half of their original size without harming their performance.

citation-role summary

background 1

citation-polarity summary

background 1

representative citing papers

The Falcon Series of Open Language Models

cs.CL · 2023-11-28 · conditional · novelty 6.0

Falcon-180B is a 180B-parameter open decoder-only model trained on 3.5 trillion tokens that approaches PaLM-2-Large performance at lower cost and is released with dataset extracts.

CTRL: A Conditional Transformer Language Model for Controllable Generation

cs.CL · 2019-09-11 · unverdicted · novelty 6.0

CTRL is a large conditional transformer language model that uses naturally occurring control codes to steer text generation style and content.

Widening the Representation Bottleneck in Neural Machine Translation with Lexical Shortcuts

cs.CL · 2019-06-28 · conditional · novelty 6.0

Gated lexical shortcut connections added to the transformer yield 0.9 BLEU average gains on five WMT directions while lowering the lexical content stored in hidden states.

Attention Is All You Need

cs.CL · 2017-06-12 · unverdicted · novelty 5.0

Pith review generated a malformed one-line summary.

Learn Spelling from Teachers: Transferring Knowledge from Language Models to Sequence-to-Sequence Speech Recognition

eess.AS · 2019-07-13 · unverdicted · novelty 4.0

Knowledge distillation from an external RNN language model to a seq2seq ASR model yields 9.3% CER on Chinese datasets, an 18.42% relative improvement over the baseline without test-time fusion components.

CosmicFish-HRM: Adaptive Reasoning via Hierarchical Recurrent Mechanisms in Compact Language Models

cs.LG · 2026-05-27 · unverdicted · novelty 3.0

Presents CosmicFish-HRM, a compact LM using hierarchical recurrent reasoning to adapt computation depth per input.

Applying a Pre-trained Language Model to Spanish Twitter Humor Prediction

cs.CL · 2019-07-06 · unverdicted · novelty 3.0

A Spanish Twitter language model trained from scratch with label smoothing placed 3rd and 2nd in the HAHA 2019 humor classification and regression tasks.

citing papers explorer

Showing 7 of 7 citing papers.

The Falcon Series of Open Language Models cs.CL · 2023-11-28 · conditional · none · ref 150 · internal anchor
Falcon-180B is a 180B-parameter open decoder-only model trained on 3.5 trillion tokens that approaches PaLM-2-Large performance at lower cost and is released with dataset extracts.
CTRL: A Conditional Transformer Language Model for Controllable Generation cs.CL · 2019-09-11 · unverdicted · none · ref 36 · internal anchor
CTRL is a large conditional transformer language model that uses naturally occurring control codes to steer text generation style and content.
Widening the Representation Bottleneck in Neural Machine Translation with Lexical Shortcuts cs.CL · 2019-06-28 · conditional · none · ref 20 · internal anchor
Gated lexical shortcut connections added to the transformer yield 0.9 BLEU average gains on five WMT directions while lowering the lexical content stored in hidden states.
Attention Is All You Need cs.CL · 2017-06-12 · unverdicted · none · ref 30
Pith review generated a malformed one-line summary.
Learn Spelling from Teachers: Transferring Knowledge from Language Models to Sequence-to-Sequence Speech Recognition eess.AS · 2019-07-13 · unverdicted · none · ref 29 · internal anchor
Knowledge distillation from an external RNN language model to a seq2seq ASR model yields 9.3% CER on Chinese datasets, an 18.42% relative improvement over the baseline without test-time fusion components.
CosmicFish-HRM: Adaptive Reasoning via Hierarchical Recurrent Mechanisms in Compact Language Models cs.LG · 2026-05-27 · unverdicted · none · ref 17 · internal anchor
Presents CosmicFish-HRM, a compact LM using hierarchical recurrent reasoning to adapt computation depth per input.
Applying a Pre-trained Language Model to Spanish Twitter Humor Prediction cs.CL · 2019-07-06 · unverdicted · none · ref 10 · internal anchor
A Spanish Twitter language model trained from scratch with label smoothing placed 3rd and 2nd in the HAHA 2019 humor classification and regression tasks.

Using the Output Embedding to Improve Language Models

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer