Falcon-180B is a 180B-parameter open decoder-only model trained on 3.5 trillion tokens that approaches PaLM-2-Large performance at lower cost and is released with dataset extracts.
Using the Output Embedding to Improve Language Models
7 Pith papers cite this work. Polarity classification is still indexing.
abstract
We study the topmost weight matrix of neural network language models. We show that this matrix constitutes a valid word embedding. When training language models, we recommend tying the input embedding and this output embedding. We analyze the resulting update rules and show that the tied embedding evolves in a more similar way to the output embedding than to the input embedding in the untied model. We also offer a new method of regularizing the output embedding. Our methods lead to a significant reduction in perplexity, as we are able to show on a variety of neural network language models. Finally, we show that weight tying can reduce the size of neural translation models to less than half of their original size without harming their performance.
citation-role summary
citation-polarity summary
roles
background 1polarities
background 1representative citing papers
CTRL is a large conditional transformer language model that uses naturally occurring control codes to steer text generation style and content.
Gated lexical shortcut connections added to the transformer yield 0.9 BLEU average gains on five WMT directions while lowering the lexical content stored in hidden states.
Pith review generated a malformed one-line summary.
Knowledge distillation from an external RNN language model to a seq2seq ASR model yields 9.3% CER on Chinese datasets, an 18.42% relative improvement over the baseline without test-time fusion components.
Presents CosmicFish-HRM, a compact LM using hierarchical recurrent reasoning to adapt computation depth per input.
A Spanish Twitter language model trained from scratch with label smoothing placed 3rd and 2nd in the HAHA 2019 humor classification and regression tasks.
citing papers explorer
-
The Falcon Series of Open Language Models
Falcon-180B is a 180B-parameter open decoder-only model trained on 3.5 trillion tokens that approaches PaLM-2-Large performance at lower cost and is released with dataset extracts.
-
CTRL: A Conditional Transformer Language Model for Controllable Generation
CTRL is a large conditional transformer language model that uses naturally occurring control codes to steer text generation style and content.
-
Widening the Representation Bottleneck in Neural Machine Translation with Lexical Shortcuts
Gated lexical shortcut connections added to the transformer yield 0.9 BLEU average gains on five WMT directions while lowering the lexical content stored in hidden states.
-
Attention Is All You Need
Pith review generated a malformed one-line summary.
-
Learn Spelling from Teachers: Transferring Knowledge from Language Models to Sequence-to-Sequence Speech Recognition
Knowledge distillation from an external RNN language model to a seq2seq ASR model yields 9.3% CER on Chinese datasets, an 18.42% relative improvement over the baseline without test-time fusion components.
-
CosmicFish-HRM: Adaptive Reasoning via Hierarchical Recurrent Mechanisms in Compact Language Models
Presents CosmicFish-HRM, a compact LM using hierarchical recurrent reasoning to adapt computation depth per input.
-
Applying a Pre-trained Language Model to Spanish Twitter Humor Prediction
A Spanish Twitter language model trained from scratch with label smoothing placed 3rd and 2nd in the HAHA 2019 humor classification and regression tasks.