Recognition: unknown
Cloze-driven Pretraining of Self-attention Networks
read the original abstract
We present a new approach for pretraining a bi-directional transformer model that provides significant performance gains across a variety of language understanding problems. Our model solves a cloze-style word reconstruction task, where each word is ablated and must be predicted given the rest of the text. Experiments demonstrate large performance gains on GLUE and new state of the art results on NER as well as constituency parsing benchmarks, consistent with the concurrently introduced BERT model. We also present a detailed analysis of a number of factors that contribute to effective pretraining, including data domain and size, model capacity, and variations on the cloze objective.
This paper has not been read by Pith yet.
Forward citations
Cited by 4 Pith papers
-
OPT: Open Pre-trained Transformer Language Models
OPT releases open decoder-only transformers up to 175B parameters that match GPT-3 performance at one-seventh the carbon cost, along with code and training logs.
-
Multitask Prompted Training Enables Zero-Shot Task Generalization
Multitask fine-tuning of an encoder-decoder model on prompted datasets produces zero-shot generalization that often beats models up to 16 times larger on standard benchmarks.
-
Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer
T5 casts all NLP tasks as text-to-text generation, systematically explores pre-training choices, and reaches strong performance on summarization, QA, classification and other tasks via large-scale training on the Colo...
-
RoBERTa: A Robustly Optimized BERT Pretraining Approach
With better hyperparameters, more data, and longer training, an unchanged BERT-Large architecture matches or exceeds XLNet and other successors on GLUE, SQuAD, and RACE.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.