Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

Adam Roberts; Colin Raffel; Katherine Lee; Michael Matena; Noam Shazeer; Peter J. Liu; Sharan Narang; Wei Li; Yanqi Zhou

arxiv: 1910.10683 · v4 · submitted 2019-10-23 · 💻 cs.LG · cs.CL· stat.ML

Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

Colin Raffel , Noam Shazeer , Adam Roberts , Katherine Lee , Sharan Narang , Michael Matena , Yanqi Zhou , Wei Li

show 1 more author

Peter J. Liu

This is my paper

Pith reviewed 2026-05-12 05:33 UTC · model grok-4.3

classification 💻 cs.LG cs.CLstat.ML

keywords transfer learningtext-to-texttransformerpre-trainingnatural language processingsummarizationquestion answeringtext classification

0 comments

The pith

A single text-to-text transformer pre-trained on a large cleaned web corpus reaches state-of-the-art results on many NLP benchmarks when fine-tuned uniformly.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests how far transfer learning can go in natural language processing by turning every task into the same text-to-text format. It compares pre-training goals, model sizes, data sources, and fine-tuning methods across dozens of tasks. The authors introduce a massive cleaned web dataset and find that scaling up the unified approach produces top performance on summarization, question answering, text classification, and related problems. This shows that one model and training recipe can handle a wide range of language tasks without custom setups for each one.

Core claim

By converting every text-based language problem into a text-to-text format and pre-training a transformer on the Colossal Clean Crawled Corpus with a denoising objective, the resulting model achieves state-of-the-art results on many benchmarks when fine-tuned on downstream tasks covering summarization, question answering, text classification, and more.

What carries the argument

The text-to-text framework that represents every input and output as plain text strings, allowing one transformer architecture and pre-training procedure to serve all tasks.

If this is right

One pre-trained model can be adapted to many tasks without designing separate architectures for each.
Larger model scale combined with cleaner and larger unlabeled data improves transfer performance across benchmarks.
Systematic comparison of pre-training objectives and data sources identifies which choices transfer most effectively.
Releasing the pre-trained models, new dataset, and code allows direct reuse and extension by others.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The uniform format may reduce the engineering effort needed to apply models to new language problems.
If the text-to-text approach works across many tasks, it could simplify evaluation and comparison of future models.
The success with web-scale cleaned data suggests that data quality and volume matter as much as model architecture for transfer.

Load-bearing premise

Converting every language task into a text-to-text generation problem preserves all necessary information for solving the original task.

What would settle it

A language task where even a very large text-to-text model, after fine-tuning, scores substantially below the best task-specific models on standard metrics.

read the original abstract

Transfer learning, where a model is first pre-trained on a data-rich task before being fine-tuned on a downstream task, has emerged as a powerful technique in natural language processing (NLP). The effectiveness of transfer learning has given rise to a diversity of approaches, methodology, and practice. In this paper, we explore the landscape of transfer learning techniques for NLP by introducing a unified framework that converts all text-based language problems into a text-to-text format. Our systematic study compares pre-training objectives, architectures, unlabeled data sets, transfer approaches, and other factors on dozens of language understanding tasks. By combining the insights from our exploration with scale and our new ``Colossal Clean Crawled Corpus'', we achieve state-of-the-art results on many benchmarks covering summarization, question answering, text classification, and more. To facilitate future work on transfer learning for NLP, we release our data set, pre-trained models, and code.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

T5 unifies NLP tasks as text-to-text, runs controlled ablations on objectives and scale, and releases everything to back its SOTA claims.

read the letter

The main thing to know is that this paper unifies a wide range of NLP tasks under a single text-to-text framework and backs it up with systematic experiments on objectives, model sizes, and data sources. What they do well is run controlled comparisons. They test several pre-training objectives on the same setup, compare encoder-decoder to other architectures, and introduce the Colossal Clean Crawled Corpus as a new unlabeled dataset. Combining these with larger models leads to state-of-the-art numbers on tasks like summarization and question answering. The decision to release the pre-trained models, code, and the C4 data makes the claims easier to check and build upon. The softer parts are around the data preparation. The cleaning heuristics for C4 are described but not deeply ablated, so it's hard to know how sensitive the results are to those choices. Also, while they show the text-to-text approach works, it's not always clear how much information is lost when forcing every task into generation format, though their results suggest it's minimal for the tasks they test. The SOTA claims are benchmark-specific and could shift with new test sets or different fine-tuning protocols. This is worth the time for anyone in NLP transfer learning or scaling laws. The empirical work is thorough enough and the releases add real value, so it deserves a serious referee.

Referee Report

2 major / 3 minor

Summary. The manuscript introduces T5, a unified text-to-text transformer framework that reformulates all NLP tasks as sequence-to-sequence generation problems. It conducts a systematic empirical study comparing pre-training objectives (e.g., span corruption), model architectures (encoder-decoder vs. decoder-only), unlabeled datasets, and transfer methods across dozens of tasks. By scaling models up to 11B parameters and pre-training on the new Colossal Clean Crawled Corpus (C4), the authors report state-of-the-art results on benchmarks spanning summarization, question answering, text classification, and more, while releasing the models, code, and C4 dataset.

Significance. If the reported results hold under independent verification, the work is significant for establishing a simple, scalable, and unified approach to transfer learning that outperforms prior specialized methods. The thorough controlled ablations isolating the contributions of objective, architecture, and data, combined with the public release of artifacts, provide a strong foundation for future research and reproducibility in NLP.

major comments (2)

[§4.2, Table 7] §4.2 and Table 7: The headline SOTA claims on GLUE and SuperGLUE rely on single-run fine-tuning results without reported standard deviations or statistical significance tests across multiple random seeds; given known variance in fine-tuning large models, this weakens the strength of the cross-task superiority claims.
[§3.4] §3.4: The comparison of pre-training objectives uses fixed compute budgets, but the paper does not quantify whether the observed advantage of span corruption over alternatives (e.g., language modeling) persists when allowing each objective its own optimal hyperparameter search or longer training; this is load-bearing for the recommendation of the default objective.

minor comments (3)

[§2] The model size nomenclature (small, base, large, 3B, 11B) is introduced gradually; a single summary table early in §2 or §3 would improve readability.
[Figure 3] Figure 3 (scaling curves): The x-axis for parameter count is logarithmic but the tick labels and legend could be enlarged for clarity in print.
[Appendix A.3] Appendix A.3 on C4 cleaning heuristics is detailed, but a short paragraph in the main text summarizing the key filtering steps would help readers without requiring appendix consultation.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the positive assessment and recommendation for minor revision. We address each major comment below.

read point-by-point responses

Referee: [§4.2, Table 7] §4.2 and Table 7: The headline SOTA claims on GLUE and SuperGLUE rely on single-run fine-tuning results without reported standard deviations or statistical significance tests across multiple random seeds; given known variance in fine-tuning large models, this weakens the strength of the cross-task superiority claims.

Authors: We acknowledge that reporting standard deviations from multiple random seeds would provide stronger statistical support for the SOTA claims. Due to the prohibitive computational expense of repeated fine-tuning runs for models up to 11B parameters, we reported single-run results for the primary GLUE and SuperGLUE numbers. The observed gains are large in magnitude and consistent across dozens of tasks and model scales, which reduces the likelihood that they arise from random seed variance alone. In the revised manuscript we will add a brief discussion in §4.2 noting the single-run protocol and referencing prior studies on fine-tuning variance. revision: partial
Referee: [§3.4] §3.4: The comparison of pre-training objectives uses fixed compute budgets, but the paper does not quantify whether the observed advantage of span corruption over alternatives (e.g., language modeling) persists when allowing each objective its own optimal hyperparameter search or longer training; this is load-bearing for the recommendation of the default objective.

Authors: We deliberately held compute budgets fixed across objectives to isolate the effect of the pre-training task itself rather than differences in training duration or hyperparameter optimization. This controlled design is standard for large-scale ablation studies. While we did not conduct per-objective hyperparameter sweeps or extended training, span corruption produced clear and consistent gains under the equal-compute regime. We will revise §3.4 to explicitly state this rationale and note that further per-objective optimization remains an interesting direction for future work. revision: partial

Circularity Check

0 steps flagged

No significant circularity; empirical results are self-contained

full rationale

The paper conducts a large-scale empirical exploration of transfer learning by reformulating NLP tasks as text-to-text problems, systematically ablating pre-training objectives, architectures, data sources, and scaling behaviors across dozens of benchmarks. All central claims (including SOTA results) derive from direct experimental measurements on the released C4 corpus and models rather than from any closed-form derivations, fitted parameters renamed as predictions, or self-citation chains. No equations or uniqueness theorems are invoked that reduce the reported outcomes to inputs by construction; the work is therefore independent and verifiable through the provided artifacts.

Axiom & Free-Parameter Ledger

3 free parameters · 1 axioms · 0 invented entities

The central claim rests on empirical comparisons rather than closed-form derivations. Free parameters include model scale choices, pre-training objective variants, and the specific data-cleaning rules used to construct the Colossal Clean Crawled Corpus. The work assumes standard transformer inductive biases and the effectiveness of transfer from unlabeled pre-training.

free parameters (3)

model scale (small to 11B parameters)
Different parameter counts are trained and compared; performance depends on these choices.
pre-training objective variants
Multiple objectives (e.g., span corruption) are selected and evaluated; results are sensitive to which are used.
C4 data-cleaning heuristics
Rules for filtering the crawled corpus are introduced and affect the pre-training data distribution.

axioms (1)

domain assumption Pre-training on large unlabeled text followed by fine-tuning improves performance on downstream language tasks
Invoked as the foundation for all transfer experiments in the abstract.

pith-pipeline@v0.9.0 · 5485 in / 1497 out tokens · 79829 ms · 2026-05-12T05:33:22.657738+00:00 · methodology

discussion (0)

Forward citations

Cited by 60 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models
cs.CL 2022-01 accept novelty 9.0

Chain-of-thought prompting, by including intermediate reasoning steps in few-shot examples, elicits strong reasoning abilities in large language models on arithmetic, commonsense, and symbolic tasks.
Online Learning-to-Defer with Varying Experts
stat.ML 2026-05 unverdicted novelty 8.0

Presents the first online learning-to-defer algorithm with regret bounds O((n + n_e) T^{2/3}) generally and O((n + n_e) sqrt(T)) under low noise for multiclass classification with varying experts.
Discovering Latent Knowledge in Language Models Without Supervision
cs.CL 2022-12 conditional novelty 8.0

An unsupervised technique extracts latent yes-no knowledge from language model activations by locating a direction that satisfies logical consistency properties, outperforming zero-shot accuracy by 4% on average acros...
Show Your Work: Scratchpads for Intermediate Computation with Language Models
cs.LG 2021-11 unverdicted novelty 8.0

Training language models to generate intermediate computation steps on a scratchpad enables them to perform multi-step tasks such as long addition and arbitrary program execution that they otherwise fail at.
The Pile: An 800GB Dataset of Diverse Text for Language Modeling
cs.CL 2020-12 conditional novelty 8.0

The Pile is a newly constructed 825 GiB dataset from 22 diverse sources that enables language models to achieve better performance on academic, professional, and cross-domain tasks than models trained on Common Crawl ...
Generative Language Modeling for Automated Theorem Proving
cs.LG 2020-09 unverdicted novelty 8.0

GPT-f, a transformer-based prover for Metamath, generated new short proofs that were accepted into the main library—the first such contribution from a deep-learning system.
Measuring Massive Multitask Language Understanding
cs.CY 2020-09 accept novelty 8.0

Introduces the MMLU benchmark of 57 tasks and shows that current models, including GPT-3, achieve low accuracy far below expert level across academic and professional domains.
Dense Passage Retrieval for Open-Domain Question Answering
cs.CL 2020-04 accept novelty 8.0

Dense dual-encoder retrievers outperform BM25 by 9-19% absolute in top-20 passage retrieval accuracy across open-domain QA datasets and enable new state-of-the-art end-to-end QA results.
REALM: Retrieval-Augmented Language Model Pre-Training
cs.CL 2020-02 accept novelty 8.0

REALM augments language-model pre-training with an unsupervised retriever over Wikipedia documents and reports 4-16% absolute gains on open-domain QA benchmarks over prior implicit and explicit knowledge methods.
Live Music Diffusion Models: Efficient Fine-Tuning and Post-Training of Interactive Diffusion Music Generators
cs.SD 2026-05 unverdicted novelty 7.0

Live Music Diffusion Models adapt bidirectional diffusion for interactive music generation via KV caching and ARC-Forcing, recovering and exceeding discrete autoregressive efficiency while enabling post-training align...
TabPFN-MT: A Natively Multitask In-Context Learner for Tabular Data
cs.LG 2026-05 unverdicted novelty 7.0

TabPFN-MT is a multitask in-context learner for tabular data that sets a new state-of-the-art on deep multitask learning for datasets under 1000 samples while reducing inference cost from O(T) to O(1) passes.
Online Learning-to-Defer with Varying Experts
stat.ML 2026-05 unverdicted novelty 7.0

Presents the first online Learning-to-Defer algorithm achieving regret O((n + n_e) T^{2/3}) generally and O((n + n_e) sqrt(T)) under low noise for multiclass classification with varying experts.
The Benefits of Temporal Correlations: SGD Learns k-Juntas from Random Walks Efficiently
cs.LG 2026-05 unverdicted novelty 7.0

Temporal correlations from lazy random walks enable efficient SGD learning of k-juntas via temporal-difference loss on ReLU networks, achieving linear sample complexity in d.
OA-WAM: Object-Addressable World Action Model for Robust Robot Manipulation
cs.RO 2026-05 unverdicted novelty 7.0

OA-WAM uses persistent address vectors and dynamic content vectors in object slots to enable addressable world-action prediction, improving robustness on manipulation benchmarks under scene changes.
SWAN: Semantic Watermarking with Abstract Meaning Representation
cs.CL 2026-05 unverdicted novelty 7.0

SWAN uses AMR to embed semantic watermarks that persist through paraphrases, matching SOTA detection on original text and improving AUC by 13.9 points on paraphrased RealNews data.
AttentionBender: Manipulating Cross-Attention in Video Diffusion Transformers as a Creative Probe
cs.MM 2026-04 unverdicted novelty 7.0

AttentionBender applies 2D transforms to cross-attention maps in video diffusion transformers, producing distributed distortions and glitch aesthetics that reveal entangled attention mechanisms while serving as both a...
Dual Triangle Attention: Effective Bidirectional Attention Without Positional Embeddings
q-bio.QM 2026-04 unverdicted novelty 7.0

Dual Triangle Attention achieves effective bidirectional attention with built-in positional inductive bias via dual triangular masks, outperforming standard bidirectional attention on position-sensitive tasks and show...
Unlocking Prompt Infilling Capability for Diffusion Language Models
cs.CL 2026-04 unverdicted novelty 7.0

Full-sequence masking in SFT unlocks prompt infilling for masked diffusion language models, producing templates that match or surpass hand-designed ones and transfer across models.
Analyzing Reverse Address Translation Overheads in Multi-GPU Scale-Up Pods
cs.DC 2026-04 unverdicted novelty 7.0

Simulation study shows cold TLB misses in reverse address translation dominate latency for small collectives in multi-GPU pods, causing up to 1.4x degradation, while larger ones see diminishing returns.
RELIC: Evaluating Complex Reasoning via the Recognition of Languages In-Context
cs.CL 2025-06 unverdicted novelty 7.0

RELIC benchmark reveals that advanced LLMs fail to scale reasoning compute with task difficulty in context-free language recognition and instead reduce reasoning tokens while shifting to guessing strategies.
Neurons Speak in Ranges: Breaking Free from Discrete Neuronal Attribution
cs.LG 2025-02 unverdicted novelty 7.0

Neurons exhibit concept-conditioned activation ranges forming Gaussian-like distributions with minimal overlap, and range-based interventions via NeuronLens outperform neuron-level masking in targeted manipulation wit...
PromptGuard: Soft Prompt-Guided Unsafe Content Moderation for Text-to-Image Models
cs.CV 2025-01 unverdicted novelty 7.0

PromptGuard optimizes a universal safety soft prompt (and category-specific variants) in T2I embedding space to moderate NSFW inputs, achieving average unsafe ratios of 5.84-6.18% while being 3.8x faster than prior defenses.
The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits
cs.CL 2024-02 unverdicted novelty 7.0

BitNet b1.58 shows that ternary 1.58-bit LLMs can match full-precision performance at substantially lower inference cost.
Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads
cs.LG 2024-01 conditional novelty 7.0

Medusa augments LLMs with multiple decoding heads and tree-based attention to predict and verify several tokens in parallel, yielding 2.2-3.6x inference speedup via two fine-tuning regimes.
LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale
cs.LG 2022-08 conditional novelty 7.0

LLM.int8() performs 8-bit inference for transformers up to 175B parameters with no accuracy loss by combining vector-wise quantization for most features with 16-bit mixed-precision handling of systematic outlier dimensions.
Flamingo: a Visual Language Model for Few-Shot Learning
cs.CV 2022-04 unverdicted novelty 7.0

Flamingo models reach new state-of-the-art few-shot results on image and video tasks by bridging frozen vision and language models with cross-attention layers trained on interleaved web-scale data.
Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity
cs.LG 2021-01 accept novelty 7.0

Switch Transformers use top-1 expert routing in a Mixture of Experts setup to scale to trillion-parameter language models with constant compute and up to 4x speedup over T5-XXL.
GraphCodeBERT: Pre-training Code Representations with Data Flow
cs.SE 2020-09 accept novelty 7.0

GraphCodeBERT uses data flow graphs in pre-training to capture semantic code structure and reaches state-of-the-art results on code search, clone detection, translation, and refinement.
Learning to summarize from human feedback
cs.CL 2020-09 conditional novelty 7.0

Reinforcement learning on a reward model trained from human summary comparisons produces summaries humans prefer over supervised fine-tuning or human references on TL;DR and transfers to CNN/DM.
Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks
cs.CL 2020-05 accept novelty 7.0

RAG models set new state-of-the-art results on open-domain QA by retrieving Wikipedia passages and conditioning a generative model on them, while also producing more factual text than parametric baselines.
Unsupervised Cross-lingual Representation Learning at Scale
cs.CL 2019-11 conditional novelty 7.0

XLM-R, pretrained on 100 languages with 2TB of CommonCrawl data, improves average XNLI accuracy by 14.6 points and MLQA F1 by 13 points over mBERT while matching strong monolingual models on GLUE.
ALBERT: A Lite BERT for Self-supervised Learning of Language Representations
cs.CL 2019-09 accept novelty 7.0

ALBERT reduces BERT parameters via embedding factorization and layer sharing, adds inter-sentence coherence pretraining, and reaches SOTA on GLUE, RACE, and SQuAD with fewer parameters than BERT-large.
Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism
cs.CL 2019-09 unverdicted novelty 7.0

Intra-layer model parallelism in PyTorch enables training of 8.3B-parameter transformers, achieving SOTA perplexity of 10.8 on WikiText103 and 66.5% accuracy on LAMBADA.
FOL2NS: Generating Natural Sentences from First-Order Logic
cs.CL 2026-05 unverdicted novelty 6.0

FOL2NS generates synthetic first-order logic formulas with varying quantifier depths and translates them into natural language sentences via a hybrid rule-driven and fine-tuned language model approach.
Runtime-Orchestrated Second-Order Optimization for Scalable LLM Training
cs.DC 2026-05 unverdicted novelty 6.0

Asteria is a runtime system that enables second-order optimization for LLMs by dynamically distributing optimizer state across GPU, CPU, and NVMe while using asynchronous inverse-root computations and bounded-stalenes...
Block-Based Double Decoders
cs.LG 2026-05 unverdicted novelty 6.0

Block-based double decoders achieve full supervision in pretraining like decoder-only models and efficient inference like encoder-decoders through doubly-causal block-based attention masks, outperforming encoder-decod...
Theory-optimal Quantization Based on Flatness
cs.LG 2026-05 unverdicted novelty 6.0

The paper introduces the Flatness metric, derives a theory-optimal quantization solution, and presents BDQ that uses bidirectional diagonal transformations to reduce outlier impact, achieving under 1% drop at W4A4 on ...
The two clocks and the innovation window: When and how generative models learn rules
cs.LG 2026-05 unverdicted novelty 6.0

Generative models learn rules before memorizing data, creating an innovation window whose width depends on dataset size and rule complexity, observed in both diffusion and autoregressive architectures.
Diversity in Large Language Models under Supervised Fine-Tuning
cs.LG 2026-04 unverdicted novelty 6.0

TOFU loss mitigates the narrowing of generative diversity in LLMs after supervised fine-tuning by addressing neglect of low-frequency patterns and forgetting of prior knowledge.
Whose Story Gets Told? Positionality and Bias in LLM Summaries of Life Narratives
cs.CL 2026-04 unverdicted novelty 6.0

A proposed pipeline shows LLMs introduce detectable race and gender biases when summarizing life narratives, creating potential for representational harm in research.
Sketching the Readout of Large Language Models for Scalable Data Attribution and Valuation
cs.LG 2026-04 unverdicted novelty 6.0

RISE applies CountSketch to dual lexical and semantic channels derived from output-layer gradient outer products, cutting data attribution storage by up to 112x and enabling retrospective and prospective influence ana...
mimic-video: Video-Action Models for Generalizable Robot Control Beyond VLAs
cs.RO 2025-12 unverdicted novelty 6.0

mimic-video combines internet video pretraining with a flow-matching decoder to achieve state-of-the-art robotic manipulation performance with 10x better sample efficiency than vision-language-action models.
Progress Ratio Embeddings: An Impatience Signal for Robust Length Control in Neural Text Generation
cs.CL 2025-12 unverdicted novelty 6.0

Progress Ratio Embeddings use a trigonometric progress-ratio signal to deliver stable length control in transformers that generalizes to unseen target lengths.
SSL4RL: Revisiting Self-supervised Learning as Intrinsic Reward for Visual-Language Reasoning
cs.CV 2025-10 unverdicted novelty 6.0

SSL4RL reformulates self-supervised learning objectives into dense, verifiable reward signals for RL-based fine-tuning of vision-language models, yielding performance gains on reasoning benchmarks.
Should We Still Pretrain Encoders with Masked Language Modeling?
cs.CL 2025-07 accept novelty 6.0

Controlled ablations of 38 models find MLM superior to CLM on representation benchmarks while CLM offers better data efficiency and stability; a biphasic CLM-then-MLM schedule is optimal under fixed compute and improv...
Flex Attention: A Programming Model for Generating Optimized Attention Kernels
cs.LG 2024-12 unverdicted novelty 6.0

FlexAttention supplies a compiler-driven interface that expresses common attention variants in a few lines of PyTorch and emits optimized kernels whose speed matches hand-written implementations.
How Good is Your Wikipedia? Auditing Data Quality for Low-resource and Multilingual NLP
cs.CL 2024-11 unverdicted novelty 6.0

The study filters non-English Wikipedia, reveals quality problems, proposes a 4-level ranking, and shows filtered data matches or beats raw data in language modeling with largest gains for lower-quality editions.
Transfusion: Predict the Next Token and Diffuse Images with One Multi-Modal Model
cs.AI 2024-08 unverdicted novelty 6.0

A single transformer combines language modeling loss and diffusion loss on mixed-modality data, scaling to 7B parameters and 2T tokens while matching specialized language and diffusion models.
Inference Scaling Laws: An Empirical Analysis of Compute-Optimal Inference for Problem-Solving with Language Models
cs.AI 2024-08 conditional novelty 6.0

Empirical analysis shows scaling inference compute via strategies like tree search can be more efficient than scaling model parameters, with 7B models plus novel search outperforming 34B models.
The Falcon Series of Open Language Models
cs.CL 2023-11 conditional novelty 6.0

Falcon-180B is a 180B-parameter open decoder-only model trained on 3.5 trillion tokens that approaches PaLM-2-Large performance at lower cost and is released with dataset extracts.
Vision Transformers Need Registers
cs.CV 2023-09 unverdicted novelty 6.0

Adding register tokens to Vision Transformers eliminates high-norm background artifacts and raises state-of-the-art performance on dense visual prediction tasks.
DeepSpeed Ulysses: System Optimizations for Enabling Training of Extreme Long Sequence Transformer Models
cs.LG 2023-09 accept novelty 6.0

DeepSpeed-Ulysses keeps communication volume constant for sequence-parallel attention when sequence length and device count scale together, delivering 2.5x faster training on 4x longer sequences than prior SOTA.
Can AI-Generated Text be Reliably Detected?
cs.CL 2023-03 unverdicted novelty 6.0

Recursive paraphrasing attacks substantially lower detection rates for multiple AI text detectors with only minor quality loss, while a theoretical analysis ties best-case AUROC to total variation distance between hum...
SemDeDup: Data-efficient learning at web-scale through semantic deduplication
cs.LG 2023-03 unverdicted novelty 6.0

SemDeDup removes semantic duplicates from datasets like LAION using pre-trained embeddings, cutting data by 50% with minimal performance loss and efficiency gains on C4.
The Flan Collection: Designing Data and Methods for Effective Instruction Tuning
cs.AI 2023-01 conditional novelty 6.0

The Flan Collection demonstrates that task balancing, data enrichment, and mixed prompt training are critical to effective instruction tuning, yielding stronger Flan-T5 models released publicly.
Atlas: Few-shot Learning with Retrieval Augmented Language Models
cs.CL 2022-08 unverdicted novelty 6.0

Atlas reaches over 42% accuracy on Natural Questions with only 64 examples, outperforming a 540B-parameter model by 3% with 50x fewer parameters.
Efficient Training of Language Models to Fill in the Middle
cs.CL 2022-07 unverdicted novelty 6.0

Autoregressive language models trained on data with middle spans relocated to the end learn infilling without degrading left-to-right perplexity or sampling quality.
Language Models (Mostly) Know What They Know
cs.CL 2022-07 unverdicted novelty 6.0

Language models show good calibration when asked to estimate the probability that their own answers are correct, with performance improving as models get larger.
Scaling Autoregressive Models for Content-Rich Text-to-Image Generation
cs.CV 2022-06 unverdicted novelty 6.0

Scaling an autoregressive Transformer to 20B parameters for text-to-image generation using image token sequences achieves new SOTA zero-shot FID of 7.23 and fine-tuned FID of 3.22 on MS-COCO.
CoCa: Contrastive Captioners are Image-Text Foundation Models
cs.CV 2022-05 accept novelty 6.0

CoCa unifies contrastive and generative pretraining in one image-text model to reach 86.3% zero-shot ImageNet accuracy and new state-of-the-art results on multiple downstream benchmarks.

Reference graph

Works this paper leans on

83 extracted references · 83 canonical work pages · cited by 84 Pith papers · 22 internal anchors

[1]

Memory-efﬁcient adaptive optimiza- tion for large-scale learning

Rohan Anil, Vineet Gupta, Tomer Koren, and Yoram Singer. Memory-efficient adaptive optimization for large-scale learning.arXiv preprint arXiv:1901.11150,

work page arXiv 1901
[2]

Massively Multilingual Neural Machine Translation in the Wild: Findings and Challenges

Naveen Arivazhagan, Ankur Bapna, Orhan Firat, Dmitry Lepikhin, Melvin Johnson, Maxim Krikun, Mia Xu Chen, Yuan Cao, George Foster, Colin Cherry, et al. Massively multi- lingual neural machine translation in the wild: Findings and challenges.arXiv preprint arXiv:1907.05019,

work page Pith review arXiv 1907
[3]

Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E. Hinton. Layer normalization.arXiv preprint arXiv:1607.06450,

work page internal anchor Pith review Pith/arXiv arXiv
[4]

Cloze-driven Pretraining of Self-attention Networks

Alexei Baevski, Sergey Edunov, Yinhan Liu, Luke Zettlemoyer, and Michael Auli. Cloze- driven pretraining of self-attention networks.arXiv preprint arXiv:1903.07785,

work page Pith review arXiv 1903
[5]

Simple, scalable adaptation for neural machine translation.arXiv preprint arXiv:1909.08478,

Ankur Bapna, Naveen Arivazhagan, and Orhan Firat. Simple, scalable adaptation for neural machine translation.arXiv preprint arXiv:1909.08478,

work page arXiv 1909
[6]

SciBERT: A pretrained language model for scientific text

Iz Beltagy, Kyle Lo, and Arman Cohan. SciBERT: A pretrained language model for scientific text. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP),

work page 2019
[7]

Findings of the 2014 workshop on statistical machine translation

Ondřej Bojar, Christian Buck, Christian Federmann, Barry Haddow, Philipp Koehn, Jo- hannes Leveling, Christof Monz, Pavel Pecina, Matt Post, Herve Saint-Amand, et al. Findings of the 2014 workshop on statistical machine translation. InProceedings of the Ninth Workshop on Statistical Machine Translation,

work page 2014
[8]

Findings of the 2015 workshop on statistical machine translation

Ondřej Bojar, Rajen Chatterjee, Christian Federmann, Barry Haddow, Matthias Huck, Chris Hokamp, Philipp Koehn, Varvara Logacheva, Christof Monz, Matteo Negri, et al. Findings of the 2015 workshop on statistical machine translation. InProceedings of the Tenth Workshop on Statistical Machine Translation,

work page 2015
[9]

Findings of the 2016 conference on machine translation

Ondřej Bojar, Rajen Chatterjee, Christian Federmann, Yvette Graham, Barry Haddow, Matthias Huck, Antonio Jimeno Yepes, Philipp Koehn, Varvara Logacheva, Christof Monz, et al. Findings of the 2016 conference on machine translation. InProceedings of the First Conference on Machine Translation,

work page 2016
[10]

Generating Sentences from a Continuous Space

Samuel R. Bowman, Luke Vilnis, Oriol Vinyals, Andrew M. Dai, Rafal Jozefowicz, and Samy Bengio. Generating sentences from a continuous space.arXiv preprint arXiv:1511.06349,

work page Pith review arXiv
[11]

SemEval-2017 Task 1: Semantic Textual Similarity - Multilingual and Cross-lingual Focused Evaluation

Daniel Cer, Mona Diab, Eneko Agirre, Inigo Lopez-Gazpio, and Lucia Specia. Semeval-2017 task 1: Semantic textual similarity-multilingual and cross-lingual focused evaluation. arXiv preprint arXiv:1708.00055,

work page Pith review arXiv 2017
[12]

Long Short-Term Memory-Networks for Machine Reading

Jianpeng Cheng, Li Dong, and Mirella Lapata. Long short-term memory-networks for machine reading.arXiv preprint arXiv:1601.06733,

work page Pith review arXiv
[13]

BoolQ: Exploring the Surprising Difficulty of Natural Yes/No Questions

Christopher Clark, Kenton Lee, Ming-Wei Chang, Tom Kwiatkowski, Michael Collins, and Kristina Toutanova. BoolQ: Exploring the surprising difficulty of natural yes/no questions. arXiv preprint arXiv:1905.10044,

work page internal anchor Pith review arXiv 1905
[14]

ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators

Kevin Clark, Minh-Thang Luong, Quoc V Le, and Christopher D Manning. Electra: Pre-training text encoders as discriminators rather than generators. arXiv preprint arXiv:2003.10555,

work page internal anchor Pith review arXiv 2003
[15]

SentEval: An Evaluation Toolkit for Universal Sentence Representations

Alexis Conneau and Douwe Kiela. SentEval: An evaluation toolkit for universal sentence representations. arXiv preprint arXiv:1803.05449,

work page Pith review arXiv
[16]

Supervised Learning of Universal Sentence Representations from Natural Language Inference Data

Alexis Conneau, Douwe Kiela, Holger Schwenk, Loic Barrault, and Antoine Bordes. Super- vised learning of universal sentence representations from natural language inference data. arXiv preprint arXiv:1705.02364,

work page Pith review arXiv
[17]

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre- training of deep bidirectional transformers for language understanding.arXiv preprint arXiv:1810.04805,

work page internal anchor Pith review Pith/arXiv arXiv
[18]

Uniﬁed language model pre- training for natural language understanding and gen- eration

Li Dong, Nan Yang, Wenhui Wang, Furu Wei, Xiaodong Liu, Yu Wang, Jianfeng Gao, Ming Zhou, and Hsiao-Wuen Hon. Unified language model pre-training for natural language understanding and generation.arXiv preprint arXiv:1905.03197,

work page arXiv 1905
[19]

Understanding Back-Translation at Scale

59 Raffel, Shazeer, Roberts, Lee, Narang, Matena, Zhou, Li and Liu Sergey Edunov, Myle Ott, Michael Auli, and David Grangier. Understanding back-translation at scale. arXiv preprint arXiv:1808.09381,

work page Pith review arXiv
[20]

Learning Word Vectors for 157 Languages

Edouard Grave, Piotr Bojanowski, Prakhar Gupta, Armand Joulin, and Tomas Mikolov. Learning word vectors for 157 languages.arXiv preprint arXiv:1802.06893,

work page Pith review arXiv
[21]

Generating Sequences With Recurrent Neural Networks

Alex Graves. Generating sequences with recurrent neural networks. arXiv preprint arXiv:1308.0850,

work page Pith review arXiv
[22]

Rethinking ImageNet Pre-training

Kaiming He, Ross Girshick, and Piotr Dollár. Rethinking ImageNet pre-training.arXiv preprint arXiv:1811.08883,

work page Pith review arXiv
[23]

A hybrid neural network model for commonsense reasoning

Pengcheng He, Xiaodong Liu, Weizhu Chen, and Jianfeng Gao. A hybrid neural network model for commonsense reasoning.arXiv preprint arXiv:1907.11983,

work page arXiv 1907
[24]

Deep Learning Scaling is Predictable, Empirically

Joel Hestness, Sharan Narang, Newsha Ardalani, Gregory Diamos, Heewoo Jun, Hassan Kianinejad, Md. Mostofa Ali Patwary, Yang Yang, and Yanqi Zhou. Deep learning scaling is predictable, empirically.arXiv preprint arXiv:1712.00409,

work page internal anchor Pith review Pith/arXiv arXiv
[25]

Learning distributed representations of sentences from unlabelled data.arXiv preprint arXiv:1602.03483,

Felix Hill, Kyunghyun Cho, and Anna Korhonen. Learning distributed representations of sentences from unlabelled data.arXiv preprint arXiv:1602.03483,

work page arXiv
[26]

Distilling the Knowledge in a Neural Network

Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531,

work page internal anchor Pith review Pith/arXiv arXiv
[27]

Parameter-Efficient Transfer Learning for NLP

Neil Houlsby, Andrei Giurgiu, Stanislaw Jastrzebski, Bruna Morrone, Quentin De Laroussilhe, Andrea Gesmundo, Mona Attariyan, and Sylvain Gelly. Parameter-efficient transfer learning for NLP.arXiv preprint arXiv:1902.00751,

work page Pith review arXiv 1902
[28]

Universal Language Model Fine-tuning for Text Classification

Jeremy Howard and Sebastian Ruder. Universal language model fine-tuning for text classifi- cation. arXiv preprint arXiv:1801.06146,

work page Pith review arXiv
[29]

GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism

Cheng-Zhi Anna Huang, Ashish Vaswani, Jakob Uszkoreit, Ian Simon, Curtis Hawthorne, Noam Shazeer, Andrew M. Dai, Matthew D. Hoffman, Monica Dinculescu, and Dou- glas Eck. Music transformer: Generating music with long-term structure. InSeventh International Conference on Learning Representations, 2018a. 60 Exploring the Limits of Transfer Learning Yanping ...

work page Pith review arXiv
[30]

Tinybert: Distilling bert for natural language understanding

Xiaoqi Jiao, Yichun Yin, Lifeng Shang, Xin Jiang, Xiao Chen, Linlin Li, Fang Wang, and Qun Liu. TinyBERT: Distilling BERT for natural language understanding.arXiv preprint arXiv:1909.10351,

work page arXiv 1909
[31]

TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension

Mandar Joshi, Eunsol Choi, Daniel S. Weld, and Luke Zettlemoyer. TriviaQA: A large scale distantly supervised challenge dataset for reading comprehension.arXiv preprint arXiv:1705.03551,

work page internal anchor Pith review Pith/arXiv arXiv
[32]

SpanBERT: Improving pre-training by representing and predicting spans

Mandar Joshi, Danqi Chen, Yinhan Liu, Daniel S. Weld, Luke Zettlemoyer, and Omer Levy. SpanBERT: Improving pre-training by representing and predicting spans.arXiv preprint arXiv:1907.10529,

work page arXiv 1907
[33]

Exploring the Limits of Language Modeling

Rafal Jozefowicz, Oriol Vinyals, Mike Schuster, Noam Shazeer, and Yonghui Wu. Exploring the limits of language modeling.arXiv preprint arXiv:1602.02410,

work page Pith review arXiv
[34]

CTRL: A Conditional Transformer Language Model for Controllable Generation

Nitish Shirish Keskar, Bryan McCann, Lav R. Varshney, Caiming Xiong, and Richard Socher. CTRL: A conditional transformer language model for controllable generation. arXiv preprint arXiv:1909.05858, 2019a. NitishShirishKeskar, BryanMcCann, CaimingXiong, andRichardSocher. Unifyingquestion answering and text classification via span extraction.arXiv preprint ...

work page internal anchor Pith review arXiv 1909
[35]

A surprisingly robust trick for winograd schema challenge

Vid Kocijan, Ana-Maria Cretu, Oana-Maria Camburu, Yordan Yordanov, and Thomas Lukasiewicz. A surprisingly robust trick for Winograd schema challenge.arXiv preprint arXiv:1905.06290,

work page arXiv 1905
[36]

Federated Optimization:Distributed Optimization Beyond the Datacenter

Jakub Konečn` y, Brendan McMahan, and Daniel Ramage. Federated optimization: Dis- tributed optimization beyond the datacenter.arXiv preprint arXiv:1511.03575,

work page Pith review arXiv
[37]

Federated Learning: Strategies for Improving Communication Efficiency

Jakub Konečn` y, H. Brendan McMahan, Felix X. Yu, Peter Richtárik, Ananda Theertha Suresh, and Dave Bacon. Federated learning: Strategies for improving communication efficiency. arXiv preprint arXiv:1610.05492,

work page internal anchor Pith review arXiv
[38]

Simon Kornblith, Jonathon Shlens, and Quoc V. Le. Do better ImageNet models transfer better? arXiv preprint arXiv:1805.08974,

work page Pith review arXiv
[39]

One weird trick for parallelizing convolutional neural networks

Alex Krizhevsky. One weird trick for parallelizing convolutional neural networks.arXiv preprint arXiv:1404.5997,

work page Pith review arXiv
[40]

Subword Regularization: Improving Neural Network Translation Models with Multiple Subword Candidates

Taku Kudo. Subword regularization: Improving neural network translation models with multiple subword candidates.arXiv preprint arXiv:1804.10959,

work page Pith review arXiv
[41]

Sentencepiece: A simple and language independent subword tokenizer and detokenizer for neural text processing

Taku Kudo and John Richardson. SentencePiece: A simple and language independent sub- word tokenizer and detokenizer for neural text processing.arXiv preprint arXiv:1808.06226,

work page internal anchor Pith review arXiv
[42]

Cross-lingual Language Model Pretraining

Guillaume Lample and Alexis Conneau. Cross-lingual language model pretraining.arXiv preprint arXiv:1901.07291,

work page Pith review arXiv 1901
[43]

ALBERT: A Lite BERT for Self-supervised Learning of Language Representations

Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, and Radu Soricut. ALBERT: A lite BERT for self-supervised learning of language representa- tions. arXiv preprint arXiv:1909.11942,

work page internal anchor Pith review arXiv 1909
[44]

Generating Wikipedia by Summarizing Long Sequences

Peter J. Liu, Mohammad Saleh, Etienne Pot, Ben Goodrich, Ryan Sepassi, Lukasz Kaiser, and Noam Shazeer. Generating Wikipedia by summarizing long sequences.arXiv preprint arXiv:1801.10198,

work page Pith review arXiv
[45]

Liu, Yu-An Chung, and Jie Ren

Peter J. Liu, Yu-An Chung, and Jie Ren. SummAE: Zero-shot abstractive text summarization using length-agnostic auto-encoders.arXiv preprint arXiv:1910.00998, 2019a. 62 Exploring the Limits of Transfer Learning Xiaodong Liu, Jianfeng Gao, Xiaodong He, Li Deng, Kevin Duh, and Ye-Yi Wang. Rep- resentation learning using multi-task deep neural networks for se...

work page arXiv 1910
[46]

Multi-Task Deep Neural Networks for Natural Language Understanding

Xiaodong Liu, Pengcheng He, Weizhu Chen, and Jianfeng Gao. Multi-task deep neural networks for natural language understanding.arXiv preprint arXiv:1901.11504, 2019b. Yang Liu. Fine-tune BERT for extractive summarization.arXiv preprint arXiv:1903.10318,

work page Pith review arXiv 1901
[47]

RoBERTa: A Robustly Optimized BERT Pretraining Approach

Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. RoBERTa: A robustly optimized BERT pretraining approach.arXiv preprint arXiv:1907.11692, 2019c. Lajanugen Logeswaran and Honglak Lee. An efficient framework for learning sentence representations. arXiv preprint arXiv:1803.02893,

work page internal anchor Pith review Pith/arXiv arXiv 1907
[48]

The Natural Language Decathlon: Multitask Learning as Question Answering

Bryan McCann, Nitish Shirish Keskar, Caiming Xiong, and Richard Socher. The nat- ural language decathlon: Multitask learning as question answering. arXiv preprint arXiv:1806.08730,

work page Pith review arXiv
[49]

Efficient Estimation of Word Representations in Vector Space

Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Efficient estimation of word representations in vector space.arXiv preprint arXiv:1301.3781, 2013a. Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S. Corrado, and Jeff Dean. Distributed representations of words and phrases and their compositionality. InAdvances in neural information processing system...

work page internal anchor Pith review Pith/arXiv arXiv
[50]

A Deep Reinforced Model for Abstractive Summarization

Romain Paulus, Caiming Xiong, and Richard Socher. A deep reinforced model for abstractive summarization. arXiv preprint arXiv:1705.04304,

work page Pith review arXiv
[51]

GloVe: Global vectors for word representation

63 Raffel, Shazeer, Roberts, Lee, Narang, Matena, Zhou, Li and Liu Jeffrey Pennington, Richard Socher, and Christopher Manning. GloVe: Global vectors for word representation. InProceedings of the 2014 conference on empirical methods in natural language processing (EMNLP),

work page 2014
[52]

Matthew Peters, Sebastian Ruder, and Noah A. Smith. To tune or not to tune? adapting pretrained representations to diverse tasks.arXiv preprint arXiv:1903.05987,

work page Pith review arXiv 1903
[53]

Deep contextualized word representations

Matthew E. Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. Deep contextualized word representations.arXiv preprint arXiv:1802.05365,

work page Pith review arXiv
[54]

Jason Phang, Thibault Févry, and Samuel R. Bowman. Sentence encoders on STILTs: Sup- plementary training on intermediate labeled-data tasks.arXiv preprint arXiv:1811.01088,

work page Pith review arXiv
[55]

WiC: the Word-in-Context Dataset for Evaluating Context-Sensitive Meaning Representations

Mohammad Taher Pilehvar and Jose Camacho-Collados. WIC: 10,000 example pairs for evaluating context-sensitive representations.arXiv preprint arXiv:1808.09121,

work page Pith review arXiv
[56]

A Call for Clarity in Reporting BLEU Scores

Matt Post. A call for clarity in reporting BLEU scores.arXiv preprint arXiv:1804.08771,

work page Pith review arXiv
[57]

Resolving complex cases of definite pronouns: the Winograd schema challenge

Altaf Rahman and Vincent Ng. Resolving complex cases of definite pronouns: the Winograd schema challenge. InProceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning. Association for Computational Linguistics,

work page 2012
[58]

SQuAD: 100,000+ Questions for Machine Comprehension of Text

Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. Squad: 100,000+ questions for machine comprehension of text.arXiv preprint arXiv:1606.05250,

work page internal anchor Pith review arXiv
[59]

Unsupervised Pretraining for Sequence to Sequence Learning

Prajit Ramachandran, Peter J. Liu, and Quoc V. Le. Unsupervised pretraining for sequence to sequence learning.arXiv preprint arXiv:1611.02683,

work page Pith review arXiv
[60]

An Overview of Multi-Task Learning in Deep Neural Networks

Sebastian Ruder. An overview of multi-task learning in deep neural networks.arXiv preprint arXiv:1706.05098,

work page internal anchor Pith review arXiv
[61]

Peters, Swabha Swayamdipta, and Thomas Wolf

64 Exploring the Limits of Transfer Learning Sebastian Ruder, Matthew E. Peters, Swabha Swayamdipta, and Thomas Wolf. Transfer learning in natural language processing. InProceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Tutorials, pages 15–18,

work page 2019
[62]

DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter

Victor Sanh, Lysandre Debut, Julien Chaumond, and Thomas Wolf. DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter.arXiv preprint arXiv:1910.01108,

work page internal anchor Pith review Pith/arXiv arXiv 1910
[63]

Get To The Point: Summarization with Pointer-Generator Networks

Abigail See, Peter J. Liu, and Christopher D. Manning. Get to the point: Summarization with pointer-generator networks.arXiv preprint arXiv:1704.04368,

work page Pith review arXiv
[64]

Neural Machine Translation of Rare Words with Subword Units

Rico Sennrich, Barry Haddow, and Alexandra Birch. Neural machine translation of rare words with subword units.arXiv preprint arXiv:1508.07909,

work page internal anchor Pith review arXiv
[65]

Christopher J Shallue, Jaehoon Lee, Joe Antognini, Jascha Sohl-Dickstein, Roy Frostig, and George E. Dahl. Measuring the effects of data parallelism on neural network training. arXiv preprint arXiv:1811.03600,

work page Pith review arXiv
[66]

Self-Attention with Relative Position Representations

Peter Shaw, Jakob Uszkoreit, and Ashish Vaswani. Self-attention with relative position representations. arXiv preprint arXiv:1803.02155,

work page Pith review arXiv
[67]

Adafactor: Adaptive Learning Rates with Sublinear Memory Cost

Noam Shazeer and Mitchell Stern. Adafactor: Adaptive learning rates with sublinear memory cost. arXiv preprint arXiv:1804.04235,

work page Pith review arXiv
[68]

Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer

Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, and Jeff Dean. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. arXiv preprint arXiv:1701.06538,

work page internal anchor Pith review Pith/arXiv arXiv
[69]

Manning, Andrew Ng, and Christopher Potts

Richard Socher, Alex Perelygin, Jean Wu, Jason Chuang, Christopher D. Manning, Andrew Ng, and Christopher Potts. Recursive deep models for semantic compositionality over a sentiment treebank. InProceedings of the 2013 conference on empirical methods in natural language processing,

work page 2013
[70]

MASS: Masked Sequence to Sequence Pre-training for Language Generation

Kaitao Song, Xu Tan, Tao Qin, Jianfeng Lu, and Tie-Yan Liu. MASS: Masked sequence to sequence pre-training for language generation.arXiv preprint arXiv:1905.02450,

work page Pith review arXiv 1905
[71]

Sandeep Subramanian, Adam Trischler, Yoshua Bengio, and Christopher J. Pal. Learning general purpose distributed sentence representations via large scale multi-task learning. arXiv preprint arXiv:1804.00079,

work page Pith review arXiv
[72]

Trinh and Quoc V

Trieu H. Trinh and Quoc V. Le. A simple method for commonsense reasoning.arXiv preprint arXiv:1806.02847,

work page arXiv
[73]

NewsQA: A Machine Comprehension Dataset

Adam Trischler, Tong Wang, Xingdi Yuan, Justin Harris, Alessandro Sordoni, Philip Bachman, and Kaheer Suleman. NewsQA: A machine comprehension dataset.arXiv preprint arXiv:1611.09830,

work page Pith review arXiv
[74]

The bottom-up evolution of representations in the transformer: A study with machine translation and language modeling objectives

Elena Voita, Rico Sennrich, and Ivan Titov. The bottom-up evolution of representations in the transformer: A study with machine translation and language modeling objectives. arXiv preprint arXiv:1909.01380,

work page arXiv 1909
[75]

Alex Wang, Amapreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R. Bowman. GLUE: A multi-task benchmark and analysis platform for natural language understanding. arXiv preprint arXiv:1804.07461,

work page internal anchor Pith review arXiv
[76]

SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems

Alex Wang, Jan Hula, Patrick Xia, Raghavendra Pappagari, R. Thomas McCoy, Roma Patel, Najoung Kim, Ian Tenney, Yinghui Huang, Katherin Yu, et al. Can you tell me how to get past Sesame Street? Sentence-level pretraining beyond language modeling. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 2019a. Alex Wang, Y...

work page internal anchor Pith review arXiv 1905
[77]

Adina Williams, Nikita Nangia, and Samuel R. Bowman. A broad-coverage challenge corpus for sentence understanding through inference.arXiv preprint arXiv:1704.05426,

work page Pith review arXiv
[78]

Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation

Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V. Le, Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, et al. Google’s neural machine translation system: Bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144,

work page internal anchor Pith review arXiv
[79]

Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Carbonell, Ruslan Salakhutdinov, and Quoc V. Le. XLNet: Generalized autoregressive pretraining for language understanding.arXiv preprint arXiv:1906.08237,

work page internal anchor Pith review arXiv 1906
[80]

Adams Wei Yu, David Dohan, Minh-Thang Luong, Rui Zhao, Kai Chen, Mohammad Norouzi, and Quoc V. Le. QAnet: Combining local convolution with global self-attention for reading comprehension.arXiv preprint arXiv:1804.09541,

work page Pith review arXiv

Showing first 80 references.

[1] [1]

Memory-efﬁcient adaptive optimiza- tion for large-scale learning

Rohan Anil, Vineet Gupta, Tomer Koren, and Yoram Singer. Memory-efficient adaptive optimization for large-scale learning.arXiv preprint arXiv:1901.11150,

work page arXiv 1901

[2] [2]

Massively Multilingual Neural Machine Translation in the Wild: Findings and Challenges

Naveen Arivazhagan, Ankur Bapna, Orhan Firat, Dmitry Lepikhin, Melvin Johnson, Maxim Krikun, Mia Xu Chen, Yuan Cao, George Foster, Colin Cherry, et al. Massively multi- lingual neural machine translation in the wild: Findings and challenges.arXiv preprint arXiv:1907.05019,

work page Pith review arXiv 1907

[3] [3]

Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E. Hinton. Layer normalization.arXiv preprint arXiv:1607.06450,

work page internal anchor Pith review Pith/arXiv arXiv

[4] [4]

Cloze-driven Pretraining of Self-attention Networks

Alexei Baevski, Sergey Edunov, Yinhan Liu, Luke Zettlemoyer, and Michael Auli. Cloze- driven pretraining of self-attention networks.arXiv preprint arXiv:1903.07785,

work page Pith review arXiv 1903

[5] [5]

Simple, scalable adaptation for neural machine translation.arXiv preprint arXiv:1909.08478,

Ankur Bapna, Naveen Arivazhagan, and Orhan Firat. Simple, scalable adaptation for neural machine translation.arXiv preprint arXiv:1909.08478,

work page arXiv 1909

[6] [6]

SciBERT: A pretrained language model for scientific text

Iz Beltagy, Kyle Lo, and Arman Cohan. SciBERT: A pretrained language model for scientific text. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP),

work page 2019

[7] [7]

Findings of the 2014 workshop on statistical machine translation

Ondřej Bojar, Christian Buck, Christian Federmann, Barry Haddow, Philipp Koehn, Jo- hannes Leveling, Christof Monz, Pavel Pecina, Matt Post, Herve Saint-Amand, et al. Findings of the 2014 workshop on statistical machine translation. InProceedings of the Ninth Workshop on Statistical Machine Translation,

work page 2014

[8] [8]

Findings of the 2015 workshop on statistical machine translation

Ondřej Bojar, Rajen Chatterjee, Christian Federmann, Barry Haddow, Matthias Huck, Chris Hokamp, Philipp Koehn, Varvara Logacheva, Christof Monz, Matteo Negri, et al. Findings of the 2015 workshop on statistical machine translation. InProceedings of the Tenth Workshop on Statistical Machine Translation,

work page 2015

[9] [9]

Findings of the 2016 conference on machine translation

Ondřej Bojar, Rajen Chatterjee, Christian Federmann, Yvette Graham, Barry Haddow, Matthias Huck, Antonio Jimeno Yepes, Philipp Koehn, Varvara Logacheva, Christof Monz, et al. Findings of the 2016 conference on machine translation. InProceedings of the First Conference on Machine Translation,

work page 2016

[10] [10]

Generating Sentences from a Continuous Space

Samuel R. Bowman, Luke Vilnis, Oriol Vinyals, Andrew M. Dai, Rafal Jozefowicz, and Samy Bengio. Generating sentences from a continuous space.arXiv preprint arXiv:1511.06349,

work page Pith review arXiv

[11] [11]

SemEval-2017 Task 1: Semantic Textual Similarity - Multilingual and Cross-lingual Focused Evaluation

Daniel Cer, Mona Diab, Eneko Agirre, Inigo Lopez-Gazpio, and Lucia Specia. Semeval-2017 task 1: Semantic textual similarity-multilingual and cross-lingual focused evaluation. arXiv preprint arXiv:1708.00055,

work page Pith review arXiv 2017

[12] [12]

Long Short-Term Memory-Networks for Machine Reading

Jianpeng Cheng, Li Dong, and Mirella Lapata. Long short-term memory-networks for machine reading.arXiv preprint arXiv:1601.06733,

work page Pith review arXiv

[13] [13]

BoolQ: Exploring the Surprising Difficulty of Natural Yes/No Questions

Christopher Clark, Kenton Lee, Ming-Wei Chang, Tom Kwiatkowski, Michael Collins, and Kristina Toutanova. BoolQ: Exploring the surprising difficulty of natural yes/no questions. arXiv preprint arXiv:1905.10044,

work page internal anchor Pith review arXiv 1905

[14] [14]

ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators

Kevin Clark, Minh-Thang Luong, Quoc V Le, and Christopher D Manning. Electra: Pre-training text encoders as discriminators rather than generators. arXiv preprint arXiv:2003.10555,

work page internal anchor Pith review arXiv 2003

[15] [15]

SentEval: An Evaluation Toolkit for Universal Sentence Representations

Alexis Conneau and Douwe Kiela. SentEval: An evaluation toolkit for universal sentence representations. arXiv preprint arXiv:1803.05449,

work page Pith review arXiv

[16] [16]

Supervised Learning of Universal Sentence Representations from Natural Language Inference Data

Alexis Conneau, Douwe Kiela, Holger Schwenk, Loic Barrault, and Antoine Bordes. Super- vised learning of universal sentence representations from natural language inference data. arXiv preprint arXiv:1705.02364,

work page Pith review arXiv

[17] [17]

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre- training of deep bidirectional transformers for language understanding.arXiv preprint arXiv:1810.04805,

work page internal anchor Pith review Pith/arXiv arXiv

[18] [18]

Uniﬁed language model pre- training for natural language understanding and gen- eration

Li Dong, Nan Yang, Wenhui Wang, Furu Wei, Xiaodong Liu, Yu Wang, Jianfeng Gao, Ming Zhou, and Hsiao-Wuen Hon. Unified language model pre-training for natural language understanding and generation.arXiv preprint arXiv:1905.03197,

work page arXiv 1905

[19] [19]

Understanding Back-Translation at Scale

59 Raffel, Shazeer, Roberts, Lee, Narang, Matena, Zhou, Li and Liu Sergey Edunov, Myle Ott, Michael Auli, and David Grangier. Understanding back-translation at scale. arXiv preprint arXiv:1808.09381,

work page Pith review arXiv

[20] [20]

Learning Word Vectors for 157 Languages

Edouard Grave, Piotr Bojanowski, Prakhar Gupta, Armand Joulin, and Tomas Mikolov. Learning word vectors for 157 languages.arXiv preprint arXiv:1802.06893,

work page Pith review arXiv

[21] [21]

Generating Sequences With Recurrent Neural Networks

Alex Graves. Generating sequences with recurrent neural networks. arXiv preprint arXiv:1308.0850,

work page Pith review arXiv

[22] [22]

Rethinking ImageNet Pre-training

Kaiming He, Ross Girshick, and Piotr Dollár. Rethinking ImageNet pre-training.arXiv preprint arXiv:1811.08883,

work page Pith review arXiv

[23] [23]

A hybrid neural network model for commonsense reasoning

Pengcheng He, Xiaodong Liu, Weizhu Chen, and Jianfeng Gao. A hybrid neural network model for commonsense reasoning.arXiv preprint arXiv:1907.11983,

work page arXiv 1907

[24] [24]

Deep Learning Scaling is Predictable, Empirically

Joel Hestness, Sharan Narang, Newsha Ardalani, Gregory Diamos, Heewoo Jun, Hassan Kianinejad, Md. Mostofa Ali Patwary, Yang Yang, and Yanqi Zhou. Deep learning scaling is predictable, empirically.arXiv preprint arXiv:1712.00409,

work page internal anchor Pith review Pith/arXiv arXiv

[25] [25]

Learning distributed representations of sentences from unlabelled data.arXiv preprint arXiv:1602.03483,

Felix Hill, Kyunghyun Cho, and Anna Korhonen. Learning distributed representations of sentences from unlabelled data.arXiv preprint arXiv:1602.03483,

work page arXiv

[26] [26]

Distilling the Knowledge in a Neural Network

Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531,

work page internal anchor Pith review Pith/arXiv arXiv

[27] [27]

Parameter-Efficient Transfer Learning for NLP

Neil Houlsby, Andrei Giurgiu, Stanislaw Jastrzebski, Bruna Morrone, Quentin De Laroussilhe, Andrea Gesmundo, Mona Attariyan, and Sylvain Gelly. Parameter-efficient transfer learning for NLP.arXiv preprint arXiv:1902.00751,

work page Pith review arXiv 1902

[28] [28]

Universal Language Model Fine-tuning for Text Classification

Jeremy Howard and Sebastian Ruder. Universal language model fine-tuning for text classifi- cation. arXiv preprint arXiv:1801.06146,

work page Pith review arXiv

[29] [29]

GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism

Cheng-Zhi Anna Huang, Ashish Vaswani, Jakob Uszkoreit, Ian Simon, Curtis Hawthorne, Noam Shazeer, Andrew M. Dai, Matthew D. Hoffman, Monica Dinculescu, and Dou- glas Eck. Music transformer: Generating music with long-term structure. InSeventh International Conference on Learning Representations, 2018a. 60 Exploring the Limits of Transfer Learning Yanping ...

work page Pith review arXiv

[30] [30]

Tinybert: Distilling bert for natural language understanding

Xiaoqi Jiao, Yichun Yin, Lifeng Shang, Xin Jiang, Xiao Chen, Linlin Li, Fang Wang, and Qun Liu. TinyBERT: Distilling BERT for natural language understanding.arXiv preprint arXiv:1909.10351,

work page arXiv 1909

[31] [31]

TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension

Mandar Joshi, Eunsol Choi, Daniel S. Weld, and Luke Zettlemoyer. TriviaQA: A large scale distantly supervised challenge dataset for reading comprehension.arXiv preprint arXiv:1705.03551,

work page internal anchor Pith review Pith/arXiv arXiv

[32] [32]

SpanBERT: Improving pre-training by representing and predicting spans

Mandar Joshi, Danqi Chen, Yinhan Liu, Daniel S. Weld, Luke Zettlemoyer, and Omer Levy. SpanBERT: Improving pre-training by representing and predicting spans.arXiv preprint arXiv:1907.10529,

work page arXiv 1907

[33] [33]

Exploring the Limits of Language Modeling

Rafal Jozefowicz, Oriol Vinyals, Mike Schuster, Noam Shazeer, and Yonghui Wu. Exploring the limits of language modeling.arXiv preprint arXiv:1602.02410,

work page Pith review arXiv

[34] [34]

CTRL: A Conditional Transformer Language Model for Controllable Generation

Nitish Shirish Keskar, Bryan McCann, Lav R. Varshney, Caiming Xiong, and Richard Socher. CTRL: A conditional transformer language model for controllable generation. arXiv preprint arXiv:1909.05858, 2019a. NitishShirishKeskar, BryanMcCann, CaimingXiong, andRichardSocher. Unifyingquestion answering and text classification via span extraction.arXiv preprint ...

work page internal anchor Pith review arXiv 1909

[35] [35]

A surprisingly robust trick for winograd schema challenge

Vid Kocijan, Ana-Maria Cretu, Oana-Maria Camburu, Yordan Yordanov, and Thomas Lukasiewicz. A surprisingly robust trick for Winograd schema challenge.arXiv preprint arXiv:1905.06290,

work page arXiv 1905

[36] [36]

Federated Optimization:Distributed Optimization Beyond the Datacenter

Jakub Konečn` y, Brendan McMahan, and Daniel Ramage. Federated optimization: Dis- tributed optimization beyond the datacenter.arXiv preprint arXiv:1511.03575,

work page Pith review arXiv

[37] [37]

Federated Learning: Strategies for Improving Communication Efficiency

Jakub Konečn` y, H. Brendan McMahan, Felix X. Yu, Peter Richtárik, Ananda Theertha Suresh, and Dave Bacon. Federated learning: Strategies for improving communication efficiency. arXiv preprint arXiv:1610.05492,

work page internal anchor Pith review arXiv

[38] [38]

Simon Kornblith, Jonathon Shlens, and Quoc V. Le. Do better ImageNet models transfer better? arXiv preprint arXiv:1805.08974,

work page Pith review arXiv

[39] [39]

One weird trick for parallelizing convolutional neural networks

Alex Krizhevsky. One weird trick for parallelizing convolutional neural networks.arXiv preprint arXiv:1404.5997,

work page Pith review arXiv

[40] [40]

Subword Regularization: Improving Neural Network Translation Models with Multiple Subword Candidates

Taku Kudo. Subword regularization: Improving neural network translation models with multiple subword candidates.arXiv preprint arXiv:1804.10959,

work page Pith review arXiv

[41] [41]

Sentencepiece: A simple and language independent subword tokenizer and detokenizer for neural text processing

Taku Kudo and John Richardson. SentencePiece: A simple and language independent sub- word tokenizer and detokenizer for neural text processing.arXiv preprint arXiv:1808.06226,

work page internal anchor Pith review arXiv

[42] [42]

Cross-lingual Language Model Pretraining

Guillaume Lample and Alexis Conneau. Cross-lingual language model pretraining.arXiv preprint arXiv:1901.07291,

work page Pith review arXiv 1901

[43] [43]

ALBERT: A Lite BERT for Self-supervised Learning of Language Representations

Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, and Radu Soricut. ALBERT: A lite BERT for self-supervised learning of language representa- tions. arXiv preprint arXiv:1909.11942,

work page internal anchor Pith review arXiv 1909

[44] [44]

Generating Wikipedia by Summarizing Long Sequences

Peter J. Liu, Mohammad Saleh, Etienne Pot, Ben Goodrich, Ryan Sepassi, Lukasz Kaiser, and Noam Shazeer. Generating Wikipedia by summarizing long sequences.arXiv preprint arXiv:1801.10198,

work page Pith review arXiv

[45] [45]

Liu, Yu-An Chung, and Jie Ren

Peter J. Liu, Yu-An Chung, and Jie Ren. SummAE: Zero-shot abstractive text summarization using length-agnostic auto-encoders.arXiv preprint arXiv:1910.00998, 2019a. 62 Exploring the Limits of Transfer Learning Xiaodong Liu, Jianfeng Gao, Xiaodong He, Li Deng, Kevin Duh, and Ye-Yi Wang. Rep- resentation learning using multi-task deep neural networks for se...

work page arXiv 1910

[46] [46]

Multi-Task Deep Neural Networks for Natural Language Understanding

Xiaodong Liu, Pengcheng He, Weizhu Chen, and Jianfeng Gao. Multi-task deep neural networks for natural language understanding.arXiv preprint arXiv:1901.11504, 2019b. Yang Liu. Fine-tune BERT for extractive summarization.arXiv preprint arXiv:1903.10318,

work page Pith review arXiv 1901

[47] [47]

RoBERTa: A Robustly Optimized BERT Pretraining Approach

Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. RoBERTa: A robustly optimized BERT pretraining approach.arXiv preprint arXiv:1907.11692, 2019c. Lajanugen Logeswaran and Honglak Lee. An efficient framework for learning sentence representations. arXiv preprint arXiv:1803.02893,

work page internal anchor Pith review Pith/arXiv arXiv 1907

[48] [48]

The Natural Language Decathlon: Multitask Learning as Question Answering

Bryan McCann, Nitish Shirish Keskar, Caiming Xiong, and Richard Socher. The nat- ural language decathlon: Multitask learning as question answering. arXiv preprint arXiv:1806.08730,

work page Pith review arXiv

[49] [49]

Efficient Estimation of Word Representations in Vector Space

Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Efficient estimation of word representations in vector space.arXiv preprint arXiv:1301.3781, 2013a. Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S. Corrado, and Jeff Dean. Distributed representations of words and phrases and their compositionality. InAdvances in neural information processing system...

work page internal anchor Pith review Pith/arXiv arXiv

[50] [50]

A Deep Reinforced Model for Abstractive Summarization

Romain Paulus, Caiming Xiong, and Richard Socher. A deep reinforced model for abstractive summarization. arXiv preprint arXiv:1705.04304,

work page Pith review arXiv

[51] [51]

GloVe: Global vectors for word representation

63 Raffel, Shazeer, Roberts, Lee, Narang, Matena, Zhou, Li and Liu Jeffrey Pennington, Richard Socher, and Christopher Manning. GloVe: Global vectors for word representation. InProceedings of the 2014 conference on empirical methods in natural language processing (EMNLP),

work page 2014

[52] [52]

Matthew Peters, Sebastian Ruder, and Noah A. Smith. To tune or not to tune? adapting pretrained representations to diverse tasks.arXiv preprint arXiv:1903.05987,

work page Pith review arXiv 1903

[53] [53]

Deep contextualized word representations

Matthew E. Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. Deep contextualized word representations.arXiv preprint arXiv:1802.05365,

work page Pith review arXiv

[54] [54]

Jason Phang, Thibault Févry, and Samuel R. Bowman. Sentence encoders on STILTs: Sup- plementary training on intermediate labeled-data tasks.arXiv preprint arXiv:1811.01088,

work page Pith review arXiv

[55] [55]

WiC: the Word-in-Context Dataset for Evaluating Context-Sensitive Meaning Representations

Mohammad Taher Pilehvar and Jose Camacho-Collados. WIC: 10,000 example pairs for evaluating context-sensitive representations.arXiv preprint arXiv:1808.09121,

work page Pith review arXiv

[56] [56]

A Call for Clarity in Reporting BLEU Scores

Matt Post. A call for clarity in reporting BLEU scores.arXiv preprint arXiv:1804.08771,

work page Pith review arXiv

[57] [57]

Resolving complex cases of definite pronouns: the Winograd schema challenge

Altaf Rahman and Vincent Ng. Resolving complex cases of definite pronouns: the Winograd schema challenge. InProceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning. Association for Computational Linguistics,

work page 2012

[58] [58]

SQuAD: 100,000+ Questions for Machine Comprehension of Text

Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. Squad: 100,000+ questions for machine comprehension of text.arXiv preprint arXiv:1606.05250,

work page internal anchor Pith review arXiv

[59] [59]

Unsupervised Pretraining for Sequence to Sequence Learning

Prajit Ramachandran, Peter J. Liu, and Quoc V. Le. Unsupervised pretraining for sequence to sequence learning.arXiv preprint arXiv:1611.02683,

work page Pith review arXiv

[60] [60]

An Overview of Multi-Task Learning in Deep Neural Networks

Sebastian Ruder. An overview of multi-task learning in deep neural networks.arXiv preprint arXiv:1706.05098,

work page internal anchor Pith review arXiv

[61] [61]

Peters, Swabha Swayamdipta, and Thomas Wolf

64 Exploring the Limits of Transfer Learning Sebastian Ruder, Matthew E. Peters, Swabha Swayamdipta, and Thomas Wolf. Transfer learning in natural language processing. InProceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Tutorials, pages 15–18,

work page 2019

[62] [62]

DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter

Victor Sanh, Lysandre Debut, Julien Chaumond, and Thomas Wolf. DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter.arXiv preprint arXiv:1910.01108,

work page internal anchor Pith review Pith/arXiv arXiv 1910

[63] [63]

Get To The Point: Summarization with Pointer-Generator Networks

Abigail See, Peter J. Liu, and Christopher D. Manning. Get to the point: Summarization with pointer-generator networks.arXiv preprint arXiv:1704.04368,

work page Pith review arXiv

[64] [64]

Neural Machine Translation of Rare Words with Subword Units

Rico Sennrich, Barry Haddow, and Alexandra Birch. Neural machine translation of rare words with subword units.arXiv preprint arXiv:1508.07909,

work page internal anchor Pith review arXiv

[65] [65]

Christopher J Shallue, Jaehoon Lee, Joe Antognini, Jascha Sohl-Dickstein, Roy Frostig, and George E. Dahl. Measuring the effects of data parallelism on neural network training. arXiv preprint arXiv:1811.03600,

work page Pith review arXiv

[66] [66]

Self-Attention with Relative Position Representations

Peter Shaw, Jakob Uszkoreit, and Ashish Vaswani. Self-attention with relative position representations. arXiv preprint arXiv:1803.02155,

work page Pith review arXiv

[67] [67]

Adafactor: Adaptive Learning Rates with Sublinear Memory Cost

Noam Shazeer and Mitchell Stern. Adafactor: Adaptive learning rates with sublinear memory cost. arXiv preprint arXiv:1804.04235,

work page Pith review arXiv

[68] [68]

Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer

Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, and Jeff Dean. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. arXiv preprint arXiv:1701.06538,

work page internal anchor Pith review Pith/arXiv arXiv

[69] [69]

Manning, Andrew Ng, and Christopher Potts

Richard Socher, Alex Perelygin, Jean Wu, Jason Chuang, Christopher D. Manning, Andrew Ng, and Christopher Potts. Recursive deep models for semantic compositionality over a sentiment treebank. InProceedings of the 2013 conference on empirical methods in natural language processing,

work page 2013

[70] [70]

MASS: Masked Sequence to Sequence Pre-training for Language Generation

Kaitao Song, Xu Tan, Tao Qin, Jianfeng Lu, and Tie-Yan Liu. MASS: Masked sequence to sequence pre-training for language generation.arXiv preprint arXiv:1905.02450,

work page Pith review arXiv 1905

[71] [71]

Sandeep Subramanian, Adam Trischler, Yoshua Bengio, and Christopher J. Pal. Learning general purpose distributed sentence representations via large scale multi-task learning. arXiv preprint arXiv:1804.00079,

work page Pith review arXiv

[72] [72]

Trinh and Quoc V

Trieu H. Trinh and Quoc V. Le. A simple method for commonsense reasoning.arXiv preprint arXiv:1806.02847,

work page arXiv

[73] [73]

NewsQA: A Machine Comprehension Dataset

Adam Trischler, Tong Wang, Xingdi Yuan, Justin Harris, Alessandro Sordoni, Philip Bachman, and Kaheer Suleman. NewsQA: A machine comprehension dataset.arXiv preprint arXiv:1611.09830,

work page Pith review arXiv

[74] [74]

The bottom-up evolution of representations in the transformer: A study with machine translation and language modeling objectives

Elena Voita, Rico Sennrich, and Ivan Titov. The bottom-up evolution of representations in the transformer: A study with machine translation and language modeling objectives. arXiv preprint arXiv:1909.01380,

work page arXiv 1909

[75] [75]

Alex Wang, Amapreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R. Bowman. GLUE: A multi-task benchmark and analysis platform for natural language understanding. arXiv preprint arXiv:1804.07461,

work page internal anchor Pith review arXiv

[76] [76]

SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems

Alex Wang, Jan Hula, Patrick Xia, Raghavendra Pappagari, R. Thomas McCoy, Roma Patel, Najoung Kim, Ian Tenney, Yinghui Huang, Katherin Yu, et al. Can you tell me how to get past Sesame Street? Sentence-level pretraining beyond language modeling. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 2019a. Alex Wang, Y...

work page internal anchor Pith review arXiv 1905

[77] [77]

Adina Williams, Nikita Nangia, and Samuel R. Bowman. A broad-coverage challenge corpus for sentence understanding through inference.arXiv preprint arXiv:1704.05426,

work page Pith review arXiv

[78] [78]

Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation

Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V. Le, Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, et al. Google’s neural machine translation system: Bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144,

work page internal anchor Pith review arXiv

[79] [79]

Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Carbonell, Ruslan Salakhutdinov, and Quoc V. Le. XLNet: Generalized autoregressive pretraining for language understanding.arXiv preprint arXiv:1906.08237,

work page internal anchor Pith review arXiv 1906

[80] [80]

Adams Wei Yu, David Dohan, Minh-Thang Luong, Rui Zhao, Kai Chen, Mohammad Norouzi, and Quoc V. Le. QAnet: Combining local convolution with global self-attention for reading comprehension.arXiv preprint arXiv:1804.09541,

work page Pith review arXiv