Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer
Pith reviewed 2026-05-12 05:33 UTC · model grok-4.3
The pith
A single text-to-text transformer pre-trained on a large cleaned web corpus reaches state-of-the-art results on many NLP benchmarks when fine-tuned uniformly.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By converting every text-based language problem into a text-to-text format and pre-training a transformer on the Colossal Clean Crawled Corpus with a denoising objective, the resulting model achieves state-of-the-art results on many benchmarks when fine-tuned on downstream tasks covering summarization, question answering, text classification, and more.
What carries the argument
The text-to-text framework that represents every input and output as plain text strings, allowing one transformer architecture and pre-training procedure to serve all tasks.
If this is right
- One pre-trained model can be adapted to many tasks without designing separate architectures for each.
- Larger model scale combined with cleaner and larger unlabeled data improves transfer performance across benchmarks.
- Systematic comparison of pre-training objectives and data sources identifies which choices transfer most effectively.
- Releasing the pre-trained models, new dataset, and code allows direct reuse and extension by others.
Where Pith is reading between the lines
- The uniform format may reduce the engineering effort needed to apply models to new language problems.
- If the text-to-text approach works across many tasks, it could simplify evaluation and comparison of future models.
- The success with web-scale cleaned data suggests that data quality and volume matter as much as model architecture for transfer.
Load-bearing premise
Converting every language task into a text-to-text generation problem preserves all necessary information for solving the original task.
What would settle it
A language task where even a very large text-to-text model, after fine-tuning, scores substantially below the best task-specific models on standard metrics.
read the original abstract
Transfer learning, where a model is first pre-trained on a data-rich task before being fine-tuned on a downstream task, has emerged as a powerful technique in natural language processing (NLP). The effectiveness of transfer learning has given rise to a diversity of approaches, methodology, and practice. In this paper, we explore the landscape of transfer learning techniques for NLP by introducing a unified framework that converts all text-based language problems into a text-to-text format. Our systematic study compares pre-training objectives, architectures, unlabeled data sets, transfer approaches, and other factors on dozens of language understanding tasks. By combining the insights from our exploration with scale and our new ``Colossal Clean Crawled Corpus'', we achieve state-of-the-art results on many benchmarks covering summarization, question answering, text classification, and more. To facilitate future work on transfer learning for NLP, we release our data set, pre-trained models, and code.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces T5, a unified text-to-text transformer framework that reformulates all NLP tasks as sequence-to-sequence generation problems. It conducts a systematic empirical study comparing pre-training objectives (e.g., span corruption), model architectures (encoder-decoder vs. decoder-only), unlabeled datasets, and transfer methods across dozens of tasks. By scaling models up to 11B parameters and pre-training on the new Colossal Clean Crawled Corpus (C4), the authors report state-of-the-art results on benchmarks spanning summarization, question answering, text classification, and more, while releasing the models, code, and C4 dataset.
Significance. If the reported results hold under independent verification, the work is significant for establishing a simple, scalable, and unified approach to transfer learning that outperforms prior specialized methods. The thorough controlled ablations isolating the contributions of objective, architecture, and data, combined with the public release of artifacts, provide a strong foundation for future research and reproducibility in NLP.
major comments (2)
- [§4.2, Table 7] §4.2 and Table 7: The headline SOTA claims on GLUE and SuperGLUE rely on single-run fine-tuning results without reported standard deviations or statistical significance tests across multiple random seeds; given known variance in fine-tuning large models, this weakens the strength of the cross-task superiority claims.
- [§3.4] §3.4: The comparison of pre-training objectives uses fixed compute budgets, but the paper does not quantify whether the observed advantage of span corruption over alternatives (e.g., language modeling) persists when allowing each objective its own optimal hyperparameter search or longer training; this is load-bearing for the recommendation of the default objective.
minor comments (3)
- [§2] The model size nomenclature (small, base, large, 3B, 11B) is introduced gradually; a single summary table early in §2 or §3 would improve readability.
- [Figure 3] Figure 3 (scaling curves): The x-axis for parameter count is logarithmic but the tick labels and legend could be enlarged for clarity in print.
- [Appendix A.3] Appendix A.3 on C4 cleaning heuristics is detailed, but a short paragraph in the main text summarizing the key filtering steps would help readers without requiring appendix consultation.
Simulated Author's Rebuttal
We thank the referee for the positive assessment and recommendation for minor revision. We address each major comment below.
read point-by-point responses
-
Referee: [§4.2, Table 7] §4.2 and Table 7: The headline SOTA claims on GLUE and SuperGLUE rely on single-run fine-tuning results without reported standard deviations or statistical significance tests across multiple random seeds; given known variance in fine-tuning large models, this weakens the strength of the cross-task superiority claims.
Authors: We acknowledge that reporting standard deviations from multiple random seeds would provide stronger statistical support for the SOTA claims. Due to the prohibitive computational expense of repeated fine-tuning runs for models up to 11B parameters, we reported single-run results for the primary GLUE and SuperGLUE numbers. The observed gains are large in magnitude and consistent across dozens of tasks and model scales, which reduces the likelihood that they arise from random seed variance alone. In the revised manuscript we will add a brief discussion in §4.2 noting the single-run protocol and referencing prior studies on fine-tuning variance. revision: partial
-
Referee: [§3.4] §3.4: The comparison of pre-training objectives uses fixed compute budgets, but the paper does not quantify whether the observed advantage of span corruption over alternatives (e.g., language modeling) persists when allowing each objective its own optimal hyperparameter search or longer training; this is load-bearing for the recommendation of the default objective.
Authors: We deliberately held compute budgets fixed across objectives to isolate the effect of the pre-training task itself rather than differences in training duration or hyperparameter optimization. This controlled design is standard for large-scale ablation studies. While we did not conduct per-objective hyperparameter sweeps or extended training, span corruption produced clear and consistent gains under the equal-compute regime. We will revise §3.4 to explicitly state this rationale and note that further per-objective optimization remains an interesting direction for future work. revision: partial
Circularity Check
No significant circularity; empirical results are self-contained
full rationale
The paper conducts a large-scale empirical exploration of transfer learning by reformulating NLP tasks as text-to-text problems, systematically ablating pre-training objectives, architectures, data sources, and scaling behaviors across dozens of benchmarks. All central claims (including SOTA results) derive from direct experimental measurements on the released C4 corpus and models rather than from any closed-form derivations, fitted parameters renamed as predictions, or self-citation chains. No equations or uniqueness theorems are invoked that reduce the reported outcomes to inputs by construction; the work is therefore independent and verifiable through the provided artifacts.
Axiom & Free-Parameter Ledger
free parameters (3)
- model scale (small to 11B parameters)
- pre-training objective variants
- C4 data-cleaning heuristics
axioms (1)
- domain assumption Pre-training on large unlabeled text followed by fine-tuning improves performance on downstream language tasks
Forward citations
Cited by 60 Pith papers
-
Chain-of-Thought Prompting Elicits Reasoning in Large Language Models
Chain-of-thought prompting, by including intermediate reasoning steps in few-shot examples, elicits strong reasoning abilities in large language models on arithmetic, commonsense, and symbolic tasks.
-
Online Learning-to-Defer with Varying Experts
Presents the first online learning-to-defer algorithm with regret bounds O((n + n_e) T^{2/3}) generally and O((n + n_e) sqrt(T)) under low noise for multiclass classification with varying experts.
-
Discovering Latent Knowledge in Language Models Without Supervision
An unsupervised technique extracts latent yes-no knowledge from language model activations by locating a direction that satisfies logical consistency properties, outperforming zero-shot accuracy by 4% on average acros...
-
Show Your Work: Scratchpads for Intermediate Computation with Language Models
Training language models to generate intermediate computation steps on a scratchpad enables them to perform multi-step tasks such as long addition and arbitrary program execution that they otherwise fail at.
-
The Pile: An 800GB Dataset of Diverse Text for Language Modeling
The Pile is a newly constructed 825 GiB dataset from 22 diverse sources that enables language models to achieve better performance on academic, professional, and cross-domain tasks than models trained on Common Crawl ...
-
Generative Language Modeling for Automated Theorem Proving
GPT-f, a transformer-based prover for Metamath, generated new short proofs that were accepted into the main library—the first such contribution from a deep-learning system.
-
Measuring Massive Multitask Language Understanding
Introduces the MMLU benchmark of 57 tasks and shows that current models, including GPT-3, achieve low accuracy far below expert level across academic and professional domains.
-
Dense Passage Retrieval for Open-Domain Question Answering
Dense dual-encoder retrievers outperform BM25 by 9-19% absolute in top-20 passage retrieval accuracy across open-domain QA datasets and enable new state-of-the-art end-to-end QA results.
-
REALM: Retrieval-Augmented Language Model Pre-Training
REALM augments language-model pre-training with an unsupervised retriever over Wikipedia documents and reports 4-16% absolute gains on open-domain QA benchmarks over prior implicit and explicit knowledge methods.
-
Live Music Diffusion Models: Efficient Fine-Tuning and Post-Training of Interactive Diffusion Music Generators
Live Music Diffusion Models adapt bidirectional diffusion for interactive music generation via KV caching and ARC-Forcing, recovering and exceeding discrete autoregressive efficiency while enabling post-training align...
-
TabPFN-MT: A Natively Multitask In-Context Learner for Tabular Data
TabPFN-MT is a multitask in-context learner for tabular data that sets a new state-of-the-art on deep multitask learning for datasets under 1000 samples while reducing inference cost from O(T) to O(1) passes.
-
Online Learning-to-Defer with Varying Experts
Presents the first online Learning-to-Defer algorithm achieving regret O((n + n_e) T^{2/3}) generally and O((n + n_e) sqrt(T)) under low noise for multiclass classification with varying experts.
-
The Benefits of Temporal Correlations: SGD Learns k-Juntas from Random Walks Efficiently
Temporal correlations from lazy random walks enable efficient SGD learning of k-juntas via temporal-difference loss on ReLU networks, achieving linear sample complexity in d.
-
OA-WAM: Object-Addressable World Action Model for Robust Robot Manipulation
OA-WAM uses persistent address vectors and dynamic content vectors in object slots to enable addressable world-action prediction, improving robustness on manipulation benchmarks under scene changes.
-
SWAN: Semantic Watermarking with Abstract Meaning Representation
SWAN uses AMR to embed semantic watermarks that persist through paraphrases, matching SOTA detection on original text and improving AUC by 13.9 points on paraphrased RealNews data.
-
AttentionBender: Manipulating Cross-Attention in Video Diffusion Transformers as a Creative Probe
AttentionBender applies 2D transforms to cross-attention maps in video diffusion transformers, producing distributed distortions and glitch aesthetics that reveal entangled attention mechanisms while serving as both a...
-
Dual Triangle Attention: Effective Bidirectional Attention Without Positional Embeddings
Dual Triangle Attention achieves effective bidirectional attention with built-in positional inductive bias via dual triangular masks, outperforming standard bidirectional attention on position-sensitive tasks and show...
-
Unlocking Prompt Infilling Capability for Diffusion Language Models
Full-sequence masking in SFT unlocks prompt infilling for masked diffusion language models, producing templates that match or surpass hand-designed ones and transfer across models.
-
Analyzing Reverse Address Translation Overheads in Multi-GPU Scale-Up Pods
Simulation study shows cold TLB misses in reverse address translation dominate latency for small collectives in multi-GPU pods, causing up to 1.4x degradation, while larger ones see diminishing returns.
-
RELIC: Evaluating Complex Reasoning via the Recognition of Languages In-Context
RELIC benchmark reveals that advanced LLMs fail to scale reasoning compute with task difficulty in context-free language recognition and instead reduce reasoning tokens while shifting to guessing strategies.
-
Neurons Speak in Ranges: Breaking Free from Discrete Neuronal Attribution
Neurons exhibit concept-conditioned activation ranges forming Gaussian-like distributions with minimal overlap, and range-based interventions via NeuronLens outperform neuron-level masking in targeted manipulation wit...
-
PromptGuard: Soft Prompt-Guided Unsafe Content Moderation for Text-to-Image Models
PromptGuard optimizes a universal safety soft prompt (and category-specific variants) in T2I embedding space to moderate NSFW inputs, achieving average unsafe ratios of 5.84-6.18% while being 3.8x faster than prior defenses.
-
The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits
BitNet b1.58 shows that ternary 1.58-bit LLMs can match full-precision performance at substantially lower inference cost.
-
Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads
Medusa augments LLMs with multiple decoding heads and tree-based attention to predict and verify several tokens in parallel, yielding 2.2-3.6x inference speedup via two fine-tuning regimes.
-
LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale
LLM.int8() performs 8-bit inference for transformers up to 175B parameters with no accuracy loss by combining vector-wise quantization for most features with 16-bit mixed-precision handling of systematic outlier dimensions.
-
Flamingo: a Visual Language Model for Few-Shot Learning
Flamingo models reach new state-of-the-art few-shot results on image and video tasks by bridging frozen vision and language models with cross-attention layers trained on interleaved web-scale data.
-
Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity
Switch Transformers use top-1 expert routing in a Mixture of Experts setup to scale to trillion-parameter language models with constant compute and up to 4x speedup over T5-XXL.
-
GraphCodeBERT: Pre-training Code Representations with Data Flow
GraphCodeBERT uses data flow graphs in pre-training to capture semantic code structure and reaches state-of-the-art results on code search, clone detection, translation, and refinement.
-
Learning to summarize from human feedback
Reinforcement learning on a reward model trained from human summary comparisons produces summaries humans prefer over supervised fine-tuning or human references on TL;DR and transfers to CNN/DM.
-
Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks
RAG models set new state-of-the-art results on open-domain QA by retrieving Wikipedia passages and conditioning a generative model on them, while also producing more factual text than parametric baselines.
-
Unsupervised Cross-lingual Representation Learning at Scale
XLM-R, pretrained on 100 languages with 2TB of CommonCrawl data, improves average XNLI accuracy by 14.6 points and MLQA F1 by 13 points over mBERT while matching strong monolingual models on GLUE.
-
ALBERT: A Lite BERT for Self-supervised Learning of Language Representations
ALBERT reduces BERT parameters via embedding factorization and layer sharing, adds inter-sentence coherence pretraining, and reaches SOTA on GLUE, RACE, and SQuAD with fewer parameters than BERT-large.
-
Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism
Intra-layer model parallelism in PyTorch enables training of 8.3B-parameter transformers, achieving SOTA perplexity of 10.8 on WikiText103 and 66.5% accuracy on LAMBADA.
-
FOL2NS: Generating Natural Sentences from First-Order Logic
FOL2NS generates synthetic first-order logic formulas with varying quantifier depths and translates them into natural language sentences via a hybrid rule-driven and fine-tuned language model approach.
-
Runtime-Orchestrated Second-Order Optimization for Scalable LLM Training
Asteria is a runtime system that enables second-order optimization for LLMs by dynamically distributing optimizer state across GPU, CPU, and NVMe while using asynchronous inverse-root computations and bounded-stalenes...
-
Block-Based Double Decoders
Block-based double decoders achieve full supervision in pretraining like decoder-only models and efficient inference like encoder-decoders through doubly-causal block-based attention masks, outperforming encoder-decod...
-
Theory-optimal Quantization Based on Flatness
The paper introduces the Flatness metric, derives a theory-optimal quantization solution, and presents BDQ that uses bidirectional diagonal transformations to reduce outlier impact, achieving under 1% drop at W4A4 on ...
-
The two clocks and the innovation window: When and how generative models learn rules
Generative models learn rules before memorizing data, creating an innovation window whose width depends on dataset size and rule complexity, observed in both diffusion and autoregressive architectures.
-
Diversity in Large Language Models under Supervised Fine-Tuning
TOFU loss mitigates the narrowing of generative diversity in LLMs after supervised fine-tuning by addressing neglect of low-frequency patterns and forgetting of prior knowledge.
-
Whose Story Gets Told? Positionality and Bias in LLM Summaries of Life Narratives
A proposed pipeline shows LLMs introduce detectable race and gender biases when summarizing life narratives, creating potential for representational harm in research.
-
Sketching the Readout of Large Language Models for Scalable Data Attribution and Valuation
RISE applies CountSketch to dual lexical and semantic channels derived from output-layer gradient outer products, cutting data attribution storage by up to 112x and enabling retrospective and prospective influence ana...
-
mimic-video: Video-Action Models for Generalizable Robot Control Beyond VLAs
mimic-video combines internet video pretraining with a flow-matching decoder to achieve state-of-the-art robotic manipulation performance with 10x better sample efficiency than vision-language-action models.
-
Progress Ratio Embeddings: An Impatience Signal for Robust Length Control in Neural Text Generation
Progress Ratio Embeddings use a trigonometric progress-ratio signal to deliver stable length control in transformers that generalizes to unseen target lengths.
-
SSL4RL: Revisiting Self-supervised Learning as Intrinsic Reward for Visual-Language Reasoning
SSL4RL reformulates self-supervised learning objectives into dense, verifiable reward signals for RL-based fine-tuning of vision-language models, yielding performance gains on reasoning benchmarks.
-
Should We Still Pretrain Encoders with Masked Language Modeling?
Controlled ablations of 38 models find MLM superior to CLM on representation benchmarks while CLM offers better data efficiency and stability; a biphasic CLM-then-MLM schedule is optimal under fixed compute and improv...
-
Flex Attention: A Programming Model for Generating Optimized Attention Kernels
FlexAttention supplies a compiler-driven interface that expresses common attention variants in a few lines of PyTorch and emits optimized kernels whose speed matches hand-written implementations.
-
How Good is Your Wikipedia? Auditing Data Quality for Low-resource and Multilingual NLP
The study filters non-English Wikipedia, reveals quality problems, proposes a 4-level ranking, and shows filtered data matches or beats raw data in language modeling with largest gains for lower-quality editions.
-
Transfusion: Predict the Next Token and Diffuse Images with One Multi-Modal Model
A single transformer combines language modeling loss and diffusion loss on mixed-modality data, scaling to 7B parameters and 2T tokens while matching specialized language and diffusion models.
-
Inference Scaling Laws: An Empirical Analysis of Compute-Optimal Inference for Problem-Solving with Language Models
Empirical analysis shows scaling inference compute via strategies like tree search can be more efficient than scaling model parameters, with 7B models plus novel search outperforming 34B models.
-
The Falcon Series of Open Language Models
Falcon-180B is a 180B-parameter open decoder-only model trained on 3.5 trillion tokens that approaches PaLM-2-Large performance at lower cost and is released with dataset extracts.
-
Vision Transformers Need Registers
Adding register tokens to Vision Transformers eliminates high-norm background artifacts and raises state-of-the-art performance on dense visual prediction tasks.
-
DeepSpeed Ulysses: System Optimizations for Enabling Training of Extreme Long Sequence Transformer Models
DeepSpeed-Ulysses keeps communication volume constant for sequence-parallel attention when sequence length and device count scale together, delivering 2.5x faster training on 4x longer sequences than prior SOTA.
-
Can AI-Generated Text be Reliably Detected?
Recursive paraphrasing attacks substantially lower detection rates for multiple AI text detectors with only minor quality loss, while a theoretical analysis ties best-case AUROC to total variation distance between hum...
-
SemDeDup: Data-efficient learning at web-scale through semantic deduplication
SemDeDup removes semantic duplicates from datasets like LAION using pre-trained embeddings, cutting data by 50% with minimal performance loss and efficiency gains on C4.
-
The Flan Collection: Designing Data and Methods for Effective Instruction Tuning
The Flan Collection demonstrates that task balancing, data enrichment, and mixed prompt training are critical to effective instruction tuning, yielding stronger Flan-T5 models released publicly.
-
Atlas: Few-shot Learning with Retrieval Augmented Language Models
Atlas reaches over 42% accuracy on Natural Questions with only 64 examples, outperforming a 540B-parameter model by 3% with 50x fewer parameters.
-
Efficient Training of Language Models to Fill in the Middle
Autoregressive language models trained on data with middle spans relocated to the end learn infilling without degrading left-to-right perplexity or sampling quality.
-
Language Models (Mostly) Know What They Know
Language models show good calibration when asked to estimate the probability that their own answers are correct, with performance improving as models get larger.
-
Scaling Autoregressive Models for Content-Rich Text-to-Image Generation
Scaling an autoregressive Transformer to 20B parameters for text-to-image generation using image token sequences achieves new SOTA zero-shot FID of 7.23 and fine-tuned FID of 3.22 on MS-COCO.
-
CoCa: Contrastive Captioners are Image-Text Foundation Models
CoCa unifies contrastive and generative pretraining in one image-text model to reach 86.3% zero-shot ImageNet accuracy and new state-of-the-art results on multiple downstream benchmarks.
Reference graph
Works this paper leans on
-
[1]
Memory-efficient adaptive optimiza- tion for large-scale learning
Rohan Anil, Vineet Gupta, Tomer Koren, and Yoram Singer. Memory-efficient adaptive optimization for large-scale learning.arXiv preprint arXiv:1901.11150,
-
[2]
Massively Multilingual Neural Machine Translation in the Wild: Findings and Challenges
Naveen Arivazhagan, Ankur Bapna, Orhan Firat, Dmitry Lepikhin, Melvin Johnson, Maxim Krikun, Mia Xu Chen, Yuan Cao, George Foster, Colin Cherry, et al. Massively multi- lingual neural machine translation in the wild: Findings and challenges.arXiv preprint arXiv:1907.05019,
work page Pith review arXiv 1907
-
[3]
Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E. Hinton. Layer normalization.arXiv preprint arXiv:1607.06450,
work page internal anchor Pith review Pith/arXiv arXiv
-
[4]
Cloze-driven Pretraining of Self-attention Networks
Alexei Baevski, Sergey Edunov, Yinhan Liu, Luke Zettlemoyer, and Michael Auli. Cloze- driven pretraining of self-attention networks.arXiv preprint arXiv:1903.07785,
work page Pith review arXiv 1903
-
[5]
Simple, scalable adaptation for neural machine translation.arXiv preprint arXiv:1909.08478,
Ankur Bapna, Naveen Arivazhagan, and Orhan Firat. Simple, scalable adaptation for neural machine translation.arXiv preprint arXiv:1909.08478,
-
[6]
SciBERT: A pretrained language model for scientific text
Iz Beltagy, Kyle Lo, and Arman Cohan. SciBERT: A pretrained language model for scientific text. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP),
work page 2019
-
[7]
Findings of the 2014 workshop on statistical machine translation
Ondřej Bojar, Christian Buck, Christian Federmann, Barry Haddow, Philipp Koehn, Jo- hannes Leveling, Christof Monz, Pavel Pecina, Matt Post, Herve Saint-Amand, et al. Findings of the 2014 workshop on statistical machine translation. InProceedings of the Ninth Workshop on Statistical Machine Translation,
work page 2014
-
[8]
Findings of the 2015 workshop on statistical machine translation
Ondřej Bojar, Rajen Chatterjee, Christian Federmann, Barry Haddow, Matthias Huck, Chris Hokamp, Philipp Koehn, Varvara Logacheva, Christof Monz, Matteo Negri, et al. Findings of the 2015 workshop on statistical machine translation. InProceedings of the Tenth Workshop on Statistical Machine Translation,
work page 2015
-
[9]
Findings of the 2016 conference on machine translation
Ondřej Bojar, Rajen Chatterjee, Christian Federmann, Yvette Graham, Barry Haddow, Matthias Huck, Antonio Jimeno Yepes, Philipp Koehn, Varvara Logacheva, Christof Monz, et al. Findings of the 2016 conference on machine translation. InProceedings of the First Conference on Machine Translation,
work page 2016
-
[10]
Generating Sentences from a Continuous Space
Samuel R. Bowman, Luke Vilnis, Oriol Vinyals, Andrew M. Dai, Rafal Jozefowicz, and Samy Bengio. Generating sentences from a continuous space.arXiv preprint arXiv:1511.06349,
-
[11]
SemEval-2017 Task 1: Semantic Textual Similarity - Multilingual and Cross-lingual Focused Evaluation
Daniel Cer, Mona Diab, Eneko Agirre, Inigo Lopez-Gazpio, and Lucia Specia. Semeval-2017 task 1: Semantic textual similarity-multilingual and cross-lingual focused evaluation. arXiv preprint arXiv:1708.00055,
work page Pith review arXiv 2017
-
[12]
Long Short-Term Memory-Networks for Machine Reading
Jianpeng Cheng, Li Dong, and Mirella Lapata. Long short-term memory-networks for machine reading.arXiv preprint arXiv:1601.06733,
-
[13]
BoolQ: Exploring the Surprising Difficulty of Natural Yes/No Questions
Christopher Clark, Kenton Lee, Ming-Wei Chang, Tom Kwiatkowski, Michael Collins, and Kristina Toutanova. BoolQ: Exploring the surprising difficulty of natural yes/no questions. arXiv preprint arXiv:1905.10044,
work page internal anchor Pith review arXiv 1905
-
[14]
ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators
Kevin Clark, Minh-Thang Luong, Quoc V Le, and Christopher D Manning. Electra: Pre-training text encoders as discriminators rather than generators. arXiv preprint arXiv:2003.10555,
work page internal anchor Pith review arXiv 2003
-
[15]
SentEval: An Evaluation Toolkit for Universal Sentence Representations
Alexis Conneau and Douwe Kiela. SentEval: An evaluation toolkit for universal sentence representations. arXiv preprint arXiv:1803.05449,
-
[16]
Supervised Learning of Universal Sentence Representations from Natural Language Inference Data
Alexis Conneau, Douwe Kiela, Holger Schwenk, Loic Barrault, and Antoine Bordes. Super- vised learning of universal sentence representations from natural language inference data. arXiv preprint arXiv:1705.02364,
-
[17]
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre- training of deep bidirectional transformers for language understanding.arXiv preprint arXiv:1810.04805,
work page internal anchor Pith review Pith/arXiv arXiv
-
[18]
Unified language model pre- training for natural language understanding and gen- eration
Li Dong, Nan Yang, Wenhui Wang, Furu Wei, Xiaodong Liu, Yu Wang, Jianfeng Gao, Ming Zhou, and Hsiao-Wuen Hon. Unified language model pre-training for natural language understanding and generation.arXiv preprint arXiv:1905.03197,
-
[19]
Understanding Back-Translation at Scale
59 Raffel, Shazeer, Roberts, Lee, Narang, Matena, Zhou, Li and Liu Sergey Edunov, Myle Ott, Michael Auli, and David Grangier. Understanding back-translation at scale. arXiv preprint arXiv:1808.09381,
-
[20]
Learning Word Vectors for 157 Languages
Edouard Grave, Piotr Bojanowski, Prakhar Gupta, Armand Joulin, and Tomas Mikolov. Learning word vectors for 157 languages.arXiv preprint arXiv:1802.06893,
-
[21]
Generating Sequences With Recurrent Neural Networks
Alex Graves. Generating sequences with recurrent neural networks. arXiv preprint arXiv:1308.0850,
-
[22]
Rethinking ImageNet Pre-training
Kaiming He, Ross Girshick, and Piotr Dollár. Rethinking ImageNet pre-training.arXiv preprint arXiv:1811.08883,
-
[23]
A hybrid neural network model for commonsense reasoning
Pengcheng He, Xiaodong Liu, Weizhu Chen, and Jianfeng Gao. A hybrid neural network model for commonsense reasoning.arXiv preprint arXiv:1907.11983,
-
[24]
Deep Learning Scaling is Predictable, Empirically
Joel Hestness, Sharan Narang, Newsha Ardalani, Gregory Diamos, Heewoo Jun, Hassan Kianinejad, Md. Mostofa Ali Patwary, Yang Yang, and Yanqi Zhou. Deep learning scaling is predictable, empirically.arXiv preprint arXiv:1712.00409,
work page internal anchor Pith review Pith/arXiv arXiv
-
[25]
Felix Hill, Kyunghyun Cho, and Anna Korhonen. Learning distributed representations of sentences from unlabelled data.arXiv preprint arXiv:1602.03483,
-
[26]
Distilling the Knowledge in a Neural Network
Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531,
work page internal anchor Pith review Pith/arXiv arXiv
-
[27]
Parameter-Efficient Transfer Learning for NLP
Neil Houlsby, Andrei Giurgiu, Stanislaw Jastrzebski, Bruna Morrone, Quentin De Laroussilhe, Andrea Gesmundo, Mona Attariyan, and Sylvain Gelly. Parameter-efficient transfer learning for NLP.arXiv preprint arXiv:1902.00751,
work page Pith review arXiv 1902
-
[28]
Universal Language Model Fine-tuning for Text Classification
Jeremy Howard and Sebastian Ruder. Universal language model fine-tuning for text classifi- cation. arXiv preprint arXiv:1801.06146,
-
[29]
GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism
Cheng-Zhi Anna Huang, Ashish Vaswani, Jakob Uszkoreit, Ian Simon, Curtis Hawthorne, Noam Shazeer, Andrew M. Dai, Matthew D. Hoffman, Monica Dinculescu, and Dou- glas Eck. Music transformer: Generating music with long-term structure. InSeventh International Conference on Learning Representations, 2018a. 60 Exploring the Limits of Transfer Learning Yanping ...
-
[30]
Tinybert: Distilling bert for natural language understanding
Xiaoqi Jiao, Yichun Yin, Lifeng Shang, Xin Jiang, Xiao Chen, Linlin Li, Fang Wang, and Qun Liu. TinyBERT: Distilling BERT for natural language understanding.arXiv preprint arXiv:1909.10351,
-
[31]
TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension
Mandar Joshi, Eunsol Choi, Daniel S. Weld, and Luke Zettlemoyer. TriviaQA: A large scale distantly supervised challenge dataset for reading comprehension.arXiv preprint arXiv:1705.03551,
work page internal anchor Pith review Pith/arXiv arXiv
-
[32]
SpanBERT: Improving pre-training by representing and predicting spans
Mandar Joshi, Danqi Chen, Yinhan Liu, Daniel S. Weld, Luke Zettlemoyer, and Omer Levy. SpanBERT: Improving pre-training by representing and predicting spans.arXiv preprint arXiv:1907.10529,
-
[33]
Exploring the Limits of Language Modeling
Rafal Jozefowicz, Oriol Vinyals, Mike Schuster, Noam Shazeer, and Yonghui Wu. Exploring the limits of language modeling.arXiv preprint arXiv:1602.02410,
-
[34]
CTRL: A Conditional Transformer Language Model for Controllable Generation
Nitish Shirish Keskar, Bryan McCann, Lav R. Varshney, Caiming Xiong, and Richard Socher. CTRL: A conditional transformer language model for controllable generation. arXiv preprint arXiv:1909.05858, 2019a. NitishShirishKeskar, BryanMcCann, CaimingXiong, andRichardSocher. Unifyingquestion answering and text classification via span extraction.arXiv preprint ...
work page internal anchor Pith review arXiv 1909
-
[35]
A surprisingly robust trick for winograd schema challenge
Vid Kocijan, Ana-Maria Cretu, Oana-Maria Camburu, Yordan Yordanov, and Thomas Lukasiewicz. A surprisingly robust trick for Winograd schema challenge.arXiv preprint arXiv:1905.06290,
-
[36]
Federated Optimization:Distributed Optimization Beyond the Datacenter
Jakub Konečn` y, Brendan McMahan, and Daniel Ramage. Federated optimization: Dis- tributed optimization beyond the datacenter.arXiv preprint arXiv:1511.03575,
-
[37]
Federated Learning: Strategies for Improving Communication Efficiency
Jakub Konečn` y, H. Brendan McMahan, Felix X. Yu, Peter Richtárik, Ananda Theertha Suresh, and Dave Bacon. Federated learning: Strategies for improving communication efficiency. arXiv preprint arXiv:1610.05492,
work page internal anchor Pith review arXiv
-
[38]
Simon Kornblith, Jonathon Shlens, and Quoc V. Le. Do better ImageNet models transfer better? arXiv preprint arXiv:1805.08974,
-
[39]
One weird trick for parallelizing convolutional neural networks
Alex Krizhevsky. One weird trick for parallelizing convolutional neural networks.arXiv preprint arXiv:1404.5997,
-
[40]
Subword Regularization: Improving Neural Network Translation Models with Multiple Subword Candidates
Taku Kudo. Subword regularization: Improving neural network translation models with multiple subword candidates.arXiv preprint arXiv:1804.10959,
-
[41]
Taku Kudo and John Richardson. SentencePiece: A simple and language independent sub- word tokenizer and detokenizer for neural text processing.arXiv preprint arXiv:1808.06226,
work page internal anchor Pith review arXiv
-
[42]
Cross-lingual Language Model Pretraining
Guillaume Lample and Alexis Conneau. Cross-lingual language model pretraining.arXiv preprint arXiv:1901.07291,
work page Pith review arXiv 1901
-
[43]
ALBERT: A Lite BERT for Self-supervised Learning of Language Representations
Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, and Radu Soricut. ALBERT: A lite BERT for self-supervised learning of language representa- tions. arXiv preprint arXiv:1909.11942,
work page internal anchor Pith review arXiv 1909
-
[44]
Generating Wikipedia by Summarizing Long Sequences
Peter J. Liu, Mohammad Saleh, Etienne Pot, Ben Goodrich, Ryan Sepassi, Lukasz Kaiser, and Noam Shazeer. Generating Wikipedia by summarizing long sequences.arXiv preprint arXiv:1801.10198,
-
[45]
Peter J. Liu, Yu-An Chung, and Jie Ren. SummAE: Zero-shot abstractive text summarization using length-agnostic auto-encoders.arXiv preprint arXiv:1910.00998, 2019a. 62 Exploring the Limits of Transfer Learning Xiaodong Liu, Jianfeng Gao, Xiaodong He, Li Deng, Kevin Duh, and Ye-Yi Wang. Rep- resentation learning using multi-task deep neural networks for se...
-
[46]
Multi-Task Deep Neural Networks for Natural Language Understanding
Xiaodong Liu, Pengcheng He, Weizhu Chen, and Jianfeng Gao. Multi-task deep neural networks for natural language understanding.arXiv preprint arXiv:1901.11504, 2019b. Yang Liu. Fine-tune BERT for extractive summarization.arXiv preprint arXiv:1903.10318,
work page Pith review arXiv 1901
-
[47]
RoBERTa: A Robustly Optimized BERT Pretraining Approach
Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. RoBERTa: A robustly optimized BERT pretraining approach.arXiv preprint arXiv:1907.11692, 2019c. Lajanugen Logeswaran and Honglak Lee. An efficient framework for learning sentence representations. arXiv preprint arXiv:1803.02893,
work page internal anchor Pith review Pith/arXiv arXiv 1907
-
[48]
The Natural Language Decathlon: Multitask Learning as Question Answering
Bryan McCann, Nitish Shirish Keskar, Caiming Xiong, and Richard Socher. The nat- ural language decathlon: Multitask learning as question answering. arXiv preprint arXiv:1806.08730,
-
[49]
Efficient Estimation of Word Representations in Vector Space
Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Efficient estimation of word representations in vector space.arXiv preprint arXiv:1301.3781, 2013a. Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S. Corrado, and Jeff Dean. Distributed representations of words and phrases and their compositionality. InAdvances in neural information processing system...
work page internal anchor Pith review Pith/arXiv arXiv
-
[50]
A Deep Reinforced Model for Abstractive Summarization
Romain Paulus, Caiming Xiong, and Richard Socher. A deep reinforced model for abstractive summarization. arXiv preprint arXiv:1705.04304,
-
[51]
GloVe: Global vectors for word representation
63 Raffel, Shazeer, Roberts, Lee, Narang, Matena, Zhou, Li and Liu Jeffrey Pennington, Richard Socher, and Christopher Manning. GloVe: Global vectors for word representation. InProceedings of the 2014 conference on empirical methods in natural language processing (EMNLP),
work page 2014
-
[52]
Matthew Peters, Sebastian Ruder, and Noah A. Smith. To tune or not to tune? adapting pretrained representations to diverse tasks.arXiv preprint arXiv:1903.05987,
work page Pith review arXiv 1903
-
[53]
Deep contextualized word representations
Matthew E. Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. Deep contextualized word representations.arXiv preprint arXiv:1802.05365,
-
[54]
Jason Phang, Thibault Févry, and Samuel R. Bowman. Sentence encoders on STILTs: Sup- plementary training on intermediate labeled-data tasks.arXiv preprint arXiv:1811.01088,
-
[55]
WiC: the Word-in-Context Dataset for Evaluating Context-Sensitive Meaning Representations
Mohammad Taher Pilehvar and Jose Camacho-Collados. WIC: 10,000 example pairs for evaluating context-sensitive representations.arXiv preprint arXiv:1808.09121,
-
[56]
A Call for Clarity in Reporting BLEU Scores
Matt Post. A call for clarity in reporting BLEU scores.arXiv preprint arXiv:1804.08771,
-
[57]
Resolving complex cases of definite pronouns: the Winograd schema challenge
Altaf Rahman and Vincent Ng. Resolving complex cases of definite pronouns: the Winograd schema challenge. InProceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning. Association for Computational Linguistics,
work page 2012
-
[58]
SQuAD: 100,000+ Questions for Machine Comprehension of Text
Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. Squad: 100,000+ questions for machine comprehension of text.arXiv preprint arXiv:1606.05250,
work page internal anchor Pith review arXiv
-
[59]
Unsupervised Pretraining for Sequence to Sequence Learning
Prajit Ramachandran, Peter J. Liu, and Quoc V. Le. Unsupervised pretraining for sequence to sequence learning.arXiv preprint arXiv:1611.02683,
-
[60]
An Overview of Multi-Task Learning in Deep Neural Networks
Sebastian Ruder. An overview of multi-task learning in deep neural networks.arXiv preprint arXiv:1706.05098,
work page internal anchor Pith review arXiv
-
[61]
Peters, Swabha Swayamdipta, and Thomas Wolf
64 Exploring the Limits of Transfer Learning Sebastian Ruder, Matthew E. Peters, Swabha Swayamdipta, and Thomas Wolf. Transfer learning in natural language processing. InProceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Tutorials, pages 15–18,
work page 2019
-
[62]
DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter
Victor Sanh, Lysandre Debut, Julien Chaumond, and Thomas Wolf. DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter.arXiv preprint arXiv:1910.01108,
work page internal anchor Pith review Pith/arXiv arXiv 1910
-
[63]
Get To The Point: Summarization with Pointer-Generator Networks
Abigail See, Peter J. Liu, and Christopher D. Manning. Get to the point: Summarization with pointer-generator networks.arXiv preprint arXiv:1704.04368,
-
[64]
Neural Machine Translation of Rare Words with Subword Units
Rico Sennrich, Barry Haddow, and Alexandra Birch. Neural machine translation of rare words with subword units.arXiv preprint arXiv:1508.07909,
work page internal anchor Pith review arXiv
-
[65]
Christopher J Shallue, Jaehoon Lee, Joe Antognini, Jascha Sohl-Dickstein, Roy Frostig, and George E. Dahl. Measuring the effects of data parallelism on neural network training. arXiv preprint arXiv:1811.03600,
-
[66]
Self-Attention with Relative Position Representations
Peter Shaw, Jakob Uszkoreit, and Ashish Vaswani. Self-attention with relative position representations. arXiv preprint arXiv:1803.02155,
-
[67]
Adafactor: Adaptive Learning Rates with Sublinear Memory Cost
Noam Shazeer and Mitchell Stern. Adafactor: Adaptive learning rates with sublinear memory cost. arXiv preprint arXiv:1804.04235,
-
[68]
Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer
Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, and Jeff Dean. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. arXiv preprint arXiv:1701.06538,
work page internal anchor Pith review Pith/arXiv arXiv
-
[69]
Manning, Andrew Ng, and Christopher Potts
Richard Socher, Alex Perelygin, Jean Wu, Jason Chuang, Christopher D. Manning, Andrew Ng, and Christopher Potts. Recursive deep models for semantic compositionality over a sentiment treebank. InProceedings of the 2013 conference on empirical methods in natural language processing,
work page 2013
-
[70]
MASS: Masked Sequence to Sequence Pre-training for Language Generation
Kaitao Song, Xu Tan, Tao Qin, Jianfeng Lu, and Tie-Yan Liu. MASS: Masked sequence to sequence pre-training for language generation.arXiv preprint arXiv:1905.02450,
work page Pith review arXiv 1905
-
[71]
Sandeep Subramanian, Adam Trischler, Yoshua Bengio, and Christopher J. Pal. Learning general purpose distributed sentence representations via large scale multi-task learning. arXiv preprint arXiv:1804.00079,
-
[72]
Trieu H. Trinh and Quoc V. Le. A simple method for commonsense reasoning.arXiv preprint arXiv:1806.02847,
-
[73]
NewsQA: A Machine Comprehension Dataset
Adam Trischler, Tong Wang, Xingdi Yuan, Justin Harris, Alessandro Sordoni, Philip Bachman, and Kaheer Suleman. NewsQA: A machine comprehension dataset.arXiv preprint arXiv:1611.09830,
-
[74]
Elena Voita, Rico Sennrich, and Ivan Titov. The bottom-up evolution of representations in the transformer: A study with machine translation and language modeling objectives. arXiv preprint arXiv:1909.01380,
-
[75]
Alex Wang, Amapreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R. Bowman. GLUE: A multi-task benchmark and analysis platform for natural language understanding. arXiv preprint arXiv:1804.07461,
work page internal anchor Pith review arXiv
-
[76]
SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems
Alex Wang, Jan Hula, Patrick Xia, Raghavendra Pappagari, R. Thomas McCoy, Roma Patel, Najoung Kim, Ian Tenney, Yinghui Huang, Katherin Yu, et al. Can you tell me how to get past Sesame Street? Sentence-level pretraining beyond language modeling. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 2019a. Alex Wang, Y...
work page internal anchor Pith review arXiv 1905
-
[77]
Adina Williams, Nikita Nangia, and Samuel R. Bowman. A broad-coverage challenge corpus for sentence understanding through inference.arXiv preprint arXiv:1704.05426,
-
[78]
Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation
Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V. Le, Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, et al. Google’s neural machine translation system: Bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144,
work page internal anchor Pith review arXiv
-
[79]
Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Carbonell, Ruslan Salakhutdinov, and Quoc V. Le. XLNet: Generalized autoregressive pretraining for language understanding.arXiv preprint arXiv:1906.08237,
work page internal anchor Pith review arXiv 1906
-
[80]
Adams Wei Yu, David Dohan, Minh-Thang Luong, Rui Zhao, Kai Chen, Mohammad Norouzi, and Quoc V. Le. QAnet: Combining local convolution with global self-attention for reading comprehension.arXiv preprint arXiv:1804.09541,
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.