Recognition: 2 theorem links
· Lean TheoremExploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer
Pith reviewed 2026-05-12 05:33 UTC · model grok-4.3
The pith
A single text-to-text transformer pre-trained on a large cleaned web corpus reaches state-of-the-art results on many NLP benchmarks when fine-tuned uniformly.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By converting every text-based language problem into a text-to-text format and pre-training a transformer on the Colossal Clean Crawled Corpus with a denoising objective, the resulting model achieves state-of-the-art results on many benchmarks when fine-tuned on downstream tasks covering summarization, question answering, text classification, and more.
What carries the argument
The text-to-text framework that represents every input and output as plain text strings, allowing one transformer architecture and pre-training procedure to serve all tasks.
If this is right
- One pre-trained model can be adapted to many tasks without designing separate architectures for each.
- Larger model scale combined with cleaner and larger unlabeled data improves transfer performance across benchmarks.
- Systematic comparison of pre-training objectives and data sources identifies which choices transfer most effectively.
- Releasing the pre-trained models, new dataset, and code allows direct reuse and extension by others.
Where Pith is reading between the lines
- The uniform format may reduce the engineering effort needed to apply models to new language problems.
- If the text-to-text approach works across many tasks, it could simplify evaluation and comparison of future models.
- The success with web-scale cleaned data suggests that data quality and volume matter as much as model architecture for transfer.
Load-bearing premise
Converting every language task into a text-to-text generation problem preserves all necessary information for solving the original task.
What would settle it
A language task where even a very large text-to-text model, after fine-tuning, scores substantially below the best task-specific models on standard metrics.
read the original abstract
Transfer learning, where a model is first pre-trained on a data-rich task before being fine-tuned on a downstream task, has emerged as a powerful technique in natural language processing (NLP). The effectiveness of transfer learning has given rise to a diversity of approaches, methodology, and practice. In this paper, we explore the landscape of transfer learning techniques for NLP by introducing a unified framework that converts all text-based language problems into a text-to-text format. Our systematic study compares pre-training objectives, architectures, unlabeled data sets, transfer approaches, and other factors on dozens of language understanding tasks. By combining the insights from our exploration with scale and our new ``Colossal Clean Crawled Corpus'', we achieve state-of-the-art results on many benchmarks covering summarization, question answering, text classification, and more. To facilitate future work on transfer learning for NLP, we release our data set, pre-trained models, and code.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces T5, a unified text-to-text transformer framework that reformulates all NLP tasks as sequence-to-sequence generation problems. It conducts a systematic empirical study comparing pre-training objectives (e.g., span corruption), model architectures (encoder-decoder vs. decoder-only), unlabeled datasets, and transfer methods across dozens of tasks. By scaling models up to 11B parameters and pre-training on the new Colossal Clean Crawled Corpus (C4), the authors report state-of-the-art results on benchmarks spanning summarization, question answering, text classification, and more, while releasing the models, code, and C4 dataset.
Significance. If the reported results hold under independent verification, the work is significant for establishing a simple, scalable, and unified approach to transfer learning that outperforms prior specialized methods. The thorough controlled ablations isolating the contributions of objective, architecture, and data, combined with the public release of artifacts, provide a strong foundation for future research and reproducibility in NLP.
major comments (2)
- [§4.2, Table 7] §4.2 and Table 7: The headline SOTA claims on GLUE and SuperGLUE rely on single-run fine-tuning results without reported standard deviations or statistical significance tests across multiple random seeds; given known variance in fine-tuning large models, this weakens the strength of the cross-task superiority claims.
- [§3.4] §3.4: The comparison of pre-training objectives uses fixed compute budgets, but the paper does not quantify whether the observed advantage of span corruption over alternatives (e.g., language modeling) persists when allowing each objective its own optimal hyperparameter search or longer training; this is load-bearing for the recommendation of the default objective.
minor comments (3)
- [§2] The model size nomenclature (small, base, large, 3B, 11B) is introduced gradually; a single summary table early in §2 or §3 would improve readability.
- [Figure 3] Figure 3 (scaling curves): The x-axis for parameter count is logarithmic but the tick labels and legend could be enlarged for clarity in print.
- [Appendix A.3] Appendix A.3 on C4 cleaning heuristics is detailed, but a short paragraph in the main text summarizing the key filtering steps would help readers without requiring appendix consultation.
Simulated Author's Rebuttal
We thank the referee for the positive assessment and recommendation for minor revision. We address each major comment below.
read point-by-point responses
-
Referee: [§4.2, Table 7] §4.2 and Table 7: The headline SOTA claims on GLUE and SuperGLUE rely on single-run fine-tuning results without reported standard deviations or statistical significance tests across multiple random seeds; given known variance in fine-tuning large models, this weakens the strength of the cross-task superiority claims.
Authors: We acknowledge that reporting standard deviations from multiple random seeds would provide stronger statistical support for the SOTA claims. Due to the prohibitive computational expense of repeated fine-tuning runs for models up to 11B parameters, we reported single-run results for the primary GLUE and SuperGLUE numbers. The observed gains are large in magnitude and consistent across dozens of tasks and model scales, which reduces the likelihood that they arise from random seed variance alone. In the revised manuscript we will add a brief discussion in §4.2 noting the single-run protocol and referencing prior studies on fine-tuning variance. revision: partial
-
Referee: [§3.4] §3.4: The comparison of pre-training objectives uses fixed compute budgets, but the paper does not quantify whether the observed advantage of span corruption over alternatives (e.g., language modeling) persists when allowing each objective its own optimal hyperparameter search or longer training; this is load-bearing for the recommendation of the default objective.
Authors: We deliberately held compute budgets fixed across objectives to isolate the effect of the pre-training task itself rather than differences in training duration or hyperparameter optimization. This controlled design is standard for large-scale ablation studies. While we did not conduct per-objective hyperparameter sweeps or extended training, span corruption produced clear and consistent gains under the equal-compute regime. We will revise §3.4 to explicitly state this rationale and note that further per-objective optimization remains an interesting direction for future work. revision: partial
Circularity Check
No significant circularity; empirical results are self-contained
full rationale
The paper conducts a large-scale empirical exploration of transfer learning by reformulating NLP tasks as text-to-text problems, systematically ablating pre-training objectives, architectures, data sources, and scaling behaviors across dozens of benchmarks. All central claims (including SOTA results) derive from direct experimental measurements on the released C4 corpus and models rather than from any closed-form derivations, fitted parameters renamed as predictions, or self-citation chains. No equations or uniqueness theorems are invoked that reduce the reported outcomes to inputs by construction; the work is therefore independent and verifiable through the provided artifacts.
Axiom & Free-Parameter Ledger
free parameters (3)
- model scale (small to 11B parameters)
- pre-training objective variants
- C4 data-cleaning heuristics
axioms (1)
- domain assumption Pre-training on large unlabeled text followed by fine-tuning improves performance on downstream language tasks
Forward citations
Cited by 41 Pith papers
-
Chain-of-Thought Prompting Elicits Reasoning in Large Language Models
Chain-of-thought prompting, by including intermediate reasoning steps in few-shot examples, elicits strong reasoning abilities in large language models on arithmetic, commonsense, and symbolic tasks.
-
Online Learning-to-Defer with Varying Experts
Presents the first online learning-to-defer algorithm with regret bounds O((n + n_e) T^{2/3}) generally and O((n + n_e) sqrt(T)) under low noise for multiclass classification with varying experts.
-
Show Your Work: Scratchpads for Intermediate Computation with Language Models
Training language models to generate intermediate computation steps on a scratchpad enables them to perform multi-step tasks such as long addition and arbitrary program execution that they otherwise fail at.
-
The Pile: An 800GB Dataset of Diverse Text for Language Modeling
The Pile is a newly constructed 825 GiB dataset from 22 diverse sources that enables language models to achieve better performance on academic, professional, and cross-domain tasks than models trained on Common Crawl ...
-
Measuring Massive Multitask Language Understanding
Introduces the MMLU benchmark of 57 tasks and shows that current models, including GPT-3, achieve low accuracy far below expert level across academic and professional domains.
-
The Benefits of Temporal Correlations: SGD Learns k-Juntas from Random Walks Efficiently
Temporal correlations from lazy random walks enable efficient SGD learning of k-juntas via temporal-difference loss on ReLU networks, achieving linear sample complexity in d.
-
OA-WAM: Object-Addressable World Action Model for Robust Robot Manipulation
OA-WAM uses persistent address vectors and dynamic content vectors in object slots to enable addressable world-action prediction, improving robustness on manipulation benchmarks under scene changes.
-
SWAN: Semantic Watermarking with Abstract Meaning Representation
SWAN uses AMR to embed semantic watermarks that persist through paraphrases, matching SOTA detection on original text and improving AUC by 13.9 points on paraphrased RealNews data.
-
AttentionBender: Manipulating Cross-Attention in Video Diffusion Transformers as a Creative Probe
AttentionBender applies 2D transforms to cross-attention maps in video diffusion transformers, producing distributed distortions and glitch aesthetics that reveal entangled attention mechanisms while serving as both a...
-
Dual Triangle Attention: Effective Bidirectional Attention Without Positional Embeddings
Dual Triangle Attention achieves effective bidirectional attention with built-in positional inductive bias via dual triangular masks, outperforming standard bidirectional attention on position-sensitive tasks and show...
-
Unlocking Prompt Infilling Capability for Diffusion Language Models
Full-sequence masking in SFT unlocks prompt infilling for masked diffusion language models, producing templates that match or surpass hand-designed ones and transfer across models.
-
Analyzing Reverse Address Translation Overheads in Multi-GPU Scale-Up Pods
Simulation study shows cold TLB misses in reverse address translation dominate latency for small collectives in multi-GPU pods, causing up to 1.4x degradation, while larger ones see diminishing returns.
-
Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads
Medusa augments LLMs with multiple decoding heads and tree-based attention to predict and verify several tokens in parallel, yielding 2.2-3.6x inference speedup via two fine-tuning regimes.
-
LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale
LLM.int8() performs 8-bit inference for transformers up to 175B parameters with no accuracy loss by combining vector-wise quantization for most features with 16-bit mixed-precision handling of systematic outlier dimensions.
-
Flamingo: a Visual Language Model for Few-Shot Learning
Flamingo models reach new state-of-the-art few-shot results on image and video tasks by bridging frozen vision and language models with cross-attention layers trained on interleaved web-scale data.
-
Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity
Switch Transformers use top-1 expert routing in a Mixture of Experts setup to scale to trillion-parameter language models with constant compute and up to 4x speedup over T5-XXL.
-
Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks
RAG models set new state-of-the-art results on open-domain QA by retrieving Wikipedia passages and conditioning a generative model on them, while also producing more factual text than parametric baselines.
-
ALBERT: A Lite BERT for Self-supervised Learning of Language Representations
ALBERT reduces BERT parameters via embedding factorization and layer sharing, adds inter-sentence coherence pretraining, and reaches SOTA on GLUE, RACE, and SQuAD with fewer parameters than BERT-large.
-
Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism
Intra-layer model parallelism in PyTorch enables training of 8.3B-parameter transformers, achieving SOTA perplexity of 10.8 on WikiText103 and 66.5% accuracy on LAMBADA.
-
The two clocks and the innovation window: When and how generative models learn rules
Generative models learn rules before memorizing data, creating an innovation window whose width depends on dataset size and rule complexity, observed in both diffusion and autoregressive architectures.
-
Diversity in Large Language Models under Supervised Fine-Tuning
TOFU loss mitigates the narrowing of generative diversity in LLMs after supervised fine-tuning by addressing neglect of low-frequency patterns and forgetting of prior knowledge.
-
Whose Story Gets Told? Positionality and Bias in LLM Summaries of Life Narratives
A proposed pipeline shows LLMs introduce detectable race and gender biases when summarizing life narratives, creating potential for representational harm in research.
-
Sketching the Readout of Large Language Models for Scalable Data Attribution and Valuation
RISE applies CountSketch to dual lexical and semantic channels derived from output-layer gradient outer products, cutting data attribution storage by up to 112x and enabling retrospective and prospective influence ana...
-
Transfusion: Predict the Next Token and Diffuse Images with One Multi-Modal Model
A single transformer combines language modeling loss and diffusion loss on mixed-modality data, scaling to 7B parameters and 2T tokens while matching specialized language and diffusion models.
-
Vision Transformers Need Registers
Adding register tokens to Vision Transformers eliminates high-norm background artifacts and raises state-of-the-art performance on dense visual prediction tasks.
-
DeepSpeed Ulysses: System Optimizations for Enabling Training of Extreme Long Sequence Transformer Models
DeepSpeed-Ulysses keeps communication volume constant for sequence-parallel attention when sequence length and device count scale together, delivering 2.5x faster training on 4x longer sequences than prior SOTA.
-
Language Models (Mostly) Know What They Know
Language models show good calibration when asked to estimate the probability that their own answers are correct, with performance improving as models get larger.
-
Scaling Autoregressive Models for Content-Rich Text-to-Image Generation
Scaling an autoregressive Transformer to 20B parameters for text-to-image generation using image token sequences achieves new SOTA zero-shot FID of 7.23 and fine-tuned FID of 3.22 on MS-COCO.
-
ST-MoE: Designing Stable and Transferable Sparse Expert Models
ST-MoE introduces stability techniques for sparse expert models, allowing a 269B-parameter model to achieve state-of-the-art transfer learning results across reasoning, summarization, and QA tasks at the compute cost ...
-
Unsupervised Dense Information Retrieval with Contrastive Learning
Contrastive learning trains unsupervised dense retrievers that beat BM25 on most BEIR datasets and support cross-lingual retrieval across scripts.
-
Ethical and social risks of harm from Language Models
The authors provide a detailed taxonomy of 21 risks associated with language models, covering discrimination, information leaks, misinformation, malicious applications, interaction harms, and societal impacts like job...
-
A General Language Assistant as a Laboratory for Alignment
Ranked preference modeling outperforms imitation learning for language model alignment and scales more favorably with model size.
-
Linformer: Self-Attention with Linear Complexity
Linformer approximates self-attention with a low-rank projection to achieve O(n) time and space complexity while matching Transformer accuracy on standard NLP tasks.
-
CodeBERT: A Pre-Trained Model for Programming and Natural Languages
CodeBERT pre-trains a bimodal model on code and text pairs plus unimodal data to achieve state-of-the-art results on natural language code search and code documentation generation.
-
HuggingFace's Transformers: State-of-the-art Natural Language Processing
Hugging Face releases an open-source Python library that supplies a unified API and pretrained weights for major Transformer architectures used in natural language processing.
-
Diversity in Large Language Models under Supervised Fine-Tuning
Supervised fine-tuning narrows LLM generative diversity through neglect of low-frequency patterns and knowledge forgetting, but the TOFU loss mitigates this effect across models and benchmarks.
-
Uncertainty-Aware Transformers: Conformal Prediction for Language Models
CONFIDE applies conformal prediction to transformer embeddings for valid prediction sets, improving accuracy up to 4.09% and efficiency over baselines on models like BERT-tiny.
-
Voice Biomarkers for Depression and Anxiety
Deep learning models extract content-agnostic voice biomarkers for depression and anxiety from a ~65k-utterance proprietary dataset, achieving 71% sensitivity and specificity when combined with lexical features.
-
Comparison of Modern Multilingual Text Embedding Techniques for Hate Speech Detection Task
Supervised models using embeddings like jina and e5 reach up to 92% accuracy on multilingual hate speech detection, substantially outperforming anomaly detection, while PCA to 64 dimensions preserves most performance ...
-
Gemma: Open Models Based on Gemini Research and Technology
Gemma introduces open 2B and 7B LLMs derived from Gemini technology that beat comparable open models on 11 of 18 text tasks and come with safety assessments.
-
Gemma 2: Improving Open Language Models at a Practical Size
Gemma 2 models achieve leading performance at their sizes by combining established Transformer modifications with knowledge distillation for the 2B and 9B variants.
Reference graph
Works this paper leans on
-
[1]
Memory-efficient adaptive optimization for large-scale learning.arXiv preprint arXiv:1901.11150,
Rohan Anil, Vineet Gupta, Tomer Koren, and Yoram Singer. Memory-efficient adaptive optimization for large-scale learning.arXiv preprint arXiv:1901.11150,
-
[2]
Massively multilingual neural machine translation in the wild: Findings and challenges
Naveen Arivazhagan, Ankur Bapna, Orhan Firat, Dmitry Lepikhin, Melvin Johnson, Maxim Krikun, Mia Xu Chen, Yuan Cao, George Foster, Colin Cherry, et al. Massively multi- lingual neural machine translation in the wild: Findings and challenges.arXiv preprint arXiv:1907.05019,
-
[3]
Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E. Hinton. Layer normalization.arXiv preprint arXiv:1607.06450,
work page internal anchor Pith review Pith/arXiv arXiv
-
[4]
Cloze-driven Pretraining of Self-attention Networks
Alexei Baevski, Sergey Edunov, Yinhan Liu, Luke Zettlemoyer, and Michael Auli. Cloze- driven pretraining of self-attention networks.arXiv preprint arXiv:1903.07785,
work page Pith review arXiv 1903
-
[5]
Simple, scalable adaptation for neural machine translation.arXiv preprint arXiv:1909.08478,
Ankur Bapna, Naveen Arivazhagan, and Orhan Firat. Simple, scalable adaptation for neural machine translation.arXiv preprint arXiv:1909.08478,
-
[6]
SciBERT: A pretrained language model for scientific text
Iz Beltagy, Kyle Lo, and Arman Cohan. SciBERT: A pretrained language model for scientific text. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP),
work page 2019
-
[7]
Findings of the 2014 workshop on statistical machine translation
Ondřej Bojar, Christian Buck, Christian Federmann, Barry Haddow, Philipp Koehn, Jo- hannes Leveling, Christof Monz, Pavel Pecina, Matt Post, Herve Saint-Amand, et al. Findings of the 2014 workshop on statistical machine translation. InProceedings of the Ninth Workshop on Statistical Machine Translation,
work page 2014
-
[8]
Findings of the 2015 workshop on statistical machine translation
Ondřej Bojar, Rajen Chatterjee, Christian Federmann, Barry Haddow, Matthias Huck, Chris Hokamp, Philipp Koehn, Varvara Logacheva, Christof Monz, Matteo Negri, et al. Findings of the 2015 workshop on statistical machine translation. InProceedings of the Tenth Workshop on Statistical Machine Translation,
work page 2015
-
[9]
Findings of the 2016 conference on machine translation
Ondřej Bojar, Rajen Chatterjee, Christian Federmann, Yvette Graham, Barry Haddow, Matthias Huck, Antonio Jimeno Yepes, Philipp Koehn, Varvara Logacheva, Christof Monz, et al. Findings of the 2016 conference on machine translation. InProceedings of the First Conference on Machine Translation,
work page 2016
-
[10]
Bowman, Luke Vilnis, Oriol Vinyals, Andrew M
Samuel R. Bowman, Luke Vilnis, Oriol Vinyals, Andrew M. Dai, Rafal Jozefowicz, and Samy Bengio. Generating sentences from a continuous space.arXiv preprint arXiv:1511.06349,
-
[11]
Semeval-2017 task 1: Semantic textual similarity-multilingual and cross-lingual focused evaluation
Daniel Cer, Mona Diab, Eneko Agirre, Inigo Lopez-Gazpio, and Lucia Specia. Semeval-2017 task 1: Semantic textual similarity-multilingual and cross-lingual focused evaluation. arXiv preprint arXiv:1708.00055,
-
[12]
Long short-term memory-networks for machine reading.arXiv preprint arXiv:1601.06733,
Jianpeng Cheng, Li Dong, and Mirella Lapata. Long short-term memory-networks for machine reading.arXiv preprint arXiv:1601.06733,
-
[13]
BoolQ: Exploring the Surprising Difficulty of Natural Yes/No Questions
Christopher Clark, Kenton Lee, Ming-Wei Chang, Tom Kwiatkowski, Michael Collins, and Kristina Toutanova. BoolQ: Exploring the surprising difficulty of natural yes/no questions. arXiv preprint arXiv:1905.10044,
work page internal anchor Pith review arXiv 1905
-
[14]
ELECTRA: Pre-training text encoders as discriminators rather than generators
Kevin Clark, Minh-Thang Luong, Quoc V Le, and Christopher D Manning. Electra: Pre-training text encoders as discriminators rather than generators. arXiv preprint arXiv:2003.10555,
-
[15]
SentEval: An evaluation toolkit for universal sentence representations
Alexis Conneau and Douwe Kiela. SentEval: An evaluation toolkit for universal sentence representations. arXiv preprint arXiv:1803.05449,
-
[16]
Super- vised learning of universal sentence representations from natural language inference data
Alexis Conneau, Douwe Kiela, Holger Schwenk, Loic Barrault, and Antoine Bordes. Super- vised learning of universal sentence representations from natural language inference data. arXiv preprint arXiv:1705.02364,
-
[17]
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre- training of deep bidirectional transformers for language understanding.arXiv preprint arXiv:1810.04805,
work page internal anchor Pith review Pith/arXiv arXiv
-
[18]
Unified language model pre- training for natural language understanding and gen- eration
Li Dong, Nan Yang, Wenhui Wang, Furu Wei, Xiaodong Liu, Yu Wang, Jianfeng Gao, Ming Zhou, and Hsiao-Wuen Hon. Unified language model pre-training for natural language understanding and generation.arXiv preprint arXiv:1905.03197,
-
[19]
Understanding back-translation at scale
59 Raffel, Shazeer, Roberts, Lee, Narang, Matena, Zhou, Li and Liu Sergey Edunov, Myle Ott, Michael Auli, and David Grangier. Understanding back-translation at scale. arXiv preprint arXiv:1808.09381,
-
[20]
Learning word vectors for 157 languages.arXiv preprint arXiv:1802.06893,
Edouard Grave, Piotr Bojanowski, Prakhar Gupta, Armand Joulin, and Tomas Mikolov. Learning word vectors for 157 languages.arXiv preprint arXiv:1802.06893,
-
[21]
arXiv preprint arXiv:1308.0850 (2013) 4, 5
Alex Graves. Generating sequences with recurrent neural networks. arXiv preprint arXiv:1308.0850,
-
[22]
Rethinking ImageNet pre-training.arXiv preprint arXiv:1811.08883,
Kaiming He, Ross Girshick, and Piotr Dollár. Rethinking ImageNet pre-training.arXiv preprint arXiv:1811.08883,
-
[23]
A hybrid neural network model for commonsense reasoning
Pengcheng He, Xiaodong Liu, Weizhu Chen, and Jianfeng Gao. A hybrid neural network model for commonsense reasoning.arXiv preprint arXiv:1907.11983,
-
[24]
Deep Learning Scaling is Predictable, Empirically
Joel Hestness, Sharan Narang, Newsha Ardalani, Gregory Diamos, Heewoo Jun, Hassan Kianinejad, Md. Mostofa Ali Patwary, Yang Yang, and Yanqi Zhou. Deep learning scaling is predictable, empirically.arXiv preprint arXiv:1712.00409,
work page internal anchor Pith review Pith/arXiv arXiv
-
[25]
Felix Hill, Kyunghyun Cho, and Anna Korhonen. Learning distributed representations of sentences from unlabelled data.arXiv preprint arXiv:1602.03483,
-
[26]
Distilling the Knowledge in a Neural Network
Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531,
work page internal anchor Pith review Pith/arXiv arXiv
-
[27]
Parameter- Efficient Transfer Learning for NLP .arXiv2019, arXiv:1902.00751
Neil Houlsby, Andrei Giurgiu, Stanislaw Jastrzebski, Bruna Morrone, Quentin De Laroussilhe, Andrea Gesmundo, Mona Attariyan, and Sylvain Gelly. Parameter-efficient transfer learning for NLP.arXiv preprint arXiv:1902.00751,
-
[28]
Universal language model fine-tuning for text classification
Jeremy Howard and Sebastian Ruder. Universal language model fine-tuning for text classifi- cation. arXiv preprint arXiv:1801.06146,
-
[29]
GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism
Cheng-Zhi Anna Huang, Ashish Vaswani, Jakob Uszkoreit, Ian Simon, Curtis Hawthorne, Noam Shazeer, Andrew M. Dai, Matthew D. Hoffman, Monica Dinculescu, and Dou- glas Eck. Music transformer: Generating music with long-term structure. InSeventh International Conference on Learning Representations, 2018a. 60 Exploring the Limits of Transfer Learning Yanping ...
-
[30]
arXiv preprint arXiv:1909.10351 , year=
Xiaoqi Jiao, Yichun Yin, Lifeng Shang, Xin Jiang, Xiao Chen, Linlin Li, Fang Wang, and Qun Liu. TinyBERT: Distilling BERT for natural language understanding.arXiv preprint arXiv:1909.10351,
-
[31]
TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension
Mandar Joshi, Eunsol Choi, Daniel S. Weld, and Luke Zettlemoyer. TriviaQA: A large scale distantly supervised challenge dataset for reading comprehension.arXiv preprint arXiv:1705.03551,
work page internal anchor Pith review Pith/arXiv arXiv
-
[32]
SpanBERT: Improving pre-training by representing and predicting spans
Mandar Joshi, Danqi Chen, Yinhan Liu, Daniel S. Weld, Luke Zettlemoyer, and Omer Levy. SpanBERT: Improving pre-training by representing and predicting spans.arXiv preprint arXiv:1907.10529,
-
[33]
Exploring the Limits of Language Modeling
Rafal Jozefowicz, Oriol Vinyals, Mike Schuster, Noam Shazeer, and Yonghui Wu. Exploring the limits of language modeling.arXiv preprint arXiv:1602.02410,
-
[34]
Nitish Shirish Keskar, Bryan McCann, Lav R. Varshney, Caiming Xiong, and Richard Socher. CTRL: A conditional transformer language model for controllable generation. arXiv preprint arXiv:1909.05858, 2019a. NitishShirishKeskar, BryanMcCann, CaimingXiong, andRichardSocher. Unifyingquestion answering and text classification via span extraction.arXiv preprint ...
-
[35]
A surprisingly robust trick for Winograd schema challenge.arXiv preprint arXiv:1905.06290,
Vid Kocijan, Ana-Maria Cretu, Oana-Maria Camburu, Yordan Yordanov, and Thomas Lukasiewicz. A surprisingly robust trick for Winograd schema challenge.arXiv preprint arXiv:1905.06290,
-
[36]
Jakub Konečn` y, Brendan McMahan, and Daniel Ramage. Federated optimization: Dis- tributed optimization beyond the datacenter.arXiv preprint arXiv:1511.03575,
-
[37]
Federated Learning: Strategies for Improving Communication Efficiency
Jakub Konečn` y, H. Brendan McMahan, Felix X. Yu, Peter Richtárik, Ananda Theertha Suresh, and Dave Bacon. Federated learning: Strategies for improving communication efficiency. arXiv preprint arXiv:1610.05492,
work page internal anchor Pith review arXiv
- [38]
-
[39]
arXiv preprint arXiv:1404.5997 , year=
Alex Krizhevsky. One weird trick for parallelizing convolutional neural networks.arXiv preprint arXiv:1404.5997,
-
[40]
arXiv preprint arXiv:1804.10959 , year=
Taku Kudo. Subword regularization: Improving neural network translation models with multiple subword candidates.arXiv preprint arXiv:1804.10959,
-
[41]
Taku Kudo and John Richardson. SentencePiece: A simple and language independent sub- word tokenizer and detokenizer for neural text processing.arXiv preprint arXiv:1808.06226,
work page internal anchor Pith review arXiv
-
[42]
Cross- lingual language model pretraining
Guillaume Lample and Alexis Conneau. Cross-lingual language model pretraining.arXiv preprint arXiv:1901.07291,
-
[43]
ALBERT: A Lite BERT for Self-supervised Learning of Language Representations
Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, and Radu Soricut. ALBERT: A lite BERT for self-supervised learning of language representa- tions. arXiv preprint arXiv:1909.11942,
work page internal anchor Pith review arXiv 1909
-
[44]
Generating Wikipedia by summarizing long sequences
Peter J. Liu, Mohammad Saleh, Etienne Pot, Ben Goodrich, Ryan Sepassi, Lukasz Kaiser, and Noam Shazeer. Generating Wikipedia by summarizing long sequences.arXiv preprint arXiv:1801.10198,
-
[45]
Peter J. Liu, Yu-An Chung, and Jie Ren. SummAE: Zero-shot abstractive text summarization using length-agnostic auto-encoders.arXiv preprint arXiv:1910.00998, 2019a. 62 Exploring the Limits of Transfer Learning Xiaodong Liu, Jianfeng Gao, Xiaodong He, Li Deng, Kevin Duh, and Ye-Yi Wang. Rep- resentation learning using multi-task deep neural networks for se...
-
[46]
Xiaodong Liu, Pengcheng He, Weizhu Chen, and Jianfeng Gao. Multi-task deep neural networks for natural language understanding.arXiv preprint arXiv:1901.11504, 2019b. Yang Liu. Fine-tune BERT for extractive summarization.arXiv preprint arXiv:1903.10318,
-
[47]
RoBERTa: A Robustly Optimized BERT Pretraining Approach
Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. RoBERTa: A robustly optimized BERT pretraining approach.arXiv preprint arXiv:1907.11692, 2019c. Lajanugen Logeswaran and Honglak Lee. An efficient framework for learning sentence representations. arXiv preprint arXiv:1803.02893,
work page internal anchor Pith review Pith/arXiv arXiv 1907
-
[48]
The Natural Language Decathlon: Multitask Learning as Question Answering
Bryan McCann, Nitish Shirish Keskar, Caiming Xiong, and Richard Socher. The nat- ural language decathlon: Multitask learning as question answering. arXiv preprint arXiv:1806.08730,
-
[49]
Efficient Estimation of Word Representations in Vector Space
Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Efficient estimation of word representations in vector space.arXiv preprint arXiv:1301.3781, 2013a. Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S. Corrado, and Jeff Dean. Distributed representations of words and phrases and their compositionality. InAdvances in neural information processing system...
work page internal anchor Pith review Pith/arXiv arXiv
-
[50]
A deep reinforced model for abstractive summarization
Romain Paulus, Caiming Xiong, and Richard Socher. A deep reinforced model for abstractive summarization. arXiv preprint arXiv:1705.04304,
-
[51]
GloVe: Global vectors for word representation
63 Raffel, Shazeer, Roberts, Lee, Narang, Matena, Zhou, Li and Liu Jeffrey Pennington, Richard Socher, and Christopher Manning. GloVe: Global vectors for word representation. InProceedings of the 2014 conference on empirical methods in natural language processing (EMNLP),
work page 2014
- [52]
-
[53]
Deep contextualized word representations
Matthew E. Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. Deep contextualized word representations.arXiv preprint arXiv:1802.05365,
- [54]
-
[55]
Mohammad Taher Pilehvar and Jose Camacho-Collados. WIC: 10,000 example pairs for evaluating context-sensitive representations.arXiv preprint arXiv:1808.09121,
-
[56]
A call for clarity in reporting BLEU scores.arXiv preprint arXiv:1804.08771,
Matt Post. A call for clarity in reporting BLEU scores.arXiv preprint arXiv:1804.08771,
-
[57]
Resolving complex cases of definite pronouns: the Winograd schema challenge
Altaf Rahman and Vincent Ng. Resolving complex cases of definite pronouns: the Winograd schema challenge. InProceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning. Association for Computational Linguistics,
work page 2012
-
[58]
SQuAD: 100,000+ Questions for Machine Comprehension of Text
Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. Squad: 100,000+ questions for machine comprehension of text.arXiv preprint arXiv:1606.05250,
work page internal anchor Pith review arXiv
-
[59]
Prajit Ramachandran, Peter J. Liu, and Quoc V. Le. Unsupervised pretraining for sequence to sequence learning.arXiv preprint arXiv:1611.02683,
-
[60]
An Overview of Multi-Task Learning in Deep Neural Networks
Sebastian Ruder. An overview of multi-task learning in deep neural networks.arXiv preprint arXiv:1706.05098,
work page internal anchor Pith review arXiv
-
[61]
Peters, Swabha Swayamdipta, and Thomas Wolf
64 Exploring the Limits of Transfer Learning Sebastian Ruder, Matthew E. Peters, Swabha Swayamdipta, and Thomas Wolf. Transfer learning in natural language processing. InProceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Tutorials, pages 15–18,
work page 2019
-
[62]
DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter
Victor Sanh, Lysandre Debut, Julien Chaumond, and Thomas Wolf. DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter.arXiv preprint arXiv:1910.01108,
work page internal anchor Pith review Pith/arXiv arXiv 1910
-
[63]
Get To The Point: Summarization with Pointer-Generator Networks
Abigail See, Peter J. Liu, and Christopher D. Manning. Get to the point: Summarization with pointer-generator networks.arXiv preprint arXiv:1704.04368,
-
[64]
Neural Machine Translation of Rare Words with Subword Units
Rico Sennrich, Barry Haddow, and Alexandra Birch. Neural machine translation of rare words with subword units.arXiv preprint arXiv:1508.07909,
work page internal anchor Pith review arXiv
- [65]
-
[66]
Self-attention with relative position repre- sentations
Peter Shaw, Jakob Uszkoreit, and Ashish Vaswani. Self-attention with relative position representations. arXiv preprint arXiv:1803.02155,
-
[67]
Adafactor: Adaptive Learning Rates with Sublinear Memory Cost
Noam Shazeer and Mitchell Stern. Adafactor: Adaptive learning rates with sublinear memory cost. arXiv preprint arXiv:1804.04235,
-
[68]
Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer
Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, and Jeff Dean. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. arXiv preprint arXiv:1701.06538,
work page internal anchor Pith review Pith/arXiv arXiv
-
[69]
Manning, Andrew Ng, and Christopher Potts
Richard Socher, Alex Perelygin, Jean Wu, Jason Chuang, Christopher D. Manning, Andrew Ng, and Christopher Potts. Recursive deep models for semantic compositionality over a sentiment treebank. InProceedings of the 2013 conference on empirical methods in natural language processing,
work page 2013
-
[70]
Kaitao Song, Xu Tan, Tao Qin, Jianfeng Lu, and Tie-Yan Liu. MASS: Masked sequence to sequence pre-training for language generation.arXiv preprint arXiv:1905.02450,
- [71]
-
[72]
A simple method for commonsense reasoning
Trieu H. Trinh and Quoc V. Le. A simple method for commonsense reasoning.arXiv preprint arXiv:1806.02847,
-
[73]
NewsQA: A machine comprehension dataset.arXiv preprint arXiv:1611.09830,
Adam Trischler, Tong Wang, Xingdi Yuan, Justin Harris, Alessandro Sordoni, Philip Bachman, and Kaheer Suleman. NewsQA: A machine comprehension dataset.arXiv preprint arXiv:1611.09830,
-
[74]
Elena Voita, Rico Sennrich, and Ivan Titov. The bottom-up evolution of representations in the transformer: A study with machine translation and language modeling objectives. arXiv preprint arXiv:1909.01380,
-
[75]
Alex Wang, Amapreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R. Bowman. GLUE: A multi-task benchmark and analysis platform for natural language understanding. arXiv preprint arXiv:1804.07461,
work page internal anchor Pith review arXiv
-
[76]
SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems,
Alex Wang, Jan Hula, Patrick Xia, Raghavendra Pappagari, R. Thomas McCoy, Roma Patel, Najoung Kim, Ian Tenney, Yinghui Huang, Katherin Yu, et al. Can you tell me how to get past Sesame Street? Sentence-level pretraining beyond language modeling. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 2019a. Alex Wang, Y...
- [77]
-
[78]
Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation
Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V. Le, Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, et al. Google’s neural machine translation system: Bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144,
work page internal anchor Pith review arXiv
- [79]
- [80]
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.