pith. sign in

arxiv: 1910.10683 · v4 · submitted 2019-10-23 · 💻 cs.LG · cs.CL· stat.ML

Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

Pith reviewed 2026-05-12 05:33 UTC · model grok-4.3

classification 💻 cs.LG cs.CLstat.ML
keywords transfer learningtext-to-texttransformerpre-trainingnatural language processingsummarizationquestion answeringtext classification
0
0 comments X

The pith

A single text-to-text transformer pre-trained on a large cleaned web corpus reaches state-of-the-art results on many NLP benchmarks when fine-tuned uniformly.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests how far transfer learning can go in natural language processing by turning every task into the same text-to-text format. It compares pre-training goals, model sizes, data sources, and fine-tuning methods across dozens of tasks. The authors introduce a massive cleaned web dataset and find that scaling up the unified approach produces top performance on summarization, question answering, text classification, and related problems. This shows that one model and training recipe can handle a wide range of language tasks without custom setups for each one.

Core claim

By converting every text-based language problem into a text-to-text format and pre-training a transformer on the Colossal Clean Crawled Corpus with a denoising objective, the resulting model achieves state-of-the-art results on many benchmarks when fine-tuned on downstream tasks covering summarization, question answering, text classification, and more.

What carries the argument

The text-to-text framework that represents every input and output as plain text strings, allowing one transformer architecture and pre-training procedure to serve all tasks.

If this is right

  • One pre-trained model can be adapted to many tasks without designing separate architectures for each.
  • Larger model scale combined with cleaner and larger unlabeled data improves transfer performance across benchmarks.
  • Systematic comparison of pre-training objectives and data sources identifies which choices transfer most effectively.
  • Releasing the pre-trained models, new dataset, and code allows direct reuse and extension by others.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The uniform format may reduce the engineering effort needed to apply models to new language problems.
  • If the text-to-text approach works across many tasks, it could simplify evaluation and comparison of future models.
  • The success with web-scale cleaned data suggests that data quality and volume matter as much as model architecture for transfer.

Load-bearing premise

Converting every language task into a text-to-text generation problem preserves all necessary information for solving the original task.

What would settle it

A language task where even a very large text-to-text model, after fine-tuning, scores substantially below the best task-specific models on standard metrics.

read the original abstract

Transfer learning, where a model is first pre-trained on a data-rich task before being fine-tuned on a downstream task, has emerged as a powerful technique in natural language processing (NLP). The effectiveness of transfer learning has given rise to a diversity of approaches, methodology, and practice. In this paper, we explore the landscape of transfer learning techniques for NLP by introducing a unified framework that converts all text-based language problems into a text-to-text format. Our systematic study compares pre-training objectives, architectures, unlabeled data sets, transfer approaches, and other factors on dozens of language understanding tasks. By combining the insights from our exploration with scale and our new ``Colossal Clean Crawled Corpus'', we achieve state-of-the-art results on many benchmarks covering summarization, question answering, text classification, and more. To facilitate future work on transfer learning for NLP, we release our data set, pre-trained models, and code.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 3 minor

Summary. The manuscript introduces T5, a unified text-to-text transformer framework that reformulates all NLP tasks as sequence-to-sequence generation problems. It conducts a systematic empirical study comparing pre-training objectives (e.g., span corruption), model architectures (encoder-decoder vs. decoder-only), unlabeled datasets, and transfer methods across dozens of tasks. By scaling models up to 11B parameters and pre-training on the new Colossal Clean Crawled Corpus (C4), the authors report state-of-the-art results on benchmarks spanning summarization, question answering, text classification, and more, while releasing the models, code, and C4 dataset.

Significance. If the reported results hold under independent verification, the work is significant for establishing a simple, scalable, and unified approach to transfer learning that outperforms prior specialized methods. The thorough controlled ablations isolating the contributions of objective, architecture, and data, combined with the public release of artifacts, provide a strong foundation for future research and reproducibility in NLP.

major comments (2)
  1. [§4.2, Table 7] §4.2 and Table 7: The headline SOTA claims on GLUE and SuperGLUE rely on single-run fine-tuning results without reported standard deviations or statistical significance tests across multiple random seeds; given known variance in fine-tuning large models, this weakens the strength of the cross-task superiority claims.
  2. [§3.4] §3.4: The comparison of pre-training objectives uses fixed compute budgets, but the paper does not quantify whether the observed advantage of span corruption over alternatives (e.g., language modeling) persists when allowing each objective its own optimal hyperparameter search or longer training; this is load-bearing for the recommendation of the default objective.
minor comments (3)
  1. [§2] The model size nomenclature (small, base, large, 3B, 11B) is introduced gradually; a single summary table early in §2 or §3 would improve readability.
  2. [Figure 3] Figure 3 (scaling curves): The x-axis for parameter count is logarithmic but the tick labels and legend could be enlarged for clarity in print.
  3. [Appendix A.3] Appendix A.3 on C4 cleaning heuristics is detailed, but a short paragraph in the main text summarizing the key filtering steps would help readers without requiring appendix consultation.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the positive assessment and recommendation for minor revision. We address each major comment below.

read point-by-point responses
  1. Referee: [§4.2, Table 7] §4.2 and Table 7: The headline SOTA claims on GLUE and SuperGLUE rely on single-run fine-tuning results without reported standard deviations or statistical significance tests across multiple random seeds; given known variance in fine-tuning large models, this weakens the strength of the cross-task superiority claims.

    Authors: We acknowledge that reporting standard deviations from multiple random seeds would provide stronger statistical support for the SOTA claims. Due to the prohibitive computational expense of repeated fine-tuning runs for models up to 11B parameters, we reported single-run results for the primary GLUE and SuperGLUE numbers. The observed gains are large in magnitude and consistent across dozens of tasks and model scales, which reduces the likelihood that they arise from random seed variance alone. In the revised manuscript we will add a brief discussion in §4.2 noting the single-run protocol and referencing prior studies on fine-tuning variance. revision: partial

  2. Referee: [§3.4] §3.4: The comparison of pre-training objectives uses fixed compute budgets, but the paper does not quantify whether the observed advantage of span corruption over alternatives (e.g., language modeling) persists when allowing each objective its own optimal hyperparameter search or longer training; this is load-bearing for the recommendation of the default objective.

    Authors: We deliberately held compute budgets fixed across objectives to isolate the effect of the pre-training task itself rather than differences in training duration or hyperparameter optimization. This controlled design is standard for large-scale ablation studies. While we did not conduct per-objective hyperparameter sweeps or extended training, span corruption produced clear and consistent gains under the equal-compute regime. We will revise §3.4 to explicitly state this rationale and note that further per-objective optimization remains an interesting direction for future work. revision: partial

Circularity Check

0 steps flagged

No significant circularity; empirical results are self-contained

full rationale

The paper conducts a large-scale empirical exploration of transfer learning by reformulating NLP tasks as text-to-text problems, systematically ablating pre-training objectives, architectures, data sources, and scaling behaviors across dozens of benchmarks. All central claims (including SOTA results) derive from direct experimental measurements on the released C4 corpus and models rather than from any closed-form derivations, fitted parameters renamed as predictions, or self-citation chains. No equations or uniqueness theorems are invoked that reduce the reported outcomes to inputs by construction; the work is therefore independent and verifiable through the provided artifacts.

Axiom & Free-Parameter Ledger

3 free parameters · 1 axioms · 0 invented entities

The central claim rests on empirical comparisons rather than closed-form derivations. Free parameters include model scale choices, pre-training objective variants, and the specific data-cleaning rules used to construct the Colossal Clean Crawled Corpus. The work assumes standard transformer inductive biases and the effectiveness of transfer from unlabeled pre-training.

free parameters (3)
  • model scale (small to 11B parameters)
    Different parameter counts are trained and compared; performance depends on these choices.
  • pre-training objective variants
    Multiple objectives (e.g., span corruption) are selected and evaluated; results are sensitive to which are used.
  • C4 data-cleaning heuristics
    Rules for filtering the crawled corpus are introduced and affect the pre-training data distribution.
axioms (1)
  • domain assumption Pre-training on large unlabeled text followed by fine-tuning improves performance on downstream language tasks
    Invoked as the foundation for all transfer experiments in the abstract.

pith-pipeline@v0.9.0 · 5485 in / 1497 out tokens · 79829 ms · 2026-05-12T05:33:22.657738+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 60 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Chain-of-Thought Prompting Elicits Reasoning in Large Language Models

    cs.CL 2022-01 accept novelty 9.0

    Chain-of-thought prompting, by including intermediate reasoning steps in few-shot examples, elicits strong reasoning abilities in large language models on arithmetic, commonsense, and symbolic tasks.

  2. Online Learning-to-Defer with Varying Experts

    stat.ML 2026-05 unverdicted novelty 8.0

    Presents the first online learning-to-defer algorithm with regret bounds O((n + n_e) T^{2/3}) generally and O((n + n_e) sqrt(T)) under low noise for multiclass classification with varying experts.

  3. Discovering Latent Knowledge in Language Models Without Supervision

    cs.CL 2022-12 conditional novelty 8.0

    An unsupervised technique extracts latent yes-no knowledge from language model activations by locating a direction that satisfies logical consistency properties, outperforming zero-shot accuracy by 4% on average acros...

  4. Show Your Work: Scratchpads for Intermediate Computation with Language Models

    cs.LG 2021-11 unverdicted novelty 8.0

    Training language models to generate intermediate computation steps on a scratchpad enables them to perform multi-step tasks such as long addition and arbitrary program execution that they otherwise fail at.

  5. The Pile: An 800GB Dataset of Diverse Text for Language Modeling

    cs.CL 2020-12 conditional novelty 8.0

    The Pile is a newly constructed 825 GiB dataset from 22 diverse sources that enables language models to achieve better performance on academic, professional, and cross-domain tasks than models trained on Common Crawl ...

  6. Generative Language Modeling for Automated Theorem Proving

    cs.LG 2020-09 unverdicted novelty 8.0

    GPT-f, a transformer-based prover for Metamath, generated new short proofs that were accepted into the main library—the first such contribution from a deep-learning system.

  7. Measuring Massive Multitask Language Understanding

    cs.CY 2020-09 accept novelty 8.0

    Introduces the MMLU benchmark of 57 tasks and shows that current models, including GPT-3, achieve low accuracy far below expert level across academic and professional domains.

  8. Dense Passage Retrieval for Open-Domain Question Answering

    cs.CL 2020-04 accept novelty 8.0

    Dense dual-encoder retrievers outperform BM25 by 9-19% absolute in top-20 passage retrieval accuracy across open-domain QA datasets and enable new state-of-the-art end-to-end QA results.

  9. REALM: Retrieval-Augmented Language Model Pre-Training

    cs.CL 2020-02 accept novelty 8.0

    REALM augments language-model pre-training with an unsupervised retriever over Wikipedia documents and reports 4-16% absolute gains on open-domain QA benchmarks over prior implicit and explicit knowledge methods.

  10. Live Music Diffusion Models: Efficient Fine-Tuning and Post-Training of Interactive Diffusion Music Generators

    cs.SD 2026-05 unverdicted novelty 7.0

    Live Music Diffusion Models adapt bidirectional diffusion for interactive music generation via KV caching and ARC-Forcing, recovering and exceeding discrete autoregressive efficiency while enabling post-training align...

  11. TabPFN-MT: A Natively Multitask In-Context Learner for Tabular Data

    cs.LG 2026-05 unverdicted novelty 7.0

    TabPFN-MT is a multitask in-context learner for tabular data that sets a new state-of-the-art on deep multitask learning for datasets under 1000 samples while reducing inference cost from O(T) to O(1) passes.

  12. Online Learning-to-Defer with Varying Experts

    stat.ML 2026-05 unverdicted novelty 7.0

    Presents the first online Learning-to-Defer algorithm achieving regret O((n + n_e) T^{2/3}) generally and O((n + n_e) sqrt(T)) under low noise for multiclass classification with varying experts.

  13. The Benefits of Temporal Correlations: SGD Learns k-Juntas from Random Walks Efficiently

    cs.LG 2026-05 unverdicted novelty 7.0

    Temporal correlations from lazy random walks enable efficient SGD learning of k-juntas via temporal-difference loss on ReLU networks, achieving linear sample complexity in d.

  14. OA-WAM: Object-Addressable World Action Model for Robust Robot Manipulation

    cs.RO 2026-05 unverdicted novelty 7.0

    OA-WAM uses persistent address vectors and dynamic content vectors in object slots to enable addressable world-action prediction, improving robustness on manipulation benchmarks under scene changes.

  15. SWAN: Semantic Watermarking with Abstract Meaning Representation

    cs.CL 2026-05 unverdicted novelty 7.0

    SWAN uses AMR to embed semantic watermarks that persist through paraphrases, matching SOTA detection on original text and improving AUC by 13.9 points on paraphrased RealNews data.

  16. AttentionBender: Manipulating Cross-Attention in Video Diffusion Transformers as a Creative Probe

    cs.MM 2026-04 unverdicted novelty 7.0

    AttentionBender applies 2D transforms to cross-attention maps in video diffusion transformers, producing distributed distortions and glitch aesthetics that reveal entangled attention mechanisms while serving as both a...

  17. Dual Triangle Attention: Effective Bidirectional Attention Without Positional Embeddings

    q-bio.QM 2026-04 unverdicted novelty 7.0

    Dual Triangle Attention achieves effective bidirectional attention with built-in positional inductive bias via dual triangular masks, outperforming standard bidirectional attention on position-sensitive tasks and show...

  18. Unlocking Prompt Infilling Capability for Diffusion Language Models

    cs.CL 2026-04 unverdicted novelty 7.0

    Full-sequence masking in SFT unlocks prompt infilling for masked diffusion language models, producing templates that match or surpass hand-designed ones and transfer across models.

  19. Analyzing Reverse Address Translation Overheads in Multi-GPU Scale-Up Pods

    cs.DC 2026-04 unverdicted novelty 7.0

    Simulation study shows cold TLB misses in reverse address translation dominate latency for small collectives in multi-GPU pods, causing up to 1.4x degradation, while larger ones see diminishing returns.

  20. RELIC: Evaluating Complex Reasoning via the Recognition of Languages In-Context

    cs.CL 2025-06 unverdicted novelty 7.0

    RELIC benchmark reveals that advanced LLMs fail to scale reasoning compute with task difficulty in context-free language recognition and instead reduce reasoning tokens while shifting to guessing strategies.

  21. Neurons Speak in Ranges: Breaking Free from Discrete Neuronal Attribution

    cs.LG 2025-02 unverdicted novelty 7.0

    Neurons exhibit concept-conditioned activation ranges forming Gaussian-like distributions with minimal overlap, and range-based interventions via NeuronLens outperform neuron-level masking in targeted manipulation wit...

  22. PromptGuard: Soft Prompt-Guided Unsafe Content Moderation for Text-to-Image Models

    cs.CV 2025-01 unverdicted novelty 7.0

    PromptGuard optimizes a universal safety soft prompt (and category-specific variants) in T2I embedding space to moderate NSFW inputs, achieving average unsafe ratios of 5.84-6.18% while being 3.8x faster than prior defenses.

  23. The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits

    cs.CL 2024-02 unverdicted novelty 7.0

    BitNet b1.58 shows that ternary 1.58-bit LLMs can match full-precision performance at substantially lower inference cost.

  24. Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads

    cs.LG 2024-01 conditional novelty 7.0

    Medusa augments LLMs with multiple decoding heads and tree-based attention to predict and verify several tokens in parallel, yielding 2.2-3.6x inference speedup via two fine-tuning regimes.

  25. LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale

    cs.LG 2022-08 conditional novelty 7.0

    LLM.int8() performs 8-bit inference for transformers up to 175B parameters with no accuracy loss by combining vector-wise quantization for most features with 16-bit mixed-precision handling of systematic outlier dimensions.

  26. Flamingo: a Visual Language Model for Few-Shot Learning

    cs.CV 2022-04 unverdicted novelty 7.0

    Flamingo models reach new state-of-the-art few-shot results on image and video tasks by bridging frozen vision and language models with cross-attention layers trained on interleaved web-scale data.

  27. Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity

    cs.LG 2021-01 accept novelty 7.0

    Switch Transformers use top-1 expert routing in a Mixture of Experts setup to scale to trillion-parameter language models with constant compute and up to 4x speedup over T5-XXL.

  28. GraphCodeBERT: Pre-training Code Representations with Data Flow

    cs.SE 2020-09 accept novelty 7.0

    GraphCodeBERT uses data flow graphs in pre-training to capture semantic code structure and reaches state-of-the-art results on code search, clone detection, translation, and refinement.

  29. Learning to summarize from human feedback

    cs.CL 2020-09 conditional novelty 7.0

    Reinforcement learning on a reward model trained from human summary comparisons produces summaries humans prefer over supervised fine-tuning or human references on TL;DR and transfers to CNN/DM.

  30. Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks

    cs.CL 2020-05 accept novelty 7.0

    RAG models set new state-of-the-art results on open-domain QA by retrieving Wikipedia passages and conditioning a generative model on them, while also producing more factual text than parametric baselines.

  31. Unsupervised Cross-lingual Representation Learning at Scale

    cs.CL 2019-11 conditional novelty 7.0

    XLM-R, pretrained on 100 languages with 2TB of CommonCrawl data, improves average XNLI accuracy by 14.6 points and MLQA F1 by 13 points over mBERT while matching strong monolingual models on GLUE.

  32. ALBERT: A Lite BERT for Self-supervised Learning of Language Representations

    cs.CL 2019-09 accept novelty 7.0

    ALBERT reduces BERT parameters via embedding factorization and layer sharing, adds inter-sentence coherence pretraining, and reaches SOTA on GLUE, RACE, and SQuAD with fewer parameters than BERT-large.

  33. Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism

    cs.CL 2019-09 unverdicted novelty 7.0

    Intra-layer model parallelism in PyTorch enables training of 8.3B-parameter transformers, achieving SOTA perplexity of 10.8 on WikiText103 and 66.5% accuracy on LAMBADA.

  34. FOL2NS: Generating Natural Sentences from First-Order Logic

    cs.CL 2026-05 unverdicted novelty 6.0

    FOL2NS generates synthetic first-order logic formulas with varying quantifier depths and translates them into natural language sentences via a hybrid rule-driven and fine-tuned language model approach.

  35. Runtime-Orchestrated Second-Order Optimization for Scalable LLM Training

    cs.DC 2026-05 unverdicted novelty 6.0

    Asteria is a runtime system that enables second-order optimization for LLMs by dynamically distributing optimizer state across GPU, CPU, and NVMe while using asynchronous inverse-root computations and bounded-stalenes...

  36. Block-Based Double Decoders

    cs.LG 2026-05 unverdicted novelty 6.0

    Block-based double decoders achieve full supervision in pretraining like decoder-only models and efficient inference like encoder-decoders through doubly-causal block-based attention masks, outperforming encoder-decod...

  37. Theory-optimal Quantization Based on Flatness

    cs.LG 2026-05 unverdicted novelty 6.0

    The paper introduces the Flatness metric, derives a theory-optimal quantization solution, and presents BDQ that uses bidirectional diagonal transformations to reduce outlier impact, achieving under 1% drop at W4A4 on ...

  38. The two clocks and the innovation window: When and how generative models learn rules

    cs.LG 2026-05 unverdicted novelty 6.0

    Generative models learn rules before memorizing data, creating an innovation window whose width depends on dataset size and rule complexity, observed in both diffusion and autoregressive architectures.

  39. Diversity in Large Language Models under Supervised Fine-Tuning

    cs.LG 2026-04 unverdicted novelty 6.0

    TOFU loss mitigates the narrowing of generative diversity in LLMs after supervised fine-tuning by addressing neglect of low-frequency patterns and forgetting of prior knowledge.

  40. Whose Story Gets Told? Positionality and Bias in LLM Summaries of Life Narratives

    cs.CL 2026-04 unverdicted novelty 6.0

    A proposed pipeline shows LLMs introduce detectable race and gender biases when summarizing life narratives, creating potential for representational harm in research.

  41. Sketching the Readout of Large Language Models for Scalable Data Attribution and Valuation

    cs.LG 2026-04 unverdicted novelty 6.0

    RISE applies CountSketch to dual lexical and semantic channels derived from output-layer gradient outer products, cutting data attribution storage by up to 112x and enabling retrospective and prospective influence ana...

  42. mimic-video: Video-Action Models for Generalizable Robot Control Beyond VLAs

    cs.RO 2025-12 unverdicted novelty 6.0

    mimic-video combines internet video pretraining with a flow-matching decoder to achieve state-of-the-art robotic manipulation performance with 10x better sample efficiency than vision-language-action models.

  43. Progress Ratio Embeddings: An Impatience Signal for Robust Length Control in Neural Text Generation

    cs.CL 2025-12 unverdicted novelty 6.0

    Progress Ratio Embeddings use a trigonometric progress-ratio signal to deliver stable length control in transformers that generalizes to unseen target lengths.

  44. SSL4RL: Revisiting Self-supervised Learning as Intrinsic Reward for Visual-Language Reasoning

    cs.CV 2025-10 unverdicted novelty 6.0

    SSL4RL reformulates self-supervised learning objectives into dense, verifiable reward signals for RL-based fine-tuning of vision-language models, yielding performance gains on reasoning benchmarks.

  45. Should We Still Pretrain Encoders with Masked Language Modeling?

    cs.CL 2025-07 accept novelty 6.0

    Controlled ablations of 38 models find MLM superior to CLM on representation benchmarks while CLM offers better data efficiency and stability; a biphasic CLM-then-MLM schedule is optimal under fixed compute and improv...

  46. Flex Attention: A Programming Model for Generating Optimized Attention Kernels

    cs.LG 2024-12 unverdicted novelty 6.0

    FlexAttention supplies a compiler-driven interface that expresses common attention variants in a few lines of PyTorch and emits optimized kernels whose speed matches hand-written implementations.

  47. How Good is Your Wikipedia? Auditing Data Quality for Low-resource and Multilingual NLP

    cs.CL 2024-11 unverdicted novelty 6.0

    The study filters non-English Wikipedia, reveals quality problems, proposes a 4-level ranking, and shows filtered data matches or beats raw data in language modeling with largest gains for lower-quality editions.

  48. Transfusion: Predict the Next Token and Diffuse Images with One Multi-Modal Model

    cs.AI 2024-08 unverdicted novelty 6.0

    A single transformer combines language modeling loss and diffusion loss on mixed-modality data, scaling to 7B parameters and 2T tokens while matching specialized language and diffusion models.

  49. Inference Scaling Laws: An Empirical Analysis of Compute-Optimal Inference for Problem-Solving with Language Models

    cs.AI 2024-08 conditional novelty 6.0

    Empirical analysis shows scaling inference compute via strategies like tree search can be more efficient than scaling model parameters, with 7B models plus novel search outperforming 34B models.

  50. The Falcon Series of Open Language Models

    cs.CL 2023-11 conditional novelty 6.0

    Falcon-180B is a 180B-parameter open decoder-only model trained on 3.5 trillion tokens that approaches PaLM-2-Large performance at lower cost and is released with dataset extracts.

  51. Vision Transformers Need Registers

    cs.CV 2023-09 unverdicted novelty 6.0

    Adding register tokens to Vision Transformers eliminates high-norm background artifacts and raises state-of-the-art performance on dense visual prediction tasks.

  52. DeepSpeed Ulysses: System Optimizations for Enabling Training of Extreme Long Sequence Transformer Models

    cs.LG 2023-09 accept novelty 6.0

    DeepSpeed-Ulysses keeps communication volume constant for sequence-parallel attention when sequence length and device count scale together, delivering 2.5x faster training on 4x longer sequences than prior SOTA.

  53. Can AI-Generated Text be Reliably Detected?

    cs.CL 2023-03 unverdicted novelty 6.0

    Recursive paraphrasing attacks substantially lower detection rates for multiple AI text detectors with only minor quality loss, while a theoretical analysis ties best-case AUROC to total variation distance between hum...

  54. SemDeDup: Data-efficient learning at web-scale through semantic deduplication

    cs.LG 2023-03 unverdicted novelty 6.0

    SemDeDup removes semantic duplicates from datasets like LAION using pre-trained embeddings, cutting data by 50% with minimal performance loss and efficiency gains on C4.

  55. The Flan Collection: Designing Data and Methods for Effective Instruction Tuning

    cs.AI 2023-01 conditional novelty 6.0

    The Flan Collection demonstrates that task balancing, data enrichment, and mixed prompt training are critical to effective instruction tuning, yielding stronger Flan-T5 models released publicly.

  56. Atlas: Few-shot Learning with Retrieval Augmented Language Models

    cs.CL 2022-08 unverdicted novelty 6.0

    Atlas reaches over 42% accuracy on Natural Questions with only 64 examples, outperforming a 540B-parameter model by 3% with 50x fewer parameters.

  57. Efficient Training of Language Models to Fill in the Middle

    cs.CL 2022-07 unverdicted novelty 6.0

    Autoregressive language models trained on data with middle spans relocated to the end learn infilling without degrading left-to-right perplexity or sampling quality.

  58. Language Models (Mostly) Know What They Know

    cs.CL 2022-07 unverdicted novelty 6.0

    Language models show good calibration when asked to estimate the probability that their own answers are correct, with performance improving as models get larger.

  59. Scaling Autoregressive Models for Content-Rich Text-to-Image Generation

    cs.CV 2022-06 unverdicted novelty 6.0

    Scaling an autoregressive Transformer to 20B parameters for text-to-image generation using image token sequences achieves new SOTA zero-shot FID of 7.23 and fine-tuned FID of 3.22 on MS-COCO.

  60. CoCa: Contrastive Captioners are Image-Text Foundation Models

    cs.CV 2022-05 accept novelty 6.0

    CoCa unifies contrastive and generative pretraining in one image-text model to reach 86.3% zero-shot ImageNet accuracy and new state-of-the-art results on multiple downstream benchmarks.

Reference graph

Works this paper leans on

83 extracted references · 83 canonical work pages · cited by 84 Pith papers · 22 internal anchors

  1. [1]

    Memory-efficient adaptive optimiza- tion for large-scale learning

    Rohan Anil, Vineet Gupta, Tomer Koren, and Yoram Singer. Memory-efficient adaptive optimization for large-scale learning.arXiv preprint arXiv:1901.11150,

  2. [2]

    Massively Multilingual Neural Machine Translation in the Wild: Findings and Challenges

    Naveen Arivazhagan, Ankur Bapna, Orhan Firat, Dmitry Lepikhin, Melvin Johnson, Maxim Krikun, Mia Xu Chen, Yuan Cao, George Foster, Colin Cherry, et al. Massively multi- lingual neural machine translation in the wild: Findings and challenges.arXiv preprint arXiv:1907.05019,

  3. [3]

    Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E. Hinton. Layer normalization.arXiv preprint arXiv:1607.06450,

  4. [4]

    Cloze-driven Pretraining of Self-attention Networks

    Alexei Baevski, Sergey Edunov, Yinhan Liu, Luke Zettlemoyer, and Michael Auli. Cloze- driven pretraining of self-attention networks.arXiv preprint arXiv:1903.07785,

  5. [5]

    Simple, scalable adaptation for neural machine translation.arXiv preprint arXiv:1909.08478,

    Ankur Bapna, Naveen Arivazhagan, and Orhan Firat. Simple, scalable adaptation for neural machine translation.arXiv preprint arXiv:1909.08478,

  6. [6]

    SciBERT: A pretrained language model for scientific text

    Iz Beltagy, Kyle Lo, and Arman Cohan. SciBERT: A pretrained language model for scientific text. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP),

  7. [7]

    Findings of the 2014 workshop on statistical machine translation

    Ondřej Bojar, Christian Buck, Christian Federmann, Barry Haddow, Philipp Koehn, Jo- hannes Leveling, Christof Monz, Pavel Pecina, Matt Post, Herve Saint-Amand, et al. Findings of the 2014 workshop on statistical machine translation. InProceedings of the Ninth Workshop on Statistical Machine Translation,

  8. [8]

    Findings of the 2015 workshop on statistical machine translation

    Ondřej Bojar, Rajen Chatterjee, Christian Federmann, Barry Haddow, Matthias Huck, Chris Hokamp, Philipp Koehn, Varvara Logacheva, Christof Monz, Matteo Negri, et al. Findings of the 2015 workshop on statistical machine translation. InProceedings of the Tenth Workshop on Statistical Machine Translation,

  9. [9]

    Findings of the 2016 conference on machine translation

    Ondřej Bojar, Rajen Chatterjee, Christian Federmann, Yvette Graham, Barry Haddow, Matthias Huck, Antonio Jimeno Yepes, Philipp Koehn, Varvara Logacheva, Christof Monz, et al. Findings of the 2016 conference on machine translation. InProceedings of the First Conference on Machine Translation,

  10. [10]

    Generating Sentences from a Continuous Space

    Samuel R. Bowman, Luke Vilnis, Oriol Vinyals, Andrew M. Dai, Rafal Jozefowicz, and Samy Bengio. Generating sentences from a continuous space.arXiv preprint arXiv:1511.06349,

  11. [11]

    SemEval-2017 Task 1: Semantic Textual Similarity - Multilingual and Cross-lingual Focused Evaluation

    Daniel Cer, Mona Diab, Eneko Agirre, Inigo Lopez-Gazpio, and Lucia Specia. Semeval-2017 task 1: Semantic textual similarity-multilingual and cross-lingual focused evaluation. arXiv preprint arXiv:1708.00055,

  12. [12]

    Long Short-Term Memory-Networks for Machine Reading

    Jianpeng Cheng, Li Dong, and Mirella Lapata. Long short-term memory-networks for machine reading.arXiv preprint arXiv:1601.06733,

  13. [13]

    BoolQ: Exploring the Surprising Difficulty of Natural Yes/No Questions

    Christopher Clark, Kenton Lee, Ming-Wei Chang, Tom Kwiatkowski, Michael Collins, and Kristina Toutanova. BoolQ: Exploring the surprising difficulty of natural yes/no questions. arXiv preprint arXiv:1905.10044,

  14. [14]

    ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators

    Kevin Clark, Minh-Thang Luong, Quoc V Le, and Christopher D Manning. Electra: Pre-training text encoders as discriminators rather than generators. arXiv preprint arXiv:2003.10555,

  15. [15]

    SentEval: An Evaluation Toolkit for Universal Sentence Representations

    Alexis Conneau and Douwe Kiela. SentEval: An evaluation toolkit for universal sentence representations. arXiv preprint arXiv:1803.05449,

  16. [16]

    Supervised Learning of Universal Sentence Representations from Natural Language Inference Data

    Alexis Conneau, Douwe Kiela, Holger Schwenk, Loic Barrault, and Antoine Bordes. Super- vised learning of universal sentence representations from natural language inference data. arXiv preprint arXiv:1705.02364,

  17. [17]

    BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

    Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre- training of deep bidirectional transformers for language understanding.arXiv preprint arXiv:1810.04805,

  18. [18]

    Unified language model pre- training for natural language understanding and gen- eration

    Li Dong, Nan Yang, Wenhui Wang, Furu Wei, Xiaodong Liu, Yu Wang, Jianfeng Gao, Ming Zhou, and Hsiao-Wuen Hon. Unified language model pre-training for natural language understanding and generation.arXiv preprint arXiv:1905.03197,

  19. [19]

    Understanding Back-Translation at Scale

    59 Raffel, Shazeer, Roberts, Lee, Narang, Matena, Zhou, Li and Liu Sergey Edunov, Myle Ott, Michael Auli, and David Grangier. Understanding back-translation at scale. arXiv preprint arXiv:1808.09381,

  20. [20]

    Learning Word Vectors for 157 Languages

    Edouard Grave, Piotr Bojanowski, Prakhar Gupta, Armand Joulin, and Tomas Mikolov. Learning word vectors for 157 languages.arXiv preprint arXiv:1802.06893,

  21. [21]

    Generating Sequences With Recurrent Neural Networks

    Alex Graves. Generating sequences with recurrent neural networks. arXiv preprint arXiv:1308.0850,

  22. [22]

    Rethinking ImageNet Pre-training

    Kaiming He, Ross Girshick, and Piotr Dollár. Rethinking ImageNet pre-training.arXiv preprint arXiv:1811.08883,

  23. [23]

    A hybrid neural network model for commonsense reasoning

    Pengcheng He, Xiaodong Liu, Weizhu Chen, and Jianfeng Gao. A hybrid neural network model for commonsense reasoning.arXiv preprint arXiv:1907.11983,

  24. [24]

    Deep Learning Scaling is Predictable, Empirically

    Joel Hestness, Sharan Narang, Newsha Ardalani, Gregory Diamos, Heewoo Jun, Hassan Kianinejad, Md. Mostofa Ali Patwary, Yang Yang, and Yanqi Zhou. Deep learning scaling is predictable, empirically.arXiv preprint arXiv:1712.00409,

  25. [25]

    Learning distributed representations of sentences from unlabelled data.arXiv preprint arXiv:1602.03483,

    Felix Hill, Kyunghyun Cho, and Anna Korhonen. Learning distributed representations of sentences from unlabelled data.arXiv preprint arXiv:1602.03483,

  26. [26]

    Distilling the Knowledge in a Neural Network

    Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531,

  27. [27]

    Parameter-Efficient Transfer Learning for NLP

    Neil Houlsby, Andrei Giurgiu, Stanislaw Jastrzebski, Bruna Morrone, Quentin De Laroussilhe, Andrea Gesmundo, Mona Attariyan, and Sylvain Gelly. Parameter-efficient transfer learning for NLP.arXiv preprint arXiv:1902.00751,

  28. [28]

    Universal Language Model Fine-tuning for Text Classification

    Jeremy Howard and Sebastian Ruder. Universal language model fine-tuning for text classifi- cation. arXiv preprint arXiv:1801.06146,

  29. [29]

    GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism

    Cheng-Zhi Anna Huang, Ashish Vaswani, Jakob Uszkoreit, Ian Simon, Curtis Hawthorne, Noam Shazeer, Andrew M. Dai, Matthew D. Hoffman, Monica Dinculescu, and Dou- glas Eck. Music transformer: Generating music with long-term structure. InSeventh International Conference on Learning Representations, 2018a. 60 Exploring the Limits of Transfer Learning Yanping ...

  30. [30]

    Tinybert: Distilling bert for natural language understanding

    Xiaoqi Jiao, Yichun Yin, Lifeng Shang, Xin Jiang, Xiao Chen, Linlin Li, Fang Wang, and Qun Liu. TinyBERT: Distilling BERT for natural language understanding.arXiv preprint arXiv:1909.10351,

  31. [31]

    TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension

    Mandar Joshi, Eunsol Choi, Daniel S. Weld, and Luke Zettlemoyer. TriviaQA: A large scale distantly supervised challenge dataset for reading comprehension.arXiv preprint arXiv:1705.03551,

  32. [32]

    SpanBERT: Improving pre-training by representing and predicting spans

    Mandar Joshi, Danqi Chen, Yinhan Liu, Daniel S. Weld, Luke Zettlemoyer, and Omer Levy. SpanBERT: Improving pre-training by representing and predicting spans.arXiv preprint arXiv:1907.10529,

  33. [33]

    Exploring the Limits of Language Modeling

    Rafal Jozefowicz, Oriol Vinyals, Mike Schuster, Noam Shazeer, and Yonghui Wu. Exploring the limits of language modeling.arXiv preprint arXiv:1602.02410,

  34. [34]

    CTRL: A Conditional Transformer Language Model for Controllable Generation

    Nitish Shirish Keskar, Bryan McCann, Lav R. Varshney, Caiming Xiong, and Richard Socher. CTRL: A conditional transformer language model for controllable generation. arXiv preprint arXiv:1909.05858, 2019a. NitishShirishKeskar, BryanMcCann, CaimingXiong, andRichardSocher. Unifyingquestion answering and text classification via span extraction.arXiv preprint ...

  35. [35]

    A surprisingly robust trick for winograd schema challenge

    Vid Kocijan, Ana-Maria Cretu, Oana-Maria Camburu, Yordan Yordanov, and Thomas Lukasiewicz. A surprisingly robust trick for Winograd schema challenge.arXiv preprint arXiv:1905.06290,

  36. [36]

    Federated Optimization:Distributed Optimization Beyond the Datacenter

    Jakub Konečn` y, Brendan McMahan, and Daniel Ramage. Federated optimization: Dis- tributed optimization beyond the datacenter.arXiv preprint arXiv:1511.03575,

  37. [37]

    Federated Learning: Strategies for Improving Communication Efficiency

    Jakub Konečn` y, H. Brendan McMahan, Felix X. Yu, Peter Richtárik, Ananda Theertha Suresh, and Dave Bacon. Federated learning: Strategies for improving communication efficiency. arXiv preprint arXiv:1610.05492,

  38. [38]

    Simon Kornblith, Jonathon Shlens, and Quoc V. Le. Do better ImageNet models transfer better? arXiv preprint arXiv:1805.08974,

  39. [39]

    One weird trick for parallelizing convolutional neural networks

    Alex Krizhevsky. One weird trick for parallelizing convolutional neural networks.arXiv preprint arXiv:1404.5997,

  40. [40]

    Subword Regularization: Improving Neural Network Translation Models with Multiple Subword Candidates

    Taku Kudo. Subword regularization: Improving neural network translation models with multiple subword candidates.arXiv preprint arXiv:1804.10959,

  41. [41]

    Sentencepiece: A simple and language independent subword tokenizer and detokenizer for neural text processing

    Taku Kudo and John Richardson. SentencePiece: A simple and language independent sub- word tokenizer and detokenizer for neural text processing.arXiv preprint arXiv:1808.06226,

  42. [42]

    Cross-lingual Language Model Pretraining

    Guillaume Lample and Alexis Conneau. Cross-lingual language model pretraining.arXiv preprint arXiv:1901.07291,

  43. [43]

    ALBERT: A Lite BERT for Self-supervised Learning of Language Representations

    Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, and Radu Soricut. ALBERT: A lite BERT for self-supervised learning of language representa- tions. arXiv preprint arXiv:1909.11942,

  44. [44]

    Generating Wikipedia by Summarizing Long Sequences

    Peter J. Liu, Mohammad Saleh, Etienne Pot, Ben Goodrich, Ryan Sepassi, Lukasz Kaiser, and Noam Shazeer. Generating Wikipedia by summarizing long sequences.arXiv preprint arXiv:1801.10198,

  45. [45]

    Liu, Yu-An Chung, and Jie Ren

    Peter J. Liu, Yu-An Chung, and Jie Ren. SummAE: Zero-shot abstractive text summarization using length-agnostic auto-encoders.arXiv preprint arXiv:1910.00998, 2019a. 62 Exploring the Limits of Transfer Learning Xiaodong Liu, Jianfeng Gao, Xiaodong He, Li Deng, Kevin Duh, and Ye-Yi Wang. Rep- resentation learning using multi-task deep neural networks for se...

  46. [46]

    Multi-Task Deep Neural Networks for Natural Language Understanding

    Xiaodong Liu, Pengcheng He, Weizhu Chen, and Jianfeng Gao. Multi-task deep neural networks for natural language understanding.arXiv preprint arXiv:1901.11504, 2019b. Yang Liu. Fine-tune BERT for extractive summarization.arXiv preprint arXiv:1903.10318,

  47. [47]

    RoBERTa: A Robustly Optimized BERT Pretraining Approach

    Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. RoBERTa: A robustly optimized BERT pretraining approach.arXiv preprint arXiv:1907.11692, 2019c. Lajanugen Logeswaran and Honglak Lee. An efficient framework for learning sentence representations. arXiv preprint arXiv:1803.02893,

  48. [48]

    The Natural Language Decathlon: Multitask Learning as Question Answering

    Bryan McCann, Nitish Shirish Keskar, Caiming Xiong, and Richard Socher. The nat- ural language decathlon: Multitask learning as question answering. arXiv preprint arXiv:1806.08730,

  49. [49]

    Efficient Estimation of Word Representations in Vector Space

    Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Efficient estimation of word representations in vector space.arXiv preprint arXiv:1301.3781, 2013a. Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S. Corrado, and Jeff Dean. Distributed representations of words and phrases and their compositionality. InAdvances in neural information processing system...

  50. [50]

    A Deep Reinforced Model for Abstractive Summarization

    Romain Paulus, Caiming Xiong, and Richard Socher. A deep reinforced model for abstractive summarization. arXiv preprint arXiv:1705.04304,

  51. [51]

    GloVe: Global vectors for word representation

    63 Raffel, Shazeer, Roberts, Lee, Narang, Matena, Zhou, Li and Liu Jeffrey Pennington, Richard Socher, and Christopher Manning. GloVe: Global vectors for word representation. InProceedings of the 2014 conference on empirical methods in natural language processing (EMNLP),

  52. [52]

    Matthew Peters, Sebastian Ruder, and Noah A. Smith. To tune or not to tune? adapting pretrained representations to diverse tasks.arXiv preprint arXiv:1903.05987,

  53. [53]

    Deep contextualized word representations

    Matthew E. Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. Deep contextualized word representations.arXiv preprint arXiv:1802.05365,

  54. [54]

    Jason Phang, Thibault Févry, and Samuel R. Bowman. Sentence encoders on STILTs: Sup- plementary training on intermediate labeled-data tasks.arXiv preprint arXiv:1811.01088,

  55. [55]

    WiC: the Word-in-Context Dataset for Evaluating Context-Sensitive Meaning Representations

    Mohammad Taher Pilehvar and Jose Camacho-Collados. WIC: 10,000 example pairs for evaluating context-sensitive representations.arXiv preprint arXiv:1808.09121,

  56. [56]

    A Call for Clarity in Reporting BLEU Scores

    Matt Post. A call for clarity in reporting BLEU scores.arXiv preprint arXiv:1804.08771,

  57. [57]

    Resolving complex cases of definite pronouns: the Winograd schema challenge

    Altaf Rahman and Vincent Ng. Resolving complex cases of definite pronouns: the Winograd schema challenge. InProceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning. Association for Computational Linguistics,

  58. [58]

    SQuAD: 100,000+ Questions for Machine Comprehension of Text

    Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. Squad: 100,000+ questions for machine comprehension of text.arXiv preprint arXiv:1606.05250,

  59. [59]

    Unsupervised Pretraining for Sequence to Sequence Learning

    Prajit Ramachandran, Peter J. Liu, and Quoc V. Le. Unsupervised pretraining for sequence to sequence learning.arXiv preprint arXiv:1611.02683,

  60. [60]

    An Overview of Multi-Task Learning in Deep Neural Networks

    Sebastian Ruder. An overview of multi-task learning in deep neural networks.arXiv preprint arXiv:1706.05098,

  61. [61]

    Peters, Swabha Swayamdipta, and Thomas Wolf

    64 Exploring the Limits of Transfer Learning Sebastian Ruder, Matthew E. Peters, Swabha Swayamdipta, and Thomas Wolf. Transfer learning in natural language processing. InProceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Tutorials, pages 15–18,

  62. [62]

    DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter

    Victor Sanh, Lysandre Debut, Julien Chaumond, and Thomas Wolf. DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter.arXiv preprint arXiv:1910.01108,

  63. [63]

    Get To The Point: Summarization with Pointer-Generator Networks

    Abigail See, Peter J. Liu, and Christopher D. Manning. Get to the point: Summarization with pointer-generator networks.arXiv preprint arXiv:1704.04368,

  64. [64]

    Neural Machine Translation of Rare Words with Subword Units

    Rico Sennrich, Barry Haddow, and Alexandra Birch. Neural machine translation of rare words with subword units.arXiv preprint arXiv:1508.07909,

  65. [65]

    Christopher J Shallue, Jaehoon Lee, Joe Antognini, Jascha Sohl-Dickstein, Roy Frostig, and George E. Dahl. Measuring the effects of data parallelism on neural network training. arXiv preprint arXiv:1811.03600,

  66. [66]

    Self-Attention with Relative Position Representations

    Peter Shaw, Jakob Uszkoreit, and Ashish Vaswani. Self-attention with relative position representations. arXiv preprint arXiv:1803.02155,

  67. [67]

    Adafactor: Adaptive Learning Rates with Sublinear Memory Cost

    Noam Shazeer and Mitchell Stern. Adafactor: Adaptive learning rates with sublinear memory cost. arXiv preprint arXiv:1804.04235,

  68. [68]

    Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer

    Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, and Jeff Dean. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. arXiv preprint arXiv:1701.06538,

  69. [69]

    Manning, Andrew Ng, and Christopher Potts

    Richard Socher, Alex Perelygin, Jean Wu, Jason Chuang, Christopher D. Manning, Andrew Ng, and Christopher Potts. Recursive deep models for semantic compositionality over a sentiment treebank. InProceedings of the 2013 conference on empirical methods in natural language processing,

  70. [70]

    MASS: Masked Sequence to Sequence Pre-training for Language Generation

    Kaitao Song, Xu Tan, Tao Qin, Jianfeng Lu, and Tie-Yan Liu. MASS: Masked sequence to sequence pre-training for language generation.arXiv preprint arXiv:1905.02450,

  71. [71]

    Sandeep Subramanian, Adam Trischler, Yoshua Bengio, and Christopher J. Pal. Learning general purpose distributed sentence representations via large scale multi-task learning. arXiv preprint arXiv:1804.00079,

  72. [72]

    Trinh and Quoc V

    Trieu H. Trinh and Quoc V. Le. A simple method for commonsense reasoning.arXiv preprint arXiv:1806.02847,

  73. [73]

    NewsQA: A Machine Comprehension Dataset

    Adam Trischler, Tong Wang, Xingdi Yuan, Justin Harris, Alessandro Sordoni, Philip Bachman, and Kaheer Suleman. NewsQA: A machine comprehension dataset.arXiv preprint arXiv:1611.09830,

  74. [74]

    The bottom-up evolution of representations in the transformer: A study with machine translation and language modeling objectives

    Elena Voita, Rico Sennrich, and Ivan Titov. The bottom-up evolution of representations in the transformer: A study with machine translation and language modeling objectives. arXiv preprint arXiv:1909.01380,

  75. [75]

    Alex Wang, Amapreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R. Bowman. GLUE: A multi-task benchmark and analysis platform for natural language understanding. arXiv preprint arXiv:1804.07461,

  76. [76]

    SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems

    Alex Wang, Jan Hula, Patrick Xia, Raghavendra Pappagari, R. Thomas McCoy, Roma Patel, Najoung Kim, Ian Tenney, Yinghui Huang, Katherin Yu, et al. Can you tell me how to get past Sesame Street? Sentence-level pretraining beyond language modeling. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 2019a. Alex Wang, Y...

  77. [77]

    Adina Williams, Nikita Nangia, and Samuel R. Bowman. A broad-coverage challenge corpus for sentence understanding through inference.arXiv preprint arXiv:1704.05426,

  78. [78]

    Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation

    Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V. Le, Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, et al. Google’s neural machine translation system: Bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144,

  79. [79]

    Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Carbonell, Ruslan Salakhutdinov, and Quoc V. Le. XLNet: Generalized autoregressive pretraining for language understanding.arXiv preprint arXiv:1906.08237,

  80. [80]

    Adams Wei Yu, David Dohan, Minh-Thang Luong, Rui Zhao, Kai Chen, Mohammad Norouzi, and Quoc V. Le. QAnet: Combining local convolution with global self-attention for reading comprehension.arXiv preprint arXiv:1804.09541,

Showing first 80 references.