hub

The Natural Language Decathlon: Multitask Learning as Question Answering

Bryan McCann, Nitish Shirish Keskar, Caiming Xiong, Richard Socher · 2018 · cs.CL · arXiv 1806.08730

17 Pith papers cite this work. Polarity classification is still indexing.

17 Pith papers citing it

open full Pith review browse 17 citing papers arXiv PDF

abstract

Deep learning has improved performance on many natural language processing (NLP) tasks individually. However, general NLP models cannot emerge within a paradigm that focuses on the particularities of a single metric, dataset, and task. We introduce the Natural Language Decathlon (decaNLP), a challenge that spans ten tasks: question answering, machine translation, summarization, natural language inference, sentiment analysis, semantic role labeling, zero-shot relation extraction, goal-oriented dialogue, semantic parsing, and commonsense pronoun resolution. We cast all tasks as question answering over a context. Furthermore, we present a new Multitask Question Answering Network (MQAN) jointly learns all tasks in decaNLP without any task-specific modules or parameters in the multitask setting. MQAN shows improvements in transfer learning for machine translation and named entity recognition, domain adaptation for sentiment analysis and natural language inference, and zero-shot capabilities for text classification. We demonstrate that the MQAN's multi-pointer-generator decoder is key to this success and performance further improves with an anti-curriculum training strategy. Though designed for decaNLP, MQAN also achieves state of the art results on the WikiSQL semantic parsing task in the single-task setting. We also release code for procuring and processing data, training and evaluating models, and reproducing all experiments for decaNLP.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 2

citation-polarity summary

background 2

representative citing papers

DSPy: Compiling Declarative Language Model Calls into Self-Improving Pipelines

cs.CL · 2023-10-05 · conditional · novelty 8.0

DSPy compiles short declarative programs into LM pipelines that self-optimize and outperform both standard few-shot prompting and expert-written chains on math, retrieval, and QA tasks.

Language Models are Few-Shot Learners

cs.CL · 2020-05-28 · accept · novelty 8.0

GPT-3 shows that scaling an autoregressive language model to 175 billion parameters enables strong few-shot performance across diverse NLP tasks via in-context prompting without fine-tuning.

Holistic Evaluation of Language Models

cs.CL · 2022-11-16 · accept · novelty 7.0

HELM establishes a multi-metric evaluation covering 30 language models on 42 scenarios (16 core) to raise average scenario coverage from 17.9% to 96% under uniform conditions while releasing all prompts, completions, and a toolkit.

Multitask Prompted Training Enables Zero-Shot Task Generalization

cs.LG · 2021-10-15 · conditional · novelty 7.0

Multitask fine-tuning of an encoder-decoder model on prompted datasets produces zero-shot generalization that often beats models up to 16 times larger on standard benchmarks.

Cross-Task Generalization via Natural Language Crowdsourcing Instructions

cs.CL · 2021-04-18 · conditional · novelty 7.0

Presents the NATURAL INSTRUCTIONS meta-dataset and shows generative pre-trained language models achieve 19% better generalization to unseen tasks when using task instructions.

Learning to summarize from human feedback

cs.CL · 2020-09-02 · conditional · novelty 7.0

Reinforcement learning on a reward model trained from human summary comparisons produces summaries humans prefer over supervised fine-tuning or human references on TL;DR and transfers to CNN/DM.

OPT: Open Pre-trained Transformer Language Models

cs.CL · 2022-05-02 · unverdicted · novelty 7.0

OPT releases open decoder-only transformers up to 175B parameters that match GPT-3 performance at one-seventh the carbon cost, along with code and training logs.

Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

cs.LG · 2019-10-23 · unverdicted · novelty 7.0

T5 casts all NLP tasks as text-to-text generation, systematically explores pre-training choices, and reaches strong performance on summarization, QA, classification and other tasks via large-scale training on the Colossal Clean Crawled Corpus.

GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding

cs.CL · 2018-04-20 · unverdicted · novelty 7.0

GLUE is a multi-task benchmark for general natural language understanding that includes a diagnostic test suite and finds limited gains from current multi-task learning methods over single-task training.

Scaling Data-Constrained Language Models

cs.CL · 2023-05-25 · conditional · novelty 6.0

Repeating training data up to 4 epochs yields negligible loss increase versus unique data for fixed compute, and a new scaling law accounts for the decaying value of repeated tokens and excess parameters.

AGIEval: A Human-Centric Benchmark for Evaluating Foundation Models

cs.CL · 2023-04-13 · accept · novelty 6.0

AGIEval shows GPT-4 exceeding average human scores on SAT Math at 95% and Chinese college entrance English at 92.5%, while revealing weaker results on complex reasoning tasks.

The Flan Collection: Designing Data and Methods for Effective Instruction Tuning

cs.AI · 2023-01-31 · conditional · novelty 6.0

The Flan Collection demonstrates that task balancing, data enrichment, and mixed prompt training are critical to effective instruction tuning, yielding stronger Flan-T5 models released publicly.

CTRL: A Conditional Transformer Language Model for Controllable Generation

cs.CL · 2019-09-11 · unverdicted · novelty 6.0

CTRL is a large conditional transformer language model that uses naturally occurring control codes to steer text generation style and content.

SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems

cs.CL · 2019-05-02 · accept · novelty 6.0

SuperGLUE is a new benchmark with more difficult language understanding tasks, a toolkit, and leaderboard to drive further progress beyond GLUE.

Emergent Abilities of Large Language Models

cs.CL · 2022-06-15 · unverdicted · novelty 6.0

Emergent abilities are capabilities present in large language models but absent in smaller ones and cannot be predicted by extrapolating smaller model performance.

A Survey of Large Language Models

cs.CL · 2023-03-31 · accept · novelty 3.0

This survey reviews the background, key techniques, and evaluation methods for large language models, emphasizing emergent abilities that appear at large scales.

High-Dimensional Statistics: Reflections on Progress and Open Problems

math.ST · 2026-05-06 · unverdicted · novelty 2.0 · 2 refs

This review synthesizes representative advances in high-dimensional statistics, highlights common themes and open problems, and points to key entry works.

citing papers explorer

Showing 17 of 17 citing papers.

DSPy: Compiling Declarative Language Model Calls into Self-Improving Pipelines cs.CL · 2023-10-05 · conditional · none · ref 34
DSPy compiles short declarative programs into LM pipelines that self-optimize and outperform both standard few-shot prompting and expert-written chains on math, retrieval, and QA tasks.
Language Models are Few-Shot Learners cs.CL · 2020-05-28 · accept · none · ref 52
GPT-3 shows that scaling an autoregressive language model to 175 billion parameters enables strong few-shot performance across diverse NLP tasks via in-context prompting without fine-tuning.
Holistic Evaluation of Language Models cs.CL · 2022-11-16 · accept · none · ref 7 · internal anchor
HELM establishes a multi-metric evaluation covering 30 language models on 42 scenarios (16 core) to raise average scenario coverage from 17.9% to 96% under uniform conditions while releasing all prompts, completions, and a toolkit.
Multitask Prompted Training Enables Zero-Shot Task Generalization cs.LG · 2021-10-15 · conditional · none · ref 35 · internal anchor
Multitask fine-tuning of an encoder-decoder model on prompted datasets produces zero-shot generalization that often beats models up to 16 times larger on standard benchmarks.
Cross-Task Generalization via Natural Language Crowdsourcing Instructions cs.CL · 2021-04-18 · conditional · none · ref 13 · internal anchor
Presents the NATURAL INSTRUCTIONS meta-dataset and shows generative pre-trained language models achieve 19% better generalization to unseen tasks when using task instructions.
Learning to summarize from human feedback cs.CL · 2020-09-02 · conditional · none · ref 42 · internal anchor
Reinforcement learning on a reward model trained from human summary comparisons produces summaries humans prefer over supervised fine-tuning or human references on TL;DR and transfers to CNN/DM.
OPT: Open Pre-trained Transformer Language Models cs.CL · 2022-05-02 · unverdicted · none · ref 23
OPT releases open decoder-only transformers up to 175B parameters that match GPT-3 performance at one-seventh the carbon cost, along with code and training logs.
Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer cs.LG · 2019-10-23 · unverdicted · none · ref 48
T5 casts all NLP tasks as text-to-text generation, systematically explores pre-training choices, and reaches strong performance on summarization, QA, classification and other tasks via large-scale training on the Colossal Clean Crawled Corpus.
GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding cs.CL · 2018-04-20 · unverdicted · none · ref 29
GLUE is a multi-task benchmark for general natural language understanding that includes a diagnostic test suite and finds limited gains from current multi-task learning methods over single-task training.
Scaling Data-Constrained Language Models cs.CL · 2023-05-25 · conditional · none · ref 72 · internal anchor
Repeating training data up to 4 epochs yields negligible loss increase versus unique data for fixed compute, and a new scaling law accounts for the decaying value of repeated tokens and excess parameters.
AGIEval: A Human-Centric Benchmark for Evaluating Foundation Models cs.CL · 2023-04-13 · accept · none · ref 65 · internal anchor
AGIEval shows GPT-4 exceeding average human scores on SAT Math at 95% and Chinese college entrance English at 92.5%, while revealing weaker results on complex reasoning tasks.
The Flan Collection: Designing Data and Methods for Effective Instruction Tuning cs.AI · 2023-01-31 · conditional · none · ref 36 · internal anchor
The Flan Collection demonstrates that task balancing, data enrichment, and mixed prompt training are critical to effective instruction tuning, yielding stronger Flan-T5 models released publicly.
CTRL: A Conditional Transformer Language Model for Controllable Generation cs.CL · 2019-09-11 · unverdicted · none · ref 31 · internal anchor
CTRL is a large conditional transformer language model that uses naturally occurring control codes to steer text generation style and content.
SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems cs.CL · 2019-05-02 · accept · none · ref 123 · internal anchor
SuperGLUE is a new benchmark with more difficult language understanding tasks, a toolkit, and leaderboard to drive further progress beyond GLUE.
Emergent Abilities of Large Language Models cs.CL · 2022-06-15 · unverdicted · none · ref 55
Emergent abilities are capabilities present in large language models but absent in smaller ones and cannot be predicted by extrapolating smaller model performance.
A Survey of Large Language Models cs.CL · 2023-03-31 · accept · none · ref 125
This survey reviews the background, key techniques, and evaluation methods for large language models, emphasizing emergent abilities that appear at large scales.
High-Dimensional Statistics: Reflections on Progress and Open Problems math.ST · 2026-05-06 · unverdicted · none · ref 66 · 2 links · internal anchor
This review synthesizes representative advances in high-dimensional statistics, highlights common themes and open problems, and points to key entry works.

The Natural Language Decathlon: Multitask Learning as Question Answering

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer