arxiv: 2104.08773 · v4 · pith:QDJYZSFMnew · submitted 2021-04-18 · 💻 cs.CL · cs.AI· cs.CV· cs.LG

Cross-Task Generalization via Natural Language Crowdsourcing Instructions

Swaroop Mishra , Daniel Khashabi , Chitta Baral , Hannaneh Hajishirzi This is my paper

Pith reviewed 2026-05-18 01:53 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.CVcs.LG

keywords cross-task generalizationnatural language instructionsinstruction followinglanguage modelsgeneralization to unseen tasksmeta learningcrowdsourced instructionsNLP tasks

0 comments

The pith

Language models improve at new tasks by 19 percent when given human-written instructions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper creates a collection of 61 different tasks drawn from existing NLP datasets, each paired with the original crowdsourced instructions that explain how to solve it. Models are trained on some of these tasks while seeing the instructions, then tested on the remaining tasks they have never encountered. The results show that including the instructions during training leads to better performance on the new tasks compared to training without them.

Core claim

Generative pre-trained language models that receive task instructions along with input data generate better outputs on tasks not seen during training, achieving a 19% improvement in generalization performance over models without access to instructions.

What carries the argument

A meta-dataset of 61 tasks with their human-authored instructions mapped to a unified schema, which is used to train models that encode the instruction text together with the input to produce the task output.

If this is right

Cross-task generalization can be directly measured by training on a subset of tasks and evaluating on completely held-out tasks.
Human-authored instructions provide a way to define tasks that transfers across different problems.
Models benefit from instructions specifically in the setting of unseen tasks rather than just seen ones.
Significant room remains for progress since current models fall short of an estimated upper bound performance.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Such models could eventually acquire capabilities in new domains by reading descriptions instead of needing large amounts of labeled examples for each.
This points toward training approaches that treat instructions as a general interface for task specification.
Extending the set of tasks beyond the current 61 could reveal whether the gains hold for more diverse or real-world problems.

Load-bearing premise

The selected tasks are diverse and non-overlapping so that success on unseen ones reflects genuine understanding of the instructions rather than shared surface features or data overlap.

What would settle it

If replacing the actual instructions with irrelevant or shuffled text still produces the same performance gains on held-out tasks, the benefit would not be attributable to instruction understanding.

read the original abstract

Humans (e.g., crowdworkers) have a remarkable ability in solving different tasks, by simply reading textual instructions that define them and looking at a few examples. Despite the success of the conventional supervised learning on individual datasets, such models often struggle with generalization across tasks (e.g., a question-answering system cannot solve classification tasks). A long-standing challenge in AI is to build a model that learns a new task by understanding the human-readable instructions that define it. To study this, we introduce NATURAL INSTRUCTIONS, a dataset of 61 distinct tasks, their human-authored instructions, and 193k task instances (input-output pairs). The instructions are obtained from crowdsourcing instructions used to create existing NLP datasets and mapped to a unified schema. Using this meta-dataset, we measure cross-task generalization by training models on seen tasks and measuring generalization to the remaining unseen ones. We adopt generative pre-trained language models to encode task-specific instructions along with input and generate task output. Our results indicate that models benefit from instructions when evaluated in terms of generalization to unseen tasks (19% better for models utilizing instructions). These models, however, are far behind an estimated performance upperbound indicating significant room for more progress in this direction.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper builds a new meta-dataset of 61 tasks with their original instructions and shows a 19% gain on unseen tasks when instructions are provided, but the evidence for why that gain occurs is still thin.

read the letter

The main takeaway is that this work assembles NATURAL INSTRUCTIONS, a collection of 61 existing NLP tasks together with the human-written instructions that were used to create them, then tests whether models can use those instructions to handle tasks they never saw during training. They report that adding the instruction text improves performance on the held-out tasks by about 19% relative to the no-instruction baseline. That directional result is the clearest contribution here. The dataset itself is new in its scale and in the way it re-uses real crowdsourcing instructions rather than writing fresh ones, and the seen-to-unseen split is a straightforward way to measure cross-task generalization that goes beyond standard multi-task setups. Those pieces give the paper a concrete resource that others can use to study instruction following. The experiments are still preliminary. The abstract gives no model sizes, training details, or significance numbers, and there is no reported check on how similar the unseen tasks are to the training ones in category or instruction phrasing. If many held-out tasks fall into the same broad families with overlapping wording, the measured gain could come from surface pattern matching rather than from learning to follow instructions in general. That is the main soft spot right now. This paper is for researchers working on making language models more flexible across tasks without task-specific labels. The dataset and the basic protocol are worth having in the literature even if later work needs to tighten the controls. I would send it to peer review. The resource is fresh and the question is worth referee attention, provided the authors add the missing experimental details and some analysis of task overlap.

Referee Report

2 major / 2 minor

Summary. The paper introduces NATURAL INSTRUCTIONS, a meta-dataset of 61 distinct NLP tasks with crowdsourced human-authored instructions and 193k input-output instances. Models are trained on a subset of seen tasks and evaluated on held-out unseen tasks; the central claim is that generative pre-trained language models that encode the task instructions along with the input achieve 19% relative improvement in generalization to unseen tasks compared to instruction-free baselines.

Significance. If the result holds after addressing the evaluation concerns, the work would demonstrate that natural language instructions can support cross-task generalization beyond what is possible with standard supervised learning on individual datasets. The release of the dataset and the unified schema for instructions would be a concrete resource for future research on instruction-following models. The empirical comparison of instruction-aware versus instruction-free conditions on explicitly held-out tasks is a strength when properly controlled.

major comments (2)

[Evaluation] Evaluation setup: the claim that performance gains on held-out tasks can be attributed to instruction understanding rather than shared surface patterns requires evidence that the 61 tasks have no substantial overlap in category, output format, or instruction phrasing between train and test splits. No quantitative check (e.g., instruction embedding similarity or category-level leakage analysis) is reported, which is load-bearing for the central generalization claim.
[Results] Results section: the abstract states a 19% relative gain but supplies no details on exact model sizes, training hyperparameters, number of runs, statistical significance tests, or task similarity controls. This leaves the magnitude and robustness of the reported improvement difficult to assess from the provided evidence.

minor comments (2)

[Abstract] The estimated performance upperbound is mentioned in the abstract but not defined or computed in the main text; adding a brief description of how it is obtained would improve clarity.
[Dataset] Notation for the unified schema of instructions could be introduced earlier with an example to help readers follow the dataset construction.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and describe the revisions we will make to improve the clarity and rigor of the evaluation and results sections.

read point-by-point responses

Referee: [Evaluation] Evaluation setup: the claim that performance gains on held-out tasks can be attributed to instruction understanding rather than shared surface patterns requires evidence that the 61 tasks have no substantial overlap in category, output format, or instruction phrasing between train and test splits. No quantitative check (e.g., instruction embedding similarity or category-level leakage analysis) is reported, which is load-bearing for the central generalization claim.

Authors: We agree that explicit evidence of limited overlap between seen and unseen tasks is important to support the attribution of gains to instruction understanding. The 61 tasks were selected from distinct existing NLP datasets with different objectives, output spaces, and instruction styles, which provides a natural separation. To strengthen this, we will add a quantitative analysis in the revised manuscript: we will compute average cosine similarities between instruction embeddings (using a sentence encoder) across train/test splits and include a category-level breakdown showing minimal leakage in task types or formats. This addition will directly address the concern. revision: yes
Referee: [Results] Results section: the abstract states a 19% relative gain but supplies no details on exact model sizes, training hyperparameters, number of runs, statistical significance tests, or task similarity controls. This leaves the magnitude and robustness of the reported improvement difficult to assess from the provided evidence.

Authors: We acknowledge that the abstract and results would benefit from more complete reporting. The manuscript already specifies the use of generative pre-trained models, but we will expand the results section and add an appendix detailing exact model sizes (T5 variants), training hyperparameters, number of runs with different random seeds, and statistical significance tests (e.g., paired t-tests across runs). The task similarity controls will be covered by the new analysis described in our response to the evaluation comment. These changes will make the reported 19% relative improvement easier to evaluate. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical cross-task evaluation is self-contained

full rationale

The paper constructs a new meta-dataset of 61 tasks with crowdsourced instructions and reports an empirical result: models trained on seen tasks generalize better to held-out unseen tasks when instructions are provided (19% improvement). This is measured directly via standard train/test splits on task instances, with no equations, fitted parameters, or derivations that reduce the reported gain to a quantity defined by the same data or by self-citation. The central claim rests on the experimental contrast between instruction-aware and instruction-free conditions on explicitly held-out tasks rather than any definitional equivalence or load-bearing self-reference.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on the empirical performance difference being caused by instruction comprehension and on the assumption that the collected tasks form a representative sample of NLP problems with minimal unintended overlap.

axioms (2)

domain assumption Pre-trained generative language models can meaningfully incorporate task instructions to improve output on held-out tasks.
The modeling approach assumes current large language models possess sufficient capacity to use the provided instructions.
domain assumption The 61 tasks are sufficiently independent that training on a subset yields no direct knowledge of the held-out tasks.
Required for the unseen-task evaluation to measure genuine generalization.

pith-pipeline@v0.9.0 · 5763 in / 1399 out tokens · 53367 ms · 2026-05-18T01:53:07.442492+00:00 · methodology

discussion (0)

Forward citations

Cited by 16 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Rethinking the Role of Demonstrations: What Makes In-Context Learning Work?
cs.CL 2022-02 accept novelty 8.0

Randomly replacing labels in in-context demonstrations barely hurts performance, showing that label space, input distribution, and sequence format drive in-context learning more than ground-truth labels.
PIAST: Rapid Prompting with In-context Augmentation for Scarce Training data
cs.CL 2025-12 conditional novelty 7.0

PIAST iteratively optimizes few-shot examples in prompts via Monte Carlo Shapley value estimation, outperforming prior automatic prompting methods and setting new SOTA on classification, simplification, and GSM8K with...
Self-Rewarding Language Models
cs.CL 2024-01 conditional novelty 7.0

Iterative self-rewarding via LLM-as-Judge in DPO training on Llama 2 70B improves instruction following and self-evaluation, outperforming GPT-4 on AlpacaEval 2.0.
LIMA: Less Is More for Alignment
cs.CL 2023-05 conditional novelty 7.0

Fine-tuning a 65B model on 1,000 high-quality examples produces output that humans rate as good as or better than GPT-4 in 43% of cases, indicating most capabilities come from pretraining.
VideoChat: Chat-Centric Video Understanding
cs.CV 2023-05 conditional novelty 7.0

VideoChat integrates video models and LLMs via a learnable interface for chat-based spatiotemporal and causal video reasoning, trained on a new video-centric instruction dataset.
OPT: Open Pre-trained Transformer Language Models
cs.CL 2022-05 unverdicted novelty 7.0

OPT releases open decoder-only transformers up to 175B parameters that match GPT-3 performance at one-seventh the carbon cost, along with code and training logs.
Multitask Prompted Training Enables Zero-Shot Task Generalization
cs.LG 2021-10 conditional novelty 7.0

Multitask fine-tuning of an encoder-decoder model on prompted datasets produces zero-shot generalization that often beats models up to 16 times larger on standard benchmarks.
MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark
cs.CL 2024-06 conditional novelty 6.0

MMLU-Pro is a revised benchmark that makes language model evaluation harder and more stable by using ten options per question and emphasizing reasoning over simple knowledge recall.
Lessons from the Trenches on Reproducible Evaluation of Language Models
cs.CL 2024-05 accept novelty 6.0

The paper compiles practical lessons on reproducible LM evaluation and introduces the lm-eval library to mitigate common methodological problems in NLP.
LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code
cs.SE 2024-03 unverdicted novelty 6.0

LiveCodeBench collects 400 recent contest problems to create a contamination-free benchmark evaluating LLMs on code generation and related capabilities like self-repair and execution.
Smaug: Fixing Failure Modes of Preference Optimisation with DPO-Positive
cs.CL 2024-02 conditional novelty 6.0

DPOP is a new loss function that prevents DPO from lowering preferred response likelihoods and outperforms standard DPO on diverse datasets, MT-Bench, and enables Smaug-72B to exceed 80% on the Open LLM Leaderboard.
Simple synthetic data reduces sycophancy in large language models
cs.CL 2023-08 unverdicted novelty 6.0

Scaling and instruction tuning increase sycophancy in LLMs on opinion and fact tasks, but a synthetic data fine-tuning intervention reduces it on held-out prompts.
PandaGPT: One Model To Instruction-Follow Them All
cs.CL 2023-05 conditional novelty 6.0

A single model trained only on image-text pairs gains instruction-following ability across images, video, and audio by routing all modalities through ImageBind's shared embedding space into Vicuna.
ART: Automatic multi-step reasoning and tool-use for large language models
cs.CL 2023-03 unverdicted novelty 6.0

ART automatically generates multi-step reasoning programs with tool integration for LLMs, yielding substantial gains over few-shot and auto-CoT prompting on BigBench and MMLU while matching hand-crafted CoT on most tasks.
SmolLM2: When Smol Goes Big -- Data-Centric Training of a Small Language Model
cs.CL 2025-02 unverdicted novelty 5.0

SmolLM2 is a 1.7B-parameter language model that outperforms Qwen2.5-1.5B and Llama3.2-1B after overtraining on 11 trillion tokens using custom FineMath, Stack-Edu, and SmolTalk datasets in a multi-stage pipeline.
Large Language Models: A Survey
cs.CL 2024-02 accept novelty 3.0

The paper surveys key large language models, their training methods, datasets, evaluation benchmarks, and future research directions in the field.

Reference graph

Works this paper leans on

107 extracted references · 107 canonical work pages · cited by 16 Pith papers · 4 internal anchors

[1]

Proceedings of EMNLP , pages=

Learning from Task Descriptions , author=. Proceedings of EMNLP , pages=

work page
[3]

Proceedings of NeurIPS , volume=

SuperGLUE: A stickier benchmark for general-purpose language understanding systems , author=. Proceedings of NeurIPS , volume=

work page
[4]

1960 , publisher=

Programs with common sense , author=. 1960 , publisher=

work page 1960
[5]

EMNLP , year=

SQuAD: 100, 000+ Questions for Machine Comprehension of Text , author=. EMNLP , year=

work page
[7]

Unsupervised Data Augmentation for Consistency Training , url =

Xie, Qizhe and Dai, Zihang and Hovy, Eduard and Luong, Thang and Le, Quoc , booktitle =. Unsupervised Data Augmentation for Consistency Training , url =

work page
[9]

AI Magazine , volume=

From ‘F’to ‘A’on the NY Regents Science Exams: An Overview of the Aristo Project , author=. AI Magazine , volume=

work page
[10]

Proceedings of NAACL-HLT , pages=

How many data points is a prompt worth? , author=. Proceedings of NAACL-HLT , pages=

work page
[11]

Logan IV and Eric Wallace and Sameer Singh , title =

Taylor Shin and Yasaman Razeghi and Robert L. Logan IV and Eric Wallace and Sameer Singh , title =. Proceedings of EMNLP , pages =

work page
[12]

arXiv preprint arXiv:2102.01335 , year=

Neural Data Augmentation via Example Extrapolation , author=. arXiv preprint arXiv:2102.01335 , year=

work page arXiv
[13]

The Natural Language Decathlon: Multitask Learning as Question Answering

The natural language decathlon: Multitask learning as question answering , author=. arXiv preprint arXiv:1806.08730 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[14]

2019 , url=

The Natural Language Decathlon: Multitask Learning as Question Answering , author=. 2019 , url=

work page 2019
[15]

arXiv preprint arXiv:2004.05483 , year=

Unsupervised commonsense question answering with self-talk , author=. arXiv preprint arXiv:2004.05483 , year=

work page arXiv 2004
[16]

arXiv preprint arXiv:2009.00751 , year=

Text Modular Networks: Learning to Decompose Tasks in the Language of Existing Models , author=. arXiv preprint arXiv:2009.00751 , year=

work page arXiv 2009
[17]

Journal of instructional development , volume=

The use of positive and negative examples during instruction , author=. Journal of instructional development , volume=. 1981 , publisher=

work page 1981
[18]

arXiv preprint arXiv:2103.10385 , year=

GPT Understands, Too , author=. arXiv preprint arXiv:2103.10385 , year=

work page arXiv
[19]

arXiv preprint arXiv:2001.07676 , year=

Exploiting cloze questions for few-shot text classification and natural language inference , author=. arXiv preprint arXiv:2001.07676 , year=

work page arXiv 2001
[20]

arXiv preprint arXiv:2009.07118 , year=

It's Not Just Size That Matters: Small Language Models Are Also Few-Shot Learners , author=. arXiv preprint arXiv:2009.07118 , year=

work page arXiv 2009
[21]

arXiv preprint arXiv:1912.02164 , year=

Plug and play language models: A simple approach to controlled text generation , author=. arXiv preprint arXiv:1912.02164 , year=

work page arXiv 1912
[22]

arXiv preprint arXiv:2103.11955 , year=

Improving and Simplifying Pattern Exploiting Training , author=. arXiv preprint arXiv:2103.11955 , year=

work page arXiv
[23]

Prefix-Tuning: Optimizing Continuous Prompts for Generation

Prefix-Tuning: Optimizing Continuous Prompts for Generation , author=. arXiv preprint arXiv:2101.00190 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[24]

Proceedings of the IEEE/CVF , pages=

Alfred: A benchmark for interpreting grounded instructions for everyday tasks , author=. Proceedings of the IEEE/CVF , pages=

work page
[25]

arXiv preprint arXiv:2102.12206 , year=

PADA: A Prompt-based Autoregressive Approach for Adaptation to Unseen Domains , author=. arXiv preprint arXiv:2102.12206 , year=

work page arXiv
[26]

Proceedings of NAACL , pages=

Looking beyond the surface: A challenge set for reading comprehension over multiple sentences , author=. Proceedings of NAACL , pages=

work page
[27]

OpenAI blog , volume=

Language models are unsupervised multitask learners , author=. OpenAI blog , volume=

work page
[28]

Proceedings of the 31st International Conference on Neural Information Processing Systems , pages=

Prototypical networks for few-shot learning , author=. Proceedings of the 31st International Conference on Neural Information Processing Systems , pages=

work page
[29]

Proceedings of workshop on evaluation metrics and system comparison for automatic summarization , pages=

An assessment of the accuracy of automatic evaluation in summarization , author=. Proceedings of workshop on evaluation metrics and system comparison for automatic summarization , pages=

work page
[30]

Proceedings of the 2nd Workshop on Machine Reading for Question Answering , pages=

Reasoning Over Paragraph Effects in Situations , author=. Proceedings of the 2nd Workshop on Machine Reading for Question Answering , pages=

work page
[32]

Proceedings of ECCV , pages=

Hard negative examples are hard, but useful , author=. Proceedings of ECCV , pages=. 2020 , organization=

work page 2020
[33]

2011 IEEE 11th International Conference on Data Mining , pages=

Learning from negative examples in set-expansion , author=. 2011 IEEE 11th International Conference on Data Mining , pages=. 2011 , organization=

work page 2011
[34]

Proceedings of ICML Workshop on The Continuum from Labeled to Unlabeled Data , volume=

Bootstrapped learning of semantic classes from positive and negative examples , author=. Proceedings of ICML Workshop on The Continuum from Labeled to Unlabeled Data , volume=

work page
[35]

Proceedings of the VLDB Endowment , volume=

Natural language to SQL: Where are we today? , author=. Proceedings of the VLDB Endowment , volume=. 2020 , publisher=

work page 2020
[36]

2013 , organization=

Learning semantic maps from natural language descriptions , author=. 2013 , organization=

work page 2013
[37]

Proceedings of LREC , year=

NL2Bash: A Corpus and Semantic Parser for Natural Language Interface to the Linux Operating System , author=. Proceedings of LREC , year=

work page
[38]

arXiv preprint arXiv:2101.06804 , year=

What Makes Good In-Context Examples for GPT- 3 ? , author=. arXiv preprint arXiv:2101.06804 , year=

work page arXiv
[39]

Machine learning , volume=

Learning from natural instructions , author=. Machine learning , volume=. 2014 , publisher=

work page 2014
[40]

Proceedings of CoNLL , pages=

Learning what is essential in questions , author=. Proceedings of CoNLL , pages=

work page
[41]

Machine learning , volume=

Multitask learning , author=. Machine learning , volume=. 1997 , publisher=

work page 1997
[42]

Proceedings of ACL , pages=

CHARTDIALOGS: Plotting from Natural Language Instructions , author=. Proceedings of ACL , pages=

work page
[44]

Proceedings of EMNLP: Findings , pages=

Adapting Language Models for Zero-shot Learning by Meta-tuning on Dataset and Prompt Collections , author=. Proceedings of EMNLP: Findings , pages=

work page
[45]

ArXiv , year=

Towards General Purpose Vision Systems , author=. ArXiv , year=

work page
[46]

Proceedings of IJCAI , year=

Transformers as soft reasoners over language , author=. Proceedings of IJCAI , year=

work page
[47]

Lewis, Mike and Liu, Yinhan and Goyal, Naman and Ghazvininejad, Marjan and Mohamed, Abdelrahman and Levy, Omer and Stoyanov, Ves and Zettlemoyer, Luke , booktitle=

work page
[48]

Khashabi, Daniel and Min, Sewon and Khot, Tushar and Sabharwal, Ashish and Tafjord, Oyvind and Clark, Peter and Hajishirzi, Hannaneh , booktitle=

work page
[49]

Journal of Machine Learning Research , volume=

Exploring the limits of transfer learning with a unified text-to-text transformer , author=. Journal of Machine Learning Research , volume=

work page
[50]

Joshi and Danqi Chen and Omer Levy and Mike Lewis and Luke S

Yinhan Liu and Myle Ott and Naman Goyal and Jingfei Du and Mandar S. Joshi and Danqi Chen and Omer Levy and Mike Lewis and Luke S. Zettlemoyer and Veselin Stoyanov , journal=. R o BERT a: A Robustly Optimized BERT Pretraining Approach. 2019 , url =

work page 2019
[51]

International Conference on Machine Learning , pages=

Generalization without systematicity: On the compositional skills of sequence-to-sequence recurrent networks , author=. International Conference on Machine Learning , pages=. 2018 , organization=

work page 2018
[52]

Proceedings of NAACL-HLT , pages=

Deep contextualized word representations , author=. Proceedings of NAACL-HLT , pages=

work page
[53]

SPoC: Search-based Pseudocode to Code

Spoc: Search-based pseudocode to code , author=. arXiv preprint arXiv:1906.04908 , year=

work page internal anchor Pith review Pith/arXiv arXiv 1906
[54]

Proceedings of EMNLP-IJCNLP , pages=

Quoref: A Reading Comprehension Dataset with Questions Requiring Coreferential Reasoning , author=. Proceedings of EMNLP-IJCNLP , pages=

work page
[55]

Going on a vacation

“Going on a vacation” takes longer than “Going for a walk”: A Study of Temporal Commonsense Understanding , author=. Proceedings of EMNLP-IJCNLP , pages=

work page
[56]

Proceedings of EMNLP-IJCNLP , pages=

Cosmos QA: Machine Reading Comprehension with Contextual Commonsense Reasoning , author=. Proceedings of EMNLP-IJCNLP , pages=

work page
[57]

Proceedings of the AAAI , year=

Winogrande: An adversarial winograd schema challenge at scale , author=. Proceedings of the AAAI , year=

work page
[58]

Proceedings of NAACL , pages=

DROP: A Reading Comprehension Benchmark Requiring Discrete Reasoning Over Paragraphs , author=. Proceedings of NAACL , pages=

work page
[59]

Khashabi, Daniel and Stanovsky, Gabriel and Bragg, Jonathan and Lourie, Nicholas and Kasai, Jungo and Choi, Yejin and Smith, Noah A and Weld, Daniel S , journal=

work page
[60]

Khot, Tushar and Clark, Peter and Guerquin, Michal and Jansen, Peter and Sabharwal, Ashish , booktitle=

work page
[61]

A Call for Clarity in Reporting BLEU Scores

Post, Matt. A Call for Clarity in Reporting BLEU Scores. Proceedings of the Third Conference on Machine Translation: Research Papers. 2018

work page 2018
[62]

Proceedings of ACL , pages=

BLEURT: Learning Robust Metrics for Text Generation , author=. Proceedings of ACL , pages=

work page
[63]

Text summarization branches out , pages=

Rouge: A package for automatic evaluation of summaries , author=. Text summarization branches out , pages=

work page
[65]

arXiv preprint arXiv:2010.12083 , year=

Language-Conditioned Imitation Learning for Robot Manipulation Tasks , author=. arXiv preprint arXiv:2010.12083 , year=

work page arXiv 2010
[67]

arXiv preprint arXiv:2109.07830 , year=

Reframing Instructional Prompts to GPTk's Language , author=. arXiv preprint arXiv:2109.07830 , year=

work page arXiv
[68]

Proceedings of EMNLP , pages=

Muppet: Massive Multi-task Representations with Pre-Finetuning , author=. Proceedings of EMNLP , pages=

work page
[69]

Language Models are Few-Shot Learners , volume =

Brown, Tom and Mann, Benjamin and Ryder, Nick and Subbiah, Melanie and Kaplan, Jared D and Dhariwal, Prafulla and Neelakantan, Arvind and Shyam, Pranav and Sastry, Girish and Askell, Amanda and Agarwal, Sandhini and Herbert-Voss, Ariel and Krueger, Gretchen and Henighan, Tom and Child, Rewon and Ramesh, Aditya and Ziegler, Daniel and Wu, Jeffrey and Winte...

work page
[70]

Extended Abstracts of the 2021 CHI Conference on Human Factors in Computing Systems , pages=

Prompt programming for large language models: Beyond the few-shot paradigm , author=. Extended Abstracts of the 2021 CHI Conference on Human Factors in Computing Systems , pages=

work page 2021
[71]

Proceedings of ICLR , year=

Multitask Prompted Training Enables Zero-Shot Task Generalization , author=. Proceedings of ICLR , year=

work page
[72]

Proceedings of EMNLP , year=

Few-Shot Text Generation with Natural Language Instructions , author=. Proceedings of EMNLP , year=

work page
[73]

Proceedings of ICLR , year=

Finetuned Language Models are Zero-Shot Learners , author=. Proceedings of ICLR , year=

work page
[74]

Proceedings of EMNLP , year=

CrossFit: A Few-shot Learning Challenge for Cross-task Generalization in NLP , author=. Proceedings of EMNLP , year=

work page
[75]

Proceedings of ICML , pages=

Calibrate before use: Improving few-shot performance of language models , author=. Proceedings of ICML , pages=

work page
[76]

Armen Aghajanyan, Anchit Gupta, Akshat Shrivastava, Xilun Chen, Luke Zettlemoyer, and Sonal Gupta. 2021. Muppet: Massive multi-task representations with pre-finetuning. In Proceedings of EMNLP, pages 5799--5811

work page 2021
[77]

Ali M Ali. 1981. The use of positive and negative examples during instruction. Journal of instructional development, 5(1):2--7

work page 1981
[78]

Iz Beltagy, Matthew E Peters, and Arman Cohan. 2020. Longformer: The long-document transformer. arXiv preprint arXiv:2004.05150

work page internal anchor Pith review Pith/arXiv arXiv 2020
[79]

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel Ziegler, Jeffrey Wu, Clemens Winter, Chris Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gr...

work page 2020
[80]

Rich Caruana. 1997. Multitask learning. Machine learning, 28(1):41--75

work page 1997
[81]

Pradeep Dasigi, Nelson F Liu, Ana Marasovic, Noah A Smith, and Matt Gardner. 2019. Quoref: A reading comprehension dataset with questions requiring coreferential reasoning. In Proceedings of EMNLP-IJCNLP, pages 5927--5934

work page 2019
[82]

Dheeru Dua, Yizhong Wang, Pradeep Dasigi, Gabriel Stanovsky, Sameer Singh, and Matt Gardner. 2019. Drop: A reading comprehension benchmark requiring discrete reasoning over paragraphs. In Proceedings of NAACL, pages 2368--2378

work page 2019
[83]

Avia Efrat and Omer Levy. 2020. The turking test: Can language models understand instructions? arXiv preprint arXiv:2010.11982

work page arXiv 2020
[84]

Kamath, Aniruddha Kembhavi, and Derek Hoiem

Tanmay Gupta, A. Kamath, Aniruddha Kembhavi, and Derek Hoiem. 2021. Towards general purpose vision systems. ArXiv, abs/2104.00743

work page arXiv 2021
[85]

Peter Hase and Mohit Bansal. 2021. When can models learn from explanations? a formal framework for understanding the roles of explanation data. arXiv preprint arXiv:2102.02201

work page arXiv 2021
[86]

Lifu Huang, Ronan Le Bras, Chandra Bhagavatula, and Yejin Choi. 2019. Cosmos qa: Machine reading comprehension with contextual commonsense reasoning. In Proceedings of EMNLP-IJCNLP, pages 2391--2401

work page 2019
[87]

Daniel Khashabi, Snigdha Chaturvedi, Michael Roth, Shyam Upadhyay, and Dan Roth. 2018. Looking beyond the surface: A challenge set for reading comprehension over multiple sentences. In Proceedings of NAACL, pages 252--262

work page 2018

Showing first 80 references.