pith. machine review for the scientific record. sign in

arxiv: 2104.08773 · v4 · pith:QDJYZSFMnew · submitted 2021-04-18 · 💻 cs.CL · cs.AI· cs.CV· cs.LG

Cross-Task Generalization via Natural Language Crowdsourcing Instructions

Pith reviewed 2026-05-18 01:53 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.CVcs.LG
keywords cross-task generalizationnatural language instructionsinstruction followinglanguage modelsgeneralization to unseen tasksmeta learningcrowdsourced instructionsNLP tasks
0
0 comments X

The pith

Language models improve at new tasks by 19 percent when given human-written instructions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper creates a collection of 61 different tasks drawn from existing NLP datasets, each paired with the original crowdsourced instructions that explain how to solve it. Models are trained on some of these tasks while seeing the instructions, then tested on the remaining tasks they have never encountered. The results show that including the instructions during training leads to better performance on the new tasks compared to training without them.

Core claim

Generative pre-trained language models that receive task instructions along with input data generate better outputs on tasks not seen during training, achieving a 19% improvement in generalization performance over models without access to instructions.

What carries the argument

A meta-dataset of 61 tasks with their human-authored instructions mapped to a unified schema, which is used to train models that encode the instruction text together with the input to produce the task output.

If this is right

  • Cross-task generalization can be directly measured by training on a subset of tasks and evaluating on completely held-out tasks.
  • Human-authored instructions provide a way to define tasks that transfers across different problems.
  • Models benefit from instructions specifically in the setting of unseen tasks rather than just seen ones.
  • Significant room remains for progress since current models fall short of an estimated upper bound performance.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Such models could eventually acquire capabilities in new domains by reading descriptions instead of needing large amounts of labeled examples for each.
  • This points toward training approaches that treat instructions as a general interface for task specification.
  • Extending the set of tasks beyond the current 61 could reveal whether the gains hold for more diverse or real-world problems.

Load-bearing premise

The selected tasks are diverse and non-overlapping so that success on unseen ones reflects genuine understanding of the instructions rather than shared surface features or data overlap.

What would settle it

If replacing the actual instructions with irrelevant or shuffled text still produces the same performance gains on held-out tasks, the benefit would not be attributable to instruction understanding.

read the original abstract

Humans (e.g., crowdworkers) have a remarkable ability in solving different tasks, by simply reading textual instructions that define them and looking at a few examples. Despite the success of the conventional supervised learning on individual datasets, such models often struggle with generalization across tasks (e.g., a question-answering system cannot solve classification tasks). A long-standing challenge in AI is to build a model that learns a new task by understanding the human-readable instructions that define it. To study this, we introduce NATURAL INSTRUCTIONS, a dataset of 61 distinct tasks, their human-authored instructions, and 193k task instances (input-output pairs). The instructions are obtained from crowdsourcing instructions used to create existing NLP datasets and mapped to a unified schema. Using this meta-dataset, we measure cross-task generalization by training models on seen tasks and measuring generalization to the remaining unseen ones. We adopt generative pre-trained language models to encode task-specific instructions along with input and generate task output. Our results indicate that models benefit from instructions when evaluated in terms of generalization to unseen tasks (19% better for models utilizing instructions). These models, however, are far behind an estimated performance upperbound indicating significant room for more progress in this direction.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces NATURAL INSTRUCTIONS, a meta-dataset of 61 distinct NLP tasks with crowdsourced human-authored instructions and 193k input-output instances. Models are trained on a subset of seen tasks and evaluated on held-out unseen tasks; the central claim is that generative pre-trained language models that encode the task instructions along with the input achieve 19% relative improvement in generalization to unseen tasks compared to instruction-free baselines.

Significance. If the result holds after addressing the evaluation concerns, the work would demonstrate that natural language instructions can support cross-task generalization beyond what is possible with standard supervised learning on individual datasets. The release of the dataset and the unified schema for instructions would be a concrete resource for future research on instruction-following models. The empirical comparison of instruction-aware versus instruction-free conditions on explicitly held-out tasks is a strength when properly controlled.

major comments (2)
  1. [Evaluation] Evaluation setup: the claim that performance gains on held-out tasks can be attributed to instruction understanding rather than shared surface patterns requires evidence that the 61 tasks have no substantial overlap in category, output format, or instruction phrasing between train and test splits. No quantitative check (e.g., instruction embedding similarity or category-level leakage analysis) is reported, which is load-bearing for the central generalization claim.
  2. [Results] Results section: the abstract states a 19% relative gain but supplies no details on exact model sizes, training hyperparameters, number of runs, statistical significance tests, or task similarity controls. This leaves the magnitude and robustness of the reported improvement difficult to assess from the provided evidence.
minor comments (2)
  1. [Abstract] The estimated performance upperbound is mentioned in the abstract but not defined or computed in the main text; adding a brief description of how it is obtained would improve clarity.
  2. [Dataset] Notation for the unified schema of instructions could be introduced earlier with an example to help readers follow the dataset construction.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and describe the revisions we will make to improve the clarity and rigor of the evaluation and results sections.

read point-by-point responses
  1. Referee: [Evaluation] Evaluation setup: the claim that performance gains on held-out tasks can be attributed to instruction understanding rather than shared surface patterns requires evidence that the 61 tasks have no substantial overlap in category, output format, or instruction phrasing between train and test splits. No quantitative check (e.g., instruction embedding similarity or category-level leakage analysis) is reported, which is load-bearing for the central generalization claim.

    Authors: We agree that explicit evidence of limited overlap between seen and unseen tasks is important to support the attribution of gains to instruction understanding. The 61 tasks were selected from distinct existing NLP datasets with different objectives, output spaces, and instruction styles, which provides a natural separation. To strengthen this, we will add a quantitative analysis in the revised manuscript: we will compute average cosine similarities between instruction embeddings (using a sentence encoder) across train/test splits and include a category-level breakdown showing minimal leakage in task types or formats. This addition will directly address the concern. revision: yes

  2. Referee: [Results] Results section: the abstract states a 19% relative gain but supplies no details on exact model sizes, training hyperparameters, number of runs, statistical significance tests, or task similarity controls. This leaves the magnitude and robustness of the reported improvement difficult to assess from the provided evidence.

    Authors: We acknowledge that the abstract and results would benefit from more complete reporting. The manuscript already specifies the use of generative pre-trained models, but we will expand the results section and add an appendix detailing exact model sizes (T5 variants), training hyperparameters, number of runs with different random seeds, and statistical significance tests (e.g., paired t-tests across runs). The task similarity controls will be covered by the new analysis described in our response to the evaluation comment. These changes will make the reported 19% relative improvement easier to evaluate. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical cross-task evaluation is self-contained

full rationale

The paper constructs a new meta-dataset of 61 tasks with crowdsourced instructions and reports an empirical result: models trained on seen tasks generalize better to held-out unseen tasks when instructions are provided (19% improvement). This is measured directly via standard train/test splits on task instances, with no equations, fitted parameters, or derivations that reduce the reported gain to a quantity defined by the same data or by self-citation. The central claim rests on the experimental contrast between instruction-aware and instruction-free conditions on explicitly held-out tasks rather than any definitional equivalence or load-bearing self-reference.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on the empirical performance difference being caused by instruction comprehension and on the assumption that the collected tasks form a representative sample of NLP problems with minimal unintended overlap.

axioms (2)
  • domain assumption Pre-trained generative language models can meaningfully incorporate task instructions to improve output on held-out tasks.
    The modeling approach assumes current large language models possess sufficient capacity to use the provided instructions.
  • domain assumption The 61 tasks are sufficiently independent that training on a subset yields no direct knowledge of the held-out tasks.
    Required for the unseen-task evaluation to measure genuine generalization.

pith-pipeline@v0.9.0 · 5763 in / 1399 out tokens · 53367 ms · 2026-05-18T01:53:07.442492+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 16 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Rethinking the Role of Demonstrations: What Makes In-Context Learning Work?

    cs.CL 2022-02 accept novelty 8.0

    Randomly replacing labels in in-context demonstrations barely hurts performance, showing that label space, input distribution, and sequence format drive in-context learning more than ground-truth labels.

  2. PIAST: Rapid Prompting with In-context Augmentation for Scarce Training data

    cs.CL 2025-12 conditional novelty 7.0

    PIAST iteratively optimizes few-shot examples in prompts via Monte Carlo Shapley value estimation, outperforming prior automatic prompting methods and setting new SOTA on classification, simplification, and GSM8K with...

  3. Self-Rewarding Language Models

    cs.CL 2024-01 conditional novelty 7.0

    Iterative self-rewarding via LLM-as-Judge in DPO training on Llama 2 70B improves instruction following and self-evaluation, outperforming GPT-4 on AlpacaEval 2.0.

  4. LIMA: Less Is More for Alignment

    cs.CL 2023-05 conditional novelty 7.0

    Fine-tuning a 65B model on 1,000 high-quality examples produces output that humans rate as good as or better than GPT-4 in 43% of cases, indicating most capabilities come from pretraining.

  5. VideoChat: Chat-Centric Video Understanding

    cs.CV 2023-05 conditional novelty 7.0

    VideoChat integrates video models and LLMs via a learnable interface for chat-based spatiotemporal and causal video reasoning, trained on a new video-centric instruction dataset.

  6. OPT: Open Pre-trained Transformer Language Models

    cs.CL 2022-05 unverdicted novelty 7.0

    OPT releases open decoder-only transformers up to 175B parameters that match GPT-3 performance at one-seventh the carbon cost, along with code and training logs.

  7. Multitask Prompted Training Enables Zero-Shot Task Generalization

    cs.LG 2021-10 conditional novelty 7.0

    Multitask fine-tuning of an encoder-decoder model on prompted datasets produces zero-shot generalization that often beats models up to 16 times larger on standard benchmarks.

  8. MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark

    cs.CL 2024-06 conditional novelty 6.0

    MMLU-Pro is a revised benchmark that makes language model evaluation harder and more stable by using ten options per question and emphasizing reasoning over simple knowledge recall.

  9. Lessons from the Trenches on Reproducible Evaluation of Language Models

    cs.CL 2024-05 accept novelty 6.0

    The paper compiles practical lessons on reproducible LM evaluation and introduces the lm-eval library to mitigate common methodological problems in NLP.

  10. LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code

    cs.SE 2024-03 unverdicted novelty 6.0

    LiveCodeBench collects 400 recent contest problems to create a contamination-free benchmark evaluating LLMs on code generation and related capabilities like self-repair and execution.

  11. Smaug: Fixing Failure Modes of Preference Optimisation with DPO-Positive

    cs.CL 2024-02 conditional novelty 6.0

    DPOP is a new loss function that prevents DPO from lowering preferred response likelihoods and outperforms standard DPO on diverse datasets, MT-Bench, and enables Smaug-72B to exceed 80% on the Open LLM Leaderboard.

  12. Simple synthetic data reduces sycophancy in large language models

    cs.CL 2023-08 unverdicted novelty 6.0

    Scaling and instruction tuning increase sycophancy in LLMs on opinion and fact tasks, but a synthetic data fine-tuning intervention reduces it on held-out prompts.

  13. PandaGPT: One Model To Instruction-Follow Them All

    cs.CL 2023-05 conditional novelty 6.0

    A single model trained only on image-text pairs gains instruction-following ability across images, video, and audio by routing all modalities through ImageBind's shared embedding space into Vicuna.

  14. ART: Automatic multi-step reasoning and tool-use for large language models

    cs.CL 2023-03 unverdicted novelty 6.0

    ART automatically generates multi-step reasoning programs with tool integration for LLMs, yielding substantial gains over few-shot and auto-CoT prompting on BigBench and MMLU while matching hand-crafted CoT on most tasks.

  15. SmolLM2: When Smol Goes Big -- Data-Centric Training of a Small Language Model

    cs.CL 2025-02 unverdicted novelty 5.0

    SmolLM2 is a 1.7B-parameter language model that outperforms Qwen2.5-1.5B and Llama3.2-1B after overtraining on 11 trillion tokens using custom FineMath, Stack-Edu, and SmolTalk datasets in a multi-stage pipeline.

  16. Large Language Models: A Survey

    cs.CL 2024-02 accept novelty 3.0

    The paper surveys key large language models, their training methods, datasets, evaluation benchmarks, and future research directions in the field.

Reference graph

Works this paper leans on

107 extracted references · 107 canonical work pages · cited by 16 Pith papers · 4 internal anchors

  1. [1]

    Proceedings of EMNLP , pages=

    Learning from Task Descriptions , author=. Proceedings of EMNLP , pages=

  2. [3]

    Proceedings of NeurIPS , volume=

    SuperGLUE: A stickier benchmark for general-purpose language understanding systems , author=. Proceedings of NeurIPS , volume=

  3. [4]

    1960 , publisher=

    Programs with common sense , author=. 1960 , publisher=

  4. [5]

    EMNLP , year=

    SQuAD: 100, 000+ Questions for Machine Comprehension of Text , author=. EMNLP , year=

  5. [7]

    Unsupervised Data Augmentation for Consistency Training , url =

    Xie, Qizhe and Dai, Zihang and Hovy, Eduard and Luong, Thang and Le, Quoc , booktitle =. Unsupervised Data Augmentation for Consistency Training , url =

  6. [9]

    AI Magazine , volume=

    From ‘F’to ‘A’on the NY Regents Science Exams: An Overview of the Aristo Project , author=. AI Magazine , volume=

  7. [10]

    Proceedings of NAACL-HLT , pages=

    How many data points is a prompt worth? , author=. Proceedings of NAACL-HLT , pages=

  8. [11]

    Logan IV and Eric Wallace and Sameer Singh , title =

    Taylor Shin and Yasaman Razeghi and Robert L. Logan IV and Eric Wallace and Sameer Singh , title =. Proceedings of EMNLP , pages =

  9. [12]

    arXiv preprint arXiv:2102.01335 , year=

    Neural Data Augmentation via Example Extrapolation , author=. arXiv preprint arXiv:2102.01335 , year=

  10. [13]

    The Natural Language Decathlon: Multitask Learning as Question Answering

    The natural language decathlon: Multitask learning as question answering , author=. arXiv preprint arXiv:1806.08730 , year=

  11. [14]

    2019 , url=

    The Natural Language Decathlon: Multitask Learning as Question Answering , author=. 2019 , url=

  12. [15]

    arXiv preprint arXiv:2004.05483 , year=

    Unsupervised commonsense question answering with self-talk , author=. arXiv preprint arXiv:2004.05483 , year=

  13. [16]

    arXiv preprint arXiv:2009.00751 , year=

    Text Modular Networks: Learning to Decompose Tasks in the Language of Existing Models , author=. arXiv preprint arXiv:2009.00751 , year=

  14. [17]

    Journal of instructional development , volume=

    The use of positive and negative examples during instruction , author=. Journal of instructional development , volume=. 1981 , publisher=

  15. [18]

    arXiv preprint arXiv:2103.10385 , year=

    GPT Understands, Too , author=. arXiv preprint arXiv:2103.10385 , year=

  16. [19]

    arXiv preprint arXiv:2001.07676 , year=

    Exploiting cloze questions for few-shot text classification and natural language inference , author=. arXiv preprint arXiv:2001.07676 , year=

  17. [20]

    arXiv preprint arXiv:2009.07118 , year=

    It's Not Just Size That Matters: Small Language Models Are Also Few-Shot Learners , author=. arXiv preprint arXiv:2009.07118 , year=

  18. [21]

    Plug and play language models: A simple approach to controlled text generation

    Plug and play language models: A simple approach to controlled text generation , author=. arXiv preprint arXiv:1912.02164 , year=

  19. [22]

    arXiv preprint arXiv:2103.11955 , year=

    Improving and Simplifying Pattern Exploiting Training , author=. arXiv preprint arXiv:2103.11955 , year=

  20. [23]

    Prefix-Tuning: Optimizing Continuous Prompts for Generation

    Prefix-Tuning: Optimizing Continuous Prompts for Generation , author=. arXiv preprint arXiv:2101.00190 , year=

  21. [24]

    Proceedings of the IEEE/CVF , pages=

    Alfred: A benchmark for interpreting grounded instructions for everyday tasks , author=. Proceedings of the IEEE/CVF , pages=

  22. [25]

    arXiv preprint arXiv:2102.12206 , year=

    PADA: A Prompt-based Autoregressive Approach for Adaptation to Unseen Domains , author=. arXiv preprint arXiv:2102.12206 , year=

  23. [26]

    Proceedings of NAACL , pages=

    Looking beyond the surface: A challenge set for reading comprehension over multiple sentences , author=. Proceedings of NAACL , pages=

  24. [27]

    OpenAI blog , volume=

    Language models are unsupervised multitask learners , author=. OpenAI blog , volume=

  25. [28]

    Proceedings of the 31st International Conference on Neural Information Processing Systems , pages=

    Prototypical networks for few-shot learning , author=. Proceedings of the 31st International Conference on Neural Information Processing Systems , pages=

  26. [29]

    Proceedings of workshop on evaluation metrics and system comparison for automatic summarization , pages=

    An assessment of the accuracy of automatic evaluation in summarization , author=. Proceedings of workshop on evaluation metrics and system comparison for automatic summarization , pages=

  27. [30]

    Proceedings of the 2nd Workshop on Machine Reading for Question Answering , pages=

    Reasoning Over Paragraph Effects in Situations , author=. Proceedings of the 2nd Workshop on Machine Reading for Question Answering , pages=

  28. [32]

    Proceedings of ECCV , pages=

    Hard negative examples are hard, but useful , author=. Proceedings of ECCV , pages=. 2020 , organization=

  29. [33]

    2011 IEEE 11th International Conference on Data Mining , pages=

    Learning from negative examples in set-expansion , author=. 2011 IEEE 11th International Conference on Data Mining , pages=. 2011 , organization=

  30. [34]

    Proceedings of ICML Workshop on The Continuum from Labeled to Unlabeled Data , volume=

    Bootstrapped learning of semantic classes from positive and negative examples , author=. Proceedings of ICML Workshop on The Continuum from Labeled to Unlabeled Data , volume=

  31. [35]

    Proceedings of the VLDB Endowment , volume=

    Natural language to SQL: Where are we today? , author=. Proceedings of the VLDB Endowment , volume=. 2020 , publisher=

  32. [36]

    2013 , organization=

    Learning semantic maps from natural language descriptions , author=. 2013 , organization=

  33. [37]

    Proceedings of LREC , year=

    NL2Bash: A Corpus and Semantic Parser for Natural Language Interface to the Linux Operating System , author=. Proceedings of LREC , year=

  34. [38]

    Preprint, arXiv, January

    What Makes Good In-Context Examples for GPT- 3 ? , author=. arXiv preprint arXiv:2101.06804 , year=

  35. [39]

    Machine learning , volume=

    Learning from natural instructions , author=. Machine learning , volume=. 2014 , publisher=

  36. [40]

    Proceedings of CoNLL , pages=

    Learning what is essential in questions , author=. Proceedings of CoNLL , pages=

  37. [41]

    Machine learning , volume=

    Multitask learning , author=. Machine learning , volume=. 1997 , publisher=

  38. [42]

    Proceedings of ACL , pages=

    CHARTDIALOGS: Plotting from Natural Language Instructions , author=. Proceedings of ACL , pages=

  39. [44]

    Proceedings of EMNLP: Findings , pages=

    Adapting Language Models for Zero-shot Learning by Meta-tuning on Dataset and Prompt Collections , author=. Proceedings of EMNLP: Findings , pages=

  40. [45]

    ArXiv , year=

    Towards General Purpose Vision Systems , author=. ArXiv , year=

  41. [46]

    Proceedings of IJCAI , year=

    Transformers as soft reasoners over language , author=. Proceedings of IJCAI , year=

  42. [47]

    Lewis, Mike and Liu, Yinhan and Goyal, Naman and Ghazvininejad, Marjan and Mohamed, Abdelrahman and Levy, Omer and Stoyanov, Ves and Zettlemoyer, Luke , booktitle=

  43. [48]

    Khashabi, Daniel and Min, Sewon and Khot, Tushar and Sabharwal, Ashish and Tafjord, Oyvind and Clark, Peter and Hajishirzi, Hannaneh , booktitle=

  44. [49]

    Journal of Machine Learning Research , volume=

    Exploring the limits of transfer learning with a unified text-to-text transformer , author=. Journal of Machine Learning Research , volume=

  45. [50]

    Joshi and Danqi Chen and Omer Levy and Mike Lewis and Luke S

    Yinhan Liu and Myle Ott and Naman Goyal and Jingfei Du and Mandar S. Joshi and Danqi Chen and Omer Levy and Mike Lewis and Luke S. Zettlemoyer and Veselin Stoyanov , journal=. R o BERT a: A Robustly Optimized BERT Pretraining Approach. 2019 , url =

  46. [51]

    International Conference on Machine Learning , pages=

    Generalization without systematicity: On the compositional skills of sequence-to-sequence recurrent networks , author=. International Conference on Machine Learning , pages=. 2018 , organization=

  47. [52]

    Proceedings of NAACL-HLT , pages=

    Deep contextualized word representations , author=. Proceedings of NAACL-HLT , pages=

  48. [53]

    SPoC: Search-based Pseudocode to Code

    Spoc: Search-based pseudocode to code , author=. arXiv preprint arXiv:1906.04908 , year=

  49. [54]

    Proceedings of EMNLP-IJCNLP , pages=

    Quoref: A Reading Comprehension Dataset with Questions Requiring Coreferential Reasoning , author=. Proceedings of EMNLP-IJCNLP , pages=

  50. [55]

    Going on a vacation

    “Going on a vacation” takes longer than “Going for a walk”: A Study of Temporal Commonsense Understanding , author=. Proceedings of EMNLP-IJCNLP , pages=

  51. [56]

    Proceedings of EMNLP-IJCNLP , pages=

    Cosmos QA: Machine Reading Comprehension with Contextual Commonsense Reasoning , author=. Proceedings of EMNLP-IJCNLP , pages=

  52. [57]

    Proceedings of the AAAI , year=

    Winogrande: An adversarial winograd schema challenge at scale , author=. Proceedings of the AAAI , year=

  53. [58]

    Proceedings of NAACL , pages=

    DROP: A Reading Comprehension Benchmark Requiring Discrete Reasoning Over Paragraphs , author=. Proceedings of NAACL , pages=

  54. [59]

    Khashabi, Daniel and Stanovsky, Gabriel and Bragg, Jonathan and Lourie, Nicholas and Kasai, Jungo and Choi, Yejin and Smith, Noah A and Weld, Daniel S , journal=

  55. [60]

    Khot, Tushar and Clark, Peter and Guerquin, Michal and Jansen, Peter and Sabharwal, Ashish , booktitle=

  56. [61]

    A Call for Clarity in Reporting BLEU Scores

    Post, Matt. A Call for Clarity in Reporting BLEU Scores. Proceedings of the Third Conference on Machine Translation: Research Papers. 2018

  57. [62]

    Proceedings of ACL , pages=

    BLEURT: Learning Robust Metrics for Text Generation , author=. Proceedings of ACL , pages=

  58. [63]

    Text summarization branches out , pages=

    Rouge: A package for automatic evaluation of summaries , author=. Text summarization branches out , pages=

  59. [65]

    arXiv preprint arXiv:2010.12083 , year=

    Language-Conditioned Imitation Learning for Robot Manipulation Tasks , author=. arXiv preprint arXiv:2010.12083 , year=

  60. [67]

    arXiv preprint arXiv:2109.07830 , year=

    Reframing Instructional Prompts to GPTk's Language , author=. arXiv preprint arXiv:2109.07830 , year=

  61. [68]

    Proceedings of EMNLP , pages=

    Muppet: Massive Multi-task Representations with Pre-Finetuning , author=. Proceedings of EMNLP , pages=

  62. [69]

    Language Models are Few-Shot Learners , volume =

    Brown, Tom and Mann, Benjamin and Ryder, Nick and Subbiah, Melanie and Kaplan, Jared D and Dhariwal, Prafulla and Neelakantan, Arvind and Shyam, Pranav and Sastry, Girish and Askell, Amanda and Agarwal, Sandhini and Herbert-Voss, Ariel and Krueger, Gretchen and Henighan, Tom and Child, Rewon and Ramesh, Aditya and Ziegler, Daniel and Wu, Jeffrey and Winte...

  63. [70]

    Extended Abstracts of the 2021 CHI Conference on Human Factors in Computing Systems , pages=

    Prompt programming for large language models: Beyond the few-shot paradigm , author=. Extended Abstracts of the 2021 CHI Conference on Human Factors in Computing Systems , pages=

  64. [71]

    Proceedings of ICLR , year=

    Multitask Prompted Training Enables Zero-Shot Task Generalization , author=. Proceedings of ICLR , year=

  65. [72]

    Proceedings of EMNLP , year=

    Few-Shot Text Generation with Natural Language Instructions , author=. Proceedings of EMNLP , year=

  66. [73]

    Proceedings of ICLR , year=

    Finetuned Language Models are Zero-Shot Learners , author=. Proceedings of ICLR , year=

  67. [74]

    Proceedings of EMNLP , year=

    CrossFit: A Few-shot Learning Challenge for Cross-task Generalization in NLP , author=. Proceedings of EMNLP , year=

  68. [75]

    Proceedings of ICML , pages=

    Calibrate before use: Improving few-shot performance of language models , author=. Proceedings of ICML , pages=

  69. [76]

    Armen Aghajanyan, Anchit Gupta, Akshat Shrivastava, Xilun Chen, Luke Zettlemoyer, and Sonal Gupta. 2021. Muppet: Massive multi-task representations with pre-finetuning. In Proceedings of EMNLP, pages 5799--5811

  70. [77]

    Ali M Ali. 1981. The use of positive and negative examples during instruction. Journal of instructional development, 5(1):2--7

  71. [78]

    Iz Beltagy, Matthew E Peters, and Arman Cohan. 2020. Longformer: The long-document transformer. arXiv preprint arXiv:2004.05150

  72. [79]

    Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel Ziegler, Jeffrey Wu, Clemens Winter, Chris Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gr...

  73. [80]

    Rich Caruana. 1997. Multitask learning. Machine learning, 28(1):41--75

  74. [81]

    Pradeep Dasigi, Nelson F Liu, Ana Marasovic, Noah A Smith, and Matt Gardner. 2019. Quoref: A reading comprehension dataset with questions requiring coreferential reasoning. In Proceedings of EMNLP-IJCNLP, pages 5927--5934

  75. [82]

    Dheeru Dua, Yizhong Wang, Pradeep Dasigi, Gabriel Stanovsky, Sameer Singh, and Matt Gardner. 2019. Drop: A reading comprehension benchmark requiring discrete reasoning over paragraphs. In Proceedings of NAACL, pages 2368--2378

  76. [83]

    Avia Efrat and Omer Levy. 2020. The turking test: Can language models understand instructions? arXiv preprint arXiv:2010.11982

  77. [84]

    Kamath, Aniruddha Kembhavi, and Derek Hoiem

    Tanmay Gupta, A. Kamath, Aniruddha Kembhavi, and Derek Hoiem. 2021. Towards general purpose vision systems. ArXiv, abs/2104.00743

  78. [85]

    Peter Hase and Mohit Bansal. 2021. When can models learn from explanations? a formal framework for understanding the roles of explanation data. arXiv preprint arXiv:2102.02201

  79. [86]

    Lifu Huang, Ronan Le Bras, Chandra Bhagavatula, and Yejin Choi. 2019. Cosmos qa: Machine reading comprehension with contextual commonsense reasoning. In Proceedings of EMNLP-IJCNLP, pages 2391--2401

  80. [87]

    Daniel Khashabi, Snigdha Chaturvedi, Michael Roth, Shyam Upadhyay, and Dan Roth. 2018. Looking beyond the surface: A challenge set for reading comprehension over multiple sentences. In Proceedings of NAACL, pages 252--262

Showing first 80 references.