arxiv: 2110.08207 · v3 · submitted 2021-10-15 · 💻 cs.LG · cs.CL

Recognition: 2 theorem links

· Lean Theorem

Multitask Prompted Training Enables Zero-Shot Task Generalization

Abheesht Sharma, Albert Webson, Alexander M. Rush, Andrea Santilli, Antoine Chaffin, Arnaud Stiegler, Arun Raja, Canwen Xu, Colin Raffel, Debajyoti Datta, Eliza Szczechla, Gunjan Chhablani, Han Wang, Harshit Pandey, Jason Alan Fries, Jonathan Chang, Jos Rozen, Leo Gao, Lintang Sutawika, Manan Dey, Matteo Manica, Mike Tian-Jian Jiang, M Saiful Bari, Nihal Nayak, Rachel Bawden, Ryan Teehan, Shanya Sharma Sharma, Sheng Shen, Stella Biderman, Stephen H. Bach, Taewoon Kim, Tali Bers, Teven Le Scao, Thibault Fevry, Thomas Wang, Thomas Wolf, Trishala Neeraj, Urmish Thakker, Victor Sanh, Zaid Alyafeai, Zheng Xin Yong

Authors on Pith no claims yet

Pith reviewed 2026-05-14 17:54 UTC · model grok-4.3

classification 💻 cs.LG cs.CL

keywords zero-shot generalizationmultitask learningprompted traininglanguage modelstask generalizationencoder-decoder modelsBIG-bench

0 comments

The pith

Converting many supervised datasets into prompted forms and fine-tuning a language model on the mixture produces strong zero-shot performance on held-out tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether zero-shot generalization can be induced directly through explicit multitask learning instead of relying on the implicit effects of language model pretraining. To do so, the authors build a system that turns arbitrary supervised datasets into multiple human-readable prompted versions, then fine-tune a pretrained encoder-decoder model on this broad collection. The resulting model shows competitive zero-shot results on standard benchmarks, frequently exceeding the performance of models many times its size, and also delivers solid results on a slice of the BIG-bench suite. A reader would care because the work suggests that careful curation of prompted tasks can substitute for scale in achieving flexible task handling.

Core claim

By mapping a large collection of supervised natural-language tasks into diverse prompted formats and fine-tuning a pretrained encoder-decoder model on the resulting multitask mixture, the model achieves strong zero-shot performance on completely held-out tasks, often surpassing models up to 16 times larger on standard datasets and up to 6 times larger on a subset of BIG-bench tasks.

What carries the argument

A system for converting arbitrary supervised datasets into multiple human-readable prompted forms that are then mixed together for multitask fine-tuning of a pretrained encoder-decoder model.

If this is right

Zero-shot task performance no longer requires either enormous model scale or task-specific fine-tuning once a broad prompted multitask mixture is available.
New tasks can be tackled zero-shot simply by supplying a suitable prompt, without additional training data for that task.
The diversity of prompted tasks in the training mixture becomes a controllable lever for improving generalization, independent of raw parameter count.
Performance gains observed on standard benchmarks and on BIG-bench subsets indicate that the approach transfers across many different task types.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the prompted-multitask signal truly teaches task abstraction, then similar results should appear when the same procedure is applied to smaller base models or to non-English task collections.
The method raises the possibility that many existing supervised datasets can be reused as training material for generalist models rather than being discarded after single-task use.
A natural next measurement would be whether the same mixture also improves few-shot performance or reduces the amount of in-context examples needed at inference time.

Load-bearing premise

That turning supervised datasets into prompted forms supplies a training signal for genuine task generalization rather than for prompt-specific patterns or dataset artifacts.

What would settle it

A controlled test in which the same model is fine-tuned on the identical tasks but without the prompted formatting and then evaluated zero-shot on the same held-out tasks, checking whether performance collapses to chance levels.

read the original abstract

Large language models have recently been shown to attain reasonable zero-shot generalization on a diverse set of tasks (Brown et al., 2020). It has been hypothesized that this is a consequence of implicit multitask learning in language models' pretraining (Radford et al., 2019). Can zero-shot generalization instead be directly induced by explicit multitask learning? To test this question at scale, we develop a system for easily mapping any natural language tasks into a human-readable prompted form. We convert a large set of supervised datasets, each with multiple prompts with diverse wording. These prompted datasets allow for benchmarking the ability of a model to perform completely held-out tasks. We fine-tune a pretrained encoder-decoder model (Raffel et al., 2020; Lester et al., 2021) on this multitask mixture covering a wide variety of tasks. The model attains strong zero-shot performance on several standard datasets, often outperforming models up to 16x its size. Further, our approach attains strong performance on a subset of tasks from the BIG-bench benchmark, outperforming models up to 6x its size. All trained models are available at https://github.com/bigscience-workshop/t-zero and all prompts are available at https://github.com/bigscience-workshop/promptsource.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Explicit multitask prompted fine-tuning on T5 produces measurable zero-shot gains on held-out tasks, but the setup leaves room for the model to exploit shared prompt formats rather than learn the tasks themselves.

read the letter

The paper's core result is that fine-tuning T5 on a large mixture of prompted supervised datasets yields zero-shot performance on held-out tasks that beats much larger models on several benchmarks. They convert many datasets into multiple prompt variants, train on the combined set, and evaluate on tasks never seen during this stage. That scale of explicit prompting is the main new piece compared to earlier implicit-multitask observations in pretraining work.

Referee Report

3 major / 2 minor

Summary. The paper claims that converting a large collection of supervised NLP datasets into prompted forms using diverse human-readable templates, then fine-tuning a pretrained encoder-decoder model (T5) on the resulting multitask mixture, induces strong zero-shot generalization to completely held-out tasks. The resulting model often outperforms models up to 16x larger on standard benchmarks and shows competitive results on a subset of BIG-bench tasks.

Significance. If the results hold after addressing evaluation details, the work demonstrates that explicit multitask prompted training can produce zero-shot capabilities at modest scale, providing a practical alternative to relying solely on pretraining scale and offering a reproducible recipe for improving task generalization.

major comments (3)

[Evaluation] Evaluation section: the central claim of genuine task generalization (rather than surface-format following) rests on held-out tasks, yet the manuscript does not report a control experiment using evaluation prompts drawn from a disjoint syntactic or generative distribution while preserving task semantics. Without this, performance could be explained by shared prompt patterns across the mixture.
[§4] §4 (Experiments) and Table 1: baseline comparisons lack full details on prompt selection procedure for the larger models, statistical significance testing across prompt variations or random seeds, and exact data exclusion rules for the training/held-out split; these omissions make it impossible to verify that the reported outperformance (e.g., vs. 16x larger models) is robust.
[Method] Method section: the description of how tasks are mapped to prompts and how the multitask mixture is constructed does not specify the proportion of each task type or whether any filtering was applied to avoid format leakage, which is load-bearing for interpreting the zero-shot results as evidence of task understanding.

minor comments (2)

[Abstract] The abstract and §5 reference the public release of models and prompts; ensure the final version includes precise commit hashes or version numbers for reproducibility.
[Notation] Notation for 'prompt' vs. 'template' is used interchangeably in places; a brief glossary or consistent definition would improve clarity.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback on our work. We address each major comment point by point below, providing clarifications and committing to revisions that strengthen the manuscript without altering its core claims.

read point-by-point responses

Referee: [Evaluation] Evaluation section: the central claim of genuine task generalization (rather than surface-format following) rests on held-out tasks, yet the manuscript does not report a control experiment using evaluation prompts drawn from a disjoint syntactic or generative distribution while preserving task semantics. Without this, performance could be explained by shared prompt patterns across the mixture.

Authors: We agree that distinguishing format following from task generalization is important. Our evaluation uses completely held-out tasks with prompts drawn from promptsource that were never encountered during training, and the diversity of templates across the mixture was intended to promote generalization beyond surface patterns. However, we acknowledge that an explicit control experiment with syntactically disjoint prompts (while preserving semantics) would provide stronger evidence. We will add a discussion of this limitation and propose such a control as future work in the revised manuscript. revision: partial
Referee: [§4] §4 (Experiments) and Table 1: baseline comparisons lack full details on prompt selection procedure for the larger models, statistical significance testing across prompt variations or random seeds, and exact data exclusion rules for the training/held-out split; these omissions make it impossible to verify that the reported outperformance (e.g., vs. 16x larger models) is robust.

Authors: We will expand §4 and Table 1 in the revision to include: (1) the precise prompt selection procedure for baselines (following the recommendations in the original papers for each model), (2) results with statistical significance across multiple prompt variations and random seeds where computationally feasible, and (3) the exact criteria used for the training/held-out split to confirm no overlap. These details will allow readers to better assess the robustness of the reported gains. revision: yes
Referee: [Method] Method section: the description of how tasks are mapped to prompts and how the multitask mixture is constructed does not specify the proportion of each task type or whether any filtering was applied to avoid format leakage, which is load-bearing for interpreting the zero-shot results as evidence of task understanding.

Authors: We will revise the Method section to specify the exact proportions of each task type in the mixture (proportional to the number of examples per dataset) and to clarify that filtering was applied only to enforce task-level disjointness between training and held-out sets, with no additional format-based filtering. We maintain that the strong zero-shot results on novel tasks with unseen prompts support task understanding rather than format memorization, but the added details will make this interpretation more transparent. revision: yes

Circularity Check

0 steps flagged

No circularity: zero-shot results are measured on explicitly held-out external tasks and benchmarks

full rationale

The paper's derivation consists of converting supervised datasets to prompted forms, fine-tuning a pretrained model on the resulting multitask mixture, and then reporting performance on held-out tasks from standard datasets and BIG-bench. These evaluation tasks are disjoint from the training mixture by construction, and performance is measured against external benchmarks rather than any fitted parameter, self-referential metric, or prior result from the same authors. No equations, uniqueness theorems, or ansatzes are invoked that reduce the claimed generalization to the inputs by definition. Self-citations are absent from the load-bearing steps; citations to prior work (e.g., T5, prompt tuning) supply the base model but do not substitute for the empirical result.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the empirical outcome of prompted multitask fine-tuning; no new mathematical axioms, physical constants, or invented entities are introduced beyond standard transformer training assumptions.

axioms (1)

domain assumption Natural language prompts can be used to unify diverse supervised tasks into a single training mixture without destructive interference
This assumption underpins the decision to mix prompted datasets for zero-shot transfer.

pith-pipeline@v0.9.0 · 5709 in / 1157 out tokens · 22149 ms · 2026-05-14T17:54:56.684710+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear
We fine-tune a pretrained encoder-decoder model (Raffel et al., 2020; Lester et al., 2021) on this multitask mixture covering a wide variety of tasks.
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear
The model attains strong zero-shot performance on several standard datasets, often outperforming models up to 16× its size.

Forward citations

Cited by 22 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models
cs.CL 2022-01 accept novelty 9.0

Chain-of-thought prompting, by including intermediate reasoning steps in few-shot examples, elicits strong reasoning abilities in large language models on arithmetic, commonsense, and symbolic tasks.
Instruction Tuning with GPT-4
cs.CL 2023-04 unverdicted novelty 8.0

GPT-4-generated instruction data produces superior zero-shot performance in finetuned LLaMA models versus prior state-of-the-art data.
Editing Models with Task Arithmetic
cs.LG 2022-12 accept novelty 8.0

Task vectors from weight differences allow arithmetic operations to edit pre-trained models, improving multiple tasks simultaneously and enabling analogical inference on unseen tasks.
Accelerating Zeroth-Order Spectral Optimization with Partial Orthogonalization from Power Iteration
cs.LG 2026-05 unverdicted novelty 7.0

Partial orthogonalization from power iteration accelerates zeroth-order Muon by 1.5x-4x on LLM fine-tuning tasks while maintaining competitive accuracy.
From Static Analysis to Audience Dissemination: A Training-Free Multimodal Controversy Detection Multi-Agent Framework
cs.LG 2026-05 unverdicted novelty 7.0

AuDisAgent reformulates multimodal controversy detection as a dynamic audience dissemination process using screening, panel discussion, and arbitration agents, plus comment bootstrapping, and reports outperforming pri...
Self-Rewarding Language Models
cs.CL 2024-01 conditional novelty 7.0

Iterative self-rewarding via LLM-as-Judge in DPO training on Llama 2 70B improves instruction following and self-evaluation, outperforming GPT-4 on AlpacaEval 2.0.
C-Pack: Packed Resources For General Chinese Embeddings
cs.CL 2023-09 accept novelty 7.0

C-Pack releases a new Chinese embedding benchmark, large training dataset, and optimized models that outperform priors by up to 10% on C-MTEB while also delivering English SOTA results.
QLoRA: Efficient Finetuning of Quantized LLMs
cs.LG 2023-05 conditional novelty 7.0

QLoRA finetunes 4-bit quantized LLMs via LoRA adapters to match full-precision performance while using far less memory, enabling 65B-scale training on single GPUs and producing Guanaco models near ChatGPT level.
Understanding and Accelerating the Training of Masked Diffusion Language Models
cs.LG 2026-05 conditional novelty 6.0

Bell-shaped time sampling accelerates masked diffusion language model training by roughly 4x on LM1B by countering locality bias in language data.
Understanding the Mechanism of Altruism in Large Language Models
econ.GN 2026-04 unverdicted novelty 6.0

A small set of sparse autoencoder features in LLMs drives shifts between generous and selfish allocations in dictator games, with causal patching and steering confirming their role and generalization to other social games.
RemoteShield: Enable Robust Multimodal Large Language Models for Earth Observation
cs.CV 2026-04 unverdicted novelty 6.0

RemoteShield improves robustness of Earth observation MLLMs by training on semantic equivalence clusters of clean and perturbed inputs via preference learning to maintain consistent reasoning under noise.
Towards an AI co-scientist
cs.AI 2025-02 unverdicted novelty 6.0

A multi-agent AI system generates novel biomedical hypotheses that show promising experimental validation in drug repurposing for leukemia, new targets for liver fibrosis, and a bacterial gene transfer mechanism.
MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark
cs.CL 2024-06 conditional novelty 6.0

MMLU-Pro is a revised benchmark that makes language model evaluation harder and more stable by using ten options per question and emphasizing reasoning over simple knowledge recall.
ChatEval: Towards Better LLM-based Evaluators through Multi-Agent Debate
cs.CL 2023-08 conditional novelty 6.0

Multi-agent debate among LLMs yields more reliable text evaluations than single-agent prompting by simulating collaborative human judgment.
Gorilla: Large Language Model Connected with Massive APIs
cs.CL 2023-05 conditional novelty 6.0

Gorilla is a fine-tuned LLM that surpasses GPT-4 in accurate API call generation and uses retrieval to handle documentation updates.
PaLM: Scaling Language Modeling with Pathways
cs.CL 2022-04 accept novelty 6.0

PaLM 540B demonstrates continued scaling benefits by setting new few-shot SOTA results on hundreds of benchmarks and outperforming humans on BIG-bench.
A General Language Assistant as a Laboratory for Alignment
cs.CL 2021-12 conditional novelty 6.0

Ranked preference modeling outperforms imitation learning for language model alignment and scales more favorably with model size.
SmolLM2: When Smol Goes Big -- Data-Centric Training of a Small Language Model
cs.CL 2025-02 unverdicted novelty 5.0

SmolLM2 is a 1.7B-parameter language model that outperforms Qwen2.5-1.5B and Llama3.2-1B after overtraining on 11 trillion tokens using custom FineMath, Stack-Edu, and SmolTalk datasets in a multi-stage pipeline.
Galactica: A Large Language Model for Science
cs.CL 2022-11 unverdicted novelty 5.0

Galactica, a science-specialized LLM, reports higher scores than GPT-3, Chinchilla, and PaLM on LaTeX knowledge, mathematical reasoning, and medical QA benchmarks while outperforming general models on BIG-bench.
Text Style Transfer with Machine Translation for Graphic Designs
cs.CL 2026-04 unverdicted novelty 4.0

Custom tag methods with NMT and LLMs for word alignment in text style transfer perform no better than standard attention-based alignment from NMT models.
Large Language Models: A Survey
cs.CL 2024-02 accept novelty 3.0

The paper surveys key large language models, their training methods, datasets, evaluation benchmarks, and future research directions in the field.
Sora: A Review on Background, Technology, Limitations, and Opportunities of Large Vision Models
cs.CV 2024-02 unverdicted novelty 2.0

The paper reviews the background, technology, applications, limitations, and future directions of OpenAI's Sora text-to-video generative model based on public information.

Reference graph

Works this paper leans on

75 extracted references · 75 canonical work pages · cited by 22 Pith papers · 21 internal anchors

[1]

Cloze-driven Pretraining of Self-attention Networks

Alexei Baevski, Sergey Edunov, Yinhan Liu, Luke Zettlemoyer, and Michael Auli. Cloze-driven pretraining of self-attention networks. arXiv preprint arXiv:1903.07785,

work page internal anchor Pith review Pith/arXiv arXiv 1903
[2]

Available: https://doi.org/10.1162/tacl a 00449

doi: 10.1162/tacl a 00338. URL https://doi.org/10.1162/tacl a 00338. Qiang Ning Ben Zhou, Daniel Khashabi and Dan Roth. “going on a vacation” takes longer than “going for a walk”: A study of temporal commonsense understanding. In EMNLP,

work page internal anchor Pith review doi:10.1162/tacl
[3]

Bender, Timnit Gebru, Angelina McMillan-Major, and Shmargaret Shmitchell

Emily M. Bender, Timnit Gebru, Angelina McMillan-Major, and Shmargaret Shmitchell. On the dangers of stochastic parrots: Can language models be too big? In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, pages 610–623,

work page 2021
[4]

Semantic parsing on Freebase from question-answer pairs

Jonathan Berant, Andrew Chou, Roy Frostig, and Percy Liang. Semantic parsing on Freebase from question-answer pairs. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pages 1533–1544, Seattle, Washington, USA, October

work page 2013
[6]

URL https://arxiv.org/abs/2108.07258. Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhari- wal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-V oss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel 11 Published as a conference paper at ICLR 2022 Ziegler, ...

work page internal anchor Pith review Pith/arXiv arXiv 2022
[7]

Rich Caruana

URL https://proceedings.neurips.cc/paper/2020/ﬁle/1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdf. Rich Caruana. Multitask learning. Mach. Learn. , 28(1):41–75,

work page 2020
[8]

Caruana, Multitask Learning.Machine Learning28, 41–75 (1997), doi:10.1023/A: 1007379606734

doi: 10.1023/A: 1007379606734. URL https://doi.org/10.1023/A:1007379606734. Eunsol Choi, He He, Mohit Iyyer, Mark Yatskar, Wen-tau Yih, Yejin Choi, Percy Liang, and Luke Zettlemoyer. QuAC: Question answering in context. In Proceedings of the 2018 Con- ference on Empirical Methods in Natural Language Processing, pages 2174–2184, Brussels, Bel- gium, Octobe...

work page doi:10.1023/a: 2018
[9]

BoolQ: Exploring the Surprising Difficulty of Natural Yes/No Questions

Association for Computational Linguistics. doi: 10.18653/v1/ D18-1241. URL https://aclanthology.org/D18-1241. Christopher Clark, Kenton Lee, Ming-Wei Chang, Tom Kwiatkowski, Michael Collins, and Kristina Toutanova. Boolq: Exploring the surprising difﬁculty of natural yes/no questions. CoRR, abs/1905.10044,

work page internal anchor Pith review Pith/arXiv arXiv doi:10.18653/v1/ 1905
[10]

BoolQ: Exploring the Surprising Difficulty of Natural Yes/No Questions

URL http://arxiv.org/abs/1905.10044. Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? try arc, the ai2 reasoning challenge. arXiv:1803.05457v1,

work page internal anchor Pith review Pith/arXiv arXiv 1905
[11]

A uniﬁed architecture for natural language processing: deep neural networks with multitask learning

Ronan Collobert and Jason Weston. A uniﬁed architecture for natural language processing: deep neural networks with multitask learning. In William W. Cohen, Andrew McCallum, and Sam T. Roweis, editors, Machine Learning, Proceedings of the Twenty-Fifth International Conference (ICML 2008), Helsinki, Filnand, June 5-9, 2008 , volume 307 of ACM International ...

work page 2008
[12]

URL https: //doi.org/10.1145/1390156.1390177

doi: 10.1145/1390156.1390177. URL https: //doi.org/10.1145/1390156.1390177. Ido Dagan, Oren Glickman, and Bernardo Magnini. The pascal recognising textual entailment chal- lenge. In Machine Learning Challenges Workshop, pages 177–190. Springer,

work page doi:10.1145/1390156.1390177
[13]

Liu, Ana Marasovic, Noah A

Pradeep Dasigi, Nelson F. Liu, Ana Marasovic, Noah A. Smith, and Matt Gardner. Quoref: A reading comprehension dataset with questions requiring coreferential reasoning. arXiv:1908.05803v2,

work page arXiv 1908
[14]

doi: 10.18653/v1/W18-5102

Association for Computational Linguis- tics. doi: 10.18653/v1/W18-5102. URL https://www.aclweb.org/anthology/W18-5102. Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for...

work page doi:10.18653/v1/w18-5102 2019
[15]

12 Published as a conference paper at ICLR 2022 Danilo Giampiccolo, Bernardo Magnini, Ido Dagan, and Bill Dolan

URL https://doi.org/10.5281/zenodo.5371628. 12 Published as a conference paper at ICLR 2022 Danilo Giampiccolo, Bernardo Magnini, Ido Dagan, and Bill Dolan. The third pascal recognizing textual entailment challenge. In Proceedings of the ACL-PASCAL workshop on textual entailment and paraphrasing, pages 1–9. Association for Computational Linguistics,

work page doi:10.5281/zenodo.5371628 2022
[16]

Samsum corpus: A human- annotated dialogue dataset for abstractive summarization

Bogdan Gliwa, Iwona Mochol, Maciej Biesek, and Aleksander Wawer. Samsum corpus: A human- annotated dialogue dataset for abstractive summarization. arXiv preprint arXiv:1911.12237 ,

work page arXiv 1911
[17]

Twitter sentiment classiﬁcation using distant supervision

Alec Go, Richa Bhayani, and Lei Huang. Twitter sentiment classiﬁcation using distant supervision. CS224N project report, Stanford, 1(12):2009,

work page 2009
[18]

A Joint Many-Task Model: Growing a Neural Network for Multiple NLP Tasks

URL http://arxiv.org/abs/1611.01587. Karl Moritz Hermann, Tomas Kocisky, Edward Grefenstette, Lasse Espeholt, Will Kay, Mustafa Suleyman, and Phil Blunsom. Teaching machines to read and comprehend. In Advances in neural information processing systems, pages 1693–1701,

work page internal anchor Pith review Pith/arXiv arXiv
[19]

Lifu Huang, Ronan Le Bras, Chandra Bhagavatula, and Yejin Choi

URL https://aclanthology.org/H01-1069. Lifu Huang, Ronan Le Bras, Chandra Bhagavatula, and Yejin Choi. Cosmos qa: Machine reading comprehension with contextual commonsense reasoning. In arXiv:1909.00277v2,

work page arXiv 1909
[20]

Matt Gardner Johannes Welbl, Nelson F. Liu. Crowdsourcing multiple choice science questions. arXiv:1707.06209v1,

work page internal anchor Pith review Pith/arXiv arXiv
[21]

TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension

Mandar Joshi, Eunsol Choi, Daniel Weld, and Luke Zettlemoyer. triviaqa: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension.arXiv e-prints, art. arXiv:1705.03551,

work page internal anchor Pith review Pith/arXiv arXiv
[22]

Uniﬁedqa: Crossing format boundaries with a single QA system.CoRR, abs/2005.00700, 2020a

Daniel Khashabi, Tushar Khot, Ashish Sabharwal, Oyvind Tafjord, Peter Clark, and Hannaneh Ha- jishirzi. Uniﬁedqa: Crossing format boundaries with a single QA system.CoRR, abs/2005.00700, 2020a. URL https://arxiv.org/abs/2005.00700. Daniel Khashabi, Sewon Min, Tushar Khot, Ashish Sabharwal, Oyvind Tafjord, Peter Clark, and Hannaneh Hajishirzi. UNIFIEDQA: C...

work page doi:10.18653/v1/2020 2005
[23]

What changes can large- scale language models bring? intensive study on hyperclova: Billions-scale korean generative pretrained transformers

Boseop Kim, HyoungSeok Kim, Sang-Woo Lee, Gichang Lee, Donghyun Kwak, Dong Hyeon Jeon, Sunghyun Park, Sungju Kim, Seonhoon Kim, Dongpil Seo, et al. What changes can large- scale language models bring? intensive study on hyperclova: Billions-scale korean generative pretrained transformers. arXiv preprint arXiv:2109.04650,

work page arXiv
[24]

Quantifying the carbon emissions of machine learning

Alexandre Lacoste, Alexandra Luccioni, Victor Schmidt, and Thomas Dandres. Quantifying the carbon emissions of machine learning. arXiv preprint arXiv:1910.09700,

work page arXiv 1910
[25]

RACE: Large-scale ReAding Comprehension Dataset From Examinations

Guokun Lai, Qizhe Xie, Hanxiao Liu, Yiming Yang, and Eduard Hovy. Race: Large-scale reading comprehension dataset from examinations. arXiv preprint arXiv:1704.04683,

work page internal anchor Pith review Pith/arXiv arXiv
[26]

Neural Text Generation from Structured Data with Application to the Biography Domain

13 Published as a conference paper at ICLR 2022 R´emi Lebret, David Grangier, and Michael Auli. Generating text from structured data with appli- cation to the biography domain. CoRR, abs/1603.07771,

work page internal anchor Pith review Pith/arXiv arXiv 2022
[27]

URL http://arxiv.org/abs/1603. 07771. Katherine Lee, Daphne Ippolito, Andrew Nystrom, Chiyuan Zhang, Douglas Eck, Chris Callison- Burch, and Nicholas Carlini. Deduplicating training data makes language models better. arXiv preprint arXiv:2107.06499,

work page arXiv
[29]

The Power of Scale for Parameter-Efficient Prompt Tuning

URL https://arxiv.org/abs/2104.08691. Hector Levesque, Ernest Davis, and Leora Morgenstern. The winograd schema challenge. In Thir- teenth International Conference on the Principles of Knowledge Representation and Reasoning ,

work page internal anchor Pith review Pith/arXiv arXiv
[30]

Learning question classiﬁers

Xin Li and Dan Roth. Learning question classiﬁers. In COLING 2002: The 19th International Conference on Computational Linguistics,

work page 2002
[31]

Bill Yuchen Lin, Wangchunshu Zhou, Ming Shen, Pei Zhou, Chandra Bhagavatula, Yejin Choi, and Xiang Ren

URL https://aclanthology.org/C02-1150. Bill Yuchen Lin, Wangchunshu Zhou, Ming Shen, Pei Zhou, Chandra Bhagavatula, Yejin Choi, and Xiang Ren. CommonGen: A constrained text generation challenge for generative commonsense reasoning. In Findings of the Association for Computational Linguistics: EMNLP 2020 , pages 1823–1840, Online, November

work page 2020
[32]

doi: 10.18653/ v1/2020.ﬁndings-emnlp.165

Association for Computational Linguistics. doi: 10.18653/ v1/2020.ﬁndings-emnlp.165. URL https://aclanthology.org/2020.ﬁndings-emnlp.165. Kevin Lin, Oyvind Tafjord, Peter Clark, and Matt Gardner. Reasoning over paragraph effects in situations. In MRQA@EMNLP,

work page 2020
[33]

Cutting down on prompts and parameters: Simple few-shot learning with language models.arXiv preprint arXiv:2106.13353,

Robert L Logan, Ivana Balaˇzevi´c, Eric Wallace, Fabio Petroni, Sameer Singh, and Sebastian Riedel. Cutting down on prompts and parameters: Simple few-shot learning with language models.arXiv preprint arXiv:2106.13353,

work page arXiv
[35]

URL http: //arxiv.org/abs/1806.08730. R. Thomas McCoy, Ellie Pavlick, and Tal Linzen. Right for the wrong reasons: Diagnosing syntactic heuristics in natural language inference. CoRR, abs/1902.01007,

work page internal anchor Pith review Pith/arXiv arXiv 1902
[36]

Right for the Wrong Reasons: Diagnosing Syntactic Heuristics in Natural Language Inference

URL http://arxiv.org/abs/ 1902.01007. Todor Mihaylov, Peter Clark, Tushar Khot, and Ashish Sabharwal. Can a suit of armor conduct electricity? a new dataset for open book question answering. In EMNLP,

work page internal anchor Pith review Pith/arXiv arXiv 1902
[38]

Nikita Nangia, Clara Vania, Rasika Bhalerao, and Samuel R

URL https://arxiv.org/abs/2104.08773. Nikita Nangia, Clara Vania, Rasika Bhalerao, and Samuel R. Bowman. CrowS-Pairs: A Challenge Dataset for Measuring Social Biases in Masked Language Models. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing , Online, November

work page arXiv 2020
[39]

Don't Give Me the Details, Just the Summary! Topic-Aware Convolutional Neural Networks for Extreme Summarization

Association for Computational Linguistics. Shashi Narayan, Shay B. Cohen, and Mirella Lapata. Don’t give me the details, just the summary! topic-aware convolutional neural networks for extreme summarization. ArXiv, abs/1808.08745,

work page internal anchor Pith review Pith/arXiv arXiv
[40]

Carbon Emissions and Large Neural Network Training

David Patterson, Joseph Gonzalez, Quoc Le, Chen Liang, Lluis-Miquel Munguia, Daniel Rothchild, David So, Maud Texier, and Jeff Dean. Carbon emissions and large neural network training.arXiv preprint arXiv:2104.10350,

work page internal anchor Pith review Pith/arXiv arXiv
[42]

Mohammad Taher Pilehvar and os’e Camacho-Collados

URL https://arxiv.org/abs/2105.11447. Mohammad Taher Pilehvar and os’e Camacho-Collados. Wic: 10, 000 example pairs for evaluating context-sensitive representations. CoRR, abs/1808.09121,

work page arXiv
[43]

URL http://arxiv.org/abs/1808. 09121. Adam Poliak, Aparajita Haldar, Rachel Rudinger, J. Edward Hu, Ellie Pavlick, Aaron Steven White, and Benjamin Van Durme. Collecting diverse natural language inference problems for sentence representation evaluation. In Proceedings of the 2018 Conference on Empirical Methods in Nat- ural Language Processing, pages 67–8...

work page 2018
[44]

doi: 10.18653/v1/D18-1007

Associa- tion for Computational Linguistics. doi: 10.18653/v1/D18-1007. URL https://aclanthology.org/ D18-1007. Yada Pruksachatkun, Jason Phang, Haokun Liu, Phu Mon Htut, Xiaoyi Zhang, Richard Yuanzhe Pang, Clara Vania, Katharina Kann, and Samuel R. Bowman. Intermediate-task transfer learning with pretrained language models: When and why does it work? InP...

work page doi:10.18653/v1/d18-1007
[45]

doi: 10.18653/v1/2020.acl-main.467

Association for Computational Linguistics. doi: 10.18653/v1/2020.acl-main.467. URL https: //aclanthology.org/2020.acl-main.467. Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9,

work page doi:10.18653/v1/2020.acl-main.467 2020
[46]

SQuAD: 100,000+ Questions for Machine Comprehension of Text

15 Published as a conference paper at ICLR 2022 Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. SQuAD: 100,000+ Questions for Machine Comprehension of Text. arXiv e-prints, art. arXiv:1606.05250,

work page internal anchor Pith review Pith/arXiv arXiv 2022
[48]

Adam Roberts, Colin Raffel, and Noam Shazeer

URL https://arxiv.org/abs/2102.07350. Adam Roberts, Colin Raffel, and Noam Shazeer. How much knowledge can you pack into the parameters of a language model? In Proceedings of the 2020 Conference on Empirical Meth- ods in Natural Language Processing (EMNLP) , pages 5418–5426, Online, November

work page arXiv 2020
[49]

doi: 10.18653/v1/2020.emnlp-main.437

Association for Computational Linguistics. doi: 10.18653/v1/2020.emnlp-main.437. URL https://aclanthology.org/2020.emnlp-main.437. Melissa Roemmele, Cosmin Adrian Bejan, and Andrew S Gordon. Choice of plausible alternatives: An evaluation of commonsense causal reasoning. In 2011 AAAI Spring Symposium Series,

work page doi:10.18653/v1/2020.emnlp-main.437 2020
[50]

Getting closer to AI com- plete question answering: A set of prerequisite real tasks

Anna Rogers, Olga Kovaleva, Matthew Downey, and Anna Rumshisky. Getting closer to AI com- plete question answering: A set of prerequisite real tasks. In The Thirty-Fourth AAAI Conference on Artiﬁcial Intelligence, AAAI 2020, The Thirty-Second Innovative Applications of Artiﬁcial In- telligence Conference, IAAI 2020, The Tenth AAAI Symposium on Educational...

work page 2020
[51]

Rachel Rudinger, Jason Naradowsky, Brian Leonard, and Benjamin Van Durme

URL https://aaai.org/ojs/index.php/AAAI/article/view/6398. Rachel Rudinger, Jason Naradowsky, Brian Leonard, and Benjamin Van Durme. Gender bias in coreference resolution. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, New Orleans, Louisiana, June

work page 2018
[52]

Alexander M

Association for Computational Linguistics. Alexander M. Rush, Sumit Chopra, and Jason Weston. A neural attention model for abstractive sentence summarization. Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing,

work page 2015
[53]

URL http://dx.doi.org/10.18653/v1/ D15-1044

doi: 10.18653/v1/d15-1044. URL http://dx.doi.org/10.18653/v1/ D15-1044. Amrita Saha, Rahul Aralikatte, Mitesh M. Khapra, and Karthik Sankaranarayanan. DuoRC: Towards Complex Language Understanding with Paraphrased Reading Comprehension. In Meeting of the Association for Computational Linguistics (ACL),

work page doi:10.18653/v1/d15-1044
[55]

WinoGrande: An Adversarial Winograd Schema Challenge at Scale

URL http://arxiv. org/abs/1907.10641. Timo Schick and Hinrich Sch ¨utze. Exploiting cloze-questions for few-shot text classiﬁcation and natural language inference. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, pages 255–269, Online, April

work page internal anchor Pith review Pith/arXiv arXiv 1907
[56]

URL https://aclanthology.org/2021.eacl-main.20

Association for Computational Linguistics. URL https://aclanthology.org/2021.eacl-main.20. Roy Schwartz, Jesse Dodge, Noah A Smith, and Oren Etzioni. Green ai. Communications of the ACM, 63(12):54–63,

work page 2021
[58]

Get To The Point: Summarization with Pointer-Generator Networks

URL http://arxiv.org/abs/1704.04368. Noam Shazeer and Mitchell Stern. Adafactor: Adaptive learning rates with sublinear memory cost. In International Conference on Machine Learning, pages 4596–4604. PMLR,

work page internal anchor Pith review Pith/arXiv arXiv
[60]

Emma Strubell, Ananya Ganesh, and Andrew McCallum

URL http://arxiv.org/abs/1908.09203. Emma Strubell, Ananya Ganesh, and Andrew McCallum. Energy and policy considerations for deep learning in NLP. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 3645–3650,

work page arXiv 1908
[61]

DREAM: A challenge dataset and models for dialogue-based reading comprehension

16 Published as a conference paper at ICLR 2022 Kai Sun, Dian Yu, Jianshu Chen, Dong Yu, Yejin Choi, and Claire Cardie. DREAM: A challenge dataset and models for dialogue-based reading comprehension. Transactions of the Association for Computational Linguistics,

work page 2022
[62]

DREAM: A Challenge Dataset and Models for Dialogue-Based Reading Comprehension

URL https://arxiv.org/abs/1902.00164v1. Oyvind Tafjord, Matt Gardner, Kevin Lin, and Peter Clark. ”quartz: An open-domain dataset of qualitative relationship questions”. EMNLP, ”2019”. Oyvind Tafjord, Peter Clark, Matt Gardner, Wen-tau Yih, and Ashish Sabharwal. Quarel: A dataset and models for answering questions about qualitative relationships.CoRR, abs...

work page internal anchor Pith review Pith/arXiv arXiv 1902
[63]

QuaRel: A Dataset and Models for Answering Questions about Qualitative Relationships

URL http://arxiv.org/abs/1811.08048. Tu Vu, Tong Wang, Tsendsuren Munkhdalai, Alessandro Sordoni, Adam Trischler, Andrew Mattarella-Micke, Subhransu Maji, and Mohit Iyyer. Exploring and predicting transferability across NLP tasks. In Proceedings of the 2020 Conference on Empirical Methods in Natural Lan- guage Processing (EMNLP), pages 7882–7926, Online, November

work page internal anchor Pith review Pith/arXiv arXiv 2020
[64]

doi: 10.18653/v1/2020.emnlp-main.635

Association for Com- putational Linguistics. doi: 10.18653/v1/2020.emnlp-main.635. URL https://aclanthology.org/ 2020.emnlp-main.635. Alex Wang, Yada Pruksachatkun, Nikita Nangia, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R. Bowman. Superglue: A stickier benchmark for general-purpose language understanding systems. CoRR, abs/1905....

work page doi:10.18653/v1/2020.emnlp-main.635 2020
[65]

Alex Warstadt, Amanpreet Singh, and Samuel R Bowman

URL https://arxiv.org/abs/2104.14690. Alex Warstadt, Amanpreet Singh, and Samuel R Bowman. Neural network acceptability judgments. arXiv preprint arXiv:1805.12471,

work page arXiv
[66]

Jason Wei, Maarten Bosma, Vincent Y

URL https://arxiv.org/abs/2109.01247. Jason Wei, Maarten Bosma, Vincent Y . Zhao, Kelvin Guu, Adams Wei Yu, Brian Lester, Nan Du, Andrew M. Dai, and Quoc V . Le. Finetuned language models are zero-shot learners,

work page arXiv
[67]

Anlizing the adversarial natural language infer- ence dataset

Adina Williams, Tristan Thrush, and Douwe Kiela. Anlizing the adversarial natural language infer- ence dataset. arXiv preprint arXiv:2010.12729,

work page arXiv 2010
[68]

Crossﬁt: A few-shot learning challenge for cross-task generalization in nlp

Qinyuan Ye, Bill Yuchen Lin, and Xiang Ren. Crossﬁt: A few-shot learning challenge for cross-task generalization in nlp. arXiv preprint arXiv:2104.08835,

work page arXiv
[69]

URL https://arxiv.org/abs/2104. 08835. Yang Yi, Yih Wen-tau, and Christopher Meek. WikiQA: A Challenge Dataset for Open-Domain Question Answering. Association for Computational Linguistics , page 2013–2018,

work page 2013
[70]

Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi

doi: 10.18653/v1/D15-1237. Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. Hellaswag: Can a machine really ﬁnish your sentence? In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics,

work page doi:10.18653/v1/d15-1237
[71]

ReCoRD: Bridging the Gap between Human and Machine Commonsense Reading Comprehension

Sheng Zhang, Xiaodong Liu, Jingjing Liu, Jianfeng Gao, Kevin Duh, and Benjamin Van Durme. Record: Bridging the gap between human and machine commonsense reading comprehension. arXiv preprint arXiv:1810.12885,

work page internal anchor Pith review Pith/arXiv arXiv
[72]

Character-level convolutional networks for text classi- ﬁcation

Xiang Zhang, Junbo Zhao, and Yann LeCun. Character-level convolutional networks for text classi- ﬁcation. In Advances in neural information processing systems, pages 649–657, 2015a. 17 Published as a conference paper at ICLR 2022 Xiang Zhang, Junbo Jake Zhao, and Yann LeCun. Character-level convolutional networks for text classiﬁcation. In NIPS, 2015b. Yu...

work page 2022
[73]

Gender bias in coreference resolution: Evaluation and debiasing methods

Jieyu Zhao, Tianlu Wang, Mark Yatskar, Vicente Ordonez, and Kai-Wei Chang. Gender bias in coreference resolution: Evaluation and debiasing methods. In Proceedings of the 2018 Con- ference of the North American Chapter of the Association for Computational Linguistics: Hu- man Language Technologies, Volume 2 (Short Papers) , pages 15–20, New Orleans, Louisi...

work page 2018
[74]

doi: 10.18653/v1/N18-2003

Association for Computational Linguistics. doi: 10.18653/v1/N18-2003. URL https://aclanthology.org/N18-2003. Tony Z. Zhao, Eric Wallace, Shi Feng, Dan Klein, and Sameer Singh. Calibrate before use: Improv- ing few-shot performance of language models,

work page doi:10.18653/v1/n18-2003 2003
[76]

Adapt- ing language models for zero-shot learning by meta- tuning on dataset and prompt collections.arXiv preprint arXiv:2104.04670,

URL https://arxiv.org/abs/2104.04670. A C ONTRIBUTIONS AND PROJECT STRUCTURE This research was conducted under the BigScience project for open research,4 a year-long initiative targeting the study of large models and datasets. The goal of the project is to research language models in a public environment outside large technology companies. The project has...

work page arXiv 2019
[77]

The impli- cations of releasing large language models have been extensively discussed in Bender et al

and a publicly available model, T5+LM (Lester et al., 2021). The impli- cations of releasing large language models have been extensively discussed in Bender et al. (2021); Bommasani et al. (2021); Solaiman et al. (2019) among others. We expect replicating our work to be within the capabilities of dozens of organizations worldwide, the main barrier being ﬁ...

work page 2021
[78]

helicopter

(also called AX-g under SuperGLUE) and CrowS- Pairs (Nangia et al., 2020). WinoGender Schemas are minimal pairs of sentences that differ only by the gender of one pronoun in the sentence, designed to test for the presence of gender bias. We use the version from Poliak et al. (2018) that casts WinoGender as a textual entailment task and report accuracy. Cr...

work page 2020
[79]

question answering

consists of mostly straightforward decisions that reﬂect well-known tasks in the literature: sentiment analysis, topic classiﬁcation, paraphrase identiﬁcation, natural lan- guage inference, word sense disambiguation, coreference resolution, summarization, and structure- to-text generation. The main difﬁculty lies in the fact that a large collection of dat...

work page 2020
[80]

attempt to capture physical or scientiﬁc reasoning, as distinct from sentence completion, reading comprehension, or broad knowledge question answering

deﬁne a commonsense task as an “attempt to capture physical or scientiﬁc reasoning, as distinct from sentence completion, reading comprehension, or broad knowledge question answering.” Circular deﬁnition aside, it is far from clear that scientiﬁc reasoning is commonsense. Among Brown et al. (2020)’s selection, ARC exempliﬁes how evaluation of scientiﬁc kn...

work page 2020
[81]

commonsense inference

that training on a paraphrase dataset (QQP) before training on an NLI dataset (RTE) actually hurts performance compared to training on the entailment task only. Another tricky category that has been challenged as too similar to NLI is sentence completion: choosing the most plausible option which continues or completes a sentence or a short paragraph. SW A...

work page 2019
[82]

Paris is the capital of France

8https://github.com/openai/gpt-2/issues/131 23 Published as a conference paper at ICLR 2022 Task Dataset T0 Train T0+ Train T0++ Train Eval Coreference Resolution super glue/wsc.ﬁxed ✓ ✓ Coreference Resolution winogrande/winogrande xl ✓ Natural Language Inference super glue/cb ✓ Natural Language Inference super glue/rte ✓ Natural Language Inference anli ✓...

work page 2022
[83]

{{sent_more}}

T0 (p = 5.7) T0 (3B) T0 T0+ T0++ Task Dataset Mean Med. Mean Med. Mean Med. Mean Med. Mean Med. Mean Med. Mean Med. Coref. WSC 54.09 57.69 52.40 56.25 60.00 63.46 65.10 64.42 61.45 64.42 62.24 64.42 70.29 69.71 Wino. (XL) 50.65 50.71 58.11 57.22 59.35 58.80 50.97 50.51 59.94 60.46 62.54 61.72 66.42 66.54 NLI ANLI R1 32.89 32.85 39.02 40.05 41.28 43.20 33....

work page 2022
[84]

{{answer.aliases|choice}}

{{input}} Target Template: {{output | map(attribute="answer") | list | choice}} {% endif %} Prompt not for the original task intended by the dataset authors Input Template: {% if output %} {{input}} Target Template: {{output | map(attribute="answer") | list | choice}} {% endif %} 1.5.4 TRIVIA QA UNFILTERED Dataset from Joshi et al. (2017). Used in evaluat...

work page 2017
[85]

{{answer}}

Answer by yes or no. Document: {{passage}} Question: {{question}}? Target Template: {% if label != -1 %} {{answer_choices[label]}} {% endif %} Answer Choices Template: No ||| Yes Prompt from Schick and Sch¨utze (2021) Input Template: 148 Published as a conference paper at ICLR 2022 Based on the following passage, {{ question }}? {{ passage }} Target Templ...

work page 2021