Recognition: 2 theorem links
· Lean TheoremMultitask Prompted Training Enables Zero-Shot Task Generalization
Pith reviewed 2026-05-14 17:54 UTC · model grok-4.3
The pith
Converting many supervised datasets into prompted forms and fine-tuning a language model on the mixture produces strong zero-shot performance on held-out tasks.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By mapping a large collection of supervised natural-language tasks into diverse prompted formats and fine-tuning a pretrained encoder-decoder model on the resulting multitask mixture, the model achieves strong zero-shot performance on completely held-out tasks, often surpassing models up to 16 times larger on standard datasets and up to 6 times larger on a subset of BIG-bench tasks.
What carries the argument
A system for converting arbitrary supervised datasets into multiple human-readable prompted forms that are then mixed together for multitask fine-tuning of a pretrained encoder-decoder model.
If this is right
- Zero-shot task performance no longer requires either enormous model scale or task-specific fine-tuning once a broad prompted multitask mixture is available.
- New tasks can be tackled zero-shot simply by supplying a suitable prompt, without additional training data for that task.
- The diversity of prompted tasks in the training mixture becomes a controllable lever for improving generalization, independent of raw parameter count.
- Performance gains observed on standard benchmarks and on BIG-bench subsets indicate that the approach transfers across many different task types.
Where Pith is reading between the lines
- If the prompted-multitask signal truly teaches task abstraction, then similar results should appear when the same procedure is applied to smaller base models or to non-English task collections.
- The method raises the possibility that many existing supervised datasets can be reused as training material for generalist models rather than being discarded after single-task use.
- A natural next measurement would be whether the same mixture also improves few-shot performance or reduces the amount of in-context examples needed at inference time.
Load-bearing premise
That turning supervised datasets into prompted forms supplies a training signal for genuine task generalization rather than for prompt-specific patterns or dataset artifacts.
What would settle it
A controlled test in which the same model is fine-tuned on the identical tasks but without the prompted formatting and then evaluated zero-shot on the same held-out tasks, checking whether performance collapses to chance levels.
read the original abstract
Large language models have recently been shown to attain reasonable zero-shot generalization on a diverse set of tasks (Brown et al., 2020). It has been hypothesized that this is a consequence of implicit multitask learning in language models' pretraining (Radford et al., 2019). Can zero-shot generalization instead be directly induced by explicit multitask learning? To test this question at scale, we develop a system for easily mapping any natural language tasks into a human-readable prompted form. We convert a large set of supervised datasets, each with multiple prompts with diverse wording. These prompted datasets allow for benchmarking the ability of a model to perform completely held-out tasks. We fine-tune a pretrained encoder-decoder model (Raffel et al., 2020; Lester et al., 2021) on this multitask mixture covering a wide variety of tasks. The model attains strong zero-shot performance on several standard datasets, often outperforming models up to 16x its size. Further, our approach attains strong performance on a subset of tasks from the BIG-bench benchmark, outperforming models up to 6x its size. All trained models are available at https://github.com/bigscience-workshop/t-zero and all prompts are available at https://github.com/bigscience-workshop/promptsource.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that converting a large collection of supervised NLP datasets into prompted forms using diverse human-readable templates, then fine-tuning a pretrained encoder-decoder model (T5) on the resulting multitask mixture, induces strong zero-shot generalization to completely held-out tasks. The resulting model often outperforms models up to 16x larger on standard benchmarks and shows competitive results on a subset of BIG-bench tasks.
Significance. If the results hold after addressing evaluation details, the work demonstrates that explicit multitask prompted training can produce zero-shot capabilities at modest scale, providing a practical alternative to relying solely on pretraining scale and offering a reproducible recipe for improving task generalization.
major comments (3)
- [Evaluation] Evaluation section: the central claim of genuine task generalization (rather than surface-format following) rests on held-out tasks, yet the manuscript does not report a control experiment using evaluation prompts drawn from a disjoint syntactic or generative distribution while preserving task semantics. Without this, performance could be explained by shared prompt patterns across the mixture.
- [§4] §4 (Experiments) and Table 1: baseline comparisons lack full details on prompt selection procedure for the larger models, statistical significance testing across prompt variations or random seeds, and exact data exclusion rules for the training/held-out split; these omissions make it impossible to verify that the reported outperformance (e.g., vs. 16x larger models) is robust.
- [Method] Method section: the description of how tasks are mapped to prompts and how the multitask mixture is constructed does not specify the proportion of each task type or whether any filtering was applied to avoid format leakage, which is load-bearing for interpreting the zero-shot results as evidence of task understanding.
minor comments (2)
- [Abstract] The abstract and §5 reference the public release of models and prompts; ensure the final version includes precise commit hashes or version numbers for reproducibility.
- [Notation] Notation for 'prompt' vs. 'template' is used interchangeably in places; a brief glossary or consistent definition would improve clarity.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our work. We address each major comment point by point below, providing clarifications and committing to revisions that strengthen the manuscript without altering its core claims.
read point-by-point responses
-
Referee: [Evaluation] Evaluation section: the central claim of genuine task generalization (rather than surface-format following) rests on held-out tasks, yet the manuscript does not report a control experiment using evaluation prompts drawn from a disjoint syntactic or generative distribution while preserving task semantics. Without this, performance could be explained by shared prompt patterns across the mixture.
Authors: We agree that distinguishing format following from task generalization is important. Our evaluation uses completely held-out tasks with prompts drawn from promptsource that were never encountered during training, and the diversity of templates across the mixture was intended to promote generalization beyond surface patterns. However, we acknowledge that an explicit control experiment with syntactically disjoint prompts (while preserving semantics) would provide stronger evidence. We will add a discussion of this limitation and propose such a control as future work in the revised manuscript. revision: partial
-
Referee: [§4] §4 (Experiments) and Table 1: baseline comparisons lack full details on prompt selection procedure for the larger models, statistical significance testing across prompt variations or random seeds, and exact data exclusion rules for the training/held-out split; these omissions make it impossible to verify that the reported outperformance (e.g., vs. 16x larger models) is robust.
Authors: We will expand §4 and Table 1 in the revision to include: (1) the precise prompt selection procedure for baselines (following the recommendations in the original papers for each model), (2) results with statistical significance across multiple prompt variations and random seeds where computationally feasible, and (3) the exact criteria used for the training/held-out split to confirm no overlap. These details will allow readers to better assess the robustness of the reported gains. revision: yes
-
Referee: [Method] Method section: the description of how tasks are mapped to prompts and how the multitask mixture is constructed does not specify the proportion of each task type or whether any filtering was applied to avoid format leakage, which is load-bearing for interpreting the zero-shot results as evidence of task understanding.
Authors: We will revise the Method section to specify the exact proportions of each task type in the mixture (proportional to the number of examples per dataset) and to clarify that filtering was applied only to enforce task-level disjointness between training and held-out sets, with no additional format-based filtering. We maintain that the strong zero-shot results on novel tasks with unseen prompts support task understanding rather than format memorization, but the added details will make this interpretation more transparent. revision: yes
Circularity Check
No circularity: zero-shot results are measured on explicitly held-out external tasks and benchmarks
full rationale
The paper's derivation consists of converting supervised datasets to prompted forms, fine-tuning a pretrained model on the resulting multitask mixture, and then reporting performance on held-out tasks from standard datasets and BIG-bench. These evaluation tasks are disjoint from the training mixture by construction, and performance is measured against external benchmarks rather than any fitted parameter, self-referential metric, or prior result from the same authors. No equations, uniqueness theorems, or ansatzes are invoked that reduce the claimed generalization to the inputs by definition. Self-citations are absent from the load-bearing steps; citations to prior work (e.g., T5, prompt tuning) supply the base model but do not substitute for the empirical result.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Natural language prompts can be used to unify diverse supervised tasks into a single training mixture without destructive interference
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclearWe fine-tune a pretrained encoder-decoder model (Raffel et al., 2020; Lester et al., 2021) on this multitask mixture covering a wide variety of tasks.
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclearThe model attains strong zero-shot performance on several standard datasets, often outperforming models up to 16× its size.
Forward citations
Cited by 22 Pith papers
-
Chain-of-Thought Prompting Elicits Reasoning in Large Language Models
Chain-of-thought prompting, by including intermediate reasoning steps in few-shot examples, elicits strong reasoning abilities in large language models on arithmetic, commonsense, and symbolic tasks.
-
Instruction Tuning with GPT-4
GPT-4-generated instruction data produces superior zero-shot performance in finetuned LLaMA models versus prior state-of-the-art data.
-
Editing Models with Task Arithmetic
Task vectors from weight differences allow arithmetic operations to edit pre-trained models, improving multiple tasks simultaneously and enabling analogical inference on unseen tasks.
-
Accelerating Zeroth-Order Spectral Optimization with Partial Orthogonalization from Power Iteration
Partial orthogonalization from power iteration accelerates zeroth-order Muon by 1.5x-4x on LLM fine-tuning tasks while maintaining competitive accuracy.
-
From Static Analysis to Audience Dissemination: A Training-Free Multimodal Controversy Detection Multi-Agent Framework
AuDisAgent reformulates multimodal controversy detection as a dynamic audience dissemination process using screening, panel discussion, and arbitration agents, plus comment bootstrapping, and reports outperforming pri...
-
Self-Rewarding Language Models
Iterative self-rewarding via LLM-as-Judge in DPO training on Llama 2 70B improves instruction following and self-evaluation, outperforming GPT-4 on AlpacaEval 2.0.
-
C-Pack: Packed Resources For General Chinese Embeddings
C-Pack releases a new Chinese embedding benchmark, large training dataset, and optimized models that outperform priors by up to 10% on C-MTEB while also delivering English SOTA results.
-
QLoRA: Efficient Finetuning of Quantized LLMs
QLoRA finetunes 4-bit quantized LLMs via LoRA adapters to match full-precision performance while using far less memory, enabling 65B-scale training on single GPUs and producing Guanaco models near ChatGPT level.
-
Understanding and Accelerating the Training of Masked Diffusion Language Models
Bell-shaped time sampling accelerates masked diffusion language model training by roughly 4x on LM1B by countering locality bias in language data.
-
Understanding the Mechanism of Altruism in Large Language Models
A small set of sparse autoencoder features in LLMs drives shifts between generous and selfish allocations in dictator games, with causal patching and steering confirming their role and generalization to other social games.
-
RemoteShield: Enable Robust Multimodal Large Language Models for Earth Observation
RemoteShield improves robustness of Earth observation MLLMs by training on semantic equivalence clusters of clean and perturbed inputs via preference learning to maintain consistent reasoning under noise.
-
Towards an AI co-scientist
A multi-agent AI system generates novel biomedical hypotheses that show promising experimental validation in drug repurposing for leukemia, new targets for liver fibrosis, and a bacterial gene transfer mechanism.
-
MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark
MMLU-Pro is a revised benchmark that makes language model evaluation harder and more stable by using ten options per question and emphasizing reasoning over simple knowledge recall.
-
ChatEval: Towards Better LLM-based Evaluators through Multi-Agent Debate
Multi-agent debate among LLMs yields more reliable text evaluations than single-agent prompting by simulating collaborative human judgment.
-
Gorilla: Large Language Model Connected with Massive APIs
Gorilla is a fine-tuned LLM that surpasses GPT-4 in accurate API call generation and uses retrieval to handle documentation updates.
-
PaLM: Scaling Language Modeling with Pathways
PaLM 540B demonstrates continued scaling benefits by setting new few-shot SOTA results on hundreds of benchmarks and outperforming humans on BIG-bench.
-
A General Language Assistant as a Laboratory for Alignment
Ranked preference modeling outperforms imitation learning for language model alignment and scales more favorably with model size.
-
SmolLM2: When Smol Goes Big -- Data-Centric Training of a Small Language Model
SmolLM2 is a 1.7B-parameter language model that outperforms Qwen2.5-1.5B and Llama3.2-1B after overtraining on 11 trillion tokens using custom FineMath, Stack-Edu, and SmolTalk datasets in a multi-stage pipeline.
-
Galactica: A Large Language Model for Science
Galactica, a science-specialized LLM, reports higher scores than GPT-3, Chinchilla, and PaLM on LaTeX knowledge, mathematical reasoning, and medical QA benchmarks while outperforming general models on BIG-bench.
-
Text Style Transfer with Machine Translation for Graphic Designs
Custom tag methods with NMT and LLMs for word alignment in text style transfer perform no better than standard attention-based alignment from NMT models.
-
Large Language Models: A Survey
The paper surveys key large language models, their training methods, datasets, evaluation benchmarks, and future research directions in the field.
-
Sora: A Review on Background, Technology, Limitations, and Opportunities of Large Vision Models
The paper reviews the background, technology, applications, limitations, and future directions of OpenAI's Sora text-to-video generative model based on public information.
Reference graph
Works this paper leans on
-
[1]
Cloze-driven Pretraining of Self-attention Networks
Alexei Baevski, Sergey Edunov, Yinhan Liu, Luke Zettlemoyer, and Michael Auli. Cloze-driven pretraining of self-attention networks. arXiv preprint arXiv:1903.07785,
work page internal anchor Pith review Pith/arXiv arXiv 1903
-
[2]
Available: https://doi.org/10.1162/tacl a 00449
doi: 10.1162/tacl a 00338. URL https://doi.org/10.1162/tacl a 00338. Qiang Ning Ben Zhou, Daniel Khashabi and Dan Roth. “going on a vacation” takes longer than “going for a walk”: A study of temporal commonsense understanding. In EMNLP,
work page internal anchor Pith review doi:10.1162/tacl
-
[3]
Bender, Timnit Gebru, Angelina McMillan-Major, and Shmargaret Shmitchell
Emily M. Bender, Timnit Gebru, Angelina McMillan-Major, and Shmargaret Shmitchell. On the dangers of stochastic parrots: Can language models be too big? In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, pages 610–623,
work page 2021
-
[4]
Semantic parsing on Freebase from question-answer pairs
Jonathan Berant, Andrew Chou, Roy Frostig, and Percy Liang. Semantic parsing on Freebase from question-answer pairs. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pages 1533–1544, Seattle, Washington, USA, October
work page 2013
-
[6]
URL https://arxiv.org/abs/2108.07258. Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhari- wal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-V oss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel 11 Published as a conference paper at ICLR 2022 Ziegler, ...
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[7]
URL https://proceedings.neurips.cc/paper/2020/file/1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdf. Rich Caruana. Multitask learning. Mach. Learn. , 28(1):41–75,
work page 2020
-
[8]
Caruana, Multitask Learning.Machine Learning28, 41–75 (1997), doi:10.1023/A: 1007379606734
doi: 10.1023/A: 1007379606734. URL https://doi.org/10.1023/A:1007379606734. Eunsol Choi, He He, Mohit Iyyer, Mark Yatskar, Wen-tau Yih, Yejin Choi, Percy Liang, and Luke Zettlemoyer. QuAC: Question answering in context. In Proceedings of the 2018 Con- ference on Empirical Methods in Natural Language Processing, pages 2174–2184, Brussels, Bel- gium, Octobe...
work page doi:10.1023/a: 2018
-
[9]
BoolQ: Exploring the Surprising Difficulty of Natural Yes/No Questions
Association for Computational Linguistics. doi: 10.18653/v1/ D18-1241. URL https://aclanthology.org/D18-1241. Christopher Clark, Kenton Lee, Ming-Wei Chang, Tom Kwiatkowski, Michael Collins, and Kristina Toutanova. Boolq: Exploring the surprising difficulty of natural yes/no questions. CoRR, abs/1905.10044,
work page internal anchor Pith review Pith/arXiv arXiv doi:10.18653/v1/ 1905
-
[10]
BoolQ: Exploring the Surprising Difficulty of Natural Yes/No Questions
URL http://arxiv.org/abs/1905.10044. Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? try arc, the ai2 reasoning challenge. arXiv:1803.05457v1,
work page internal anchor Pith review Pith/arXiv arXiv 1905
-
[11]
A unified architecture for natural language processing: deep neural networks with multitask learning
Ronan Collobert and Jason Weston. A unified architecture for natural language processing: deep neural networks with multitask learning. In William W. Cohen, Andrew McCallum, and Sam T. Roweis, editors, Machine Learning, Proceedings of the Twenty-Fifth International Conference (ICML 2008), Helsinki, Filnand, June 5-9, 2008 , volume 307 of ACM International ...
work page 2008
-
[12]
URL https: //doi.org/10.1145/1390156.1390177
doi: 10.1145/1390156.1390177. URL https: //doi.org/10.1145/1390156.1390177. Ido Dagan, Oren Glickman, and Bernardo Magnini. The pascal recognising textual entailment chal- lenge. In Machine Learning Challenges Workshop, pages 177–190. Springer,
-
[13]
Pradeep Dasigi, Nelson F. Liu, Ana Marasovic, Noah A. Smith, and Matt Gardner. Quoref: A reading comprehension dataset with questions requiring coreferential reasoning. arXiv:1908.05803v2,
-
[14]
Association for Computational Linguis- tics. doi: 10.18653/v1/W18-5102. URL https://www.aclweb.org/anthology/W18-5102. Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for...
-
[15]
URL https://doi.org/10.5281/zenodo.5371628. 12 Published as a conference paper at ICLR 2022 Danilo Giampiccolo, Bernardo Magnini, Ido Dagan, and Bill Dolan. The third pascal recognizing textual entailment challenge. In Proceedings of the ACL-PASCAL workshop on textual entailment and paraphrasing, pages 1–9. Association for Computational Linguistics,
-
[16]
Samsum corpus: A human- annotated dialogue dataset for abstractive summarization
Bogdan Gliwa, Iwona Mochol, Maciej Biesek, and Aleksander Wawer. Samsum corpus: A human- annotated dialogue dataset for abstractive summarization. arXiv preprint arXiv:1911.12237 ,
-
[17]
Twitter sentiment classification using distant supervision
Alec Go, Richa Bhayani, and Lei Huang. Twitter sentiment classification using distant supervision. CS224N project report, Stanford, 1(12):2009,
work page 2009
-
[18]
A Joint Many-Task Model: Growing a Neural Network for Multiple NLP Tasks
URL http://arxiv.org/abs/1611.01587. Karl Moritz Hermann, Tomas Kocisky, Edward Grefenstette, Lasse Espeholt, Will Kay, Mustafa Suleyman, and Phil Blunsom. Teaching machines to read and comprehend. In Advances in neural information processing systems, pages 1693–1701,
work page internal anchor Pith review Pith/arXiv arXiv
-
[19]
Lifu Huang, Ronan Le Bras, Chandra Bhagavatula, and Yejin Choi
URL https://aclanthology.org/H01-1069. Lifu Huang, Ronan Le Bras, Chandra Bhagavatula, and Yejin Choi. Cosmos qa: Machine reading comprehension with contextual commonsense reasoning. In arXiv:1909.00277v2,
-
[20]
Matt Gardner Johannes Welbl, Nelson F. Liu. Crowdsourcing multiple choice science questions. arXiv:1707.06209v1,
work page internal anchor Pith review Pith/arXiv arXiv
-
[21]
TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension
Mandar Joshi, Eunsol Choi, Daniel Weld, and Luke Zettlemoyer. triviaqa: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension.arXiv e-prints, art. arXiv:1705.03551,
work page internal anchor Pith review Pith/arXiv arXiv
-
[22]
Unifiedqa: Crossing format boundaries with a single QA system.CoRR, abs/2005.00700, 2020a
Daniel Khashabi, Tushar Khot, Ashish Sabharwal, Oyvind Tafjord, Peter Clark, and Hannaneh Ha- jishirzi. Unifiedqa: Crossing format boundaries with a single QA system.CoRR, abs/2005.00700, 2020a. URL https://arxiv.org/abs/2005.00700. Daniel Khashabi, Sewon Min, Tushar Khot, Ashish Sabharwal, Oyvind Tafjord, Peter Clark, and Hannaneh Hajishirzi. UNIFIEDQA: C...
-
[23]
Boseop Kim, HyoungSeok Kim, Sang-Woo Lee, Gichang Lee, Donghyun Kwak, Dong Hyeon Jeon, Sunghyun Park, Sungju Kim, Seonhoon Kim, Dongpil Seo, et al. What changes can large- scale language models bring? intensive study on hyperclova: Billions-scale korean generative pretrained transformers. arXiv preprint arXiv:2109.04650,
-
[24]
Quantifying the carbon emissions of machine learning
Alexandre Lacoste, Alexandra Luccioni, Victor Schmidt, and Thomas Dandres. Quantifying the carbon emissions of machine learning. arXiv preprint arXiv:1910.09700,
-
[25]
RACE: Large-scale ReAding Comprehension Dataset From Examinations
Guokun Lai, Qizhe Xie, Hanxiao Liu, Yiming Yang, and Eduard Hovy. Race: Large-scale reading comprehension dataset from examinations. arXiv preprint arXiv:1704.04683,
work page internal anchor Pith review Pith/arXiv arXiv
-
[26]
Neural Text Generation from Structured Data with Application to the Biography Domain
13 Published as a conference paper at ICLR 2022 R´emi Lebret, David Grangier, and Michael Auli. Generating text from structured data with appli- cation to the biography domain. CoRR, abs/1603.07771,
work page internal anchor Pith review Pith/arXiv arXiv 2022
- [27]
-
[29]
The Power of Scale for Parameter-Efficient Prompt Tuning
URL https://arxiv.org/abs/2104.08691. Hector Levesque, Ernest Davis, and Leora Morgenstern. The winograd schema challenge. In Thir- teenth International Conference on the Principles of Knowledge Representation and Reasoning ,
work page internal anchor Pith review Pith/arXiv arXiv
-
[30]
Xin Li and Dan Roth. Learning question classifiers. In COLING 2002: The 19th International Conference on Computational Linguistics,
work page 2002
-
[31]
URL https://aclanthology.org/C02-1150. Bill Yuchen Lin, Wangchunshu Zhou, Ming Shen, Pei Zhou, Chandra Bhagavatula, Yejin Choi, and Xiang Ren. CommonGen: A constrained text generation challenge for generative commonsense reasoning. In Findings of the Association for Computational Linguistics: EMNLP 2020 , pages 1823–1840, Online, November
work page 2020
-
[32]
doi: 10.18653/ v1/2020.findings-emnlp.165
Association for Computational Linguistics. doi: 10.18653/ v1/2020.findings-emnlp.165. URL https://aclanthology.org/2020.findings-emnlp.165. Kevin Lin, Oyvind Tafjord, Peter Clark, and Matt Gardner. Reasoning over paragraph effects in situations. In MRQA@EMNLP,
work page 2020
-
[33]
Robert L Logan, Ivana Balaˇzevi´c, Eric Wallace, Fabio Petroni, Sameer Singh, and Sebastian Riedel. Cutting down on prompts and parameters: Simple few-shot learning with language models.arXiv preprint arXiv:2106.13353,
-
[35]
URL http: //arxiv.org/abs/1806.08730. R. Thomas McCoy, Ellie Pavlick, and Tal Linzen. Right for the wrong reasons: Diagnosing syntactic heuristics in natural language inference. CoRR, abs/1902.01007,
work page internal anchor Pith review Pith/arXiv arXiv 1902
-
[36]
Right for the Wrong Reasons: Diagnosing Syntactic Heuristics in Natural Language Inference
URL http://arxiv.org/abs/ 1902.01007. Todor Mihaylov, Peter Clark, Tushar Khot, and Ashish Sabharwal. Can a suit of armor conduct electricity? a new dataset for open book question answering. In EMNLP,
work page internal anchor Pith review Pith/arXiv arXiv 1902
-
[38]
Nikita Nangia, Clara Vania, Rasika Bhalerao, and Samuel R
URL https://arxiv.org/abs/2104.08773. Nikita Nangia, Clara Vania, Rasika Bhalerao, and Samuel R. Bowman. CrowS-Pairs: A Challenge Dataset for Measuring Social Biases in Masked Language Models. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing , Online, November
-
[39]
Association for Computational Linguistics. Shashi Narayan, Shay B. Cohen, and Mirella Lapata. Don’t give me the details, just the summary! topic-aware convolutional neural networks for extreme summarization. ArXiv, abs/1808.08745,
work page internal anchor Pith review Pith/arXiv arXiv
-
[40]
Carbon Emissions and Large Neural Network Training
David Patterson, Joseph Gonzalez, Quoc Le, Chen Liang, Lluis-Miquel Munguia, Daniel Rothchild, David So, Maud Texier, and Jeff Dean. Carbon emissions and large neural network training.arXiv preprint arXiv:2104.10350,
work page internal anchor Pith review Pith/arXiv arXiv
-
[42]
Mohammad Taher Pilehvar and os’e Camacho-Collados
URL https://arxiv.org/abs/2105.11447. Mohammad Taher Pilehvar and os’e Camacho-Collados. Wic: 10, 000 example pairs for evaluating context-sensitive representations. CoRR, abs/1808.09121,
-
[43]
URL http://arxiv.org/abs/1808. 09121. Adam Poliak, Aparajita Haldar, Rachel Rudinger, J. Edward Hu, Ellie Pavlick, Aaron Steven White, and Benjamin Van Durme. Collecting diverse natural language inference problems for sentence representation evaluation. In Proceedings of the 2018 Conference on Empirical Methods in Nat- ural Language Processing, pages 67–8...
work page 2018
-
[44]
Associa- tion for Computational Linguistics. doi: 10.18653/v1/D18-1007. URL https://aclanthology.org/ D18-1007. Yada Pruksachatkun, Jason Phang, Haokun Liu, Phu Mon Htut, Xiaoyi Zhang, Richard Yuanzhe Pang, Clara Vania, Katharina Kann, and Samuel R. Bowman. Intermediate-task transfer learning with pretrained language models: When and why does it work? InP...
-
[45]
doi: 10.18653/v1/2020.acl-main.467
Association for Computational Linguistics. doi: 10.18653/v1/2020.acl-main.467. URL https: //aclanthology.org/2020.acl-main.467. Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9,
-
[46]
SQuAD: 100,000+ Questions for Machine Comprehension of Text
15 Published as a conference paper at ICLR 2022 Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. SQuAD: 100,000+ Questions for Machine Comprehension of Text. arXiv e-prints, art. arXiv:1606.05250,
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[48]
Adam Roberts, Colin Raffel, and Noam Shazeer
URL https://arxiv.org/abs/2102.07350. Adam Roberts, Colin Raffel, and Noam Shazeer. How much knowledge can you pack into the parameters of a language model? In Proceedings of the 2020 Conference on Empirical Meth- ods in Natural Language Processing (EMNLP) , pages 5418–5426, Online, November
-
[49]
doi: 10.18653/v1/2020.emnlp-main.437
Association for Computational Linguistics. doi: 10.18653/v1/2020.emnlp-main.437. URL https://aclanthology.org/2020.emnlp-main.437. Melissa Roemmele, Cosmin Adrian Bejan, and Andrew S Gordon. Choice of plausible alternatives: An evaluation of commonsense causal reasoning. In 2011 AAAI Spring Symposium Series,
-
[50]
Getting closer to AI com- plete question answering: A set of prerequisite real tasks
Anna Rogers, Olga Kovaleva, Matthew Downey, and Anna Rumshisky. Getting closer to AI com- plete question answering: A set of prerequisite real tasks. In The Thirty-Fourth AAAI Conference on Artificial Intelligence, AAAI 2020, The Thirty-Second Innovative Applications of Artificial In- telligence Conference, IAAI 2020, The Tenth AAAI Symposium on Educational...
work page 2020
-
[51]
Rachel Rudinger, Jason Naradowsky, Brian Leonard, and Benjamin Van Durme
URL https://aaai.org/ojs/index.php/AAAI/article/view/6398. Rachel Rudinger, Jason Naradowsky, Brian Leonard, and Benjamin Van Durme. Gender bias in coreference resolution. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, New Orleans, Louisiana, June
work page 2018
-
[52]
Association for Computational Linguistics. Alexander M. Rush, Sumit Chopra, and Jason Weston. A neural attention model for abstractive sentence summarization. Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing,
work page 2015
-
[53]
URL http://dx.doi.org/10.18653/v1/ D15-1044
doi: 10.18653/v1/d15-1044. URL http://dx.doi.org/10.18653/v1/ D15-1044. Amrita Saha, Rahul Aralikatte, Mitesh M. Khapra, and Karthik Sankaranarayanan. DuoRC: Towards Complex Language Understanding with Paraphrased Reading Comprehension. In Meeting of the Association for Computational Linguistics (ACL),
-
[55]
WinoGrande: An Adversarial Winograd Schema Challenge at Scale
URL http://arxiv. org/abs/1907.10641. Timo Schick and Hinrich Sch ¨utze. Exploiting cloze-questions for few-shot text classification and natural language inference. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, pages 255–269, Online, April
work page internal anchor Pith review Pith/arXiv arXiv 1907
-
[56]
URL https://aclanthology.org/2021.eacl-main.20
Association for Computational Linguistics. URL https://aclanthology.org/2021.eacl-main.20. Roy Schwartz, Jesse Dodge, Noah A Smith, and Oren Etzioni. Green ai. Communications of the ACM, 63(12):54–63,
work page 2021
-
[58]
Get To The Point: Summarization with Pointer-Generator Networks
URL http://arxiv.org/abs/1704.04368. Noam Shazeer and Mitchell Stern. Adafactor: Adaptive learning rates with sublinear memory cost. In International Conference on Machine Learning, pages 4596–4604. PMLR,
work page internal anchor Pith review Pith/arXiv arXiv
-
[60]
Emma Strubell, Ananya Ganesh, and Andrew McCallum
URL http://arxiv.org/abs/1908.09203. Emma Strubell, Ananya Ganesh, and Andrew McCallum. Energy and policy considerations for deep learning in NLP. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 3645–3650,
-
[61]
DREAM: A challenge dataset and models for dialogue-based reading comprehension
16 Published as a conference paper at ICLR 2022 Kai Sun, Dian Yu, Jianshu Chen, Dong Yu, Yejin Choi, and Claire Cardie. DREAM: A challenge dataset and models for dialogue-based reading comprehension. Transactions of the Association for Computational Linguistics,
work page 2022
-
[62]
DREAM: A Challenge Dataset and Models for Dialogue-Based Reading Comprehension
URL https://arxiv.org/abs/1902.00164v1. Oyvind Tafjord, Matt Gardner, Kevin Lin, and Peter Clark. ”quartz: An open-domain dataset of qualitative relationship questions”. EMNLP, ”2019”. Oyvind Tafjord, Peter Clark, Matt Gardner, Wen-tau Yih, and Ashish Sabharwal. Quarel: A dataset and models for answering questions about qualitative relationships.CoRR, abs...
work page internal anchor Pith review Pith/arXiv arXiv 1902
-
[63]
QuaRel: A Dataset and Models for Answering Questions about Qualitative Relationships
URL http://arxiv.org/abs/1811.08048. Tu Vu, Tong Wang, Tsendsuren Munkhdalai, Alessandro Sordoni, Adam Trischler, Andrew Mattarella-Micke, Subhransu Maji, and Mohit Iyyer. Exploring and predicting transferability across NLP tasks. In Proceedings of the 2020 Conference on Empirical Methods in Natural Lan- guage Processing (EMNLP), pages 7882–7926, Online, November
work page internal anchor Pith review Pith/arXiv arXiv 2020
-
[64]
doi: 10.18653/v1/2020.emnlp-main.635
Association for Com- putational Linguistics. doi: 10.18653/v1/2020.emnlp-main.635. URL https://aclanthology.org/ 2020.emnlp-main.635. Alex Wang, Yada Pruksachatkun, Nikita Nangia, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R. Bowman. Superglue: A stickier benchmark for general-purpose language understanding systems. CoRR, abs/1905....
-
[65]
Alex Warstadt, Amanpreet Singh, and Samuel R Bowman
URL https://arxiv.org/abs/2104.14690. Alex Warstadt, Amanpreet Singh, and Samuel R Bowman. Neural network acceptability judgments. arXiv preprint arXiv:1805.12471,
-
[66]
Jason Wei, Maarten Bosma, Vincent Y
URL https://arxiv.org/abs/2109.01247. Jason Wei, Maarten Bosma, Vincent Y . Zhao, Kelvin Guu, Adams Wei Yu, Brian Lester, Nan Du, Andrew M. Dai, and Quoc V . Le. Finetuned language models are zero-shot learners,
-
[67]
Anlizing the adversarial natural language infer- ence dataset
Adina Williams, Tristan Thrush, and Douwe Kiela. Anlizing the adversarial natural language infer- ence dataset. arXiv preprint arXiv:2010.12729,
-
[68]
Crossfit: A few-shot learning challenge for cross-task generalization in nlp
Qinyuan Ye, Bill Yuchen Lin, and Xiang Ren. Crossfit: A few-shot learning challenge for cross-task generalization in nlp. arXiv preprint arXiv:2104.08835,
-
[69]
URL https://arxiv.org/abs/2104. 08835. Yang Yi, Yih Wen-tau, and Christopher Meek. WikiQA: A Challenge Dataset for Open-Domain Question Answering. Association for Computational Linguistics , page 2013–2018,
work page 2013
-
[70]
Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi
doi: 10.18653/v1/D15-1237. Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. Hellaswag: Can a machine really finish your sentence? In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics,
-
[71]
ReCoRD: Bridging the Gap between Human and Machine Commonsense Reading Comprehension
Sheng Zhang, Xiaodong Liu, Jingjing Liu, Jianfeng Gao, Kevin Duh, and Benjamin Van Durme. Record: Bridging the gap between human and machine commonsense reading comprehension. arXiv preprint arXiv:1810.12885,
work page internal anchor Pith review Pith/arXiv arXiv
-
[72]
Character-level convolutional networks for text classi- fication
Xiang Zhang, Junbo Zhao, and Yann LeCun. Character-level convolutional networks for text classi- fication. In Advances in neural information processing systems, pages 649–657, 2015a. 17 Published as a conference paper at ICLR 2022 Xiang Zhang, Junbo Jake Zhao, and Yann LeCun. Character-level convolutional networks for text classification. In NIPS, 2015b. Yu...
work page 2022
-
[73]
Gender bias in coreference resolution: Evaluation and debiasing methods
Jieyu Zhao, Tianlu Wang, Mark Yatskar, Vicente Ordonez, and Kai-Wei Chang. Gender bias in coreference resolution: Evaluation and debiasing methods. In Proceedings of the 2018 Con- ference of the North American Chapter of the Association for Computational Linguistics: Hu- man Language Technologies, Volume 2 (Short Papers) , pages 15–20, New Orleans, Louisi...
work page 2018
-
[74]
Association for Computational Linguistics. doi: 10.18653/v1/N18-2003. URL https://aclanthology.org/N18-2003. Tony Z. Zhao, Eric Wallace, Shi Feng, Dan Klein, and Sameer Singh. Calibrate before use: Improv- ing few-shot performance of language models,
-
[76]
URL https://arxiv.org/abs/2104.04670. A C ONTRIBUTIONS AND PROJECT STRUCTURE This research was conducted under the BigScience project for open research,4 a year-long initiative targeting the study of large models and datasets. The goal of the project is to research language models in a public environment outside large technology companies. The project has...
-
[77]
and a publicly available model, T5+LM (Lester et al., 2021). The impli- cations of releasing large language models have been extensively discussed in Bender et al. (2021); Bommasani et al. (2021); Solaiman et al. (2019) among others. We expect replicating our work to be within the capabilities of dozens of organizations worldwide, the main barrier being fi...
work page 2021
-
[78]
(also called AX-g under SuperGLUE) and CrowS- Pairs (Nangia et al., 2020). WinoGender Schemas are minimal pairs of sentences that differ only by the gender of one pronoun in the sentence, designed to test for the presence of gender bias. We use the version from Poliak et al. (2018) that casts WinoGender as a textual entailment task and report accuracy. Cr...
work page 2020
-
[79]
consists of mostly straightforward decisions that reflect well-known tasks in the literature: sentiment analysis, topic classification, paraphrase identification, natural lan- guage inference, word sense disambiguation, coreference resolution, summarization, and structure- to-text generation. The main difficulty lies in the fact that a large collection of dat...
work page 2020
-
[80]
define a commonsense task as an “attempt to capture physical or scientific reasoning, as distinct from sentence completion, reading comprehension, or broad knowledge question answering.” Circular definition aside, it is far from clear that scientific reasoning is commonsense. Among Brown et al. (2020)’s selection, ARC exemplifies how evaluation of scientific kn...
work page 2020
-
[81]
that training on a paraphrase dataset (QQP) before training on an NLI dataset (RTE) actually hurts performance compared to training on the entailment task only. Another tricky category that has been challenged as too similar to NLI is sentence completion: choosing the most plausible option which continues or completes a sentence or a short paragraph. SW A...
work page 2019
-
[82]
Paris is the capital of France
8https://github.com/openai/gpt-2/issues/131 23 Published as a conference paper at ICLR 2022 Task Dataset T0 Train T0+ Train T0++ Train Eval Coreference Resolution super glue/wsc.fixed ✓ ✓ Coreference Resolution winogrande/winogrande xl ✓ Natural Language Inference super glue/cb ✓ Natural Language Inference super glue/rte ✓ Natural Language Inference anli ✓...
work page 2022
-
[83]
T0 (p = 5.7) T0 (3B) T0 T0+ T0++ Task Dataset Mean Med. Mean Med. Mean Med. Mean Med. Mean Med. Mean Med. Mean Med. Coref. WSC 54.09 57.69 52.40 56.25 60.00 63.46 65.10 64.42 61.45 64.42 62.24 64.42 70.29 69.71 Wino. (XL) 50.65 50.71 58.11 57.22 59.35 58.80 50.97 50.51 59.94 60.46 62.54 61.72 66.42 66.54 NLI ANLI R1 32.89 32.85 39.02 40.05 41.28 43.20 33....
work page 2022
-
[84]
{{input}} Target Template: {{output | map(attribute="answer") | list | choice}} {% endif %} Prompt not for the original task intended by the dataset authors Input Template: {% if output %} {{input}} Target Template: {{output | map(attribute="answer") | list | choice}} {% endif %} 1.5.4 TRIVIA QA UNFILTERED Dataset from Joshi et al. (2017). Used in evaluat...
work page 2017
-
[85]
Answer by yes or no. Document: {{passage}} Question: {{question}}? Target Template: {% if label != -1 %} {{answer_choices[label]}} {% endif %} Answer Choices Template: No ||| Yes Prompt from Schick and Sch¨utze (2021) Input Template: 148 Published as a conference paper at ICLR 2022 Based on the following passage, {{ question }}? {{ passage }} Target Templ...
work page 2021
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.