Cross-Task Generalization via Natural Language Crowdsourcing Instructions
Pith reviewed 2026-05-18 01:53 UTC · model grok-4.3
The pith
Language models improve at new tasks by 19 percent when given human-written instructions.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Generative pre-trained language models that receive task instructions along with input data generate better outputs on tasks not seen during training, achieving a 19% improvement in generalization performance over models without access to instructions.
What carries the argument
A meta-dataset of 61 tasks with their human-authored instructions mapped to a unified schema, which is used to train models that encode the instruction text together with the input to produce the task output.
If this is right
- Cross-task generalization can be directly measured by training on a subset of tasks and evaluating on completely held-out tasks.
- Human-authored instructions provide a way to define tasks that transfers across different problems.
- Models benefit from instructions specifically in the setting of unseen tasks rather than just seen ones.
- Significant room remains for progress since current models fall short of an estimated upper bound performance.
Where Pith is reading between the lines
- Such models could eventually acquire capabilities in new domains by reading descriptions instead of needing large amounts of labeled examples for each.
- This points toward training approaches that treat instructions as a general interface for task specification.
- Extending the set of tasks beyond the current 61 could reveal whether the gains hold for more diverse or real-world problems.
Load-bearing premise
The selected tasks are diverse and non-overlapping so that success on unseen ones reflects genuine understanding of the instructions rather than shared surface features or data overlap.
What would settle it
If replacing the actual instructions with irrelevant or shuffled text still produces the same performance gains on held-out tasks, the benefit would not be attributable to instruction understanding.
read the original abstract
Humans (e.g., crowdworkers) have a remarkable ability in solving different tasks, by simply reading textual instructions that define them and looking at a few examples. Despite the success of the conventional supervised learning on individual datasets, such models often struggle with generalization across tasks (e.g., a question-answering system cannot solve classification tasks). A long-standing challenge in AI is to build a model that learns a new task by understanding the human-readable instructions that define it. To study this, we introduce NATURAL INSTRUCTIONS, a dataset of 61 distinct tasks, their human-authored instructions, and 193k task instances (input-output pairs). The instructions are obtained from crowdsourcing instructions used to create existing NLP datasets and mapped to a unified schema. Using this meta-dataset, we measure cross-task generalization by training models on seen tasks and measuring generalization to the remaining unseen ones. We adopt generative pre-trained language models to encode task-specific instructions along with input and generate task output. Our results indicate that models benefit from instructions when evaluated in terms of generalization to unseen tasks (19% better for models utilizing instructions). These models, however, are far behind an estimated performance upperbound indicating significant room for more progress in this direction.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces NATURAL INSTRUCTIONS, a meta-dataset of 61 distinct NLP tasks with crowdsourced human-authored instructions and 193k input-output instances. Models are trained on a subset of seen tasks and evaluated on held-out unseen tasks; the central claim is that generative pre-trained language models that encode the task instructions along with the input achieve 19% relative improvement in generalization to unseen tasks compared to instruction-free baselines.
Significance. If the result holds after addressing the evaluation concerns, the work would demonstrate that natural language instructions can support cross-task generalization beyond what is possible with standard supervised learning on individual datasets. The release of the dataset and the unified schema for instructions would be a concrete resource for future research on instruction-following models. The empirical comparison of instruction-aware versus instruction-free conditions on explicitly held-out tasks is a strength when properly controlled.
major comments (2)
- [Evaluation] Evaluation setup: the claim that performance gains on held-out tasks can be attributed to instruction understanding rather than shared surface patterns requires evidence that the 61 tasks have no substantial overlap in category, output format, or instruction phrasing between train and test splits. No quantitative check (e.g., instruction embedding similarity or category-level leakage analysis) is reported, which is load-bearing for the central generalization claim.
- [Results] Results section: the abstract states a 19% relative gain but supplies no details on exact model sizes, training hyperparameters, number of runs, statistical significance tests, or task similarity controls. This leaves the magnitude and robustness of the reported improvement difficult to assess from the provided evidence.
minor comments (2)
- [Abstract] The estimated performance upperbound is mentioned in the abstract but not defined or computed in the main text; adding a brief description of how it is obtained would improve clarity.
- [Dataset] Notation for the unified schema of instructions could be introduced earlier with an example to help readers follow the dataset construction.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. We address each major comment below and describe the revisions we will make to improve the clarity and rigor of the evaluation and results sections.
read point-by-point responses
-
Referee: [Evaluation] Evaluation setup: the claim that performance gains on held-out tasks can be attributed to instruction understanding rather than shared surface patterns requires evidence that the 61 tasks have no substantial overlap in category, output format, or instruction phrasing between train and test splits. No quantitative check (e.g., instruction embedding similarity or category-level leakage analysis) is reported, which is load-bearing for the central generalization claim.
Authors: We agree that explicit evidence of limited overlap between seen and unseen tasks is important to support the attribution of gains to instruction understanding. The 61 tasks were selected from distinct existing NLP datasets with different objectives, output spaces, and instruction styles, which provides a natural separation. To strengthen this, we will add a quantitative analysis in the revised manuscript: we will compute average cosine similarities between instruction embeddings (using a sentence encoder) across train/test splits and include a category-level breakdown showing minimal leakage in task types or formats. This addition will directly address the concern. revision: yes
-
Referee: [Results] Results section: the abstract states a 19% relative gain but supplies no details on exact model sizes, training hyperparameters, number of runs, statistical significance tests, or task similarity controls. This leaves the magnitude and robustness of the reported improvement difficult to assess from the provided evidence.
Authors: We acknowledge that the abstract and results would benefit from more complete reporting. The manuscript already specifies the use of generative pre-trained models, but we will expand the results section and add an appendix detailing exact model sizes (T5 variants), training hyperparameters, number of runs with different random seeds, and statistical significance tests (e.g., paired t-tests across runs). The task similarity controls will be covered by the new analysis described in our response to the evaluation comment. These changes will make the reported 19% relative improvement easier to evaluate. revision: yes
Circularity Check
No circularity: empirical cross-task evaluation is self-contained
full rationale
The paper constructs a new meta-dataset of 61 tasks with crowdsourced instructions and reports an empirical result: models trained on seen tasks generalize better to held-out unseen tasks when instructions are provided (19% improvement). This is measured directly via standard train/test splits on task instances, with no equations, fitted parameters, or derivations that reduce the reported gain to a quantity defined by the same data or by self-citation. The central claim rests on the experimental contrast between instruction-aware and instruction-free conditions on explicitly held-out tasks rather than any definitional equivalence or load-bearing self-reference.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Pre-trained generative language models can meaningfully incorporate task instructions to improve output on held-out tasks.
- domain assumption The 61 tasks are sufficiently independent that training on a subset yields no direct knowledge of the held-out tasks.
Forward citations
Cited by 16 Pith papers
-
Rethinking the Role of Demonstrations: What Makes In-Context Learning Work?
Randomly replacing labels in in-context demonstrations barely hurts performance, showing that label space, input distribution, and sequence format drive in-context learning more than ground-truth labels.
-
PIAST: Rapid Prompting with In-context Augmentation for Scarce Training data
PIAST iteratively optimizes few-shot examples in prompts via Monte Carlo Shapley value estimation, outperforming prior automatic prompting methods and setting new SOTA on classification, simplification, and GSM8K with...
-
Self-Rewarding Language Models
Iterative self-rewarding via LLM-as-Judge in DPO training on Llama 2 70B improves instruction following and self-evaluation, outperforming GPT-4 on AlpacaEval 2.0.
-
LIMA: Less Is More for Alignment
Fine-tuning a 65B model on 1,000 high-quality examples produces output that humans rate as good as or better than GPT-4 in 43% of cases, indicating most capabilities come from pretraining.
-
VideoChat: Chat-Centric Video Understanding
VideoChat integrates video models and LLMs via a learnable interface for chat-based spatiotemporal and causal video reasoning, trained on a new video-centric instruction dataset.
-
OPT: Open Pre-trained Transformer Language Models
OPT releases open decoder-only transformers up to 175B parameters that match GPT-3 performance at one-seventh the carbon cost, along with code and training logs.
-
Multitask Prompted Training Enables Zero-Shot Task Generalization
Multitask fine-tuning of an encoder-decoder model on prompted datasets produces zero-shot generalization that often beats models up to 16 times larger on standard benchmarks.
-
MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark
MMLU-Pro is a revised benchmark that makes language model evaluation harder and more stable by using ten options per question and emphasizing reasoning over simple knowledge recall.
-
Lessons from the Trenches on Reproducible Evaluation of Language Models
The paper compiles practical lessons on reproducible LM evaluation and introduces the lm-eval library to mitigate common methodological problems in NLP.
-
LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code
LiveCodeBench collects 400 recent contest problems to create a contamination-free benchmark evaluating LLMs on code generation and related capabilities like self-repair and execution.
-
Smaug: Fixing Failure Modes of Preference Optimisation with DPO-Positive
DPOP is a new loss function that prevents DPO from lowering preferred response likelihoods and outperforms standard DPO on diverse datasets, MT-Bench, and enables Smaug-72B to exceed 80% on the Open LLM Leaderboard.
-
Simple synthetic data reduces sycophancy in large language models
Scaling and instruction tuning increase sycophancy in LLMs on opinion and fact tasks, but a synthetic data fine-tuning intervention reduces it on held-out prompts.
-
PandaGPT: One Model To Instruction-Follow Them All
A single model trained only on image-text pairs gains instruction-following ability across images, video, and audio by routing all modalities through ImageBind's shared embedding space into Vicuna.
-
ART: Automatic multi-step reasoning and tool-use for large language models
ART automatically generates multi-step reasoning programs with tool integration for LLMs, yielding substantial gains over few-shot and auto-CoT prompting on BigBench and MMLU while matching hand-crafted CoT on most tasks.
-
SmolLM2: When Smol Goes Big -- Data-Centric Training of a Small Language Model
SmolLM2 is a 1.7B-parameter language model that outperforms Qwen2.5-1.5B and Llama3.2-1B after overtraining on 11 trillion tokens using custom FineMath, Stack-Edu, and SmolTalk datasets in a multi-stage pipeline.
-
Large Language Models: A Survey
The paper surveys key large language models, their training methods, datasets, evaluation benchmarks, and future research directions in the field.
Reference graph
Works this paper leans on
-
[1]
Learning from Task Descriptions , author=. Proceedings of EMNLP , pages=
-
[3]
Proceedings of NeurIPS , volume=
SuperGLUE: A stickier benchmark for general-purpose language understanding systems , author=. Proceedings of NeurIPS , volume=
- [4]
-
[5]
SQuAD: 100, 000+ Questions for Machine Comprehension of Text , author=. EMNLP , year=
-
[7]
Unsupervised Data Augmentation for Consistency Training , url =
Xie, Qizhe and Dai, Zihang and Hovy, Eduard and Luong, Thang and Le, Quoc , booktitle =. Unsupervised Data Augmentation for Consistency Training , url =
-
[9]
From ‘F’to ‘A’on the NY Regents Science Exams: An Overview of the Aristo Project , author=. AI Magazine , volume=
-
[10]
Proceedings of NAACL-HLT , pages=
How many data points is a prompt worth? , author=. Proceedings of NAACL-HLT , pages=
-
[11]
Logan IV and Eric Wallace and Sameer Singh , title =
Taylor Shin and Yasaman Razeghi and Robert L. Logan IV and Eric Wallace and Sameer Singh , title =. Proceedings of EMNLP , pages =
-
[12]
arXiv preprint arXiv:2102.01335 , year=
Neural Data Augmentation via Example Extrapolation , author=. arXiv preprint arXiv:2102.01335 , year=
-
[13]
The Natural Language Decathlon: Multitask Learning as Question Answering
The natural language decathlon: Multitask learning as question answering , author=. arXiv preprint arXiv:1806.08730 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[14]
The Natural Language Decathlon: Multitask Learning as Question Answering , author=. 2019 , url=
work page 2019
-
[15]
arXiv preprint arXiv:2004.05483 , year=
Unsupervised commonsense question answering with self-talk , author=. arXiv preprint arXiv:2004.05483 , year=
-
[16]
arXiv preprint arXiv:2009.00751 , year=
Text Modular Networks: Learning to Decompose Tasks in the Language of Existing Models , author=. arXiv preprint arXiv:2009.00751 , year=
-
[17]
Journal of instructional development , volume=
The use of positive and negative examples during instruction , author=. Journal of instructional development , volume=. 1981 , publisher=
work page 1981
-
[18]
arXiv preprint arXiv:2103.10385 , year=
GPT Understands, Too , author=. arXiv preprint arXiv:2103.10385 , year=
-
[19]
arXiv preprint arXiv:2001.07676 , year=
Exploiting cloze questions for few-shot text classification and natural language inference , author=. arXiv preprint arXiv:2001.07676 , year=
-
[20]
arXiv preprint arXiv:2009.07118 , year=
It's Not Just Size That Matters: Small Language Models Are Also Few-Shot Learners , author=. arXiv preprint arXiv:2009.07118 , year=
-
[21]
arXiv preprint arXiv:1912.02164 , year=
Plug and play language models: A simple approach to controlled text generation , author=. arXiv preprint arXiv:1912.02164 , year=
-
[22]
arXiv preprint arXiv:2103.11955 , year=
Improving and Simplifying Pattern Exploiting Training , author=. arXiv preprint arXiv:2103.11955 , year=
-
[23]
Prefix-Tuning: Optimizing Continuous Prompts for Generation
Prefix-Tuning: Optimizing Continuous Prompts for Generation , author=. arXiv preprint arXiv:2101.00190 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[24]
Proceedings of the IEEE/CVF , pages=
Alfred: A benchmark for interpreting grounded instructions for everyday tasks , author=. Proceedings of the IEEE/CVF , pages=
-
[25]
arXiv preprint arXiv:2102.12206 , year=
PADA: A Prompt-based Autoregressive Approach for Adaptation to Unseen Domains , author=. arXiv preprint arXiv:2102.12206 , year=
-
[26]
Looking beyond the surface: A challenge set for reading comprehension over multiple sentences , author=. Proceedings of NAACL , pages=
-
[27]
Language models are unsupervised multitask learners , author=. OpenAI blog , volume=
-
[28]
Proceedings of the 31st International Conference on Neural Information Processing Systems , pages=
Prototypical networks for few-shot learning , author=. Proceedings of the 31st International Conference on Neural Information Processing Systems , pages=
-
[29]
An assessment of the accuracy of automatic evaluation in summarization , author=. Proceedings of workshop on evaluation metrics and system comparison for automatic summarization , pages=
-
[30]
Proceedings of the 2nd Workshop on Machine Reading for Question Answering , pages=
Reasoning Over Paragraph Effects in Situations , author=. Proceedings of the 2nd Workshop on Machine Reading for Question Answering , pages=
-
[32]
Hard negative examples are hard, but useful , author=. Proceedings of ECCV , pages=. 2020 , organization=
work page 2020
-
[33]
2011 IEEE 11th International Conference on Data Mining , pages=
Learning from negative examples in set-expansion , author=. 2011 IEEE 11th International Conference on Data Mining , pages=. 2011 , organization=
work page 2011
-
[34]
Proceedings of ICML Workshop on The Continuum from Labeled to Unlabeled Data , volume=
Bootstrapped learning of semantic classes from positive and negative examples , author=. Proceedings of ICML Workshop on The Continuum from Labeled to Unlabeled Data , volume=
-
[35]
Proceedings of the VLDB Endowment , volume=
Natural language to SQL: Where are we today? , author=. Proceedings of the VLDB Endowment , volume=. 2020 , publisher=
work page 2020
-
[36]
Learning semantic maps from natural language descriptions , author=. 2013 , organization=
work page 2013
-
[37]
NL2Bash: A Corpus and Semantic Parser for Natural Language Interface to the Linux Operating System , author=. Proceedings of LREC , year=
-
[38]
arXiv preprint arXiv:2101.06804 , year=
What Makes Good In-Context Examples for GPT- 3 ? , author=. arXiv preprint arXiv:2101.06804 , year=
-
[39]
Learning from natural instructions , author=. Machine learning , volume=. 2014 , publisher=
work page 2014
-
[40]
Learning what is essential in questions , author=. Proceedings of CoNLL , pages=
-
[41]
Multitask learning , author=. Machine learning , volume=. 1997 , publisher=
work page 1997
-
[42]
CHARTDIALOGS: Plotting from Natural Language Instructions , author=. Proceedings of ACL , pages=
-
[44]
Proceedings of EMNLP: Findings , pages=
Adapting Language Models for Zero-shot Learning by Meta-tuning on Dataset and Prompt Collections , author=. Proceedings of EMNLP: Findings , pages=
- [45]
-
[46]
Transformers as soft reasoners over language , author=. Proceedings of IJCAI , year=
-
[47]
Lewis, Mike and Liu, Yinhan and Goyal, Naman and Ghazvininejad, Marjan and Mohamed, Abdelrahman and Levy, Omer and Stoyanov, Ves and Zettlemoyer, Luke , booktitle=
-
[48]
Khashabi, Daniel and Min, Sewon and Khot, Tushar and Sabharwal, Ashish and Tafjord, Oyvind and Clark, Peter and Hajishirzi, Hannaneh , booktitle=
-
[49]
Journal of Machine Learning Research , volume=
Exploring the limits of transfer learning with a unified text-to-text transformer , author=. Journal of Machine Learning Research , volume=
-
[50]
Joshi and Danqi Chen and Omer Levy and Mike Lewis and Luke S
Yinhan Liu and Myle Ott and Naman Goyal and Jingfei Du and Mandar S. Joshi and Danqi Chen and Omer Levy and Mike Lewis and Luke S. Zettlemoyer and Veselin Stoyanov , journal=. R o BERT a: A Robustly Optimized BERT Pretraining Approach. 2019 , url =
work page 2019
-
[51]
International Conference on Machine Learning , pages=
Generalization without systematicity: On the compositional skills of sequence-to-sequence recurrent networks , author=. International Conference on Machine Learning , pages=. 2018 , organization=
work page 2018
-
[52]
Proceedings of NAACL-HLT , pages=
Deep contextualized word representations , author=. Proceedings of NAACL-HLT , pages=
-
[53]
SPoC: Search-based Pseudocode to Code
Spoc: Search-based pseudocode to code , author=. arXiv preprint arXiv:1906.04908 , year=
work page internal anchor Pith review Pith/arXiv arXiv 1906
-
[54]
Proceedings of EMNLP-IJCNLP , pages=
Quoref: A Reading Comprehension Dataset with Questions Requiring Coreferential Reasoning , author=. Proceedings of EMNLP-IJCNLP , pages=
-
[55]
“Going on a vacation” takes longer than “Going for a walk”: A Study of Temporal Commonsense Understanding , author=. Proceedings of EMNLP-IJCNLP , pages=
-
[56]
Proceedings of EMNLP-IJCNLP , pages=
Cosmos QA: Machine Reading Comprehension with Contextual Commonsense Reasoning , author=. Proceedings of EMNLP-IJCNLP , pages=
-
[57]
Proceedings of the AAAI , year=
Winogrande: An adversarial winograd schema challenge at scale , author=. Proceedings of the AAAI , year=
-
[58]
DROP: A Reading Comprehension Benchmark Requiring Discrete Reasoning Over Paragraphs , author=. Proceedings of NAACL , pages=
-
[59]
Khashabi, Daniel and Stanovsky, Gabriel and Bragg, Jonathan and Lourie, Nicholas and Kasai, Jungo and Choi, Yejin and Smith, Noah A and Weld, Daniel S , journal=
-
[60]
Khot, Tushar and Clark, Peter and Guerquin, Michal and Jansen, Peter and Sabharwal, Ashish , booktitle=
-
[61]
A Call for Clarity in Reporting BLEU Scores
Post, Matt. A Call for Clarity in Reporting BLEU Scores. Proceedings of the Third Conference on Machine Translation: Research Papers. 2018
work page 2018
-
[62]
BLEURT: Learning Robust Metrics for Text Generation , author=. Proceedings of ACL , pages=
-
[63]
Text summarization branches out , pages=
Rouge: A package for automatic evaluation of summaries , author=. Text summarization branches out , pages=
-
[65]
arXiv preprint arXiv:2010.12083 , year=
Language-Conditioned Imitation Learning for Robot Manipulation Tasks , author=. arXiv preprint arXiv:2010.12083 , year=
-
[67]
arXiv preprint arXiv:2109.07830 , year=
Reframing Instructional Prompts to GPTk's Language , author=. arXiv preprint arXiv:2109.07830 , year=
-
[68]
Muppet: Massive Multi-task Representations with Pre-Finetuning , author=. Proceedings of EMNLP , pages=
-
[69]
Language Models are Few-Shot Learners , volume =
Brown, Tom and Mann, Benjamin and Ryder, Nick and Subbiah, Melanie and Kaplan, Jared D and Dhariwal, Prafulla and Neelakantan, Arvind and Shyam, Pranav and Sastry, Girish and Askell, Amanda and Agarwal, Sandhini and Herbert-Voss, Ariel and Krueger, Gretchen and Henighan, Tom and Child, Rewon and Ramesh, Aditya and Ziegler, Daniel and Wu, Jeffrey and Winte...
-
[70]
Extended Abstracts of the 2021 CHI Conference on Human Factors in Computing Systems , pages=
Prompt programming for large language models: Beyond the few-shot paradigm , author=. Extended Abstracts of the 2021 CHI Conference on Human Factors in Computing Systems , pages=
work page 2021
-
[71]
Multitask Prompted Training Enables Zero-Shot Task Generalization , author=. Proceedings of ICLR , year=
-
[72]
Few-Shot Text Generation with Natural Language Instructions , author=. Proceedings of EMNLP , year=
-
[73]
Finetuned Language Models are Zero-Shot Learners , author=. Proceedings of ICLR , year=
-
[74]
CrossFit: A Few-shot Learning Challenge for Cross-task Generalization in NLP , author=. Proceedings of EMNLP , year=
-
[75]
Calibrate before use: Improving few-shot performance of language models , author=. Proceedings of ICML , pages=
-
[76]
Armen Aghajanyan, Anchit Gupta, Akshat Shrivastava, Xilun Chen, Luke Zettlemoyer, and Sonal Gupta. 2021. Muppet: Massive multi-task representations with pre-finetuning. In Proceedings of EMNLP, pages 5799--5811
work page 2021
-
[77]
Ali M Ali. 1981. The use of positive and negative examples during instruction. Journal of instructional development, 5(1):2--7
work page 1981
-
[78]
Iz Beltagy, Matthew E Peters, and Arman Cohan. 2020. Longformer: The long-document transformer. arXiv preprint arXiv:2004.05150
work page internal anchor Pith review Pith/arXiv arXiv 2020
-
[79]
Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel Ziegler, Jeffrey Wu, Clemens Winter, Chris Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gr...
work page 2020
-
[80]
Rich Caruana. 1997. Multitask learning. Machine learning, 28(1):41--75
work page 1997
-
[81]
Pradeep Dasigi, Nelson F Liu, Ana Marasovic, Noah A Smith, and Matt Gardner. 2019. Quoref: A reading comprehension dataset with questions requiring coreferential reasoning. In Proceedings of EMNLP-IJCNLP, pages 5927--5934
work page 2019
-
[82]
Dheeru Dua, Yizhong Wang, Pradeep Dasigi, Gabriel Stanovsky, Sameer Singh, and Matt Gardner. 2019. Drop: A reading comprehension benchmark requiring discrete reasoning over paragraphs. In Proceedings of NAACL, pages 2368--2378
work page 2019
- [83]
-
[84]
Kamath, Aniruddha Kembhavi, and Derek Hoiem
Tanmay Gupta, A. Kamath, Aniruddha Kembhavi, and Derek Hoiem. 2021. Towards general purpose vision systems. ArXiv, abs/2104.00743
- [85]
-
[86]
Lifu Huang, Ronan Le Bras, Chandra Bhagavatula, and Yejin Choi. 2019. Cosmos qa: Machine reading comprehension with contextual commonsense reasoning. In Proceedings of EMNLP-IJCNLP, pages 2391--2401
work page 2019
-
[87]
Daniel Khashabi, Snigdha Chaturvedi, Michael Roth, Shyam Upadhyay, and Dan Roth. 2018. Looking beyond the surface: A challenge set for reading comprehension over multiple sentences. In Proceedings of NAACL, pages 252--262
work page 2018
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.