arxiv: 2202.12837 · v2 · submitted 2022-02-25 · 💻 cs.CL · cs.AI

Recognition: 2 theorem links

· Lean Theorem

Rethinking the Role of Demonstrations: What Makes In-Context Learning Work?

Sewon Min , Xinxi Lyu , Ari Holtzman , Mikel Artetxe , Mike Lewis , Hannaneh Hajishirzi , Luke Zettlemoyer

Authors on Pith no claims yet

Pith reviewed 2026-05-15 09:47 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords in-context learningdemonstrationslabel replacementlarge language modelsfew-shot promptingprompt formatclassification tasksGPT-3

0 comments

The pith

Randomly replacing labels in in-context demonstrations barely hurts performance on classification and multiple-choice tasks across many models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that ground truth demonstrations are not required for in-context learning in large language models. Randomly replacing the labels in the demonstrations barely hurts performance on classification and multi-choice tasks. This holds consistently over 12 different models including GPT-3. What matters more are the demonstrations providing examples of the label space, the distribution of the input text, and the overall format of the sequence. This provides a new understanding of why in-context learning works.

Core claim

Ground truth demonstrations are in fact not required. Randomly replacing labels in the demonstrations barely hurts performance on a range of classification and multi-choice tasks, consistently over 12 different models including GPT-3. Instead, other aspects of the demonstrations are the key drivers of end task performance, including the fact that they provide a few examples of the label space, the distribution of the input text, and the overall format of the sequence.

What carries the argument

Random label replacement as a probe to separate the contribution of correct input-label mappings from the provision of label space, input distribution, and sequence format in demonstrations.

Load-bearing premise

That randomly replacing labels does not introduce unintended statistical cues the models can exploit and that the chosen classification and multiple-choice tasks represent broader in-context learning behavior.

What would settle it

A substantial performance drop when labels are randomized on a task where the input-label mapping cannot be inferred from format or distribution cues alone.

read the original abstract

Large language models (LMs) are able to in-context learn -- perform a new task via inference alone by conditioning on a few input-label pairs (demonstrations) and making predictions for new inputs. However, there has been little understanding of how the model learns and which aspects of the demonstrations contribute to end task performance. In this paper, we show that ground truth demonstrations are in fact not required -- randomly replacing labels in the demonstrations barely hurts performance on a range of classification and multi-choce tasks, consistently over 12 different models including GPT-3. Instead, we find that other aspects of the demonstrations are the key drivers of end task performance, including the fact that they provide a few examples of (1) the label space, (2) the distribution of the input text, and (3) the overall format of the sequence. Together, our analysis provides a new way of understanding how and why in-context learning works, while opening up new questions about how much can be learned from large language models through inference alone.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Random labels in the demonstrations work nearly as well as correct ones, so the real drivers are label space, input distribution, and format.

read the letter

The main thing to know is that ground-truth labels in the few-shot examples are not doing much work. Replacing them with random labels barely changes performance on the classification and multiple-choice tasks they ran, and this holds across twelve models including GPT-3. That result is the paper's clearest contribution and it directly challenges the usual assumption that the model is learning the correct mapping from the demonstrations themselves. Instead the authors show that the examples mainly supply the set of possible labels, a rough sense of the input distribution, and the overall input-output format. The experiments are simple and the comparisons are head-to-head, which makes the finding easy to evaluate. The consistency across model sizes and families gives the claim some weight. The paper is careful not to overclaim universality; it frames the work as identifying which aspects of the demonstrations matter rather than explaining every case of in-context learning. The main soft spot is scope. All the tasks are classification or multiple-choice, so it is still open whether the same pattern appears in open-ended generation or multi-step reasoning. The randomization procedure is described, but small details on how the random labels were sampled could still matter for replication. Readers who work on prompting, evaluation, or mechanistic understanding of large LMs will get the most out of it. The empirical core is solid enough that a serious editor should send it to referees rather than desk-reject it. The result is narrow but it forces a more precise picture of what is actually happening during inference.

Referee Report

1 major / 2 minor

Summary. The paper examines the mechanisms of in-context learning (ICL) in large language models. Through experiments across 12 models (including GPT-3) on classification and multiple-choice tasks, it shows that replacing ground-truth labels in demonstrations with random labels causes only minimal performance drops. The authors argue that demonstrations primarily contribute by exposing the label space, input text distribution, and overall sequence format rather than providing correct input-label mappings.

Significance. If the empirical findings hold, the work provides a substantive reframing of ICL: it shifts emphasis from label correctness to structural cues in the prompt. This has clear implications for prompt engineering, model interpretability, and the limits of what can be learned purely at inference time. The consistent results across model scales and task types strengthen the contribution.

major comments (1)

[§4.1, §4.2] §4.1 and §4.2: The random-label generation procedure is described at a high level but lacks explicit confirmation that the random labels are drawn uniformly from the task's label set without replacement or frequency bias; this detail is needed to rule out unintended distributional cues that could explain the near-equivalent performance.

minor comments (2)

[Abstract] Abstract: 'multi-choce' is a typo and should be corrected to 'multiple-choice'.
[§3] §3: The description of the 12 models and their sizes could be consolidated into a single table for easier reference rather than scattered across paragraphs.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their positive assessment and constructive feedback on our work. We address the major comment below.

read point-by-point responses

Referee: [§4.1, §4.2] §4.1 and §4.2: The random-label generation procedure is described at a high level but lacks explicit confirmation that the random labels are drawn uniformly from the task's label set without replacement or frequency bias; this detail is needed to rule out unintended distributional cues that could explain the near-equivalent performance.

Authors: We agree that greater precision on the sampling procedure is warranted to eliminate any ambiguity about potential distributional cues. In the revised manuscript we will explicitly state that, for each demonstration, the original label is replaced by a label drawn uniformly at random from the task's full label set, with independent draws across demonstrations (i.e., replacement is allowed). This procedure ensures that the random-label demonstrations expose the label space while preserving the original input distribution and sequence format, without introducing frequency bias or other unintended cues. revision: yes

Circularity Check

0 steps flagged

No significant circularity; purely empirical analysis

full rationale

The paper presents an entirely empirical investigation with no mathematical derivations, fitted parameters, or first-principles claims that could reduce to their own inputs by construction. Central findings rest on direct experimental comparisons (random label replacement vs. ground-truth demonstrations across 12 models) that are reported as observed outcomes rather than predictions derived from self-referential definitions. No self-citation chains, ansatzes, or uniqueness theorems are invoked as load-bearing support for the core results; the work explicitly frames its contribution around identifying roles of label space, input distribution, and format via controlled experiments. This satisfies the criteria for a self-contained empirical study with no circular steps.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the empirical observation that random labels preserve performance; no free parameters are fitted to derive the result, and no new entities are postulated.

axioms (1)

domain assumption Large language models perform tasks by conditioning on input-label pairs provided in the prompt.
This is the foundational premise of in-context learning research invoked throughout the abstract.

pith-pipeline@v0.9.0 · 5504 in / 1142 out tokens · 41467 ms · 2026-05-15T09:47:47.546394+00:00 · methodology

discussion (0)

Forward citations

Cited by 19 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models
cs.CL 2022-01 accept novelty 9.0

Chain-of-thought prompting, by including intermediate reasoning steps in few-shot examples, elicits strong reasoning abilities in large language models on arithmetic, commonsense, and symbolic tasks.
Why Multimodal In-Context Learning Lags Behind? Unveiling the Inner Mechanisms and Bottlenecks
cs.CV 2026-04 unverdicted novelty 7.0

Multimodal ICL lags text-only ICL in few-shot settings due to weak cross-modal reasoning alignment and unreliable task mapping transfer, with an inference-stage method proposed to strengthen transfer.
LLM4Log: A Systematic Review of Large Language Model-based Log Analysis
cs.SE 2026-03 accept novelty 7.0

LLM4Log is a systematic review of 145 papers on LLM-based log analysis that delivers a unified taxonomy, design patterns, and open challenges for reliable adoption in AIOps.
Program of Thoughts Prompting: Disentangling Computation from Reasoning for Numerical Reasoning Tasks
cs.CL 2022-11 unverdicted novelty 7.0

PoT prompting improves numerical reasoning by having language models write programs executed by a computer instead of performing calculations in natural language chains of thought, with an average 12% gain over CoT.
In-context Learning and Induction Heads
cs.LG 2022-09 unverdicted novelty 7.0

Induction heads, which implement pattern completion in attention, develop at the same training stage as a sudden rise in in-context learning, providing evidence they are the primary mechanism for in-context learning i...
Flamingo: a Visual Language Model for Few-Shot Learning
cs.CV 2022-04 unverdicted novelty 7.0

Flamingo models reach new state-of-the-art few-shot results on image and video tasks by bridging frozen vision and language models with cross-attention layers trained on interleaved web-scale data.
Stories in Space: In-Context Learning Trajectories in Conceptual Belief Space
cs.CL 2026-05 unverdicted novelty 6.0

LLMs perform in-context learning as trajectories through a structured low-dimensional conceptual belief space, with the structure visible in both behavior and internal representations and causally manipulable via inte...
OPT-BENCH: Evaluating the Iterative Self-Optimization of LLM Agents in Large-Scale Search Spaces
cs.AI 2026-05 unverdicted novelty 6.0

OPT-BENCH and OPT-Agent evaluate LLM self-optimization in large search spaces, showing stronger models improve via feedback but stay constrained by base capacity and below human performance.
Temporal Reasoning Is Not the Bottleneck: A Probabilistic Inconsistency Framework for Neuro-Symbolic QA
cs.AI 2026-05 unverdicted novelty 6.0

Temporal reasoning is not the core bottleneck for LLMs on time-based QA; the real issue is unstructured text-to-event mapping, addressed by a neuro-symbolic system with PIS that reaches 100% accuracy on benchmarks whe...
Measuring Representation Robustness in Large Language Models for Geometry
cs.CL 2026-04 unverdicted novelty 6.0

LLMs display accuracy gaps of up to 14 percentage points on the same geometry problems solely due to representation choice, with vector forms consistently weakest and a convert-then-solve prompt helping only high-capa...
Otter: A Multi-Modal Model with In-Context Instruction Tuning
cs.CV 2023-05 unverdicted novelty 6.0

Otter is a multi-modal model instruction-tuned on the MIMIC-IT dataset of over 3 million in-context instruction-response pairs to improve convergence and generalization on tasks with multiple images and videos.
Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them
cs.CL 2022-10 accept novelty 6.0

Chain-of-thought prompting enables large language models to surpass average human performance on 17 of 23 challenging BIG-Bench tasks.
Emergent Abilities of Large Language Models
cs.CL 2022-06 unverdicted novelty 6.0

Emergent abilities are capabilities present in large language models but absent in smaller ones and cannot be predicted by extrapolating smaller model performance.
Absurd World: A Simple Yet Powerful Method to Absurdify the Real-world for Probing LLM Reasoning Capabilities
cs.AI 2026-05 unverdicted novelty 5.0

Absurd World automatically converts real-world problems into absurd yet logically coherent scenarios to test whether LLMs can reason without depending on familiar patterns.
Can LLMs Take Retrieved Information with a Grain of Salt?
cs.CL 2026-05 unverdicted novelty 5.0

LLMs exhibit systematic failures in obeying expressed certainty in retrieved contexts, but a combination of prior reminders, certainty recalibration, and context simplification reduces obedience errors by 25%.
When Context Sticks: Studying Interference in In-Context Learning
cs.LG 2026-04 unverdicted novelty 5.0

In-context learning shows persistent interference from prior examples, with more misleading linear examples degrading quadratic predictions and training curricula modulating recovery speed.
SAKE: Self-aware Knowledge Exploitation-Exploration for Grounded Multimodal Named Entity Recognition
cs.IR 2026-04 unverdicted novelty 5.0

SAKE is an agentic framework for GMNER that uses uncertainty-based self-awareness and reinforcement learning to balance internal knowledge exploitation with adaptive external exploration.
Reducing Redundancy in Retrieval-Augmented Generation through Chunk Filtering
cs.CL 2026-04 unverdicted novelty 4.0

Entity-based chunk filtering reduces RAG vector index size by 25-36% with retrieval quality near baseline levels.
SLM Finetuning for Natural Language to Domain Specific Code Generation in Production
cs.LG 2026-04 unverdicted novelty 3.0

Fine-tuned small language models outperform larger models in natural language to domain-specific code generation with improved performance, latency, and the ability to adapt to customer-specific scenarios without losi...

Reference graph

Works this paper leans on

237 extracted references · 237 canonical work pages · cited by 19 Pith papers · 6 internal anchors

[1]

Robust Disambiguation of Named Entities in Text

Hoffart, Johannes and Yosef, Mohamed Amir and Bordino, Ilaria and F. Robust Disambiguation of Named Entities in Text. EMNLP. 2011

work page 2011
[2]

CODAH : An Adversarially-Authored Question Answering Dataset for Common Sense

Chen, Michael and D ' Arcy, Mike and Liu, Alisa and Fernandez, Jared and Downey, Doug. CODAH : An Adversarially-Authored Question Answering Dataset for Common Sense. Proceedings of the 3rd Workshop on Evaluating Vector Space Representations for NLP. 2019

work page 2019
[3]

and Brockett, Chris

Dolan, William B. and Brockett, Chris. Automatically Constructing a Corpus of Sentential Paraphrases. Proceedings of the Third International Workshop on Paraphrasing ( IWP 2005). 2005

work page 2005
[4]

Semantic Web ,year=

DBpedia - A large-scale, multilingual knowledge base extracted from Wikipedia ,author=. Semantic Web ,year=

work page
[5]

ICLR ,year=

Abductive Commonsense Reasoning ,author=. ICLR ,year=

work page
[6]

Proceedings of the 11th International AAAI Conference on Web and Social Media ,year=

Automated Hate Speech Detection and the Problem of Offensive Language ,author=. Proceedings of the 11th International AAAI Conference on Web and Social Media ,year=

work page
[7]

The CommitmentBank: Investigating projection in naturally occurring discourse ,journal=

de Marneffe, Marie-Catherine and Simons, Mandy and Tonhauser, Judith ,year=. The CommitmentBank: Investigating projection in naturally occurring discourse ,journal=

work page
[8]

and Lapata, Mirella

Narayan, Shashi and Cohen, Shay B. and Lapata, Mirella. Don ' t Give Me the Details, Just the Summary! Topic-Aware Convolutional Neural Networks for Extreme Summarization. EMNLP. 2018

work page 2018
[9]

W iki QA : A Challenge Dataset for Open-Domain Question Answering

Yang, Yi and Yih, Wen-tau and Meek, Christopher. W iki QA : A Challenge Dataset for Open-Domain Question Answering. EMNLP. 2015

work page 2015
[10]

Toward Semantics-Based Answer Pinpointing

Hovy, Eduard and Gerber, Laurie and Hermjakob, Ulf and Lin, Chin-Yew and Ravichandran, Deepak. Toward Semantics-Based Answer Pinpointing. Proceedings of the First International Conference on Human Language Technology Research. 2001

work page 2001
[11]

Social IQ a: Commonsense Reasoning about Social Interactions

Sap, Maarten and Rashkin, Hannah and Chen, Derek and Le Bras, Ronan and Choi, Yejin. Social IQ a: Commonsense Reasoning about Social Interactions. EMNLP. 2019

work page 2019
[12]

NAACL ,year=

Dheeru Dua and Yizhong Wang and Pradeep Dasigi and Gabriel Stanovsky and Sameer Singh and Matt Gardner ,title=. NAACL ,year=

work page
[13]

Learning Question Classifiers

Li, Xin and Roth, Dan. Learning Question Classifiers. COLING 2002: The 19th International Conference on Computational Linguistics. 2002

work page 2002
[14]

S pider: A Large-Scale Human-Labeled Dataset for Complex and Cross-Domain Semantic Parsing and Text-to- SQL Task

Yu, Tao and Zhang, Rui and Yang, Kai and Yasunaga, Michihiro and Wang, Dongxu and Li, Zifan and Ma, James and Li, Irene and Yao, Qingning and Roman, Shanelle and Zhang, Zilin and Radev, Dragomir. S pider: A Large-Scale Human-Labeled Dataset for Complex and Cross-Domain Semantic Parsing and Text-to- SQL Task. EMNLP. 2018

work page 2018
[15]

CARER : Contextualized Affect Representations for Emotion Recognition

Saravia, Elvis and Liu, Hsien-Chi Toby and Huang, Yen-Hao and Wu, Junlin and Chen, Yi-Shin. CARER : Contextualized Affect Representations for Emotion Recognition. EMNLP. 2018

work page 2018
[16]

Rep4NLP@ACL ,year=

NewsQA: A Machine Comprehension Dataset ,author=. Rep4NLP@ACL ,year=

work page
[17]

`` I ' d rather just go to bed '' : Understanding Indirect Answers

Louis, Annie and Roth, Dan and Radlinski, Filip. `` I ' d rather just go to bed '' : Understanding Indirect Answers. EMNLP. 2020

work page 2020
[18]

A SICK cure for the evaluation of compositional distributional semantic models

Marelli, Marco and Menini, Stefano and Baroni, Marco and Bentivogli, Luisa and Bernardi, Raffaella and Zamparelli, Roberto. A SICK cure for the evaluation of compositional distributional semantic models. LREC. 2014

work page 2014
[19]

`` Going on a vacation '' takes longer than `` Going for a walk '' : A Study of Temporal Commonsense Understanding

Zhou, Ben and Khashabi, Daniel and Ning, Qiang and Roth, Dan. `` Going on a vacation '' takes longer than `` Going for a walk '' : A Study of Temporal Commonsense Understanding. EMNLP. 2019

work page 2019
[20]

O ne S top E nglish corpus: A new corpus for automatic readability assessment and text simplification

Vajjala, Sowmya and Lu c i \'c , Ivana. O ne S top E nglish corpus: A new corpus for automatic readability assessment and text simplification. Proceedings of the Thirteenth Workshop on Innovative Use of NLP for Building Educational Applications. 2018

work page 2018
[21]

Explainable Automated Fact-Checking for Public Health Claims

Kotonya, Neema and Toni, Francesca. Explainable Automated Fact-Checking for Public Health Claims. EMNLP. 2020

work page 2020
[22]

EMNLP-IJCNLP ,year=

Social IQa: Commonsense Reasoning about Social Interactions ,author=. EMNLP-IJCNLP ,year=

work page
[23]

B ool Q : Exploring the Surprising Difficulty of Natural Yes/No Questions

Clark, Christopher and Lee, Kenton and Chang, Ming-Wei and Kwiatkowski, Tom and Collins, Michael and Toutanova, Kristina. B ool Q : Exploring the Surprising Difficulty of Natural Yes/No Questions. NAACL-HLT. 2019

work page 2019
[24]

and Ng, Andrew and Potts, Christopher

Socher, Richard and Perelygin, Alex and Wu, Jean and Chuang, Jason and Manning, Christopher D. and Ng, Andrew and Potts, Christopher. Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank. EMNLP. 2013

work page 2013
[25]

S em E val-2012 Task 7: Choice of Plausible Alternatives: An Evaluation of Commonsense Causal Reasoning

Gordon, Andrew and Kozareva, Zornitsa and Roemmele, Melissa. S em E val-2012 Task 7: Choice of Plausible Alternatives: An Evaluation of Commonsense Causal Reasoning. The First Joint Conference on Lexical and Computational Semantics ( S em E val). 2012

work page 2012
[26]

Rajpurkar, Pranav and Zhang, Jian and Lopyrev, Konstantin and Liang, Percy ,title=

work page
[27]

T- RE x: A Large Scale Alignment of Natural Language with Knowledge Base Triples

Elsahar, Hady and Vougiouklis, Pavlos and Remaci, Arslen and Gravier, Christophe and Hare, Jonathon and Laforest, Frederique and Simperl, Elena. T- RE x: A Large Scale Alignment of Natural Language with Knowledge Base Triples. LREC. 2018

work page 2018
[28]

AAAI ,year=

PIQA: Reasoning about Physical Commonsense in Natural Language ,author=. AAAI ,year=

work page
[29]

Khot, Tushar and Clark, Peter and Guerquin, Michal and Jansen, Peter and Sabharwal, Ashish ,booktitle=

work page
[30]

Proceedings of the second PASCAL challenges workshop on recognising textual entailment ,year=

The second pascal recognising textual entailment challenge ,author=. Proceedings of the second PASCAL challenges workshop on recognising textual entailment ,year=

work page
[31]

TWEETQA : A Social Media Focused Question Answering Dataset

Xiong, Wenhan and Wu, Jiawei and Wang, Hong and Kulkarni, Vivek and Yu, Mo and Chang, Shiyu and Guo, Xiaoxiao and Wang, William Yang. TWEETQA : A Social Media Focused Question Answering Dataset. ACL. 2019

work page 2019
[32]

EMNLP ,year=

Can a Suit of Armor Conduct Electricity? A New Dataset for Open Book Question Answering ,author=. EMNLP ,year=

work page
[33]

and Hidalgo, Jos\'

Almeida, Tiago A. and Hidalgo, Jos\'. Contributions to the Study of SMS Spam Filtering: New Collection and Results ,year=

work page
[34]

Mining Discourse Markers for Unsupervised Sentence Representation Learning

Sileo, Damien and Van De Cruys, Tim and Pradel, Camille and Muller, Philippe. Mining Discourse Markers for Unsupervised Sentence Representation Learning. NAACL-HLT. 2019

work page 2019
[35]

S em E val-2019 Task 3: E mo C ontext Contextual Emotion Detection in Text

Chatterjee, Ankush and Narahari, Kedhar Nath and Joshi, Meghana and Agrawal, Puneet. S em E val-2019 Task 3: E mo C ontext Contextual Emotion Detection in Text. Proceedings of the 13th International Workshop on Semantic Evaluation. 2019

work page 2019
[36]

arXiv preprint arXiv:2012.10289 ,year=

HateXplain: A Benchmark Dataset for Explainable Hate Speech Detection ,author=. arXiv preprint arXiv:2012.10289 ,year=

work page arXiv 2012
[37]

F reebase QA : A New Factoid QA Data Set Matching Trivia-Style Question-Answer Pairs with F reebase

Jiang, Kelvin and Wu, Dekun and Jiang, Hui. F reebase QA : A New Factoid QA Data Set Matching Trivia-Style Question-Answer Pairs with F reebase. NAACL-HLT. 2019

work page 2019
[38]

Annotated G igaword

Napoles, Courtney and Gormley, Matthew and Van Durme, Benjamin. Annotated G igaword. Proceedings of the Joint Workshop on Automatic Knowledge Base Construction and Web-scale Knowledge Extraction ( AKBC - WEKEX ). 2012

work page 2012
[39]

and Marasovi \'c , Ana and Smith, Noah A

Dasigi, Pradeep and Liu, Nelson F. and Marasovi \'c , Ana and Smith, Noah A. and Gardner, Matt. Q uoref: A Reading Comprehension Dataset with Questions Requiring Coreferential Reasoning. EMNLP. 2019

work page 2019
[40]

2014 ,journal=

Malo, Pekka and Sinha, Ankur and Korhonen, Pekka and Wallenius, Jyrki and Takala, Pyry ,title=. 2014 ,journal=

work page 2014
[41]

H otpot QA : A Dataset for Diverse, Explainable Multi-hop Question Answering

Yang, Zhilin and Qi, Peng and Zhang, Saizheng and Bengio, Yoshua and Cohen, William and Salakhutdinov, Ruslan and Manning, Christopher D. H otpot QA : A Dataset for Diverse, Explainable Multi-hop Question Answering. EMNLP. 2018

work page 2018
[42]

AAAI , year=

Quarel: A dataset and models for answering questions about qualitative relationships , author=. AAAI , year=

work page
[43]

Warstadt, Alex and Parrish, Alicia and Liu, Haokun and Mohananey, Anhad and Peng, Wei and Wang, Sheng-Fu and Bowman, Samuel R. ,title=. TACL ,year=

work page
[44]

Investigating Societal Biases in a Poetry Composition System

Sheng, Emily and Uthus, David. Investigating Societal Biases in a Poetry Composition System. Proceedings of the Second Workshop on Gender Bias in Natural Language Processing. 2020

work page 2020
[45]

Cosmos QA : Machine Reading Comprehension with Contextual Commonsense Reasoning

Huang, Lifu and Le Bras, Ronan and Bhagavatula, Chandra and Choi, Yejin. Cosmos QA : Machine Reading Comprehension with Contextual Commonsense Reasoning. EMNLP. 2019

work page 2019
[46]

ArXiv ,year=

Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge ,author=. ArXiv ,year=

work page
[47]

ArXiv ,year=

ETHOS: an Online Hate Speech Detection Dataset ,author=. ArXiv ,year=

work page
[48]

FEVER : a Large-scale Dataset for Fact Extraction and VER ification

Thorne, James and Vlachos, Andreas and Christodoulopoulos, Christos and Mittal, Arpit. FEVER : a Large-scale Dataset for Fact Extraction and VER ification. NAACL-HLT. 2018

work page 2018
[49]

Machine Learning Challenges Workshop ,year=

The PASCAL recognising textual entailment challenge ,author=. Machine Learning Challenges Workshop ,year=

work page
[50]

Semantic Parsing on F reebase from Question-Answer Pairs

Berant, Jonathan and Chou, Andrew and Frostig, Roy and Liang, Percy. Semantic Parsing on F reebase from Question-Answer Pairs. EMNLP. 2013

work page 2013
[51]

Adversarial NLI : A New Benchmark for Natural Language Understanding

Nie, Yixin and Williams, Adina and Dinan, Emily and Bansal, Mohit and Weston, Jason and Kiela, Douwe. Adversarial NLI : A New Benchmark for Natural Language Understanding. ACL. 2020

work page 2020
[52]

Identifying Well-formed Natural Language Questions

Faruqui, Manaal and Das, Dipanjan. Identifying Well-formed Natural Language Questions. EMNLP. 2018

work page 2018
[53]

Getting Closer to AI Complete Question Answering: A Set of Prerequisite Real Tasks ,booktitle=

Rogers, Anna and Kovaleva, Olga and Downey, Matthew and Rumshisky, Anna ,year=. Getting Closer to AI Complete Question Answering: A Set of Prerequisite Real Tasks ,booktitle=

work page
[54]

H ella S wag: Can a Machine Really Finish Your Sentence?

Zellers, Rowan and Holtzman, Ari and Bisk, Yonatan and Farhadi, Ali and Choi, Yejin. H ella S wag: Can a Machine Really Finish Your Sentence?. ACL. 2019

work page 2019
[55]

NAACL-HLT ,year=

CommonsenseQA: A Question Answering Challenge Targeting Commonsense Knowledge ,author=. NAACL-HLT ,year=

work page
[56]

and Davis, Ernest and Morgenstern, Leora ,title=

Levesque, Hector J. and Davis, Ernest and Morgenstern, Leora ,title=. 2012 ,booktitle=

work page 2012
[57]

`` Liar, Liar Pants on Fire '' : A New Benchmark Dataset for Fake News Detection

Wang, William Yang. `` Liar, Liar Pants on Fire '' : A New Benchmark Dataset for Fake News Detection. ACL. 2017

work page 2017
[58]

T weet E val: Unified Benchmark and Comparative Evaluation for Tweet Classification

Barbieri, Francesco and Camacho-Collados, Jose and Espinosa Anke, Luis and Neves, Leonardo. T weet E val: Unified Benchmark and Comparative Evaluation for Tweet Classification. Findings of EMNLP. 2020

work page 2020
[59]

Q ua RT z: An Open-Domain Dataset of Qualitative Relationship Questions

Tafjord, Oyvind and Gardner, Matt and Lin, Kevin and Clark, Peter. Q ua RT z: An Open-Domain Dataset of Qualitative Relationship Questions. EMNLP. 2019

work page 2019
[60]

Zero-Shot Relation Extraction via Reading Comprehension

Levy, Omer and Seo, Minjoon and Choi, Eunsol and Zettlemoyer, Luke. Zero-Shot Relation Extraction via Reading Comprehension. C o NLL. 2017

work page 2017
[61]

WIQA : A dataset for `` What if

Tandon, Niket and Dalvi, Bhavana and Sakaguchi, Keisuke and Clark, Peter and Bosselut, Antoine. WIQA : A dataset for `` What if... '' reasoning over procedural text. EMNLP. 2019

work page 2019
[62]

and Gardner, Matt

Welbl, Johannes and Liu, Nelson F. and Gardner, Matt. Crowdsourcing Multiple Choice Science Questions. Proceedings of the 3rd Workshop on Noisy User-generated Text. 2017

work page 2017
[63]

Sakaguchi, Keisuke and Bras, Ronan Le and Bhagavatula, Chandra and Choi, Yejin ,booktitle=

work page
[64]

Language Models as Knowledge Bases?

Petroni, Fabio and Rockt. Language Models as Knowledge Bases?. EMNLP. 2019

work page 2019
[65]

B irds have four legs?! N umer S ense: P robing N umerical C ommonsense K nowledge of P re- T rained L anguage M odels

Lin, Bill Yuchen and Lee, Seyeon and Khanna, Rahul and Ren, Xiang. B irds have four legs?! N umer S ense: P robing N umerical C ommonsense K nowledge of P re- T rained L anguage M odels. EMNLP. 2020

work page 2020
[66]

QASC: A Dataset for Question Answering via Sentence Composition ,booktitle=

Khot, Tushar and Clark, Peter and Guerquin, Michal and Jansen, Peter and Sabharwal, Ashish ,year=. QASC: A Dataset for Question Answering via Sentence Composition ,booktitle=

work page
[67]

SWAG : A Large-Scale Adversarial Dataset for Grounded Commonsense Inference

Zellers, Rowan and Bisk, Yonatan and Schwartz, Roy and Choi, Yejin. SWAG : A Large-Scale Adversarial Dataset for Grounded Commonsense Inference. EMNLP. 2018

work page 2018
[68]

TACL ,year=

Natural Questions: A Benchmark for Question Answering Research ,author=. TACL ,year=

work page
[69]

Automated Knowledge Base Construction ,year=

How Context Affects Language Models' Factual Predictions ,author=. Automated Knowledge Base Construction ,year=

work page
[70]

W i C : the Word-in-Context Dataset for Evaluating Context-Sensitive Meaning Representations

Pilehvar, Mohammad Taher and Camacho-Collados, Jose. W i C : the Word-in-Context Dataset for Evaluating Context-Sensitive Meaning Representations. NAACL-HLT. 2019

work page 2019
[71]

Proceedings of the ACL-PASCAL workshop on textual entailment and paraphrasing ,year=

The third pascal recognizing textual entailment challenge ,author=. Proceedings of the ACL-PASCAL workshop on textual entailment and paraphrasing ,year=

work page
[72]

ACL ,year=

Know What You Don't Know: Unanswerable Questions for SQuAD ,author=. ACL ,year=

work page
[73]

Neural Network Acceptability Judgments

Warstadt, Alex and Singh, Amanpreet and Bowman, Samuel R. Neural Network Acceptability Judgments. TACL. 2019

work page 2019
[74]

DREAM : A Challenge Data Set and Models for Dialogue-Based Reading Comprehension

Sun, Kai and Yu, Dian and Chen, Jianshu and Yu, Dong and Choi, Yejin and Cardie, Claire. DREAM : A Challenge Data Set and Models for Dialogue-Based Reading Comprehension. TACL. 2019

work page 2019
[75]

ArXiv ,year=

SearchQA: A New Q&A Dataset Augmented with Context from a Search Engine ,author=. ArXiv ,year=

work page
[76]

C row S -Pairs: A Challenge Dataset for Measuring Social Biases in Masked Language Models

Nangia, Nikita and Vania, Clara and Bhalerao, Rasika and Bowman, Samuel R. C row S -Pairs: A Challenge Dataset for Measuring Social Biases in Masked Language Models. EMNLP. 2020

work page 2020
[77]

EMNLP ,year=

MCTest: A Challenge Dataset for the Open-Domain Machine Comprehension of Text ,author=. EMNLP ,year=

work page
[78]

and Daly, Raymond E

Maas, Andrew L. and Daly, Raymond E. and Pham, Peter T. and Huang, Dan and Ng, Andrew Y. and Potts, Christopher. Learning Word Vectors for Sentiment Analysis. Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies. 2011

work page 2011
[79]

Hate Speech Dataset from a White Supremacy Forum

de Gibert, Ona and Perez, Naiara and Garc \' a-Pablos, Aitor and Cuadros, Montse. Hate Speech Dataset from a White Supremacy Forum. Proceedings of the 2nd Workshop on Abusive Language Online ( ALW 2). 2018

work page 2018
[80]

Reasoning Over Paragraph Effects in Situations

Lin, Kevin and Tafjord, Oyvind and Clark, Peter and Gardner, Matt. Reasoning Over Paragraph Effects in Situations. Proceedings of the 2nd Workshop on Machine Reading for Question Answering. 2019

work page 2019

Showing first 80 references.