Recognition: 2 theorem links
· Lean TheoremRethinking the Role of Demonstrations: What Makes In-Context Learning Work?
Pith reviewed 2026-05-15 09:47 UTC · model grok-4.3
The pith
Randomly replacing labels in in-context demonstrations barely hurts performance on classification and multiple-choice tasks across many models.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Ground truth demonstrations are in fact not required. Randomly replacing labels in the demonstrations barely hurts performance on a range of classification and multi-choice tasks, consistently over 12 different models including GPT-3. Instead, other aspects of the demonstrations are the key drivers of end task performance, including the fact that they provide a few examples of the label space, the distribution of the input text, and the overall format of the sequence.
What carries the argument
Random label replacement as a probe to separate the contribution of correct input-label mappings from the provision of label space, input distribution, and sequence format in demonstrations.
Load-bearing premise
That randomly replacing labels does not introduce unintended statistical cues the models can exploit and that the chosen classification and multiple-choice tasks represent broader in-context learning behavior.
What would settle it
A substantial performance drop when labels are randomized on a task where the input-label mapping cannot be inferred from format or distribution cues alone.
read the original abstract
Large language models (LMs) are able to in-context learn -- perform a new task via inference alone by conditioning on a few input-label pairs (demonstrations) and making predictions for new inputs. However, there has been little understanding of how the model learns and which aspects of the demonstrations contribute to end task performance. In this paper, we show that ground truth demonstrations are in fact not required -- randomly replacing labels in the demonstrations barely hurts performance on a range of classification and multi-choce tasks, consistently over 12 different models including GPT-3. Instead, we find that other aspects of the demonstrations are the key drivers of end task performance, including the fact that they provide a few examples of (1) the label space, (2) the distribution of the input text, and (3) the overall format of the sequence. Together, our analysis provides a new way of understanding how and why in-context learning works, while opening up new questions about how much can be learned from large language models through inference alone.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper examines the mechanisms of in-context learning (ICL) in large language models. Through experiments across 12 models (including GPT-3) on classification and multiple-choice tasks, it shows that replacing ground-truth labels in demonstrations with random labels causes only minimal performance drops. The authors argue that demonstrations primarily contribute by exposing the label space, input text distribution, and overall sequence format rather than providing correct input-label mappings.
Significance. If the empirical findings hold, the work provides a substantive reframing of ICL: it shifts emphasis from label correctness to structural cues in the prompt. This has clear implications for prompt engineering, model interpretability, and the limits of what can be learned purely at inference time. The consistent results across model scales and task types strengthen the contribution.
major comments (1)
- [§4.1, §4.2] §4.1 and §4.2: The random-label generation procedure is described at a high level but lacks explicit confirmation that the random labels are drawn uniformly from the task's label set without replacement or frequency bias; this detail is needed to rule out unintended distributional cues that could explain the near-equivalent performance.
minor comments (2)
- [Abstract] Abstract: 'multi-choce' is a typo and should be corrected to 'multiple-choice'.
- [§3] §3: The description of the 12 models and their sizes could be consolidated into a single table for easier reference rather than scattered across paragraphs.
Simulated Author's Rebuttal
We thank the referee for their positive assessment and constructive feedback on our work. We address the major comment below.
read point-by-point responses
-
Referee: [§4.1, §4.2] §4.1 and §4.2: The random-label generation procedure is described at a high level but lacks explicit confirmation that the random labels are drawn uniformly from the task's label set without replacement or frequency bias; this detail is needed to rule out unintended distributional cues that could explain the near-equivalent performance.
Authors: We agree that greater precision on the sampling procedure is warranted to eliminate any ambiguity about potential distributional cues. In the revised manuscript we will explicitly state that, for each demonstration, the original label is replaced by a label drawn uniformly at random from the task's full label set, with independent draws across demonstrations (i.e., replacement is allowed). This procedure ensures that the random-label demonstrations expose the label space while preserving the original input distribution and sequence format, without introducing frequency bias or other unintended cues. revision: yes
Circularity Check
No significant circularity; purely empirical analysis
full rationale
The paper presents an entirely empirical investigation with no mathematical derivations, fitted parameters, or first-principles claims that could reduce to their own inputs by construction. Central findings rest on direct experimental comparisons (random label replacement vs. ground-truth demonstrations across 12 models) that are reported as observed outcomes rather than predictions derived from self-referential definitions. No self-citation chains, ansatzes, or uniqueness theorems are invoked as load-bearing support for the core results; the work explicitly frames its contribution around identifying roles of label space, input distribution, and format via controlled experiments. This satisfies the criteria for a self-contained empirical study with no circular steps.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Large language models perform tasks by conditioning on input-label pairs provided in the prompt.
Forward citations
Cited by 19 Pith papers
-
Chain-of-Thought Prompting Elicits Reasoning in Large Language Models
Chain-of-thought prompting, by including intermediate reasoning steps in few-shot examples, elicits strong reasoning abilities in large language models on arithmetic, commonsense, and symbolic tasks.
-
Why Multimodal In-Context Learning Lags Behind? Unveiling the Inner Mechanisms and Bottlenecks
Multimodal ICL lags text-only ICL in few-shot settings due to weak cross-modal reasoning alignment and unreliable task mapping transfer, with an inference-stage method proposed to strengthen transfer.
-
LLM4Log: A Systematic Review of Large Language Model-based Log Analysis
LLM4Log is a systematic review of 145 papers on LLM-based log analysis that delivers a unified taxonomy, design patterns, and open challenges for reliable adoption in AIOps.
-
Program of Thoughts Prompting: Disentangling Computation from Reasoning for Numerical Reasoning Tasks
PoT prompting improves numerical reasoning by having language models write programs executed by a computer instead of performing calculations in natural language chains of thought, with an average 12% gain over CoT.
-
In-context Learning and Induction Heads
Induction heads, which implement pattern completion in attention, develop at the same training stage as a sudden rise in in-context learning, providing evidence they are the primary mechanism for in-context learning i...
-
Flamingo: a Visual Language Model for Few-Shot Learning
Flamingo models reach new state-of-the-art few-shot results on image and video tasks by bridging frozen vision and language models with cross-attention layers trained on interleaved web-scale data.
-
Stories in Space: In-Context Learning Trajectories in Conceptual Belief Space
LLMs perform in-context learning as trajectories through a structured low-dimensional conceptual belief space, with the structure visible in both behavior and internal representations and causally manipulable via inte...
-
OPT-BENCH: Evaluating the Iterative Self-Optimization of LLM Agents in Large-Scale Search Spaces
OPT-BENCH and OPT-Agent evaluate LLM self-optimization in large search spaces, showing stronger models improve via feedback but stay constrained by base capacity and below human performance.
-
Temporal Reasoning Is Not the Bottleneck: A Probabilistic Inconsistency Framework for Neuro-Symbolic QA
Temporal reasoning is not the core bottleneck for LLMs on time-based QA; the real issue is unstructured text-to-event mapping, addressed by a neuro-symbolic system with PIS that reaches 100% accuracy on benchmarks whe...
-
Measuring Representation Robustness in Large Language Models for Geometry
LLMs display accuracy gaps of up to 14 percentage points on the same geometry problems solely due to representation choice, with vector forms consistently weakest and a convert-then-solve prompt helping only high-capa...
-
Otter: A Multi-Modal Model with In-Context Instruction Tuning
Otter is a multi-modal model instruction-tuned on the MIMIC-IT dataset of over 3 million in-context instruction-response pairs to improve convergence and generalization on tasks with multiple images and videos.
-
Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them
Chain-of-thought prompting enables large language models to surpass average human performance on 17 of 23 challenging BIG-Bench tasks.
-
Emergent Abilities of Large Language Models
Emergent abilities are capabilities present in large language models but absent in smaller ones and cannot be predicted by extrapolating smaller model performance.
-
Absurd World: A Simple Yet Powerful Method to Absurdify the Real-world for Probing LLM Reasoning Capabilities
Absurd World automatically converts real-world problems into absurd yet logically coherent scenarios to test whether LLMs can reason without depending on familiar patterns.
-
Can LLMs Take Retrieved Information with a Grain of Salt?
LLMs exhibit systematic failures in obeying expressed certainty in retrieved contexts, but a combination of prior reminders, certainty recalibration, and context simplification reduces obedience errors by 25%.
-
When Context Sticks: Studying Interference in In-Context Learning
In-context learning shows persistent interference from prior examples, with more misleading linear examples degrading quadratic predictions and training curricula modulating recovery speed.
-
SAKE: Self-aware Knowledge Exploitation-Exploration for Grounded Multimodal Named Entity Recognition
SAKE is an agentic framework for GMNER that uses uncertainty-based self-awareness and reinforcement learning to balance internal knowledge exploitation with adaptive external exploration.
-
Reducing Redundancy in Retrieval-Augmented Generation through Chunk Filtering
Entity-based chunk filtering reduces RAG vector index size by 25-36% with retrieval quality near baseline levels.
-
SLM Finetuning for Natural Language to Domain Specific Code Generation in Production
Fine-tuned small language models outperform larger models in natural language to domain-specific code generation with improved performance, latency, and the ability to adapt to customer-specific scenarios without losi...
Reference graph
Works this paper leans on
-
[1]
Robust Disambiguation of Named Entities in Text
Hoffart, Johannes and Yosef, Mohamed Amir and Bordino, Ilaria and F. Robust Disambiguation of Named Entities in Text. EMNLP. 2011
work page 2011
-
[2]
CODAH : An Adversarially-Authored Question Answering Dataset for Common Sense
Chen, Michael and D ' Arcy, Mike and Liu, Alisa and Fernandez, Jared and Downey, Doug. CODAH : An Adversarially-Authored Question Answering Dataset for Common Sense. Proceedings of the 3rd Workshop on Evaluating Vector Space Representations for NLP. 2019
work page 2019
-
[3]
Dolan, William B. and Brockett, Chris. Automatically Constructing a Corpus of Sentential Paraphrases. Proceedings of the Third International Workshop on Paraphrasing ( IWP 2005). 2005
work page 2005
-
[4]
DBpedia - A large-scale, multilingual knowledge base extracted from Wikipedia ,author=. Semantic Web ,year=
- [5]
-
[6]
Proceedings of the 11th International AAAI Conference on Web and Social Media ,year=
Automated Hate Speech Detection and the Problem of Offensive Language ,author=. Proceedings of the 11th International AAAI Conference on Web and Social Media ,year=
-
[7]
The CommitmentBank: Investigating projection in naturally occurring discourse ,journal=
de Marneffe, Marie-Catherine and Simons, Mandy and Tonhauser, Judith ,year=. The CommitmentBank: Investigating projection in naturally occurring discourse ,journal=
-
[8]
Narayan, Shashi and Cohen, Shay B. and Lapata, Mirella. Don ' t Give Me the Details, Just the Summary! Topic-Aware Convolutional Neural Networks for Extreme Summarization. EMNLP. 2018
work page 2018
-
[9]
W iki QA : A Challenge Dataset for Open-Domain Question Answering
Yang, Yi and Yih, Wen-tau and Meek, Christopher. W iki QA : A Challenge Dataset for Open-Domain Question Answering. EMNLP. 2015
work page 2015
-
[10]
Toward Semantics-Based Answer Pinpointing
Hovy, Eduard and Gerber, Laurie and Hermjakob, Ulf and Lin, Chin-Yew and Ravichandran, Deepak. Toward Semantics-Based Answer Pinpointing. Proceedings of the First International Conference on Human Language Technology Research. 2001
work page 2001
-
[11]
Social IQ a: Commonsense Reasoning about Social Interactions
Sap, Maarten and Rashkin, Hannah and Chen, Derek and Le Bras, Ronan and Choi, Yejin. Social IQ a: Commonsense Reasoning about Social Interactions. EMNLP. 2019
work page 2019
-
[12]
Dheeru Dua and Yizhong Wang and Pradeep Dasigi and Gabriel Stanovsky and Sameer Singh and Matt Gardner ,title=. NAACL ,year=
-
[13]
Li, Xin and Roth, Dan. Learning Question Classifiers. COLING 2002: The 19th International Conference on Computational Linguistics. 2002
work page 2002
-
[14]
Yu, Tao and Zhang, Rui and Yang, Kai and Yasunaga, Michihiro and Wang, Dongxu and Li, Zifan and Ma, James and Li, Irene and Yao, Qingning and Roman, Shanelle and Zhang, Zilin and Radev, Dragomir. S pider: A Large-Scale Human-Labeled Dataset for Complex and Cross-Domain Semantic Parsing and Text-to- SQL Task. EMNLP. 2018
work page 2018
-
[15]
CARER : Contextualized Affect Representations for Emotion Recognition
Saravia, Elvis and Liu, Hsien-Chi Toby and Huang, Yen-Hao and Wu, Junlin and Chen, Yi-Shin. CARER : Contextualized Affect Representations for Emotion Recognition. EMNLP. 2018
work page 2018
- [16]
-
[17]
`` I ' d rather just go to bed '' : Understanding Indirect Answers
Louis, Annie and Roth, Dan and Radlinski, Filip. `` I ' d rather just go to bed '' : Understanding Indirect Answers. EMNLP. 2020
work page 2020
-
[18]
A SICK cure for the evaluation of compositional distributional semantic models
Marelli, Marco and Menini, Stefano and Baroni, Marco and Bentivogli, Luisa and Bernardi, Raffaella and Zamparelli, Roberto. A SICK cure for the evaluation of compositional distributional semantic models. LREC. 2014
work page 2014
-
[19]
Zhou, Ben and Khashabi, Daniel and Ning, Qiang and Roth, Dan. `` Going on a vacation '' takes longer than `` Going for a walk '' : A Study of Temporal Commonsense Understanding. EMNLP. 2019
work page 2019
-
[20]
Vajjala, Sowmya and Lu c i \'c , Ivana. O ne S top E nglish corpus: A new corpus for automatic readability assessment and text simplification. Proceedings of the Thirteenth Workshop on Innovative Use of NLP for Building Educational Applications. 2018
work page 2018
-
[21]
Explainable Automated Fact-Checking for Public Health Claims
Kotonya, Neema and Toni, Francesca. Explainable Automated Fact-Checking for Public Health Claims. EMNLP. 2020
work page 2020
-
[22]
Social IQa: Commonsense Reasoning about Social Interactions ,author=. EMNLP-IJCNLP ,year=
-
[23]
B ool Q : Exploring the Surprising Difficulty of Natural Yes/No Questions
Clark, Christopher and Lee, Kenton and Chang, Ming-Wei and Kwiatkowski, Tom and Collins, Michael and Toutanova, Kristina. B ool Q : Exploring the Surprising Difficulty of Natural Yes/No Questions. NAACL-HLT. 2019
work page 2019
-
[24]
and Ng, Andrew and Potts, Christopher
Socher, Richard and Perelygin, Alex and Wu, Jean and Chuang, Jason and Manning, Christopher D. and Ng, Andrew and Potts, Christopher. Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank. EMNLP. 2013
work page 2013
-
[25]
Gordon, Andrew and Kozareva, Zornitsa and Roemmele, Melissa. S em E val-2012 Task 7: Choice of Plausible Alternatives: An Evaluation of Commonsense Causal Reasoning. The First Joint Conference on Lexical and Computational Semantics ( S em E val). 2012
work page 2012
-
[26]
Rajpurkar, Pranav and Zhang, Jian and Lopyrev, Konstantin and Liang, Percy ,title=
-
[27]
T- RE x: A Large Scale Alignment of Natural Language with Knowledge Base Triples
Elsahar, Hady and Vougiouklis, Pavlos and Remaci, Arslen and Gravier, Christophe and Hare, Jonathon and Laforest, Frederique and Simperl, Elena. T- RE x: A Large Scale Alignment of Natural Language with Knowledge Base Triples. LREC. 2018
work page 2018
-
[28]
PIQA: Reasoning about Physical Commonsense in Natural Language ,author=. AAAI ,year=
-
[29]
Khot, Tushar and Clark, Peter and Guerquin, Michal and Jansen, Peter and Sabharwal, Ashish ,booktitle=
-
[30]
Proceedings of the second PASCAL challenges workshop on recognising textual entailment ,year=
The second pascal recognising textual entailment challenge ,author=. Proceedings of the second PASCAL challenges workshop on recognising textual entailment ,year=
-
[31]
TWEETQA : A Social Media Focused Question Answering Dataset
Xiong, Wenhan and Wu, Jiawei and Wang, Hong and Kulkarni, Vivek and Yu, Mo and Chang, Shiyu and Guo, Xiaoxiao and Wang, William Yang. TWEETQA : A Social Media Focused Question Answering Dataset. ACL. 2019
work page 2019
-
[32]
Can a Suit of Armor Conduct Electricity? A New Dataset for Open Book Question Answering ,author=. EMNLP ,year=
-
[33]
Almeida, Tiago A. and Hidalgo, Jos\'. Contributions to the Study of SMS Spam Filtering: New Collection and Results ,year=
-
[34]
Mining Discourse Markers for Unsupervised Sentence Representation Learning
Sileo, Damien and Van De Cruys, Tim and Pradel, Camille and Muller, Philippe. Mining Discourse Markers for Unsupervised Sentence Representation Learning. NAACL-HLT. 2019
work page 2019
-
[35]
S em E val-2019 Task 3: E mo C ontext Contextual Emotion Detection in Text
Chatterjee, Ankush and Narahari, Kedhar Nath and Joshi, Meghana and Agrawal, Puneet. S em E val-2019 Task 3: E mo C ontext Contextual Emotion Detection in Text. Proceedings of the 13th International Workshop on Semantic Evaluation. 2019
work page 2019
-
[36]
arXiv preprint arXiv:2012.10289 ,year=
HateXplain: A Benchmark Dataset for Explainable Hate Speech Detection ,author=. arXiv preprint arXiv:2012.10289 ,year=
-
[37]
F reebase QA : A New Factoid QA Data Set Matching Trivia-Style Question-Answer Pairs with F reebase
Jiang, Kelvin and Wu, Dekun and Jiang, Hui. F reebase QA : A New Factoid QA Data Set Matching Trivia-Style Question-Answer Pairs with F reebase. NAACL-HLT. 2019
work page 2019
-
[38]
Napoles, Courtney and Gormley, Matthew and Van Durme, Benjamin. Annotated G igaword. Proceedings of the Joint Workshop on Automatic Knowledge Base Construction and Web-scale Knowledge Extraction ( AKBC - WEKEX ). 2012
work page 2012
-
[39]
and Marasovi \'c , Ana and Smith, Noah A
Dasigi, Pradeep and Liu, Nelson F. and Marasovi \'c , Ana and Smith, Noah A. and Gardner, Matt. Q uoref: A Reading Comprehension Dataset with Questions Requiring Coreferential Reasoning. EMNLP. 2019
work page 2019
-
[40]
Malo, Pekka and Sinha, Ankur and Korhonen, Pekka and Wallenius, Jyrki and Takala, Pyry ,title=. 2014 ,journal=
work page 2014
-
[41]
H otpot QA : A Dataset for Diverse, Explainable Multi-hop Question Answering
Yang, Zhilin and Qi, Peng and Zhang, Saizheng and Bengio, Yoshua and Cohen, William and Salakhutdinov, Ruslan and Manning, Christopher D. H otpot QA : A Dataset for Diverse, Explainable Multi-hop Question Answering. EMNLP. 2018
work page 2018
-
[42]
Quarel: A dataset and models for answering questions about qualitative relationships , author=. AAAI , year=
-
[43]
Warstadt, Alex and Parrish, Alicia and Liu, Haokun and Mohananey, Anhad and Peng, Wei and Wang, Sheng-Fu and Bowman, Samuel R. ,title=. TACL ,year=
-
[44]
Investigating Societal Biases in a Poetry Composition System
Sheng, Emily and Uthus, David. Investigating Societal Biases in a Poetry Composition System. Proceedings of the Second Workshop on Gender Bias in Natural Language Processing. 2020
work page 2020
-
[45]
Cosmos QA : Machine Reading Comprehension with Contextual Commonsense Reasoning
Huang, Lifu and Le Bras, Ronan and Bhagavatula, Chandra and Choi, Yejin. Cosmos QA : Machine Reading Comprehension with Contextual Commonsense Reasoning. EMNLP. 2019
work page 2019
-
[46]
Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge ,author=. ArXiv ,year=
- [47]
-
[48]
FEVER : a Large-scale Dataset for Fact Extraction and VER ification
Thorne, James and Vlachos, Andreas and Christodoulopoulos, Christos and Mittal, Arpit. FEVER : a Large-scale Dataset for Fact Extraction and VER ification. NAACL-HLT. 2018
work page 2018
-
[49]
Machine Learning Challenges Workshop ,year=
The PASCAL recognising textual entailment challenge ,author=. Machine Learning Challenges Workshop ,year=
-
[50]
Semantic Parsing on F reebase from Question-Answer Pairs
Berant, Jonathan and Chou, Andrew and Frostig, Roy and Liang, Percy. Semantic Parsing on F reebase from Question-Answer Pairs. EMNLP. 2013
work page 2013
-
[51]
Adversarial NLI : A New Benchmark for Natural Language Understanding
Nie, Yixin and Williams, Adina and Dinan, Emily and Bansal, Mohit and Weston, Jason and Kiela, Douwe. Adversarial NLI : A New Benchmark for Natural Language Understanding. ACL. 2020
work page 2020
-
[52]
Identifying Well-formed Natural Language Questions
Faruqui, Manaal and Das, Dipanjan. Identifying Well-formed Natural Language Questions. EMNLP. 2018
work page 2018
-
[53]
Getting Closer to AI Complete Question Answering: A Set of Prerequisite Real Tasks ,booktitle=
Rogers, Anna and Kovaleva, Olga and Downey, Matthew and Rumshisky, Anna ,year=. Getting Closer to AI Complete Question Answering: A Set of Prerequisite Real Tasks ,booktitle=
-
[54]
H ella S wag: Can a Machine Really Finish Your Sentence?
Zellers, Rowan and Holtzman, Ari and Bisk, Yonatan and Farhadi, Ali and Choi, Yejin. H ella S wag: Can a Machine Really Finish Your Sentence?. ACL. 2019
work page 2019
-
[55]
CommonsenseQA: A Question Answering Challenge Targeting Commonsense Knowledge ,author=. NAACL-HLT ,year=
-
[56]
and Davis, Ernest and Morgenstern, Leora ,title=
Levesque, Hector J. and Davis, Ernest and Morgenstern, Leora ,title=. 2012 ,booktitle=
work page 2012
-
[57]
`` Liar, Liar Pants on Fire '' : A New Benchmark Dataset for Fake News Detection
Wang, William Yang. `` Liar, Liar Pants on Fire '' : A New Benchmark Dataset for Fake News Detection. ACL. 2017
work page 2017
-
[58]
T weet E val: Unified Benchmark and Comparative Evaluation for Tweet Classification
Barbieri, Francesco and Camacho-Collados, Jose and Espinosa Anke, Luis and Neves, Leonardo. T weet E val: Unified Benchmark and Comparative Evaluation for Tweet Classification. Findings of EMNLP. 2020
work page 2020
-
[59]
Q ua RT z: An Open-Domain Dataset of Qualitative Relationship Questions
Tafjord, Oyvind and Gardner, Matt and Lin, Kevin and Clark, Peter. Q ua RT z: An Open-Domain Dataset of Qualitative Relationship Questions. EMNLP. 2019
work page 2019
-
[60]
Zero-Shot Relation Extraction via Reading Comprehension
Levy, Omer and Seo, Minjoon and Choi, Eunsol and Zettlemoyer, Luke. Zero-Shot Relation Extraction via Reading Comprehension. C o NLL. 2017
work page 2017
-
[61]
WIQA : A dataset for `` What if
Tandon, Niket and Dalvi, Bhavana and Sakaguchi, Keisuke and Clark, Peter and Bosselut, Antoine. WIQA : A dataset for `` What if... '' reasoning over procedural text. EMNLP. 2019
work page 2019
-
[62]
Welbl, Johannes and Liu, Nelson F. and Gardner, Matt. Crowdsourcing Multiple Choice Science Questions. Proceedings of the 3rd Workshop on Noisy User-generated Text. 2017
work page 2017
-
[63]
Sakaguchi, Keisuke and Bras, Ronan Le and Bhagavatula, Chandra and Choi, Yejin ,booktitle=
-
[64]
Language Models as Knowledge Bases?
Petroni, Fabio and Rockt. Language Models as Knowledge Bases?. EMNLP. 2019
work page 2019
-
[65]
Lin, Bill Yuchen and Lee, Seyeon and Khanna, Rahul and Ren, Xiang. B irds have four legs?! N umer S ense: P robing N umerical C ommonsense K nowledge of P re- T rained L anguage M odels. EMNLP. 2020
work page 2020
-
[66]
QASC: A Dataset for Question Answering via Sentence Composition ,booktitle=
Khot, Tushar and Clark, Peter and Guerquin, Michal and Jansen, Peter and Sabharwal, Ashish ,year=. QASC: A Dataset for Question Answering via Sentence Composition ,booktitle=
-
[67]
SWAG : A Large-Scale Adversarial Dataset for Grounded Commonsense Inference
Zellers, Rowan and Bisk, Yonatan and Schwartz, Roy and Choi, Yejin. SWAG : A Large-Scale Adversarial Dataset for Grounded Commonsense Inference. EMNLP. 2018
work page 2018
-
[68]
Natural Questions: A Benchmark for Question Answering Research ,author=. TACL ,year=
-
[69]
Automated Knowledge Base Construction ,year=
How Context Affects Language Models' Factual Predictions ,author=. Automated Knowledge Base Construction ,year=
-
[70]
W i C : the Word-in-Context Dataset for Evaluating Context-Sensitive Meaning Representations
Pilehvar, Mohammad Taher and Camacho-Collados, Jose. W i C : the Word-in-Context Dataset for Evaluating Context-Sensitive Meaning Representations. NAACL-HLT. 2019
work page 2019
-
[71]
Proceedings of the ACL-PASCAL workshop on textual entailment and paraphrasing ,year=
The third pascal recognizing textual entailment challenge ,author=. Proceedings of the ACL-PASCAL workshop on textual entailment and paraphrasing ,year=
-
[72]
Know What You Don't Know: Unanswerable Questions for SQuAD ,author=. ACL ,year=
-
[73]
Neural Network Acceptability Judgments
Warstadt, Alex and Singh, Amanpreet and Bowman, Samuel R. Neural Network Acceptability Judgments. TACL. 2019
work page 2019
-
[74]
DREAM : A Challenge Data Set and Models for Dialogue-Based Reading Comprehension
Sun, Kai and Yu, Dian and Chen, Jianshu and Yu, Dong and Choi, Yejin and Cardie, Claire. DREAM : A Challenge Data Set and Models for Dialogue-Based Reading Comprehension. TACL. 2019
work page 2019
-
[75]
SearchQA: A New Q&A Dataset Augmented with Context from a Search Engine ,author=. ArXiv ,year=
-
[76]
C row S -Pairs: A Challenge Dataset for Measuring Social Biases in Masked Language Models
Nangia, Nikita and Vania, Clara and Bhalerao, Rasika and Bowman, Samuel R. C row S -Pairs: A Challenge Dataset for Measuring Social Biases in Masked Language Models. EMNLP. 2020
work page 2020
-
[77]
MCTest: A Challenge Dataset for the Open-Domain Machine Comprehension of Text ,author=. EMNLP ,year=
-
[78]
Maas, Andrew L. and Daly, Raymond E. and Pham, Peter T. and Huang, Dan and Ng, Andrew Y. and Potts, Christopher. Learning Word Vectors for Sentiment Analysis. Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies. 2011
work page 2011
-
[79]
Hate Speech Dataset from a White Supremacy Forum
de Gibert, Ona and Perez, Naiara and Garc \' a-Pablos, Aitor and Cuadros, Montse. Hate Speech Dataset from a White Supremacy Forum. Proceedings of the 2nd Workshop on Abusive Language Online ( ALW 2). 2018
work page 2018
-
[80]
Reasoning Over Paragraph Effects in Situations
Lin, Kevin and Tafjord, Oyvind and Clark, Peter and Gardner, Matt. Reasoning Over Paragraph Effects in Situations. Proceedings of the 2nd Workshop on Machine Reading for Question Answering. 2019
work page 2019
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.