pith. machine review for the scientific record. sign in

arxiv: 2202.12837 · v2 · submitted 2022-02-25 · 💻 cs.CL · cs.AI

Recognition: 2 theorem links

· Lean Theorem

Rethinking the Role of Demonstrations: What Makes In-Context Learning Work?

Authors on Pith no claims yet

Pith reviewed 2026-05-15 09:47 UTC · model grok-4.3

classification 💻 cs.CL cs.AI
keywords in-context learningdemonstrationslabel replacementlarge language modelsfew-shot promptingprompt formatclassification tasksGPT-3
0
0 comments X

The pith

Randomly replacing labels in in-context demonstrations barely hurts performance on classification and multiple-choice tasks across many models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that ground truth demonstrations are not required for in-context learning in large language models. Randomly replacing the labels in the demonstrations barely hurts performance on classification and multi-choice tasks. This holds consistently over 12 different models including GPT-3. What matters more are the demonstrations providing examples of the label space, the distribution of the input text, and the overall format of the sequence. This provides a new understanding of why in-context learning works.

Core claim

Ground truth demonstrations are in fact not required. Randomly replacing labels in the demonstrations barely hurts performance on a range of classification and multi-choice tasks, consistently over 12 different models including GPT-3. Instead, other aspects of the demonstrations are the key drivers of end task performance, including the fact that they provide a few examples of the label space, the distribution of the input text, and the overall format of the sequence.

What carries the argument

Random label replacement as a probe to separate the contribution of correct input-label mappings from the provision of label space, input distribution, and sequence format in demonstrations.

Load-bearing premise

That randomly replacing labels does not introduce unintended statistical cues the models can exploit and that the chosen classification and multiple-choice tasks represent broader in-context learning behavior.

What would settle it

A substantial performance drop when labels are randomized on a task where the input-label mapping cannot be inferred from format or distribution cues alone.

read the original abstract

Large language models (LMs) are able to in-context learn -- perform a new task via inference alone by conditioning on a few input-label pairs (demonstrations) and making predictions for new inputs. However, there has been little understanding of how the model learns and which aspects of the demonstrations contribute to end task performance. In this paper, we show that ground truth demonstrations are in fact not required -- randomly replacing labels in the demonstrations barely hurts performance on a range of classification and multi-choce tasks, consistently over 12 different models including GPT-3. Instead, we find that other aspects of the demonstrations are the key drivers of end task performance, including the fact that they provide a few examples of (1) the label space, (2) the distribution of the input text, and (3) the overall format of the sequence. Together, our analysis provides a new way of understanding how and why in-context learning works, while opening up new questions about how much can be learned from large language models through inference alone.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper examines the mechanisms of in-context learning (ICL) in large language models. Through experiments across 12 models (including GPT-3) on classification and multiple-choice tasks, it shows that replacing ground-truth labels in demonstrations with random labels causes only minimal performance drops. The authors argue that demonstrations primarily contribute by exposing the label space, input text distribution, and overall sequence format rather than providing correct input-label mappings.

Significance. If the empirical findings hold, the work provides a substantive reframing of ICL: it shifts emphasis from label correctness to structural cues in the prompt. This has clear implications for prompt engineering, model interpretability, and the limits of what can be learned purely at inference time. The consistent results across model scales and task types strengthen the contribution.

major comments (1)
  1. [§4.1, §4.2] §4.1 and §4.2: The random-label generation procedure is described at a high level but lacks explicit confirmation that the random labels are drawn uniformly from the task's label set without replacement or frequency bias; this detail is needed to rule out unintended distributional cues that could explain the near-equivalent performance.
minor comments (2)
  1. [Abstract] Abstract: 'multi-choce' is a typo and should be corrected to 'multiple-choice'.
  2. [§3] §3: The description of the 12 models and their sizes could be consolidated into a single table for easier reference rather than scattered across paragraphs.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their positive assessment and constructive feedback on our work. We address the major comment below.

read point-by-point responses
  1. Referee: [§4.1, §4.2] §4.1 and §4.2: The random-label generation procedure is described at a high level but lacks explicit confirmation that the random labels are drawn uniformly from the task's label set without replacement or frequency bias; this detail is needed to rule out unintended distributional cues that could explain the near-equivalent performance.

    Authors: We agree that greater precision on the sampling procedure is warranted to eliminate any ambiguity about potential distributional cues. In the revised manuscript we will explicitly state that, for each demonstration, the original label is replaced by a label drawn uniformly at random from the task's full label set, with independent draws across demonstrations (i.e., replacement is allowed). This procedure ensures that the random-label demonstrations expose the label space while preserving the original input distribution and sequence format, without introducing frequency bias or other unintended cues. revision: yes

Circularity Check

0 steps flagged

No significant circularity; purely empirical analysis

full rationale

The paper presents an entirely empirical investigation with no mathematical derivations, fitted parameters, or first-principles claims that could reduce to their own inputs by construction. Central findings rest on direct experimental comparisons (random label replacement vs. ground-truth demonstrations across 12 models) that are reported as observed outcomes rather than predictions derived from self-referential definitions. No self-citation chains, ansatzes, or uniqueness theorems are invoked as load-bearing support for the core results; the work explicitly frames its contribution around identifying roles of label space, input distribution, and format via controlled experiments. This satisfies the criteria for a self-contained empirical study with no circular steps.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the empirical observation that random labels preserve performance; no free parameters are fitted to derive the result, and no new entities are postulated.

axioms (1)
  • domain assumption Large language models perform tasks by conditioning on input-label pairs provided in the prompt.
    This is the foundational premise of in-context learning research invoked throughout the abstract.

pith-pipeline@v0.9.0 · 5504 in / 1142 out tokens · 41467 ms · 2026-05-15T09:47:47.546394+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 19 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Chain-of-Thought Prompting Elicits Reasoning in Large Language Models

    cs.CL 2022-01 accept novelty 9.0

    Chain-of-thought prompting, by including intermediate reasoning steps in few-shot examples, elicits strong reasoning abilities in large language models on arithmetic, commonsense, and symbolic tasks.

  2. Why Multimodal In-Context Learning Lags Behind? Unveiling the Inner Mechanisms and Bottlenecks

    cs.CV 2026-04 unverdicted novelty 7.0

    Multimodal ICL lags text-only ICL in few-shot settings due to weak cross-modal reasoning alignment and unreliable task mapping transfer, with an inference-stage method proposed to strengthen transfer.

  3. LLM4Log: A Systematic Review of Large Language Model-based Log Analysis

    cs.SE 2026-03 accept novelty 7.0

    LLM4Log is a systematic review of 145 papers on LLM-based log analysis that delivers a unified taxonomy, design patterns, and open challenges for reliable adoption in AIOps.

  4. Program of Thoughts Prompting: Disentangling Computation from Reasoning for Numerical Reasoning Tasks

    cs.CL 2022-11 unverdicted novelty 7.0

    PoT prompting improves numerical reasoning by having language models write programs executed by a computer instead of performing calculations in natural language chains of thought, with an average 12% gain over CoT.

  5. In-context Learning and Induction Heads

    cs.LG 2022-09 unverdicted novelty 7.0

    Induction heads, which implement pattern completion in attention, develop at the same training stage as a sudden rise in in-context learning, providing evidence they are the primary mechanism for in-context learning i...

  6. Flamingo: a Visual Language Model for Few-Shot Learning

    cs.CV 2022-04 unverdicted novelty 7.0

    Flamingo models reach new state-of-the-art few-shot results on image and video tasks by bridging frozen vision and language models with cross-attention layers trained on interleaved web-scale data.

  7. Stories in Space: In-Context Learning Trajectories in Conceptual Belief Space

    cs.CL 2026-05 unverdicted novelty 6.0

    LLMs perform in-context learning as trajectories through a structured low-dimensional conceptual belief space, with the structure visible in both behavior and internal representations and causally manipulable via inte...

  8. OPT-BENCH: Evaluating the Iterative Self-Optimization of LLM Agents in Large-Scale Search Spaces

    cs.AI 2026-05 unverdicted novelty 6.0

    OPT-BENCH and OPT-Agent evaluate LLM self-optimization in large search spaces, showing stronger models improve via feedback but stay constrained by base capacity and below human performance.

  9. Temporal Reasoning Is Not the Bottleneck: A Probabilistic Inconsistency Framework for Neuro-Symbolic QA

    cs.AI 2026-05 unverdicted novelty 6.0

    Temporal reasoning is not the core bottleneck for LLMs on time-based QA; the real issue is unstructured text-to-event mapping, addressed by a neuro-symbolic system with PIS that reaches 100% accuracy on benchmarks whe...

  10. Measuring Representation Robustness in Large Language Models for Geometry

    cs.CL 2026-04 unverdicted novelty 6.0

    LLMs display accuracy gaps of up to 14 percentage points on the same geometry problems solely due to representation choice, with vector forms consistently weakest and a convert-then-solve prompt helping only high-capa...

  11. Otter: A Multi-Modal Model with In-Context Instruction Tuning

    cs.CV 2023-05 unverdicted novelty 6.0

    Otter is a multi-modal model instruction-tuned on the MIMIC-IT dataset of over 3 million in-context instruction-response pairs to improve convergence and generalization on tasks with multiple images and videos.

  12. Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them

    cs.CL 2022-10 accept novelty 6.0

    Chain-of-thought prompting enables large language models to surpass average human performance on 17 of 23 challenging BIG-Bench tasks.

  13. Emergent Abilities of Large Language Models

    cs.CL 2022-06 unverdicted novelty 6.0

    Emergent abilities are capabilities present in large language models but absent in smaller ones and cannot be predicted by extrapolating smaller model performance.

  14. Absurd World: A Simple Yet Powerful Method to Absurdify the Real-world for Probing LLM Reasoning Capabilities

    cs.AI 2026-05 unverdicted novelty 5.0

    Absurd World automatically converts real-world problems into absurd yet logically coherent scenarios to test whether LLMs can reason without depending on familiar patterns.

  15. Can LLMs Take Retrieved Information with a Grain of Salt?

    cs.CL 2026-05 unverdicted novelty 5.0

    LLMs exhibit systematic failures in obeying expressed certainty in retrieved contexts, but a combination of prior reminders, certainty recalibration, and context simplification reduces obedience errors by 25%.

  16. When Context Sticks: Studying Interference in In-Context Learning

    cs.LG 2026-04 unverdicted novelty 5.0

    In-context learning shows persistent interference from prior examples, with more misleading linear examples degrading quadratic predictions and training curricula modulating recovery speed.

  17. SAKE: Self-aware Knowledge Exploitation-Exploration for Grounded Multimodal Named Entity Recognition

    cs.IR 2026-04 unverdicted novelty 5.0

    SAKE is an agentic framework for GMNER that uses uncertainty-based self-awareness and reinforcement learning to balance internal knowledge exploitation with adaptive external exploration.

  18. Reducing Redundancy in Retrieval-Augmented Generation through Chunk Filtering

    cs.CL 2026-04 unverdicted novelty 4.0

    Entity-based chunk filtering reduces RAG vector index size by 25-36% with retrieval quality near baseline levels.

  19. SLM Finetuning for Natural Language to Domain Specific Code Generation in Production

    cs.LG 2026-04 unverdicted novelty 3.0

    Fine-tuned small language models outperform larger models in natural language to domain-specific code generation with improved performance, latency, and the ability to adapt to customer-specific scenarios without losi...

Reference graph

Works this paper leans on

237 extracted references · 237 canonical work pages · cited by 19 Pith papers · 6 internal anchors

  1. [1]

    Robust Disambiguation of Named Entities in Text

    Hoffart, Johannes and Yosef, Mohamed Amir and Bordino, Ilaria and F. Robust Disambiguation of Named Entities in Text. EMNLP. 2011

  2. [2]

    CODAH : An Adversarially-Authored Question Answering Dataset for Common Sense

    Chen, Michael and D ' Arcy, Mike and Liu, Alisa and Fernandez, Jared and Downey, Doug. CODAH : An Adversarially-Authored Question Answering Dataset for Common Sense. Proceedings of the 3rd Workshop on Evaluating Vector Space Representations for NLP. 2019

  3. [3]

    and Brockett, Chris

    Dolan, William B. and Brockett, Chris. Automatically Constructing a Corpus of Sentential Paraphrases. Proceedings of the Third International Workshop on Paraphrasing ( IWP 2005). 2005

  4. [4]

    Semantic Web ,year=

    DBpedia - A large-scale, multilingual knowledge base extracted from Wikipedia ,author=. Semantic Web ,year=

  5. [5]

    ICLR ,year=

    Abductive Commonsense Reasoning ,author=. ICLR ,year=

  6. [6]

    Proceedings of the 11th International AAAI Conference on Web and Social Media ,year=

    Automated Hate Speech Detection and the Problem of Offensive Language ,author=. Proceedings of the 11th International AAAI Conference on Web and Social Media ,year=

  7. [7]

    The CommitmentBank: Investigating projection in naturally occurring discourse ,journal=

    de Marneffe, Marie-Catherine and Simons, Mandy and Tonhauser, Judith ,year=. The CommitmentBank: Investigating projection in naturally occurring discourse ,journal=

  8. [8]

    and Lapata, Mirella

    Narayan, Shashi and Cohen, Shay B. and Lapata, Mirella. Don ' t Give Me the Details, Just the Summary! Topic-Aware Convolutional Neural Networks for Extreme Summarization. EMNLP. 2018

  9. [9]

    W iki QA : A Challenge Dataset for Open-Domain Question Answering

    Yang, Yi and Yih, Wen-tau and Meek, Christopher. W iki QA : A Challenge Dataset for Open-Domain Question Answering. EMNLP. 2015

  10. [10]

    Toward Semantics-Based Answer Pinpointing

    Hovy, Eduard and Gerber, Laurie and Hermjakob, Ulf and Lin, Chin-Yew and Ravichandran, Deepak. Toward Semantics-Based Answer Pinpointing. Proceedings of the First International Conference on Human Language Technology Research. 2001

  11. [11]

    Social IQ a: Commonsense Reasoning about Social Interactions

    Sap, Maarten and Rashkin, Hannah and Chen, Derek and Le Bras, Ronan and Choi, Yejin. Social IQ a: Commonsense Reasoning about Social Interactions. EMNLP. 2019

  12. [12]

    NAACL ,year=

    Dheeru Dua and Yizhong Wang and Pradeep Dasigi and Gabriel Stanovsky and Sameer Singh and Matt Gardner ,title=. NAACL ,year=

  13. [13]

    Learning Question Classifiers

    Li, Xin and Roth, Dan. Learning Question Classifiers. COLING 2002: The 19th International Conference on Computational Linguistics. 2002

  14. [14]

    S pider: A Large-Scale Human-Labeled Dataset for Complex and Cross-Domain Semantic Parsing and Text-to- SQL Task

    Yu, Tao and Zhang, Rui and Yang, Kai and Yasunaga, Michihiro and Wang, Dongxu and Li, Zifan and Ma, James and Li, Irene and Yao, Qingning and Roman, Shanelle and Zhang, Zilin and Radev, Dragomir. S pider: A Large-Scale Human-Labeled Dataset for Complex and Cross-Domain Semantic Parsing and Text-to- SQL Task. EMNLP. 2018

  15. [15]

    CARER : Contextualized Affect Representations for Emotion Recognition

    Saravia, Elvis and Liu, Hsien-Chi Toby and Huang, Yen-Hao and Wu, Junlin and Chen, Yi-Shin. CARER : Contextualized Affect Representations for Emotion Recognition. EMNLP. 2018

  16. [16]

    Rep4NLP@ACL ,year=

    NewsQA: A Machine Comprehension Dataset ,author=. Rep4NLP@ACL ,year=

  17. [17]

    `` I ' d rather just go to bed '' : Understanding Indirect Answers

    Louis, Annie and Roth, Dan and Radlinski, Filip. `` I ' d rather just go to bed '' : Understanding Indirect Answers. EMNLP. 2020

  18. [18]

    A SICK cure for the evaluation of compositional distributional semantic models

    Marelli, Marco and Menini, Stefano and Baroni, Marco and Bentivogli, Luisa and Bernardi, Raffaella and Zamparelli, Roberto. A SICK cure for the evaluation of compositional distributional semantic models. LREC. 2014

  19. [19]

    `` Going on a vacation '' takes longer than `` Going for a walk '' : A Study of Temporal Commonsense Understanding

    Zhou, Ben and Khashabi, Daniel and Ning, Qiang and Roth, Dan. `` Going on a vacation '' takes longer than `` Going for a walk '' : A Study of Temporal Commonsense Understanding. EMNLP. 2019

  20. [20]

    O ne S top E nglish corpus: A new corpus for automatic readability assessment and text simplification

    Vajjala, Sowmya and Lu c i \'c , Ivana. O ne S top E nglish corpus: A new corpus for automatic readability assessment and text simplification. Proceedings of the Thirteenth Workshop on Innovative Use of NLP for Building Educational Applications. 2018

  21. [21]

    Explainable Automated Fact-Checking for Public Health Claims

    Kotonya, Neema and Toni, Francesca. Explainable Automated Fact-Checking for Public Health Claims. EMNLP. 2020

  22. [22]

    EMNLP-IJCNLP ,year=

    Social IQa: Commonsense Reasoning about Social Interactions ,author=. EMNLP-IJCNLP ,year=

  23. [23]

    B ool Q : Exploring the Surprising Difficulty of Natural Yes/No Questions

    Clark, Christopher and Lee, Kenton and Chang, Ming-Wei and Kwiatkowski, Tom and Collins, Michael and Toutanova, Kristina. B ool Q : Exploring the Surprising Difficulty of Natural Yes/No Questions. NAACL-HLT. 2019

  24. [24]

    and Ng, Andrew and Potts, Christopher

    Socher, Richard and Perelygin, Alex and Wu, Jean and Chuang, Jason and Manning, Christopher D. and Ng, Andrew and Potts, Christopher. Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank. EMNLP. 2013

  25. [25]

    S em E val-2012 Task 7: Choice of Plausible Alternatives: An Evaluation of Commonsense Causal Reasoning

    Gordon, Andrew and Kozareva, Zornitsa and Roemmele, Melissa. S em E val-2012 Task 7: Choice of Plausible Alternatives: An Evaluation of Commonsense Causal Reasoning. The First Joint Conference on Lexical and Computational Semantics ( S em E val). 2012

  26. [26]

    Rajpurkar, Pranav and Zhang, Jian and Lopyrev, Konstantin and Liang, Percy ,title=

  27. [27]

    T- RE x: A Large Scale Alignment of Natural Language with Knowledge Base Triples

    Elsahar, Hady and Vougiouklis, Pavlos and Remaci, Arslen and Gravier, Christophe and Hare, Jonathon and Laforest, Frederique and Simperl, Elena. T- RE x: A Large Scale Alignment of Natural Language with Knowledge Base Triples. LREC. 2018

  28. [28]

    AAAI ,year=

    PIQA: Reasoning about Physical Commonsense in Natural Language ,author=. AAAI ,year=

  29. [29]

    Khot, Tushar and Clark, Peter and Guerquin, Michal and Jansen, Peter and Sabharwal, Ashish ,booktitle=

  30. [30]

    Proceedings of the second PASCAL challenges workshop on recognising textual entailment ,year=

    The second pascal recognising textual entailment challenge ,author=. Proceedings of the second PASCAL challenges workshop on recognising textual entailment ,year=

  31. [31]

    TWEETQA : A Social Media Focused Question Answering Dataset

    Xiong, Wenhan and Wu, Jiawei and Wang, Hong and Kulkarni, Vivek and Yu, Mo and Chang, Shiyu and Guo, Xiaoxiao and Wang, William Yang. TWEETQA : A Social Media Focused Question Answering Dataset. ACL. 2019

  32. [32]

    EMNLP ,year=

    Can a Suit of Armor Conduct Electricity? A New Dataset for Open Book Question Answering ,author=. EMNLP ,year=

  33. [33]

    and Hidalgo, Jos\'

    Almeida, Tiago A. and Hidalgo, Jos\'. Contributions to the Study of SMS Spam Filtering: New Collection and Results ,year=

  34. [34]

    Mining Discourse Markers for Unsupervised Sentence Representation Learning

    Sileo, Damien and Van De Cruys, Tim and Pradel, Camille and Muller, Philippe. Mining Discourse Markers for Unsupervised Sentence Representation Learning. NAACL-HLT. 2019

  35. [35]

    S em E val-2019 Task 3: E mo C ontext Contextual Emotion Detection in Text

    Chatterjee, Ankush and Narahari, Kedhar Nath and Joshi, Meghana and Agrawal, Puneet. S em E val-2019 Task 3: E mo C ontext Contextual Emotion Detection in Text. Proceedings of the 13th International Workshop on Semantic Evaluation. 2019

  36. [36]

    arXiv preprint arXiv:2012.10289 ,year=

    HateXplain: A Benchmark Dataset for Explainable Hate Speech Detection ,author=. arXiv preprint arXiv:2012.10289 ,year=

  37. [37]

    F reebase QA : A New Factoid QA Data Set Matching Trivia-Style Question-Answer Pairs with F reebase

    Jiang, Kelvin and Wu, Dekun and Jiang, Hui. F reebase QA : A New Factoid QA Data Set Matching Trivia-Style Question-Answer Pairs with F reebase. NAACL-HLT. 2019

  38. [38]

    Annotated G igaword

    Napoles, Courtney and Gormley, Matthew and Van Durme, Benjamin. Annotated G igaword. Proceedings of the Joint Workshop on Automatic Knowledge Base Construction and Web-scale Knowledge Extraction ( AKBC - WEKEX ). 2012

  39. [39]

    and Marasovi \'c , Ana and Smith, Noah A

    Dasigi, Pradeep and Liu, Nelson F. and Marasovi \'c , Ana and Smith, Noah A. and Gardner, Matt. Q uoref: A Reading Comprehension Dataset with Questions Requiring Coreferential Reasoning. EMNLP. 2019

  40. [40]

    2014 ,journal=

    Malo, Pekka and Sinha, Ankur and Korhonen, Pekka and Wallenius, Jyrki and Takala, Pyry ,title=. 2014 ,journal=

  41. [41]

    H otpot QA : A Dataset for Diverse, Explainable Multi-hop Question Answering

    Yang, Zhilin and Qi, Peng and Zhang, Saizheng and Bengio, Yoshua and Cohen, William and Salakhutdinov, Ruslan and Manning, Christopher D. H otpot QA : A Dataset for Diverse, Explainable Multi-hop Question Answering. EMNLP. 2018

  42. [42]

    AAAI , year=

    Quarel: A dataset and models for answering questions about qualitative relationships , author=. AAAI , year=

  43. [43]

    Warstadt, Alex and Parrish, Alicia and Liu, Haokun and Mohananey, Anhad and Peng, Wei and Wang, Sheng-Fu and Bowman, Samuel R. ,title=. TACL ,year=

  44. [44]

    Investigating Societal Biases in a Poetry Composition System

    Sheng, Emily and Uthus, David. Investigating Societal Biases in a Poetry Composition System. Proceedings of the Second Workshop on Gender Bias in Natural Language Processing. 2020

  45. [45]

    Cosmos QA : Machine Reading Comprehension with Contextual Commonsense Reasoning

    Huang, Lifu and Le Bras, Ronan and Bhagavatula, Chandra and Choi, Yejin. Cosmos QA : Machine Reading Comprehension with Contextual Commonsense Reasoning. EMNLP. 2019

  46. [46]

    ArXiv ,year=

    Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge ,author=. ArXiv ,year=

  47. [47]

    ArXiv ,year=

    ETHOS: an Online Hate Speech Detection Dataset ,author=. ArXiv ,year=

  48. [48]

    FEVER : a Large-scale Dataset for Fact Extraction and VER ification

    Thorne, James and Vlachos, Andreas and Christodoulopoulos, Christos and Mittal, Arpit. FEVER : a Large-scale Dataset for Fact Extraction and VER ification. NAACL-HLT. 2018

  49. [49]

    Machine Learning Challenges Workshop ,year=

    The PASCAL recognising textual entailment challenge ,author=. Machine Learning Challenges Workshop ,year=

  50. [50]

    Semantic Parsing on F reebase from Question-Answer Pairs

    Berant, Jonathan and Chou, Andrew and Frostig, Roy and Liang, Percy. Semantic Parsing on F reebase from Question-Answer Pairs. EMNLP. 2013

  51. [51]

    Adversarial NLI : A New Benchmark for Natural Language Understanding

    Nie, Yixin and Williams, Adina and Dinan, Emily and Bansal, Mohit and Weston, Jason and Kiela, Douwe. Adversarial NLI : A New Benchmark for Natural Language Understanding. ACL. 2020

  52. [52]

    Identifying Well-formed Natural Language Questions

    Faruqui, Manaal and Das, Dipanjan. Identifying Well-formed Natural Language Questions. EMNLP. 2018

  53. [53]

    Getting Closer to AI Complete Question Answering: A Set of Prerequisite Real Tasks ,booktitle=

    Rogers, Anna and Kovaleva, Olga and Downey, Matthew and Rumshisky, Anna ,year=. Getting Closer to AI Complete Question Answering: A Set of Prerequisite Real Tasks ,booktitle=

  54. [54]

    H ella S wag: Can a Machine Really Finish Your Sentence?

    Zellers, Rowan and Holtzman, Ari and Bisk, Yonatan and Farhadi, Ali and Choi, Yejin. H ella S wag: Can a Machine Really Finish Your Sentence?. ACL. 2019

  55. [55]

    NAACL-HLT ,year=

    CommonsenseQA: A Question Answering Challenge Targeting Commonsense Knowledge ,author=. NAACL-HLT ,year=

  56. [56]

    and Davis, Ernest and Morgenstern, Leora ,title=

    Levesque, Hector J. and Davis, Ernest and Morgenstern, Leora ,title=. 2012 ,booktitle=

  57. [57]

    `` Liar, Liar Pants on Fire '' : A New Benchmark Dataset for Fake News Detection

    Wang, William Yang. `` Liar, Liar Pants on Fire '' : A New Benchmark Dataset for Fake News Detection. ACL. 2017

  58. [58]

    T weet E val: Unified Benchmark and Comparative Evaluation for Tweet Classification

    Barbieri, Francesco and Camacho-Collados, Jose and Espinosa Anke, Luis and Neves, Leonardo. T weet E val: Unified Benchmark and Comparative Evaluation for Tweet Classification. Findings of EMNLP. 2020

  59. [59]

    Q ua RT z: An Open-Domain Dataset of Qualitative Relationship Questions

    Tafjord, Oyvind and Gardner, Matt and Lin, Kevin and Clark, Peter. Q ua RT z: An Open-Domain Dataset of Qualitative Relationship Questions. EMNLP. 2019

  60. [60]

    Zero-Shot Relation Extraction via Reading Comprehension

    Levy, Omer and Seo, Minjoon and Choi, Eunsol and Zettlemoyer, Luke. Zero-Shot Relation Extraction via Reading Comprehension. C o NLL. 2017

  61. [61]

    WIQA : A dataset for `` What if

    Tandon, Niket and Dalvi, Bhavana and Sakaguchi, Keisuke and Clark, Peter and Bosselut, Antoine. WIQA : A dataset for `` What if... '' reasoning over procedural text. EMNLP. 2019

  62. [62]

    and Gardner, Matt

    Welbl, Johannes and Liu, Nelson F. and Gardner, Matt. Crowdsourcing Multiple Choice Science Questions. Proceedings of the 3rd Workshop on Noisy User-generated Text. 2017

  63. [63]

    Sakaguchi, Keisuke and Bras, Ronan Le and Bhagavatula, Chandra and Choi, Yejin ,booktitle=

  64. [64]

    Language Models as Knowledge Bases?

    Petroni, Fabio and Rockt. Language Models as Knowledge Bases?. EMNLP. 2019

  65. [65]

    B irds have four legs?! N umer S ense: P robing N umerical C ommonsense K nowledge of P re- T rained L anguage M odels

    Lin, Bill Yuchen and Lee, Seyeon and Khanna, Rahul and Ren, Xiang. B irds have four legs?! N umer S ense: P robing N umerical C ommonsense K nowledge of P re- T rained L anguage M odels. EMNLP. 2020

  66. [66]

    QASC: A Dataset for Question Answering via Sentence Composition ,booktitle=

    Khot, Tushar and Clark, Peter and Guerquin, Michal and Jansen, Peter and Sabharwal, Ashish ,year=. QASC: A Dataset for Question Answering via Sentence Composition ,booktitle=

  67. [67]

    SWAG : A Large-Scale Adversarial Dataset for Grounded Commonsense Inference

    Zellers, Rowan and Bisk, Yonatan and Schwartz, Roy and Choi, Yejin. SWAG : A Large-Scale Adversarial Dataset for Grounded Commonsense Inference. EMNLP. 2018

  68. [68]

    TACL ,year=

    Natural Questions: A Benchmark for Question Answering Research ,author=. TACL ,year=

  69. [69]

    Automated Knowledge Base Construction ,year=

    How Context Affects Language Models' Factual Predictions ,author=. Automated Knowledge Base Construction ,year=

  70. [70]

    W i C : the Word-in-Context Dataset for Evaluating Context-Sensitive Meaning Representations

    Pilehvar, Mohammad Taher and Camacho-Collados, Jose. W i C : the Word-in-Context Dataset for Evaluating Context-Sensitive Meaning Representations. NAACL-HLT. 2019

  71. [71]

    Proceedings of the ACL-PASCAL workshop on textual entailment and paraphrasing ,year=

    The third pascal recognizing textual entailment challenge ,author=. Proceedings of the ACL-PASCAL workshop on textual entailment and paraphrasing ,year=

  72. [72]

    ACL ,year=

    Know What You Don't Know: Unanswerable Questions for SQuAD ,author=. ACL ,year=

  73. [73]

    Neural Network Acceptability Judgments

    Warstadt, Alex and Singh, Amanpreet and Bowman, Samuel R. Neural Network Acceptability Judgments. TACL. 2019

  74. [74]

    DREAM : A Challenge Data Set and Models for Dialogue-Based Reading Comprehension

    Sun, Kai and Yu, Dian and Chen, Jianshu and Yu, Dong and Choi, Yejin and Cardie, Claire. DREAM : A Challenge Data Set and Models for Dialogue-Based Reading Comprehension. TACL. 2019

  75. [75]

    ArXiv ,year=

    SearchQA: A New Q&A Dataset Augmented with Context from a Search Engine ,author=. ArXiv ,year=

  76. [76]

    C row S -Pairs: A Challenge Dataset for Measuring Social Biases in Masked Language Models

    Nangia, Nikita and Vania, Clara and Bhalerao, Rasika and Bowman, Samuel R. C row S -Pairs: A Challenge Dataset for Measuring Social Biases in Masked Language Models. EMNLP. 2020

  77. [77]

    EMNLP ,year=

    MCTest: A Challenge Dataset for the Open-Domain Machine Comprehension of Text ,author=. EMNLP ,year=

  78. [78]

    and Daly, Raymond E

    Maas, Andrew L. and Daly, Raymond E. and Pham, Peter T. and Huang, Dan and Ng, Andrew Y. and Potts, Christopher. Learning Word Vectors for Sentiment Analysis. Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies. 2011

  79. [79]

    Hate Speech Dataset from a White Supremacy Forum

    de Gibert, Ona and Perez, Naiara and Garc \' a-Pablos, Aitor and Cuadros, Montse. Hate Speech Dataset from a White Supremacy Forum. Proceedings of the 2nd Workshop on Abusive Language Online ( ALW 2). 2018

  80. [80]

    Reasoning Over Paragraph Effects in Situations

    Lin, Kevin and Tafjord, Oyvind and Clark, Peter and Gardner, Matt. Reasoning Over Paragraph Effects in Situations. Proceedings of the 2nd Workshop on Machine Reading for Question Answering. 2019

Showing first 80 references.