Adversarial Examples for Evaluating Reading Comprehension Systems
pith:ZHU6TP4W Add to your LaTeX paper
What is a Pith Number?\usepackage{pith}
\pithnumber{ZHU6TP4W}
Prints a linked pith:ZHU6TP4W badge after your title and writes the identifier into PDF metadata. Compiles on arXiv with no extra files. Learn more
read the original abstract
Standard accuracy metrics indicate that reading comprehension systems are making rapid progress, but the extent to which these systems truly understand language remains unclear. To reward systems with real language understanding abilities, we propose an adversarial evaluation scheme for the Stanford Question Answering Dataset (SQuAD). Our method tests whether systems can answer questions about paragraphs that contain adversarially inserted sentences, which are automatically generated to distract computer systems without changing the correct answer or misleading humans. In this adversarial setting, the accuracy of sixteen published models drops from an average of $75\%$ F1 score to $36\%$; when the adversary is allowed to add ungrammatical sequences of words, average accuracy on four models decreases further to $7\%$. We hope our insights will motivate the development of new models that understand language more precisely.
This paper has not been read by Pith yet.
Forward citations
Cited by 3 Pith papers
-
Online Learning-to-Defer with Varying Experts
Presents the first online learning-to-defer algorithm with regret bounds O((n + n_e) T^{2/3}) generally and O((n + n_e) sqrt(T)) under low noise for multiclass classification with varying experts.
-
Pushing the Boundaries of Multiple Choice Evaluation to One Hundred Options
Scaling multiple-choice questions to 100 options on a Korean error detection task shows that LLM performance on conventional benchmarks overstates true competence due to shortcut strategies.
-
ReDef: Do Code Language Models Truly Understand Code Changes for Just-in-Time Software Defect Prediction?
ReDef creates a revert-anchored dataset of 3,164 defective and 10,268 clean code modifications and shows that code language models perform better with diff encodings but maintain stable performance under counterfactua...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.