Adversarial Examples for Evaluating Reading Comprehension Systems

arxiv: 1707.07328 · v1 · pith:ZHU6TP4Wnew · submitted 2017-07-23 · 💻 cs.CL · cs.LG

Adversarial Examples for Evaluating Reading Comprehension Systems

Robin Jia , Percy Liang This is my paper

classification 💻 cs.CL cs.LG

keywords systemsaccuracyadversariallanguagemodelsansweraveragecomprehension

0 comments p. Extension

pith:ZHU6TP4W Add to your LaTeX paper

What is a Pith Number?

\usepackage{pith}
\pithnumber{ZHU6TP4W}

Prints a linked pith:ZHU6TP4W badge after your title and writes the identifier into PDF metadata. Compiles on arXiv with no extra files. Learn more

read the original abstract

Standard accuracy metrics indicate that reading comprehension systems are making rapid progress, but the extent to which these systems truly understand language remains unclear. To reward systems with real language understanding abilities, we propose an adversarial evaluation scheme for the Stanford Question Answering Dataset (SQuAD). Our method tests whether systems can answer questions about paragraphs that contain adversarially inserted sentences, which are automatically generated to distract computer systems without changing the correct answer or misleading humans. In this adversarial setting, the accuracy of sixteen published models drops from an average of $75\%$ F1 score to $36\%$; when the adversary is allowed to add ungrammatical sequences of words, average accuracy on four models decreases further to $7\%$. We hope our insights will motivate the development of new models that understand language more precisely.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Online Learning-to-Defer with Varying Experts
stat.ML 2026-05 unverdicted novelty 8.0

Presents the first online learning-to-defer algorithm with regret bounds O((n + n_e) T^{2/3}) generally and O((n + n_e) sqrt(T)) under low noise for multiclass classification with varying experts.
Pushing the Boundaries of Multiple Choice Evaluation to One Hundred Options
cs.CL 2026-04 unverdicted novelty 7.0

Scaling multiple-choice questions to 100 options on a Korean error detection task shows that LLM performance on conventional benchmarks overstates true competence due to shortcut strategies.
ReDef: Do Code Language Models Truly Understand Code Changes for Just-in-Time Software Defect Prediction?
cs.SE 2025-09 unverdicted novelty 7.0

ReDef creates a revert-anchored dataset of 3,164 defective and 10,268 clean code modifications and shows that code language models perform better with diff encodings but maintain stable performance under counterfactua...