Recognition: 2 theorem links
SQuAD: 100,000+ Questions for Machine Comprehension of Text
Pith reviewed 2026-05-12 10:05 UTC · model grok-4.3
The pith
SQuAD supplies over 100,000 crowd-sourced questions on Wikipedia articles where each answer is a contiguous text segment from the passage.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We present the Stanford Question Answering Dataset (SQuAD), a new reading comprehension dataset consisting of 100,000+ questions posed by crowdworkers on a set of Wikipedia articles, where the answer to each question is a segment of text from the corresponding reading passage. We analyze the dataset to understand the types of reasoning required to answer the questions, leaning heavily on dependency and constituency trees. We build a strong logistic regression model, which achieves an F1 score of 51.0%, a significant improvement over a simple baseline (20%). However, human performance (86.8%) is much higher, indicating that the dataset presents a good challenge problem for future research.
What carries the argument
The SQuAD dataset of crowdworker questions on Wikipedia passages with extractive text-span answers.
If this is right
- Supplies a large training resource that can drive development of extractive question-answering systems.
- Exposes the specific reasoning steps (such as coreference or multi-sentence inference) needed to answer many questions.
- Creates a clear performance gap that future models must close to demonstrate real text understanding.
- Allows direct comparison of machine and human answers on identical passages and questions.
Where Pith is reading between the lines
- Success on SQuAD could transfer to other tasks that require pulling precise information from documents.
- The dataset's focus on Wikipedia may encourage models that generalize better across factual text domains.
- Repeated use of this benchmark could shift evaluation standards toward requiring explicit reasoning traces rather than end-to-end accuracy alone.
Load-bearing premise
Questions written by crowdworkers on Wikipedia articles will mainly test genuine reading comprehension and reasoning instead of surface patterns or outside knowledge.
What would settle it
A model that reaches human-level accuracy on the dataset by using only word overlap or external knowledge without reading the passages.
read the original abstract
We present the Stanford Question Answering Dataset (SQuAD), a new reading comprehension dataset consisting of 100,000+ questions posed by crowdworkers on a set of Wikipedia articles, where the answer to each question is a segment of text from the corresponding reading passage. We analyze the dataset to understand the types of reasoning required to answer the questions, leaning heavily on dependency and constituency trees. We build a strong logistic regression model, which achieves an F1 score of 51.0%, a significant improvement over a simple baseline (20%). However, human performance (86.8%) is much higher, indicating that the dataset presents a good challenge problem for future research. The dataset is freely available at https://stanford-qa.com
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper presents the Stanford Question Answering Dataset (SQuAD), a new reading comprehension dataset consisting of 100,000+ questions posed by crowdworkers on a set of Wikipedia articles, where the answer to each question is a segment of text from the corresponding reading passage. It analyzes the types of reasoning required using dependency and constituency trees, builds a logistic regression model achieving 51.0% F1 (versus a 20% baseline), reports human performance at 86.8% F1, and releases the dataset publicly as a challenge problem.
Significance. If the questions predominantly require comprehension of the supplied passages, SQuAD provides a large-scale, span-based benchmark that has the potential to drive substantial progress in machine reading comprehension. Strengths include the dataset scale, public availability, explicit analysis of reasoning types via parse trees, and the clear gap between model and human performance. These elements support its role as a reproducible challenge for the field.
major comments (1)
- [§2] §2 (Dataset Construction): The question collection process does not include a control experiment such as passage-ablated accuracy to verify that questions cannot be answered using external world knowledge alone. This is load-bearing for the central claim that the dataset tests machine comprehension of the provided text rather than retrieval or prior knowledge.
minor comments (2)
- [Abstract] Abstract: The mention of 'leaning heavily on dependency and constituency trees' for analysis lacks specifics on which tree features are extracted or how they are encoded for the logistic regression; expanding this would improve clarity.
- [§4] §4 (Models): The simple baseline that achieves 20% F1 is referenced but not described in detail (e.g., whether it selects random spans or uses frequency heuristics); specifying it would aid interpretation of the reported improvement.
Simulated Author's Rebuttal
We thank the referee for the positive assessment of SQuAD's significance and for the constructive major comment. We address it point by point below.
read point-by-point responses
-
Referee: [§2] §2 (Dataset Construction): The question collection process does not include a control experiment such as passage-ablated accuracy to verify that questions cannot be answered using external world knowledge alone. This is load-bearing for the central claim that the dataset tests machine comprehension of the provided text rather than retrieval or prior knowledge.
Authors: We agree that a passage-ablated control would strengthen the central claim that SQuAD measures comprehension of the supplied text. The original manuscript does not contain such an experiment. Crowdworkers were instructed to pose questions answerable from the passage, and our parse-tree analysis of reasoning types provides indirect evidence that syntactic and semantic processing of the text is required for many questions. To directly address the concern, we will add a passage-ablated evaluation in the revised manuscript: both the logistic regression model and human annotators will be tested on the questions with the passage removed. We will report the resulting F1 scores to quantify the degree to which external knowledge alone suffices. revision: yes
Circularity Check
No circularity: dataset release and direct baseline measurement
full rationale
The paper's core contribution is the construction and public release of the SQuAD dataset (100k+ crowdworker questions with passage-span answers) plus straightforward analysis via dependency/constituency trees and a logistic regression baseline (F1 51.0%). No equations, predictions, or first-principles claims exist that reduce by construction to fitted inputs, self-citations, or renamed known results. Reported numbers are direct empirical measurements on the new data; the logistic regression is an off-the-shelf model whose features and performance are independently verifiable. The derivation chain is self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Crowdworkers can reliably generate questions whose answers are exact text spans within the provided passage.
- domain assumption Dependency and constituency trees are sufficient to categorize the reasoning types needed for the questions.
Forward citations
Cited by 37 Pith papers
-
Online Learning-to-Defer with Varying Experts
Presents the first online learning-to-defer algorithm with regret bounds O((n + n_e) T^{2/3}) generally and O((n + n_e) sqrt(T)) under low noise for multiclass classification with varying experts.
-
Passage Re-ranking with BERT
Fine-tuning BERT for query-passage relevance classification achieves state-of-the-art results on TREC-CAR and MS MARCO, with a 27% relative gain in MRR@10 over prior methods.
-
PragLocker: Protecting Agent Intellectual Property in Untrusted Deployments via Non-Portable Prompts
PragLocker protects agent prompts as IP by building non-portable obfuscated versions that function only on the intended LLM through code-symbol semantic anchoring followed by target-model feedback noise injection.
-
TCD-Arena: Assessing Robustness of Time Series Causal Discovery Methods Against Assumption Violations
TCD-Arena is a new customizable testing framework that runs millions of experiments to map how 33 different assumption violations affect time series causal discovery methods and shows ensembles can boost overall robustness.
-
From Flat Facts to Sharp Hallucinations: Detecting Stubborn Errors via Gradient Sensitivity
EPGS detects stubborn hallucinations by perturbing embeddings with noise and tracking gradient magnitude spikes, outperforming entropy and representation baselines as a proxy for loss landscape sharpness.
-
Deep-Reporter: Deep Research for Grounded Multimodal Long-Form Generation
Deep-Reporter introduces a unified agentic framework for grounded multimodal long-form generation via multimodal search, checklist-guided synthesis, and recurrent context management, plus the M2LongBench benchmark.
-
Multitask Prompted Training Enables Zero-Shot Task Generalization
Multitask fine-tuning of an encoder-decoder model on prompted datasets produces zero-shot generalization that often beats models up to 16 times larger on standard benchmarks.
-
Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity
Switch Transformers use top-1 expert routing in a Mixture of Experts setup to scale to trillion-parameter language models with constant compute and up to 4x speedup over T5-XXL.
-
BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension
BART introduces a denoising pretraining method for seq2seq models that matches RoBERTa on GLUE and SQuAD while setting new state-of-the-art results on abstractive summarization, dialogue, and QA with up to 6 ROUGE gains.
-
Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer
T5 casts all NLP tasks as text-to-text generation, systematically explores pre-training choices, and reaches strong performance on summarization, QA, classification and other tasks via large-scale training on the Colo...
-
MS MARCO: A Human Generated MAchine Reading COmprehension Dataset
MS MARCO is a new large-scale machine reading comprehension dataset built from real Bing search queries, human-generated answers, and web passages, supporting three tasks including answer synthesis and passage ranking.
-
Structured Recurrent Mixers for Massively Parallelized Sequence Generation
Structured Recurrent Mixers enable algebraic switching between parallel training and recurrent inference representations, delivering higher efficiency, information capacity, and throughput than other linear-complexity models.
-
Effective Performance Measurement: Challenges and Opportunities in KPI Extraction from Earnings Calls
Encoder models trained on SEC filings struggle with earnings calls due to domain shift, while LLMs enable open-ended KPI extraction with 79.7% human-verified precision on newly introduced benchmarks.
-
The Override Gap: A Magnitude Account of Knowledge Conflict Failure in Hypernetwork-Based Instant LLM Adaptation
Knowledge conflicts in hypernetwork LLM adaptation stem from constant adapter margins losing to frequency-dependent pretrained margins; selective layer boosting and conflict-aware triggering raise deep-conflict accura...
-
Are Large Language Models Economically Viable for Industry Deployment?
Small LLMs under 2B parameters achieve better economic break-even, energy efficiency, and hardware density than larger models on legacy GPUs for industrial tasks.
-
HiGMem: A Hierarchical and LLM-Guided Memory System for Long-Term Conversational Agents
HiGMem combines hierarchical event-turn memory with LLM-guided selection to retrieve concise relevant evidence from long dialogues, improving F1 scores and cutting retrieved turns by an order of magnitude on the LoCoM...
-
Domain-oriented RAG Assessment (DoRA): Synthetic Benchmarking for RAG-based Question Answering on Defense Documents
DoRA is a new synthetic benchmark for RAG-based QA on defense documents where fine-tuning Llama3.1-8B-Instruct on it improves task success by up to 26% and cuts hallucination rates by 47%.
-
Transforming External Knowledge into Triplets for Enhanced Retrieval in RAG of LLMs
Tri-RAG turns external knowledge into Condition-Proof-Conclusion triplets and retrieves via the Condition anchor to improve efficiency and quality in LLM RAG.
-
Memory-Efficient Transfer Learning with Fading Side Networks via Masked Dual Path Distillation
MDPD mutually distills knowledge between a frozen backbone and a learnable side network during fine-tuning, then discards the side network at inference to accelerate speed by at least 25% while preserving accuracy.
-
MP-ISMoE: Mixed-Precision Interactive Side Mixture-of-Experts for Efficient Transfer Learning
MP-ISMoE uses Gaussian noise perturbed iterative quantization and interactive side mixture-of-experts to deliver higher accuracy than prior memory-efficient transfer learning methods while keeping similar parameter an...
-
Watt Counts: Energy-Aware Benchmark for Sustainable LLM Inference on Heterogeneous GPU Architectures
Watt Counts supplies over 5,000 energy measurements across 50 LLMs and 10 GPUs and shows that hardware-aware selection can reduce server-scenario energy use by up to 70 percent with little effect on user experience.
-
GCoT-Decoding: Unlocking Deep Reasoning Paths for Universal Question Answering
GCoT-decoding combines Fibonacci sampling, heuristic backtracking, span-based confidence scoring, and semantic consensus aggregation to enable general chain-of-thought reasoning without task-specific prompts.
-
JU\'A -- A Benchmark for Information Retrieval in Brazilian Legal Text Collections
JU'A is a new heterogeneous benchmark for Brazilian legal IR that distinguishes retrieval methods and shows domain-adapted models excel on aligned subsets while BM25 stays competitive elsewhere.
-
Can We Trust a Black-box LLM? LLM Untrustworthy Boundary Detection via Bias-Diffusion and Multi-Agent Reinforcement Learning
GMRL-BD detects untrustworthy topic boundaries for black-box LLMs by combining bias-diffusion on a Wikipedia KG with multi-agent RL, supported by a released dataset labeling biases in models like Llama2 and Qwen2.
-
NV-Embed: Improved Techniques for Training LLMs as Generalist Embedding Models
NV-Embed achieves first place on the MTEB leaderboard across 56 tasks by combining a latent attention layer, causal-mask removal, two-stage contrastive training, and data curation for LLM-based embedding models.
-
DeepSpeed Ulysses: System Optimizations for Enabling Training of Extreme Long Sequence Transformer Models
DeepSpeed-Ulysses keeps communication volume constant for sequence-parallel attention when sequence length and device count scale together, delivering 2.5x faster training on 4x longer sequences than prior SOTA.
-
Textbooks Are All You Need II: phi-1.5 technical report
phi-1.5 is a 1.3B parameter model trained on synthetic textbook data that matches the reasoning performance of models five times larger on natural language, math, and basic coding tasks.
-
ST-MoE: Designing Stable and Transferable Sparse Expert Models
ST-MoE introduces stability techniques for sparse expert models, allowing a 269B-parameter model to achieve state-of-the-art transfer learning results across reasoning, summarization, and QA tasks at the compute cost ...
-
MDN: Parallelizing Stepwise Momentum for Delta Linear Attention
MDN parallelizes stepwise momentum for delta linear attention using geometric reordering and dynamical systems analysis, yielding performance gains over Mamba2 and GDN on 400M and 1.3B models.
-
From Flat Facts to Sharp Hallucinations: Detecting Stubborn Errors via Gradient Sensitivity
EPGS detects high-confidence factual errors in LLMs by using embedding perturbations to measure gradient sensitivity as a proxy for sharp versus flat minima.
-
The Override Gap: A Magnitude Account of Knowledge Conflict Failure in Hypernetwork-Based Instant LLM Adaptation
Knowledge conflicts in hypernetwork LLM adaptation stem from constant adapter margins losing to frequency-dependent pretrained margins; selective layer boosting and conflict-aware triggering close the gap.
-
Hardware-Efficient Softmax and Layer Normalization with Guaranteed Normalization for Edge Devices
Hardware approximations for Softmax and LayerNorm preserve exact normalization guarantees and deliver up to 14x area reduction in 28nm silicon with negligible accuracy loss on GLUE, SQuAD, and perplexity.
-
Toeplitz MLP Mixers are Low Complexity, Information-Rich Sequence Models
Toeplitz MLP Mixers replace attention with masked Toeplitz multiplications for sub-quadratic complexity while retaining more sequence information and outperforming on copying and in-context tasks.
-
Humanity's Last Exam
Humanity's Last Exam is a new 2,500-question benchmark at the frontier of human knowledge where state-of-the-art LLMs show low accuracy.
-
Sparse-on-Dense: Area and Energy-Efficient Computing of Sparse Neural Networks on Dense Matrix Multiplication Accelerators
Sparse neural networks achieve better area and energy efficiency when executed on dense matrix multiplication accelerators using a Sparse-on-Dense approach than on dedicated sparse accelerators.
-
Qwen3-VL-Embedding and Qwen3-VL-Reranker: A Unified Framework for State-of-the-Art Multimodal Retrieval and Ranking
Qwen3-VL-Embedding-8B achieves state-of-the-art performance with a 77.8 overall score on the MMEB-V2 multimodal embedding benchmark.
-
GLU Variants Improve Transformer
Some GLU variants using non-sigmoid nonlinearities improve Transformer quality over ReLU and GELU in feed-forward sublayers.
Reference graph
Works this paper leans on
- [1]
- [2]
-
[3]
D. Chen, J. Bolton, and C. D. Manning. 2016. A thorough examination of the CNN / D aily M ail reading comprehension task. In Association for Computational Linguistics (ACL)
work page 2016
-
[4]
P. Clark and O. Etzioni. 2016. My computer is an honor student but how intelligent is it? standardized tests as a measure of AI . AI Magazine , 37(1):5--12
work page 2016
-
[5]
J. Deng, W. Dong, R. Socher, L. Li, K. Li, and L. Fei-Fei. 2009. I mage N et: A large-scale hierarchical image database. In Computer Vision and Pattern Recognition (CVPR) , pages 248--255
work page 2009
-
[6]
D. Ferrucci, E. Brown, J. Chu-Carroll, J. Fan, D. Gondek, A. A. Kalyanpur, A. Lally, J. W. Murdock, E. Nyberg, J. Prager, N. Schlaefer, and C. Welty. 2013. Building W atson: An overview of the D eep QA project. AI Magazine , 31(3):59--79
work page 2013
-
[7]
S. N. Gaikwad, D. Morina, R. Nistala, M. Agarwal, A. Cossette, R. Bhanu, S. Savage, V. Narwal, K. Rajpal, J. Regino, et al. 2015. Daemo: A self-governed crowdsourcing marketplace. In Proceedings of the 28th Annual ACM Symposium on User Interface Software & Technology , pages 101--102
work page 2015
-
[8]
K. M. Hermann, T. Kočiský, E. Grefenstette, L. Espeholt, W. Kay, M. Suleyman, and P. Blunsom. 2015. Teaching machines to read and comprehend. In Advances in Neural Information Processing Systems (NIPS)
work page 2015
-
[9]
F. Hill, A. Bordes, S. Chopra, and J. Weston. 2015. The goldilocks principle: Reading children's books with explicit memory representations. In International Conference on Learning Representations (ICLR)
work page 2015
-
[10]
L. Hirschman, M. Light, E. Breck, and J. D. Burger. 1999. Deep read: A reading comprehension system. In Association for Computational Linguistics (ACL) , pages 325--332
work page 1999
-
[11]
M. J. Hosseini, H. Hajishirzi, O. Etzioni, and N. Kushman. 2014. Learning to solve arithmetic word problems with verb categorization. In Empirical Methods in Natural Language Processing (EMNLP) , pages 523--533
work page 2014
-
[12]
N. Kushman, Y. Artzi, L. Zettlemoyer, and R. Barzilay. 2014. Learning to automatically solve algebra word problems. In Association for Computational Linguistics (ACL)
work page 2014
-
[13]
M. P. Marcus, M. A. Marcinkiewicz, and B. Santorini. 1993. Building a large annotated corpus of E nglish: the P enn T reebank. Computational Linguistics , 19:313--330
work page 1993
-
[14]
K. Narasimhan and R. Barzilay. 2015. Machine comprehension with discourse relations. In Association for Computational Linguistics (ACL)
work page 2015
-
[15]
H. T. Ng, L. H. Teo, and J. L. P. Kwan. 2000. A machine learning approach to answering questions for reading comprehension tests. In Joint SIGDAT conference on empirical methods in natural language processing and very large corpora - Volume 13 , pages 124--132
work page 2000
-
[16]
D. Ravichandran and E. Hovy. 2002. Learning surface text patterns for a question answering system. In Association for Computational Linguistics (ACL) , pages 41--47
work page 2002
-
[17]
M. Richardson, C. J. Burges, and E. Renshaw. 2013. Mctest: A challenge dataset for the open-domain machine comprehension of text. In Empirical Methods in Natural Language Processing (EMNLP) , pages 193--203
work page 2013
-
[18]
E. Riloff and M. Thelen. 2000. A rule-based question answering system for reading comprehension tests. In ANLP/NAACL Workshop on reading comprehension tests as evaluation for computer-based language understanding sytems - Volume 6 , pages 13--19
work page 2000
- [19]
-
[20]
D. Shen and D. Klakow. 2006. Exploring correlation of dependency relation paths for answer extraction. In International Conference on Computational Linguistics and Association for Computational Linguistics (COLING/ACL) , pages 889--896
work page 2006
-
[21]
M. Shirakawa, T. Hara, and S. Nishio. 2015. N -gram idf: A global term weighting scheme based on information distance. In World Wide Web (WWW) , pages 960--970
work page 2015
-
[22]
H. Sun, N. Duan, Y. Duan, and M. Zhou. 2013. Answer extraction from passage graph for question answering. In International Joint Conference on Artificial Intelligence (IJCAI)
work page 2013
-
[23]
E. M. Voorhees and D. M. Tice. 2000. Building a question answering test collection. In ACM Special Interest Group on Information Retreival (SIGIR) , pages 200--207
work page 2000
- [24]
-
[25]
H. Wang, M. Bansal, K. Gimpel, and D. McAllester. 2015. Machine comprehension with syntax, frames, and semantics. In Association for Computational Linguistics (ACL)
work page 2015
- [26]
-
[27]
Y. Yang, W. Yih, and C. Meek. 2015. W iki QA : A challenge dataset for open-domain question answering. In Empirical Methods in Natural Language Processing (EMNLP) , pages 2013--2018
work page 2015
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.