arxiv: 1606.05250 · v3 · submitted 2016-06-16 · 💻 cs.CL

Recognition: 2 theorem links

SQuAD: 100,000+ Questions for Machine Comprehension of Text

Jian Zhang, Konstantin Lopyrev, Percy Liang, Pranav Rajpurkar

Pith reviewed 2026-05-12 10:05 UTC · model grok-4.3

classification 💻 cs.CL

keywords SQuADreading comprehensionquestion answeringmachine comprehensionWikipediacrowdsourcingnatural language processingextractive QA

0 comments

The pith

SQuAD supplies over 100,000 crowd-sourced questions on Wikipedia articles where each answer is a contiguous text segment from the passage.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a large-scale dataset for testing whether machines can read and understand written text. It gathers questions from many people about encyclopedia passages and requires the answers to come directly from those passages as spans of words. This setup forces systems to perform actual comprehension rather than rely on memorized facts or simple keyword matches. The authors also break down the kinds of reasoning the questions demand and show that current models fall well short of human accuracy on the same questions.

Core claim

We present the Stanford Question Answering Dataset (SQuAD), a new reading comprehension dataset consisting of 100,000+ questions posed by crowdworkers on a set of Wikipedia articles, where the answer to each question is a segment of text from the corresponding reading passage. We analyze the dataset to understand the types of reasoning required to answer the questions, leaning heavily on dependency and constituency trees. We build a strong logistic regression model, which achieves an F1 score of 51.0%, a significant improvement over a simple baseline (20%). However, human performance (86.8%) is much higher, indicating that the dataset presents a good challenge problem for future research.

What carries the argument

The SQuAD dataset of crowdworker questions on Wikipedia passages with extractive text-span answers.

If this is right

Supplies a large training resource that can drive development of extractive question-answering systems.
Exposes the specific reasoning steps (such as coreference or multi-sentence inference) needed to answer many questions.
Creates a clear performance gap that future models must close to demonstrate real text understanding.
Allows direct comparison of machine and human answers on identical passages and questions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Success on SQuAD could transfer to other tasks that require pulling precise information from documents.
The dataset's focus on Wikipedia may encourage models that generalize better across factual text domains.
Repeated use of this benchmark could shift evaluation standards toward requiring explicit reasoning traces rather than end-to-end accuracy alone.

Load-bearing premise

Questions written by crowdworkers on Wikipedia articles will mainly test genuine reading comprehension and reasoning instead of surface patterns or outside knowledge.

What would settle it

A model that reaches human-level accuracy on the dataset by using only word overlap or external knowledge without reading the passages.

read the original abstract

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SQuAD gives the field a large crowd-sourced span QA dataset with a measurable model-human gap, but leaves open how many questions actually require the passage versus external knowledge.

read the letter

The key thing to know is that this paper releases SQuAD, a collection of over 100,000 crowdworker questions on Wikipedia articles where every answer is an exact text span from the passage. They also run a logistic regression baseline using dependency and constituency tree features that reaches 51% F1, well above the 20% simple baseline, while humans hit 86.8%. That gap is the practical hook. What is actually new is the scale combined with the strict span format and the parse-tree breakdown of reasoning types; earlier QA sets were smaller and less uniform. Making the data public at stanford-qa.com is straightforward and helpful. The paper does a solid job of describing the collection process at a high level and showing that the questions require more than keyword matching. The soft spot is exactly the one the stress test flags. There is no passage-ablated control, so we cannot tell what fraction of questions could be answered from world knowledge alone without reading the text. If that fraction is non-trivial, the benchmark tests a mix of comprehension and retrieval rather than pure reading. The tree analysis is interesting but does not quantify this risk, and the abstract gives limited detail on how questions were filtered or validated. This paper is for NLP researchers who build or evaluate question-answering and reading-comprehension systems. Anyone tracking benchmarks or needing a large, public resource will get immediate value from the numbers and the data release. It deserves a serious referee because the resource itself is substantial and the initial measurements are concrete enough to discuss and extend, even with the open question about question purity. I would send it to peer review.

Referee Report

1 major / 2 minor

Summary. The paper presents the Stanford Question Answering Dataset (SQuAD), a new reading comprehension dataset consisting of 100,000+ questions posed by crowdworkers on a set of Wikipedia articles, where the answer to each question is a segment of text from the corresponding reading passage. It analyzes the types of reasoning required using dependency and constituency trees, builds a logistic regression model achieving 51.0% F1 (versus a 20% baseline), reports human performance at 86.8% F1, and releases the dataset publicly as a challenge problem.

Significance. If the questions predominantly require comprehension of the supplied passages, SQuAD provides a large-scale, span-based benchmark that has the potential to drive substantial progress in machine reading comprehension. Strengths include the dataset scale, public availability, explicit analysis of reasoning types via parse trees, and the clear gap between model and human performance. These elements support its role as a reproducible challenge for the field.

major comments (1)

[§2] §2 (Dataset Construction): The question collection process does not include a control experiment such as passage-ablated accuracy to verify that questions cannot be answered using external world knowledge alone. This is load-bearing for the central claim that the dataset tests machine comprehension of the provided text rather than retrieval or prior knowledge.

minor comments (2)

[Abstract] Abstract: The mention of 'leaning heavily on dependency and constituency trees' for analysis lacks specifics on which tree features are extracted or how they are encoded for the logistic regression; expanding this would improve clarity.
[§4] §4 (Models): The simple baseline that achieves 20% F1 is referenced but not described in detail (e.g., whether it selects random spans or uses frequency heuristics); specifying it would aid interpretation of the reported improvement.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the positive assessment of SQuAD's significance and for the constructive major comment. We address it point by point below.

read point-by-point responses

Referee: [§2] §2 (Dataset Construction): The question collection process does not include a control experiment such as passage-ablated accuracy to verify that questions cannot be answered using external world knowledge alone. This is load-bearing for the central claim that the dataset tests machine comprehension of the provided text rather than retrieval or prior knowledge.

Authors: We agree that a passage-ablated control would strengthen the central claim that SQuAD measures comprehension of the supplied text. The original manuscript does not contain such an experiment. Crowdworkers were instructed to pose questions answerable from the passage, and our parse-tree analysis of reasoning types provides indirect evidence that syntactic and semantic processing of the text is required for many questions. To directly address the concern, we will add a passage-ablated evaluation in the revised manuscript: both the logistic regression model and human annotators will be tested on the questions with the passage removed. We will report the resulting F1 scores to quantify the degree to which external knowledge alone suffices. revision: yes

Circularity Check

0 steps flagged

No circularity: dataset release and direct baseline measurement

full rationale

The paper's core contribution is the construction and public release of the SQuAD dataset (100k+ crowdworker questions with passage-span answers) plus straightforward analysis via dependency/constituency trees and a logistic regression baseline (F1 51.0%). No equations, predictions, or first-principles claims exist that reduce by construction to fitted inputs, self-citations, or renamed known results. Reported numbers are direct empirical measurements on the new data; the logistic regression is an off-the-shelf model whose features and performance are independently verifiable. The derivation chain is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The work rests on standard assumptions about crowdsourced data quality and Wikipedia suitability rather than new postulates or fitted constants.

axioms (2)

domain assumption Crowdworkers can reliably generate questions whose answers are exact text spans within the provided passage.
Invoked in the dataset creation process described in the abstract.
domain assumption Dependency and constituency trees are sufficient to categorize the reasoning types needed for the questions.
Used for the analysis of question types.

pith-pipeline@v0.9.0 · 5429 in / 1255 out tokens · 45712 ms · 2026-05-12T10:05:38.970799+00:00 · methodology

discussion (0)

Forward citations

Cited by 37 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Online Learning-to-Defer with Varying Experts
stat.ML 2026-05 unverdicted novelty 8.0

Presents the first online learning-to-defer algorithm with regret bounds O((n + n_e) T^{2/3}) generally and O((n + n_e) sqrt(T)) under low noise for multiclass classification with varying experts.
Passage Re-ranking with BERT
cs.IR 2019-01 unverdicted novelty 8.0

Fine-tuning BERT for query-passage relevance classification achieves state-of-the-art results on TREC-CAR and MS MARCO, with a 27% relative gain in MRR@10 over prior methods.
PragLocker: Protecting Agent Intellectual Property in Untrusted Deployments via Non-Portable Prompts
cs.CR 2026-05 unverdicted novelty 7.0

PragLocker protects agent prompts as IP by building non-portable obfuscated versions that function only on the intended LLM through code-symbol semantic anchoring followed by target-model feedback noise injection.
TCD-Arena: Assessing Robustness of Time Series Causal Discovery Methods Against Assumption Violations
cs.LG 2026-05 unverdicted novelty 7.0

TCD-Arena is a new customizable testing framework that runs millions of experiments to map how 33 different assumption violations affect time series causal discovery methods and shows ensembles can boost overall robustness.
From Flat Facts to Sharp Hallucinations: Detecting Stubborn Errors via Gradient Sensitivity
cs.LG 2026-05 unverdicted novelty 7.0

EPGS detects stubborn hallucinations by perturbing embeddings with noise and tracking gradient magnitude spikes, outperforming entropy and representation baselines as a proxy for loss landscape sharpness.
Deep-Reporter: Deep Research for Grounded Multimodal Long-Form Generation
cs.CL 2026-04 unverdicted novelty 7.0

Deep-Reporter introduces a unified agentic framework for grounded multimodal long-form generation via multimodal search, checklist-guided synthesis, and recurrent context management, plus the M2LongBench benchmark.
Multitask Prompted Training Enables Zero-Shot Task Generalization
cs.LG 2021-10 conditional novelty 7.0

Multitask fine-tuning of an encoder-decoder model on prompted datasets produces zero-shot generalization that often beats models up to 16 times larger on standard benchmarks.
Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity
cs.LG 2021-01 accept novelty 7.0

Switch Transformers use top-1 expert routing in a Mixture of Experts setup to scale to trillion-parameter language models with constant compute and up to 4x speedup over T5-XXL.
BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension
cs.CL 2019-10 accept novelty 7.0

BART introduces a denoising pretraining method for seq2seq models that matches RoBERTa on GLUE and SQuAD while setting new state-of-the-art results on abstractive summarization, dialogue, and QA with up to 6 ROUGE gains.
Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer
cs.LG 2019-10 unverdicted novelty 7.0

T5 casts all NLP tasks as text-to-text generation, systematically explores pre-training choices, and reaches strong performance on summarization, QA, classification and other tasks via large-scale training on the Colo...
MS MARCO: A Human Generated MAchine Reading COmprehension Dataset
cs.CL 2016-11 accept novelty 7.0

MS MARCO is a new large-scale machine reading comprehension dataset built from real Bing search queries, human-generated answers, and web passages, supporting three tasks including answer synthesis and passage ranking.
Structured Recurrent Mixers for Massively Parallelized Sequence Generation
cs.CL 2026-05 unverdicted novelty 6.0

Structured Recurrent Mixers enable algebraic switching between parallel training and recurrent inference representations, delivering higher efficiency, information capacity, and throughput than other linear-complexity models.
Effective Performance Measurement: Challenges and Opportunities in KPI Extraction from Earnings Calls
cs.CL 2026-05 unverdicted novelty 6.0

Encoder models trained on SEC filings struggle with earnings calls due to domain shift, while LLMs enable open-ended KPI extraction with 79.7% human-verified precision on newly introduced benchmarks.
The Override Gap: A Magnitude Account of Knowledge Conflict Failure in Hypernetwork-Based Instant LLM Adaptation
cs.LG 2026-04 conditional novelty 6.0

Knowledge conflicts in hypernetwork LLM adaptation stem from constant adapter margins losing to frequency-dependent pretrained margins; selective layer boosting and conflict-aware triggering raise deep-conflict accura...
Are Large Language Models Economically Viable for Industry Deployment?
cs.CL 2026-04 unverdicted novelty 6.0

Small LLMs under 2B parameters achieve better economic break-even, energy efficiency, and hardware density than larger models on legacy GPUs for industrial tasks.
HiGMem: A Hierarchical and LLM-Guided Memory System for Long-Term Conversational Agents
cs.CL 2026-04 unverdicted novelty 6.0

HiGMem combines hierarchical event-turn memory with LLM-guided selection to retrieve concise relevant evidence from long dialogues, improving F1 scores and cutting retrieved turns by an order of magnitude on the LoCoM...
Domain-oriented RAG Assessment (DoRA): Synthetic Benchmarking for RAG-based Question Answering on Defense Documents
cs.CL 2026-04 unverdicted novelty 6.0

DoRA is a new synthetic benchmark for RAG-based QA on defense documents where fine-tuning Llama3.1-8B-Instruct on it improves task success by up to 26% and cuts hallucination rates by 47%.
Transforming External Knowledge into Triplets for Enhanced Retrieval in RAG of LLMs
cs.CL 2026-04 unverdicted novelty 6.0

Tri-RAG turns external knowledge into Condition-Proof-Conclusion triplets and retrieves via the Condition anchor to improve efficiency and quality in LLM RAG.
Memory-Efficient Transfer Learning with Fading Side Networks via Masked Dual Path Distillation
cs.CV 2026-04 unverdicted novelty 6.0

MDPD mutually distills knowledge between a frozen backbone and a learnable side network during fine-tuning, then discards the side network at inference to accelerate speed by at least 25% while preserving accuracy.
MP-ISMoE: Mixed-Precision Interactive Side Mixture-of-Experts for Efficient Transfer Learning
cs.LG 2026-04 unverdicted novelty 6.0

MP-ISMoE uses Gaussian noise perturbed iterative quantization and interactive side mixture-of-experts to deliver higher accuracy than prior memory-efficient transfer learning methods while keeping similar parameter an...
Watt Counts: Energy-Aware Benchmark for Sustainable LLM Inference on Heterogeneous GPU Architectures
cs.DC 2026-04 unverdicted novelty 6.0

Watt Counts supplies over 5,000 energy measurements across 50 LLMs and 10 GPUs and shows that hardware-aware selection can reduce server-scenario energy use by up to 70 percent with little effect on user experience.
GCoT-Decoding: Unlocking Deep Reasoning Paths for Universal Question Answering
cs.CL 2026-04 unverdicted novelty 6.0

GCoT-decoding combines Fibonacci sampling, heuristic backtracking, span-based confidence scoring, and semantic consensus aggregation to enable general chain-of-thought reasoning without task-specific prompts.
JU\'A -- A Benchmark for Information Retrieval in Brazilian Legal Text Collections
cs.IR 2026-04 accept novelty 6.0

JU'A is a new heterogeneous benchmark for Brazilian legal IR that distinguishes retrieval methods and shows domain-adapted models excel on aligned subsets while BM25 stays competitive elsewhere.
Can We Trust a Black-box LLM? LLM Untrustworthy Boundary Detection via Bias-Diffusion and Multi-Agent Reinforcement Learning
cs.AI 2026-04 unverdicted novelty 6.0

GMRL-BD detects untrustworthy topic boundaries for black-box LLMs by combining bias-diffusion on a Wikipedia KG with multi-agent RL, supported by a released dataset labeling biases in models like Llama2 and Qwen2.
NV-Embed: Improved Techniques for Training LLMs as Generalist Embedding Models
cs.CL 2024-05 accept novelty 6.0

NV-Embed achieves first place on the MTEB leaderboard across 56 tasks by combining a latent attention layer, causal-mask removal, two-stage contrastive training, and data curation for LLM-based embedding models.
DeepSpeed Ulysses: System Optimizations for Enabling Training of Extreme Long Sequence Transformer Models
cs.LG 2023-09 accept novelty 6.0

DeepSpeed-Ulysses keeps communication volume constant for sequence-parallel attention when sequence length and device count scale together, delivering 2.5x faster training on 4x longer sequences than prior SOTA.
Textbooks Are All You Need II: phi-1.5 technical report
cs.CL 2023-09 unverdicted novelty 6.0

phi-1.5 is a 1.3B parameter model trained on synthetic textbook data that matches the reasoning performance of models five times larger on natural language, math, and basic coding tasks.
ST-MoE: Designing Stable and Transferable Sparse Expert Models
cs.CL 2022-02 unverdicted novelty 6.0

ST-MoE introduces stability techniques for sparse expert models, allowing a 269B-parameter model to achieve state-of-the-art transfer learning results across reasoning, summarization, and QA tasks at the compute cost ...
MDN: Parallelizing Stepwise Momentum for Delta Linear Attention
cs.LG 2026-05 unverdicted novelty 5.0

MDN parallelizes stepwise momentum for delta linear attention using geometric reordering and dynamical systems analysis, yielding performance gains over Mamba2 and GDN on 400M and 1.3B models.
From Flat Facts to Sharp Hallucinations: Detecting Stubborn Errors via Gradient Sensitivity
cs.LG 2026-05 unverdicted novelty 5.0

EPGS detects high-confidence factual errors in LLMs by using embedding perturbations to measure gradient sensitivity as a proxy for sharp versus flat minima.
The Override Gap: A Magnitude Account of Knowledge Conflict Failure in Hypernetwork-Based Instant LLM Adaptation
cs.LG 2026-04 unverdicted novelty 5.0

Knowledge conflicts in hypernetwork LLM adaptation stem from constant adapter margins losing to frequency-dependent pretrained margins; selective layer boosting and conflict-aware triggering close the gap.
Hardware-Efficient Softmax and Layer Normalization with Guaranteed Normalization for Edge Devices
cs.AR 2026-04 conditional novelty 5.0

Hardware approximations for Softmax and LayerNorm preserve exact normalization guarantees and deliver up to 14x area reduction in 28nm silicon with negligible accuracy loss on GLUE, SQuAD, and perplexity.
Toeplitz MLP Mixers are Low Complexity, Information-Rich Sequence Models
cs.LG 2026-04 unverdicted novelty 5.0

Toeplitz MLP Mixers replace attention with masked Toeplitz multiplications for sub-quadratic complexity while retaining more sequence information and outperforming on copying and in-context tasks.
Humanity's Last Exam
cs.LG 2025-01 unverdicted novelty 5.0

Humanity's Last Exam is a new 2,500-question benchmark at the frontier of human knowledge where state-of-the-art LLMs show low accuracy.
Sparse-on-Dense: Area and Energy-Efficient Computing of Sparse Neural Networks on Dense Matrix Multiplication Accelerators
cs.AR 2026-04 unverdicted novelty 4.0

Sparse neural networks achieve better area and energy efficiency when executed on dense matrix multiplication accelerators using a Sparse-on-Dense approach than on dedicated sparse accelerators.
Qwen3-VL-Embedding and Qwen3-VL-Reranker: A Unified Framework for State-of-the-Art Multimodal Retrieval and Ranking
cs.CL 2026-01 unverdicted novelty 4.0

Qwen3-VL-Embedding-8B achieves state-of-the-art performance with a 77.8 overall score on the MMEB-V2 multimodal embedding benchmark.
GLU Variants Improve Transformer
cs.LG 2020-02 unverdicted novelty 4.0

Some GLU variants using non-sigmoid nonlinearities improve Transformer quality over ReLU and GELU in feed-forward sublayers.

Reference graph

Works this paper leans on

27 extracted references · 27 canonical work pages · cited by 35 Pith papers

[1]

Berant, V

J. Berant, V. Srikumar, P. Chen, A. V. Linden, B. Harding, B. Huang, P. Clark, and C. D. Manning. 2014. Modeling biological processes for reading comprehension. In Empirical Methods in Natural Language Processing (EMNLP)

work page 2014
[2]

Brill, S

E. Brill, S. Dumais, and M. Banko. 2002. An analysis of the A sk MSR question-answering system. In Association for Computational Linguistics (ACL) , pages 257--264

work page 2002
[3]

D. Chen, J. Bolton, and C. D. Manning. 2016. A thorough examination of the CNN / D aily M ail reading comprehension task. In Association for Computational Linguistics (ACL)

work page 2016
[4]

Clark and O

P. Clark and O. Etzioni. 2016. My computer is an honor student but how intelligent is it? standardized tests as a measure of AI . AI Magazine , 37(1):5--12

work page 2016
[5]

J. Deng, W. Dong, R. Socher, L. Li, K. Li, and L. Fei-Fei. 2009. I mage N et: A large-scale hierarchical image database. In Computer Vision and Pattern Recognition (CVPR) , pages 248--255

work page 2009
[6]

Ferrucci, E

D. Ferrucci, E. Brown, J. Chu-Carroll, J. Fan, D. Gondek, A. A. Kalyanpur, A. Lally, J. W. Murdock, E. Nyberg, J. Prager, N. Schlaefer, and C. Welty. 2013. Building W atson: An overview of the D eep QA project. AI Magazine , 31(3):59--79

work page 2013
[7]

S. N. Gaikwad, D. Morina, R. Nistala, M. Agarwal, A. Cossette, R. Bhanu, S. Savage, V. Narwal, K. Rajpal, J. Regino, et al. 2015. Daemo: A self-governed crowdsourcing marketplace. In Proceedings of the 28th Annual ACM Symposium on User Interface Software & Technology , pages 101--102

work page 2015
[8]

K. M. Hermann, T. Kočiský, E. Grefenstette, L. Espeholt, W. Kay, M. Suleyman, and P. Blunsom. 2015. Teaching machines to read and comprehend. In Advances in Neural Information Processing Systems (NIPS)

work page 2015
[9]

F. Hill, A. Bordes, S. Chopra, and J. Weston. 2015. The goldilocks principle: Reading children's books with explicit memory representations. In International Conference on Learning Representations (ICLR)

work page 2015
[10]

Hirschman, M

L. Hirschman, M. Light, E. Breck, and J. D. Burger. 1999. Deep read: A reading comprehension system. In Association for Computational Linguistics (ACL) , pages 325--332

work page 1999
[11]

M. J. Hosseini, H. Hajishirzi, O. Etzioni, and N. Kushman. 2014. Learning to solve arithmetic word problems with verb categorization. In Empirical Methods in Natural Language Processing (EMNLP) , pages 523--533

work page 2014
[12]

Kushman, Y

N. Kushman, Y. Artzi, L. Zettlemoyer, and R. Barzilay. 2014. Learning to automatically solve algebra word problems. In Association for Computational Linguistics (ACL)

work page 2014
[13]

M. P. Marcus, M. A. Marcinkiewicz, and B. Santorini. 1993. Building a large annotated corpus of E nglish: the P enn T reebank. Computational Linguistics , 19:313--330

work page 1993
[14]

Narasimhan and R

K. Narasimhan and R. Barzilay. 2015. Machine comprehension with discourse relations. In Association for Computational Linguistics (ACL)

work page 2015
[15]

H. T. Ng, L. H. Teo, and J. L. P. Kwan. 2000. A machine learning approach to answering questions for reading comprehension tests. In Joint SIGDAT conference on empirical methods in natural language processing and very large corpora - Volume 13 , pages 124--132

work page 2000
[16]

Ravichandran and E

D. Ravichandran and E. Hovy. 2002. Learning surface text patterns for a question answering system. In Association for Computational Linguistics (ACL) , pages 41--47

work page 2002
[17]

Richardson, C

M. Richardson, C. J. Burges, and E. Renshaw. 2013. Mctest: A challenge dataset for the open-domain machine comprehension of text. In Empirical Methods in Natural Language Processing (EMNLP) , pages 193--203

work page 2013
[18]

Riloff and M

E. Riloff and M. Thelen. 2000. A rule-based question answering system for reading comprehension tests. In ANLP/NAACL Workshop on reading comprehension tests as evaluation for computer-based language understanding sytems - Volume 6 , pages 13--19

work page 2000
[19]

Sachan, A

M. Sachan, A. Dubey, E. P. Xing, and M. Richardson. 2015. Learning answer-entailing structures for machine comprehension. In Association for Computational Linguistics (ACL)

work page 2015
[20]

Shen and D

D. Shen and D. Klakow. 2006. Exploring correlation of dependency relation paths for answer extraction. In International Conference on Computational Linguistics and Association for Computational Linguistics (COLING/ACL) , pages 889--896

work page 2006
[21]

Shirakawa, T

M. Shirakawa, T. Hara, and S. Nishio. 2015. N -gram idf: A global term weighting scheme based on information distance. In World Wide Web (WWW) , pages 960--970

work page 2015
[22]

H. Sun, N. Duan, Y. Duan, and M. Zhou. 2013. Answer extraction from passage graph for question answering. In International Joint Conference on Artificial Intelligence (IJCAI)

work page 2013
[23]

E. M. Voorhees and D. M. Tice. 2000. Building a question answering test collection. In ACM Special Interest Group on Information Retreival (SIGIR) , pages 200--207

work page 2000
[24]

Shuohang Wang and Jing Jiang. 2016. Machine comprehension using match-lstm and answer pointer. CoRR , abs/1608.07905

work page arXiv 2016
[25]

H. Wang, M. Bansal, K. Gimpel, and D. McAllester. 2015. Machine comprehension with syntax, frames, and semantics. In Association for Computational Linguistics (ACL)

work page 2015
[26]

Weston, A

J. Weston, A. Bordes, S. Chopra, and T. Mikolov. 2015. Towards AI -complete question answering: A set of prerequisite toy tasks. arXiv

work page 2015
[27]

Y. Yang, W. Yih, and C. Meek. 2015. W iki QA : A challenge dataset for open-domain question answering. In Empirical Methods in Natural Language Processing (EMNLP) , pages 2013--2018

work page 2015