pith. machine review for the scientific record. sign in

arxiv: 1606.05250 · v3 · submitted 2016-06-16 · 💻 cs.CL

Recognition: 2 theorem links

SQuAD: 100,000+ Questions for Machine Comprehension of Text

Jian Zhang, Konstantin Lopyrev, Percy Liang, Pranav Rajpurkar

Pith reviewed 2026-05-12 10:05 UTC · model grok-4.3

classification 💻 cs.CL
keywords SQuADreading comprehensionquestion answeringmachine comprehensionWikipediacrowdsourcingnatural language processingextractive QA
0
0 comments X

The pith

SQuAD supplies over 100,000 crowd-sourced questions on Wikipedia articles where each answer is a contiguous text segment from the passage.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a large-scale dataset for testing whether machines can read and understand written text. It gathers questions from many people about encyclopedia passages and requires the answers to come directly from those passages as spans of words. This setup forces systems to perform actual comprehension rather than rely on memorized facts or simple keyword matches. The authors also break down the kinds of reasoning the questions demand and show that current models fall well short of human accuracy on the same questions.

Core claim

We present the Stanford Question Answering Dataset (SQuAD), a new reading comprehension dataset consisting of 100,000+ questions posed by crowdworkers on a set of Wikipedia articles, where the answer to each question is a segment of text from the corresponding reading passage. We analyze the dataset to understand the types of reasoning required to answer the questions, leaning heavily on dependency and constituency trees. We build a strong logistic regression model, which achieves an F1 score of 51.0%, a significant improvement over a simple baseline (20%). However, human performance (86.8%) is much higher, indicating that the dataset presents a good challenge problem for future research.

What carries the argument

The SQuAD dataset of crowdworker questions on Wikipedia passages with extractive text-span answers.

If this is right

  • Supplies a large training resource that can drive development of extractive question-answering systems.
  • Exposes the specific reasoning steps (such as coreference or multi-sentence inference) needed to answer many questions.
  • Creates a clear performance gap that future models must close to demonstrate real text understanding.
  • Allows direct comparison of machine and human answers on identical passages and questions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Success on SQuAD could transfer to other tasks that require pulling precise information from documents.
  • The dataset's focus on Wikipedia may encourage models that generalize better across factual text domains.
  • Repeated use of this benchmark could shift evaluation standards toward requiring explicit reasoning traces rather than end-to-end accuracy alone.

Load-bearing premise

Questions written by crowdworkers on Wikipedia articles will mainly test genuine reading comprehension and reasoning instead of surface patterns or outside knowledge.

What would settle it

A model that reaches human-level accuracy on the dataset by using only word overlap or external knowledge without reading the passages.

read the original abstract

We present the Stanford Question Answering Dataset (SQuAD), a new reading comprehension dataset consisting of 100,000+ questions posed by crowdworkers on a set of Wikipedia articles, where the answer to each question is a segment of text from the corresponding reading passage. We analyze the dataset to understand the types of reasoning required to answer the questions, leaning heavily on dependency and constituency trees. We build a strong logistic regression model, which achieves an F1 score of 51.0%, a significant improvement over a simple baseline (20%). However, human performance (86.8%) is much higher, indicating that the dataset presents a good challenge problem for future research. The dataset is freely available at https://stanford-qa.com

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper presents the Stanford Question Answering Dataset (SQuAD), a new reading comprehension dataset consisting of 100,000+ questions posed by crowdworkers on a set of Wikipedia articles, where the answer to each question is a segment of text from the corresponding reading passage. It analyzes the types of reasoning required using dependency and constituency trees, builds a logistic regression model achieving 51.0% F1 (versus a 20% baseline), reports human performance at 86.8% F1, and releases the dataset publicly as a challenge problem.

Significance. If the questions predominantly require comprehension of the supplied passages, SQuAD provides a large-scale, span-based benchmark that has the potential to drive substantial progress in machine reading comprehension. Strengths include the dataset scale, public availability, explicit analysis of reasoning types via parse trees, and the clear gap between model and human performance. These elements support its role as a reproducible challenge for the field.

major comments (1)
  1. [§2] §2 (Dataset Construction): The question collection process does not include a control experiment such as passage-ablated accuracy to verify that questions cannot be answered using external world knowledge alone. This is load-bearing for the central claim that the dataset tests machine comprehension of the provided text rather than retrieval or prior knowledge.
minor comments (2)
  1. [Abstract] Abstract: The mention of 'leaning heavily on dependency and constituency trees' for analysis lacks specifics on which tree features are extracted or how they are encoded for the logistic regression; expanding this would improve clarity.
  2. [§4] §4 (Models): The simple baseline that achieves 20% F1 is referenced but not described in detail (e.g., whether it selects random spans or uses frequency heuristics); specifying it would aid interpretation of the reported improvement.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the positive assessment of SQuAD's significance and for the constructive major comment. We address it point by point below.

read point-by-point responses
  1. Referee: [§2] §2 (Dataset Construction): The question collection process does not include a control experiment such as passage-ablated accuracy to verify that questions cannot be answered using external world knowledge alone. This is load-bearing for the central claim that the dataset tests machine comprehension of the provided text rather than retrieval or prior knowledge.

    Authors: We agree that a passage-ablated control would strengthen the central claim that SQuAD measures comprehension of the supplied text. The original manuscript does not contain such an experiment. Crowdworkers were instructed to pose questions answerable from the passage, and our parse-tree analysis of reasoning types provides indirect evidence that syntactic and semantic processing of the text is required for many questions. To directly address the concern, we will add a passage-ablated evaluation in the revised manuscript: both the logistic regression model and human annotators will be tested on the questions with the passage removed. We will report the resulting F1 scores to quantify the degree to which external knowledge alone suffices. revision: yes

Circularity Check

0 steps flagged

No circularity: dataset release and direct baseline measurement

full rationale

The paper's core contribution is the construction and public release of the SQuAD dataset (100k+ crowdworker questions with passage-span answers) plus straightforward analysis via dependency/constituency trees and a logistic regression baseline (F1 51.0%). No equations, predictions, or first-principles claims exist that reduce by construction to fitted inputs, self-citations, or renamed known results. Reported numbers are direct empirical measurements on the new data; the logistic regression is an off-the-shelf model whose features and performance are independently verifiable. The derivation chain is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The work rests on standard assumptions about crowdsourced data quality and Wikipedia suitability rather than new postulates or fitted constants.

axioms (2)
  • domain assumption Crowdworkers can reliably generate questions whose answers are exact text spans within the provided passage.
    Invoked in the dataset creation process described in the abstract.
  • domain assumption Dependency and constituency trees are sufficient to categorize the reasoning types needed for the questions.
    Used for the analysis of question types.

pith-pipeline@v0.9.0 · 5429 in / 1255 out tokens · 45712 ms · 2026-05-12T10:05:38.970799+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 37 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Online Learning-to-Defer with Varying Experts

    stat.ML 2026-05 unverdicted novelty 8.0

    Presents the first online learning-to-defer algorithm with regret bounds O((n + n_e) T^{2/3}) generally and O((n + n_e) sqrt(T)) under low noise for multiclass classification with varying experts.

  2. Passage Re-ranking with BERT

    cs.IR 2019-01 unverdicted novelty 8.0

    Fine-tuning BERT for query-passage relevance classification achieves state-of-the-art results on TREC-CAR and MS MARCO, with a 27% relative gain in MRR@10 over prior methods.

  3. PragLocker: Protecting Agent Intellectual Property in Untrusted Deployments via Non-Portable Prompts

    cs.CR 2026-05 unverdicted novelty 7.0

    PragLocker protects agent prompts as IP by building non-portable obfuscated versions that function only on the intended LLM through code-symbol semantic anchoring followed by target-model feedback noise injection.

  4. TCD-Arena: Assessing Robustness of Time Series Causal Discovery Methods Against Assumption Violations

    cs.LG 2026-05 unverdicted novelty 7.0

    TCD-Arena is a new customizable testing framework that runs millions of experiments to map how 33 different assumption violations affect time series causal discovery methods and shows ensembles can boost overall robustness.

  5. From Flat Facts to Sharp Hallucinations: Detecting Stubborn Errors via Gradient Sensitivity

    cs.LG 2026-05 unverdicted novelty 7.0

    EPGS detects stubborn hallucinations by perturbing embeddings with noise and tracking gradient magnitude spikes, outperforming entropy and representation baselines as a proxy for loss landscape sharpness.

  6. Deep-Reporter: Deep Research for Grounded Multimodal Long-Form Generation

    cs.CL 2026-04 unverdicted novelty 7.0

    Deep-Reporter introduces a unified agentic framework for grounded multimodal long-form generation via multimodal search, checklist-guided synthesis, and recurrent context management, plus the M2LongBench benchmark.

  7. Multitask Prompted Training Enables Zero-Shot Task Generalization

    cs.LG 2021-10 conditional novelty 7.0

    Multitask fine-tuning of an encoder-decoder model on prompted datasets produces zero-shot generalization that often beats models up to 16 times larger on standard benchmarks.

  8. Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity

    cs.LG 2021-01 accept novelty 7.0

    Switch Transformers use top-1 expert routing in a Mixture of Experts setup to scale to trillion-parameter language models with constant compute and up to 4x speedup over T5-XXL.

  9. BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension

    cs.CL 2019-10 accept novelty 7.0

    BART introduces a denoising pretraining method for seq2seq models that matches RoBERTa on GLUE and SQuAD while setting new state-of-the-art results on abstractive summarization, dialogue, and QA with up to 6 ROUGE gains.

  10. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

    cs.LG 2019-10 unverdicted novelty 7.0

    T5 casts all NLP tasks as text-to-text generation, systematically explores pre-training choices, and reaches strong performance on summarization, QA, classification and other tasks via large-scale training on the Colo...

  11. MS MARCO: A Human Generated MAchine Reading COmprehension Dataset

    cs.CL 2016-11 accept novelty 7.0

    MS MARCO is a new large-scale machine reading comprehension dataset built from real Bing search queries, human-generated answers, and web passages, supporting three tasks including answer synthesis and passage ranking.

  12. Structured Recurrent Mixers for Massively Parallelized Sequence Generation

    cs.CL 2026-05 unverdicted novelty 6.0

    Structured Recurrent Mixers enable algebraic switching between parallel training and recurrent inference representations, delivering higher efficiency, information capacity, and throughput than other linear-complexity models.

  13. Effective Performance Measurement: Challenges and Opportunities in KPI Extraction from Earnings Calls

    cs.CL 2026-05 unverdicted novelty 6.0

    Encoder models trained on SEC filings struggle with earnings calls due to domain shift, while LLMs enable open-ended KPI extraction with 79.7% human-verified precision on newly introduced benchmarks.

  14. The Override Gap: A Magnitude Account of Knowledge Conflict Failure in Hypernetwork-Based Instant LLM Adaptation

    cs.LG 2026-04 conditional novelty 6.0

    Knowledge conflicts in hypernetwork LLM adaptation stem from constant adapter margins losing to frequency-dependent pretrained margins; selective layer boosting and conflict-aware triggering raise deep-conflict accura...

  15. Are Large Language Models Economically Viable for Industry Deployment?

    cs.CL 2026-04 unverdicted novelty 6.0

    Small LLMs under 2B parameters achieve better economic break-even, energy efficiency, and hardware density than larger models on legacy GPUs for industrial tasks.

  16. HiGMem: A Hierarchical and LLM-Guided Memory System for Long-Term Conversational Agents

    cs.CL 2026-04 unverdicted novelty 6.0

    HiGMem combines hierarchical event-turn memory with LLM-guided selection to retrieve concise relevant evidence from long dialogues, improving F1 scores and cutting retrieved turns by an order of magnitude on the LoCoM...

  17. Domain-oriented RAG Assessment (DoRA): Synthetic Benchmarking for RAG-based Question Answering on Defense Documents

    cs.CL 2026-04 unverdicted novelty 6.0

    DoRA is a new synthetic benchmark for RAG-based QA on defense documents where fine-tuning Llama3.1-8B-Instruct on it improves task success by up to 26% and cuts hallucination rates by 47%.

  18. Transforming External Knowledge into Triplets for Enhanced Retrieval in RAG of LLMs

    cs.CL 2026-04 unverdicted novelty 6.0

    Tri-RAG turns external knowledge into Condition-Proof-Conclusion triplets and retrieves via the Condition anchor to improve efficiency and quality in LLM RAG.

  19. Memory-Efficient Transfer Learning with Fading Side Networks via Masked Dual Path Distillation

    cs.CV 2026-04 unverdicted novelty 6.0

    MDPD mutually distills knowledge between a frozen backbone and a learnable side network during fine-tuning, then discards the side network at inference to accelerate speed by at least 25% while preserving accuracy.

  20. MP-ISMoE: Mixed-Precision Interactive Side Mixture-of-Experts for Efficient Transfer Learning

    cs.LG 2026-04 unverdicted novelty 6.0

    MP-ISMoE uses Gaussian noise perturbed iterative quantization and interactive side mixture-of-experts to deliver higher accuracy than prior memory-efficient transfer learning methods while keeping similar parameter an...

  21. Watt Counts: Energy-Aware Benchmark for Sustainable LLM Inference on Heterogeneous GPU Architectures

    cs.DC 2026-04 unverdicted novelty 6.0

    Watt Counts supplies over 5,000 energy measurements across 50 LLMs and 10 GPUs and shows that hardware-aware selection can reduce server-scenario energy use by up to 70 percent with little effect on user experience.

  22. GCoT-Decoding: Unlocking Deep Reasoning Paths for Universal Question Answering

    cs.CL 2026-04 unverdicted novelty 6.0

    GCoT-decoding combines Fibonacci sampling, heuristic backtracking, span-based confidence scoring, and semantic consensus aggregation to enable general chain-of-thought reasoning without task-specific prompts.

  23. JU\'A -- A Benchmark for Information Retrieval in Brazilian Legal Text Collections

    cs.IR 2026-04 accept novelty 6.0

    JU'A is a new heterogeneous benchmark for Brazilian legal IR that distinguishes retrieval methods and shows domain-adapted models excel on aligned subsets while BM25 stays competitive elsewhere.

  24. Can We Trust a Black-box LLM? LLM Untrustworthy Boundary Detection via Bias-Diffusion and Multi-Agent Reinforcement Learning

    cs.AI 2026-04 unverdicted novelty 6.0

    GMRL-BD detects untrustworthy topic boundaries for black-box LLMs by combining bias-diffusion on a Wikipedia KG with multi-agent RL, supported by a released dataset labeling biases in models like Llama2 and Qwen2.

  25. NV-Embed: Improved Techniques for Training LLMs as Generalist Embedding Models

    cs.CL 2024-05 accept novelty 6.0

    NV-Embed achieves first place on the MTEB leaderboard across 56 tasks by combining a latent attention layer, causal-mask removal, two-stage contrastive training, and data curation for LLM-based embedding models.

  26. DeepSpeed Ulysses: System Optimizations for Enabling Training of Extreme Long Sequence Transformer Models

    cs.LG 2023-09 accept novelty 6.0

    DeepSpeed-Ulysses keeps communication volume constant for sequence-parallel attention when sequence length and device count scale together, delivering 2.5x faster training on 4x longer sequences than prior SOTA.

  27. Textbooks Are All You Need II: phi-1.5 technical report

    cs.CL 2023-09 unverdicted novelty 6.0

    phi-1.5 is a 1.3B parameter model trained on synthetic textbook data that matches the reasoning performance of models five times larger on natural language, math, and basic coding tasks.

  28. ST-MoE: Designing Stable and Transferable Sparse Expert Models

    cs.CL 2022-02 unverdicted novelty 6.0

    ST-MoE introduces stability techniques for sparse expert models, allowing a 269B-parameter model to achieve state-of-the-art transfer learning results across reasoning, summarization, and QA tasks at the compute cost ...

  29. MDN: Parallelizing Stepwise Momentum for Delta Linear Attention

    cs.LG 2026-05 unverdicted novelty 5.0

    MDN parallelizes stepwise momentum for delta linear attention using geometric reordering and dynamical systems analysis, yielding performance gains over Mamba2 and GDN on 400M and 1.3B models.

  30. From Flat Facts to Sharp Hallucinations: Detecting Stubborn Errors via Gradient Sensitivity

    cs.LG 2026-05 unverdicted novelty 5.0

    EPGS detects high-confidence factual errors in LLMs by using embedding perturbations to measure gradient sensitivity as a proxy for sharp versus flat minima.

  31. The Override Gap: A Magnitude Account of Knowledge Conflict Failure in Hypernetwork-Based Instant LLM Adaptation

    cs.LG 2026-04 unverdicted novelty 5.0

    Knowledge conflicts in hypernetwork LLM adaptation stem from constant adapter margins losing to frequency-dependent pretrained margins; selective layer boosting and conflict-aware triggering close the gap.

  32. Hardware-Efficient Softmax and Layer Normalization with Guaranteed Normalization for Edge Devices

    cs.AR 2026-04 conditional novelty 5.0

    Hardware approximations for Softmax and LayerNorm preserve exact normalization guarantees and deliver up to 14x area reduction in 28nm silicon with negligible accuracy loss on GLUE, SQuAD, and perplexity.

  33. Toeplitz MLP Mixers are Low Complexity, Information-Rich Sequence Models

    cs.LG 2026-04 unverdicted novelty 5.0

    Toeplitz MLP Mixers replace attention with masked Toeplitz multiplications for sub-quadratic complexity while retaining more sequence information and outperforming on copying and in-context tasks.

  34. Humanity's Last Exam

    cs.LG 2025-01 unverdicted novelty 5.0

    Humanity's Last Exam is a new 2,500-question benchmark at the frontier of human knowledge where state-of-the-art LLMs show low accuracy.

  35. Sparse-on-Dense: Area and Energy-Efficient Computing of Sparse Neural Networks on Dense Matrix Multiplication Accelerators

    cs.AR 2026-04 unverdicted novelty 4.0

    Sparse neural networks achieve better area and energy efficiency when executed on dense matrix multiplication accelerators using a Sparse-on-Dense approach than on dedicated sparse accelerators.

  36. Qwen3-VL-Embedding and Qwen3-VL-Reranker: A Unified Framework for State-of-the-Art Multimodal Retrieval and Ranking

    cs.CL 2026-01 unverdicted novelty 4.0

    Qwen3-VL-Embedding-8B achieves state-of-the-art performance with a 77.8 overall score on the MMEB-V2 multimodal embedding benchmark.

  37. GLU Variants Improve Transformer

    cs.LG 2020-02 unverdicted novelty 4.0

    Some GLU variants using non-sigmoid nonlinearities improve Transformer quality over ReLU and GELU in feed-forward sublayers.

Reference graph

Works this paper leans on

27 extracted references · 27 canonical work pages · cited by 35 Pith papers

  1. [1]

    Berant, V

    J. Berant, V. Srikumar, P. Chen, A. V. Linden, B. Harding, B. Huang, P. Clark, and C. D. Manning. 2014. Modeling biological processes for reading comprehension. In Empirical Methods in Natural Language Processing (EMNLP)

  2. [2]

    Brill, S

    E. Brill, S. Dumais, and M. Banko. 2002. An analysis of the A sk MSR question-answering system. In Association for Computational Linguistics (ACL) , pages 257--264

  3. [3]

    D. Chen, J. Bolton, and C. D. Manning. 2016. A thorough examination of the CNN / D aily M ail reading comprehension task. In Association for Computational Linguistics (ACL)

  4. [4]

    Clark and O

    P. Clark and O. Etzioni. 2016. My computer is an honor student but how intelligent is it? standardized tests as a measure of AI . AI Magazine , 37(1):5--12

  5. [5]

    J. Deng, W. Dong, R. Socher, L. Li, K. Li, and L. Fei-Fei. 2009. I mage N et: A large-scale hierarchical image database. In Computer Vision and Pattern Recognition (CVPR) , pages 248--255

  6. [6]

    Ferrucci, E

    D. Ferrucci, E. Brown, J. Chu-Carroll, J. Fan, D. Gondek, A. A. Kalyanpur, A. Lally, J. W. Murdock, E. Nyberg, J. Prager, N. Schlaefer, and C. Welty. 2013. Building W atson: An overview of the D eep QA project. AI Magazine , 31(3):59--79

  7. [7]

    S. N. Gaikwad, D. Morina, R. Nistala, M. Agarwal, A. Cossette, R. Bhanu, S. Savage, V. Narwal, K. Rajpal, J. Regino, et al. 2015. Daemo: A self-governed crowdsourcing marketplace. In Proceedings of the 28th Annual ACM Symposium on User Interface Software & Technology , pages 101--102

  8. [8]

    K. M. Hermann, T. Kočiský, E. Grefenstette, L. Espeholt, W. Kay, M. Suleyman, and P. Blunsom. 2015. Teaching machines to read and comprehend. In Advances in Neural Information Processing Systems (NIPS)

  9. [9]

    F. Hill, A. Bordes, S. Chopra, and J. Weston. 2015. The goldilocks principle: Reading children's books with explicit memory representations. In International Conference on Learning Representations (ICLR)

  10. [10]

    Hirschman, M

    L. Hirschman, M. Light, E. Breck, and J. D. Burger. 1999. Deep read: A reading comprehension system. In Association for Computational Linguistics (ACL) , pages 325--332

  11. [11]

    M. J. Hosseini, H. Hajishirzi, O. Etzioni, and N. Kushman. 2014. Learning to solve arithmetic word problems with verb categorization. In Empirical Methods in Natural Language Processing (EMNLP) , pages 523--533

  12. [12]

    Kushman, Y

    N. Kushman, Y. Artzi, L. Zettlemoyer, and R. Barzilay. 2014. Learning to automatically solve algebra word problems. In Association for Computational Linguistics (ACL)

  13. [13]

    M. P. Marcus, M. A. Marcinkiewicz, and B. Santorini. 1993. Building a large annotated corpus of E nglish: the P enn T reebank. Computational Linguistics , 19:313--330

  14. [14]

    Narasimhan and R

    K. Narasimhan and R. Barzilay. 2015. Machine comprehension with discourse relations. In Association for Computational Linguistics (ACL)

  15. [15]

    H. T. Ng, L. H. Teo, and J. L. P. Kwan. 2000. A machine learning approach to answering questions for reading comprehension tests. In Joint SIGDAT conference on empirical methods in natural language processing and very large corpora - Volume 13 , pages 124--132

  16. [16]

    Ravichandran and E

    D. Ravichandran and E. Hovy. 2002. Learning surface text patterns for a question answering system. In Association for Computational Linguistics (ACL) , pages 41--47

  17. [17]

    Richardson, C

    M. Richardson, C. J. Burges, and E. Renshaw. 2013. Mctest: A challenge dataset for the open-domain machine comprehension of text. In Empirical Methods in Natural Language Processing (EMNLP) , pages 193--203

  18. [18]

    Riloff and M

    E. Riloff and M. Thelen. 2000. A rule-based question answering system for reading comprehension tests. In ANLP/NAACL Workshop on reading comprehension tests as evaluation for computer-based language understanding sytems - Volume 6 , pages 13--19

  19. [19]

    Sachan, A

    M. Sachan, A. Dubey, E. P. Xing, and M. Richardson. 2015. Learning answer-entailing structures for machine comprehension. In Association for Computational Linguistics (ACL)

  20. [20]

    Shen and D

    D. Shen and D. Klakow. 2006. Exploring correlation of dependency relation paths for answer extraction. In International Conference on Computational Linguistics and Association for Computational Linguistics (COLING/ACL) , pages 889--896

  21. [21]

    Shirakawa, T

    M. Shirakawa, T. Hara, and S. Nishio. 2015. N -gram idf: A global term weighting scheme based on information distance. In World Wide Web (WWW) , pages 960--970

  22. [22]

    H. Sun, N. Duan, Y. Duan, and M. Zhou. 2013. Answer extraction from passage graph for question answering. In International Joint Conference on Artificial Intelligence (IJCAI)

  23. [23]

    E. M. Voorhees and D. M. Tice. 2000. Building a question answering test collection. In ACM Special Interest Group on Information Retreival (SIGIR) , pages 200--207

  24. [24]

    Shuohang Wang and Jing Jiang. 2016. Machine comprehension using match-lstm and answer pointer. CoRR , abs/1608.07905

  25. [25]

    H. Wang, M. Bansal, K. Gimpel, and D. McAllester. 2015. Machine comprehension with syntax, frames, and semantics. In Association for Computational Linguistics (ACL)

  26. [26]

    Weston, A

    J. Weston, A. Bordes, S. Chopra, and T. Mikolov. 2015. Towards AI -complete question answering: A set of prerequisite toy tasks. arXiv

  27. [27]

    Y. Yang, W. Yih, and C. Meek. 2015. W iki QA : A challenge dataset for open-domain question answering. In Empirical Methods in Natural Language Processing (EMNLP) , pages 2013--2018