pith. machine review for the scientific record. sign in

arxiv: 1611.09268 · v3 · submitted 2016-11-28 · 💻 cs.CL · cs.IR

Recognition: 2 theorem links

· Lean Theorem

MS MARCO: A Human Generated MAchine Reading COmprehension Dataset

Alina Stoica, Andrew McNamara, Bhaskar Mitra, Daniel Campos, Jianfeng Gao, Li Deng, Mir Rosenberg, Nick Craswell, Payal Bajaj, Rangan Majumder, Saurabh Tiwary, Tong Wang, Tri Nguyen, XiaoDong Liu, Xia Song

Authors on Pith no claims yet

Pith reviewed 2026-05-12 05:49 UTC · model grok-4.3

classification 💻 cs.CL cs.IR
keywords machine reading comprehensionquestion answeringdatasetsearch query logshuman generated answerspassage ranking
0
0 comments X

The pith

MS MARCO supplies over a million real search questions with human answers to train and test reading comprehension systems.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces MS MARCO, a dataset of 1,010,916 questions sampled from Bing search logs, each paired with a human-generated answer and passages from web documents. Unlike earlier collections that relied on synthetic or curated questions, this one draws directly from actual user queries to create a more realistic test bed. The authors define three tasks of increasing realism: deciding whether passages support an answer and synthesizing it, generating a fluent answer that stands alone, and ranking the passages themselves. The scale and origin in live search traffic allow models to be trained and measured on the kinds of information needs people express every day. If the dataset holds up, progress on it should translate more directly to practical question-answering tools.

Core claim

MS MARCO consists of 1,010,916 anonymized questions taken from Bing search logs, each supplied with at least one human-generated answer and a set of passages extracted from retrieved web documents. Questions may admit multiple answers or none at all. The dataset is accompanied by three tasks: (1) predict answerability from the passages and extract or synthesize the answer, (2) produce a well-formed answer understandable from the question and passages alone, and (3) rank the passages by relevance to the question.

What carries the argument

The MS MARCO dataset of real-user questions paired with human answers and retrieved passages, which supplies both training data and evaluation targets for the three defined reading-comprehension tasks.

If this is right

  • Question-answering models can be trained and scored on whether their outputs match human responses to everyday search queries rather than artificial test items.
  • The three tasks allow separate measurement of answerability detection, answer synthesis, and passage ranking.
  • Systems must learn to handle queries that have no answer or admit several valid answers.
  • Training at this scale supports development of models whose behavior on live search traffic can be measured directly.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same log-sampling method could be repeated on other search engines to produce comparable datasets in additional languages or vertical domains.
  • The presence of both original and rewritten human answers offers a way to quantify acceptable variation in response quality.
  • Models that improve on MS MARCO could be tested for transfer by running them on fresh, unlabeled search logs.

Load-bearing premise

Human annotators produce answers that are accurate, complete, and representative of how ordinary people would respond to the sampled questions.

What would settle it

Independent human raters judge a random sample of the dataset answers as incomplete or incorrect on a substantial fraction of questions.

read the original abstract

We introduce a large scale MAchine Reading COmprehension dataset, which we name MS MARCO. The dataset comprises of 1,010,916 anonymized questions---sampled from Bing's search query logs---each with a human generated answer and 182,669 completely human rewritten generated answers. In addition, the dataset contains 8,841,823 passages---extracted from 3,563,535 web documents retrieved by Bing---that provide the information necessary for curating the natural language answers. A question in the MS MARCO dataset may have multiple answers or no answers at all. Using this dataset, we propose three different tasks with varying levels of difficulty: (i) predict if a question is answerable given a set of context passages, and extract and synthesize the answer as a human would (ii) generate a well-formed answer (if possible) based on the context passages that can be understood with the question and passage context, and finally (iii) rank a set of retrieved passages given a question. The size of the dataset and the fact that the questions are derived from real user search queries distinguishes MS MARCO from other well-known publicly available datasets for machine reading comprehension and question-answering. We believe that the scale and the real-world nature of this dataset makes it attractive for benchmarking machine reading comprehension and question-answering models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces MS MARCO, a large-scale machine reading comprehension dataset comprising 1,010,916 questions sampled from Bing search query logs, each with a human-generated answer and associated passages from 8.8 million web documents. It defines three tasks: (i) predicting answerability and synthesizing an answer from context passages, (ii) generating a well-formed answer from passages, and (iii) ranking retrieved passages given a question. The central claim is that the dataset's scale and derivation from real user queries distinguish it from prior MRC and QA resources, making it suitable for benchmarking.

Significance. If the human annotations are shown to be high-quality and reliably grounded in the passages, the dataset would provide a valuable large-scale resource for training and evaluating models on realistic, open-domain questions that may be unanswerable or admit multiple responses, advancing MRC research beyond smaller or synthetic datasets.

major comments (3)
  1. [Dataset description] The manuscript provides no details on the sampling procedure, anonymization steps, or filtering criteria applied to the Bing query logs when selecting the 1,010,916 questions. This information is required to evaluate whether the questions retain a natural distribution of real user intent (Abstract and dataset description section).
  2. [Annotation and quality control] No annotation guidelines, quality control procedures, inter-annotator agreement statistics, or statistics on passage relevance/answer grounding are reported for the human-generated answers. These are load-bearing for the claim that the answers are accurate, complete, and derivable from the provided passages (Abstract).
  3. [Abstract] The paper states that questions 'may have multiple answers or no answers at all' but supplies no empirical breakdown of answerable vs. unanswerable cases or passage sufficiency rates, leaving the asserted realism advantage over prior datasets unsubstantiated.
minor comments (2)
  1. [Title] The title acronym expansion contains inconsistent capitalization ('MAchine Reading COmprehension').
  2. [Abstract] The abstract uses the nonstandard phrasing 'comprises of'; standard usage is 'comprises'.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the detailed and constructive review. The comments highlight important areas where the manuscript can be strengthened for clarity and completeness. We address each major comment below and will revise the manuscript accordingly.

read point-by-point responses
  1. Referee: [Dataset description] The manuscript provides no details on the sampling procedure, anonymization steps, or filtering criteria applied to the Bing query logs when selecting the 1,010,916 questions. This information is required to evaluate whether the questions retain a natural distribution of real user intent (Abstract and dataset description section).

    Authors: We agree that the current description is insufficient. In the revised manuscript, we will add a dedicated subsection on data collection that details the sampling procedure from Bing search query logs, the anonymization steps taken to protect user privacy, and the filtering criteria applied to arrive at the final set of 1,010,916 questions. This will allow readers to assess how well the questions reflect natural user intent. revision: yes

  2. Referee: [Annotation and quality control] No annotation guidelines, quality control procedures, inter-annotator agreement statistics, or statistics on passage relevance/answer grounding are reported for the human-generated answers. These are load-bearing for the claim that the answers are accurate, complete, and derivable from the provided passages (Abstract).

    Authors: We acknowledge the omission. The revised version will include the annotation guidelines given to workers, the quality control procedures (including review and validation steps), and statistics on passage relevance and answer grounding. We note that formal inter-annotator agreement was not computed during the original annotation process; we will instead describe the single-annotator-per-question workflow with post-hoc quality checks and discuss this as a limitation. revision: partial

  3. Referee: [Abstract] The paper states that questions 'may have multiple answers or no answers at all' but supplies no empirical breakdown of answerable vs. unanswerable cases or passage sufficiency rates, leaving the asserted realism advantage over prior datasets unsubstantiated.

    Authors: We agree that empirical statistics are needed to support this claim. We will add to the abstract and dataset section the observed proportions of answerable questions, questions with multiple valid answers, unanswerable questions, and cases where the provided passages are insufficient. These figures are derivable from the existing annotations and will be reported to substantiate the realism advantage. revision: yes

Circularity Check

0 steps flagged

No circularity: dataset construction paper with no derivations

full rationale

The paper introduces the MS MARCO dataset by describing its construction from Bing query logs, human-generated answers, and retrieved passages. It contains no equations, predictions, fitted parameters, or first-principles derivations that could reduce to inputs by construction. The central claim (distinguishing scale and real-world queries) is a descriptive statement about the data resource itself, not a result derived from prior fitted quantities or self-citations. No load-bearing steps match any of the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is a dataset-construction paper with no mathematical derivations, fitted parameters, or postulated entities. No free parameters, axioms, or invented entities are required to support the central claim.

pith-pipeline@v0.9.0 · 5592 in / 1097 out tokens · 59967 ms · 2026-05-12T05:49:30.065418+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

Forward citations

Cited by 39 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Passage Re-ranking with BERT

    cs.IR 2019-01 unverdicted novelty 8.0

    Fine-tuning BERT for query-passage relevance classification achieves state-of-the-art results on TREC-CAR and MS MARCO, with a 27% relative gain in MRR@10 over prior methods.

  2. TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension

    cs.CL 2017-05 accept novelty 8.0

    TriviaQA is a new large-scale dataset for reading comprehension that features complex compositional questions, high lexical variability, and cross-sentence reasoning requirements, where current baselines reach only 40...

  3. The Extrapolation Cliff in On-Policy Distillation of Near-Deterministic Structured Outputs

    cs.LG 2026-05 unverdicted novelty 7.0

    On-policy distillation has an extrapolation cliff at closed-form lambda*(p,b,c) set by teacher modal probability, warm-start mass, and clip strength, past which training shifts from format-preserving to format-collapsing.

  4. DiffRetriever: Parallel Representative Tokens for Retrieval with Diffusion Language Models

    cs.IR 2026-05 unverdicted novelty 7.0

    DiffRetriever generates multiple representative tokens in parallel using diffusion language models, yielding consistent retrieval gains over single-token baselines and autoregressive multi-token variants on BEIR benchmarks.

  5. EnterpriseRAG-Bench: A RAG Benchmark for Company Internal Knowledge

    cs.IR 2026-05 conditional novelty 7.0

    EnterpriseRAG-Bench supplies a synthetic corpus of 500,000 documents across Slack, Gmail, GitHub and other tools plus 500 questions that probe lookup, multi-document reasoning, conflict resolution and absence detection.

  6. Rethinking Reasoning-Intensive Retrieval: Evaluating and Advancing Retrievers in Agentic Search Systems

    cs.CL 2026-05 unverdicted novelty 7.0

    BRIGHT-Pro and RTriever-Synth advance reasoning-intensive retrieval by adding multi-aspect evidence evaluation and aspect-decomposed synthetic training, with the fine-tuned RTriever-4B showing gains over its base model.

  7. Why Mean Pooling Works: Quantifying Second-Order Collapse in Text Embeddings

    cs.CL 2026-04 unverdicted novelty 7.0

    Modern text encoders resist second-order collapse under mean pooling because token embeddings concentrate tightly within texts, and this resistance correlates with stronger downstream performance.

  8. UnIte: Uncertainty-based Iterative Document Sampling for Domain Adaptation in Information Retrieval

    cs.IR 2026-04 unverdicted novelty 7.0

    UnIte selects target-domain documents for pseudo-query generation by filtering high aleatoric uncertainty and prioritizing high epistemic uncertainty, yielding +2.45 to +3.49 nDCG@10 gains on BEIR with ~4k samples.

  9. A Parametric Memory Head for Continual Generative Retrieval

    cs.IR 2026-04 unverdicted novelty 7.0

    A product-key parametric memory head with selective sparse updates mitigates catastrophic forgetting in generative retrieval models during sequential addition of new documents.

  10. On the Robustness of LLM-Based Dense Retrievers: A Systematic Analysis of Generalizability and Stability

    cs.IR 2026-04 unverdicted novelty 7.0

    LLM-based dense retrievers generalize better when instruction-tuned but pay a specialization tax when optimized for reasoning; they resist typos and corpus poisoning better than encoder-only baselines yet remain vulne...

  11. AdversarialCoT: Single-Document Retrieval Poisoning for LLM Reasoning

    cs.IR 2026-04 unverdicted novelty 7.0

    A single query-specific poisoned document, built by extracting and iteratively refining an adversarial chain-of-thought, can substantially degrade reasoning accuracy in retrieval-augmented LLM systems.

  12. Can You Trust the Vectors in Your Vector Database? Black-Hole Attack from Embedding Space Defects

    cs.CR 2026-04 unverdicted novelty 7.0

    Injecting a few malicious vectors near the centroid exploits centrality-driven hubness in high-dimensional embeddings, causing them to dominate top-k retrievals in up to 99.85% of cases.

  13. GAIA: a benchmark for General AI Assistants

    cs.CL 2023-11 unverdicted novelty 7.0

    GAIA benchmark shows humans at 92% accuracy on simple real-world questions far outperform current AI systems at 15%, proposing this gap as a key milestone for general AI.

  14. Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks

    cs.CL 2020-05 accept novelty 7.0

    RAG models set new state-of-the-art results on open-domain QA by retrieving Wikipedia passages and conditioning a generative model on them, while also producing more factual text than parametric baselines.

  15. BoolQ: Exploring the Surprising Difficulty of Natural Yes/No Questions

    cs.CL 2019-05 accept novelty 7.0

    BoolQ introduces naturally occurring yes/no questions as a challenging benchmark where BERT fine-tuned on MultiNLI reaches 80.4% accuracy against 90% human performance.

  16. Reproducing Complex Set-Compositional Information Retrieval

    cs.CL 2026-05 unverdicted novelty 6.0

    Neural retrievers that double BM25 performance on QUEST collapse below 0.02 Recall@100 on the new LIMIT+ benchmark while lexical methods reach 0.96, with all methods degrading as compositional depth increases.

  17. NuggetIndex: Governed Atomic Retrieval for Maintainable RAG

    cs.IR 2026-04 unverdicted novelty 6.0

    NuggetIndex manages atomic nuggets with temporal validity and lifecycle metadata to filter outdated information before ranking, yielding 42% higher nugget recall, 9pp better temporal correctness, and 55% fewer conflic...

  18. RAQG-QPP: Query Performance Prediction with Retrieved Query Variants and Retrieval Augmented Query Generation

    cs.IR 2026-04 unverdicted novelty 6.0

    Retrieved query variants from logs combined with LLM-augmented generation improve unsupervised QPP accuracy by up to 30% for neural rankers on TREC DL'19 and DL'20.

  19. JigsawRL: Assembling RL Pipelines for Efficient LLM Post-Training

    cs.LG 2026-04 unverdicted novelty 6.0

    JigsawRL achieves up to 1.85x higher throughput in LLM RL pipelines via pipeline multiplexing, sub-stage graphs, and look-ahead scheduling compared to prior systems.

  20. From Tokens to Concepts: Leveraging SAE for SPLADE

    cs.IR 2026-04 unverdicted novelty 6.0

    SAE-SPLADE substitutes SPLADE's backbone vocabulary with SAE-derived semantic concepts and matches standard SPLADE performance with better efficiency on in- and out-of-domain tasks.

  21. ORPHEAS: A Cross-Lingual Greek-English Embedding Model for Retrieval-Augmented Generation

    cs.CL 2026-04 unverdicted novelty 6.0

    ORPHEAS, a Greek-English embedding model created with knowledge graph fine-tuning, outperforms state-of-the-art multilingual models on monolingual and cross-lingual retrieval benchmarks.

  22. Stochastic KV Routing: Enabling Adaptive Depth-Wise Cache Sharing

    cs.LG 2026-04 unverdicted novelty 6.0

    Stochastic training with random cross-layer KV attention enables depth-wise cache sharing in transformers, cutting memory footprint while preserving or improving performance.

  23. NV-Embed: Improved Techniques for Training LLMs as Generalist Embedding Models

    cs.CL 2024-05 accept novelty 6.0

    NV-Embed achieves first place on the MTEB leaderboard across 56 tasks by combining a latent attention layer, causal-mask removal, two-stage contrastive training, and data curation for LLM-based embedding models.

  24. Unsupervised Dense Information Retrieval with Contrastive Learning

    cs.IR 2021-12 unverdicted novelty 6.0

    Contrastive learning trains unsupervised dense retrievers that beat BM25 on most BEIR datasets and support cross-lingual retrieval across scripts.

  25. NAVIS: Concurrent Search and Update with Low Position-Seeking Overhead in On-SSD Graph-Based Vector Search

    cs.DC 2026-05 unverdicted novelty 5.0

    NAVIS improves concurrent search and update throughput in on-SSD graph vector search by up to 2.74x for insertions and 1.37x for searches through reduced position-seeking overhead.

  26. Superintelligent Retrieval Agent: The Next Frontier of Information Retrieval

    cs.IR 2026-05 unverdicted novelty 5.0

    SIRA compresses multi-round exploratory retrieval into one LLM-guided, corpus-statistic-validated weighted BM25 query and reports superior results over dense retrievers and agentic baselines on BEIR benchmarks.

  27. Gyan: An Explainable Neuro-Symbolic Language Model

    cs.CL 2026-05 unverdicted novelty 5.0

    Gyan is a novel explainable neuro-symbolic language model that decouples language modeling from knowledge representation using rhetorical and semantic theories and reports superior performance on multiple datasets.

  28. Efficient Listwise Reranking with Compressed Document Representations

    cs.IR 2026-04 unverdicted novelty 5.0

    RRK compresses documents to multi-token embeddings for efficient listwise reranking, enabling an 8B model to achieve 3x-18x speedups over smaller models with comparable or better effectiveness.

  29. RefineRAG: Word-Level Poisoning Attacks via Retriever-Guided Text Refinement

    cs.CR 2026-04 unverdicted novelty 5.0

    RefineRAG achieves 90% attack success on NQ by generating toxic seeds then optimizing them via retriever-in-the-loop word refinement, outperforming prior methods on effectiveness and naturalness.

  30. Humanity's Last Exam

    cs.LG 2025-01 unverdicted novelty 5.0

    Humanity's Last Exam is a new 2,500-question benchmark at the frontier of human knowledge where state-of-the-art LLMs show low accuracy.

  31. Multilingual E5 Text Embeddings: A Technical Report

    cs.CL 2024-02 unverdicted novelty 5.0

    Open-source multilingual E5 embedding models are trained via contrastive pre-training on 1 billion text pairs and fine-tuning, with an instruction-tuned model matching English SOTA performance.

  32. Text Embeddings by Weakly-Supervised Contrastive Pre-training

    cs.CL 2022-12 unverdicted novelty 5.0

    E5 text embeddings trained with weakly-supervised contrastive pre-training on CCPairs outperform BM25 on BEIR zero-shot and achieve top results on MTEB, beating much larger models.

  33. Gyan: An Explainable Neuro-Symbolic Language Model

    cs.CL 2026-05 unverdicted novelty 4.0

    Gyan is a novel explainable non-transformer language model that achieves SOTA results on multiple datasets by mimicking human-like compositional context and world models.

  34. DisastRAG: A Multi-Source Disaster Information Integration and Access System Based on Retrieval-Augmented Large Language Models

    cs.IR 2026-04 unverdicted novelty 4.0

    DisastRAG is a multi-source RAG system for disaster management that boosts LLM accuracy on disaster queries through integrated retrieval paths from documents, databases, and web fallback.

  35. Qwen3-VL-Embedding and Qwen3-VL-Reranker: A Unified Framework for State-of-the-Art Multimodal Retrieval and Ranking

    cs.CL 2026-01 unverdicted novelty 4.0

    Qwen3-VL-Embedding-8B achieves state-of-the-art performance with a 77.8 overall score on the MMEB-V2 multimodal embedding benchmark.

  36. Qwen Goes Brrr: Off-the-Shelf RAG for Ukrainian Multi-Domain Document Understanding

    cs.CL 2026-05 unverdicted novelty 3.0

    A RAG pipeline with contextual PDF chunking, question-and-answer-aware retrieval and reranking using Qwen3 models reaches 0.96 accuracy on a Ukrainian multi-domain document QA shared task.

  37. LLMs Struggle with Abstract Meaning Comprehension More Than Expected

    cs.CL 2026-04 unverdicted novelty 3.0

    LLMs struggle with abstract meaning comprehension on SemEval-2021 Task 4 more than fine-tuned models, and a new bidirectional attention classifier yields small accuracy gains of 3-4%.

  38. DisastRAG: A Multi-Source Disaster Information Integration and Access System Based on Retrieval-Augmented Large Language Models

    cs.IR 2026-04 unverdicted novelty 3.0

    DisastRAG is a multi-source RAG framework for disaster information that routes queries across document retrieval, structured database access, and web fallback, delivering 12-23 point gains on multiple-choice tasks and...

  39. LLMs-as-Judges: A Comprehensive Survey on LLM-based Evaluation Methods

    cs.CL 2024-12 accept novelty 3.0

    A survey that organizes LLMs-as-judges research into functionality, methodology, applications, meta-evaluation, and limitations.

Reference graph

Works this paper leans on

16 extracted references · 16 canonical work pages · cited by 37 Pith papers · 3 internal anchors

  1. [1]

    Neural Machine Translation by Jointly Learning to Align and Translate

    D. Bahdanau, K. Cho, and Y . Bengio. Neural machine translation by jointly learning to align and translate.arXiv preprint arXiv:1409.0473,

  2. [3]

    URL http://arxiv.org/abs/1601.06733. C. Clark and M. Gardner. Simple and effective multi-paragraph reading comprehension. CoRR, abs/1710.10723,

  3. [4]

    URL http://arxiv.org/abs/1710.10723. P. Clark, I. Cowhey, O. Etzioni, T. Khot, A. Sabharwal, C. Schoenick, and O. Tafjord. Think you have solved question answering? try arc, the ai2 reasoning challenge

  4. [5]

    M. Dunn, L. Sagun, M. Higgins, V . U. Güney, V . Cirik, and K. Cho. Searchqa: A new q&a dataset augmented with context from a search engine. CoRR, abs/1704.05179,

  5. [6]

    B. H. Frank. Google brain chief: Deep learning takes at least 100,000 examples. https://venturebeat.com/ 2017/10/23/google-brain-chief-says-100000-examples-is-enough-data-for-deep-learning/ ,

  6. [7]

    J. Gao, M. Galley, and L. Li. Neural approaches to conversational ai. arXiv preprint arXiv:1809.08267,

  7. [8]

    URL https: //arxiv.org/abs/1512.03385. W. He, K. Liu, Y . Lyu, S. Zhao, X. Xiao, Y . Liu, Y . Wang, H. Wu, Q. She, X. Liu, T. Wu, and H. Wang. Dureader: a chinese machine reading comprehension dataset from real-world applications. CoRR, abs/1711.05073,

  8. [9]

    K. M. Hermann, T. Kociský, E. Grefenstette, L. Espeholt, W. Kay, M. Suleyman, and P. Blunsom. Teaching machines to read and comprehend. 2015a. URL https://arxiv.org/abs/1506.03340. K. M. Hermann, T. Kocisky, E. Grefenstette, L. Espeholt, W. Kay, M. Suleyman, and P. Blunsom. Teaching machines to read and comprehend. In Advances in Neural Information Proces...

  9. [10]

    Kadlec, M

    R. Kadlec, M. Schmid, O. Bajgar, and J. Kleindienst. Text understanding with the attention sum reader network. arXiv preprint arXiv:1603.01547,

  10. [11]

    Kocisk \' y , J

    T. Kociský, J. Schwarz, P. Blunsom, C. Dyer, K. M. Hermann, G. Melis, and E. Grefenstette. The narrativeqa reading comprehension challenge. CoRR, abs/1712.07040,

  11. [12]

    URL https://arxiv.org/abs/1606.05250. P. Rajpurkar, R. Jia, and P. Liang. Know what you don’t know: Unanswerable questions for squad. arXiv preprint arXiv:1806.03822,

  12. [13]

    M. J. Seo, A. Kembhavi, A. Farhadi, and H. Hajishirzi. Bidirectional attention flow for machine comprehension. CoRR, abs/1611.01603,

  13. [14]

    Shen, P.-S

    Y . Shen, P.-S. Huang, J. Gao, and W. Chen. Reasonet: Learning to stop reading in machine comprehension. arXiv preprint arXiv:1609.05284,

  14. [16]

    URL http://arxiv.org/abs/1409.3215. A. Trischler, T. Wang, X. Yuan, J. Harris, A. Sordoni, P. Bachman, and K. Suleman. Newsqa: A machine comprehension dataset. In Rep4NLP@ACL,

  15. [17]

    URL https://arxiv.org/abs/ 1502.05698. A. Wissner-Gross. Datasets over algorithms. Edge. com. Retrieved, 8,

  16. [18]

    ReCoRD: Bridging the Gap between Human and Machine Commonsense Reading Comprehension

    S. Zhang, X. Liu, J. Liu, J. Gao, K. Duh, and B. Van Durme†. Record: Bridging the gap between human and machine commonsense reading comprehension. arXiv preprint arXiv:1810.12885,