hub

MS MARCO: A Human Generated MAchine Reading COmprehension Dataset

Payal Bajaj, Daniel Campos, Nick Craswell, Li Deng, Jianfeng Gao, Xiaodong Liu · 2016 · cs.CL · arXiv 1611.09268

37 Pith papers cite this work. Polarity classification is still indexing.

37 Pith papers citing it

open full Pith review browse 37 citing papers arXiv PDF

abstract

We introduce a large scale MAchine Reading COmprehension dataset, which we name MS MARCO. The dataset comprises of 1,010,916 anonymized questions---sampled from Bing's search query logs---each with a human generated answer and 182,669 completely human rewritten generated answers. In addition, the dataset contains 8,841,823 passages---extracted from 3,563,535 web documents retrieved by Bing---that provide the information necessary for curating the natural language answers. A question in the MS MARCO dataset may have multiple answers or no answers at all. Using this dataset, we propose three different tasks with varying levels of difficulty: (i) predict if a question is answerable given a set of context passages, and extract and synthesize the answer as a human would (ii) generate a well-formed answer (if possible) based on the context passages that can be understood with the question and passage context, and finally (iii) rank a set of retrieved passages given a question. The size of the dataset and the fact that the questions are derived from real user search queries distinguishes MS MARCO from other well-known publicly available datasets for machine reading comprehension and question-answering. We believe that the scale and the real-world nature of this dataset makes it attractive for benchmarking machine reading comprehension and question-answering models.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 1 dataset 1

citation-polarity summary

background 1 use dataset 1

claims ledger

abstract We introduce a large scale MAchine Reading COmprehension dataset, which we name MS MARCO. The dataset comprises of 1,010,916 anonymized questions---sampled from Bing's search query logs---each with a human generated answer and 182,669 completely human rewritten generated answers. In addition, the dataset contains 8,841,823 passages---extracted from 3,563,535 web documents retrieved by Bing---that provide the information necessary for curating the natural language answers. A question in the MS MARCO dataset may have multiple answers or no answers at all. Using this dataset, we propose three dif

co-cited works

representative citing papers

Passage Re-ranking with BERT

cs.IR · 2019-01-13 · unverdicted · novelty 8.0

Fine-tuning BERT for query-passage relevance classification achieves state-of-the-art results on TREC-CAR and MS MARCO, with a 27% relative gain in MRR@10 over prior methods.

TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension

cs.CL · 2017-05-09 · accept · novelty 8.0

TriviaQA is a new large-scale dataset for reading comprehension that features complex compositional questions, high lexical variability, and cross-sentence reasoning requirements, where current baselines reach only 40% while humans reach 80%.

The Extrapolation Cliff in On-Policy Distillation of Near-Deterministic Structured Outputs

cs.LG · 2026-05-09 · unverdicted · novelty 7.0

On-policy distillation has an extrapolation cliff at closed-form lambda*(p,b,c) set by teacher modal probability, warm-start mass, and clip strength, past which training shifts from format-preserving to format-collapsing.

DiffRetriever: Parallel Representative Tokens for Retrieval with Diffusion Language Models

cs.IR · 2026-05-08 · unverdicted · novelty 7.0

DiffRetriever generates multiple representative tokens in parallel using diffusion language models, yielding consistent retrieval gains over single-token baselines and autoregressive multi-token variants on BEIR benchmarks.

EnterpriseRAG-Bench: A RAG Benchmark for Company Internal Knowledge

cs.IR · 2026-05-05 · conditional · novelty 7.0

EnterpriseRAG-Bench supplies a synthetic corpus of 500,000 documents across Slack, Gmail, GitHub and other tools plus 500 questions that probe lookup, multi-document reasoning, conflict resolution and absence detection.

Rethinking Reasoning-Intensive Retrieval: Evaluating and Advancing Retrievers in Agentic Search Systems

cs.CL · 2026-05-05 · unverdicted · novelty 7.0

BRIGHT-Pro and RTriever-Synth advance reasoning-intensive retrieval by adding multi-aspect evidence evaluation and aspect-decomposed synthetic training, with the fine-tuned RTriever-4B showing gains over its base model.

Why Mean Pooling Works: Quantifying Second-Order Collapse in Text Embeddings

cs.CL · 2026-04-30 · unverdicted · novelty 7.0

Modern text encoders resist second-order collapse under mean pooling because token embeddings concentrate tightly within texts, and this resistance correlates with stronger downstream performance.

UnIte: Uncertainty-based Iterative Document Sampling for Domain Adaptation in Information Retrieval

cs.IR · 2026-04-28 · unverdicted · novelty 7.0

UnIte selects target-domain documents for pseudo-query generation by filtering high aleatoric uncertainty and prioritizing high epistemic uncertainty, yielding +2.45 to +3.49 nDCG@10 gains on BEIR with ~4k samples.

A Parametric Memory Head for Continual Generative Retrieval

cs.IR · 2026-04-25 · unverdicted · novelty 7.0

A product-key parametric memory head with selective sparse updates mitigates catastrophic forgetting in generative retrieval models during sequential addition of new documents.

On the Robustness of LLM-Based Dense Retrievers: A Systematic Analysis of Generalizability and Stability

cs.IR · 2026-04-17 · unverdicted · novelty 7.0

LLM-based dense retrievers generalize better when instruction-tuned but pay a specialization tax when optimized for reasoning; they resist typos and corpus poisoning better than encoder-only baselines yet remain vulnerable to semantic perturbations, with larger models and certain embedding geometry,

AdversarialCoT: Single-Document Retrieval Poisoning for LLM Reasoning

cs.IR · 2026-04-14 · unverdicted · novelty 7.0

A single query-specific poisoned document, built by extracting and iteratively refining an adversarial chain-of-thought, can substantially degrade reasoning accuracy in retrieval-augmented LLM systems.

Can You Trust the Vectors in Your Vector Database? Black-Hole Attack from Embedding Space Defects

cs.CR · 2026-04-07 · unverdicted · novelty 7.0

Injecting a few malicious vectors near the centroid exploits centrality-driven hubness in high-dimensional embeddings, causing them to dominate top-k retrievals in up to 99.85% of cases.

GAIA: a benchmark for General AI Assistants

cs.CL · 2023-11-21 · unverdicted · novelty 7.0

GAIA benchmark shows humans at 92% accuracy on simple real-world questions far outperform current AI systems at 15%, proposing this gap as a key milestone for general AI.

Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks

cs.CL · 2020-05-22 · accept · novelty 7.0

RAG models set new state-of-the-art results on open-domain QA by retrieving Wikipedia passages and conditioning a generative model on them, while also producing more factual text than parametric baselines.

BoolQ: Exploring the Surprising Difficulty of Natural Yes/No Questions

cs.CL · 2019-05-24 · accept · novelty 7.0

BoolQ introduces naturally occurring yes/no questions as a challenging benchmark where BERT fine-tuned on MultiNLI reaches 80.4% accuracy against 90% human performance.

Reproducing Complex Set-Compositional Information Retrieval

cs.CL · 2026-05-05 · unverdicted · novelty 6.0

Neural retrievers that double BM25 performance on QUEST collapse below 0.02 Recall@100 on the new LIMIT+ benchmark while lexical methods reach 0.96, with all methods degrading as compositional depth increases.

NuggetIndex: Governed Atomic Retrieval for Maintainable RAG

cs.IR · 2026-04-30 · unverdicted · novelty 6.0

NuggetIndex manages atomic nuggets with temporal validity and lifecycle metadata to filter outdated information before ranking, yielding 42% higher nugget recall, 9pp better temporal correctness, and 55% fewer conflicts than passage or unmanaged proposition baselines.

RAQG-QPP: Query Performance Prediction with Retrieved Query Variants and Retrieval Augmented Query Generation

cs.IR · 2026-04-29 · unverdicted · novelty 6.0

Retrieved query variants from logs combined with LLM-augmented generation improve unsupervised QPP accuracy by up to 30% for neural rankers on TREC DL'19 and DL'20.

JigsawRL: Assembling RL Pipelines for Efficient LLM Post-Training

cs.LG · 2026-04-26 · unverdicted · novelty 6.0

JigsawRL achieves up to 1.85x higher throughput in LLM RL pipelines via pipeline multiplexing, sub-stage graphs, and look-ahead scheduling compared to prior systems.

From Tokens to Concepts: Leveraging SAE for SPLADE

cs.IR · 2026-04-23 · unverdicted · novelty 6.0

SAE-SPLADE substitutes SPLADE's backbone vocabulary with SAE-derived semantic concepts and matches standard SPLADE performance with better efficiency on in- and out-of-domain tasks.

ORPHEAS: A Cross-Lingual Greek-English Embedding Model for Retrieval-Augmented Generation

cs.CL · 2026-04-22 · unverdicted · novelty 6.0

ORPHEAS, a Greek-English embedding model created with knowledge graph fine-tuning, outperforms state-of-the-art multilingual models on monolingual and cross-lingual retrieval benchmarks.

Stochastic KV Routing: Enabling Adaptive Depth-Wise Cache Sharing

cs.LG · 2026-04-03 · unverdicted · novelty 6.0

Stochastic training with random cross-layer KV attention enables depth-wise cache sharing in transformers, cutting memory footprint while preserving or improving performance.

NV-Embed: Improved Techniques for Training LLMs as Generalist Embedding Models

cs.CL · 2024-05-27 · accept · novelty 6.0

NV-Embed achieves first place on the MTEB leaderboard across 56 tasks by combining a latent attention layer, causal-mask removal, two-stage contrastive training, and data curation for LLM-based embedding models.

Unsupervised Dense Information Retrieval with Contrastive Learning

cs.IR · 2021-12-16 · unverdicted · novelty 6.0

Contrastive learning trains unsupervised dense retrievers that beat BM25 on most BEIR datasets and support cross-lingual retrieval across scripts.

citing papers explorer

Showing 1 of 1 citing paper after filters.

GAIA: a benchmark for General AI Assistants cs.CL · 2023-11-21 · unverdicted · none · ref 75 · internal anchor
GAIA benchmark shows humans at 92% accuracy on simple real-world questions far outperform current AI systems at 15%, proposing this gap as a key milestone for general AI.

MS MARCO: A Human Generated MAchine Reading COmprehension Dataset

hub tools

citation-role summary

citation-polarity summary

claims ledger

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer