arxiv: 1809.09600 · v1 · submitted 2018-09-25 · 💻 cs.CL

Recognition: 1 theorem link

HotpotQA: A Dataset for Diverse, Explainable Multi-hop Question Answering

Zhilin Yang , Peng Qi , Saizheng Zhang , Yoshua Bengio , William W. Cohen , Ruslan Salakhutdinov , Christopher D. Manning

Authors on Pith no claims yet

Pith reviewed 2026-05-12 04:59 UTC · model grok-4.3

classification 💻 cs.CL

keywords HotpotQAmulti-hop question answeringexplainable QAsupporting factsWikipedia datasetcomparison questionsQA benchmark

0 comments

The pith

HotpotQA introduces 113k Wikipedia questions that require multi-hop reasoning across documents along with sentence-level supporting facts for explanations.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Existing QA datasets do not train systems for complex reasoning or to explain their answers. This paper presents HotpotQA, a large collection of questions based on Wikipedia articles that necessitate retrieving and reasoning over multiple documents. The dataset includes annotations for the specific sentences used in reasoning, enabling supervised training for explainability. It also features comparison questions that test fact extraction and comparison abilities. If successful, this would allow QA systems to handle more realistic, complex queries with transparent reasoning processes.

Core claim

HotpotQA provides 113k question-answer pairs from Wikipedia that demand finding and reasoning over multiple documents, include diverse questions not tied to schemas, supply sentence-level supporting facts, and introduce factoid comparison questions to test fact extraction and comparison. The supporting facts enable models to improve performance and make explainable predictions.

What carries the argument

The HotpotQA dataset with its sentence-level supporting fact annotations that provide strong supervision for multi-hop reasoning and explainability.

Load-bearing premise

That the questions genuinely require multi-hop reasoning over multiple documents rather than being answerable from single documents or surface patterns, and that the sentence-level supporting fact annotations are accurate and complete.

What would settle it

A demonstration that current QA models can answer most HotpotQA questions correctly by processing only a single document or without using the supporting facts annotations.

read the original abstract

Existing question answering (QA) datasets fail to train QA systems to perform complex reasoning and provide explanations for answers. We introduce HotpotQA, a new dataset with 113k Wikipedia-based question-answer pairs with four key features: (1) the questions require finding and reasoning over multiple supporting documents to answer; (2) the questions are diverse and not constrained to any pre-existing knowledge bases or knowledge schemas; (3) we provide sentence-level supporting facts required for reasoning, allowing QA systems to reason with strong supervision and explain the predictions; (4) we offer a new type of factoid comparison questions to test QA systems' ability to extract relevant facts and perform necessary comparison. We show that HotpotQA is challenging for the latest QA systems, and the supporting facts enable models to improve performance and make explainable predictions.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

HotpotQA gives the field a useful new benchmark for multi-hop QA over Wikipedia with sentence-level facts and comparison questions, though its value rests on how strictly the construction enforced genuine multi-hop cases.

read the letter

The main takeaway is that this paper releases HotpotQA, a 113k-question dataset built on Wikipedia that targets multi-hop reasoning, supplies sentence-level supporting facts, stays free of KB constraints, and adds comparison questions. It also runs baselines showing that recent models struggle and that the facts help both accuracy and explainability. That combination is new relative to SQuAD-style single-hop sets and earlier multi-hop efforts tied to KBs. The authors did solid work scaling the collection with crowdsourcing, defining clear question types, and releasing the data with enough structure for others to use right away. The reported gains from adding the supporting-fact supervision are concrete and worth having on record. The soft spots sit mainly in the validation of the core claims. The paper needs to show stronger evidence that most questions cannot be answered from a single document or via shallow patterns, and that the supporting-fact annotations are both complete and free of systematic bias. Details on filtering steps, agreement rates, and any post-hoc checks for single-hop leakage would help. The results section could also include more targeted ablations on whether models are truly using the facts for reasoning or just as extra signals. These are standard concerns for crowdsourced QA data rather than fatal problems. Readers working on QA architectures, explainability, or evaluation benchmarks will get immediate value from the dataset and the baseline numbers. The work is coherent on its own terms and engages the existing literature directly, so it deserves a serious referee. I would send it out for review; the community can use the resource and the discussion will likely tighten the construction details.

Referee Report

2 major / 2 minor

Summary. The paper introduces HotpotQA, a dataset of 113k Wikipedia-based QA pairs designed to require multi-hop reasoning over multiple documents. It features sentence-level supporting fact annotations for explainability, diverse questions unconstrained by KBs, and a new category of comparison questions. The authors claim current QA systems find it challenging and that access to supporting facts improves performance while enabling explainable predictions.

Significance. If the construction process robustly enforces genuine multi-hop requirements and produces accurate, complete supporting-fact labels, the dataset would be a significant contribution by providing strong supervision for reasoning and explainability in QA, addressing gaps in prior single-hop or schema-constrained datasets.

major comments (2)

[§3] §3 (Data Collection): The crowdsourcing pipeline for bridge and comparison questions is described at a high level, but no quantitative validation (e.g., percentage of questions answerable from a single paragraph or document) is provided to confirm that the multi-hop requirement is enforced and that surface-pattern shortcuts are filtered; this is load-bearing for the central claim that questions require reasoning over multiple supporting documents.
[§4.3] §4.3 (Experiments with Supporting Facts): Performance gains are reported when models use the provided sentence-level facts, yet there is no analysis of annotation completeness (e.g., whether all necessary sentences are labeled or if relevant ones are missed) or inter-annotator agreement; without this, the reliability of the 'strong supervision' and the source of the observed improvements remain unclear.

minor comments (2)

[Abstract] The abstract states the four key features but omits any quantitative results (e.g., model accuracies or dataset statistics beyond the total size), which would help readers immediately assess the claims.
[Table 1] Table 1 or dataset statistics section: Clarify the exact split between bridge and comparison questions and report any filtering rates from the validation stage to improve reproducibility.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for their constructive feedback on our manuscript introducing HotpotQA. We address each major comment below, indicating where we will revise the paper to strengthen the presentation of our data collection and annotation processes.

read point-by-point responses

Referee: [§3] §3 (Data Collection): The crowdsourcing pipeline for bridge and comparison questions is described at a high level, but no quantitative validation (e.g., percentage of questions answerable from a single paragraph or document) is provided to confirm that the multi-hop requirement is enforced and that surface-pattern shortcuts are filtered; this is load-bearing for the central claim that questions require reasoning over multiple supporting documents.

Authors: We agree that explicit quantitative validation would better substantiate the multi-hop nature of the questions. The manuscript describes the crowdsourcing pipeline, including the use of adversarial filtering to remove questions answerable from a single document or via surface patterns, but does not report specific percentages or validation statistics from that process. In the revision, we will add a new table and accompanying text with the number of questions at each filtering stage, along with results from a manual audit of a sample of final questions confirming that they require information from multiple documents. revision: yes
Referee: [§4.3] §4.3 (Experiments with Supporting Facts): Performance gains are reported when models use the provided sentence-level facts, yet there is no analysis of annotation completeness (e.g., whether all necessary sentences are labeled or if relevant ones are missed) or inter-annotator agreement; without this, the reliability of the 'strong supervision' and the source of the observed improvements remain unclear.

Authors: We acknowledge the absence of completeness analysis and inter-annotator agreement (IAA) metrics for the supporting-fact annotations, which limits the ability to fully assess their reliability. The manuscript provides details on how supporting facts were collected but does not include these quantitative checks. We will revise §4.3 and the data collection section to include additional discussion of the annotation guidelines and any post-hoc manual checks performed. However, because each question received supporting-fact annotations from only a single worker, we do not have the data to compute IAA; we will explicitly note this as a limitation of the current release. revision: partial

standing simulated objections not resolved

Inter-annotator agreement for supporting-fact annotations, as multiple independent annotations were not collected during the original crowdsourcing process.

Circularity Check

0 steps flagged

No circularity: empirical dataset construction with direct benchmarking

full rationale

The paper introduces HotpotQA via crowdsourcing pipeline for multi-hop questions and supporting-fact annotations, then reports direct model evaluations on the resulting dataset. No equations, fitted parameters, or predictions are presented; there is no derivation chain that reduces to self-definition, self-citation load-bearing, or renaming of inputs. Central claims rest on the described construction process and external model benchmarks, which are independent of any internal fit or prior self-result. This is a standard empirical dataset paper with no load-bearing circular steps.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is a dataset introduction paper with no free parameters, axioms, or invented entities in a mathematical or theoretical sense; the contribution is the curated dataset and its properties.

pith-pipeline@v0.9.0 · 5462 in / 1231 out tokens · 85048 ms · 2026-05-12T04:59:23.206350+00:00 · methodology

discussion (0)

Forward citations

Cited by 39 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Online Learning-to-Defer with Varying Experts
stat.ML 2026-05 unverdicted novelty 8.0

Presents the first online learning-to-defer algorithm with regret bounds O((n + n_e) T^{2/3}) generally and O((n + n_e) sqrt(T)) under low noise for multiclass classification with varying experts.
DSPy: Compiling Declarative Language Model Calls into Self-Improving Pipelines
cs.CL 2023-10 conditional novelty 8.0

DSPy compiles short declarative programs into LM pipelines that self-optimize and outperform both standard few-shot prompting and expert-written chains on math, retrieval, and QA tasks.
Logic-Regularized Verifier Elicits Reasoning from LLMs
cs.CL 2026-05 unverdicted novelty 7.0

LOVER creates an unsupervised logic-regularized verifier that reaches 95% of supervised verifier performance on reasoning tasks across 10 datasets.
HeadRank: Decoding-Free Passage Reranking via Preference-Aligned Attention Heads
cs.IR 2026-04 unverdicted novelty 7.0

HeadRank improves decoding-free passage reranking by preference-aligning attention heads to increase discriminability in middle-context documents, outperforming baselines on 14 benchmarks with only 211 training queries.
Weak-Link Optimization for Multi-Agent Reasoning and Collaboration
cs.AI 2026-04 unverdicted novelty 7.0

WORC improves multi-agent LLM reasoning to 82.2% average accuracy by predicting and compensating for the weakest agent via targeted extra sampling rather than uniform reinforcement.
Do We Still Need GraphRAG? Benchmarking RAG and GraphRAG for Agentic Search Systems
cs.IR 2026-04 unverdicted novelty 7.0

Agentic search narrows the gap between dense RAG and GraphRAG but does not remove GraphRAG's advantage on complex multi-hop reasoning.
Group-in-Group Policy Optimization for LLM Agent Training
cs.LG 2025-05 unverdicted novelty 7.0

GiGPO adds a hierarchical grouping mechanism to group-based RL so that LLM agents receive both global trajectory and local step-level credit signals, yielding >12% gains on ALFWorld and >9% on WebShop over GRPO while ...
M3-Embedding: Multi-Linguality, Multi-Functionality, Multi-Granularity Text Embeddings Through Self-Knowledge Distillation
cs.CL 2024-02 unverdicted novelty 7.0

M3-Embedding is a single model for multi-lingual, multi-functional, and multi-granular text embeddings trained via self-knowledge distillation that achieves new state-of-the-art results on multilingual, cross-lingual,...
ASTRA-QA: A Benchmark for Abstract Question Answering over Documents
cs.CL 2026-05 unverdicted novelty 6.0

ASTRA-QA is a benchmark for abstract document question answering that uses explicit topic sets, unsupported content annotations, and evidence alignments to enable direct scoring of coverage and hallucination.
KV-RM: Regularizing KV-Cache Movement for Static-Graph LLM Serving
cs.AR 2026-05 unverdicted novelty 6.0

KV-RM regularizes KV-cache movement in static-graph LLM serving via block paging and merge-staged transport to improve throughput, tail latency, and memory use for variable-length decoding.
Reformulating KV Cache Eviction Problem for Long-Context LLM Inference
cs.CL 2026-05 unverdicted novelty 6.0

LaProx reformulates KV cache eviction as an output-aware matrix approximation, enabling a unified global token selection strategy that preserves LLM performance at 5% cache size across long-context benchmarks.
$S^3$-R1: Learning to Retrieve and Answer Step-by-Step with Synthetic Data
cs.LG 2026-05 unverdicted novelty 6.0

S^3-R1 generates synthetic intermediate-difficulty multi-hop questions and applies dense rewards for search quality plus answer correctness, yielding up to 10% better out-of-domain generalization than baselines.
CleanBase: Detecting Malicious Documents in RAG Knowledge Databases
cs.CR 2026-05 unverdicted novelty 6.0

CleanBase identifies malicious documents in RAG databases by detecting cliques in a semantic similarity graph constructed using embedding models and a statistical threshold.
SEARCH-R: Structured Entity-Aware Retrieval with Chain-of-Reasoning Navigator for Multi-hop Question Answering
cs.CL 2026-04 unverdicted novelty 6.0

SEARCH-R improves multi-hop question answering by training a fine-tuned Llama navigator for sub-question decomposition and using dependency-tree retrieval to quantify informational contribution of documents.
SparKV: Overhead-Aware KV Cache Loading for Efficient On-Device LLM Inference
cs.NI 2026-04 unverdicted novelty 6.0

SparKV reduces time-to-first-token by 1.3x-5.1x and energy use by 1.5x-3.3x for on-device LLM inference by adaptively choosing between cloud KV streaming and local computation while overlapping execution and adjusting...
How Far Are Video Models from True Multimodal Reasoning?
cs.CV 2026-04 unverdicted novelty 6.0

Current video models succeed on basic understanding but achieve under 25% success on logically grounded generation and near 0% on interactive generation, exposing gaps in multimodal reasoning.
Evaluating Multi-Hop Reasoning in RAG Systems: A Comparison of LLM-Based Retriever Evaluation Strategies
cs.IR 2026-04 unverdicted novelty 6.0

CARE, a context-aware LLM judge, outperforms standard methods when evaluating multi-hop retrieval quality in RAG systems.
Answer Only as Precisely as Justified: Calibrated Claim-Level Specificity Control for Agentic Systems
cs.CL 2026-04 unverdicted novelty 6.0

Compositional selective specificity (CSS) improves overcommitment-aware utility from 0.846 to 0.913 on LongFact while retaining 0.938 specificity by calibrating claim-level backoffs in agentic AI responses.
MemSearch-o1: Empowering Large Language Models with Reasoning-Aligned Memory Growth in Agentic Search
cs.IR 2026-04 unverdicted novelty 6.0

MemSearch-o1 uses reasoning-aligned memory growth from seed tokens, retracing via contribution functions, and path reorganization to mitigate memory dilution in LLM agentic search.
MemSearch-o1: Empowering Large Language Models with Reasoning-Aligned Memory Growth in Agentic Search
cs.IR 2026-04 unverdicted novelty 6.0

MemSearch-o1 mitigates memory dilution in agentic LLM search through reasoning-aligned token-level memory growth, retracing with a contribution function, and path reorganization, improving reasoning activation on benchmarks.
Complementing Self-Consistency with Cross-Model Disagreement for Uncertainty Quantification
cs.AI 2026-04 unverdicted novelty 6.0

Cross-model semantic disagreement adds an epistemic uncertainty term that improves total uncertainty estimation over self-consistency alone, helping flag confident errors in LLMs.
Transforming External Knowledge into Triplets for Enhanced Retrieval in RAG of LLMs
cs.CL 2026-04 unverdicted novelty 6.0

Tri-RAG turns external knowledge into Condition-Proof-Conclusion triplets and retrieves via the Condition anchor to improve efficiency and quality in LLM RAG.
Data, Not Model: Explaining Bias toward LLM Texts in Neural Retrievers
cs.IR 2026-04 unverdicted novelty 6.0

Bias toward LLM texts in neural retrievers arises from artifact imbalances between positive and negative documents in training data that are absorbed during contrastive learning.
LLMs Should Express Uncertainty Explicitly
cs.LG 2026-04 unverdicted novelty 6.0

Training LLMs to verbalize uncertainty explicitly at the end or during reasoning reduces overconfident errors and improves answer quality on factual tasks while enabling RAG triggers.
LLMs Should Express Uncertainty Explicitly
cs.LG 2026-04 unverdicted novelty 6.0

Training LLMs to express uncertainty explicitly via global confidence or local markers enhances calibration and intervention triggers compared to post-hoc estimation.
OASES: Outcome-Aligned Search-Evaluation Co-Training for Agentic Search
cs.AI 2026-04 unverdicted novelty 6.0

OASES co-trains search policies and evaluators to generate outcome-aligned process rewards, outperforming standard RL baselines on five multi-hop QA benchmarks.
TDA-RC: Task-Driven Alignment for Knowledge-Based Reasoning Chains in Large Language Models
cs.CL 2026-03 unverdicted novelty 6.0

TDA-RC embeds topological patterns from multi-round reasoning into CoT via persistent homology and a repair agent, yielding better accuracy-efficiency trade-offs than ToT or GoT on tested datasets.
MSA: Memory Sparse Attention for Efficient End-to-End Memory Model Scaling to 100M Tokens
cs.CL 2026-03 unverdicted novelty 6.0

MSA is an end-to-end trainable memory model using sparse attention and document-wise RoPE that scales to 100M tokens with linear complexity and less than 9% degradation.
MemAgent: Reshaping Long-Context LLM with Multi-Conv RL-based Memory Agent
cs.CL 2025-07 unverdicted novelty 6.0

MemAgent uses multi-conversation RL to train a memory agent that reads text in segments and overwrites memory, extrapolating from 8K training to 3.5M token QA with under 5% loss and 95%+ on 512K RULER.
NV-Embed: Improved Techniques for Training LLMs as Generalist Embedding Models
cs.CL 2024-05 accept novelty 6.0

NV-Embed achieves first place on the MTEB leaderboard across 56 tasks by combining a latent attention layer, causal-mask removal, two-stage contrastive training, and data curation for LLM-based embedding models.
SnapKV: LLM Knows What You are Looking for Before Generation
cs.CL 2024-04 conditional novelty 6.0

SnapKV selects clustered important KV positions per attention head from an observation window at the prompt end, yielding 3.6x faster generation and 8.2x better memory efficiency on 16K-token inputs with comparable pe...
ReWOO: Decoupling Reasoning from Observations for Efficient Augmented Language Models
cs.CL 2023-05 conditional novelty 6.0

ReWOO decouples reasoning from tool observations in augmented language models, delivering 5x token efficiency and 4% higher accuracy on multi-step reasoning benchmarks like HotpotQA.
Humanity's Last Exam
cs.LG 2025-01 unverdicted novelty 5.0

Humanity's Last Exam is a new 2,500-question benchmark at the frontier of human knowledge where state-of-the-art LLMs show low accuracy.
World Model on Million-Length Video And Language With Blockwise RingAttention
cs.LG 2024-02 unverdicted novelty 5.0

Presents open-source 7B models for million-token video and language understanding via Blockwise RingAttention, setting new benchmarks in retrieval and long video tasks.
Supplement Generation Training for Enhancing Agentic Task Performance
cs.LG 2026-04 unverdicted novelty 4.0

SGT trains a lightweight model to generate task-specific supplemental text that improves performance of a larger frozen LLM on agentic tasks without modifying the large model.
LightRAG: Simple and Fast Retrieval-Augmented Generation
cs.IR 2024-10 unverdicted novelty 4.0

LightRAG builds graph structures into RAG indexing and retrieval with dual-level search and incremental updates to improve accuracy and speed.
Understanding the planning of LLM agents: A survey
cs.AI 2024-02 accept novelty 4.0

A survey that provides a taxonomy of methods for improving planning in LLM-based agents across task decomposition, plan selection, external modules, reflection, and memory.
A Reproducibility Study of Metacognitive Retrieval-Augmented Generation
cs.IR 2026-04 unverdicted novelty 3.0

MetaRAG is only partially reproducible with lower absolute scores than originally reported, gains substantially from reranking, and shows greater robustness than SIM-RAG under extended retrieval features.
Large Language Models: A Survey
cs.CL 2024-02 accept novelty 3.0

The paper surveys key large language models, their training methods, datasets, evaluation benchmarks, and future research directions in the field.

Reference graph

Works this paper leans on

19 extracted references · 19 canonical work pages · cited by 37 Pith papers

[1]

Danqi Chen, Adam Fisch, Jason Weston, and Antoine Bordes. 2017. Reading Wikipedia to answer open-domain questions. In Association for Computational Linguistics (ACL)

work page 2017
[2]

Christopher Clark and Matt Gardner. 2017. Simple and effective multi-paragraph reading comprehension. In Proceedings of the 55th Annual Meeting of the Association of Computational Linguistics

work page 2017
[3]

Matthew Dunn, Levent Sagun, Mike Higgins, Ugur Guney, Volkan Cirik, and Kyunghyun Cho. 2017. SearchQA : A new Q&A dataset augmented with context from a search engine. arXiv preprint arXiv:1704.05179

work page arXiv 2017
[4]

Weld, and Luke Zettlemoyer

Mandar Joshi, Eunsol Choi, Daniel S. Weld, and Luke Zettlemoyer. 2017. TriviaQA : A large scale distantly supervised challenge dataset for reading comprehension. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics

work page 2017
[5]

Xiaodong Liu, Yelong Shen, Kevin Duh, and Jianfeng Gao. 2018. Stochastic answer networks for machine reading comprehension. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics

work page 2018
[6]

Manning, Mihai Surdeanu, John Bauer, Jenny Finkel, Steven J

Christopher D. Manning, Mihai Surdeanu, John Bauer, Jenny Finkel, Steven J. Bethard, and David McClosky. 2014. The Stanford CoreNLP natural language processing toolkit. In Association for Computational Linguistics (ACL) System Demonstrations, pages 55--60

work page 2014
[7]

Alexander H Miller, Will Feng, Adam Fisch, Jiasen Lu, Dhruv Batra, Antoine Bordes, Devi Parikh, and Jason Weston. 2017. ParlAI : A dialog research software platform. arXiv preprint arXiv:1705.06476

work page Pith review arXiv 2017
[8]

Tri Nguyen, Mir Rosenberg, Xia Song, Jianfeng Gao, Saurabh Tiwary, Rangan Majumder, and Li Deng. 2016. MS MARCO : A human generated machine reading comprehension dataset. In Proceedings of the 30th Annual Conference on Neural Information Processing Systems (NIPS)

work page 2016
[9]

Jekaterina Novikova, Ond r ej Du s ek, Amanda Cercas Curry, and Verena Rieser. 2017. Why we need new evaluation metrics for NLG . In Proceedings of the Conference on Empirical Methods in Natural Language Processing

work page 2017
[10]

Boyuan Pan, Hao Li, Zhou Zhao, Bin Cao, Deng Cai, and Xiaofei He. 2017. Memen: Multi-layer embedding with memory networks for machine comprehension. arXiv preprint arXiv:1707.09098

work page arXiv 2017
[11]

Pranav Rajpurkar, Robin Jia, and Percy Liang. 2018. Know what you don't know: Unanswerable questions for SQuAD . In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics

work page 2018
[12]

Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. 2016. SQuAD : 100,000+ questions for machine comprehension of text. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing (EMNLP)

work page 2016
[13]

Shimi Salant and Jonathan Berant. 2018. Contextualized word representations for reading comprehension. In Proceedings of the 16th Annual Conference of the North American Chapter of the Association for Computational Linguistics

work page 2018
[14]

Minjoon Seo, Aniruddha Kembhavi, Ali Farhadi, and Hannaneh Hajishirzi. 2017. Bidirectional attention flow for machine comprehension. In Proceedings of the International Conference on Learning Representations

work page 2017
[15]

Alon Talmor and Jonathan Berant. 2018. The web as a knowledge-base for answering complex questions. In Proceedings of the 16th Annual Conference of the North American Chapter of the Association for Computational Linguistics

work page 2018
[16]

Wenhui Wang, Nan Yang, Furu Wei, Baobao Chang, and Ming Zhou. 2017. Gated self-matching networks for reading comprehension and question answering. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), volume 1, pages 189--198

work page 2017
[17]

Johannes Welbl, Pontus Stenetorp, and Sebastian Riedel. 2018. Constructing datasets for multi-hop reading comprehension across documents. Transactions of the Association of Computational Linguistics

work page 2018
[18]

Caiming Xiong, Victor Zhong, and Richard Socher. 2018. DCN+ : Mixed objective and deep residual coattention for question answering. In Proceedings of the International Conference on Learning Representations

work page 2018
[19]

Zhilin Yang, Saizheng Zhang, Jack Urbanek, Will Feng, Alexander H Miller, Arthur Szlam, Douwe Kiela, and Jason Weston. 2018. Mastering the dungeon: Grounded language learning by mechanical turker descent. In Proceedings of the International Conference on Learning Representations

work page 2018