arxiv: 2604.22757 · v1 · submitted 2026-03-06 · 💻 cs.IR · cs.AI

StratRAG: A Multi-Hop Retrieval Evaluation Dataset for Retrieval-Augmented Generation Systems

Aryan Patodiya This is my paper

Pith reviewed 2026-05-15 14:28 UTC · model grok-4.3

classification 💻 cs.IR cs.AI

keywords multi-hop retrievalRAG evaluationhybrid retrievalnoisy document poolsbridge questionsHotpotQAretrieval benchmarksdistractor setting

0 comments

The pith

Hybrid retrieval outperforms other methods on the StratRAG dataset of noisy multi-hop questions, though bridge questions remain harder.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces StratRAG as an evaluation dataset for retrieval in RAG systems on multi-hop reasoning tasks. It contains 2,200 examples drawn from HotpotQA, each paired with a fixed pool of 15 documents that includes exactly two gold documents and 13 topical distractors. Benchmarks compare BM25, dense retrieval with all-MiniLM-L6-v2, and hybrid fusion, showing the hybrid approach reaches Recall@2 of 0.70 and MRR of 0.93 overall. Bridge questions prove more difficult than comparison or yes-no questions, with recall dropping to 0.67. The results establish the dataset as a tool for testing and improving retrieval strategies under realistic noise.

Core claim

StratRAG comprises 2,200 multi-hop examples across bridge, comparison, and yes-no question types, each paired with a pool of 15 candidate documents containing exactly 2 gold documents and 13 topically related distractors. Benchmarks of BM25, dense retrieval, and hybrid fusion on the validation set show hybrid retrieval achieving the best overall performance with Recall@2 = 0.70 and MRR = 0.93, yet bridge questions remain substantially harder with Recall@2 = 0.67.

What carries the argument

The StratRAG dataset of 2,200 queries, each with a fixed 15-document pool containing exactly two gold documents amid 13 distractors, used to measure retrieval performance across question types.

If this is right

Hybrid fusion retrieval achieves the highest Recall@2, MRR, and NDCG@5 among the three tested strategies on the full StratRAG validation set.
Bridge questions show measurably lower recall than comparison and yes-no questions under the same noisy conditions.
The reported metrics motivate the design of new retrieval policies, such as reinforcement-learning-based selection, to handle harder question types.
The fixed noisy pools allow direct comparison of retrieval methods on the same multi-hop instances.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The dataset could serve as a training signal for retrieval models that learn to prioritize documents needed for chain-of-thought reasoning.
Similar construction methods might expose limitations in current dense embeddings when the required documents are only indirectly related to the query.
Integrating StratRAG-style noise into existing RAG pipelines could surface accuracy drops on multi-hop queries that current clean-pool evaluations miss.

Load-bearing premise

The 13 distractors drawn from HotpotQA sufficiently represent real-world topical noise and the original gold labels transfer accurately to the new pools.

What would settle it

A new retrieval method that reaches Recall@2 above 0.80 on bridge questions without lowering overall scores, or a measurement showing real-world document pools contain substantially different noise distributions than the HotpotQA distractors.

read the original abstract

We introduce StratRAG, an open-source retrieval evaluation dataset for benchmarking Retrieval-Augmented Generation (RAG) systems on multi-hop reasoning tasks under realistic, noisy document-pool conditions. Derived from HotpotQA (distractor setting), StratRAG comprises 2,200 examples across three question types -- bridge, comparison, and yes-no -- each paired with a pool of 15 candidate documents containing exactly 2 gold documents and 13 topically related distractors. We benchmark three retrieval strategies -- BM25, dense retrieval (all-MiniLM-L6-v2), and hybrid fusion -- reporting Recall@k, MRR, and NDCG@5 on the validation set. Hybrid retrieval achieves the best overall performance (Recall@2 = 0.70, MRR = 0.93), yet bridge questions remain substantially harder (Recall@2 = 0.67), motivating future work on reinforcement-learning-based retrieval policies. StratRAG is publicly available at https://huggingface.co/datasets/Aryanp088/StratRAG.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

StratRAG is a clean dataset release that gives controlled noisy pools for multi-hop retrieval testing, with straightforward benchmarks that hold up.

read the letter

StratRAG takes HotpotQA distractor examples and rebuilds them into 2,200 queries, each with a fixed 15-document pool containing exactly two gold documents and 13 topically related distractors. The questions are split into bridge, comparison, and yes-no types. They test BM25, dense retrieval with all-MiniLM-L6-v2, and a hybrid fusion method, reporting Recall@k, MRR, and NDCG@5 on the validation set. Hybrid comes out on top with Recall@2 at 0.70 and MRR at 0.93, while bridge questions stay harder at 0.67 recall. That gap is the most useful signal in the results. The fixed pool size and known gold count make the setup easy to reproduce and compare against, which is the main practical value. The construction is transparent and the metrics follow directly from standard retrievers applied to the described pools. The main soft spot is that the distractors are drawn from the same HotpotQA source, so they may not capture the broader topical noise found in real web or enterprise collections. The paper also assumes the original gold labels transfer without error to the new pools, which is plausible but not re-checked. These are normal limitations for a dataset paper rather than fatal ones. This work is aimed at IR and RAG groups that need repeatable multi-hop retrieval tests. Anyone running retrieval experiments will get immediate use from the released Hugging Face dataset. It deserves peer review because the artifact is new, the evaluation is simple and reproducible, and the numbers are presented without overclaiming.

Referee Report

0 major / 2 minor

Summary. The paper introduces StratRAG, an open-source dataset derived from HotpotQA for evaluating retrieval in multi-hop RAG systems. It contains 2,200 examples across bridge, comparison, and yes-no question types, each paired with a pool of 15 documents (exactly 2 gold documents and 13 topically related distractors). The authors benchmark BM25, dense retrieval (all-MiniLM-L6-v2), and hybrid fusion, reporting Recall@k, MRR, and NDCG@5 on the validation set. Hybrid retrieval achieves the best overall results (Recall@2 = 0.70, MRR = 0.93), while bridge questions remain harder (Recall@2 = 0.67), motivating future reinforcement-learning-based retrieval policies. The dataset is released publicly on Hugging Face.

Significance. If the dataset construction holds, StratRAG provides a timely, reproducible benchmark resource for multi-hop retrieval under noisy conditions, which is increasingly relevant for RAG systems. The empirical results establish clear baselines and isolate the relative difficulty of bridge questions, directly supporting follow-on work on adaptive retrieval policies. The public release and use of standard metrics enhance community utility and reproducibility.

minor comments (2)

[Abstract] Abstract: the total of 2,200 examples is stated without clarifying the train/validation split sizes or whether the reported metrics are computed only on the validation portion; this detail should be added for clarity.
[Dataset Construction] Dataset construction section: the exact procedure for selecting the 13 distractors (e.g., similarity threshold or sampling method from HotpotQA) is not fully specified, which could hinder exact replication of the pools.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their positive review and recommendation to accept the manuscript. We appreciate the recognition of StratRAG as a timely and reproducible benchmark for multi-hop retrieval in noisy RAG settings.

Circularity Check

0 steps flagged

No significant circularity; dataset release with direct empirical benchmarks

full rationale

The paper constructs StratRAG by taking HotpotQA examples, adding 13 distractors per pool, and then applies standard off-the-shelf retrievers (BM25, all-MiniLM-L6-v2 dense, hybrid fusion) to compute Recall@k, MRR, and NDCG using established formulas. No equations, fitted parameters, predictions, or self-citations appear in the load-bearing steps; the reported numbers (e.g., hybrid Recall@2 = 0.70) follow directly from the described construction and metric definitions without reduction to prior author work or internal redefinition.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper introduces no free parameters, invented entities, or ad-hoc axioms beyond standard assumptions that HotpotQA gold labels remain valid when distractors are added.

axioms (1)

domain assumption HotpotQA distractor setting supplies accurate gold document labels that transfer to the new noisy pools
The entire dataset is derived from HotpotQA, so correctness rests on this transfer assumption.

pith-pipeline@v0.9.0 · 5479 in / 1162 out tokens · 42145 ms · 2026-05-15T14:28:21.195026+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

11 extracted references · 11 canonical work pages · 1 internal anchor

[1]

Ragas: Automated Evaluation of Retrieval Augmented Generation

Es, S., James, J., Espinosa-Anke, L., and Schockaert, S. RAGAS: Automated evaluation of retrieval augmented generation. arXiv preprint arXiv:2309.15217, 2023

work page internal anchor Pith review arXiv 2023
[2]

Constructing a multi-hop QA dataset for comprehensive evaluation of reasoning steps

Ho, X., Nguyen, A.-K., Sugawara, S., and Aizawa, A. Constructing a multi-hop QA dataset for comprehensive evaluation of reasoning steps. In Proceedings of COLING 2020

work page 2020
[3]

Dense passage retrieval for open-domain question answering

Karpukhin, V., Oguz, B., Min, S., Lewis, P., Wu, L., Edunov, S., Chen, D., and Yih, W.-t. Dense passage retrieval for open-domain question answering. In Proceedings of EMNLP 2020

work page 2020
[4]

u ttler, H., Lewis, M., Yih, W.-t., Rockt\

Lewis, P., Perez, E., Piktus, A., Petroni, F., Karpukhin, V., Goyal, N., K\" u ttler, H., Lewis, M., Yih, W.-t., Rockt\" a schel, T., Riedel, S., and Kiela, D. Retrieval-augmented generation for knowledge-intensive NLP tasks. In Advances in Neural Information Processing Systems (NeurIPS 2020)

work page 2020
[5]

Hybrid listwise optimization for retrieval models

Ma, X., Guo, J., Zhang, R., Fan, Y., and Cheng, X. Hybrid listwise optimization for retrieval models. arXiv preprint arXiv:2205.09153, 2022

work page arXiv 2022
[6]

and Gurevych, I

Reimers, N. and Gurevych, I. Sentence-BERT: Sentence embeddings using Siamese BERT-networks. In Proceedings of EMNLP 2019

work page 2019
[7]

and Zaragoza, H

Robertson, S. and Zaragoza, H. The probabilistic relevance framework: BM25 and beyond. Foundations and Trends in Information Retrieval, 3(4):333--389, 2009

work page 2009
[8]

BEIR: A heterogeneous benchmark for zero-shot evaluation of information retrieval models

Thakur, N., Reimers, N., R\" u ckl\' e , A., Srivastava, A., and Gurevych, I. BEIR: A heterogeneous benchmark for zero-shot evaluation of information retrieval models. In NeurIPS 2021 Datasets and Benchmarks Track

work page 2021
[9]

MuSiQue: Multihop questions via single-hop question composition

Trivedi, H., Balasubramanian, N., Khot, T., and Sabharwal, A. MuSiQue: Multihop questions via single-hop question composition. Transactions of the Association for Computational Linguistics, 10:539--554, 2022

work page 2022
[10]

MiniLM: Deep self-attention distillation for task-agnostic compression of pre-trained transformers

Wang, W., Wei, F., Dong, L., Bao, H., Yang, N., and Zhou, M. MiniLM: Deep self-attention distillation for task-agnostic compression of pre-trained transformers. In Advances in Neural Information Processing Systems (NeurIPS 2020)

work page 2020
[11]

W., Salakhutdinov, R., and Manning, C

Yang, Z., Qi, P., Zhang, S., Bengio, Y., Cohen, W. W., Salakhutdinov, R., and Manning, C. D. HotpotQA: A dataset for diverse, explainable multi-hop question answering. In Proceedings of EMNLP 2018

work page 2018