StratRAG: A Multi-Hop Retrieval Evaluation Dataset for Retrieval-Augmented Generation Systems
Pith reviewed 2026-05-15 14:28 UTC · model grok-4.3
The pith
Hybrid retrieval outperforms other methods on the StratRAG dataset of noisy multi-hop questions, though bridge questions remain harder.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
StratRAG comprises 2,200 multi-hop examples across bridge, comparison, and yes-no question types, each paired with a pool of 15 candidate documents containing exactly 2 gold documents and 13 topically related distractors. Benchmarks of BM25, dense retrieval, and hybrid fusion on the validation set show hybrid retrieval achieving the best overall performance with Recall@2 = 0.70 and MRR = 0.93, yet bridge questions remain substantially harder with Recall@2 = 0.67.
What carries the argument
The StratRAG dataset of 2,200 queries, each with a fixed 15-document pool containing exactly two gold documents amid 13 distractors, used to measure retrieval performance across question types.
If this is right
- Hybrid fusion retrieval achieves the highest Recall@2, MRR, and NDCG@5 among the three tested strategies on the full StratRAG validation set.
- Bridge questions show measurably lower recall than comparison and yes-no questions under the same noisy conditions.
- The reported metrics motivate the design of new retrieval policies, such as reinforcement-learning-based selection, to handle harder question types.
- The fixed noisy pools allow direct comparison of retrieval methods on the same multi-hop instances.
Where Pith is reading between the lines
- The dataset could serve as a training signal for retrieval models that learn to prioritize documents needed for chain-of-thought reasoning.
- Similar construction methods might expose limitations in current dense embeddings when the required documents are only indirectly related to the query.
- Integrating StratRAG-style noise into existing RAG pipelines could surface accuracy drops on multi-hop queries that current clean-pool evaluations miss.
Load-bearing premise
The 13 distractors drawn from HotpotQA sufficiently represent real-world topical noise and the original gold labels transfer accurately to the new pools.
What would settle it
A new retrieval method that reaches Recall@2 above 0.80 on bridge questions without lowering overall scores, or a measurement showing real-world document pools contain substantially different noise distributions than the HotpotQA distractors.
read the original abstract
We introduce StratRAG, an open-source retrieval evaluation dataset for benchmarking Retrieval-Augmented Generation (RAG) systems on multi-hop reasoning tasks under realistic, noisy document-pool conditions. Derived from HotpotQA (distractor setting), StratRAG comprises 2,200 examples across three question types -- bridge, comparison, and yes-no -- each paired with a pool of 15 candidate documents containing exactly 2 gold documents and 13 topically related distractors. We benchmark three retrieval strategies -- BM25, dense retrieval (all-MiniLM-L6-v2), and hybrid fusion -- reporting Recall@k, MRR, and NDCG@5 on the validation set. Hybrid retrieval achieves the best overall performance (Recall@2 = 0.70, MRR = 0.93), yet bridge questions remain substantially harder (Recall@2 = 0.67), motivating future work on reinforcement-learning-based retrieval policies. StratRAG is publicly available at https://huggingface.co/datasets/Aryanp088/StratRAG.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces StratRAG, an open-source dataset derived from HotpotQA for evaluating retrieval in multi-hop RAG systems. It contains 2,200 examples across bridge, comparison, and yes-no question types, each paired with a pool of 15 documents (exactly 2 gold documents and 13 topically related distractors). The authors benchmark BM25, dense retrieval (all-MiniLM-L6-v2), and hybrid fusion, reporting Recall@k, MRR, and NDCG@5 on the validation set. Hybrid retrieval achieves the best overall results (Recall@2 = 0.70, MRR = 0.93), while bridge questions remain harder (Recall@2 = 0.67), motivating future reinforcement-learning-based retrieval policies. The dataset is released publicly on Hugging Face.
Significance. If the dataset construction holds, StratRAG provides a timely, reproducible benchmark resource for multi-hop retrieval under noisy conditions, which is increasingly relevant for RAG systems. The empirical results establish clear baselines and isolate the relative difficulty of bridge questions, directly supporting follow-on work on adaptive retrieval policies. The public release and use of standard metrics enhance community utility and reproducibility.
minor comments (2)
- [Abstract] Abstract: the total of 2,200 examples is stated without clarifying the train/validation split sizes or whether the reported metrics are computed only on the validation portion; this detail should be added for clarity.
- [Dataset Construction] Dataset construction section: the exact procedure for selecting the 13 distractors (e.g., similarity threshold or sampling method from HotpotQA) is not fully specified, which could hinder exact replication of the pools.
Simulated Author's Rebuttal
We thank the referee for their positive review and recommendation to accept the manuscript. We appreciate the recognition of StratRAG as a timely and reproducible benchmark for multi-hop retrieval in noisy RAG settings.
Circularity Check
No significant circularity; dataset release with direct empirical benchmarks
full rationale
The paper constructs StratRAG by taking HotpotQA examples, adding 13 distractors per pool, and then applies standard off-the-shelf retrievers (BM25, all-MiniLM-L6-v2 dense, hybrid fusion) to compute Recall@k, MRR, and NDCG using established formulas. No equations, fitted parameters, predictions, or self-citations appear in the load-bearing steps; the reported numbers (e.g., hybrid Recall@2 = 0.70) follow directly from the described construction and metric definitions without reduction to prior author work or internal redefinition.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption HotpotQA distractor setting supplies accurate gold document labels that transfer to the new noisy pools
Reference graph
Works this paper leans on
-
[1]
Ragas: Automated Evaluation of Retrieval Augmented Generation
Es, S., James, J., Espinosa-Anke, L., and Schockaert, S. RAGAS: Automated evaluation of retrieval augmented generation. arXiv preprint arXiv:2309.15217, 2023
work page internal anchor Pith review arXiv 2023
-
[2]
Constructing a multi-hop QA dataset for comprehensive evaluation of reasoning steps
Ho, X., Nguyen, A.-K., Sugawara, S., and Aizawa, A. Constructing a multi-hop QA dataset for comprehensive evaluation of reasoning steps. In Proceedings of COLING 2020
work page 2020
-
[3]
Dense passage retrieval for open-domain question answering
Karpukhin, V., Oguz, B., Min, S., Lewis, P., Wu, L., Edunov, S., Chen, D., and Yih, W.-t. Dense passage retrieval for open-domain question answering. In Proceedings of EMNLP 2020
work page 2020
-
[4]
u ttler, H., Lewis, M., Yih, W.-t., Rockt\
Lewis, P., Perez, E., Piktus, A., Petroni, F., Karpukhin, V., Goyal, N., K\" u ttler, H., Lewis, M., Yih, W.-t., Rockt\" a schel, T., Riedel, S., and Kiela, D. Retrieval-augmented generation for knowledge-intensive NLP tasks. In Advances in Neural Information Processing Systems (NeurIPS 2020)
work page 2020
-
[5]
Hybrid listwise optimization for retrieval models
Ma, X., Guo, J., Zhang, R., Fan, Y., and Cheng, X. Hybrid listwise optimization for retrieval models. arXiv preprint arXiv:2205.09153, 2022
-
[6]
Reimers, N. and Gurevych, I. Sentence-BERT: Sentence embeddings using Siamese BERT-networks. In Proceedings of EMNLP 2019
work page 2019
-
[7]
Robertson, S. and Zaragoza, H. The probabilistic relevance framework: BM25 and beyond. Foundations and Trends in Information Retrieval, 3(4):333--389, 2009
work page 2009
-
[8]
BEIR: A heterogeneous benchmark for zero-shot evaluation of information retrieval models
Thakur, N., Reimers, N., R\" u ckl\' e , A., Srivastava, A., and Gurevych, I. BEIR: A heterogeneous benchmark for zero-shot evaluation of information retrieval models. In NeurIPS 2021 Datasets and Benchmarks Track
work page 2021
-
[9]
MuSiQue: Multihop questions via single-hop question composition
Trivedi, H., Balasubramanian, N., Khot, T., and Sabharwal, A. MuSiQue: Multihop questions via single-hop question composition. Transactions of the Association for Computational Linguistics, 10:539--554, 2022
work page 2022
-
[10]
MiniLM: Deep self-attention distillation for task-agnostic compression of pre-trained transformers
Wang, W., Wei, F., Dong, L., Bao, H., Yang, N., and Zhou, M. MiniLM: Deep self-attention distillation for task-agnostic compression of pre-trained transformers. In Advances in Neural Information Processing Systems (NeurIPS 2020)
work page 2020
-
[11]
W., Salakhutdinov, R., and Manning, C
Yang, Z., Qi, P., Zhang, S., Bengio, Y., Cohen, W. W., Salakhutdinov, R., and Manning, C. D. HotpotQA: A dataset for diverse, explainable multi-hop question answering. In Proceedings of EMNLP 2018
work page 2018
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.