arxiv: 2604.13728 · v1 · submitted 2026-04-15 · 💻 cs.IR · cs.CL

Recognition: unknown

Hybrid Retrieval for COVID-19 Literature: Comparing Rank Fusion and Projection Fusion with Diversity Reranking

Harishkumar Kishorkumar Prajapati

Authors on Pith no claims yet

Pith reviewed 2026-05-10 12:48 UTC · model grok-4.3

classification 💻 cs.IR cs.CL

keywords hybrid retrievalrank fusionCOVID-19 literatureTREC-COVIDdiversity rerankingsparse-dense fusionnDCG evaluationprojection fusion

0 comments

The pith

RRF fusion achieves the highest relevance in hybrid retrieval for COVID-19 literature with nDCG@10 of 0.828.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper evaluates hybrid retrieval systems for COVID-19 scientific literature using the TREC-COVID benchmark of 171,332 papers and 50 expert queries. It compares sparse retrieval with SPLADE, dense retrieval with BGE, rank-level fusion via RRF, and a projection-based vector fusion approach called B5, plus MMR diversity reranking. RRF fusion produces the strongest relevance scores, outperforming dense-only by 6.1 percent and sparse-only by 14.9 percent. The B5 projection variant trades some relevance for 33 percent faster response times and more than double the intra-list diversity. Tests across expert, machine-generated, and paraphrased queries show both pipelines stay under the two-second latency target.

Core claim

The paper shows that reciprocal rank fusion of SPLADE sparse and BGE dense results delivers the best relevance on expert queries, reaching nDCG@10 of 0.828. The B5 projection fusion reaches nDCG@10 of 0.678 but runs in 847 ms versus 1271 ms for RRF and yields 2.2 times higher ILD@10. MMR reranking boosts diversity by 23.8 to 24.5 percent at a 20.4 to 25.4 percent relevance cost. B5 shows its largest gain on keyword-heavy query reformulations.

What carries the argument

Reciprocal rank fusion (RRF) and projection-based vector fusion (B5) applied to SPLADE and BGE retrievers, followed by MMR diversity reranking.

If this is right

RRF fusion is the strongest choice when maximum relevance is the priority on expert COVID-19 queries.
B5 projection fusion offers a practical speed-diversity trade-off for applications that value faster responses and varied result lists.
MMR reranking reliably increases intra-list diversity by roughly 24 percent across fusion methods.
Both fusion approaches meet sub-2-second latency on expert, machine-generated, and paraphrased queries.
Performance patterns remain consistent when queries are expanded to 400 total variants including paraphrases.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The speed and diversity advantages of projection fusion suggest it could suit interactive literature search tools where users scan many results quickly.
Testing on paraphrased queries indicates hybrid systems may handle varied user phrasing better than single retrievers alone.
The Streamlit deployment shows how these pipelines can be turned into accessible web applications for domain researchers.
The relative gains on keyword-heavy reformulations point to possible benefits in other scientific search tasks that mix technical terms and natural language.

Load-bearing premise

The TREC-COVID benchmark with its 50 expert queries and the specific SPLADE and BGE implementations are representative enough to support general claims about hybrid retrieval performance.

What would settle it

Running the same RRF and B5 pipelines on a different large document collection with at least 200 queries and finding no relevance gain over the best single retriever would show the results do not generalize.

Figures

Figures reproduced from arXiv: 2604.13728 by Harishkumar Kishorkumar Prajapati.

read the original abstract

We present a hybrid retrieval system for COVID-19 scientific literature, evaluated on the TREC-COVID benchmark (171,332 papers, 50 expert queries). The system implements six retrieval configurations spanning sparse (SPLADE), dense (BGE), rank-level fusion (RRF), and a projection-based vector fusion (B5) approach. RRF fusion achieves the best relevance (nDCG@10 = 0.828), outperforming dense-only by 6.1% and sparse-only by 14.9%. Our projection fusion variant reaches nDCG@10 = 0.678 on expert queries while being 33% faster (847 ms vs. 1271 ms) and producing 2.2x higher ILD@10 than RRF. Evaluation across 400 queries -- including expert, machine-generated, and three paraphrase styles -- shows that B5 delivers the largest relative gain on keyword-heavy reformulations (+8.8%), although RRF remains best in absolute nDCG@10. On expert queries, MMR reranking increases intra-list diversity by 23.8-24.5% at a 20.4-25.4% nDCG@10 cost. Both fusion pipelines evaluated for latency remain below the sub-2 s target across all query sets. The system is deployed as a Streamlit web application backed by Pinecone serverless indices.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This is a clean empirical comparison of known hybrid retrieval methods on TREC-COVID that adds one new fusion variant and multi-style query tests but reports no significance on its main gains.

read the letter

The paper runs six retrieval setups on the TREC-COVID collection of 171k papers and 50 expert queries. It pits SPLADE sparse retrieval against BGE dense retrieval, then combines them with reciprocal rank fusion and a projection-based vector fusion they label B5. RRF reaches nDCG@10 of 0.828, while B5 trades some relevance for 33% lower latency and noticeably higher intra-list diversity. They also test the same pipelines on machine-generated queries and three paraphrase styles, plus MMR reranking for diversity, and report that everything stays under two seconds latency. A Streamlit demo backed by Pinecone is included. That is the actual contribution: a side-by-side systems check with concrete numbers on relevance, speed, and diversity across query flavors. The B5 variant and the paraphrase results are the only pieces that go beyond simply re-running RRF and MMR on this benchmark. The rest of the work is solid engineering execution on public data. The main weakness is the absence of any variance, standard deviation, or statistical test around the nDCG deltas. Fifty queries is a small set; without per-query scores or a paired test it is impossible to know whether the 6% and 15% edges over dense and sparse baselines are reliable or just noise from a few topics. The paper stays strictly applied to COVID literature search and does not claim broader theoretical advances. Readers who build production search for scientific or medical collections will find the latency-diversity trade-offs useful. The work is honest and reproducible enough to deserve referee time, though any review should ask for significance numbers and per-query breakdowns before publication.

Referee Report

2 major / 2 minor

Summary. The manuscript presents a hybrid retrieval system for COVID-19 scientific literature evaluated on the TREC-COVID benchmark (171,332 papers, 50 expert queries). It compares six configurations: sparse retrieval (SPLADE), dense retrieval (BGE), rank-level fusion via Reciprocal Rank Fusion (RRF), projection-based vector fusion (B5), and variants incorporating MMR reranking for diversity. The central empirical claims are that RRF achieves the highest nDCG@10 of 0.828 on expert queries (outperforming dense-only by 6.1% and sparse-only by 14.9%), while B5 offers lower nDCG@10 (0.678) but 33% faster latency (847 ms) and 2.2x higher ILD@10; MMR boosts diversity by 23.8-24.5% at a 20.4-25.4% nDCG cost. Results are also reported across 400 queries (expert, machine-generated, and paraphrased styles), with all pipelines meeting sub-2s latency, and the system is deployed as a Streamlit app using Pinecone indices.

Significance. If the results hold after addressing statistical validation, the work provides a practical, domain-specific case study on trade-offs between relevance, latency, and diversity in hybrid retrieval, with direct applicability to real-time systems. The multi-style query evaluation and explicit latency/ILD metrics add applied value for IR practitioners building COVID-19 or similar literature search tools. The deployment detail strengthens the systems contribution.

major comments (2)

[Abstract] Abstract: The assertion that RRF 'achieves the best relevance' (nDCG@10 = 0.828, +6.1% over dense-only, +14.9% over sparse-only) rests on point estimates from 50 queries without per-query scores, standard deviation, or any statistical significance test (paired t-test or Wilcoxon signed-rank). On a small query set, these deltas are vulnerable to topic-specific noise and may not reach p<0.05, so the 'best' ranking claim is not yet supported.
[Evaluation] Evaluation across query sets: While results are reported on 400 queries (including machine-generated and paraphrased variants), the primary performance claims and outperformance statements remain anchored to the 50 expert queries; no variance analysis or significance testing is described for either set, leaving the generalizability of the fusion comparisons unclear.

minor comments (2)

[Method] The exact vector projection mechanism for B5 and its relation to standard dense retrieval (BGE) is described at a high level only; adding a short equation or pseudocode would improve reproducibility.
[Results] Hardware, batch size, and measurement protocol for the reported latency figures (847 ms vs. 1271 ms) are not stated, which affects interpretation of the 33% speedup claim.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback highlighting the need for statistical validation in our empirical claims. We agree that this strengthens the manuscript and will incorporate the requested analyses. Below we respond point-by-point to the major comments.

read point-by-point responses

Referee: [Abstract] Abstract: The assertion that RRF 'achieves the best relevance' (nDCG@10 = 0.828, +6.1% over dense-only, +14.9% over sparse-only) rests on point estimates from 50 queries without per-query scores, standard deviation, or any statistical significance test (paired t-test or Wilcoxon signed-rank). On a small query set, these deltas are vulnerable to topic-specific noise and may not reach p<0.05, so the 'best' ranking claim is not yet supported.

Authors: We agree that the current claims rely on point estimates without statistical support. In the revised manuscript we will report per-query nDCG@10 values, standard deviations across the 50 expert queries, and apply paired significance tests (Wilcoxon signed-rank) to confirm whether the observed improvements over dense-only and sparse-only baselines reach statistical significance. This will allow readers to assess the robustness of the 'best relevance' statement. revision: yes
Referee: [Evaluation] Evaluation across query sets: While results are reported on 400 queries (including machine-generated and paraphrased variants), the primary performance claims and outperformance statements remain anchored to the 50 expert queries; no variance analysis or significance testing is described for either set, leaving the generalizability of the fusion comparisons unclear.

Authors: We acknowledge the absence of variance analysis and significance testing across query sets. We will revise the evaluation section to include standard deviations and paired statistical tests for all 400 queries (expert, machine-generated, and paraphrased styles). We will also add a brief discussion of how the relative performance of RRF and B5 generalizes (or varies) across query styles based on these new analyses. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical systems comparison on public benchmark

full rationale

The paper reports experimental results from running six retrieval configurations (SPLADE, BGE, RRF, projection fusion B5, MMR reranking) on the TREC-COVID benchmark and measuring nDCG@10, latency, and ILD@10. No equations, derivations, fitted parameters, or self-citation chains are used to derive the central claims; the nDCG@10 values are direct outputs of the retrieval runs. The evaluation is self-contained against the external benchmark and does not reduce any prediction to its own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No mathematical model, free parameters, axioms, or invented entities are present; the paper is a pure empirical comparison of retrieval pipelines.

pith-pipeline@v0.9.0 · 5558 in / 1093 out tokens · 25496 ms · 2026-05-10T12:48:31.994647+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

14 extracted references · 2 canonical work pages · 1 internal anchor

[1]

CORD-19: The COVID-19 Open Research Dataset,

L. L. Wang et al., “CORD-19: The COVID-19 Open Research Dataset,” inProc. ACL Workshop NLP-COVID, 2020

2020
[2]

TREC-COVID: Constructing a Pandemic Informa- tion Retrieval Test Collection,

E. V oorhees et al., “TREC-COVID: Constructing a Pandemic Informa- tion Retrieval Test Collection,”SIGIR F orum, vol. 54, no. 1, pp. 1–12, 2020

2020
[3]

SPLADE: Sparse Lex- ical and Expansion Model for First Stage Ranking,

T. Formal, B. Piwowarski, and S. Clinchant, “SPLADE: Sparse Lex- ical and Expansion Model for First Stage Ranking,” inProc. SIGIR, pp. 2288–2292, 2021

2021
[4]

arXiv:2109.10086 [cs] doi:10.48550/arXiv.2109.10086

T. Formal, C. Lassance, B. Piwowarski, and S. Clinchant, “SPLADE v2: Sparse Lexical and Expansion Model for Information Retrieval,” arXiv:2109.10086, 2022

work page arXiv 2022
[5]

C-Pack: Packed Resources For General Chinese Embeddings

S. Xiao, Z. Liu, P. Zhang, and N. Muennighoff, “C-Pack: Packaged Re- sources To Advance General Chinese Embedding,”arXiv:2309.07597, 2023

work page internal anchor Pith review arXiv 2023
[6]

Dense Passage Retrieval for Open-Domain Ques- tion Answering,

V . Karpukhin et al., “Dense Passage Retrieval for Open-Domain Ques- tion Answering,” inProc. EMNLP, pp. 6769–6781, 2020

2020
[7]

Reciprocal Rank Fusion Outperforms Condorcet and Individual Rank Learning Methods,

G. V . Cormack, C. L. Clarke, and S. Buettcher, “Reciprocal Rank Fusion Outperforms Condorcet and Individual Rank Learning Methods,” in Proc. SIGIR, pp. 758–759, 2009

2009
[8]

The Use of MMR, Diversity-Based Reranking for Reordering Documents and Producing Summaries,

J. Carbonell and J. Goldstein, “The Use of MMR, Diversity-Based Reranking for Reordering Documents and Producing Summaries,” in Proc. SIGIR, pp. 335–336, 1998

1998
[9]

Sparse, Dense, and Attentional Representations for Text Retrieval,

Y . Luan, J. Eisenstein, K. Toutanova, and M. Collins, “Sparse, Dense, and Attentional Representations for Text Retrieval,”TACL, vol. 9, pp. 329–345, 2021

2021
[10]

Extensions of Lipschitz Mappings into a Hilbert Space,

W. Johnson and J. Lindenstrauss, “Extensions of Lipschitz Mappings into a Hilbert Space,”Contemporary Mathematics, vol. 26, pp. 189– 206, 1984

1984
[11]

Database-Friendly Random Projections: Johnson- Lindenstrauss with Binary Coins,

D. Achlioptas, “Database-Friendly Random Projections: Johnson- Lindenstrauss with Binary Coins,”J. Computer and System Sciences, vol. 66, no. 4, pp. 671–687, 2003

2003
[12]

Novelty and Diversity in Information Retrieval Evaluation,

C. L. Clarke et al., “Novelty and Diversity in Information Retrieval Evaluation,” inProc. SIGIR, pp. 659–666, 2008

2008
[13]

The Probabilistic Relevance Framework: BM25 and Beyond,

S. Robertson and H. Zaragoza, “The Probabilistic Relevance Framework: BM25 and Beyond,”F oundations and Trends in IR, vol. 3, no. 4, pp. 333–389, 2009

2009
[14]

BEIR: A Heterogeneous Benchmark for Zero-shot Evaluation of Infor- mation Retrieval Models,

N. Thakur, N. Reimers, A. R ¨uckl´e, A. Srivastava, and I. Gurevych, “BEIR: A Heterogeneous Benchmark for Zero-shot Evaluation of Infor- mation Retrieval Models,” inProc. NeurIPS Datasets and Benchmarks, 2021

2021