Beyond the Reranker: Do RAG Retrieval Enhancements Help Once a Strong Reranker Is Present?

Allam Reddy; Manan Chopra; Sadanand Singh

arxiv: 2606.28367 · v1 · pith:AL5WNVROnew · submitted 2026-06-14 · 💻 cs.IR · cs.AI

Beyond the Reranker: Do RAG Retrieval Enhancements Help Once a Strong Reranker Is Present?

Sadanand Singh , Allam Reddy , Manan Chopra This is my paper

Pith reviewed 2026-06-30 10:57 UTC · model grok-4.3

classification 💻 cs.IR cs.AI

keywords RAG retrievalrerankerquery expansionheterogeneous documentsHetDocQASSCCbenchmark evaluation

0 comments

The pith

A strong cross-encoder reranker accounts for most RAG pipeline quality, with only query expansion and SSCC adding reliable gains.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests common retrieval enhancements in RAG pipelines once a strong reranker is already applied, moving beyond tests limited to homogeneous Wikipedia-style text. It introduces HetDocQA, a benchmark with chunker-agnostic span-overlap labels on mixed collections of code, tables, PDFs, and prose, and compares eight methods against homogeneous controls using bootstrap intervals and corrections. Results show the reranker drives the bulk of quality, query expansion improves results consistently, and the new SSCC method adds gains only on heterogeneous data while hierarchical summarization, graph expansion, routing, and rank fusion add none.

Core claim

A strong cross-encoder reranker accounts for most of the pipeline's quality; beyond it, only two methods yield reliable gains: query expansion and SSCC. SSCC, a per-source calibrated corrector introduced here, sets a separate acceptance threshold for each score source and helps only on heterogeneous data. The remaining reranking and pool-expansion methods in common use, among them hierarchical summarization, graph expansion, routing, and rank fusion, give no reliable gain once that reranker is present.

What carries the argument

HetDocQA benchmark with chunker-agnostic span-overlap relevance labels, evaluated on a shared backbone with cross-encoder reranker and multiple-comparison-corrected bootstrap intervals across heterogeneous and homogeneous collections.

If this is right

Query expansion produces reliable gains even after a strong reranker is applied.
SSCC improves performance only when the collection contains heterogeneous document formats.
Hierarchical summarization, graph expansion, per-query routing, and rank fusion add no reliable improvement once the reranker is present.
Benefits observed for retrieval methods on homogeneous corpora do not appear on the mixed-format collections typical in practice.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

RAG systems could simplify by focusing compute on the reranker stage rather than adding multiple retrieval layers.
Future RAG benchmarks should include heterogeneous document mixes to avoid overstating method gains.
SSCC may generalize to any retrieval setup that combines multiple score sources with differing calibration.

Load-bearing premise

Span-overlap relevance labels in HetDocQA accurately reflect downstream question-answering utility on mixed-format documents.

What would settle it

An end-to-end QA accuracy measurement on a real mixed-format corpus showing that graph expansion or rank fusion improves answers when paired with the same strong reranker and a different relevance labeling scheme.

Figures

Figures reproduced from arXiv: 2606.28367 by Allam Reddy, Manan Chopra, Sadanand Singh.

**Figure 1.** Figure 1: Methods by pipeline stage. Each method evaluated in this study is drawn under the stage of the shared pipeline it acts on, and shaded by whether its effect on HetDocQA is significant after Holm–Bonferroni correction (Section 6). The two methods with a reliable effect act on the query (HyDE) and at the answer decision (SSCC). Those that rerank or expand the candidate pool (hierarchical summarization, the cr… view at source ↗

**Figure 2.** Figure 2: HetDocQA construction and the fairness mechanism. (a) Permissive sources become mixed-format collections; a model distinct from any evaluated generator drafts questions, which are filtered, decontaminated, and human-validated, then split disjointly by collection. (b) Each evidence label is a character span in the source document. At evaluation, each system’s own chunks count as relevant when they cover at … view at source ↗

**Figure 3.** Figure 3: Effect of the reranker on recall. Recall@k on HetDocQA dev, with all configurations sharing the same first-stage retriever. Adding the cross-encoder reranker raises the curve sharply, whereas adding the pool-expansion methods (RAPTOR, graph, and all combined) on top of the reranker changes it little [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: Routing degeneracy. Per-query nearest-neighbor distances are nearly identical across queries (left); the entropy that EGR routes on sits at its theoretical ceiling with very little variance (center); and the per-query best strategy gives almost no advantage over a single fixed strategy (right). The signal a per-query router would need is not present. MuSiQue (prose) QASPER (sci-prose) HetDocQA (heterog.) i… view at source ↗

**Figure 6.** Figure 6: Why one threshold fails. Relevance-score distributions for relevant vs. non-relevant passages under the bi-encoder (left) and the cross-encoder (right) on HetDocQA. The two sources separate relevant from non-relevant at very different score ranges, so the per-source thresholds SSCC fits (dashed) differ substantially; a single global threshold cannot be right for both. A query-side probe. If the gains that … view at source ↗

**Figure 7.** Figure 7: Difficulty by question type and modality. [PITH_FULL_IMAGE:figures/full_fig_p013_7.png] view at source ↗

**Figure 8.** Figure 8: Modality-aware query expansion, three-arm comparison. [PITH_FULL_IMAGE:figures/full_fig_p014_8.png] view at source ↗

read the original abstract

Retrieval-augmented generation (RAG) is routinely extended with methods meant to improve retrieval: query expansion, hierarchical and cross-document summarization, graph-based expansion, per-query routing, rank fusion, and corrective re-retrieval. The benefits reported for these methods come almost exclusively from homogeneous corpora, predominantly Wikipedia prose. Whether they hold on the mixed-format collections common in practice, where code, markdown, tables, scientific PDFs, and prose are interleaved within one corpus, has not been measured. To study this directly, we build \textbf{HetDocQA}, a heterogeneous benchmark with \emph{chunker-agnostic} span-overlap relevance labels and collection-disjoint splits, and pair it with MuSiQue and QASPER as homogeneous controls. We evaluate eight methods on a shared backbone, with bootstrap confidence intervals and multiple-comparison correction. A strong cross-encoder reranker accounts for most of the pipeline's quality; beyond it, only two methods yield reliable gains: query expansion and SSCC. SSCC, a per-source calibrated corrector introduced here, sets a separate acceptance threshold for each score source and helps only on heterogeneous data. The remaining reranking and pool-expansion methods in common use, among them hierarchical summarization, graph expansion, routing, and rank fusion, give no reliable gain once that reranker is present.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

A strong reranker makes most retrieval add-ons redundant on mixed data, with the new benchmark and SSCC as the main contributions.

read the letter

The central result is that a strong cross-encoder reranker carries most of the quality in these pipelines. Once it is in place, the usual retrieval extras—hierarchical summarization, graph expansion, routing, rank fusion—show no reliable lift on the new heterogeneous test set. Query expansion still helps a little, and their SSCC method helps more on mixed-format collections.

The paper introduces HetDocQA with chunker-agnostic span-overlap labels and collection-disjoint splits, plus the SSCC corrector that calibrates thresholds per source. They run the comparison against MuSiQue and QASPER as homogeneous controls, using bootstrap intervals and multiple-comparison correction. That setup is cleaner than most prior RAG ablation work.

The experimental controls are the strongest part. Shared backbone, eight methods, proper error bars, and explicit focus on heterogeneous versus homogeneous data all make the negative findings easier to trust.

The main limitation is the label construction. Span overlap is straightforward but can undervalue paraphrases, cross-format synthesis, or partial matches that matter for actual QA on code, tables, and PDFs. If the labels systematically miss those cases, the claim that the other methods add nothing could be narrower than it appears. The backbone choice is reasonable but not the only strong option in use.

This is useful for teams running RAG on real mixed corpora who want evidence on where to stop adding components. The benchmark and the controlled negative results are worth checking. Send it to peer review; the design is solid enough to justify referee time even if the label question needs more discussion.

Referee Report

2 major / 2 minor

Summary. The paper constructs HetDocQA, a heterogeneous document QA benchmark using chunker-agnostic span-overlap relevance labels and collection-disjoint splits, and evaluates eight RAG retrieval enhancement methods (including query expansion, hierarchical summarization, graph expansion, routing, rank fusion, and the new SSCC) against homogeneous controls (MuSiQue, QASPER). Using a shared backbone and strong cross-encoder reranker, with bootstrap confidence intervals and multiple-comparison correction, it concludes that the reranker accounts for most pipeline quality and that only query expansion and SSCC yield reliable additional gains on heterogeneous data; the other methods show no reliable gains once the reranker is present.

Significance. If the results hold, the work indicates that many standard RAG retrieval enhancements add little value beyond a strong reranker on mixed-format corpora, which could simplify production pipelines and redirect research effort. Strengths include the new benchmark addressing a gap in heterogeneous evaluation, explicit use of bootstrap CIs with multiple-comparison correction for reliable empirical claims, collection-disjoint splits, and the introduction of SSCC which shows targeted gains on heterogeneous data.

major comments (2)

[§3] §3 (HetDocQA construction): The central claim that only query expansion and SSCC yield reliable gains (while hierarchical summarization, graph expansion, routing, and rank fusion do not) depends on span-overlap labels being faithful proxies for downstream QA utility. On mixed-format collections (code, tables, PDFs), these labels risk missing paraphrases, cross-format synthesis, and partial matches that matter for actual utility, which could make the 'no reliable gain' results metric artifacts rather than pipeline behavior.
[§4] §4 (experimental results): The no-gain conclusions for reranking and pool-expansion methods are reported with bootstrap CIs and correction, but without any reported correlation between the span-overlap labels and end-task QA accuracy on heterogeneous documents, it is unclear whether the metric choice drives the differential effectiveness claims across methods.

minor comments (2)

The abstract states the main finding clearly but could specify the exact number of methods and the shared backbone model used for all comparisons.
SSCC is described as setting 'a separate acceptance threshold for each score source'; a short additional sentence on how thresholds are calibrated (e.g., on a validation split) would improve clarity without lengthening the paper.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our work. The points raised about the proxy nature of span-overlap labels are substantive and merit explicit discussion in the manuscript. We respond to each major comment below and will incorporate clarifications via partial revisions.

read point-by-point responses

Referee: [§3] §3 (HetDocQA construction): The central claim that only query expansion and SSCC yield reliable gains (while hierarchical summarization, graph expansion, routing, and rank fusion do not) depends on span-overlap labels being faithful proxies for downstream QA utility. On mixed-format collections (code, tables, PDFs), these labels risk missing paraphrases, cross-format synthesis, and partial matches that matter for actual utility, which could make the 'no reliable gain' results metric artifacts rather than pipeline behavior.

Authors: We agree that span-overlap labels are an imperfect proxy and may miss paraphrases, cross-format synthesis, or partial matches relevant to actual QA utility on heterogeneous collections. This limitation is inherent to many lexical overlap metrics. The design choice was driven by the requirement for chunker-agnostic labels to enable fair evaluation across retrieval methods with varying chunking strategies, which is a core contribution of HetDocQA. Similar span-based labeling is standard in IR benchmarks. We will add an explicit discussion of this proxy limitation and its implications for result interpretation in §3, while keeping the core empirical claims tied to the retrieval metric. revision: partial
Referee: [§4] §4 (experimental results): The no-gain conclusions for reranking and pool-expansion methods are reported with bootstrap CIs and correction, but without any reported correlation between the span-overlap labels and end-task QA accuracy on heterogeneous documents, it is unclear whether the metric choice drives the differential effectiveness claims across methods.

Authors: We did not report a correlation between span-overlap labels and end-task QA accuracy, as the study centers on retrieval performance metrics under a consistent labeling scheme, following standard practice in RAG retrieval research. Computing such a correlation would necessitate full end-to-end QA experiments for every method on heterogeneous documents, which lies outside the current scope though it represents a useful future direction. The bootstrap CIs and multiple-comparison correction provide statistical support for the retrieval-level claims. We will add a clarifying paragraph in the discussion section acknowledging the proxy nature and the absence of direct QA correlation. revision: partial

Circularity Check

0 steps flagged

No circularity: purely empirical evaluation on new benchmark

full rationale

The paper reports experimental results from evaluating retrieval methods on the newly constructed HetDocQA benchmark (with MuSiQue and QASPER controls), using bootstrap confidence intervals and multiple-comparison correction. Central claims rest on direct measurements of retrieval quality with a shared backbone and cross-encoder reranker; no equations, derivations, fitted parameters renamed as predictions, or self-citation chains appear in the load-bearing steps. The introduction of SSCC is an empirical method contribution, not a self-referential reduction. This matches the default case of a self-contained empirical study.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 2 invented entities

The central claim depends on the validity of the new benchmark labels and the statistical procedures used to declare 'reliable gains.' No free parameters are fitted to produce the headline result; the new entities are the benchmark and SSCC itself.

axioms (2)

standard math Bootstrap resampling yields valid confidence intervals for retrieval metrics
Invoked to support the reliability of reported gains.
standard math Multiple-comparison correction appropriately controls error rate across eight methods
Used to declare which methods yield reliable gains.

invented entities (2)

HetDocQA benchmark no independent evidence
purpose: Provide chunker-agnostic span-overlap labels and collection-disjoint splits for heterogeneous mixed-format corpora
Newly constructed for this study; no external validation cited in abstract.
SSCC (per-source calibrated corrector) no independent evidence
purpose: Apply separate acceptance thresholds per score source to improve retrieval on heterogeneous data
Introduced in this paper as the second method that yields reliable gains.

pith-pipeline@v0.9.1-grok · 5781 in / 1592 out tokens · 51968 ms · 2026-06-30T10:57:29.560134+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

58 extracted references · 6 canonical work pages · 6 internal anchors

[1]

Retrieval-augmented generation for knowledge- intensive NLP tasks

Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, Sebastian Riedel, and Douwe Kiela. Retrieval-augmented generation for knowledge- intensive NLP tasks. InAdvances in Neural In- formation Processing Systems (NeurIPS), 2020

2020
[2]

REALM: Retrieval- augmented language model pre-training

Kelvin Guu, Kenton Lee, Zora Tung, Panupong Pasupat, and Ming-Wei Chang. REALM: Retrieval- augmented language model pre-training. InInter- national Conference on Machine Learning (ICML), 2020

2020
[3]

Leveraging passage retrieval with generative models for open domain question answering

Gautier Izacard and Edouard Grave. Leveraging passage retrieval with generative models for open domain question answering. InProceedings of the European Chapter of the ACL (EACL), 2021

2021
[4]

Precise zero-shot dense retrieval without relevance labels

Luyu Gao, Xueguang Ma, Jimmy Lin, and Jamie Callan. Precise zero-shot dense retrieval without relevance labels. InProceedings of the Association for Computational Linguistics (ACL), 2023

2023
[5]

Chi, Quoc V

Huaixiu Steven Zheng, Swaroop Mishra, Xinyun Chen, Heng-Tze Cheng, Ed H. Chi, Quoc V. Le, and Denny Zhou. Take a step back: Evoking reasoning via abstraction in large language models. InInter- national Conference on Learning Representations (ICLR), 2024

2024
[6]

Query2doc: Query expansion with large language models

Liang Wang, Nan Yang, and Furu Wei. Query2doc: Query expansion with large language models. In Proceedings of Empirical Methods in Natural Lan- guage Processing (EMNLP), 2023

2023
[7]

Parth Sarthi, Salman Abdullah, Aditi Tuli, Shubh Khanna, Anna Goldie, and Christopher D. Manning. RAPTOR: Recursive abstractive processing for tree- organized retrieval. InInternational Conference on Learning Representations (ICLR), 2024

2024
[8]

From Local to Global: A Graph RAG Approach to Query-Focused Summarization

Darren Edge, Ha Trinh, Newman Cheng, Joshua Bradley, Alex Chao, Apurva Mody, Steven Truitt, Dasha Metropolitansky, Robert Osazuwa Ness, and Jonathan Larson. From local to global: A graph RAG approach to query-focused summarization. arXiv preprint arXiv:2404.16130, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[9]

HippoRAG: Neu- robiologically inspired long-term memory for large language models

Bernal Jiménez Gutiérrez, Yiheng Shu, Yu Gu, Michihiro Yasunaga, and Yu Su. HippoRAG: Neu- robiologically inspired long-term memory for large language models. InAdvances in Neural Informa- tion Processing Systems (NeurIPS), 2024

2024
[10]

Corrective Retrieval Augmented Generation

Shi-Qi Yan, Jia-Chen Gu, Yun Zhu, and Zhen-Hua Ling. Corrective retrieval augmented generation. arXiv preprint arXiv:2401.15884, 2024. 9

work page internal anchor Pith review Pith/arXiv arXiv 2024
[11]

Self-RAG: Learning to re- trieve, generate, and critique through self-reflection

Akari Asai, Zeqiu Wu, Yizhong Wang, Avirup Sil, and Hannaneh Hajishirzi. Self-RAG: Learning to re- trieve, generate, and critique through self-reflection. InInternational Conference on Learning Represen- tations (ICLR), 2024

2024
[12]

Xu, Luyu Gao, Zhiqing Sun, Qian Liu, Jane Dwivedi-Yu, Yiming Yang, Jamie Callan, and Graham Neubig

Zhengbao Jiang, Frank F. Xu, Luyu Gao, Zhiqing Sun, Qian Liu, Jane Dwivedi-Yu, Yiming Yang, Jamie Callan, and Graham Neubig. Active retrieval augmented generation. InProceedings of Empirical Methods in Natural Language Processing (EMNLP), 2023

2023
[13]

SoyeongJeong, JinheonBaek, SukminCho, SungJu Hwang, and Jong C. Park. Adaptive-RAG: Learning to adapt retrieval-augmented large language models through question complexity. InProceedings of the North American Chapter of the ACL (NAACL), 2024

2024
[14]

Cormack, Charles L

Gordon V. Cormack, Charles L. A. Clarke, and Ste- fan Buettcher. Reciprocal rank fusion outperforms Condorcet and individual rank learning methods. InProceedings of ACM SIGIR, 2009

2009
[15]

Natural questions: A bench- mark for question answering research.Transactions of the Association for Computational Linguistics (TACL), 2019

Tom Kwiatkowski, Jennimaria Palomaki, Olivia Redfield, Michael Collins, Ankur Parikh, Chris Al- berti, Danielle Epstein, Illia Polosukhin, Jacob De- vlin, Kenton Lee, et al. Natural questions: A bench- mark for question answering research.Transactions of the Association for Computational Linguistics (TACL), 2019

2019
[16]

Cohen, Ruslan Salakhutdinov, and Christopher D

Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Ben- gio, William W. Cohen, Ruslan Salakhutdinov, and Christopher D. Manning. HotpotQA: A dataset for diverse, explainable multi-hop question answer- ing. InProceedings of Empirical Methods in Natural Language Processing (EMNLP), 2018

2018
[17]

MuSiQue: Multi- hop questions via single-hop question composition

Harsh Trivedi, Niranjan Balasubramanian, Tushar Khot, and Ashish Sabharwal. MuSiQue: Multi- hop questions via single-hop question composition. Transactions of the Association for Computational Linguistics (TACL), 2022

2022
[18]

Constructing a multi- hop QA dataset for comprehensive evaluation of reasoning steps

Xanh Ho, Anh-Khoa Duong Nguyen, Saku Sug- awara, and Akiko Aizawa. Constructing a multi- hop QA dataset for comprehensive evaluation of reasoning steps. InProceedings of the International Conference on Computational Linguistics (COL- ING), 2020

2020
[19]

Smith, and Matt Gardner

Pradeep Dasigi, Kyle Lo, Iz Beltagy, Arman Cohan, Noah A. Smith, and Matt Gardner. A dataset of information-seeking questions and answers anchored in research papers. InProceedings of the North American Chapter of the ACL (NAACL), 2021

2021
[20]

Richard Yuanzhe Pang, Alicia Parrish, Nitish Joshi, Nikita Nangia, Jason Phang, Angelica Chen, Vishakh Padmakumar, Johnny Ma, Jana Thomp- son, He He, and Samuel R. Bowman. QuALITY: Question answering with long input texts, yes! In Proceedings of the North American Chapter of the ACL (NAACL), 2022

2022
[21]

Cross-granularity hypergraph retrieval-augmented generation for multi-hop ques- tion answering, 2025

Changjian Wang, Weihong Deng, Weili Guan, Quan Lu, and Ning Jiang. Cross-granularity hypergraph retrieval-augmented generation for multi-hop ques- tion answering, 2025

2025
[22]

A-RAG: Scaling agentic retrieval-augmented generation via hierarchical retrieval interfaces, 2026

Mingxuan Du, Benfeng Xu, Chiwei Zhu, Shaohan Wang, Pengyu Wang, Xiaorui Wang, and Zhendong Mao. A-RAG: Scaling agentic retrieval-augmented generation via hierarchical retrieval interfaces, 2026

2026
[23]

You don’t need pre-built graphs for RAG: Retrieval augmented gen- eration with adaptive reasoning structures, 2025

Shengyuan Chen, Chuang Zhou, Zheng Yuan, Qing- gang Zhang, Zeyang Cui, Hao Chen, Yilin Xiao, Jiannong Cao, and Xiao Huang. You don’t need pre-built graphs for RAG: Retrieval augmented gen- eration with adaptive reasoning structures, 2025

2025
[24]

Rea- soning in trees: Improving retrieval-augmented gen- eration for multi-hop question answering, 2026

Yuling Shi, Maolin Sun, Zijun Liu, Mo Yang, Yix- iong Fang, Tianran Sun, and Xiaodong Gu. Rea- soning in trees: Improving retrieval-augmented gen- eration for multi-hop question answering, 2026

2026
[25]

RAGRouter: Learning to route queries to multiple retrieval-augmented language models, 2025

Jiarui Zhang, Xiangyu Liu, Yong Hu, Chaoyue Niu, Fan Wu, and Guihai Chen. RAGRouter: Learning to route queries to multiple retrieval-augmented language models, 2025

2025
[26]

RouteRAG: Efficient retrieval-augmented genera- tion from text and graph via reinforcement learning, 2025

Yucan Guo, Miao Su, Saiping Guan, Zihao Sun, Xiaolong Jin, Jiafeng Guo, and Xueqi Cheng. RouteRAG: Efficient retrieval-augmented genera- tion from text and graph via reinforcement learning, 2025

2025
[27]

InfoGain-RAG: Boosting retrieval-augmented generation via document infor- mation gain-based reranking and filtering, 2025

Zihan Wang, Zihan Liang, Zhou Shao, Yufei Ma, Huangyu Dai, Ben Chen, Lingtao Mao, Chenyi Lei, Yuqing Ding, and Han Li. InfoGain-RAG: Boosting retrieval-augmented generation via document infor- mation gain-based reranking and filtering, 2025

2025
[28]

ParallelSearch: Train your LLMs to decompose query and search sub-queries in parallel with reinforcement learning, 2025

Shu Zhao, Tan Yu, Anbang Xu, Japinder Singh, Aa- ditya Shukla, and Rama Akkiraju. ParallelSearch: Train your LLMs to decompose query and search sub-queries in parallel with reinforcement learning, 2025

2025
[29]

Out of style: RAG’s fragility to linguistic variation, 2025

Tianyu Cao, Neel Bhandari, Akhila Yerukola, Akari Asai, and Maarten Sap. Out of style: RAG’s fragility to linguistic variation, 2025

2025
[30]

A system- atic review of key retrieval-augmented generation 10 (RAG) systems: Progress, gaps, and future direc- tions, 2025

Agada Joseph Oche, Ademola Glory Folashade, Tirthankar Ghosal, and Arpan Biswas. A system- atic review of key retrieval-augmented generation 10 (RAG) systems: Progress, gaps, and future direc- tions, 2025

2025
[31]

Retrieval-augmented genera- tion: A comprehensive survey of architectures, en- hancements, and robustness frontiers, 2025

Chaitanya Sharma. Retrieval-augmented genera- tion: A comprehensive survey of architectures, en- hancements, and robustness frontiers, 2025

2025
[32]

Towards global retrieval aug- mented generation: A benchmark for corpus-level reasoning, 2025

Qi Luo, Xiaonan Li, Tingshuo Fan, Xinchi Chen, and Xipeng Qiu. Towards global retrieval aug- mented generation: A benchmark for corpus-level reasoning, 2025

2025
[33]

Smucker, James Allan, and Ben Carterette

Mark D. Smucker, James Allan, and Ben Carterette. A comparison of statistical significance tests for information retrieval evaluation. InProceedings of ACM CIKM, 2007

2007
[34]

Bootstrap methods: Another look at the jackknife.The Annals of Statistics, 7(1):1–26, 1979

Bradley Efron. Bootstrap methods: Another look at the jackknife.The Annals of Statistics, 7(1):1–26, 1979

1979
[35]

A simple sequentially rejective multiple test procedure.Scandinavian Journal of Statistics, 6(2):65–70, 1979

Sture Holm. A simple sequentially rejective multiple test procedure.Scandinavian Journal of Statistics, 6(2):65–70, 1979

1979
[36]

Controlling the false discovery rate: A practical and powerful approach to multiple testing.Journal of the Royal Statistical Society: Series B, 57(1):289–300, 1995

Yoav Benjamini and Yosef Hochberg. Controlling the false discovery rate: A practical and powerful approach to multiple testing.Journal of the Royal Statistical Society: Series B, 57(1):289–300, 1995

1995
[37]

Dense passage retrieval for open-domain question answering

Vladimir Karpukhin, Barlas Oğuz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen-tau Yih. Dense passage retrieval for open-domain question answering. InProceedings of Empirical Methods in Natural Language Processing (EMNLP), 2020

2020
[38]

Unsupervised dense informa- tion retrieval with contrastive learning.Transac- tions on Machine Learning Research (TMLR), 2022

Gautier Izacard, Mathilde Caron, Lucas Hosseini, Sebastian Riedel, Piotr Bojanowski, Armand Joulin, and Edouard Grave. Unsupervised dense informa- tion retrieval with contrastive learning.Transac- tions on Machine Learning Research (TMLR), 2022

2022
[39]

M3-Embedding: Multi-Linguality, Multi-Functionality, Multi-Granularity Text Embeddings Through Self-Knowledge Distillation

Jianlv Chen, Shitao Xiao, Peitian Zhang, Kun Luo, Defu Lian, and Zheng Liu. BGE M3-Embedding: Multi-lingual, multi-functionality, multi-granularity text embeddings through self-knowledge distillation. arXiv preprint arXiv:2402.03216, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[40]

Sentence-BERT: SentenceembeddingsusingsiameseBERT-networks

Nils Reimers and Iryna Gurevych. Sentence-BERT: SentenceembeddingsusingsiameseBERT-networks. InProceedings of Empirical Methods in Natural Language Processing (EMNLP), 2019

2019
[41]

Document ranking with a pre- trained sequence-to-sequence model

Rodrigo Nogueira, Zhiying Jiang, Ronak Pradeep, and Jimmy Lin. Document ranking with a pre- trained sequence-to-sequence model. InFindings of the ACL: EMNLP, 2020

2020
[42]

ColBERTv2: Effective and efficient retrieval via lightweight late interaction

Keshav Santhanam, Omar Khattab, Jon Saad- Falcon, Christopher Potts, and Matei Zaharia. ColBERTv2: Effective and efficient retrieval via lightweight late interaction. InProceedings of the North American Chapter of the ACL (NAACL), 2022

2022
[43]

Weld, and Luke Zettlemoyer

Mandar Joshi, Eunsol Choi, Daniel S. Weld, and Luke Zettlemoyer. TriviaQA: A large scale distantly supervised challenge dataset for reading comprehen- sion. InProceedings of the Association for Compu- tational Linguistics (ACL), 2017

2017
[44]

MultiHop-RAG: Benchmarking Retrieval-Augmented Generation for Multi-Hop Queries

Yixuan Tang and Yi Yang. MultiHop-RAG: Bench- marking retrieval-augmented generation for multi- hop queries.arXiv preprint arXiv:2401.15391, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[45]

BEIR: A heterogeneous benchmark for zero-shot evalua- tion of information retrieval models

Nandan Thakur, Nils Reimers, Andreas Rücklé, Abhishek Srivastava, and Iryna Gurevych. BEIR: A heterogeneous benchmark for zero-shot evalua- tion of information retrieval models. InNeurIPS Datasets and Benchmarks Track, 2021

2021
[46]

KILT: a benchmark for knowl- edge intensive language tasks

Fabio Petroni, Aleksandra Piktus, Angela Fan, Patrick Lewis, Majid Yazdani, Nicola De Cao, James Thorne, Yacine Jernite, Vladimir Karpukhin, Jean Maillard, et al. KILT: a benchmark for knowl- edge intensive language tasks. InProceedings of the North American Chapter of the ACL (NAACL), 2021

2021
[47]

CodeSearchNet Challenge: Evaluating the State of Semantic Code Search

Hamel Husain, Ho-Hsiang Wu, Tiferet Gazit, Mil- tiadis Allamanis, and Marc Brockschmidt. Code- SearchNet challenge: Evaluating the state of seman- tic code search.arXiv preprint arXiv:1909.09436, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1909
[48]

Cumulated gain-based evaluation of IR techniques.ACM Trans- actions on Information Systems (TOIS), 20(4):422– 446, 2002

Kalervo Järvelin and Jaana Kekäläinen. Cumulated gain-based evaluation of IR techniques.ACM Trans- actions on Information Systems (TOIS), 20(4):422– 446, 2002

2002
[49]

Datasheets for datasets.Communications of the ACM, 64(12):86– 92, 2021

Timnit Gebru, Jamie Morgenstern, Briana Vec- chione, Jennifer Wortman Vaughan, Hanna Wallach, Hal Daumé III, and Kate Crawford. Datasheets for datasets.Communications of the ACM, 64(12):86– 92, 2021

2021
[50]

UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction

Leland McInnes, John Healy, and James Melville. UMAP: Uniform manifold approximation and pro- jection for dimension reduction.arXiv preprint arXiv:1802.03426, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[51]

Estimating the dimension of a model.The Annals of Statistics, 6(2):461–464, 1978

Gideon Schwarz. Estimating the dimension of a model.The Annals of Statistics, 6(2):461–464, 1978

1978
[52]

Hamilton, Rex Ying, and Jure Leskovec

William L. Hamilton, Rex Ying, and Jure Leskovec. Inductiverepresentationlearningonlargegraphs. In Advances in Neural Information Processing Systems (NeurIPS), 2017. 11

2017
[53]

Hamil- ton, Pietro Liò, Yoshua Bengio, and R Devon Hjelm

Petar Veličković, William Fedus, William L. Hamil- ton, Pietro Liò, Yoshua Bengio, and R Devon Hjelm. Deep graph infomax. InInternational Conference on Learning Representations (ICLR), 2019

2019
[54]

Haveliwala

Taher H. Haveliwala. Topic-sensitive PageRank. In Proceedings of the International World Wide Web Conference (WWW), 2002

2002
[55]

jina-reranker- v3: Last but not late interaction for listwise docu- ment reranking, 2025

Feng Wang, Yuqing Li, and Han Xiao. jina-reranker- v3: Last but not late interaction for listwise docu- ment reranking, 2025

2025
[56]

Xing, Hao Zhang, Joseph E

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P. Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. Judg- ing LLM-as-a-judge with MT-Bench and Chatbot Arena. InAdvances in Neural Information Process- ing Systems (NeurIPS), Datasets and Benchmarks Track, 2023

2023
[57]

FActScore: Fine-grained atomic evaluation of fac- tual precision in long form text generation

Sewon Min, Kalpesh Krishna, Xinxi Lyu, Mike Lewis, Wen-tau Yih, Pang Wei Koh, Mohit Iyyer, Luke Zettlemoyer, and Hannaneh Hajishirzi. FActScore: Fine-grained atomic evaluation of fac- tual precision in long form text generation. InPro- ceedings of Empirical Methods in Natural Language Processing (EMNLP), 2023

2023
[58]

RAGAs: Automated evaluation of retrieval augmented generation

Shahul Es, Jithin James, Luis Espinosa-Anke, and Steven Schockaert. RAGAs: Automated evaluation of retrieval augmented generation. InProceedings of the European Chapter of the ACL (EACL): System Demonstrations, 2024. 12 A Appendix: additional analyses Thisappendixreportstwoanalysesthatsupportthemaintextbutarenotcentraltoitsargument: acharacterization of w...

2024

[1] [1]

Retrieval-augmented generation for knowledge- intensive NLP tasks

Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, Sebastian Riedel, and Douwe Kiela. Retrieval-augmented generation for knowledge- intensive NLP tasks. InAdvances in Neural In- formation Processing Systems (NeurIPS), 2020

2020

[2] [2]

REALM: Retrieval- augmented language model pre-training

Kelvin Guu, Kenton Lee, Zora Tung, Panupong Pasupat, and Ming-Wei Chang. REALM: Retrieval- augmented language model pre-training. InInter- national Conference on Machine Learning (ICML), 2020

2020

[3] [3]

Leveraging passage retrieval with generative models for open domain question answering

Gautier Izacard and Edouard Grave. Leveraging passage retrieval with generative models for open domain question answering. InProceedings of the European Chapter of the ACL (EACL), 2021

2021

[4] [4]

Precise zero-shot dense retrieval without relevance labels

Luyu Gao, Xueguang Ma, Jimmy Lin, and Jamie Callan. Precise zero-shot dense retrieval without relevance labels. InProceedings of the Association for Computational Linguistics (ACL), 2023

2023

[5] [5]

Chi, Quoc V

Huaixiu Steven Zheng, Swaroop Mishra, Xinyun Chen, Heng-Tze Cheng, Ed H. Chi, Quoc V. Le, and Denny Zhou. Take a step back: Evoking reasoning via abstraction in large language models. InInter- national Conference on Learning Representations (ICLR), 2024

2024

[6] [6]

Query2doc: Query expansion with large language models

Liang Wang, Nan Yang, and Furu Wei. Query2doc: Query expansion with large language models. In Proceedings of Empirical Methods in Natural Lan- guage Processing (EMNLP), 2023

2023

[7] [7]

Parth Sarthi, Salman Abdullah, Aditi Tuli, Shubh Khanna, Anna Goldie, and Christopher D. Manning. RAPTOR: Recursive abstractive processing for tree- organized retrieval. InInternational Conference on Learning Representations (ICLR), 2024

2024

[8] [8]

From Local to Global: A Graph RAG Approach to Query-Focused Summarization

Darren Edge, Ha Trinh, Newman Cheng, Joshua Bradley, Alex Chao, Apurva Mody, Steven Truitt, Dasha Metropolitansky, Robert Osazuwa Ness, and Jonathan Larson. From local to global: A graph RAG approach to query-focused summarization. arXiv preprint arXiv:2404.16130, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[9] [9]

HippoRAG: Neu- robiologically inspired long-term memory for large language models

Bernal Jiménez Gutiérrez, Yiheng Shu, Yu Gu, Michihiro Yasunaga, and Yu Su. HippoRAG: Neu- robiologically inspired long-term memory for large language models. InAdvances in Neural Informa- tion Processing Systems (NeurIPS), 2024

2024

[10] [10]

Corrective Retrieval Augmented Generation

Shi-Qi Yan, Jia-Chen Gu, Yun Zhu, and Zhen-Hua Ling. Corrective retrieval augmented generation. arXiv preprint arXiv:2401.15884, 2024. 9

work page internal anchor Pith review Pith/arXiv arXiv 2024

[11] [11]

Self-RAG: Learning to re- trieve, generate, and critique through self-reflection

Akari Asai, Zeqiu Wu, Yizhong Wang, Avirup Sil, and Hannaneh Hajishirzi. Self-RAG: Learning to re- trieve, generate, and critique through self-reflection. InInternational Conference on Learning Represen- tations (ICLR), 2024

2024

[12] [12]

Xu, Luyu Gao, Zhiqing Sun, Qian Liu, Jane Dwivedi-Yu, Yiming Yang, Jamie Callan, and Graham Neubig

Zhengbao Jiang, Frank F. Xu, Luyu Gao, Zhiqing Sun, Qian Liu, Jane Dwivedi-Yu, Yiming Yang, Jamie Callan, and Graham Neubig. Active retrieval augmented generation. InProceedings of Empirical Methods in Natural Language Processing (EMNLP), 2023

2023

[13] [13]

SoyeongJeong, JinheonBaek, SukminCho, SungJu Hwang, and Jong C. Park. Adaptive-RAG: Learning to adapt retrieval-augmented large language models through question complexity. InProceedings of the North American Chapter of the ACL (NAACL), 2024

2024

[14] [14]

Cormack, Charles L

Gordon V. Cormack, Charles L. A. Clarke, and Ste- fan Buettcher. Reciprocal rank fusion outperforms Condorcet and individual rank learning methods. InProceedings of ACM SIGIR, 2009

2009

[15] [15]

Natural questions: A bench- mark for question answering research.Transactions of the Association for Computational Linguistics (TACL), 2019

Tom Kwiatkowski, Jennimaria Palomaki, Olivia Redfield, Michael Collins, Ankur Parikh, Chris Al- berti, Danielle Epstein, Illia Polosukhin, Jacob De- vlin, Kenton Lee, et al. Natural questions: A bench- mark for question answering research.Transactions of the Association for Computational Linguistics (TACL), 2019

2019

[16] [16]

Cohen, Ruslan Salakhutdinov, and Christopher D

Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Ben- gio, William W. Cohen, Ruslan Salakhutdinov, and Christopher D. Manning. HotpotQA: A dataset for diverse, explainable multi-hop question answer- ing. InProceedings of Empirical Methods in Natural Language Processing (EMNLP), 2018

2018

[17] [17]

MuSiQue: Multi- hop questions via single-hop question composition

Harsh Trivedi, Niranjan Balasubramanian, Tushar Khot, and Ashish Sabharwal. MuSiQue: Multi- hop questions via single-hop question composition. Transactions of the Association for Computational Linguistics (TACL), 2022

2022

[18] [18]

Constructing a multi- hop QA dataset for comprehensive evaluation of reasoning steps

Xanh Ho, Anh-Khoa Duong Nguyen, Saku Sug- awara, and Akiko Aizawa. Constructing a multi- hop QA dataset for comprehensive evaluation of reasoning steps. InProceedings of the International Conference on Computational Linguistics (COL- ING), 2020

2020

[19] [19]

Smith, and Matt Gardner

Pradeep Dasigi, Kyle Lo, Iz Beltagy, Arman Cohan, Noah A. Smith, and Matt Gardner. A dataset of information-seeking questions and answers anchored in research papers. InProceedings of the North American Chapter of the ACL (NAACL), 2021

2021

[20] [20]

Richard Yuanzhe Pang, Alicia Parrish, Nitish Joshi, Nikita Nangia, Jason Phang, Angelica Chen, Vishakh Padmakumar, Johnny Ma, Jana Thomp- son, He He, and Samuel R. Bowman. QuALITY: Question answering with long input texts, yes! In Proceedings of the North American Chapter of the ACL (NAACL), 2022

2022

[21] [21]

Cross-granularity hypergraph retrieval-augmented generation for multi-hop ques- tion answering, 2025

Changjian Wang, Weihong Deng, Weili Guan, Quan Lu, and Ning Jiang. Cross-granularity hypergraph retrieval-augmented generation for multi-hop ques- tion answering, 2025

2025

[22] [22]

A-RAG: Scaling agentic retrieval-augmented generation via hierarchical retrieval interfaces, 2026

Mingxuan Du, Benfeng Xu, Chiwei Zhu, Shaohan Wang, Pengyu Wang, Xiaorui Wang, and Zhendong Mao. A-RAG: Scaling agentic retrieval-augmented generation via hierarchical retrieval interfaces, 2026

2026

[23] [23]

You don’t need pre-built graphs for RAG: Retrieval augmented gen- eration with adaptive reasoning structures, 2025

Shengyuan Chen, Chuang Zhou, Zheng Yuan, Qing- gang Zhang, Zeyang Cui, Hao Chen, Yilin Xiao, Jiannong Cao, and Xiao Huang. You don’t need pre-built graphs for RAG: Retrieval augmented gen- eration with adaptive reasoning structures, 2025

2025

[24] [24]

Rea- soning in trees: Improving retrieval-augmented gen- eration for multi-hop question answering, 2026

Yuling Shi, Maolin Sun, Zijun Liu, Mo Yang, Yix- iong Fang, Tianran Sun, and Xiaodong Gu. Rea- soning in trees: Improving retrieval-augmented gen- eration for multi-hop question answering, 2026

2026

[25] [25]

RAGRouter: Learning to route queries to multiple retrieval-augmented language models, 2025

Jiarui Zhang, Xiangyu Liu, Yong Hu, Chaoyue Niu, Fan Wu, and Guihai Chen. RAGRouter: Learning to route queries to multiple retrieval-augmented language models, 2025

2025

[26] [26]

RouteRAG: Efficient retrieval-augmented genera- tion from text and graph via reinforcement learning, 2025

Yucan Guo, Miao Su, Saiping Guan, Zihao Sun, Xiaolong Jin, Jiafeng Guo, and Xueqi Cheng. RouteRAG: Efficient retrieval-augmented genera- tion from text and graph via reinforcement learning, 2025

2025

[27] [27]

InfoGain-RAG: Boosting retrieval-augmented generation via document infor- mation gain-based reranking and filtering, 2025

Zihan Wang, Zihan Liang, Zhou Shao, Yufei Ma, Huangyu Dai, Ben Chen, Lingtao Mao, Chenyi Lei, Yuqing Ding, and Han Li. InfoGain-RAG: Boosting retrieval-augmented generation via document infor- mation gain-based reranking and filtering, 2025

2025

[28] [28]

ParallelSearch: Train your LLMs to decompose query and search sub-queries in parallel with reinforcement learning, 2025

Shu Zhao, Tan Yu, Anbang Xu, Japinder Singh, Aa- ditya Shukla, and Rama Akkiraju. ParallelSearch: Train your LLMs to decompose query and search sub-queries in parallel with reinforcement learning, 2025

2025

[29] [29]

Out of style: RAG’s fragility to linguistic variation, 2025

Tianyu Cao, Neel Bhandari, Akhila Yerukola, Akari Asai, and Maarten Sap. Out of style: RAG’s fragility to linguistic variation, 2025

2025

[30] [30]

A system- atic review of key retrieval-augmented generation 10 (RAG) systems: Progress, gaps, and future direc- tions, 2025

Agada Joseph Oche, Ademola Glory Folashade, Tirthankar Ghosal, and Arpan Biswas. A system- atic review of key retrieval-augmented generation 10 (RAG) systems: Progress, gaps, and future direc- tions, 2025

2025

[31] [31]

Retrieval-augmented genera- tion: A comprehensive survey of architectures, en- hancements, and robustness frontiers, 2025

Chaitanya Sharma. Retrieval-augmented genera- tion: A comprehensive survey of architectures, en- hancements, and robustness frontiers, 2025

2025

[32] [32]

Towards global retrieval aug- mented generation: A benchmark for corpus-level reasoning, 2025

Qi Luo, Xiaonan Li, Tingshuo Fan, Xinchi Chen, and Xipeng Qiu. Towards global retrieval aug- mented generation: A benchmark for corpus-level reasoning, 2025

2025

[33] [33]

Smucker, James Allan, and Ben Carterette

Mark D. Smucker, James Allan, and Ben Carterette. A comparison of statistical significance tests for information retrieval evaluation. InProceedings of ACM CIKM, 2007

2007

[34] [34]

Bootstrap methods: Another look at the jackknife.The Annals of Statistics, 7(1):1–26, 1979

Bradley Efron. Bootstrap methods: Another look at the jackknife.The Annals of Statistics, 7(1):1–26, 1979

1979

[35] [35]

A simple sequentially rejective multiple test procedure.Scandinavian Journal of Statistics, 6(2):65–70, 1979

Sture Holm. A simple sequentially rejective multiple test procedure.Scandinavian Journal of Statistics, 6(2):65–70, 1979

1979

[36] [36]

Controlling the false discovery rate: A practical and powerful approach to multiple testing.Journal of the Royal Statistical Society: Series B, 57(1):289–300, 1995

Yoav Benjamini and Yosef Hochberg. Controlling the false discovery rate: A practical and powerful approach to multiple testing.Journal of the Royal Statistical Society: Series B, 57(1):289–300, 1995

1995

[37] [37]

Dense passage retrieval for open-domain question answering

Vladimir Karpukhin, Barlas Oğuz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen-tau Yih. Dense passage retrieval for open-domain question answering. InProceedings of Empirical Methods in Natural Language Processing (EMNLP), 2020

2020

[38] [38]

Unsupervised dense informa- tion retrieval with contrastive learning.Transac- tions on Machine Learning Research (TMLR), 2022

Gautier Izacard, Mathilde Caron, Lucas Hosseini, Sebastian Riedel, Piotr Bojanowski, Armand Joulin, and Edouard Grave. Unsupervised dense informa- tion retrieval with contrastive learning.Transac- tions on Machine Learning Research (TMLR), 2022

2022

[39] [39]

M3-Embedding: Multi-Linguality, Multi-Functionality, Multi-Granularity Text Embeddings Through Self-Knowledge Distillation

Jianlv Chen, Shitao Xiao, Peitian Zhang, Kun Luo, Defu Lian, and Zheng Liu. BGE M3-Embedding: Multi-lingual, multi-functionality, multi-granularity text embeddings through self-knowledge distillation. arXiv preprint arXiv:2402.03216, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[40] [40]

Sentence-BERT: SentenceembeddingsusingsiameseBERT-networks

Nils Reimers and Iryna Gurevych. Sentence-BERT: SentenceembeddingsusingsiameseBERT-networks. InProceedings of Empirical Methods in Natural Language Processing (EMNLP), 2019

2019

[41] [41]

Document ranking with a pre- trained sequence-to-sequence model

Rodrigo Nogueira, Zhiying Jiang, Ronak Pradeep, and Jimmy Lin. Document ranking with a pre- trained sequence-to-sequence model. InFindings of the ACL: EMNLP, 2020

2020

[42] [42]

ColBERTv2: Effective and efficient retrieval via lightweight late interaction

Keshav Santhanam, Omar Khattab, Jon Saad- Falcon, Christopher Potts, and Matei Zaharia. ColBERTv2: Effective and efficient retrieval via lightweight late interaction. InProceedings of the North American Chapter of the ACL (NAACL), 2022

2022

[43] [43]

Weld, and Luke Zettlemoyer

Mandar Joshi, Eunsol Choi, Daniel S. Weld, and Luke Zettlemoyer. TriviaQA: A large scale distantly supervised challenge dataset for reading comprehen- sion. InProceedings of the Association for Compu- tational Linguistics (ACL), 2017

2017

[44] [44]

MultiHop-RAG: Benchmarking Retrieval-Augmented Generation for Multi-Hop Queries

Yixuan Tang and Yi Yang. MultiHop-RAG: Bench- marking retrieval-augmented generation for multi- hop queries.arXiv preprint arXiv:2401.15391, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[45] [45]

BEIR: A heterogeneous benchmark for zero-shot evalua- tion of information retrieval models

Nandan Thakur, Nils Reimers, Andreas Rücklé, Abhishek Srivastava, and Iryna Gurevych. BEIR: A heterogeneous benchmark for zero-shot evalua- tion of information retrieval models. InNeurIPS Datasets and Benchmarks Track, 2021

2021

[46] [46]

KILT: a benchmark for knowl- edge intensive language tasks

Fabio Petroni, Aleksandra Piktus, Angela Fan, Patrick Lewis, Majid Yazdani, Nicola De Cao, James Thorne, Yacine Jernite, Vladimir Karpukhin, Jean Maillard, et al. KILT: a benchmark for knowl- edge intensive language tasks. InProceedings of the North American Chapter of the ACL (NAACL), 2021

2021

[47] [47]

CodeSearchNet Challenge: Evaluating the State of Semantic Code Search

Hamel Husain, Ho-Hsiang Wu, Tiferet Gazit, Mil- tiadis Allamanis, and Marc Brockschmidt. Code- SearchNet challenge: Evaluating the state of seman- tic code search.arXiv preprint arXiv:1909.09436, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1909

[48] [48]

Cumulated gain-based evaluation of IR techniques.ACM Trans- actions on Information Systems (TOIS), 20(4):422– 446, 2002

Kalervo Järvelin and Jaana Kekäläinen. Cumulated gain-based evaluation of IR techniques.ACM Trans- actions on Information Systems (TOIS), 20(4):422– 446, 2002

2002

[49] [49]

Datasheets for datasets.Communications of the ACM, 64(12):86– 92, 2021

Timnit Gebru, Jamie Morgenstern, Briana Vec- chione, Jennifer Wortman Vaughan, Hanna Wallach, Hal Daumé III, and Kate Crawford. Datasheets for datasets.Communications of the ACM, 64(12):86– 92, 2021

2021

[50] [50]

UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction

Leland McInnes, John Healy, and James Melville. UMAP: Uniform manifold approximation and pro- jection for dimension reduction.arXiv preprint arXiv:1802.03426, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[51] [51]

Estimating the dimension of a model.The Annals of Statistics, 6(2):461–464, 1978

Gideon Schwarz. Estimating the dimension of a model.The Annals of Statistics, 6(2):461–464, 1978

1978

[52] [52]

Hamilton, Rex Ying, and Jure Leskovec

William L. Hamilton, Rex Ying, and Jure Leskovec. Inductiverepresentationlearningonlargegraphs. In Advances in Neural Information Processing Systems (NeurIPS), 2017. 11

2017

[53] [53]

Hamil- ton, Pietro Liò, Yoshua Bengio, and R Devon Hjelm

Petar Veličković, William Fedus, William L. Hamil- ton, Pietro Liò, Yoshua Bengio, and R Devon Hjelm. Deep graph infomax. InInternational Conference on Learning Representations (ICLR), 2019

2019

[54] [54]

Haveliwala

Taher H. Haveliwala. Topic-sensitive PageRank. In Proceedings of the International World Wide Web Conference (WWW), 2002

2002

[55] [55]

jina-reranker- v3: Last but not late interaction for listwise docu- ment reranking, 2025

Feng Wang, Yuqing Li, and Han Xiao. jina-reranker- v3: Last but not late interaction for listwise docu- ment reranking, 2025

2025

[56] [56]

Xing, Hao Zhang, Joseph E

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P. Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. Judg- ing LLM-as-a-judge with MT-Bench and Chatbot Arena. InAdvances in Neural Information Process- ing Systems (NeurIPS), Datasets and Benchmarks Track, 2023

2023

[57] [57]

FActScore: Fine-grained atomic evaluation of fac- tual precision in long form text generation

Sewon Min, Kalpesh Krishna, Xinxi Lyu, Mike Lewis, Wen-tau Yih, Pang Wei Koh, Mohit Iyyer, Luke Zettlemoyer, and Hannaneh Hajishirzi. FActScore: Fine-grained atomic evaluation of fac- tual precision in long form text generation. InPro- ceedings of Empirical Methods in Natural Language Processing (EMNLP), 2023

2023

[58] [58]

RAGAs: Automated evaluation of retrieval augmented generation

Shahul Es, Jithin James, Luis Espinosa-Anke, and Steven Schockaert. RAGAs: Automated evaluation of retrieval augmented generation. InProceedings of the European Chapter of the ACL (EACL): System Demonstrations, 2024. 12 A Appendix: additional analyses Thisappendixreportstwoanalysesthatsupportthemaintextbutarenotcentraltoitsargument: acharacterization of w...

2024