arxiv: 2104.08663 · v4 · submitted 2021-04-17 · 💻 cs.IR · cs.AI· cs.CL

Recognition: 1 theorem link

· Lean Theorem

BEIR: A Heterogenous Benchmark for Zero-shot Evaluation of Information Retrieval Models

Abhishek Srivastava, Andreas R\"uckl\'e, Iryna Gurevych, Nandan Thakur, Nils Reimers

Pith reviewed 2026-05-12 14:35 UTC · model grok-4.3

classification 💻 cs.IR cs.AIcs.CL

keywords information retrievalzero-shot evaluationbenchmarkout-of-distribution generalizationneural retrieval modelsBM25re-ranking

0 comments

The pith

A benchmark with 18 diverse datasets shows BM25 as a robust zero-shot baseline while re-ranking models lead in performance at higher cost.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper creates a collection of 18 existing datasets spanning many different retrieval tasks and domains to test how retrieval systems behave when applied to new data without retraining. Ten different systems are run on all of them, ranging from a simple word-matching approach to complex neural models that use dense vectors or re-ranking steps. The results indicate that the basic word-matching method remains reliable across these unfamiliar settings, while certain neural methods reach higher accuracy but require far more computation. Dense-vector and sparse-vector methods run quickly yet frequently fall short of the others in these out-of-distribution cases. The benchmark is released publicly so others can measure and improve the generalization of future retrieval systems.

Core claim

BEIR aggregates 18 publicly available datasets from varied text retrieval tasks and domains, then evaluates ten retrieval systems including lexical, sparse, dense, late-interaction, and re-ranking architectures in zero-shot mode; the evaluation establishes that BM25 is a robust baseline, re-ranking and late-interaction models achieve the highest average performance yet at high computational cost, and dense and sparse models are efficient but often underperform, revealing room for better generalization.

What carries the argument

BEIR benchmark, a curated set of 18 heterogeneous public datasets used to measure out-of-distribution zero-shot performance of retrieval models.

If this is right

Systematic zero-shot comparisons become possible for any new retrieval model using the public datasets.
Model selection must weigh accuracy gains against the computational expense of re-ranking steps.
Dense and sparse retrieval approaches require further work to close the observed gap in generalization.
Progress toward robust retrieval systems can be tracked by repeated evaluation on the same fixed collection.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Developers may seek hybrid systems that retain the efficiency of dense models while adding elements that improve cross-domain stability.
The benchmark could be reused to test whether training procedures that explicitly target out-of-distribution robustness close the performance differences.
Similar multi-dataset evaluation setups could be applied to related tasks such as passage ranking within question-answering pipelines.

Load-bearing premise

The 18 chosen datasets supply a sufficiently broad and representative sample of real-world out-of-distribution retrieval situations.

What would settle it

Running the ten systems on a fresh collection of retrieval datasets drawn from domains absent from the original 18 and finding that dense or sparse models now match or exceed the average performance of re-ranking models would contradict the reported pattern.

read the original abstract

Existing neural information retrieval (IR) models have often been studied in homogeneous and narrow settings, which has considerably limited insights into their out-of-distribution (OOD) generalization capabilities. To address this, and to facilitate researchers to broadly evaluate the effectiveness of their models, we introduce Benchmarking-IR (BEIR), a robust and heterogeneous evaluation benchmark for information retrieval. We leverage a careful selection of 18 publicly available datasets from diverse text retrieval tasks and domains and evaluate 10 state-of-the-art retrieval systems including lexical, sparse, dense, late-interaction and re-ranking architectures on the BEIR benchmark. Our results show BM25 is a robust baseline and re-ranking and late-interaction-based models on average achieve the best zero-shot performances, however, at high computational costs. In contrast, dense and sparse-retrieval models are computationally more efficient but often underperform other approaches, highlighting the considerable room for improvement in their generalization capabilities. We hope this framework allows us to better evaluate and understand existing retrieval systems, and contributes to accelerating progress towards better robust and generalizable systems in the future. BEIR is publicly available at https://github.com/UKPLab/beir.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

BEIR gives IR researchers a practical new heterogeneous zero-shot benchmark and shows dense models lag BM25 on average, but the dataset selection leaves the generalization claims open to questions about representativeness.

read the letter

The main thing to know is that this paper ships a new public benchmark called BEIR with 18 datasets spanning different retrieval tasks and domains, then runs ten model families on it in zero-shot mode. That setup is genuinely useful because most prior IR work stayed inside single datasets or narrow domains, so this gives a clearer picture of where models break when the distribution shifts. The results line up with what the abstract says: BM25 holds up well as a baseline, late-interaction and re-ranking approaches do best on average but cost more, and dense plus sparse retrievers are faster yet fall short more often. The code and data release at the GitHub link supports straightforward reproduction, which is a real plus for this kind of empirical work. The selection of datasets is described as careful, and the paper sticks to standard metrics on publicly available collections, so the core numbers should be verifiable once the full splits and implementations are checked. That said, the headline claim that dense and sparse models have considerable room for improvement in generalization rests on these 18 datasets being a sufficiently broad and unbiased sample of out-of-distribution cases. The abstract does not spell out explicit coverage criteria or show an analysis that the chosen mix does not overweight domains where lexical overlap helps. If the collection tilts that way, the average gap could partly reflect benchmark composition rather than a universal signal. This is a moderate rather than fatal issue for a benchmark paper, but it is worth clarifying in review. The work is aimed at IR researchers who want to test robustness beyond in-domain settings. It is the kind of resource that deserves a serious referee because the benchmark itself is new, the evaluation is empirical and reproducible, and the findings point to concrete gaps worth addressing. I would bring it to a reading group and cite the benchmark in future robustness studies once the selection rationale is tightened.

Referee Report

2 major / 2 minor

Summary. The paper introduces the BEIR benchmark for zero-shot evaluation of information retrieval models. It consists of 18 carefully selected publicly available datasets from diverse tasks and domains. The authors evaluate 10 state-of-the-art models spanning lexical, sparse, dense, late-interaction, and re-ranking architectures. Key findings include that BM25 serves as a robust baseline, re-ranking and late-interaction models achieve the highest average zero-shot performance but incur high computational costs, whereas dense and sparse retrieval models are more efficient yet frequently underperform, pointing to substantial room for enhancing their generalization abilities.

Significance. This benchmark addresses a critical gap in evaluating OOD generalization in IR, which has been limited by homogeneous settings. By providing a heterogeneous testbed and comprehensive evaluations, it can significantly influence future research towards more robust models. The public release at the GitHub link facilitates reproducibility and community use. The empirical nature with public datasets and standard metrics adds to its value, though the strength depends on the benchmark's representativeness.

major comments (2)

[Dataset selection section] Dataset selection section: The manuscript refers to a 'careful selection' of the 18 datasets to achieve heterogeneity across tasks and domains but provides no explicit coverage criteria, sampling strategy, or analysis of potential composition biases (e.g., over-representation of domains where lexical overlap is strong). This directly affects the load-bearing claim that dense and sparse models 'often underperform' and have 'considerable room for improvement in their generalization capabilities,' as these conclusions rest on averages over the chosen set.
[Results section (performance tables)] Results section (performance tables): While aggregate averages are presented, the manuscript does not report per-dataset breakdowns with variance measures or statistical tests for the performance gaps. This weakens the interpretation of 'often underperform' if a small number of datasets disproportionately influence the mean.

minor comments (2)

[Title] Title: 'Heterogenous' is a misspelling and should read 'Heterogeneous'.
[Abstract and introduction] Abstract and introduction: The description of model families and metrics (e.g., nDCG@10) is clear but could include a one-sentence note on the exact evaluation protocol for zero-shot transfer to aid quick comprehension.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and the recommendation for minor revision. We address each major comment below and will update the manuscript accordingly to improve clarity and robustness.

read point-by-point responses

Referee: [Dataset selection section] Dataset selection section: The manuscript refers to a 'careful selection' of the 18 datasets to achieve heterogeneity across tasks and domains but provides no explicit coverage criteria, sampling strategy, or analysis of potential composition biases (e.g., over-representation of domains where lexical overlap is strong). This directly affects the load-bearing claim that dense and sparse models 'often underperform' and have 'considerable room for improvement in their generalization capabilities,' as these conclusions rest on averages over the chosen set.

Authors: We appreciate the referee pointing this out. The datasets were chosen to span a wide range of retrieval tasks (e.g., ad-hoc search, QA, fact-checking, argument retrieval) and domains (e.g., Wikipedia, news, biomedical, scientific papers, tweets, forums) while ensuring public availability and standard evaluation protocols. However, we acknowledge that the original manuscript did not explicitly list the selection criteria or analyze potential biases such as lexical overlap. In the revised version, we will add a new subsection detailing the explicit criteria (task diversity, domain coverage, dataset scale, and annotation quality), include a summary table of task/domain characteristics, and provide a short discussion of possible composition biases with examples of how lexical vs. semantic matching varies across datasets. This will better ground the generalization claims. revision: yes
Referee: [Results section (performance tables)] Results section (performance tables): While aggregate averages are presented, the manuscript does not report per-dataset breakdowns with variance measures or statistical tests for the performance gaps. This weakens the interpretation of 'often underperform' if a small number of datasets disproportionately influence the mean.

Authors: We agree that statistical support would strengthen the interpretation. The original manuscript already contains a full per-dataset breakdown in Table 2 (nDCG@10 for every model on all 18 datasets), which allows readers to verify that underperformance of dense/sparse models is not driven by a few outliers but holds across the majority of datasets. To further address the concern, we will add in the revision: (i) a brief analysis of performance distribution and identification of any influential datasets, (ii) pairwise statistical significance tests (Wilcoxon signed-rank) on the average scores between model categories, and (iii) a note that, as these are single deterministic runs on fixed test sets, traditional variance is not applicable, but bootstrapped confidence intervals can be reported for the averages. These additions will make the 'often underperform' statement more rigorously supported. revision: yes

Circularity Check

0 steps flagged

Empirical benchmark evaluation with no internal derivations or self-referential reductions

full rationale

The paper consists entirely of an empirical study: it selects 18 external public datasets, runs 10 retrieval systems on them, and reports observed performance numbers. No equations, fitted parameters, or predictions are defined inside the paper that later reduce to those same quantities by construction. Central claims (BM25 robustness, relative performance of model classes) are direct aggregates of the external evaluation results rather than tautological restatements. No self-citation load-bearing steps, uniqueness theorems, or ansatzes appear in the derivation chain. The work is therefore self-contained against external benchmarks and receives the default non-circularity finding.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper relies entirely on existing public datasets and previously published retrieval models without introducing new free parameters, axioms beyond standard IR evaluation practice, or invented entities.

axioms (1)

domain assumption Standard IR metrics such as nDCG@10 and Recall@100 are appropriate measures for zero-shot generalization assessment.
Implicit in the choice of reported results across all models and datasets.

pith-pipeline@v0.9.0 · 5527 in / 1180 out tokens · 58504 ms · 2026-05-12T14:35:26.616158+00:00 · methodology

discussion (0)

Forward citations

Cited by 31 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Very Efficient Listwise Multimodal Reranking for Long Documents
cs.IR 2026-05 unverdicted novelty 7.0

ZipRerank delivers state-of-the-art multimodal listwise reranking accuracy for long documents at up to 10x lower latency via early interaction and single-pass scoring.
Re$^2$Math: Benchmarking Theorem Retrieval in Research-Level Mathematics
cs.AI 2026-05 unverdicted novelty 7.0

Re²Math is a new benchmark that evaluates AI models on retrieving and verifying the applicability of theorems from math literature to advance steps in partial proofs, accepting any sufficient theorem while controlling...
DiffRetriever: Parallel Representative Tokens for Retrieval with Diffusion Language Models
cs.IR 2026-05 unverdicted novelty 7.0

DiffRetriever generates multiple representative tokens in parallel using diffusion language models, yielding consistent retrieval gains over single-token baselines and autoregressive multi-token variants on BEIR benchmarks.
TabEmbed: Benchmarking and Learning Generalist Embeddings for Tabular Understanding
cs.CL 2026-05 unverdicted novelty 7.0

TabEmbed is the first generalist embedding model for tabular data that unifies classification and retrieval in one space via contrastive learning and outperforms text embedding models on the new TabBench benchmark.
HackerSignal: A Large-Scale Multi-Source Dataset Linking Hacker Community Discourse to the CVE Vulnerability Lifecycle
cs.CR 2026-05 unverdicted novelty 7.0

HackerSignal aggregates 7.45M documents from hacker communities, exploit databases, vulnerability reports, and fixes into a public benchmark for temporal OOD CVE linkage and exploit classification.
UnIte: Uncertainty-based Iterative Document Sampling for Domain Adaptation in Information Retrieval
cs.IR 2026-04 unverdicted novelty 7.0

UnIte selects target-domain documents for pseudo-query generation by filtering high aleatoric uncertainty and prioritizing high epistemic uncertainty, yielding +2.45 to +3.49 nDCG@10 gains on BEIR with ~4k samples.
MMEB-V3: Measuring the Performance Gaps of Omni-Modality Embedding Models
cs.IR 2026-04 unverdicted novelty 7.0

MMEB-V3 benchmark shows omni-modality embedding models fail to enforce instruction-specified modality constraints and exhibit asymmetric, query-biased retrieval.
ResRank: Unifying Retrieval and Listwise Reranking via End-to-End Joint Training with Residual Passage Compression
cs.IR 2026-04 conditional novelty 7.0

ResRank unifies retrieval and listwise reranking by compressing passages to one token each, using residual connections and cosine-similarity scoring, achieving competitive effectiveness on TREC DL and BEIR benchmarks ...
TeleEmbedBench: A Multi-Corpus Embedding Benchmark for RAG in Telecommunications
cs.LG 2026-04 unverdicted novelty 7.0

TeleEmbedBench is the first multi-corpus benchmark showing LLM-based embedding models significantly outperform traditional sentence-transformers on telecommunications specifications and code for retrieval accuracy and...
C-Pack: Packed Resources For General Chinese Embeddings
cs.CL 2023-09 accept novelty 7.0

C-Pack releases a new Chinese embedding benchmark, large training dataset, and optimized models that outperform priors by up to 10% on C-MTEB while also delivering English SOTA results.
MLAIRE: Multilingual Language-Aware Information Retrieval Evaluation Protocal
cs.IR 2026-05 unverdicted novelty 6.0

MLAIRE is a protocol that evaluates multilingual retrievers on both semantic accuracy and query-language preference using parallel passages and new metrics like LPR and Lang-nDCG, showing that standard metrics hide di...
An Agent-Oriented Pluggable Experience-RAG Skill for Experience-Driven Retrieval Strategy Orchestration
cs.AI 2026-05 unverdicted novelty 6.0

Experience-RAG Skill uses experience memory to dynamically select retrieval strategies for agents, achieving 0.8924 nDCG@10 on BeIR/nq, hotpotqa, and scifact while outperforming fixed single-retriever baselines.
Verbal-R3: Verbal Reranker as the Missing Bridge between Retrieval and Reasoning
cs.CL 2026-05 unverdicted novelty 6.0

Verbal-R3 uses a verbal reranker to generate analytic narratives that guide retrieval and reasoning in LLMs, achieving SOTA results on complex QA benchmarks.
Kernel Affine Hull Machines for Compute-Efficient Query-Side Semantic Encoding
cs.LG 2026-05 unverdicted novelty 6.0

Kernel Affine Hull Machines map lexical features to semantic embeddings via RKHS and least-mean-squares, outperforming adapters in reconstruction and retrieval metrics while reducing latency 8.5-fold on a legal benchmark.
Efficient Rationale-based Retrieval: On-policy Distillation from Generative Rerankers based on JEPA
cs.IR 2026-04 unverdicted novelty 6.0

Rabtriever distills a generative reranker into an efficient bi-encoder using on-policy JEPA to achieve near-reranker accuracy with linear complexity on rationale-based retrieval.
Efficient Rationale-based Retrieval: On-policy Distillation from Generative Rerankers based on JEPA
cs.IR 2026-04 unverdicted novelty 6.0

Rabtriever distills a generative reranker into an efficient independent encoder using JEPA and auxiliary reverse KL loss to achieve linear complexity and strong performance on rationale-based retrieval tasks.
ARHN: Answer-Centric Relabeling of Hard Negatives with Open-Source LLMs for Dense Retrieval
cs.IR 2026-04 unverdicted novelty 6.0

ARHN refines hard-negative training data for dense retrieval by using LLMs to convert answer-containing passages into additional positives and exclude answer-containing passages from the negative set.
HIVE: Query, Hypothesize, Verify An LLM Framework for Multimodal Reasoning-Intensive Retrieval
cs.IR 2026-04 unverdicted novelty 6.0

HIVE raises multimodal retrieval nDCG@10 to 41.7 on the MM-BRIGHT benchmark by inserting LLM-driven hypothesis generation and verification between retrieval passes, delivering +9.5 over the best text-only baseline and...
Data, Not Model: Explaining Bias toward LLM Texts in Neural Retrievers
cs.IR 2026-04 unverdicted novelty 6.0

Bias toward LLM texts in neural retrievers arises from artifact imbalances between positive and negative documents in training data that are absorbed during contrastive learning.
Are LLM-Based Retrievers Worth Their Cost? An Empirical Study of Efficiency, Robustness, and Reasoning Overhead
cs.IR 2026-04 accept novelty 6.0

Empirical comparison across 14 retrievers on the BRIGHT benchmark shows reasoning-specialized models can match strong accuracy with competitive speed while many large LLM bi-encoders add latency for small gains and co...
NV-Embed: Improved Techniques for Training LLMs as Generalist Embedding Models
cs.CL 2024-05 accept novelty 6.0

NV-Embed achieves first place on the MTEB leaderboard across 56 tasks by combining a latent attention layer, causal-mask removal, two-stage contrastive training, and data curation for LLM-based embedding models.
Self-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflection
cs.CL 2023-10 unverdicted novelty 6.0

Self-RAG trains LLMs to adaptively retrieve passages on demand and self-critique using reflection tokens, outperforming ChatGPT and retrieval-augmented Llama2 on QA, reasoning, and fact verification.
Unsupervised Dense Information Retrieval with Contrastive Learning
cs.IR 2021-12 unverdicted novelty 6.0

Contrastive learning trains unsupervised dense retrievers that beat BM25 on most BEIR datasets and support cross-lingual retrieval across scripts.
Superintelligent Retrieval Agent: The Next Frontier of Information Retrieval
cs.IR 2026-05 unverdicted novelty 5.0

SIRA compresses multi-round exploratory retrieval into one LLM-guided, corpus-statistic-validated weighted BM25 query and reports superior results over dense retrievers and agentic baselines on BEIR benchmarks.
AgenticRAG: Agentic Retrieval for Enterprise Knowledge Bases
cs.AI 2026-05 unverdicted novelty 5.0

AgenticRAG equips an LLM with iterative retrieval and navigation tools, delivering 49.6% recall@1 on BRIGHT, 0.96 factuality on WixQA, and 92% correctness on FinanceBench.
Efficient Listwise Reranking with Compressed Document Representations
cs.IR 2026-04 unverdicted novelty 5.0

RRK compresses documents to multi-token embeddings for efficient listwise reranking, enabling an 8B model to achieve 3x-18x speedups over smaller models with comparable or better effectiveness.
Reproduction Beyond Benchmarks: ConstBERT and ColBERT-v2 Across Backends and Query Distributions
cs.IR 2026-04 accept novelty 5.0

ConstBERT and ColBERT-v2 reproduce on MS-MARCO but drop 86-97% on long queries because MaxSim cannot filter filler noise, and extra fine-tuning or backend changes do not overcome the architectural constraint.
Dynamic Ranked List Truncation for Reranking Pipelines via LLM-generated Reference-Documents
cs.IR 2026-04 unverdicted novelty 5.0

LLM-generated reference documents enable dynamic ranked list truncation and adaptive batching for listwise reranking, outperforming prior RLT methods and accelerating processing by up to 66% on TREC benchmarks.
An Agent-Oriented Pluggable Experience-RAG Skill for Experience-Driven Retrieval Strategy Orchestration
cs.AI 2026-05 unverdicted novelty 4.0

Experience-RAG Skill is a reusable agent skill that selects retrieval strategies via experience memory, achieving 0.8924 nDCG@10 on BeIR/nq, hotpotqa, and scifact while outperforming fixed retriever baselines.
A Systematic Study of Retrieval Pipeline Design for Retrieval-Augmented Medical Question Answering
cs.CL 2026-04 unverdicted novelty 4.0

Dense retrieval plus query reformulation and reranking reaches 60.49% accuracy on MedQA USMLE, outperforming other setups while domain-specialized models make better use of the retrieved evidence.
Hypencoder Revisited: Reproducibility and Analysis of Non-Linear Scoring for First-Stage Retrieval
cs.IR 2026-04 conditional novelty 3.0

Reproducibility study confirms Hypencoder's non-linear query-specific scoring improves retrieval over bi-encoders on standard benchmarks but standard methods remain faster and hard-task results are mixed due to implem...

Reference graph

Works this paper leans on

94 extracted references · 94 canonical work pages · cited by 29 Pith papers · 2 internal anchors

[1]

Amin Ahmad, Noah Constant, Yinfei Yang, and Daniel Cer. 2019. ReQA: An Evaluation for End-to-End Answer Retrieval Models. InProceedings of the 2nd Workshop on Machine Reading for Question Answering, pages 137–146, Hong Kong, China. Association for Computational Linguistics. 18

work page 2019
[2]

Akari Asai, Jungo Kasai, Jonathan Clark, Kenton Lee, Eunsol Choi, and Hannaneh Hajishirzi

work page
[3]

In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguis- tics: Human Language Technologies, pages 547–564, Online

XOR QA: Cross-lingual Open-Retrieval Question Answering. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguis- tics: Human Language Technologies, pages 547–564, Online. Association for Computational Linguistics. 17

work page 2021
[4]

Petr Baudiš and Jan Šediv `y. 2015. Modeling of the question answering task in the yodaqa system. In International Conference of the Cross-Language Evaluation Forum for European Languages, pages 222–228. Springer. 6

work page 2015
[5]

Jonathan Berant, Andrew Chou, Roy Frostig, and Percy Liang. 2013. Semantic Parsing on Freebase from Question-Answer Pairs. In Proceedings of the 2013 Conference on Empiri- cal Methods in Natural Language Processing, pages 1533–1544, Seattle, Washington, USA. Association for Computational Linguistics. 6

work page 2013
[6]

Adam Berger, Rich Caruana, David Cohn, Dayne Freitag, and Vibhu Mittal. 2000. Bridging the lexical chasm: statistical approaches to answer-ﬁnding. In Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval, pages 192–199. 1, 3

work page 2000
[7]

Alexander Bondarenko, Maik Fröbe, Meriem Beloucif, Lukas Gienapp, Yamen Ajjour, Alexan- der Panchenko, Chris Biemann, Benno Stein, Henning Wachsmuth, Martin Potthast, and Matthias Hagen. 2020. Overview of Touché 2020: Argument Retrieval. In Working Notes Papers of the CLEF 2020 Evaluation Labs, volume 2696 of CEUR Workshop Proceedings. 4, 19, 22

work page 2020
[8]

Vera Boteva, Demian Gholipour, Artem Sokolov, and Stefan Riezler. 2016. A full-text learning to rank dataset for medical information retrieval. In Proceedings of the 38th European Conference on Information Retrieval (ECIR 2016), pages 716–722. 4, 18

work page 2016
[9]

Danqi Chen, Adam Fisch, Jason Weston, and Antoine Bordes. 2017. Reading Wikipedia to Answer Open-Domain Questions. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1870–1879, Vancouver, Canada. Association for Computational Linguistics. 1, 18

work page 2017
[10]

Arman Cohan, Sergey Feldman, Iz Beltagy, Doug Downey, and Daniel Weld. 2020. SPECTER: Document-level Representation Learning using Citation-informed Transformers. InProceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 2270–2282, Online. Association for Computational Linguistics. 4, 19

work page 2020
[11]

Davind Corney, Dyaa Albakour, Miguel Martinez, and Samir Moussa. 2016. What do a Million News Articles Look like? In Proceedings of the First International Workshop on Recent Trends in News Information Retrieval co-located with 38th European Conference on Information Retrieval (ECIR 2016), pages 42–47. 18

work page 2016
[12]

Zhuyun Dai and Jamie Callan. 2020. Context-Aware Term Weighting For First Stage Passage Retrieval. In Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval , SIGIR ’20, page 1533–1536, New York, NY , USA. Association for Computing Machinery. 3, 6

work page 2020
[13]

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. BERT: Pre-training of deep bidirectional transformers for language understanding.arXiv preprint arXiv:1810.04805. 1 10

work page internal anchor Pith review Pith/arXiv arXiv 2018
[14]

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre- training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171– 4186, Min...

work page 2019
[15]

Thomas Diggelmann, Jordan Boyd-Graber, Jannis Bulian, Massimiliano Ciaramita, and Markus Leippold. 2020. CLIMATE-FEVER: A Dataset for Veriﬁcation of Real-World Climate Claims. 4, 20

work page 2020
[16]

Yingqi Qu Yuchen Ding, Jing Liu, Kai Liu, Ruiyang Ren, Xin Zhao, Daxiang Dong, Hua Wu, and Haifeng Wang. 2020. RocketQA: An Optimized Training Approach to Dense Passage Retrieval for Open-Domain Question Answering. 1, 17

work page 2020
[17]

Anlei Dong, Ruiqiang Zhang, Pranam Kolari, Jing Bai, Fernando Diaz, Yi Chang, Zhaohui Zheng, and Hongyuan Zha. 2010. Time is of the Essence: Improving Recency Ranking Using Twitter Data. In Proceedings of the 19th International Conference on World Wide Web, WWW ’10, page 331–340, New York, NY , USA. Association for Computing Machinery. 17

work page 2010
[18]

Luyu Gao, Zhuyun Dai, Tongfei Chen, Zhen Fan, Benjamin Van Durme, and Jamie Callan

work page
[19]

Complementing Lexical Retrieval with Semantic Residual Embedding. 3, 17

work page
[20]

Daniel Gillick, Alessandro Presta, and Gaurav Singh Tomar. 2018. End-to-End Retrieval in Continuous Space. 3

work page 2018
[21]

Mandy Guo, Yinfei Yang, Daniel Cer, Qinlan Shen, and Noah Constant. 2020. MultiReQA: A Cross-Domain Evaluation for Retrieval Question Answering Models. 2, 3

work page 2020
[22]

Hagberg, Daniel A

Aric A. Hagberg, Daniel A. Schult, and Pieter J. Swart. 2008. Exploring Network Structure, Dy- namics, and Function using NetworkX. In Proceedings of the 7th Python in Science Conference, pages 11 – 15, Pasadena, CA USA. 5

work page 2008
[23]

Faegheh Hasibi, Fedor Nikolaev, Chenyan Xiong, Krisztian Balog, Svein Erik Bratsberg, Alexander Kotov, and Jamie Callan. 2017. DBpedia-Entity V2: A Test Collection for Entity Search. In Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’17, pages 1265–1268. ACM. 4, 8, 19

work page 2017
[24]

Hintze and Ray D

Jerry L. Hintze and Ray D. Nelson. 1998. Violin Plots: A Box Plot-Density Trace Synergism. The American Statistician, 52(2):181–184. 8, 24

work page 1998
[25]

Sebastian Hofstätter, Sheng-Chieh Lin, Jheng-Hong Yang, Jimmy Lin, and Allan Hanbury

work page
[26]

Efﬁciently Teaching an Effective Dense Retriever with Balanced Topic Aware Sampling. In Proc. of SIGIR. 6

work page
[27]

Sebastian Hofstätter, Sophia Althammer, Michael Schröder, Mete Sertkan, and Allan Han- bury. 2021. Improving Efﬁcient Neural Ranking Models with Cross-Architecture Knowledge Distillation. 6, 21

work page 2021
[28]

Doris Hoogeveen, Karin M Verspoor, and Timothy Baldwin. 2015. CQADupStack: A bench- mark data set for community question-answering research. In Proceedings of the 20th Aus- tralasian document computing symposium, pages 1–8. 4, 19

work page 2015
[29]

Sergey Ioffe. 2010. Improved consistent sampling, weighted minhash and l1 sketching. In 2010 IEEE International Conference on Data Mining, pages 246–255. IEEE. 4, 20

work page 2010
[30]

Ming Ji, Yizhou Sun, Marina Danilevsky, Jiawei Han, and Jing Gao. 2010. Graph Regularized Transductive Classiﬁcation on Heterogeneous Information Networks. In Machine Learning and Knowledge Discovery in Databases, pages 570–586, Berlin, Heidelberg. Springer Berlin Heidelberg. 19

work page 2010
[31]

Jing Jiang and ChengXiang Zhai. 2007. An empirical study of tokenization strategies for biomedical information retrieval. Information Retrieval, 10(4-5):341–363. 18 11

work page 2007
[32]

Jeff Johnson, Matthijs Douze, and Hervé Jégou. 2017. Billion-scale similarity search with GPUs. arXiv preprint arXiv:1702.08734. 6

work page arXiv 2017
[33]

Mandar Joshi, Eunsol Choi, Daniel Weld, and Luke Zettlemoyer. 2017. TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1601–1611, Vancouver, Canada. Association for Computational Linguistics. 6

work page 2017
[34]

Vladimir Karpukhin, Barlas Oguz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen-tau Yih. 2020. Dense Passage Retrieval for Open-Domain Question Answering. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 6769–6781, Online. Association for Computational Linguistics. 1, 3, 4, 6

work page 2020
[35]

Omar Khattab and Matei Zaharia. 2020. ColBERT: Efﬁcient and Effective Passage Search via Contextualized Late Interaction over BERT. In Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’20, page 39–48, New York, NY , USA. Association for Computing Machinery. 3, 4, 6, 8

work page 2020
[36]

Kleinberg

Jon M. Kleinberg. 1999. Authoritative Sources in a Hyperlinked Environment. J. ACM, 46(5):604–632. 17

work page 1999
[37]

Toutanova, Llion Jones, Ming-Wei Chang, Andrew Dai, Jakob Uszkoreit, Quoc Le, and Slav Petrov

Tom Kwiatkowski, Jennimaria Palomaki, Olivia Redﬁeld, Michael Collins, Ankur Parikh, Chris Alberti, Danielle Epstein, Illia Polosukhin, Matthew Kelcey, Jacob Devlin, Kenton Lee, Kristina N. Toutanova, Llion Jones, Ming-Wei Chang, Andrew Dai, Jakob Uszkoreit, Quoc Le, and Slav Petrov. 2019. Natural Questions: a Benchmark for Question Answering Research. Tr...

work page 2019
[38]

Davis Liang, Peng Xu, Siamak Shakeri, Cicero Nogueira dos Santos, Ramesh Nallapati, Zhiheng Huang, and Bing Xiang. 2020. Embedding-based Zero-shot Retrieval through Query Generation. 3

work page 2020
[39]

Jimmy Lin, Matt Crane, Andrew Trotman, Jamie Callan, Ishan Chattopadhyaya, John Foley, Grant Ingersoll, Craig Macdonald, and Sebastiano Vigna. 2016. Toward reproducible baselines: The open-source IR reproducibility challenge. InEuropean Conference on Information Retrieval, pages 408–420. Springer. 5

work page 2016
[40]

Jimmy Lin, Rodrigo Nogueira, and Andrew Yates. 2020. Pretrained Transformers for Text Ranking: BERT and Beyond. 1, 3

work page 2020
[41]

Aldo Lipani. 2016. Fairness in Information Retrieval. In Proceedings of the 39th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’16, page 1171, New York, NY , USA. Association for Computing Machinery. 9

work page 2016
[42]

Aldo Lipani. 2019. On Biases in Information retrieval models and evaluation. Ph.D. thesis, Technische Universität Wien. 8

work page 2019
[43]

Aldo Lipani, Mihai Lupu, and Allan Hanbury. 2016. The Curious Incidence of Bias Corrections in the Pool. In European Conference on Information Retrieval, pages 267–279. Springer. 9

work page 2016
[44]

Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. RoBERTa: A Robustly Optimized BERT Pretraining Approach. 6

work page 2019
[45]

Yi Luan, Jacob Eisenstein, Kristina Toutanova, and Michael Collins. 2021. Sparse, Dense, and Attentional Representations for Text Retrieval. 3

work page 2021
[46]

Ji Ma, Ivan Korotkov, Yinfei Yang, Keith Hall, and Ryan McDonald. 2021. Zero-shot Neural Passage Retrieval via Domain-targeted Synthetic Question Generation. 3

work page 2021
[47]

Macedo Maia, Siegfried Handschuh, André Freitas, Brian Davis, Ross McDermott, Manel Zarrouk, and Alexandra Balahur. 2018. WWW’18 Open Challenge: Financial Opinion Mining and Question Answering. In Companion Proceedings of the The Web Conference 2018, WWW ’18, page 1941–1942, Republic and Canton of Geneva, CHE. International World Wide Web Conferences Stee...

work page 2018
[48]

Tri Nguyen, Mir Rosenberg, Xia Song, Jianfeng Gao, Saurabh Tiwary, Rangan Majumder, and Li Deng. 2016. MS MARCO: A Human Generated MAchine Reading COmprehension Dataset. choice, 2640:660. 1, 3, 4, 6, 17

work page 2016
[49]

Rodrigo Nogueira and Kyunghyun Cho. 2020. Passage Re-ranking with BERT. arXiv preprint arXiv:1901.04085. 1, 3, 17

work page internal anchor Pith review Pith/arXiv arXiv 2020
[50]

Rodrigo Nogueira, Jimmy Lin, and AI Epistemic. 2019. From doc2query to docTTTTTquery. Online preprint. 6

work page 2019
[51]

Rodrigo Nogueira, Wei Yang, Jimmy Lin, and Kyunghyun Cho. 2019. Document Expansion by Query Prediction. 3

work page 2019
[52]

Lawrence Page, Sergey Brin, Rajeev Motwani, and Terry Winograd. 1999. The PageRank Citation Ranking: Bringing Order to the Web. Technical Report 1999-66, Stanford InfoLab. Previous number = SIDL-WP-1999-0120. 17

work page 1999
[53]

Fabio Petroni, Aleksandra Piktus, Angela Fan, Patrick Lewis, Majid Yazdani, Nicola De Cao, James Thorne, Yacine Jernite, Vassilis Plachouras, Tim Rocktäschel, and Sebastian Riedel

work page
[54]

KILT: a Benchmark for Knowledge Intensive Language Tasks. 2

work page
[55]

Filip Radlinski, Madhu Kurup, and Thorsten Joachims. 2008. How Does Clickthrough Data Reﬂect Retrieval Quality? In Proceedings of the 17th ACM Conference on Information and Knowledge Management, CIKM ’08, page 43–52, New York, NY , USA. Association for Computing Machinery. 17

work page 2008
[56]

Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. 2020. Exploring the Limits of Transfer Learning with a Uniﬁed Text-to-Text Transformer.Journal of Machine Learning Research, 21(140):1–67. 6

work page 2020
[57]

Nils Reimers and Iryna Gurevych. 2019. Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics. 3, 4

work page 2019
[58]

Nils Reimers and Iryna Gurevych. 2020. The Curse of Dense Low-Dimensional Information Retrieval for Large Index Sizes. arXiv preprint arXiv:2012.14210. 2

work page arXiv 2020
[59]

Stephen Robertson and Hugo Zaragoza. 2009. The Probabilistic Relevance Framework: BM25 and Beyond. Foundations and Trends in Information Retrieval, 3(4):333–389. 1, 3, 5

work page 2009
[60]

Victor Sanh, Lysandre Debut, Julien Chaumond, and Thomas Wolf. 2020. DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter. 6

work page 2020
[61]

Minjoon Seo, Jinhyuk Lee, Tom Kwiatkowski, Ankur Parikh, Ali Farhadi, and Hannaneh Hajishirzi. 2019. Real-Time Open-Domain Question Answering with Dense-Sparse Phrase Index. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 4430–4441, Florence, Italy. Association for Computational Linguistics. 3

work page 2019
[62]

Ian Soboroff, Shudong Huang, and Donna Harman. 2019. TREC 2019 News Track Overview. In TREC. 4, 18

work page 2019
[63]

Axel Suarez, Dyaa Albakour, David Corney, Miguel Martinez, and Jose Esquivel. 2018. A Data Collection for Evaluating the Retrieval of Related Tweets to News Articles. In 40th European Conference on Information Retrieval Research (ECIR 2018), Grenoble, France, March, 2018., pages 780–786. 4, 9, 18

work page 2018
[64]

James Thorne, Andreas Vlachos, Christos Christodoulopoulos, and Arpit Mittal. 2018. FEVER: a Large-scale Dataset for Fact Extraction and VERiﬁcation. In Proceedings of the 2018 Confer- ence of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 809–819, New Orleans, Louisi...

work page 2018
[65]

George Tsatsaronis, Georgios Balikas, Prodromos Malakasiotis, Ioannis Partalas, Matthias Zschunke, Michael R Alvers, Dirk Weissenborn, Anastasia Krithara, Sergios Petridis, Dimitris Polychronopoulos, et al. 2015. An overview of the BIOASQ large-scale biomedical semantic indexing and question answering competition. BMC bioinformatics, 16(1):138. 4, 8, 18

work page 2015
[66]

Aaron van den Oord, Yazhe Li, and Oriol Vinyals. 2019. Representation Learning with Contrastive Predictive Coding. 21

work page 2019
[67]

Christophe Van Gysel and Maarten de Rijke. 2018. Pytrec_eval: An Extremely Fast Python Interface to trec_eval. In SIGIR. ACM. 5

work page 2018
[68]

Ellen V oorhees. 2005. Overview of the TREC 2004 Robust Retrieval Track. 4, 19

work page 2005
[69]

Hersh, Kyle Lo, Kirk Roberts, Ian Soboroff, and Lucy Lu Wang

Ellen V oorhees, Tasmeer Alam, Steven Bedrick, Dina Demner-Fushman, William R. Hersh, Kyle Lo, Kirk Roberts, Ian Soboroff, and Lucy Lu Wang. 2021. TREC-COVID: Constructing a Pandemic Information Retrieval Test Collection. SIGIR Forum, 54(1). 2, 4, 9, 18

work page 2021
[70]

Henning Wachsmuth, Martin Potthast, Khalid Al-Khatib, Yamen Ajjour, Jana Puschmann, Jiani Qu, Jonas Dorsch, Viorel Morari, Janek Bevendorff, and Benno Stein. 2017. Building an Argument Search Engine for the Web. In 4th Workshop on Argument Mining (ArgMining 2017) at EMNLP, pages 49–59. Association for Computational Linguistics. 19

work page 2017
[71]

Henning Wachsmuth, Shahbaz Syed, and Benno Stein. 2018. Retrieval of the Best Coun- terargument without Prior Topic Knowledge. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages 241–251. Association for Computational Linguistics. 4, 19

work page 2018
[72]

David Wadden, Shanchuan Lin, Kyle Lo, Lucy Lu Wang, Madeleine van Zuylen, Arman Cohan, and Hannaneh Hajishirzi. 2020. Fact or Fiction: Verifying Scientiﬁc Claims. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 7534–7550, Online. Association for Computational Linguistics. 4, 20

work page 2020
[73]

Weld, Oren Etzioni, and Sebastian Kohlmeier

Lucy Lu Wang, Kyle Lo, Yoganand Chandrasekhar, Russell Reas, Jiangjiang Yang, Doug Burdick, Darrin Eide, Kathryn Funk, Yannis Katsis, Rodney Kinney, Yunyao Li, Ziyang Liu, William Merrill, Paul Mooney, Dewey Murdick, Devvret Rishi, Jerry Sheehan, Zhihong Shen, Brandon Stilson, Alex Wade, Kuansan Wang, Nancy Xin Ru Wang, Chris Wilhelm, Boya Xie, Douglas Ra...

work page 2020
[74]

Wenhui Wang, Furu Wei, Li Dong, Hangbo Bao, Nan Yang, and Ming Zhou. 2020. MiniLM: Deep Self-Attention Distillation for Task-Agnostic Compression of Pre-Trained Transformers. In Advances in Neural Information Processing Systems, volume 33, pages 5776–5788. Curran Associates, Inc. 6

work page 2020
[75]

Yining Wang, Liwei Wang, Yuanzhi Li, Di He, Wei Chen, and Tie-Yan Liu. 2013. A theoretical analysis of NDCG ranking measures. In Proceedings of the 26th annual conference on learning theory (COLT 2013), volume 8, page 6. 5

work page 2013
[76]

Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, Joe Davison, Sam Shleifer, Patrick von Platen, Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu, Teven Le Scao, Sylvain Gugger, Mariama Drame, Quentin Lhoest, and Alexander M. Rush. 2020. Transformers: State-of-t...

work page 2020
[77]

Lee Xiong, Chenyan Xiong, Ye Li, Kwok-Fung Tang, Jialin Liu, Paul Bennett, Junaid Ahmed, and Arnold Overwijk. 2020. Approximate Nearest Neighbor Negative Contrastive Learning for Dense Text Retrieval. 6, 9

work page 2020
[78]

Peilin Yang, Hui Fang, and Jimmy Lin. 2017. Anserini: Enabling the Use of Lucene for Infor- mation Retrieval Research. In Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’17, page 1253–1256, New York, NY , USA. Association for Computing Machinery. 4 14

work page 2017
[79]

Yinfei Yang, Daniel Cer, Amin Ahmad, Mandy Guo, Jax Law, Noah Constant, Gustavo Her- nandez Abrego, Steve Yuan, Chris Tar, Yun-hsuan Sung, et al. 2020. Multilingual Universal Sentence Encoder for Semantic Retrieval. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: System Demonstrations, pages 87–94. 4

work page 2020
[80]

Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William Cohen, Ruslan Salakhutdinov, and Christopher D. Manning. 2018. HotpotQA: A Dataset for Diverse, Explainable Multi-hop Question Answering. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 2369–2380, Brussels, Belgium. Association for Computational L...

work page 2018

Showing first 80 references.