pith. machine review for the scientific record. sign in

arxiv: 2104.08663 · v4 · submitted 2021-04-17 · 💻 cs.IR · cs.AI· cs.CL

Recognition: 1 theorem link

· Lean Theorem

BEIR: A Heterogenous Benchmark for Zero-shot Evaluation of Information Retrieval Models

Abhishek Srivastava, Andreas R\"uckl\'e, Iryna Gurevych, Nandan Thakur, Nils Reimers

Pith reviewed 2026-05-12 14:35 UTC · model grok-4.3

classification 💻 cs.IR cs.AIcs.CL
keywords information retrievalzero-shot evaluationbenchmarkout-of-distribution generalizationneural retrieval modelsBM25re-ranking
0
0 comments X

The pith

A benchmark with 18 diverse datasets shows BM25 as a robust zero-shot baseline while re-ranking models lead in performance at higher cost.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper creates a collection of 18 existing datasets spanning many different retrieval tasks and domains to test how retrieval systems behave when applied to new data without retraining. Ten different systems are run on all of them, ranging from a simple word-matching approach to complex neural models that use dense vectors or re-ranking steps. The results indicate that the basic word-matching method remains reliable across these unfamiliar settings, while certain neural methods reach higher accuracy but require far more computation. Dense-vector and sparse-vector methods run quickly yet frequently fall short of the others in these out-of-distribution cases. The benchmark is released publicly so others can measure and improve the generalization of future retrieval systems.

Core claim

BEIR aggregates 18 publicly available datasets from varied text retrieval tasks and domains, then evaluates ten retrieval systems including lexical, sparse, dense, late-interaction, and re-ranking architectures in zero-shot mode; the evaluation establishes that BM25 is a robust baseline, re-ranking and late-interaction models achieve the highest average performance yet at high computational cost, and dense and sparse models are efficient but often underperform, revealing room for better generalization.

What carries the argument

BEIR benchmark, a curated set of 18 heterogeneous public datasets used to measure out-of-distribution zero-shot performance of retrieval models.

If this is right

  • Systematic zero-shot comparisons become possible for any new retrieval model using the public datasets.
  • Model selection must weigh accuracy gains against the computational expense of re-ranking steps.
  • Dense and sparse retrieval approaches require further work to close the observed gap in generalization.
  • Progress toward robust retrieval systems can be tracked by repeated evaluation on the same fixed collection.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Developers may seek hybrid systems that retain the efficiency of dense models while adding elements that improve cross-domain stability.
  • The benchmark could be reused to test whether training procedures that explicitly target out-of-distribution robustness close the performance differences.
  • Similar multi-dataset evaluation setups could be applied to related tasks such as passage ranking within question-answering pipelines.

Load-bearing premise

The 18 chosen datasets supply a sufficiently broad and representative sample of real-world out-of-distribution retrieval situations.

What would settle it

Running the ten systems on a fresh collection of retrieval datasets drawn from domains absent from the original 18 and finding that dense or sparse models now match or exceed the average performance of re-ranking models would contradict the reported pattern.

read the original abstract

Existing neural information retrieval (IR) models have often been studied in homogeneous and narrow settings, which has considerably limited insights into their out-of-distribution (OOD) generalization capabilities. To address this, and to facilitate researchers to broadly evaluate the effectiveness of their models, we introduce Benchmarking-IR (BEIR), a robust and heterogeneous evaluation benchmark for information retrieval. We leverage a careful selection of 18 publicly available datasets from diverse text retrieval tasks and domains and evaluate 10 state-of-the-art retrieval systems including lexical, sparse, dense, late-interaction and re-ranking architectures on the BEIR benchmark. Our results show BM25 is a robust baseline and re-ranking and late-interaction-based models on average achieve the best zero-shot performances, however, at high computational costs. In contrast, dense and sparse-retrieval models are computationally more efficient but often underperform other approaches, highlighting the considerable room for improvement in their generalization capabilities. We hope this framework allows us to better evaluate and understand existing retrieval systems, and contributes to accelerating progress towards better robust and generalizable systems in the future. BEIR is publicly available at https://github.com/UKPLab/beir.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces the BEIR benchmark for zero-shot evaluation of information retrieval models. It consists of 18 carefully selected publicly available datasets from diverse tasks and domains. The authors evaluate 10 state-of-the-art models spanning lexical, sparse, dense, late-interaction, and re-ranking architectures. Key findings include that BM25 serves as a robust baseline, re-ranking and late-interaction models achieve the highest average zero-shot performance but incur high computational costs, whereas dense and sparse retrieval models are more efficient yet frequently underperform, pointing to substantial room for enhancing their generalization abilities.

Significance. This benchmark addresses a critical gap in evaluating OOD generalization in IR, which has been limited by homogeneous settings. By providing a heterogeneous testbed and comprehensive evaluations, it can significantly influence future research towards more robust models. The public release at the GitHub link facilitates reproducibility and community use. The empirical nature with public datasets and standard metrics adds to its value, though the strength depends on the benchmark's representativeness.

major comments (2)
  1. [Dataset selection section] Dataset selection section: The manuscript refers to a 'careful selection' of the 18 datasets to achieve heterogeneity across tasks and domains but provides no explicit coverage criteria, sampling strategy, or analysis of potential composition biases (e.g., over-representation of domains where lexical overlap is strong). This directly affects the load-bearing claim that dense and sparse models 'often underperform' and have 'considerable room for improvement in their generalization capabilities,' as these conclusions rest on averages over the chosen set.
  2. [Results section (performance tables)] Results section (performance tables): While aggregate averages are presented, the manuscript does not report per-dataset breakdowns with variance measures or statistical tests for the performance gaps. This weakens the interpretation of 'often underperform' if a small number of datasets disproportionately influence the mean.
minor comments (2)
  1. [Title] Title: 'Heterogenous' is a misspelling and should read 'Heterogeneous'.
  2. [Abstract and introduction] Abstract and introduction: The description of model families and metrics (e.g., nDCG@10) is clear but could include a one-sentence note on the exact evaluation protocol for zero-shot transfer to aid quick comprehension.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and the recommendation for minor revision. We address each major comment below and will update the manuscript accordingly to improve clarity and robustness.

read point-by-point responses
  1. Referee: [Dataset selection section] Dataset selection section: The manuscript refers to a 'careful selection' of the 18 datasets to achieve heterogeneity across tasks and domains but provides no explicit coverage criteria, sampling strategy, or analysis of potential composition biases (e.g., over-representation of domains where lexical overlap is strong). This directly affects the load-bearing claim that dense and sparse models 'often underperform' and have 'considerable room for improvement in their generalization capabilities,' as these conclusions rest on averages over the chosen set.

    Authors: We appreciate the referee pointing this out. The datasets were chosen to span a wide range of retrieval tasks (e.g., ad-hoc search, QA, fact-checking, argument retrieval) and domains (e.g., Wikipedia, news, biomedical, scientific papers, tweets, forums) while ensuring public availability and standard evaluation protocols. However, we acknowledge that the original manuscript did not explicitly list the selection criteria or analyze potential biases such as lexical overlap. In the revised version, we will add a new subsection detailing the explicit criteria (task diversity, domain coverage, dataset scale, and annotation quality), include a summary table of task/domain characteristics, and provide a short discussion of possible composition biases with examples of how lexical vs. semantic matching varies across datasets. This will better ground the generalization claims. revision: yes

  2. Referee: [Results section (performance tables)] Results section (performance tables): While aggregate averages are presented, the manuscript does not report per-dataset breakdowns with variance measures or statistical tests for the performance gaps. This weakens the interpretation of 'often underperform' if a small number of datasets disproportionately influence the mean.

    Authors: We agree that statistical support would strengthen the interpretation. The original manuscript already contains a full per-dataset breakdown in Table 2 (nDCG@10 for every model on all 18 datasets), which allows readers to verify that underperformance of dense/sparse models is not driven by a few outliers but holds across the majority of datasets. To further address the concern, we will add in the revision: (i) a brief analysis of performance distribution and identification of any influential datasets, (ii) pairwise statistical significance tests (Wilcoxon signed-rank) on the average scores between model categories, and (iii) a note that, as these are single deterministic runs on fixed test sets, traditional variance is not applicable, but bootstrapped confidence intervals can be reported for the averages. These additions will make the 'often underperform' statement more rigorously supported. revision: yes

Circularity Check

0 steps flagged

Empirical benchmark evaluation with no internal derivations or self-referential reductions

full rationale

The paper consists entirely of an empirical study: it selects 18 external public datasets, runs 10 retrieval systems on them, and reports observed performance numbers. No equations, fitted parameters, or predictions are defined inside the paper that later reduce to those same quantities by construction. Central claims (BM25 robustness, relative performance of model classes) are direct aggregates of the external evaluation results rather than tautological restatements. No self-citation load-bearing steps, uniqueness theorems, or ansatzes appear in the derivation chain. The work is therefore self-contained against external benchmarks and receives the default non-circularity finding.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper relies entirely on existing public datasets and previously published retrieval models without introducing new free parameters, axioms beyond standard IR evaluation practice, or invented entities.

axioms (1)
  • domain assumption Standard IR metrics such as nDCG@10 and Recall@100 are appropriate measures for zero-shot generalization assessment.
    Implicit in the choice of reported results across all models and datasets.

pith-pipeline@v0.9.0 · 5527 in / 1180 out tokens · 58504 ms · 2026-05-12T14:35:26.616158+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 31 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Very Efficient Listwise Multimodal Reranking for Long Documents

    cs.IR 2026-05 unverdicted novelty 7.0

    ZipRerank delivers state-of-the-art multimodal listwise reranking accuracy for long documents at up to 10x lower latency via early interaction and single-pass scoring.

  2. Re$^2$Math: Benchmarking Theorem Retrieval in Research-Level Mathematics

    cs.AI 2026-05 unverdicted novelty 7.0

    Re²Math is a new benchmark that evaluates AI models on retrieving and verifying the applicability of theorems from math literature to advance steps in partial proofs, accepting any sufficient theorem while controlling...

  3. DiffRetriever: Parallel Representative Tokens for Retrieval with Diffusion Language Models

    cs.IR 2026-05 unverdicted novelty 7.0

    DiffRetriever generates multiple representative tokens in parallel using diffusion language models, yielding consistent retrieval gains over single-token baselines and autoregressive multi-token variants on BEIR benchmarks.

  4. TabEmbed: Benchmarking and Learning Generalist Embeddings for Tabular Understanding

    cs.CL 2026-05 unverdicted novelty 7.0

    TabEmbed is the first generalist embedding model for tabular data that unifies classification and retrieval in one space via contrastive learning and outperforms text embedding models on the new TabBench benchmark.

  5. HackerSignal: A Large-Scale Multi-Source Dataset Linking Hacker Community Discourse to the CVE Vulnerability Lifecycle

    cs.CR 2026-05 unverdicted novelty 7.0

    HackerSignal aggregates 7.45M documents from hacker communities, exploit databases, vulnerability reports, and fixes into a public benchmark for temporal OOD CVE linkage and exploit classification.

  6. UnIte: Uncertainty-based Iterative Document Sampling for Domain Adaptation in Information Retrieval

    cs.IR 2026-04 unverdicted novelty 7.0

    UnIte selects target-domain documents for pseudo-query generation by filtering high aleatoric uncertainty and prioritizing high epistemic uncertainty, yielding +2.45 to +3.49 nDCG@10 gains on BEIR with ~4k samples.

  7. MMEB-V3: Measuring the Performance Gaps of Omni-Modality Embedding Models

    cs.IR 2026-04 unverdicted novelty 7.0

    MMEB-V3 benchmark shows omni-modality embedding models fail to enforce instruction-specified modality constraints and exhibit asymmetric, query-biased retrieval.

  8. ResRank: Unifying Retrieval and Listwise Reranking via End-to-End Joint Training with Residual Passage Compression

    cs.IR 2026-04 conditional novelty 7.0

    ResRank unifies retrieval and listwise reranking by compressing passages to one token each, using residual connections and cosine-similarity scoring, achieving competitive effectiveness on TREC DL and BEIR benchmarks ...

  9. TeleEmbedBench: A Multi-Corpus Embedding Benchmark for RAG in Telecommunications

    cs.LG 2026-04 unverdicted novelty 7.0

    TeleEmbedBench is the first multi-corpus benchmark showing LLM-based embedding models significantly outperform traditional sentence-transformers on telecommunications specifications and code for retrieval accuracy and...

  10. C-Pack: Packed Resources For General Chinese Embeddings

    cs.CL 2023-09 accept novelty 7.0

    C-Pack releases a new Chinese embedding benchmark, large training dataset, and optimized models that outperform priors by up to 10% on C-MTEB while also delivering English SOTA results.

  11. MLAIRE: Multilingual Language-Aware Information Retrieval Evaluation Protocal

    cs.IR 2026-05 unverdicted novelty 6.0

    MLAIRE is a protocol that evaluates multilingual retrievers on both semantic accuracy and query-language preference using parallel passages and new metrics like LPR and Lang-nDCG, showing that standard metrics hide di...

  12. An Agent-Oriented Pluggable Experience-RAG Skill for Experience-Driven Retrieval Strategy Orchestration

    cs.AI 2026-05 unverdicted novelty 6.0

    Experience-RAG Skill uses experience memory to dynamically select retrieval strategies for agents, achieving 0.8924 nDCG@10 on BeIR/nq, hotpotqa, and scifact while outperforming fixed single-retriever baselines.

  13. Verbal-R3: Verbal Reranker as the Missing Bridge between Retrieval and Reasoning

    cs.CL 2026-05 unverdicted novelty 6.0

    Verbal-R3 uses a verbal reranker to generate analytic narratives that guide retrieval and reasoning in LLMs, achieving SOTA results on complex QA benchmarks.

  14. Kernel Affine Hull Machines for Compute-Efficient Query-Side Semantic Encoding

    cs.LG 2026-05 unverdicted novelty 6.0

    Kernel Affine Hull Machines map lexical features to semantic embeddings via RKHS and least-mean-squares, outperforming adapters in reconstruction and retrieval metrics while reducing latency 8.5-fold on a legal benchmark.

  15. Efficient Rationale-based Retrieval: On-policy Distillation from Generative Rerankers based on JEPA

    cs.IR 2026-04 unverdicted novelty 6.0

    Rabtriever distills a generative reranker into an efficient bi-encoder using on-policy JEPA to achieve near-reranker accuracy with linear complexity on rationale-based retrieval.

  16. Efficient Rationale-based Retrieval: On-policy Distillation from Generative Rerankers based on JEPA

    cs.IR 2026-04 unverdicted novelty 6.0

    Rabtriever distills a generative reranker into an efficient independent encoder using JEPA and auxiliary reverse KL loss to achieve linear complexity and strong performance on rationale-based retrieval tasks.

  17. ARHN: Answer-Centric Relabeling of Hard Negatives with Open-Source LLMs for Dense Retrieval

    cs.IR 2026-04 unverdicted novelty 6.0

    ARHN refines hard-negative training data for dense retrieval by using LLMs to convert answer-containing passages into additional positives and exclude answer-containing passages from the negative set.

  18. HIVE: Query, Hypothesize, Verify An LLM Framework for Multimodal Reasoning-Intensive Retrieval

    cs.IR 2026-04 unverdicted novelty 6.0

    HIVE raises multimodal retrieval nDCG@10 to 41.7 on the MM-BRIGHT benchmark by inserting LLM-driven hypothesis generation and verification between retrieval passes, delivering +9.5 over the best text-only baseline and...

  19. Data, Not Model: Explaining Bias toward LLM Texts in Neural Retrievers

    cs.IR 2026-04 unverdicted novelty 6.0

    Bias toward LLM texts in neural retrievers arises from artifact imbalances between positive and negative documents in training data that are absorbed during contrastive learning.

  20. Are LLM-Based Retrievers Worth Their Cost? An Empirical Study of Efficiency, Robustness, and Reasoning Overhead

    cs.IR 2026-04 accept novelty 6.0

    Empirical comparison across 14 retrievers on the BRIGHT benchmark shows reasoning-specialized models can match strong accuracy with competitive speed while many large LLM bi-encoders add latency for small gains and co...

  21. NV-Embed: Improved Techniques for Training LLMs as Generalist Embedding Models

    cs.CL 2024-05 accept novelty 6.0

    NV-Embed achieves first place on the MTEB leaderboard across 56 tasks by combining a latent attention layer, causal-mask removal, two-stage contrastive training, and data curation for LLM-based embedding models.

  22. Self-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflection

    cs.CL 2023-10 unverdicted novelty 6.0

    Self-RAG trains LLMs to adaptively retrieve passages on demand and self-critique using reflection tokens, outperforming ChatGPT and retrieval-augmented Llama2 on QA, reasoning, and fact verification.

  23. Unsupervised Dense Information Retrieval with Contrastive Learning

    cs.IR 2021-12 unverdicted novelty 6.0

    Contrastive learning trains unsupervised dense retrievers that beat BM25 on most BEIR datasets and support cross-lingual retrieval across scripts.

  24. Superintelligent Retrieval Agent: The Next Frontier of Information Retrieval

    cs.IR 2026-05 unverdicted novelty 5.0

    SIRA compresses multi-round exploratory retrieval into one LLM-guided, corpus-statistic-validated weighted BM25 query and reports superior results over dense retrievers and agentic baselines on BEIR benchmarks.

  25. AgenticRAG: Agentic Retrieval for Enterprise Knowledge Bases

    cs.AI 2026-05 unverdicted novelty 5.0

    AgenticRAG equips an LLM with iterative retrieval and navigation tools, delivering 49.6% recall@1 on BRIGHT, 0.96 factuality on WixQA, and 92% correctness on FinanceBench.

  26. Efficient Listwise Reranking with Compressed Document Representations

    cs.IR 2026-04 unverdicted novelty 5.0

    RRK compresses documents to multi-token embeddings for efficient listwise reranking, enabling an 8B model to achieve 3x-18x speedups over smaller models with comparable or better effectiveness.

  27. Reproduction Beyond Benchmarks: ConstBERT and ColBERT-v2 Across Backends and Query Distributions

    cs.IR 2026-04 accept novelty 5.0

    ConstBERT and ColBERT-v2 reproduce on MS-MARCO but drop 86-97% on long queries because MaxSim cannot filter filler noise, and extra fine-tuning or backend changes do not overcome the architectural constraint.

  28. Dynamic Ranked List Truncation for Reranking Pipelines via LLM-generated Reference-Documents

    cs.IR 2026-04 unverdicted novelty 5.0

    LLM-generated reference documents enable dynamic ranked list truncation and adaptive batching for listwise reranking, outperforming prior RLT methods and accelerating processing by up to 66% on TREC benchmarks.

  29. An Agent-Oriented Pluggable Experience-RAG Skill for Experience-Driven Retrieval Strategy Orchestration

    cs.AI 2026-05 unverdicted novelty 4.0

    Experience-RAG Skill is a reusable agent skill that selects retrieval strategies via experience memory, achieving 0.8924 nDCG@10 on BeIR/nq, hotpotqa, and scifact while outperforming fixed retriever baselines.

  30. A Systematic Study of Retrieval Pipeline Design for Retrieval-Augmented Medical Question Answering

    cs.CL 2026-04 unverdicted novelty 4.0

    Dense retrieval plus query reformulation and reranking reaches 60.49% accuracy on MedQA USMLE, outperforming other setups while domain-specialized models make better use of the retrieved evidence.

  31. Hypencoder Revisited: Reproducibility and Analysis of Non-Linear Scoring for First-Stage Retrieval

    cs.IR 2026-04 conditional novelty 3.0

    Reproducibility study confirms Hypencoder's non-linear query-specific scoring improves retrieval over bi-encoders on standard benchmarks but standard methods remain faster and hard-task results are mixed due to implem...

Reference graph

Works this paper leans on

94 extracted references · 94 canonical work pages · cited by 29 Pith papers · 2 internal anchors

  1. [1]

    Amin Ahmad, Noah Constant, Yinfei Yang, and Daniel Cer. 2019. ReQA: An Evaluation for End-to-End Answer Retrieval Models. InProceedings of the 2nd Workshop on Machine Reading for Question Answering, pages 137–146, Hong Kong, China. Association for Computational Linguistics. 18

  2. [2]

    Akari Asai, Jungo Kasai, Jonathan Clark, Kenton Lee, Eunsol Choi, and Hannaneh Hajishirzi

  3. [3]

    In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguis- tics: Human Language Technologies, pages 547–564, Online

    XOR QA: Cross-lingual Open-Retrieval Question Answering. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguis- tics: Human Language Technologies, pages 547–564, Online. Association for Computational Linguistics. 17

  4. [4]

    Petr Baudiš and Jan Šediv `y. 2015. Modeling of the question answering task in the yodaqa system. In International Conference of the Cross-Language Evaluation Forum for European Languages, pages 222–228. Springer. 6

  5. [5]

    Jonathan Berant, Andrew Chou, Roy Frostig, and Percy Liang. 2013. Semantic Parsing on Freebase from Question-Answer Pairs. In Proceedings of the 2013 Conference on Empiri- cal Methods in Natural Language Processing, pages 1533–1544, Seattle, Washington, USA. Association for Computational Linguistics. 6

  6. [6]

    Adam Berger, Rich Caruana, David Cohn, Dayne Freitag, and Vibhu Mittal. 2000. Bridging the lexical chasm: statistical approaches to answer-finding. In Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval, pages 192–199. 1, 3

  7. [7]

    Alexander Bondarenko, Maik Fröbe, Meriem Beloucif, Lukas Gienapp, Yamen Ajjour, Alexan- der Panchenko, Chris Biemann, Benno Stein, Henning Wachsmuth, Martin Potthast, and Matthias Hagen. 2020. Overview of Touché 2020: Argument Retrieval. In Working Notes Papers of the CLEF 2020 Evaluation Labs, volume 2696 of CEUR Workshop Proceedings. 4, 19, 22

  8. [8]

    Vera Boteva, Demian Gholipour, Artem Sokolov, and Stefan Riezler. 2016. A full-text learning to rank dataset for medical information retrieval. In Proceedings of the 38th European Conference on Information Retrieval (ECIR 2016), pages 716–722. 4, 18

  9. [9]

    Danqi Chen, Adam Fisch, Jason Weston, and Antoine Bordes. 2017. Reading Wikipedia to Answer Open-Domain Questions. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1870–1879, Vancouver, Canada. Association for Computational Linguistics. 1, 18

  10. [10]

    Arman Cohan, Sergey Feldman, Iz Beltagy, Doug Downey, and Daniel Weld. 2020. SPECTER: Document-level Representation Learning using Citation-informed Transformers. InProceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 2270–2282, Online. Association for Computational Linguistics. 4, 19

  11. [11]

    Davind Corney, Dyaa Albakour, Miguel Martinez, and Samir Moussa. 2016. What do a Million News Articles Look like? In Proceedings of the First International Workshop on Recent Trends in News Information Retrieval co-located with 38th European Conference on Information Retrieval (ECIR 2016), pages 42–47. 18

  12. [12]

    Zhuyun Dai and Jamie Callan. 2020. Context-Aware Term Weighting For First Stage Passage Retrieval. In Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval , SIGIR ’20, page 1533–1536, New York, NY , USA. Association for Computing Machinery. 3, 6

  13. [13]

    Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. BERT: Pre-training of deep bidirectional transformers for language understanding.arXiv preprint arXiv:1810.04805. 1 10

  14. [14]

    Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre- training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171– 4186, Min...

  15. [15]

    Thomas Diggelmann, Jordan Boyd-Graber, Jannis Bulian, Massimiliano Ciaramita, and Markus Leippold. 2020. CLIMATE-FEVER: A Dataset for Verification of Real-World Climate Claims. 4, 20

  16. [16]

    Yingqi Qu Yuchen Ding, Jing Liu, Kai Liu, Ruiyang Ren, Xin Zhao, Daxiang Dong, Hua Wu, and Haifeng Wang. 2020. RocketQA: An Optimized Training Approach to Dense Passage Retrieval for Open-Domain Question Answering. 1, 17

  17. [17]

    Anlei Dong, Ruiqiang Zhang, Pranam Kolari, Jing Bai, Fernando Diaz, Yi Chang, Zhaohui Zheng, and Hongyuan Zha. 2010. Time is of the Essence: Improving Recency Ranking Using Twitter Data. In Proceedings of the 19th International Conference on World Wide Web, WWW ’10, page 331–340, New York, NY , USA. Association for Computing Machinery. 17

  18. [18]

    Luyu Gao, Zhuyun Dai, Tongfei Chen, Zhen Fan, Benjamin Van Durme, and Jamie Callan

  19. [19]

    Complementing Lexical Retrieval with Semantic Residual Embedding. 3, 17

  20. [20]

    Daniel Gillick, Alessandro Presta, and Gaurav Singh Tomar. 2018. End-to-End Retrieval in Continuous Space. 3

  21. [21]

    Mandy Guo, Yinfei Yang, Daniel Cer, Qinlan Shen, and Noah Constant. 2020. MultiReQA: A Cross-Domain Evaluation for Retrieval Question Answering Models. 2, 3

  22. [22]

    Hagberg, Daniel A

    Aric A. Hagberg, Daniel A. Schult, and Pieter J. Swart. 2008. Exploring Network Structure, Dy- namics, and Function using NetworkX. In Proceedings of the 7th Python in Science Conference, pages 11 – 15, Pasadena, CA USA. 5

  23. [23]

    Faegheh Hasibi, Fedor Nikolaev, Chenyan Xiong, Krisztian Balog, Svein Erik Bratsberg, Alexander Kotov, and Jamie Callan. 2017. DBpedia-Entity V2: A Test Collection for Entity Search. In Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’17, pages 1265–1268. ACM. 4, 8, 19

  24. [24]

    Hintze and Ray D

    Jerry L. Hintze and Ray D. Nelson. 1998. Violin Plots: A Box Plot-Density Trace Synergism. The American Statistician, 52(2):181–184. 8, 24

  25. [25]

    Sebastian Hofstätter, Sheng-Chieh Lin, Jheng-Hong Yang, Jimmy Lin, and Allan Hanbury

  26. [26]

    Efficiently Teaching an Effective Dense Retriever with Balanced Topic Aware Sampling. In Proc. of SIGIR. 6

  27. [27]

    Sebastian Hofstätter, Sophia Althammer, Michael Schröder, Mete Sertkan, and Allan Han- bury. 2021. Improving Efficient Neural Ranking Models with Cross-Architecture Knowledge Distillation. 6, 21

  28. [28]

    Doris Hoogeveen, Karin M Verspoor, and Timothy Baldwin. 2015. CQADupStack: A bench- mark data set for community question-answering research. In Proceedings of the 20th Aus- tralasian document computing symposium, pages 1–8. 4, 19

  29. [29]

    Sergey Ioffe. 2010. Improved consistent sampling, weighted minhash and l1 sketching. In 2010 IEEE International Conference on Data Mining, pages 246–255. IEEE. 4, 20

  30. [30]

    Ming Ji, Yizhou Sun, Marina Danilevsky, Jiawei Han, and Jing Gao. 2010. Graph Regularized Transductive Classification on Heterogeneous Information Networks. In Machine Learning and Knowledge Discovery in Databases, pages 570–586, Berlin, Heidelberg. Springer Berlin Heidelberg. 19

  31. [31]

    Jing Jiang and ChengXiang Zhai. 2007. An empirical study of tokenization strategies for biomedical information retrieval. Information Retrieval, 10(4-5):341–363. 18 11

  32. [32]

    Jeff Johnson, Matthijs Douze, and Hervé Jégou. 2017. Billion-scale similarity search with GPUs. arXiv preprint arXiv:1702.08734. 6

  33. [33]

    Mandar Joshi, Eunsol Choi, Daniel Weld, and Luke Zettlemoyer. 2017. TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1601–1611, Vancouver, Canada. Association for Computational Linguistics. 6

  34. [34]

    Vladimir Karpukhin, Barlas Oguz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen-tau Yih. 2020. Dense Passage Retrieval for Open-Domain Question Answering. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 6769–6781, Online. Association for Computational Linguistics. 1, 3, 4, 6

  35. [35]

    Omar Khattab and Matei Zaharia. 2020. ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT. In Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’20, page 39–48, New York, NY , USA. Association for Computing Machinery. 3, 4, 6, 8

  36. [36]

    Kleinberg

    Jon M. Kleinberg. 1999. Authoritative Sources in a Hyperlinked Environment. J. ACM, 46(5):604–632. 17

  37. [37]

    Toutanova, Llion Jones, Ming-Wei Chang, Andrew Dai, Jakob Uszkoreit, Quoc Le, and Slav Petrov

    Tom Kwiatkowski, Jennimaria Palomaki, Olivia Redfield, Michael Collins, Ankur Parikh, Chris Alberti, Danielle Epstein, Illia Polosukhin, Matthew Kelcey, Jacob Devlin, Kenton Lee, Kristina N. Toutanova, Llion Jones, Ming-Wei Chang, Andrew Dai, Jakob Uszkoreit, Quoc Le, and Slav Petrov. 2019. Natural Questions: a Benchmark for Question Answering Research. Tr...

  38. [38]

    Davis Liang, Peng Xu, Siamak Shakeri, Cicero Nogueira dos Santos, Ramesh Nallapati, Zhiheng Huang, and Bing Xiang. 2020. Embedding-based Zero-shot Retrieval through Query Generation. 3

  39. [39]

    Jimmy Lin, Matt Crane, Andrew Trotman, Jamie Callan, Ishan Chattopadhyaya, John Foley, Grant Ingersoll, Craig Macdonald, and Sebastiano Vigna. 2016. Toward reproducible baselines: The open-source IR reproducibility challenge. InEuropean Conference on Information Retrieval, pages 408–420. Springer. 5

  40. [40]

    Jimmy Lin, Rodrigo Nogueira, and Andrew Yates. 2020. Pretrained Transformers for Text Ranking: BERT and Beyond. 1, 3

  41. [41]

    Aldo Lipani. 2016. Fairness in Information Retrieval. In Proceedings of the 39th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’16, page 1171, New York, NY , USA. Association for Computing Machinery. 9

  42. [42]

    Aldo Lipani. 2019. On Biases in Information retrieval models and evaluation. Ph.D. thesis, Technische Universität Wien. 8

  43. [43]

    Aldo Lipani, Mihai Lupu, and Allan Hanbury. 2016. The Curious Incidence of Bias Corrections in the Pool. In European Conference on Information Retrieval, pages 267–279. Springer. 9

  44. [44]

    Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. RoBERTa: A Robustly Optimized BERT Pretraining Approach. 6

  45. [45]

    Yi Luan, Jacob Eisenstein, Kristina Toutanova, and Michael Collins. 2021. Sparse, Dense, and Attentional Representations for Text Retrieval. 3

  46. [46]

    Ji Ma, Ivan Korotkov, Yinfei Yang, Keith Hall, and Ryan McDonald. 2021. Zero-shot Neural Passage Retrieval via Domain-targeted Synthetic Question Generation. 3

  47. [47]

    Macedo Maia, Siegfried Handschuh, André Freitas, Brian Davis, Ross McDermott, Manel Zarrouk, and Alexandra Balahur. 2018. WWW’18 Open Challenge: Financial Opinion Mining and Question Answering. In Companion Proceedings of the The Web Conference 2018, WWW ’18, page 1941–1942, Republic and Canton of Geneva, CHE. International World Wide Web Conferences Stee...

  48. [48]

    Tri Nguyen, Mir Rosenberg, Xia Song, Jianfeng Gao, Saurabh Tiwary, Rangan Majumder, and Li Deng. 2016. MS MARCO: A Human Generated MAchine Reading COmprehension Dataset. choice, 2640:660. 1, 3, 4, 6, 17

  49. [49]

    Rodrigo Nogueira and Kyunghyun Cho. 2020. Passage Re-ranking with BERT. arXiv preprint arXiv:1901.04085. 1, 3, 17

  50. [50]

    Rodrigo Nogueira, Jimmy Lin, and AI Epistemic. 2019. From doc2query to docTTTTTquery. Online preprint. 6

  51. [51]

    Rodrigo Nogueira, Wei Yang, Jimmy Lin, and Kyunghyun Cho. 2019. Document Expansion by Query Prediction. 3

  52. [52]

    Lawrence Page, Sergey Brin, Rajeev Motwani, and Terry Winograd. 1999. The PageRank Citation Ranking: Bringing Order to the Web. Technical Report 1999-66, Stanford InfoLab. Previous number = SIDL-WP-1999-0120. 17

  53. [53]

    Fabio Petroni, Aleksandra Piktus, Angela Fan, Patrick Lewis, Majid Yazdani, Nicola De Cao, James Thorne, Yacine Jernite, Vassilis Plachouras, Tim Rocktäschel, and Sebastian Riedel

  54. [54]

    KILT: a Benchmark for Knowledge Intensive Language Tasks. 2

  55. [55]

    Filip Radlinski, Madhu Kurup, and Thorsten Joachims. 2008. How Does Clickthrough Data Reflect Retrieval Quality? In Proceedings of the 17th ACM Conference on Information and Knowledge Management, CIKM ’08, page 43–52, New York, NY , USA. Association for Computing Machinery. 17

  56. [56]

    Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. 2020. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer.Journal of Machine Learning Research, 21(140):1–67. 6

  57. [57]

    Nils Reimers and Iryna Gurevych. 2019. Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics. 3, 4

  58. [58]

    Nils Reimers and Iryna Gurevych. 2020. The Curse of Dense Low-Dimensional Information Retrieval for Large Index Sizes. arXiv preprint arXiv:2012.14210. 2

  59. [59]

    Stephen Robertson and Hugo Zaragoza. 2009. The Probabilistic Relevance Framework: BM25 and Beyond. Foundations and Trends in Information Retrieval, 3(4):333–389. 1, 3, 5

  60. [60]

    Victor Sanh, Lysandre Debut, Julien Chaumond, and Thomas Wolf. 2020. DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter. 6

  61. [61]

    Minjoon Seo, Jinhyuk Lee, Tom Kwiatkowski, Ankur Parikh, Ali Farhadi, and Hannaneh Hajishirzi. 2019. Real-Time Open-Domain Question Answering with Dense-Sparse Phrase Index. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 4430–4441, Florence, Italy. Association for Computational Linguistics. 3

  62. [62]

    Ian Soboroff, Shudong Huang, and Donna Harman. 2019. TREC 2019 News Track Overview. In TREC. 4, 18

  63. [63]

    Axel Suarez, Dyaa Albakour, David Corney, Miguel Martinez, and Jose Esquivel. 2018. A Data Collection for Evaluating the Retrieval of Related Tweets to News Articles. In 40th European Conference on Information Retrieval Research (ECIR 2018), Grenoble, France, March, 2018., pages 780–786. 4, 9, 18

  64. [64]

    James Thorne, Andreas Vlachos, Christos Christodoulopoulos, and Arpit Mittal. 2018. FEVER: a Large-scale Dataset for Fact Extraction and VERification. In Proceedings of the 2018 Confer- ence of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 809–819, New Orleans, Louisi...

  65. [65]

    George Tsatsaronis, Georgios Balikas, Prodromos Malakasiotis, Ioannis Partalas, Matthias Zschunke, Michael R Alvers, Dirk Weissenborn, Anastasia Krithara, Sergios Petridis, Dimitris Polychronopoulos, et al. 2015. An overview of the BIOASQ large-scale biomedical semantic indexing and question answering competition. BMC bioinformatics, 16(1):138. 4, 8, 18

  66. [66]

    Aaron van den Oord, Yazhe Li, and Oriol Vinyals. 2019. Representation Learning with Contrastive Predictive Coding. 21

  67. [67]

    Christophe Van Gysel and Maarten de Rijke. 2018. Pytrec_eval: An Extremely Fast Python Interface to trec_eval. In SIGIR. ACM. 5

  68. [68]

    Ellen V oorhees. 2005. Overview of the TREC 2004 Robust Retrieval Track. 4, 19

  69. [69]

    Hersh, Kyle Lo, Kirk Roberts, Ian Soboroff, and Lucy Lu Wang

    Ellen V oorhees, Tasmeer Alam, Steven Bedrick, Dina Demner-Fushman, William R. Hersh, Kyle Lo, Kirk Roberts, Ian Soboroff, and Lucy Lu Wang. 2021. TREC-COVID: Constructing a Pandemic Information Retrieval Test Collection. SIGIR Forum, 54(1). 2, 4, 9, 18

  70. [70]

    Henning Wachsmuth, Martin Potthast, Khalid Al-Khatib, Yamen Ajjour, Jana Puschmann, Jiani Qu, Jonas Dorsch, Viorel Morari, Janek Bevendorff, and Benno Stein. 2017. Building an Argument Search Engine for the Web. In 4th Workshop on Argument Mining (ArgMining 2017) at EMNLP, pages 49–59. Association for Computational Linguistics. 19

  71. [71]

    Henning Wachsmuth, Shahbaz Syed, and Benno Stein. 2018. Retrieval of the Best Coun- terargument without Prior Topic Knowledge. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages 241–251. Association for Computational Linguistics. 4, 19

  72. [72]

    David Wadden, Shanchuan Lin, Kyle Lo, Lucy Lu Wang, Madeleine van Zuylen, Arman Cohan, and Hannaneh Hajishirzi. 2020. Fact or Fiction: Verifying Scientific Claims. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 7534–7550, Online. Association for Computational Linguistics. 4, 20

  73. [73]

    Weld, Oren Etzioni, and Sebastian Kohlmeier

    Lucy Lu Wang, Kyle Lo, Yoganand Chandrasekhar, Russell Reas, Jiangjiang Yang, Doug Burdick, Darrin Eide, Kathryn Funk, Yannis Katsis, Rodney Kinney, Yunyao Li, Ziyang Liu, William Merrill, Paul Mooney, Dewey Murdick, Devvret Rishi, Jerry Sheehan, Zhihong Shen, Brandon Stilson, Alex Wade, Kuansan Wang, Nancy Xin Ru Wang, Chris Wilhelm, Boya Xie, Douglas Ra...

  74. [74]

    Wenhui Wang, Furu Wei, Li Dong, Hangbo Bao, Nan Yang, and Ming Zhou. 2020. MiniLM: Deep Self-Attention Distillation for Task-Agnostic Compression of Pre-Trained Transformers. In Advances in Neural Information Processing Systems, volume 33, pages 5776–5788. Curran Associates, Inc. 6

  75. [75]

    Yining Wang, Liwei Wang, Yuanzhi Li, Di He, Wei Chen, and Tie-Yan Liu. 2013. A theoretical analysis of NDCG ranking measures. In Proceedings of the 26th annual conference on learning theory (COLT 2013), volume 8, page 6. 5

  76. [76]

    Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, Joe Davison, Sam Shleifer, Patrick von Platen, Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu, Teven Le Scao, Sylvain Gugger, Mariama Drame, Quentin Lhoest, and Alexander M. Rush. 2020. Transformers: State-of-t...

  77. [77]

    Lee Xiong, Chenyan Xiong, Ye Li, Kwok-Fung Tang, Jialin Liu, Paul Bennett, Junaid Ahmed, and Arnold Overwijk. 2020. Approximate Nearest Neighbor Negative Contrastive Learning for Dense Text Retrieval. 6, 9

  78. [78]

    Peilin Yang, Hui Fang, and Jimmy Lin. 2017. Anserini: Enabling the Use of Lucene for Infor- mation Retrieval Research. In Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’17, page 1253–1256, New York, NY , USA. Association for Computing Machinery. 4 14

  79. [79]

    Yinfei Yang, Daniel Cer, Amin Ahmad, Mandy Guo, Jax Law, Noah Constant, Gustavo Her- nandez Abrego, Steve Yuan, Chris Tar, Yun-hsuan Sung, et al. 2020. Multilingual Universal Sentence Encoder for Semantic Retrieval. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: System Demonstrations, pages 87–94. 4

  80. [80]

    Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William Cohen, Ruslan Salakhutdinov, and Christopher D. Manning. 2018. HotpotQA: A Dataset for Diverse, Explainable Multi-hop Question Answering. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 2369–2380, Brussels, Belgium. Association for Computational L...

Showing first 80 references.