hub Canonical reference

BEIR: A Heterogenous Benchmark for Zero-shot Evaluation of Information Retrieval Models

· 2021 · cs.IR · arXiv 2104.08663

Canonical reference. 89% of citing Pith papers cite this work as background.

61 Pith papers citing it

Background 89% of classified citations

open full Pith review browse 61 citing papers arXiv PDF

abstract

Existing neural information retrieval (IR) models have often been studied in homogeneous and narrow settings, which has considerably limited insights into their out-of-distribution (OOD) generalization capabilities. To address this, and to facilitate researchers to broadly evaluate the effectiveness of their models, we introduce Benchmarking-IR (BEIR), a robust and heterogeneous evaluation benchmark for information retrieval. We leverage a careful selection of 18 publicly available datasets from diverse text retrieval tasks and domains and evaluate 10 state-of-the-art retrieval systems including lexical, sparse, dense, late-interaction and re-ranking architectures on the BEIR benchmark. Our results show BM25 is a robust baseline and re-ranking and late-interaction-based models on average achieve the best zero-shot performances, however, at high computational costs. In contrast, dense and sparse-retrieval models are computationally more efficient but often underperform other approaches, highlighting the considerable room for improvement in their generalization capabilities. We hope this framework allows us to better evaluate and understand existing retrieval systems, and contributes to accelerating progress towards better robust and generalizable systems in the future. BEIR is publicly available at https://github.com/UKPLab/beir.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 7 dataset 1 method 1

citation-polarity summary

background 8 use method 1

representative citing papers

Optimization Dynamics Imprint Semantic Specificity in Contrastive Embedding Norms

stat.ML · 2026-06-29 · unverdicted · novelty 7.0

Embedding norms in contrastive models encode semantic properties via optimization dynamics under scale-invariant losses.

Test-Time Training for Zero-Resource Dense Retrieval Reranking

cs.IR · 2026-05-31 · unverdicted · novelty 7.0

DART adapts a scoring matrix at inference time via gradient updates on pseudo-labels from top/bottom documents to gain +2.1% mean NDCG@10 on six BEIR benchmarks with under 10ms added latency.

Vector Linking via Cross-Model Local Isometric Consistency

cs.AI · 2026-05-29 · unverdicted · novelty 7.0

A reference-based geometric hashing method recovers cross-model vector correspondences by exploiting local isometric consistency in contrastive embeddings and iteratively bootstrapping from a seed of paired anchors.

Spectral Retrieval: Multi-Scale Sinc Convolution over Token Embeddings for Localized Retrieval in LLM Multi-Agent Systems

cs.IR · 2026-05-23 · unverdicted · novelty 7.0

Spectral Retrieval uses multi-scale sinc convolutions on token embeddings to interpolate between per-token MaxSim and mean-pooling, achieving large gains on synthetic and LIMIT-small benchmarks for localized retrieval.

Block-Sphere Vector Quantization

cs.LG · 2026-05-19 · unverdicted · novelty 7.0

BlockQuant is a new block quantization algorithm on the sphere after random rotation that theoretically improves reconstruction MSE and expected inner-product distortion over EDEN, RabitQ, and TurboQuant.

Very Efficient Listwise Multimodal Reranking for Long Documents

cs.IR · 2026-05-12 · unverdicted · novelty 7.0

ZipRerank delivers state-of-the-art multimodal listwise reranking accuracy for long documents at up to 10x lower latency via early interaction and single-pass scoring.

Re$^2$Math: Benchmarking Theorem Retrieval in Research-Level Mathematics

cs.AI · 2026-05-09 · unverdicted · novelty 7.0

Re²Math is a new benchmark that evaluates AI models on retrieving and verifying the applicability of theorems from math literature to advance steps in partial proofs, accepting any sufficient theorem while controlling for leakage.

TabEmbed: Benchmarking and Learning Generalist Embeddings for Tabular Understanding

cs.CL · 2026-05-06 · unverdicted · novelty 7.0

TabEmbed is the first generalist embedding model for tabular data that unifies classification and retrieval in one space via contrastive learning and outperforms text embedding models on the new TabBench benchmark.

HackerSignal: A Large-Scale Multi-Source Dataset Linking Hacker Community Discourse to the CVE Vulnerability Lifecycle

cs.CR · 2026-05-04 · unverdicted · novelty 7.0

HackerSignal aggregates 7.45M documents from hacker communities, exploit databases, vulnerability reports, and fixes into a public benchmark for temporal OOD CVE linkage and exploit classification.

UnIte: Uncertainty-based Iterative Document Sampling for Domain Adaptation in Information Retrieval

cs.IR · 2026-04-28 · unverdicted · novelty 7.0

UnIte selects target-domain documents for pseudo-query generation by filtering high aleatoric uncertainty and prioritizing high epistemic uncertainty, yielding +2.45 to +3.49 nDCG@10 gains on BEIR with ~4k samples.

MMEB-V3: Measuring the Performance Gaps of Omni-Modality Embedding Models

cs.IR · 2026-04-25 · unverdicted · novelty 7.0

MMEB-V3 benchmark shows omni-modality embedding models fail to enforce instruction-specified modality constraints and exhibit asymmetric, query-biased retrieval.

ResRank: Unifying Retrieval and Listwise Reranking via End-to-End Joint Training with Residual Passage Compression

cs.IR · 2026-04-24 · conditional · novelty 7.0

ResRank unifies retrieval and listwise reranking by compressing passages to one token each, using residual connections and cosine-similarity scoring, achieving competitive effectiveness on TREC DL and BEIR benchmarks with zero generated tokens.

TeleEmbedBench: A Multi-Corpus Embedding Benchmark for RAG in Telecommunications

cs.LG · 2026-04-20 · unverdicted · novelty 7.0

TeleEmbedBench is the first multi-corpus benchmark showing LLM-based embedding models significantly outperform traditional sentence-transformers on telecommunications specifications and code for retrieval accuracy and noise robustness.

WorkRB: A Community-Driven Evaluation Framework for AI in the Work Domain

cs.CL · 2026-03-17 · unverdicted · novelty 7.0

WorkRB is the first open community-driven benchmark for AI in the work domain, organizing 13 tasks from 7 groups with dynamic multilingual ontology loading and modular design for proprietary task integration.

LMEB: Long-horizon Memory Embedding Benchmark

cs.CL · 2026-03-13 · unverdicted · novelty 7.0

LMEB benchmark shows that embedding models' performance on traditional retrieval does not transfer to long-horizon memory tasks, larger models do not always perform better, and LMEB measures capabilities orthogonal to MTEB.

Scaling Laws for Cross-Encoder Reranking

cs.IR · 2026-03-05 · unverdicted · novelty 7.0

Cross-encoder reranker performance scales predictably via power laws with model size and training exposure, allowing accurate forecasts for 400M and 1B models and data-heavy compute allocation.

SQuTR: A Robustness Benchmark for Spoken Query to Text Retrieval under Acoustic Noise

cs.IR · 2026-02-13 · unverdicted · novelty 7.0

SQuTR aggregates 37k queries from six text retrieval datasets, synthesizes speech from 200 speakers, adds 17 noise categories at varying SNR, and shows that even large retrieval models degrade sharply under extreme acoustic noise.

Unified Work Embeddings: Contrastive Learning of a Bidirectional Multi-task Ranker

cs.CL · 2025-11-11 · unverdicted · novelty 7.0

UWE is a task-agnostic bi-encoder that uses many-to-many InfoNCE and token-level soft late interaction to achieve zero-shot ranking across unseen work-related target spaces while using far fewer parameters than Qwen3-8B and improving MAP by 4.4 points.

From Standalone LLMs to Integrated Intelligence: A Survey of Compound Al Systems

cs.MA · 2025-06-05 · accept · novelty 7.0

A survey that defines Compound AI Systems, proposes a multi-dimensional taxonomy based on component roles and orchestration strategies, reviews four foundational paradigms, and identifies key challenges for future research.

C-Pack: Packed Resources For General Chinese Embeddings

cs.CL · 2023-09-14 · accept · novelty 7.0

C-Pack releases a new Chinese embedding benchmark, large training dataset, and optimized models that outperform priors by up to 10% on C-MTEB while also delivering English SOTA results.

When Should Queries Be Decomposed? A Stage-Aware Study of Query Decomposition for Multi-Condition Retrieval

cs.IR · 2026-06-07 · unverdicted · novelty 6.0

Empirical study shows query decomposition is detrimental in initial retrieval due to semantic dilution but beneficial in reranking, proposing a stage-aware framework that improves performance on MultiConIR and SSRB benchmarks.

ColBERTSaR: Sparsified ColBERT Index via Product Quantization

cs.IR · 2026-06-04 · unverdicted · novelty 6.0

ColBERTSaR uses product quantization on ColBERT embeddings to create a true inverted index that is 50-70% smaller than one-bit PLAID while retaining retrieval effectiveness, and is theoretically equivalent to learned-sparse retrieval except for scoring.

SPECTRA: Synthetic IR Test Collections with Relevance Oracles and Controlled Distractor Diagnostics

cs.IR · 2026-05-29 · unverdicted · novelty 6.0

SPECTRA generates reproducible synthetic IR corpora up to 60,000 documents with controllable distractors, long-tail vocabulary, and graded relevance labels via a single-process Python prototype.

RICE-PO: Turning Retrieval Interactions into Credit Signals for Reasoning Agents

cs.CL · 2026-05-25 · unverdicted · novelty 6.0

RICE-PO is a policy optimization framework that converts retrieval interactions into credit signals for latent reasoning steps in agents by selecting high-uncertainty actions as anchors and propagating credit based on influence strength and residual stability, outperforming baselines on BRIGHT and B

citing papers explorer

Showing 50 of 61 citing papers.

Optimization Dynamics Imprint Semantic Specificity in Contrastive Embedding Norms stat.ML · 2026-06-29 · unverdicted · none · ref 8 · internal anchor
Embedding norms in contrastive models encode semantic properties via optimization dynamics under scale-invariant losses.
Test-Time Training for Zero-Resource Dense Retrieval Reranking cs.IR · 2026-05-31 · unverdicted · none · ref 27 · internal anchor
DART adapts a scoring matrix at inference time via gradient updates on pseudo-labels from top/bottom documents to gain +2.1% mean NDCG@10 on six BEIR benchmarks with under 10ms added latency.
Vector Linking via Cross-Model Local Isometric Consistency cs.AI · 2026-05-29 · unverdicted · none · ref 24 · internal anchor
A reference-based geometric hashing method recovers cross-model vector correspondences by exploiting local isometric consistency in contrastive embeddings and iteratively bootstrapping from a seed of paired anchors.
Spectral Retrieval: Multi-Scale Sinc Convolution over Token Embeddings for Localized Retrieval in LLM Multi-Agent Systems cs.IR · 2026-05-23 · unverdicted · none · ref 10 · internal anchor
Spectral Retrieval uses multi-scale sinc convolutions on token embeddings to interpolate between per-token MaxSim and mean-pooling, achieving large gains on synthetic and LIMIT-small benchmarks for localized retrieval.
Block-Sphere Vector Quantization cs.LG · 2026-05-19 · unverdicted · none · ref 36 · internal anchor
BlockQuant is a new block quantization algorithm on the sphere after random rotation that theoretically improves reconstruction MSE and expected inner-product distortion over EDEN, RabitQ, and TurboQuant.
Very Efficient Listwise Multimodal Reranking for Long Documents cs.IR · 2026-05-12 · unverdicted · none · ref 45 · internal anchor
ZipRerank delivers state-of-the-art multimodal listwise reranking accuracy for long documents at up to 10x lower latency via early interaction and single-pass scoring.
Re$^2$Math: Benchmarking Theorem Retrieval in Research-Level Mathematics cs.AI · 2026-05-09 · unverdicted · none · ref 18 · internal anchor
Re²Math is a new benchmark that evaluates AI models on retrieving and verifying the applicability of theorems from math literature to advance steps in partial proofs, accepting any sufficient theorem while controlling for leakage.
TabEmbed: Benchmarking and Learning Generalist Embeddings for Tabular Understanding cs.CL · 2026-05-06 · unverdicted · none · ref 31 · internal anchor
TabEmbed is the first generalist embedding model for tabular data that unifies classification and retrieval in one space via contrastive learning and outperforms text embedding models on the new TabBench benchmark.
HackerSignal: A Large-Scale Multi-Source Dataset Linking Hacker Community Discourse to the CVE Vulnerability Lifecycle cs.CR · 2026-05-04 · unverdicted · none · ref 5 · internal anchor
HackerSignal aggregates 7.45M documents from hacker communities, exploit databases, vulnerability reports, and fixes into a public benchmark for temporal OOD CVE linkage and exploit classification.
UnIte: Uncertainty-based Iterative Document Sampling for Domain Adaptation in Information Retrieval cs.IR · 2026-04-28 · unverdicted · none · ref 17 · internal anchor
UnIte selects target-domain documents for pseudo-query generation by filtering high aleatoric uncertainty and prioritizing high epistemic uncertainty, yielding +2.45 to +3.49 nDCG@10 gains on BEIR with ~4k samples.
MMEB-V3: Measuring the Performance Gaps of Omni-Modality Embedding Models cs.IR · 2026-04-25 · unverdicted · none · ref 32 · internal anchor
MMEB-V3 benchmark shows omni-modality embedding models fail to enforce instruction-specified modality constraints and exhibit asymmetric, query-biased retrieval.
ResRank: Unifying Retrieval and Listwise Reranking via End-to-End Joint Training with Residual Passage Compression cs.IR · 2026-04-24 · conditional · none · ref 27 · internal anchor
ResRank unifies retrieval and listwise reranking by compressing passages to one token each, using residual connections and cosine-similarity scoring, achieving competitive effectiveness on TREC DL and BEIR benchmarks with zero generated tokens.
TeleEmbedBench: A Multi-Corpus Embedding Benchmark for RAG in Telecommunications cs.LG · 2026-04-20 · unverdicted · none · ref 9 · internal anchor
TeleEmbedBench is the first multi-corpus benchmark showing LLM-based embedding models significantly outperform traditional sentence-transformers on telecommunications specifications and code for retrieval accuracy and noise robustness.
WorkRB: A Community-Driven Evaluation Framework for AI in the Work Domain cs.CL · 2026-03-17 · unverdicted · none · ref 26 · internal anchor
WorkRB is the first open community-driven benchmark for AI in the work domain, organizing 13 tasks from 7 groups with dynamic multilingual ontology loading and modular design for proprietary task integration.
LMEB: Long-horizon Memory Embedding Benchmark cs.CL · 2026-03-13 · unverdicted · none · ref 31 · internal anchor
LMEB benchmark shows that embedding models' performance on traditional retrieval does not transfer to long-horizon memory tasks, larger models do not always perform better, and LMEB measures capabilities orthogonal to MTEB.
Scaling Laws for Cross-Encoder Reranking cs.IR · 2026-03-05 · unverdicted · none · ref 36 · internal anchor
Cross-encoder reranker performance scales predictably via power laws with model size and training exposure, allowing accurate forecasts for 400M and 1B models and data-heavy compute allocation.
SQuTR: A Robustness Benchmark for Spoken Query to Text Retrieval under Acoustic Noise cs.IR · 2026-02-13 · unverdicted · none · ref 30 · internal anchor
SQuTR aggregates 37k queries from six text retrieval datasets, synthesizes speech from 200 speakers, adds 17 noise categories at varying SNR, and shows that even large retrieval models degrade sharply under extreme acoustic noise.
Unified Work Embeddings: Contrastive Learning of a Bidirectional Multi-task Ranker cs.CL · 2025-11-11 · unverdicted · none · ref 35 · internal anchor
UWE is a task-agnostic bi-encoder that uses many-to-many InfoNCE and token-level soft late interaction to achieve zero-shot ranking across unseen work-related target spaces while using far fewer parameters than Qwen3-8B and improving MAP by 4.4 points.
From Standalone LLMs to Integrated Intelligence: A Survey of Compound Al Systems cs.MA · 2025-06-05 · accept · none · ref 179 · internal anchor
A survey that defines Compound AI Systems, proposes a multi-dimensional taxonomy based on component roles and orchestration strategies, reviews four foundational paradigms, and identifies key challenges for future research.
C-Pack: Packed Resources For General Chinese Embeddings cs.CL · 2023-09-14 · accept · none · ref 56 · internal anchor
C-Pack releases a new Chinese embedding benchmark, large training dataset, and optimized models that outperform priors by up to 10% on C-MTEB while also delivering English SOTA results.
When Should Queries Be Decomposed? A Stage-Aware Study of Query Decomposition for Multi-Condition Retrieval cs.IR · 2026-06-07 · unverdicted · none · ref 6 · internal anchor
Empirical study shows query decomposition is detrimental in initial retrieval due to semantic dilution but beneficial in reranking, proposing a stage-aware framework that improves performance on MultiConIR and SSRB benchmarks.
ColBERTSaR: Sparsified ColBERT Index via Product Quantization cs.IR · 2026-06-04 · unverdicted · none · ref 28 · internal anchor
ColBERTSaR uses product quantization on ColBERT embeddings to create a true inverted index that is 50-70% smaller than one-bit PLAID while retaining retrieval effectiveness, and is theoretically equivalent to learned-sparse retrieval except for scoring.
SPECTRA: Synthetic IR Test Collections with Relevance Oracles and Controlled Distractor Diagnostics cs.IR · 2026-05-29 · unverdicted · none · ref 12 · internal anchor
SPECTRA generates reproducible synthetic IR corpora up to 60,000 documents with controllable distractors, long-tail vocabulary, and graded relevance labels via a single-process Python prototype.
RICE-PO: Turning Retrieval Interactions into Credit Signals for Reasoning Agents cs.CL · 2026-05-25 · unverdicted · none · ref 26 · internal anchor
RICE-PO is a policy optimization framework that converts retrieval interactions into credit signals for latent reasoning steps in agents by selecting high-uncertainty actions as anchors and propagating credit based on influence strength and residual stability, outperforming baselines on BRIGHT and B
TubiFM: Unified Item, Carousel, and Search Ranking for Streaming Discovery cs.IR · 2026-05-22 · unverdicted · none · ref 20 · internal anchor
A Llama-based model trained on serialized user stories unifies item, carousel, and search ranking and outperforms specialist baselines offline while improving some online metrics and reducing latency.
When Cases Get Rare: A Retrieval Benchmark for Off-Guideline Clinical Question Answering cs.CL · 2026-05-20 · unverdicted · none · ref 2 · internal anchor
OGCaReBench is a new retrieval-focused benchmark for evaluating LLMs on off-guideline clinical questions from real case reports, showing retrieval augmentation raises accuracy from 56% to 82%.
Improving BM25 Code Retrieval Under Fixed Generic Tokenization: Adaptive q-Log Odds as a Drop-In BM25 Fix cs.IR · 2026-05-18 · unverdicted · none · ref 16 · internal anchor
A q-log odds variant of BM25 raises NDCG@10 by 89% relative on CodeSearchNet Go under fixed generic tokenization while recovering standard BM25 at q=1.
MLAIRE: Multilingual Language-Aware Information Retrieval Evaluation Protocal cs.IR · 2026-05-08 · unverdicted · none · ref 12 · internal anchor
MLAIRE is a protocol that evaluates multilingual retrievers on both semantic accuracy and query-language preference using parallel passages and new metrics like LPR and Lang-nDCG, showing that standard metrics hide distinct behavioral differences among retrievers.
DiffRetriever: Parallel Representative Tokens for Retrieval with Diffusion Language Models cs.IR · 2026-05-08 · unverdicted · none · ref 9 · 2 links · internal anchor
DiffRetriever uses parallel masked tokens in diffusion LMs for retrieval representations, outperforming DiffEmbed and other baselines on aggregate effectiveness while supporting efficient multi-representation matching.
Superintelligent Retrieval Agent: The Next Frontier of Agentic Retrieval cs.IR · 2026-05-07 · unverdicted · none · ref 10 · 2 links · internal anchor
SIRA compresses multi-round exploratory retrieval into one corpus-discriminative BM25 action via LLM document enrichment, query-time term prediction, and corpus-statistic filtering, reporting top average performance on ten BEIR benchmarks and strong results on BrowseComp-Wikipedia without relevance
Verbal-R3: Verbal Reranker as the Missing Bridge between Retrieval and Reasoning cs.CL · 2026-05-02 · unverdicted · none · ref 49 · internal anchor
Verbal-R3 uses a verbal reranker to generate analytic narratives that guide retrieval and reasoning in LLMs, achieving SOTA results on complex QA benchmarks.
Kernel Affine Hull Machines as Compute-Efficient Encoders for Frozen Semantic Spaces cs.LG · 2026-05-01 · unverdicted · none · ref 46 · 2 links · internal anchor
KAHM yields a compute-efficient query encoder that outperforms matched learned adapters in reconstructing a frozen Mixedbread embedding space on an Austrian-law retrieval task while delivering an 8.53x CPU speedup.
Efficient Rationale-based Retrieval: On-policy Distillation from Generative Rerankers based on JEPA cs.IR · 2026-04-25 · unverdicted · none · ref 25 · 2 links · internal anchor
Rabtriever distills a generative reranker into an efficient bi-encoder using on-policy JEPA to achieve near-reranker accuracy with linear complexity on rationale-based retrieval.
ARHN: Answer-Centric Relabeling of Hard Negatives with Open-Source LLMs for Dense Retrieval cs.IR · 2026-04-13 · unverdicted · none · ref 23 · internal anchor
ARHN refines hard-negative training data for dense retrieval by using LLMs to convert answer-containing passages into additional positives and exclude answer-containing passages from the negative set.
HIVE: Query, Hypothesize, Verify An LLM Framework for Multimodal Reasoning-Intensive Retrieval cs.IR · 2026-04-08 · unverdicted · none · ref 33 · internal anchor
HIVE raises multimodal retrieval nDCG@10 to 41.7 on the MM-BRIGHT benchmark by inserting LLM-driven hypothesis generation and verification between retrieval passes, delivering +9.5 over the best text-only baseline and +14.1 over the best multimodal baseline.
Data, Not Model: Explaining Bias toward LLM Texts in Neural Retrievers cs.IR · 2026-04-07 · unverdicted · none · ref 18 · internal anchor
Bias toward LLM texts in neural retrievers arises from artifact imbalances between positive and negative documents in training data that are absorbed during contrastive learning.
Are LLM-Based Retrievers Worth Their Cost? An Empirical Study of Efficiency, Robustness, and Reasoning Overhead cs.IR · 2026-04-04 · accept · none · ref 51 · internal anchor
Empirical comparison across 14 retrievers on the BRIGHT benchmark shows reasoning-specialized models can match strong accuracy with competitive speed while many large LLM bi-encoders add latency for small gains and confidence scores remain poorly calibrated.
LiteSemRAG: Lightweight LLM-Free Semantic-Aware Graph Retrieval for Robust RAG cs.IR · 2026-03-16 · unverdicted · none · ref 13 · internal anchor
LiteSemRAG delivers leading MRR@10 on three benchmarks using only lightweight semantic graph methods and zero LLM tokens.
Mitigating Membership Inference in Intermediate Representations with Differentially Private Training cs.LG · 2026-02-26 · unverdicted · none · ref 11 · internal anchor
LM-DP-SGD estimates layer-specific MIA risks from shadow models and reweights gradients to give stronger protection to vulnerable layers, improving the privacy-utility trade-off over uniform DP-SGD.
Where Relevance Emerges: A Layer-Wise Study of Internal Attention for Zero-Shot Re-Ranking cs.IR · 2026-02-26 · unverdicted · none · ref 23 · internal anchor
Internal attention in LLMs shows a bell-curve relevance distribution across layers, enabling Selective-ICR that cuts inference latency 30-50% and lets an 8B zero-shot model match 14B RL re-rankers on BRIGHT.
LEAF: Knowledge Distillation of Text Embedding Models with Teacher-Aligned Representations cs.IR · 2025-09-16 · conditional · none · ref 37 · internal anchor
LEAF distills teacher-aligned student embedding models that achieve new SOTA results on BEIR and MTEB for their size class while requiring only modest data and compute.
ProRank: Prompt Warmup via Reinforcement Learning for Small Language Models Reranking cs.IR · 2025-06-04 · unverdicted · none · ref 7 · internal anchor
ProRank uses RL-based prompt warmup and fine-grained scoring to train small language models that surpass LLM rerankers on BEIR.
NV-Embed: Improved Techniques for Training LLMs as Generalist Embedding Models cs.CL · 2024-05-27 · accept · none · ref 126 · internal anchor
NV-Embed achieves first place on the MTEB leaderboard across 56 tasks by combining a latent attention layer, causal-mask removal, two-stage contrastive training, and data curation for LLM-based embedding models.
Self-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflection cs.CL · 2023-10-17 · unverdicted · none · ref 164 · internal anchor
Self-RAG trains LLMs to adaptively retrieve passages on demand and self-critique using reflection tokens, outperforming ChatGPT and retrieval-augmented Llama2 on QA, reasoning, and fact verification.
Atlas: Few-shot Learning with Retrieval Augmented Language Models cs.CL · 2022-08-05 · unverdicted · none · ref 129 · internal anchor
Atlas reaches over 42% accuracy on Natural Questions with only 64 examples, outperforming a 540B-parameter model by 3% with 50x fewer parameters.
Unsupervised Dense Information Retrieval with Contrastive Learning cs.IR · 2021-12-16 · unverdicted · none · ref 172 · internal anchor
Contrastive learning trains unsupervised dense retrievers that beat BM25 on most BEIR datasets and support cross-lingual retrieval across scripts.
A Distribution-Free Framework for Rewrite-Based Human-text Detection via Knockoff Filtering stat.ME · 2026-05-29 · unverdicted · none · ref 21 · internal anchor
A distribution-free framework applies knockoff filtering to rewrite-based detectors to achieve finite-sample FDR control for human vs. LLM text detection.
DynaTree: Dynamic Agentic Retrieval Tree for Time-Sensitive News Retrieval cs.IR · 2026-05-29 · unverdicted · none · ref 28 · internal anchor
DynaTree separates offline agentic tree construction from online subtree selection to deliver better recall, ranking, and production survival rates than standard or prior agentic RAG for news retrieval.
PRA-RAG: Provably Robust Aggregation in Retrieval-Augmented Generation against Retrieval Corruption cs.IR · 2026-05-08 · unverdicted · none · ref 123 · internal anchor
PRA-RAG is a new aggregation algorithm for RAG that claims provable robustness bounds against poisoned retrieved texts and reduces attack success rate to 1% while keeping 71% accuracy.
AgenticRAG: Agentic Retrieval for Enterprise Knowledge Bases cs.AI · 2026-05-07 · unverdicted · none · ref 7 · internal anchor
AgenticRAG equips an LLM with iterative retrieval and navigation tools, delivering 49.6% recall@1 on BRIGHT, 0.96 factuality on WixQA, and 92% correctness on FinanceBench.

BEIR: A Heterogenous Benchmark for Zero-shot Evaluation of Information Retrieval Models

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer