arxiv: 2212.03533 · v2 · submitted 2022-12-07 · 💻 cs.CL · cs.IR

Recognition: no theorem link

Text Embeddings by Weakly-Supervised Contrastive Pre-training

Binxing Jiao, Daxin Jiang, Furu Wei, Liang Wang, Linjun Yang, Nan Yang, Rangan Majumder, Xiaolong Huang

Pith reviewed 2026-05-11 04:49 UTC · model grok-4.3

classification 💻 cs.CL cs.IR

keywords text embeddingscontrastive pre-trainingweak supervisionzero-shot retrievalfine-tuningretrieval benchmarksembedding evaluation

0 comments

The pith

Text embeddings trained via contrastive learning on weakly supervised pairs outperform the BM25 baseline on retrieval tasks without any labeled data.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that a family of embedding models can be trained contrastively using only weak supervision signals drawn from a large curated collection of text pairs. This matters to a sympathetic reader because it offers a route to general-purpose single-vector text representations that work across retrieval, clustering, and classification without expensive task-specific labels. If the approach holds, embedding models become cheaper to build and easier to deploy at scale while still delivering competitive accuracy in both zero-shot and fine-tuned regimes.

Core claim

The central claim is that contrastive pre-training on weak supervision signals extracted from a curated large-scale text pair dataset produces embeddings that transfer effectively to many downstream tasks, achieving the first outperformance of the BM25 baseline on the BEIR retrieval benchmark in a zero-shot setting and the highest scores on the MTEB benchmark after fine-tuning, even against models with substantially more parameters.

What carries the argument

Contrastive pre-training on weak supervision signals from the curated large-scale text pair dataset, which supplies positive and negative pairs to shape the embedding space.

If this is right

The embeddings function as drop-in single-vector representations for any task that needs them, including retrieval, clustering, and classification.
Zero-shot use already surpasses a strong traditional baseline on diverse retrieval problems.
Fine-tuning the same base model produces the strongest recorded results on broad embedding benchmarks while using far fewer parameters than prior leaders.
The same training recipe scales to produce models that maintain performance across varied tasks and domains.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The curation step that creates the weak pairs appears to be the main lever for avoiding domain-specific artifacts.
The same weak-supervision contrastive recipe could be applied to construct embeddings for additional languages or narrow technical domains if suitable pair datasets can be assembled.
Iterative refinement of the pair-generation rules might further lift generalization without adding labeled data.

Load-bearing premise

The weak supervision signals drawn from the curated text pair dataset yield embeddings that generalize across tasks and domains without inheriting biases or artifacts from the pair-generation process.

What would settle it

If a new large-scale retrieval benchmark shows the embeddings failing to exceed the BM25 baseline in zero-shot evaluation, the performance claim would be refuted.

read the original abstract

This paper presents E5, a family of state-of-the-art text embeddings that transfer well to a wide range of tasks. The model is trained in a contrastive manner with weak supervision signals from our curated large-scale text pair dataset (called CCPairs). E5 can be readily used as a general-purpose embedding model for any tasks requiring a single-vector representation of texts such as retrieval, clustering, and classification, achieving strong performance in both zero-shot and fine-tuned settings. We conduct extensive evaluations on 56 datasets from the BEIR and MTEB benchmarks. For zero-shot settings, E5 is the first model that outperforms the strong BM25 baseline on the BEIR retrieval benchmark without using any labeled data. When fine-tuned, E5 obtains the best results on the MTEB benchmark, beating existing embedding models with 40x more parameters.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 1 minor

Summary. The paper introduces E5, a family of text embedding models trained in a contrastive manner using weak supervision signals from a curated large-scale text pair dataset called CCPairs. It claims strong performance on retrieval, clustering, and classification tasks, specifically being the first to outperform the BM25 baseline on the BEIR benchmark in a zero-shot setting without any labeled data, and achieving the best results on the MTEB benchmark when fine-tuned, surpassing models with 40 times more parameters. Evaluations are conducted on 56 datasets from BEIR and MTEB.

Significance. If the central claims hold, this work would be significant as it demonstrates that high-quality general-purpose text embeddings can be obtained through weakly-supervised contrastive pre-training without relying on labeled data, offering a parameter-efficient alternative to larger models. The extensive evaluation across multiple benchmarks supports its potential as a versatile embedding model for various NLP tasks.

major comments (3)

The description of the CCPairs dataset curation and the weak supervision signal extraction is insufficient. Without details on how pairs are generated and any controls for label noise or domain biases, it is difficult to assess whether the outperformance on BEIR is truly due to generalizable signals or artifacts from the data collection process.
The manuscript reports results on 56 datasets but does not provide information on training hyperparameters, batch sizes, contrastive temperature, or ablation studies isolating the contribution of the weak supervision. This leaves the central performance claims only moderately supported.
The claim that E5 is the first model to outperform BM25 on BEIR without labeled data requires explicit comparison tables showing previous zero-shot models and confirmation that no labeled data from BEIR or similar was used in CCPairs construction.

minor comments (1)

The abstract could clarify the model sizes of E5 variants for better context on the parameter efficiency claim.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the thoughtful and constructive review. The comments highlight areas where additional clarity and evidence can strengthen the manuscript. We address each major comment below and will revise the paper to incorporate the suggested improvements while preserving the core contributions on weakly-supervised contrastive pre-training for text embeddings.

read point-by-point responses

Referee: The description of the CCPairs dataset curation and the weak supervision signal extraction is insufficient. Without details on how pairs are generated and any controls for label noise or domain biases, it is difficult to assess whether the outperformance on BEIR is truly due to generalizable signals or artifacts from the data collection process.

Authors: We appreciate this observation. Section 3.1 of the manuscript outlines the CCPairs construction, including data sources (e.g., Wikipedia hyperlinks, Reddit threads, StackExchange Q&A) and weak supervision signals derived from co-occurrence and structural relations. To address the concern directly, we will expand this section with explicit details on pair generation heuristics, noise filtering steps (such as length-based pruning and duplicate removal), domain distribution statistics, and bias mitigation strategies. These additions will include quantitative analysis of label noise estimates and domain coverage to demonstrate that performance gains stem from generalizable signals rather than collection artifacts. revision: yes
Referee: The manuscript reports results on 56 datasets but does not provide information on training hyperparameters, batch sizes, contrastive temperature, or ablation studies isolating the contribution of the weak supervision. This leaves the central performance claims only moderately supported.

Authors: We note that core hyperparameters (batch size 1024, contrastive temperature 0.01, learning rate schedule, and optimizer) are specified in the appendix, along with the contrastive loss formulation. However, we agree that moving these to the main text and adding dedicated ablation studies would improve support for the claims. In revision, we will include a new subsection with ablations that isolate the weak supervision components (e.g., comparing different pair sources and loss variants) and report their impact on BEIR and MTEB performance. This will make the experimental setup fully transparent and better substantiate the role of weak supervision. revision: partial
Referee: The claim that E5 is the first model to outperform BM25 on BEIR without labeled data requires explicit comparison tables showing previous zero-shot models and confirmation that no labeled data from BEIR or similar was used in CCPairs construction.

Authors: We maintain the claim based on our literature review but concur that explicit evidence is warranted. We will add a comparison table in the experiments section listing zero-shot results of prior models (including Sentence-BERT, SimCSE, and other contrastive baselines) on BEIR, confirming none surpass BM25. Additionally, we will insert a clear statement and supporting details verifying that CCPairs was built exclusively from public, non-BEIR sources with no access to BEIR labels or test data, including checks for domain overlap. This will rigorously support the zero-shot, no-labeled-data assertion. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper's central claims consist of empirical performance results on external benchmarks (BEIR and MTEB across 56 datasets) after contrastive training on a separately curated CCPairs dataset. No derivation chain, equations, or first-principles predictions are presented that reduce by construction to the training inputs or self-citations. The zero-shot outperformance of BM25 and fine-tuned MTEB results are measured outcomes, not fitted or renamed quantities. The weak-supervision assumption is stated as an empirical hypothesis to be validated by the reported numbers rather than enforced by definition. This is a standard self-contained empirical paper with no load-bearing self-referential steps.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The work rests on the standard assumption that contrastive objectives on weakly-labeled pairs yield semantically useful vectors; no new mathematical axioms or invented entities are introduced in the abstract.

free parameters (1)

contrastive temperature and batch size
Typical hyperparameters of contrastive training that must be chosen or tuned but are not reported in the abstract.

axioms (1)

domain assumption Weakly-supervised text pairs from CCPairs provide sufficient semantic signal for generalization
Central premise that the curated pairs are representative enough to train broadly useful embeddings.

pith-pipeline@v0.9.0 · 5462 in / 1081 out tokens · 49082 ms · 2026-05-11T04:49:18.678536+00:00 · methodology

discussion (0)

Forward citations

Cited by 60 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

STRABLE: Benchmarking Tabular Machine Learning with Strings
cs.LG 2026-05 unverdicted novelty 8.0

A new corpus of 108 mixed string-numeric tables shows that advanced tabular learners with basic string embeddings perform well on most real-world data, while large LLM encoders help on free-text heavy tables.
FollowTable: A Benchmark for Instruction-Following Table Retrieval
cs.IR 2026-05 unverdicted novelty 8.0

FollowTable is the first large-scale benchmark for instruction-following table retrieval, paired with an Instruction Responsiveness Score, showing that existing models fail to adapt to fine-grained constraints beyond ...
Very Efficient Listwise Multimodal Reranking for Long Documents
cs.IR 2026-05 unverdicted novelty 7.0

ZipRerank delivers state-of-the-art multimodal listwise reranking accuracy for long documents at up to 10x lower latency via early interaction and single-pass scoring.
Breaking $\textit{Winner-Takes-All}$: Cooperative Policy Optimization Improves Diverse LLM Reasoning
cs.AI 2026-05 unverdicted novelty 7.0

GCPO shifts RLVR from rollout competition to team cooperation by assigning advantages via marginal contributions to a determinant-based coverage volume over semantic embeddings, yielding higher accuracy and solution d...
Test-Time Compute for Dense Retrieval: Agentic Program Generation with Frozen Embedding Models
cs.LG 2026-05 unverdicted novelty 7.0

Agentic program search over frozen embedding APIs yields a parameter-free inference algebra—a softmax-weighted centroid of top-K documents interpolated with the query—that lifts nDCG@10 across seven model families on ...
MulTaBench: Benchmarking Multimodal Tabular Learning with Text and Image
cs.LG 2026-05 unverdicted novelty 7.0

MulTaBench is a new collection of 40 image-tabular and text-tabular datasets designed to test target-aware representation tuning in multimodal tabular models.
Skill Description Deception Attack against Task Routing in Internet of Agents
cs.MA 2026-05 conditional novelty 7.0

Malicious agents can deceive LLM-based task routers in Internet of Agents systems by generating fake skill descriptions, achieving up to 98% success rate across nine domains.
LatentRAG: Latent Reasoning and Retrieval for Efficient Agentic RAG
cs.CL 2026-05 unverdicted novelty 7.0

LatentRAG performs agentic RAG by generating latent tokens for thoughts and subqueries in one forward pass, matching explicit methods' accuracy on seven benchmarks while reducing latency by ~90%.
SkillRet: A Large-Scale Benchmark for Skill Retrieval in LLM Agents
cs.AI 2026-05 unverdicted novelty 7.0

SkillRet benchmark shows fine-tuned retrievers improve NDCG@10 by 13+ points over prior models on large-scale skill retrieval for LLM agents.
TabEmbed: Benchmarking and Learning Generalist Embeddings for Tabular Understanding
cs.CL 2026-05 unverdicted novelty 7.0

TabEmbed is the first generalist embedding model for tabular data that unifies classification and retrieval in one space via contrastive learning and outperforms text embedding models on the new TabBench benchmark.
Embedding-based In-Context Prompt Training for Enhancing LLMs as Text Encoders
cs.CL 2026-05 unverdicted novelty 7.0

EPIC trains LLMs to treat continuous embeddings as in-context prompts, yielding state-of-the-art text embedding performance on MTEB with or without prompts at inference and lower compute.
Learning How and What to Memorize: Cognition-Inspired Two-Stage Optimization for Evolving Memory
cs.CL 2026-05 unverdicted novelty 7.0

MemCoE learns memory organization guidelines via contrastive feedback and then trains a guideline-aligned RL policy for memory updates, yielding consistent gains on personalization benchmarks.
From Static Analysis to Audience Dissemination: A Training-Free Multimodal Controversy Detection Multi-Agent Framework
cs.LG 2026-05 unverdicted novelty 7.0

AuDisAgent reformulates multimodal controversy detection as a dynamic audience dissemination process using screening, panel discussion, and arbitration agents, plus comment bootstrapping, and reports outperforming pri...
Why Mean Pooling Works: Quantifying Second-Order Collapse in Text Embeddings
cs.CL 2026-04 unverdicted novelty 7.0

Modern text encoders resist second-order collapse under mean pooling because token embeddings concentrate tightly within texts, and this resistance correlates with stronger downstream performance.
Prism-Reranker: Beyond Relevance Scoring -- Jointly Producing Contributions and Evidence for Agentic Retrieval
cs.IR 2026-04 accept novelty 7.0

Prism-Reranker models output relevance, contribution statements, and evidence passages to support agentic retrieval beyond scalar scoring.
HaS: Accelerating RAG through Homology-Aware Speculative Retrieval
cs.IR 2026-04 unverdicted novelty 7.0

HaS accelerates RAG retrieval via homology-aware speculative retrieval and homologous query re-identification validation, cutting latency 24-37% with 1-2% accuracy drop on tested datasets.
Latent Abstraction for Retrieval-Augmented Generation
cs.CL 2026-04 unverdicted novelty 7.0

LAnR unifies retrieval-augmented generation inside a single LLM by deriving dense retrieval vectors from a [PRED] token's hidden states and using entropy to adaptively stop retrieval, outperforming prior RAG on six QA...
OmniGCD: Abstracting Generalized Category Discovery for Modality Agnosticism
cs.CV 2026-04 unverdicted novelty 7.0

OmniGCD trains a Transformer once on synthetic data to enable zero-shot generalized category discovery across 16 datasets in four modalities without any dataset-specific fine-tuning.
DRBENCHER: Can Your Agent Identify the Entity, Retrieve Its Properties and Do the Math?
cs.AI 2026-04 unverdicted novelty 7.0

DRBENCHER generates multi-hop questions across biochemistry, finance, geophysics, security, and history that test interleaved browsing and computation, where the strongest models reach only 20% accuracy and human vali...
Retrieval Augmented Conversational Recommendation with Reinforcement Learning
cs.IR 2026-04 unverdicted novelty 7.0

RAR retrieves candidate items from a 300k-movie corpus then uses LLM generation with RL feedback to produce context-aware recommendations that outperform baselines on benchmarks.
PLUME: Latent Reasoning Based Universal Multimodal Embedding
cs.CV 2026-04 unverdicted novelty 7.0

PLUME uses latent-state autoregressive rollouts and a progressive training curriculum to deliver efficient reasoning for universal multimodal embeddings without generating explicit rationales.
Group-in-Group Policy Optimization for LLM Agent Training
cs.LG 2025-05 unverdicted novelty 7.0

GiGPO adds a hierarchical grouping mechanism to group-based RL so that LLM agents receive both global trajectory and local step-level credit signals, yielding >12% gains on ALFWorld and >9% on WebShop over GRPO while ...
M3-Embedding: Multi-Linguality, Multi-Functionality, Multi-Granularity Text Embeddings Through Self-Knowledge Distillation
cs.CL 2024-02 unverdicted novelty 7.0

M3-Embedding is a single model for multi-lingual, multi-functional, and multi-granular text embeddings trained via self-knowledge distillation that achieves new state-of-the-art results on multilingual, cross-lingual,...
C-Pack: Packed Resources For General Chinese Embeddings
cs.CL 2023-09 accept novelty 7.0

C-Pack releases a new Chinese embedding benchmark, large training dataset, and optimized models that outperform priors by up to 10% on C-MTEB while also delivering English SOTA results.
UTS at PsyDefDetect: Multi-Agent Councils and Absence-Based Reasoning for Defense Mechanism Classification
cs.AI 2026-05 unverdicted novelty 6.0

A multi-agent council of Gemini agents using absence-based clinical rules achieves F1 0.406 for defense mechanism classification, placing second among 64 teams, with overrides from fine-tuned models adding 2.4pp.
UTS at PsyDefDetect: Multi-Agent Councils and Absence-Based Reasoning for Defense Mechanism Classification
cs.AI 2026-05 unverdicted novelty 6.0

A deliberative council of Gemini agents using absence-based clinical rules achieves 0.382 F1 without fine-tuning and second place overall at 0.406 F1 on defense mechanism classification, with minority-class overrides ...
TIDE-Bench: Task-Aware and Diagnostic Evaluation of Tool-Integrated Reasoning
cs.AI 2026-05 unverdicted novelty 6.0

TIDE-Bench is a new benchmark for tool-integrated reasoning that combines diverse tasks, multi-aspect metrics covering answer quality, process reliability, efficiency and cost, plus filtered challenging test sets.
PiCA: Pivot-Based Credit Assignment for Search Agentic Reinforcement Learning
cs.AI 2026-05 unverdicted novelty 6.0

PiCA improves RL for LLM search agents by defining process rewards around pivot steps that act as information peaks boosting final answer success probability via potential-based shaping.
PiCA: Pivot-Based Credit Assignment for Search Agentic Reinforcement Learning
cs.AI 2026-05 unverdicted novelty 6.0

PiCA uses pivot-based potential rewards derived from historical sub-queries to supply trajectory-aware step guidance in agentic RL, delivering 15% gains on QA benchmarks for 3B/7B models.
SearchSkill: Teaching LLMs to Use Search Tools with Evolving Skill Banks
cs.AI 2026-05 unverdicted novelty 6.0

SearchSkill introduces an evolving SkillBank and two-stage SFT to make LLM search query planning explicit via skill selection, improving exact match on QA benchmarks and retrieval behavior.
Do not copy and paste! Rewriting strategies for code retrieval
cs.SE 2026-05 conditional novelty 6.0

Full natural-language rewriting of code and queries boosts retrieval on code benchmarks while corpus-only rewriting often hurts, with token entropy difference serving as a cheap predictor of gains.
RRCM: Ranking-Driven Retrieval over Collaborative and Meta Memories for LLM Recommendation
cs.IR 2026-05 unverdicted novelty 6.0

RRCM trains an LLM to dynamically retrieve from collaborative and meta memories using group relative policy optimization driven by final top-k recommendation quality.
Measuring Black-Box Confidence via Reasoning Trajectories: Geometry, Coverage, and Verbalization
cs.AI 2026-05 unverdicted novelty 6.0

Trajectory geometry in embedding space fused with coverage and verbalization yields better black-box CoT confidence estimation than self-consistency at lower sample counts across six benchmark-reasoner pairs.
A$^2$TGPO: Agentic Turn-Group Policy Optimization with Adaptive Turn-level Clipping
cs.CL 2026-05 unverdicted novelty 6.0

A²TGPO improves RL policy optimization for multi-turn agentic LLMs by normalizing information gain within same-depth turn groups, rescaling cumulative advantages by sqrt of term count, and modulating clipping ranges p...
Skill1: Unified Evolution of Skill-Augmented Agents via Reinforcement Learning
cs.AI 2026-05 unverdicted novelty 6.0

Skill1 trains one policy to jointly evolve skill query generation, re-ranking, task solving, and distillation from a single task-success signal, with low-frequency trends crediting selection and high-frequency variati...
CASCADE: Case-Based Continual Adaptation for Large Language Models During Deployment
cs.AI 2026-05 unverdicted novelty 6.0

CASCADE enables LLMs to continually adapt at deployment via case-based episodic memory and contextual bandits, improving macro-averaged success by 20.9% over zero-shot on 16 tasks spanning medicine, law, code, and robotics.
NH-CROP: Robust Pricing for Governed Language Data Assets under Cost Uncertainty
cs.AI 2026-05 unverdicted novelty 6.0

NH-CROP introduces a robust online pricing method for governed language data with uncertain costs, using a selective verification gate that improves or matches baselines without relying heavily on paid information acq...
Verbal-R3: Verbal Reranker as the Missing Bridge between Retrieval and Reasoning
cs.CL 2026-05 unverdicted novelty 6.0

Verbal-R3 uses a verbal reranker to generate analytic narratives that guide retrieval and reasoning in LLMs, achieving SOTA results on complex QA benchmarks.
Kernel Affine Hull Machines for Compute-Efficient Query-Side Semantic Encoding
cs.LG 2026-05 unverdicted novelty 6.0

Kernel Affine Hull Machines map lexical features to semantic embeddings via RKHS and least-mean-squares, outperforming adapters in reconstruction and retrieval metrics while reducing latency 8.5-fold on a legal benchmark.
Is Textual Similarity Invariant under Machine Translation? Evidence Based on the Political Manifesto Corpus
cs.CL 2026-05 unverdicted novelty 6.0

Machine translation preserves embedding similarity structure for ten languages but distorts it for four in the Manifesto Corpus, via a new non-inferiority testing framework.
Iterative Definition Refinement for Zero-Shot Classification via LLM-Based Semantic Prototype Optimization
cs.CV 2026-04 unverdicted novelty 6.0

Iterative LLM-based refinement of category definitions improves zero-shot classification performance across 13 embedding models on a new 10-category web URL benchmark.
Omni-Embed-Audio: Leveraging Multimodal LLMs for Robust Audio-Text Retrieval
cs.SD 2026-04 unverdicted novelty 6.0

Omni-Embed-Audio uses multimodal LLMs to match CLAP on standard audio retrieval while improving text-to-text retrieval by 22% relative and hard negative discrimination by 4.3 points HNSR@10 on user-intent queries.
AutoSearch: Adaptive Search Depth for Efficient Agentic RAG via Reinforcement Learning
cs.AI 2026-04 unverdicted novelty 6.0

AutoSearch applies RL with a self-answering reward to adaptively determine minimal sufficient search depth in agentic RAG, reducing over-searching while maintaining answer quality on complex questions.
RoTRAG: Rule of Thumb Reasoning for Conversation Harm Detection with Retrieval-Augmented Generation
cs.CL 2026-04 unverdicted novelty 6.0

RoTRAG retrieves Rules of Thumb to ground LLM reasoning for harm detection and severity classification in multi-turn dialogues, reporting roughly 40% relative F1 gains and 8.4% lower distributional error on two safety...
REZE: Representation Regularization for Domain-adaptive Text Embedding Pre-finetuning
cs.CL 2026-04 unverdicted novelty 6.0

REZE controls representation shifts in contrastive pre-finetuning of text embeddings via eigenspace decomposition of anchor-positive pairs and adaptive soft-shrinkage on task-variant directions.
Sketching the Readout of Large Language Models for Scalable Data Attribution and Valuation
cs.LG 2026-04 unverdicted novelty 6.0

RISE applies CountSketch to dual lexical and semantic channels derived from output-layer gradient outer products, cutting data attribution storage by up to 112x and enabling retrospective and prospective influence ana...
$\pi$-Play: Multi-Agent Self-Play via Privileged Self-Distillation without External Data
cs.LG 2026-04 unverdicted novelty 6.0

π-Play uses self-generated question construction paths as privileged information in multi-agent self-distillation to convert sparse-reward self-play into a dense-feedback loop, surpassing supervised search agents and ...
ViLL-E: Video LLM Embeddings for Retrieval
cs.CV 2026-04 unverdicted novelty 6.0

ViLL-E introduces a dynamic embedding mechanism and joint contrastive-generative training for VideoLLMs, delivering up to 7% gains in temporal localization and 4% in video retrieval while enabling new zero-shot capabilities.
Rag Performance Prediction for Question Answering
cs.CL 2026-04 unverdicted novelty 6.0

A novel supervised predictor modeling semantic relationships among question, retrieved passages, and generated answer best forecasts when RAG improves QA performance.
Reasoning-Based Refinement of Unsupervised Text Clusters with LLMs
cs.CL 2026-04 unverdicted novelty 6.0

LLM reasoning refines unsupervised text clusters via coherence checks, redundancy removal, and label grounding, yielding better coherence and human-aligned labels on social media data.
AV-SQL: Decomposing Complex Text-to-SQL Queries with Agentic Views
cs.DB 2026-04 unverdicted novelty 6.0

AV-SQL uses a pipeline of LLM agents to generate intermediate CTE views that decompose complex Text-to-SQL queries, reaching 70.38% execution accuracy on Spider 2.0.
Data, Not Model: Explaining Bias toward LLM Texts in Neural Retrievers
cs.IR 2026-04 unverdicted novelty 6.0

Bias toward LLM texts in neural retrievers arises from artifact imbalances between positive and negative documents in training data that are absorbed during contrastive learning.
JU\'A -- A Benchmark for Information Retrieval in Brazilian Legal Text Collections
cs.IR 2026-04 accept novelty 6.0

JU'A is a new heterogeneous benchmark for Brazilian legal IR that distinguishes retrieval methods and shows domain-adapted models excel on aligned subsets while BM25 stays competitive elsewhere.
Are LLM-Based Retrievers Worth Their Cost? An Empirical Study of Efficiency, Robustness, and Reasoning Overhead
cs.IR 2026-04 accept novelty 6.0

Empirical comparison across 14 retrievers on the BRIGHT benchmark shows reasoning-specialized models can match strong accuracy with competitive speed while many large LLM bi-encoders add latency for small gains and co...
OASES: Outcome-Aligned Search-Evaluation Co-Training for Agentic Search
cs.AI 2026-04 unverdicted novelty 6.0

OASES co-trains search policies and evaluators to generate outcome-aligned process rewards, outperforming standard RL baselines on five multi-hop QA benchmarks.
NV-Embed: Improved Techniques for Training LLMs as Generalist Embedding Models
cs.CL 2024-05 accept novelty 6.0

NV-Embed achieves first place on the MTEB leaderboard across 56 tasks by combining a latent attention layer, causal-mask removal, two-stage contrastive training, and data curation for LLM-based embedding models.
Scaling Retrieval-Augmented Reasoning with Parallel Search and Explicit Merging
cs.AI 2026-05 unverdicted novelty 5.0

MultiSearch uses parallel multi-query retrieval plus explicit merging inside a reinforcement-learning loop to improve retrieval-augmented reasoning, outperforming baselines on seven QA benchmarks.
Superintelligent Retrieval Agent: The Next Frontier of Information Retrieval
cs.IR 2026-05 unverdicted novelty 5.0

SIRA compresses multi-round exploratory retrieval into one LLM-guided, corpus-statistic-validated weighted BM25 query and reports superior results over dense retrievers and agentic baselines on BEIR benchmarks.
Skill1: Unified Evolution of Skill-Augmented Agents via Reinforcement Learning
cs.AI 2026-05 unverdicted novelty 5.0

Skill1 trains a single RL policy to co-evolve skill selection, utilization, and distillation in language model agents from one task-outcome reward, using low-frequency trends to credit selection and high-frequency var...
Skill1: Unified Evolution of Skill-Augmented Agents via Reinforcement Learning
cs.AI 2026-05 unverdicted novelty 5.0

Skill1 co-evolves skill selection, utilization, and distillation inside a single policy using only task-outcome reward, with low-frequency trends crediting selection and high-frequency variation crediting distillation...

Reference graph

Works this paper leans on

66 extracted references · 66 canonical work pages · cited by 72 Pith papers · 5 internal anchors

[1]

A simple but tough-to-beat baseline for sentence embeddings

Sanjeev Arora, Yingyu Liang, and Tengyu Ma. A simple but tough-to-beat baseline for sentence embeddings. In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings . OpenReview.net, 2017. URL https://openreview.net/forum?id=SyK00v5xx

work page 2017
[2]

Massively multilingual sentence embeddings for zero- shot cross-lingual transfer and beyond

Mikel Artetxe and Holger Schwenk. Massively multilingual sentence embeddings for zero- shot cross-lingual transfer and beyond. Transactions of the Association for Computational Linguistics, 7:597–610, 2019. doi: 10.1162/tacl_a_00288. URL https://aclanthology. org/Q19-1038

work page doi:10.1162/tacl_a_00288 2019
[3]

Blei, Andrew Y

David M. Blei, Andrew Y . Ng, and Michael I. Jordan. Latent dirichlet allocation. In Thomas G. Dietterich, Suzanna Becker, and Zoubin Ghahramani, editors, Advances in Neural Information Processing Systems 14 [Neural Information Processing Systems: Natural and Synthetic, NIPS 2001, December 3-8, 2001, Vancouver, British Columbia, Canada] , pages 601–608. M...

work page 2001
[4]

Overview of touché 2022: argument retrieval

Alexander Bondarenko, Maik Fröbe, Johannes Kiesel, Shahbaz Syed, Timon Gurcke, Meriem Beloucif, Alexander Panchenko, Chris Biemann, Benno Stein, Henning Wachsmuth, et al. Overview of touché 2022: argument retrieval. In International Conference of the Cross- Language Evaluation Forum for European Languages, pages 311–336. Springer, 2022

work page 2022
[5]

A full-text learning to rank dataset for medical information retrieval

Vera Boteva, Demian Gholipour, Artem Sokolov, and Stefan Riezler. A full-text learning to rank dataset for medical information retrieval. In European Conference on Information Retrieval, pages 716–722. Springer, 2016

work page 2016
[6]

Bowman, Gabor Angeli, Christopher Potts, and Christopher D

Samuel R. Bowman, Gabor Angeli, Christopher Potts, and Christopher D. Manning. A large annotated corpus for learning natural language inference. InProceedings of the 2015 Conference on Empirical Methods in Natural Language Processing , pages 632–642, Lisbon, Portugal,

work page 2015
[7]

and Angeli, Gabor and Potts, Christopher and Manning, Christopher D

Association for Computational Linguistics. doi: 10.18653/v1/D15-1075. URL https: //aclanthology.org/D15-1075

work page doi:10.18653/v1/d15-1075
[8]

Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhari- wal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-V oss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Lit...

work page 2020
[9]

MS MARCO: A Human Generated MAchine Reading COmprehension Dataset

Daniel Fernando Campos, Tri Nguyen, Mir Rosenberg, Xia Song, Jianfeng Gao, Saurabh Tiwary, Rangan Majumder, Li Deng, and Bhaskar Mitra. Ms marco: A human generated machine reading comprehension dataset. ArXiv, abs/1611.09268, 2016. 9

work page internal anchor Pith review arXiv 2016
[10]

Yu, Yin-Wen Chang, Yiming Yang, and Sanjiv Kumar

Wei-Cheng Chang, Felix X. Yu, Yin-Wen Chang, Yiming Yang, and Sanjiv Kumar. Pre-training tasks for embedding-based large-scale retrieval. In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020. OpenReview.net, 2020. URL https://openreview.net/forum?id=rkg-mA4FDr

work page 2020
[11]

Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey E. Hinton. A simple framework for contrastive learning of visual representations. In Proceedings of the 37th International Conference on Machine Learning, ICML 2020, 13-18 July 2020, Virtual Event, volume 119 of Proceedings of Machine Learning Research, pages 1597–1607. PMLR, 2020. URL http: //pro...

work page 2020
[12]

Salient phrase aware dense retrieval: Can a dense retriever imitate a sparse one?, 2021

Xilun Chen, Kushal Lakhotia, Barlas O ˘guz, Anchit Gupta, Patrick Lewis, Stan Peshterliev, Yashar Mehdad, Sonal Gupta, and Wen-tau Yih. Salient phrase aware dense retrieval: Can a dense retriever imitate a sparse one? arXiv preprint arXiv:2110.06918, 2021

work page arXiv 2021
[13]

Specter: Document-level representation learning using citation-informed transformers

Arman Cohan, Sergey Feldman, Iz Beltagy, Doug Downey, and Daniel S Weld. Specter: Document-level representation learning using citation-informed transformers. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 2270–2282, 2020

work page 2020
[14]

SentEval: An evaluation toolkit for universal sentence representations

Alexis Conneau and Douwe Kiela. SentEval: An evaluation toolkit for universal sentence representations. In Proceedings of the Eleventh International Conference on Language Re- sources and Evaluation (LREC 2018), Miyazaki, Japan, 2018. European Language Resources Association (ELRA). URL https://aclanthology.org/L18-1269

work page 2018
[15]

Supervised Learning of Universal Sentence Representations from Natural Language Inference Data

Alexis Conneau, Douwe Kiela, Holger Schwenk, Loïc Barrault, and Antoine Bordes. Super- vised learning of universal sentence representations from natural language inference data. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 670–680, Copenhagen, Denmark, 2017. Association for Computational Linguistics. doi...

work page doi:10.18653/v1/d17-1070 2017
[16]

Zhao, Ji Ma, Yi Luan, Jianmo Ni, Jing Lu, Anton Bakalov, Kelvin Guu, Keith B

Zhuyun Dai, Vincent Zhao, Ji Ma, Yi Luan, Jianmo Ni, Jing Lu, Anton Bakalov, Kelvin Guu, Keith B. Hall, and Ming-Wei Chang. Promptagator: Few-shot dense retrieval from 8 examples. ArXiv, abs/2209.11755, 2022

work page arXiv 2022
[17]

Indexing by latent semantic analysis.Journal of the American society for information science, 41(6):391–407, 1990

Scott Deerwester, Susan T Dumais, George W Furnas, Thomas K Landauer, and Richard Harshman. Indexing by latent semantic analysis.Journal of the American society for information science, 41(6):391–407, 1990

work page 1990
[18]

In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Com- putational Linguistics: Human Language Technologies (NAACL-HLT 2019), pp

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Confer- ence of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapol...

work page doi:10.18653/v1/n19-1423 2019
[19]

Boyd-Graber, Jannis Bulian, Massimiliano Cia- ramita, and Markus Leippold

Thomas Diggelmann, Jordan Boyd-Graber, Jannis Bulian, Massimiliano Ciaramita, and Markus Leippold. Climate-fever: A dataset for verification of real-world climate claims. arXiv preprint arXiv:2012.00614, 2020

work page arXiv 2012
[20]

What neural networks memorize and why: Discovering the long tail via influence estimation

Vitaly Feldman and Chiyuan Zhang. What neural networks memorize and why: Discovering the long tail via influence estimation. In Hugo Larochelle, Marc’Aurelio Ranzato, Raia Hadsell, Maria-Florina Balcan, and Hsuan-Tien Lin, editors, Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIP...

work page 2020
[21]

Language- agnostic bert sentence embedding

Fangxiaoyu Feng, Yinfei Yang, Daniel Cer, Naveen Arivazhagan, and Wei Wang. Language- agnostic bert sentence embedding. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 878–891, 2022

work page 2022
[22]

The Pile: An 800GB Dataset of Diverse Text for Language Modeling

Leo Gao, Stella Rose Biderman, Sid Black, Laurence Golding, Travis Hoppe, Charles Foster, Jason Phang, Horace He, Anish Thite, Noa Nabeshima, Shawn Presser, and Connor Leahy. The pile: An 800gb dataset of diverse text for language modeling. ArXiv, abs/2101.00027, 2021. 10

work page internal anchor Pith review Pith/arXiv arXiv 2021
[23]

doi: 10.18653/v1/2021.emnlp-main.552

Tianyu Gao, Xingcheng Yao, and Danqi Chen. SimCSE: Simple contrastive learning of sentence embeddings. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 6894–6910, Online and Punta Cana, Dominican Republic, 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.emnlp-main.552. URL https://acla...

work page doi:10.18653/v1/2021.emnlp-main.552 2021
[24]

Tsang, and Masashi Sugiyama

Bo Han, Quanming Yao, Xingrui Yu, Gang Niu, Miao Xu, Weihua Hu, Ivor W. Tsang, and Masashi Sugiyama. Co-teaching: Robust training of deep neural networks with extremely noisy labels. In Samy Bengio, Hanna M. Wallach, Hugo Larochelle, Kris- ten Grauman, Nicolò Cesa-Bianchi, and Roman Garnett, editors, Advances in Neu- ral Information Processing Systems 31:...

work page 2018
[25]

Dbpedia-entity v2: a test collection for entity search

Faegheh Hasibi, Fedor Nikolaev, Chenyan Xiong, Krisztian Balog, Svein Erik Bratsberg, Alexander Kotov, and Jamie Callan. Dbpedia-entity v2: a test collection for entity search. In Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 1265–1268, 2017

work page 2017
[26]

Iterative answer prediction with pointer-augmented multimodal transformers for textvqa

Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross B. Girshick. Momentum contrast for unsupervised visual representation learning. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2020, Seattle, WA, USA, June 13-19, 2020, pages 9726–9735. IEEE, 2020. doi: 10.1109/CVPR42600.2020.00975. URL https://doi.org/10.1109/ CVPR42600....

work page doi:10.1109/cvpr42600.2020.00975 2020
[27]

Cqadupstack: A benchmark data set for community question-answering research

Doris Hoogeveen, Karin M Verspoor, and Timothy Baldwin. Cqadupstack: A benchmark data set for community question-answering research. In Proceedings of the 20th Australasian document computing symposium, pages 1–8, 2015

work page 2015
[28]

Parameter-efficient transfer learning for NLP

Neil Houlsby, Andrei Giurgiu, Stanislaw Jastrzebski, Bruna Morrone, Quentin de Laroussilhe, Andrea Gesmundo, Mona Attariyan, and Sylvain Gelly. Parameter-efficient transfer learning for NLP. In Kamalika Chaudhuri and Ruslan Salakhutdinov, editors, Proceedings of the 36th International Conference on Machine Learning, ICML 2019, 9-15 June 2019, Long Beach, ...

work page 2019
[29]

Unsupervised Dense Information Retrieval with Contrastive Learning

Gautier Izacard, Mathilde Caron, Lucas Hosseini, Sebastian Riedel, Piotr Bojanowski, Armand Joulin, and Edouard Grave. Towards unsupervised dense information retrieval with contrastive learning. ArXiv, abs/2112.09118, 2021

work page internal anchor Pith review arXiv 2021
[30]

Le, Yun- Hsuan Sung, Zhen Li, and Tom Duerig

Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc V . Le, Yun- Hsuan Sung, Zhen Li, and Tom Duerig. Scaling up visual and vision-language representation learning with noisy text supervision. In Marina Meila and Tong Zhang, editors, Proceedings of the 38th International Conference on Machine Learning, ICML 2021, 18-24 July 2021, V...

work page 2021
[31]

URL http://proceedings.mlr.press/v139/jia21b.html

work page
[32]

Dense passage retrieval for open-domain question answering

Vladimir Karpukhin, Barlas Oguz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen-tau Yih. Dense passage retrieval for open-domain question answering. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 6769–6781, Online, 2020. Association for Computational Linguistics. doi: 10. 1...

work page 2020
[33]

ColBERT: Efficient and effective passage search via con- textualized late interaction over bert

Omar Khattab and Matei Zaharia. Colbert: Efficient and effective passage search via contex- tualized late interaction over BERT. In Jimmy Huang, Yi Chang, Xueqi Cheng, Jaap Kamps, Vanessa Murdock, Ji-Rong Wen, and Yiqun Liu, editors,Proceedings of the 43rd International ACM SIGIR conference on research and development in Information Retrieval, SIGIR 2020,...

work page doi:10.1145/3397271.3401075 2020
[34]

Transactions of the Association for Computational Linguistics , author =

Tom Kwiatkowski, Jennimaria Palomaki, Olivia Redfield, Michael Collins, Ankur Parikh, Chris Alberti, Danielle Epstein, Illia Polosukhin, Jacob Devlin, Kenton Lee, Kristina Toutanova, Llion Jones, Matthew Kelcey, Ming-Wei Chang, Andrew M. Dai, Jakob Uszkoreit, Quoc Le, and Slav Petrov. Natural questions: A benchmark for question answering research. Transac...

work page doi:10.1162/tacl_a_00276 2019
[35]

Learning dense representations of phrases at scale

Jinhyuk Lee, Mujeen Sung, Jaewoo Kang, and Danqi Chen. Learning dense representations of phrases at scale. InProceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers) , pages 6634–6647, Online, 2021. Association for Computationa...

work page doi:10.18653/v1/2021.acl-long.518 2021
[36]

Deduplicating training data makes language models better

Katherine Lee, Daphne Ippolito, Andrew Nystrom, Chiyuan Zhang, Douglas Eck, Chris Callison-Burch, and Nicholas Carlini. Deduplicating training data makes language models better. In ACL, 2022

work page 2022
[37]

RoBERTa: A Robustly Optimized BERT Pretraining Approach

Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. Roberta: A robustly optimized bert pretraining approach. ArXiv, abs/1907.11692, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1907
[38]

S 2 ORC : The Semantic Scholar Open Research Corpus

Kyle Lo, Lucy Lu Wang, Mark Neumann, Rodney Kinney, and Daniel Weld. S2ORC: The semantic scholar open research corpus. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics , pages 4969–4983, Online, 2020. Associ- ation for Computational Linguistics. doi: 10.18653/v1/2020.acl-main.447. URL https: //aclanthology.org/202...

work page doi:10.18653/v1/2020.acl-main.447 2020
[39]

Www’18 open challenge: financial opinion mining and question answering

Macedo Maia, Siegfried Handschuh, André Freitas, Brian Davis, Ross McDermott, Manel Zarrouk, and Alexandra Balahur. Www’18 open challenge: financial opinion mining and question answering. In Companion proceedings of the the web conference 2018, pages 1941– 1942, 2018

work page 2018
[40]

Corrado, and Jeffrey Dean

Tomas Mikolov, Kai Chen, Gregory S. Corrado, and Jeffrey Dean. Efficient estimation of word representations in vector space. In ICLR, 2013

work page 2013
[41]

SGPT : GPT sentence embeddings for semantic search

Niklas Muennighoff. Sgpt: Gpt sentence embeddings for semantic search. ArXiv, abs/2202.08904, 2022

work page arXiv 2022
[42]

arXiv preprint arXiv:2210.07316 , year=

Niklas Muennighoff, Nouamane Tazi, Loic Magne, and Nils Reimers. Mteb: Massive text embedding benchmark. ArXiv, abs/2210.07316, 2022

work page arXiv 2022
[43]

arXiv preprint arXiv:2201.10005 , year=

Arvind Neelakantan, Tao Xu, Raul Puri, Alec Radford, Jesse Michael Han, Jerry Tworek, Qiming Yuan, Nikolas A. Tezak, Jong Wook Kim, Chris Hallacy, Johannes Heidecke, Pranav Shyam, Boris Power, Tyna Eloundou Nekoul, Girish Sastry, Gretchen Krueger, David P. Schnurr, Felipe Petroski Such, Kenny Sai-Kin Hsu, Madeleine Thompson, Tabarak Khan, Toki Sherbakov, ...

work page arXiv 2022
[44]

SELF: learning to filter noisy labels with self- ensembling

Duc Tam Nguyen, Chaithanya Kumar Mummadi, Thi-Phuong-Nhung Ngo, Thi Hoai Phuong Nguyen, Laura Beggel, and Thomas Brox. SELF: learning to filter noisy labels with self- ensembling. In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020 . OpenReview.net, 2020. URL https://openreview. net/forum?id=HkgsPhNYPS

work page 2020
[45]

arXiv preprint arXiv:2112.07899 , year=

Jianmo Ni, Chen Qu, Jing Lu, Zhuyun Dai, Gustavo Hern’andez ’Abrego, Ji Ma, Vincent Zhao, Yi Luan, Keith B. Hall, Ming-Wei Chang, and Yinfei Yang. Large dual encoders are generalizable retrievers. ArXiv, abs/2112.07899, 2021

work page arXiv 2021
[46]

Sentence-t5: Scalable sentence encoders from pre-trained text-to-text models

Jianmo Ni, Gustavo Hernandez Abrego, Noah Constant, Ji Ma, Keith Hall, Daniel Cer, and Yinfei Yang. Sentence-t5: Scalable sentence encoders from pre-trained text-to-text models. In Findings of the Association for Computational Linguistics: ACL 2022, pages 1864–1874, 2022. 12

work page 2022
[47]

Domain-matched Pre-training Tasks for Dense Retrieval

Barlas Oguz, Kushal Lakhotia, Anchit Gupta, Patrick Lewis, Vladimir Karpukhin, Aleksandra Piktus, Xilun Chen, Sebastian Riedel, Scott Yih, Sonal Gupta, and Yashar Mehdad. Domain- matched pre-training tasks for dense retrieval. In Findings of the Association for Computational Linguistics: NAACL 2022, Seattle, WA, United States, July 10-15, 2022 , pages 152...

work page doi:10.18653/v1/2022.findings-naacl.114 2022
[48]

Kilt: a benchmark for knowledge intensive language tasks

Fabio Petroni, Aleksandra Piktus, Angela Fan, Patrick Lewis, Majid Yazdani, Nicola De Cao, James Thorne, Yacine Jernite, Vassilis Plachouras, Tim Rocktaschel, and Sebastian Riedel. Kilt: a benchmark for knowledge intensive language tasks. In North American Chapter of the Association for Computational Linguistics, 2020

work page 2020
[49]

Learning transferable visual models from natural language supervi- sion

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervi- sion. In Marina Meila and Tong Zhang, editors, Proceedings of the 38th International Conference on Machi...

work page 2021
[50]

Exploring the limits of transfer learning with a unified text-to-text transformer

Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research, 21:1–67, 2020

work page 2020
[51]

Sentence- BERT : Sentence Embeddings using S iamese BERT -Networks

Nils Reimers and Iryna Gurevych. Sentence-BERT: Sentence embeddings using Siamese BERT-networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Lan- guage Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 3982–3992, Hong Kong, China, 2019. Association for Computational Linguis...

work page doi:10.18653/v1/d19-1410 2019
[52]

R ocket QA v2: A Joint Training Method for Dense Passage Retrieval and Passage Re-ranking

Ruiyang Ren, Yingqi Qu, Jing Liu, Wayne Xin Zhao, QiaoQiao She, Hua Wu, Haifeng Wang, and Ji-Rong Wen. RocketQAv2: A joint training method for dense passage retrieval and passage re-ranking. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 2825–2835, Online and Punta Cana, Dominican Republic, 2021. Associati...

work page doi:10.18653/v1/2021.emnlp-main.224 2021
[53]

CCM atrix: Mining Billions of High-Quality Parallel Sentences on the Web

Holger Schwenk, Guillaume Wenzek, Sergey Edunov, Edouard Grave, Armand Joulin, and Angela Fan. CCMatrix: Mining billions of high-quality parallel sentences on the web. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)...

work page doi:10.18653/v1/2021.acl-long.507 2021
[54]

Manning, A

Richard Socher, Alex Perelygin, Jean Wu, Jason Chuang, Christopher D. Manning, A. Ng, and Christopher Potts. Recursive deep models for semantic compositionality over a sentiment treebank. In Conference on Empirical Methods in Natural Language Processing, 2013

work page 2013
[55]

Beir: A heterogeneous benchmark for zero-shot evaluation of information retrieval models

Nandan Thakur, Nils Reimers, Andreas Rücklé, Abhishek Srivastava, and Iryna Gurevych. Beir: A heterogeneous benchmark for zero-shot evaluation of information retrieval models. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2), 2021

work page 2021
[56]

FEVER: a large-scale dataset for fact extraction and VERification

James Thorne, Andreas Vlachos, Christos Christodoulopoulos, and Arpit Mittal. FEVER: a large-scale dataset for fact extraction and VERification. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 809–819, New Orleans, Louisiana,

work page 2018
[57]

FEVER: a large-scale dataset for Fact Extraction and VERification

Association for Computational Linguistics. doi: 10.18653/v1/N18-1074. URL https: //aclanthology.org/N18-1074

work page internal anchor Pith review doi:10.18653/v1/n18-1074
[58]

Trec-covid: constructing a pandemic 13 information retrieval test collection

Ellen V oorhees, Tasmeer Alam, Steven Bedrick, Dina Demner-Fushman, William R Hersh, Kyle Lo, Kirk Roberts, Ian Soboroff, and Lucy Lu Wang. Trec-covid: constructing a pandemic 13 information retrieval test collection. In ACM SIGIR Forum, volume 54, pages 1–12. ACM New York, NY , USA, 2021

work page 2021
[59]

Retrieval of the best counterargument without prior topic knowledge

Henning Wachsmuth, Shahbaz Syed, and Benno Stein. Retrieval of the best counterargument without prior topic knowledge. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 241–251, 2018

work page 2018
[60]

Fact or fiction: Verifying scientific claims

David Wadden, Shanchuan Lin, Kyle Lo, Lucy Lu Wang, Madeleine van Zuylen, Arman Cohan, and Hannaneh Hajishirzi. Fact or fiction: Verifying scientific claims. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP) , pages 7534–7550, 2020

work page 2020
[61]

Simlm: Pre-training with representation bottleneck for dense passage retrieval

Liang Wang, Nan Yang, Xiaolong Huang, Binxing Jiao, Linjun Yang, Daxin Jiang, Rangan Majumder, and Furu Wei. Simlm: Pre-training with representation bottleneck for dense passage retrieval. ArXiv, abs/2207.02578, 2022

work page arXiv 2022
[62]

Minilmv2: Multi-head self-attention relation distillation for compressing pretrained transformers

Wenhui Wang, Hangbo Bao, Shaohan Huang, Li Dong, and Furu Wei. Minilmv2: Multi-head self-attention relation distillation for compressing pretrained transformers. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pages 2140–2151, 2021

work page 2021
[63]

CCNet: Extracting high quality monolingual datasets from web crawl data

Guillaume Wenzek, Marie-Anne Lachaux, Alexis Conneau, Vishrav Chaudhary, Francisco Guzmán, Armand Joulin, and Edouard Grave. CCNet: Extracting high quality monolingual datasets from web crawl data. In Proceedings of the 12th Language Resources and Evaluation Conference, pages 4003–4012, Marseille, France, 2020. European Language Resources Associ- ation. I...

work page 2020
[64]

Bennett, Junaid Ahmed, and Arnold Overwijk

Lee Xiong, Chenyan Xiong, Ye Li, Kwok-Fung Tang, Jialin Liu, Paul N. Bennett, Junaid Ahmed, and Arnold Overwijk. Approximate nearest neighbor negative contrastive learning for dense text retrieval. In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021. OpenReview.net, 2021. URL https://openreview. ne...

work page 2021
[65]

Laprador: Unsupervised pretrained dense retriever for zero-shot text retrieval

Canwen Xu, Daya Guo, Nan Duan, and Julian McAuley. Laprador: Unsupervised pretrained dense retriever for zero-shot text retrieval. In Findings of the Association for Computational Linguistics: ACL 2022, pages 3557–3569, 2022

work page 2022
[66]

Others” category includes “Sim- pleWiki

Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William Cohen, Ruslan Salakhutdinov, and Christopher D Manning. Hotpotqa: A dataset for diverse, explainable multi-hop question answering. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 2369–2380, 2018. A Dataset Details For Common Crawl, we download the...

work page 2018