arxiv: 2308.03281 · v1 · submitted 2023-08-07 · 💻 cs.CL

Recognition: 3 theorem links

· Lean Theorem

Towards General Text Embeddings with Multi-stage Contrastive Learning

Zehan Li , Xin Zhang , Yanzhao Zhang , Dingkun Long , Pengjun Xie , Meishan Zhang

Authors on Pith no claims yet

Pith reviewed 2026-05-12 03:23 UTC · model grok-4.3

classification 💻 cs.CL

keywords text embeddingscontrastive learningmulti-stage traininggeneral text representationsMTEB benchmarkcode retrievalunified NLP models

0 comments

The pith

A 110M-parameter text embedding model trained via multi-stage contrastive learning on mixed datasets outperforms OpenAI's black-box API and much larger models on the MTEB benchmark.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper trains a single text embedding model called GTE by applying contrastive learning first in an unsupervised pre-training stage and then in a supervised fine-tuning stage, drawing from a large and varied collection of datasets. The central goal is to produce embeddings that work well across many different NLP tasks and even code retrieval without any extra per-task tuning. A sympathetic reader would care because this points to a route for building general-purpose, efficient embedding systems that do not require separate models or heavy customization for each application. The modest size of the base model makes the result practically relevant for deployment.

Core claim

GTE is a unified text embedding model trained with multi-stage contrastive learning over a diverse mixture of datasets; the base version with 110M parameters surpasses OpenAI's embedding API and exceeds the performance of embedding models more than ten times larger on the massive text embedding benchmark, while also outperforming prior code retrievers of similar size when code is treated simply as text.

What carries the argument

Multi-stage contrastive learning applied to a mixture of unsupervised pre-training and supervised fine-tuning datasets that unifies many NLP and code tasks into a single contrastive format.

If this is right

Embeddings can be made general enough to handle both natural language and programming language retrieval without language-specific retraining.
Modest-sized models can exceed the quality of much larger embedding models when trained on sufficiently diverse contrastive data.
A single trained embedder can replace multiple task-specific models in retrieval, semantic search, and clustering pipelines.
Performance gains come primarily from increasing the scale and variety of training data across the two stages rather than from model size alone.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the staged training pattern generalizes, future embedding work may shift focus from architecture search toward systematic data mixing and curriculum design.
The result raises the possibility that open models can close the gap with proprietary embedding services without requiring users to send data to external APIs.
Treating code as ordinary text inside the same contrastive framework suggests that cross-domain transfer between natural and formal languages may be easier than previously assumed.

Load-bearing premise

The particular combination of datasets and the two-stage contrastive procedure is sufficient to produce embeddings that transfer directly to new tasks without further per-task fine-tuning.

What would settle it

A controlled experiment in which a model of identical size is trained on the same total data volume but with single-stage contrastive learning instead of the multi-stage schedule, then evaluated on MTEB; if it matches or exceeds GTE_base, the necessity of the staged approach is called into question.

read the original abstract

We present GTE, a general-purpose text embedding model trained with multi-stage contrastive learning. In line with recent advancements in unifying various NLP tasks into a single format, we train a unified text embedding model by employing contrastive learning over a diverse mixture of datasets from multiple sources. By significantly increasing the number of training data during both unsupervised pre-training and supervised fine-tuning stages, we achieve substantial performance gains over existing embedding models. Notably, even with a relatively modest parameter count of 110M, GTE$_\text{base}$ outperforms the black-box embedding API provided by OpenAI and even surpasses 10x larger text embedding models on the massive text embedding benchmark. Furthermore, without additional fine-tuning on each programming language individually, our model outperforms previous best code retrievers of similar size by treating code as text. In summary, our model achieves impressive results by effectively harnessing multi-stage contrastive learning, offering a powerful and efficient text embedding model with broad applicability across various NLP and code-related tasks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces GTE, a family of text embedding models trained via multi-stage contrastive learning. Unsupervised pre-training on large web-scale corpora is followed by supervised fine-tuning on a diverse mixture of datasets; the resulting 110M-parameter GTE_base is reported to outperform OpenAI's text-embedding-ada-002 and multiple 1B+ parameter models on the Massive Text Embedding Benchmark (MTEB) while also delivering strong code-retrieval results by treating code as plain text without per-language fine-tuning.

Significance. If the reported gains are free of data contamination and the training mixture is fully documented, the work would demonstrate that carefully staged contrastive learning on heterogeneous data can produce compact, general-purpose embeddings competitive with much larger or proprietary systems. This would be a useful practical contribution for efficient open-source embedding models.

major comments (2)

[§3.2] §3.2 (Supervised fine-tuning datasets): The manuscript states that supervised training uses 'a diverse mixture of datasets from multiple sources' but provides neither an exhaustive list of the datasets nor an explicit statement or appendix confirming zero overlap with MTEB test splits (or near-duplicates). Because MTEB aggregates tasks drawn from common supervised sources (e.g., STS, classification, retrieval corpora), any leakage would directly undermine the central claim that the model generalizes 'without additional fine-tuning' and outperforms larger models on truly unseen distributions.
[§4.1, Table 2] §4.1 and Table 2 (MTEB results): The headline claim that GTE_base surpasses 10× larger models and OpenAI's API is presented as an average score; the paper does not report per-task breakdowns with confidence intervals or statistical significance tests against the strongest baselines. Without these, it is impossible to assess whether the reported gains are robust or driven by a few tasks where data overlap may exist.

minor comments (2)

[Abstract, §1] The abstract and §1 use the subscript notation GTE$ _base $ inconsistently with the later text; standardize to GTE_base throughout.
[§4.3] §4.3 (Code retrieval): The claim that the model outperforms prior code retrievers 'without additional fine-tuning on each programming language' would be strengthened by an explicit statement of the exact code corpora used in the supervised stage.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments. We address each major concern point by point below, indicating where revisions will be made to enhance transparency and detail in the manuscript.

read point-by-point responses

Referee: [§3.2] The manuscript states that supervised training uses 'a diverse mixture of datasets from multiple sources' but provides neither an exhaustive list of the datasets nor an explicit statement or appendix confirming zero overlap with MTEB test splits (or near-duplicates). Because MTEB aggregates tasks drawn from common supervised sources (e.g., STS, classification, retrieval corpora), any leakage would directly undermine the central claim that the model generalizes 'without additional fine-tuning' and outperforms larger models on truly unseen distributions.

Authors: We agree that full documentation of the training mixture and explicit checks for overlap are essential to support the generalization claims. In the revised manuscript, we will add an appendix providing an exhaustive list of all supervised fine-tuning datasets, including their sources, sizes, and any preprocessing or filtering applied. We will also include a dedicated statement describing our procedures to avoid contamination with MTEB test splits, such as restricting to official training portions and applying deduplication steps. These additions will directly address the concern about potential data leakage. revision: yes
Referee: [§4.1, Table 2] The headline claim that GTE_base surpasses 10× larger models and OpenAI's API is presented as an average score; the paper does not report per-task breakdowns with confidence intervals or statistical significance tests against the strongest baselines. Without these, it is impossible to assess whether the reported gains are robust or driven by a few tasks where data overlap may exist.

Authors: We acknowledge that per-task breakdowns would allow readers to better evaluate the robustness of the reported average. In the revision, we will expand the results to include per-task scores for GTE_base and the primary baselines (including OpenAI's model and larger models) either in an extended version of Table 2 or a supplementary table. While confidence intervals and formal statistical significance tests are not standard practice for MTEB reporting due to computational demands and benchmark conventions, we will add discussion of performance consistency across task types (e.g., retrieval, classification, STS) to help assess whether gains are broadly distributed or concentrated in specific areas. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical training pipeline with external benchmark validation

full rationale

The paper describes a standard multi-stage contrastive learning procedure (unsupervised pre-training followed by supervised fine-tuning) applied to a mixture of datasets, then reports empirical results on external benchmarks such as MTEB. No mathematical derivation chain, self-definitional equations, fitted parameters renamed as predictions, or load-bearing self-citations exist; performance claims rest on direct comparisons to independent models and APIs rather than reducing to the training inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The central claim rests on empirical training of a neural network with standard contrastive loss; no new mathematical axioms or invented entities. Free parameters such as learning rates, batch sizes, data mixture weights, and stage-specific hyperparameters are implicit in the training process but not enumerated in the abstract.

pith-pipeline@v0.9.0 · 5481 in / 1159 out tokens · 30370 ms · 2026-05-12T03:23:28.299970+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith.Foundation.PhiForcing hierarchy_emergence_forces_phi unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

even with a relatively modest parameter count of 110M, GTE_base outperforms the black-box embedding API provided by OpenAI
IndisputableMonolith.Foundation.DimensionForcing dimension_forced unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

without additional fine-tuning on each programming language individually, our model outperforms previous best code retrievers

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 42 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Test-Time Compute for Dense Retrieval: Agentic Program Generation with Frozen Embedding Models
cs.LG 2026-05 unverdicted novelty 7.0

Agentic program search over frozen embedding APIs yields a parameter-free inference algebra—a softmax-weighted centroid of top-K documents interpolated with the query—that lifts nDCG@10 across seven model families on ...
Test-Time Compute for Dense Retrieval: Agentic Program Generation with Frozen Embedding Models
cs.LG 2026-05 unverdicted novelty 7.0

A softmax-weighted centroid of the local top-K documents interpolated with the query improves nDCG@10 for frozen embedding models across seven families on held-out BEIR data.
DiffRetriever: Parallel Representative Tokens for Retrieval with Diffusion Language Models
cs.IR 2026-05 unverdicted novelty 7.0

DiffRetriever generates multiple representative tokens in parallel using diffusion language models, yielding consistent retrieval gains over single-token baselines and autoregressive multi-token variants on BEIR benchmarks.
Priming, Path-dependence, and Plasticity: Understanding the molding of user-LLM interaction and its implications from (many) chat logs in the wild
cs.HC 2026-05 unverdicted novelty 7.0

Large-scale analysis of wild LLM chat logs finds that user interaction patterns stabilize quickly after initial use and correlate with long-term outcomes like retention, creating an agency paradox of limited explorati...
TabEmbed: Benchmarking and Learning Generalist Embeddings for Tabular Understanding
cs.CL 2026-05 unverdicted novelty 7.0

TabEmbed is the first generalist embedding model for tabular data that unifies classification and retrieval in one space via contrastive learning and outperforms text embedding models on the new TabBench benchmark.
Rethinking Reasoning-Intensive Retrieval: Evaluating and Advancing Retrievers in Agentic Search Systems
cs.CL 2026-05 unverdicted novelty 7.0

BRIGHT-Pro and RTriever-Synth advance reasoning-intensive retrieval by adding multi-aspect evidence evaluation and aspect-decomposed synthetic training, with the fine-tuned RTriever-4B showing gains over its base model.
Embedding-based In-Context Prompt Training for Enhancing LLMs as Text Encoders
cs.CL 2026-05 unverdicted novelty 7.0

EPIC trains LLMs to treat continuous embeddings as in-context prompts, yielding state-of-the-art text embedding performance on MTEB with or without prompts at inference and lower compute.
Learning How and What to Memorize: Cognition-Inspired Two-Stage Optimization for Evolving Memory
cs.CL 2026-05 unverdicted novelty 7.0

MemCoE learns memory organization guidelines via contrastive feedback and then trains a guideline-aligned RL policy for memory updates, yielding consistent gains on personalization benchmarks.
Why Mean Pooling Works: Quantifying Second-Order Collapse in Text Embeddings
cs.CL 2026-04 unverdicted novelty 7.0

Modern text encoders resist second-order collapse under mean pooling because token embeddings concentrate tightly within texts, and this resistance correlates with stronger downstream performance.
Interactive Episodic Memory with User Feedback
cs.CV 2026-04 unverdicted novelty 7.0

Introduces an interactive episodic memory task with user feedback and a Feedback Alignment Module that improves retrieval accuracy on video benchmarks while remaining efficient.
ATIR: Towards Audio-Text Interleaved Contextual Retrieval
cs.SD 2026-04 unverdicted novelty 7.0

Defines ATIR task and benchmark for mixed audio-text queries; MLLM model with token compression shows substantial gains over strong baselines.
LookasideVLN: Direction-Aware Aerial Vision-and-Language Navigation
cs.CV 2026-04 unverdicted novelty 7.0

LookasideVLN improves aerial vision-and-language navigation by encoding directional cues from instructions into an egocentric graph and lightweight knowledge base, outperforming prior methods like CityNavAgent even wi...
On the Robustness of LLM-Based Dense Retrievers: A Systematic Analysis of Generalizability and Stability
cs.IR 2026-04 unverdicted novelty 7.0

LLM-based dense retrievers generalize better when instruction-tuned but pay a specialization tax when optimized for reasoning; they resist typos and corpus poisoning better than encoder-only baselines yet remain vulne...
A Unified Model and Document Representation for On-Device Retrieval-Augmented Generation
cs.IR 2026-04 unverdicted novelty 7.0

A single model unifies retrieval and context compression for on-device RAG via shared representations, matching traditional RAG performance at 1/10 context size with no extra storage.
Tencent Advertising Algorithm Challenge 2025: All-Modality Generative Recommendation
cs.IR 2026-04 accept novelty 7.0

Releases TencentGR-1M and TencentGR-10M datasets with baselines for all-modality generative recommendation in advertising, including weighted evaluation for conversions.
GME: Improving Universal Multimodal Retrieval by Multimodal LLMs
cs.CL 2024-12 unverdicted novelty 7.0

GME achieves state-of-the-art results in universal multimodal retrieval by training on a balanced synthetic multimodal dataset.
C-Pack: Packed Resources For General Chinese Embeddings
cs.CL 2023-09 accept novelty 7.0

C-Pack releases a new Chinese embedding benchmark, large training dataset, and optimized models that outperform priors by up to 10% on C-MTEB while also delivering English SOTA results.
Layer-wise Representation Dynamics: An Empirical Investigation Across Embedders and Base LLMs
cs.LG 2026-05 unverdicted novelty 6.0

LRD framework with Frenet, NRS, and GFMI metrics shows layer-wise structure in 31 models provides usable signal for model selection and pruning on MTEB tasks.
Adaptive Kernel Ridge Regression with Linear Structure: Sharp Oracle Inequalities and Minimax Optimality
math.ST 2026-05 unverdicted novelty 6.0

An augmented kernel ridge regression estimator separates linear and nonlinear components to achieve sharp oracle inequalities and minimax optimal prediction risk under general kernels.
PathISE: Learning Informative Path Supervision for Knowledge Graph Question Answering
cs.AI 2026-05 conditional novelty 6.0

PathISE generates pseudo path-level supervision from answer labels alone via a transformer estimator, distills it to an LLM path generator, and achieves competitive or state-of-the-art KGQA performance on three benchm...
PASA: A Principled Embedding-Space Watermarking Approach for LLM-Generated Text under Semantic-Invariant Attacks
cs.CR 2026-05 unverdicted novelty 6.0

PASA is a semantic-level watermarking method for LLM text that uses embedding-space clusters and synchronized randomness to remain detectable after paraphrasing while preserving text quality.
CASCADE: Case-Based Continual Adaptation for Large Language Models During Deployment
cs.AI 2026-05 unverdicted novelty 6.0

CASCADE enables LLMs to continually adapt at deployment via case-based episodic memory and contextual bandits, improving macro-averaged success by 20.9% over zero-shot on 16 tasks spanning medicine, law, code, and robotics.
AutoPPA: Automated Circuit PPA Optimization via Contrastive Code-based Rule Library Learning
cs.LG 2026-04 unverdicted novelty 6.0

AutoPPA learns generalizable PPA optimization rules automatically via contrastive abstraction from diverse code pairs and applies them through adaptive search, outperforming manual methods and prior tools SymRTLO and ...
SPENCE: A Syntactic Probe for Detecting Contamination in NL2SQL Benchmarks
cs.CL 2026-04 unverdicted novelty 6.0

SPENCE shows older NL2SQL benchmarks like Spider have high performance sensitivity to syntactic changes, indicating likely training contamination, while newer ones like BIRD show little sensitivity and appear largely clean.
REZE: Representation Regularization for Domain-adaptive Text Embedding Pre-finetuning
cs.CL 2026-04 unverdicted novelty 6.0

REZE controls representation shifts in contrastive pre-finetuning of text embeddings via eigenspace decomposition of anchor-positive pairs and adaptive soft-shrinkage on task-variant directions.
Robotic Manipulation is Vision-to-Geometry Mapping ($f(v) \rightarrow G$): Vision-Geometry Backbones over Language and Video Models
cs.RO 2026-04 unverdicted novelty 6.0

Vision-geometry backbones using pretrained 3D world models outperform vision-language and video models for robotic manipulation by enabling direct mapping from visual input to geometric actions.
Beyond Single Reports: Evaluating Automated ATT&CK Technique Extraction in Multi-Report Campaign Settings
cs.SE 2026-04 unverdicted novelty 6.0

Aggregating multiple CTI reports improves ATT&CK technique extraction F1 by about 26 percent over single-report baselines, with saturation after 5-15 reports and maximum F1 scores of 78.6 percent and 54.9 percent acro...
HIVE: Query, Hypothesize, Verify An LLM Framework for Multimodal Reasoning-Intensive Retrieval
cs.IR 2026-04 unverdicted novelty 6.0

HIVE raises multimodal retrieval nDCG@10 to 41.7 on the MM-BRIGHT benchmark by inserting LLM-driven hypothesis generation and verification between retrieval passes, delivering +9.5 over the best text-only baseline and...
ChunQiuTR: Time-Keyed Temporal Retrieval in Classical Chinese Annals
cs.CL 2026-04 unverdicted novelty 6.0

ChunQiuTR benchmark and CTD model improve time-keyed retrieval accuracy for Classical Chinese annals by combining semantic similarity with explicit calendrical temporal context.
Data, Not Model: Explaining Bias toward LLM Texts in Neural Retrievers
cs.IR 2026-04 unverdicted novelty 6.0

Bias toward LLM texts in neural retrievers arises from artifact imbalances between positive and negative documents in training data that are absorbed during contrastive learning.
Are LLM-Based Retrievers Worth Their Cost? An Empirical Study of Efficiency, Robustness, and Reasoning Overhead
cs.IR 2026-04 accept novelty 6.0

Empirical comparison across 14 retrievers on the BRIGHT benchmark shows reasoning-specialized models can match strong accuracy with competitive speed while many large LLM bi-encoders add latency for small gains and co...
NV-Embed: Improved Techniques for Training LLMs as Generalist Embedding Models
cs.CL 2024-05 accept novelty 6.0

NV-Embed achieves first place on the MTEB leaderboard across 56 tasks by combining a latent attention layer, causal-mask removal, two-stage contrastive training, and data curation for LLM-based embedding models.
Not All RAGs Are Created Equal: A Component-Wise Empirical Study for Software Engineering Tasks
cs.SE 2026-05 unverdicted novelty 5.0

Retriever-side choices, particularly the retrieval algorithm, exert more influence on RAG performance than generator selection across code generation, summarization, and repair tasks.
SafeTune: Mitigating Data Poisoning in LLM Fine-Tuning for RTL Code Generation
cs.CR 2026-04 unverdicted novelty 5.0

SafeTune uses GNN-based structural anomaly detection and semantic prompt classification to filter poisoned data in LLM fine-tuning for RTL generation, enhancing robustness against hardware Trojan insertion without alt...
A Gated Hybrid Contrastive Collaborative Filtering Recommendation
cs.IR 2026-04 unverdicted novelty 5.0

A gated hybrid contrastive collaborative filtering framework improves hit rate@10 and NDCG@10 on movie review datasets by layer-wise adaptive fusion of semantic and collaborative signals with contrastive objectives.
SAT: Balancing Reasoning Accuracy and Efficiency with Stepwise Adaptive Thinking
cs.AI 2026-04 unverdicted novelty 5.0

SAT reduces reasoning tokens by up to 40% across multiple large reasoning models and benchmarks by adaptively pruning steps based on difficulty while maintaining or improving accuracy.
SmolLM2: When Smol Goes Big -- Data-Centric Training of a Small Language Model
cs.CL 2025-02 unverdicted novelty 5.0

SmolLM2 is a 1.7B-parameter language model that outperforms Qwen2.5-1.5B and Llama3.2-1B after overtraining on 11 trillion tokens using custom FineMath, Stack-Edu, and SmolTalk datasets in a multi-stage pipeline.
Granite Embedding Multilingual R2 Models
cs.IR 2026-05 unverdicted novelty 4.0

Granite Embedding Multilingual R2 releases 311M and 97M parameter bi-encoder models that achieve state-of-the-art retrieval performance on multilingual text, code, long-document, and reasoning datasets.
Mira-Embeddings-V1: Domain-Adapted Semantic Reranking for Recruitment via LLM-Synthesized Data
cs.CL 2026-04 conditional novelty 4.0

Mira-Embeddings-V1 adapts embeddings for recruitment reranking by synthesizing positive and hard-negative samples with LLMs, then applies JD-JD contrastive and JD-CV triplet training plus a BoundaryHead MLP, lifting R...
Qwen3-VL-Embedding and Qwen3-VL-Reranker: A Unified Framework for State-of-the-Art Multimodal Retrieval and Ranking
cs.CL 2026-01 unverdicted novelty 4.0

Qwen3-VL-Embedding-8B achieves state-of-the-art performance with a 77.8 overall score on the MMEB-V2 multimodal embedding benchmark.
Qwen3 Embedding: Advancing Text Embedding and Reranking Through Foundation Models
cs.CL 2025-06 unverdicted novelty 4.0

Qwen3 Embedding models in 0.6B-8B sizes achieve state-of-the-art results on MTEB and retrieval tasks including code, cross-lingual, and multilingual retrieval through unsupervised pre-training, supervised fine-tuning,...
A Reproducibility Study of Metacognitive Retrieval-Augmented Generation
cs.IR 2026-04 unverdicted novelty 3.0

MetaRAG is only partially reproducible with lower absolute scores than originally reported, gains substantially from reranking, and shows greater robustness than SIM-RAG under extended retrieval features.

Reference graph

Works this paper leans on

106 extracted references · 106 canonical work pages · cited by 41 Pith papers · 8 internal anchors

[1]

Akari Asai, Timo Schick, Patrick Lewis, Xilun Chen, Gautier Izacard, Sebastian Riedel, Hannaneh Hajishirzi, and Wen-tau Yih. 2023. https://aclanthology.org/2023.findings-acl.225 Task-aware retrieval with instructions . In Findings of the Association for Computational Linguistics: ACL 2023, pages 3650--3675, Toronto, Canada. Association for Computational L...

work page 2023
[2]

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel Ziegler, Jeffrey Wu, Clemens Winter, Chris Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gr...

work page 2020
[3]

Yu, Yin-Wen Chang, Yiming Yang, and Sanjiv Kumar

Wei-Cheng Chang, Felix X. Yu, Yin-Wen Chang, Yiming Yang, and Sanjiv Kumar. 2020. https://openreview.net/forum?id=rkg-mA4FDr Pre-training tasks for embedding-based large-scale retrieval . In International Conference on Learning Representations

work page 2020
[4]

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mohammad Bavarian...

work page internal anchor Pith review Pith/arXiv arXiv 2021
[5]

Tianqi Chen, Bing Xu, Chiyuan Zhang, and Carlos Guestrin. 2016. http://arxiv.org/abs/1604.06174 Training deep nets with sublinear memory cost

work page internal anchor Pith review arXiv 2016
[6]

Hyunjin Choi, Judong Kim, Seongho Joe, and Youngjune Gwon. 2021. https://api.semanticscholar.org/CorpusID:231709235 Evaluation of bert and albert sentence embedding performance on downstream nlp tasks . 2020 25th International Conference on Pattern Recognition (ICPR), pages 5482--5487

work page 2021
[10]

Luyu Gao and Jamie Callan. 2021. https://api.semanticscholar.org/CorpusID:237581068 Condenser: a pre-training architecture for dense retrieval . In Conference on Empirical Methods in Natural Language Processing

work page 2021
[14]

Daya Guo, Shuo Ren, Shuai Lu, Zhangyin Feng, Duyu Tang, Shujie LIU, Long Zhou, Nan Duan, Alexey Svyatkovskiy, Shengyu Fu, Michele Tufano, Shao Kun Deng, Colin Clement, Dawn Drain, Neel Sundaresan, Jian Yin, Daxin Jiang, and Ming Zhou. 2021. https://openreview.net/forum?id=jLoC4ez43PZ Graphcode \ bert \ : Pre-training code representations with data flow . ...

work page 2021
[15]

Kelvin Guu, Kenton Lee, Zora Tung, Panupong Pasupat, and Mingwei Chang. 2020. https://proceedings.mlr.press/v119/guu20a.html Retrieval augmented language model pre-training . In Proceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, pages 3929--3938. PMLR

work page 2020
[16]

Samuel Humeau, Kurt Shuster, Marie-Anne Lachaux, and Jason Weston. 2020. https://openreview.net/forum?id=SkxgnnNFvH Poly-encoders: Architectures and pre-training strategies for fast and accurate multi-sentence scoring . In International Conference on Learning Representations

work page 2020
[18]

Gautier Izacard, Mathilde Caron, Lucas Hosseini, Sebastian Riedel, Piotr Bojanowski, Armand Joulin, and Edouard Grave. 2022 a . https://openreview.net/forum?id=jKN1pXi7b0 Unsupervised dense information retrieval with contrastive learning . Transactions on Machine Learning Research

work page 2022
[19]

Gautier Izacard, Patrick Lewis, Maria Lomeli, Lucas Hosseini, Fabio Petroni, Timo Schick, Jane Dwivedi-Yu, Armand Joulin, Sebastian Riedel, and Edouard Grave. 2022 b . http://arxiv.org/abs/2208.03299 Few-shot Learning with Retrieval Augmented Language Models

work page arXiv 2022
[23]

Xiaonan Li, Yeyun Gong, Yelong Shen, Xipeng Qiu, Hang Zhang, Bolun Yao, Weizhen Qi, Daxin Jiang, Weizhu Chen, and Nan Duan. 2022. https://aclanthology.org/2022.emnlp-main.187 C ode R etriever: A large scale contrastive pre-training method for code search . In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 289...

work page 2022
[26]

Dingkun Long, Qiong Gao, Kuan Zou, Guangwei Xu, Pengjun Xie, Ruijie Guo, Jianfeng Xu, Guanjun Jiang, Luxi Xing, and Ping Yang. 2022 a . https://api.semanticscholar.org/CorpusID:247292113 Multi-cpr: A multi domain chinese dataset for passage retrieval . Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval

work page 2022
[27]

Dingkun Long, Yanzhao Zhang, Guangwei Xu, and Pengjun Xie. 2022 b . https://api.semanticscholar.org/CorpusID:253157959 Retrieval oriented masking pre-training language model for dense passage retrieval . ArXiv, abs/2210.15133

work page arXiv 2022
[28]

Paulius Micikevicius, Sharan Narang, Jonah Alben, Gregory Diamos, Erich Elsen, David Garcia, Boris Ginsburg, Michael Houston, Oleksii Kuchaiev, Ganesh Venkatesh, and Hao Wu. 2018. https://openreview.net/forum?id=r1gs9JgRZ Mixed precision training . In International Conference on Learning Representations

work page 2018
[29]

Fedor Moiseev, Gustavo Hernandez Abrego, Peter Dornbach, Imed Zitouni, Enrique Alfonseca, and Zhe Dong. 2023. https://aclanthology.org/2023.findings-acl.761 S am T o N e: Improving contrastive loss for dual encoder retrieval models with same tower negatives . In Findings of the Association for Computational Linguistics: ACL 2023, pages 12028--12037, Toron...

work page 2023
[30]

Niklas Muennighoff, Nouamane Tazi, Loic Magne, and Nils Reimers. 2023. https://aclanthology.org/2023.eacl-main.148 MTEB : Massive text embedding benchmark . In Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, pages 2014--2037, Dubrovnik, Croatia. Association for Computational Linguistics

work page 2023
[33]

Jianmo Ni, Chen Qu, Jing Lu, Zhuyun Dai, Gustavo Hernandez Abrego, Ji Ma, Vincent Zhao, Yi Luan, Keith Hall, Ming-Wei Chang, and Yinfei Yang. 2022 b . https://aclanthology.org/2022.emnlp-main.669 Large dual encoders are generalizable retrievers . In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 9844--9855, A...

work page 2022
[35]

OpenAI. 2023. https://api.semanticscholar.org/CorpusID:257532815 Gpt-4 technical report . ArXiv, abs/2303.08774

work page internal anchor Pith review Pith/arXiv arXiv 2023
[36]

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. 2021. https://proceedings.mlr.press/v139/radford21a.html Learning transferable visual models from natural language supervision . In Proceedings of the 38th International C...

work page 2021
[37]

Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. 2018. https://cdn.openai.com/research-covers/language-unsupervised/language_understanding_paper.pdf Improving language understanding by generative pre-training

work page 2018
[38]

Rajapakse

Thilina C. Rajapakse. 2023. https://api.semanticscholar.org/CorpusID:259949811 Dense passage retrieval: Architectures and augmentation methods . Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval

work page 2023
[39]

Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, and Yuxiong He. 2020. Zero: Memory optimizations toward training trillion parameter models. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC '20. IEEE Press

work page 2020
[40]

Ori Ram, Yoav Levine, Itay Dalmedigos, Dor Muhlgay, Amnon Shashua, Kevin Leyton-Brown, and Yoav Shoham. 2023. https://api.semanticscholar.org/CorpusID:256459451 In-context retrieval-augmented language models . ArXiv, abs/2302.00083

work page arXiv 2023
[43]

Andrew Rosenberg and Julia Hirschberg. 2007. https://aclanthology.org/D07-1043 V -measure: A conditional entropy-based external cluster evaluation measure . In Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning ( EMNLP - C o NLL ) , pages 410--420, Prague, Czech Republic...

work page 2007
[44]

Weijia Shi, Sewon Min, Michihiro Yasunaga, Minjoon Seo, Rich James, Mike Lewis, Luke Zettlemoyer, and Wen tau Yih. 2023. https://api.semanticscholar.org/CorpusID:256389797 Replug: Retrieval-augmented black-box language models . ArXiv, abs/2301.12652

work page arXiv 2023
[45]

Smith, Luke Zettlemoyer, and Tao Yu

Hongjin Su, Weijia Shi, Jungo Kasai, Yizhong Wang, Yushi Hu, Mari Ostendorf, Wen-tau Yih, Noah A. Smith, Luke Zettlemoyer, and Tao Yu. 2023. https://aclanthology.org/2023.findings-acl.71 One embedder, any task: Instruction-finetuned text embeddings . In Findings of the Association for Computational Linguistics: ACL 2023, pages 1102--1121, Toronto, Canada....

work page 2023
[46]

Jianlin Su, Jiarun Cao, Weijie Liu, and Yangyiwen Ou. 2021. http://arxiv.org/abs/2103.15316 Whitening sentence representations for better semantics and faster retrieval

work page arXiv 2021
[47]

Nandan Thakur, Nils Reimers, Andreas R\" u ckl\' e , Abhishek Srivastava, and Iryna Gurevych. 2021. https://datasets-benchmarks-proceedings.neurips.cc/paper_files/paper/2021/file/65b9eea6e1cc6bb9f0cd2a47751a186f-Paper-round2.pdf Beir: A heterogeneous benchmark for zero-shot evaluation of information retrieval models . In Proceedings of the Neural Informat...

work page 2021
[48]

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timoth \'e e Lacroix, Baptiste Rozi \`e re, Naman Goyal, Eric Hambro, Faisal Azhar, Aurelien Rodriguez, Armand Joulin, Edouard Grave, and Guillaume Lample. 2023. https://api.semanticscholar.org/CorpusID:257219404 Llama: Open and efficient foundation language models . ArXiv...

work page internal anchor Pith review Pith/arXiv arXiv 2023
[50]

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, ukasz Kaiser, and Illia Polosukhin. 2017. https://proceedings.neurips.cc/paper_files/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf Attention is all you need . In Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc

work page 2017
[51]

Liang Wang, Nan Yang, Xiaolong Huang, Binxing Jiao, Linjun Yang, Daxin Jiang, Rangan Majumder, and Furu Wei. 2022 a . https://api.semanticscholar.org/CorpusID:250311114 Simlm: Pre-training with representation bottleneck for dense passage retrieval . In Annual Meeting of the Association for Computational Linguistics

work page 2022
[53]

Wenhui Wang, Furu Wei, Li Dong, Hangbo Bao, Nan Yang, and Ming Zhou. 2020. Minilm: Deep self-attention distillation for task-agnostic compression of pre-trained transformers. In Proceedings of the 34th International Conference on Neural Information Processing Systems, NIPS'20, Red Hook, NY, USA. Curran Associates Inc

work page 2020
[54]

Guillaume Wenzek, Marie-Anne Lachaux, Alexis Conneau, Vishrav Chaudhary, Francisco Guzm \'a n, Armand Joulin, and Edouard Grave. 2020. https://aclanthology.org/2020.lrec-1.494 CCN et: Extracting high quality monolingual datasets from web crawl data . In Proceedings of the Twelfth Language Resources and Evaluation Conference, pages 4003--4012, Marseille, F...

work page 2020
[56]

Shitao Xiao, Zheng Liu, Yingxia Shao, and Zhao Cao. 2022. https://api.semanticscholar.org/CorpusID:252917569 Retromae: Pre-training retrieval-oriented language models via masked auto-encoder . In Conference on Empirical Methods in Natural Language Processing

work page 2022
[58]

Bennett, Junaid Ahmed, and Arnold Overwijk

Lee Xiong, Chenyan Xiong, Ye Li, Kwok-Fung Tang, Jialin Liu, Paul N. Bennett, Junaid Ahmed, and Arnold Overwijk. 2021. https://openreview.net/forum?id=zeFrfgyZln Approximate nearest neighbor negative contrastive learning for dense text retrieval . In International Conference on Learning Representations

work page 2021
[60]

Attention is All you Need , url =

Vaswani, Ashish and Shazeer, Noam and Parmar, Niki and Uszkoreit, Jakob and Jones, Llion and Gomez, Aidan N and Kaiser, ukasz and Polosukhin, Illia , booktitle =. Attention is All you Need , url =

work page
[61]

Representation Learning with Contrastive Predictive Coding

A. Representation Learning with Contrastive Predictive Coding , journal =. 2018 , url =. 1807.03748 , timestamp =

work page internal anchor Pith review Pith/arXiv arXiv 2018
[62]

2021 , eprint=

Evaluating Large Language Models Trained on Code , author=. 2021 , eprint=

work page 2021
[63]

International Conference on Learning Representations , year=

Approximate Nearest Neighbor Negative Contrastive Learning for Dense Text Retrieval , author=. International Conference on Learning Representations , year=

work page
[64]

Proceedings of the 37th International Conference on Machine Learning , pages =

Retrieval Augmented Language Model Pre-Training , author =. Proceedings of the 37th International Conference on Machine Learning , pages =. 2020 , editor =

work page 2020
[65]

International Conference on Learning Representations , year=

Pre-training Tasks for Embedding-based Large-scale Retrieval , author=. International Conference on Learning Representations , year=

work page
[66]

Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis , articleno =

Rajbhandari, Samyam and Rasley, Jeff and Ruwase, Olatunji and He, Yuxiong , title =. Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis , articleno =. 2020 , isbn =

work page 2020
[67]

International Conference on Learning Representations , year=

Mixed Precision Training , author=. International Conference on Learning Representations , year=

work page
[68]

2021 , eprint=

Whitening Sentence Representations for Better Semantics and Faster Retrieval , author=. 2021 , eprint=

work page 2021
[69]

International Conference on Learning Representations , year=

Representation Degeneration Problem in Training Natural Language Generation Models , author=. International Conference on Learning Representations , year=

work page
[70]

2016 , eprint=

Training Deep Nets with Sublinear Memory Cost , author=. 2016 , eprint=

work page 2016
[71]

GraphCode

Daya Guo and Shuo Ren and Shuai Lu and Zhangyin Feng and Duyu Tang and Shujie LIU and Long Zhou and Nan Duan and Alexey Svyatkovskiy and Shengyu Fu and Michele Tufano and Shao Kun Deng and Colin Clement and Dawn Drain and Neel Sundaresan and Jian Yin and Daxin Jiang and Ming Zhou , booktitle=. GraphCode. 2021 , url=

work page 2021
[72]

Robust Speech Recognition via Large-Scale Weak Supervision

Alec Radford and Jong Wook Kim and Tao Xu and Greg Brockman and Christine McLeavey and Ilya Sutskever , title =. CoRR , volume =. 2022 , url =. doi:10.48550/arXiv.2212.04356 , eprinttype =. 2212.04356 , timestamp =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2212.04356 2022
[73]

International Conference on Learning Representations , year=

Poly-encoders: Architectures and Pre-training Strategies for Fast and Accurate Multi-sentence Scoring , author=. International Conference on Learning Representations , year=

work page
[74]

ArXiv , year=

Retrieval Oriented Masking Pre-training Language Model for Dense Passage Retrieval , author=. ArXiv , year=

work page
[75]

Proceedings of the 38th International Conference on Machine Learning , pages =

Learning Transferable Visual Models From Natural Language Supervision , author =. Proceedings of the 38th International Conference on Machine Learning , pages =. 2021 , editor =

work page 2021
[76]

BEIR: A Heterogeneous Benchmark for Zero-shot Evaluation of Information Retrieval Models , url =

Thakur, Nandan and Reimers, Nils and R\". BEIR: A Heterogeneous Benchmark for Zero-shot Evaluation of Information Retrieval Models , url =. Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks , editor =

work page
[77]

Transactions on Machine Learning Research , issn=

Unsupervised Dense Information Retrieval with Contrastive Learning , author=. Transactions on Machine Learning Research , issn=. 2022 , url=

work page 2022
[78]

Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval , pages =

Xie, Yiqing and Liu, Xiao and Xiong, Chenyan , title =. Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval , pages =. 2023 , isbn =. doi:10.1145/3539618.3592080 , abstract =

work page doi:10.1145/3539618.3592080 2023
[79]

CoRR , volume =

Zehan Li and Yanzhao Zhang and Dingkun Long and Pengjun Xie , title =. CoRR , volume =. 2023 , url =. doi:10.48550/arXiv.2305.13197 , eprinttype =. 2305.13197 , timestamp =

work page doi:10.48550/arxiv.2305.13197 2023
[80]

Few-shot

Izacard, Gautier and Lewis, Patrick and Lomeli, Maria and Hosseini, Lucas and Petroni, Fabio and Schick, Timo and Dwivedi-Yu, Jane and Joulin, Armand and Riedel, Sebastian and Grave, Edouard , year =. Few-shot

work page
[81]

Language Models are Few-Shot Learners , url =

Brown, Tom and Mann, Benjamin and Ryder, Nick and Subbiah, Melanie and Kaplan, Jared D and Dhariwal, Prafulla and Neelakantan, Arvind and Shyam, Pranav and Sastry, Girish and Askell, Amanda and Agarwal, Sandhini and Herbert-Voss, Ariel and Krueger, Gretchen and Henighan, Tom and Child, Rewon and Ramesh, Aditya and Ziegler, Daniel and Wu, Jeffrey and Winte...

work page
[82]

CodeSearchNet Challenge: Evaluating the State of Semantic Code Search

Hamel Husain and Ho. CodeSearchNet Challenge: Evaluating the State of Semantic Code Search , journal =. 2019 , url =. 1909.09436 , timestamp =

work page internal anchor Pith review arXiv 2019
[83]

Text Embeddings by Weakly-Supervised Contrastive Pre-training

Text Embeddings by Weakly-Supervised Contrastive Pre-training , author=. arXiv preprint arXiv:2212.03533 , url=

work page internal anchor Pith review Pith/arXiv arXiv
[84]

arXiv preprint arXiv:2201.10005 , year=

Arvind Neelakantan and Tao Xu and Raul Puri and Alec Radford and Jesse Michael Han and Jerry Tworek and Qiming Yuan and Nikolas Tezak and Jong Wook Kim and Chris Hallacy and Johannes Heidecke and Pranav Shyam and Boris Power and Tyna Eloundou Nekoul and Girish Sastry and Gretchen Krueger and David Schnurr and Felipe Petroski Such and Kenny Hsu and Madelei...

work page arXiv 2022
[85]

Aho and Jeffrey D

Alfred V. Aho and Jeffrey D. Ullman , title =. 1972

work page 1972
[86]

Publications Manual , year = "1983", publisher =

work page 1983
[87]

Chandra and Dexter C

Ashok K. Chandra and Dexter C. Kozen and Larry J. Stockmeyer , year = "1981", title =. doi:10.1145/322234.322243

work page doi:10.1145/322234.322243 1981
[88]

Scalable training of

Andrew, Galen and Gao, Jianfeng , booktitle=. Scalable training of

work page
[89]

Dan Gusfield , title =. 1997

work page 1997
[90]

Tetreault , title =

Mohammad Sadegh Rasooli and Joel R. Tetreault , title =. Computing Research Repository , volume =. 2015 , url =

work page 2015
[91]

A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , Volume =

Ando, Rie Kubota and Zhang, Tong , Issn =. A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , Volume =. Journal of Machine Learning Research , Month = dec, Numpages =

work page
[92]

2020 25th International Conference on Pattern Recognition (ICPR) , year=

Evaluation of BERT and ALBERT Sentence Embedding Performance on Downstream NLP Tasks , author=. 2020 25th International Conference on Pattern Recognition (ICPR) , year=

work page 2020
[93]

Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval , year=

Multi-CPR: A Multi Domain Chinese Dataset for Passage Retrieval , author=. Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval , year=

work page
[94]

Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval , year=

Dense Passage Retrieval: Architectures and Augmentation Methods , author=. Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval , year=

work page
[95]

ArXiv , year=

In-Context Retrieval-Augmented Language Models , author=. ArXiv , year=

work page
[96]

ArXiv , year=

REPLUG: Retrieval-Augmented Black-Box Language Models , author=. ArXiv , year=

work page
[97]

ArXiv , year=

Few-shot Learning with Retrieval Augmented Language Models , author=. ArXiv , year=

work page
[98]

ArXiv , year=

GPT-4 Technical Report , author=. ArXiv , year=

work page
[99]

ArXiv , year=

LLaMA: Open and Efficient Foundation Language Models , author=. ArXiv , year=

work page
[100]

2018 , url=

Improving Language Understanding by Generative Pre-Training , author=. 2018 , url=

work page 2018
[101]

Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies,

Devlin, Jacob and Chang, Ming-Wei and Lee, Kenton and Toutanova, Kristina. BERT : Pre-training of Deep Bidirectional Transformers for Language Understanding. Proceedings of the 2019 Conference of the North A merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). 2019. doi:10.18653/v...

work page doi:10.18653/v1/n19-1423 2019
[102]

Conference on Empirical Methods in Natural Language Processing , year=

RetroMAE: Pre-Training Retrieval-oriented Language Models Via Masked Auto-Encoder , author=. Conference on Empirical Methods in Natural Language Processing , year=

work page

Showing first 80 references.