pith. machine review for the scientific record. sign in

arxiv: 2308.03281 · v1 · submitted 2023-08-07 · 💻 cs.CL

Recognition: 3 theorem links

· Lean Theorem

Towards General Text Embeddings with Multi-stage Contrastive Learning

Authors on Pith no claims yet

Pith reviewed 2026-05-12 03:23 UTC · model grok-4.3

classification 💻 cs.CL
keywords text embeddingscontrastive learningmulti-stage traininggeneral text representationsMTEB benchmarkcode retrievalunified NLP models
0
0 comments X

The pith

A 110M-parameter text embedding model trained via multi-stage contrastive learning on mixed datasets outperforms OpenAI's black-box API and much larger models on the MTEB benchmark.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper trains a single text embedding model called GTE by applying contrastive learning first in an unsupervised pre-training stage and then in a supervised fine-tuning stage, drawing from a large and varied collection of datasets. The central goal is to produce embeddings that work well across many different NLP tasks and even code retrieval without any extra per-task tuning. A sympathetic reader would care because this points to a route for building general-purpose, efficient embedding systems that do not require separate models or heavy customization for each application. The modest size of the base model makes the result practically relevant for deployment.

Core claim

GTE is a unified text embedding model trained with multi-stage contrastive learning over a diverse mixture of datasets; the base version with 110M parameters surpasses OpenAI's embedding API and exceeds the performance of embedding models more than ten times larger on the massive text embedding benchmark, while also outperforming prior code retrievers of similar size when code is treated simply as text.

What carries the argument

Multi-stage contrastive learning applied to a mixture of unsupervised pre-training and supervised fine-tuning datasets that unifies many NLP and code tasks into a single contrastive format.

If this is right

  • Embeddings can be made general enough to handle both natural language and programming language retrieval without language-specific retraining.
  • Modest-sized models can exceed the quality of much larger embedding models when trained on sufficiently diverse contrastive data.
  • A single trained embedder can replace multiple task-specific models in retrieval, semantic search, and clustering pipelines.
  • Performance gains come primarily from increasing the scale and variety of training data across the two stages rather than from model size alone.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the staged training pattern generalizes, future embedding work may shift focus from architecture search toward systematic data mixing and curriculum design.
  • The result raises the possibility that open models can close the gap with proprietary embedding services without requiring users to send data to external APIs.
  • Treating code as ordinary text inside the same contrastive framework suggests that cross-domain transfer between natural and formal languages may be easier than previously assumed.

Load-bearing premise

The particular combination of datasets and the two-stage contrastive procedure is sufficient to produce embeddings that transfer directly to new tasks without further per-task fine-tuning.

What would settle it

A controlled experiment in which a model of identical size is trained on the same total data volume but with single-stage contrastive learning instead of the multi-stage schedule, then evaluated on MTEB; if it matches or exceeds GTE_base, the necessity of the staged approach is called into question.

read the original abstract

We present GTE, a general-purpose text embedding model trained with multi-stage contrastive learning. In line with recent advancements in unifying various NLP tasks into a single format, we train a unified text embedding model by employing contrastive learning over a diverse mixture of datasets from multiple sources. By significantly increasing the number of training data during both unsupervised pre-training and supervised fine-tuning stages, we achieve substantial performance gains over existing embedding models. Notably, even with a relatively modest parameter count of 110M, GTE$_\text{base}$ outperforms the black-box embedding API provided by OpenAI and even surpasses 10x larger text embedding models on the massive text embedding benchmark. Furthermore, without additional fine-tuning on each programming language individually, our model outperforms previous best code retrievers of similar size by treating code as text. In summary, our model achieves impressive results by effectively harnessing multi-stage contrastive learning, offering a powerful and efficient text embedding model with broad applicability across various NLP and code-related tasks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces GTE, a family of text embedding models trained via multi-stage contrastive learning. Unsupervised pre-training on large web-scale corpora is followed by supervised fine-tuning on a diverse mixture of datasets; the resulting 110M-parameter GTE_base is reported to outperform OpenAI's text-embedding-ada-002 and multiple 1B+ parameter models on the Massive Text Embedding Benchmark (MTEB) while also delivering strong code-retrieval results by treating code as plain text without per-language fine-tuning.

Significance. If the reported gains are free of data contamination and the training mixture is fully documented, the work would demonstrate that carefully staged contrastive learning on heterogeneous data can produce compact, general-purpose embeddings competitive with much larger or proprietary systems. This would be a useful practical contribution for efficient open-source embedding models.

major comments (2)
  1. [§3.2] §3.2 (Supervised fine-tuning datasets): The manuscript states that supervised training uses 'a diverse mixture of datasets from multiple sources' but provides neither an exhaustive list of the datasets nor an explicit statement or appendix confirming zero overlap with MTEB test splits (or near-duplicates). Because MTEB aggregates tasks drawn from common supervised sources (e.g., STS, classification, retrieval corpora), any leakage would directly undermine the central claim that the model generalizes 'without additional fine-tuning' and outperforms larger models on truly unseen distributions.
  2. [§4.1, Table 2] §4.1 and Table 2 (MTEB results): The headline claim that GTE_base surpasses 10× larger models and OpenAI's API is presented as an average score; the paper does not report per-task breakdowns with confidence intervals or statistical significance tests against the strongest baselines. Without these, it is impossible to assess whether the reported gains are robust or driven by a few tasks where data overlap may exist.
minor comments (2)
  1. [Abstract, §1] The abstract and §1 use the subscript notation GTE$ _base $ inconsistently with the later text; standardize to GTE_base throughout.
  2. [§4.3] §4.3 (Code retrieval): The claim that the model outperforms prior code retrievers 'without additional fine-tuning on each programming language' would be strengthened by an explicit statement of the exact code corpora used in the supervised stage.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments. We address each major concern point by point below, indicating where revisions will be made to enhance transparency and detail in the manuscript.

read point-by-point responses
  1. Referee: [§3.2] The manuscript states that supervised training uses 'a diverse mixture of datasets from multiple sources' but provides neither an exhaustive list of the datasets nor an explicit statement or appendix confirming zero overlap with MTEB test splits (or near-duplicates). Because MTEB aggregates tasks drawn from common supervised sources (e.g., STS, classification, retrieval corpora), any leakage would directly undermine the central claim that the model generalizes 'without additional fine-tuning' and outperforms larger models on truly unseen distributions.

    Authors: We agree that full documentation of the training mixture and explicit checks for overlap are essential to support the generalization claims. In the revised manuscript, we will add an appendix providing an exhaustive list of all supervised fine-tuning datasets, including their sources, sizes, and any preprocessing or filtering applied. We will also include a dedicated statement describing our procedures to avoid contamination with MTEB test splits, such as restricting to official training portions and applying deduplication steps. These additions will directly address the concern about potential data leakage. revision: yes

  2. Referee: [§4.1, Table 2] The headline claim that GTE_base surpasses 10× larger models and OpenAI's API is presented as an average score; the paper does not report per-task breakdowns with confidence intervals or statistical significance tests against the strongest baselines. Without these, it is impossible to assess whether the reported gains are robust or driven by a few tasks where data overlap may exist.

    Authors: We acknowledge that per-task breakdowns would allow readers to better evaluate the robustness of the reported average. In the revision, we will expand the results to include per-task scores for GTE_base and the primary baselines (including OpenAI's model and larger models) either in an extended version of Table 2 or a supplementary table. While confidence intervals and formal statistical significance tests are not standard practice for MTEB reporting due to computational demands and benchmark conventions, we will add discussion of performance consistency across task types (e.g., retrieval, classification, STS) to help assess whether gains are broadly distributed or concentrated in specific areas. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical training pipeline with external benchmark validation

full rationale

The paper describes a standard multi-stage contrastive learning procedure (unsupervised pre-training followed by supervised fine-tuning) applied to a mixture of datasets, then reports empirical results on external benchmarks such as MTEB. No mathematical derivation chain, self-definitional equations, fitted parameters renamed as predictions, or load-bearing self-citations exist; performance claims rest on direct comparisons to independent models and APIs rather than reducing to the training inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The central claim rests on empirical training of a neural network with standard contrastive loss; no new mathematical axioms or invented entities. Free parameters such as learning rates, batch sizes, data mixture weights, and stage-specific hyperparameters are implicit in the training process but not enumerated in the abstract.

pith-pipeline@v0.9.0 · 5481 in / 1159 out tokens · 30370 ms · 2026-05-12T03:23:28.299970+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 42 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Test-Time Compute for Dense Retrieval: Agentic Program Generation with Frozen Embedding Models

    cs.LG 2026-05 unverdicted novelty 7.0

    Agentic program search over frozen embedding APIs yields a parameter-free inference algebra—a softmax-weighted centroid of top-K documents interpolated with the query—that lifts nDCG@10 across seven model families on ...

  2. Test-Time Compute for Dense Retrieval: Agentic Program Generation with Frozen Embedding Models

    cs.LG 2026-05 unverdicted novelty 7.0

    A softmax-weighted centroid of the local top-K documents interpolated with the query improves nDCG@10 for frozen embedding models across seven families on held-out BEIR data.

  3. DiffRetriever: Parallel Representative Tokens for Retrieval with Diffusion Language Models

    cs.IR 2026-05 unverdicted novelty 7.0

    DiffRetriever generates multiple representative tokens in parallel using diffusion language models, yielding consistent retrieval gains over single-token baselines and autoregressive multi-token variants on BEIR benchmarks.

  4. Priming, Path-dependence, and Plasticity: Understanding the molding of user-LLM interaction and its implications from (many) chat logs in the wild

    cs.HC 2026-05 unverdicted novelty 7.0

    Large-scale analysis of wild LLM chat logs finds that user interaction patterns stabilize quickly after initial use and correlate with long-term outcomes like retention, creating an agency paradox of limited explorati...

  5. TabEmbed: Benchmarking and Learning Generalist Embeddings for Tabular Understanding

    cs.CL 2026-05 unverdicted novelty 7.0

    TabEmbed is the first generalist embedding model for tabular data that unifies classification and retrieval in one space via contrastive learning and outperforms text embedding models on the new TabBench benchmark.

  6. Rethinking Reasoning-Intensive Retrieval: Evaluating and Advancing Retrievers in Agentic Search Systems

    cs.CL 2026-05 unverdicted novelty 7.0

    BRIGHT-Pro and RTriever-Synth advance reasoning-intensive retrieval by adding multi-aspect evidence evaluation and aspect-decomposed synthetic training, with the fine-tuned RTriever-4B showing gains over its base model.

  7. Embedding-based In-Context Prompt Training for Enhancing LLMs as Text Encoders

    cs.CL 2026-05 unverdicted novelty 7.0

    EPIC trains LLMs to treat continuous embeddings as in-context prompts, yielding state-of-the-art text embedding performance on MTEB with or without prompts at inference and lower compute.

  8. Learning How and What to Memorize: Cognition-Inspired Two-Stage Optimization for Evolving Memory

    cs.CL 2026-05 unverdicted novelty 7.0

    MemCoE learns memory organization guidelines via contrastive feedback and then trains a guideline-aligned RL policy for memory updates, yielding consistent gains on personalization benchmarks.

  9. Why Mean Pooling Works: Quantifying Second-Order Collapse in Text Embeddings

    cs.CL 2026-04 unverdicted novelty 7.0

    Modern text encoders resist second-order collapse under mean pooling because token embeddings concentrate tightly within texts, and this resistance correlates with stronger downstream performance.

  10. Interactive Episodic Memory with User Feedback

    cs.CV 2026-04 unverdicted novelty 7.0

    Introduces an interactive episodic memory task with user feedback and a Feedback Alignment Module that improves retrieval accuracy on video benchmarks while remaining efficient.

  11. ATIR: Towards Audio-Text Interleaved Contextual Retrieval

    cs.SD 2026-04 unverdicted novelty 7.0

    Defines ATIR task and benchmark for mixed audio-text queries; MLLM model with token compression shows substantial gains over strong baselines.

  12. LookasideVLN: Direction-Aware Aerial Vision-and-Language Navigation

    cs.CV 2026-04 unverdicted novelty 7.0

    LookasideVLN improves aerial vision-and-language navigation by encoding directional cues from instructions into an egocentric graph and lightweight knowledge base, outperforming prior methods like CityNavAgent even wi...

  13. On the Robustness of LLM-Based Dense Retrievers: A Systematic Analysis of Generalizability and Stability

    cs.IR 2026-04 unverdicted novelty 7.0

    LLM-based dense retrievers generalize better when instruction-tuned but pay a specialization tax when optimized for reasoning; they resist typos and corpus poisoning better than encoder-only baselines yet remain vulne...

  14. A Unified Model and Document Representation for On-Device Retrieval-Augmented Generation

    cs.IR 2026-04 unverdicted novelty 7.0

    A single model unifies retrieval and context compression for on-device RAG via shared representations, matching traditional RAG performance at 1/10 context size with no extra storage.

  15. Tencent Advertising Algorithm Challenge 2025: All-Modality Generative Recommendation

    cs.IR 2026-04 accept novelty 7.0

    Releases TencentGR-1M and TencentGR-10M datasets with baselines for all-modality generative recommendation in advertising, including weighted evaluation for conversions.

  16. GME: Improving Universal Multimodal Retrieval by Multimodal LLMs

    cs.CL 2024-12 unverdicted novelty 7.0

    GME achieves state-of-the-art results in universal multimodal retrieval by training on a balanced synthetic multimodal dataset.

  17. C-Pack: Packed Resources For General Chinese Embeddings

    cs.CL 2023-09 accept novelty 7.0

    C-Pack releases a new Chinese embedding benchmark, large training dataset, and optimized models that outperform priors by up to 10% on C-MTEB while also delivering English SOTA results.

  18. Layer-wise Representation Dynamics: An Empirical Investigation Across Embedders and Base LLMs

    cs.LG 2026-05 unverdicted novelty 6.0

    LRD framework with Frenet, NRS, and GFMI metrics shows layer-wise structure in 31 models provides usable signal for model selection and pruning on MTEB tasks.

  19. Adaptive Kernel Ridge Regression with Linear Structure: Sharp Oracle Inequalities and Minimax Optimality

    math.ST 2026-05 unverdicted novelty 6.0

    An augmented kernel ridge regression estimator separates linear and nonlinear components to achieve sharp oracle inequalities and minimax optimal prediction risk under general kernels.

  20. PathISE: Learning Informative Path Supervision for Knowledge Graph Question Answering

    cs.AI 2026-05 conditional novelty 6.0

    PathISE generates pseudo path-level supervision from answer labels alone via a transformer estimator, distills it to an LLM path generator, and achieves competitive or state-of-the-art KGQA performance on three benchm...

  21. PASA: A Principled Embedding-Space Watermarking Approach for LLM-Generated Text under Semantic-Invariant Attacks

    cs.CR 2026-05 unverdicted novelty 6.0

    PASA is a semantic-level watermarking method for LLM text that uses embedding-space clusters and synchronized randomness to remain detectable after paraphrasing while preserving text quality.

  22. CASCADE: Case-Based Continual Adaptation for Large Language Models During Deployment

    cs.AI 2026-05 unverdicted novelty 6.0

    CASCADE enables LLMs to continually adapt at deployment via case-based episodic memory and contextual bandits, improving macro-averaged success by 20.9% over zero-shot on 16 tasks spanning medicine, law, code, and robotics.

  23. AutoPPA: Automated Circuit PPA Optimization via Contrastive Code-based Rule Library Learning

    cs.LG 2026-04 unverdicted novelty 6.0

    AutoPPA learns generalizable PPA optimization rules automatically via contrastive abstraction from diverse code pairs and applies them through adaptive search, outperforming manual methods and prior tools SymRTLO and ...

  24. SPENCE: A Syntactic Probe for Detecting Contamination in NL2SQL Benchmarks

    cs.CL 2026-04 unverdicted novelty 6.0

    SPENCE shows older NL2SQL benchmarks like Spider have high performance sensitivity to syntactic changes, indicating likely training contamination, while newer ones like BIRD show little sensitivity and appear largely clean.

  25. REZE: Representation Regularization for Domain-adaptive Text Embedding Pre-finetuning

    cs.CL 2026-04 unverdicted novelty 6.0

    REZE controls representation shifts in contrastive pre-finetuning of text embeddings via eigenspace decomposition of anchor-positive pairs and adaptive soft-shrinkage on task-variant directions.

  26. Robotic Manipulation is Vision-to-Geometry Mapping ($f(v) \rightarrow G$): Vision-Geometry Backbones over Language and Video Models

    cs.RO 2026-04 unverdicted novelty 6.0

    Vision-geometry backbones using pretrained 3D world models outperform vision-language and video models for robotic manipulation by enabling direct mapping from visual input to geometric actions.

  27. Beyond Single Reports: Evaluating Automated ATT&CK Technique Extraction in Multi-Report Campaign Settings

    cs.SE 2026-04 unverdicted novelty 6.0

    Aggregating multiple CTI reports improves ATT&CK technique extraction F1 by about 26 percent over single-report baselines, with saturation after 5-15 reports and maximum F1 scores of 78.6 percent and 54.9 percent acro...

  28. HIVE: Query, Hypothesize, Verify An LLM Framework for Multimodal Reasoning-Intensive Retrieval

    cs.IR 2026-04 unverdicted novelty 6.0

    HIVE raises multimodal retrieval nDCG@10 to 41.7 on the MM-BRIGHT benchmark by inserting LLM-driven hypothesis generation and verification between retrieval passes, delivering +9.5 over the best text-only baseline and...

  29. ChunQiuTR: Time-Keyed Temporal Retrieval in Classical Chinese Annals

    cs.CL 2026-04 unverdicted novelty 6.0

    ChunQiuTR benchmark and CTD model improve time-keyed retrieval accuracy for Classical Chinese annals by combining semantic similarity with explicit calendrical temporal context.

  30. Data, Not Model: Explaining Bias toward LLM Texts in Neural Retrievers

    cs.IR 2026-04 unverdicted novelty 6.0

    Bias toward LLM texts in neural retrievers arises from artifact imbalances between positive and negative documents in training data that are absorbed during contrastive learning.

  31. Are LLM-Based Retrievers Worth Their Cost? An Empirical Study of Efficiency, Robustness, and Reasoning Overhead

    cs.IR 2026-04 accept novelty 6.0

    Empirical comparison across 14 retrievers on the BRIGHT benchmark shows reasoning-specialized models can match strong accuracy with competitive speed while many large LLM bi-encoders add latency for small gains and co...

  32. NV-Embed: Improved Techniques for Training LLMs as Generalist Embedding Models

    cs.CL 2024-05 accept novelty 6.0

    NV-Embed achieves first place on the MTEB leaderboard across 56 tasks by combining a latent attention layer, causal-mask removal, two-stage contrastive training, and data curation for LLM-based embedding models.

  33. Not All RAGs Are Created Equal: A Component-Wise Empirical Study for Software Engineering Tasks

    cs.SE 2026-05 unverdicted novelty 5.0

    Retriever-side choices, particularly the retrieval algorithm, exert more influence on RAG performance than generator selection across code generation, summarization, and repair tasks.

  34. SafeTune: Mitigating Data Poisoning in LLM Fine-Tuning for RTL Code Generation

    cs.CR 2026-04 unverdicted novelty 5.0

    SafeTune uses GNN-based structural anomaly detection and semantic prompt classification to filter poisoned data in LLM fine-tuning for RTL generation, enhancing robustness against hardware Trojan insertion without alt...

  35. A Gated Hybrid Contrastive Collaborative Filtering Recommendation

    cs.IR 2026-04 unverdicted novelty 5.0

    A gated hybrid contrastive collaborative filtering framework improves hit rate@10 and NDCG@10 on movie review datasets by layer-wise adaptive fusion of semantic and collaborative signals with contrastive objectives.

  36. SAT: Balancing Reasoning Accuracy and Efficiency with Stepwise Adaptive Thinking

    cs.AI 2026-04 unverdicted novelty 5.0

    SAT reduces reasoning tokens by up to 40% across multiple large reasoning models and benchmarks by adaptively pruning steps based on difficulty while maintaining or improving accuracy.

  37. SmolLM2: When Smol Goes Big -- Data-Centric Training of a Small Language Model

    cs.CL 2025-02 unverdicted novelty 5.0

    SmolLM2 is a 1.7B-parameter language model that outperforms Qwen2.5-1.5B and Llama3.2-1B after overtraining on 11 trillion tokens using custom FineMath, Stack-Edu, and SmolTalk datasets in a multi-stage pipeline.

  38. Granite Embedding Multilingual R2 Models

    cs.IR 2026-05 unverdicted novelty 4.0

    Granite Embedding Multilingual R2 releases 311M and 97M parameter bi-encoder models that achieve state-of-the-art retrieval performance on multilingual text, code, long-document, and reasoning datasets.

  39. Mira-Embeddings-V1: Domain-Adapted Semantic Reranking for Recruitment via LLM-Synthesized Data

    cs.CL 2026-04 conditional novelty 4.0

    Mira-Embeddings-V1 adapts embeddings for recruitment reranking by synthesizing positive and hard-negative samples with LLMs, then applies JD-JD contrastive and JD-CV triplet training plus a BoundaryHead MLP, lifting R...

  40. Qwen3-VL-Embedding and Qwen3-VL-Reranker: A Unified Framework for State-of-the-Art Multimodal Retrieval and Ranking

    cs.CL 2026-01 unverdicted novelty 4.0

    Qwen3-VL-Embedding-8B achieves state-of-the-art performance with a 77.8 overall score on the MMEB-V2 multimodal embedding benchmark.

  41. Qwen3 Embedding: Advancing Text Embedding and Reranking Through Foundation Models

    cs.CL 2025-06 unverdicted novelty 4.0

    Qwen3 Embedding models in 0.6B-8B sizes achieve state-of-the-art results on MTEB and retrieval tasks including code, cross-lingual, and multilingual retrieval through unsupervised pre-training, supervised fine-tuning,...

  42. A Reproducibility Study of Metacognitive Retrieval-Augmented Generation

    cs.IR 2026-04 unverdicted novelty 3.0

    MetaRAG is only partially reproducible with lower absolute scores than originally reported, gains substantially from reranking, and shows greater robustness than SIM-RAG under extended retrieval features.

Reference graph

Works this paper leans on

106 extracted references · 106 canonical work pages · cited by 41 Pith papers · 8 internal anchors

  1. [1]

    Akari Asai, Timo Schick, Patrick Lewis, Xilun Chen, Gautier Izacard, Sebastian Riedel, Hannaneh Hajishirzi, and Wen-tau Yih. 2023. https://aclanthology.org/2023.findings-acl.225 Task-aware retrieval with instructions . In Findings of the Association for Computational Linguistics: ACL 2023, pages 3650--3675, Toronto, Canada. Association for Computational L...

  2. [2]

    Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel Ziegler, Jeffrey Wu, Clemens Winter, Chris Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gr...

  3. [3]

    Yu, Yin-Wen Chang, Yiming Yang, and Sanjiv Kumar

    Wei-Cheng Chang, Felix X. Yu, Yin-Wen Chang, Yiming Yang, and Sanjiv Kumar. 2020. https://openreview.net/forum?id=rkg-mA4FDr Pre-training tasks for embedding-based large-scale retrieval . In International Conference on Learning Representations

  4. [4]

    Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mohammad Bavarian...

  5. [5]

    Tianqi Chen, Bing Xu, Chiyuan Zhang, and Carlos Guestrin. 2016. http://arxiv.org/abs/1604.06174 Training deep nets with sublinear memory cost

  6. [6]

    Hyunjin Choi, Judong Kim, Seongho Joe, and Youngjune Gwon. 2021. https://api.semanticscholar.org/CorpusID:231709235 Evaluation of bert and albert sentence embedding performance on downstream nlp tasks . 2020 25th International Conference on Pattern Recognition (ICPR), pages 5482--5487

  7. [10]

    Luyu Gao and Jamie Callan. 2021. https://api.semanticscholar.org/CorpusID:237581068 Condenser: a pre-training architecture for dense retrieval . In Conference on Empirical Methods in Natural Language Processing

  8. [14]

    Daya Guo, Shuo Ren, Shuai Lu, Zhangyin Feng, Duyu Tang, Shujie LIU, Long Zhou, Nan Duan, Alexey Svyatkovskiy, Shengyu Fu, Michele Tufano, Shao Kun Deng, Colin Clement, Dawn Drain, Neel Sundaresan, Jian Yin, Daxin Jiang, and Ming Zhou. 2021. https://openreview.net/forum?id=jLoC4ez43PZ Graphcode \ bert \ : Pre-training code representations with data flow . ...

  9. [15]

    Kelvin Guu, Kenton Lee, Zora Tung, Panupong Pasupat, and Mingwei Chang. 2020. https://proceedings.mlr.press/v119/guu20a.html Retrieval augmented language model pre-training . In Proceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, pages 3929--3938. PMLR

  10. [16]

    Samuel Humeau, Kurt Shuster, Marie-Anne Lachaux, and Jason Weston. 2020. https://openreview.net/forum?id=SkxgnnNFvH Poly-encoders: Architectures and pre-training strategies for fast and accurate multi-sentence scoring . In International Conference on Learning Representations

  11. [18]

    Gautier Izacard, Mathilde Caron, Lucas Hosseini, Sebastian Riedel, Piotr Bojanowski, Armand Joulin, and Edouard Grave. 2022 a . https://openreview.net/forum?id=jKN1pXi7b0 Unsupervised dense information retrieval with contrastive learning . Transactions on Machine Learning Research

  12. [19]

    Gautier Izacard, Patrick Lewis, Maria Lomeli, Lucas Hosseini, Fabio Petroni, Timo Schick, Jane Dwivedi-Yu, Armand Joulin, Sebastian Riedel, and Edouard Grave. 2022 b . http://arxiv.org/abs/2208.03299 Few-shot Learning with Retrieval Augmented Language Models

  13. [23]

    Xiaonan Li, Yeyun Gong, Yelong Shen, Xipeng Qiu, Hang Zhang, Bolun Yao, Weizhen Qi, Daxin Jiang, Weizhu Chen, and Nan Duan. 2022. https://aclanthology.org/2022.emnlp-main.187 C ode R etriever: A large scale contrastive pre-training method for code search . In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 289...

  14. [26]

    Dingkun Long, Qiong Gao, Kuan Zou, Guangwei Xu, Pengjun Xie, Ruijie Guo, Jianfeng Xu, Guanjun Jiang, Luxi Xing, and Ping Yang. 2022 a . https://api.semanticscholar.org/CorpusID:247292113 Multi-cpr: A multi domain chinese dataset for passage retrieval . Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval

  15. [27]

    Dingkun Long, Yanzhao Zhang, Guangwei Xu, and Pengjun Xie. 2022 b . https://api.semanticscholar.org/CorpusID:253157959 Retrieval oriented masking pre-training language model for dense passage retrieval . ArXiv, abs/2210.15133

  16. [28]

    Paulius Micikevicius, Sharan Narang, Jonah Alben, Gregory Diamos, Erich Elsen, David Garcia, Boris Ginsburg, Michael Houston, Oleksii Kuchaiev, Ganesh Venkatesh, and Hao Wu. 2018. https://openreview.net/forum?id=r1gs9JgRZ Mixed precision training . In International Conference on Learning Representations

  17. [29]

    Fedor Moiseev, Gustavo Hernandez Abrego, Peter Dornbach, Imed Zitouni, Enrique Alfonseca, and Zhe Dong. 2023. https://aclanthology.org/2023.findings-acl.761 S am T o N e: Improving contrastive loss for dual encoder retrieval models with same tower negatives . In Findings of the Association for Computational Linguistics: ACL 2023, pages 12028--12037, Toron...

  18. [30]

    Niklas Muennighoff, Nouamane Tazi, Loic Magne, and Nils Reimers. 2023. https://aclanthology.org/2023.eacl-main.148 MTEB : Massive text embedding benchmark . In Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, pages 2014--2037, Dubrovnik, Croatia. Association for Computational Linguistics

  19. [33]

    Jianmo Ni, Chen Qu, Jing Lu, Zhuyun Dai, Gustavo Hernandez Abrego, Ji Ma, Vincent Zhao, Yi Luan, Keith Hall, Ming-Wei Chang, and Yinfei Yang. 2022 b . https://aclanthology.org/2022.emnlp-main.669 Large dual encoders are generalizable retrievers . In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 9844--9855, A...

  20. [35]

    OpenAI. 2023. https://api.semanticscholar.org/CorpusID:257532815 Gpt-4 technical report . ArXiv, abs/2303.08774

  21. [36]

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. 2021. https://proceedings.mlr.press/v139/radford21a.html Learning transferable visual models from natural language supervision . In Proceedings of the 38th International C...

  22. [37]

    Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. 2018. https://cdn.openai.com/research-covers/language-unsupervised/language_understanding_paper.pdf Improving language understanding by generative pre-training

  23. [38]

    Rajapakse

    Thilina C. Rajapakse. 2023. https://api.semanticscholar.org/CorpusID:259949811 Dense passage retrieval: Architectures and augmentation methods . Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval

  24. [39]

    Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, and Yuxiong He. 2020. Zero: Memory optimizations toward training trillion parameter models. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC '20. IEEE Press

  25. [40]

    Ori Ram, Yoav Levine, Itay Dalmedigos, Dor Muhlgay, Amnon Shashua, Kevin Leyton-Brown, and Yoav Shoham. 2023. https://api.semanticscholar.org/CorpusID:256459451 In-context retrieval-augmented language models . ArXiv, abs/2302.00083

  26. [43]

    Andrew Rosenberg and Julia Hirschberg. 2007. https://aclanthology.org/D07-1043 V -measure: A conditional entropy-based external cluster evaluation measure . In Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning ( EMNLP - C o NLL ) , pages 410--420, Prague, Czech Republic...

  27. [44]

    Weijia Shi, Sewon Min, Michihiro Yasunaga, Minjoon Seo, Rich James, Mike Lewis, Luke Zettlemoyer, and Wen tau Yih. 2023. https://api.semanticscholar.org/CorpusID:256389797 Replug: Retrieval-augmented black-box language models . ArXiv, abs/2301.12652

  28. [45]

    Smith, Luke Zettlemoyer, and Tao Yu

    Hongjin Su, Weijia Shi, Jungo Kasai, Yizhong Wang, Yushi Hu, Mari Ostendorf, Wen-tau Yih, Noah A. Smith, Luke Zettlemoyer, and Tao Yu. 2023. https://aclanthology.org/2023.findings-acl.71 One embedder, any task: Instruction-finetuned text embeddings . In Findings of the Association for Computational Linguistics: ACL 2023, pages 1102--1121, Toronto, Canada....

  29. [46]

    Jianlin Su, Jiarun Cao, Weijie Liu, and Yangyiwen Ou. 2021. http://arxiv.org/abs/2103.15316 Whitening sentence representations for better semantics and faster retrieval

  30. [47]

    Nandan Thakur, Nils Reimers, Andreas R\" u ckl\' e , Abhishek Srivastava, and Iryna Gurevych. 2021. https://datasets-benchmarks-proceedings.neurips.cc/paper_files/paper/2021/file/65b9eea6e1cc6bb9f0cd2a47751a186f-Paper-round2.pdf Beir: A heterogeneous benchmark for zero-shot evaluation of information retrieval models . In Proceedings of the Neural Informat...

  31. [48]

    Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timoth \'e e Lacroix, Baptiste Rozi \`e re, Naman Goyal, Eric Hambro, Faisal Azhar, Aurelien Rodriguez, Armand Joulin, Edouard Grave, and Guillaume Lample. 2023. https://api.semanticscholar.org/CorpusID:257219404 Llama: Open and efficient foundation language models . ArXiv...

  32. [50]

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, ukasz Kaiser, and Illia Polosukhin. 2017. https://proceedings.neurips.cc/paper_files/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf Attention is all you need . In Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc

  33. [51]

    Liang Wang, Nan Yang, Xiaolong Huang, Binxing Jiao, Linjun Yang, Daxin Jiang, Rangan Majumder, and Furu Wei. 2022 a . https://api.semanticscholar.org/CorpusID:250311114 Simlm: Pre-training with representation bottleneck for dense passage retrieval . In Annual Meeting of the Association for Computational Linguistics

  34. [53]

    Wenhui Wang, Furu Wei, Li Dong, Hangbo Bao, Nan Yang, and Ming Zhou. 2020. Minilm: Deep self-attention distillation for task-agnostic compression of pre-trained transformers. In Proceedings of the 34th International Conference on Neural Information Processing Systems, NIPS'20, Red Hook, NY, USA. Curran Associates Inc

  35. [54]

    Guillaume Wenzek, Marie-Anne Lachaux, Alexis Conneau, Vishrav Chaudhary, Francisco Guzm \'a n, Armand Joulin, and Edouard Grave. 2020. https://aclanthology.org/2020.lrec-1.494 CCN et: Extracting high quality monolingual datasets from web crawl data . In Proceedings of the Twelfth Language Resources and Evaluation Conference, pages 4003--4012, Marseille, F...

  36. [56]

    Shitao Xiao, Zheng Liu, Yingxia Shao, and Zhao Cao. 2022. https://api.semanticscholar.org/CorpusID:252917569 Retromae: Pre-training retrieval-oriented language models via masked auto-encoder . In Conference on Empirical Methods in Natural Language Processing

  37. [58]

    Bennett, Junaid Ahmed, and Arnold Overwijk

    Lee Xiong, Chenyan Xiong, Ye Li, Kwok-Fung Tang, Jialin Liu, Paul N. Bennett, Junaid Ahmed, and Arnold Overwijk. 2021. https://openreview.net/forum?id=zeFrfgyZln Approximate nearest neighbor negative contrastive learning for dense text retrieval . In International Conference on Learning Representations

  38. [60]

    Attention is All you Need , url =

    Vaswani, Ashish and Shazeer, Noam and Parmar, Niki and Uszkoreit, Jakob and Jones, Llion and Gomez, Aidan N and Kaiser, ukasz and Polosukhin, Illia , booktitle =. Attention is All you Need , url =

  39. [61]

    Representation Learning with Contrastive Predictive Coding

    A. Representation Learning with Contrastive Predictive Coding , journal =. 2018 , url =. 1807.03748 , timestamp =

  40. [62]

    2021 , eprint=

    Evaluating Large Language Models Trained on Code , author=. 2021 , eprint=

  41. [63]

    International Conference on Learning Representations , year=

    Approximate Nearest Neighbor Negative Contrastive Learning for Dense Text Retrieval , author=. International Conference on Learning Representations , year=

  42. [64]

    Proceedings of the 37th International Conference on Machine Learning , pages =

    Retrieval Augmented Language Model Pre-Training , author =. Proceedings of the 37th International Conference on Machine Learning , pages =. 2020 , editor =

  43. [65]

    International Conference on Learning Representations , year=

    Pre-training Tasks for Embedding-based Large-scale Retrieval , author=. International Conference on Learning Representations , year=

  44. [66]

    Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis , articleno =

    Rajbhandari, Samyam and Rasley, Jeff and Ruwase, Olatunji and He, Yuxiong , title =. Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis , articleno =. 2020 , isbn =

  45. [67]

    International Conference on Learning Representations , year=

    Mixed Precision Training , author=. International Conference on Learning Representations , year=

  46. [68]

    2021 , eprint=

    Whitening Sentence Representations for Better Semantics and Faster Retrieval , author=. 2021 , eprint=

  47. [69]

    International Conference on Learning Representations , year=

    Representation Degeneration Problem in Training Natural Language Generation Models , author=. International Conference on Learning Representations , year=

  48. [70]

    2016 , eprint=

    Training Deep Nets with Sublinear Memory Cost , author=. 2016 , eprint=

  49. [71]

    GraphCode

    Daya Guo and Shuo Ren and Shuai Lu and Zhangyin Feng and Duyu Tang and Shujie LIU and Long Zhou and Nan Duan and Alexey Svyatkovskiy and Shengyu Fu and Michele Tufano and Shao Kun Deng and Colin Clement and Dawn Drain and Neel Sundaresan and Jian Yin and Daxin Jiang and Ming Zhou , booktitle=. GraphCode. 2021 , url=

  50. [72]

    Robust Speech Recognition via Large-Scale Weak Supervision

    Alec Radford and Jong Wook Kim and Tao Xu and Greg Brockman and Christine McLeavey and Ilya Sutskever , title =. CoRR , volume =. 2022 , url =. doi:10.48550/arXiv.2212.04356 , eprinttype =. 2212.04356 , timestamp =

  51. [73]

    International Conference on Learning Representations , year=

    Poly-encoders: Architectures and Pre-training Strategies for Fast and Accurate Multi-sentence Scoring , author=. International Conference on Learning Representations , year=

  52. [74]

    ArXiv , year=

    Retrieval Oriented Masking Pre-training Language Model for Dense Passage Retrieval , author=. ArXiv , year=

  53. [75]

    Proceedings of the 38th International Conference on Machine Learning , pages =

    Learning Transferable Visual Models From Natural Language Supervision , author =. Proceedings of the 38th International Conference on Machine Learning , pages =. 2021 , editor =

  54. [76]

    BEIR: A Heterogeneous Benchmark for Zero-shot Evaluation of Information Retrieval Models , url =

    Thakur, Nandan and Reimers, Nils and R\". BEIR: A Heterogeneous Benchmark for Zero-shot Evaluation of Information Retrieval Models , url =. Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks , editor =

  55. [77]

    Transactions on Machine Learning Research , issn=

    Unsupervised Dense Information Retrieval with Contrastive Learning , author=. Transactions on Machine Learning Research , issn=. 2022 , url=

  56. [78]

    Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval , pages =

    Xie, Yiqing and Liu, Xiao and Xiong, Chenyan , title =. Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval , pages =. 2023 , isbn =. doi:10.1145/3539618.3592080 , abstract =

  57. [79]

    CoRR , volume =

    Zehan Li and Yanzhao Zhang and Dingkun Long and Pengjun Xie , title =. CoRR , volume =. 2023 , url =. doi:10.48550/arXiv.2305.13197 , eprinttype =. 2305.13197 , timestamp =

  58. [80]

    Few-shot

    Izacard, Gautier and Lewis, Patrick and Lomeli, Maria and Hosseini, Lucas and Petroni, Fabio and Schick, Timo and Dwivedi-Yu, Jane and Joulin, Armand and Riedel, Sebastian and Grave, Edouard , year =. Few-shot

  59. [81]

    Language Models are Few-Shot Learners , url =

    Brown, Tom and Mann, Benjamin and Ryder, Nick and Subbiah, Melanie and Kaplan, Jared D and Dhariwal, Prafulla and Neelakantan, Arvind and Shyam, Pranav and Sastry, Girish and Askell, Amanda and Agarwal, Sandhini and Herbert-Voss, Ariel and Krueger, Gretchen and Henighan, Tom and Child, Rewon and Ramesh, Aditya and Ziegler, Daniel and Wu, Jeffrey and Winte...

  60. [82]

    CodeSearchNet Challenge: Evaluating the State of Semantic Code Search

    Hamel Husain and Ho. CodeSearchNet Challenge: Evaluating the State of Semantic Code Search , journal =. 2019 , url =. 1909.09436 , timestamp =

  61. [83]

    Text Embeddings by Weakly-Supervised Contrastive Pre-training

    Text Embeddings by Weakly-Supervised Contrastive Pre-training , author=. arXiv preprint arXiv:2212.03533 , url=

  62. [84]

    arXiv preprint arXiv:2201.10005 , year=

    Arvind Neelakantan and Tao Xu and Raul Puri and Alec Radford and Jesse Michael Han and Jerry Tworek and Qiming Yuan and Nikolas Tezak and Jong Wook Kim and Chris Hallacy and Johannes Heidecke and Pranav Shyam and Boris Power and Tyna Eloundou Nekoul and Girish Sastry and Gretchen Krueger and David Schnurr and Felipe Petroski Such and Kenny Hsu and Madelei...

  63. [85]

    Aho and Jeffrey D

    Alfred V. Aho and Jeffrey D. Ullman , title =. 1972

  64. [86]

    Publications Manual , year = "1983", publisher =

  65. [87]

    Chandra and Dexter C

    Ashok K. Chandra and Dexter C. Kozen and Larry J. Stockmeyer , year = "1981", title =. doi:10.1145/322234.322243

  66. [88]

    Scalable training of

    Andrew, Galen and Gao, Jianfeng , booktitle=. Scalable training of

  67. [89]

    Dan Gusfield , title =. 1997

  68. [90]

    Tetreault , title =

    Mohammad Sadegh Rasooli and Joel R. Tetreault , title =. Computing Research Repository , volume =. 2015 , url =

  69. [91]

    A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , Volume =

    Ando, Rie Kubota and Zhang, Tong , Issn =. A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , Volume =. Journal of Machine Learning Research , Month = dec, Numpages =

  70. [92]

    2020 25th International Conference on Pattern Recognition (ICPR) , year=

    Evaluation of BERT and ALBERT Sentence Embedding Performance on Downstream NLP Tasks , author=. 2020 25th International Conference on Pattern Recognition (ICPR) , year=

  71. [93]

    Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval , year=

    Multi-CPR: A Multi Domain Chinese Dataset for Passage Retrieval , author=. Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval , year=

  72. [94]

    Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval , year=

    Dense Passage Retrieval: Architectures and Augmentation Methods , author=. Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval , year=

  73. [95]

    ArXiv , year=

    In-Context Retrieval-Augmented Language Models , author=. ArXiv , year=

  74. [96]

    ArXiv , year=

    REPLUG: Retrieval-Augmented Black-Box Language Models , author=. ArXiv , year=

  75. [97]

    ArXiv , year=

    Few-shot Learning with Retrieval Augmented Language Models , author=. ArXiv , year=

  76. [98]

    ArXiv , year=

    GPT-4 Technical Report , author=. ArXiv , year=

  77. [99]

    ArXiv , year=

    LLaMA: Open and Efficient Foundation Language Models , author=. ArXiv , year=

  78. [100]

    2018 , url=

    Improving Language Understanding by Generative Pre-Training , author=. 2018 , url=

  79. [101]

    Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies,

    Devlin, Jacob and Chang, Ming-Wei and Lee, Kenton and Toutanova, Kristina. BERT : Pre-training of Deep Bidirectional Transformers for Language Understanding. Proceedings of the 2019 Conference of the North A merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). 2019. doi:10.18653/v...

  80. [102]

    Conference on Empirical Methods in Natural Language Processing , year=

    RetroMAE: Pre-Training Retrieval-oriented Language Models Via Masked Auto-Encoder , author=. Conference on Empirical Methods in Natural Language Processing , year=

Showing first 80 references.