hub Mixed citations

MTEB: Massive Text Embedding Benchmark

· 2022 · cs.CL · arXiv 2210.07316

Mixed citation behavior. Most common role is background (67%).

41 Pith papers citing it

Background 67% of classified citations

open full Pith review browse 41 citing papers arXiv PDF

abstract

Text embeddings are commonly evaluated on a small set of datasets from a single task not covering their possible applications to other tasks. It is unclear whether state-of-the-art embeddings on semantic textual similarity (STS) can be equally well applied to other tasks like clustering or reranking. This makes progress in the field difficult to track, as various models are constantly being proposed without proper evaluation. To solve this problem, we introduce the Massive Text Embedding Benchmark (MTEB). MTEB spans 8 embedding tasks covering a total of 58 datasets and 112 languages. Through the benchmarking of 33 models on MTEB, we establish the most comprehensive benchmark of text embeddings to date. We find that no particular text embedding method dominates across all tasks. This suggests that the field has yet to converge on a universal text embedding method and scale it up sufficiently to provide state-of-the-art results on all embedding tasks. MTEB comes with open-source code and a public leaderboard at https://github.com/embeddings-benchmark/mteb.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 8 dataset 4

citation-polarity summary

background 8 use dataset 4

representative citing papers

Chronicle: A Multimodal Foundation Model for Joint Language and Time Series Understanding

cs.LG · 2026-05-18 · unverdicted · novelty 7.0

Chronicle is the first model jointly pretrained from scratch on text and time series in a unified transformer that matches a comparable language model on NLU tasks and sets new bars for time series classification and multimodal forecasting.

AcquisitionSynthesis: Targeted Data Generation using Acquisition Functions

cs.CL · 2026-05-13 · unverdicted · novelty 7.0

AcquisitionSynthesis uses acquisition functions as rewards to train generators that produce higher-quality synthetic data, delivering 2-7% gains on math, medical QA, and coding tasks with improved robustness to forgetting.

Led to Mislead: Adversarial Content Injection for Attacks on Neural Ranking Models

cs.IR · 2026-05-02 · unverdicted · novelty 7.0

CRAFT is a supervised LLM framework using retrieval-augmented generation, self-refinement, fine-tuning, and preference optimization to create fluent adversarial content that boosts target ranks in neural ranking models, outperforming baselines on MS MARCO and TREC benchmarks with cross-architecture

MMEB-V3: Measuring the Performance Gaps of Omni-Modality Embedding Models

cs.IR · 2026-04-25 · unverdicted · novelty 7.0

MMEB-V3 benchmark shows omni-modality embedding models fail to enforce instruction-specified modality constraints and exhibit asymmetric, query-biased retrieval.

mEOL: Training-Free Instruction-Guided Multimodal Embedder for Vector Graphics and Image Retrieval

cs.CV · 2026-04-18 · unverdicted · novelty 7.0

mEOL creates aligned embeddings for text, images, and SVGs using instruction-guided MLLM one-word summaries and semantic SVG rewriting, outperforming baselines on a new text-to-SVG retrieval benchmark.

DualGuard: Dual-stream Large Language Model Watermarking Defense against Paraphrase and Spoofing Attack

cs.CR · 2025-12-18 · unverdicted · novelty 7.0

DualGuard uses adaptive dual-stream watermark signals to detect and trace both paraphrase and spoofing attacks in LLM outputs while preserving text quality.

Unified Work Embeddings: Contrastive Learning of a Bidirectional Multi-task Ranker

cs.CL · 2025-11-11 · unverdicted · novelty 7.0

UWE is a task-agnostic bi-encoder that uses many-to-many InfoNCE and token-level soft late interaction to achieve zero-shot ranking across unseen work-related target spaces while using far fewer parameters than Qwen3-8B and improving MAP by 4.4 points.

Representational Alignment Across Model Layers and Brain Regions with Multi-Level Optimal Transport

cs.LG · 2025-10-02 · accept · novelty 7.0

Multi-Level Optimal Transport (MOT) jointly infers soft layer couplings and neuron transport plans to produce global alignment scores and structured hierarchical correspondences between networks of varying depths.

C-Pack: Packed Resources For General Chinese Embeddings

cs.CL · 2023-09-14 · accept · novelty 7.0

C-Pack releases a new Chinese embedding benchmark, large training dataset, and optimized models that outperform priors by up to 10% on C-MTEB while also delivering English SOTA results.

When More Cores Hurts: The Vector Database Scaling Paradox in HPC

cs.DC · 2026-06-08 · unverdicted · novelty 6.0

Large-scale HPC evaluation of Qdrant, Milvus, and Weaviate reveals that workload patterns limit scaling and extra cores can reduce throughput, exposing a cloud-to-HPC design mismatch.

One prompt is not enough: Instruction Sensitivity Undermines Embedding Model Evaluation

cs.CL · 2026-05-21 · accept · novelty 6.0

Single-prompt evaluations of instruction-tuned embedding models misrepresent performance and allow any model to be ranked first by favorable prompt choice.

Sliced Inner Product Gromov-Wasserstein Distances

stat.ML · 2026-05-08 · unverdicted · novelty 6.0

A sliced IGW distance is introduced with closed-form 1D expressions, rotational invariance, and studied structural and computational properties for efficient data alignment.

cs.CL · 2026-05-01 · unverdicted · novelty 6.0

Machine translation preserves embedding similarity structure for ten languages but distorts it for four in the Manifesto Corpus, via a new non-inferiority testing framework.

JU\'A -- A Benchmark for Information Retrieval in Brazilian Legal Text Collections

cs.IR · 2026-04-07 · accept · novelty 6.0

JU'A is a new heterogeneous benchmark for Brazilian legal IR that distinguishes retrieval methods and shows domain-adapted models excel on aligned subsets while BM25 stays competitive elsewhere.

Semantic Data Processing with Holistic Data Understanding

cs.DB · 2026-04-03 · unverdicted · novelty 6.0

HoldUp uses LLM-guided clustering to provide holistic dataset context for semantic operators, yielding up to 33% higher classification accuracy and 30% higher scoring accuracy than row-by-row LLM processing across 15 datasets.

Mitigating Membership Inference in Intermediate Representations with Differentially Private Training

cs.LG · 2026-02-26 · unverdicted · novelty 6.0

LM-DP-SGD estimates layer-specific MIA risks from shadow models and reweights gradients to give stronger protection to vulnerable layers, improving the privacy-utility trade-off over uniform DP-SGD.

SHINE: A Scalable In-Context Hypernetwork for Mapping Context to LoRA in a Single Pass

cs.CL · 2026-02-06 · unverdicted · novelty 6.0

SHINE trains a scalable in-context hypernetwork to generate high-quality LoRA adapters from contexts in one pass, enabling efficient LLM adaptation that saves time and compute compared to standard fine-tuning.

Efficient-DLM: From Autoregressive to Diffusion Language Models, and Beyond in Speed

cs.CL · 2025-12-16 · unverdicted · novelty 6.0

Efficient-DLM converts AR models to dLMs via block-wise causal attention and position-dependent masking, yielding higher accuracy and 2.7-4.5x throughput than Dream 7B and Qwen3 4B.

LLM-MemCluster: Empowering Large Language Models with Dynamic Memory for Text Clustering

cs.CL · 2025-11-19 · unverdicted · novelty 6.0

LLM-MemCluster gives LLMs stateful memory and prompts that let them decide cluster count and iteratively refine groupings, outperforming baselines on benchmarks in a tuning-free end-to-end setup.

Detecting LLM-Generated Spam Reviews by Integrating Language Model Embeddings and Graph Neural Network

cs.CL · 2025-10-02 · unverdicted · novelty 6.0

Introduces FraudSquad, a hybrid model using language model embeddings and a gated graph transformer that outperforms baselines on newly created LLM-generated spam review datasets.

EmbeddingGemma: Powerful and Lightweight Text Representations

cs.CL · 2025-09-24 · unverdicted · novelty 6.0

A 300M-parameter open embedding model sets new SOTA on MTEB for its size class and matches models twice as large while staying effective when compressed.

NV-Embed: Improved Techniques for Training LLMs as Generalist Embedding Models

cs.CL · 2024-05-27 · accept · novelty 6.0

NV-Embed achieves first place on the MTEB leaderboard across 56 tasks by combining a latent attention layer, causal-mask removal, two-stage contrastive training, and data curation for LLM-based embedding models.

StarCoder 2 and The Stack v2: The Next Generation

cs.SE · 2024-02-29 · accept · novelty 6.0

StarCoder2-15B matches or beats CodeLlama-34B on code tasks despite being smaller, and StarCoder2-3B outperforms prior 15B models, with open weights and exact training data identifiers released.

Scaling Data-Constrained Language Models

cs.CL · 2023-05-25 · conditional · novelty 6.0

Repeating training data up to 4 epochs yields negligible loss increase versus unique data for fixed compute, and a new scaling law accounts for the decaying value of repeated tokens and excess parameters.

citing papers explorer

Showing 41 of 41 citing papers.

Chronicle: A Multimodal Foundation Model for Joint Language and Time Series Understanding cs.LG · 2026-05-18 · unverdicted · none · ref 21 · internal anchor
Chronicle is the first model jointly pretrained from scratch on text and time series in a unified transformer that matches a comparable language model on NLU tasks and sets new bars for time series classification and multimodal forecasting.
AcquisitionSynthesis: Targeted Data Generation using Acquisition Functions cs.CL · 2026-05-13 · unverdicted · none · ref 28 · internal anchor
AcquisitionSynthesis uses acquisition functions as rewards to train generators that produce higher-quality synthetic data, delivering 2-7% gains on math, medical QA, and coding tasks with improved robustness to forgetting.
Led to Mislead: Adversarial Content Injection for Attacks on Neural Ranking Models cs.IR · 2026-05-02 · unverdicted · none · ref 33 · internal anchor
CRAFT is a supervised LLM framework using retrieval-augmented generation, self-refinement, fine-tuning, and preference optimization to create fluent adversarial content that boosts target ranks in neural ranking models, outperforming baselines on MS MARCO and TREC benchmarks with cross-architecture
MMEB-V3: Measuring the Performance Gaps of Omni-Modality Embedding Models cs.IR · 2026-04-25 · unverdicted · none · ref 25 · internal anchor
MMEB-V3 benchmark shows omni-modality embedding models fail to enforce instruction-specified modality constraints and exhibit asymmetric, query-biased retrieval.
mEOL: Training-Free Instruction-Guided Multimodal Embedder for Vector Graphics and Image Retrieval cs.CV · 2026-04-18 · unverdicted · none · ref 29 · internal anchor
mEOL creates aligned embeddings for text, images, and SVGs using instruction-guided MLLM one-word summaries and semantic SVG rewriting, outperforming baselines on a new text-to-SVG retrieval benchmark.
DualGuard: Dual-stream Large Language Model Watermarking Defense against Paraphrase and Spoofing Attack cs.CR · 2025-12-18 · unverdicted · none · ref 41 · internal anchor
DualGuard uses adaptive dual-stream watermark signals to detect and trace both paraphrase and spoofing attacks in LLM outputs while preserving text quality.
Unified Work Embeddings: Contrastive Learning of a Bidirectional Multi-task Ranker cs.CL · 2025-11-11 · unverdicted · none · ref 30 · internal anchor
UWE is a task-agnostic bi-encoder that uses many-to-many InfoNCE and token-level soft late interaction to achieve zero-shot ranking across unseen work-related target spaces while using far fewer parameters than Qwen3-8B and improving MAP by 4.4 points.
Representational Alignment Across Model Layers and Brain Regions with Multi-Level Optimal Transport cs.LG · 2025-10-02 · accept · none · ref 10 · internal anchor
Multi-Level Optimal Transport (MOT) jointly infers soft layer couplings and neuron transport plans to produce global alignment scores and structured hierarchical correspondences between networks of varying depths.
C-Pack: Packed Resources For General Chinese Embeddings cs.CL · 2023-09-14 · accept · none · ref 38 · internal anchor
C-Pack releases a new Chinese embedding benchmark, large training dataset, and optimized models that outperform priors by up to 10% on C-MTEB while also delivering English SOTA results.
When More Cores Hurts: The Vector Database Scaling Paradox in HPC cs.DC · 2026-06-08 · unverdicted · none · ref 110 · internal anchor
Large-scale HPC evaluation of Qdrant, Milvus, and Weaviate reveals that workload patterns limit scaling and extra cores can reduce throughput, exposing a cloud-to-HPC design mismatch.
One prompt is not enough: Instruction Sensitivity Undermines Embedding Model Evaluation cs.CL · 2026-05-21 · accept · none · ref 23 · internal anchor
Single-prompt evaluations of instruction-tuned embedding models misrepresent performance and allow any model to be ranked first by favorable prompt choice.
Sliced Inner Product Gromov-Wasserstein Distances stat.ML · 2026-05-08 · unverdicted · none · ref 48 · internal anchor
A sliced IGW distance is introduced with closed-form 1D expressions, rotational invariance, and studied structural and computational properties for efficient data alignment.
Is Textual Similarity Invariant under Machine Translation? Evidence Based on the Political Manifesto Corpus cs.CL · 2026-05-01 · unverdicted · none · ref 47 · internal anchor
Machine translation preserves embedding similarity structure for ten languages but distorts it for four in the Manifesto Corpus, via a new non-inferiority testing framework.
JU\'A -- A Benchmark for Information Retrieval in Brazilian Legal Text Collections cs.IR · 2026-04-07 · accept · none · ref 16 · internal anchor
JU'A is a new heterogeneous benchmark for Brazilian legal IR that distinguishes retrieval methods and shows domain-adapted models excel on aligned subsets while BM25 stays competitive elsewhere.
Semantic Data Processing with Holistic Data Understanding cs.DB · 2026-04-03 · unverdicted · none · ref 45 · internal anchor
HoldUp uses LLM-guided clustering to provide holistic dataset context for semantic operators, yielding up to 33% higher classification accuracy and 30% higher scoring accuracy than row-by-row LLM processing across 15 datasets.
Mitigating Membership Inference in Intermediate Representations with Differentially Private Training cs.LG · 2026-02-26 · unverdicted · none · ref 8 · internal anchor
LM-DP-SGD estimates layer-specific MIA risks from shadow models and reweights gradients to give stronger protection to vulnerable layers, improving the privacy-utility trade-off over uniform DP-SGD.
SHINE: A Scalable In-Context Hypernetwork for Mapping Context to LoRA in a Single Pass cs.CL · 2026-02-06 · unverdicted · none · ref 15 · internal anchor
SHINE trains a scalable in-context hypernetwork to generate high-quality LoRA adapters from contexts in one pass, enabling efficient LLM adaptation that saves time and compute compared to standard fine-tuning.
Efficient-DLM: From Autoregressive to Diffusion Language Models, and Beyond in Speed cs.CL · 2025-12-16 · unverdicted · none · ref 23 · internal anchor
Efficient-DLM converts AR models to dLMs via block-wise causal attention and position-dependent masking, yielding higher accuracy and 2.7-4.5x throughput than Dream 7B and Qwen3 4B.
LLM-MemCluster: Empowering Large Language Models with Dynamic Memory for Text Clustering cs.CL · 2025-11-19 · unverdicted · none · ref 10 · internal anchor
LLM-MemCluster gives LLMs stateful memory and prompts that let them decide cluster count and iteratively refine groupings, outperforming baselines on benchmarks in a tuning-free end-to-end setup.
Detecting LLM-Generated Spam Reviews by Integrating Language Model Embeddings and Graph Neural Network cs.CL · 2025-10-02 · unverdicted · none · ref 30 · internal anchor
Introduces FraudSquad, a hybrid model using language model embeddings and a gated graph transformer that outperforms baselines on newly created LLM-generated spam review datasets.
EmbeddingGemma: Powerful and Lightweight Text Representations cs.CL · 2025-09-24 · unverdicted · none · ref 15 · internal anchor
A 300M-parameter open embedding model sets new SOTA on MTEB for its size class and matches models twice as large while staying effective when compressed.
NV-Embed: Improved Techniques for Training LLMs as Generalist Embedding Models cs.CL · 2024-05-27 · accept · none · ref 115 · internal anchor
NV-Embed achieves first place on the MTEB leaderboard across 56 tasks by combining a latent attention layer, causal-mask removal, two-stage contrastive training, and data curation for LLM-based embedding models.
StarCoder 2 and The Stack v2: The Next Generation cs.SE · 2024-02-29 · accept · none · ref 238 · internal anchor
StarCoder2-15B matches or beats CodeLlama-34B on code tasks despite being smaller, and StarCoder2-3B outperforms prior 15B models, with open weights and exact training data identifiers released.
Scaling Data-Constrained Language Models cs.CL · 2023-05-25 · conditional · none · ref 79 · internal anchor
Repeating training data up to 4 epochs yields negligible loss increase versus unique data for fixed compute, and a new scaling law accounts for the decaying value of repeated tokens and excess parameters.
REPLUG: Retrieval-Augmented Black-Box Language Models cs.CL · 2023-01-30 · conditional · none · ref 64 · internal anchor
REPLUG improves frozen black-box LMs by prepending LM-supervised retrieved documents, delivering 6.3% better language modeling on GPT-3 and 5.1% better five-shot MMLU on Codex.
BLOOM: A 176B-Parameter Open-Access Multilingual Language Model cs.CL · 2022-11-09 · unverdicted · none · ref 286 · internal anchor
BLOOM is a 176B-parameter open-access multilingual language model trained on the ROOTS corpus that achieves competitive performance on benchmarks, with improved results after multitask prompted finetuning.
CALMem : Application-Layer Dual Memory for Conversational AI cs.IR · 2026-05-20 · unverdicted · none · ref 6 · internal anchor
CALMem delivers virtually unbounded effective context for LLM conversations via an application-layer dual memory architecture with intra-session retrieval and token-adaptive injection.
Measuring Embedding Sensitivity to Authorial Style in French: Comparing Literary Texts with Language Model Rewritings cs.CL · 2026-05-11 · unverdicted · none · ref 61 · internal anchor
Embeddings reliably capture authorial stylistic features in French literary texts, and these signals persist after LLM rewriting while showing model-specific patterns.
How Does Chunking Affect Retrieval-Augmented Code Completion? A Controlled Empirical Study cs.SE · 2026-05-06 · conditional · none · ref 15 · internal anchor
Function-based chunking underperforms other strategies in RAG code completion by 3.57-5.64 points, with context length as the dominant factor.
Towards Better Static Code Analysis Reports: Sentence Transformer-based Filtering of Non-Actionable Alerts cs.SE · 2026-04-20 · conditional · none · ref 31 · internal anchor
STAF applies sentence embeddings from transformers to classify SCA findings, reaching 89% F1 and beating prior filters by 11% within projects and 6% across projects.
Text-as-Signal: Quantitative Semantic Scoring with Embeddings, Logprobs, and Noise Reduction cs.CL · 2026-03-17 · unverdicted · none · ref 4 · internal anchor
A configurable pipeline turns text corpora into quantitative semantic signals via embeddings, logprobs, and UMAP-based noise reduction for document positioning and corpus profiling.
Robustness Risk of Conversational Retrieval: Identifying and Mitigating Noise Sensitivity in Qwen3-Embedding Model cs.IR · 2026-02-03 · unverdicted · none · ref 5 · internal anchor
Qwen3-embedding models show noise sensitivity in conversational retrieval where dialogue artifacts rank highly despite lacking semantic value, a problem reduced by query prompting and more severe than in prior Qwen versions or other baselines.
GovScape: A Public Multimodal Search System for 70 Million Pages of Government PDFs cs.IR · 2025-11-14 · accept · none · ref 37 · internal anchor
GovScape delivers multimodal search over 10 million government PDFs using metadata, exact text, semantic embeddings, and visual page features at an estimated $1,500 preprocessing cost.
Search-R3: Unifying Reasoning and Embedding in Large Language Models cs.CL · 2025-10-08 · unverdicted · none · ref 49 · internal anchor
Search-R3 trains LLMs to output search embeddings as a direct product of step-by-step reasoning via supervised pre-training and a specialized RL environment that avoids full corpus re-encoding.
Text Embeddings by Weakly-Supervised Contrastive Pre-training cs.CL · 2022-12-07 · unverdicted · none · ref 42 · internal anchor
E5 text embeddings trained with weakly-supervised contrastive pre-training on CCPairs outperform BM25 on BEIR zero-shot and achieve top results on MTEB, beating much larger models.
Domain-Adaptive Dense Retrieval for Brazilian Legal Search cs.IR · 2026-05-05 · unverdicted · none · ref 14 · internal anchor
Mixed training of Qwen3-Embedding-4B on legal data plus SQuAD-pt yields higher average NDCG@10 (0.447), MRR@10 (0.595), and MAP@10 (0.308) across six Portuguese retrieval datasets than legal-only or base models, with largest gains on out-of-domain question-based search.
Federated Learning for ICD Classification with Lightweight Models and Pretrained Embeddings cs.IR · 2025-07-03 · unverdicted · none · ref 19 · internal anchor
Lightweight federated learning with frozen embeddings and MLP heads reaches competitive micro and macro F1 scores for ICD-9 and ICD-10 coding on MIMIC-IV, nearly matching centralized training.
Query pipeline optimization for cancer patient question answering systems cs.CL · 2024-12-19 · unverdicted · none · ref 45 · internal anchor
Three-aspect RAG query pipeline optimization for cancer patient QA introduces HSRDR and SEOS and reports 5.24% accuracy gain on Claude-3-haiku versus chain-of-thought on a custom dataset.
Much of Geospatial Web Search Is Beyond Traditional GIS cs.IR · 2026-05-11 · unreviewed · ref 18 · internal anchor
MIPIC: Matryoshka Representation Learning via Self-Distilled Intra-Relational and Progressive Information Chaining cs.CL · 2026-04-27 · unreviewed · ref 27 · internal anchor
Verbalized Algorithms: Classical Algorithms are All You Need (Mostly) cs.CL · 2025-09-09 · unreviewed · ref 10 · internal anchor

MTEB: Massive Text Embedding Benchmark

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer