hub Canonical reference

SimCSE: Simple Contrastive Learning of Sentence Embeddings

Tianyu Gao, Xingcheng Yao, Danqi Chen · 2021 · cs.CL · arXiv 2104.08821

Canonical reference. 100% of citing Pith papers cite this work as background.

32 Pith papers citing it

Background 100% of classified citations

open full Pith review browse 32 citing papers arXiv PDF

abstract

This paper presents SimCSE, a simple contrastive learning framework that greatly advances state-of-the-art sentence embeddings. We first describe an unsupervised approach, which takes an input sentence and predicts itself in a contrastive objective, with only standard dropout used as noise. This simple method works surprisingly well, performing on par with previous supervised counterparts. We find that dropout acts as minimal data augmentation, and removing it leads to a representation collapse. Then, we propose a supervised approach, which incorporates annotated pairs from natural language inference datasets into our contrastive learning framework by using "entailment" pairs as positives and "contradiction" pairs as hard negatives. We evaluate SimCSE on standard semantic textual similarity (STS) tasks, and our unsupervised and supervised models using BERT base achieve an average of 76.3% and 81.6% Spearman's correlation respectively, a 4.2% and 2.2% improvement compared to the previous best results. We also show -- both theoretically and empirically -- that the contrastive learning objective regularizes pre-trained embeddings' anisotropic space to be more uniform, and it better aligns positive pairs when supervised signals are available.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 6

citation-polarity summary

background 6

representative citing papers

Beyond Binary Edits Robust Multimodal Knowledge Editing with Adversarial Subspace Alignment

cs.AI · 2026-05-22 · unverdicted · novelty 7.0

Introduces Latent Adversarial Robustification and Rank-Constrained Subspace Learning to enable robust generalization in multimodal knowledge editing through adversarial subspace alignment.

Semantic Reranking at Inference Time for Hard Examples in Rhetorical Role Labeling

cs.CL · 2026-05-18 · unverdicted · novelty 7.0

RISE is an inference-time semantic reranking framework that refines low-confidence predictions in rhetorical role labeling using contrastively learned label representations, delivering an average +9.15 macro-F1 gain on hard examples across eight datasets and seven models.

TabEmbed: Benchmarking and Learning Generalist Embeddings for Tabular Understanding

cs.CL · 2026-05-06 · unverdicted · novelty 7.0

TabEmbed is the first generalist embedding model for tabular data that unifies classification and retrieval in one space via contrastive learning and outperforms text embedding models on the new TabBench benchmark.

Semantic Recall for Vector Search

cs.IR · 2026-04-22 · unverdicted · novelty 7.0

Semantic Recall is a new evaluation metric for approximate nearest neighbor search that focuses only on semantically relevant results, with Tolerant Recall as a proxy when relevance labels are unavailable.

mEOL: Training-Free Instruction-Guided Multimodal Embedder for Vector Graphics and Image Retrieval

cs.CV · 2026-04-18 · unverdicted · novelty 7.0

mEOL creates aligned embeddings for text, images, and SVGs using instruction-guided MLLM one-word summaries and semantic SVG rewriting, outperforming baselines on a new text-to-SVG retrieval benchmark.

Adapting MLLMs for Nuanced Video Retrieval

cs.CV · 2025-12-15 · unverdicted · novelty 7.0

Text-only contrastive fine-tuning of an MLLM with hard negatives produces embeddings that handle temporal, negation, and multimodal nuances in video retrieval and achieves SOTA performance.

Unified Work Embeddings: Contrastive Learning of a Bidirectional Multi-task Ranker

cs.CL · 2025-11-11 · unverdicted · novelty 7.0

UWE is a task-agnostic bi-encoder that uses many-to-many InfoNCE and token-level soft late interaction to achieve zero-shot ranking across unseen work-related target spaces while using far fewer parameters than Qwen3-8B and improving MAP by 4.4 points.

M3-Embedding: Multi-Linguality, Multi-Functionality, Multi-Granularity Text Embeddings Through Self-Knowledge Distillation

cs.CL · 2024-02-05 · unverdicted · novelty 7.0

M3-Embedding is a single model for multi-lingual, multi-functional, and multi-granular text embeddings trained via self-knowledge distillation that achieves new state-of-the-art results on multilingual, cross-lingual, and long-document retrieval benchmarks.

RePrompT: Recurrent Prompt Tuning for Integrating Structured EHR Encoders with Large Language Models

cs.CL · 2026-04-20 · unverdicted · novelty 6.0

RePrompT uses recurrent prompt tuning to inject prior-visit latent states and cohort-derived population prompt tokens into LLMs, yielding better performance than pure EHR or pure LLM baselines on MIMIC clinical prediction tasks.

UniCon: Unified Framework for Efficient Contrastive Alignment via Kernels

cs.LG · 2026-04-17 · unverdicted · novelty 6.0

UniCon unifies contrastive alignment across encoders and alignment types using kernels to enable exact closed-form updates instead of stochastic optimization.

Turning Generators into Retrievers: Unlocking MLLMs for Natural Language-Guided Geo-Localization

cs.CV · 2026-04-12 · unverdicted · novelty 6.0

Parameter-efficient fine-tuning lets MLLMs serve as effective retrievers for natural-language-guided cross-view geo-localization, beating dual-encoder baselines on GeoText-1652 and CVG-Text while using far fewer trainable parameters.

Data, Not Model: Explaining Bias toward LLM Texts in Neural Retrievers

cs.IR · 2026-04-07 · unverdicted · novelty 6.0

Bias toward LLM texts in neural retrievers arises from artifact imbalances between positive and negative documents in training data that are absorbed during contrastive learning.

Policy-Governed LLM Routing with Intent Matching for Instrument Laboratories

cs.CY · 2026-04-03 · conditional · novelty 6.0

A governed LLM routing system for lab tutoring raises challenge-alignment from 0.90 to 0.98, boosts productive-struggle time, and cuts token costs by two-thirds while preserving answer accuracy.

SPOT: Selective Prompt Projection via Total Variation for Inference-Only Safe Text-to-Image Generation

cs.AI · 2026-01-31 · unverdicted · novelty 6.0

SPOT projects prompts to a tau-safe set via total variation to cut inappropriate content 14-44% relative to baselines while preserving benign prompt behavior in frozen T2I models.

EmbeddingGemma: Powerful and Lightweight Text Representations

cs.CL · 2025-09-24 · unverdicted · novelty 6.0

A 300M-parameter open embedding model sets new SOTA on MTEB for its size class and matches models twice as large while staying effective when compressed.

Knapsack Optimization-based Schema Linking for LLM-based Text-to-SQL Generation

cs.CL · 2025-02-18 · unverdicted · novelty 6.0

KaSLA applies knapsack optimization hierarchically to schema linking for LLM text-to-SQL, claiming better results than large models and improved SQL generation on Spider and BIRD.

Conjuring Semantic Similarity

cs.AI · 2024-10-21 · unverdicted · novelty 6.0

Semantic similarity between texts is measured by the Jeffreys divergence between the image distributions induced by conditioning a diffusion model on each text, computed via Monte-Carlo sampling of the reverse-time SDEs.

E5-V: Universal Embeddings with Multimodal Large Language Models

cs.CL · 2024-07-17 · unverdicted · novelty 6.0

E5-V produces strong universal multimodal embeddings from MLLMs trained solely on text pairs, often surpassing prior methods across retrieval and related tasks without multimodal fine-tuning.

NV-Embed: Improved Techniques for Training LLMs as Generalist Embedding Models

cs.CL · 2024-05-27 · accept · novelty 6.0

NV-Embed achieves first place on the MTEB leaderboard across 56 tasks by combining a latent attention layer, causal-mask removal, two-stage contrastive training, and data curation for LLM-based embedding models.

Atlas: Few-shot Learning with Retrieval Augmented Language Models

cs.CL · 2022-08-05 · unverdicted · novelty 6.0

Atlas reaches over 42% accuracy on Natural Questions with only 64 examples, outperforming a 540B-parameter model by 3% with 50x fewer parameters.

Unsupervised Dense Information Retrieval with Contrastive Learning

cs.IR · 2021-12-16 · unverdicted · novelty 6.0

Contrastive learning trains unsupervised dense retrievers that beat BM25 on most BEIR datasets and support cross-lingual retrieval across scripts.

Retrieval-Based Multi-Label Legal Annotation: Extensible, Data-Efficient and Hallucination-Free

cs.CL · 2026-05-16 · unverdicted · novelty 5.0

Retrieval with frozen embeddings and k-NN delivers competitive accuracy, high data efficiency, and zero hallucinations on legal multi-label annotation across ECtHR and Eurlex datasets.

SimReg: Achieving Higher Performance in the Pretraining via Embedding Similarity Regularization

cs.CL · 2026-05-09 · unverdicted · novelty 5.0

SimReg regularization accelerates LLM pretraining convergence by over 30% and raises average zero-shot performance by over 1% across benchmarks.

G-Loss: Graph-Guided Fine-Tuning of Language Models

cs.CL · 2026-04-28 · unverdicted · novelty 5.0

G-Loss builds a document-similarity graph and uses semi-supervised label propagation to guide fine-tuning of language models, yielding higher accuracy than standard losses on five classification benchmarks.

citing papers explorer

Showing 32 of 32 citing papers.

Beyond Binary Edits Robust Multimodal Knowledge Editing with Adversarial Subspace Alignment cs.AI · 2026-05-22 · unverdicted · none · ref 14 · internal anchor
Introduces Latent Adversarial Robustification and Rank-Constrained Subspace Learning to enable robust generalization in multimodal knowledge editing through adversarial subspace alignment.
Semantic Reranking at Inference Time for Hard Examples in Rhetorical Role Labeling cs.CL · 2026-05-18 · unverdicted · none · ref 25 · internal anchor
RISE is an inference-time semantic reranking framework that refines low-confidence predictions in rhetorical role labeling using contrastively learned label representations, delivering an average +9.15 macro-F1 gain on hard examples across eight datasets and seven models.
TabEmbed: Benchmarking and Learning Generalist Embeddings for Tabular Understanding cs.CL · 2026-05-06 · unverdicted · none · ref 8 · internal anchor
TabEmbed is the first generalist embedding model for tabular data that unifies classification and retrieval in one space via contrastive learning and outperforms text embedding models on the new TabBench benchmark.
Semantic Recall for Vector Search cs.IR · 2026-04-22 · unverdicted · none · ref 11 · internal anchor
Semantic Recall is a new evaluation metric for approximate nearest neighbor search that focuses only on semantically relevant results, with Tolerant Recall as a proxy when relevance labels are unavailable.
mEOL: Training-Free Instruction-Guided Multimodal Embedder for Vector Graphics and Image Retrieval cs.CV · 2026-04-18 · unverdicted · none · ref 14 · internal anchor
mEOL creates aligned embeddings for text, images, and SVGs using instruction-guided MLLM one-word summaries and semantic SVG rewriting, outperforming baselines on a new text-to-SVG retrieval benchmark.
Adapting MLLMs for Nuanced Video Retrieval cs.CV · 2025-12-15 · unverdicted · none · ref 27 · internal anchor
Text-only contrastive fine-tuning of an MLLM with hard negatives produces embeddings that handle temporal, negation, and multimodal nuances in video retrieval and achieves SOTA performance.
Unified Work Embeddings: Contrastive Learning of a Bidirectional Multi-task Ranker cs.CL · 2025-11-11 · unverdicted · none · ref 17 · internal anchor
UWE is a task-agnostic bi-encoder that uses many-to-many InfoNCE and token-level soft late interaction to achieve zero-shot ranking across unseen work-related target spaces while using far fewer parameters than Qwen3-8B and improving MAP by 4.4 points.
M3-Embedding: Multi-Linguality, Multi-Functionality, Multi-Granularity Text Embeddings Through Self-Knowledge Distillation cs.CL · 2024-02-05 · unverdicted · none · ref 56 · internal anchor
M3-Embedding is a single model for multi-lingual, multi-functional, and multi-granular text embeddings trained via self-knowledge distillation that achieves new state-of-the-art results on multilingual, cross-lingual, and long-document retrieval benchmarks.
RePrompT: Recurrent Prompt Tuning for Integrating Structured EHR Encoders with Large Language Models cs.CL · 2026-04-20 · unverdicted · none · ref 121 · internal anchor
RePrompT uses recurrent prompt tuning to inject prior-visit latent states and cohort-derived population prompt tokens into LLMs, yielding better performance than pure EHR or pure LLM baselines on MIMIC clinical prediction tasks.
UniCon: Unified Framework for Efficient Contrastive Alignment via Kernels cs.LG · 2026-04-17 · unverdicted · none · ref 2 · internal anchor
UniCon unifies contrastive alignment across encoders and alignment types using kernels to enable exact closed-form updates instead of stochastic optimization.
Turning Generators into Retrievers: Unlocking MLLMs for Natural Language-Guided Geo-Localization cs.CV · 2026-04-12 · unverdicted · none · ref 13 · internal anchor
Parameter-efficient fine-tuning lets MLLMs serve as effective retrievers for natural-language-guided cross-view geo-localization, beating dual-encoder baselines on GeoText-1652 and CVG-Text while using far fewer trainable parameters.
Data, Not Model: Explaining Bias toward LLM Texts in Neural Retrievers cs.IR · 2026-04-07 · unverdicted · none · ref 11 · internal anchor
Bias toward LLM texts in neural retrievers arises from artifact imbalances between positive and negative documents in training data that are absorbed during contrastive learning.
Policy-Governed LLM Routing with Intent Matching for Instrument Laboratories cs.CY · 2026-04-03 · conditional · none · ref 12 · internal anchor
A governed LLM routing system for lab tutoring raises challenge-alignment from 0.90 to 0.98, boosts productive-struggle time, and cuts token costs by two-thirds while preserving answer accuracy.
SPOT: Selective Prompt Projection via Total Variation for Inference-Only Safe Text-to-Image Generation cs.AI · 2026-01-31 · unverdicted · none · ref 4 · internal anchor
SPOT projects prompts to a tau-safe set via total variation to cut inappropriate content 14-44% relative to baselines while preserving benign prompt behavior in frozen T2I models.
EmbeddingGemma: Powerful and Lightweight Text Representations cs.CL · 2025-09-24 · unverdicted · none · ref 5 · internal anchor
A 300M-parameter open embedding model sets new SOTA on MTEB for its size class and matches models twice as large while staying effective when compressed.
Knapsack Optimization-based Schema Linking for LLM-based Text-to-SQL Generation cs.CL · 2025-02-18 · unverdicted · none · ref 18 · internal anchor
KaSLA applies knapsack optimization hierarchically to schema linking for LLM text-to-SQL, claiming better results than large models and improved SQL generation on Spider and BIRD.
Conjuring Semantic Similarity cs.AI · 2024-10-21 · unverdicted · none · ref 11 · internal anchor
Semantic similarity between texts is measured by the Jeffreys divergence between the image distributions induced by conditioning a diffusion model on each text, computed via Monte-Carlo sampling of the reverse-time SDEs.
E5-V: Universal Embeddings with Multimodal Large Language Models cs.CL · 2024-07-17 · unverdicted · none · ref 3 · internal anchor
E5-V produces strong universal multimodal embeddings from MLLMs trained solely on text pairs, often surpassing prior methods across retrieval and related tasks without multimodal fine-tuning.
NV-Embed: Improved Techniques for Training LLMs as Generalist Embedding Models cs.CL · 2024-05-27 · accept · none · ref 119 · internal anchor
NV-Embed achieves first place on the MTEB leaderboard across 56 tasks by combining a latent attention layer, causal-mask removal, two-stage contrastive training, and data curation for LLM-based embedding models.
Atlas: Few-shot Learning with Retrieval Augmented Language Models cs.CL · 2022-08-05 · unverdicted · none · ref 40 · internal anchor
Atlas reaches over 42% accuracy on Natural Questions with only 64 examples, outperforming a 540B-parameter model by 3% with 50x fewer parameters.
Unsupervised Dense Information Retrieval with Contrastive Learning cs.IR · 2021-12-16 · unverdicted · none · ref 131 · internal anchor
Contrastive learning trains unsupervised dense retrievers that beat BM25 on most BEIR datasets and support cross-lingual retrieval across scripts.
Retrieval-Based Multi-Label Legal Annotation: Extensible, Data-Efficient and Hallucination-Free cs.CL · 2026-05-16 · unverdicted · none · ref 27 · internal anchor
Retrieval with frozen embeddings and k-NN delivers competitive accuracy, high data efficiency, and zero hallucinations on legal multi-label annotation across ECtHR and Eurlex datasets.
SimReg: Achieving Higher Performance in the Pretraining via Embedding Similarity Regularization cs.CL · 2026-05-09 · unverdicted · none · ref 4 · internal anchor
SimReg regularization accelerates LLM pretraining convergence by over 30% and raises average zero-shot performance by over 1% across benchmarks.
G-Loss: Graph-Guided Fine-Tuning of Language Models cs.CL · 2026-04-28 · unverdicted · none · ref 12 · internal anchor
G-Loss builds a document-similarity graph and uses semi-supervised label propagation to guide fine-tuning of language models, yielding higher accuracy than standard losses on five classification benchmarks.
Bridging Linguistic Gaps: Cross-Lingual Mapping in Pre-Training and Dataset for Enhanced Multilingual LLM Performance cs.CL · 2026-04-12 · unverdicted · none · ref 13 · internal anchor
A new pre-training task that maps languages bidirectionally in embedding space improves machine translation by up to 11.9 BLEU, cross-lingual QA by 6.72 BERTScore points, and understanding accuracy by over 5% over strong baselines.
A Survey on Hallucination in Large Language Models: Principles, Taxonomy, Challenges, and Open Questions cs.CL · 2023-11-09 · unverdicted · none · ref 99 · internal anchor
The paper surveys hallucination in LLMs with an innovative taxonomy, factors, detection methods, benchmarks, mitigation strategies, and open research directions.
StarCoder: may the source be with you! cs.CL · 2023-05-09 · accept · none · ref 121 · internal anchor
StarCoderBase matches or beats OpenAI's code-cushman-001 on multi-language code benchmarks; the Python-fine-tuned StarCoder reaches 40% pass@1 on HumanEval while retaining other-language performance.
Beyond the Basics: Leveraging Large Language Model for Fine-Grained Medical Entity Recognition cs.AI · 2026-04-19 · conditional · none · ref 7 · internal anchor
Fine-tuned LLaMA3 with LoRA reaches 81.24% F1 on 18-category fine-grained medical entity recognition, beating zero-shot by 63.11% and few-shot by 35.63%.
Are Decoder-Only Large Language Models the Silver Bullet for Code Search? cs.SE · 2024-10-29 · unverdicted · none · ref 72 · internal anchor
Fine-tuned decoder-only LLMs achieve up to 40.4% higher MAP than UniXcoder on CoSQA+ for code search, with non-monotonic size scaling and data composition sensitivity.
Survey in Characterizing Semantic Change cs.CL · 2024-02-29 · unverdicted · none · ref 29 · internal anchor
The survey organizes prior work on semantic change characterization into three classes, summarizes selected publications in a table, and discusses research needs and trends.
LLMs Uncertainty Quantification via Adaptive Conformal Semantic Entropy cs.LG · 2026-05-05 · unreviewed · ref 6 · internal anchor
MIPIC: Matryoshka Representation Learning via Self-Distilled Intra-Relational and Progressive Information Chaining cs.CL · 2026-04-27 · unreviewed · ref 9 · internal anchor

SimCSE: Simple Contrastive Learning of Sentence Embeddings

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer