hub Mixed citations

NV-Embed: Improved Techniques for Training LLMs as Generalist Embedding Models

Chankyu Lee, Rajarshi Roy, Mengyao Xu, Jonathan Raiman, Mohammad Shoeybi, Bryan Catanzaro · 2024 · cs.CL · arXiv 2405.17428

Mixed citation behavior. Most common role is background (43%).

44 Pith papers citing it

Background 43% of classified citations

open full Pith review browse 44 citing papers arXiv PDF

abstract

Decoder-only LLM-based embedding models are beginning to outperform BERT or T5-based embedding models in general-purpose text embedding tasks, including dense vector-based retrieval. In this work, we introduce NV-Embed, incorporating architectural designs, training procedures, and curated datasets to significantly enhance the performance of LLM as a versatile embedding model, while maintaining its simplicity and reproducibility. For model architecture, we propose a latent attention layer to obtain pooled embeddings, which consistently improves retrieval and downstream task accuracy compared to mean pooling or using the last <EOS> token embedding from LLMs. To enhance representation learning, we remove the causal attention mask of LLMs during contrastive training. For training algorithm, we introduce a two-stage contrastive instruction-tuning method. It first applies contrastive training with instructions on retrieval datasets, utilizing in-batch negatives and curated hard negative examples. At stage-2, it blends various non-retrieval into instruction tuning, which not only enhances non-retrieval task accuracy but also improves retrieval performance. For training data, we utilize the hard-negative mining, synthetic data generation and existing public available datasets to boost the performance of embedding model. By combining these techniques, our NV-Embed-v1 and NV-Embed-v2 models obtained the No.1 position on the MTEB leaderboard (as of May 24 and August 30, 2024, respectively) across 56 tasks, demonstrating the sustained effectiveness of the proposed methods over time. It also achieved the highest scores in the Long Doc section and the second-highest scores in the QA section of the AIR Benchmark, which covers a range of out-of-domain information retrieval topics beyond those in MTEB. We further provide the analysis of model compression techniques for generalist embedding models.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 4 method 3

citation-polarity summary

background 3 use method 3 unclear 1

representative citing papers

Beyond Binary: Reframing GUI Critique as Continuous Semantic Alignment

cs.LG · 2026-05-14 · unverdicted · novelty 7.0 · 2 refs

BBCritic reframes GUI critique as continuous semantic alignment via contrastive learning in an affordance space, outperforming larger binary SOTA models on a new four-level hierarchical benchmark without extra annotations.

SMA: Submodular Modality Aligner For Data Efficient Multimodal Learning

cs.LG · 2026-05-13 · unverdicted · novelty 7.0

SMA uses a submodular mutual information objective on data sets to deliver competitive zero-shot classification and retrieval performance on CLIP benchmarks with only tens of thousands of samples, orders of magnitude fewer than standard approaches.

SkillRet: A Large-Scale Benchmark for Skill Retrieval in LLM Agents

cs.AI · 2026-05-07 · unverdicted · novelty 7.0

SkillRet benchmark shows fine-tuned retrievers improve NDCG@10 by 13+ points over prior models on large-scale skill retrieval for LLM agents.

TabEmbed: Benchmarking and Learning Generalist Embeddings for Tabular Understanding

cs.CL · 2026-05-06 · unverdicted · novelty 7.0

TabEmbed is the first generalist embedding model for tabular data that unifies classification and retrieval in one space via contrastive learning and outperforms text embedding models on the new TabBench benchmark.

Rethinking Reasoning-Intensive Retrieval: Evaluating and Advancing Retrievers in Agentic Search Systems

cs.CL · 2026-05-05 · unverdicted · novelty 7.0

BRIGHT-Pro and RTriever-Synth advance reasoning-intensive retrieval by adding multi-aspect evidence evaluation and aspect-decomposed synthetic training, with the fine-tuned RTriever-4B showing gains over its base model.

mEOL: Training-Free Instruction-Guided Multimodal Embedder for Vector Graphics and Image Retrieval

cs.CV · 2026-04-18 · unverdicted · novelty 7.0

mEOL creates aligned embeddings for text, images, and SVGs using instruction-guided MLLM one-word summaries and semantic SVG rewriting, outperforming baselines on a new text-to-SVG retrieval benchmark.

Bottleneck Tokens for Unified Multimodal Retrieval

cs.LG · 2026-04-13 · unverdicted · novelty 7.0

Bottleneck Tokens paired with a masked generative objective achieve state-of-the-art unified multimodal retrieval performance among 2B-scale models on the MMEB-V2 benchmark with 78 datasets.

Audio Flamingo 3: Advancing Audio Intelligence with Fully Open Large Audio Language Models

cs.SD · 2025-07-10 · unverdicted · novelty 7.0

Audio Flamingo 3 introduces an open large audio-language model achieving new state-of-the-art results on over 20 audio understanding and reasoning benchmarks using a unified encoder and curriculum training on open data.

VisRAG: Vision-based Retrieval-augmented Generation on Multi-modality Documents

cs.IR · 2024-10-14 · conditional · novelty 7.0

VisRAG achieves 20-40% better end-to-end performance than text-based RAG by directly embedding and retrieving document images with VLMs.

VLM2Vec: Training Vision-Language Models for Massive Multimodal Embedding Tasks

cs.CV · 2024-10-07 · conditional · novelty 7.0

VLM2Vec converts state-of-the-art vision-language models into universal multimodal embedders via contrastive training on the new MMEB benchmark, delivering 10-20% absolute gains over prior models on both in-distribution and out-of-distribution tasks.

BitNet Text Embeddings

cs.CL · 2026-06-24 · unverdicted · novelty 6.0

BITEMBED converts LLM backbones to ternary BitNet-style encoders, adapts them with contrastive pre-training and teacher distillation, and produces text embeddings at multiple precisions that perform comparably to full-precision baselines on MMTEB.

When More Cores Hurts: The Vector Database Scaling Paradox in HPC

cs.DC · 2026-06-08 · unverdicted · novelty 6.0

Large-scale HPC evaluation of Qdrant, Milvus, and Weaviate reveals that workload patterns limit scaling and extra cores can reduce throughput, exposing a cloud-to-HPC design mismatch.

Frequency-Domain Latent Attention Gating for Cross-Domain Token Aggregation

cs.LG · 2026-06-06 · unverdicted · novelty 6.0

FLaG is a frequency-domain module using FFT, latent queries, and gating that improves token aggregation and shows gains on ESM2 AMP prediction and CIFAR-100 image classification while staying competitive on text tasks.

Semantic Retrieval for Product Search in E-Commerce

cs.IR · 2026-05-31 · unverdicted · novelty 6.0

A dual-encoder LLM is trained via contrastive learning then ROAR to retrieve exact matches and rank substitutes in e-commerce search.

On the Robustness of Multilingual Text Embedding Rankings Across Learning Tasks, Languages, and Benchmark Datasets

cs.CL · 2026-05-29 · unverdicted · novelty 6.0

Meta-study of MTEB rankings introduces dataset-composition and ranking-scheme robustness indicators and finds only a small subset of models stay consistently strong across tasks, languages, and evaluation variations.

Layer-wise Representation Dynamics: An Empirical Investigation Across Embedders and Base LLMs

cs.LG · 2026-05-12 · unverdicted · novelty 6.0

LRD framework with Frenet, NRS, and GFMI metrics shows layer-wise structure in 31 models provides usable signal for model selection and pruning on MTEB tasks.

Test-Time Compute for Frozen Embedding Models through Agentic Program Search

cs.LG · 2026-05-12 · unverdicted · novelty 6.0 · 2 refs

Agentic program search over a frozen encoder API yields retrieval programs that improve nDCG@10 on held-out tasks and unseen encoder families with no per-domain training.

DiffRetriever: Parallel Representative Tokens for Retrieval with Diffusion Language Models

cs.IR · 2026-05-08 · unverdicted · novelty 6.0

DiffRetriever uses parallel masked tokens in diffusion LMs for retrieval representations, outperforming DiffEmbed and other baselines on aggregate effectiveness while supporting efficient multi-representation matching.

Aspect-Aware Content-Based Recommendations for Mathematical Research Papers

cs.IR · 2026-05-05 · unverdicted · novelty 6.0

The authors introduce aspect-aware datasets GoldRiM and SilverRiM for math papers and AchGNN, a heterogeneous GNN that outperforms prior methods by jointly modeling textual semantics, citations, and author lineage across aspects.

cs.CL · 2026-05-01 · unverdicted · novelty 6.0

Machine translation preserves embedding similarity structure for ten languages but distorts it for four in the Manifesto Corpus, via a new non-inferiority testing framework.

Reliable Answers for Recurring Questions: Boosting Text-to-SQL Accuracy with Template Constrained Decoding

cs.CL · 2026-04-30 · unverdicted · novelty 6.0

TeCoD improves Text-to-SQL execution accuracy by up to 36% over in-context learning and cuts latency 2.2x on matched queries by extracting templates from historical pairs and enforcing them with constrained decoding.

Exploring Audio Hallucination in Egocentric Video Understanding

cs.CV · 2026-04-26 · unverdicted · novelty 6.0

AV-LLMs hallucinate audio from visuals in egocentric videos, scoring only 27.3% accuracy on foreground sounds and 39.5% on background sounds in a 1000-question evaluation.

ViLL-E: Video LLM Embeddings for Retrieval

cs.CV · 2026-04-13 · unverdicted · novelty 6.0

ViLL-E introduces a dynamic embedding mechanism and joint contrastive-generative training for VideoLLMs, delivering up to 7% gains in temporal localization and 4% in video retrieval while enabling new zero-shot capabilities.

Geometry-Aware Localized Watermarking for Copyright Protection in Embedding-as-a-Service

cs.CR · 2026-04-13 · unverdicted · novelty 6.0

GeoMark decouples local watermark triggering from centralized ownership attribution using geometry-separated anchors and adaptive neighborhoods to improve robustness against paraphrasing, dimension changes, and clustering attacks while preserving utility.

citing papers explorer

Showing 34 of 34 citing papers after filters.

Beyond Binary: Reframing GUI Critique as Continuous Semantic Alignment cs.LG · 2026-05-14 · unverdicted · none · ref 89 · 2 links · internal anchor
BBCritic reframes GUI critique as continuous semantic alignment via contrastive learning in an affordance space, outperforming larger binary SOTA models on a new four-level hierarchical benchmark without extra annotations.
SMA: Submodular Modality Aligner For Data Efficient Multimodal Learning cs.LG · 2026-05-13 · unverdicted · none · ref 31 · internal anchor
SMA uses a submodular mutual information objective on data sets to deliver competitive zero-shot classification and retrieval performance on CLIP benchmarks with only tens of thousands of samples, orders of magnitude fewer than standard approaches.
SkillRet: A Large-Scale Benchmark for Skill Retrieval in LLM Agents cs.AI · 2026-05-07 · unverdicted · none · ref 15 · internal anchor
SkillRet benchmark shows fine-tuned retrievers improve NDCG@10 by 13+ points over prior models on large-scale skill retrieval for LLM agents.
TabEmbed: Benchmarking and Learning Generalist Embeddings for Tabular Understanding cs.CL · 2026-05-06 · unverdicted · none · ref 16 · internal anchor
TabEmbed is the first generalist embedding model for tabular data that unifies classification and retrieval in one space via contrastive learning and outperforms text embedding models on the new TabBench benchmark.
Rethinking Reasoning-Intensive Retrieval: Evaluating and Advancing Retrievers in Agentic Search Systems cs.CL · 2026-05-05 · unverdicted · none · ref 4 · internal anchor
BRIGHT-Pro and RTriever-Synth advance reasoning-intensive retrieval by adding multi-aspect evidence evaluation and aspect-decomposed synthetic training, with the fine-tuned RTriever-4B showing gains over its base model.
mEOL: Training-Free Instruction-Guided Multimodal Embedder for Vector Graphics and Image Retrieval cs.CV · 2026-04-18 · unverdicted · none · ref 23 · internal anchor
mEOL creates aligned embeddings for text, images, and SVGs using instruction-guided MLLM one-word summaries and semantic SVG rewriting, outperforming baselines on a new text-to-SVG retrieval benchmark.
Bottleneck Tokens for Unified Multimodal Retrieval cs.LG · 2026-04-13 · unverdicted · none · ref 11 · internal anchor
Bottleneck Tokens paired with a masked generative objective achieve state-of-the-art unified multimodal retrieval performance among 2B-scale models on the MMEB-V2 benchmark with 78 datasets.
BitNet Text Embeddings cs.CL · 2026-06-24 · unverdicted · none · ref 28 · internal anchor
BITEMBED converts LLM backbones to ternary BitNet-style encoders, adapts them with contrastive pre-training and teacher distillation, and produces text embeddings at multiple precisions that perform comparably to full-precision baselines on MMTEB.
When More Cores Hurts: The Vector Database Scaling Paradox in HPC cs.DC · 2026-06-08 · unverdicted · none · ref 53 · internal anchor
Large-scale HPC evaluation of Qdrant, Milvus, and Weaviate reveals that workload patterns limit scaling and extra cores can reduce throughput, exposing a cloud-to-HPC design mismatch.
Frequency-Domain Latent Attention Gating for Cross-Domain Token Aggregation cs.LG · 2026-06-06 · unverdicted · none · ref 1 · internal anchor
FLaG is a frequency-domain module using FFT, latent queries, and gating that improves token aggregation and shows gains on ESM2 AMP prediction and CIFAR-100 image classification while staying competitive on text tasks.
Semantic Retrieval for Product Search in E-Commerce cs.IR · 2026-05-31 · unverdicted · none · ref 4 · internal anchor
A dual-encoder LLM is trained via contrastive learning then ROAR to retrieve exact matches and rank substitutes in e-commerce search.
On the Robustness of Multilingual Text Embedding Rankings Across Learning Tasks, Languages, and Benchmark Datasets cs.CL · 2026-05-29 · unverdicted · none · ref 29 · internal anchor
Meta-study of MTEB rankings introduces dataset-composition and ranking-scheme robustness indicators and finds only a small subset of models stay consistently strong across tasks, languages, and evaluation variations.
Layer-wise Representation Dynamics: An Empirical Investigation Across Embedders and Base LLMs cs.LG · 2026-05-12 · unverdicted · none · ref 35 · internal anchor
LRD framework with Frenet, NRS, and GFMI metrics shows layer-wise structure in 31 models provides usable signal for model selection and pruning on MTEB tasks.
Test-Time Compute for Frozen Embedding Models through Agentic Program Search cs.LG · 2026-05-12 · unverdicted · none · ref 9 · 2 links · internal anchor
Agentic program search over a frozen encoder API yields retrieval programs that improve nDCG@10 on held-out tasks and unseen encoder families with no per-domain training.
DiffRetriever: Parallel Representative Tokens for Retrieval with Diffusion Language Models cs.IR · 2026-05-08 · unverdicted · none · ref 5 · internal anchor
DiffRetriever uses parallel masked tokens in diffusion LMs for retrieval representations, outperforming DiffEmbed and other baselines on aggregate effectiveness while supporting efficient multi-representation matching.
Aspect-Aware Content-Based Recommendations for Mathematical Research Papers cs.IR · 2026-05-05 · unverdicted · none · ref 29 · internal anchor
The authors introduce aspect-aware datasets GoldRiM and SilverRiM for math papers and AchGNN, a heterogeneous GNN that outperforms prior methods by jointly modeling textual semantics, citations, and author lineage across aspects.
Is Textual Similarity Invariant under Machine Translation? Evidence Based on the Political Manifesto Corpus cs.CL · 2026-05-01 · unverdicted · none · ref 37 · internal anchor
Machine translation preserves embedding similarity structure for ten languages but distorts it for four in the Manifesto Corpus, via a new non-inferiority testing framework.
Reliable Answers for Recurring Questions: Boosting Text-to-SQL Accuracy with Template Constrained Decoding cs.CL · 2026-04-30 · unverdicted · none · ref 14 · internal anchor
TeCoD improves Text-to-SQL execution accuracy by up to 36% over in-context learning and cuts latency 2.2x on matched queries by extracting templates from historical pairs and enforcing them with constrained decoding.
Exploring Audio Hallucination in Egocentric Video Understanding cs.CV · 2026-04-26 · unverdicted · none · ref 20 · internal anchor
AV-LLMs hallucinate audio from visuals in egocentric videos, scoring only 27.3% accuracy on foreground sounds and 39.5% on background sounds in a 1000-question evaluation.
ViLL-E: Video LLM Embeddings for Retrieval cs.CV · 2026-04-13 · unverdicted · none · ref 17 · internal anchor
ViLL-E introduces a dynamic embedding mechanism and joint contrastive-generative training for VideoLLMs, delivering up to 7% gains in temporal localization and 4% in video retrieval while enabling new zero-shot capabilities.
Geometry-Aware Localized Watermarking for Copyright Protection in Embedding-as-a-Service cs.CR · 2026-04-13 · unverdicted · none · ref 17 · internal anchor
GeoMark decouples local watermark triggering from centralized ownership attribution using geometry-separated anchors and adaptive neighborhoods to improve robustness against paraphrasing, dimension changes, and clustering attacks while preserving utility.
Benchmarking and Enabling Efficient Chinese Medical Retrieval via Asymmetric Encoders cs.IR · 2026-04-13 · unverdicted · none · ref 1 · internal anchor
New CMedTEB benchmark and CARE asymmetric retriever outperform symmetric models on Chinese medical retrieval tasks while preserving low latency.
Regime-Conditional Retrieval: Theory and a Transferable Router for Two-Hop QA cs.IR · 2026-04-10 · conditional · none · ref 7 · internal anchor
Two-hop QA retrieval performance depends on whether the hop-2 entity is in the question or bridge passage, and a simple predicate-based router trained on one dataset transfers to improve R@5 on others.
BridgeRAG: Training-Free Bridge-Conditioned Retrieval for Multi-Hop Question Answering cs.IR · 2026-04-03 · conditional · none · ref 13 · internal anchor
BridgeRAG improves training-free multi-hop retrieval by using a bridge-conditioned LLM scorer to rank evidence chains, achieving new best R@5 scores on MuSiQue, 2WikiMultiHopQA, and HotpotQA.
MeMo: Memory as a Model cs.CL · 2026-05-14 · unverdicted · none · ref 13 · 2 links · internal anchor
MeMo encodes new knowledge into a separate memory model that integrates with frozen LLMs, showing strong performance on QA benchmarks while avoiding catastrophic forgetting and working without access to model weights.
DeepImagine: Learning Biomedical Reasoning via Successive Counterfactual Imagining cs.CL · 2026-04-24 · unverdicted · none · ref 13 · internal anchor
DeepImagine trains LLMs on counterfactual pairs from clinical trials using supervised fine-tuning and reinforcement learning to improve outcome prediction by approximating causal mechanisms.
AFMRL: Attribute-Enhanced Fine-Grained Multi-Modal Representation Learning in E-commerce cs.CL · 2026-04-22 · unverdicted · none · ref 29 · internal anchor
AFMRL uses MLLM-generated attributes in attribute-guided contrastive learning and retrieval-aware reinforcement to achieve SOTA fine-grained multimodal retrieval on e-commerce datasets.
Legal Retrieval for Public Defenders cs.IR · 2026-01-20 · conditional · none · ref 20 · internal anchor
NJ BriefBank is a domain-adapted legal retrieval tool for public defenders that improves on standard benchmarks by incorporating legal reasoning, domain data, and synthetic examples, with a new released taxonomy and annotated evaluation dataset.
TALAS: Teacher-Anchored Layer Alignment with Adaptive Sharpness-Aware Minimization for Embedding Distillation cs.CL · 2026-06-20 · unverdicted · none · ref 83 · internal anchor
TALAS is a knowledge distillation method that selectively aligns upper student layers to teacher sentence embeddings, propagates knowledge top-down via relational constraints in lower layers, and uses ASAM to seek flatter minima.
MLT-Dedup: Efficient Large-Scale Online Video Deduplication via Multi-Level Representations and Spatial-Temporal Matching cs.CV · 2026-06-10 · unverdicted · none · ref 25 · internal anchor
MLT-Dedup achieves 91% reduction in online video repetition rates at 90% precision and 5x indexing capacity using multi-level representations and differential feature-enhanced similarity on a real-world platform.
UniNote: A Unified Embedding Model for Multimodal Representation and Ranking cs.IR · 2026-05-28 · unverdicted · none · ref 12 · internal anchor
UniNote proposes a two-stage trained unified embedding model (contrastive SFT then RL) for multimodal I2I retrieval that claims SOTA results and was deployed at Xiaohongshu with MRL for improved quality and efficiency.
Gemini Embedding 2: A Native Multimodal Embedding Model from Gemini cs.CV · 2026-05-26 · unverdicted · none · ref 16 · internal anchor
A native multimodal embedding model from Gemini achieves reported state-of-the-art results on retrieval benchmarks across modalities via large-scale contrastive learning.
Frame of Reference: Addressing the Challenges of Common Ground Representation in Situational Dialogs cs.CL · 2026-01-14 · unverdicted · none · ref 2 · internal anchor
Reinforcement learning on synthetic data improves language models' ability to represent and use common ground with relational references in situated dialogs.
Qwen3-VL-Embedding and Qwen3-VL-Reranker: A Unified Framework for State-of-the-Art Multimodal Retrieval and Ranking cs.CL · 2026-01-08 · unverdicted · none · ref 13 · internal anchor
Qwen3-VL-Embedding-8B achieves state-of-the-art performance with a 77.8 overall score on the MMEB-V2 multimodal embedding benchmark.

NV-Embed: Improved Techniques for Training LLMs as Generalist Embedding Models

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer