SEA-Embedding is a fully open text embedding pipeline for Southeast Asian languages that achieves state-of-the-art performance on the SEA-BED benchmark by analyzing data composition, training objectives, and base encoder choices.
hub
Improving Text Embeddings with Large Language Models
15 Pith papers cite this work, alongside 79 external citations. Polarity classification is still indexing.
hub tools
citation-role summary
citation-polarity summary
representative citing papers
IdioLink introduces a benchmark dataset and evaluation showing that strong embedding models struggle to retrieve equivalent meanings across idiomatic and literal forms, relying on shallow cues instead.
A text-supervised global layout embedding augments local patch representations in late-interaction VDR, yielding +2.4 nDCG@5 and +2.3 MAP@5 gains over ColPali/ColQwen baselines on ViDoRe-v2.
BITEMBED converts LLM backbones to ternary BitNet-style encoders, adapts them with contrastive pre-training and teacher distillation, and produces text embeddings at multiple precisions that perform comparably to full-precision baselines on MMTEB.
Test-time LLM feedback refines query embeddings to deliver up to 25% relative gains on zero-shot literature search, intent detection, and related benchmarks.
Neural retrievers that double BM25 performance on QUEST collapse below 0.02 Recall@100 on the new LIMIT+ benchmark while lexical methods reach 0.96, with all methods degrading as compositional depth increases.
Rabtriever distills a generative reranker into an efficient bi-encoder using on-policy JEPA to achieve near-reranker accuracy with linear complexity on rationale-based retrieval.
PETRA is a curated 1.36M-chunk petroleum-engineering retrieval dataset and pipeline that raises in-domain nDCG from 0.703 to 0.763 via score fusion and delivers 44% relative gain on an Earth Science benchmark through reranker adaptation on synthetic supervision.
Fine-tuned recurrent models like Mamba2 produce competitive text embeddings with linear-time constant-memory inference via vertical chunking, outperforming transformers in memory use.
MSPA-CQR improves conversational query rewriting by constructing self-consistent preference data across rewriting, retrieval, and response dimensions and training with prefix-guided multi-faceted direct preference optimization, showing effectiveness in both in- and out-of-distribution settings.
ModernBERT is a new bidirectional encoder model achieving SOTA performance on diverse classification and retrieval benchmarks while offering superior speed and memory efficiency for long-context inference.
Supervised models using embeddings like jina and e5 reach up to 92% accuracy on multilingual hate speech detection, substantially outperforming anomaly detection, while PCA to 64 dimensions preserves most performance in the supervised case.
Empirical evaluation of quantization effects on eight LLMs across bit widths, showing performance generally declines at lower precision but with model-size-dependent resilience and acceptable accuracy at 2 bits for many cases.
Reproducibility study confirms Hypencoder's non-linear query-specific scoring improves retrieval over bi-encoders on standard benchmarks but standard methods remain faster and hard-task results are mixed due to implementation issues.
citing papers explorer
-
SEA-Embedding: Open and Reproducible Text Embeddings for Southeast Asia
SEA-Embedding is a fully open text embedding pipeline for Southeast Asian languages that achieves state-of-the-art performance on the SEA-BED benchmark by analyzing data composition, training objectives, and base encoder choices.
-
IdioLink: Retrieving Meaning Beyond Words Across Idiomatic and Literal Expressions
IdioLink introduces a benchmark dataset and evaluation showing that strong embedding models struggle to retrieve equivalent meanings across idiomatic and literal forms, relying on shallow cues instead.
-
Beyond Bag-of-Patches: Learning Global Layout via Textual Supervision for Late-Interaction Visual Document Retrieval
A text-supervised global layout embedding augments local patch representations in late-interaction VDR, yielding +2.4 nDCG@5 and +2.3 MAP@5 gains over ColPali/ColQwen baselines on ViDoRe-v2.
-
BitNet Text Embeddings
BITEMBED converts LLM backbones to ternary BitNet-style encoders, adapts them with contrastive pre-training and teacher distillation, and produces text embeddings at multiple precisions that perform comparably to full-precision baselines on MMTEB.
-
Task-Adaptive Embedding Refinement via Test-time LLM Guidance
Test-time LLM feedback refines query embeddings to deliver up to 25% relative gains on zero-shot literature search, intent detection, and related benchmarks.
-
Reproducing Complex Set-Compositional Information Retrieval
Neural retrievers that double BM25 performance on QUEST collapse below 0.02 Recall@100 on the new LIMIT+ benchmark while lexical methods reach 0.96, with all methods degrading as compositional depth increases.
-
Efficient Rationale-based Retrieval: On-policy Distillation from Generative Rerankers based on JEPA
Rabtriever distills a generative reranker into an efficient bi-encoder using on-policy JEPA to achieve near-reranker accuracy with linear complexity on rationale-based retrieval.
-
PETRA: Transforming Web Text for Petroleum-Engineering Domain Adaptation
PETRA is a curated 1.36M-chunk petroleum-engineering retrieval dataset and pipeline that raises in-domain nDCG from 0.703 to 0.763 via score fusion and delivers 44% relative gain on an Earth Science benchmark through reranker adaptation on synthetic supervision.
-
Linear-Time and Constant-Memory Text Embeddings Based on Recurrent Language Models
Fine-tuned recurrent models like Mamba2 produce competitive text embeddings with linear-time constant-memory inference via vertical chunking, outperforming transformers in memory use.
-
Multi-Faceted Self-Consistent Preference Alignment for Query Rewriting in Conversational Search
MSPA-CQR improves conversational query rewriting by constructing self-consistent preference data across rewriting, retrieval, and response dimensions and training with prefix-guided multi-faceted direct preference optimization, showing effectiveness in both in- and out-of-distribution settings.
-
Smarter, Better, Faster, Longer: A Modern Bidirectional Encoder for Fast, Memory Efficient, and Long Context Finetuning and Inference
ModernBERT is a new bidirectional encoder model achieving SOTA performance on diverse classification and retrieval benchmarks while offering superior speed and memory efficiency for long-context inference.
-
Comparison of Modern Multilingual Text Embedding Techniques for Hate Speech Detection Task
Supervised models using embeddings like jina and e5 reach up to 92% accuracy on multilingual hate speech detection, substantially outperforming anomaly detection, while PCA to 64 dimensions preserves most performance in the supervised case.
-
K-Quantization and its Impact on Output Performance
Empirical evaluation of quantization effects on eight LLMs across bit widths, showing performance generally declines at lower precision but with model-size-dependent resilience and acceptable accuracy at 2 bits for many cases.
-
Hypencoder Revisited: Reproducibility and Analysis of Non-Linear Scoring for First-Stage Retrieval
Reproducibility study confirms Hypencoder's non-linear query-specific scoring improves retrieval over bi-encoders on standard benchmarks but standard methods remain faster and hard-task results are mixed due to implementation issues.
- To MRL or not to MRL: Text Embeddings are Robust to Truncation Without Matryoshka Learning, Except In Heavy Truncation Scenarios