IdioLink introduces a benchmark dataset and evaluation showing that strong embedding models struggle to retrieve equivalent meanings across idiomatic and literal forms, relying on shallow cues instead.
hub Mixed citations
Multilingual E5 Text Embeddings: A Technical Report
Mixed citation behavior. Most common role is method (43%).
abstract
This technical report presents the training methodology and evaluation results of the open-source multilingual E5 text embedding models, released in mid-2023. Three embedding models of different sizes (small / base / large) are provided, offering a balance between the inference efficiency and embedding quality. The training procedure adheres to the English E5 model recipe, involving contrastive pre-training on 1 billion multilingual text pairs, followed by fine-tuning on a combination of labeled datasets. Additionally, we introduce a new instruction-tuned embedding model, whose performance is on par with state-of-the-art, English-only models of similar sizes. Information regarding the model release can be found at https://github.com/microsoft/unilm/tree/master/e5 .
hub tools
citation-role summary
citation-polarity summary
representative citing papers
Co-citation predictability for statute retrieval decays over 20 years in Ukrainian court data, dropping 33-47% in MRR with non-uniform patterns across legal domains.
DAPRO provides the first dynamic, theoretically guaranteed way to allocate interaction budgets across test cases for bounding time-to-event in multi-turn LLM evaluations, achieving tighter coverage than static conformal survival methods.
EPIC trains LLMs to treat continuous embeddings as in-context prompts, yielding state-of-the-art text embedding performance on MTEB with or without prompts at inference and lower compute.
Defines ATIR task and benchmark for mixed audio-text queries; MLLM model with token compression shows substantial gains over strong baselines.
RARE builds redundancy-aware benchmarks via atomic fact decomposition and CRRF-enhanced LLM generation, showing retriever PerfRecall@10 dropping from 66.4% on general data to 5.0-27.9% on high-similarity finance/legal/patent corpora.
Code-switching creates a fundamental performance bottleneck for multilingual retrievers, causing drops of up to 27% on new benchmarks CSR-L and CS-MTEB, with embedding divergence as the key cause and vocabulary expansion insufficient to fix it.
Claim2Vec is a contrastively fine-tuned multilingual encoder that improves claim clustering performance and embedding space structure on multilingual fact-check datasets.
LMEB benchmark shows that embedding models' performance on traditional retrieval does not transfer to long-horizon memory tasks, larger models do not always perform better, and LMEB measures capabilities orthogonal to MTEB.
SQuTR aggregates 37k queries from six text retrieval datasets, synthesizes speech from 200 speakers, adds 17 noise categories at varying SNR, and shows that even large retrieval models degrade sharply under extreme acoustic noise.
Single-prompt evaluations of instruction-tuned embedding models misrepresent performance and allow any model to be ranked first by favorable prompt choice.
Embedding model performance on MTEB tasks correlates strongly with nearest-neighbor overlap and ICA magnitude differences in their embedding spaces.
Introduces the MUSA benchmark and evaluates LALMs showing that strong single-speaker performance fails to ensure robust selective attention under multilingual interference, with errors from source confusion and unresolved attribution after separation.
An extended annotation scheme with new categories and attributes plus a Gemma-300M-based multi-head classifier achieves 81.6% macro F1 on personal fact classification, outperforming few-shot LLM baselines by nearly 9 points with lower compute.
MLAIRE is a protocol that evaluates multilingual retrievers on both semantic accuracy and query-language preference using parallel passages and new metrics like LPR and Lang-nDCG, showing that standard metrics hide distinct behavioral differences among retrievers.
Iterative LLM-based refinement of category definitions improves zero-shot classification performance across 13 embedding models on a new 10-category web URL benchmark.
JFinTEB is the first benchmark for evaluating Japanese financial text embeddings across retrieval and classification tasks derived from realistic financial scenarios.
HIVE raises multimodal retrieval nDCG@10 to 41.7 on the MM-BRIGHT benchmark by inserting LLM-driven hypothesis generation and verification between retrieval passes, delivering +9.5 over the best text-only baseline and +14.1 over the best multimodal baseline.
VERTIGO post-trains camera trajectory generators with visual preference signals from Unity-rendered previews scored by a cinematically fine-tuned VLM, cutting character off-screen rates from 38% to near zero while improving framing and prompt adherence.
Retrievers trained on agent trajectories via the LRAT framework improve evidence recall, task success, and efficiency in agentic search benchmarks.
Adaptive Prompt Elicitation (APE) uses an information-theoretic framework to generate visual queries that elicit and compile user intent into better prompts for text-to-image models, showing improved alignment in benchmarks and a user study.
Proposes High-Precision Scoring (HPS) and Tie-aware Retrieval Metrics (TRM) to reduce tie-induced instability in low-precision retrieval evaluation.
Causal2Vec prepends a BERT-generated contextual token to decoder-only LLMs and pools its hidden state with the EOS token to reach new SOTA on MTEB among public-data-trained embedding models.
QCEA reformulates entity alignment as a query-conditioned ranking task with semantic encoding, graph learning, and direction-aware transformation to handle context-dependent, asymmetric correspondences in medical knowledge graphs.
citing papers explorer
-
Can Large Audio Language Models Ignore Multilingual Distractors? An Evaluation of Their Selective Auditory Attention Capabilities
Introduces the MUSA benchmark and evaluates LALMs showing that strong single-speaker performance fails to ensure robust selective attention under multilingual interference, with errors from source confusion and unresolved attribution after separation.