ALEE generates AMR-based English minimal pairs with fine-grained semantic shifts, translates them, and evaluates embedding models on 275+ languages to expose cross-lingual gaps linked to training data and tokenization.
hub Mixed citations
Multilingual E5 Text Embeddings: A Technical Report
Mixed citation behavior. Most common role is method (43%).
abstract
This technical report presents the training methodology and evaluation results of the open-source multilingual E5 text embedding models, released in mid-2023. Three embedding models of different sizes (small / base / large) are provided, offering a balance between the inference efficiency and embedding quality. The training procedure adheres to the English E5 model recipe, involving contrastive pre-training on 1 billion multilingual text pairs, followed by fine-tuning on a combination of labeled datasets. Additionally, we introduce a new instruction-tuned embedding model, whose performance is on par with state-of-the-art, English-only models of similar sizes. Information regarding the model release can be found at https://github.com/microsoft/unilm/tree/master/e5 .
hub tools
citation-role summary
citation-polarity summary
representative citing papers
HAKARI-Bench reconstructs 35 benchmarks into 551 tasks across 43 languages, reproducing full MTEB, MMTEB, and BEIR rankings with Spearman correlation above 0.97 while supporting efficiency variant comparisons.
CORE-Bench is a benchmark for code retrieval in agentic coding settings, built from curated tasks and SWE-bench instances, showing performance drops and gains from fine-tuning.
SEA-Embedding is a fully open text embedding pipeline for Southeast Asian languages that achieves state-of-the-art performance on the SEA-BED benchmark by analyzing data composition, training objectives, and base encoder choices.
HTEB introduces dynamic, multi-axis evaluation of text embedding robustness using LLM transformations, finding decoupled profiles across models and that scaling does not close all robustness gaps.
IdioLink introduces a benchmark dataset and evaluation showing that strong embedding models struggle to retrieve equivalent meanings across idiomatic and literal forms, relying on shallow cues instead.
Co-citation predictability for statute retrieval decays over 20 years in Ukrainian court data, dropping 33-47% in MRR with non-uniform patterns across legal domains.
DAPRO provides the first dynamic, theoretically guaranteed way to allocate interaction budgets across test cases for bounding time-to-event in multi-turn LLM evaluations, achieving tighter coverage than static conformal survival methods.
EPIC trains LLMs to treat continuous embeddings as in-context prompts, yielding state-of-the-art text embedding performance on MTEB with or without prompts at inference and lower compute.
Defines ATIR task and benchmark for mixed audio-text queries; MLLM model with token compression shows substantial gains over strong baselines.
Code-switching creates a fundamental performance bottleneck for multilingual retrievers, causing drops of up to 27% on new benchmarks CSR-L and CS-MTEB, with embedding divergence as the key cause and vocabulary expansion insufficient to fix it.
Claim2Vec is a contrastively fine-tuned multilingual encoder that improves claim clustering performance and embedding space structure on multilingual fact-check datasets.
LMEB benchmark shows that embedding models' performance on traditional retrieval does not transfer to long-horizon memory tasks, larger models do not always perform better, and LMEB measures capabilities orthogonal to MTEB.
SQuTR aggregates 37k queries from six text retrieval datasets, synthesizes speech from 200 speakers, adds 17 noise categories at varying SNR, and shows that even large retrieval models degrade sharply under extreme acoustic noise.
MultiSynt/MT supplies 4.8 trillion translated tokens in 36 languages from 100B English tokens, letting LLMs match native-data baselines with 72% fewer tokens and beat them by 15% at equal budget.
A Feedback Network model is developed showing online semantic exploration is more concentrated than physical mobility, with stable retail-business linkages and greater COVID disruption to spatial than cognitive routines, as a step toward hybrid digital twins of society.
EvoEmbedding generates evolvable embeddings via a latent memory updated during sequential processing, outperforming larger models on long-context retrieval and generalizing to 10x longer contexts in downstream tasks.
Proposes a pretrained Universal Row Encoder using transformers and global statistics to generate table-width invariant row embeddings for modular relational graph models, claiming improved transfer, convergence, and memory on RelBench.
ARIADNE routes queries to the best adapter via embedding-space centroid proximity, recovering 97.44% of upper-bound performance on 23 NLP tasks and 89.7% selection accuracy on 44 tasks without training or internal access.
HyGRAG is a hierarchical graph RAG framework that constructs LLM summaries over hybrid chunk-entity graphs, retrieves via context and relation awareness across levels, and enables dynamic updates, reporting a 9.7% average accuracy gain on multi-hop reasoning tasks.
BBLP uses a multi-modal object encoder for label propagation in object detection and reaches 81.6% of fully-supervised mAP on D4LA with only 10% labelled data.
A multi-dimensional taxonomy filtering approach recovers high-performing data from deprioritized web corpora, with filtered low-tier subsets outperforming unfiltered top-tier data on reasoning and coding benchmarks.
FIGMA proposes a multi-view contrastive architecture plus the FGMCaps dataset to retrieve music from fine-grained textual descriptions of musical attributes, reporting up to 73.3% relative gains over CLAP baselines.
StoryVideoQA provides the largest auto-generated deep video understanding dataset to date with 363K QAs across TV and movies, paired with the PlotTree agent for hierarchical plot-based reasoning that existing VideoQA models struggle to match.
citing papers explorer
-
Bounding Box Label Propagation for Re-Annotation of Document Layout Analysis Datasets
BBLP uses a multi-modal object encoder for label propagation in object detection and reaches 81.6% of fully-supervised mAP on D4LA with only 10% labelled data.
-
StoryVideoQA: Scaling Deep Video Understanding with a Large-Scale, Multi-Genre and Auto-Generated Dataset
StoryVideoQA provides the largest auto-generated deep video understanding dataset to date with 363K QAs across TV and movies, paired with the PlotTree agent for hierarchical plot-based reasoning that existing VideoQA models struggle to match.
-
Iterative Definition Refinement for Zero-Shot Classification via LLM-Based Semantic Prototype Optimization
Iterative LLM-based refinement of category definitions improves zero-shot classification performance across 13 embedding models on a new 10-category web URL benchmark.
-
VERTIGO: Visual Preference Optimization for Cinematic Camera Trajectory Generation
VERTIGO post-trains camera trajectory generators with visual preference signals from Unity-rendered previews scored by a cinematically fine-tuned VLM, cutting character off-screen rates from 38% to near zero while improving framing and prompt adherence.
-
DocRetriever: A Plug-and-Play Framework for Multimodal Document Retrieval with Comprehensive Benchmark
DocRetriever introduces a framework using layout-aware sparse embeddings for hybrid encoding without OCR and a generalizable reasoning-augmented reranker for few-shot settings, plus the MultiDocR benchmark for evaluation.
-
AffectAgent: Collaborative Multi-Agent Reasoning for Retrieval-Augmented Multimodal Emotion Recognition
AffectAgent deploys a query planner, evidence filter, and emotion generator as collaborative agents trained via MAPPO with shared reward, plus MB-MoE and RAAF modules, to achieve superior multimodal emotion recognition on MER-UniBench.