SkMTEB is the first comprehensive text embedding benchmark for Slovak, and vocabulary-trimmed E5 adaptations achieve competitive performance with much smaller models.
super hub Mixed citations
Qwen3 Embedding: Advancing Text Embedding and Reranking Through Foundation Models
Mixed citation behavior. Most common role is background (46%).
abstract
In this work, we introduce the Qwen3 Embedding series, a significant advancement over its predecessor, the GTE-Qwen series, in text embedding and reranking capabilities, built upon the Qwen3 foundation models. Leveraging the Qwen3 LLMs' robust capabilities in multilingual text understanding and generation, our innovative multi-stage training pipeline combines large-scale unsupervised pre-training with supervised fine-tuning on high-quality datasets. Effective model merging strategies further ensure the robustness and adaptability of the Qwen3 Embedding series. During the training process, the Qwen3 LLMs serve not only as backbone models but also play a crucial role in synthesizing high-quality, rich, and diverse training data across multiple domains and languages, thus enhancing the training pipeline. The Qwen3 Embedding series offers a spectrum of model sizes (0.6B, 4B, 8B) for both embedding and reranking tasks, addressing diverse deployment scenarios where users can optimize for either efficiency or effectiveness. Empirical evaluations demonstrate that the Qwen3 Embedding series achieves state-of-the-art results across diverse benchmarks. Notably, it excels on the multilingual evaluation benchmark MTEB for text embedding, as well as in various retrieval tasks, including code retrieval, cross-lingual retrieval and multilingual retrieval. To facilitate reproducibility and promote community-driven research and development, the Qwen3 Embedding models are publicly available under the Apache 2.0 license.
hub tools
citation-role summary
citation-polarity summary
claims ledger
- abstract In this work, we introduce the Qwen3 Embedding series, a significant advancement over its predecessor, the GTE-Qwen series, in text embedding and reranking capabilities, built upon the Qwen3 foundation models. Leveraging the Qwen3 LLMs' robust capabilities in multilingual text understanding and generation, our innovative multi-stage training pipeline combines large-scale unsupervised pre-training with supervised fine-tuning on high-quality datasets. Effective model merging strategies further ensure the robustness and adaptability of the Qwen3 Embedding series. During the training process, the
authors
co-cited works
representative citing papers
DiscourseFlip is a graph-guided attack allocating limited poisoning budget to induce targeted opinion shifts over semantic query networks in black-box RAG.
A new corpus of 108 mixed string-numeric tables shows that advanced tabular learners with basic string embeddings perform well on most real-world data, while large LLM encoders help on free-text heavy tables.
SLAM achieves 100% detection on Gemma-2 models with only 1-2 point quality cost by causally steering SAE-identified residual-stream directions for linguistic structure.
ReasonAudio benchmark reveals that state-of-the-art text-audio retrieval models struggle with reasoning tasks like negation and duration, and multimodal LLMs lose reasoning ability after contrastive fine-tuning.
FollowTable is the first large-scale benchmark for instruction-following table retrieval, paired with an Instruction Responsiveness Score, showing that existing models fail to adapt to fine-grained constraints beyond topical similarity.
A 0.6B LM with length-aware attention adjustments performs competitive in-context retrieval at million-token scale on MS MARCO, NQ, and LIMIT benchmarks.
MoHallBench is a new benchmark evaluating motion hallucination in VideoLLMs from co-occurrence priors, sequential inference, and similarity confusion, revealing decoupling from action recognition performance.
Tailored queries enable identification of the embedding model used by a black-box IR system from the unordered set of retrieved documents, even when a reranker is present.
STEB is a new benchmark of 96 datasets in 7 languages for evaluating style text embeddings on authorship, detection, and linguistic probing tasks.
Tabular foundation models excel on tiny- to medium-sized IID data but are outperformed by traditional tree-based and deep learning models on non-IID, large, and high-dimensional datasets, based on evaluations across 11 models and 142 datasets in the new BeyondArena benchmark.
Turn-averaged SAEs reconstruct average activations over conversation turns to represent high-level turn characteristics with a fixed number of features, simplifying long-context interpretability compared to per-token SAEs.
A Gaussian information-gain metric in embedding space quantifies semantic progress in dialogues via uncertainty reduction and shows competitive agreement with human judgments on MT-Bench and UltraFeedback.
EBA clusters sampled LLM generations in representation space to estimate agreement, outperforming random selection with stable scaling and showing that central positions correlate with higher generation quality.
TAA-k finds query-adaptive retrieval cutoffs by first using knee detection to isolate a candidate window around the relevance-to-noise transition, then applying EVT goodness-of-fit tests inside that window.
CORE-Bench is a benchmark for code retrieval in agentic coding settings, built from curated tasks and SWE-bench instances, showing performance drops and gains from fine-tuning.
ActProbe is an action-space detector that uses temporal consistency error and action chunk magnitude from policy outputs, mapped via LSTM-MLP, to predict failures earlier than baselines across policies and real-robot tasks.
SEA-Embedding is a fully open text embedding pipeline for Southeast Asian languages that achieves state-of-the-art performance on the SEA-BED benchmark by analyzing data composition, training objectives, and base encoder choices.
Defines cost-aware RAG with evidence cost tiers and shows static selectors are brittle while agentic LLM-based selection is promising but model-dependent.
Identifies the generative-discriminative gap in LLM hard negative synthesis for retrieval and proposes CausalNeg using CoT counterfactual perturbation plus query-view entropy maximization to generate more effective negatives.
Sakura is a multi-agent system that generates structurally complex tests from NL descriptions, achieving 50-78% higher compilability and 38-66% higher coverage overlap than baselines on 1,464 scenarios from 20 Apache Commons applications.
HEART-Bench evaluates LLM agents on psychological consistency using 11 Big-Five-grounded characters with 1,000 episodic memories each and 64 DIAMONDS-based decision scenarios, yielding 673 validated MCQs.
VeriTrip is a new benchmark using a Multimodal Retrieval Base and Verifiable Knowledge Base to evaluate evidence-grounded reasoning and factual reliability in travel planning agents over unstructured multimodal web data.
RoBatch is a two-stage framework that formulates and solves the joint Route with Batching Problem via a batch-aware proxy utility model and greedy scheduling, outperforming separate routing or batching baselines on six benchmarks.
citing papers explorer
-
FINER-SQL: Boosting Small Language Models for Text-to-SQL
FINER-SQL boosts 3B-parameter small language models to 67.73% and 85% execution accuracy on BIRD and Spider benchmarks via dense memory and atomic rewards in group relative policy optimization, matching larger LLMs at lower latency.