super hub Mixed citations

Qwen3 Embedding: Advancing Text Embedding and Reranking Through Foundation Models

Baosong Yang, Dingkun Long, Huan Lin, Mingxin Li, Xin Zhang, Yanzhao Zhang · 2025 · cs.CL · arXiv 2506.05176

Mixed citation behavior. Most common role is background (46%).

307 Pith papers citing it

Background 46% of classified citations

open full Pith review browse 307 citing papers more from Baosong Yang arXiv PDF

abstract

In this work, we introduce the Qwen3 Embedding series, a significant advancement over its predecessor, the GTE-Qwen series, in text embedding and reranking capabilities, built upon the Qwen3 foundation models. Leveraging the Qwen3 LLMs' robust capabilities in multilingual text understanding and generation, our innovative multi-stage training pipeline combines large-scale unsupervised pre-training with supervised fine-tuning on high-quality datasets. Effective model merging strategies further ensure the robustness and adaptability of the Qwen3 Embedding series. During the training process, the Qwen3 LLMs serve not only as backbone models but also play a crucial role in synthesizing high-quality, rich, and diverse training data across multiple domains and languages, thus enhancing the training pipeline. The Qwen3 Embedding series offers a spectrum of model sizes (0.6B, 4B, 8B) for both embedding and reranking tasks, addressing diverse deployment scenarios where users can optimize for either efficiency or effectiveness. Empirical evaluations demonstrate that the Qwen3 Embedding series achieves state-of-the-art results across diverse benchmarks. Notably, it excels on the multilingual evaluation benchmark MTEB for text embedding, as well as in various retrieval tasks, including code retrieval, cross-lingual retrieval and multilingual retrieval. To facilitate reproducibility and promote community-driven research and development, the Qwen3 Embedding models are publicly available under the Apache 2.0 license.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 14 method 6 baseline 3 dataset 1

citation-polarity summary

background 11 use method 6 baseline 3 unclear 2 support 1 use dataset 1

claims ledger

abstract In this work, we introduce the Qwen3 Embedding series, a significant advancement over its predecessor, the GTE-Qwen series, in text embedding and reranking capabilities, built upon the Qwen3 foundation models. Leveraging the Qwen3 LLMs' robust capabilities in multilingual text understanding and generation, our innovative multi-stage training pipeline combines large-scale unsupervised pre-training with supervised fine-tuning on high-quality datasets. Effective model merging strategies further ensure the robustness and adaptability of the Qwen3 Embedding series. During the training process, the

authors

Baosong Yang Dingkun Long Huan Lin Mingxin Li Xin Zhang Yanzhao Zhang

co-cited works

representative citing papers

CATCH-ME if you RAG: a dataset of Contextually Annotated multi-Turn Counterspeech against Hate and Misinformation Exchanges

cs.CL · 2026-06-18 · unverdicted · novelty 8.0

Presents a new expert-curated dataset of multi-turn counterspeech dialogues in five languages targeting hate against seven groups, with span annotations linking to verified external knowledge for RAG applications.

SkMTEB: Slovak Massive Text Embedding Benchmark and Model Adaptation

cs.CL · 2026-06-11 · unverdicted · novelty 8.0

SkMTEB is the first comprehensive text embedding benchmark for Slovak, and vocabulary-trimmed E5 adaptations achieve competitive performance with much smaller models.

DiscourseFlip: An Oblique Discourse-Level Opinion Manipulation Attack against Black-box Retrieval-Augmented Generation

cs.CL · 2026-05-31 · unverdicted · novelty 8.0

DiscourseFlip is a graph-guided attack allocating limited poisoning budget to induce targeted opinion shifts over semantic query networks in black-box RAG.

STRABLE: Benchmarking Tabular Machine Learning with Strings

cs.LG · 2026-05-12 · unverdicted · novelty 8.0

A new corpus of 108 mixed string-numeric tables shows that advanced tabular learners with basic string embeddings perform well on most real-world data, while large LLM encoders help on free-text heavy tables.

SLAM: Structural Linguistic Activation Marking for Language Models

cs.CL · 2026-05-06 · unverdicted · novelty 8.0 · 2 refs

SLAM achieves 100% detection on Gemma-2 models with only 1-2 point quality cost by causally steering SAE-identified residual-stream directions for linguistic structure.

ReasonAudio: A Benchmark for Evaluating Reasoning Beyond Matching in Text-Audio Retrieval

cs.AI · 2026-05-05 · unverdicted · novelty 8.0 · 2 refs

ReasonAudio benchmark reveals that state-of-the-art text-audio retrieval models struggle with reasoning tasks like negation and duration, and multimodal LLMs lose reasoning ability after contrastive fine-tuning.

FollowTable: A Benchmark for Instruction-Following Table Retrieval

cs.IR · 2026-05-01 · unverdicted · novelty 8.0

FollowTable is the first large-scale benchmark for instruction-following table retrieval, paired with an Instruction Responsiveness Score, showing that existing models fail to adapt to fine-grained constraints beyond topical similarity.

Can Language Models Actually Retrieve In-Context? Drowning in Documents at Million Token Scale

cs.CL · 2026-07-01 · unverdicted · novelty 7.0

A 0.6B LM with length-aware attention adjustments performs competitive in-context retrieval at million-token scale on MS MARCO, NQ, and LIMIT benchmarks.

MoHallBench: A Benchmark for Motion Hallucination in Video Large Language Models

cs.CV · 2026-07-01 · unverdicted · novelty 7.0

MoHallBench is a new benchmark evaluating motion hallucination in VideoLLMs from co-occurrence priors, sequential inference, and similarity confusion, revealing decoupling from action recognition performance.

Embedding Inference Attack

cs.CR · 2026-07-01 · unverdicted · novelty 7.0

Tailored queries enable identification of the embedding model used by a black-box IR system from the unordered set of retrieved documents, even when a reranker is present.

STEB: Style Text Embedding Benchmark

cs.CL · 2026-06-30 · unverdicted · novelty 7.0

STEB is a new benchmark of 96 datasets in 7 languages for evaluating style text embeddings on authorship, detection, and linguistic probing tasks.

Beyond IID: How General Are Tabular Foundation Models, Really?

cs.LG · 2026-06-29 · unverdicted · novelty 7.0

Tabular foundation models excel on tiny- to medium-sized IID data but are outperformed by traditional tree-based and deep learning models on non-IID, large, and high-dimensional datasets, based on evaluations across 11 models and 142 datasets in the new BeyondArena benchmark.

Turn-Averaged SAEs for Feature Discovery and Long-Context Attribution

cs.CL · 2026-06-26 · unverdicted · novelty 7.0

Turn-averaged SAEs reconstruct average activations over conversation turns to represent high-level turn characteristics with a fixed number of features, simplifying long-context interpretability compared to per-token SAEs.

Measuring Semantic Progress in Multi-turn Dialogue via Information Gain

cs.CL · 2026-06-10 · unverdicted · novelty 7.0

A Gaussian information-gain metric in embedding space quantifies semantic progress in dialogues via uncertainty reduction and shows competitive agreement with human judgments on MT-Bench and UltraFeedback.

Agreement in Representation Space for Open-Ended Self-Consistency

cs.CL · 2026-06-10 · unverdicted · novelty 7.0

EBA clusters sampled LLM generations in representation space to estimate agreement, outperforming random selection with stable scaling and showing that central positions correlate with higher generation quality.

Tail-Aware Adaptive-k: Query-Adaptive Context Selection for Retrieval-Augmented Generation

cs.IR · 2026-06-10 · unverdicted · novelty 7.0

TAA-k finds query-adaptive retrieval cutoffs by first using knee detection to isolate a candidate window around the relevance-to-noise transition, then applying EVT goodness-of-fit tests inside that window.

CORE-Bench: A Comprehensive Benchmark for Code Retrieval in the Era of Agentic Coding

cs.IR · 2026-06-10 · accept · novelty 7.0

CORE-Bench is a benchmark for code retrieval in agentic coding settings, built from curated tasks and SWE-bench instances, showing performance drops and gains from fine-tuning.

ActProbe: Action-Space Probe for Early Failure Detection of Generative Robot Policies

cs.RO · 2026-06-07 · unverdicted · novelty 7.0

ActProbe is an action-space detector that uses temporal consistency error and action chunk magnitude from policy outputs, mapped via LSTM-MLP, to predict failures earlier than baselines across policies and real-robot tasks.

SEA-Embedding: Open and Reproducible Text Embeddings for Southeast Asia

cs.CL · 2026-06-02 · unverdicted · novelty 7.0

SEA-Embedding is a fully open text embedding pipeline for Southeast Asian languages that achieves state-of-the-art performance on the SEA-BED benchmark by analyzing data composition, training objectives, and base encoder choices.

When Knowledge Is Not Free: Cost-Aware Evidence Selection in Retrieval-Augmented Generation

cs.CL · 2026-06-01 · unverdicted · novelty 7.0

Defines cost-aware RAG with evidence cost tiers and shows static selectors are brittle while agentic LLM-based selection is promising but model-dependent.

When Hard Negatives Hurt: Bridging the Generative-Discriminative Gap in Hard Negative Synthesis for Retrieval

cs.LG · 2026-05-31 · unverdicted · novelty 7.0

Identifies the generative-discriminative gap in LLM hard negative synthesis for retrieval and proposes CausalNeg using CoT counterfactual perturbation plus query-view entropy maximization to generate more effective negatives.

Sakura: An Approach for Generating Complex Tests from Natural Language Test Descriptions

cs.SE · 2026-05-30 · unverdicted · novelty 7.0

Sakura is a multi-agent system that generates structurally complex tests from NL descriptions, achieving 50-78% higher compilability and 38-66% higher coverage overlap than baselines on 1,464 scenarios from 20 Apache Commons applications.

HEART-Bench: Do LLM Agents Exhibit Human-like Psychology?

cs.CL · 2026-05-28 · unverdicted · novelty 7.0

HEART-Bench evaluates LLM agents on psychological consistency using 11 Big-Five-grounded characters with 1,000 episodic memories each and 64 DIAMONDS-based decision scenarios, yielding 673 validated MCQs.

VeriTrip: A Verifiable Benchmark for Travel Planning Agents over Unstructured Web Corpora

cs.AI · 2026-05-27 · unverdicted · novelty 7.0

VeriTrip is a new benchmark using a Multimodal Retrieval Base and Verifiable Knowledge Base to evaluate evidence-grounded reasoning and factual reliability in travel planning agents over unstructured multimodal web data.

citing papers explorer

Showing 50 of 307 citing papers.

CATCH-ME if you RAG: a dataset of Contextually Annotated multi-Turn Counterspeech against Hate and Misinformation Exchanges cs.CL · 2026-06-18 · unverdicted · none · ref 8 · internal anchor
Presents a new expert-curated dataset of multi-turn counterspeech dialogues in five languages targeting hate against seven groups, with span annotations linking to verified external knowledge for RAG applications.
SkMTEB: Slovak Massive Text Embedding Benchmark and Model Adaptation cs.CL · 2026-06-11 · unverdicted · none · ref 10 · internal anchor
SkMTEB is the first comprehensive text embedding benchmark for Slovak, and vocabulary-trimmed E5 adaptations achieve competitive performance with much smaller models.
DiscourseFlip: An Oblique Discourse-Level Opinion Manipulation Attack against Black-box Retrieval-Augmented Generation cs.CL · 2026-05-31 · unverdicted · none · ref 45 · internal anchor
DiscourseFlip is a graph-guided attack allocating limited poisoning budget to induce targeted opinion shifts over semantic query networks in black-box RAG.
STRABLE: Benchmarking Tabular Machine Learning with Strings cs.LG · 2026-05-12 · unverdicted · none · ref 71 · internal anchor
A new corpus of 108 mixed string-numeric tables shows that advanced tabular learners with basic string embeddings perform well on most real-world data, while large LLM encoders help on free-text heavy tables.
SLAM: Structural Linguistic Activation Marking for Language Models cs.CL · 2026-05-06 · unverdicted · none · ref 28 · 2 links · internal anchor
SLAM achieves 100% detection on Gemma-2 models with only 1-2 point quality cost by causally steering SAE-identified residual-stream directions for linguistic structure.
ReasonAudio: A Benchmark for Evaluating Reasoning Beyond Matching in Text-Audio Retrieval cs.AI · 2026-05-05 · unverdicted · none · ref 37 · 2 links · internal anchor
ReasonAudio benchmark reveals that state-of-the-art text-audio retrieval models struggle with reasoning tasks like negation and duration, and multimodal LLMs lose reasoning ability after contrastive fine-tuning.
FollowTable: A Benchmark for Instruction-Following Table Retrieval cs.IR · 2026-05-01 · unverdicted · none · ref 61 · internal anchor
FollowTable is the first large-scale benchmark for instruction-following table retrieval, paired with an Instruction Responsiveness Score, showing that existing models fail to adapt to fine-grained constraints beyond topical similarity.
Can Language Models Actually Retrieve In-Context? Drowning in Documents at Million Token Scale cs.CL · 2026-07-01 · unverdicted · none · ref 36 · internal anchor
A 0.6B LM with length-aware attention adjustments performs competitive in-context retrieval at million-token scale on MS MARCO, NQ, and LIMIT benchmarks.
MoHallBench: A Benchmark for Motion Hallucination in Video Large Language Models cs.CV · 2026-07-01 · unverdicted · none · ref 47 · internal anchor
MoHallBench is a new benchmark evaluating motion hallucination in VideoLLMs from co-occurrence priors, sequential inference, and similarity confusion, revealing decoupling from action recognition performance.
Embedding Inference Attack cs.CR · 2026-07-01 · unverdicted · none · ref 39 · internal anchor
Tailored queries enable identification of the embedding model used by a black-box IR system from the unordered set of retrieved documents, even when a reranker is present.
STEB: Style Text Embedding Benchmark cs.CL · 2026-06-30 · unverdicted · none · ref 10 · internal anchor
STEB is a new benchmark of 96 datasets in 7 languages for evaluating style text embeddings on authorship, detection, and linguistic probing tasks.
Beyond IID: How General Are Tabular Foundation Models, Really? cs.LG · 2026-06-29 · unverdicted · none · ref 203 · internal anchor
Tabular foundation models excel on tiny- to medium-sized IID data but are outperformed by traditional tree-based and deep learning models on non-IID, large, and high-dimensional datasets, based on evaluations across 11 models and 142 datasets in the new BeyondArena benchmark.
Turn-Averaged SAEs for Feature Discovery and Long-Context Attribution cs.CL · 2026-06-26 · unverdicted · none · ref 21 · internal anchor
Turn-averaged SAEs reconstruct average activations over conversation turns to represent high-level turn characteristics with a fixed number of features, simplifying long-context interpretability compared to per-token SAEs.
Measuring Semantic Progress in Multi-turn Dialogue via Information Gain cs.CL · 2026-06-10 · unverdicted · none · ref 43 · internal anchor
A Gaussian information-gain metric in embedding space quantifies semantic progress in dialogues via uncertainty reduction and shows competitive agreement with human judgments on MT-Bench and UltraFeedback.
Agreement in Representation Space for Open-Ended Self-Consistency cs.CL · 2026-06-10 · unverdicted · none · ref 10 · internal anchor
EBA clusters sampled LLM generations in representation space to estimate agreement, outperforming random selection with stable scaling and showing that central positions correlate with higher generation quality.
Tail-Aware Adaptive-k: Query-Adaptive Context Selection for Retrieval-Augmented Generation cs.IR · 2026-06-10 · unverdicted · none · ref 39 · internal anchor
TAA-k finds query-adaptive retrieval cutoffs by first using knee detection to isolate a candidate window around the relevance-to-noise transition, then applying EVT goodness-of-fit tests inside that window.
CORE-Bench: A Comprehensive Benchmark for Code Retrieval in the Era of Agentic Coding cs.IR · 2026-06-10 · accept · none · ref 37 · internal anchor
CORE-Bench is a benchmark for code retrieval in agentic coding settings, built from curated tasks and SWE-bench instances, showing performance drops and gains from fine-tuning.
ActProbe: Action-Space Probe for Early Failure Detection of Generative Robot Policies cs.RO · 2026-06-07 · unverdicted · none · ref 44 · internal anchor
ActProbe is an action-space detector that uses temporal consistency error and action chunk magnitude from policy outputs, mapped via LSTM-MLP, to predict failures earlier than baselines across policies and real-robot tasks.
SEA-Embedding: Open and Reproducible Text Embeddings for Southeast Asia cs.CL · 2026-06-02 · unverdicted · none · ref 48 · internal anchor
SEA-Embedding is a fully open text embedding pipeline for Southeast Asian languages that achieves state-of-the-art performance on the SEA-BED benchmark by analyzing data composition, training objectives, and base encoder choices.
When Knowledge Is Not Free: Cost-Aware Evidence Selection in Retrieval-Augmented Generation cs.CL · 2026-06-01 · unverdicted · none · ref 22 · internal anchor
Defines cost-aware RAG with evidence cost tiers and shows static selectors are brittle while agentic LLM-based selection is promising but model-dependent.
When Hard Negatives Hurt: Bridging the Generative-Discriminative Gap in Hard Negative Synthesis for Retrieval cs.LG · 2026-05-31 · unverdicted · none · ref 50 · internal anchor
Identifies the generative-discriminative gap in LLM hard negative synthesis for retrieval and proposes CausalNeg using CoT counterfactual perturbation plus query-view entropy maximization to generate more effective negatives.
Sakura: An Approach for Generating Complex Tests from Natural Language Test Descriptions cs.SE · 2026-05-30 · unverdicted · none · ref 85 · internal anchor
Sakura is a multi-agent system that generates structurally complex tests from NL descriptions, achieving 50-78% higher compilability and 38-66% higher coverage overlap than baselines on 1,464 scenarios from 20 Apache Commons applications.
HEART-Bench: Do LLM Agents Exhibit Human-like Psychology? cs.CL · 2026-05-28 · unverdicted · none · ref 57 · internal anchor
HEART-Bench evaluates LLM agents on psychological consistency using 11 Big-Five-grounded characters with 1,000 episodic memories each and 64 DIAMONDS-based decision scenarios, yielding 673 validated MCQs.
VeriTrip: A Verifiable Benchmark for Travel Planning Agents over Unstructured Web Corpora cs.AI · 2026-05-27 · unverdicted · none · ref 30 · internal anchor
VeriTrip is a new benchmark using a Multimodal Retrieval Base and Verifiable Knowledge Base to evaluate evidence-grounded reasoning and factual reliability in travel planning agents over unstructured multimodal web data.
Towards Cost-effective LLMs Routing with Batch Prompting cs.DB · 2026-05-27 · unverdicted · none · ref 39 · internal anchor
RoBatch is a two-stage framework that formulates and solves the joint Route with Batching Problem via a batch-aware proxy utility model and greedy scheduling, outperforming separate routing or batching baselines on six benchmarks.
The Harder Text Embedding Benchmark (HTEB): Beyond One-dimensional Static Robustness cs.CL · 2026-05-27 · unverdicted · none · ref 70 · internal anchor
HTEB introduces dynamic, multi-axis evaluation of text embedding robustness using LLM transformations, finding decoupled profiles across models and that scaling does not close all robustness gaps.
Retrieval as Reasoning: Self-Evolving Agent-Native Retrieval via LLM-Wiki cs.CL · 2026-05-25 · unverdicted · none · ref 23 · internal anchor
LLM-Wiki structures external knowledge as compilable wiki pages with links and persistent self-correction, achieving SOTA results on HotpotQA, MuSiQue, and 2WikiMultiHopQA by 2.0-8.1 F1 points over prior RAG systems.
How Reliable Are Semantic-ID Tokenizer Comparisons in Generative Recommendation? cs.IR · 2026-05-25 · conditional · none · ref 49 · internal anchor
Semantic-ID tokenizers produce collisions affecting up to 30.5% of items across four datasets, inflating Hit@10 by up to 103.36% and making prior tokenizer comparisons unreliable.
IdioLink: Retrieving Meaning Beyond Words Across Idiomatic and Literal Expressions cs.CL · 2026-05-21 · unverdicted · none · ref 71 · internal anchor
IdioLink introduces a benchmark dataset and evaluation showing that strong embedding models struggle to retrieve equivalent meanings across idiomatic and literal forms, relying on shallow cues instead.
Decision-Aware Quadratic ReLU Replacement for HE-Friendly Inference cs.CR · 2026-05-21 · unverdicted · none · ref 31 · 2 links · internal anchor
Formulates quadratic ReLU replacement as a linear separation problem in lifted space, with exact conditions for calibration-lossless replacement and convex relaxations for approximate cases, achieving plaintext accuracy at lower cost under CKKS.
DermAgent: A Self-Reflective Agentic System for Dermatological Image Analysis with Multi-Tool Reasoning and Traceable Decision-Making cs.CV · 2026-05-14 · unverdicted · none · ref 32 · internal anchor
DermAgent orchestrates seven vision-language tools in a Plan-Execute-Reflect loop with dual-modality retrieval from 413k cases and a critic module to outperform GPT-4o by 17.6% in zero-shot dermatological diagnosis accuracy.
BOOKMARKS: Efficient Active Storyline Memory for Role-playing cs.CL · 2026-05-13 · unverdicted · none · ref 8 · internal anchor
BOOKMARKS introduces searchable bookmarks as reusable answers to storyline questions, enabling active initialization and passive synchronization for more consistent role-playing agent memory than recurrent summarization.
ReTool-Video: Recursive Tool-Using Video Agents with Meta-Augmented Tool Grounding cs.CV · 2026-05-13 · unverdicted · none · ref 51 · internal anchor
ReTool-Video uses a 134-tool meta-augmented library and recursive grounding to translate abstract video intents into fine-grained multimodal operations, outperforming baselines on MVBench, MLVU, and Video-MME.
AcquisitionSynthesis: Targeted Data Generation using Acquisition Functions cs.CL · 2026-05-13 · unverdicted · none · ref 21 · internal anchor
AcquisitionSynthesis uses acquisition functions as rewards to train generators that produce higher-quality synthetic data, delivering 2-7% gains on math, medical QA, and coding tasks with improved robustness to forgetting.
LeanSearch v2: Global Premise Retrieval for Lean 4 Theorem Proving cs.IR · 2026-05-13 · conditional · none · ref 26 · 2 links · internal anchor
LeanSearch v2 recovers 46.1% of ground-truth premise groups for research-level Lean 4 theorems within 10 candidates and raises fixed-loop proof success to 20%.
AssemblyBench: Physics-Aware Assembly of Complex Industrial Objects cs.CV · 2026-05-13 · unverdicted · none · ref 45 · internal anchor
AssemblyBench dataset and AssemblyDyno transformer model enable physics-aware prediction of assembly sequences and trajectories for complex industrial objects from multimodal instructions and 3D shapes.
Very Efficient Listwise Multimodal Reranking for Long Documents cs.IR · 2026-05-12 · unverdicted · none · ref 41 · internal anchor
ZipRerank delivers state-of-the-art multimodal listwise reranking accuracy for long documents at up to 10x lower latency via early interaction and single-pass scoring.
Evolving-RL: End-to-End Optimization of Experience-Driven Self-Evolving Capability within Agents cs.AI · 2026-05-11 · unverdicted · none · ref 38 · internal anchor
Evolving-RL jointly optimizes experience extraction and utilization in LLM agents via RL with separate evaluation signals, delivering up to 98.7% relative gains on out-of-distribution tasks in ALFWorld and Mind2Web.
Skill Description Deception Attack against Task Routing in Internet of Agents cs.MA · 2026-05-11 · conditional · none · ref 18 · internal anchor
Malicious agents can deceive LLM-based task routers in Internet of Agents systems by generating fake skill descriptions, achieving up to 98% success rate across nine domains.
CHASM: Online Changepoint Detection in Temporal and Cross-Variable Dependence stat.ME · 2026-05-08 · unverdicted · none · ref 70 · internal anchor
CHASM detects changes in temporal and cross-variable dependence in multivariate time series by monitoring the truncated eigenvalue sequence of a recursively estimated DMD operator, using optimal assignment and augmented monitoring for complex values.
Toward Privileged Foundation Models:LUPI for Accelerated and Improved Learning cs.LG · 2026-05-08 · unverdicted · none · ref 43 · 2 links · internal anchor
PIQL integrates privileged information to accelerate convergence, lower loss, and improve generalization in tabular foundation models.
The Moltbook Files: A Harmless Slopocalypse or Humanity's Last Experiment cs.CL · 2026-05-08 · unverdicted · none · ref 241 · internal anchor
An AI-agent social platform generated mostly neutral content whose use in fine-tuning reduced model truthfulness comparably to human Reddit data, suggesting limited unique harm but flagging tail risks like secret leaks.
Self-Driving Datasets: From 20 Million Papers to Nuanced Biomedical Knowledge at Scale cs.LG · 2026-05-07 · conditional · none · ref 75 · 3 links · internal anchor
Starling, a multi-agent LLM system, extracts ~6.3 million nuanced structured records from PubMed across six tasks with reported error rates of 0.6-7.7%, lower than several curated databases.
LatentRAG: Latent Reasoning and Retrieval for Efficient Agentic RAG cs.CL · 2026-05-07 · unverdicted · none · ref 33 · internal anchor
LatentRAG performs agentic RAG by generating latent tokens for thoughts and subqueries in one forward pass, matching explicit methods' accuracy on seven benchmarks while reducing latency by ~90%.
SkillRet: A Large-Scale Benchmark for Skill Retrieval in LLM Agents cs.AI · 2026-05-07 · unverdicted · none · ref 34 · internal anchor
SkillRet benchmark shows fine-tuned retrievers improve NDCG@10 by 13+ points over prior models on large-scale skill retrieval for LLM agents.
TabEmbed: Benchmarking and Learning Generalist Embeddings for Tabular Understanding cs.CL · 2026-05-06 · unverdicted · none · ref 45 · internal anchor
TabEmbed is the first generalist embedding model for tabular data that unifies classification and retrieval in one space via contrastive learning and outperforms text embedding models on the new TabBench benchmark.
Rational Communication Shapes Morphological Composition cs.CL · 2026-05-05 · unverdicted · none · ref 25 · internal anchor
Using historical corpora and the Rational Speech Act framework, attested English morphological compositions are ranked higher than plausible alternatives from the same time period when both semantic recoverability and production cost are considered.
Is It Novel and Why? Fine-Grained Patent Novelty Prediction Based on Passage Retrieval cs.CL · 2026-05-04 · unverdicted · none · ref 71 · internal anchor
Introduces a feature-level annotated patent dataset and LLM retrieval-reasoning workflows that outperform embedding baselines on passage retrieval and novel feature identification while avoiding spurious correlations in novelty prediction.
Prosa: Rubric-Based Evaluation of LLMs on Real User Chats in Brazilian Portuguese cs.CL · 2026-05-02 · conditional · none · ref 22 · internal anchor
Prosa demonstrates that rubric-based binary scoring with multi-judge filtering yields full agreement on 16 LLM rankings across judges on Brazilian Portuguese chats, compared to only 7/16 under holistic scoring, while widening score gaps by 47%.
Led to Mislead: Adversarial Content Injection for Attacks on Neural Ranking Models cs.IR · 2026-05-02 · unverdicted · none · ref 59 · internal anchor
CRAFT is a supervised LLM framework using retrieval-augmented generation, self-refinement, fine-tuning, and preference optimization to create fluent adversarial content that boosts target ranks in neural ranking models, outperforming baselines on MS MARCO and TREC benchmarks with cross-architecture

Qwen3 Embedding: Advancing Text Embedding and Reranking Through Foundation Models

hub tools

citation-role summary

citation-polarity summary

claims ledger

authors

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer