MathAtlas is the first large-scale benchmark for autoformalizing graduate mathematics, where even strong models reach only 9.8% correctness on theorem statements and drop to 2.6% on the hardest dependency-deep subset.
hub
Improving text embeddings with large language models
20 Pith papers cite this work. Polarity classification is still indexing.
hub tools
representative citing papers
ClaimRAG-LAW is a French-English legal RAG benchmark with claim-level granularity for experts and non-experts that reveals limitations in current retrieval and generation performance.
RISE is an inference-time semantic reranking framework that refines low-confidence predictions in rhetorical role labeling using contrastively learned label representations, delivering an average +9.15 macro-F1 gain on hard examples across eight datasets and seven models.
IntrAgent uses a two-stage pipeline of section ranking and iterative reading to perform content-grounded literature information retrieval, achieving 13.2% higher accuracy than RAG and agent baselines on the new IntraBench benchmark.
UWE is a task-agnostic bi-encoder that uses many-to-many InfoNCE and token-level soft late interaction to achieve zero-shot ranking across unseen work-related target spaces while using far fewer parameters than Qwen3-8B and improving MAP by 4.4 points.
A panel of smaller diverse LLMs outperforms a single large model as an evaluator of generations, showing less intra-model bias and over 7x lower cost.
M3-Embedding is a single model for multi-lingual, multi-functional, and multi-granular text embeddings trained via self-knowledge distillation that achieves new state-of-the-art results on multilingual, cross-lingual, and long-document retrieval benchmarks.
Embedding model performance on MTEB tasks correlates strongly with nearest-neighbor overlap and ICA magnitude differences in their embedding spaces.
Machine translation preserves embedding similarity structure for ten languages but distorts it for four in the Manifesto Corpus, via a new non-inferiority testing framework.
Two-hop QA retrieval performance depends on whether the hop-2 entity is in the question or bridge passage, and a simple predicate-based router trained on one dataset transfers to improve R@5 on others.
E5-V produces strong universal multimodal embeddings from MLLMs trained solely on text pairs, often surpassing prior methods across retrieval and related tasks without multimodal fine-tuning.
NV-Embed achieves first place on the MTEB leaderboard across 56 tasks by combining a latent attention layer, causal-mask removal, two-stage contrastive training, and data curation for LLM-based embedding models.
Retriever-side choices, particularly the retrieval algorithm, exert more influence on RAG performance than generator selection across code generation, summarization, and repair tasks.
HDRR combines document-level semantic routing with scoped chunk retrieval to outperform both pure chunk-based retrieval and semantic file routing on the FinDER benchmark, delivering higher average scores, lower failure rates, and more perfect answers.
NJ BriefBank is a domain-adapted legal retrieval tool for public defenders that improves on standard benchmarks by incorporating legal reasoning, domain data, and synthetic examples, with a new released taxonomy and annotated evaluation dataset.
A 300M multilingual embedding model matches or exceeds 7B retrieval performance via optimized data scale, hard negatives, and task diversity over language diversity.
Language composition in training data creates opposing effects on CLIR and mono-IR performance for Korean-English retrieval, which model merging can partially resolve.
VLM2Vec-V2 is a multimodal embedding model trained on an extended MMEB-V2 benchmark that adds video and visual document tasks and reports gains on both new and prior image benchmarks.
Open-source multilingual E5 embedding models are trained via contrastive pre-training on 1 billion text pairs and fine-tuning, with an instruction-tuned model matching English SOTA performance.
Fine-tuned decoder-only LLMs achieve up to 40.4% higher MAP than UniXcoder on CoSQA+ for code search, with non-monotonic size scaling and data composition sensitivity.
citing papers explorer
-
MathAtlas: A Benchmark for Autoformalization in the Wild
MathAtlas is the first large-scale benchmark for autoformalizing graduate mathematics, where even strong models reach only 9.8% correctness on theorem statements and drop to 2.6% on the hardest dependency-deep subset.
-
Fine-grained Claim-level RAG Benchmark for Law
ClaimRAG-LAW is a French-English legal RAG benchmark with claim-level granularity for experts and non-experts that reveals limitations in current retrieval and generation performance.
-
Semantic Reranking at Inference Time for Hard Examples in Rhetorical Role Labeling
RISE is an inference-time semantic reranking framework that refines low-confidence predictions in rhetorical role labeling using contrastively learned label representations, delivering an average +9.15 macro-F1 gain on hard examples across eight datasets and seven models.
-
IntrAgent: An LLM Agent for Content-Grounded Information Retrieval through Literature Review
IntrAgent uses a two-stage pipeline of section ranking and iterative reading to perform content-grounded literature information retrieval, achieving 13.2% higher accuracy than RAG and agent baselines on the new IntraBench benchmark.
-
Unified Work Embeddings: Contrastive Learning of a Bidirectional Multi-task Ranker
UWE is a task-agnostic bi-encoder that uses many-to-many InfoNCE and token-level soft late interaction to achieve zero-shot ranking across unseen work-related target spaces while using far fewer parameters than Qwen3-8B and improving MAP by 4.4 points.
-
Replacing Judges with Juries: Evaluating LLM Generations with a Panel of Diverse Models
A panel of smaller diverse LLMs outperforms a single large model as an evaluator of generations, showing less intra-model bias and over 7x lower cost.
-
M3-Embedding: Multi-Linguality, Multi-Functionality, Multi-Granularity Text Embeddings Through Self-Knowledge Distillation
M3-Embedding is a single model for multi-lingual, multi-functional, and multi-granular text embeddings trained via self-knowledge distillation that achieves new state-of-the-art results on multilingual, cross-lingual, and long-document retrieval benchmarks.
-
Structure Retention in Embedding Spaces as a Predictor of Benchmark Performance
Embedding model performance on MTEB tasks correlates strongly with nearest-neighbor overlap and ICA magnitude differences in their embedding spaces.
-
Is Textual Similarity Invariant under Machine Translation? Evidence Based on the Political Manifesto Corpus
Machine translation preserves embedding similarity structure for ten languages but distorts it for four in the Manifesto Corpus, via a new non-inferiority testing framework.
-
Regime-Conditional Retrieval: Theory and a Transferable Router for Two-Hop QA
Two-hop QA retrieval performance depends on whether the hop-2 entity is in the question or bridge passage, and a simple predicate-based router trained on one dataset transfers to improve R@5 on others.
-
E5-V: Universal Embeddings with Multimodal Large Language Models
E5-V produces strong universal multimodal embeddings from MLLMs trained solely on text pairs, often surpassing prior methods across retrieval and related tasks without multimodal fine-tuning.
-
NV-Embed: Improved Techniques for Training LLMs as Generalist Embedding Models
NV-Embed achieves first place on the MTEB leaderboard across 56 tasks by combining a latent attention layer, causal-mask removal, two-stage contrastive training, and data curation for LLM-based embedding models.
-
Not All RAGs Are Created Equal: A Component-Wise Empirical Study for Software Engineering Tasks
Retriever-side choices, particularly the retrieval algorithm, exert more influence on RAG performance than generator selection across code generation, summarization, and repair tasks.
-
Resolving the Robustness-Precision Trade-off in Financial RAG through Hybrid Document-Routed Retrieval
HDRR combines document-level semantic routing with scoped chunk retrieval to outperform both pure chunk-based retrieval and semantic file routing on the FinDER benchmark, delivering higher average scores, lower failure rates, and more perfect answers.
-
Legal Retrieval for Public Defenders
NJ BriefBank is a domain-adapted legal retrieval tool for public defenders that improves on standard benchmarks by incorporating legal reasoning, domain data, and synthetic examples, with a new released taxonomy and annotated evaluation dataset.
-
Retrofitting Small Multilingual Models for Retrieval: Matching 7B Performance with 300M Parameters
A 300M multilingual embedding model matches or exceeds 7B retrieval performance via optimized data scale, hard negatives, and task diversity over language diversity.
-
Improving Korean-English Cross-Lingual Retrieval: A Data-Centric Study of Language Composition and Model Merging
Language composition in training data creates opposing effects on CLIR and mono-IR performance for Korean-English retrieval, which model merging can partially resolve.
-
VLM2Vec-V2: Advancing Multimodal Embedding for Videos, Images, and Visual Documents
VLM2Vec-V2 is a multimodal embedding model trained on an extended MMEB-V2 benchmark that adds video and visual document tasks and reports gains on both new and prior image benchmarks.
-
Multilingual E5 Text Embeddings: A Technical Report
Open-source multilingual E5 embedding models are trained via contrastive pre-training on 1 billion text pairs and fine-tuning, with an instruction-tuned model matching English SOTA performance.
-
Are Decoder-Only Large Language Models the Silver Bullet for Code Search?
Fine-tuned decoder-only LLMs achieve up to 40.4% higher MAP than UniXcoder on CoSQA+ for code search, with non-monotonic size scaling and data composition sensitivity.