citation dossier

Mteb: Massive text embedding benchmark

Niklas Muennighoff, Nouamane Tazi, Loïc Magne, and Nils Reimers · 2023 · arXiv 2210.07316

17Pith papers citing it

17reference links

cs.CLtop field · 6 papers

UNVERDICTEDtop verdict bucket · 12 papers

This arXiv-backed work is queued for full Pith review when it crosses the high-inbound sweep. That review runs reader · skeptic · desk-editor · referee · rebuttal · circularity · lean confirmation · RS check · pith extraction.

read on arXiv PDF

why this work matters in Pith

Pith has found this work in 17 reviewed papers. Its strongest current cluster is cs.CL (6 papers). The largest review-status bucket among citing papers is UNVERDICTED (12 papers). For highly cited works, this page shows a dossier first and a bounded explorer second; it never tries to render every citing paper at once.

representative citing papers

Much of Geospatial Web Search Is Beyond Traditional GIS

cs.IR · 2026-05-11 · unverdicted · novelty 7.0

Analysis of 1.01 million unfiltered Bing queries identifies 18% as geospatial, dominated by transactional categories like costs (15.3%) that exceed traditional GIS scope.

Led to Mislead: Adversarial Content Injection for Attacks on Neural Ranking Models

cs.IR · 2026-05-02 · unverdicted · novelty 7.0

CRAFT is a supervised LLM framework using retrieval-augmented generation, self-refinement, fine-tuning, and preference optimization to create fluent adversarial content that boosts target ranks in neural ranking models, outperforming baselines on MS MARCO and TREC benchmarks with cross-architecture

MMEB-V3: Measuring the Performance Gaps of Omni-Modality Embedding Models

cs.IR · 2026-04-25 · unverdicted · novelty 7.0

MMEB-V3 benchmark shows omni-modality embedding models fail to enforce instruction-specified modality constraints and exhibit asymmetric, query-biased retrieval.

mEOL: Training-Free Instruction-Guided Multimodal Embedder for Vector Graphics and Image Retrieval

cs.CV · 2026-04-18 · unverdicted · novelty 7.0

mEOL creates aligned embeddings for text, images, and SVGs using instruction-guided MLLM one-word summaries and semantic SVG rewriting, outperforming baselines on a new text-to-SVG retrieval benchmark.

C-Pack: Packed Resources For General Chinese Embeddings

cs.CL · 2023-09-14 · accept · novelty 7.0

C-Pack releases a new Chinese embedding benchmark, large training dataset, and optimized models that outperform priors by up to 10% on C-MTEB while also delivering English SOTA results.

Sliced Inner Product Gromov-Wasserstein Distances

stat.ML · 2026-05-08 · unverdicted · novelty 6.0

A sliced IGW distance is introduced with closed-form 1D expressions, rotational invariance, and studied structural and computational properties for efficient data alignment.

cs.CL · 2026-05-01 · unverdicted · novelty 6.0

Machine translation preserves embedding similarity structure for ten languages but distorts it for four in the Manifesto Corpus, via a new non-inferiority testing framework.

MIPIC: Matryoshka Representation Learning via Self-Distilled Intra-Relational and Progressive Information Chaining

cs.CL · 2026-04-27 · unverdicted · novelty 6.0

MIPIC trains nested Matryoshka representations via self-distilled intra-relational alignment with top-k CKA and progressive information chaining across depths, yielding competitive performance especially at extreme low dimensions.

JU\'A -- A Benchmark for Information Retrieval in Brazilian Legal Text Collections

cs.IR · 2026-04-07 · accept · novelty 6.0

JU'A is a new heterogeneous benchmark for Brazilian legal IR that distinguishes retrieval methods and shows domain-adapted models excel on aligned subsets while BM25 stays competitive elsewhere.

Semantic Data Processing with Holistic Data Understanding

cs.DB · 2026-04-03 · unverdicted · novelty 6.0

HoldUp uses LLM-guided clustering to provide holistic dataset context for semantic operators, yielding up to 33% higher classification accuracy and 30% higher scoring accuracy than row-by-row LLM processing across 15 datasets.

StarCoder 2 and The Stack v2: The Next Generation

cs.SE · 2024-02-29 · accept · novelty 6.0

StarCoder2-15B matches or beats CodeLlama-34B on code tasks despite being smaller, and StarCoder2-3B outperforms prior 15B models, with open weights and exact training data identifiers released.

BLOOM: A 176B-Parameter Open-Access Multilingual Language Model

cs.CL · 2022-11-09 · unverdicted · novelty 6.0

BLOOM is a 176B-parameter open-access multilingual language model trained on the ROOTS corpus that achieves competitive performance on benchmarks, with improved results after multitask prompted finetuning.

Measuring Embedding Sensitivity to Authorial Style in French: Comparing Literary Texts with Language Model Rewritings

cs.CL · 2026-05-11 · unverdicted · novelty 5.0

Embeddings reliably capture authorial stylistic features in French literary texts, and these signals persist after LLM rewriting while showing model-specific patterns.

How Does Chunking Affect Retrieval-Augmented Code Completion? A Controlled Empirical Study

cs.SE · 2026-05-06 · conditional · novelty 5.0

Function-based chunking underperforms other strategies in RAG code completion by 3.57-5.64 points, with context length as the dominant factor.

Towards Better Static Code Analysis Reports: Sentence Transformer-based Filtering of Non-Actionable Alerts

cs.SE · 2026-04-20 · conditional · novelty 5.0

STAF applies sentence embeddings from transformers to classify SCA findings, reaching 89% F1 and beating prior filters by 11% within projects and 6% across projects.

Text Embeddings by Weakly-Supervised Contrastive Pre-training

cs.CL · 2022-12-07 · unverdicted · novelty 5.0

E5 text embeddings trained with weakly-supervised contrastive pre-training on CCPairs outperform BM25 on BEIR zero-shot and achieve top results on MTEB, beating much larger models.

Domain-Adaptive Dense Retrieval for Brazilian Legal Search

cs.IR · 2026-05-05 · unverdicted · novelty 4.0

Mixed training of Qwen3-Embedding-4B on legal data plus SQuAD-pt yields higher average NDCG@10 (0.447), MRR@10 (0.595), and MAP@10 (0.308) across six Portuguese retrieval datasets than legal-only or base models, with largest gains on out-of-domain question-based search.

citing papers explorer

Showing 17 of 17 citing papers.

Much of Geospatial Web Search Is Beyond Traditional GIS cs.IR · 2026-05-11 · unverdicted · none · ref 18
Analysis of 1.01 million unfiltered Bing queries identifies 18% as geospatial, dominated by transactional categories like costs (15.3%) that exceed traditional GIS scope.
Led to Mislead: Adversarial Content Injection for Attacks on Neural Ranking Models cs.IR · 2026-05-02 · unverdicted · none · ref 33
CRAFT is a supervised LLM framework using retrieval-augmented generation, self-refinement, fine-tuning, and preference optimization to create fluent adversarial content that boosts target ranks in neural ranking models, outperforming baselines on MS MARCO and TREC benchmarks with cross-architecture
MMEB-V3: Measuring the Performance Gaps of Omni-Modality Embedding Models cs.IR · 2026-04-25 · unverdicted · none · ref 25
MMEB-V3 benchmark shows omni-modality embedding models fail to enforce instruction-specified modality constraints and exhibit asymmetric, query-biased retrieval.
mEOL: Training-Free Instruction-Guided Multimodal Embedder for Vector Graphics and Image Retrieval cs.CV · 2026-04-18 · unverdicted · none · ref 29
mEOL creates aligned embeddings for text, images, and SVGs using instruction-guided MLLM one-word summaries and semantic SVG rewriting, outperforming baselines on a new text-to-SVG retrieval benchmark.
C-Pack: Packed Resources For General Chinese Embeddings cs.CL · 2023-09-14 · accept · none · ref 38
C-Pack releases a new Chinese embedding benchmark, large training dataset, and optimized models that outperform priors by up to 10% on C-MTEB while also delivering English SOTA results.
Sliced Inner Product Gromov-Wasserstein Distances stat.ML · 2026-05-08 · unverdicted · none · ref 48
A sliced IGW distance is introduced with closed-form 1D expressions, rotational invariance, and studied structural and computational properties for efficient data alignment.
Is Textual Similarity Invariant under Machine Translation? Evidence Based on the Political Manifesto Corpus cs.CL · 2026-05-01 · unverdicted · none · ref 47
Machine translation preserves embedding similarity structure for ten languages but distorts it for four in the Manifesto Corpus, via a new non-inferiority testing framework.
MIPIC: Matryoshka Representation Learning via Self-Distilled Intra-Relational and Progressive Information Chaining cs.CL · 2026-04-27 · unverdicted · none · ref 27
MIPIC trains nested Matryoshka representations via self-distilled intra-relational alignment with top-k CKA and progressive information chaining across depths, yielding competitive performance especially at extreme low dimensions.
JU\'A -- A Benchmark for Information Retrieval in Brazilian Legal Text Collections cs.IR · 2026-04-07 · accept · none · ref 16
JU'A is a new heterogeneous benchmark for Brazilian legal IR that distinguishes retrieval methods and shows domain-adapted models excel on aligned subsets while BM25 stays competitive elsewhere.
Semantic Data Processing with Holistic Data Understanding cs.DB · 2026-04-03 · unverdicted · none · ref 45
HoldUp uses LLM-guided clustering to provide holistic dataset context for semantic operators, yielding up to 33% higher classification accuracy and 30% higher scoring accuracy than row-by-row LLM processing across 15 datasets.
StarCoder 2 and The Stack v2: The Next Generation cs.SE · 2024-02-29 · accept · none · ref 238
StarCoder2-15B matches or beats CodeLlama-34B on code tasks despite being smaller, and StarCoder2-3B outperforms prior 15B models, with open weights and exact training data identifiers released.
BLOOM: A 176B-Parameter Open-Access Multilingual Language Model cs.CL · 2022-11-09 · unverdicted · none · ref 286
BLOOM is a 176B-parameter open-access multilingual language model trained on the ROOTS corpus that achieves competitive performance on benchmarks, with improved results after multitask prompted finetuning.
Measuring Embedding Sensitivity to Authorial Style in French: Comparing Literary Texts with Language Model Rewritings cs.CL · 2026-05-11 · unverdicted · none · ref 61
Embeddings reliably capture authorial stylistic features in French literary texts, and these signals persist after LLM rewriting while showing model-specific patterns.
How Does Chunking Affect Retrieval-Augmented Code Completion? A Controlled Empirical Study cs.SE · 2026-05-06 · conditional · none · ref 15
Function-based chunking underperforms other strategies in RAG code completion by 3.57-5.64 points, with context length as the dominant factor.
Towards Better Static Code Analysis Reports: Sentence Transformer-based Filtering of Non-Actionable Alerts cs.SE · 2026-04-20 · conditional · none · ref 31
STAF applies sentence embeddings from transformers to classify SCA findings, reaching 89% F1 and beating prior filters by 11% within projects and 6% across projects.
Text Embeddings by Weakly-Supervised Contrastive Pre-training cs.CL · 2022-12-07 · unverdicted · none · ref 42
E5 text embeddings trained with weakly-supervised contrastive pre-training on CCPairs outperform BM25 on BEIR zero-shot and achieve top results on MTEB, beating much larger models.
Domain-Adaptive Dense Retrieval for Brazilian Legal Search cs.IR · 2026-05-05 · unverdicted · none · ref 14
Mixed training of Qwen3-Embedding-4B on legal data plus SQuAD-pt yields higher average NDCG@10 (0.447), MRR@10 (0.595), and MAP@10 (0.308) across six Portuguese retrieval datasets than legal-only or base models, with largest gains on out-of-domain question-based search.

Mteb: Massive text embedding benchmark

why this work matters in Pith

fields

years

verdicts

representative citing papers

citing papers explorer