On the Robustness of Multilingual Text Embedding Rankings Across Learning Tasks, Languages, and Benchmark Datasets

Ana Gjorgjevikj; Barbara Korou\v{s}i\'c Seljak; Tome Eftimov

arxiv: 2605.31142 · v1 · pith:M6HZBOVRnew · submitted 2026-05-29 · 💻 cs.CL · cs.AI

On the Robustness of Multilingual Text Embedding Rankings Across Learning Tasks, Languages, and Benchmark Datasets

Ana Gjorgjevikj , Barbara Korou\v{s}i\'c Seljak , Tome Eftimov This is my paper

Pith reviewed 2026-06-28 22:41 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords multilingual text embeddingsMTEB benchmarkranking robustnessdataset compositionaggregation methodsLLM-based modelsmulti-task evaluationsensitivity analysis

0 comments

The pith

Rankings of multilingual text embedding models shift depending on which datasets are included and how scores are aggregated.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper runs a meta-study of multilingual model performance in the MTEB benchmark to test whether reported model superiority holds when dataset compositions or performance aggregation methods change. It defines two indicators, dataset-composition robustness and ranking-scheme robustness, that quantify how much rankings move under different evaluation designs. Across five languages and nine tasks the task-specific results show large-scale LLM-based models frequently rank at the top yet not uniformly, for example in retrieval, while the task-agnostic view finds only a small subset of models stays strong across all tasks, schemes, and subsamples. Results for roughly 230 further languages are also released.

Core claim

Benchmarking conclusions about which multilingual text embedding models perform best depend on implicit choices of dataset compositions and performance aggregation methods; applying a range of multi-criteria decision-making ranking schemes and the two new robustness indicators shows that only a small subset of models remains consistently strong across tasks, ranking schemes, and data subsamples, although large-scale LLM-based models are often robust top performers except in tasks such as retrieval.

What carries the argument

Two robustness indicators—dataset-composition robustness (sensitivity of rankings to changing dataset compositions) and ranking-scheme robustness (sensitivity to aggregation method change)—that quantify stability of model orderings under altered evaluation designs.

If this is right

Large-scale LLM-based models are often robust top performers across most tasks but not uniformly, for instance in retrieval.
Only a small subset of models remains consistently strong when tasks, ranking schemes, and data subsamples are varied together.
Task-specific analyses reveal that stability of model rankings differs by learning task.
Results released for approximately 230 additional languages extend the sensitivity findings beyond the five languages examined in depth.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Evaluators who weight languages or tasks differently from the MTEB default may obtain different model recommendations.
Applications that rely on a single top-ranked model should test that model on their own data compositions rather than assuming benchmark stability.
Future benchmark releases could report the two robustness indicators alongside raw scores so users can judge ranking reliability directly.

Load-bearing premise

The MTEB dataset collection and the chosen five languages plus nine tasks supply a representative sample for detecting sensitivity of rankings to composition and aggregation changes.

What would settle it

Re-computing all rankings on every possible subset of the MTEB datasets and every aggregation method and finding that the identical set of models always occupies the top positions would falsify the claim that rankings are sensitive to those choices.

Figures

Figures reproduced from arXiv: 2605.31142 by Ana Gjorgjevikj, Barbara Korou\v{s}i\'c Seljak, Tome Eftimov.

**Figure 1.** Figure 1: presents a comprehensive overview of the most robust models for each task-language combination. Several patterns emerge from the analysis. First, model Qwen3-Embedding-8B exhibits remarkable consistency across classification-oriented tasks with regard to the RS robustness, achieving top performance in classification and pair classification across all five languages. Second, for retrieval, bilingual-embeddi… view at source ↗

**Figure 2.** Figure 2: Ranking sensitivity of top-6 models across MCDM schemes for clustering in French. Boxplots show rank distributions across MCDM methods, each with three weighting schemes and three dataset compositions. based models such as Octen-Embedding-8B, SFR-Embedding-Mistral, SFR-Embedding-2_R, GritLM8x7B, and multilingual-e5-large-instruct. While some (e.g., Octen-Embedding-8B, multilingual-e5- large-instruct) achi… view at source ↗

**Figure 3.** Figure 3: Robustness distribution across tasks (left) and languages (right). Stacked horizontal bars show the number of languages achieving full robustness (green), ranking scheme robustness only (yellow), or not applicable (gray) for each task. Grouped vertical bars compare robustness levels across languages. Full robustness (DS+RS) indicates stable model rankings under both dataset subsampling and ranking schemes.… view at source ↗

**Figure 4.** Figure 4: presents a high-level summary of the task-agnostic analysis (see Appendix E for detailed heatmaps), identifying models that maintain the most robust performance across all nine tasks within each language. This addresses a key practical scenario: understanding the robustness of models across diverse downstream applications when task requirements are unknown or varied. The results reveal a clear hierarchy of… view at source ↗

**Figure 5.** Figure 5: Drendrogram for French language on the clustering task. [PITH_FULL_IMAGE:figures/full_fig_p014_5.png] view at source ↗

**Figure 6.** Figure 6: Model sensitivity across ranking schemes for French language on the clustering task. [PITH_FULL_IMAGE:figures/full_fig_p014_6.png] view at source ↗

**Figure 7.** Figure 7: Model sensitivity to dataset compositions by ranking scheme for French language on the [PITH_FULL_IMAGE:figures/full_fig_p015_7.png] view at source ↗

**Figure 8.** Figure 8: Ranking sensitivity of the embedding models from Figure 5 across the ranking schemes [PITH_FULL_IMAGE:figures/full_fig_p015_8.png] view at source ↗

**Figure 9.** Figure 9: Ranking sensitivity of each individual embedding model from Figure 5 across the ranking [PITH_FULL_IMAGE:figures/full_fig_p016_9.png] view at source ↗

**Figure 10.** Figure 10: Drendrogram for French language on the classification task. [PITH_FULL_IMAGE:figures/full_fig_p017_10.png] view at source ↗

**Figure 11.** Figure 11: Drendrogram for French language on the retrieval task. [PITH_FULL_IMAGE:figures/full_fig_p019_11.png] view at source ↗

**Figure 12.** Figure 12: Drendrogram for French language on the reranking task. [PITH_FULL_IMAGE:figures/full_fig_p020_12.png] view at source ↗

**Figure 13.** Figure 13: Drendrogram for French language on the pair classification task. [PITH_FULL_IMAGE:figures/full_fig_p021_13.png] view at source ↗

**Figure 14.** Figure 14: Drendrogram for French language on the STS task. [PITH_FULL_IMAGE:figures/full_fig_p023_14.png] view at source ↗

**Figure 15.** Figure 15: Drendrogram for French language on the bitext mining task. [PITH_FULL_IMAGE:figures/full_fig_p025_15.png] view at source ↗

**Figure 16.** Figure 16: Drendrogram for French language on the multilabel classification task. [PITH_FULL_IMAGE:figures/full_fig_p026_16.png] view at source ↗

**Figure 17.** Figure 17: Cross-task consistency by a ranking scheme for French language. [PITH_FULL_IMAGE:figures/full_fig_p028_17.png] view at source ↗

**Figure 18.** Figure 18: Cross-task consistency by ranking scheme for English language. [PITH_FULL_IMAGE:figures/full_fig_p029_18.png] view at source ↗

**Figure 19.** Figure 19: Cross-task consistency by ranking scheme for German language. [PITH_FULL_IMAGE:figures/full_fig_p030_19.png] view at source ↗

**Figure 20.** Figure 20: Cross-task consistency by ranking scheme for Hindi language. [PITH_FULL_IMAGE:figures/full_fig_p031_20.png] view at source ↗

**Figure 21.** Figure 21: Cross-task consistency by ranking scheme for Spanish language. [PITH_FULL_IMAGE:figures/full_fig_p032_21.png] view at source ↗

**Figure 22.** Figure 22: Relationship between clustering similarity and ranking consistency across languages on [PITH_FULL_IMAGE:figures/full_fig_p037_22.png] view at source ↗

**Figure 23.** Figure 23: Relationship between clustering similarity and ranking consistency across languages on [PITH_FULL_IMAGE:figures/full_fig_p037_23.png] view at source ↗

**Figure 24.** Figure 24: Relationship between clustering similarity and ranking consistency across languages on [PITH_FULL_IMAGE:figures/full_fig_p038_24.png] view at source ↗

**Figure 25.** Figure 25: Relationship between clustering similarity and ranking consistency across languages on [PITH_FULL_IMAGE:figures/full_fig_p038_25.png] view at source ↗

**Figure 26.** Figure 26: Relationship between clustering similarity and ranking consistency across languages on [PITH_FULL_IMAGE:figures/full_fig_p039_26.png] view at source ↗

**Figure 27.** Figure 27: Relationship between clustering similarity and ranking consistency across languages on [PITH_FULL_IMAGE:figures/full_fig_p039_27.png] view at source ↗

**Figure 28.** Figure 28: Cross-task consistency by a ranking scheme for French language for [PITH_FULL_IMAGE:figures/full_fig_p040_28.png] view at source ↗

**Figure 29.** Figure 29: Cross-task consistency by a ranking scheme for French language for [PITH_FULL_IMAGE:figures/full_fig_p040_29.png] view at source ↗

**Figure 30.** Figure 30: Cross-task consistency by a ranking scheme for English language for [PITH_FULL_IMAGE:figures/full_fig_p041_30.png] view at source ↗

**Figure 31.** Figure 31: Cross-task consistency by a ranking scheme for English language for [PITH_FULL_IMAGE:figures/full_fig_p041_31.png] view at source ↗

**Figure 32.** Figure 32: Cross-task consistency by a ranking scheme for German language for [PITH_FULL_IMAGE:figures/full_fig_p042_32.png] view at source ↗

**Figure 33.** Figure 33: Cross-task consistency by a ranking scheme for German language for [PITH_FULL_IMAGE:figures/full_fig_p042_33.png] view at source ↗

**Figure 34.** Figure 34: Cross-task consistency by a ranking scheme for Hindi language for [PITH_FULL_IMAGE:figures/full_fig_p043_34.png] view at source ↗

**Figure 35.** Figure 35: Cross-task consistency by a ranking scheme for Hindi language for [PITH_FULL_IMAGE:figures/full_fig_p043_35.png] view at source ↗

**Figure 36.** Figure 36: Cross-task consistency by a ranking scheme for Spanish language for [PITH_FULL_IMAGE:figures/full_fig_p044_36.png] view at source ↗

**Figure 37.** Figure 37: Cross-task consistency by a ranking scheme for Spanish language for [PITH_FULL_IMAGE:figures/full_fig_p044_37.png] view at source ↗

**Figure 38.** Figure 38: Spearman correlation between the MTEB model rank and the rankings produced by [PITH_FULL_IMAGE:figures/full_fig_p045_38.png] view at source ↗

**Figure 39.** Figure 39: Spearman correlation between the MTEB model rank and the rankings produced by [PITH_FULL_IMAGE:figures/full_fig_p045_39.png] view at source ↗

**Figure 40.** Figure 40: Spearman correlation between the MTEB model rank and the rankings produced by [PITH_FULL_IMAGE:figures/full_fig_p046_40.png] view at source ↗

**Figure 41.** Figure 41: Spearman correlation between the MTEB model rank and the rankings produced by [PITH_FULL_IMAGE:figures/full_fig_p046_41.png] view at source ↗

**Figure 42.** Figure 42: Spearman correlation between the MTEB model rank and the rankings produced by [PITH_FULL_IMAGE:figures/full_fig_p047_42.png] view at source ↗

**Figure 43.** Figure 43: Spearman correlation between the MTEB model rank and the rankings produced by [PITH_FULL_IMAGE:figures/full_fig_p047_43.png] view at source ↗

**Figure 44.** Figure 44: Spearman correlation between the MTEB model rank and the rankings produced by [PITH_FULL_IMAGE:figures/full_fig_p048_44.png] view at source ↗

**Figure 45.** Figure 45: Spearman correlation between the MTEB model rank and the rankings produced by [PITH_FULL_IMAGE:figures/full_fig_p048_45.png] view at source ↗

**Figure 46.** Figure 46: Spearman correlation between the MTEB model rank and the rankings produced by [PITH_FULL_IMAGE:figures/full_fig_p049_46.png] view at source ↗

**Figure 47.** Figure 47: Spearman correlation between the MTEB model rank and the rankings produced by [PITH_FULL_IMAGE:figures/full_fig_p049_47.png] view at source ↗

**Figure 48.** Figure 48: Spearman correlation between the MTEB model rank and the rankings produced by [PITH_FULL_IMAGE:figures/full_fig_p050_48.png] view at source ↗

**Figure 49.** Figure 49: Spearman correlation between the MTEB model rank and the rankings produced by [PITH_FULL_IMAGE:figures/full_fig_p051_49.png] view at source ↗

**Figure 50.** Figure 50: Spearman correlation between the MTEB model rank and the rankings produced by [PITH_FULL_IMAGE:figures/full_fig_p052_50.png] view at source ↗

**Figure 51.** Figure 51: Spearman correlation between the MTEB model rank and the rankings produced by [PITH_FULL_IMAGE:figures/full_fig_p053_51.png] view at source ↗

**Figure 52.** Figure 52: Spearman correlation between the MTEB model rank and the rankings produced by [PITH_FULL_IMAGE:figures/full_fig_p054_52.png] view at source ↗

**Figure 53.** Figure 53: Spearman correlation between the MTEB model rank and the rankings produced by [PITH_FULL_IMAGE:figures/full_fig_p055_53.png] view at source ↗

**Figure 54.** Figure 54: Spearman correlation between the MTEB model rank and the rankings produced by [PITH_FULL_IMAGE:figures/full_fig_p056_54.png] view at source ↗

**Figure 55.** Figure 55: Spearman correlation between the MTEB model rank and the rankings produced by [PITH_FULL_IMAGE:figures/full_fig_p057_55.png] view at source ↗

read the original abstract

Large-scale multilingual text embedding models play crucial role in both research and industry, yet their behavior in language-specific, multi-task settings remains insufficiently understood. Although benchmarking platforms such as MTEB report results across more than 250 languages, conclusions about model superiority often depend on implicit choices of dataset compositions and performance aggregation methods. To address this gap, we present a meta-study of multilingual model performance robustness in MTEB, applying a diverse set of multi-criteria decision-making ranking schemes and introducing two robustness indicators: dataset-composition robustness (sensitivity of rankings to changing dataset compositions) and ranking-scheme robustness (sensitivity to aggregation method change). They enable systematic sensitivity analysis of whether benchmarking conclusions remain stable under different evaluation designs. We conduct an in-depth analysis on five languages (English, French, German, Hindi, and Spanish) across nine tasks (e.g., classification, clustering, retrieval) and release results for approximately 230 additional languages. The task-specific analyses show that large-scale LLM-based models are often robust top performers, though not uniformly (e.g., in retrieval task), while task-agnostic results reveal that only a small subset of models remains consistently strong across tasks, ranking schemes, and data subsamples.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The two new robustness indicators are a concrete addition, but the 5-language/9-task scope is too narrow to support the claim that only a small subset of models is consistently strong.

read the letter

The paper defines dataset-composition robustness and ranking-scheme robustness, then applies both to MTEB rankings across five languages and nine tasks while releasing numbers for roughly 230 more languages. That is the actual new piece: a pair of explicit sensitivity measures rather than another set of model scores.

It does the work of showing that large LLM-based models tend to sit near the top in most task-specific views but drop in retrieval, and that the task-agnostic picture shrinks to a small group that survives changes in composition and aggregation. The release of the extra-language results is also practical for people who need numbers outside the five-language slice.

The load-bearing assumption is that the chosen five languages and nine tasks are diverse enough to expose real instability. If they under-sample the variability across MTEB’s full set of 250-plus languages and dozens of datasets per task, the “small subset remains consistently strong” result could be an artifact of the slice rather than a general property. The abstract gives no detail on subsampling rules or statistical tests, so it is not yet clear whether the indicators are applied in a way that avoids post-hoc choices.

This is a meta-study on public data with no new parameters or fitted models, so the citation pattern is not an issue. The work is aimed at people who design or interpret multilingual benchmarks and want concrete ways to test how stable their rankings are. It is not broad enough or methodologically tight enough to change practice on its own, but the indicators themselves are worth having in the toolbox.

I would send it to peer review. The contribution is modest and scoped, yet the indicators are reproducible and address a real practical question in evaluation methodology.

Referee Report

1 major / 1 minor

Summary. The manuscript conducts a meta-study of multilingual text embedding model performance on the MTEB benchmark. It introduces two new indicators—dataset-composition robustness (sensitivity of rankings to dataset composition changes) and ranking-scheme robustness (sensitivity to aggregation method changes)—and applies multi-criteria decision-making ranking schemes to analyze stability. The study focuses on five languages (English, French, German, Hindi, Spanish) across nine tasks, finds that large-scale LLM-based models are often robust top performers in task-specific settings (though not uniformly, e.g., retrieval), and concludes that only a small subset of models remains consistently strong across tasks, schemes, and subsamples in task-agnostic analyses. Extended results for ~230 additional languages are released.

Significance. If the central findings hold, the work offers a systematic framework for assessing how benchmarking conclusions depend on evaluation design choices, which is relevant for reliable model selection in multilingual settings. The release of results for additional languages is a concrete positive contribution that supports reproducibility and further analysis.

major comments (1)

[meta-study design and analysis scope] The task-agnostic conclusion that only a small subset of models remains consistently strong across tasks, ranking schemes, and data subsamples is load-bearing on the representativeness of the chosen five languages and nine tasks (plus implicit MTEB dataset compositions). The meta-study design applies the new robustness indicators only within this scope; without explicit justification, comparison to the full MTEB variability (250+ languages, dozens of datasets per task), or external validation, the observed consistency may not generalize beyond the selected slice.

minor comments (1)

[methods] The abstract and methods description omit precise details on dataset subsampling rules, the exact computation of the two robustness indicators, statistical tests applied, and error handling; adding these would strengthen verifiability of the reported sensitivities.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback on the scope of our meta-study. We address the major comment below.

read point-by-point responses

Referee: [meta-study design and analysis scope] The task-agnostic conclusion that only a small subset of models remains consistently strong across tasks, ranking schemes, and data subsamples is load-bearing on the representativeness of the chosen five languages and nine tasks (plus implicit MTEB dataset compositions). The meta-study design applies the new robustness indicators only within this scope; without explicit justification, comparison to the full MTEB variability (250+ languages, dozens of datasets per task), or external validation, the observed consistency may not generalize beyond the selected slice.

Authors: We appreciate the referee highlighting this point. The five languages (English, French, German, Hindi, Spanish) were selected to span high- and low-resource settings and multiple language families while enabling computationally intensive robustness analyses across nine tasks and multiple ranking schemes; extending the full indicator computation to all 250+ languages would have been prohibitive. The task-agnostic conclusions are scoped to this representative slice, and the manuscript already releases per-language results for ~230 additional languages to support community extensions. We agree that the current version lacks explicit justification for the selection and a limitations discussion on generalizability; we will add both in the revision. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical meta-analysis on public benchmarks

full rationale

The paper defines two new robustness indicators (dataset-composition robustness and ranking-scheme robustness) and applies them to existing MTEB data for five languages and nine tasks. No equations, predictions, or derivations are present that reduce claimed results to fitted parameters, self-citations, or self-definitions by construction. The analysis consists of sensitivity checks on public benchmark outputs using multi-criteria ranking schemes; conclusions about model consistency follow directly from the computed indicators on the chosen data slice without circular reduction. Self-citations, if any, are not load-bearing for uniqueness theorems or ansatzes. This is a standard empirical meta-study whose central claims remain independent of the inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

The central claim rests on the representativeness of MTEB data and the validity of the newly defined indicators; no free parameters are mentioned, but two invented entities are introduced.

axioms (1)

domain assumption MTEB benchmark results across languages and tasks form a suitable base for sensitivity analysis of rankings
Invoked when the meta-study is defined and when conclusions about robustness are drawn from the platform data.

invented entities (2)

dataset-composition robustness indicator no independent evidence
purpose: Quantifies sensitivity of model rankings to changes in dataset composition
Newly defined in the paper to enable the sensitivity analysis.
ranking-scheme robustness indicator no independent evidence
purpose: Quantifies sensitivity of model rankings to changes in aggregation method
Newly defined in the paper to enable the sensitivity analysis.

pith-pipeline@v0.9.1-grok · 5764 in / 1338 out tokens · 18800 ms · 2026-06-28T22:41:44.628165+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

32 extracted references · 16 canonical work pages · 8 internal anchors

[1]

Retrieval-augmented generation for ai-generated content: A survey.Data Science and Engineering, pages 1–29, 2026

Penghao Zhao, Hailin Zhang, Qinhan Yu, Zhengren Wang, Yunteng Geng, Fangcheng Fu, Ling Yang, Wentao Zhang, Jie Jiang, and Bin Cui. Retrieval-augmented generation for ai-generated content: A survey.Data Science and Engineering, pages 1–29, 2026

2026
[2]

Embedding-informed adaptive retrieval-augmented generation of large language models

Chengkai Huang, Yu Xia, Rui Wang, Kaige Xie, Tong Yu, Julian McAuley, and Lina Yao. Embedding-informed adaptive retrieval-augmented generation of large language models. In Proceedings of the 31st International Conference on Computational Linguistics, pages 1403– 1412, 2025

2025
[3]

Mmteb: Massive multilingual text embedding benchmark.arXiv preprint arXiv:2502.13595, 2025

Kenneth Enevoldsen, Isaac Chung, Imene Kerboua, Márton Kardos, Ashwin Mathur, David Stap, Jay Gala, Wissam Siblini, Dominik Krzemi ´nski, Genta Indra Winata, et al. Mmteb: Massive multilingual text embedding benchmark.arXiv preprint arXiv:2502.13595, 2025

work page arXiv 2025
[4]

Kenneth Enevoldsen, Márton Kardos, Niklas Muennighoff, and Kristoffer L Nielbo. The scan- dinavian embedding benchmarks: Comprehensive assessment of multilingual and monolingual text embedding.Advances in Neural Information Processing Systems, 37:40336–40358, 2024

2024
[5]

What are the best systems? new perspectives on nlp benchmarking.Advances in neural information processing systems, 35:26915–26932, 2022

Pierre Colombo, Nathan Noiry, Ekhine Irurozki, and Stéphan Clémençon. What are the best systems? new perspectives on nlp benchmarking.Advances in neural information processing systems, 35:26915–26932, 2022

2022
[6]

V ote’n’rank: Revision of 10 benchmarking with social choice theory

Mark Rofin, Vladislav Mikhailov, Mikhail Florinsky, Andrey Kravchenko, Tatiana Shavrina, Elena Tutubalina, Daniel Karabekyan, and Ekaterina Artemova. V ote’n’rank: Revision of 10 benchmarking with social choice theory. InProceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, pages 670–686, 2023

2023
[7]

Generative representational instruction tuning

Niklas Muennighoff, SU Hongjin, Liang Wang, Nan Yang, Furu Wei, Tao Yu, Amanpreet Singh, and Douwe Kiela. Generative representational instruction tuning. InThe Thirteenth International Conference on Learning Representations, 2024

2024
[8]

Llm2vec: Large language models are secretly powerful text encoders.arXiv preprint arXiv:2404.05961, 2024

Parishad BehnamGhader, Vaibhav Adlakha, Marius Mosbach, Dzmitry Bahdanau, Nicolas Chapados, and Siva Reddy. Llm2vec: Large language models are secretly powerful text encoders.arXiv preprint arXiv:2404.05961, 2024

work page arXiv 2024
[9]

MTEB: Massive Text Embedding Benchmark

Niklas Muennighoff, Nouamane Tazi, Loïc Magne, and Nils Reimers. Mteb: Massive text embedding benchmark.arXiv preprint arXiv:2210.07316, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[10]

Maintaining mteb: Towards long term usability and reproducibility of embedding benchmarks

Isaac Chung, Imene Kerboua, Marton Kardos, Roman Solomatin, and Kenneth Enevoldsen. Maintaining mteb: Towards long term usability and reproducibility of embedding benchmarks. arXiv preprint arXiv:2506.21182, 2025

work page arXiv 2025
[11]

Fuzzy multiple attribute decision making methods

Shu-Jen Chen and Ching-Lai Hwang. Fuzzy multiple attribute decision making methods. In Fuzzy multiple attribute decision making: Methods and applications, pages 289–486. Springer, 1992

1992
[12]

Multicriteria optimization of civil engineering systems.Faculty of civil engineering, Belgrade, 2(1):5–21, 1998

Serafim Opricovic. Multicriteria optimization of civil engineering systems.Faculty of civil engineering, Belgrade, 2(1):5–21, 1998

1998
[13]

L’ingénierie de la décision.Elaboration d’instruments d’aide à la décision

Jean-Pierre Brans, R Nadeau, and M Landry. L’ingénierie de la décision.Elaboration d’instruments d’aide à la décision. La méthode PROMETHEE. In l’Aide à la Décision: Nature, Instruments et Perspectives d’Avenir, pages 183–213, 1982

1982
[14]

Determination of objective weights using a new method based on the removal effects of criteria (merec).Symmetry, 13(4):525, 2021

Mehdi Keshavarz-Ghorabaee, Maghsoud Amiri, Edmundas Kazimieras Zavadskas, Zenonas Turskis, and Jurgita Antucheviciene. Determination of objective weights using a new method based on the removal effects of criteria (merec).Symmetry, 13(4):525, 2021

2021
[15]

Determining objective weights in multiple criteria problems: The critic method.Computers & operations research, 22(7):763– 770, 1995

Danae Diakoulaki, George Mavrotas, and Lefteris Papayannakis. Determining objective weights in multiple criteria problems: The critic method.Computers & operations research, 22(7):763– 770, 1995

1995
[16]

Llama-embed-nemotron-8b: A universal text embedding model for multilingual and cross-lingual tasks.arXiv preprint arXiv:2511.07025, 2025

Yauhen Babakhin, Radek Osmulski, Ronay Ak, Gabriel Moreira, Mengyao Xu, Benedikt Schifferer, Bo Liu, and Even Oldridge. Llama-embed-nemotron-8b: A universal text embedding model for multilingual and cross-lingual tasks.arXiv preprint arXiv:2511.07025, 2025

work page arXiv 2025
[17]

Qwen3 Embedding: Advancing Text Embedding and Reranking Through Foundation Models

Yanzhao Zhang, Mingxin Li, Dingkun Long, Xin Zhang, Huan Lin, Baosong Yang, Pengjun Xie, An Yang, Dayiheng Liu, Junyang Lin, et al. Qwen3 embedding: Advancing text embedding and reranking through foundation models.arXiv preprint arXiv:2506.05176, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[18]

Octen-embedding-8b: A fine-tuned multilingual text embedding model, 2025

Octen Team. Octen-embedding-8b: A fine-tuned multilingual text embedding model, 2025

2025
[19]

Multilingual E5 Text Embeddings: A Technical Report

Liang Wang, Nan Yang, Xiaolong Huang, Linjun Yang, Rangan Majumder, and Furu Wei. Multilingual e5 text embeddings: A technical report.arXiv preprint arXiv:2402.05672, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[20]

M3-Embedding: Multi-Linguality, Multi-Functionality, Multi-Granularity Text Embeddings Through Self-Knowledge Distillation

Jianlv Chen, Shitao Xiao, Peitian Zhang, Kun Luo, Defu Lian, and Zheng Liu. Bge m3- embedding: Multi-lingual, multi-functionality, multi-granularity text embeddings through self-knowledge distillation.arXiv preprint arXiv:2402.03216, 4(5), 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[21]

Linq-embed-mistral technical report.arXiv preprint arXiv:2412.03223, 2024

Chanyeol Choi, Junseong Kim, Seolhwa Lee, Jihoon Kwon, Sangmo Gu, Yejin Kim, Minkyung Cho, and Jy-yong Sohn. Linq-embed-mistral technical report.arXiv preprint arXiv:2412.03223, 2024

work page arXiv 2024
[22]

Unsupervised cross-lingual representation learning at scale

Alexis Conneau, Kartikay Khandelwal, Naman Goyal, Vishrav Chaudhary, Guillaume Wen- zek, Francisco Guzmán, Edouard Grave, Myle Ott, Luke Zettlemoyer, and Veselin Stoyanov. Unsupervised cross-lingual representation learning at scale. InProceedings of the 58th annual meeting of the association for computational linguistics, pages 8440–8451, 2020. 11

2020
[23]

K., G¨unther, M., Wang, B., Krimmel, M., Wang, F., Mastrapas, G., Koukounas, A., Wang, N., et al

Saba Sturua, Isabelle Mohr, Mohammad Kalim Akram, Michael Günther, Bo Wang, Markus Krimmel, Feng Wang, Georgios Mastrapas, Andreas Koukounas, Andreas Koukounas, Nan Wang, and Han Xiao. jina-embeddings-v3: Multilingual embeddings with task lora.arXiv preprint arXiv:2409.10173, 2024

work page arXiv 2024
[24]

Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks

Nils Reimers and Iryna Gurevych. Sentence-bert: Sentence embeddings using siamese bert- networks.arXiv preprint arXiv:1908.10084, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1908
[25]

Text Embeddings by Weakly-Supervised Contrastive Pre-training

Liang Wang, Nan Yang, Xiaolong Huang, Binxing Jiao, Linjun Yang, Daxin Jiang, Rangan Majumder, and Furu Wei. Text embeddings by weakly-supervised contrastive pre-training. arXiv preprint arXiv:2212.03533, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[26]

Angle-optimized text embeddings.arXiv preprint arXiv:2309.12871, 2023

Xianming Li and Jing Li. Angle-optimized text embeddings.arXiv preprint arXiv:2309.12871, 2023

work page arXiv 2023
[27]

Open source strikes bread - new fluffy embeddings model, 2024

Sean Lee, Aamir Shakir, Darius Koenig, and Julius Lipp. Open source strikes bread - new fluffy embeddings model, 2024

2024
[28]

Arctic-embed: Scalable, efficient, and accurate text embedding models.arXiv preprint arXiv:2405.05374, 2024

Luke Merrick, Danmei Xu, Gaurav Nuti, and Daniel Campos. Arctic-embed: Scalable, efficient, and accurate text embedding models.arXiv preprint arXiv:2405.05374, 2024

work page arXiv 2024
[29]

NV-Embed: Improved Techniques for Training LLMs as Generalist Embedding Models

Chankyu Lee, Rajarshi Roy, Mengyao Xu, Jonathan Raiman, Mohammad Shoeybi, Bryan Catanzaro, and Wei Ping. Nv-embed: Improved techniques for training llms as generalist embedding models.arXiv preprint arXiv:2405.17428, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[30]

EmbeddingGemma: Powerful and Lightweight Text Representations

Henrique Schechter Vera, Sahil Dua, Biao Zhang, Daniel Salz, Ryan Mullins, Sindhu Raghuram Panyam, Sara Smoot, Iftekhar Naim, Joe Zou, Feiyang Chen, et al. Embeddinggemma: Powerful and lightweight text representations.arXiv preprint arXiv:2509.20354, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[31]

Language- agnostic bert sentence embedding

Fangxiaoyu Feng, Yinfei Yang, Daniel Cer, Naveen Arivazhagan, and Wei Wang. Language- agnostic bert sentence embedding. InProceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 878–891, 2022. 12 Table 1: Distribution of datasets by task for ten languages with the largest total number of datas...

2022
[32]

This cluster captures embedding models whose semantic spaces show weak alignment with the evaluated tasks and poor task-agnostic robustness, indicating limited general-purpose applicability. all-wsm-equal all-wsm-critic all-wsm-merec all-topsis-equal all-topsis-critic all-topsis-merec all-vikor-equal all-vikor-critic all-vikor-merec all-promethee_ii_usual...

[1] [1]

Retrieval-augmented generation for ai-generated content: A survey.Data Science and Engineering, pages 1–29, 2026

Penghao Zhao, Hailin Zhang, Qinhan Yu, Zhengren Wang, Yunteng Geng, Fangcheng Fu, Ling Yang, Wentao Zhang, Jie Jiang, and Bin Cui. Retrieval-augmented generation for ai-generated content: A survey.Data Science and Engineering, pages 1–29, 2026

2026

[2] [2]

Embedding-informed adaptive retrieval-augmented generation of large language models

Chengkai Huang, Yu Xia, Rui Wang, Kaige Xie, Tong Yu, Julian McAuley, and Lina Yao. Embedding-informed adaptive retrieval-augmented generation of large language models. In Proceedings of the 31st International Conference on Computational Linguistics, pages 1403– 1412, 2025

2025

[3] [3]

Mmteb: Massive multilingual text embedding benchmark.arXiv preprint arXiv:2502.13595, 2025

Kenneth Enevoldsen, Isaac Chung, Imene Kerboua, Márton Kardos, Ashwin Mathur, David Stap, Jay Gala, Wissam Siblini, Dominik Krzemi ´nski, Genta Indra Winata, et al. Mmteb: Massive multilingual text embedding benchmark.arXiv preprint arXiv:2502.13595, 2025

work page arXiv 2025

[4] [4]

Kenneth Enevoldsen, Márton Kardos, Niklas Muennighoff, and Kristoffer L Nielbo. The scan- dinavian embedding benchmarks: Comprehensive assessment of multilingual and monolingual text embedding.Advances in Neural Information Processing Systems, 37:40336–40358, 2024

2024

[5] [5]

What are the best systems? new perspectives on nlp benchmarking.Advances in neural information processing systems, 35:26915–26932, 2022

Pierre Colombo, Nathan Noiry, Ekhine Irurozki, and Stéphan Clémençon. What are the best systems? new perspectives on nlp benchmarking.Advances in neural information processing systems, 35:26915–26932, 2022

2022

[6] [6]

V ote’n’rank: Revision of 10 benchmarking with social choice theory

Mark Rofin, Vladislav Mikhailov, Mikhail Florinsky, Andrey Kravchenko, Tatiana Shavrina, Elena Tutubalina, Daniel Karabekyan, and Ekaterina Artemova. V ote’n’rank: Revision of 10 benchmarking with social choice theory. InProceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, pages 670–686, 2023

2023

[7] [7]

Generative representational instruction tuning

Niklas Muennighoff, SU Hongjin, Liang Wang, Nan Yang, Furu Wei, Tao Yu, Amanpreet Singh, and Douwe Kiela. Generative representational instruction tuning. InThe Thirteenth International Conference on Learning Representations, 2024

2024

[8] [8]

Llm2vec: Large language models are secretly powerful text encoders.arXiv preprint arXiv:2404.05961, 2024

Parishad BehnamGhader, Vaibhav Adlakha, Marius Mosbach, Dzmitry Bahdanau, Nicolas Chapados, and Siva Reddy. Llm2vec: Large language models are secretly powerful text encoders.arXiv preprint arXiv:2404.05961, 2024

work page arXiv 2024

[9] [9]

MTEB: Massive Text Embedding Benchmark

Niklas Muennighoff, Nouamane Tazi, Loïc Magne, and Nils Reimers. Mteb: Massive text embedding benchmark.arXiv preprint arXiv:2210.07316, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[10] [10]

Maintaining mteb: Towards long term usability and reproducibility of embedding benchmarks

Isaac Chung, Imene Kerboua, Marton Kardos, Roman Solomatin, and Kenneth Enevoldsen. Maintaining mteb: Towards long term usability and reproducibility of embedding benchmarks. arXiv preprint arXiv:2506.21182, 2025

work page arXiv 2025

[11] [11]

Fuzzy multiple attribute decision making methods

Shu-Jen Chen and Ching-Lai Hwang. Fuzzy multiple attribute decision making methods. In Fuzzy multiple attribute decision making: Methods and applications, pages 289–486. Springer, 1992

1992

[12] [12]

Multicriteria optimization of civil engineering systems.Faculty of civil engineering, Belgrade, 2(1):5–21, 1998

Serafim Opricovic. Multicriteria optimization of civil engineering systems.Faculty of civil engineering, Belgrade, 2(1):5–21, 1998

1998

[13] [13]

L’ingénierie de la décision.Elaboration d’instruments d’aide à la décision

Jean-Pierre Brans, R Nadeau, and M Landry. L’ingénierie de la décision.Elaboration d’instruments d’aide à la décision. La méthode PROMETHEE. In l’Aide à la Décision: Nature, Instruments et Perspectives d’Avenir, pages 183–213, 1982

1982

[14] [14]

Determination of objective weights using a new method based on the removal effects of criteria (merec).Symmetry, 13(4):525, 2021

Mehdi Keshavarz-Ghorabaee, Maghsoud Amiri, Edmundas Kazimieras Zavadskas, Zenonas Turskis, and Jurgita Antucheviciene. Determination of objective weights using a new method based on the removal effects of criteria (merec).Symmetry, 13(4):525, 2021

2021

[15] [15]

Determining objective weights in multiple criteria problems: The critic method.Computers & operations research, 22(7):763– 770, 1995

Danae Diakoulaki, George Mavrotas, and Lefteris Papayannakis. Determining objective weights in multiple criteria problems: The critic method.Computers & operations research, 22(7):763– 770, 1995

1995

[16] [16]

Llama-embed-nemotron-8b: A universal text embedding model for multilingual and cross-lingual tasks.arXiv preprint arXiv:2511.07025, 2025

Yauhen Babakhin, Radek Osmulski, Ronay Ak, Gabriel Moreira, Mengyao Xu, Benedikt Schifferer, Bo Liu, and Even Oldridge. Llama-embed-nemotron-8b: A universal text embedding model for multilingual and cross-lingual tasks.arXiv preprint arXiv:2511.07025, 2025

work page arXiv 2025

[17] [17]

Qwen3 Embedding: Advancing Text Embedding and Reranking Through Foundation Models

Yanzhao Zhang, Mingxin Li, Dingkun Long, Xin Zhang, Huan Lin, Baosong Yang, Pengjun Xie, An Yang, Dayiheng Liu, Junyang Lin, et al. Qwen3 embedding: Advancing text embedding and reranking through foundation models.arXiv preprint arXiv:2506.05176, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[18] [18]

Octen-embedding-8b: A fine-tuned multilingual text embedding model, 2025

Octen Team. Octen-embedding-8b: A fine-tuned multilingual text embedding model, 2025

2025

[19] [19]

Multilingual E5 Text Embeddings: A Technical Report

Liang Wang, Nan Yang, Xiaolong Huang, Linjun Yang, Rangan Majumder, and Furu Wei. Multilingual e5 text embeddings: A technical report.arXiv preprint arXiv:2402.05672, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[20] [20]

M3-Embedding: Multi-Linguality, Multi-Functionality, Multi-Granularity Text Embeddings Through Self-Knowledge Distillation

Jianlv Chen, Shitao Xiao, Peitian Zhang, Kun Luo, Defu Lian, and Zheng Liu. Bge m3- embedding: Multi-lingual, multi-functionality, multi-granularity text embeddings through self-knowledge distillation.arXiv preprint arXiv:2402.03216, 4(5), 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[21] [21]

Linq-embed-mistral technical report.arXiv preprint arXiv:2412.03223, 2024

Chanyeol Choi, Junseong Kim, Seolhwa Lee, Jihoon Kwon, Sangmo Gu, Yejin Kim, Minkyung Cho, and Jy-yong Sohn. Linq-embed-mistral technical report.arXiv preprint arXiv:2412.03223, 2024

work page arXiv 2024

[22] [22]

Unsupervised cross-lingual representation learning at scale

Alexis Conneau, Kartikay Khandelwal, Naman Goyal, Vishrav Chaudhary, Guillaume Wen- zek, Francisco Guzmán, Edouard Grave, Myle Ott, Luke Zettlemoyer, and Veselin Stoyanov. Unsupervised cross-lingual representation learning at scale. InProceedings of the 58th annual meeting of the association for computational linguistics, pages 8440–8451, 2020. 11

2020

[23] [23]

K., G¨unther, M., Wang, B., Krimmel, M., Wang, F., Mastrapas, G., Koukounas, A., Wang, N., et al

Saba Sturua, Isabelle Mohr, Mohammad Kalim Akram, Michael Günther, Bo Wang, Markus Krimmel, Feng Wang, Georgios Mastrapas, Andreas Koukounas, Andreas Koukounas, Nan Wang, and Han Xiao. jina-embeddings-v3: Multilingual embeddings with task lora.arXiv preprint arXiv:2409.10173, 2024

work page arXiv 2024

[24] [24]

Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks

Nils Reimers and Iryna Gurevych. Sentence-bert: Sentence embeddings using siamese bert- networks.arXiv preprint arXiv:1908.10084, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1908

[25] [25]

Text Embeddings by Weakly-Supervised Contrastive Pre-training

Liang Wang, Nan Yang, Xiaolong Huang, Binxing Jiao, Linjun Yang, Daxin Jiang, Rangan Majumder, and Furu Wei. Text embeddings by weakly-supervised contrastive pre-training. arXiv preprint arXiv:2212.03533, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[26] [26]

Angle-optimized text embeddings.arXiv preprint arXiv:2309.12871, 2023

Xianming Li and Jing Li. Angle-optimized text embeddings.arXiv preprint arXiv:2309.12871, 2023

work page arXiv 2023

[27] [27]

Open source strikes bread - new fluffy embeddings model, 2024

Sean Lee, Aamir Shakir, Darius Koenig, and Julius Lipp. Open source strikes bread - new fluffy embeddings model, 2024

2024

[28] [28]

Arctic-embed: Scalable, efficient, and accurate text embedding models.arXiv preprint arXiv:2405.05374, 2024

Luke Merrick, Danmei Xu, Gaurav Nuti, and Daniel Campos. Arctic-embed: Scalable, efficient, and accurate text embedding models.arXiv preprint arXiv:2405.05374, 2024

work page arXiv 2024

[29] [29]

NV-Embed: Improved Techniques for Training LLMs as Generalist Embedding Models

Chankyu Lee, Rajarshi Roy, Mengyao Xu, Jonathan Raiman, Mohammad Shoeybi, Bryan Catanzaro, and Wei Ping. Nv-embed: Improved techniques for training llms as generalist embedding models.arXiv preprint arXiv:2405.17428, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[30] [30]

EmbeddingGemma: Powerful and Lightweight Text Representations

Henrique Schechter Vera, Sahil Dua, Biao Zhang, Daniel Salz, Ryan Mullins, Sindhu Raghuram Panyam, Sara Smoot, Iftekhar Naim, Joe Zou, Feiyang Chen, et al. Embeddinggemma: Powerful and lightweight text representations.arXiv preprint arXiv:2509.20354, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[31] [31]

Language- agnostic bert sentence embedding

Fangxiaoyu Feng, Yinfei Yang, Daniel Cer, Naveen Arivazhagan, and Wei Wang. Language- agnostic bert sentence embedding. InProceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 878–891, 2022. 12 Table 1: Distribution of datasets by task for ten languages with the largest total number of datas...

2022

[32] [32]

This cluster captures embedding models whose semantic spaces show weak alignment with the evaluated tasks and poor task-agnostic robustness, indicating limited general-purpose applicability. all-wsm-equal all-wsm-critic all-wsm-merec all-topsis-equal all-topsis-critic all-topsis-merec all-vikor-equal all-vikor-critic all-vikor-merec all-promethee_ii_usual...