pith. sign in

arxiv: 2605.31142 · v1 · pith:M6HZBOVRnew · submitted 2026-05-29 · 💻 cs.CL · cs.AI

On the Robustness of Multilingual Text Embedding Rankings Across Learning Tasks, Languages, and Benchmark Datasets

Pith reviewed 2026-06-28 22:41 UTC · model grok-4.3

classification 💻 cs.CL cs.AI
keywords multilingual text embeddingsMTEB benchmarkranking robustnessdataset compositionaggregation methodsLLM-based modelsmulti-task evaluationsensitivity analysis
0
0 comments X

The pith

Rankings of multilingual text embedding models shift depending on which datasets are included and how scores are aggregated.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper runs a meta-study of multilingual model performance in the MTEB benchmark to test whether reported model superiority holds when dataset compositions or performance aggregation methods change. It defines two indicators, dataset-composition robustness and ranking-scheme robustness, that quantify how much rankings move under different evaluation designs. Across five languages and nine tasks the task-specific results show large-scale LLM-based models frequently rank at the top yet not uniformly, for example in retrieval, while the task-agnostic view finds only a small subset of models stays strong across all tasks, schemes, and subsamples. Results for roughly 230 further languages are also released.

Core claim

Benchmarking conclusions about which multilingual text embedding models perform best depend on implicit choices of dataset compositions and performance aggregation methods; applying a range of multi-criteria decision-making ranking schemes and the two new robustness indicators shows that only a small subset of models remains consistently strong across tasks, ranking schemes, and data subsamples, although large-scale LLM-based models are often robust top performers except in tasks such as retrieval.

What carries the argument

Two robustness indicators—dataset-composition robustness (sensitivity of rankings to changing dataset compositions) and ranking-scheme robustness (sensitivity to aggregation method change)—that quantify stability of model orderings under altered evaluation designs.

If this is right

  • Large-scale LLM-based models are often robust top performers across most tasks but not uniformly, for instance in retrieval.
  • Only a small subset of models remains consistently strong when tasks, ranking schemes, and data subsamples are varied together.
  • Task-specific analyses reveal that stability of model rankings differs by learning task.
  • Results released for approximately 230 additional languages extend the sensitivity findings beyond the five languages examined in depth.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Evaluators who weight languages or tasks differently from the MTEB default may obtain different model recommendations.
  • Applications that rely on a single top-ranked model should test that model on their own data compositions rather than assuming benchmark stability.
  • Future benchmark releases could report the two robustness indicators alongside raw scores so users can judge ranking reliability directly.

Load-bearing premise

The MTEB dataset collection and the chosen five languages plus nine tasks supply a representative sample for detecting sensitivity of rankings to composition and aggregation changes.

What would settle it

Re-computing all rankings on every possible subset of the MTEB datasets and every aggregation method and finding that the identical set of models always occupies the top positions would falsify the claim that rankings are sensitive to those choices.

Figures

Figures reproduced from arXiv: 2605.31142 by Ana Gjorgjevikj, Barbara Korou\v{s}i\'c Seljak, Tome Eftimov.

Figure 1
Figure 1. Figure 1: presents a comprehensive overview of the most robust models for each task-language combination. Several patterns emerge from the analysis. First, model Qwen3-Embedding-8B exhibits remarkable consistency across classification-oriented tasks with regard to the RS robustness, achieving top performance in classification and pair classification across all five languages. Second, for retrieval, bilingual-embeddi… view at source ↗
Figure 2
Figure 2. Figure 2: Ranking sensitivity of top-6 models across MCDM schemes for clustering in French. Boxplots show rank distributions across MCDM methods, each with three weighting schemes and three dataset compositions. based models such as Octen-Embedding-8B, SFR-Embedding-Mistral, SFR-Embedding-2_R, GritLM￾8x7B, and multilingual-e5-large-instruct. While some (e.g., Octen-Embedding-8B, multilingual-e5- large-instruct) achi… view at source ↗
Figure 3
Figure 3. Figure 3: Robustness distribution across tasks (left) and languages (right). Stacked horizontal bars show the number of languages achieving full robustness (green), ranking scheme robustness only (yellow), or not applicable (gray) for each task. Grouped vertical bars compare robustness levels across languages. Full robustness (DS+RS) indicates stable model rankings under both dataset subsampling and ranking schemes.… view at source ↗
Figure 4
Figure 4. Figure 4: presents a high-level summary of the task-agnostic analysis (see Appendix E for detailed heatmaps), identifying models that maintain the most robust performance across all nine tasks within each language. This addresses a key practical scenario: understanding the robustness of models across diverse downstream applications when task requirements are unknown or varied. The results reveal a clear hierarchy of… view at source ↗
Figure 5
Figure 5. Figure 5: Drendrogram for French language on the clustering task. [PITH_FULL_IMAGE:figures/full_fig_p014_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Model sensitivity across ranking schemes for French language on the clustering task. [PITH_FULL_IMAGE:figures/full_fig_p014_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Model sensitivity to dataset compositions by ranking scheme for French language on the [PITH_FULL_IMAGE:figures/full_fig_p015_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Ranking sensitivity of the embedding models from Figure 5 across the ranking schemes [PITH_FULL_IMAGE:figures/full_fig_p015_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Ranking sensitivity of each individual embedding model from Figure 5 across the ranking [PITH_FULL_IMAGE:figures/full_fig_p016_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Drendrogram for French language on the classification task. [PITH_FULL_IMAGE:figures/full_fig_p017_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Drendrogram for French language on the retrieval task. [PITH_FULL_IMAGE:figures/full_fig_p019_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Drendrogram for French language on the reranking task. [PITH_FULL_IMAGE:figures/full_fig_p020_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Drendrogram for French language on the pair classification task. [PITH_FULL_IMAGE:figures/full_fig_p021_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Drendrogram for French language on the STS task. [PITH_FULL_IMAGE:figures/full_fig_p023_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Drendrogram for French language on the bitext mining task. [PITH_FULL_IMAGE:figures/full_fig_p025_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: Drendrogram for French language on the multilabel classification task. [PITH_FULL_IMAGE:figures/full_fig_p026_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: Cross-task consistency by a ranking scheme for French language. [PITH_FULL_IMAGE:figures/full_fig_p028_17.png] view at source ↗
Figure 18
Figure 18. Figure 18: Cross-task consistency by ranking scheme for English language. [PITH_FULL_IMAGE:figures/full_fig_p029_18.png] view at source ↗
Figure 19
Figure 19. Figure 19: Cross-task consistency by ranking scheme for German language. [PITH_FULL_IMAGE:figures/full_fig_p030_19.png] view at source ↗
Figure 20
Figure 20. Figure 20: Cross-task consistency by ranking scheme for Hindi language. [PITH_FULL_IMAGE:figures/full_fig_p031_20.png] view at source ↗
Figure 21
Figure 21. Figure 21: Cross-task consistency by ranking scheme for Spanish language. [PITH_FULL_IMAGE:figures/full_fig_p032_21.png] view at source ↗
Figure 22
Figure 22. Figure 22: Relationship between clustering similarity and ranking consistency across languages on [PITH_FULL_IMAGE:figures/full_fig_p037_22.png] view at source ↗
Figure 23
Figure 23. Figure 23: Relationship between clustering similarity and ranking consistency across languages on [PITH_FULL_IMAGE:figures/full_fig_p037_23.png] view at source ↗
Figure 24
Figure 24. Figure 24: Relationship between clustering similarity and ranking consistency across languages on [PITH_FULL_IMAGE:figures/full_fig_p038_24.png] view at source ↗
Figure 25
Figure 25. Figure 25: Relationship between clustering similarity and ranking consistency across languages on [PITH_FULL_IMAGE:figures/full_fig_p038_25.png] view at source ↗
Figure 26
Figure 26. Figure 26: Relationship between clustering similarity and ranking consistency across languages on [PITH_FULL_IMAGE:figures/full_fig_p039_26.png] view at source ↗
Figure 27
Figure 27. Figure 27: Relationship between clustering similarity and ranking consistency across languages on [PITH_FULL_IMAGE:figures/full_fig_p039_27.png] view at source ↗
Figure 28
Figure 28. Figure 28: Cross-task consistency by a ranking scheme for French language for [PITH_FULL_IMAGE:figures/full_fig_p040_28.png] view at source ↗
Figure 29
Figure 29. Figure 29: Cross-task consistency by a ranking scheme for French language for [PITH_FULL_IMAGE:figures/full_fig_p040_29.png] view at source ↗
Figure 30
Figure 30. Figure 30: Cross-task consistency by a ranking scheme for English language for [PITH_FULL_IMAGE:figures/full_fig_p041_30.png] view at source ↗
Figure 31
Figure 31. Figure 31: Cross-task consistency by a ranking scheme for English language for [PITH_FULL_IMAGE:figures/full_fig_p041_31.png] view at source ↗
Figure 32
Figure 32. Figure 32: Cross-task consistency by a ranking scheme for German language for [PITH_FULL_IMAGE:figures/full_fig_p042_32.png] view at source ↗
Figure 33
Figure 33. Figure 33: Cross-task consistency by a ranking scheme for German language for [PITH_FULL_IMAGE:figures/full_fig_p042_33.png] view at source ↗
Figure 34
Figure 34. Figure 34: Cross-task consistency by a ranking scheme for Hindi language for [PITH_FULL_IMAGE:figures/full_fig_p043_34.png] view at source ↗
Figure 35
Figure 35. Figure 35: Cross-task consistency by a ranking scheme for Hindi language for [PITH_FULL_IMAGE:figures/full_fig_p043_35.png] view at source ↗
Figure 36
Figure 36. Figure 36: Cross-task consistency by a ranking scheme for Spanish language for [PITH_FULL_IMAGE:figures/full_fig_p044_36.png] view at source ↗
Figure 37
Figure 37. Figure 37: Cross-task consistency by a ranking scheme for Spanish language for [PITH_FULL_IMAGE:figures/full_fig_p044_37.png] view at source ↗
Figure 38
Figure 38. Figure 38: Spearman correlation between the MTEB model rank and the rankings produced by [PITH_FULL_IMAGE:figures/full_fig_p045_38.png] view at source ↗
Figure 39
Figure 39. Figure 39: Spearman correlation between the MTEB model rank and the rankings produced by [PITH_FULL_IMAGE:figures/full_fig_p045_39.png] view at source ↗
Figure 40
Figure 40. Figure 40: Spearman correlation between the MTEB model rank and the rankings produced by [PITH_FULL_IMAGE:figures/full_fig_p046_40.png] view at source ↗
Figure 41
Figure 41. Figure 41: Spearman correlation between the MTEB model rank and the rankings produced by [PITH_FULL_IMAGE:figures/full_fig_p046_41.png] view at source ↗
Figure 42
Figure 42. Figure 42: Spearman correlation between the MTEB model rank and the rankings produced by [PITH_FULL_IMAGE:figures/full_fig_p047_42.png] view at source ↗
Figure 43
Figure 43. Figure 43: Spearman correlation between the MTEB model rank and the rankings produced by [PITH_FULL_IMAGE:figures/full_fig_p047_43.png] view at source ↗
Figure 44
Figure 44. Figure 44: Spearman correlation between the MTEB model rank and the rankings produced by [PITH_FULL_IMAGE:figures/full_fig_p048_44.png] view at source ↗
Figure 45
Figure 45. Figure 45: Spearman correlation between the MTEB model rank and the rankings produced by [PITH_FULL_IMAGE:figures/full_fig_p048_45.png] view at source ↗
Figure 46
Figure 46. Figure 46: Spearman correlation between the MTEB model rank and the rankings produced by [PITH_FULL_IMAGE:figures/full_fig_p049_46.png] view at source ↗
Figure 47
Figure 47. Figure 47: Spearman correlation between the MTEB model rank and the rankings produced by [PITH_FULL_IMAGE:figures/full_fig_p049_47.png] view at source ↗
Figure 48
Figure 48. Figure 48: Spearman correlation between the MTEB model rank and the rankings produced by [PITH_FULL_IMAGE:figures/full_fig_p050_48.png] view at source ↗
Figure 49
Figure 49. Figure 49: Spearman correlation between the MTEB model rank and the rankings produced by [PITH_FULL_IMAGE:figures/full_fig_p051_49.png] view at source ↗
Figure 50
Figure 50. Figure 50: Spearman correlation between the MTEB model rank and the rankings produced by [PITH_FULL_IMAGE:figures/full_fig_p052_50.png] view at source ↗
Figure 51
Figure 51. Figure 51: Spearman correlation between the MTEB model rank and the rankings produced by [PITH_FULL_IMAGE:figures/full_fig_p053_51.png] view at source ↗
Figure 52
Figure 52. Figure 52: Spearman correlation between the MTEB model rank and the rankings produced by [PITH_FULL_IMAGE:figures/full_fig_p054_52.png] view at source ↗
Figure 53
Figure 53. Figure 53: Spearman correlation between the MTEB model rank and the rankings produced by [PITH_FULL_IMAGE:figures/full_fig_p055_53.png] view at source ↗
Figure 54
Figure 54. Figure 54: Spearman correlation between the MTEB model rank and the rankings produced by [PITH_FULL_IMAGE:figures/full_fig_p056_54.png] view at source ↗
Figure 55
Figure 55. Figure 55: Spearman correlation between the MTEB model rank and the rankings produced by [PITH_FULL_IMAGE:figures/full_fig_p057_55.png] view at source ↗
read the original abstract

Large-scale multilingual text embedding models play crucial role in both research and industry, yet their behavior in language-specific, multi-task settings remains insufficiently understood. Although benchmarking platforms such as MTEB report results across more than 250 languages, conclusions about model superiority often depend on implicit choices of dataset compositions and performance aggregation methods. To address this gap, we present a meta-study of multilingual model performance robustness in MTEB, applying a diverse set of multi-criteria decision-making ranking schemes and introducing two robustness indicators: dataset-composition robustness (sensitivity of rankings to changing dataset compositions) and ranking-scheme robustness (sensitivity to aggregation method change). They enable systematic sensitivity analysis of whether benchmarking conclusions remain stable under different evaluation designs. We conduct an in-depth analysis on five languages (English, French, German, Hindi, and Spanish) across nine tasks (e.g., classification, clustering, retrieval) and release results for approximately 230 additional languages. The task-specific analyses show that large-scale LLM-based models are often robust top performers, though not uniformly (e.g., in retrieval task), while task-agnostic results reveal that only a small subset of models remains consistently strong across tasks, ranking schemes, and data subsamples.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The manuscript conducts a meta-study of multilingual text embedding model performance on the MTEB benchmark. It introduces two new indicators—dataset-composition robustness (sensitivity of rankings to dataset composition changes) and ranking-scheme robustness (sensitivity to aggregation method changes)—and applies multi-criteria decision-making ranking schemes to analyze stability. The study focuses on five languages (English, French, German, Hindi, Spanish) across nine tasks, finds that large-scale LLM-based models are often robust top performers in task-specific settings (though not uniformly, e.g., retrieval), and concludes that only a small subset of models remains consistently strong across tasks, schemes, and subsamples in task-agnostic analyses. Extended results for ~230 additional languages are released.

Significance. If the central findings hold, the work offers a systematic framework for assessing how benchmarking conclusions depend on evaluation design choices, which is relevant for reliable model selection in multilingual settings. The release of results for additional languages is a concrete positive contribution that supports reproducibility and further analysis.

major comments (1)
  1. [meta-study design and analysis scope] The task-agnostic conclusion that only a small subset of models remains consistently strong across tasks, ranking schemes, and data subsamples is load-bearing on the representativeness of the chosen five languages and nine tasks (plus implicit MTEB dataset compositions). The meta-study design applies the new robustness indicators only within this scope; without explicit justification, comparison to the full MTEB variability (250+ languages, dozens of datasets per task), or external validation, the observed consistency may not generalize beyond the selected slice.
minor comments (1)
  1. [methods] The abstract and methods description omit precise details on dataset subsampling rules, the exact computation of the two robustness indicators, statistical tests applied, and error handling; adding these would strengthen verifiability of the reported sensitivities.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback on the scope of our meta-study. We address the major comment below.

read point-by-point responses
  1. Referee: [meta-study design and analysis scope] The task-agnostic conclusion that only a small subset of models remains consistently strong across tasks, ranking schemes, and data subsamples is load-bearing on the representativeness of the chosen five languages and nine tasks (plus implicit MTEB dataset compositions). The meta-study design applies the new robustness indicators only within this scope; without explicit justification, comparison to the full MTEB variability (250+ languages, dozens of datasets per task), or external validation, the observed consistency may not generalize beyond the selected slice.

    Authors: We appreciate the referee highlighting this point. The five languages (English, French, German, Hindi, Spanish) were selected to span high- and low-resource settings and multiple language families while enabling computationally intensive robustness analyses across nine tasks and multiple ranking schemes; extending the full indicator computation to all 250+ languages would have been prohibitive. The task-agnostic conclusions are scoped to this representative slice, and the manuscript already releases per-language results for ~230 additional languages to support community extensions. We agree that the current version lacks explicit justification for the selection and a limitations discussion on generalizability; we will add both in the revision. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical meta-analysis on public benchmarks

full rationale

The paper defines two new robustness indicators (dataset-composition robustness and ranking-scheme robustness) and applies them to existing MTEB data for five languages and nine tasks. No equations, predictions, or derivations are present that reduce claimed results to fitted parameters, self-citations, or self-definitions by construction. The analysis consists of sensitivity checks on public benchmark outputs using multi-criteria ranking schemes; conclusions about model consistency follow directly from the computed indicators on the chosen data slice without circular reduction. Self-citations, if any, are not load-bearing for uniqueness theorems or ansatzes. This is a standard empirical meta-study whose central claims remain independent of the inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

The central claim rests on the representativeness of MTEB data and the validity of the newly defined indicators; no free parameters are mentioned, but two invented entities are introduced.

axioms (1)
  • domain assumption MTEB benchmark results across languages and tasks form a suitable base for sensitivity analysis of rankings
    Invoked when the meta-study is defined and when conclusions about robustness are drawn from the platform data.
invented entities (2)
  • dataset-composition robustness indicator no independent evidence
    purpose: Quantifies sensitivity of model rankings to changes in dataset composition
    Newly defined in the paper to enable the sensitivity analysis.
  • ranking-scheme robustness indicator no independent evidence
    purpose: Quantifies sensitivity of model rankings to changes in aggregation method
    Newly defined in the paper to enable the sensitivity analysis.

pith-pipeline@v0.9.1-grok · 5764 in / 1338 out tokens · 18800 ms · 2026-06-28T22:41:44.628165+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

32 extracted references · 16 canonical work pages · 8 internal anchors

  1. [1]

    Retrieval-augmented generation for ai-generated content: A survey.Data Science and Engineering, pages 1–29, 2026

    Penghao Zhao, Hailin Zhang, Qinhan Yu, Zhengren Wang, Yunteng Geng, Fangcheng Fu, Ling Yang, Wentao Zhang, Jie Jiang, and Bin Cui. Retrieval-augmented generation for ai-generated content: A survey.Data Science and Engineering, pages 1–29, 2026

  2. [2]

    Embedding-informed adaptive retrieval-augmented generation of large language models

    Chengkai Huang, Yu Xia, Rui Wang, Kaige Xie, Tong Yu, Julian McAuley, and Lina Yao. Embedding-informed adaptive retrieval-augmented generation of large language models. In Proceedings of the 31st International Conference on Computational Linguistics, pages 1403– 1412, 2025

  3. [3]

    Mmteb: Massive multilingual text embedding benchmark.arXiv preprint arXiv:2502.13595, 2025

    Kenneth Enevoldsen, Isaac Chung, Imene Kerboua, Márton Kardos, Ashwin Mathur, David Stap, Jay Gala, Wissam Siblini, Dominik Krzemi ´nski, Genta Indra Winata, et al. Mmteb: Massive multilingual text embedding benchmark.arXiv preprint arXiv:2502.13595, 2025

  4. [4]

    Kenneth Enevoldsen, Márton Kardos, Niklas Muennighoff, and Kristoffer L Nielbo. The scan- dinavian embedding benchmarks: Comprehensive assessment of multilingual and monolingual text embedding.Advances in Neural Information Processing Systems, 37:40336–40358, 2024

  5. [5]

    What are the best systems? new perspectives on nlp benchmarking.Advances in neural information processing systems, 35:26915–26932, 2022

    Pierre Colombo, Nathan Noiry, Ekhine Irurozki, and Stéphan Clémençon. What are the best systems? new perspectives on nlp benchmarking.Advances in neural information processing systems, 35:26915–26932, 2022

  6. [6]

    V ote’n’rank: Revision of 10 benchmarking with social choice theory

    Mark Rofin, Vladislav Mikhailov, Mikhail Florinsky, Andrey Kravchenko, Tatiana Shavrina, Elena Tutubalina, Daniel Karabekyan, and Ekaterina Artemova. V ote’n’rank: Revision of 10 benchmarking with social choice theory. InProceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, pages 670–686, 2023

  7. [7]

    Generative representational instruction tuning

    Niklas Muennighoff, SU Hongjin, Liang Wang, Nan Yang, Furu Wei, Tao Yu, Amanpreet Singh, and Douwe Kiela. Generative representational instruction tuning. InThe Thirteenth International Conference on Learning Representations, 2024

  8. [8]

    Llm2vec: Large language models are secretly powerful text encoders.arXiv preprint arXiv:2404.05961, 2024

    Parishad BehnamGhader, Vaibhav Adlakha, Marius Mosbach, Dzmitry Bahdanau, Nicolas Chapados, and Siva Reddy. Llm2vec: Large language models are secretly powerful text encoders.arXiv preprint arXiv:2404.05961, 2024

  9. [9]

    MTEB: Massive Text Embedding Benchmark

    Niklas Muennighoff, Nouamane Tazi, Loïc Magne, and Nils Reimers. Mteb: Massive text embedding benchmark.arXiv preprint arXiv:2210.07316, 2022

  10. [10]

    Maintaining mteb: Towards long term usability and reproducibility of embedding benchmarks

    Isaac Chung, Imene Kerboua, Marton Kardos, Roman Solomatin, and Kenneth Enevoldsen. Maintaining mteb: Towards long term usability and reproducibility of embedding benchmarks. arXiv preprint arXiv:2506.21182, 2025

  11. [11]

    Fuzzy multiple attribute decision making methods

    Shu-Jen Chen and Ching-Lai Hwang. Fuzzy multiple attribute decision making methods. In Fuzzy multiple attribute decision making: Methods and applications, pages 289–486. Springer, 1992

  12. [12]

    Multicriteria optimization of civil engineering systems.Faculty of civil engineering, Belgrade, 2(1):5–21, 1998

    Serafim Opricovic. Multicriteria optimization of civil engineering systems.Faculty of civil engineering, Belgrade, 2(1):5–21, 1998

  13. [13]

    L’ingénierie de la décision.Elaboration d’instruments d’aide à la décision

    Jean-Pierre Brans, R Nadeau, and M Landry. L’ingénierie de la décision.Elaboration d’instruments d’aide à la décision. La méthode PROMETHEE. In l’Aide à la Décision: Nature, Instruments et Perspectives d’Avenir, pages 183–213, 1982

  14. [14]

    Determination of objective weights using a new method based on the removal effects of criteria (merec).Symmetry, 13(4):525, 2021

    Mehdi Keshavarz-Ghorabaee, Maghsoud Amiri, Edmundas Kazimieras Zavadskas, Zenonas Turskis, and Jurgita Antucheviciene. Determination of objective weights using a new method based on the removal effects of criteria (merec).Symmetry, 13(4):525, 2021

  15. [15]

    Determining objective weights in multiple criteria problems: The critic method.Computers & operations research, 22(7):763– 770, 1995

    Danae Diakoulaki, George Mavrotas, and Lefteris Papayannakis. Determining objective weights in multiple criteria problems: The critic method.Computers & operations research, 22(7):763– 770, 1995

  16. [16]

    Llama-embed-nemotron-8b: A universal text embedding model for multilingual and cross-lingual tasks.arXiv preprint arXiv:2511.07025, 2025

    Yauhen Babakhin, Radek Osmulski, Ronay Ak, Gabriel Moreira, Mengyao Xu, Benedikt Schifferer, Bo Liu, and Even Oldridge. Llama-embed-nemotron-8b: A universal text embedding model for multilingual and cross-lingual tasks.arXiv preprint arXiv:2511.07025, 2025

  17. [17]

    Qwen3 Embedding: Advancing Text Embedding and Reranking Through Foundation Models

    Yanzhao Zhang, Mingxin Li, Dingkun Long, Xin Zhang, Huan Lin, Baosong Yang, Pengjun Xie, An Yang, Dayiheng Liu, Junyang Lin, et al. Qwen3 embedding: Advancing text embedding and reranking through foundation models.arXiv preprint arXiv:2506.05176, 2025

  18. [18]

    Octen-embedding-8b: A fine-tuned multilingual text embedding model, 2025

    Octen Team. Octen-embedding-8b: A fine-tuned multilingual text embedding model, 2025

  19. [19]

    Multilingual E5 Text Embeddings: A Technical Report

    Liang Wang, Nan Yang, Xiaolong Huang, Linjun Yang, Rangan Majumder, and Furu Wei. Multilingual e5 text embeddings: A technical report.arXiv preprint arXiv:2402.05672, 2024

  20. [20]

    M3-Embedding: Multi-Linguality, Multi-Functionality, Multi-Granularity Text Embeddings Through Self-Knowledge Distillation

    Jianlv Chen, Shitao Xiao, Peitian Zhang, Kun Luo, Defu Lian, and Zheng Liu. Bge m3- embedding: Multi-lingual, multi-functionality, multi-granularity text embeddings through self-knowledge distillation.arXiv preprint arXiv:2402.03216, 4(5), 2024

  21. [21]

    Linq-embed-mistral technical report.arXiv preprint arXiv:2412.03223, 2024

    Chanyeol Choi, Junseong Kim, Seolhwa Lee, Jihoon Kwon, Sangmo Gu, Yejin Kim, Minkyung Cho, and Jy-yong Sohn. Linq-embed-mistral technical report.arXiv preprint arXiv:2412.03223, 2024

  22. [22]

    Unsupervised cross-lingual representation learning at scale

    Alexis Conneau, Kartikay Khandelwal, Naman Goyal, Vishrav Chaudhary, Guillaume Wen- zek, Francisco Guzmán, Edouard Grave, Myle Ott, Luke Zettlemoyer, and Veselin Stoyanov. Unsupervised cross-lingual representation learning at scale. InProceedings of the 58th annual meeting of the association for computational linguistics, pages 8440–8451, 2020. 11

  23. [23]

    K., G¨unther, M., Wang, B., Krimmel, M., Wang, F., Mastrapas, G., Koukounas, A., Wang, N., et al

    Saba Sturua, Isabelle Mohr, Mohammad Kalim Akram, Michael Günther, Bo Wang, Markus Krimmel, Feng Wang, Georgios Mastrapas, Andreas Koukounas, Andreas Koukounas, Nan Wang, and Han Xiao. jina-embeddings-v3: Multilingual embeddings with task lora.arXiv preprint arXiv:2409.10173, 2024

  24. [24]

    Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks

    Nils Reimers and Iryna Gurevych. Sentence-bert: Sentence embeddings using siamese bert- networks.arXiv preprint arXiv:1908.10084, 2019

  25. [25]

    Text Embeddings by Weakly-Supervised Contrastive Pre-training

    Liang Wang, Nan Yang, Xiaolong Huang, Binxing Jiao, Linjun Yang, Daxin Jiang, Rangan Majumder, and Furu Wei. Text embeddings by weakly-supervised contrastive pre-training. arXiv preprint arXiv:2212.03533, 2022

  26. [26]

    Angle-optimized text embeddings.arXiv preprint arXiv:2309.12871, 2023

    Xianming Li and Jing Li. Angle-optimized text embeddings.arXiv preprint arXiv:2309.12871, 2023

  27. [27]

    Open source strikes bread - new fluffy embeddings model, 2024

    Sean Lee, Aamir Shakir, Darius Koenig, and Julius Lipp. Open source strikes bread - new fluffy embeddings model, 2024

  28. [28]

    Arctic-embed: Scalable, efficient, and accurate text embedding models.arXiv preprint arXiv:2405.05374, 2024

    Luke Merrick, Danmei Xu, Gaurav Nuti, and Daniel Campos. Arctic-embed: Scalable, efficient, and accurate text embedding models.arXiv preprint arXiv:2405.05374, 2024

  29. [29]

    NV-Embed: Improved Techniques for Training LLMs as Generalist Embedding Models

    Chankyu Lee, Rajarshi Roy, Mengyao Xu, Jonathan Raiman, Mohammad Shoeybi, Bryan Catanzaro, and Wei Ping. Nv-embed: Improved techniques for training llms as generalist embedding models.arXiv preprint arXiv:2405.17428, 2024

  30. [30]

    EmbeddingGemma: Powerful and Lightweight Text Representations

    Henrique Schechter Vera, Sahil Dua, Biao Zhang, Daniel Salz, Ryan Mullins, Sindhu Raghuram Panyam, Sara Smoot, Iftekhar Naim, Joe Zou, Feiyang Chen, et al. Embeddinggemma: Powerful and lightweight text representations.arXiv preprint arXiv:2509.20354, 2025

  31. [31]

    Language- agnostic bert sentence embedding

    Fangxiaoyu Feng, Yinfei Yang, Daniel Cer, Naveen Arivazhagan, and Wei Wang. Language- agnostic bert sentence embedding. InProceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 878–891, 2022. 12 Table 1: Distribution of datasets by task for ten languages with the largest total number of datas...

  32. [32]

    This cluster captures embedding models whose semantic spaces show weak alignment with the evaluated tasks and poor task-agnostic robustness, indicating limited general-purpose applicability. all-wsm-equal all-wsm-critic all-wsm-merec all-topsis-equal all-topsis-critic all-topsis-merec all-vikor-equal all-vikor-critic all-vikor-merec all-promethee_ii_usual...