pith. sign in

arxiv: 2606.13647 · v1 · pith:EZMN3N2Rnew · submitted 2026-06-11 · 💻 cs.CL · cs.AI· cs.LG

SkMTEB: Slovak Massive Text Embedding Benchmark and Model Adaptation

Pith reviewed 2026-06-27 06:37 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.LG
keywords Slovaktext embeddingsbenchmarkmultilingual modelsmodel adaptationlow-resource languagessemantic searchRAG
0
0 comments X

The pith

Large instruction-tuned multilingual models outperform Slovak-specific NLU models on text embeddings, and adapted smaller versions compete with proprietary APIs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces SkMTEB, the first comprehensive benchmark for Slovak text embeddings with 31 datasets across 7 task types. Evaluation of 31 models shows that large multilingual instruction-tuned models perform best while existing Slovak NLU models transfer poorly to embeddings. The authors adapt Multilingual E5 models by vocabulary trimming and fine-tuning to create e5-sk-small and e5-sk-large, which despite being up to 62% smaller achieve competitive results with closed APIs. These open models support local deployment for semantic search and RAG. The work provides a path for other low-resource languages by releasing all resources openly.

Core claim

SkMTEB reveals that for Slovak embeddings, instruction-tuned multilingual models lead in performance, Slovak NLU models do not transfer effectively, and vocabulary-trimmed and fine-tuned versions of Multilingual E5 models deliver competitive results at reduced sizes suitable for local use.

What carries the argument

Vocabulary trimming and fine-tuning applied to Multilingual E5 models on the SkMTEB benchmark to produce smaller Slovak embedding models.

If this is right

  • Large multilingual models are preferable for Slovak embedding tasks over language-specific NLU models.
  • Model size can be reduced by up to 62% while maintaining competitive performance through adaptation.
  • Open-source adapted models enable local semantic search and RAG without relying on proprietary APIs.
  • The benchmark and adaptation method offer a template for other under-resourced languages.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Embedding tasks may require different training signals than traditional NLU tasks even in the same language.
  • Similar adaptation techniques could reduce computational costs for embeddings in many low-resource settings.
  • Performance gaps between open and closed models may narrow further with targeted fine-tuning on language-specific benchmarks.

Load-bearing premise

The 31 datasets across 7 task types represent a fair sample of actual Slovak embedding applications.

What would settle it

A new embedding model achieving substantially higher average scores on SkMTEB than the adapted e5-sk models, or a demonstration that key real-world Slovak use cases are missing from the benchmark.

Figures

Figures reproduced from arXiv: 2606.13647 by Andrej Ridzik, Daniel Hl\'adek, Marek \v{S}uppa, Nat\'alia K\v{n}a\v{z}ekov\'a, Vikt\'oria Ondrejov\'a.

Figure 1
Figure 1. Figure 1: Overview of the SkMTEB benchmark comprising 31 datasets across 7 task types. Lighter shading [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: SkMTEB Pareto frontier (All). Each point [PITH_FULL_IMAGE:figures/full_fig_p014_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Model Size vs. Average Performance selected from the full generated dataset, with 50 pairs drawn from each similarity score. The an￾notation guidelines did not explicitly specify the target distribution of similarity scores. Three native Slovak speakers served as annotators: two are coau￾thors of this work, and the third is a colleague who volunteered without any expectation of remuner￾ation. All annotator… view at source ↗
Figure 4
Figure 4. Figure 4: Abbreviated system prompt for STS sentence generation (Slovak). The full prompt includes detailed [PITH_FULL_IMAGE:figures/full_fig_p018_4.png] view at source ↗
read the original abstract

We introduce SkMTEB, the first comprehensive MTEB-style text embedding benchmark for Slovak, a low-resource West Slavic language, comprising 31 datasets across 7 task types -- nearly 4$\times$ the depth of existing multilingual benchmark coverage for Slovak. Our evaluation of 31 embedding models reveals that large instruction-tuned multilingual models achieve the strongest performance, while existing Slovak-specific models trained for NLU tasks transfer poorly to embedding tasks. To address the need for efficient, locally-deployable Slovak embeddings, we develop \texttt{e5-sk-small} (45M parameters) and \texttt{e5-sk-large} (365M) by applying vocabulary trimming and fine-tuning to Multilingual E5 models. Despite size reductions of up to 62\%, our open-source models achieve competitive performance with proprietary APIs while remaining locally deployable for semantic search and retrieval-augmented generation (RAG). We release the benchmark, models, datasets, and code openly, hoping our approach offers a replicable path for other under-resourced languages.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces SkMTEB, the first MTEB-style benchmark for Slovak text embeddings comprising 31 datasets across 7 task types (nearly 4× existing multilingual coverage for the language). It evaluates 31 embedding models, finding that large instruction-tuned multilingual models perform best while Slovak-specific NLU models transfer poorly to embeddings; it also presents vocabulary-trimmed and fine-tuned e5-sk-small (45M params) and e5-sk-large (365M params) models that achieve competitive results with proprietary APIs while remaining open-source and locally deployable for semantic search and RAG.

Significance. If the benchmark datasets are representative, the work supplies a much-needed evaluation resource for a low-resource language and demonstrates a replicable adaptation path (vocabulary trimming + fine-tuning) that yields practical, smaller open models. The open release of the benchmark, models, datasets, and code strengthens the contribution for the community working on multilingual embeddings.

major comments (2)
  1. [Abstract and dataset section] Abstract and § on dataset construction: the central performance claims (large instruction-tuned models dominate; e5-sk models are competitive with proprietary APIs) rest on the 31 datasets being a representative proxy for real Slovak semantic-search and RAG workloads, yet the manuscript supplies no quantitative evidence such as the fraction of native versus machine-translated text, domain-coverage statistics, or validation metrics on native data. This is load-bearing for the headline rankings and generalizability statements.
  2. [Model adaptation section] Model adaptation section (description of e5-sk-small/large): the vocabulary-trimming procedure and fine-tuning data/protocol are described at high level, but the manuscript does not report the exact trimming ratio, the composition or size of the fine-tuning corpus, or ablation results isolating the contribution of trimming versus fine-tuning; without these, the claim that the size reductions (up to 62%) preserve competitive performance cannot be fully assessed.
minor comments (2)
  1. [Results tables] Table reporting model sizes and scores: ensure all 31 models are listed with consistent parameter counts and that the two new e5-sk variants are clearly distinguished from the base Multilingual E5 models.
  2. [Abstract and introduction] The phrase “nearly 4× the depth of existing multilingual benchmark coverage” should be accompanied by an explicit citation or table comparing prior Slovak coverage to the new 31 datasets.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment below and will revise the paper accordingly to improve transparency and support for our claims.

read point-by-point responses
  1. Referee: [Abstract and dataset section] Abstract and § on dataset construction: the central performance claims (large instruction-tuned models dominate; e5-sk models are competitive with proprietary APIs) rest on the 31 datasets being a representative proxy for real Slovak semantic-search and RAG workloads, yet the manuscript supplies no quantitative evidence such as the fraction of native versus machine-translated text, domain-coverage statistics, or validation metrics on native data. This is load-bearing for the headline rankings and generalizability statements.

    Authors: We agree that quantitative details on dataset composition would better substantiate the representativeness claims. The manuscript describes the 31 datasets and their sources but does not include breakdowns such as native vs. machine-translated fractions or domain statistics. In the revision we will add these metrics (e.g., a table reporting the proportion of native text per task type and domain coverage) drawn from the dataset documentation and curation process to strengthen the generalizability discussion. revision: yes

  2. Referee: [Model adaptation section] Model adaptation section (description of e5-sk-small/large): the vocabulary-trimming procedure and fine-tuning data/protocol are described at high level, but the manuscript does not report the exact trimming ratio, the composition or size of the fine-tuning corpus, or ablation results isolating the contribution of trimming versus fine-tuning; without these, the claim that the size reductions (up to 62%) preserve competitive performance cannot be fully assessed.

    Authors: We acknowledge the high-level description limits assessment of the adaptation method. We will expand the section to report the exact trimming ratios, the fine-tuning corpus size and composition (including sources and example counts), and any existing comparisons between trimmed and untrimmed variants. Full isolating ablations of trimming versus fine-tuning were not performed in the original experiments; we will therefore provide the requested details while noting this as a limitation and potential future direction. revision: partial

Circularity Check

0 steps flagged

No circularity in derivation chain

full rationale

The paper introduces an empirical benchmark (SkMTEB with 31 datasets) and performs direct model evaluations plus standard vocabulary-trimming + fine-tuning adaptations on Multilingual E5. All performance claims rest on measured results against the new datasets rather than any self-referential definition, fitted parameter renamed as prediction, or load-bearing self-citation chain. The representativeness of the 31 datasets is an external validity assumption, not a circularity in the reported chain.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Empirical benchmark and adaptation study with no free parameters, axioms, or invented entities required for the central claims beyond standard supervised fine-tuning practices.

pith-pipeline@v0.9.1-grok · 5743 in / 966 out tokens · 12297 ms · 2026-06-27T06:37:44.112893+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

23 extracted references · 3 linked inside Pith

  1. [1]

    Lucas Bandarkar, Davis Liang, Benjamin Muller, Mikel Artetxe, Satya Narayan Shukla, Donald Husa, Naman Goyal, Abhinandan Krishnan, Luke Zettlemoyer, and Madian Khabsa

    MTEB-NL and E5-NL: Embedding bench- mark and models for dutch.arXiv preprint arXiv:2509.12340. Lucas Bandarkar, Davis Liang, Benjamin Muller, Mikel Artetxe, Satya Narayan Shukla, Donald Husa, Naman Goyal, Abhinandan Krishnan, Luke Zettlemoyer, and Madian Khabsa. 2024. The belebele benchmark: a parallel reading comprehension dataset in 122 lan- guage varia...

  2. [2]

    Kenneth Enevoldsen, Márton Kardos, Niklas Muen- nighoff, and Kristoffer Laigaard Nielbo

    Webfaq: A multilingual collection of nat- ural q&a datasets for dense retrieval.Preprint, arXiv:2502.20936. Kenneth Enevoldsen, Márton Kardos, Niklas Muen- nighoff, and Kristoffer Laigaard Nielbo. 2024. The scandinavian embedding benchmarks: Comprehen- sive assessment of multilingual and monolingual text embedding. InAdvances in Neural Information Pro- ce...

  3. [3]

    InInternational Conference on Learning Representations

    MMTEB: Massive multilingual text embed- ding benchmark. InInternational Conference on Learning Representations. Christian Federmann, Tom Kocmi, and Ying Xin. 2022. NTREX-128 – news test references for MT evalua- tion of 128 languages. InProceedings of the First Workshop on Scaling Up Multilingual Evaluation, pages 21–24, Online. Association for Computatio...

  4. [4]

    InProceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), pages 16024–16036, Torino, Italy

    The ParlaSent multilingual training dataset for sentiment identification in parliamentary proceed- ings. InProceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), pages 16024–16036, Torino, Italy. ELRA and ICCL. Sepideh Mollanorozy, Marc Tanti, and Malvina Nissim

  5. [5]

    InProceedings of the 5th Workshop on Research in Computational Linguistic Typology and Multilingual NLP, pages 89–95, Dubrovnik, Croatia

    Cross-lingual transfer learning with Persian. InProceedings of the 5th Workshop on Research in Computational Linguistic Typology and Multilingual NLP, pages 89–95, Dubrovnik, Croatia. Association for Computational Linguistics. Niklas Muennighoff, Nouamane Tazi, Loïc Magne, and Nils Reimers. 2023. MTEB: Massive text embedding benchmark. InProceedings of th...

  6. [6]

    The russian-focused embedders’ exploration: ruMTEB benchmark and Russian embedding model design. InProceedings of the 2025 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Tech- nologies (Volume 1: Long Papers), pages 236–254, Albuquerque, New Mexico. Association for Compu- tational Linguistics. Zuz...

  7. [7]

    InFindings of the Association for Computational Linguistics: EMNLP 2023, pages 14725–14739, Singapore

    Efficient multilingual language model com- pression through vocabulary trimming. InFindings of the Association for Computational Linguistics: EMNLP 2023, pages 14725–14739, Singapore. Asso- ciation for Computational Linguistics. Henrique Schechter Vera, Sahil Dua, Biao Zhang, Daniel Salz, Ryan Mullins, Sindhu Raghuram Pa- nyam, Sara Smoot, Iftekhar Naim, ...

  8. [8]

    Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel Bowman

    Embeddinggemma: Powerful and lightweight text representations.ArXiv, abs/2509.20354. Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel Bowman. 2018. GLUE: A multi-task benchmark and analysis platform for nat- ural language understanding. InProceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Net...

  9. [9]

    Puxuan Yu, Luke Merrick, Gaurav Nuti, and Daniel Campos

    Association for Computational Linguistics. Puxuan Yu, Luke Merrick, Gaurav Nuti, and Daniel Campos. 2024. Arctic-embed 2.0: Multilingual retrieval without compromise.arXiv preprint arXiv:2412.04506. Biao Zhang, Philip Williams, Ivan Titov, and Rico Sen- nrich. 2020. Improving massively multilingual neu- ral machine translation and zero-shot translation. I...

  10. [10]

    Yanzhao Zhang, Mingxin Li, Dingkun Long, Xin Zhang, Huan Lin, Baosong Yang, Pengjun Xie, An Yang, Dayiheng Liu, Junyang Lin, Fei Huang, and Jingren Zhou

    Association for Computational Linguistics. Yanzhao Zhang, Mingxin Li, Dingkun Long, Xin Zhang, Huan Lin, Baosong Yang, Pengjun Xie, An Yang, Dayiheng Liu, Junyang Lin, Fei Huang, and Jingren Zhou. 2025. Qwen3 embedding: Advancing text embedding and reranking through foundation models. ArXiv, abs/2506.05176. Erfan Zinvandi, Morteza Alikhani, Mehran Sarmadi...

  11. [11]

    Read an original Slovak sentence from the SlovakSum corpus

  12. [12]

    Generate six variant sentences, one for each STS score (0–5)

  13. [13]

    Provide a Slovak explanation justifying each score assignment

  14. [14]

    5, score 1 vs

    Follow detailed score boundary definitions distinguishing between adjacent scores (e.g., score 4 vs. 5, score 1 vs. 2)

  15. [15]

    Maintain natural Slovak language quality suit- able for native speakers

  16. [16]

    nearly equivalent with minor gener- alizations

    Return output in a structured JSON format. The prompt (in Figure 4) includes two complete worked examples demonstrating expected outputs for different source sentences. Key design choices include: Annotator pair Overall Accuracy Pearson Corr. Spearman Corr. Cohen’sκ [1,2] 0.67 0.94 0.94 0.60 [1,3] 0.63 0.91 0.92 0.56 [2,3] 0.61 0.93 0.93 0.53 Table 5: Pai...

  17. [17]

    Prečítať si jednu slovenskú vetu (ORIGINÁL)

  18. [18]

    0": {"sentence

    Pre KAŽDÉ skóre podobnosti 0–5 (STS) vytvor: – jednu slovenskú vetu (VARIANT), – jedno krátke vysvetlenie v slovenčine. DEFINÍCIE SKÓRE 0–5 Všeobecné pravidlo:„Dôležitá informácia" = hlavná udalost’, aktér, miesto,ˇcas, množstvo, základná pointa. Ak zmeníš alebo vynecháš dôležitú informáciu, skóre NEMÔŽE byt’ 4 ani 5. Skóre 0– ÚPLNE INÁ TÉMA Vety sú o úpl...

  19. [19]

    Compute token frequencies on FineWeb2-Slovak corpus

  20. [20]

    Rank tokens by frequency in target language

  21. [21]

    Retain top 60K tokens (recommended threshold from original paper)

  22. [22]

    Resize embedding and output projection matrices

  23. [23]

    MTEB(slk, v1)

    Fine-tune on downstream tasks The 60K threshold was chosen based on the original vocabulary trimming paper’s recommendation, which found this value provides good coverage while maximizing compression. For comparison, MTEB- NL (Banar et al., 2025) uses 50K tokens for Dutch. F.4 Model and Data Availability All released models and datasets are collected on H...