pith. sign in

arxiv: 2605.28190 · v1 · pith:C46NGGTTnew · submitted 2026-05-27 · 💻 cs.CL

The Harder Text Embedding Benchmark (HTEB): Beyond One-dimensional Static Robustness

Pith reviewed 2026-06-29 12:40 UTC · model grok-4.3

classification 💻 cs.CL
keywords text embeddingsrobustness evaluationdynamic benchmarksmultilingual embeddingslexical variationembedding modelsbenchmarking
0
0 comments X

The pith

Embedding robustness is multidimensional and requires dynamic evaluation to expose failures hidden by static benchmarks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper claims that benchmarks reporting one score per embedding model treat robustness as a fixed scalar property, but models actually respond differently to different kinds of input change. It introduces HTEB to test robustness at evaluation time by using an LLM to apply random transformations along three axes: lexical and stylistic variation, length changes, and language shifts. Human ratings on an English sample validate the transformations. Results from 16 models across 32 datasets in 42 languages show three patterns: models have partly independent robustness profiles on the axes, scaling raises overall scores without closing most gaps to the transformed versions except on the language axis, and English datasets react more strongly to the changes than multilingual ones.

Core claim

Embedding robustness is multidimensional because models respond differently to variations along lexical/stylistic, length, and language axes, and dynamic evaluation with stochastic LLM transformations at test time is needed to reveal weaknesses that static benchmarks miss. HTEB applies these transformations to 32 datasets covering 42 languages, validated by 4,800 human ratings on English data, and finds that models exhibit specific partly decoupled robustness profiles, that scale increases absolute scores but does not close the gap between original and transformed evaluations except on the language axis, and that English datasets are more sensitive to the transformations than multilingual on

What carries the argument

HTEB, a dynamic evaluation framework that applies stochastic LLM transformations to inputs at evaluation time along the Lexical/Stylistic, Length, and Language axes.

If this is right

  • Models exhibit specific, partly decoupled robustness profiles across the lexical/stylistic, length, and language axes.
  • Across model families, increasing scale raises absolute scores on both original and transformed inputs but does not close the performance gap except on the language axis.
  • English datasets prove more sensitive to the HTEB transformations than multilingual datasets.
  • Single-score embedding benchmarks miss distinctions in how models handle different types of variation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Multilingual training data may confer robustness advantages that could be tested by comparing matched English-only and multilingual model pairs.
  • Future benchmarks could report separate scores per axis rather than a single aggregate.
  • Training objectives might be adjusted to target specific axes independently to improve overall robustness.
  • The approach of on-the-fly stochastic transformations could extend to other NLP tasks such as classification or retrieval.

Load-bearing premise

The stochastic LLM transformations along the three axes represent practically relevant real-world variations without introducing artifacts, and the human ratings on an English subsample validate the transformations for all 42 languages and 32 datasets.

What would settle it

Finding that all tested models show identical robustness levels across the three axes or that increasing scale fully eliminates the performance difference between original and transformed inputs on every axis.

Figures

Figures reproduced from arXiv: 2605.28190 by Haithem Afli, Manuel Frank.

Figure 1
Figure 1. Figure 1: (M)MTEB datasets are transformed over n runs using eight transformations that generate lexical- /stylistic-, length- or language-related variations. Recent work has begun to address these lim￾itations. SAGE (Goel et al., 2025) tests robust￾ness to controlled perturbations and noise injec￾tions on English datasets. PTEB (Frank and Afli, 2026) stochastically paraphrases 20 MTE￾B/MMTEB datasets including 25 l… view at source ↗
Figure 2
Figure 2. Figure 2: Four-step method to select the HTEB transformation model. [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Per-item embedding drift versus mean human transformation quality rating, by transformation. Each dot [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Left: Embedding model performance on English data (19 datasets, 3 runs). Right: Embedding model performance on multilingual data (13 datasets, 3 runs). Both: Right columns report Original and total HTEB score with ø denoting the average. Exact scores per axis and model can be found in [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Size ablations for Qwen3-Embedding, [PITH_FULL_IMAGE:figures/full_fig_p005_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Mean score delta (∆ = HTEB − Original) per embedding model (model size in parentheses) and transformation, averaged over datasets and seeds. Negative values (red) indicate degradation under HTEB’s transformations; positive values (green) indicate improvement. transformations in the English setting (-4.5 pp Translation, -6.3 pp Cross-Translation). (3) Llama￾Nemotron-8B is the only model with positive deltas… view at source ↗
Figure 7
Figure 7. Figure 7: HL location shift with per-transformation 95% [PITH_FULL_IMAGE:figures/full_fig_p007_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Google Form to collect the human quality [PITH_FULL_IMAGE:figures/full_fig_p015_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: HTEB performance drop vs. Original score [PITH_FULL_IMAGE:figures/full_fig_p018_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: English benchmark (n = 11 models). Per-model |∆| score by robustness axis, stacked left-to-right in order of within-model disruption and grouped by axis ordering [PITH_FULL_IMAGE:figures/full_fig_p019_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Multilingual benchmark (n = 9 models). Per-model |∆| score by robustness axis, stacked left-to-right in order of within-model disruption and grouped by axis ordering. 19 [PITH_FULL_IMAGE:figures/full_fig_p019_11.png] view at source ↗
read the original abstract

Embedding benchmarks like MTEB report a single score per model, implicitly treating robustness as a static, scalar property. We argue that embedding robustness is multidimensional, since models respond differently to different types of variation, and requires dynamic evaluation to expose failures hidden by static benchmarks. We introduce the Harder Text Embedding Benchmark (HTEB), a dynamic evaluation framework that challenges model robustness along three practically interpretable axes (Lexical/Stylistic, Length and Language) by stochastically transforming inputs at evaluation time with an LLM. Evaluating 16 open-weight embedding models on 32 datasets covering 42 languages under transformations validated by 4,800 human ratings on an English subsample, we find three patterns: (1) Models exhibit specific, partly decoupled robustness profiles across axes. (2) Across three model families, scale increases absolute scores but does not close the gap between original and transformed evaluations. Here, scaling tends to improve specifically the Language axis. (3) English datasets are more sensitive to HTEB transformations than multilingual datasets. This demonstrates that HTEB identifies strengths and weaknesses of models along deployment-relevant axes, challenging current embedding benchmarks and arguing for multidimensional, dynamic robustness evaluation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The paper claims that embedding robustness is multidimensional (models respond differently to different variations) rather than the static scalar property implied by MTEB, and introduces the Harder Text Embedding Benchmark (HTEB) as a dynamic framework. HTEB applies stochastic LLM transformations along three axes (Lexical/Stylistic, Length, Language) to inputs from 32 datasets covering 42 languages at evaluation time; after validation via 4,800 human ratings on an English subsample, evaluation of 16 open-weight models reveals three patterns: (1) specific partly decoupled robustness profiles across axes, (2) scale increases absolute scores but does not close gaps (except on the Language axis), and (3) English datasets are more sensitive than multilingual ones.

Significance. If the transformations are shown to be free of systematic artifacts across languages, HTEB would offer a practically useful dynamic evaluation tool that exposes robustness failures hidden by static benchmarks and could inform model selection and development along deployment-relevant axes.

major comments (1)
  1. [Abstract and evaluation description section] Abstract and evaluation description section: the central patterns (2) and (3) rest on the claim that LLM transformations produce comparable, practically relevant variations across 42 languages, yet the only external validation cited is 4,800 human ratings on an English subsample. No ratings, grammaticality checks, or artifact analysis are described for the remaining 41 languages or the full set of 32 datasets; if transformation quality (semantic fidelity, introduced bias, or grammaticality) varies systematically by language family or script, the reported higher English sensitivity and the Language-axis scaling effect could be driven by differential artifact rates rather than genuine robustness differences.
minor comments (1)
  1. The abstract refers to 'three model families' without naming them; the main text should explicitly list the families and the specific models within each to allow readers to assess whether the scale and axis-specific patterns generalize.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their detailed and constructive feedback. The concern regarding the scope of human validation for the LLM transformations is well-taken and directly relevant to the strength of patterns (2) and (3). We address this point below and outline the revisions we will make.

read point-by-point responses
  1. Referee: [Abstract and evaluation description section] Abstract and evaluation description section: the central patterns (2) and (3) rest on the claim that LLM transformations produce comparable, practically relevant variations across 42 languages, yet the only external validation cited is 4,800 human ratings on an English subsample. No ratings, grammaticality checks, or artifact analysis are described for the remaining 41 languages or the full set of 32 datasets; if transformation quality (semantic fidelity, introduced bias, or grammaticality) varies systematically by language family or script, the reported higher English sensitivity and the Language-axis scaling effect could be driven by differential artifact rates rather than genuine robustness differences.

    Authors: We agree that the human validation is limited to an English subsample and that this constitutes a genuine limitation for claims involving cross-lingual comparisons. The 4,800 ratings were obtained on a stratified English subsample drawn from the 32 datasets to assess semantic fidelity, grammaticality, and absence of introduced bias for the three transformation axes. The same LLM (with language-specific prompts) was used to generate transformations for all 42 languages, which provides some consistency, but we did not perform equivalent human ratings or automated checks for non-English outputs. Consequently, it remains possible that differential artifact rates across language families or scripts contribute to the observed English sensitivity and the Language-axis scaling pattern. In the revised manuscript we will (a) explicitly state this limitation in both the abstract and the evaluation section, (b) add a dedicated paragraph in the Limitations section discussing the risk of language-specific artifacts and its implications for patterns (2) and (3), and (c) qualify the interpretation of the Language-axis results accordingly. We will not claim that the transformations have been shown to be free of systematic artifacts across all languages. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation is self-contained

full rationale

The paper introduces HTEB as a new dynamic benchmark using stochastic LLM transformations along three axes, with empirical patterns observed from evaluating 16 models on 32 datasets. The central claims rest on the external human validation (4,800 ratings on English subsample) and direct comparisons to the MTEB baseline, without any self-definitional loops, fitted parameters renamed as predictions, or load-bearing self-citations that reduce the result to its inputs by construction. No equations or uniqueness theorems are invoked that collapse the multidimensional robustness finding back to the benchmark definition itself. The derivation chain is independent and externally grounded.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

This is an empirical benchmark paper; the central claim rests on the validity of the new evaluation framework and human validation rather than mathematical axioms or new entities. No free parameters or invented entities are detailed in the abstract.

axioms (1)
  • domain assumption Human ratings on an English subsample validate the quality and relevance of LLM transformations across all languages and datasets
    The paper relies on 4,800 human ratings to support the transformations.

pith-pipeline@v0.9.1-grok · 5734 in / 1330 out tokens · 50020 ms · 2026-06-29T12:40:19.908503+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

73 extracted references · 56 canonical work pages · 11 internal anchors

  1. [1]

    Mohamed Abdalla, Krishnapriya Vishnubhotla, and Saif Mohammad. 2023. https://doi.org/10.18653/v1/2023.eacl-main.55 What Makes Sentences Semantically Related ? A Textual Relatedness Dataset and Empirical Study . In Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics , pages 782--796, Dubrovnik, Croati...

  2. [2]

    David Ifeoluwa Adelani, Marek Masiak, Israel Abebe Azime, Jesujoba Alabi, Atnafu Lambebo Tonja, Christine Mwase, Odunayo Ogundepo, Bonaventure F. P. Dossou, Akintunde Oladipo, Doreen Nixdorf, Chris Chinenye Emezue, Sana Al-azzawi, Blessing Sibanda, Davis David, Lolwethu Ndolela, Jonathan Mukiibi, Tunde Ajayi, Tatiana Moteu, Brian Odhiambo, and 46 others. ...

  3. [3]

    Mohammad Kalim Akram, Saba Sturua, Nastia Havriushenko, Quentin Herreros, Michael Günther, Maximilian Werk, and Han Xiao. 2026. https://doi.org/10.48550/arXiv.2602.15547 jina-embeddings-v5-text: Task - Targeted Embedding Distillation . ArXiv:2602.15547

  4. [4]

    Adnan El Assadi, Isaac Chung, Roman Solomatin, Niklas Muennighoff, and Kenneth Enevoldsen. 2026. https://openreview.net/forum?id=rcmfu1ydAf HUME : Measuring the Human - Model Performance Gap in Text Embedding Tasks . In International Conference on Learning Representations

  5. [5]

    Yauhen Babakhin, Radek Osmulski, Ronay Ak, Gabriel Moreira, Mengyao Xu, Benedikt Schifferer, Bo Liu, and Even Oldridge. 2025. https://arxiv.org/abs/2511.07025 Llama-embed-nemotron- 8B : a universal text embedding model for multilingual and cross-lingual tasks . ArXiv:2511.07025

  6. [6]

    Iñigo Casanueva, Tadas Temčinas, Daniela Gerz, Matthew Henderson, and Ivan Vulić. 2020. https://doi.org/10.18653/v1/2020.nlp4convai-1.5 Efficient Intent Detection with Dual Sentence Encoders . In Proceedings of the 2nd Workshop on Natural Language Processing for Conversational AI , pages 38--45, Online. Association for Computational Linguistics

  7. [7]

    Daniel Cer, Mona Diab, Eneko Agirre, Iñigo Lopez-Gazpio, and Lucia Specia. 2017. https://doi.org/10.18653/v1/S17-2001 SemEval -2017 Task 1: Semantic Textual Similarity Multilingual and Crosslingual Focused Evaluation . In Proceedings of the 11th International Workshop on Semantic Evaluation ( SemEval -2017) , pages 1--14, Vancouver, Canada. Association fo...

  8. [8]

    Jianlyu Chen, Shitao Xiao, Peitian Zhang, Kun Luo, Defu Lian, and Zheng Liu. 2024. https://doi.org/10.18653/v1/2024.findings-acl.137 M3- Embedding : Multi - Linguality , Multi - Functionality , Multi - Granularity Text Embeddings Through Self - Knowledge Distillation . In Findings of the Association for Computational Linguistics : ACL 2024 , pages 2318--2...

  9. [9]

    Benjamin Clavié. 2024. https://doi.org/10.48550/arXiv.2312.16144 Towards Better Monolingual Japanese Retrievers with Multi - Vector Models . ArXiv:2312.16144

  10. [10]

    Arman Cohan, Sergey Feldman, Iz Beltagy, Doug Downey, and Daniel Weld. 2020. https://doi.org/10.18653/v1/2020.acl-main.207 SPECTER : Document -level representation learning using citation-informed transformers . In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics , pages 2270--2282. Association for Computational Linguistics

  11. [11]

    Ruchira Dhar and Anders Søgaard. 2026. https://arxiv.org/abs/2604.25923v1 Evaluation Revisited : A Taxonomy of Evaluation Concerns in Natural Language Processing . ArXiv:2604.25923

  12. [12]

    Rotem Dror, Gili Baumer, Marina Bogomolov, and Roi Reichart. 2017. https://doi.org/10.1162/tacl_a_00074 Replicability Analysis for Natural Language Processing : Testing Significance with Multiple Datasets . Transactions of the Association for Computational Linguistics, 5:471--486

  13. [13]

    Rotem Dror, Gili Baumer, Segev Shlomov, and Roi Reichart. 2018. https://doi.org/10.18653/v1/P18-1128 The Hitchhiker 's Guide to Testing Statistical Significance in Natural Language Processing . In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics ( Volume 1: Long Papers ) , pages 1383--1392, Melbourne, Australia. Asso...

  14. [14]

    Bradley Efron and R. J. Tibshirani. 1994. https://doi.org/10.1201/9780429246593 An Introduction to the Bootstrap . Chapman and Hall/CRC, New York

  15. [15]

    Kenneth Enevoldsen, Isaac Chung, Imene Kerboua, Márton Kardos, Ashwin Mathur, David Stap, Jay Gala, Wissam Siblini, Dominik Krzemiński, Genta Indra Winata, Saba Sturua, Saiteja Utpala, Mathieu Ciancone, Marion Schaeffer, Diganta Misra, Shreeya Dhakal, Jonathan Rystrøm, Roman Solomatin, Ömer Veysel Çağatan, and 63 others. 2025. https://openreview.net/forum...

  16. [16]

    Fabbri, Wojciech Kryściński, Bryan McCann, Caiming Xiong, Richard Socher, and Dragomir Radev

    Alexander R. Fabbri, Wojciech Kryściński, Bryan McCann, Caiming Xiong, Richard Socher, and Dragomir Radev. 2021. https://doi.org/10.1162/tacl_a_00373 SummEval : Re -evaluating summarization evaluation . Transactions of the Association for Computational Linguistics, 9:391--409

  17. [17]

    Jack FitzGerald, Christopher Hench, Charith Peris, Scott Mackie, Kay Rottmann, Ana Sanchez, Aaron Nash, Liam Urbach, Vishesh Kakarala, Richa Singh, Swetha Ranganath, Laurie Crist, Misha Britan, Wouter Leeuwis, Gokhan Tur, and Prem Natarajan. 2023. https://doi.org/10.18653/v1/2023.acl-long.235 MASSIVE : A 1M - Example Multilingual Natural Language Understa...

  18. [18]

    Manuel Frank and Haithem Afli. 2026. https://doi.org/10.18653/v1/2026.eacl-long.130 PTEB : Towards robust text embedding evaluation via stochastic paraphrasing at evaluation time with LLMs . In Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics ( Volume 1: Long Papers ) , pages 2832--2851, Rabat, Mo...

  19. [19]

    Gemma Team , Aishwarya Kamath, Johan Ferret, Shreya Pathak, Nino Vieillard, Ramona Merhej, Sarah Perrin, Tatiana Matejovicova, Alexandre Ramé, Morgane Rivière, Louis Rouillard, Thomas Mesnard, Geoffrey Cideron, Jean-bastien Grill, Sabela Ramos, Edouard Yvinec, Michelle Casbon, Etienne Pot, Ivo Penchev, and 197 others. 2025. https://doi.org/10.48550/arXiv....

  20. [20]

    Mor Geva, Daniel Khashabi, Elad Segal, Tushar Khot, Dan Roth, and Jonathan Berant. 2021. https://doi.org/10.1162/tacl_a_00370 Did Aristotle use a laptop? A question answering benchmark with implicit reasoning strategies . Transactions of the Association for Computational Linguistics, 9:346--361

  21. [21]

    Danilo Giampiccolo, Bernardo Magnini, Ido Dagan, and Bill Dolan. 2007. https://aclanthology.org/W07-1401/ The third PASCAL recognizing textual entailment challenge . In Proceedings of the ACL - PASCAL Workshop on Textual Entailment and Paraphrasing , pages 1--9, Prague. Association for Computational Linguistics

  22. [22]

    Samarth Goel, Reagan Lee, and Kannan Ramchandran. 2025. https://openreview.net/forum?id=p4f2G2XXR4 SAGE : A Realistic Benchmark for Semantic Understanding . In NeurIPS 2025 Workshop on Evaluating the Evolving LLM Lifecycle : Benchmarks , Emergent Abilities , and Scaling

  23. [23]

    Kilem Li Gwet. 2008. https://doi.org/10.1348/000711006X126600 Computing inter-rater reliability and its variance in the presence of high agreement . British Journal of Mathematical and Statistical Psychology, 61(1):29--48

  24. [24]

    Minors of a Class of Riordan Arrays Related to Weighted Partial Motzkin Paths

    Tim C. Hesterberg. 2015. https://doi.org/10.1080/00031305.2015.1089789 What Teachers Should Know About the Bootstrap : Resampling in the Undergraduate Statistics Curriculum . The American Statistician, 69(4):371--386

  25. [25]

    Sture Holm. 1979. http://www.jstor.org/stable/4615733 A simple sequentially rejective multiple test procedure . Scandinavian Journal of Statistics, 6(2):65--70

  26. [26]

    Verspoor, and Timothy Baldwin

    Doris Hoogeveen, Karin M. Verspoor, and Timothy Baldwin. 2015. https://doi.org/10.1145/2838931.2838934 CQADupStack : a benchmark data set for community question-answering research . In Proceedings of the 20th Australasian Document Computing Symposium ( ADCS ) , pages 3:1--3:8, Parramatta, NSW, Australia. ACM

  27. [27]

    Yevhen Kostiuk and Kenneth Enevoldsen. 2026. https://doi.org/10.48550/arXiv.2605.22544 One prompt is not enough: Instruction Sensitivity Undermines Embedding Model Evaluation . ArXiv:2605.22544

  28. [28]

    Wuwei Lan, Siyu Qiu, Hua He, and Wei Xu. 2017. https://doi.org/10.18653/v1/D17-1126 A continuously growing dataset of sentential paraphrases . In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing , pages 1224--1234, Copenhagen, Denmark. Association for Computational Linguistics

  29. [29]

    Ken Lang. 1995. https://doi.org/10.1016/B978-1-55860-377-6.50048-7 NewsWeeder : Learning to Filter Netnews . In Machine Learning Proceedings 1995 , pages 331--339, San Francisco (CA). Morgan Kaufmann

  30. [30]

    Chankyu Lee, Rajarshi Roy, Mengyao Xu, Jonathan Raiman, Mohammad Shoeybi, Bryan Catanzaro, and Wei Ping. 2025. https://proceedings.iclr.cc/paper_files/paper/2025/hash/c4bf73386022473a652a18941e9ea6f8-Abstract-Conference.html NV - Embed : Improved Techniques for Training LLMs as Generalist Embedding Models . International Conference on Learning Representat...

  31. [31]

    Sean Lee, Aamir Shakir, Darius Koenig, and Julius Lipp. 2024. https://www.mixedbread.ai/blog/mxbai-embed-large-v1 Open source strikes bread -- new fluffy embeddings model . (accessed 2025-12-19)

  32. [32]

    Xianming Li and Jing Li. 2024. https://doi.org/10.18653/v1/2024.acl-long.101 AoE : Angle -optimized Embeddings for Semantic Textual Similarity . In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics ( Volume 1: Long Papers ) , pages 1825--1839, Bangkok, Thailand. Association for Computational Linguistics

  33. [33]

    Ji Lin, Jiaming Tang, Haotian Tang, Shang Yang, Wei-Ming Chen, Wei-Chen Wang, Guangxuan Xiao, Xingyu Dang, Chuang Gan, and Song Han. 2024. https://proceedings.mlsys.org/paper_files/paper/2024/hash/42a452cbafa9dd64e9ba4aa95cc1ef21-Abstract-Conference.html AWQ : Activation -aware Weight Quantization for On - Device LLM Compression and Acceleration . Proceed...

  34. [34]

    Xueqing Liu, Chi Wang, Yue Leng, and ChengXiang Zhai. 2018. https://doi.org/10.1145/3283812.3283815 LinkSO : a dataset for learning to retrieve similar question answer pairs on software development forums . In Proceedings of the 4th ACM SIGSOFT international workshop on NLP for software engineering , Nl4se 2018, pages 2--5, Lake Buena Vista, FL, USA. Asso...

  35. [35]

    Singh, Rylan Schaeffer, Andrew Poulton, Sanmi Koyejo, Pontus Stenetorp, Sharan Narang, and Dieuwke Hupkes

    Lovish Madaan, Aaditya K. Singh, Rylan Schaeffer, Andrew Poulton, Sanmi Koyejo, Pontus Stenetorp, Sharan Narang, and Dieuwke Hupkes. 2024. https://doi.org/10.48550/arXiv.2406.10229 Quantifying Variance in Evaluation Benchmarks . ArXiv:2406.10229

  36. [36]

    Andani Madodonga, Vukosi Marivate, and Matthew Adendorff. 2023. https://doi.org/10.55492/dhasa.v4i01.4449 Izindaba- Tindzaba : Machine learning news categorisation for long and short text for isiZulu and Siswati . Journal of the Digital Humanities Association of Southern Africa (DHASA), 4(01)

  37. [37]

    Rahmad Mahendra, Alham Fikri Aji, Samuel Louvan, Fahrurrozi Rahman, and Clara Vania. 2021. https://aclanthology.org/2021.emnlp-main.821 IndoNLI : a natural language inference dataset for Indonesian . In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing , pages 10511--10527, Online and Punta Cana, Dominican Republic. As...

  38. [38]

    Microsoft. 2026. https://huggingface.co/microsoft/harrier-oss-v1-270m microsoft/harrier-oss-v1-270m · Hugging Face . (accessed 2026-05-13)

  39. [39]

    Mistral AI . 2025. https://mistral.ai/news/mistral-3 Introducing Mistral 3 Mistral AI . (accessed 2025-12-18)

  40. [40]

    MTEB. 2025. https://huggingface.co/datasets/mteb/PlscClusteringS2S.v2 Dataset PlscClusteringS2S .v2 . (accessed 2026-03-06)

  41. [41]

    Niklas Muennighoff, Nouamane Tazi, Loic Magne, and Nils Reimers. 2023. https://doi.org/10.18653/v1/2023.eacl-main.148 MTEB : Massive Text Embedding Benchmark . In Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics , pages 2014--2037, Dubrovnik, Croatia. Association for Computational Linguistics

  42. [42]

    James O'Neill, Polina Rozenshtein, Ryuichi Kiryo, Motoko Kubota, and Danushka Bollegala. 2021. https://doi.org/10.18653/v1/2021.emnlp-main.568 I wish I would have loved this one, but I didn't – a multilingual dataset for counterfactual detection in product review . In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing ,...

  43. [43]

    Yulia Otmakhova, Thinh Hung Truong, Rahmad Mahendra, Zenan Zhai, Rongxin Zhu, Daniel Beck, and Jey Han Lau. 2026. https://doi.org/10.18653/v1/2026.findings-eacl.269 FLUKE : a linguistically-driven and task-agnostic framework for robustness evaluation . In Findings of the Association for Computational Linguistics : EACL 2026 , pages 5103--5123, Rabat, Moro...

  44. [44]

    Nedjma Ousidhoum, Shamsuddeen Muhammad, Mohamed Abdalla, Idris Abdulmumin, Ibrahim Ahmad, Sanchit Ahuja, Alham Aji, Vladimir Araujo, Abinew Ayele, Pavan Baswani, Meriem Beloucif, Chris Biemann, Sofia Bourhim, Christine Kock, Genet Dekebo, Oumaima Hourrane, Gopichand Kanumolu, Lokesh Madasu, Samuel Rutunda, and 8 others. 2024. https://doi.org/10.18653/v1/2...

  45. [45]

    Shantipriya Parida, Sambit Sekhar, Soumendra Kumar Sahoo, Swateek Jena, Abhijeet Parida, Satya Ranjan Dash, and Guneet Singh Kohli. 2023. https://huggingface.co/datasets/OdiaGenAI/sentiment_analysis_hindi OdiaGenAI : Generative AI and LLM initiative for the odia language - Dataset Card Sentiment Analysis Hindi . (accessed 2026-05-13)

  46. [46]

    Gowtham Ramesh, Sumanth Doddapaneni, Aravinth Bheemaraj, Mayank Jobanputra, Raghavan AK, Ajitesh Sharma, Sujit Sahoo, Harshita Diddee, Mahalakshmi J, Divyanshu Kakwani, Navneet Kumar, Aswin Pradeep, Srihari Nagaraj, Kumar Deepak, Vivek Raghavan, Anoop Kunchukuttan, Pratyush Kumar, and Mitesh Shantadevi Khapra. 2022. https://doi.org/10.1162/tacl_a_00452 Sa...

  47. [47]

    Nils Reimers and Iryna Gurevych. 2017. https://doi.org/10.18653/v1/D17-1035 Reporting Score Distributions Makes a Difference : Performance Study of LSTM -networks for Sequence Tagging . In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing , pages 338--348, Copenhagen, Denmark. Association for Computational Linguistics

  48. [48]

    Nils Reimers and Iryna Gurevych. 2018. https://arxiv.org/abs/1803.09578 Why Comparing Single Performance Scores Does Not Allow to Draw Conclusions About Machine Learning Approaches . ArXiv:1803.09578

  49. [49]

    Nils Reimers and Iryna Gurevych. 2019. https://doi.org/10.18653/v1/D19-1410 Sentence- BERT : Sentence Embeddings using Siamese BERT - Networks . In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing ( EMNLP - IJCNLP ) , pages 3982--3992, Hong Kong...

  50. [50]

    Nils Reimers and Iryna Gurevych. 2020. https://arxiv.org/abs/2004.09813 Making monolingual sentence embeddings multilingual using knowledge distillation . In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing , pages 4512--4525. Association for Computational Linguistics

  51. [51]

    Darsh Shah, Tao Lei, Alessandro Moschitti, Salvatore Romeo, and Preslav Nakov. 2018. https://doi.org/10.18653/v1/D18-1131 Adversarial domain adaptation for duplicate question detection . In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing , pages 1056--1063, Brussels, Belgium. Association for Computational Linguistics

  52. [52]

    Gizem Soğancıoğlu, Hakime Öztürk, and Arzucan Özgür. 2017. https://doi.org/10.1093/bioinformatics/btx238 BIOSSES : a semantic sentence similarity estimation system for the biomedical domain . Bioinformatics (Oxford, England), 33(14):i49--i58

  53. [53]

    Anders Søgaard. 2013. https://aclanthology.org/N13-1068/ Estimating effect size across datasets . In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics : Human Language Technologies , pages 607--611, Atlanta, Georgia. Association for Computational Linguistics

  54. [54]

    Anders Søgaard, Anders Johannsen, Barbara Plank, Dirk Hovy, and Hector Martínez Alonso. 2014. https://doi.org/10.3115/v1/W14-1601 What's in a p-value in NLP ? In Proceedings of the Eighteenth Conference on Computational Natural Language Learning , pages 1--10, Ann Arbor, Michigan. Association for Computational Linguistics

  55. [55]

    Team Olmo , Allyson Ettinger, Amanda Bertsch, Bailey Kuehl, David Graham, David Heineman, Dirk Groeneveld, Faeze Brahman, Finbarr Timbers, Hamish Ivison, Jacob Morrison, Jake Poznanski, Kyle Lo, Luca Soldaini, Matt Jordan, Mayee Chen, Michael Noukhovitch, Nathan Lambert, Pete Walsh, and 49 others. 2025. https://doi.org/10.48550/arXiv.2512.13961 Olmo 3 . A...

  56. [56]

    Søren Vejlgaard Holm, Lars Kai Hansen, and Martin Carsten Nielsen. 2025. https://aclanthology.org/2025.nodalida-1.78/ Danoliteracy of Generative Large Language Models . In Proceedings of the Joint 25th Nordic Conference on Computational Linguistics and 11th Baltic Conference on Human Language Technologies ( NoDaLiDa / Baltic - HLT 2025) , pages 785--800, ...

  57. [57]

    Henning Wachsmuth, Shahbaz Syed, and Benno Stein. 2018. https://doi.org/10.18653/v1/P18-1023 Retrieval of the best counterargument without prior topic knowledge . In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (volume 1: Long papers) , pages 241--251, Melbourne, Australia. Association for Computational Linguistics

  58. [58]

    David Wadden, Shanchuan Lin, Kyle Lo, Lucy Lu Wang, Madeleine van Zuylen, Arman Cohan, and Hannaneh Hajishirzi. 2020. https://doi.org/10.18653/v1/2020.emnlp-main.609 Fact or Fiction : Verifying Scientific Claims . In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing ( EMNLP ) , pages 7534--7550, Online. Association for...

  59. [59]

    Kexin Wang, Nils Reimers, and Iryna Gurevych. 2021. https://doi.org/10.18653/v1/2021.findings-emnlp.59 TSDAE : Using Transformer -based Sequential Denoising Auto - Encoder for Unsupervised Sentence Embedding Learning . In Findings of the Association for Computational Linguistics : EMNLP 2021 , pages 671--688, Punta Cana, Dominican Republic. Association fo...

  60. [60]

    Liang Wang, Nan Yang, Xiaolong Huang, Binxing Jiao, Linjun Yang, Daxin Jiang, Rangan Majumder, and Furu Wei. 2024 a . https://doi.org/10.48550/arXiv.2212.03533 Text Embeddings by Weakly - Supervised Contrastive Pre -training . ArXiv:2212.03533

  61. [61]

    Liang Wang, Nan Yang, Xiaolong Huang, Linjun Yang, Rangan Majumder, and Furu Wei. 2024 b . https://doi.org/10.18653/v1/2024.acl-long.642 Improving Text Embeddings with Large Language Models . In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics ( Volume 1: Long Papers ) , pages 11897--11916, Bangkok, Thailand. Associa...

  62. [62]

    Liang Wang, Nan Yang, Xiaolong Huang, Linjun Yang, Rangan Majumder, and Furu Wei. 2024 c . https://doi.org/10.48550/arXiv.2402.05672 Multilingual E5 Text Embeddings : A Technical Report . ArXiv:2402.05672

  63. [63]

    White, Theresa A

    John S. White, Theresa A. O'Connell, and Francis E. O'Mara. 1994. https://aclanthology.org/1994.amta-1.25/ The ARPA MT Evaluation Methodologies : Evolution , Lessons , and Future Approaches . In Proceedings of the First Conference of the Association for Machine Translation in the Americas , Columbia, Maryland, USA

  64. [64]

    Frank Wilcoxon. 1945. https://doi.org/10.2307/3001968 Individual Comparisons by Ranking Methods . Biometrics Bulletin, 1(6):80--83

  65. [65]

    Wei Xu, Chris Callison-Burch, and Bill Dolan. 2015. https://doi.org/10.18653/v1/S15-2001 SemEval -2015 task 1: Paraphrase and semantic similarity in Twitter ( PIT ) . In Proceedings of the 9th International Workshop on Semantic Evaluation ( SemEval 2015) , pages 1--11, Denver, Colorado. Association for Computational Linguistics

  66. [66]

    An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, Huan Lin, Jialong Tang, and 41 others. 2025. https://doi.org/10.48550/arXiv.2505.09388 Qwen3 Technical Report . ArXiv:2505.09388

  67. [67]

    Dun Zhang, Jiacheng Li, Ziyang Zeng, and Fulong Wang. 2025 a . https://arxiv.org/abs/2412.19048 Jasper and Stella : distillation of SOTA embedding models . ArXiv:2412.19048

  68. [68]

    Dun Zhang, Ziyang Zeng, Yudong Zhou, and Shuyang Lu. 2025 b . https://arxiv.org/abs/2511.14405 Jasper-token-compression- 600M technical report . ArXiv:2511.14405

  69. [69]

    Xin Zhang, Yanzhao Zhang, Dingkun Long, Wen Xie, Ziqi Dai, Jialong Tang, Huan Lin, Baosong Yang, Pengjun Xie, Fei Huang, Meishan Zhang, Wenjie Li, and Min Zhang. 2024. https://doi.org/10.18653/v1/2024.emnlp-industry.103 mGTE : Generalized long-context text representation and reranking models for multilingual text retrieval . In Proceedings of the 2024 Con...

  70. [70]

    Yanzhao Zhang, Mingxin Li, Dingkun Long, Xin Zhang, Huan Lin, Baosong Yang, Pengjun Xie, An Yang, Dayiheng Liu, Junyang Lin, Fei Huang, and Jingren Zhou. 2025 c . https://doi.org/10.48550/arXiv.2506.05176 Qwen3 Embedding : Advancing Text Embedding and Reranking Through Foundation Models . ArXiv:2506.05176

  71. [71]

    Ziyin Zhang, Zihan Liao, Hang Yu, Peng Di, and Rui Wang. 2026. https://arxiv.org/abs/2603.19223 F2LLM -v2: Inclusive , performant, and efficient embeddings for a multilingual world . ArXiv:2603.19223

  72. [72]

    online" 'onlinestring :=

    ENTRY address archivePrefix author booktitle chapter edition editor eid eprint eprinttype howpublished institution journal key month note number organization pages publisher school series title type volume year doi pubmed url lastchecked label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block STRING...

  73. [73]

    write newline

    " write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize " " * FUNCT...