The Harder Text Embedding Benchmark (HTEB): Beyond One-dimensional Static Robustness

Haithem Afli; Manuel Frank

arxiv: 2605.28190 · v1 · pith:C46NGGTTnew · submitted 2026-05-27 · 💻 cs.CL

The Harder Text Embedding Benchmark (HTEB): Beyond One-dimensional Static Robustness

Manuel Frank , Haithem Afli This is my paper

Pith reviewed 2026-06-29 12:40 UTC · model grok-4.3

classification 💻 cs.CL

keywords text embeddingsrobustness evaluationdynamic benchmarksmultilingual embeddingslexical variationembedding modelsbenchmarking

0 comments

The pith

Embedding robustness is multidimensional and requires dynamic evaluation to expose failures hidden by static benchmarks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper claims that benchmarks reporting one score per embedding model treat robustness as a fixed scalar property, but models actually respond differently to different kinds of input change. It introduces HTEB to test robustness at evaluation time by using an LLM to apply random transformations along three axes: lexical and stylistic variation, length changes, and language shifts. Human ratings on an English sample validate the transformations. Results from 16 models across 32 datasets in 42 languages show three patterns: models have partly independent robustness profiles on the axes, scaling raises overall scores without closing most gaps to the transformed versions except on the language axis, and English datasets react more strongly to the changes than multilingual ones.

Core claim

Embedding robustness is multidimensional because models respond differently to variations along lexical/stylistic, length, and language axes, and dynamic evaluation with stochastic LLM transformations at test time is needed to reveal weaknesses that static benchmarks miss. HTEB applies these transformations to 32 datasets covering 42 languages, validated by 4,800 human ratings on English data, and finds that models exhibit specific partly decoupled robustness profiles, that scale increases absolute scores but does not close the gap between original and transformed evaluations except on the language axis, and that English datasets are more sensitive to the transformations than multilingual on

What carries the argument

HTEB, a dynamic evaluation framework that applies stochastic LLM transformations to inputs at evaluation time along the Lexical/Stylistic, Length, and Language axes.

If this is right

Models exhibit specific, partly decoupled robustness profiles across the lexical/stylistic, length, and language axes.
Across model families, increasing scale raises absolute scores on both original and transformed inputs but does not close the performance gap except on the language axis.
English datasets prove more sensitive to the HTEB transformations than multilingual datasets.
Single-score embedding benchmarks miss distinctions in how models handle different types of variation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Multilingual training data may confer robustness advantages that could be tested by comparing matched English-only and multilingual model pairs.
Future benchmarks could report separate scores per axis rather than a single aggregate.
Training objectives might be adjusted to target specific axes independently to improve overall robustness.
The approach of on-the-fly stochastic transformations could extend to other NLP tasks such as classification or retrieval.

Load-bearing premise

The stochastic LLM transformations along the three axes represent practically relevant real-world variations without introducing artifacts, and the human ratings on an English subsample validate the transformations for all 42 languages and 32 datasets.

What would settle it

Finding that all tested models show identical robustness levels across the three axes or that increasing scale fully eliminates the performance difference between original and transformed inputs on every axis.

Figures

Figures reproduced from arXiv: 2605.28190 by Haithem Afli, Manuel Frank.

**Figure 1.** Figure 1: (M)MTEB datasets are transformed over n runs using eight transformations that generate lexical- /stylistic-, length- or language-related variations. Recent work has begun to address these limitations. SAGE (Goel et al., 2025) tests robustness to controlled perturbations and noise injections on English datasets. PTEB (Frank and Afli, 2026) stochastically paraphrases 20 MTEB/MMTEB datasets including 25 l… view at source ↗

**Figure 2.** Figure 2: Four-step method to select the HTEB transformation model. [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Per-item embedding drift versus mean human transformation quality rating, by transformation. Each dot [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: Left: Embedding model performance on English data (19 datasets, 3 runs). Right: Embedding model performance on multilingual data (13 datasets, 3 runs). Both: Right columns report Original and total HTEB score with ø denoting the average. Exact scores per axis and model can be found in [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗

**Figure 5.** Figure 5: Size ablations for Qwen3-Embedding, [PITH_FULL_IMAGE:figures/full_fig_p005_5.png] view at source ↗

**Figure 6.** Figure 6: Mean score delta (∆ = HTEB − Original) per embedding model (model size in parentheses) and transformation, averaged over datasets and seeds. Negative values (red) indicate degradation under HTEB’s transformations; positive values (green) indicate improvement. transformations in the English setting (-4.5 pp Translation, -6.3 pp Cross-Translation). (3) LlamaNemotron-8B is the only model with positive deltas… view at source ↗

**Figure 7.** Figure 7: HL location shift with per-transformation 95% [PITH_FULL_IMAGE:figures/full_fig_p007_7.png] view at source ↗

**Figure 8.** Figure 8: Google Form to collect the human quality [PITH_FULL_IMAGE:figures/full_fig_p015_8.png] view at source ↗

**Figure 9.** Figure 9: HTEB performance drop vs. Original score [PITH_FULL_IMAGE:figures/full_fig_p018_9.png] view at source ↗

**Figure 10.** Figure 10: English benchmark (n = 11 models). Per-model |∆| score by robustness axis, stacked left-to-right in order of within-model disruption and grouped by axis ordering [PITH_FULL_IMAGE:figures/full_fig_p019_10.png] view at source ↗

**Figure 11.** Figure 11: Multilingual benchmark (n = 9 models). Per-model |∆| score by robustness axis, stacked left-to-right in order of within-model disruption and grouped by axis ordering. 19 [PITH_FULL_IMAGE:figures/full_fig_p019_11.png] view at source ↗

read the original abstract

Embedding benchmarks like MTEB report a single score per model, implicitly treating robustness as a static, scalar property. We argue that embedding robustness is multidimensional, since models respond differently to different types of variation, and requires dynamic evaluation to expose failures hidden by static benchmarks. We introduce the Harder Text Embedding Benchmark (HTEB), a dynamic evaluation framework that challenges model robustness along three practically interpretable axes (Lexical/Stylistic, Length and Language) by stochastically transforming inputs at evaluation time with an LLM. Evaluating 16 open-weight embedding models on 32 datasets covering 42 languages under transformations validated by 4,800 human ratings on an English subsample, we find three patterns: (1) Models exhibit specific, partly decoupled robustness profiles across axes. (2) Across three model families, scale increases absolute scores but does not close the gap between original and transformed evaluations. Here, scaling tends to improve specifically the Language axis. (3) English datasets are more sensitive to HTEB transformations than multilingual datasets. This demonstrates that HTEB identifies strengths and weaknesses of models along deployment-relevant axes, challenging current embedding benchmarks and arguing for multidimensional, dynamic robustness evaluation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

HTEB adds a dynamic three-axis robustness test for embeddings but its multilingual patterns rest on English-only human validation of the transformations.

read the letter

The main takeaway is that this paper introduces HTEB, a framework that applies stochastic LLM transformations to test embedding models along lexical/stylistic, length, and language axes instead of relying on static single scores like MTEB.

It evaluates 16 open-weight models on 32 datasets across 42 languages and reports three patterns: partly decoupled robustness profiles per model, scaling that lifts absolute performance but rarely closes the original-to-transformed gap (except on the language axis), and greater sensitivity on English datasets than multilingual ones.

The work is new in its specific dynamic protocol and the breadth of the evaluation. It does a solid job making the case that robustness is not scalar and that static benchmarks can mask real deployment issues.

The soft spot is the validation. The abstract states that 4,800 human ratings were collected on an English subsample to check the transformations, but no equivalent check is described for the remaining 41 languages or the full dataset set. If transformation fidelity or introduced artifacts differ by language family or script, the English sensitivity result and the scaling pattern on the language axis could be partly artifact-driven rather than pure robustness signals.

This paper is for researchers who build or select embedding models and want tests that better match varied real inputs. A reader focused on benchmark design would get value from the axes and protocol even if the results need tighter validation.

It deserves serious refereeing to examine the transformation details, statistical tests, and whether the patterns survive better multilingual checks.

Referee Report

1 major / 1 minor

Summary. The paper claims that embedding robustness is multidimensional (models respond differently to different variations) rather than the static scalar property implied by MTEB, and introduces the Harder Text Embedding Benchmark (HTEB) as a dynamic framework. HTEB applies stochastic LLM transformations along three axes (Lexical/Stylistic, Length, Language) to inputs from 32 datasets covering 42 languages at evaluation time; after validation via 4,800 human ratings on an English subsample, evaluation of 16 open-weight models reveals three patterns: (1) specific partly decoupled robustness profiles across axes, (2) scale increases absolute scores but does not close gaps (except on the Language axis), and (3) English datasets are more sensitive than multilingual ones.

Significance. If the transformations are shown to be free of systematic artifacts across languages, HTEB would offer a practically useful dynamic evaluation tool that exposes robustness failures hidden by static benchmarks and could inform model selection and development along deployment-relevant axes.

major comments (1)

[Abstract and evaluation description section] Abstract and evaluation description section: the central patterns (2) and (3) rest on the claim that LLM transformations produce comparable, practically relevant variations across 42 languages, yet the only external validation cited is 4,800 human ratings on an English subsample. No ratings, grammaticality checks, or artifact analysis are described for the remaining 41 languages or the full set of 32 datasets; if transformation quality (semantic fidelity, introduced bias, or grammaticality) varies systematically by language family or script, the reported higher English sensitivity and the Language-axis scaling effect could be driven by differential artifact rates rather than genuine robustness differences.

minor comments (1)

The abstract refers to 'three model families' without naming them; the main text should explicitly list the families and the specific models within each to allow readers to assess whether the scale and axis-specific patterns generalize.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their detailed and constructive feedback. The concern regarding the scope of human validation for the LLM transformations is well-taken and directly relevant to the strength of patterns (2) and (3). We address this point below and outline the revisions we will make.

read point-by-point responses

Referee: [Abstract and evaluation description section] Abstract and evaluation description section: the central patterns (2) and (3) rest on the claim that LLM transformations produce comparable, practically relevant variations across 42 languages, yet the only external validation cited is 4,800 human ratings on an English subsample. No ratings, grammaticality checks, or artifact analysis are described for the remaining 41 languages or the full set of 32 datasets; if transformation quality (semantic fidelity, introduced bias, or grammaticality) varies systematically by language family or script, the reported higher English sensitivity and the Language-axis scaling effect could be driven by differential artifact rates rather than genuine robustness differences.

Authors: We agree that the human validation is limited to an English subsample and that this constitutes a genuine limitation for claims involving cross-lingual comparisons. The 4,800 ratings were obtained on a stratified English subsample drawn from the 32 datasets to assess semantic fidelity, grammaticality, and absence of introduced bias for the three transformation axes. The same LLM (with language-specific prompts) was used to generate transformations for all 42 languages, which provides some consistency, but we did not perform equivalent human ratings or automated checks for non-English outputs. Consequently, it remains possible that differential artifact rates across language families or scripts contribute to the observed English sensitivity and the Language-axis scaling pattern. In the revised manuscript we will (a) explicitly state this limitation in both the abstract and the evaluation section, (b) add a dedicated paragraph in the Limitations section discussing the risk of language-specific artifacts and its implications for patterns (2) and (3), and (c) qualify the interpretation of the Language-axis results accordingly. We will not claim that the transformations have been shown to be free of systematic artifacts across all languages. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation is self-contained

full rationale

The paper introduces HTEB as a new dynamic benchmark using stochastic LLM transformations along three axes, with empirical patterns observed from evaluating 16 models on 32 datasets. The central claims rest on the external human validation (4,800 ratings on English subsample) and direct comparisons to the MTEB baseline, without any self-definitional loops, fitted parameters renamed as predictions, or load-bearing self-citations that reduce the result to its inputs by construction. No equations or uniqueness theorems are invoked that collapse the multidimensional robustness finding back to the benchmark definition itself. The derivation chain is independent and externally grounded.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

This is an empirical benchmark paper; the central claim rests on the validity of the new evaluation framework and human validation rather than mathematical axioms or new entities. No free parameters or invented entities are detailed in the abstract.

axioms (1)

domain assumption Human ratings on an English subsample validate the quality and relevance of LLM transformations across all languages and datasets
The paper relies on 4,800 human ratings to support the transformations.

pith-pipeline@v0.9.1-grok · 5734 in / 1330 out tokens · 50020 ms · 2026-06-29T12:40:19.908503+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

73 extracted references · 56 canonical work pages · 11 internal anchors

[1]

Mohamed Abdalla, Krishnapriya Vishnubhotla, and Saif Mohammad. 2023. https://doi.org/10.18653/v1/2023.eacl-main.55 What Makes Sentences Semantically Related ? A Textual Relatedness Dataset and Empirical Study . In Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics , pages 782--796, Dubrovnik, Croati...

work page doi:10.18653/v1/2023.eacl-main.55 2023
[2]

David Ifeoluwa Adelani, Marek Masiak, Israel Abebe Azime, Jesujoba Alabi, Atnafu Lambebo Tonja, Christine Mwase, Odunayo Ogundepo, Bonaventure F. P. Dossou, Akintunde Oladipo, Doreen Nixdorf, Chris Chinenye Emezue, Sana Al-azzawi, Blessing Sibanda, Davis David, Lolwethu Ndolela, Jonathan Mukiibi, Tunde Ajayi, Tatiana Moteu, Brian Odhiambo, and 46 others. ...

work page doi:10.18653/v1/2023.ijcnlp-main.10 2023
[3]

Mohammad Kalim Akram, Saba Sturua, Nastia Havriushenko, Quentin Herreros, Michael Günther, Maximilian Werk, and Han Xiao. 2026. https://doi.org/10.48550/arXiv.2602.15547 jina-embeddings-v5-text: Task - Targeted Embedding Distillation . ArXiv:2602.15547

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2602.15547 2026
[4]

Adnan El Assadi, Isaac Chung, Roman Solomatin, Niklas Muennighoff, and Kenneth Enevoldsen. 2026. https://openreview.net/forum?id=rcmfu1ydAf HUME : Measuring the Human - Model Performance Gap in Text Embedding Tasks . In International Conference on Learning Representations

2026
[5]

Yauhen Babakhin, Radek Osmulski, Ronay Ak, Gabriel Moreira, Mengyao Xu, Benedikt Schifferer, Bo Liu, and Even Oldridge. 2025. https://arxiv.org/abs/2511.07025 Llama-embed-nemotron- 8B : a universal text embedding model for multilingual and cross-lingual tasks . ArXiv:2511.07025

work page arXiv 2025
[6]

Iñigo Casanueva, Tadas Temčinas, Daniela Gerz, Matthew Henderson, and Ivan Vulić. 2020. https://doi.org/10.18653/v1/2020.nlp4convai-1.5 Efficient Intent Detection with Dual Sentence Encoders . In Proceedings of the 2nd Workshop on Natural Language Processing for Conversational AI , pages 38--45, Online. Association for Computational Linguistics

work page doi:10.18653/v1/2020.nlp4convai-1.5 2020
[7]

Daniel Cer, Mona Diab, Eneko Agirre, Iñigo Lopez-Gazpio, and Lucia Specia. 2017. https://doi.org/10.18653/v1/S17-2001 SemEval -2017 Task 1: Semantic Textual Similarity Multilingual and Crosslingual Focused Evaluation . In Proceedings of the 11th International Workshop on Semantic Evaluation ( SemEval -2017) , pages 1--14, Vancouver, Canada. Association fo...

work page doi:10.18653/v1/s17-2001 2017
[8]

Jianlyu Chen, Shitao Xiao, Peitian Zhang, Kun Luo, Defu Lian, and Zheng Liu. 2024. https://doi.org/10.18653/v1/2024.findings-acl.137 M3- Embedding : Multi - Linguality , Multi - Functionality , Multi - Granularity Text Embeddings Through Self - Knowledge Distillation . In Findings of the Association for Computational Linguistics : ACL 2024 , pages 2318--2...

work page doi:10.18653/v1/2024.findings-acl.137 2024
[9]

Benjamin Clavié. 2024. https://doi.org/10.48550/arXiv.2312.16144 Towards Better Monolingual Japanese Retrievers with Multi - Vector Models . ArXiv:2312.16144

work page doi:10.48550/arxiv.2312.16144 2024
[10]

Arman Cohan, Sergey Feldman, Iz Beltagy, Doug Downey, and Daniel Weld. 2020. https://doi.org/10.18653/v1/2020.acl-main.207 SPECTER : Document -level representation learning using citation-informed transformers . In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics , pages 2270--2282. Association for Computational Linguistics

work page doi:10.18653/v1/2020.acl-main.207 2020
[11]

Ruchira Dhar and Anders Søgaard. 2026. https://arxiv.org/abs/2604.25923v1 Evaluation Revisited : A Taxonomy of Evaluation Concerns in Natural Language Processing . ArXiv:2604.25923

work page internal anchor Pith review Pith/arXiv arXiv 2026
[12]

Rotem Dror, Gili Baumer, Marina Bogomolov, and Roi Reichart. 2017. https://doi.org/10.1162/tacl_a_00074 Replicability Analysis for Natural Language Processing : Testing Significance with Multiple Datasets . Transactions of the Association for Computational Linguistics, 5:471--486

work page doi:10.1162/tacl_a_00074 2017
[13]

Rotem Dror, Gili Baumer, Segev Shlomov, and Roi Reichart. 2018. https://doi.org/10.18653/v1/P18-1128 The Hitchhiker 's Guide to Testing Statistical Significance in Natural Language Processing . In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics ( Volume 1: Long Papers ) , pages 1383--1392, Melbourne, Australia. Asso...

work page doi:10.18653/v1/p18-1128 2018
[14]

Bradley Efron and R. J. Tibshirani. 1994. https://doi.org/10.1201/9780429246593 An Introduction to the Bootstrap . Chapman and Hall/CRC, New York

work page doi:10.1201/9780429246593 1994
[15]

Kenneth Enevoldsen, Isaac Chung, Imene Kerboua, Márton Kardos, Ashwin Mathur, David Stap, Jay Gala, Wissam Siblini, Dominik Krzemiński, Genta Indra Winata, Saba Sturua, Saiteja Utpala, Mathieu Ciancone, Marion Schaeffer, Diganta Misra, Shreeya Dhakal, Jonathan Rystrøm, Roman Solomatin, Ömer Veysel Çağatan, and 63 others. 2025. https://openreview.net/forum...

2025
[16]

Fabbri, Wojciech Kryściński, Bryan McCann, Caiming Xiong, Richard Socher, and Dragomir Radev

Alexander R. Fabbri, Wojciech Kryściński, Bryan McCann, Caiming Xiong, Richard Socher, and Dragomir Radev. 2021. https://doi.org/10.1162/tacl_a_00373 SummEval : Re -evaluating summarization evaluation . Transactions of the Association for Computational Linguistics, 9:391--409

work page doi:10.1162/tacl_a_00373 2021
[17]

Jack FitzGerald, Christopher Hench, Charith Peris, Scott Mackie, Kay Rottmann, Ana Sanchez, Aaron Nash, Liam Urbach, Vishesh Kakarala, Richa Singh, Swetha Ranganath, Laurie Crist, Misha Britan, Wouter Leeuwis, Gokhan Tur, and Prem Natarajan. 2023. https://doi.org/10.18653/v1/2023.acl-long.235 MASSIVE : A 1M - Example Multilingual Natural Language Understa...

work page doi:10.18653/v1/2023.acl-long.235 2023
[18]

Manuel Frank and Haithem Afli. 2026. https://doi.org/10.18653/v1/2026.eacl-long.130 PTEB : Towards robust text embedding evaluation via stochastic paraphrasing at evaluation time with LLMs . In Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics ( Volume 1: Long Papers ) , pages 2832--2851, Rabat, Mo...

work page doi:10.18653/v1/2026.eacl-long.130 2026
[19]

Gemma Team , Aishwarya Kamath, Johan Ferret, Shreya Pathak, Nino Vieillard, Ramona Merhej, Sarah Perrin, Tatiana Matejovicova, Alexandre Ramé, Morgane Rivière, Louis Rouillard, Thomas Mesnard, Geoffrey Cideron, Jean-bastien Grill, Sabela Ramos, Edouard Yvinec, Michelle Casbon, Etienne Pot, Ivo Penchev, and 197 others. 2025. https://doi.org/10.48550/arXiv....

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2503.19786 2025
[20]

Mor Geva, Daniel Khashabi, Elad Segal, Tushar Khot, Dan Roth, and Jonathan Berant. 2021. https://doi.org/10.1162/tacl_a_00370 Did Aristotle use a laptop? A question answering benchmark with implicit reasoning strategies . Transactions of the Association for Computational Linguistics, 9:346--361

work page doi:10.1162/tacl_a_00370 2021
[21]

Danilo Giampiccolo, Bernardo Magnini, Ido Dagan, and Bill Dolan. 2007. https://aclanthology.org/W07-1401/ The third PASCAL recognizing textual entailment challenge . In Proceedings of the ACL - PASCAL Workshop on Textual Entailment and Paraphrasing , pages 1--9, Prague. Association for Computational Linguistics

2007
[22]

Samarth Goel, Reagan Lee, and Kannan Ramchandran. 2025. https://openreview.net/forum?id=p4f2G2XXR4 SAGE : A Realistic Benchmark for Semantic Understanding . In NeurIPS 2025 Workshop on Evaluating the Evolving LLM Lifecycle : Benchmarks , Emergent Abilities , and Scaling

2025
[23]

Kilem Li Gwet. 2008. https://doi.org/10.1348/000711006X126600 Computing inter-rater reliability and its variance in the presence of high agreement . British Journal of Mathematical and Statistical Psychology, 61(1):29--48

work page doi:10.1348/000711006x126600 2008
[24]

Minors of a Class of Riordan Arrays Related to Weighted Partial Motzkin Paths

Tim C. Hesterberg. 2015. https://doi.org/10.1080/00031305.2015.1089789 What Teachers Should Know About the Bootstrap : Resampling in the Undergraduate Statistics Curriculum . The American Statistician, 69(4):371--386

work page internal anchor Pith review Pith/arXiv arXiv doi:10.1080/00031305.2015.1089789 2015
[25]

Sture Holm. 1979. http://www.jstor.org/stable/4615733 A simple sequentially rejective multiple test procedure . Scandinavian Journal of Statistics, 6(2):65--70

work page arXiv 1979
[26]

Verspoor, and Timothy Baldwin

Doris Hoogeveen, Karin M. Verspoor, and Timothy Baldwin. 2015. https://doi.org/10.1145/2838931.2838934 CQADupStack : a benchmark data set for community question-answering research . In Proceedings of the 20th Australasian Document Computing Symposium ( ADCS ) , pages 3:1--3:8, Parramatta, NSW, Australia. ACM

work page doi:10.1145/2838931.2838934 2015
[27]

Yevhen Kostiuk and Kenneth Enevoldsen. 2026. https://doi.org/10.48550/arXiv.2605.22544 One prompt is not enough: Instruction Sensitivity Undermines Embedding Model Evaluation . ArXiv:2605.22544

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2605.22544 2026
[28]

Wuwei Lan, Siyu Qiu, Hua He, and Wei Xu. 2017. https://doi.org/10.18653/v1/D17-1126 A continuously growing dataset of sentential paraphrases . In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing , pages 1224--1234, Copenhagen, Denmark. Association for Computational Linguistics

work page doi:10.18653/v1/d17-1126 2017
[29]

Ken Lang. 1995. https://doi.org/10.1016/B978-1-55860-377-6.50048-7 NewsWeeder : Learning to Filter Netnews . In Machine Learning Proceedings 1995 , pages 331--339, San Francisco (CA). Morgan Kaufmann

work page doi:10.1016/b978-1-55860-377-6.50048-7 1995
[30]

Chankyu Lee, Rajarshi Roy, Mengyao Xu, Jonathan Raiman, Mohammad Shoeybi, Bryan Catanzaro, and Wei Ping. 2025. https://proceedings.iclr.cc/paper_files/paper/2025/hash/c4bf73386022473a652a18941e9ea6f8-Abstract-Conference.html NV - Embed : Improved Techniques for Training LLMs as Generalist Embedding Models . International Conference on Learning Representat...

2025
[31]

Sean Lee, Aamir Shakir, Darius Koenig, and Julius Lipp. 2024. https://www.mixedbread.ai/blog/mxbai-embed-large-v1 Open source strikes bread -- new fluffy embeddings model . (accessed 2025-12-19)

2024
[32]

Xianming Li and Jing Li. 2024. https://doi.org/10.18653/v1/2024.acl-long.101 AoE : Angle -optimized Embeddings for Semantic Textual Similarity . In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics ( Volume 1: Long Papers ) , pages 1825--1839, Bangkok, Thailand. Association for Computational Linguistics

work page doi:10.18653/v1/2024.acl-long.101 2024
[33]

Ji Lin, Jiaming Tang, Haotian Tang, Shang Yang, Wei-Ming Chen, Wei-Chen Wang, Guangxuan Xiao, Xingyu Dang, Chuang Gan, and Song Han. 2024. https://proceedings.mlsys.org/paper_files/paper/2024/hash/42a452cbafa9dd64e9ba4aa95cc1ef21-Abstract-Conference.html AWQ : Activation -aware Weight Quantization for On - Device LLM Compression and Acceleration . Proceed...

2024
[34]

Xueqing Liu, Chi Wang, Yue Leng, and ChengXiang Zhai. 2018. https://doi.org/10.1145/3283812.3283815 LinkSO : a dataset for learning to retrieve similar question answer pairs on software development forums . In Proceedings of the 4th ACM SIGSOFT international workshop on NLP for software engineering , Nl4se 2018, pages 2--5, Lake Buena Vista, FL, USA. Asso...

work page doi:10.1145/3283812.3283815 2018
[35]

Singh, Rylan Schaeffer, Andrew Poulton, Sanmi Koyejo, Pontus Stenetorp, Sharan Narang, and Dieuwke Hupkes

Lovish Madaan, Aaditya K. Singh, Rylan Schaeffer, Andrew Poulton, Sanmi Koyejo, Pontus Stenetorp, Sharan Narang, and Dieuwke Hupkes. 2024. https://doi.org/10.48550/arXiv.2406.10229 Quantifying Variance in Evaluation Benchmarks . ArXiv:2406.10229

work page doi:10.48550/arxiv.2406.10229 2024
[36]

Andani Madodonga, Vukosi Marivate, and Matthew Adendorff. 2023. https://doi.org/10.55492/dhasa.v4i01.4449 Izindaba- Tindzaba : Machine learning news categorisation for long and short text for isiZulu and Siswati . Journal of the Digital Humanities Association of Southern Africa (DHASA), 4(01)

work page doi:10.55492/dhasa.v4i01.4449 2023
[37]

Rahmad Mahendra, Alham Fikri Aji, Samuel Louvan, Fahrurrozi Rahman, and Clara Vania. 2021. https://aclanthology.org/2021.emnlp-main.821 IndoNLI : a natural language inference dataset for Indonesian . In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing , pages 10511--10527, Online and Punta Cana, Dominican Republic. As...

2021
[38]

Microsoft. 2026. https://huggingface.co/microsoft/harrier-oss-v1-270m microsoft/harrier-oss-v1-270m · Hugging Face . (accessed 2026-05-13)

2026
[39]

Mistral AI . 2025. https://mistral.ai/news/mistral-3 Introducing Mistral 3 Mistral AI . (accessed 2025-12-18)

2025
[40]

MTEB. 2025. https://huggingface.co/datasets/mteb/PlscClusteringS2S.v2 Dataset PlscClusteringS2S .v2 . (accessed 2026-03-06)

2025
[41]

Niklas Muennighoff, Nouamane Tazi, Loic Magne, and Nils Reimers. 2023. https://doi.org/10.18653/v1/2023.eacl-main.148 MTEB : Massive Text Embedding Benchmark . In Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics , pages 2014--2037, Dubrovnik, Croatia. Association for Computational Linguistics

work page doi:10.18653/v1/2023.eacl-main.148 2023
[42]

James O'Neill, Polina Rozenshtein, Ryuichi Kiryo, Motoko Kubota, and Danushka Bollegala. 2021. https://doi.org/10.18653/v1/2021.emnlp-main.568 I wish I would have loved this one, but I didn't – a multilingual dataset for counterfactual detection in product review . In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing ,...

work page doi:10.18653/v1/2021.emnlp-main.568 2021
[43]

Yulia Otmakhova, Thinh Hung Truong, Rahmad Mahendra, Zenan Zhai, Rongxin Zhu, Daniel Beck, and Jey Han Lau. 2026. https://doi.org/10.18653/v1/2026.findings-eacl.269 FLUKE : a linguistically-driven and task-agnostic framework for robustness evaluation . In Findings of the Association for Computational Linguistics : EACL 2026 , pages 5103--5123, Rabat, Moro...

work page doi:10.18653/v1/2026.findings-eacl.269 2026
[44]

Nedjma Ousidhoum, Shamsuddeen Muhammad, Mohamed Abdalla, Idris Abdulmumin, Ibrahim Ahmad, Sanchit Ahuja, Alham Aji, Vladimir Araujo, Abinew Ayele, Pavan Baswani, Meriem Beloucif, Chris Biemann, Sofia Bourhim, Christine Kock, Genet Dekebo, Oumaima Hourrane, Gopichand Kanumolu, Lokesh Madasu, Samuel Rutunda, and 8 others. 2024. https://doi.org/10.18653/v1/2...

work page doi:10.18653/v1/2024.findings-acl.147 2024
[45]

Shantipriya Parida, Sambit Sekhar, Soumendra Kumar Sahoo, Swateek Jena, Abhijeet Parida, Satya Ranjan Dash, and Guneet Singh Kohli. 2023. https://huggingface.co/datasets/OdiaGenAI/sentiment_analysis_hindi OdiaGenAI : Generative AI and LLM initiative for the odia language - Dataset Card Sentiment Analysis Hindi . (accessed 2026-05-13)

2023
[46]

Gowtham Ramesh, Sumanth Doddapaneni, Aravinth Bheemaraj, Mayank Jobanputra, Raghavan AK, Ajitesh Sharma, Sujit Sahoo, Harshita Diddee, Mahalakshmi J, Divyanshu Kakwani, Navneet Kumar, Aswin Pradeep, Srihari Nagaraj, Kumar Deepak, Vivek Raghavan, Anoop Kunchukuttan, Pratyush Kumar, and Mitesh Shantadevi Khapra. 2022. https://doi.org/10.1162/tacl_a_00452 Sa...

work page doi:10.1162/tacl_a_00452 2022
[47]

Nils Reimers and Iryna Gurevych. 2017. https://doi.org/10.18653/v1/D17-1035 Reporting Score Distributions Makes a Difference : Performance Study of LSTM -networks for Sequence Tagging . In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing , pages 338--348, Copenhagen, Denmark. Association for Computational Linguistics

work page doi:10.18653/v1/d17-1035 2017
[48]

Nils Reimers and Iryna Gurevych. 2018. https://arxiv.org/abs/1803.09578 Why Comparing Single Performance Scores Does Not Allow to Draw Conclusions About Machine Learning Approaches . ArXiv:1803.09578

work page internal anchor Pith review Pith/arXiv arXiv 2018
[49]

Nils Reimers and Iryna Gurevych. 2019. https://doi.org/10.18653/v1/D19-1410 Sentence- BERT : Sentence Embeddings using Siamese BERT - Networks . In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing ( EMNLP - IJCNLP ) , pages 3982--3992, Hong Kong...

work page doi:10.18653/v1/d19-1410 2019
[50]

Nils Reimers and Iryna Gurevych. 2020. https://arxiv.org/abs/2004.09813 Making monolingual sentence embeddings multilingual using knowledge distillation . In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing , pages 4512--4525. Association for Computational Linguistics

work page arXiv 2020
[51]

Darsh Shah, Tao Lei, Alessandro Moschitti, Salvatore Romeo, and Preslav Nakov. 2018. https://doi.org/10.18653/v1/D18-1131 Adversarial domain adaptation for duplicate question detection . In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing , pages 1056--1063, Brussels, Belgium. Association for Computational Linguistics

work page doi:10.18653/v1/d18-1131 2018
[52]

Gizem Soğancıoğlu, Hakime Öztürk, and Arzucan Özgür. 2017. https://doi.org/10.1093/bioinformatics/btx238 BIOSSES : a semantic sentence similarity estimation system for the biomedical domain . Bioinformatics (Oxford, England), 33(14):i49--i58

work page doi:10.1093/bioinformatics/btx238 2017
[53]

Anders Søgaard. 2013. https://aclanthology.org/N13-1068/ Estimating effect size across datasets . In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics : Human Language Technologies , pages 607--611, Atlanta, Georgia. Association for Computational Linguistics

2013
[54]

Anders Søgaard, Anders Johannsen, Barbara Plank, Dirk Hovy, and Hector Martínez Alonso. 2014. https://doi.org/10.3115/v1/W14-1601 What's in a p-value in NLP ? In Proceedings of the Eighteenth Conference on Computational Natural Language Learning , pages 1--10, Ann Arbor, Michigan. Association for Computational Linguistics

work page doi:10.3115/v1/w14-1601 2014
[55]

Team Olmo , Allyson Ettinger, Amanda Bertsch, Bailey Kuehl, David Graham, David Heineman, Dirk Groeneveld, Faeze Brahman, Finbarr Timbers, Hamish Ivison, Jacob Morrison, Jake Poznanski, Kyle Lo, Luca Soldaini, Matt Jordan, Mayee Chen, Michael Noukhovitch, Nathan Lambert, Pete Walsh, and 49 others. 2025. https://doi.org/10.48550/arXiv.2512.13961 Olmo 3 . A...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2512.13961 2025
[56]

Søren Vejlgaard Holm, Lars Kai Hansen, and Martin Carsten Nielsen. 2025. https://aclanthology.org/2025.nodalida-1.78/ Danoliteracy of Generative Large Language Models . In Proceedings of the Joint 25th Nordic Conference on Computational Linguistics and 11th Baltic Conference on Human Language Technologies ( NoDaLiDa / Baltic - HLT 2025) , pages 785--800, ...

2025
[57]

Henning Wachsmuth, Shahbaz Syed, and Benno Stein. 2018. https://doi.org/10.18653/v1/P18-1023 Retrieval of the best counterargument without prior topic knowledge . In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (volume 1: Long papers) , pages 241--251, Melbourne, Australia. Association for Computational Linguistics

work page doi:10.18653/v1/p18-1023 2018
[58]

David Wadden, Shanchuan Lin, Kyle Lo, Lucy Lu Wang, Madeleine van Zuylen, Arman Cohan, and Hannaneh Hajishirzi. 2020. https://doi.org/10.18653/v1/2020.emnlp-main.609 Fact or Fiction : Verifying Scientific Claims . In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing ( EMNLP ) , pages 7534--7550, Online. Association for...

work page doi:10.18653/v1/2020.emnlp-main.609 2020
[59]

Kexin Wang, Nils Reimers, and Iryna Gurevych. 2021. https://doi.org/10.18653/v1/2021.findings-emnlp.59 TSDAE : Using Transformer -based Sequential Denoising Auto - Encoder for Unsupervised Sentence Embedding Learning . In Findings of the Association for Computational Linguistics : EMNLP 2021 , pages 671--688, Punta Cana, Dominican Republic. Association fo...

work page doi:10.18653/v1/2021.findings-emnlp.59 2021
[60]

Liang Wang, Nan Yang, Xiaolong Huang, Binxing Jiao, Linjun Yang, Daxin Jiang, Rangan Majumder, and Furu Wei. 2024 a . https://doi.org/10.48550/arXiv.2212.03533 Text Embeddings by Weakly - Supervised Contrastive Pre -training . ArXiv:2212.03533

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2212.03533 2024
[61]

Liang Wang, Nan Yang, Xiaolong Huang, Linjun Yang, Rangan Majumder, and Furu Wei. 2024 b . https://doi.org/10.18653/v1/2024.acl-long.642 Improving Text Embeddings with Large Language Models . In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics ( Volume 1: Long Papers ) , pages 11897--11916, Bangkok, Thailand. Associa...

work page doi:10.18653/v1/2024.acl-long.642 2024
[62]

Liang Wang, Nan Yang, Xiaolong Huang, Linjun Yang, Rangan Majumder, and Furu Wei. 2024 c . https://doi.org/10.48550/arXiv.2402.05672 Multilingual E5 Text Embeddings : A Technical Report . ArXiv:2402.05672

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2402.05672 2024
[63]

White, Theresa A

John S. White, Theresa A. O'Connell, and Francis E. O'Mara. 1994. https://aclanthology.org/1994.amta-1.25/ The ARPA MT Evaluation Methodologies : Evolution , Lessons , and Future Approaches . In Proceedings of the First Conference of the Association for Machine Translation in the Americas , Columbia, Maryland, USA

1994
[64]

Frank Wilcoxon. 1945. https://doi.org/10.2307/3001968 Individual Comparisons by Ranking Methods . Biometrics Bulletin, 1(6):80--83

work page doi:10.2307/3001968 1945
[65]

Wei Xu, Chris Callison-Burch, and Bill Dolan. 2015. https://doi.org/10.18653/v1/S15-2001 SemEval -2015 task 1: Paraphrase and semantic similarity in Twitter ( PIT ) . In Proceedings of the 9th International Workshop on Semantic Evaluation ( SemEval 2015) , pages 1--11, Denver, Colorado. Association for Computational Linguistics

work page doi:10.18653/v1/s15-2001 2015
[66]

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, Huan Lin, Jialong Tang, and 41 others. 2025. https://doi.org/10.48550/arXiv.2505.09388 Qwen3 Technical Report . ArXiv:2505.09388

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2505.09388 2025
[67]

Dun Zhang, Jiacheng Li, Ziyang Zeng, and Fulong Wang. 2025 a . https://arxiv.org/abs/2412.19048 Jasper and Stella : distillation of SOTA embedding models . ArXiv:2412.19048

work page arXiv 2025
[68]

Dun Zhang, Ziyang Zeng, Yudong Zhou, and Shuyang Lu. 2025 b . https://arxiv.org/abs/2511.14405 Jasper-token-compression- 600M technical report . ArXiv:2511.14405

work page arXiv 2025
[69]

Xin Zhang, Yanzhao Zhang, Dingkun Long, Wen Xie, Ziqi Dai, Jialong Tang, Huan Lin, Baosong Yang, Pengjun Xie, Fei Huang, Meishan Zhang, Wenjie Li, and Min Zhang. 2024. https://doi.org/10.18653/v1/2024.emnlp-industry.103 mGTE : Generalized long-context text representation and reranking models for multilingual text retrieval . In Proceedings of the 2024 Con...

work page doi:10.18653/v1/2024.emnlp-industry.103 2024
[70]

Yanzhao Zhang, Mingxin Li, Dingkun Long, Xin Zhang, Huan Lin, Baosong Yang, Pengjun Xie, An Yang, Dayiheng Liu, Junyang Lin, Fei Huang, and Jingren Zhou. 2025 c . https://doi.org/10.48550/arXiv.2506.05176 Qwen3 Embedding : Advancing Text Embedding and Reranking Through Foundation Models . ArXiv:2506.05176

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2506.05176 2025
[71]

Ziyin Zhang, Zihan Liao, Hang Yu, Peng Di, and Rui Wang. 2026. https://arxiv.org/abs/2603.19223 F2LLM -v2: Inclusive , performant, and efficient embeddings for a multilingual world . ArXiv:2603.19223

work page arXiv 2026
[72]

online" 'onlinestring :=

ENTRY address archivePrefix author booktitle chapter edition editor eid eprint eprinttype howpublished institution journal key month note number organization pages publisher school series title type volume year doi pubmed url lastchecked label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block STRING...
[73]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize " " * FUNCT...

[1] [1]

Mohamed Abdalla, Krishnapriya Vishnubhotla, and Saif Mohammad. 2023. https://doi.org/10.18653/v1/2023.eacl-main.55 What Makes Sentences Semantically Related ? A Textual Relatedness Dataset and Empirical Study . In Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics , pages 782--796, Dubrovnik, Croati...

work page doi:10.18653/v1/2023.eacl-main.55 2023

[2] [2]

David Ifeoluwa Adelani, Marek Masiak, Israel Abebe Azime, Jesujoba Alabi, Atnafu Lambebo Tonja, Christine Mwase, Odunayo Ogundepo, Bonaventure F. P. Dossou, Akintunde Oladipo, Doreen Nixdorf, Chris Chinenye Emezue, Sana Al-azzawi, Blessing Sibanda, Davis David, Lolwethu Ndolela, Jonathan Mukiibi, Tunde Ajayi, Tatiana Moteu, Brian Odhiambo, and 46 others. ...

work page doi:10.18653/v1/2023.ijcnlp-main.10 2023

[3] [3]

Mohammad Kalim Akram, Saba Sturua, Nastia Havriushenko, Quentin Herreros, Michael Günther, Maximilian Werk, and Han Xiao. 2026. https://doi.org/10.48550/arXiv.2602.15547 jina-embeddings-v5-text: Task - Targeted Embedding Distillation . ArXiv:2602.15547

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2602.15547 2026

[4] [4]

Adnan El Assadi, Isaac Chung, Roman Solomatin, Niklas Muennighoff, and Kenneth Enevoldsen. 2026. https://openreview.net/forum?id=rcmfu1ydAf HUME : Measuring the Human - Model Performance Gap in Text Embedding Tasks . In International Conference on Learning Representations

2026

[5] [5]

Yauhen Babakhin, Radek Osmulski, Ronay Ak, Gabriel Moreira, Mengyao Xu, Benedikt Schifferer, Bo Liu, and Even Oldridge. 2025. https://arxiv.org/abs/2511.07025 Llama-embed-nemotron- 8B : a universal text embedding model for multilingual and cross-lingual tasks . ArXiv:2511.07025

work page arXiv 2025

[6] [6]

Iñigo Casanueva, Tadas Temčinas, Daniela Gerz, Matthew Henderson, and Ivan Vulić. 2020. https://doi.org/10.18653/v1/2020.nlp4convai-1.5 Efficient Intent Detection with Dual Sentence Encoders . In Proceedings of the 2nd Workshop on Natural Language Processing for Conversational AI , pages 38--45, Online. Association for Computational Linguistics

work page doi:10.18653/v1/2020.nlp4convai-1.5 2020

[7] [7]

Daniel Cer, Mona Diab, Eneko Agirre, Iñigo Lopez-Gazpio, and Lucia Specia. 2017. https://doi.org/10.18653/v1/S17-2001 SemEval -2017 Task 1: Semantic Textual Similarity Multilingual and Crosslingual Focused Evaluation . In Proceedings of the 11th International Workshop on Semantic Evaluation ( SemEval -2017) , pages 1--14, Vancouver, Canada. Association fo...

work page doi:10.18653/v1/s17-2001 2017

[8] [8]

Jianlyu Chen, Shitao Xiao, Peitian Zhang, Kun Luo, Defu Lian, and Zheng Liu. 2024. https://doi.org/10.18653/v1/2024.findings-acl.137 M3- Embedding : Multi - Linguality , Multi - Functionality , Multi - Granularity Text Embeddings Through Self - Knowledge Distillation . In Findings of the Association for Computational Linguistics : ACL 2024 , pages 2318--2...

work page doi:10.18653/v1/2024.findings-acl.137 2024

[9] [9]

Benjamin Clavié. 2024. https://doi.org/10.48550/arXiv.2312.16144 Towards Better Monolingual Japanese Retrievers with Multi - Vector Models . ArXiv:2312.16144

work page doi:10.48550/arxiv.2312.16144 2024

[10] [10]

Arman Cohan, Sergey Feldman, Iz Beltagy, Doug Downey, and Daniel Weld. 2020. https://doi.org/10.18653/v1/2020.acl-main.207 SPECTER : Document -level representation learning using citation-informed transformers . In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics , pages 2270--2282. Association for Computational Linguistics

work page doi:10.18653/v1/2020.acl-main.207 2020

[11] [11]

Ruchira Dhar and Anders Søgaard. 2026. https://arxiv.org/abs/2604.25923v1 Evaluation Revisited : A Taxonomy of Evaluation Concerns in Natural Language Processing . ArXiv:2604.25923

work page internal anchor Pith review Pith/arXiv arXiv 2026

[12] [12]

Rotem Dror, Gili Baumer, Marina Bogomolov, and Roi Reichart. 2017. https://doi.org/10.1162/tacl_a_00074 Replicability Analysis for Natural Language Processing : Testing Significance with Multiple Datasets . Transactions of the Association for Computational Linguistics, 5:471--486

work page doi:10.1162/tacl_a_00074 2017

[13] [13]

Rotem Dror, Gili Baumer, Segev Shlomov, and Roi Reichart. 2018. https://doi.org/10.18653/v1/P18-1128 The Hitchhiker 's Guide to Testing Statistical Significance in Natural Language Processing . In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics ( Volume 1: Long Papers ) , pages 1383--1392, Melbourne, Australia. Asso...

work page doi:10.18653/v1/p18-1128 2018

[14] [14]

Bradley Efron and R. J. Tibshirani. 1994. https://doi.org/10.1201/9780429246593 An Introduction to the Bootstrap . Chapman and Hall/CRC, New York

work page doi:10.1201/9780429246593 1994

[15] [15]

Kenneth Enevoldsen, Isaac Chung, Imene Kerboua, Márton Kardos, Ashwin Mathur, David Stap, Jay Gala, Wissam Siblini, Dominik Krzemiński, Genta Indra Winata, Saba Sturua, Saiteja Utpala, Mathieu Ciancone, Marion Schaeffer, Diganta Misra, Shreeya Dhakal, Jonathan Rystrøm, Roman Solomatin, Ömer Veysel Çağatan, and 63 others. 2025. https://openreview.net/forum...

2025

[16] [16]

Fabbri, Wojciech Kryściński, Bryan McCann, Caiming Xiong, Richard Socher, and Dragomir Radev

Alexander R. Fabbri, Wojciech Kryściński, Bryan McCann, Caiming Xiong, Richard Socher, and Dragomir Radev. 2021. https://doi.org/10.1162/tacl_a_00373 SummEval : Re -evaluating summarization evaluation . Transactions of the Association for Computational Linguistics, 9:391--409

work page doi:10.1162/tacl_a_00373 2021

[17] [17]

Jack FitzGerald, Christopher Hench, Charith Peris, Scott Mackie, Kay Rottmann, Ana Sanchez, Aaron Nash, Liam Urbach, Vishesh Kakarala, Richa Singh, Swetha Ranganath, Laurie Crist, Misha Britan, Wouter Leeuwis, Gokhan Tur, and Prem Natarajan. 2023. https://doi.org/10.18653/v1/2023.acl-long.235 MASSIVE : A 1M - Example Multilingual Natural Language Understa...

work page doi:10.18653/v1/2023.acl-long.235 2023

[18] [18]

Manuel Frank and Haithem Afli. 2026. https://doi.org/10.18653/v1/2026.eacl-long.130 PTEB : Towards robust text embedding evaluation via stochastic paraphrasing at evaluation time with LLMs . In Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics ( Volume 1: Long Papers ) , pages 2832--2851, Rabat, Mo...

work page doi:10.18653/v1/2026.eacl-long.130 2026

[19] [19]

Gemma Team , Aishwarya Kamath, Johan Ferret, Shreya Pathak, Nino Vieillard, Ramona Merhej, Sarah Perrin, Tatiana Matejovicova, Alexandre Ramé, Morgane Rivière, Louis Rouillard, Thomas Mesnard, Geoffrey Cideron, Jean-bastien Grill, Sabela Ramos, Edouard Yvinec, Michelle Casbon, Etienne Pot, Ivo Penchev, and 197 others. 2025. https://doi.org/10.48550/arXiv....

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2503.19786 2025

[20] [20]

Mor Geva, Daniel Khashabi, Elad Segal, Tushar Khot, Dan Roth, and Jonathan Berant. 2021. https://doi.org/10.1162/tacl_a_00370 Did Aristotle use a laptop? A question answering benchmark with implicit reasoning strategies . Transactions of the Association for Computational Linguistics, 9:346--361

work page doi:10.1162/tacl_a_00370 2021

[21] [21]

Danilo Giampiccolo, Bernardo Magnini, Ido Dagan, and Bill Dolan. 2007. https://aclanthology.org/W07-1401/ The third PASCAL recognizing textual entailment challenge . In Proceedings of the ACL - PASCAL Workshop on Textual Entailment and Paraphrasing , pages 1--9, Prague. Association for Computational Linguistics

2007

[22] [22]

Samarth Goel, Reagan Lee, and Kannan Ramchandran. 2025. https://openreview.net/forum?id=p4f2G2XXR4 SAGE : A Realistic Benchmark for Semantic Understanding . In NeurIPS 2025 Workshop on Evaluating the Evolving LLM Lifecycle : Benchmarks , Emergent Abilities , and Scaling

2025

[23] [23]

Kilem Li Gwet. 2008. https://doi.org/10.1348/000711006X126600 Computing inter-rater reliability and its variance in the presence of high agreement . British Journal of Mathematical and Statistical Psychology, 61(1):29--48

work page doi:10.1348/000711006x126600 2008

[24] [24]

Minors of a Class of Riordan Arrays Related to Weighted Partial Motzkin Paths

Tim C. Hesterberg. 2015. https://doi.org/10.1080/00031305.2015.1089789 What Teachers Should Know About the Bootstrap : Resampling in the Undergraduate Statistics Curriculum . The American Statistician, 69(4):371--386

work page internal anchor Pith review Pith/arXiv arXiv doi:10.1080/00031305.2015.1089789 2015

[25] [25]

Sture Holm. 1979. http://www.jstor.org/stable/4615733 A simple sequentially rejective multiple test procedure . Scandinavian Journal of Statistics, 6(2):65--70

work page arXiv 1979

[26] [26]

Verspoor, and Timothy Baldwin

Doris Hoogeveen, Karin M. Verspoor, and Timothy Baldwin. 2015. https://doi.org/10.1145/2838931.2838934 CQADupStack : a benchmark data set for community question-answering research . In Proceedings of the 20th Australasian Document Computing Symposium ( ADCS ) , pages 3:1--3:8, Parramatta, NSW, Australia. ACM

work page doi:10.1145/2838931.2838934 2015

[27] [27]

Yevhen Kostiuk and Kenneth Enevoldsen. 2026. https://doi.org/10.48550/arXiv.2605.22544 One prompt is not enough: Instruction Sensitivity Undermines Embedding Model Evaluation . ArXiv:2605.22544

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2605.22544 2026

[28] [28]

Wuwei Lan, Siyu Qiu, Hua He, and Wei Xu. 2017. https://doi.org/10.18653/v1/D17-1126 A continuously growing dataset of sentential paraphrases . In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing , pages 1224--1234, Copenhagen, Denmark. Association for Computational Linguistics

work page doi:10.18653/v1/d17-1126 2017

[29] [29]

Ken Lang. 1995. https://doi.org/10.1016/B978-1-55860-377-6.50048-7 NewsWeeder : Learning to Filter Netnews . In Machine Learning Proceedings 1995 , pages 331--339, San Francisco (CA). Morgan Kaufmann

work page doi:10.1016/b978-1-55860-377-6.50048-7 1995

[30] [30]

Chankyu Lee, Rajarshi Roy, Mengyao Xu, Jonathan Raiman, Mohammad Shoeybi, Bryan Catanzaro, and Wei Ping. 2025. https://proceedings.iclr.cc/paper_files/paper/2025/hash/c4bf73386022473a652a18941e9ea6f8-Abstract-Conference.html NV - Embed : Improved Techniques for Training LLMs as Generalist Embedding Models . International Conference on Learning Representat...

2025

[31] [31]

Sean Lee, Aamir Shakir, Darius Koenig, and Julius Lipp. 2024. https://www.mixedbread.ai/blog/mxbai-embed-large-v1 Open source strikes bread -- new fluffy embeddings model . (accessed 2025-12-19)

2024

[32] [32]

Xianming Li and Jing Li. 2024. https://doi.org/10.18653/v1/2024.acl-long.101 AoE : Angle -optimized Embeddings for Semantic Textual Similarity . In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics ( Volume 1: Long Papers ) , pages 1825--1839, Bangkok, Thailand. Association for Computational Linguistics

work page doi:10.18653/v1/2024.acl-long.101 2024

[33] [33]

Ji Lin, Jiaming Tang, Haotian Tang, Shang Yang, Wei-Ming Chen, Wei-Chen Wang, Guangxuan Xiao, Xingyu Dang, Chuang Gan, and Song Han. 2024. https://proceedings.mlsys.org/paper_files/paper/2024/hash/42a452cbafa9dd64e9ba4aa95cc1ef21-Abstract-Conference.html AWQ : Activation -aware Weight Quantization for On - Device LLM Compression and Acceleration . Proceed...

2024

[34] [34]

Xueqing Liu, Chi Wang, Yue Leng, and ChengXiang Zhai. 2018. https://doi.org/10.1145/3283812.3283815 LinkSO : a dataset for learning to retrieve similar question answer pairs on software development forums . In Proceedings of the 4th ACM SIGSOFT international workshop on NLP for software engineering , Nl4se 2018, pages 2--5, Lake Buena Vista, FL, USA. Asso...

work page doi:10.1145/3283812.3283815 2018

[35] [35]

Singh, Rylan Schaeffer, Andrew Poulton, Sanmi Koyejo, Pontus Stenetorp, Sharan Narang, and Dieuwke Hupkes

Lovish Madaan, Aaditya K. Singh, Rylan Schaeffer, Andrew Poulton, Sanmi Koyejo, Pontus Stenetorp, Sharan Narang, and Dieuwke Hupkes. 2024. https://doi.org/10.48550/arXiv.2406.10229 Quantifying Variance in Evaluation Benchmarks . ArXiv:2406.10229

work page doi:10.48550/arxiv.2406.10229 2024

[36] [36]

Andani Madodonga, Vukosi Marivate, and Matthew Adendorff. 2023. https://doi.org/10.55492/dhasa.v4i01.4449 Izindaba- Tindzaba : Machine learning news categorisation for long and short text for isiZulu and Siswati . Journal of the Digital Humanities Association of Southern Africa (DHASA), 4(01)

work page doi:10.55492/dhasa.v4i01.4449 2023

[37] [37]

Rahmad Mahendra, Alham Fikri Aji, Samuel Louvan, Fahrurrozi Rahman, and Clara Vania. 2021. https://aclanthology.org/2021.emnlp-main.821 IndoNLI : a natural language inference dataset for Indonesian . In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing , pages 10511--10527, Online and Punta Cana, Dominican Republic. As...

2021

[38] [38]

Microsoft. 2026. https://huggingface.co/microsoft/harrier-oss-v1-270m microsoft/harrier-oss-v1-270m · Hugging Face . (accessed 2026-05-13)

2026

[39] [39]

Mistral AI . 2025. https://mistral.ai/news/mistral-3 Introducing Mistral 3 Mistral AI . (accessed 2025-12-18)

2025

[40] [40]

MTEB. 2025. https://huggingface.co/datasets/mteb/PlscClusteringS2S.v2 Dataset PlscClusteringS2S .v2 . (accessed 2026-03-06)

2025

[41] [41]

Niklas Muennighoff, Nouamane Tazi, Loic Magne, and Nils Reimers. 2023. https://doi.org/10.18653/v1/2023.eacl-main.148 MTEB : Massive Text Embedding Benchmark . In Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics , pages 2014--2037, Dubrovnik, Croatia. Association for Computational Linguistics

work page doi:10.18653/v1/2023.eacl-main.148 2023

[42] [42]

James O'Neill, Polina Rozenshtein, Ryuichi Kiryo, Motoko Kubota, and Danushka Bollegala. 2021. https://doi.org/10.18653/v1/2021.emnlp-main.568 I wish I would have loved this one, but I didn't – a multilingual dataset for counterfactual detection in product review . In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing ,...

work page doi:10.18653/v1/2021.emnlp-main.568 2021

[43] [43]

Yulia Otmakhova, Thinh Hung Truong, Rahmad Mahendra, Zenan Zhai, Rongxin Zhu, Daniel Beck, and Jey Han Lau. 2026. https://doi.org/10.18653/v1/2026.findings-eacl.269 FLUKE : a linguistically-driven and task-agnostic framework for robustness evaluation . In Findings of the Association for Computational Linguistics : EACL 2026 , pages 5103--5123, Rabat, Moro...

work page doi:10.18653/v1/2026.findings-eacl.269 2026

[44] [44]

Nedjma Ousidhoum, Shamsuddeen Muhammad, Mohamed Abdalla, Idris Abdulmumin, Ibrahim Ahmad, Sanchit Ahuja, Alham Aji, Vladimir Araujo, Abinew Ayele, Pavan Baswani, Meriem Beloucif, Chris Biemann, Sofia Bourhim, Christine Kock, Genet Dekebo, Oumaima Hourrane, Gopichand Kanumolu, Lokesh Madasu, Samuel Rutunda, and 8 others. 2024. https://doi.org/10.18653/v1/2...

work page doi:10.18653/v1/2024.findings-acl.147 2024

[45] [45]

Shantipriya Parida, Sambit Sekhar, Soumendra Kumar Sahoo, Swateek Jena, Abhijeet Parida, Satya Ranjan Dash, and Guneet Singh Kohli. 2023. https://huggingface.co/datasets/OdiaGenAI/sentiment_analysis_hindi OdiaGenAI : Generative AI and LLM initiative for the odia language - Dataset Card Sentiment Analysis Hindi . (accessed 2026-05-13)

2023

[46] [46]

Gowtham Ramesh, Sumanth Doddapaneni, Aravinth Bheemaraj, Mayank Jobanputra, Raghavan AK, Ajitesh Sharma, Sujit Sahoo, Harshita Diddee, Mahalakshmi J, Divyanshu Kakwani, Navneet Kumar, Aswin Pradeep, Srihari Nagaraj, Kumar Deepak, Vivek Raghavan, Anoop Kunchukuttan, Pratyush Kumar, and Mitesh Shantadevi Khapra. 2022. https://doi.org/10.1162/tacl_a_00452 Sa...

work page doi:10.1162/tacl_a_00452 2022

[47] [47]

Nils Reimers and Iryna Gurevych. 2017. https://doi.org/10.18653/v1/D17-1035 Reporting Score Distributions Makes a Difference : Performance Study of LSTM -networks for Sequence Tagging . In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing , pages 338--348, Copenhagen, Denmark. Association for Computational Linguistics

work page doi:10.18653/v1/d17-1035 2017

[48] [48]

Nils Reimers and Iryna Gurevych. 2018. https://arxiv.org/abs/1803.09578 Why Comparing Single Performance Scores Does Not Allow to Draw Conclusions About Machine Learning Approaches . ArXiv:1803.09578

work page internal anchor Pith review Pith/arXiv arXiv 2018

[49] [49]

Nils Reimers and Iryna Gurevych. 2019. https://doi.org/10.18653/v1/D19-1410 Sentence- BERT : Sentence Embeddings using Siamese BERT - Networks . In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing ( EMNLP - IJCNLP ) , pages 3982--3992, Hong Kong...

work page doi:10.18653/v1/d19-1410 2019

[50] [50]

Nils Reimers and Iryna Gurevych. 2020. https://arxiv.org/abs/2004.09813 Making monolingual sentence embeddings multilingual using knowledge distillation . In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing , pages 4512--4525. Association for Computational Linguistics

work page arXiv 2020

[51] [51]

Darsh Shah, Tao Lei, Alessandro Moschitti, Salvatore Romeo, and Preslav Nakov. 2018. https://doi.org/10.18653/v1/D18-1131 Adversarial domain adaptation for duplicate question detection . In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing , pages 1056--1063, Brussels, Belgium. Association for Computational Linguistics

work page doi:10.18653/v1/d18-1131 2018

[52] [52]

Gizem Soğancıoğlu, Hakime Öztürk, and Arzucan Özgür. 2017. https://doi.org/10.1093/bioinformatics/btx238 BIOSSES : a semantic sentence similarity estimation system for the biomedical domain . Bioinformatics (Oxford, England), 33(14):i49--i58

work page doi:10.1093/bioinformatics/btx238 2017

[53] [53]

Anders Søgaard. 2013. https://aclanthology.org/N13-1068/ Estimating effect size across datasets . In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics : Human Language Technologies , pages 607--611, Atlanta, Georgia. Association for Computational Linguistics

2013

[54] [54]

Anders Søgaard, Anders Johannsen, Barbara Plank, Dirk Hovy, and Hector Martínez Alonso. 2014. https://doi.org/10.3115/v1/W14-1601 What's in a p-value in NLP ? In Proceedings of the Eighteenth Conference on Computational Natural Language Learning , pages 1--10, Ann Arbor, Michigan. Association for Computational Linguistics

work page doi:10.3115/v1/w14-1601 2014

[55] [55]

Team Olmo , Allyson Ettinger, Amanda Bertsch, Bailey Kuehl, David Graham, David Heineman, Dirk Groeneveld, Faeze Brahman, Finbarr Timbers, Hamish Ivison, Jacob Morrison, Jake Poznanski, Kyle Lo, Luca Soldaini, Matt Jordan, Mayee Chen, Michael Noukhovitch, Nathan Lambert, Pete Walsh, and 49 others. 2025. https://doi.org/10.48550/arXiv.2512.13961 Olmo 3 . A...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2512.13961 2025

[56] [56]

Søren Vejlgaard Holm, Lars Kai Hansen, and Martin Carsten Nielsen. 2025. https://aclanthology.org/2025.nodalida-1.78/ Danoliteracy of Generative Large Language Models . In Proceedings of the Joint 25th Nordic Conference on Computational Linguistics and 11th Baltic Conference on Human Language Technologies ( NoDaLiDa / Baltic - HLT 2025) , pages 785--800, ...

2025

[57] [57]

Henning Wachsmuth, Shahbaz Syed, and Benno Stein. 2018. https://doi.org/10.18653/v1/P18-1023 Retrieval of the best counterargument without prior topic knowledge . In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (volume 1: Long papers) , pages 241--251, Melbourne, Australia. Association for Computational Linguistics

work page doi:10.18653/v1/p18-1023 2018

[58] [58]

David Wadden, Shanchuan Lin, Kyle Lo, Lucy Lu Wang, Madeleine van Zuylen, Arman Cohan, and Hannaneh Hajishirzi. 2020. https://doi.org/10.18653/v1/2020.emnlp-main.609 Fact or Fiction : Verifying Scientific Claims . In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing ( EMNLP ) , pages 7534--7550, Online. Association for...

work page doi:10.18653/v1/2020.emnlp-main.609 2020

[59] [59]

Kexin Wang, Nils Reimers, and Iryna Gurevych. 2021. https://doi.org/10.18653/v1/2021.findings-emnlp.59 TSDAE : Using Transformer -based Sequential Denoising Auto - Encoder for Unsupervised Sentence Embedding Learning . In Findings of the Association for Computational Linguistics : EMNLP 2021 , pages 671--688, Punta Cana, Dominican Republic. Association fo...

work page doi:10.18653/v1/2021.findings-emnlp.59 2021

[60] [60]

Liang Wang, Nan Yang, Xiaolong Huang, Binxing Jiao, Linjun Yang, Daxin Jiang, Rangan Majumder, and Furu Wei. 2024 a . https://doi.org/10.48550/arXiv.2212.03533 Text Embeddings by Weakly - Supervised Contrastive Pre -training . ArXiv:2212.03533

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2212.03533 2024

[61] [61]

Liang Wang, Nan Yang, Xiaolong Huang, Linjun Yang, Rangan Majumder, and Furu Wei. 2024 b . https://doi.org/10.18653/v1/2024.acl-long.642 Improving Text Embeddings with Large Language Models . In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics ( Volume 1: Long Papers ) , pages 11897--11916, Bangkok, Thailand. Associa...

work page doi:10.18653/v1/2024.acl-long.642 2024

[62] [62]

Liang Wang, Nan Yang, Xiaolong Huang, Linjun Yang, Rangan Majumder, and Furu Wei. 2024 c . https://doi.org/10.48550/arXiv.2402.05672 Multilingual E5 Text Embeddings : A Technical Report . ArXiv:2402.05672

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2402.05672 2024

[63] [63]

White, Theresa A

John S. White, Theresa A. O'Connell, and Francis E. O'Mara. 1994. https://aclanthology.org/1994.amta-1.25/ The ARPA MT Evaluation Methodologies : Evolution , Lessons , and Future Approaches . In Proceedings of the First Conference of the Association for Machine Translation in the Americas , Columbia, Maryland, USA

1994

[64] [64]

Frank Wilcoxon. 1945. https://doi.org/10.2307/3001968 Individual Comparisons by Ranking Methods . Biometrics Bulletin, 1(6):80--83

work page doi:10.2307/3001968 1945

[65] [65]

Wei Xu, Chris Callison-Burch, and Bill Dolan. 2015. https://doi.org/10.18653/v1/S15-2001 SemEval -2015 task 1: Paraphrase and semantic similarity in Twitter ( PIT ) . In Proceedings of the 9th International Workshop on Semantic Evaluation ( SemEval 2015) , pages 1--11, Denver, Colorado. Association for Computational Linguistics

work page doi:10.18653/v1/s15-2001 2015

[66] [66]

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, Huan Lin, Jialong Tang, and 41 others. 2025. https://doi.org/10.48550/arXiv.2505.09388 Qwen3 Technical Report . ArXiv:2505.09388

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2505.09388 2025

[67] [67]

Dun Zhang, Jiacheng Li, Ziyang Zeng, and Fulong Wang. 2025 a . https://arxiv.org/abs/2412.19048 Jasper and Stella : distillation of SOTA embedding models . ArXiv:2412.19048

work page arXiv 2025

[68] [68]

Dun Zhang, Ziyang Zeng, Yudong Zhou, and Shuyang Lu. 2025 b . https://arxiv.org/abs/2511.14405 Jasper-token-compression- 600M technical report . ArXiv:2511.14405

work page arXiv 2025

[69] [69]

Xin Zhang, Yanzhao Zhang, Dingkun Long, Wen Xie, Ziqi Dai, Jialong Tang, Huan Lin, Baosong Yang, Pengjun Xie, Fei Huang, Meishan Zhang, Wenjie Li, and Min Zhang. 2024. https://doi.org/10.18653/v1/2024.emnlp-industry.103 mGTE : Generalized long-context text representation and reranking models for multilingual text retrieval . In Proceedings of the 2024 Con...

work page doi:10.18653/v1/2024.emnlp-industry.103 2024

[70] [70]

Yanzhao Zhang, Mingxin Li, Dingkun Long, Xin Zhang, Huan Lin, Baosong Yang, Pengjun Xie, An Yang, Dayiheng Liu, Junyang Lin, Fei Huang, and Jingren Zhou. 2025 c . https://doi.org/10.48550/arXiv.2506.05176 Qwen3 Embedding : Advancing Text Embedding and Reranking Through Foundation Models . ArXiv:2506.05176

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2506.05176 2025

[71] [71]

Ziyin Zhang, Zihan Liao, Hang Yu, Peng Di, and Rui Wang. 2026. https://arxiv.org/abs/2603.19223 F2LLM -v2: Inclusive , performant, and efficient embeddings for a multilingual world . ArXiv:2603.19223

work page arXiv 2026

[72] [72]

online" 'onlinestring :=

ENTRY address archivePrefix author booktitle chapter edition editor eid eprint eprinttype howpublished institution journal key month note number organization pages publisher school series title type volume year doi pubmed url lastchecked label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block STRING...

[73] [73]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize " " * FUNCT...