pith. sign in

arxiv: 2605.27168 · v1 · pith:5TMKQGSOnew · submitted 2026-05-26 · 💻 cs.CL · cs.AI· cs.CY

Grounding Text Embeddings in Stakeholder Associations

Pith reviewed 2026-06-29 18:48 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.CY
keywords text embeddingssemantic alignmentstakeholder groundingclusteringpolicy analysishuman-AI comparisonreliability gap
0
0 comments X

The pith

Neural text embeddings lag human experts by 19-26 points when associating meanings in policy texts.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents the Stakeholder Grounding Exercise as a way to make explicit the associations that domain experts hold and to test whether text embeddings match those associations. In a Danish policy case study the embeddings trail experts by 19-26 percentage points in reliability, and the same gap appears in an English-language study of US Federal AI use cases. The misalignment is not isolated; it tracks directly with poorer performance when the embeddings are used for downstream clustering, with a Spearman correlation of 0.9 between exercise scores and cluster quality. The method therefore supplies a concrete test for whether an embedding model captures the semantic distinctions that matter to the people who will interpret its output.

Core claim

The Stakeholder Grounding Exercise shows that neural text embeddings are substantially less reliable than human experts at reproducing expert semantic associations, producing gaps of 19-26 percentage points on Danish policy texts and 16 points on US Federal AI use cases. This gap propagates to clustering: models that rank higher on the exercise also produce higher-quality clusters, with Spearman ρ = 0.9.

What carries the argument

The Stakeholder Grounding Exercise, a protocol that elicits explicit pairwise associations from domain experts and compares them to embedding similarity scores.

If this is right

  • Embedding models intended for policy or domain analysis should be screened against expert associations before deployment.
  • Clustering quality on complex texts improves when the embedding model better matches expert semantic distinctions.
  • The exercise can be run with digital protocols across languages and expert communities without changing the observed gap.
  • Downstream tasks that rely on semantic distances inherit the reliability gap between embeddings and experts.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Selection or fine-tuning of embeddings for a given domain could be guided by running the exercise on a small expert panel rather than relying on general benchmarks.
  • The same grounding approach could be adapted to test embeddings on other tasks such as classification or retrieval where human intent matters.
  • If the gap persists across many domains, it suggests that current training objectives for text embeddings do not prioritise the fine-grained distinctions that experts use.

Load-bearing premise

The associations that human experts produce in the exercise are the right target for what embeddings should represent.

What would settle it

A new embedding model that matches or exceeds expert reliability on the Stakeholder Grounding Exercise yet still yields low-quality clusters on the same texts, or a model that scores low on the exercise but produces high-quality clusters.

Figures

Figures reproduced from arXiv: 2605.27168 by Chris Russell, Johan Irving S{\o}ltoft, Jonathan Rystr{\o}m, Kenneth C. Enevoldsen, Sofie Burgos-Thorsen, Zihao Fu.

Figure 1
Figure 1. Figure 1: The construct gap. Experts (right) ar￾range three concepts on a 2D canvas using the Spa￾tial Arrangement Method (SpAM; Goldstone, 1994), a validated similarity-elicitation paradigm; a neural model embeds them in high-dimensional space (left, schematic). Dashed circles mark closest point (blue) to pivot (green). Current embedding benchmarks cannot detect mismatches; the Stakeholder Grounding Exercise (§ 3) … view at source ↗
Figure 2
Figure 2. Figure 2: Stakeholder Grounding Exercise. Partici￾pants conducting the exercise in-person (left) and online (right). Each round, participants lay out statements on their canvas to represent their notion of similarity. et al., 2024). In contrast, our exercise specifically targets these domain-specific associations, enabling more fine-grained contextual comparisons. Construct Validity in NLP: In social science, ‘const… view at source ↗
Figure 3
Figure 3. Figure 3: Human reliability. Relationship between the distance ratio, d, and inter-rater reliability (α). The reliability increases approximately logarithmically with 80% reliability at d ≈ 3.1. distance ratio is at least d, and Oi(t) is a binary scoring function indicating which item in triplet t rater i judges as closer. We primarily compute between-rater reliabil￾ity. To analyse fairness, we further restrict the … view at source ↗
Figure 4
Figure 4. Figure 4: Performance gap and disparities. The best embedding models substantially underperform stake￾holder grounding across all datasets. There are also disparities between demographic groups. erarchical bootstrap that resamples raters and then triplets within raters, holding the rater resample fixed across d-thresholds for each AUC replicate. We summarise each model’s performance by tak￾ing the area under the α(d… view at source ↗
Figure 6
Figure 6. Figure 6: Rank comparison for stakeholder ground￾ing and MMTEB. There are substantial differences in the best performing models for our stakeholder ground￾ing exercise and Danish subset of MMTEB (Enevold￾sen et al., 2025). The Spearman correlation is 0.49 (p = 0.015). 4.3 RQ3: Downstream clustering 4.3.1 Methods Finally, we analyse the downstream differences be￾tween neural embeddings and human embeddings. Specifica… view at source ↗
Figure 7
Figure 7. Figure 7: Clustering performance. Comparison of clustering performance (Adjusted Rand Index) for hu￾man experts versus the top embedding models. Most embedding models underperform humans. ways significantly. For the Gov-AI experiment, humans are mediocre, though not significantly out￾performed by any embedding model. Overall, hu￾mans rank second on a Borda-rank (McLean, 1990). We find a high Spearman correlation bet… view at source ↗
Figure 8
Figure 8. Figure 8: Pipeline from raw images to coordinates. Top: [PITH_FULL_IMAGE:figures/full_fig_p022_8.png] view at source ↗
Figure 12
Figure 12. Figure 12: English performance gap. We observe qualitatively similar results to the Danish experiment ( [PITH_FULL_IMAGE:figures/full_fig_p023_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: English MMTEB vs stakeholder￾grounding ranks. Human EmbeddingGemma-Scandi multilingual-e5-large-instruct mxbai-embed-large-v1 mmBERTscandi-base-embedding potion-base-8M text-embedding-3-large text-embedding-ada-002 potion-base-4M potion-base-2M nb-sbert-base snowflake-arctic-embed-l-v2.0 bge-m3 paraphrase-multilingual-mpnet Qwen3-Embedding-Scandi-0.6B text-embedding-3-small multilingual-e5-large potion-mu… view at source ↗
Figure 11
Figure 11. Figure 11: Context-drift relative to rater noise. Context-drift (operationalised as within-rater, between￾round pairwise-distance variance) relative to variance between raters within a round. Context-drift is generally lower than noise, indicating no substantial influence on the results. .00 .25 .50 .75 Human EmbeddingGemma-Scandi multilingual-e5-large-instruct paraphrase-multilingual-mpnet text-embedding-ada-002 mx… view at source ↗
Figure 15
Figure 15. Figure 15: Sensitivity to expected disagreement. Com￾parison of the main binary-De = 0.5 α(d) curves with curves using empirical marginal estimates of De. 0.0 0.5 Spearman Human paraphrase-multilingual-mpnet multilingual-e5-large-instruct paraphrase-multilingual-MiniLM text-embedding-3-large nb-sbert-base EmbeddingGemma-Scandi text-embedding-3-small bge-m3 snowflake-arctic-embed-l-v2.0 text-embedding-ada-002 dataset… view at source ↗
Figure 16
Figure 16. Figure 16: Spearman metric. We show qualitatively similar results while replacing our AUC-based align￾ment metric with a raw Spearman correlation of pair￾wise distances. xlm-roberta-base M2V_base_output M2V_base_glove_subword static-similarity-mrl-multilingual-v1 potion-base-2M xlm-roberta-large paraphrase-multilingual-MiniLM nb-sbert-base potion-base-4M mmBERTscandi-base-embedding dfm-sentence-encoder-large Qwen3-E… view at source ↗
Figure 18
Figure 18. Figure 18: Data-prompts degrade performance. Prepending data-specific prompts seems to deteriorate performance. 24 [PITH_FULL_IMAGE:figures/full_fig_p024_18.png] view at source ↗
Figure 17
Figure 17. Figure 17: Model-model reliability. Embedding mod￾els have varying inter-model reliability from 0 to 0.9. Typically, they’re in a similar range to model-human reliability (see [PITH_FULL_IMAGE:figures/full_fig_p024_17.png] view at source ↗
read the original abstract

Text embeddings are widely used to analyse large corpora of complex texts. However, it is unclear whether the embeddings capture the same semantic distances as the human experts using them. Ensuring alignment between embedding representations and human intentions is essential for valid analyses. We present the Stakeholder Grounding Exercise, a method for making expert associations explicit and grounding embedding model results in human understanding. In our primary case study on Danish policy issues, we find that neural text embeddings are substantially less reliable than human experts (19-26 pp gap), and that this misalignment propagates to downstream clustering performance (Spearman $\rho=0.9$ between exercise ranking and cluster quality). A secondary study on US Federal AI use cases replicates the gap (16pp) in English, using a digital protocol and a different community of experts -- demonstrating that the gap is not an artefact of a single instrument or domain. The Stakeholder Grounding Exercise offers a practical method for assessing whether embedding models capture the semantic distinctions that matter most to domain experts.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces the Stakeholder Grounding Exercise as a method to elicit explicit expert associations and uses it to evaluate whether neural text embeddings capture the same semantic distinctions as domain experts. In a primary Danish policy-issues case study, it reports that embeddings are substantially less reliable than human experts (19-26 percentage point gap) and that this misalignment propagates to downstream clustering (Spearman ρ=0.9 between exercise ranking and cluster quality). A secondary replication on US Federal AI use cases confirms a 16pp gap using a digital protocol and different experts, arguing the gap is not an artifact of a single instrument or domain.

Significance. If the Stakeholder Grounding Exercise validly operationalizes the semantic distinctions that matter to experts, the work supplies a practical, replicable protocol for diagnosing embedding misalignment in applied text-analysis settings and demonstrates that such misalignment can materially degrade clustering quality. The cross-domain replication is a strength that increases the result's robustness.

major comments (2)
  1. [Abstract (method and primary case study)] The central empirical claim (19-26 pp reliability gap and its propagation to clustering) treats the associations elicited by the Stakeholder Grounding Exercise as the reference ground truth. The abstract provides no independent validation that these elicited associations recover the distinctions that embeddings are optimized to represent (i.e., corpus-level co-occurrence patterns), raising the possibility that the measured gap reflects elicitation mismatch rather than embedding deficiency.
  2. [Abstract (secondary study)] The secondary replication uses the same exercise protocol (albeit digitized) on a different expert community; because it does not introduce an independent criterion for what embeddings ought to capture, it cannot break the dependence on the exercise as ground truth and therefore does not fully address the concern that the gap may be protocol-specific.
minor comments (2)
  1. [Abstract] The abstract states the reliability gap and Spearman correlation but does not report sample sizes, number of experts, number of items, statistical procedures, or exclusion criteria; these details are required to evaluate whether the reported gaps and correlation are supported by the data.
  2. [Abstract] The phrase 'substantially less reliable' is used without defining the reliability metric or providing the raw agreement rates; a table or equation showing how the percentage-point gap is computed would improve transparency.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for the detailed and constructive comments on the validation of the Stakeholder Grounding Exercise. We address each major comment below, clarifying the intended role of expert associations as the reference and noting revisions we will make to improve transparency.

read point-by-point responses
  1. Referee: [Abstract (method and primary case study)] The central empirical claim (19-26 pp reliability gap and its propagation to clustering) treats the associations elicited by the Stakeholder Grounding Exercise as the reference ground truth. The abstract provides no independent validation that these elicited associations recover the distinctions that embeddings are optimized to represent (i.e., corpus-level co-occurrence patterns), raising the possibility that the measured gap reflects elicitation mismatch rather than embedding deficiency.

    Authors: The Stakeholder Grounding Exercise is explicitly designed to operationalize the semantic distinctions that matter to domain experts in applied text-analysis settings, rather than to recover corpus-level co-occurrence patterns. Embeddings are trained on co-occurrence, yet the paper's motivating question is whether they align with the distinctions experts use when interpreting results; misalignment with experts can invalidate downstream analyses even if co-occurrence is captured. The reported Spearman ρ=0.9 correlation between exercise-derived rankings and clustering quality supplies direct evidence that the measured gap has practical consequences for a standard embedding use case. We did not include a direct comparison against corpus co-occurrence statistics because that is not the reference the exercise targets. In revision we will expand the abstract and methods to state this rationale more explicitly and add a limitations paragraph acknowledging that an auxiliary co-occurrence analysis could further contextualize the results. revision: partial

  2. Referee: [Abstract (secondary study)] The secondary replication uses the same exercise protocol (albeit digitized) on a different expert community; because it does not introduce an independent criterion for what embeddings ought to capture, it cannot break the dependence on the exercise as ground truth and therefore does not fully address the concern that the gap may be protocol-specific.

    Authors: The secondary study employs a distinct expert population (US Federal AI policy experts versus Danish policy experts) and a fully digital protocol, thereby replicating the reliability gap across domains, languages, and elicitation formats. This provides evidence that the gap is not an artifact of one specific group or paper-based instrument. We agree, however, that it retains dependence on the exercise as the reference criterion and does not introduce an orthogonal benchmark such as corpus co-occurrence. In the revised manuscript we will update the abstract and discussion to more precisely describe the replication's contribution while acknowledging this remaining dependence. revision: partial

standing simulated objections not resolved
  • Direct, independent validation of the elicited associations against corpus-level co-occurrence statistics that does not rely on the Stakeholder Grounding Exercise itself.

Circularity Check

0 steps flagged

Empirical comparison with independent replication; no derivation chain or fitted inputs presented as predictions

full rationale

The paper introduces the Stakeholder Grounding Exercise to elicit explicit human associations and reports an empirical reliability gap (19-26 pp) between embeddings and experts, plus a Spearman correlation to downstream clustering. No equations, parameter fits, or first-principles derivations are described. The secondary replication uses a different protocol, language, and expert community, supplying independent support rather than a self-citation chain. The central claim is a measured performance difference against the elicited associations; this is a methodological benchmark choice, not a reduction of outputs to inputs by construction. Concerns about whether the exercise supplies the 'correct' ground truth fall under validity, not circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only the abstract is available; no information on free parameters, axioms, or invented entities is provided.

pith-pipeline@v0.9.1-grok · 5727 in / 1149 out tokens · 52754 ms · 2026-06-29T18:48:16.904635+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

73 extracted references · 52 canonical work pages · 4 internal anchors

  1. [1]

    Aggarwal and ChengXiang Zhai

    Charu C. Aggarwal and ChengXiang Zhai. 2012. https://doi.org/10.1007/978-1-4614-3223-4_4 A survey of text clustering algorithms . In Charu C. Aggarwal and ChengXiang Zhai, editors, Mining Text Data, pages 77--128. Springer US, Boston, MA

  2. [2]

    Adnan El Assadi, Isaac Chung, Roman Solomatin, Niklas Muennighoff, and Kenneth Enevoldsen. 2025. https://openreview.net/forum?id=rcmfu1ydAf HUME : Measuring the human-model performance gap in text embedding tasks . In The Fourteenth International Conference on Learning Representations

  3. [3]

    Andrew M. Bean, Ryan Othniel Kearns, Angelika Romanou, Franziska Sofia Hafner, Harry Mayne, Jan Batzner, Negar Foroutan, Chris Schmitz, Karolina Korgul, Hunar Batra, Oishi Deb, Emma Beharry, Cornelius Emde, Thomas Foster, Anna Gausen, Mar \'i a Grandury, Simeng Han, Valentin Hofmann, Lujain Ibrahim, and 23 others. 2025. https://openreview.net/forum?id=mdA...

  4. [4]

    Blanchard and Ishani Banerji

    Simon J. Blanchard and Ishani Banerji. 2016. https://doi.org/10.3758/s13428-015-0644-6 Evidence-based recommendations for designing free-sorting experiments . Behavior Research Methods, 48(4):1318--1336

  5. [5]

    Tolga Bolukbasi, Kai-Wei Chang, James Zou, Venkatesh Saligrama, and Adam Kalai. 2016. https://dl.acm.org/doi/10.5555/3157382.3157584 Man is to computer programmer as woman is to homemaker? debiasing word embeddings . In Proceedings of the 30th International Conference on Neural Information Processing Systems , NIPS '16, pages 4356--4364, Red Hook, NY, USA...

  6. [6]

    Bowman and George Dahl

    Samuel R. Bowman and George Dahl. 2021. https://doi.org/10.18653/v1/2021.naacl-main.385 What will it take to fix benchmarking in natural language understanding? In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics : Human Language Technologies , pages 4843--4855, Online. Association for Compu...

  7. [7]

    Gary Bradski. 2000. https://elibrary.ru/item.asp?id=4934581 The opencv library . Dr. Dobb's Journal: Software Tools for the Professional Programmer, 25(11):120--123

  8. [8]

    Mikael Brunila. 2025. https://doi.org/10.1177/20539517251386055 Cosine capital: Large language models and the embedding of all things . Big Data & Society, 12(4):20539517251386055

  9. [9]

    Jianlyu Chen, Shitao Xiao, Peitian Zhang, Kun Luo, Defu Lian, and Zheng Liu. 2024. https://doi.org/10.18653/v1/2024.findings-acl.137 M3-embedding: Multi-linguality, multi-functionality, multi-granularity text embeddings through self-knowledge distillation . In Findings of the Association for Computational Linguistics : ACL 2024 , pages 2318--2335, Bangkok...

  10. [10]

    Isaac Chung, Imene Kerboua, M \'a rton Kardos, Roman Solomatin, and Kenneth Enevoldsen. 2025. https://openreview.net/forum?id=qcPJs0KRZW Maintaining MTEB : Towards long term usability and reproducibility of embedding benchmarks . In Championing Open-source DEvelopment in ML Workshop @ ICML25

  11. [11]

    Alexis Conneau, Kartikay Khandelwal, Naman Goyal, Vishrav Chaudhary, Guillaume Wenzek, Francisco Guzm \'a n, Edouard Grave, Myle Ott, Luke Zettlemoyer, and Veselin Stoyanov. 2020. https://doi.org/10.18653/v1/2020.acl-main.747 Unsupervised cross-lingual representation learning at scale . In Proceedings of the 58th Annual Meeting of the Association for Comp...

  12. [12]

    Psychological Bulletin , author =

    Lee J. Cronbach and Paul E. Meehl. 1955. https://doi.org/10.1037/h0040957 Construct validity in psychological tests . Psychological bulletin, 52(4):281

  13. [13]

    Kenneth Enevoldsen, Isaac Chung, Imene Kerboua, M \'a rton Kardos, Ashwin Mathur, David Stap, Jay Gala, Wissam Siblini, Dominik Krzemi \'n ski, Genta Indra Winata, Saba Sturua, Saiteja Utpala, Mathieu Ciancone, Marion Schaeffer, Diganta Misra, Shreeya Dhakal, Jonathan Rystr m, Roman Solomatin, \"O mer Veysel C a g atan, and 63 others. 2025. https://openre...

  14. [14]

    Kenneth Enevoldsen, M \'a rton Kardos, Niklas Muennighoff, and Kristoffer Nielbo. 2024. https://openreview.net/forum?id=2WbuKAfOxP#discussion The scandinavian embedding benchmarks: Comprehensive assessment of multilingual and monolingual text embedding . In The Thirty-eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track

  15. [15]

    Fangxiaoyu Feng, Yinfei Yang, Daniel Cer, Naveen Arivazhagan, and Wei Wang. 2022. https://doi.org/10.18653/v1/2022.acl-long.62 Language-agnostic BERT sentence embedding . In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics ( Volume 1: Long Papers ) , pages 878--891, Dublin, Ireland. Association for Computational Linguistics

  16. [16]

    Natalia Flechas Manrique, Wanqian Bao, Aurelie Herbelot, and Uri Hasson. 2023. https://doi.org/10.18653/v1/2023.blackboxnlp-1.13 Enhancing interpretability using human similarity judgements to prune word embeddings . In Proceedings of the 6th BlackboxNLP Workshop : Analyzing and Interpreting Neural Networks for NLP , pages 169--179, Singapore. Association...

  17. [17]

    Miranda Fricker. 2007. https://doi.org/10.1093/acprof:oso/9780198237907.003.0008 Hermeneutical Injustice , 1 edition, pages 147--175. Oxford University PressOxford

  18. [18]

    Miranda Fricker. 2017. Evolving concepts of epistemic injustice. In The Routledge Handbook of Epistemic Injustice. Routledge

  19. [19]

    Friedler, Carlos Scheidegger, and Suresh Venkatasubramanian

    Sorelle A. Friedler, Carlos Scheidegger, and Suresh Venkatasubramanian. 2021. https://doi.org/10.1145/3433949 The (im)possibility of fairness: Different value systems require different mechanisms for fair decision making . Commun. ACM, 64(4):136--143

  20. [20]

    Vahid Ghafouri, Jose Such, and Guillermo Suarez-Tangil . 2024. https://doi.org/10.18653/v1/2024.emnlp-main.1171 I love pineapple on pizza != I hate pineapple on pizza: Stance-aware sentence transformers for opinion mining . In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing , pages 21046--21058, Miami, Florida, USA. ...

  21. [21]

    Robert Goldstone. 1994. https://doi.org/10.3758/BF03204653 An efficient method for obtaining similarity data . Behavior Research Methods, Instruments, & Computers, 26(4):381--386

  22. [22]

    Casper Hansen, Christian Hansen, Lucas Maystre, Rishabh Mehrotra, Brian Brost, Federico Tomasi, and Mounia Lalmas. 2020. https://doi.org/10.1145/3383313.3412248 Contextual and sequential user embeddings for large-scale music recommendation . In Fourteenth ACM Conference on Recommender Systems , pages 53--62, Virtual Event Brazil. ACM

  23. [23]

    Heine, and Ara Norenzayan

    Joseph Henrich, Steven J. Heine, and Ara Norenzayan. 2010. https://doi.org/10.1038/466029a Most people are not WEIRD . Nature, 466(7302):29--29

  24. [24]

    Hout and Stephen D

    Michael C. Hout and Stephen D. Goldinger. 2016. https://doi.org/10.1037/xge0000144 SpAM is convenient but also satisfying: Reply to verheyen et al. (2016) . Journal of Experimental Psychology: General, 145(3):383--387

  25. [25]

    Hout, Stephen D

    Michael C. Hout, Stephen D. Goldinger, and Ryan W. Ferguson. 2013. https://doi.org/10.1037/a0028860 The versatility of SpAM : A fast, efficient, spatial method of data collection for multidimensional scaling . Journal of Experimental Psychology: General, 142(1):256--281

  26. [26]

    Anastasiia Hrytsyna and Rodrigo Alves. 2025. https://doi.org/10.1145/3709148 From representation to response: Assessing the alignment of large language models with human judgment patterns . ACM Trans. Intell. Syst. Technol., 16(6):136:1--136:23

  27. [27]

    Iskandarli

    Gunay Y. Iskandarli. 2020. https://doi.org/10.5815/ijitcs.2020.06.01 Applying clustering and topic modeling to automatic analysis of citizens' comments in EGovernment . International Journal of Information Technology and Computer Science, 12(6):1--10

  28. [28]

    Jacobs and Hanna Wallach

    Abigail Z. Jacobs and Hanna Wallach. 2021. https://doi.org/10.1145/3442188.3445901 Measurement and fairness . In Proceedings of the 2021 ACM Conference on Fairness , Accountability , and Transparency , FAccT '21, pages 375--385, New York, NY, USA. Association for Computing Machinery

  29. [29]

    Yanrong Ji, Zhihan Zhou, Han Liu, and Ramana V Davuluri. 2021. https://doi.org/10.1093/bioinformatics/btab083 DNABERT : Pre-trained bidirectional encoder representations from transformers model for DNA-language in genome . Bioinformatics, 37(15):2112--2120

  30. [30]

    Prannay Khosla, Piotr Teterwak, Chen Wang, Aaron Sarna, Yonglong Tian, Phillip Isola, Aaron Maschinot, Ce Liu, and Dilip Krishnan. 2020. https://proceedings.neurips.cc/paper/2020/hash/d89a66c7c80a29b1bdbab0f2a1a94af8-Abstract.html Supervised contrastive learning . In Advances in Neural Information Processing Systems , volume 33, pages 18661--18673. Curran...

  31. [31]

    Nikolaus Kriegeskorte and Marieke Mur. 2012. https://doi.org/10.3389/fpsyg.2012.00245 Inverse MDS : Inferring dissimilarity structure from multiple item arrangements . Frontiers in Psychology, 3

  32. [32]

    Klaus Krippendorff. 2011. https://repository.upenn.edu/handle/20.500.14332/2089 Computing krippendorff's alpha-reliability

  33. [33]

    Ida Marie S Lassen, Jens Christian Bjerring, and Kristoffer L Nielbo. 2025. https://doi.org/10.1177/20539517251365228 Silencing in data science practices . Big Data & Society, 12(3):20539517251365228

  34. [34]

    Minghao Li, Tengchao Lv, Jingye Chen, Lei Cui, Yijuan Lu, Dinei Florencio, Cha Zhang, Zhoujun Li, and Furu Wei. 2023. https://doi.org/10.1609/aaai.v37i11.26538 TrOCR : Transformer-based optical character recognition with pre-trained models . In Proceedings of the Thirty-seventh AAAI Conference on Artificial Intelligence and Thirty-fifth Conference on Inno...

  35. [35]

    Yu, and Lifang He

    Qian Li, Hao Peng, Jianxin Li, Congying Xia, Renyu Yang, Lichao Sun, Philip S. Yu, and Lifang He. 2022. https://doi.org/10.1145/3495162 A survey on text classification: From traditional to deep learning . ACM Trans. Intell. Syst. Technol., 13(2):31:1--31:41

  36. [36]

    Tao Li, Daniel Khashabi, Tushar Khot, Ashish Sabharwal, and Vivek Srikumar. 2020. https://doi.org/10.18653/v1/2020.findings-emnlp.311 UNQOVERing stereotyping biases via underspecified questions . In Findings of the Association for Computational Linguistics : EMNLP 2020 , pages 3475--3489, Online. Association for Computational Linguistics

  37. [37]

    Xianming Li and Jing Li. 2024. https://doi.org/10.18653/v1/2024.acl-long.101 AoE : Angle-optimized embeddings for semantic textual similarity . In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics ( Volume 1: Long Papers ) , pages 1825--1839, Bangkok, Thailand. Association for Computational Linguistics

  38. [38]

    Yanni Alexander Loukissas. 2022. https://mitpress.mit.edu/9780262545174/all-data-are-local/ All Data Are Local: Thinking Critically in a Data-Driven Society . MIT Press, Cambridge, MA, USA

  39. [39]

    Anders Koed Madsen, Johan Irving S ltoft, and Anne-Sofie Klitgaard-Sofie . 2025. Datafantasi: Fra styring til l ring i en verden af vilde problemer. In Organisatorisk L ring Og Forandring: Individer, Grupper Og Processer, 1 edition, pages 137--155. Samfundslitteratur, K benhavn

  40. [40]

    I. McLean. 1990. https://doi.org/10.1007/BF01560577 The borda and condorcet principles: Three medieval applications . Social Choice and Welfare, 7(2):99--108

  41. [41]

    John Xavier Morris and Alexander M. Rush. 2024. https://openreview.net/forum?id=Wqsk3FbD6D Contextual document embeddings . In The Thirteenth International Conference on Learning Representations

  42. [42]

    Niklas Muennighoff, Nouamane Tazi, Loic Magne, and Nils Reimers. 2023. https://doi.org/10.18653/v1/2023.eacl-main.148 MTEB : Massive text embedding benchmark . In Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics , pages 2014--2037, Dubrovnik, Croatia. Association for Computational Linguistics

  43. [43]

    GPT-4o System Card

    OpenAI, Aaron Hurst, Adam Lerer, Adam P. Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, A. J. Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, Aleksander M a dry, Alex Baker-Whitcomb , Alex Beutel, Alex Borzunov, Alex Carney, Alex Chow, Alex Kirillov, Alex Nichol, and 400 others. 2024. https://doi.org/10.48550/arXiv.2410.21276 GPT-4o system card . ...

  44. [44]

    Bender, Emily Denton, and Alex Hanna

    Amandalynne Paullada, Inioluwa Deborah Raji, Emily M. Bender, Emily Denton, and Alex Hanna. 2021. https://doi.org/10.1016/j.patter.2021.100336 Data and its (dis)contents: A survey of dataset development and use in machine learning research . Patterns, 2(11)

  45. [45]

    Uwe Peters. 2022. https://doi.org/10.1007/s13347-022-00512-8 Algorithmic political bias in artificial intelligence systems . Philosophy & Technology, 35(2):25

  46. [46]

    William M. Rand. 1971. https://doi.org/10.1080/01621459.1971.10482356 Objective criteria for the evaluation of clustering methods . Journal of the American Statistical Association, 66(336):846--850

  47. [47]

    Maribeth Rauh, John Mellor, Jonathan Uesato, Po-Sen Huang, Johannes Welbl, Laura Weidinger, Sumanth Dathathri, Amelia Glaese, Geoffrey Irving, Iason Gabriel, William Isaac, and Lisa Anne Hendricks. 2022. Characteristics of harmful text: Towards rigorous benchmarking of language models. In Proceedings of the 36th International Conference on Neural Informat...

  48. [48]

    Nils Reimers and Iryna Gurevych. 2019. https://doi.org/10.18653/v1/D19-1410 Sentence- BERT : Sentence embeddings using siamese BERT-networks . In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing ( EMNLP-IJCNLP ) , pages 3982--3992, Hong Kong, Ch...

  49. [49]

    Philip Resnik and Jimmy Lin. 2010. https://doi.org/10.1002/9781444324044.ch11 Evaluation of NLP systems . In The Handbook of Computational Linguistics and Natural Language Processing, chapter 11, pages 271--295. John Wiley & Sons, Ltd

  50. [50]

    Russell Richie, Bryan White, Sudeep Bhatia, and Michael C. Hout. 2020. https://doi.org/10.3758/s13428-020-01362-y The spatial arrangement method of measuring similarity can capture high-dimensional semantic structures . Behavior Research Methods, 52(5):1906--1928

  51. [51]

    Paul R \"o ttger, Valentin Hofmann, Valentina Pyatkin, Musashi Hinck, Hannah Kirk, Hinrich Schuetze, and Dirk Hovy. 2024. https://doi.org/10.18653/v1/2024.acl-long.816 Political compass or spinning arrow? Towards more meaningful evaluations for values and opinions in large language models . In Proceedings of the 62nd Annual Meeting of the Association for ...

  52. [52]

    Jonathan Hvithamar Rystr m, Hannah Rose Kirk, and Scott Hale. 2025. https://aclanthology.org/2025.ommm-1.9/ Multilingual != multicultural: Evaluating gaps between multilingual capabilities and cultural alignment in LLMs . In Proceedings of Interdisciplinary Workshop on Observations of Misunderstood , Misguided and Malicious Use of Language Models , pages ...

  53. [53]

    Ahmed, Suhana Bedi, Zachary Robertson, Sudharsan Sundar, Benjamin W

    Olawale Elijah Salaudeen, Anka Reuel, Ahmed M. Ahmed, Suhana Bedi, Zachary Robertson, Sudharsan Sundar, Benjamin W. Domingue, Angelina Wang, and Sanmi Koyejo. 2025. https://openreview.net/forum?id=2Bw6uC49QF&referrer= In NeurIPS 2025 Workshop on Evaluating the Evolving LLM Lifecycle : Benchmarks , Emergent Abilities , and Scaling

  54. [54]

    James C. Scott. 2020. https://books.google.com/books?hl=en&lr=&id=Qe_RDwAAQBAJ&oi=fnd&pg=PP1&dq=seeing+like+a+state+jc+scott&ots=FA5E2LC-8p&sig=S64GLxV1q6dxcdtD9YHbFXz_io0 Seeing like a State: How Certain Schemes to Improve the Human Condition Have Failed . yale university Press

  55. [55]

    Mona Sloane, Emanuel Moss, Olaitan Awomolo, and Laura Forlano. 2022. https://doi.org/10.1145/3551624.3555285 Participation is not a design fix for machine learning . In Equity and Access in Algorithms , Mechanisms , and Optimization , pages 1--6, Arlington VA USA. ACM

  56. [56]

    Spearman

    C. Spearman. 1904. https://doi.org/10.2307/1412159 The proof and measurement of association between two things . The American Journal of Psychology, 15(1):72--101

  57. [57]

    Smith, Luke Zettlemoyer, and Tao Yu

    Hongjin Su, Weijia Shi, Jungo Kasai, Yizhong Wang, Yushi Hu, Mari Ostendorf, Wen-tau Yih, Noah A. Smith, Luke Zettlemoyer, and Tao Yu. 2023. https://doi.org/10.18653/v1/2023.findings-acl.71 One embedder, any task: Instruction-finetuned text embeddings . In Findings of the Association for Computational Linguistics : ACL 2023 , pages 1102--1121, Toronto, Ca...

  58. [58]

    Stephan Tulkens and Thomas van Dongen . 2024. https://doi.org/10.5281/zenodo.17270888 Model2Vec : Fast state-of-the-art static embeddings

  59. [59]

    Henrique Schechter Vera, Sahil Dua, Biao Zhang, Daniel Salz, Ryan Mullins, Sindhu Raghuram Panyam, Sara Smoot, Iftekhar Naim, Joe Zou, Feiyang Chen, Daniel Cer, Alice Lisak, Min Choi, Lucas Gonzalez, Omar Sanseviero, Glenn Cameron, Ian Ballantyne, Kat Black, Kaifeng Chen, and 70 others. 2025. https://doi.org/10.48550/arXiv.2509.20354 EmbeddingGemma : Powe...

  60. [60]

    Daniela Vianna, Edleno Silva de Moura , and Altigran Soares da Silva . 2024. https://doi.org/10.1007/s10506-023-09371-w A topic discovery approach for unsupervised organization of legal document collections . Artificial Intelligence and Law, 32(4):1045--1074

  61. [61]

    Voorhees

    Ellen M. Voorhees. 1999. http://trec.nist.gov/pubs/trec8/papers/qa_report.pdf The trec-8 question answering track report . In Trec, volume 99, pages 77--82

  62. [62]

    Voorhees and Donna K

    Ellen M. Voorhees and Donna K. Harman. 2005. https://aclanthology.org/anthology-files/anthology-files/pdf/J/J06/J06-4008.pdf TREC : Experiment and Evaluation in Information Retrieval , volume 63. MIT press Cambridge

  63. [63]

    Sandra Wachter, Brent Mittelstadt, and Chris Russell. 2021. https://doi.org/10.2139/ssrn.3792772 Bias preservation in machine learning: The legality of fairness metrics under EU non-discrimination law . West Virginia Law Review

  64. [64]

    Feder Cooper, Angelina Wang, Chad Atalla, Solon Barocas, Su Lin Blodgett, Alexandra Chouldechova, Emily Corvi, P

    Hanna Wallach, Meera Desai, A. Feder Cooper, Angelina Wang, Chad Atalla, Solon Barocas, Su Lin Blodgett, Alexandra Chouldechova, Emily Corvi, P. Alex Dow, Jean Garcia-Gathright , Alexandra Olteanu, Nicholas J. Pangakis, Stefanie Reed, Emily Sheng, Dan Vann, Jennifer Wortman Vaughan, Matthew Vogel, Hannah Washington, and Abigail Z. Jacobs. 2025. https://op...

  65. [65]

    Liang Wang, Nan Yang, Xiaolong Huang, Linjun Yang, Rangan Majumder, and Furu Wei. 2024. https://doi.org/10.48550/arXiv.2402.05672 Multilingual E5 text embeddings: A technical report . Preprint, arXiv:2402.05672

  66. [66]

    Joe H. Ward. 1963. https://doi.org/10.1080/01621459.1963.10500845 Hierarchical grouping to optimize an objective function . Journal of the American Statistical Association, 58(301):236--244

  67. [67]

    Sigge Winther Nielsen. 2025. The Puzzle State: How to Govern Wicked Problems in Western Democracies, 1. printing edition. Gad Publishers, Copenhagen

  68. [68]

    Ledell Wu, Adam Fisch, Sumit Chopra, Keith Adams, Antoine Bordes, and Jason Weston. 2018. https://doi.org/10.1609/aaai.v32i1.11996 StarSpace : Embed all the things! Proceedings of the AAAI Conference on Artificial Intelligence, 32(1)

  69. [69]

    Graham, Devon Greyson, Noni E

    Weiai Wayne Xu, Jean Marie Tshimula, \`E ve Dub \'e , Janice E. Graham, Devon Greyson, Noni E. MacDonald, and Samantha B. Meyer. 2022. https://doi.org/10.2196/41198 Unmasking the Twitter discourses on masks during the COVID-19 pandemic: User cluster--based BERT topic modeling approach . JMIR Infodemiology, 2(2):e41198

  70. [70]

    Puxuan Yu, Luke Merrick, Gaurav Nuti, and Daniel Campos. 2024. https://doi.org/10.48550/arXiv.2412.04506 Arctic-embed 2.0: Multilingual retrieval without compromise . Preprint, arXiv:2412.04506

  71. [71]

    Yanzhao Zhang, Mingxin Li, Dingkun Long, Xin Zhang, Huan Lin, Baosong Yang, Pengjun Xie, An Yang, Dayiheng Liu, Junyang Lin, Fei Huang, and Jingren Zhou. 2025. https://doi.org/10.48550/arXiv.2506.05176 Qwen3 embedding: Advancing text embedding and reranking through foundation models . Preprint, arXiv:2506.05176

  72. [72]

    online" 'onlinestring :=

    ENTRY address archivePrefix author booktitle chapter edition editor eid eprint eprinttype howpublished institution journal key month note number organization pages publisher school series title type volume year doi pubmed url lastchecked label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block STRING...

  73. [73]

    write newline

    " write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize " " * FUNCT...