When Similar Means Different: Evaluating LLMs on Arabic--Hebrew Cognates
Pith reviewed 2026-06-27 06:39 UTC · model grok-4.3
The pith
Large language models rely on surface-form similarity for Arabic-Hebrew pairs and lose accuracy on false friends and loanwords.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Arabic and Hebrew share a lexicon of true cognates, false friends, and modern loanwords that creates form-meaning conflicts for LLMs. Testing on SemCog Bench shows models reach high accuracy on true cognates yet drop sharply on false friends and loanwords, driven by surface-form similarity; sentence context yields only modest gains and does not overcome the misleading signals.
What carries the argument
SemCog Bench, the curated benchmark of 1858 annotated Arabic-Hebrew pairs evaluated under four input representations to isolate reliance on surface similarity versus semantic disambiguation.
If this is right
- Current LLMs exhibit a limitation in separating form from meaning across related languages.
- Surface similarity overrides contextual cues when the two conflict.
- Sentence-level information alone does not resolve form-meaning mismatches.
- SemCog Bench supplies a concrete test set for measuring progress on cross-lingual semantic reasoning.
Where Pith is reading between the lines
- The same surface-form bias may appear in other language pairs that share cognates and loanwords.
- Explicit training signals that penalize false-friend errors could reduce the observed gap.
- Extending the benchmark to spoken or additional written varieties would check whether the pattern holds beyond the current sample.
Load-bearing premise
The manually created labels for cognate status and semantic disambiguation accurately represent the Arabic-Hebrew lexicon and the chosen input formats expose the relevant conflicts.
What would settle it
Re-running the same models on a fresh, independently verified collection of Arabic-Hebrew pairs and finding no accuracy gap between true cognates and false friends would falsify the surface-similarity reliance claim.
Figures
read the original abstract
Arabic and Hebrew, as closely related Semitic languages, share a substantial lexicon of true cognates, misleading false friends, and modern loanwords. This overlap poses a challenge for cross-lingual semantic understanding in large language models (LLMs). To evaluate this capability, we introduce SemCog Bench, a curated benchmark of 1,858 Arabic--Hebrew word pairs with sentence-level annotations for cognate identification and semantic disambiguation. We evaluate open-source and commercial LLMs across multiple input representations (raw, diacritized, Romanized, and phonetic) and reveal a critical gap in cross-lingual reasoning. While models achieve high accuracy on true cognates, performance drops sharply on false friends and loanwords, reflecting a strong reliance on surface-form similarity. Furthermore, sentence-level context yields only modest improvements, suggesting that contextual cues alone are insufficient to overcome misleading form-based signals. These findings reveal a fundamental limitation of current LLMs in resolving cross-lingual form--meaning conflicts and establish SemCog Bench as a rigorous benchmark for multilingual semantic reasoning. Our code and data are publicly available.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces SemCog Bench, a manually curated dataset of 1,858 Arabic-Hebrew word pairs with sentence-level annotations for cognate identification and semantic disambiguation. It evaluates open-source and commercial LLMs across raw, diacritized, Romanized, and phonetic input representations, claiming high accuracy on true cognates but sharp performance drops on false friends and loanwords (indicating surface-form reliance), with only modest gains from sentence context.
Significance. If the benchmark labels prove reliable, the work identifies a concrete limitation in LLMs' cross-lingual semantic reasoning for related languages and provides a public benchmark plus code for future evaluation; the public data release is a clear strength.
major comments (2)
- [§3] §3 (SemCog Bench construction): the 1,858 pairs were produced by manual curation, yet no inter-annotator agreement figures, adjudication protocol, or comparison to an independent lexicon are reported; because all headline performance gaps (true cognates vs. false friends/loanwords) are measured against these labels, label noise would directly artifact the central claim of surface-form reliance.
- [§4] §4 (experimental setup): exact model versions/checkpoints, prompt templates, and any data-leakage checks are not specified; without these, it is impossible to verify whether the reported accuracy patterns reflect genuine form-meaning conflicts or artifacts of training-data overlap.
minor comments (2)
- [§4.2] The description of how sentence-level context is concatenated with word pairs could be clarified with an explicit example prompt.
- [§5] Table 2 (or equivalent results table) would benefit from explicit statistical significance markers for the reported accuracy drops.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. We address the two major comments point by point below and will revise the manuscript to improve transparency on dataset construction and experimental details.
read point-by-point responses
-
Referee: [§3] §3 (SemCog Bench construction): the 1,858 pairs were produced by manual curation, yet no inter-annotator agreement figures, adjudication protocol, or comparison to an independent lexicon are reported; because all headline performance gaps (true cognates vs. false friends/loanwords) are measured against these labels, label noise would directly artifact the central claim of surface-form reliance.
Authors: We acknowledge that the original manuscript does not report inter-annotator agreement figures or an explicit adjudication protocol. The 1,858 pairs were curated by a small team of linguists with expertise in both languages through iterative review and consensus. In revision we will expand §3 with a detailed description of the curation and adjudication process and will add a comparison of the labels against available bilingual lexicons where such resources exist. These additions will allow readers to evaluate label reliability directly. revision: yes
-
Referee: [§4] §4 (experimental setup): exact model versions/checkpoints, prompt templates, and any data-leakage checks are not specified; without these, it is impossible to verify whether the reported accuracy patterns reflect genuine form-meaning conflicts or artifacts of training-data overlap.
Authors: We agree that precise experimental specifications are required for reproducibility. The revised version will list the exact model checkpoints (Hugging Face identifiers for open-source models and API versions for commercial models), include the complete prompt templates in an appendix, and report data-leakage checks such as n-gram overlap analysis between the benchmark and publicly known training corpora. These changes will confirm that the performance patterns arise from form-meaning conflicts rather than memorization. revision: yes
Circularity Check
No circularity: empirical evaluation against external annotations
full rationale
The paper introduces SemCog Bench as a manually curated collection of 1,858 Arabic-Hebrew pairs and reports LLM accuracies on cognate identification and semantic disambiguation by direct comparison of model outputs to those annotations. No equations, parameter fits, or derivations appear; performance figures are not constructed from the paper's own inputs but measured against held-out labels. No self-citation chains, uniqueness theorems, or ansatzes are invoked to support the central claims. The evaluation is therefore self-contained against an external benchmark and receives the default non-circularity finding.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Jais 2: A family of Arabic-centric open large language models. Technical report, IFM. Viraat Aryabumi, John Dang, Dwarak Talupuru, Saurabh Dash, David Cairuz, Hangyu Lin, Bharat Venkitesh, Madeline Smith, Jon Ander Campos, Yi Chern Tan, Kelly Marchisio, Max Bartolo, Se- bastian Ruder, Acyr Locatelli, Julia Kreutzer, Nick Frosst, Aidan Gomez, Phil Blunsom,...
arXiv 2024
-
[2]
Chatglm: A family of large language mod- els from glm-130b to glm-4 all tools. Preprint, arXiv:2406.12793. Juan Moreno Gonzalez, Bashar Alhafni, and Nizar Habash. 2026. A tale of two scripts: Transliteration and post-correction for judeo-arabic. In Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics ...
Pith/arXiv arXiv 2026
-
[3]
On Arabic Transliteration. In A. van den Bosch and A. Soudi, editors, Arabic Computational Morphology: Knowledge-based and Empirical Meth- ods, pages 15–22. Springer, Netherlands. Nizar Y Habash. 2010. Introduction to Arabic natural language processing, volume 3. Morgan & Claypool Publishers. Bradley Hauer and Grzegorz Kondrak. 2015. Auto- matic cognate i...
Pith/arXiv arXiv 2010
-
[4]
Impact: Inflectional morphology probes across complex typologies. Preprint, arXiv:2506.23929. Djamé Seddah, Reut Tsarfaty, Sandra Kübler, Marie Candito, Jinho D. Choi, Richárd Farkas, Jennifer Fos- ter, Iakes Goenaga, Koldo Gojenola Galletebeitia, Yoav Goldberg, Spence Green, Nizar Habash, Marco Kuhlmann, Wolfgang Maier, Joakim Nivre, Adam Przepiórkowski,...
arXiv 2013
-
[5]
From spmrl to nmrl: What did we learn (and unlearn) in a decade of parsing morphologically-rich languages (mrls)? In Proceedings of the 58th an- nual meeting of the Association for Computational Linguistics, pages 7396–7408. Werner Vach and Oke Gerke. 2023. Gwet’s ac1 is not a substitute for cohen’s kappa–a comparison of basic properties. MethodsX, 10:102...
Pith/arXiv arXiv 2023
-
[6]
These words were initially classified using LLMs
Task Overview The objective of this annotation task is to classify Arabic-Hebrew word pairs as either True Cognates or False Friends based on semantic equivalence. These words were initially classified using LLMs. Your role as an annotator to verify this classification. After that, you will be required to classify whether Arabic-Hebrew word pairs are loan...
-
[7]
At least one core meaning must be present in both languages
Annotation Criteria 2.1 Stage 1: Cognate Annotation Label Definition True Cognate Words which share a common Semitic root AND have semantically overlapping meanings (share a meaning). At least one core meaning must be present in both languages. (ex. The words חלב and حليب which mean milk are true cognates) False Friend Words share similar form BUT have co...
-
[8]
Read the two words and their meanings, is the meaning correct?
-
[9]
Check if any meaning overlaps between the two words from the two languages
-
[10]
If they have one shared meaning, then the words are True Cognate
-
[11]
If you accept the classification in column H (type), be it True cognate or False friends, then in annotator_type (column K), choose Accept, otherwise, choose Reject
-
[12]
words that are not commonly used in the two languages), only keep words that are used in MSA and Modern Hebrew
Please Reject the borrowed word from English, archaic words (i.e. words that are not commonly used in the two languages), only keep words that are used in MSA and Modern Hebrew. Also reject, and names of places/locations, animals etc. As they are not considered cognates
-
[13]
If you have any comments, add a note in column L
-
[14]
reasoning
You will also see a “reasoning” column; this is generated by Gemini-3.1-pro-preview. This is only for researchers’ references; please use your own judgement when providing your annotation! Google Sheet Columns to Fill: Column Required Description annotator_type (col. K) Yes Accept/Reject note Optional Brief explanation or comment (if needed) 2.2 Stage 2: ...
-
[15]
Read the Arabic sentence (and its translation if not sure about meaning)
-
[16]
Natural" or
Evaluate naturalness: Does it sound "Natural" or "Unnatural" based on the definition above?
-
[17]
Add comment if needed
-
[18]
Read the Hebrew sentence and its translation
-
[19]
Natural" or
Evaluate naturalness: "Natural" or "Unnatural" based on the definition above of naturalness
-
[20]
Add optional note if needed
-
[21]
There may be issues with translations
You don't need to verify the correctness of English translation; it is used for easier understanding. There may be issues with translations
-
[22]
The English meaning of Arabic and Hebrew words is attached also for reference
-
[23]
If you marked the words as cognates and you find that the sentences are awkward (unnatural), please suggest a new sentence without changing the word form. 2.3 Stage 3: Loanword word-level annotation Loanwords: the word that has been borrowed from one language (the donor language) and incorporated into the vocabulary of another language (the recipient lang...
-
[24]
The word should originate from a non-Semitic source language, Both Arabic and Hebrew forms should be phonologically similar
Verify the word is a genuine loanword, The word should be commonly used in Modern Standard Arabic (MSA) and Modern Hebrew. The word should originate from a non-Semitic source language, Both Arabic and Hebrew forms should be phonologically similar
-
[25]
The provided meaning should be correct for BOTH languages
-
[26]
If meanings differ significantly between Arabic and Hebrew, note this in the comments
-
[27]
iv)Words that are completely different in meaning between the two languages
Words to REJECT: i) Words that are NOT loanwords (native Semitic words) ii) Archaic words that are rarely used in modern contexts Proper nouns (names of people, places, brands) iii) Words that are only used in dialectal Arabic, not MSA. iv)Words that are completely different in meaning between the two languages
-
[28]
please use your own judgement when providing your annotation!
You will also see other columns such as entry, id, arabic_ipa, hebrew_ipa, phonetic_similairty, loan_source, those columns are only used for easier understanding by researchers. please use your own judgement when providing your annotation!
-
[29]
2.4 Stage 4: Loanword Sentence-level annotation The requirement is same as stage 2
If you have any comments, add a note in column M. 2.4 Stage 4: Loanword Sentence-level annotation The requirement is same as stage 2
-
[30]
Strict Guidelines & Critical Rules Here are some important rules to follow:
-
[31]
NO use of LLMs or AI tools (ChatGPT, Claude, Gemini, etc.)
-
[32]
NO use of machine translation
-
[33]
NO guessing — consult a dictionary if uncertain
-
[34]
Verify meanings independently using trusted dictionaries
-
[35]
Maintain consistency throughout the dataset
-
[36]
Add notes to explain unclear or borderline cases
-
[37]
slave/servant
Additional examples Example 1: True Cognate (Stage 1) [Input Variables] arabic_undiac hebrew_undiac arabic_meaning hebrew_meaning عبد עבד slave; servant; worship slave; to work [Expected Annotation] annotator_type note True Cognate Both share core meaning "slave/servant" from root ʕ-b-d Example 2: False Friend (Stage 1) [Input Variables] arabic_undiac heb...
-
[38]
Common Mistakes to Avoid
-
[39]
Over-reliance on form — Similar-looking words may be False Friends if meanings differ
-
[40]
Ignoring polysemy — One shared meaning is sufficient for True Cognate
-
[41]
Translation bias — Verify meanings in original languages
-
[42]
Conjugation differences are expected and should not be a factor in the annotation decision. Final Submission Checklist • All entries have annotator_type filled (Stage 1) • All entries have arabic_sentence_natural filled (Stage 2) • All entries have hebrew_sentence_natural filled (Stage 2) • Notes added for unclear or borderline cases • No entries skipped ...
2026
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.