Analyzing and Encoding the Al-Mawrid Arabic-English Dictionary with the ISO Language Markup Framework and TEI Lex-0

Diaa M. Fayed; Laurent Romary

arxiv: 2606.18205 · v2 · pith:VOQW6FRRnew · submitted 2026-06-16 · 💻 cs.CL

Analyzing and Encoding the Al-Mawrid Arabic-English Dictionary with the ISO Language Markup Framework and TEI Lex-0

Diaa M. Fayed , Laurent Romary This is my paper

Pith reviewed 2026-06-27 01:09 UTC · model grok-4.3

classification 💻 cs.CL

keywords Arabic dictionarydigitizationLMFTEI Lex-0lexical encodinginformation extractionbilingual lexiconLLOD

0 comments

The pith

The Al-Mawrid Arabic-English dictionary is encoded into LMF and TEI Lex-0 with 91% structural parsing accuracy on a sample.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper develops a methodology to convert a legacy print Arabic-English dictionary into a machine-readable lexicon by aligning two encoding standards. It examines the dictionary's structure and punctuation on a sample of entries and tests rules that pull out synonyms and other features. The work also sets up a way to link the result into broader data networks. A sympathetic reader would care because it turns an old resource into one that computers can use directly and offers a process others can repeat for similar dictionaries.

Core claim

By applying an editorial view to the dictionary's macro- and microstructure, the study resolves structural ambiguities and punctuation inconsistencies. On the letter Ayn sample that makes up 4.6 percent of the volume, information extraction rules reach 91 percent structural parsing accuracy, 85 percent precision and 98 percent recall for synonyms, and 88 percent precision for other morpho-semantic features. The paper compares the result to existing Arabic resources, notes limits of TEI Lex-0 for certain Arabic phenomena, and introduces a prefix-based referencing system to support integration into Linguistic Linked Open Data.

What carries the argument

The dual-standard encoding that aligns ISO LMF with TEI Lex-0 guidelines, backed by empirical analysis of lexical knowledge density and rule-based information extraction.

If this is right

The resulting resource is interoperable and machine-tractable for computational use.
A scalable prefix-based referencing system allows inclusion in the semantic web.
The workflow offers a reproducible method for retro-digitization of other complex bilingual lexicons.
The approach reveals specific limits of TEI Lex-0 in modeling implicit semantic relations and scattered morphological cues in Arabic.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the accuracy holds on the full dictionary, the encoded version could support new Arabic NLP tools that rely on synonym and feature data.
The prefix-based system could connect this lexicon to other lexical resources beyond those discussed in the paper.
Limitations noted for Arabic phenomena might prompt targeted extensions to TEI Lex-0 or LMF in future work.

Load-bearing premise

That the sample consisting of the letter Ayn is representative of the full dictionary and that the extraction rules will maintain similar accuracy when applied to the entire resource.

What would settle it

Running the same extraction rules on a second letter such as Ba and checking whether structural parsing accuracy stays near 91 percent and synonym precision near 85 percent.

Figures

Figures reproduced from arXiv: 2606.18205 by Diaa M. Fayed, Laurent Romary.

**Figure 2.** Figure 2: Simplest entry: single-word lemma and single-word translation. For a comprehensive collection of entry types, including multi-lemma headers and prepositional phrases, refer to Appendix A: Entry Representation and Lemma Modelling. 4.1.1 Grammatical Mapping and Header Typolog A fundamental challenge in Al-Mawrid is the near-total lack of formal part-of-speech (POS) tags, which occur in less than 1% of entrie… view at source ↗

**Figure 3.** Figure 3: Domain label encoding. A complete registry of Arabic labels, abbreviations, and their functional structural roles is provided in Appendix B: Labels, Punctuation, and Semantic Marking 4.2.2 Punctuation and Interpretive Boundaries Punctuation in Al-Mawrid is more than a typographic feature; it defines the functional relationship between linguistic objects. We employ the <pc> (punctuation) element to formalis… view at source ↗

**Figure 4.** Figure 4: Macrostructure, Microstructure and Referencing system. [PITH_FULL_IMAGE:figures/full_fig_p012_4.png] view at source ↗

**Figure 5.** Figure 5: A Multiword Expression (MWE) headword. For further variations, including transparent collocations and non-transparent idiomatic gemstone names, refer to Appendix D: Multiword Expressions. 4.4.1 Typology and Identification in Al-Mawrid To resolve the structural inconsistencies of the source material, we categorise MWEs based on their placement (macrostructural vs. microstructural) and semantic transparency.… view at source ↗

**Figure 6.** Figure 6: Cross-Reference phrase resolving multiple targets. Additional examples illustrating complete versus partial scoping and cross-references to idiomatic phrases are located in Appendix E: Cross-Referencing. 4.5.1 Methodological Motivations for Encoding Strategies To ensure structural rigour and interoperability, our encoding choices are governed by the specific constraints of the ISO-LMF and TEI Lex-0 framewo… view at source ↗

**Figure 7.** Figure 7: Unstructured parenthetical information. A systematic presentation of structured versus unstructured parenthetical data types are provided in Appendix F: Parenthesis Information. 4.6.1 Typographic Classification and Logic We distinguish between two types of parentheses used in the source material: square brackets [...] and braces (...). This distinction is central to our encoding strategy: • Square Brackets… view at source ↗

**Figure 8.** Figure 8: (Partial) Definitional Synonyms. Methodological proofs for header-level synonyms and explicitly marked equivalence are detailed in Appendix G: Synonymy [PITH_FULL_IMAGE:figures/full_fig_p016_8.png] view at source ↗

**Figure 9.** Figure 9: Homonymic nesting for fine-grained analysis. The full XML serialisation of complex homonymic entries is illustrated in Appendix I: Homonym Modelling [PITH_FULL_IMAGE:figures/full_fig_p018_9.png] view at source ↗

**Figure 10.** Figure 10: Functional separation of Illustrative Usage. [PITH_FULL_IMAGE:figures/full_fig_p019_10.png] view at source ↗

**Figure 11.** Figure 11: Hierarchical Translation Equivalence. Additional tiered translation models and free-text equivalents are available in Appendix K: Translation Equivalence and Semantic Granularity. 4.11.1 Structural Modelling of Equivalents To maintain interoperability within the TEI Lex-0 framework, we employ the <cit> (cited quotation) element as the formal container for all target-language data. This decision is motivat… view at source ↗

**Figure 12.** Figure 12: a multiword expression header [PITH_FULL_IMAGE:figures/full_fig_p027_12.png] view at source ↗

**Figure 13.** Figure 13: a multi-lemma header [PITH_FULL_IMAGE:figures/full_fig_p028_13.png] view at source ↗

**Figure 14.** Figure 14: a multi-lemma header Entry: Example 5: The sub-entry header (حةَ رْ صَ رُ خِ آ ، حةَ رْصَ ( is composed of the lemma and a similar meaning phrase. In addition, when there are more than one defining words or phrases, we encode each defining phrase in an <gloss> element. Full/Partial Entry ة ر ح رَصْ حر : ة ْ ي ص ر ييييي ة ر ح ُ رَصْ ة ، آ خر ر ح رَصْ ة وض ر ُ ة، م ر ع رَصْ ٍة، ر ع ْ ِد : ب LMF/TEI Lex-0 En… view at source ↗

**Figure 15.** Figure 15: a multi-lemma or a multi-phrase header Entry: Example 6: The sub-entry header (ـ ِ ب حَ رَّ صَ or عنَ حَ رَّ صَ ( is composed of the lemma and a preposition. The headword (رحَّصَ َ (is collocated with prepositions (ـ ِ .( َعن) and) ب The parenthesis information is encoded as an <gloss> element. Full/Partial Entry ح رَصَّ ـ َ [PITH_FULL_IMAGE:figures/full_fig_p029_15.png] view at source ↗

**Figure 16.** Figure 16: a prepositional phrase header [PITH_FULL_IMAGE:figures/full_fig_p029_16.png] view at source ↗

**Figure 17.** Figure 17: Figure 8: Macrostructure, Microstructure and Referencing system. [PITH_FULL_IMAGE:figures/full_fig_p032_17.png] view at source ↗

**Figure 18.** Figure 18: a MWE lemma MWEs: Example 2: The lemma of the entry is a MWE of two words. The header contains a semantic field or domain label. Full/Partial Entry ّيكِ رر ر ح ٌ ه ر م ر ع ]طب[ apraxia LMF/TEI Lex-0 Encoding <entry xml:id="me-036984" type="mainEntry" xml:lang="ar"> <form type="lemma"><orth>كِ ر ر ر ح ٌ ه ر م ر عّ>/orth></form> <gramGrp resp="#DiaaFayed #LaurentRomary"> <gram type="mwe" value="NNJJ"></gra… view at source ↗

**Figure 19.** Figure 19: a MWE lemma with Semantic Field Label [PITH_FULL_IMAGE:figures/full_fig_p032_19.png] view at source ↗

**Figure 20.** Figure 20: a transparent MWE. MWEs: Example 4: The two subentries are MWE s. The meaning of the collocation can be directly derived from its individual words. These are called transparent polylexical units. Full/Partial Entry … ي ْ ر الع ِ ٍة ب ر ق ر َل ر و ع ُ : ذ ّ ْ ن ِي ي ر ع ة َّ ِي ن ْ ي ر ع ٌ وق ُ ق ُ ، ح ّ ْ ن ِي ي ر ع ٌّ ق ر right(s) in rem ح ة َّ ِي ن ْ ي ر ع ٌ ة ر م ر اه سر ُ contributions in kind م LMF/T… view at source ↗

**Figure 21.** Figure 21: a transparent MWE. MWEs: Example 5: transparent polylexical unit. Full/Partial Entry ل ْ فع ال ُ [PITH_FULL_IMAGE:figures/full_fig_p033_21.png] view at source ↗

**Figure 22.** Figure 22: LMF/TEI-Lex0 code of a transparent MWE with semantic field label. MWEs: Example 6: Encoding of a transparent MWE and semantic field. Full/Partial Entry ن [PITH_FULL_IMAGE:figures/full_fig_p034_22.png] view at source ↗

**Figure 23.** Figure 23: LMF/TEI-Lex0 code of a transparent MWE with semantic field label. MWEs: Example 7: Encoding a non-transparent sense-related MWE. Full/Partial Entry س ْ م َّ الش ُ [PITH_FULL_IMAGE:figures/full_fig_p034_23.png] view at source ↗

**Figure 24.** Figure 24: LMF/TEI-Lex0 code of non-transparent sense-related MWEs. MWEs: Example 8: Encoding a non-transparent sense-related MWE. Full/Partial Entry [PITH_FULL_IMAGE:figures/full_fig_p034_24.png] view at source ↗

**Figure 25.** Figure 25: LMF/TEI-Lex0 code of non-transparent sense-related MWEs. MWEs: Example 9: Encoding lexicographically non-transparent entry-related MWE. Full/Partial Entry ي ْ ر الع [PITH_FULL_IMAGE:figures/full_fig_p034_25.png] view at source ↗

**Figure 26.** Figure 26: LMF/TEI-Lex0 code of non-transparent entry-related MWEs. MWEs: Example 10: Encoding lexicographically non-transparent entry-related MWE [PITH_FULL_IMAGE:figures/full_fig_p035_26.png] view at source ↗

**Figure 27.** Figure 27: LMF/TEI-Lex0 code of non-transparent entry-related MWEs. Appendix E: 4.5 Cross-Referencing: Resolving Implicit Links 4.5.1 Example 1: Cross-Reference Phrase without Targets. ● The dash symbol “ـ “is encoded using the <pc> (punctuation) element. ● The keyphrase used for the cross-reference, such as "أماكنها في راجعها", is encoded using the <lbl> (label) element. ● The keyphrase "أماكنها في راجعها "is not f… view at source ↗

**Figure 28.** Figure 28: Cross-Reference phrase without targets. Full/Partial Entry ة ر د َّ ر ر ج ُ الم ِ [PITH_FULL_IMAGE:figures/full_fig_p035_28.png] view at source ↗

**Figure 29.** Figure 29: Cross-reference phrase for more than one refers. 4.5.3 Example 3: Cross-Reference to a Multiword Expression (MWE) ● In this example, the keyword "راجع "refers to the phrase "اَكذَ رِ ثَ َ أ على", which is treated as a multiword expression (MWE) and appears as a subentry under the headword "رَث َ ."أ ● Occasionally, the author assists users by including the headword of the referred entry in parentheses—e.g… view at source ↗

**Figure 30.** Figure 30: Cross-Reference to a Multiword Expression (MWE). 4.5.4 Example 4: Complete Cross-Reference. ● In [PITH_FULL_IMAGE:figures/full_fig_p036_30.png] view at source ↗

**Figure 31.** Figure 31: Complete Cross-Reference (refer to all the meanings). 4.5.5 Example 5: Partial Cross-Reference ● In [PITH_FULL_IMAGE:figures/full_fig_p037_31.png] view at source ↗

**Figure 32.** Figure 32: FPartial Cross-reference (only one meaning). Appendix F: 4.6 Parenthesis Information: Resolving Structural Ambiguity Parenthesis Information: Example: 1: Full/Partial Entry م ل ْ ي ر و ُ ع ياء( ْ ح ر ِف األ ني ص ْ ر ت ي [PITH_FULL_IMAGE:figures/full_fig_p037_32.png] view at source ↗

**Figure 33.** Figure 33: Code of unstructured parenthetical information. [PITH_FULL_IMAGE:figures/full_fig_p037_33.png] view at source ↗

**Figure 34.** Figure 34: Code of unstructured parenthetical information. [PITH_FULL_IMAGE:figures/full_fig_p037_34.png] view at source ↗

**Figure 35.** Figure 35: Code of unstructured parenthetical information. [PITH_FULL_IMAGE:figures/full_fig_p037_35.png] view at source ↗

**Figure 36.** Figure 36: Code of unstructured parenthetical information. [PITH_FULL_IMAGE:figures/full_fig_p037_36.png] view at source ↗

**Figure 37.** Figure 37: Code of unstructured parenthetical information. [PITH_FULL_IMAGE:figures/full_fig_p038_37.png] view at source ↗

**Figure 38.** Figure 38: Code of unstructured parenthetical information. [PITH_FULL_IMAGE:figures/full_fig_p038_38.png] view at source ↗

**Figure 39.** Figure 39: Code of unstructured parenthetical information. [PITH_FULL_IMAGE:figures/full_fig_p038_39.png] view at source ↗

**Figure 40.** Figure 40: Syntactic and Semantic Knowledge. Syntactic and Semantic Knowledge: Example: 2 Full/Partial Entry )ة ر ِجيب ر ِب )مفردها ع ائ جر ر ع - ة ر ِجيب ر راجع ع LMF/TEI Lex-0 Encoding <entry xml:id="me-035317" type="mainEntry" xml:lang="ar"> <form type="lemma"><orth>ب ائ ر ج ر ع>/orth></form><form type="inflected"> <pc>(</pc><lbl>مفردها>/lbl><orth>ة ر جيب ر ع>/orth><pc>)</pc> <gramGrp><gram type="number" value="s… view at source ↗

**Figure 41.** Figure 41: Syntactic and Semantic Knowledge [PITH_FULL_IMAGE:figures/full_fig_p038_41.png] view at source ↗

**Figure 42.** Figure 42: Syntactic and Semantic Knowledge. Syntactic and Semantic Knowledge: Example: 4 دائِ َرة )َد َو َسط )أ ْوساط(: َوائِر(، َم ْح ِفل ) َم َحافِل( Entry Partial/Full LMF/TEI Lex-0 Encoding <entry xml:id="me-057506" type="mainEntry" xml:lang="ar"><form type="lemma"> ط<orth < سر ر و>/orth></form><form type="inflected"><pc>(</pc><orth>ساط ْ أو>/orth><pc>)</pc> <gramGrp resp="#DF #LR"><gram type="number" value="pl… view at source ↗

**Figure 43.** Figure 43: Syntactic and Semantic Knowledge. Syntactic and Semantic Knowledge: Example: 5 َر ُمَرا ِهقاً Entry Partial/Full را َه َق: صا LMF/TEI Lex-0 Encoding <entry xml:id="me-26200" type="mainEntry" xml:lang="ar"> <form type="lemma"><orth>ق ر <orth/<َراه <gramGrp resp="#DF #LR"><gram type="pos" value="V"/></gramGrp></form> <form type="derivative"><lbl>صاريَ>/lbl><orth>هقا ا ر ر ُ <orth/<ًم <gramGrp resp="#DF #LR"… view at source ↗

**Figure 44.** Figure 44: Syntactic and Semantic Knowledge. Syntactic and Semantic Knowledge: Example: 6 Full/Partial Entry ي طِ لط ْن ِخل ُسو ٌب إلى ْ : َم ال ِخْ LMF/TEI Lex-0 Encoding <entry xml:id="me-23559" type="mainEntry" xml:lang="ar"> <form type="lemma"><orth> ِط ْ خل ّ</orth> <gramGrp resp="#DF #LR"><gram type="pos" value="ADJ"/></gramGrp></form> <form type="derivative"><lbl> وب سُ ْ ن ر ٌم ط<orth><lbl/<إىل ْ خل ال>/orth… view at source ↗

**Figure 45.** Figure 45: Syntactic and Semantic Knowledge. Syntactic and Semantic Knowledge: Example: 7 Full/Partial Entry ( ّ ِي ان ر ن َ ِر: أ ث َ أ صفة) ييييي ر ث َ أ )اسم ) ّ ِي ان ر ن َ : أ LMF/TEI Lex-0 Encoding <entry xml:id="me-000777" type="mainEntry" xml:lang="ar"> <form type="lemma"><orth>ر ث َ أ>/orth></form> [PITH_FULL_IMAGE:figures/full_fig_p039_45.png] view at source ↗

**Figure 46.** Figure 46: Syntactic and Semantic Knowledge. Syntactic and Semantic Knowledge: Example: 8 Full/Partial Entry ً قا ِ او سر ر ت ُ م ُ ه َ ل ر ع ر ، ج ر م ر : ناغ ر ق ر ساو LMF/TEI Lex-0 Encoding <entry xml:id="me-028719" type="mainEntry" xml:lang="ar"> <form type="lemma"><orth>ق ر ساوَ>/orth><gramGrp resp="#DF #LR"> <gram type="pos" value="V"/></gramGrp></form><form type="derived"> ه<lbl< َ ل ر ع ر ِقا<orth><lbl/<ُج ا… view at source ↗

**Figure 47.** Figure 47: Syntactic and Semantic Knowledge. Syntactic and Semantic Knowledge: Example: 9 Full/Partial Entry ط َّ ل ر ِعل ب ط: فا ِّ ل ر ب ُ م LMF/TEI Lex-0 Encoding <entry xml:id="me-045287" type="mainEntry" xml:lang="ar"> <form type="lemma"><orth>ط ِّ ل ر ب ُ م>/orth></form> <gramGrp resp="#DF #LR"><gram type="pos" value="N"/></gramGrp> <form type="derived"><gramGrp resp="#DF #LR"><gram type="pos" value="V"/></gra… view at source ↗

**Figure 48.** Figure 48: Syntactic and Semantic Knowledge. Syntactic and Semantic Knowledge: Example: 10 Full/Partial Entry ر غ ر ت ْ رِاب ر د ص ْ ر ِتغاء: م ْ ِاب LMF/TEI Lex-0 Encoding <entry xml:id="me-000197" type="mainEntry" xml:lang="ar"> <form type="lemma"><orth>تغاء ْ اب> /orth></form> <gramGrp resp="#DF #LR"><gram type="pos" value="N"/></gramGrp> <form type="derived"><gramGrp resp="#DF #LR"><gram type="pos" value="V"/></… view at source ↗

**Figure 49.** Figure 49: Syntactic and Semantic Knowledge [PITH_FULL_IMAGE:figures/full_fig_p040_49.png] view at source ↗

**Figure 50.** Figure 50: Syntactic and Semantic Knowledge [PITH_FULL_IMAGE:figures/full_fig_p040_50.png] view at source ↗

**Figure 51.** Figure 51: Syntactic and Semantic Knowledge [PITH_FULL_IMAGE:figures/full_fig_p041_51.png] view at source ↗

**Figure 52.** Figure 52: Syntactic and Semantic Knowledge. Syntactic and Semantic Knowledge: Example: 14 Full/Partial Entry ور ُ ال ي ُ ات ُ ر ي ِ اتات :عار ر ب َّ الن ر من ٌ ة ر ف طائ LMF/TEI Lex-0 Encoding <entry xml:id="me-034962" type="mainEnry" xml:lang="ar"> <form type="lemma"><orth> ور ُ ال ي ُ ات ُ ر ي ِ عار>/orth></form> <sense n="se-01"><pc>:</pc> اتات <def< ر ب َّ الن ر من ٌ ة ر ف طائ>/def><xr type="hypernymy"><lbl>من … view at source ↗

**Figure 53.** Figure 53: Syntactic and Semantic Knowledge [PITH_FULL_IMAGE:figures/full_fig_p041_53.png] view at source ↗

**Figure 54.** Figure 54: Syntactic and Semantic Knowledge. Syntactic and Semantic Knowledge: Example: 17 Full/Partial Entry )نبات)ور رُ ْ ع ُ ز LMF/TEI Lex-0 Encoding <entry type="mainEntry"><form type="lemma"><orth>ور ُ ر ْ ع ُ <orth/<ز </form><xr type="hypernymy"><ref type="entry"> <gloss>(نبات>(/gloss></ref></xr></entry> [PITH_FULL_IMAGE:figures/full_fig_p041_54.png] view at source ↗

**Figure 55.** Figure 55: Syntactic and Semantic Knowledge [PITH_FULL_IMAGE:figures/full_fig_p041_55.png] view at source ↗

**Figure 56.** Figure 56: Examples and Context Words. Full/Partial Entry ( لر ْ ف ِّ الط ُّ م ُ ِت األ )ـ ر م َّ م ر ه LMF/TEI Lex-0 Encoding <entry xml:id="me-056491" type="mainEntry" xml:lang="ar"> <form type="lemma"><orth>م َّ م ر هَ>/orth></form> <cit type="example"><quote> <pc>(</pc> ُّ م ُ ت األ ـ ل ْ ف ِّ م<pc/>)<pc<َالط َّ م ر هَ>/quote></cit></entry> [PITH_FULL_IMAGE:figures/full_fig_p044_56.png] view at source ↗

**Figure 58.** Figure 58: Translations [PITH_FULL_IMAGE:figures/full_fig_p044_58.png] view at source ↗

read the original abstract

This paper presents a robust methodology for the systematic digitization and encoding of the Al-Mawrid Arabic-English dictionary, transforming it from a legacy print resource into a standardized computational lexicon. Addressing a significant gap in Arabic lexical infrastructure, the study adopts a dual-standard framing that aligns the ISO Lexical Markup Framework (LMF) with the Text Encoding Initiative TEI Lex-0 guidelines. By applying an editorial view to the dictionary's macro- and microstructure, the research resolves the structural ambiguities and punctuation inconsistencies typical of 20th-century bilingual dictionaries. The methodology is grounded in an empirical analysis of the dictionary's lexical knowledge density. Drawing on a representative sample (the letter Ayn, comprising 4.6% of the total volume), the study provides scientific weight to the encoding process, demonstrating a structural parsing accuracy of 91%. Quantitative evaluation of the information extraction rules reveals high performance, with 85% precision and 98% recall for synonyms, and 88% precision for other morpho-semantic features. Beyond technical description, the paper provides a critical comparison with existing Arabic lexical resources and discusses the limitations of TEI Lex-0 when modelling specific Arabic phenomena, such as implicit "open set" semantic relations and scattered morphological cues. Furthermore, the study explores the potential for Linguistic Linked Open Data (LLOD) integration by establishing a scalable prefix-based referencing system that facilitates the resource's inclusion in the semantic web. The result is an interoperable, machine-tractable resource that provides a reproducible workflow for the retro-digitization of complex legacy bilingual lexicons within the Arabic NLP and Digital Humanities communities.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper encodes Al-Mawrid into LMF and TEI Lex-0 on a 4.6% sample from one letter and reports extraction numbers, but the sample may not represent the full dictionary.

read the letter

The main takeaway is a practical encoding of the Al-Mawrid Arabic-English dictionary into ISO LMF and TEI Lex-0. They describe an editorial view of the macro- and microstructure, resolve punctuation issues common in older bilingual works, and test information extraction rules on the letter Ayn.

What is new is the specific workflow for this resource plus the prefix-based referencing system they propose for LLOD integration. They also compare the result to existing Arabic lexicons and flag places where TEI Lex-0 falls short on implicit semantic relations and scattered morphological cues in Arabic. That limitations section is useful and shows they are not overselling the standards.

The evaluation numbers (91% structural parsing, 85% precision/98% recall on synonyms, 88% on other features) come from a single contiguous sample that is 4.6% of the volume. The paper calls the slice representative on volume percentage alone. Arabic root-based dictionaries often differ by letter in entry density and cue patterns, so the figures may not transfer at the same rates to the rest of the book. The abstract gives no variance checks or details on how accuracy was measured.

This is for Arabic NLP and digital humanities groups that need machine-readable bilingual lexicons. Readers working on retro-digitization of legacy dictionaries will get a reproducible workflow they can adapt.

Send it for peer review. The encoded resource and the workflow are the core value; referees can address the sample limitation without undermining the contribution.

Referee Report

1 major / 0 minor

Summary. The paper presents a methodology for the systematic digitization and encoding of the Al-Mawrid Arabic-English dictionary using the ISO Lexical Markup Framework (LMF) aligned with TEI Lex-0 guidelines. It resolves structural ambiguities in the print resource via an editorial view of macro- and microstructure, evaluates information extraction rules on a sample from the letter Ayn (4.6% of volume) reporting 91% structural parsing accuracy, 85% precision/98% recall for synonyms and 88% precision for morpho-semantic features, compares the result to existing Arabic lexical resources, discusses TEI Lex-0 limitations for Arabic phenomena such as implicit semantic relations, and proposes a prefix-based system for Linguistic Linked Open Data (LLOD) integration.

Significance. If the encoding workflow proves robust and the performance metrics generalize, the work supplies a much-needed interoperable computational lexicon for Arabic NLP, fills a documented gap in standardized Arabic lexical infrastructure, and offers a reproducible retro-digitization pipeline for other complex legacy bilingual dictionaries in the Digital Humanities.

major comments (1)

[Evaluation / quantitative results (abstract and corresponding section)] The central empirical claims rest on a single contiguous sample from the letter Ayn. The manuscript asserts representativeness solely on the basis of volume share (4.6%) without reporting any cross-letter statistics on entry complexity, punctuation density, frequency of implicit relations, or morphological cue distribution. Because Arabic root-based dictionaries commonly exhibit letter-specific microstructural variation, the reported 91% parsing accuracy and extraction metrics (85% precision/98% recall for synonyms; 88% precision for other features) cannot be taken as evidence that the rules will transfer at the claimed rates to the full dictionary.

Simulated Author's Rebuttal

1 responses · 0 unresolved

Thank you for the opportunity to respond to the referee's comments. We address the major comment on the evaluation sample below.

read point-by-point responses

Referee: The central empirical claims rest on a single contiguous sample from the letter Ayn. The manuscript asserts representativeness solely on the basis of volume share (4.6%) without reporting any cross-letter statistics on entry complexity, punctuation density, frequency of implicit relations, or morphological cue distribution. Because Arabic root-based dictionaries commonly exhibit letter-specific microstructural variation, the reported 91% parsing accuracy and extraction metrics (85% precision/98% recall for synonyms; 88% precision for other features) cannot be taken as evidence that the rules will transfer at the claimed rates to the full dictionary.

Authors: We acknowledge the validity of this observation. Our sample selection was based on the letter's proportional volume and its inclusion of varied entry structures, but we did not conduct or report cross-letter analyses. Consequently, the performance figures should be viewed as preliminary indicators of the method's viability rather than validated generalization rates. In the revised manuscript, we will revise the abstract, introduction, and evaluation sections to more clearly delimit the scope of the reported metrics and add a limitations subsection discussing potential letter-specific variations in Arabic dictionaries. We will also propose this as an area for future work. This revision addresses the concern without altering the described methodology or the value of the encoding approach for the sampled data. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical encoding and evaluation with no derivations or self-referential predictions

full rationale

The paper reports an applied digitization workflow for a legacy dictionary, using LMF/TEI standards and direct quantitative evaluation (91% parsing accuracy, precision/recall on synonyms and features) on a single contiguous sample (letter Ayn). No equations, fitted parameters renamed as predictions, self-definitional steps, or load-bearing self-citations appear in the described methodology. Performance figures are straightforward measurements on the chosen sample rather than outputs forced by construction from inputs. Representativeness of the 4.6% sample is an external validity assumption, not a circular reduction. The work is self-contained as an empirical project.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper relies on the suitability of existing ISO and TEI standards for this dictionary without introducing new free parameters or invented entities.

axioms (1)

domain assumption ISO LMF and TEI Lex-0 frameworks are appropriate for modeling the macro- and microstructure of bilingual Arabic-English dictionaries
The methodology is grounded in these standards as the basis for resolving structural ambiguities.

pith-pipeline@v0.9.1-grok · 5831 in / 1196 out tokens · 53227 ms · 2026-06-27T01:09:51.163019+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

8 extracted references · 1 linked inside Pith

[1]

Deep learning for Arabic NLP: A survey

Al-Ayyoub, M., A. Nuseir, K. Alsmearat, Y. Jararweh and B. Gupta (2018). "Deep learning for Arabic NLP: A survey." Journal of computational science 26: 522-531. Alghamdi, A. A. O. (2018). A Computational Lexicon and Representational Model for Arabic Multiword Expressions, University of Leeds. Alsharhan, E., A. Ramsay and Evaluation (2020). "Investigating ...

2018
[2]

TEI Lite: Encoding for Interchange: an introduc -tion to the TEI Final revised edition for TEI P5

Amar, F. B. B., B. Gargouri and A. B. Hamadou (2010). Towards Generation of Domain Ontology from LMF Standardized Dictionaries. SEKE: 515-520. Amar, F. B. B., A. Khemakhem, B. Gargouri, K. Haddar and A. B. Hamadou (2008). LMF Standardized Model for the Editorial Electronic Dictionaries of Arabic. NLPCS. Attia, M., L. Tounsi and J. van Genabith (2010). Aut...

2010
[3]

9.5 Typographic and Lexical Information in Dictionary Data

Consortium, T. (2022 ). "9.5 Typographic and Lexical Information in Dictionary Data" TEI P5: Guidelines for Electronic Text Encoding and Interchange. [Version 4.4.0.].[Last updated on 19th April 2022].[Revision ff9cc28b0]. Costa, R., Roche, C., & Salgado, A. (2022). Standards for representing lexicographic data: An overview . DARIAH-Campus. Elleuch, I., B...

2022
[4]

Multiword expressions: between lexicography and NLP

Gantar, P., L. Colman, C. Parra Escartín and H. Martínez Alonso (2019). "Multiword expressions: between lexicography and NLP." 32(2): 138-162. Graff, D. and M. Maamouri (2012). Developing LMF-XML Bilingual Dictionaries for Colloquial Arabic Dialects. LREC

2019
[5]

A Prototype for Projecting HPSG Syntactic Lexica Towards LMF

Haddar, K., H. Fehri and L. Romary (2012). "A Prototype for Projecting HPSG Syntactic Lexica Towards LMF." arXiv preprint arXiv:1207.5328 1(27): 21-46. Hawwari, A., M. Attia and M. Diab (2014). A framework for the classification and annotation of multiword expressions in dialectal arabic . Proceedings of the EMNLP 2014 Workshop on Arabic Natural Language ...

Pith/arXiv arXiv 2012
[6]

Krauwer, K

Maegaard, B., S. Krauwer, K. Choukri and L. D. Jørgensen (2006). The BLARK concept and BLARK for Arabic. LREC. Maks, I., T. Carole and V. V. Remco (2008). Standardising Bilingual Lexical Resources According to the Lexicon Markup Framework. LREC

2006
[7]

Mörth, K

Marrakech, Morocco. Mörth, K. (2017). Arabic lexicography in the internet era. The Routledge Handbook of Lexicography , Routledge: 503-517. Moussa, N. K. E. B. and A. M. Alimi (2015). Construction d’un Wordnet standard pour l’Arabe tunisien. The second Colloquium for Researcher Students in Natural Language Processing and its Applications (CEC-TAL 2015). N...

2017
[8]

State of the art in MWE processing

Ramisch, C. (2015a). Multiword Expressions Acquisition: A Generic and Open Framework . London, Springer. Ramisch, C. J. M. E. A. (2015b). "State of the art in MWE processing." 53-102. Rebdawi, G., S. Desouki and N. Ghneim (2013). "The Interactive Arabic Dictionary: Another Collaboratively Constructed Language Resource." Journal of Computer Sciences and Ap...

2013

[1] [1]

Deep learning for Arabic NLP: A survey

Al-Ayyoub, M., A. Nuseir, K. Alsmearat, Y. Jararweh and B. Gupta (2018). "Deep learning for Arabic NLP: A survey." Journal of computational science 26: 522-531. Alghamdi, A. A. O. (2018). A Computational Lexicon and Representational Model for Arabic Multiword Expressions, University of Leeds. Alsharhan, E., A. Ramsay and Evaluation (2020). "Investigating ...

2018

[2] [2]

TEI Lite: Encoding for Interchange: an introduc -tion to the TEI Final revised edition for TEI P5

Amar, F. B. B., B. Gargouri and A. B. Hamadou (2010). Towards Generation of Domain Ontology from LMF Standardized Dictionaries. SEKE: 515-520. Amar, F. B. B., A. Khemakhem, B. Gargouri, K. Haddar and A. B. Hamadou (2008). LMF Standardized Model for the Editorial Electronic Dictionaries of Arabic. NLPCS. Attia, M., L. Tounsi and J. van Genabith (2010). Aut...

2010

[3] [3]

9.5 Typographic and Lexical Information in Dictionary Data

Consortium, T. (2022 ). "9.5 Typographic and Lexical Information in Dictionary Data" TEI P5: Guidelines for Electronic Text Encoding and Interchange. [Version 4.4.0.].[Last updated on 19th April 2022].[Revision ff9cc28b0]. Costa, R., Roche, C., & Salgado, A. (2022). Standards for representing lexicographic data: An overview . DARIAH-Campus. Elleuch, I., B...

2022

[4] [4]

Multiword expressions: between lexicography and NLP

Gantar, P., L. Colman, C. Parra Escartín and H. Martínez Alonso (2019). "Multiword expressions: between lexicography and NLP." 32(2): 138-162. Graff, D. and M. Maamouri (2012). Developing LMF-XML Bilingual Dictionaries for Colloquial Arabic Dialects. LREC

2019

[5] [5]

A Prototype for Projecting HPSG Syntactic Lexica Towards LMF

Haddar, K., H. Fehri and L. Romary (2012). "A Prototype for Projecting HPSG Syntactic Lexica Towards LMF." arXiv preprint arXiv:1207.5328 1(27): 21-46. Hawwari, A., M. Attia and M. Diab (2014). A framework for the classification and annotation of multiword expressions in dialectal arabic . Proceedings of the EMNLP 2014 Workshop on Arabic Natural Language ...

Pith/arXiv arXiv 2012

[6] [6]

Krauwer, K

Maegaard, B., S. Krauwer, K. Choukri and L. D. Jørgensen (2006). The BLARK concept and BLARK for Arabic. LREC. Maks, I., T. Carole and V. V. Remco (2008). Standardising Bilingual Lexical Resources According to the Lexicon Markup Framework. LREC

2006

[7] [7]

Mörth, K

Marrakech, Morocco. Mörth, K. (2017). Arabic lexicography in the internet era. The Routledge Handbook of Lexicography , Routledge: 503-517. Moussa, N. K. E. B. and A. M. Alimi (2015). Construction d’un Wordnet standard pour l’Arabe tunisien. The second Colloquium for Researcher Students in Natural Language Processing and its Applications (CEC-TAL 2015). N...

2017

[8] [8]

State of the art in MWE processing

Ramisch, C. (2015a). Multiword Expressions Acquisition: A Generic and Open Framework . London, Springer. Ramisch, C. J. M. E. A. (2015b). "State of the art in MWE processing." 53-102. Rebdawi, G., S. Desouki and N. Ghneim (2013). "The Interactive Arabic Dictionary: Another Collaboratively Constructed Language Resource." Journal of Computer Sciences and Ap...

2013