pith. sign in

arxiv: 2605.21227 · v1 · pith:EZGAHWMBnew · submitted 2026-05-20 · 💻 cs.CL

Do LLMs Know What Luxembourgish Borrows? Probing Lexical Neology in Low-Resource Multilingual Models

Pith reviewed 2026-05-21 04:28 UTC · model grok-4.3

classification 💻 cs.CL
keywords lexical borrowingLuxembourgishlarge language modelsknowledge graphslow-resource languagesneology detectionprompt engineering
0
0 comments X

The pith

Knowledge-graph prompts raise LLM borrowing classification accuracy in Luxembourgish from 25-35% to 71-81%

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests how well large language models handle lexical borrowing and neology in Luxembourgish, a low-resource contact language. It creates a benchmark from news corpus data labeling words as native or borrowed from French, German, or English. Models struggle without help, performing only slightly above chance. Adding a knowledge graph with details on donor languages and patterns into the prompts dramatically improves results, especially for smaller models. This approach shows promise for making LLMs respect community norms on word use in multilingual settings.

Core claim

The central claim is that without external context, multilingual LLMs perform only slightly above chance on classifying borrowings in Luxembourgish. Constructing a linguistic knowledge graph encoding donor language, morphological patterns, and lexical analogues, and injecting instance-specific subgraphs into prompts, raises accuracy from 25-35% to 71-81%. This largely closes the gap between small and large models, though detecting lexical innovation remains difficult and sensitive to few-shot examples. The results indicate that lexicon-aware prompting aids robust borrowing judgments in low-resource contact languages.

What carries the argument

The linguistic knowledge graph encoding donor language, morphological patterns, and lexical analogues, with instance-specific subgraphs injected into prompts to provide structured context for classification.

Load-bearing premise

The manual or semi-automatic labels in the benchmark correctly reflect community norms for distinguishing borrowings from native forms.

What would settle it

A new evaluation on Luxembourgish text with independently verified labels from native speakers or linguists would reveal whether the reported accuracy improvements hold or if they depend on the specific labeling process.

Figures

Figures reproduced from arXiv: 2605.21227 by Nina Hosseini-Kivanani.

Figure 1
Figure 1. Figure 1: summarizes the overall evaluation pipeline, from LuxBorrow-derived benchmark con￾struction and LKG retrieval to prompt assembly and multilingual LLM evaluation [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: (a) Per-class F1 for zero-shot and KG-graph by model (GE12, GE27, & LL70 denote Gemma 3 12B, Gemma 3 27B, & Llama 3.3 70B, respectively.). (b) Top donor confusion pairs. (c) False-positive rates on NATIVE: proportion of NATIVE tokens predicted as loans. tokens under zero-shot prompting. KG-graph prompting reduces the EN_LOAN rate by 78– 87% for the Gemma models (from 32.8% to 4.2% for Gemma 12B, and from 1… view at source ↗
Figure 3
Figure 3. Figure 3: shows the effect of removing individ￾ual components from the KG-graph prompt. The full KG-graph condition reaches 81.0%, 71.4%, and 71.3% accuracy for the three models (Gemma 3 12B, Gemma 3 27B, and Llama 3.3 70B). Removing etymological informa￾tion (No Etymology) reduces accuracy to 78.9% for Gemma 3 12B, 58.4% for Gemma 3 27B, and 69.2% for Llama 3.3 70B. Dropping analogical examples (No Analogues) has a… view at source ↗
Figure 4
Figure 4. Figure 4: Supplementary visualization of KG gain ∆KG by model size on a log-scaled x-axis [PITH_FULL_IMAGE:figures/full_fig_p012_4.png] view at source ↗
read the original abstract

Large language models (LLMs) are increasingly used for writing assistance in small contact languages, yet it is unclear whether they respect community norms around lexical borrowing and neology. We introduce LexNeo-Bench, a 3{,}050-instance token-level benchmark derived from LuxBorrow, a large-scale Luxembourgish news corpus, where target tokens are labelled as native or as French, German, or English borrowings. Using this benchmark, we probe three multilingual LLMs across 34 prompt settings on two tasks: borrowing type classification and a binary lexical-innovation proxy (borrowing versus native). Without external context, models perform only slightly above chance on borrowing classification, so we construct a linguistic knowledge graph that encodes donor language, morphological patterns, and lexical analogues, and inject instance-specific subgraphs into the prompt. Knowledge-graph prompts raise borrowing classification accuracy from 25 -- 35\% up to 71 -- 81\% and largely close the gap between small and large models, while leaving neology detection difficult and sensitive to few-shot design. Our results show that lexicon-aware prompting is highly beneficial for robust borrowing judgments in low-resource contact languages and that lexical resources can serve as structured context for LLM evaluation. This study was carried out within the ENEOLI COST Action and examines borrowing as a form of lexical innovation in multilingual Luxembourgish data.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces LexNeo-Bench, a 3,050-token benchmark derived from the LuxBorrow Luxembourgish news corpus, with tokens labeled as native or as borrowings from French, German, or English. It evaluates three multilingual LLMs across 34 prompt settings on borrowing-type classification and a binary neology proxy. Without external context, models perform near chance (25-35%); injecting instance-specific subgraphs from a constructed linguistic knowledge graph (encoding donor language, morphology, and analogues) raises borrowing classification accuracy to 71-81% and largely closes the gap between small and large models, while neology detection remains difficult and few-shot sensitive. The work argues that lexicon-aware prompting benefits low-resource contact languages and that lexical resources provide useful structured context for LLM evaluation.

Significance. If the benchmark labels accurately capture Luxembourgish community norms, the results provide concrete evidence that structured linguistic knowledge can be injected via prompts to improve model respect for lexical borrowing conventions in low-resource settings. The large, consistent accuracy gains (roughly doubling performance) and the closing of model-size gaps are noteworthy empirical findings for multilingual and low-resource NLP. The introduction of LexNeo-Bench itself is a useful contribution for future work on lexical innovation in contact languages.

major comments (2)
  1. [§3] §3 (LexNeo-Bench construction): the paper states that the 3,050 labels were obtained from LuxBorrow via manual or semi-automatic annotation, yet provides no inter-annotator agreement figures, annotation guidelines, or resolution procedure for ambiguous cases. Because the central claim equates the jump from 25-35% to 71-81% accuracy with genuine linguistic knowledge, the absence of these details leaves open the possibility that measured gains reflect alignment to annotation artifacts rather than community norms.
  2. [§4.1] §4.1 (knowledge-graph construction): the subgraphs injected into prompts are built from related lexical resources; the manuscript does not state whether these resources overlap with those underlying the LuxBorrow labels or the annotation process. Any such overlap would create a circularity risk that undermines the interpretation of the 71-81% figures as evidence of model improvement independent of the evaluation target.
minor comments (2)
  1. [Abstract / §4] The abstract and §4 mention 34 prompt settings but do not enumerate the main axes of variation (e.g., number of few-shot examples, exact KG injection format, or presence/absence of donor-language hints). A short table or bullet list would improve reproducibility.
  2. [Results section] Table or figure captions for the accuracy results should explicitly state the random baseline and the number of runs or seeds used to compute the reported percentages.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for their thoughtful review and constructive suggestions. We respond to each major comment in turn and indicate the revisions we have made or plan to make in the updated manuscript.

read point-by-point responses
  1. Referee: [§3] §3 (LexNeo-Bench construction): the paper states that the 3,050 labels were obtained from LuxBorrow via manual or semi-automatic annotation, yet provides no inter-annotator agreement figures, annotation guidelines, or resolution procedure for ambiguous cases. Because the central claim equates the jump from 25-35% to 71-81% accuracy with genuine linguistic knowledge, the absence of these details leaves open the possibility that measured gains reflect alignment to annotation artifacts rather than community norms.

    Authors: We agree that the lack of reported inter-annotator agreement and detailed guidelines is a limitation in the current manuscript. The labels were produced using a semi-automatic approach based on existing etymological resources for Luxembourgish, with manual adjudication for uncertain cases by experts. However, formal IAA was not calculated. In the revision, we will expand §3 to include the annotation guidelines and the procedure for resolving ambiguities. We will also note the absence of IAA as a limitation of the benchmark. This addresses the concern about potential annotation artifacts by providing more context on how the labels were derived. revision: partial

  2. Referee: [§4.1] §4.1 (knowledge-graph construction): the subgraphs injected into prompts are built from related lexical resources; the manuscript does not state whether these resources overlap with those underlying the LuxBorrow labels or the annotation process. Any such overlap would create a circularity risk that undermines the interpretation of the 71-81% figures as evidence of model improvement independent of the evaluation target.

    Authors: The referee correctly identifies that the manuscript does not explicitly address potential overlap. We can confirm that the knowledge graph draws on distinct lexical resources, including independent morphological and etymological databases not involved in the LuxBorrow annotation. To prevent any misinterpretation, we will revise the description in §4.1 to specify the sources used and state their independence from the benchmark construction. This clarification will support the interpretation of the results as demonstrating the value of injecting structured linguistic knowledge. revision: yes

standing simulated objections not resolved
  • Inter-annotator agreement figures for the LexNeo-Bench labels, which were not computed during the original annotation process.

Circularity Check

0 steps flagged

Empirical benchmarking study with no derivation chain or definitional reduction

full rationale

This is an empirical probing paper that reports measured accuracies on a held-out token benchmark (LexNeo-Bench) under different prompt conditions. No equations, derivations, or first-principles claims exist that could reduce a 'prediction' to its own inputs by construction. The KG injection and label construction are methodological choices whose validity is an external assumption, but the reported accuracy deltas (25-35% to 71-81%) are observed outcomes on independent test instances rather than quantities defined inside the paper. Self-citation load-bearing and ansatz smuggling patterns are absent. The work is therefore self-contained as standard benchmarking against an external corpus-derived gold standard.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the correctness of the benchmark labels and the assumption that the constructed knowledge graph faithfully encodes relevant linguistic patterns; no free parameters or invented entities are described in the abstract.

axioms (1)
  • domain assumption The LuxBorrow corpus provides representative samples of contemporary Luxembourgish usage for labeling borrowings.
    The benchmark is derived from this corpus; if the corpus is biased toward certain registers or time periods, the results may not generalize.

pith-pipeline@v0.9.0 · 5771 in / 1318 out tokens · 40040 ms · 2026-05-21T04:28:34.260935+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

13 extracted references · 13 canonical work pages · 1 internal anchor

  1. [1]

    Do LLMs Know What Luxembourgish Borrows? Probing Lexical Neology in Low-Resource Multilingual Models

    Introduction Neology, the creation and diffusion of new lexi- cal items, has long been central to lexicography, corpus linguistics, and sociolinguistics. With the emergence of large language models (LLMs), ne- ology enters a new phase. LLMs are trained on massive multilingual corpora, absorb existing neologisms, and can themselves generate novel forms, bl...

  2. [2]

    Simple View

    Related Work 2.1. Borrowing, code-switching, and neology Contact linguistics distinguishes lexical borrow- ing, items integrated into the recipient language’s lexicon and grammar, from code-switching, that is, spontaneous alternation between languages within discourse. Classic accounts emphasize that entrenched borrowings are morphologically and phonologi...

  3. [3]

    -en" and spelling

    Experiments Figure 1 summarizes the overall evaluation pipeline, from LuxBorrow-derived benchmark con- struction and LKG retrieval to prompt assembly and multilingual LLM evaluation. Figure 1: LexNeo-Bench pipeline. 3.1. Benchmark construction We construct LexNeo-Bench, a token-level evalu- ation benchmark derived from the LuxBorrow cor- pus of profession...

  4. [4]

    neologism

    Results 4.1. RQ1. Borrowing classification performance T able1 summarizes three-way borrowing classifi- cation accuracy and macro F1 across models and prompt strategies. Without a structured linguis- tic context, performance remains modest. In the zero-shot baseline, accuracy ranges from 24.5% for Gemma 3 12B to 34.7% for Llama 3.3 70B, and more elaborate...

  5. [5]

    Simple View

    Discussion Our results show that off-the-shelf multilingual LLMs have limited awareness of how a small contact language integrates lexical borrowings, even when trained on large multilingual corpora. With four possible output labels, a random base- line yields 25% accuracy; zero-shot performance ranges from 24.5% to 34.7%, indicating that para- metric kno...

  6. [6]

    Conclusion and Future Work We introduced LexNeo-Bench, a token-level benchmark derived from a borrowing-annotated Luxembourgish news corpus to probe how mul- tilingual LLMs treat morphologically adapted bor- rowings. Across three models and 34 prompt con- figurations, zero-shot parametric knowledge stays near chance, whereas instance-specific linguistic k...

  7. [7]

    Acknowledgements We thank RTL Luxembourg and Tom Weber for providing access to the news archive and for sup- porting its use for research purposes. This work highly benefited from the collaborative network fos- tered by the ENEOLI COST Action (CA22126) , supported by COST (European Cooperation in Sci- ence and Technology), and also within the project LuxV...

  8. [8]

    The under- lying corpus consists of online news articles pub- lished between 1999 and 2025 by a major Luxem- bourgish media outlet (RTL)

    Ethical and legal aspects Data provenance and legal basis. The under- lying corpus consists of online news articles pub- lished between 1999 and 2025 by a major Luxem- bourgish media outlet (RTL). The data were ob- tained under a formal research collaboration and processed under the outlet’s terms of use and the applicable EU text and data mining provisio...

  9. [9]

    Bibliographical References Martine Adda-Decker, Thomas Pellegrini, Eric Bilinski, and Gilles Adda. 2008. Develop- ments of” lëtzebuergesch” resources for auto- matic speech processing and linguistic studies. In LREC. Elena Alvarez-Mellado. 2020. An annotated cor- pus of emerging anglicisms in spanish news- paper headlines. In Proceedings of the 4th worksh...

  10. [10]

    In Proceedings of the 22nd Workshop on T reebanks and Linguistic Theories (TL T 2024), pages 30–39

    Luxbank: The first universal dependency treebank for luxembourgish. In Proceedings of the 22nd Workshop on T reebanks and Linguistic Theories (TL T 2024), pages 30–39. Zeqi T an, Shen Huang, Zixia Jia, Jiong Cai, Yinghui Li, Weiming Lu, Yueting Zhuang, Kewei Tu, Pengjun Xie, Fei Huang, et al. 2023. Damo- nlp at semeval-2023 task 2: A unified retrieval- au...

  11. [11]

    Hu- manities and Social Sciences Communications , 10(1):1–10

    Tracking the acceptance of neologisms in german: Psycholinguistic factors and their cor- respondence with corpus-linguistic findings. Hu- manities and Social Sciences Communications , 10(1):1–10. Jian Xie, Kai Zhang, Jiangjie Chen, Renze Lou, and Yu Su. Adaptive chameleon or stubborn sloth: Revealing the behavior of large language models in knowledge conf...

  12. [12]

    Language Resource References Zenter fir d’Lëtzebuerger Sprooch. 2025. Lëtze- buerger Online Dictionnaire (LOD) . Official ref- erence dictionary for Luxembourgish

  13. [13]

    Supplementary visualization of KG gain KG gain is defined as the accuracy difference be- tween KG-graph and zero-shot prompting

    Appendices 11.1. Supplementary visualization of KG gain KG gain is defined as the accuracy difference be- tween KG-graph and zero-shot prompting. The gain decreases monotonically with scale, from +56.5 percentage points for Gemma 3 12B to +40.1 for Gemma 3 27B and +36.6 for Llama 3.3 70B. Figure 4: Supplementary visualization of KG gain ∆KG by model size ...