Do LLMs Know What Luxembourgish Borrows? Probing Lexical Neology in Low-Resource Multilingual Models

Nina Hosseini-Kivanani

arxiv: 2605.21227 · v1 · pith:EZGAHWMBnew · submitted 2026-05-20 · 💻 cs.CL

Do LLMs Know What Luxembourgish Borrows? Probing Lexical Neology in Low-Resource Multilingual Models

Nina Hosseini-Kivanani This is my paper

Pith reviewed 2026-05-21 04:28 UTC · model grok-4.3

classification 💻 cs.CL

keywords lexical borrowingLuxembourgishlarge language modelsknowledge graphslow-resource languagesneology detectionprompt engineering

0 comments

The pith

Knowledge-graph prompts raise LLM borrowing classification accuracy in Luxembourgish from 25-35% to 71-81%

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests how well large language models handle lexical borrowing and neology in Luxembourgish, a low-resource contact language. It creates a benchmark from news corpus data labeling words as native or borrowed from French, German, or English. Models struggle without help, performing only slightly above chance. Adding a knowledge graph with details on donor languages and patterns into the prompts dramatically improves results, especially for smaller models. This approach shows promise for making LLMs respect community norms on word use in multilingual settings.

Core claim

The central claim is that without external context, multilingual LLMs perform only slightly above chance on classifying borrowings in Luxembourgish. Constructing a linguistic knowledge graph encoding donor language, morphological patterns, and lexical analogues, and injecting instance-specific subgraphs into prompts, raises accuracy from 25-35% to 71-81%. This largely closes the gap between small and large models, though detecting lexical innovation remains difficult and sensitive to few-shot examples. The results indicate that lexicon-aware prompting aids robust borrowing judgments in low-resource contact languages.

What carries the argument

The linguistic knowledge graph encoding donor language, morphological patterns, and lexical analogues, with instance-specific subgraphs injected into prompts to provide structured context for classification.

Load-bearing premise

The manual or semi-automatic labels in the benchmark correctly reflect community norms for distinguishing borrowings from native forms.

What would settle it

A new evaluation on Luxembourgish text with independently verified labels from native speakers or linguists would reveal whether the reported accuracy improvements hold or if they depend on the specific labeling process.

Figures

Figures reproduced from arXiv: 2605.21227 by Nina Hosseini-Kivanani.

**Figure 1.** Figure 1: summarizes the overall evaluation pipeline, from LuxBorrow-derived benchmark construction and LKG retrieval to prompt assembly and multilingual LLM evaluation [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗

**Figure 2.** Figure 2: (a) Per-class F1 for zero-shot and KG-graph by model (GE12, GE27, & LL70 denote Gemma 3 12B, Gemma 3 27B, & Llama 3.3 70B, respectively.). (b) Top donor confusion pairs. (c) False-positive rates on NATIVE: proportion of NATIVE tokens predicted as loans. tokens under zero-shot prompting. KG-graph prompting reduces the EN_LOAN rate by 78– 87% for the Gemma models (from 32.8% to 4.2% for Gemma 12B, and from 1… view at source ↗

**Figure 3.** Figure 3: shows the effect of removing individual components from the KG-graph prompt. The full KG-graph condition reaches 81.0%, 71.4%, and 71.3% accuracy for the three models (Gemma 3 12B, Gemma 3 27B, and Llama 3.3 70B). Removing etymological information (No Etymology) reduces accuracy to 78.9% for Gemma 3 12B, 58.4% for Gemma 3 27B, and 69.2% for Llama 3.3 70B. Dropping analogical examples (No Analogues) has a… view at source ↗

**Figure 4.** Figure 4: Supplementary visualization of KG gain ∆KG by model size on a log-scaled x-axis [PITH_FULL_IMAGE:figures/full_fig_p012_4.png] view at source ↗

read the original abstract

Large language models (LLMs) are increasingly used for writing assistance in small contact languages, yet it is unclear whether they respect community norms around lexical borrowing and neology. We introduce LexNeo-Bench, a 3{,}050-instance token-level benchmark derived from LuxBorrow, a large-scale Luxembourgish news corpus, where target tokens are labelled as native or as French, German, or English borrowings. Using this benchmark, we probe three multilingual LLMs across 34 prompt settings on two tasks: borrowing type classification and a binary lexical-innovation proxy (borrowing versus native). Without external context, models perform only slightly above chance on borrowing classification, so we construct a linguistic knowledge graph that encodes donor language, morphological patterns, and lexical analogues, and inject instance-specific subgraphs into the prompt. Knowledge-graph prompts raise borrowing classification accuracy from 25 -- 35\% up to 71 -- 81\% and largely close the gap between small and large models, while leaving neology detection difficult and sensitive to few-shot design. Our results show that lexicon-aware prompting is highly beneficial for robust borrowing judgments in low-resource contact languages and that lexical resources can serve as structured context for LLM evaluation. This study was carried out within the ENEOLI COST Action and examines borrowing as a form of lexical innovation in multilingual Luxembourgish data.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

KG prompts lift borrowing classification to 71-81% on the new LexNeo-Bench and shrink the small-vs-large model gap, but label quality is the main unverified piece.

read the letter

The main takeaway is that injecting instance-specific subgraphs from a linguistic knowledge graph raises borrowing-type classification accuracy from the 25-35% range to 71-81% across the tested models, and it largely closes the performance difference between smaller and larger LLMs. Neology detection stays harder and more sensitive to few-shot choices. They built LexNeo-Bench with 3,050 tokens drawn from the LuxBorrow Luxembourgish news corpus and labeled each as native or as a borrowing from French, German, or English. They then ran three multilingual LLMs through 34 prompt configurations on the two tasks. Without the graph, results sit near chance; with it, the gains are consistent in the numbers reported. The paper does a clean job showing that lexicon-aware prompting can help models respect community norms in a low-resource contact language. The experimental scope is reasonable for this kind of targeted study, and the practical angle for writing assistance tools is clear. The soft spot is the benchmark labels. They come from manual or semi-automatic annotation on the corpus, and any systematic mismatch with actual speaker norms would turn the accuracy gains into alignment with the annotation scheme rather than evidence of linguistic insight. The graph itself is built from related lexical resources, so a small risk of overlap or selection effects exists. If the full methods include inter-annotator agreement or external validation against speakers, that would tighten things up. This is for researchers working on multilingual LLMs, prompting with external knowledge, or lexical phenomena in small contact languages. A reader who needs a concrete benchmark and prompting results for borrowing classification would get direct value. It deserves a serious referee because the new benchmark and the reported improvements are concrete enough to review. I would send it for peer review and ask for more detail on how the labels were produced and checked.

Referee Report

2 major / 2 minor

Summary. The paper introduces LexNeo-Bench, a 3,050-token benchmark derived from the LuxBorrow Luxembourgish news corpus, with tokens labeled as native or as borrowings from French, German, or English. It evaluates three multilingual LLMs across 34 prompt settings on borrowing-type classification and a binary neology proxy. Without external context, models perform near chance (25-35%); injecting instance-specific subgraphs from a constructed linguistic knowledge graph (encoding donor language, morphology, and analogues) raises borrowing classification accuracy to 71-81% and largely closes the gap between small and large models, while neology detection remains difficult and few-shot sensitive. The work argues that lexicon-aware prompting benefits low-resource contact languages and that lexical resources provide useful structured context for LLM evaluation.

Significance. If the benchmark labels accurately capture Luxembourgish community norms, the results provide concrete evidence that structured linguistic knowledge can be injected via prompts to improve model respect for lexical borrowing conventions in low-resource settings. The large, consistent accuracy gains (roughly doubling performance) and the closing of model-size gaps are noteworthy empirical findings for multilingual and low-resource NLP. The introduction of LexNeo-Bench itself is a useful contribution for future work on lexical innovation in contact languages.

major comments (2)

[§3] §3 (LexNeo-Bench construction): the paper states that the 3,050 labels were obtained from LuxBorrow via manual or semi-automatic annotation, yet provides no inter-annotator agreement figures, annotation guidelines, or resolution procedure for ambiguous cases. Because the central claim equates the jump from 25-35% to 71-81% accuracy with genuine linguistic knowledge, the absence of these details leaves open the possibility that measured gains reflect alignment to annotation artifacts rather than community norms.
[§4.1] §4.1 (knowledge-graph construction): the subgraphs injected into prompts are built from related lexical resources; the manuscript does not state whether these resources overlap with those underlying the LuxBorrow labels or the annotation process. Any such overlap would create a circularity risk that undermines the interpretation of the 71-81% figures as evidence of model improvement independent of the evaluation target.

minor comments (2)

[Abstract / §4] The abstract and §4 mention 34 prompt settings but do not enumerate the main axes of variation (e.g., number of few-shot examples, exact KG injection format, or presence/absence of donor-language hints). A short table or bullet list would improve reproducibility.
[Results section] Table or figure captions for the accuracy results should explicitly state the random baseline and the number of runs or seeds used to compute the reported percentages.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for their thoughtful review and constructive suggestions. We respond to each major comment in turn and indicate the revisions we have made or plan to make in the updated manuscript.

read point-by-point responses

Referee: [§3] §3 (LexNeo-Bench construction): the paper states that the 3,050 labels were obtained from LuxBorrow via manual or semi-automatic annotation, yet provides no inter-annotator agreement figures, annotation guidelines, or resolution procedure for ambiguous cases. Because the central claim equates the jump from 25-35% to 71-81% accuracy with genuine linguistic knowledge, the absence of these details leaves open the possibility that measured gains reflect alignment to annotation artifacts rather than community norms.

Authors: We agree that the lack of reported inter-annotator agreement and detailed guidelines is a limitation in the current manuscript. The labels were produced using a semi-automatic approach based on existing etymological resources for Luxembourgish, with manual adjudication for uncertain cases by experts. However, formal IAA was not calculated. In the revision, we will expand §3 to include the annotation guidelines and the procedure for resolving ambiguities. We will also note the absence of IAA as a limitation of the benchmark. This addresses the concern about potential annotation artifacts by providing more context on how the labels were derived. revision: partial
Referee: [§4.1] §4.1 (knowledge-graph construction): the subgraphs injected into prompts are built from related lexical resources; the manuscript does not state whether these resources overlap with those underlying the LuxBorrow labels or the annotation process. Any such overlap would create a circularity risk that undermines the interpretation of the 71-81% figures as evidence of model improvement independent of the evaluation target.

Authors: The referee correctly identifies that the manuscript does not explicitly address potential overlap. We can confirm that the knowledge graph draws on distinct lexical resources, including independent morphological and etymological databases not involved in the LuxBorrow annotation. To prevent any misinterpretation, we will revise the description in §4.1 to specify the sources used and state their independence from the benchmark construction. This clarification will support the interpretation of the results as demonstrating the value of injecting structured linguistic knowledge. revision: yes

standing simulated objections not resolved

Inter-annotator agreement figures for the LexNeo-Bench labels, which were not computed during the original annotation process.

Circularity Check

0 steps flagged

Empirical benchmarking study with no derivation chain or definitional reduction

full rationale

This is an empirical probing paper that reports measured accuracies on a held-out token benchmark (LexNeo-Bench) under different prompt conditions. No equations, derivations, or first-principles claims exist that could reduce a 'prediction' to its own inputs by construction. The KG injection and label construction are methodological choices whose validity is an external assumption, but the reported accuracy deltas (25-35% to 71-81%) are observed outcomes on independent test instances rather than quantities defined inside the paper. Self-citation load-bearing and ansatz smuggling patterns are absent. The work is therefore self-contained as standard benchmarking against an external corpus-derived gold standard.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the correctness of the benchmark labels and the assumption that the constructed knowledge graph faithfully encodes relevant linguistic patterns; no free parameters or invented entities are described in the abstract.

axioms (1)

domain assumption The LuxBorrow corpus provides representative samples of contemporary Luxembourgish usage for labeling borrowings.
The benchmark is derived from this corpus; if the corpus is biased toward certain registers or time periods, the results may not generalize.

pith-pipeline@v0.9.0 · 5771 in / 1318 out tokens · 40040 ms · 2026-05-21T04:28:34.260935+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

13 extracted references · 13 canonical work pages · 1 internal anchor

[1]

Do LLMs Know What Luxembourgish Borrows? Probing Lexical Neology in Low-Resource Multilingual Models

Introduction Neology, the creation and diffusion of new lexi- cal items, has long been central to lexicography, corpus linguistics, and sociolinguistics. With the emergence of large language models (LLMs), ne- ology enters a new phase. LLMs are trained on massive multilingual corpora, absorb existing neologisms, and can themselves generate novel forms, bl...

work page internal anchor Pith review Pith/arXiv arXiv 2023
[2]

Simple View

Related Work 2.1. Borrowing, code-switching, and neology Contact linguistics distinguishes lexical borrow- ing, items integrated into the recipient language’s lexicon and grammar, from code-switching, that is, spontaneous alternation between languages within discourse. Classic accounts emphasize that entrenched borrowings are morphologically and phonologi...

work page 2025
[3]

-en" and spelling

Experiments Figure 1 summarizes the overall evaluation pipeline, from LuxBorrow-derived benchmark con- struction and LKG retrieval to prompt assembly and multilingual LLM evaluation. Figure 1: LexNeo-Bench pipeline. 3.1. Benchmark construction We construct LexNeo-Bench, a token-level evalu- ation benchmark derived from the LuxBorrow cor- pus of profession...

work page 1999
[4]

neologism

Results 4.1. RQ1. Borrowing classification performance T able1 summarizes three-way borrowing classifi- cation accuracy and macro F1 across models and prompt strategies. Without a structured linguis- tic context, performance remains modest. In the zero-shot baseline, accuracy ranges from 24.5% for Gemma 3 12B to 34.7% for Llama 3.3 70B, and more elaborate...

work page 2021
[5]

Simple View

Discussion Our results show that off-the-shelf multilingual LLMs have limited awareness of how a small contact language integrates lexical borrowings, even when trained on large multilingual corpora. With four possible output labels, a random base- line yields 25% accuracy; zero-shot performance ranges from 24.5% to 34.7%, indicating that para- metric kno...

work page 2025
[6]

Conclusion and Future Work We introduced LexNeo-Bench, a token-level benchmark derived from a borrowing-annotated Luxembourgish news corpus to probe how mul- tilingual LLMs treat morphologically adapted bor- rowings. Across three models and 34 prompt con- figurations, zero-shot parametric knowledge stays near chance, whereas instance-specific linguistic k...

work page
[7]

Acknowledgements We thank RTL Luxembourg and Tom Weber for providing access to the news archive and for sup- porting its use for research purposes. This work highly benefited from the collaborative network fos- tered by the ENEOLI COST Action (CA22126) , supported by COST (European Cooperation in Sci- ence and Technology), and also within the project LuxV...

work page
[8]

The under- lying corpus consists of online news articles pub- lished between 1999 and 2025 by a major Luxem- bourgish media outlet (RTL)

Ethical and legal aspects Data provenance and legal basis. The under- lying corpus consists of online news articles pub- lished between 1999 and 2025 by a major Luxem- bourgish media outlet (RTL). The data were ob- tained under a formal research collaboration and processed under the outlet’s terms of use and the applicable EU text and data mining provisio...

work page 1999
[9]

Bibliographical References Martine Adda-Decker, Thomas Pellegrini, Eric Bilinski, and Gilles Adda. 2008. Develop- ments of” lëtzebuergesch” resources for auto- matic speech processing and linguistic studies. In LREC. Elena Alvarez-Mellado. 2020. An annotated cor- pus of emerging anglicisms in spanish news- paper headlines. In Proceedings of the 4th worksh...

work page arXiv 2008
[10]

In Proceedings of the 22nd Workshop on T reebanks and Linguistic Theories (TL T 2024), pages 30–39

Luxbank: The first universal dependency treebank for luxembourgish. In Proceedings of the 22nd Workshop on T reebanks and Linguistic Theories (TL T 2024), pages 30–39. Zeqi T an, Shen Huang, Zixia Jia, Jiong Cai, Yinghui Li, Weiming Lu, Yueting Zhuang, Kewei Tu, Pengjun Xie, Fei Huang, et al. 2023. Damo- nlp at semeval-2023 task 2: A unified retrieval- au...

work page 2024
[11]

Hu- manities and Social Sciences Communications , 10(1):1–10

Tracking the acceptance of neologisms in german: Psycholinguistic factors and their cor- respondence with corpus-linguistic findings. Hu- manities and Social Sciences Communications , 10(1):1–10. Jian Xie, Kai Zhang, Jiangjie Chen, Renze Lou, and Yu Su. Adaptive chameleon or stubborn sloth: Revealing the behavior of large language models in knowledge conf...

work page 2024
[12]

Language Resource References Zenter fir d’Lëtzebuerger Sprooch. 2025. Lëtze- buerger Online Dictionnaire (LOD) . Oﬀicial ref- erence dictionary for Luxembourgish

work page 2025
[13]

Supplementary visualization of KG gain KG gain is defined as the accuracy difference be- tween KG-graph and zero-shot prompting

Appendices 11.1. Supplementary visualization of KG gain KG gain is defined as the accuracy difference be- tween KG-graph and zero-shot prompting. The gain decreases monotonically with scale, from +56.5 percentage points for Gemma 3 12B to +40.1 for Gemma 3 27B and +36.6 for Llama 3.3 70B. Figure 4: Supplementary visualization of KG gain ∆KG by model size ...

work page

[1] [1]

Do LLMs Know What Luxembourgish Borrows? Probing Lexical Neology in Low-Resource Multilingual Models

Introduction Neology, the creation and diffusion of new lexi- cal items, has long been central to lexicography, corpus linguistics, and sociolinguistics. With the emergence of large language models (LLMs), ne- ology enters a new phase. LLMs are trained on massive multilingual corpora, absorb existing neologisms, and can themselves generate novel forms, bl...

work page internal anchor Pith review Pith/arXiv arXiv 2023

[2] [2]

Simple View

Related Work 2.1. Borrowing, code-switching, and neology Contact linguistics distinguishes lexical borrow- ing, items integrated into the recipient language’s lexicon and grammar, from code-switching, that is, spontaneous alternation between languages within discourse. Classic accounts emphasize that entrenched borrowings are morphologically and phonologi...

work page 2025

[3] [3]

-en" and spelling

Experiments Figure 1 summarizes the overall evaluation pipeline, from LuxBorrow-derived benchmark con- struction and LKG retrieval to prompt assembly and multilingual LLM evaluation. Figure 1: LexNeo-Bench pipeline. 3.1. Benchmark construction We construct LexNeo-Bench, a token-level evalu- ation benchmark derived from the LuxBorrow cor- pus of profession...

work page 1999

[4] [4]

neologism

Results 4.1. RQ1. Borrowing classification performance T able1 summarizes three-way borrowing classifi- cation accuracy and macro F1 across models and prompt strategies. Without a structured linguis- tic context, performance remains modest. In the zero-shot baseline, accuracy ranges from 24.5% for Gemma 3 12B to 34.7% for Llama 3.3 70B, and more elaborate...

work page 2021

[5] [5]

Simple View

Discussion Our results show that off-the-shelf multilingual LLMs have limited awareness of how a small contact language integrates lexical borrowings, even when trained on large multilingual corpora. With four possible output labels, a random base- line yields 25% accuracy; zero-shot performance ranges from 24.5% to 34.7%, indicating that para- metric kno...

work page 2025

[6] [6]

Conclusion and Future Work We introduced LexNeo-Bench, a token-level benchmark derived from a borrowing-annotated Luxembourgish news corpus to probe how mul- tilingual LLMs treat morphologically adapted bor- rowings. Across three models and 34 prompt con- figurations, zero-shot parametric knowledge stays near chance, whereas instance-specific linguistic k...

work page

[7] [7]

Acknowledgements We thank RTL Luxembourg and Tom Weber for providing access to the news archive and for sup- porting its use for research purposes. This work highly benefited from the collaborative network fos- tered by the ENEOLI COST Action (CA22126) , supported by COST (European Cooperation in Sci- ence and Technology), and also within the project LuxV...

work page

[8] [8]

The under- lying corpus consists of online news articles pub- lished between 1999 and 2025 by a major Luxem- bourgish media outlet (RTL)

Ethical and legal aspects Data provenance and legal basis. The under- lying corpus consists of online news articles pub- lished between 1999 and 2025 by a major Luxem- bourgish media outlet (RTL). The data were ob- tained under a formal research collaboration and processed under the outlet’s terms of use and the applicable EU text and data mining provisio...

work page 1999

[9] [9]

Bibliographical References Martine Adda-Decker, Thomas Pellegrini, Eric Bilinski, and Gilles Adda. 2008. Develop- ments of” lëtzebuergesch” resources for auto- matic speech processing and linguistic studies. In LREC. Elena Alvarez-Mellado. 2020. An annotated cor- pus of emerging anglicisms in spanish news- paper headlines. In Proceedings of the 4th worksh...

work page arXiv 2008

[10] [10]

In Proceedings of the 22nd Workshop on T reebanks and Linguistic Theories (TL T 2024), pages 30–39

Luxbank: The first universal dependency treebank for luxembourgish. In Proceedings of the 22nd Workshop on T reebanks and Linguistic Theories (TL T 2024), pages 30–39. Zeqi T an, Shen Huang, Zixia Jia, Jiong Cai, Yinghui Li, Weiming Lu, Yueting Zhuang, Kewei Tu, Pengjun Xie, Fei Huang, et al. 2023. Damo- nlp at semeval-2023 task 2: A unified retrieval- au...

work page 2024

[11] [11]

Hu- manities and Social Sciences Communications , 10(1):1–10

Tracking the acceptance of neologisms in german: Psycholinguistic factors and their cor- respondence with corpus-linguistic findings. Hu- manities and Social Sciences Communications , 10(1):1–10. Jian Xie, Kai Zhang, Jiangjie Chen, Renze Lou, and Yu Su. Adaptive chameleon or stubborn sloth: Revealing the behavior of large language models in knowledge conf...

work page 2024

[12] [12]

Language Resource References Zenter fir d’Lëtzebuerger Sprooch. 2025. Lëtze- buerger Online Dictionnaire (LOD) . Oﬀicial ref- erence dictionary for Luxembourgish

work page 2025

[13] [13]

Supplementary visualization of KG gain KG gain is defined as the accuracy difference be- tween KG-graph and zero-shot prompting

Appendices 11.1. Supplementary visualization of KG gain KG gain is defined as the accuracy difference be- tween KG-graph and zero-shot prompting. The gain decreases monotonically with scale, from +56.5 percentage points for Gemma 3 12B to +40.1 for Gemma 3 27B and +36.6 for Llama 3.3 70B. Figure 4: Supplementary visualization of KG gain ∆KG by model size ...

work page