Do LLMs Know What Luxembourgish Borrows? Probing Lexical Neology in Low-Resource Multilingual Models
Pith reviewed 2026-05-21 04:28 UTC · model grok-4.3
The pith
Knowledge-graph prompts raise LLM borrowing classification accuracy in Luxembourgish from 25-35% to 71-81%
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that without external context, multilingual LLMs perform only slightly above chance on classifying borrowings in Luxembourgish. Constructing a linguistic knowledge graph encoding donor language, morphological patterns, and lexical analogues, and injecting instance-specific subgraphs into prompts, raises accuracy from 25-35% to 71-81%. This largely closes the gap between small and large models, though detecting lexical innovation remains difficult and sensitive to few-shot examples. The results indicate that lexicon-aware prompting aids robust borrowing judgments in low-resource contact languages.
What carries the argument
The linguistic knowledge graph encoding donor language, morphological patterns, and lexical analogues, with instance-specific subgraphs injected into prompts to provide structured context for classification.
Load-bearing premise
The manual or semi-automatic labels in the benchmark correctly reflect community norms for distinguishing borrowings from native forms.
What would settle it
A new evaluation on Luxembourgish text with independently verified labels from native speakers or linguists would reveal whether the reported accuracy improvements hold or if they depend on the specific labeling process.
Figures
read the original abstract
Large language models (LLMs) are increasingly used for writing assistance in small contact languages, yet it is unclear whether they respect community norms around lexical borrowing and neology. We introduce LexNeo-Bench, a 3{,}050-instance token-level benchmark derived from LuxBorrow, a large-scale Luxembourgish news corpus, where target tokens are labelled as native or as French, German, or English borrowings. Using this benchmark, we probe three multilingual LLMs across 34 prompt settings on two tasks: borrowing type classification and a binary lexical-innovation proxy (borrowing versus native). Without external context, models perform only slightly above chance on borrowing classification, so we construct a linguistic knowledge graph that encodes donor language, morphological patterns, and lexical analogues, and inject instance-specific subgraphs into the prompt. Knowledge-graph prompts raise borrowing classification accuracy from 25 -- 35\% up to 71 -- 81\% and largely close the gap between small and large models, while leaving neology detection difficult and sensitive to few-shot design. Our results show that lexicon-aware prompting is highly beneficial for robust borrowing judgments in low-resource contact languages and that lexical resources can serve as structured context for LLM evaluation. This study was carried out within the ENEOLI COST Action and examines borrowing as a form of lexical innovation in multilingual Luxembourgish data.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces LexNeo-Bench, a 3,050-token benchmark derived from the LuxBorrow Luxembourgish news corpus, with tokens labeled as native or as borrowings from French, German, or English. It evaluates three multilingual LLMs across 34 prompt settings on borrowing-type classification and a binary neology proxy. Without external context, models perform near chance (25-35%); injecting instance-specific subgraphs from a constructed linguistic knowledge graph (encoding donor language, morphology, and analogues) raises borrowing classification accuracy to 71-81% and largely closes the gap between small and large models, while neology detection remains difficult and few-shot sensitive. The work argues that lexicon-aware prompting benefits low-resource contact languages and that lexical resources provide useful structured context for LLM evaluation.
Significance. If the benchmark labels accurately capture Luxembourgish community norms, the results provide concrete evidence that structured linguistic knowledge can be injected via prompts to improve model respect for lexical borrowing conventions in low-resource settings. The large, consistent accuracy gains (roughly doubling performance) and the closing of model-size gaps are noteworthy empirical findings for multilingual and low-resource NLP. The introduction of LexNeo-Bench itself is a useful contribution for future work on lexical innovation in contact languages.
major comments (2)
- [§3] §3 (LexNeo-Bench construction): the paper states that the 3,050 labels were obtained from LuxBorrow via manual or semi-automatic annotation, yet provides no inter-annotator agreement figures, annotation guidelines, or resolution procedure for ambiguous cases. Because the central claim equates the jump from 25-35% to 71-81% accuracy with genuine linguistic knowledge, the absence of these details leaves open the possibility that measured gains reflect alignment to annotation artifacts rather than community norms.
- [§4.1] §4.1 (knowledge-graph construction): the subgraphs injected into prompts are built from related lexical resources; the manuscript does not state whether these resources overlap with those underlying the LuxBorrow labels or the annotation process. Any such overlap would create a circularity risk that undermines the interpretation of the 71-81% figures as evidence of model improvement independent of the evaluation target.
minor comments (2)
- [Abstract / §4] The abstract and §4 mention 34 prompt settings but do not enumerate the main axes of variation (e.g., number of few-shot examples, exact KG injection format, or presence/absence of donor-language hints). A short table or bullet list would improve reproducibility.
- [Results section] Table or figure captions for the accuracy results should explicitly state the random baseline and the number of runs or seeds used to compute the reported percentages.
Simulated Author's Rebuttal
We thank the referee for their thoughtful review and constructive suggestions. We respond to each major comment in turn and indicate the revisions we have made or plan to make in the updated manuscript.
read point-by-point responses
-
Referee: [§3] §3 (LexNeo-Bench construction): the paper states that the 3,050 labels were obtained from LuxBorrow via manual or semi-automatic annotation, yet provides no inter-annotator agreement figures, annotation guidelines, or resolution procedure for ambiguous cases. Because the central claim equates the jump from 25-35% to 71-81% accuracy with genuine linguistic knowledge, the absence of these details leaves open the possibility that measured gains reflect alignment to annotation artifacts rather than community norms.
Authors: We agree that the lack of reported inter-annotator agreement and detailed guidelines is a limitation in the current manuscript. The labels were produced using a semi-automatic approach based on existing etymological resources for Luxembourgish, with manual adjudication for uncertain cases by experts. However, formal IAA was not calculated. In the revision, we will expand §3 to include the annotation guidelines and the procedure for resolving ambiguities. We will also note the absence of IAA as a limitation of the benchmark. This addresses the concern about potential annotation artifacts by providing more context on how the labels were derived. revision: partial
-
Referee: [§4.1] §4.1 (knowledge-graph construction): the subgraphs injected into prompts are built from related lexical resources; the manuscript does not state whether these resources overlap with those underlying the LuxBorrow labels or the annotation process. Any such overlap would create a circularity risk that undermines the interpretation of the 71-81% figures as evidence of model improvement independent of the evaluation target.
Authors: The referee correctly identifies that the manuscript does not explicitly address potential overlap. We can confirm that the knowledge graph draws on distinct lexical resources, including independent morphological and etymological databases not involved in the LuxBorrow annotation. To prevent any misinterpretation, we will revise the description in §4.1 to specify the sources used and state their independence from the benchmark construction. This clarification will support the interpretation of the results as demonstrating the value of injecting structured linguistic knowledge. revision: yes
- Inter-annotator agreement figures for the LexNeo-Bench labels, which were not computed during the original annotation process.
Circularity Check
Empirical benchmarking study with no derivation chain or definitional reduction
full rationale
This is an empirical probing paper that reports measured accuracies on a held-out token benchmark (LexNeo-Bench) under different prompt conditions. No equations, derivations, or first-principles claims exist that could reduce a 'prediction' to its own inputs by construction. The KG injection and label construction are methodological choices whose validity is an external assumption, but the reported accuracy deltas (25-35% to 71-81%) are observed outcomes on independent test instances rather than quantities defined inside the paper. Self-citation load-bearing and ansatz smuggling patterns are absent. The work is therefore self-contained as standard benchmarking against an external corpus-derived gold standard.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption The LuxBorrow corpus provides representative samples of contemporary Luxembourgish usage for labeling borrowings.
Reference graph
Works this paper leans on
-
[1]
Do LLMs Know What Luxembourgish Borrows? Probing Lexical Neology in Low-Resource Multilingual Models
Introduction Neology, the creation and diffusion of new lexi- cal items, has long been central to lexicography, corpus linguistics, and sociolinguistics. With the emergence of large language models (LLMs), ne- ology enters a new phase. LLMs are trained on massive multilingual corpora, absorb existing neologisms, and can themselves generate novel forms, bl...
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[2]
Related Work 2.1. Borrowing, code-switching, and neology Contact linguistics distinguishes lexical borrow- ing, items integrated into the recipient language’s lexicon and grammar, from code-switching, that is, spontaneous alternation between languages within discourse. Classic accounts emphasize that entrenched borrowings are morphologically and phonologi...
work page 2025
-
[3]
Experiments Figure 1 summarizes the overall evaluation pipeline, from LuxBorrow-derived benchmark con- struction and LKG retrieval to prompt assembly and multilingual LLM evaluation. Figure 1: LexNeo-Bench pipeline. 3.1. Benchmark construction We construct LexNeo-Bench, a token-level evalu- ation benchmark derived from the LuxBorrow cor- pus of profession...
work page 1999
-
[4]
Results 4.1. RQ1. Borrowing classification performance T able1 summarizes three-way borrowing classifi- cation accuracy and macro F1 across models and prompt strategies. Without a structured linguis- tic context, performance remains modest. In the zero-shot baseline, accuracy ranges from 24.5% for Gemma 3 12B to 34.7% for Llama 3.3 70B, and more elaborate...
work page 2021
-
[5]
Discussion Our results show that off-the-shelf multilingual LLMs have limited awareness of how a small contact language integrates lexical borrowings, even when trained on large multilingual corpora. With four possible output labels, a random base- line yields 25% accuracy; zero-shot performance ranges from 24.5% to 34.7%, indicating that para- metric kno...
work page 2025
-
[6]
Conclusion and Future Work We introduced LexNeo-Bench, a token-level benchmark derived from a borrowing-annotated Luxembourgish news corpus to probe how mul- tilingual LLMs treat morphologically adapted bor- rowings. Across three models and 34 prompt con- figurations, zero-shot parametric knowledge stays near chance, whereas instance-specific linguistic k...
-
[7]
Acknowledgements We thank RTL Luxembourg and Tom Weber for providing access to the news archive and for sup- porting its use for research purposes. This work highly benefited from the collaborative network fos- tered by the ENEOLI COST Action (CA22126) , supported by COST (European Cooperation in Sci- ence and Technology), and also within the project LuxV...
-
[8]
Ethical and legal aspects Data provenance and legal basis. The under- lying corpus consists of online news articles pub- lished between 1999 and 2025 by a major Luxem- bourgish media outlet (RTL). The data were ob- tained under a formal research collaboration and processed under the outlet’s terms of use and the applicable EU text and data mining provisio...
work page 1999
-
[9]
Bibliographical References Martine Adda-Decker, Thomas Pellegrini, Eric Bilinski, and Gilles Adda. 2008. Develop- ments of” lëtzebuergesch” resources for auto- matic speech processing and linguistic studies. In LREC. Elena Alvarez-Mellado. 2020. An annotated cor- pus of emerging anglicisms in spanish news- paper headlines. In Proceedings of the 4th worksh...
-
[10]
In Proceedings of the 22nd Workshop on T reebanks and Linguistic Theories (TL T 2024), pages 30–39
Luxbank: The first universal dependency treebank for luxembourgish. In Proceedings of the 22nd Workshop on T reebanks and Linguistic Theories (TL T 2024), pages 30–39. Zeqi T an, Shen Huang, Zixia Jia, Jiong Cai, Yinghui Li, Weiming Lu, Yueting Zhuang, Kewei Tu, Pengjun Xie, Fei Huang, et al. 2023. Damo- nlp at semeval-2023 task 2: A unified retrieval- au...
work page 2024
-
[11]
Hu- manities and Social Sciences Communications , 10(1):1–10
Tracking the acceptance of neologisms in german: Psycholinguistic factors and their cor- respondence with corpus-linguistic findings. Hu- manities and Social Sciences Communications , 10(1):1–10. Jian Xie, Kai Zhang, Jiangjie Chen, Renze Lou, and Yu Su. Adaptive chameleon or stubborn sloth: Revealing the behavior of large language models in knowledge conf...
work page 2024
-
[12]
Language Resource References Zenter fir d’Lëtzebuerger Sprooch. 2025. Lëtze- buerger Online Dictionnaire (LOD) . Official ref- erence dictionary for Luxembourgish
work page 2025
-
[13]
Appendices 11.1. Supplementary visualization of KG gain KG gain is defined as the accuracy difference be- tween KG-graph and zero-shot prompting. The gain decreases monotonically with scale, from +56.5 percentage points for Gemma 3 12B to +40.1 for Gemma 3 27B and +36.6 for Llama 3.3 70B. Figure 4: Supplementary visualization of KG gain ∆KG by model size ...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.