arxiv: 2604.23824 · v1 · submitted 2026-04-26 · 💻 cs.CL

Recognition: unknown

Resource-Lean Lexicon Induction for German Dialects

Robert Litschko , Barbara Plank , Diego Frassinelli

Authors on Pith no claims yet

Pith reviewed 2026-05-08 05:59 UTC · model grok-4.3

classification 💻 cs.CL

keywords German dialectslexicon inductionrandom forestsstring similaritybilingual lexicon inductiondialect information retrievallow-resource languagescross-dialect transfer

0 comments

The pith

Random forests trained on string similarity features induce effective lexicons for German dialects and outperform large language models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that random forests using only string similarity measures from small seed lexicons reliably produce bilingual lexicons for multiple German dialects. This approach succeeds where large language models struggle with spelling variation and limited data. It also supports transfer across dialects and improves downstream retrieval performance when the induced lexicons expand queries. Readers care because it offers a lightweight, data-driven route to lexical resources for low-resource language varieties that avoids heavy pretraining costs.

Core claim

Statistical models such as random forests trained on string similarity features are effective for inducing German dialect lexicons. On bilingual lexicon induction they surpass Mistral-123b while remaining resource-lean. When the resulting dictionaries expand queries for BM25-based dialect information retrieval they deliver relative gains of up to 28.9 percent in nDCG@10 and 50.7 percent in Recall@100. The same models transfer across different German dialects and retain performance even with reduced training data.

What carries the argument

Random forest classifier trained on string similarity features extracted from limited seed bilingual lexicons, applied to bilingual lexicon induction and query expansion for dialect IR.

If this is right

Random forests outperform Mistral-123b on bilingual lexicon induction for German dialects.
Induced dialect lexicons raise BM25 retrieval quality by up to 28.9% nDCG@10 and 50.7% Recall@100.
Models trained on one German dialect transfer to others with limited additional data.
Performance remains stable across different quantities of seed training examples.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same string-similarity approach may succeed for other low-resource language varieties that exhibit high orthographic variation.
Lightweight statistical models could remain preferable to scaled LLMs whenever annotation budgets are small and spelling differences dominate.
Cross-dialect transfer implies that German dialects share enough surface-level regularities for simple similarity measures to generalize.

Load-bearing premise

String similarity features computed from limited seed data are enough to capture the spelling and lexical differences that define German dialects.

What would settle it

If random forests using these features achieve lower BLI accuracy than fine-tuned large models or produce no measurable improvement in nDCG or Recall when their lexicons are used for query expansion on held-out dialect collections.

Figures

Figures reproduced from arXiv: 2604.23824 by Barbara Plank, Diego Frassinelli, Robert Litschko.

**Figure 1.** Figure 1: Effect of training set size on F1 scores view at source ↗

read the original abstract

Automatic induction of high-quality dictionaries is essential for building lexical resources, yet low-resource languages and dialects pose several challenges: limited access to annotators, high degree of spelling variations, and poor performance of large language models (LLMs). We empirically show that statistical models (random forests) trained on string similarity features are surprisingly effective for inducing German dialect lexicons. They outperform LLMs, enable cross-dialect transfer, and offer a lightweight data-driven alternative. We evaluate our models intrinsically on bilingual lexicon induction (BLI) and extrinsically on dialect information retrieval (IR). On BLI, random forests outperform Mistral-123b while being more resource-lean. On dialect IR with BM25, using our dialect dictionaries for query expansion yields relative improvements of up to 28.9% in nDCG@10 and 50.7% in Recall@100. Motivated by the resource scarcity in dialects, we further investigate the extent to which models transfer across different German dialects, and their performance under varying amounts of training data.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Random forests on string features beat Mistral on German dialect BLI and lift IR by up to 29 percent nDCG, but the work stays narrow and leaves some setup details thin.

read the letter

The core takeaway is that simple random forests trained on string similarity features induce usable German dialect lexicons, beat a 123B LLM on bilingual lexicon induction, support cross-dialect transfer, and improve BM25 retrieval when used for query expansion. The relative gains reach 28.9 percent nDCG@10 and 50.7 percent recall@100 on the extrinsic task. That combination of intrinsic and extrinsic results with transfer tests is the main new piece; earlier BLI work exists, but the dialect-specific, resource-lean framing plus IR evaluation looks fresh from the abstract and reported experiments.

Referee Report

0 major / 3 minor

Summary. The paper claims that random forests trained on string similarity features can effectively induce lexicons for German dialects in a resource-lean manner. These models outperform LLMs such as Mistral-123b on bilingual lexicon induction (BLI), enable cross-dialect transfer, and yield relative gains of up to 28.9% nDCG@10 and 50.7% Recall@100 when used for BM25 query expansion in dialect IR tasks. The work further examines performance under varying training data amounts, positioning the approach as a lightweight data-driven alternative for low-resource dialects with high spelling variation.

Significance. If the empirical results hold, the paper makes a useful contribution to low-resource NLP by demonstrating that simple statistical models on string features can surpass much larger LLMs for dialect lexicon induction while supporting transfer and downstream IR improvements. This offers a practical, reproducible alternative for building lexical resources where annotators and pretraining data are scarce, with clear value for German dialect processing and similar settings.

minor comments (3)

[Method] The specific string similarity features (e.g., which edit distances, n-gram overlaps, or phonetic measures) and their exact computation should be enumerated, ideally in a dedicated table or subsection, to support full reproducibility of the random forest inputs.
[Experiments] Data splits, seed lexicon sizes, and any statistical significance testing (e.g., p-values or bootstrap intervals for the reported BLI and IR gains) are not fully detailed; adding these would strengthen confidence in the outperformance claims over Mistral-123b and BM25 baselines.
[Cross-dialect Transfer] The cross-dialect transfer section would benefit from explicitly naming the dialects involved and discussing their linguistic similarities or differences, as this directly supports the transferability claims.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their positive assessment of our work and the recommendation for minor revision. Their summary correctly captures the core claims regarding the effectiveness of random forests trained on string similarity features for resource-lean German dialect lexicon induction, including outperformance of LLMs like Mistral-123b on BLI, cross-dialect transfer, and downstream gains in dialect IR.

Circularity Check

0 steps flagged

No significant circularity; empirical evaluation is self-contained

full rationale

The paper describes an empirical methodology: training random forests on string similarity features from seed data, then evaluating on held-out bilingual lexicon induction (BLI) and dialect IR tasks with external baselines (Mistral, BM25). No equations, derivations, or self-referential definitions appear; performance claims rest on standard cross-validation, transfer experiments, and data-varying ablations rather than reducing to fitted parameters renamed as predictions or self-citation chains. The central result (outperformance on BLI/IR) is directly falsifiable against the reported metrics and does not import uniqueness theorems or ansatzes from prior author work.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The work relies on standard assumptions in ML for NLP: string similarity metrics capture relevant variation, seed bilingual data exists, and random forest hyperparameters can be tuned without domain-specific invention.

pith-pipeline@v0.9.0 · 5479 in / 1030 out tokens · 33464 ms · 2026-05-08T05:59:55.167759+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

39 extracted references · 19 canonical work pages

[1]

URL: " 'urlintro :=

ENTRY address author booktitle chapter edition editor howpublished institution journal key month note number organization pages publisher school series title type volume year eprint doi pubmed url lastchecked islrn pid label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block STRINGS urlintro eprintur...
[2]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize " " * FUNCT...
[3]

George W Adamson and Jillian Boreham. 1974. https://www.sciencedirect.com/science/article/abs/pii/0020027174900205 The use of an association measure based on character structure to identify semantically related pairs of words and document titles . Information storage and retrieval, 10(7-8):253--260

work page arXiv 1974
[4]

Mirna Adriani and C. J. van Rijsbergen. 1999. https://link.springer.com/chapter/10.1007/3-540-48155-9_20 Term similarity-based query expansion for cross-language information retrieval . In Research and Advanced Technology for Digital Libraries, pages 311--322, Berlin, Heidelberg. Springer Berlin Heidelberg

work page doi:10.1007/3-540-48155-9_20 1999
[5]

Ekaterina Artemova and Barbara Plank. 2023. https://aclanthology.org/2023.nodalida-1.39/ Low-resource bilingual dialect lexicon induction with large language models . In Proceedings of the 24th Nordic Conference on Computational Linguistics (NoDaLiDa), pages 371--385

2023
[6]

Timothy Baldwin, Jonathan Pool, and Susan Colowick. 2010. https://aclanthology.org/C10-3010/ P an L ex and LEXTRACT : Translating all words of all languages of the world . In Coling 2010: Demonstrations, pages 37--40, Beijing, China. Coling 2010 Organizing Committee

2010
[7]

Lisa Ballesteros and Bruce Croft. 1996. https://dl.acm.org/doi/10.5555/648309.754278 Dictionary methods for cross-lingual information retrieval . In Database and Expert Systems Applications, pages 791--801, Berlin, Heidelberg. Springer Berlin Heidelberg

work page doi:10.5555/648309.754278 1996
[8]

Verena Blaschke, Hinrich Sch \"u tze, and Barbara Plank. 2023. https://doi.org/10.18653/v1/2023.vardial-1.5 Does manipulating tokenization aid cross-lingual transfer? a study on POS tagging for non-standardized languages . In Tenth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial 2023), pages 40--54, Dubrovnik, Croatia. Association f...

work page doi:10.18653/v1/2023.vardial-1.5 2023
[9]

Leo Breiman. 2001. https://link.springer.com/article/10.1023/A:1010933404324 Random forests . Machine learning, 45(1):5--32

work page doi:10.1023/a:1010933404324 2001
[10]

Chris Brew, David McKelvie, et al. 1996. Word-pair extraction for lexicography. In Proceedings of the 2nd international conference on new methods in language processing, pages 45--55. Citeseer

1996
[11]

Andreas Chari, Sean MacAvaney, and Iadh Ounis. 2023. https://dl.acm.org/doi/10.1145/3539618.3592030 On the effects of regional spelling conventions in retrieval models . In Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 2220--2224

work page doi:10.1145/3539618.3592030 2023
[12]

Marjan Ghazvininejad, Hila Gonen, and Luke Zettlemoyer. 2023. https://arxiv.org/abs/2302.07856 Dictionary-based phrase-level prompting of large language models for machine translation . arXiv preprint arXiv:2302.07856

work page arXiv 2023
[13]

Abhay Gupta, Jacob Cheung, Philip Meng, Shayan Sayyed, Austen Liao, Kevin Zhu, and Sean O'Brien. 2025. https://arxiv.org/abs/2504.07100v1 Endive: A cross-dialect benchmark for fairness and performance in large language models . arXiv preprint arXiv:2504.07100

work page arXiv 2025
[14]

Geert Heyman, Ivan Vuli \'c , and Marie-Francine Moens. 2017. https://aclanthology.org/E17-1102/ Bilingual lexicon induction by learning to combine word-level and character-level representations . In Proceedings of the 15th Conference of the E uropean Chapter of the Association for Computational Linguistics: Volume 1, Long Papers , pages 1085--1095, Valen...

2017
[15]

Diana Inkpen, Oana Frunza, and Grzegorz Kondrak. 2005. Automatic identification of cognates and false friends in F rench and E nglish. In Proceedings of the International Conference Recent Advances in Natural Language Processing, volume 9, pages 251--257

2005
[16]

Ann Irvine and Chris Callison-Burch. 2017. https://doi.org/10.1162/COLI_a_00284 A comprehensive analysis of bilingual lexicon induction . Computational Linguistics, 43(2):273--310

work page doi:10.1162/coli_a_00284 2017
[17]

Mihir Kale, Sreyashi Nag, Varun Lakshinarasimhan, and Swapnil Singhavi. 2020. https://openreview.net/forum?id=B1ecYsqSuN Incorporating bilingual dictionaries for low resource semi-supervised neural machine translation . In International Conference on Learning Representations, Learning with Limited Labeled Data

2020
[18]

David Kamholz, Jonathan Pool, and Susan M Colowick. 2014. https://aclanthology.org/L14-1023/ Panlex: Building a resource for panlingual lexical translation. In LREC, volume 14, pages 3145--3150

2014
[19]

Grzegorz Kondrak and Bonnie Dorr. 2004. https://aclanthology.org/C04-1137/ Identification of confusable drug names: A new approach and evaluation methodology . In COLING 2004: Proceedings of the 20th International Conference on Computational Linguistics , pages 952--958, Geneva, Switzerland. COLING

2004
[20]

Vladimir Levenshtein. 1966. Binary codes capable of correcting deletions, insertions and reversals. Soviet Physics Doklady, 10(8):707--710. [Russian original (1965) in Doklady Akademii Nauk SSSR, 163(4):845–-848.]

1966
[21]

Yaoyiran Li, Anna Korhonen, and Ivan Vuli \'c . 2023. https://doi.org/10.18653/v1/2023.emnlp-main.595 On bilingual lexicon induction with large language models . In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 9577--9599, Singapore. Association for Computational Linguistics

work page doi:10.18653/v1/2023.emnlp-main.595 2023
[22]

Jimmy Lin, Xueguang Ma, Sheng-Chieh Lin, Jheng-Hong Yang, Ronak Pradeep, and Rodrigo Nogueira. 2021. https://dl.acm.org/doi/10.1145/3404835.3463238 Pyserini: A python toolkit for reproducible information retrieval research with sparse and dense representations . In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in I...

work page doi:10.1145/3404835.3463238 2021
[23]

Robert Litschko, Verena Blaschke, Diana Burkhardt, Barbara Plank, and Diego Frassinelli. 2025 a . https://aclanthology.org/2025.findings-emnlp.762/ Make every letter count: Building dialect variation dictionaries from monolingual corpora . In Findings of the Association for Computational Linguistics: EMNLP 2025, Suzhou, China. Association for Computationa...

2025
[24]

Robert Litschko, Goran Glava s , Simone Paolo Ponzetto, and Ivan Vuli \'c . 2018. https://dl.acm.org/doi/10.1145/3209978.3210157 Unsupervised cross-lingual information retrieval using monolingual data only . In The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval, pages 1253--1256

work page doi:10.1145/3209978.3210157 2018
[25]

Robert Litschko, Oliver Kraus, Verena Blaschke, and Barbara Plank. 2025 b . https://aclanthology.org/2025.coling-main.678/ Cross-dialect information retrieval: Information access in low-resource and high-variance languages . In Proceedings of the 31st International Conference on Computational Linguistics, pages 10158--10171, Abu Dhabi, UAE. Association fo...

2025
[26]

Hongyuan Lu, Haoran Yang, Haoyang Huang, Dongdong Zhang, Wai Lam, and Furu Wei. 2024. https://doi.org/10.18653/v1/2024.emnlp-main.55 Chain-of-dictionary prompting elicits translation in large language models . In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 958--976, Miami, Florida, USA. Association for Com...

work page doi:10.18653/v1/2024.emnlp-main.55 2024
[27]

Dan Melamed

I. Dan Melamed. 1999. https://aclanthology.org/J99-1003 Bitext maps and alignment via pattern recognition . Computational Linguistics, 25(1):107--130

1999
[28]

Raphael Merx, Ekaterina Vylomova, and Kemal Kurniawan. 2024. https://aclanthology.org/2024.alta-1.5/ Generating bilingual example sentences with large language models as lexicography assistants . In Proceedings of the 22nd Annual Workshop of the Australasian Language Technology Association, pages 64--74, Canberra, Australia. Association for Computational ...

2024
[29]

Alberto Mu \ n oz-Ortiz, Verena Blaschke, and Barbara Plank. 2025. https://aclanthology.org/2025.coling-main.427/ Evaluating pixel language models on non-standardized languages . In Proceedings of the 31st International Conference on Computational Linguistics, pages 6412--6419, Abu Dhabi, UAE. Association for Computational Linguistics

2025
[30]

Fabian Pedregosa, Ga \"e l Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion, Olivier Grisel, Mathieu Blondel, Peter Prettenhofer, Ron Weiss, Vincent Dubourg, et al. 2011. https://dl.acm.org/doi/10.5555/1953048.2078195 Scikit-learn: Machine learning in python . the Journal of machine Learning research, 12:2825--2830

work page doi:10.5555/1953048.2078195 2011
[31]

Hans Joachim Postel. 1969. Die k \"o lner phonetik. ein verfahren zur identifizierung von personennamen auf der grundlage der gestaltanalyse. IBM-Nachrichten, 19:925--931

1969
[32]

Stephen Robertson, Hugo Zaragoza, et al. 2009. https://dl.acm.org/doi/abs/10.1561/1500000019 The probabilistic relevance framework: Bm25 and beyond . Foundations and Trends in Information Retrieval, 3(4):333--389

work page doi:10.1561/1500000019 2009
[33]

Aarohi Srivastava and David Chiang. 2025. https://doi.org/10.18653/v1/2025.wnut-1.6 We ' re calling an intervention: Exploring fundamental hurdles in adapting language models to nonstandard text . In Proceedings of the Tenth Workshop on Noisy and User-generated Text, pages 45--56, Albuquerque, New Mexico, USA. Association for Computational Linguistics

work page doi:10.18653/v1/2025.wnut-1.6 2025
[34]

Francisco Valentini, Viviana Cotik, Dami \'a n Furman, Ivan Bercovich, Edgar Altszyler, and Juan Manuel P \'e rez. 2024. https://arxiv.org/pdf/2409.05994 Messirve: A large-scale spanish information retrieval dataset . arXiv preprint arXiv:2409.05994

work page arXiv 2024
[35]

Robert A Wagner and Michael J Fischer. 1974. The string-to-string correction problem. Journal of the ACM (JACM), 21(1):168--173

1974
[36]

Jonas Waldendorf, Alexandra Birch, Barry Hadow, and Antonio Valerio Micele Barone. 2022. https://aclanthology.org/2022.amta-research.11/ Improving translation of out of vocabulary words using bilingual lexicon induction in low-resource machine translation . In Proceedings of the 15th biennial conference of the Association for Machine Translation in the Am...

2022
[37]

Xinyi Wang, Sebastian Ruder, and Graham Neubig. 2022. https://aclanthology.org/2022.acl-long.61/ Expanding pretrained models to thousands more languages via lexicon-based adaptation . In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 863--877

2022
[38]

Yongjing Yin, Jiali Zeng, Yafu Li, Fandong Meng, and Yue Zhang. 2024. https://doi.org/10.18653/v1/2024.findings-emnlp.866 L ex M atcher: Dictionary-centric data curation for LLM -based machine translation . In Findings of the Association for Computational Linguistics: EMNLP 2024, pages 14767--14779, Miami, Florida, USA. Association for Computational Linguistics

work page doi:10.18653/v1/2024.findings-emnlp.866 2024
[39]

Caleb Ziems, William Held, Jingfeng Yang, Jwala Dhamala, Rahul Gupta, and Diyi Yang. 2023. https://doi.org/10.18653/v1/2023.acl-long.44 Multi- VALUE : A framework for cross-dialectal E nglish NLP . In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 744--768, Toronto, Canada. Associatio...

work page doi:10.18653/v1/2023.acl-long.44 2023