CHALIS: A Challenge Dataset for Language Identification in Difficult Scenarios
Pith reviewed 2026-06-28 01:20 UTC · model grok-4.3
The pith
A new benchmark shows language identification systems struggle with cousin languages and orthographic noise.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
CHALIS contains two sections: one with sentences that appear in multiple languages within mutually intelligible pairs (Czech/Slovak, Spanish/Catalan, Portuguese/Galician, Danish/Norwegian) and another with the same text subjected to orthographic noise through transliteration across scripts, diacritic removal, homoglyph substitution, and Internet slang. Evaluation of four common language identification systems on this data reveals consistent substantial performance drops, most pronounced for lower-resource languages inside the cousin pairs and for transliterated input.
What carries the argument
The CHALIS dataset, which supplies paired examples of shared sentences across cousin languages together with controlled orthographic noise variants.
If this is right
- Language identification models require targeted improvements to distinguish lower-resource languages within close pairs.
- Handling of orthographic noise such as transliteration and diacritic loss must become a standard robustness goal.
- Existing evaluation sets miss these edge cases and therefore overestimate system reliability.
- Public benchmarks that isolate these failure modes can direct future model development more precisely.
Where Pith is reading between the lines
- Applications that process user-generated web content may encounter higher language detection error rates than standard benchmarks suggest.
- The dataset offers a direct way to measure and train for script-invariant language detection without requiring new labeled data collection.
- Treating cousin languages as a single modeling problem rather than separate classes could reduce the observed confusions.
Load-bearing premise
The chosen language pairs and the specific noise types created in the dataset correspond to the actual difficult cases language identification systems meet in practice.
What would settle it
A replication that finds the four tested systems reach above 90 percent accuracy on the CHALIS examples or that shows real-world error rates on comparable cousin-language and noisy text to be far lower than the reported figures.
read the original abstract
We present CHALIS (Challenging Language Identification Samples), a new benchmark dataset explicitly designed to address difficult cases in language identification: cousin languages and orthographic noise. Our dataset has two parts: First, we collected sentences shared across mutually intelligible language pairs (Czech/Slovak, Spanish/Catalan, Portuguese/Galician, Danish/Norwegian). The second part tests for orthography noise: we transliterate text across multiple scripts, remove diacritics, simulate homoglyph attacks, and use Internet slang. We evaluate four widely used language identification systems on CHALIS and demonstrate that all struggle substantially in these scenarios, especially on lower-resource languages within cousin pairs and on transliterated input. The resource is publicly available at https://huggingface.co/datasets/michal-tichy/CHALIS.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper presents CHALIS, a new benchmark dataset for language identification targeting difficult cases: mutually intelligible cousin language pairs (Czech/Slovak, Spanish/Catalan, Portuguese/Galician, Danish/Norwegian) and orthographic noise (transliteration across scripts, diacritic removal, homoglyph attacks, Internet slang). It evaluates four widely used LID systems, claims they all struggle substantially (especially on lower-resource languages in cousin pairs and on transliterated input), and releases the dataset publicly on Hugging Face.
Significance. A publicly released dataset focused on these specific failure modes could support development of more robust LID systems if the instances are shown to be representative of real production difficulties. The absence of empirical grounding for the noise simulations, however, reduces the likelihood that CHALIS will become a durable standard benchmark.
major comments (2)
- [§3–4] §3–4: The orthographic noise perturbations are constructed via fixed, hand-specified rules (script mapping tables, uniform diacritic stripping, predefined homoglyph lists, slang substitutions) without reference to empirical frequency distributions drawn from actual noisy user-generated text. Consequently, the observed performance drops on the four evaluated systems may be artifacts of these arbitrary simulation parameters rather than evidence that the systems fail on the noise types they actually encounter.
- [Evaluation] Evaluation section: The central claim that the four systems 'struggle substantially' (particularly on lower-resource languages within cousin pairs and transliterated input) is load-bearing for the paper's contribution, yet the manuscript provides no quantitative details on dataset cardinality, per-category error rates, or comparison against standard LID benchmarks, preventing assessment of effect size or practical significance.
minor comments (1)
- The abstract would be strengthened by reporting basic dataset statistics (number of sentences per category, total size) so readers can immediately gauge scale.
Simulated Author's Rebuttal
We thank the referee for the detailed and constructive feedback. We address each major comment below and outline the revisions we will make to strengthen the manuscript.
read point-by-point responses
-
Referee: [§3–4] §3–4: The orthographic noise perturbations are constructed via fixed, hand-specified rules (script mapping tables, uniform diacritic stripping, predefined homoglyph lists, slang substitutions) without reference to empirical frequency distributions drawn from actual noisy user-generated text. Consequently, the observed performance drops on the four evaluated systems may be artifacts of these arbitrary simulation parameters rather than evidence that the systems fail on the noise types they actually encounter.
Authors: We acknowledge that the perturbations rely on fixed, hand-specified rules rather than frequency distributions extracted from real-world noisy corpora. These rules were chosen to instantiate well-documented failure modes (transliteration in social media, diacritic omission, homoglyph substitution, and slang) that appear repeatedly in production LID error analyses. We will revise Sections 3–4 to cite supporting linguistic and NLP literature on these phenomena, add an explicit limitations paragraph on the simulation approach, and clarify that CHALIS is offered as a controlled diagnostic benchmark rather than a statistically representative sample of production noise. This addresses the concern without requiring new data collection. revision: partial
-
Referee: [Evaluation] Evaluation section: The central claim that the four systems 'struggle substantially' (particularly on lower-resource languages within cousin pairs and transliterated input) is load-bearing for the paper's contribution, yet the manuscript provides no quantitative details on dataset cardinality, per-category error rates, or comparison against standard LID benchmarks, preventing assessment of effect size or practical significance.
Authors: We agree that the submitted manuscript under-reports quantitative details. The full evaluation section contains per-system accuracies and notes on lower-resource languages, but we will expand it to report exact dataset cardinalities (number of instances per language pair and per noise type), per-category error rates with confidence intervals, and head-to-head comparisons against the systems' published performance on standard benchmarks such as WiLI-2018 and the original evaluation sets used for each LID model. These additions will allow readers to judge effect size and practical relevance. revision: yes
Circularity Check
No significant circularity in empirical dataset paper
full rationale
The paper constructs and releases an empirical benchmark dataset by collecting parallel sentences from cousin-language pairs and applying fixed rule-based perturbations (transliteration tables, diacritic removal, homoglyph lists, slang). No derivations, fitted parameters, predictions of held-out quantities, or self-citation chains appear; the central claim is simply that four off-the-shelf LID systems perform poorly on these constructed cases. The work is therefore self-contained against external benchmarks and receives the default non-circularity finding.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Automatic Language Identification in Texts:
Tommi Jauhiainen and Marco Lui and Marcos Zampieri and Timothy Baldwin and Krister Lind. Automatic Language Identification in Texts:. CoRR , volume=. 2018 , url=. 1804.08186 , timestamp=
Pith/arXiv arXiv 2018
-
[2]
McNamee, Paul , title =. J. Comput. Sci. Coll. , month = feb, pages =. 2005 , issue_date =
2005
-
[3]
Language ID in the Wild: Unexpected Challenges on the Path to a Thousand-Language Web Text Corpus , doi =
Caswell, Isaac and Breiner, Theresa and Esch, Daan and Bapna, Ankur , year =. Language ID in the Wild: Unexpected Challenges on the Path to a Thousand-Language Web Text Corpus , doi =
-
[4]
Kreutzer, Julia and Caswell, Isaac and Wang, Lisa and Wahab, Ahsan and van Esch, Daan and Ulzii-Orshikh, Nasanbayar and Tapo, Allahsera and Subramani, Nishant and Sokolov, Artem and Sikasote, Claytone and Setyawan, Monang and Sarin, Supheakmungkol and Samb, Sokhar and Sagot, Benoît and Rivera, Clara and Rios, Annette and Papadimitriou, Isabel and Osei, Sa...
-
[5]
When Theory is a Joke: The Weinreich Witticism in Linguistics
Maxwell, Alexander , year =. When Theory is a Joke: The Weinreich Witticism in Linguistics. , volume =
-
[6]
Chambers, J. K. and Trudgill, Peter , year=. Dialectology , publisher=
-
[7]
American Anthropologist , volume =
HAUGEN, EINAR , title =. American Anthropologist , volume =. doi:https://doi.org/10.1525/aa.1966.68.4.02a00040 , url =. https://anthrosource.onlinelibrary.wiley.com/doi/pdf/10.1525/aa.1966.68.4.02a00040 , abstract =
-
[8]
and Gijsbert Rutten and Rik Vosters
Joseph, John E. and Gijsbert Rutten and Rik Vosters. Dialect, language, nation: 50 years on. Language Policy. 2020. doi:10.1007/s10993-020-09549-x
-
[9]
2013 , eprint=
Efficient Estimation of Word Representations in Vector Space , author=. 2013 , eprint=
2013
-
[10]
2013 , eprint=
Distributed Representations of Words and Phrases and their Compositionality , author=. 2013 , eprint=
2013
-
[11]
and Hyv\"
Gutmann, Michael U. and Hyv\". Noise-contrastive estimation of unnormalized statistical models, with applications to natural image statistics , year =. J. Mach. Learn. Res. , month = 02, pages =
-
[12]
Conference on Empirical Methods in Natural Language Processing (EMNLP) , address =
Learning Language Representations for Typology Prediction , author =. Conference on Empirical Methods in Natural Language Processing (EMNLP) , address =
-
[13]
Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers , volume=
Uriel and lang2vec: Representing languages as typological, geographical, and phylogenetic vectors , author=. Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers , volume=
-
[14]
2013 , doi =
WALS Online (v2020.4) , type =. 2013 , doi =
2013
-
[15]
Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers , month=
Bag of Tricks for Efficient Text Classification , author=. Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers , month=. 2017 , publisher=
2017
-
[16]
An Open Dataset and Model for Language Identification
Burchell, Laurie and Birch, Alexandra and Bogoychev, Nikolay and Heafield, Kenneth. An Open Dataset and Model for Language Identification. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers). 2023. doi:10.18653/v1/2023.acl-short.75
-
[17]
2016 , eprint=
Neural Machine Translation of Rare Words with Subword Units , author=. 2016 , eprint=
2016
-
[18]
2020 , eprint=
Adv-BERT: BERT is not robust on misspellings! Generating nature adversarial samples on BERT , author=. 2020 , eprint=
2020
-
[19]
Proceedings of the International AAAI Conference on Web and Social Media , author=
TweetMotif: Exploratory Search and Topic Summarization for Twitter , volume=. Proceedings of the International AAAI Conference on Web and Social Media , author=. 2010 , month=. doi:10.1609/icwsm.v4i1.14008 , abstractNote=
-
[20]
2020 , eprint=
Detect Language of Transliterated Texts , author=. 2020 , eprint=
2020
-
[21]
Guillaume Ayoub , title =
-
[22]
Clark, Dan Garrette, Iulia Turc, and John Wieting
Clark, Jonathan H. and Garrette, Dan and Turc, Iulia and Wieting, John , year=. <scp>Canine</scp>: Pre-training an Efficient Tokenization-Free Encoder for Language Representation , volume=. doi:10.1162/tacl_a_00448 , journal=
-
[23]
2019 , eprint=
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , author=. 2019 , eprint=
2019
-
[24]
Daniel Jurafsky and James H. Martin. Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition with Language Models. 2025
2025
-
[25]
2019 , url=
Francois Chaubard and Michael Fang and Guillaume Genthial and Rohit Mundra and Richard Socher and Christopher Manning and Richard Socher , title=. 2019 , url=
2019
-
[26]
NLLB Team and Marta R. Costa-jussà and James Cross and Onur Çelebi and Maha Elbayad and Kenneth Heafield and Kevin Heffernan and Elahe Kalbassi and Janice Lam and Daniel Licht and Jean Maillard and Anna Sun and Skyler Wang and Guillaume Wenzek and Al Youngblood and Bapi Akula and Loic Barrault and Gabriel Mejia Gonzalez and Prangthip Hansanti and John Hof...
-
[27]
The FLORES-101 Evaluation Benchmark for Low-Resource and Multilingual Machine Translation , author=
-
[28]
arXiv preprint arXiv:1902.01382 , year=
Two New Evaluation Datasets for Low-Resource Machine Translation: Nepali-English and Sinhala-English , author=. arXiv preprint arXiv:1902.01382 , year=
arXiv 1902
-
[29]
Kocmi, Tom and Avramidis, Eleftherios and Bawden, Rachel and Bojar, Ond r ej and Dvorkovich, Anton and Federmann, Christian and Fishel, Mark and Freitag, Markus and Gowda, Thamme and Grundkiewicz, Roman and Haddow, Barry and Karpinska, Marzena and Koehn, Philipp and Marie, Benjamin and Monz, Christof and Murray, Kenton and Nagata, Masaaki and Popel, Marti...
-
[30]
Leipzig Corpora Collection , title =
-
[31]
Goldhahn, Dirk and Eckart, Thomas and Quasthoff, Uwe , year =
-
[32]
2023 , eprint=
Attention Is All You Need , author=. 2023 , eprint=
2023
-
[33]
2011 , address =
Heafield, Kenneth , booktitle =. 2011 , address =
2011
-
[34]
and Koehn, Philipp , booktitle =
Heafield, Kenneth and Pouzyrevsky, Ivan and Clark, Jonathan H. and Koehn, Philipp , booktitle =. Scalable Modified. 2013 , address =
2013
-
[35]
and Ney, H
Kneser, R. and Ney, H. , booktitle=. Improved backing-off for M-gram language modeling , year=
-
[36]
GL i NER : Generalist Model for Named Entity Recognition using Bidirectional Transformer
Zaratiana, Urchade and Tomeh, Nadi and Holat, Pierre and Charnois, Thierry. GL i NER : Generalist Model for Named Entity Recognition using Bidirectional Transformer. Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers). 2024. doi:10.18653/v1/2...
-
[37]
Daniele Faraglia , title =
-
[38]
Botha, Jan A. and Pitler, Emily and Ma, Ji and Bakalov, Anton and Salcianu, Alex and Weiss, David and McDonald, Ryan and Petrov, Slav. Natural Language Processing with Small Feed-Forward Networks. Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing. 2017. doi:10.18653/v1/D17-1309
-
[39]
GitHub , year=
Compact Language Detector v3 (CLD3) , author=. GitHub , year=
-
[40]
Álvaro Huertas García , title =
-
[41]
Georges Labrèche , title =
-
[42]
Jindřich Libovický , title =
-
[43]
C harles Translator: A Machine Translation System between U krainian and C zech
Popel, Martin and Polakova, Lucie and Nov \'a k, Michal and Helcl, Jind r ich and Libovick \'y , Jind r ich and Stra n \'a k, Pavel and Krabac, Tomas and Hlavacova, Jaroslava and Anisimova, Mariia and Chlanova, Tereza. C harles Translator: A Machine Translation System between U krainian and C zech. Proceedings of the 2024 Joint International Conference on...
2024
-
[44]
MC ^2 : Towards Transparent and Culturally-Aware NLP for Minority Languages in C hina
Zhang, Chen and Tao, Mingxu and Huang, Quzhe and Lin, Jiuheng and Chen, Zhibin and Feng, Yansong. MC ^2 : Towards Transparent and Culturally-Aware NLP for Minority Languages in C hina. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2024. doi:10.18653/v1/2024.acl-long.479
-
[45]
Library and Information Science , year=
Identification of languages with short sample texts , author=. Library and Information Science , year=
-
[46]
Proceedings of the 29th Annual Conference of the American Translators Association: Languages at Crossroads , pages=
Language Identifier: A Computer Program for Automatic Natural Language Identification of On-line Text , author=. Proceedings of the 29th Annual Conference of the American Translators Association: Languages at Crossroads , pages=
-
[47]
N-Gram-Based Text Categorization , journal =
Cavnar, William and Trenkle, John , year =. N-Gram-Based Text Categorization , journal =
-
[48]
2018 , eprint=
Automatic Language Identification in Texts: A Survey , author=. 2018 , eprint=
2018
-
[49]
2014 , eprint=
Recurrent-Neural-Network for Language Detection on Twitter Code-Switching Corpus , author=. 2014 , eprint=
2014
-
[50]
Hierarchical Character- Models for Language Identification
Jaech, Aaron and Mulcaire, George and Hathi, Shobhit and Ostendorf, Mari and Smith, Noah A. Hierarchical Character- Models for Language Identification. Proceedings of the Fourth International Workshop on Natural Language Processing for Social Media. 2016. doi:10.18653/v1/W16-6212
-
[51]
Proceedings of the 25th International Conference on World Wide Web , pages=
Foundations of JSON schema , author=. Proceedings of the 25th International Conference on World Wide Web , pages=. 2016 , organization=
2016
-
[52]
Soso Dzamukashvili , title =
-
[53]
Multi-label S candinavian Language Identification ( SLIDE )
Fedorova, Mariia and Frydenberg, Jonas Sebulon and Handford, Victoria and Lang , Victoria Ovedie Chruickshank and Willoch, Solveig Helene and Midtgaard, Marthe L ken and Scherrer, Yves and M hlum, Petter and Samuel, David. Multi-label S candinavian Language Identification ( SLIDE ). Proceedings of the Third Workshop on Resources and Representations for Un...
2025
-
[54]
The 2023 Conference on Empirical Methods in Natural Language Processing , year=
Kargaran, Amir Hossein and Imani, Ayyoob and Yvon, Fran. The 2023 Conference on Empirical Methods in Natural Language Processing , year=
2023
-
[55]
Improving Native Language Identification with TF - IDF Weighting
Gebre, Binyam Gebrekidan and Zampieri, Marcos and Wittenburg, Peter and Heskes, Tom. Improving Native Language Identification with TF - IDF Weighting. Proceedings of the Eighth Workshop on Innovative Use of NLP for Building Educational Applications. 2013
2013
-
[56]
and Varoquaux, G
Pedregosa, F. and Varoquaux, G. and Gramfort, A. and Michel, V. and Thirion, B. and Grisel, O. and Blondel, M. and Prettenhofer, P. and Weiss, R. and Dubourg, V. and Vanderplas, J. and Passos, A. and Cournapeau, D. and Brucher, M. and Perrot, M. and Duchesnay, E. , journal=. Scikit-learn: Machine Learning in
-
[57]
2019 , eprint=
RoBERTa: A Robustly Optimized BERT Pretraining Approach , author=. 2019 , eprint=
2019
-
[58]
Identifying Open Challenges in Language Identification
Goot, Rob Van Der. Identifying Open Challenges in Language Identification. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2025. doi:10.18653/v1/2025.acl-long.891
-
[59]
Multimodal Neural Machine Translation: A Survey of the State of the Art
Feng, Yi and Li, Chuanyi and He, Jiatong and Hou, Zhenyu and Ng, Vincent. Multimodal Neural Machine Translation: A Survey of the State of the Art. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing. 2025. doi:10.18653/v1/2025.emnlp-main.1125
-
[60]
M oses: Open Source Toolkit for Statistical Machine Translation
Koehn, Philipp and Hoang, Hieu and Birch, Alexandra and Callison-Burch, Chris and Federico, Marcello and Bertoldi, Nicola and Cowan, Brooke and Shen, Wade and Moran, Christine and Zens, Richard and Dyer, Chris and Bojar, Ond r ej and Constantin, Alexandra and Herbst, Evan. M oses: Open Source Toolkit for Statistical Machine Translation. Proceedings of the...
2007
-
[61]
An Expanded Massive Multilingual Dataset for High-Performance Language Technologies ( HPLT )
Burchell, Laurie and de Gibert, Ona and Arefyev, Nikolay and Aulamo, Mikko and Ba. An Expanded Massive Multilingual Dataset for High-Performance Language Technologies ( HPLT ). Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2025. doi:10.18653/v1/2025.acl-long.854
-
[62]
FineWeb2: One Pipeline to Scale Them All
Guilherme Penedo and Hynek Kydl. FineWeb2: One Pipeline to Scale Them All. Second Conference on Language Modeling , year=
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.