CHALIS: A Challenge Dataset for Language Identification in Difficult Scenarios

Jind\v{r}ich Libovick\'y; Michal Tich\'y

arxiv: 2606.06088 · v1 · pith:DPN3QAYPnew · submitted 2026-06-04 · 💻 cs.CL

CHALIS: A Challenge Dataset for Language Identification in Difficult Scenarios

Michal Tich\'y , Jind\v{r}ich Libovick\'y This is my paper

Pith reviewed 2026-06-28 01:20 UTC · model grok-4.3

classification 💻 cs.CL

keywords language identificationbenchmark datasetcousin languagesorthographic noisetransliterationmutually intelligible languagesNLP evaluation

0 comments

The pith

A new benchmark shows language identification systems struggle with cousin languages and orthographic noise.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces CHALIS, a dataset built to test language identification on hard cases involving mutually intelligible language pairs and various orthographic disruptions. It gathers sentences shared across pairs such as Czech and Slovak or Portuguese and Galician, then applies controlled noise including script transliteration, diacritic removal, homoglyphs, and slang. Four widely used systems are evaluated and all show substantial errors, with the largest drops on lower-resource languages in those pairs and on transliterated material. Accurate language identification underpins many text-processing pipelines, so repeated failures in these realistic settings limit downstream reliability.

Core claim

CHALIS contains two sections: one with sentences that appear in multiple languages within mutually intelligible pairs (Czech/Slovak, Spanish/Catalan, Portuguese/Galician, Danish/Norwegian) and another with the same text subjected to orthographic noise through transliteration across scripts, diacritic removal, homoglyph substitution, and Internet slang. Evaluation of four common language identification systems on this data reveals consistent substantial performance drops, most pronounced for lower-resource languages inside the cousin pairs and for transliterated input.

What carries the argument

The CHALIS dataset, which supplies paired examples of shared sentences across cousin languages together with controlled orthographic noise variants.

If this is right

Language identification models require targeted improvements to distinguish lower-resource languages within close pairs.
Handling of orthographic noise such as transliteration and diacritic loss must become a standard robustness goal.
Existing evaluation sets miss these edge cases and therefore overestimate system reliability.
Public benchmarks that isolate these failure modes can direct future model development more precisely.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Applications that process user-generated web content may encounter higher language detection error rates than standard benchmarks suggest.
The dataset offers a direct way to measure and train for script-invariant language detection without requiring new labeled data collection.
Treating cousin languages as a single modeling problem rather than separate classes could reduce the observed confusions.

Load-bearing premise

The chosen language pairs and the specific noise types created in the dataset correspond to the actual difficult cases language identification systems meet in practice.

What would settle it

A replication that finds the four tested systems reach above 90 percent accuracy on the CHALIS examples or that shows real-world error rates on comparable cousin-language and noisy text to be far lower than the reported figures.

read the original abstract

We present CHALIS (Challenging Language Identification Samples), a new benchmark dataset explicitly designed to address difficult cases in language identification: cousin languages and orthographic noise. Our dataset has two parts: First, we collected sentences shared across mutually intelligible language pairs (Czech/Slovak, Spanish/Catalan, Portuguese/Galician, Danish/Norwegian). The second part tests for orthography noise: we transliterate text across multiple scripts, remove diacritics, simulate homoglyph attacks, and use Internet slang. We evaluate four widely used language identification systems on CHALIS and demonstrate that all struggle substantially in these scenarios, especially on lower-resource languages within cousin pairs and on transliterated input. The resource is publicly available at https://huggingface.co/datasets/michal-tichy/CHALIS.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

CHALIS supplies a new targeted dataset for cousin-language and orthographic-noise cases in LID, with evaluations showing system struggles, though the noise rules are not validated against real distributions.

read the letter

The main point is that this paper releases CHALIS, a dataset built around cousin-language sentence pairs and four kinds of orthographic noise, then shows that four standard LID systems drop in performance on it, especially for the lower-resource language in each pair and on transliterated input.

The work fills a clear gap by collecting sentences shared across Czech/Slovak, Spanish/Catalan, Portuguese/Galician, and Danish/Norwegian, then applying transliteration, diacritic stripping, homoglyph swaps, and slang. Making the resource public on Hugging Face is straightforward and useful. The evaluations are run on existing systems rather than new models, which keeps the focus on the test cases themselves.

The soft spot is the construction of the noise. The perturbations rely on fixed rule sets and substitution lists without any reported comparison to the actual distribution of noisy text that real systems see. If those rules do not match production failure modes, the measured drops could be artifacts of the simulation choices rather than evidence of the problems the authors intend to highlight. The abstract gives no numbers on dataset size or exact scores, so the full paper needs to supply those details and any additional checks to make the claims stick.

No load-bearing math or self-referential predictions appear, and the citation pattern is not an issue here. The paper engages honestly with a known limitation in current benchmarks.

This is for people who build or evaluate language identification systems and want test cases beyond clean, high-resource data. A reader working on robustness would find the resource directly usable. It deserves peer review because it delivers a concrete, shareable dataset that targets documented weaknesses, even if the noise validation could be tightened.

Referee Report

2 major / 1 minor

Summary. The paper presents CHALIS, a new benchmark dataset for language identification targeting difficult cases: mutually intelligible cousin language pairs (Czech/Slovak, Spanish/Catalan, Portuguese/Galician, Danish/Norwegian) and orthographic noise (transliteration across scripts, diacritic removal, homoglyph attacks, Internet slang). It evaluates four widely used LID systems, claims they all struggle substantially (especially on lower-resource languages in cousin pairs and on transliterated input), and releases the dataset publicly on Hugging Face.

Significance. A publicly released dataset focused on these specific failure modes could support development of more robust LID systems if the instances are shown to be representative of real production difficulties. The absence of empirical grounding for the noise simulations, however, reduces the likelihood that CHALIS will become a durable standard benchmark.

major comments (2)

[§3–4] §3–4: The orthographic noise perturbations are constructed via fixed, hand-specified rules (script mapping tables, uniform diacritic stripping, predefined homoglyph lists, slang substitutions) without reference to empirical frequency distributions drawn from actual noisy user-generated text. Consequently, the observed performance drops on the four evaluated systems may be artifacts of these arbitrary simulation parameters rather than evidence that the systems fail on the noise types they actually encounter.
[Evaluation] Evaluation section: The central claim that the four systems 'struggle substantially' (particularly on lower-resource languages within cousin pairs and transliterated input) is load-bearing for the paper's contribution, yet the manuscript provides no quantitative details on dataset cardinality, per-category error rates, or comparison against standard LID benchmarks, preventing assessment of effect size or practical significance.

minor comments (1)

The abstract would be strengthened by reporting basic dataset statistics (number of sentences per category, total size) so readers can immediately gauge scale.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback. We address each major comment below and outline the revisions we will make to strengthen the manuscript.

read point-by-point responses

Referee: [§3–4] §3–4: The orthographic noise perturbations are constructed via fixed, hand-specified rules (script mapping tables, uniform diacritic stripping, predefined homoglyph lists, slang substitutions) without reference to empirical frequency distributions drawn from actual noisy user-generated text. Consequently, the observed performance drops on the four evaluated systems may be artifacts of these arbitrary simulation parameters rather than evidence that the systems fail on the noise types they actually encounter.

Authors: We acknowledge that the perturbations rely on fixed, hand-specified rules rather than frequency distributions extracted from real-world noisy corpora. These rules were chosen to instantiate well-documented failure modes (transliteration in social media, diacritic omission, homoglyph substitution, and slang) that appear repeatedly in production LID error analyses. We will revise Sections 3–4 to cite supporting linguistic and NLP literature on these phenomena, add an explicit limitations paragraph on the simulation approach, and clarify that CHALIS is offered as a controlled diagnostic benchmark rather than a statistically representative sample of production noise. This addresses the concern without requiring new data collection. revision: partial
Referee: [Evaluation] Evaluation section: The central claim that the four systems 'struggle substantially' (particularly on lower-resource languages within cousin pairs and transliterated input) is load-bearing for the paper's contribution, yet the manuscript provides no quantitative details on dataset cardinality, per-category error rates, or comparison against standard LID benchmarks, preventing assessment of effect size or practical significance.

Authors: We agree that the submitted manuscript under-reports quantitative details. The full evaluation section contains per-system accuracies and notes on lower-resource languages, but we will expand it to report exact dataset cardinalities (number of instances per language pair and per noise type), per-category error rates with confidence intervals, and head-to-head comparisons against the systems' published performance on standard benchmarks such as WiLI-2018 and the original evaluation sets used for each LID model. These additions will allow readers to judge effect size and practical relevance. revision: yes

Circularity Check

0 steps flagged

No significant circularity in empirical dataset paper

full rationale

The paper constructs and releases an empirical benchmark dataset by collecting parallel sentences from cousin-language pairs and applying fixed rule-based perturbations (transliteration tables, diacritic removal, homoglyph lists, slang). No derivations, fitted parameters, predictions of held-out quantities, or self-citation chains appear; the central claim is simply that four off-the-shelf LID systems perform poorly on these constructed cases. The work is therefore self-contained against external benchmarks and receives the default non-circularity finding.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Dataset introduction paper; abstract describes no mathematical models, fitted parameters, background axioms, or postulated entities.

pith-pipeline@v0.9.1-grok · 5673 in / 1009 out tokens · 24523 ms · 2026-06-28T01:20:00.043224+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

62 extracted references · 14 canonical work pages

[1]

Automatic Language Identification in Texts:

Tommi Jauhiainen and Marco Lui and Marcos Zampieri and Timothy Baldwin and Krister Lind. Automatic Language Identification in Texts:. CoRR , volume=. 2018 , url=. 1804.08186 , timestamp=

Pith/arXiv arXiv 2018
[2]

McNamee, Paul , title =. J. Comput. Sci. Coll. , month = feb, pages =. 2005 , issue_date =

2005
[3]

Language ID in the Wild: Unexpected Challenges on the Path to a Thousand-Language Web Text Corpus , doi =

Caswell, Isaac and Breiner, Theresa and Esch, Daan and Bapna, Ankur , year =. Language ID in the Wild: Unexpected Challenges on the Path to a Thousand-Language Web Text Corpus , doi =
[4]

Kreutzer, Julia and Caswell, Isaac and Wang, Lisa and Wahab, Ahsan and van Esch, Daan and Ulzii-Orshikh, Nasanbayar and Tapo, Allahsera and Subramani, Nishant and Sokolov, Artem and Sikasote, Claytone and Setyawan, Monang and Sarin, Supheakmungkol and Samb, Sokhar and Sagot, Benoît and Rivera, Clara and Rios, Annette and Papadimitriou, Isabel and Osei, Sa...

work page doi:10.1162/tacl_a_00447 2022
[5]

When Theory is a Joke: The Weinreich Witticism in Linguistics

Maxwell, Alexander , year =. When Theory is a Joke: The Weinreich Witticism in Linguistics. , volume =
[6]

Chambers, J. K. and Trudgill, Peter , year=. Dialectology , publisher=
[7]

American Anthropologist , volume =

HAUGEN, EINAR , title =. American Anthropologist , volume =. doi:https://doi.org/10.1525/aa.1966.68.4.02a00040 , url =. https://anthrosource.onlinelibrary.wiley.com/doi/pdf/10.1525/aa.1966.68.4.02a00040 , abstract =

work page doi:10.1525/aa.1966.68.4.02a00040 1966
[8]

and Gijsbert Rutten and Rik Vosters

Joseph, John E. and Gijsbert Rutten and Rik Vosters. Dialect, language, nation: 50 years on. Language Policy. 2020. doi:10.1007/s10993-020-09549-x

work page doi:10.1007/s10993-020-09549-x 2020
[9]

2013 , eprint=

Efficient Estimation of Word Representations in Vector Space , author=. 2013 , eprint=

2013
[10]

2013 , eprint=

Distributed Representations of Words and Phrases and their Compositionality , author=. 2013 , eprint=

2013
[11]

and Hyv\"

Gutmann, Michael U. and Hyv\". Noise-contrastive estimation of unnormalized statistical models, with applications to natural image statistics , year =. J. Mach. Learn. Res. , month = 02, pages =
[12]

Conference on Empirical Methods in Natural Language Processing (EMNLP) , address =

Learning Language Representations for Typology Prediction , author =. Conference on Empirical Methods in Natural Language Processing (EMNLP) , address =
[13]

Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers , volume=

Uriel and lang2vec: Representing languages as typological, geographical, and phylogenetic vectors , author=. Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers , volume=
[14]

2013 , doi =

WALS Online (v2020.4) , type =. 2013 , doi =

2013
[15]

Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers , month=

Bag of Tricks for Efficient Text Classification , author=. Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers , month=. 2017 , publisher=

2017
[16]

An Open Dataset and Model for Language Identification

Burchell, Laurie and Birch, Alexandra and Bogoychev, Nikolay and Heafield, Kenneth. An Open Dataset and Model for Language Identification. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers). 2023. doi:10.18653/v1/2023.acl-short.75

work page doi:10.18653/v1/2023.acl-short.75 2023
[17]

2016 , eprint=

Neural Machine Translation of Rare Words with Subword Units , author=. 2016 , eprint=

2016
[18]

2020 , eprint=

Adv-BERT: BERT is not robust on misspellings! Generating nature adversarial samples on BERT , author=. 2020 , eprint=

2020
[19]

Proceedings of the International AAAI Conference on Web and Social Media , author=

TweetMotif: Exploratory Search and Topic Summarization for Twitter , volume=. Proceedings of the International AAAI Conference on Web and Social Media , author=. 2010 , month=. doi:10.1609/icwsm.v4i1.14008 , abstractNote=

work page doi:10.1609/icwsm.v4i1.14008 2010
[20]

2020 , eprint=

Detect Language of Transliterated Texts , author=. 2020 , eprint=

2020
[21]

Guillaume Ayoub , title =
[22]

Clark, Dan Garrette, Iulia Turc, and John Wieting

Clark, Jonathan H. and Garrette, Dan and Turc, Iulia and Wieting, John , year=. <scp>Canine</scp>: Pre-training an Efficient Tokenization-Free Encoder for Language Representation , volume=. doi:10.1162/tacl_a_00448 , journal=

work page doi:10.1162/tacl_a_00448
[23]

2019 , eprint=

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , author=. 2019 , eprint=

2019
[24]

Daniel Jurafsky and James H. Martin. Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition with Language Models. 2025

2025
[25]

2019 , url=

Francois Chaubard and Michael Fang and Guillaume Genthial and Rohit Mundra and Richard Socher and Christopher Manning and Richard Socher , title=. 2019 , url=

2019
[26]

NLLB Team and Marta R. Costa-jussà and James Cross and Onur Çelebi and Maha Elbayad and Kenneth Heafield and Kevin Heffernan and Elahe Kalbassi and Janice Lam and Daniel Licht and Jean Maillard and Anna Sun and Skyler Wang and Guillaume Wenzek and Al Youngblood and Bapi Akula and Loic Barrault and Gabriel Mejia Gonzalez and Prangthip Hansanti and John Hof...
[27]

The FLORES-101 Evaluation Benchmark for Low-Resource and Multilingual Machine Translation , author=
[28]

arXiv preprint arXiv:1902.01382 , year=

Two New Evaluation Datasets for Low-Resource Machine Translation: Nepali-English and Sinhala-English , author=. arXiv preprint arXiv:1902.01382 , year=

arXiv 1902
[29]

Findings of the WMT 24 General Machine Translation Shared Task: The LLM Era Is Here but MT Is Not Solved Yet

Kocmi, Tom and Avramidis, Eleftherios and Bawden, Rachel and Bojar, Ond r ej and Dvorkovich, Anton and Federmann, Christian and Fishel, Mark and Freitag, Markus and Gowda, Thamme and Grundkiewicz, Roman and Haddow, Barry and Karpinska, Marzena and Koehn, Philipp and Marie, Benjamin and Monz, Christof and Murray, Kenton and Nagata, Masaaki and Popel, Marti...

work page doi:10.18653/v1/2024.wmt-1.1 2024
[30]

Leipzig Corpora Collection , title =
[31]

Goldhahn, Dirk and Eckart, Thomas and Quasthoff, Uwe , year =
[32]

2023 , eprint=

Attention Is All You Need , author=. 2023 , eprint=

2023
[33]

2011 , address =

Heafield, Kenneth , booktitle =. 2011 , address =

2011
[34]

and Koehn, Philipp , booktitle =

Heafield, Kenneth and Pouzyrevsky, Ivan and Clark, Jonathan H. and Koehn, Philipp , booktitle =. Scalable Modified. 2013 , address =

2013
[35]

and Ney, H

Kneser, R. and Ney, H. , booktitle=. Improved backing-off for M-gram language modeling , year=
[36]

GL i NER : Generalist Model for Named Entity Recognition using Bidirectional Transformer

Zaratiana, Urchade and Tomeh, Nadi and Holat, Pierre and Charnois, Thierry. GL i NER : Generalist Model for Named Entity Recognition using Bidirectional Transformer. Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers). 2024. doi:10.18653/v1/2...

work page doi:10.18653/v1/2024.naacl-long.300 2024
[37]

Daniele Faraglia , title =
[38]

and Pitler, Emily and Ma, Ji and Bakalov, Anton and Salcianu, Alex and Weiss, David and McDonald, Ryan and Petrov, Slav

Botha, Jan A. and Pitler, Emily and Ma, Ji and Bakalov, Anton and Salcianu, Alex and Weiss, David and McDonald, Ryan and Petrov, Slav. Natural Language Processing with Small Feed-Forward Networks. Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing. 2017. doi:10.18653/v1/D17-1309

work page doi:10.18653/v1/d17-1309 2017
[39]

GitHub , year=

Compact Language Detector v3 (CLD3) , author=. GitHub , year=
[40]

Álvaro Huertas García , title =
[41]

Georges Labrèche , title =
[42]

Jindřich Libovický , title =
[43]

C harles Translator: A Machine Translation System between U krainian and C zech

Popel, Martin and Polakova, Lucie and Nov \'a k, Michal and Helcl, Jind r ich and Libovick \'y , Jind r ich and Stra n \'a k, Pavel and Krabac, Tomas and Hlavacova, Jaroslava and Anisimova, Mariia and Chlanova, Tereza. C harles Translator: A Machine Translation System between U krainian and C zech. Proceedings of the 2024 Joint International Conference on...

2024
[44]

MC ^2 : Towards Transparent and Culturally-Aware NLP for Minority Languages in C hina

Zhang, Chen and Tao, Mingxu and Huang, Quzhe and Lin, Jiuheng and Chen, Zhibin and Feng, Yansong. MC ^2 : Towards Transparent and Culturally-Aware NLP for Minority Languages in C hina. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2024. doi:10.18653/v1/2024.acl-long.479

work page doi:10.18653/v1/2024.acl-long.479 2024
[45]

Library and Information Science , year=

Identification of languages with short sample texts , author=. Library and Information Science , year=
[46]

Proceedings of the 29th Annual Conference of the American Translators Association: Languages at Crossroads , pages=

Language Identifier: A Computer Program for Automatic Natural Language Identification of On-line Text , author=. Proceedings of the 29th Annual Conference of the American Translators Association: Languages at Crossroads , pages=
[47]

N-Gram-Based Text Categorization , journal =

Cavnar, William and Trenkle, John , year =. N-Gram-Based Text Categorization , journal =
[48]

2018 , eprint=

Automatic Language Identification in Texts: A Survey , author=. 2018 , eprint=

2018
[49]

2014 , eprint=

Recurrent-Neural-Network for Language Detection on Twitter Code-Switching Corpus , author=. 2014 , eprint=

2014
[50]

Hierarchical Character- Models for Language Identification

Jaech, Aaron and Mulcaire, George and Hathi, Shobhit and Ostendorf, Mari and Smith, Noah A. Hierarchical Character- Models for Language Identification. Proceedings of the Fourth International Workshop on Natural Language Processing for Social Media. 2016. doi:10.18653/v1/W16-6212

work page doi:10.18653/v1/w16-6212 2016
[51]

Proceedings of the 25th International Conference on World Wide Web , pages=

Foundations of JSON schema , author=. Proceedings of the 25th International Conference on World Wide Web , pages=. 2016 , organization=

2016
[52]

Soso Dzamukashvili , title =
[53]

Multi-label S candinavian Language Identification ( SLIDE )

Fedorova, Mariia and Frydenberg, Jonas Sebulon and Handford, Victoria and Lang , Victoria Ovedie Chruickshank and Willoch, Solveig Helene and Midtgaard, Marthe L ken and Scherrer, Yves and M hlum, Petter and Samuel, David. Multi-label S candinavian Language Identification ( SLIDE ). Proceedings of the Third Workshop on Resources and Representations for Un...

2025
[54]

The 2023 Conference on Empirical Methods in Natural Language Processing , year=

Kargaran, Amir Hossein and Imani, Ayyoob and Yvon, Fran. The 2023 Conference on Empirical Methods in Natural Language Processing , year=

2023
[55]

Improving Native Language Identification with TF - IDF Weighting

Gebre, Binyam Gebrekidan and Zampieri, Marcos and Wittenburg, Peter and Heskes, Tom. Improving Native Language Identification with TF - IDF Weighting. Proceedings of the Eighth Workshop on Innovative Use of NLP for Building Educational Applications. 2013

2013
[56]

and Varoquaux, G

Pedregosa, F. and Varoquaux, G. and Gramfort, A. and Michel, V. and Thirion, B. and Grisel, O. and Blondel, M. and Prettenhofer, P. and Weiss, R. and Dubourg, V. and Vanderplas, J. and Passos, A. and Cournapeau, D. and Brucher, M. and Perrot, M. and Duchesnay, E. , journal=. Scikit-learn: Machine Learning in
[57]

2019 , eprint=

RoBERTa: A Robustly Optimized BERT Pretraining Approach , author=. 2019 , eprint=

2019
[58]

Identifying Open Challenges in Language Identification

Goot, Rob Van Der. Identifying Open Challenges in Language Identification. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2025. doi:10.18653/v1/2025.acl-long.891

work page doi:10.18653/v1/2025.acl-long.891 2025
[59]

Multimodal Neural Machine Translation: A Survey of the State of the Art

Feng, Yi and Li, Chuanyi and He, Jiatong and Hou, Zhenyu and Ng, Vincent. Multimodal Neural Machine Translation: A Survey of the State of the Art. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing. 2025. doi:10.18653/v1/2025.emnlp-main.1125

work page doi:10.18653/v1/2025.emnlp-main.1125 2025
[60]

M oses: Open Source Toolkit for Statistical Machine Translation

Koehn, Philipp and Hoang, Hieu and Birch, Alexandra and Callison-Burch, Chris and Federico, Marcello and Bertoldi, Nicola and Cowan, Brooke and Shen, Wade and Moran, Christine and Zens, Richard and Dyer, Chris and Bojar, Ond r ej and Constantin, Alexandra and Herbst, Evan. M oses: Open Source Toolkit for Statistical Machine Translation. Proceedings of the...

2007
[61]

An Expanded Massive Multilingual Dataset for High-Performance Language Technologies ( HPLT )

Burchell, Laurie and de Gibert, Ona and Arefyev, Nikolay and Aulamo, Mikko and Ba. An Expanded Massive Multilingual Dataset for High-Performance Language Technologies ( HPLT ). Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2025. doi:10.18653/v1/2025.acl-long.854

work page doi:10.18653/v1/2025.acl-long.854 2025
[62]

FineWeb2: One Pipeline to Scale Them All

Guilherme Penedo and Hynek Kydl. FineWeb2: One Pipeline to Scale Them All. Second Conference on Language Modeling , year=

[1] [1]

Automatic Language Identification in Texts:

Tommi Jauhiainen and Marco Lui and Marcos Zampieri and Timothy Baldwin and Krister Lind. Automatic Language Identification in Texts:. CoRR , volume=. 2018 , url=. 1804.08186 , timestamp=

Pith/arXiv arXiv 2018

[2] [2]

McNamee, Paul , title =. J. Comput. Sci. Coll. , month = feb, pages =. 2005 , issue_date =

2005

[3] [3]

Language ID in the Wild: Unexpected Challenges on the Path to a Thousand-Language Web Text Corpus , doi =

Caswell, Isaac and Breiner, Theresa and Esch, Daan and Bapna, Ankur , year =. Language ID in the Wild: Unexpected Challenges on the Path to a Thousand-Language Web Text Corpus , doi =

[4] [4]

Kreutzer, Julia and Caswell, Isaac and Wang, Lisa and Wahab, Ahsan and van Esch, Daan and Ulzii-Orshikh, Nasanbayar and Tapo, Allahsera and Subramani, Nishant and Sokolov, Artem and Sikasote, Claytone and Setyawan, Monang and Sarin, Supheakmungkol and Samb, Sokhar and Sagot, Benoît and Rivera, Clara and Rios, Annette and Papadimitriou, Isabel and Osei, Sa...

work page doi:10.1162/tacl_a_00447 2022

[5] [5]

When Theory is a Joke: The Weinreich Witticism in Linguistics

Maxwell, Alexander , year =. When Theory is a Joke: The Weinreich Witticism in Linguistics. , volume =

[6] [6]

Chambers, J. K. and Trudgill, Peter , year=. Dialectology , publisher=

[7] [7]

American Anthropologist , volume =

HAUGEN, EINAR , title =. American Anthropologist , volume =. doi:https://doi.org/10.1525/aa.1966.68.4.02a00040 , url =. https://anthrosource.onlinelibrary.wiley.com/doi/pdf/10.1525/aa.1966.68.4.02a00040 , abstract =

work page doi:10.1525/aa.1966.68.4.02a00040 1966

[8] [8]

and Gijsbert Rutten and Rik Vosters

Joseph, John E. and Gijsbert Rutten and Rik Vosters. Dialect, language, nation: 50 years on. Language Policy. 2020. doi:10.1007/s10993-020-09549-x

work page doi:10.1007/s10993-020-09549-x 2020

[9] [9]

2013 , eprint=

Efficient Estimation of Word Representations in Vector Space , author=. 2013 , eprint=

2013

[10] [10]

2013 , eprint=

Distributed Representations of Words and Phrases and their Compositionality , author=. 2013 , eprint=

2013

[11] [11]

and Hyv\"

Gutmann, Michael U. and Hyv\". Noise-contrastive estimation of unnormalized statistical models, with applications to natural image statistics , year =. J. Mach. Learn. Res. , month = 02, pages =

[12] [12]

Conference on Empirical Methods in Natural Language Processing (EMNLP) , address =

Learning Language Representations for Typology Prediction , author =. Conference on Empirical Methods in Natural Language Processing (EMNLP) , address =

[13] [13]

Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers , volume=

Uriel and lang2vec: Representing languages as typological, geographical, and phylogenetic vectors , author=. Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers , volume=

[14] [14]

2013 , doi =

WALS Online (v2020.4) , type =. 2013 , doi =

2013

[15] [15]

Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers , month=

Bag of Tricks for Efficient Text Classification , author=. Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers , month=. 2017 , publisher=

2017

[16] [16]

An Open Dataset and Model for Language Identification

Burchell, Laurie and Birch, Alexandra and Bogoychev, Nikolay and Heafield, Kenneth. An Open Dataset and Model for Language Identification. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers). 2023. doi:10.18653/v1/2023.acl-short.75

work page doi:10.18653/v1/2023.acl-short.75 2023

[17] [17]

2016 , eprint=

Neural Machine Translation of Rare Words with Subword Units , author=. 2016 , eprint=

2016

[18] [18]

2020 , eprint=

Adv-BERT: BERT is not robust on misspellings! Generating nature adversarial samples on BERT , author=. 2020 , eprint=

2020

[19] [19]

Proceedings of the International AAAI Conference on Web and Social Media , author=

TweetMotif: Exploratory Search and Topic Summarization for Twitter , volume=. Proceedings of the International AAAI Conference on Web and Social Media , author=. 2010 , month=. doi:10.1609/icwsm.v4i1.14008 , abstractNote=

work page doi:10.1609/icwsm.v4i1.14008 2010

[20] [20]

2020 , eprint=

Detect Language of Transliterated Texts , author=. 2020 , eprint=

2020

[21] [21]

Guillaume Ayoub , title =

[22] [22]

Clark, Dan Garrette, Iulia Turc, and John Wieting

Clark, Jonathan H. and Garrette, Dan and Turc, Iulia and Wieting, John , year=. <scp>Canine</scp>: Pre-training an Efficient Tokenization-Free Encoder for Language Representation , volume=. doi:10.1162/tacl_a_00448 , journal=

work page doi:10.1162/tacl_a_00448

[23] [23]

2019 , eprint=

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , author=. 2019 , eprint=

2019

[24] [24]

Daniel Jurafsky and James H. Martin. Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition with Language Models. 2025

2025

[25] [25]

2019 , url=

Francois Chaubard and Michael Fang and Guillaume Genthial and Rohit Mundra and Richard Socher and Christopher Manning and Richard Socher , title=. 2019 , url=

2019

[26] [26]

NLLB Team and Marta R. Costa-jussà and James Cross and Onur Çelebi and Maha Elbayad and Kenneth Heafield and Kevin Heffernan and Elahe Kalbassi and Janice Lam and Daniel Licht and Jean Maillard and Anna Sun and Skyler Wang and Guillaume Wenzek and Al Youngblood and Bapi Akula and Loic Barrault and Gabriel Mejia Gonzalez and Prangthip Hansanti and John Hof...

[27] [27]

The FLORES-101 Evaluation Benchmark for Low-Resource and Multilingual Machine Translation , author=

[28] [28]

arXiv preprint arXiv:1902.01382 , year=

Two New Evaluation Datasets for Low-Resource Machine Translation: Nepali-English and Sinhala-English , author=. arXiv preprint arXiv:1902.01382 , year=

arXiv 1902

[29] [29]

Findings of the WMT 24 General Machine Translation Shared Task: The LLM Era Is Here but MT Is Not Solved Yet

Kocmi, Tom and Avramidis, Eleftherios and Bawden, Rachel and Bojar, Ond r ej and Dvorkovich, Anton and Federmann, Christian and Fishel, Mark and Freitag, Markus and Gowda, Thamme and Grundkiewicz, Roman and Haddow, Barry and Karpinska, Marzena and Koehn, Philipp and Marie, Benjamin and Monz, Christof and Murray, Kenton and Nagata, Masaaki and Popel, Marti...

work page doi:10.18653/v1/2024.wmt-1.1 2024

[30] [30]

Leipzig Corpora Collection , title =

[31] [31]

Goldhahn, Dirk and Eckart, Thomas and Quasthoff, Uwe , year =

[32] [32]

2023 , eprint=

Attention Is All You Need , author=. 2023 , eprint=

2023

[33] [33]

2011 , address =

Heafield, Kenneth , booktitle =. 2011 , address =

2011

[34] [34]

and Koehn, Philipp , booktitle =

Heafield, Kenneth and Pouzyrevsky, Ivan and Clark, Jonathan H. and Koehn, Philipp , booktitle =. Scalable Modified. 2013 , address =

2013

[35] [35]

and Ney, H

Kneser, R. and Ney, H. , booktitle=. Improved backing-off for M-gram language modeling , year=

[36] [36]

GL i NER : Generalist Model for Named Entity Recognition using Bidirectional Transformer

Zaratiana, Urchade and Tomeh, Nadi and Holat, Pierre and Charnois, Thierry. GL i NER : Generalist Model for Named Entity Recognition using Bidirectional Transformer. Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers). 2024. doi:10.18653/v1/2...

work page doi:10.18653/v1/2024.naacl-long.300 2024

[37] [37]

Daniele Faraglia , title =

[38] [38]

and Pitler, Emily and Ma, Ji and Bakalov, Anton and Salcianu, Alex and Weiss, David and McDonald, Ryan and Petrov, Slav

Botha, Jan A. and Pitler, Emily and Ma, Ji and Bakalov, Anton and Salcianu, Alex and Weiss, David and McDonald, Ryan and Petrov, Slav. Natural Language Processing with Small Feed-Forward Networks. Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing. 2017. doi:10.18653/v1/D17-1309

work page doi:10.18653/v1/d17-1309 2017

[39] [39]

GitHub , year=

Compact Language Detector v3 (CLD3) , author=. GitHub , year=

[40] [40]

Álvaro Huertas García , title =

[41] [41]

Georges Labrèche , title =

[42] [42]

Jindřich Libovický , title =

[43] [43]

C harles Translator: A Machine Translation System between U krainian and C zech

Popel, Martin and Polakova, Lucie and Nov \'a k, Michal and Helcl, Jind r ich and Libovick \'y , Jind r ich and Stra n \'a k, Pavel and Krabac, Tomas and Hlavacova, Jaroslava and Anisimova, Mariia and Chlanova, Tereza. C harles Translator: A Machine Translation System between U krainian and C zech. Proceedings of the 2024 Joint International Conference on...

2024

[44] [44]

MC ^2 : Towards Transparent and Culturally-Aware NLP for Minority Languages in C hina

Zhang, Chen and Tao, Mingxu and Huang, Quzhe and Lin, Jiuheng and Chen, Zhibin and Feng, Yansong. MC ^2 : Towards Transparent and Culturally-Aware NLP for Minority Languages in C hina. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2024. doi:10.18653/v1/2024.acl-long.479

work page doi:10.18653/v1/2024.acl-long.479 2024

[45] [45]

Library and Information Science , year=

Identification of languages with short sample texts , author=. Library and Information Science , year=

[46] [46]

Proceedings of the 29th Annual Conference of the American Translators Association: Languages at Crossroads , pages=

Language Identifier: A Computer Program for Automatic Natural Language Identification of On-line Text , author=. Proceedings of the 29th Annual Conference of the American Translators Association: Languages at Crossroads , pages=

[47] [47]

N-Gram-Based Text Categorization , journal =

Cavnar, William and Trenkle, John , year =. N-Gram-Based Text Categorization , journal =

[48] [48]

2018 , eprint=

Automatic Language Identification in Texts: A Survey , author=. 2018 , eprint=

2018

[49] [49]

2014 , eprint=

Recurrent-Neural-Network for Language Detection on Twitter Code-Switching Corpus , author=. 2014 , eprint=

2014

[50] [50]

Hierarchical Character- Models for Language Identification

Jaech, Aaron and Mulcaire, George and Hathi, Shobhit and Ostendorf, Mari and Smith, Noah A. Hierarchical Character- Models for Language Identification. Proceedings of the Fourth International Workshop on Natural Language Processing for Social Media. 2016. doi:10.18653/v1/W16-6212

work page doi:10.18653/v1/w16-6212 2016

[51] [51]

Proceedings of the 25th International Conference on World Wide Web , pages=

Foundations of JSON schema , author=. Proceedings of the 25th International Conference on World Wide Web , pages=. 2016 , organization=

2016

[52] [52]

Soso Dzamukashvili , title =

[53] [53]

Multi-label S candinavian Language Identification ( SLIDE )

Fedorova, Mariia and Frydenberg, Jonas Sebulon and Handford, Victoria and Lang , Victoria Ovedie Chruickshank and Willoch, Solveig Helene and Midtgaard, Marthe L ken and Scherrer, Yves and M hlum, Petter and Samuel, David. Multi-label S candinavian Language Identification ( SLIDE ). Proceedings of the Third Workshop on Resources and Representations for Un...

2025

[54] [54]

The 2023 Conference on Empirical Methods in Natural Language Processing , year=

Kargaran, Amir Hossein and Imani, Ayyoob and Yvon, Fran. The 2023 Conference on Empirical Methods in Natural Language Processing , year=

2023

[55] [55]

Improving Native Language Identification with TF - IDF Weighting

Gebre, Binyam Gebrekidan and Zampieri, Marcos and Wittenburg, Peter and Heskes, Tom. Improving Native Language Identification with TF - IDF Weighting. Proceedings of the Eighth Workshop on Innovative Use of NLP for Building Educational Applications. 2013

2013

[56] [56]

and Varoquaux, G

Pedregosa, F. and Varoquaux, G. and Gramfort, A. and Michel, V. and Thirion, B. and Grisel, O. and Blondel, M. and Prettenhofer, P. and Weiss, R. and Dubourg, V. and Vanderplas, J. and Passos, A. and Cournapeau, D. and Brucher, M. and Perrot, M. and Duchesnay, E. , journal=. Scikit-learn: Machine Learning in

[57] [57]

2019 , eprint=

RoBERTa: A Robustly Optimized BERT Pretraining Approach , author=. 2019 , eprint=

2019

[58] [58]

Identifying Open Challenges in Language Identification

Goot, Rob Van Der. Identifying Open Challenges in Language Identification. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2025. doi:10.18653/v1/2025.acl-long.891

work page doi:10.18653/v1/2025.acl-long.891 2025

[59] [59]

Multimodal Neural Machine Translation: A Survey of the State of the Art

Feng, Yi and Li, Chuanyi and He, Jiatong and Hou, Zhenyu and Ng, Vincent. Multimodal Neural Machine Translation: A Survey of the State of the Art. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing. 2025. doi:10.18653/v1/2025.emnlp-main.1125

work page doi:10.18653/v1/2025.emnlp-main.1125 2025

[60] [60]

M oses: Open Source Toolkit for Statistical Machine Translation

Koehn, Philipp and Hoang, Hieu and Birch, Alexandra and Callison-Burch, Chris and Federico, Marcello and Bertoldi, Nicola and Cowan, Brooke and Shen, Wade and Moran, Christine and Zens, Richard and Dyer, Chris and Bojar, Ond r ej and Constantin, Alexandra and Herbst, Evan. M oses: Open Source Toolkit for Statistical Machine Translation. Proceedings of the...

2007

[61] [61]

An Expanded Massive Multilingual Dataset for High-Performance Language Technologies ( HPLT )

Burchell, Laurie and de Gibert, Ona and Arefyev, Nikolay and Aulamo, Mikko and Ba. An Expanded Massive Multilingual Dataset for High-Performance Language Technologies ( HPLT ). Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2025. doi:10.18653/v1/2025.acl-long.854

work page doi:10.18653/v1/2025.acl-long.854 2025

[62] [62]

FineWeb2: One Pipeline to Scale Them All

Guilherme Penedo and Hynek Kydl. FineWeb2: One Pipeline to Scale Them All. Second Conference on Language Modeling , year=