LangMAP: A Language-Adaptive Approach to Tokenization

Andrzej Szablewski; Clara Meister; Paula Buttery; Pietro Lesci; Suchir Salhan; Tiago Pimentel

arxiv: 2606.23566 · v2 · pith:JNSEKN7Pnew · submitted 2026-06-22 · 💻 cs.CL

LangMAP: A Language-Adaptive Approach to Tokenization

Clara Meister , Suchir Salhan , Andrzej Szablewski , Pietro Lesci , Paula Buttery , Tiago Pimentel This is my paper

Pith reviewed 2026-06-26 08:24 UTC · model grok-4.3

classification 💻 cs.CL

keywords tokenizationmultilingualUnigramLMlanguage adaptationshared vocabularymorphological alignmentAST alignment

0 comments

The pith

LangMAP extends UnigramLM to produce language-specific tokenization from a single shared vocabulary, improving boundary alignment across languages without language identifiers at inference.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes LangMAP, a method to adapt tokenization to individual languages while using one vocabulary for all. It extends the UnigramLM algorithm by incorporating language information during training to learn better segmentations for each language. This matters because language-specific tokenizers usually require separate models or vocabulary changes, which is costly. LangMAP allows this adaptation either when training from scratch or when adapting existing models. It shows improvements in aligning tokens to morphological boundaries in natural languages and to code structure in programming languages.

Core claim

LangMAP Tokenization extends the UnigramLM algorithm to the multilingual setting, producing language-specific tokenization from a single shared vocabulary. Language labels are used only at training time, after which the method performs language-specific segmentation at inference without knowledge of the input language. Across multiple tokenizers and languages, it improves alignment with morphological boundaries and, for coding languages, with abstract syntax tree leaf boundaries.

What carries the argument

Language-adaptive Maximum a Posteriori (LangMAP) Tokenization, which modifies UnigramLM to use language-specific posterior probabilities during training for shared-vocabulary multilingual tokenization.

If this is right

Improves morphological boundary alignment for 9 natural languages across 14 tokenizers.
Improves alignment with AST leaf boundaries for all 9 programming languages tested.
Can be used to adapt a pretrained model's tokenizer to languages without changing the vocabulary.
Improves grammatical acceptability on target languages in fine-tuning experiments.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Models using LangMAP tokenization might achieve better performance on language-specific tasks without needing language-specific models.
Extending this to more languages or tasks could reveal whether the alignment improvements translate to consistent gains in downstream performance.

Load-bearing premise

Language labels provided only during training are enough to learn segmentation rules that stay language-specific at inference time without any language information.

What would settle it

An experiment showing that LangMAP tokenization does not produce higher alignment scores with morphological or AST boundaries than standard UnigramLM on the same languages and data.

Figures

Figures reproduced from arXiv: 2606.23566 by Andrzej Szablewski, Clara Meister, Paula Buttery, Pietro Lesci, Suchir Salhan, Tiago Pimentel.

**Figure 2.** Figure 2: LangMAP MorphScore recall as a function of the original tokenizer’s MorphScore recall. [PITH_FULL_IMAGE:figures/full_fig_p017_2.png] view at source ↗

**Figure 3.** Figure 3: Performance on downstream tasks of the base model vs. models fine-tuned on text data for [PITH_FULL_IMAGE:figures/full_fig_p021_3.png] view at source ↗

read the original abstract

Language-specific tokenizers improve tokenization quality and the downstream performance of models on those languages. However, using such a tokenizer comes at a cost: either a new model must be trained from scratch, or the vocabulary of an existing pretrained model must be adapted. We propose Language-adaptive Maximum a Posteriori (LangMAP) Tokenization, a tokenization scheme that extends the UnigramLM algorithm to the multilingual setting, producing language-specific tokenization from a single shared vocabulary. Notably, LangMAP can be used when training a multilingual language model from scratch or to adapt a pretrained model's tokenizer to individual languages without changing its vocabulary. While language labels are required at training time, a key feature of the algorithm is that it then performs language-specific tokenization at inference without knowledge of the input's language. Across 14 open-source tokenizers, 9 natural languages, and 9 programming languages, LangMAP improves morphological boundary alignment and, for all coding languages tested, alignment with abstract syntax tree (AST) leaf boundaries. In fine-tuning experiments, results are mixed: LangMAP improves target-language grammatical acceptability (MultiBLiMP) on the languages tested; its benefits are less consistent on knowledge-related tasks (Global-PIQA, Belebele).

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

LangMAP gives a clean way to train once and segment language-specifically at inference from a shared vocab, with solid alignment gains, but the mixed downstream results and the implicit disambiguation claim are the parts that need checking.

read the letter

LangMAP extends UnigramLM by adding language-specific MAP estimation during training so a single model produces different segmentations at test time without any language ID. That is the main new piece.

The work does a few things right. It tests the method on 14 existing tokenizers, 9 natural languages, and 9 programming languages, measuring both morphological boundaries and AST leaf alignment for code. The gains on boundary alignment look consistent in the abstract. Using language labels only at training time and then dropping them at inference is a practical constraint that matches real deployment, and the paper shows this can be done without retraining the whole model or swapping vocabularies.

The soft spots are in the downstream picture and in how much the inference-time behavior actually depends on the input string alone. Fine-tuning results are mixed: clear lift on MultiBLiMP grammatical acceptability but less consistent on Global-PIQA and Belebele. That suggests the tokenization improvement does not always translate to model quality. The central assumption—that the learned probabilities will automatically route to the right language mode from raw text—looks plausible for clean monolingual inputs but is less secure for code-switched or ambiguous cases; the abstract does not report targeted tests on those.

The paper is aimed at people who build or adapt multilingual tokenizers and want to avoid vocabulary surgery. A reader working on tokenizer design or efficiency would find the algorithmic extension and the broad evaluation useful. It is coherent on its own terms and engages the existing UnigramLM literature directly, so it deserves a serious referee even though some claims will need tighter evidence on the inference mechanism and on statistical controls.

Referee Report

2 major / 1 minor

Summary. The manuscript presents LangMAP, an extension of the UnigramLM tokenization algorithm to the multilingual setting via language-adaptive maximum a posteriori estimation. This allows training a single shared vocabulary using language labels that then enables language-specific segmentation at inference time without any language identifier. The authors evaluate this on morphological boundary alignment for 9 natural languages and AST leaf boundary alignment for coding languages, claiming improvements over 14 open-source tokenizers. Downstream fine-tuning shows gains on grammatical acceptability tasks but mixed outcomes on knowledge-related tasks.

Significance. Should the central claims be verified, this work would represent a meaningful advance in multilingual tokenization by reducing the need for language-specific models or vocabulary adaptations. The approach maintains a single vocabulary while achieving adaptive behavior, which could streamline multilingual model development. The extensive evaluation across numerous tokenizers and languages provides a solid basis for assessing generalizability, although the mixed fine-tuning results indicate that benefits may vary by task type. The lack of new free parameters is a positive aspect of the derivation.

major comments (2)

[Abstract] Abstract: The claim that language labels supplied only at training time suffice for language-specific segmentation at inference without any language identifier is load-bearing for the central contribution. The description provides no mechanism details on how the language-adaptive MAP estimation encodes distinct language modes in the shared probability distributions such that Viterbi segmentation reliably selects language-appropriate boundaries rather than a compromise solution.
[Fine-tuning experiments] Fine-tuning experiments: The reported mixed results (improvement on MultiBLiMP but less consistent on Global-PIQA and Belebele) weaken the downstream performance claim. Without statistical tests, data split details, or controls for post-hoc choices, it is not possible to determine whether the alignment gains translate to reliable task improvements.

minor comments (1)

The abstract would benefit from explicitly stating the total number of languages (9 natural + 9 programming) and confirming whether the 14 tokenizers include both natural and code variants.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive review. We address each major comment below and indicate where revisions will be made to strengthen the manuscript.

read point-by-point responses

Referee: [Abstract] Abstract: The claim that language labels supplied only at training time suffice for language-specific segmentation at inference without any language identifier is load-bearing for the central contribution. The description provides no mechanism details on how the language-adaptive MAP estimation encodes distinct language modes in the shared probability distributions such that Viterbi segmentation reliably selects language-appropriate boundaries rather than a compromise solution.

Authors: We agree that the mechanism requires clearer exposition. LangMAP performs language-conditioned MAP estimation during training: for each language, the posterior over token probabilities is updated using language-specific data likelihoods while sharing the vocabulary across languages. This produces a single set of token scores in which language-specific segmentation preferences are embedded as higher probabilities for language-appropriate subwords. At inference, standard Viterbi decoding on these scores selects boundaries that empirically match language-specific patterns without an explicit language ID, as the optimization avoids compromise solutions by construction. We will revise the abstract to include a concise statement of this process and add a dedicated paragraph in Section 3 explaining the encoding of language modes. revision: yes
Referee: [Fine-tuning experiments] Fine-tuning experiments: The reported mixed results (improvement on MultiBLiMP but less consistent on Global-PIQA and Belebele) weaken the downstream performance claim. Without statistical tests, data split details, or controls for post-hoc choices, it is not possible to determine whether the alignment gains translate to reliable task improvements.

Authors: The referee correctly identifies that the mixed downstream results and lack of statistical reporting limit the strength of any performance claims. The primary contribution of the work is the tokenization alignment improvements, which are consistent across languages and tokenizers; downstream fine-tuning serves as a secondary evaluation. We will revise the fine-tuning section to add statistical significance tests, specify data splits and random seeds, include controls for post-hoc analysis, and rephrase claims to accurately reflect task-dependent variability rather than implying broad gains. revision: yes

Circularity Check

0 steps flagged

No significant circularity in LangMAP derivation

full rationale

The paper presents LangMAP as an algorithmic extension of the established UnigramLM tokenization method, incorporating language labels solely during training to produce a shared vocabulary whose inference-time segmentation (via Viterbi or equivalent) yields language-specific boundaries without explicit language IDs. All reported gains are measured on external, independent benchmarks: morphological boundary alignment across 14 tokenizers and 9 languages, AST leaf alignment for code, and downstream task performance on MultiBLiMP, Global-PIQA, and Belebele. No derivation step equates a claimed output to a fitted parameter or self-citation by construction; the central inference behavior is an empirical property of the learned model rather than a definitional tautology.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no explicit free parameters, axioms, or invented entities are stated in the provided text.

pith-pipeline@v0.9.1-grok · 5762 in / 968 out tokens · 32113 ms · 2026-06-26T08:24:17.850410+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

61 extracted references · 25 canonical work pages

[1]

Langley , title =

P. Langley , title =. Proceedings of the 17th International Conference on Machine Learning (ICML 2000) , address =. 2000 , pages =

2000
[2]

T. M. Mitchell. The Need for Biases in Learning Generalizations. 1980

1980
[3]

M. J. Kearns , title =
[4]

Machine Learning: An Artificial Intelligence Approach, Vol. I. 1983

1983
[5]

R. O. Duda and P. E. Hart and D. G. Stork. Pattern Classification. 2000

2000
[6]

Suppressed for Anonymity , author=
[7]

Newell and P

A. Newell and P. S. Rosenbloom. Mechanisms of Skill Acquisition and the Law of Practice. Cognitive Skills and Their Acquisition. 1981

1981
[8]

A. L. Samuel. Some Studies in Machine Learning Using the Game of Checkers. IBM Journal of Research and Development. 1959

1959
[9]

2026 , eprint=

What Language is This? Ask Your Tokenizer , author=. 2026 , eprint=

2026
[10]

Tokenisation is NP -Complete

Whittington, Philip and Bachmann, Gregor and Pimentel, Tiago. Tokenisation is NP -Complete. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2025. doi:10.18653/v1/2025.acl-long.1365

work page doi:10.18653/v1/2025.acl-long.1365 2025
[11]

The Fourteenth International Conference on Learning Representations , year=

Tokenisation over Bounded Alphabets is Hard , author=. The Fourteenth International Conference on Learning Representations , year=
[12]

Subword Regularization: Improving Neural Network Translation Models with Multiple Subword Candidates

Kudo, Taku. Subword Regularization: Improving Neural Network Translation Models with Multiple Subword Candidates. Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2018. doi:10.18653/v1/P18-1007

work page doi:10.18653/v1/p18-1007 2018
[13]

Schmidt, Varshini Reddy, Haoran Zhang, Alec Alameddine, Omri Uzan, Yuval Pinter, and Chris Tanner

Schmidt, Craig W and Reddy, Varshini and Zhang, Haoran and Alameddine, Alec and Uzan, Omri and Pinter, Yuval and Tanner, Chris. Tokenization Is More Than Compression. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. 2024. doi:10.18653/v1/2024.emnlp-main.40

work page doi:10.18653/v1/2024.emnlp-main.40 2024
[14]

An Embarrassingly Simple Method to Mitigate Undesirable Properties of Pretrained Language Model Tokenizers

Hofmann, Valentin and Schuetze, Hinrich and Pierrehumbert, Janet. An Embarrassingly Simple Method to Mitigate Undesirable Properties of Pretrained Language Model Tokenizers. Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers). 2022. doi:10.18653/v1/2022.acl-short.43

work page doi:10.18653/v1/2022.acl-short.43 2022
[15]

Unpacking Tokenization: Evaluating Text Compression and its Correlation with Model Performance

Goldman, Omer and Caciularu, Avi and Eyal, Matan and Cao, Kris and Szpektor, Idan and Tsarfaty, Reut. Unpacking Tokenization: Evaluating Text Compression and its Correlation with Model Performance. Findings of the Association for Computational Linguistics: ACL 2024. 2024. doi:10.18653/v1/2024.findings-acl.134

work page doi:10.18653/v1/2024.findings-acl.134 2024
[16]

Neural machine translation of rare words with subword units

Neural Machine Translation of Rare Words with Subword Units , author =. Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , publisher =. doi:10.18653/v1/P16-1162 , url =

work page doi:10.18653/v1/p16-1162
[17]

Impact of tokenization on language models: an analysis for Turkish.ACM Transactions on Asian and Low-Resource Language Information Processing, 22(4):116:1–116:21, 2023

Toraman, Cagri and Yilmaz, Eyup Halit and. Impact of Tokenization on Language Models: An Analysis for. ACM Trans. Asian Low-Resour. Lang. Inf. Process. , publisher =. doi:10.1145/3578707 , issn =

work page doi:10.1145/3578707
[18]

Rust, Phillip and Pfeiffer, Jonas and Vulić, Ivan and Ruder, Sebastian and Gurevych, Iryna , year = 2021, month = aug, booktitle =. How. doi:10.18653/v1/2021.acl-long.243 , url =

work page doi:10.18653/v1/2021.acl-long.243 2021
[19]

Muckley and Karen Ullrich , booktitle=

Buu Phan and Brandon Amos and Itai Gat and Marton Havasi and Matthew J. Muckley and Karen Ullrich , booktitle=. Exact Byte-Level Probabilities from Tokenized Language Models for. 2025 , url=

2025
[20]

An Analysis of Tokenization:

Nived Rajaraman and Jiantao Jiao and Kannan Ramchandran , booktitle=. An Analysis of Tokenization:. 2024 , url=

2024
[21]

Meister, Clara , year =
[22]

arXiv preprint arXiv:2601.07220 , year=

The Roots of Performance Disparity in Multilingual Language Models: Intrinsic Modeling Difficulty or Design Choices? , author=. arXiv preprint arXiv:2601.07220 , year=

Pith/arXiv arXiv
[23]

A. P. Dempster and N. M. Laird and D. B. Rubin , journal =. Maximum Likelihood from Incomplete Data via the EM Algorithm , urldate =
[24]

and Della Pietra, Stephen A

Brown, Peter F. and Della Pietra, Stephen A. and Della Pietra, Vincent J. and Mercer, Robert L. The Mathematics of Statistical Machine Translation: Parameter Estimation. Computational Linguistics. 1993

1993
[25]

Why do language models perform worse for morphologically complex languages?

Arnett, Catherine and Bergen, Benjamin. Why do language models perform worse for morphologically complex languages?. Proceedings of the 31st International Conference on Computational Linguistics. 2025

2025
[26]

XLM - V : Overcoming the Vocabulary Bottleneck in Multilingual Masked Language Models

Liang, Davis and Gonen, Hila and Mao, Yuning and Hou, Rui and Goyal, Naman and Ghazvininejad, Marjan and Zettlemoyer, Luke and Khabsa, Madian. XLM - V : Overcoming the Vocabulary Bottleneck in Multilingual Masked Language Models. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. 2023. doi:10.18653/v1/2023.emnlp-main.813

work page doi:10.18653/v1/2023.emnlp-main.813 2023
[27]

CoRR , year=

PaLM 2 Technical Report , author=. CoRR , year=
[28]

Retrofitting Large Language Models with Dynamic Tokenization

Feher, Darius and Vuli \'c , Ivan and Minixhofer, Benjamin. Retrofitting Large Language Models with Dynamic Tokenization. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2025. doi:10.18653/v1/2025.acl-long.1444

work page doi:10.18653/v1/2025.acl-long.1444 2025
[29]

No Language Left Behind: Scaling Human-Centered Machine Translation , journal =
[30]

arXiv preprint arXiv:2305.06161 , year =

Li, Raymond and Allal, Loubna Ben and Zi, Yangtian and Muennighoff, Niklas and Kocetkov, Denis and Mou, Chenghao and Marone, Marc and Akiki, Christopher and Li, Jiawei and Chim, Jenny and others , title =. arXiv preprint arXiv:2305.06161 , year =

Pith/arXiv arXiv
[31]

2026 , url=

Sander Land and Catherine Arnett , booktitle=. 2026 , url=

2026
[32]

2025 , booktitle=

Arnett, Catherine and Hudspeth, Marisa and O'Connor, Brendan , title =. 2025 , booktitle=

2025
[33]

U niversal D ependencies

Nivre, Joakim and Zeman, Daniel and Ginter, Filip and Tyers, Francis. U niversal D ependencies. Proceedings of the 15th Conference of the E uropean Chapter of the Association for Computational Linguistics: Tutorial Abstracts. 2017

2017
[34]

CoRR , volume =

Which Pieces Does Unigram Tokenization Really Need? , author=. CoRR , volume =. 2026 , eprint=

2026
[35]

Byte pair encoding is suboptimal for language model pretraining

Bostrom, Kaj and Durrett, Greg. Byte Pair Encoding is Suboptimal for Language Model Pretraining. Findings of the Association for Computational Linguistics: EMNLP 2020. 2020. doi:10.18653/v1/2020.findings-emnlp.414

work page doi:10.18653/v1/2020.findings-emnlp.414 2020
[36]

CoRR , volume =

Parity-Aware Byte-Pair Encoding: Improving Cross-lingual Fairness in Tokenization , author=. CoRR , volume =
[37]

, title =

Yang, Zhilin and Dai, Zihang and Yang, Yiming and Carbonell, Jaime and Salakhutdinov, Ruslan and Le, Quoc V. , title =. Proceedings of the 33rd International Conference on Neural Information Processing Systems , articleno =. 2019 , publisher =

2019
[38]

, title =

Raffel, Colin and Shazeer, Noam and Roberts, Adam and Lee, Katherine and Narang, Sharan and Matena, Michael and Zhou, Yanqi and Li, Wei and Liu, Peter J. , title =. J. Mach. Learn. Res. , month = jan, articleno =. 2020 , issue_date =

2020
[39]

Rethinking Tokenization for Rich Morphology: The Dominance of Unigram over BPE and Morphological Alignment

Vemula, Saketh Reddy and Dandapat, Sandipan and Sharma, Dipti and Krishnamurthy, Parameswari. Rethinking Tokenization for Rich Morphology: The Dominance of Unigram over BPE and Morphological Alignment. The 14th International Joint Conference on Natural Language Processing and The 4th Conference of the Asia-Pacific Chapter of the Association for Computatio...

2025
[40]

Causal Estimation of Tokenisation Bias

Lesci, Pietro and Meister, Clara and Hofmann, Thomas and Vlachos, Andreas and Pimentel, Tiago. Causal Estimation of Tokenisation Bias. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2025. doi:10.18653/v1/2025.acl-long.1374

work page doi:10.18653/v1/2025.acl-long.1374 2025
[41]

Length-aware Byte Pair Encoding for Mitigating Over-segmentation in K orean Machine Translation

Lee, Jungseob and Moon, Hyeonseok and Lee, Seungjun and Park, Chanjun and Eo, Sugyeong and Ko, Hyunwoong and Seo, Jaehyung and Lee, Seungyoon and Lim, Heuiseok. Length-aware Byte Pair Encoding for Mitigating Over-segmentation in K orean Machine Translation. Findings of the Association for Computational Linguistics: ACL 2024. 2024. doi:10.18653/v1/2024.fin...

work page doi:10.18653/v1/2024.findings-acl.135 2024
[42]

Getting the \# \# life out of living: How Adequate Are Word-Pieces for Modelling Complex Morphology?

Klein, Stav and Tsarfaty, Reut. Getting the \# \# life out of living: How Adequate Are Word-Pieces for Modelling Complex Morphology?. Proceedings of the 17th SIGMORPHON Workshop on Computational Research in Phonetics, Phonology, and Morphology. 2020. doi:10.18653/v1/2020.sigmorphon-1.24

work page doi:10.18653/v1/2020.sigmorphon-1.24 2020
[43]

Superbizarre Is Not Superb: Derivational Morphology Improves BERT ' s Interpretation of Complex Words

Hofmann, Valentin and Pierrehumbert, Janet and Sch. Superbizarre Is Not Superb: Derivational Morphology Improves BERT ' s Interpretation of Complex Words. Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). 2021. doi:10.1...

work page doi:10.18653/v1/2021.acl-long.279 2021
[44]

Tokenization and the Noiseless Channel

Zouhar, Vil \'e m and Meister, Clara and Gastaldi, Juan and Du, Li and Sachan, Mrinmaya and Cotterell, Ryan. Tokenization and the Noiseless Channel. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2023. doi:10.18653/v1/2023.acl-long.284

work page doi:10.18653/v1/2023.acl-long.284 2023
[45]

A Peek into Token Bias: L arge Language Models Are Not Yet Genuine Reasoners

Jiang, Bowen and Xie, Yangxinyu and Hao, Zhuoqun and Wang, Xiaomeng and Mallick, Tanwi and Su, Weijie J and Taylor, Camillo Jose and Roth, Dan. A Peek into Token Bias: L arge Language Models Are Not Yet Genuine Reasoners. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. 2024. doi:10.18653/v1/2024.emnlp-main.272

work page doi:10.18653/v1/2024.emnlp-main.272 2024
[46]

Smith , booktitle=

Orevaoghene Ahia and Sachin Kumar and Hila Gonen and Valentin Hofmann and Tomasz Limisiewicz and Yulia Tsvetkov and Noah A. Smith , booktitle=. 2024 , url=

2024
[47]

Do All Languages Cost the Same? T okenization in the Era of Commercial Language Models

Ahia, Orevaoghene and Kumar, Sachin and Gonen, Hila and Kasai, Jungo and Mortensen, David and Smith, Noah and Tsvetkov, Yulia. Do All Languages Cost the Same? T okenization in the Era of Commercial Language Models. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. 2023. doi:10.18653/v1/2023.emnlp-main.614

work page doi:10.18653/v1/2023.emnlp-main.614 2023
[48]

and Bibi, Adel , title =

Petrov, Aleksandar and Malfa, Emanuele La and Torr, Philip H.S. and Bibi, Adel , title =. Proceedings of the 37th International Conference on Neural Information Processing Systems , articleno =. 2023 , publisher =

2023
[49]

Tokenization Impacts Multilingual Language Modeling: A ssessing Vocabulary Allocation and Overlap Across Languages

Limisiewicz, Tomasz and Balhar, Ji r \'i and Mare c ek, David. Tokenization Impacts Multilingual Language Modeling: A ssessing Vocabulary Allocation and Overlap Across Languages. Findings of the Association for Computational Linguistics: ACL 2023. 2023. doi:10.18653/v1/2023.findings-acl.350

work page doi:10.18653/v1/2023.findings-acl.350 2023
[50]

Attention is All you Need , url =

Vaswani, Ashish and Shazeer, Noam and Parmar, Niki and Uszkoreit, Jakob and Jones, Llion and Gomez, Aidan N and Kaiser, ukasz and Polosukhin, Illia , booktitle =. Attention is All you Need , url =
[51]

B y T 5: Towards a Token-Free Future with Pre-trained Byte-to-Byte Models

Xue, Linting and Barua, Aditya and Constant, Noah and Al-Rfou, Rami and Narang, Sharan and Kale, Mihir and Roberts, Adam and Raffel, Colin. B y T 5: Towards a Token-Free Future with Pre-trained Byte-to-Byte Models. Transactions of the Association for Computational Linguistics. 2022. doi:10.1162/tacl_a_00461

work page doi:10.1162/tacl_a_00461 2022
[52]

and Blevins, Terra and Goldfine, Nora and Steinert-Threlkeld, Shane

Downey, C.m. and Blevins, Terra and Goldfine, Nora and Steinert-Threlkeld, Shane. Embedding Structure Matters: Comparing Methods to Adapt Multilingual Vocabularies to New Languages. Proceedings of the 3rd Workshop on Multi-lingual Representation Learning (MRL). 2023. doi:10.18653/v1/2023.mrl-1.20

work page doi:10.18653/v1/2023.mrl-1.20 2023
[53]

BPE Gets Picky: Efficient Vocabulary Refinement During Tokenizer Training

Chizhov, Pavel and Arnett, Catherine and Korotkova, Elizaveta and Yamshchikov, Ivan P. BPE Gets Picky: Efficient Vocabulary Refinement During Tokenizer Training. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. 2024. doi:10.18653/v1/2024.emnlp-main.925

work page doi:10.18653/v1/2024.emnlp-main.925 2024
[54]

Trans-Tokenization and Cross-lingual Vocabulary Transfers: Language Adaptation of LLMs for Low-Resource

Fran. Trans-Tokenization and Cross-lingual Vocabulary Transfers: Language Adaptation of LLMs for Low-Resource. CoRR , volume =. 2024 , url =. doi:10.48550/ARXIV.2408.04303 , eprinttype =. 2408.04303 , timestamp =

work page doi:10.48550/arxiv.2408.04303 2024
[55]

Tokens, the oft-overlooked appetizer: Large language models, the distributional hypothesis, and meaning , doi =

Zimmerman, Julia and Hudon, Denis and Cramer, Kathryn and Ruiz, Alejandro and Beauregard, Calla and Fehr, Ashley and Fudolig, Mikaela and Demarest, Bradford and Bird, Yoshi and Trujillo, Milo and Danforth, Christopher and Dodds, Peter , year =. Tokens, the oft-overlooked appetizer: Large language models, the distributional hypothesis, and meaning , doi =....

2024
[56]

You should evaluate your language model on marginal likelihood over tokenisations

Cao, Kris and Rimell, Laura. You should evaluate your language model on marginal likelihood over tokenisations. Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. 2021. doi:10.18653/v1/2021.emnlp-main.161

work page doi:10.18653/v1/2021.emnlp-main.161 2021
[57]

Transactions of the Association for Computational Linguistics , volume=

Multiblimp 1.0: A massively multilingual benchmark of linguistic minimal pairs , author=. Transactions of the Association for Computational Linguistics , volume=. 2026 , publisher=

2026
[58]

arXiv preprint arXiv:2510.24081 , year=

Global piqa: Evaluating physical commonsense reasoning across 100+ languages and cultures , author=. arXiv preprint arXiv:2510.24081 , year=

Pith/arXiv arXiv
[59]

The Belebele Benchmark : a Parallel Reading Comprehension Dataset in 122 Language Variants

Bandarkar, Lucas and Liang, Davis and Muller, Benjamin and Artetxe, Mikel and Shukla, Satya Narayan and Husa, Donald and Goyal, Naman and Krishnan, Abhinandan and Zettlemoyer, Luke and Khabsa, Madian. The B elebele Benchmark: a Parallel Reading Comprehension Dataset in 122 Language Variants. Proceedings of the 62nd Annual Meeting of the Association for Co...

work page doi:10.18653/v1/2024.acl-long.44 2024
[60]

arXiv preprint arXiv:2506.20920 , year=

FineWeb2: One Pipeline to Scale Them All--Adapting Pre-Training Data Processing to Every Language , author=. arXiv preprint arXiv:2506.20920 , year=

arXiv
[61]

arXiv preprint arXiv:2402.19173 , year=

Starcoder 2 and the stack v2: The next generation , author=. arXiv preprint arXiv:2402.19173 , year=

Pith/arXiv arXiv

[1] [1]

Langley , title =

P. Langley , title =. Proceedings of the 17th International Conference on Machine Learning (ICML 2000) , address =. 2000 , pages =

2000

[2] [2]

T. M. Mitchell. The Need for Biases in Learning Generalizations. 1980

1980

[3] [3]

M. J. Kearns , title =

[4] [4]

Machine Learning: An Artificial Intelligence Approach, Vol. I. 1983

1983

[5] [5]

R. O. Duda and P. E. Hart and D. G. Stork. Pattern Classification. 2000

2000

[6] [6]

Suppressed for Anonymity , author=

[7] [7]

Newell and P

A. Newell and P. S. Rosenbloom. Mechanisms of Skill Acquisition and the Law of Practice. Cognitive Skills and Their Acquisition. 1981

1981

[8] [8]

A. L. Samuel. Some Studies in Machine Learning Using the Game of Checkers. IBM Journal of Research and Development. 1959

1959

[9] [9]

2026 , eprint=

What Language is This? Ask Your Tokenizer , author=. 2026 , eprint=

2026

[10] [10]

Tokenisation is NP -Complete

Whittington, Philip and Bachmann, Gregor and Pimentel, Tiago. Tokenisation is NP -Complete. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2025. doi:10.18653/v1/2025.acl-long.1365

work page doi:10.18653/v1/2025.acl-long.1365 2025

[11] [11]

The Fourteenth International Conference on Learning Representations , year=

Tokenisation over Bounded Alphabets is Hard , author=. The Fourteenth International Conference on Learning Representations , year=

[12] [12]

Subword Regularization: Improving Neural Network Translation Models with Multiple Subword Candidates

Kudo, Taku. Subword Regularization: Improving Neural Network Translation Models with Multiple Subword Candidates. Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2018. doi:10.18653/v1/P18-1007

work page doi:10.18653/v1/p18-1007 2018

[13] [13]

Schmidt, Varshini Reddy, Haoran Zhang, Alec Alameddine, Omri Uzan, Yuval Pinter, and Chris Tanner

Schmidt, Craig W and Reddy, Varshini and Zhang, Haoran and Alameddine, Alec and Uzan, Omri and Pinter, Yuval and Tanner, Chris. Tokenization Is More Than Compression. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. 2024. doi:10.18653/v1/2024.emnlp-main.40

work page doi:10.18653/v1/2024.emnlp-main.40 2024

[14] [14]

An Embarrassingly Simple Method to Mitigate Undesirable Properties of Pretrained Language Model Tokenizers

Hofmann, Valentin and Schuetze, Hinrich and Pierrehumbert, Janet. An Embarrassingly Simple Method to Mitigate Undesirable Properties of Pretrained Language Model Tokenizers. Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers). 2022. doi:10.18653/v1/2022.acl-short.43

work page doi:10.18653/v1/2022.acl-short.43 2022

[15] [15]

Unpacking Tokenization: Evaluating Text Compression and its Correlation with Model Performance

Goldman, Omer and Caciularu, Avi and Eyal, Matan and Cao, Kris and Szpektor, Idan and Tsarfaty, Reut. Unpacking Tokenization: Evaluating Text Compression and its Correlation with Model Performance. Findings of the Association for Computational Linguistics: ACL 2024. 2024. doi:10.18653/v1/2024.findings-acl.134

work page doi:10.18653/v1/2024.findings-acl.134 2024

[16] [16]

Neural machine translation of rare words with subword units

Neural Machine Translation of Rare Words with Subword Units , author =. Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , publisher =. doi:10.18653/v1/P16-1162 , url =

work page doi:10.18653/v1/p16-1162

[17] [17]

Impact of tokenization on language models: an analysis for Turkish.ACM Transactions on Asian and Low-Resource Language Information Processing, 22(4):116:1–116:21, 2023

Toraman, Cagri and Yilmaz, Eyup Halit and. Impact of Tokenization on Language Models: An Analysis for. ACM Trans. Asian Low-Resour. Lang. Inf. Process. , publisher =. doi:10.1145/3578707 , issn =

work page doi:10.1145/3578707

[18] [18]

Rust, Phillip and Pfeiffer, Jonas and Vulić, Ivan and Ruder, Sebastian and Gurevych, Iryna , year = 2021, month = aug, booktitle =. How. doi:10.18653/v1/2021.acl-long.243 , url =

work page doi:10.18653/v1/2021.acl-long.243 2021

[19] [19]

Muckley and Karen Ullrich , booktitle=

Buu Phan and Brandon Amos and Itai Gat and Marton Havasi and Matthew J. Muckley and Karen Ullrich , booktitle=. Exact Byte-Level Probabilities from Tokenized Language Models for. 2025 , url=

2025

[20] [20]

An Analysis of Tokenization:

Nived Rajaraman and Jiantao Jiao and Kannan Ramchandran , booktitle=. An Analysis of Tokenization:. 2024 , url=

2024

[21] [21]

Meister, Clara , year =

[22] [22]

arXiv preprint arXiv:2601.07220 , year=

The Roots of Performance Disparity in Multilingual Language Models: Intrinsic Modeling Difficulty or Design Choices? , author=. arXiv preprint arXiv:2601.07220 , year=

Pith/arXiv arXiv

[23] [23]

A. P. Dempster and N. M. Laird and D. B. Rubin , journal =. Maximum Likelihood from Incomplete Data via the EM Algorithm , urldate =

[24] [24]

and Della Pietra, Stephen A

Brown, Peter F. and Della Pietra, Stephen A. and Della Pietra, Vincent J. and Mercer, Robert L. The Mathematics of Statistical Machine Translation: Parameter Estimation. Computational Linguistics. 1993

1993

[25] [25]

Why do language models perform worse for morphologically complex languages?

Arnett, Catherine and Bergen, Benjamin. Why do language models perform worse for morphologically complex languages?. Proceedings of the 31st International Conference on Computational Linguistics. 2025

2025

[26] [26]

XLM - V : Overcoming the Vocabulary Bottleneck in Multilingual Masked Language Models

Liang, Davis and Gonen, Hila and Mao, Yuning and Hou, Rui and Goyal, Naman and Ghazvininejad, Marjan and Zettlemoyer, Luke and Khabsa, Madian. XLM - V : Overcoming the Vocabulary Bottleneck in Multilingual Masked Language Models. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. 2023. doi:10.18653/v1/2023.emnlp-main.813

work page doi:10.18653/v1/2023.emnlp-main.813 2023

[27] [27]

CoRR , year=

PaLM 2 Technical Report , author=. CoRR , year=

[28] [28]

Retrofitting Large Language Models with Dynamic Tokenization

Feher, Darius and Vuli \'c , Ivan and Minixhofer, Benjamin. Retrofitting Large Language Models with Dynamic Tokenization. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2025. doi:10.18653/v1/2025.acl-long.1444

work page doi:10.18653/v1/2025.acl-long.1444 2025

[29] [29]

No Language Left Behind: Scaling Human-Centered Machine Translation , journal =

[30] [30]

arXiv preprint arXiv:2305.06161 , year =

Li, Raymond and Allal, Loubna Ben and Zi, Yangtian and Muennighoff, Niklas and Kocetkov, Denis and Mou, Chenghao and Marone, Marc and Akiki, Christopher and Li, Jiawei and Chim, Jenny and others , title =. arXiv preprint arXiv:2305.06161 , year =

Pith/arXiv arXiv

[31] [31]

2026 , url=

Sander Land and Catherine Arnett , booktitle=. 2026 , url=

2026

[32] [32]

2025 , booktitle=

Arnett, Catherine and Hudspeth, Marisa and O'Connor, Brendan , title =. 2025 , booktitle=

2025

[33] [33]

U niversal D ependencies

Nivre, Joakim and Zeman, Daniel and Ginter, Filip and Tyers, Francis. U niversal D ependencies. Proceedings of the 15th Conference of the E uropean Chapter of the Association for Computational Linguistics: Tutorial Abstracts. 2017

2017

[34] [34]

CoRR , volume =

Which Pieces Does Unigram Tokenization Really Need? , author=. CoRR , volume =. 2026 , eprint=

2026

[35] [35]

Byte pair encoding is suboptimal for language model pretraining

Bostrom, Kaj and Durrett, Greg. Byte Pair Encoding is Suboptimal for Language Model Pretraining. Findings of the Association for Computational Linguistics: EMNLP 2020. 2020. doi:10.18653/v1/2020.findings-emnlp.414

work page doi:10.18653/v1/2020.findings-emnlp.414 2020

[36] [36]

CoRR , volume =

Parity-Aware Byte-Pair Encoding: Improving Cross-lingual Fairness in Tokenization , author=. CoRR , volume =

[37] [37]

, title =

Yang, Zhilin and Dai, Zihang and Yang, Yiming and Carbonell, Jaime and Salakhutdinov, Ruslan and Le, Quoc V. , title =. Proceedings of the 33rd International Conference on Neural Information Processing Systems , articleno =. 2019 , publisher =

2019

[38] [38]

, title =

Raffel, Colin and Shazeer, Noam and Roberts, Adam and Lee, Katherine and Narang, Sharan and Matena, Michael and Zhou, Yanqi and Li, Wei and Liu, Peter J. , title =. J. Mach. Learn. Res. , month = jan, articleno =. 2020 , issue_date =

2020

[39] [39]

Rethinking Tokenization for Rich Morphology: The Dominance of Unigram over BPE and Morphological Alignment

Vemula, Saketh Reddy and Dandapat, Sandipan and Sharma, Dipti and Krishnamurthy, Parameswari. Rethinking Tokenization for Rich Morphology: The Dominance of Unigram over BPE and Morphological Alignment. The 14th International Joint Conference on Natural Language Processing and The 4th Conference of the Asia-Pacific Chapter of the Association for Computatio...

2025

[40] [40]

Causal Estimation of Tokenisation Bias

Lesci, Pietro and Meister, Clara and Hofmann, Thomas and Vlachos, Andreas and Pimentel, Tiago. Causal Estimation of Tokenisation Bias. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2025. doi:10.18653/v1/2025.acl-long.1374

work page doi:10.18653/v1/2025.acl-long.1374 2025

[41] [41]

Length-aware Byte Pair Encoding for Mitigating Over-segmentation in K orean Machine Translation

Lee, Jungseob and Moon, Hyeonseok and Lee, Seungjun and Park, Chanjun and Eo, Sugyeong and Ko, Hyunwoong and Seo, Jaehyung and Lee, Seungyoon and Lim, Heuiseok. Length-aware Byte Pair Encoding for Mitigating Over-segmentation in K orean Machine Translation. Findings of the Association for Computational Linguistics: ACL 2024. 2024. doi:10.18653/v1/2024.fin...

work page doi:10.18653/v1/2024.findings-acl.135 2024

[42] [42]

Getting the \# \# life out of living: How Adequate Are Word-Pieces for Modelling Complex Morphology?

Klein, Stav and Tsarfaty, Reut. Getting the \# \# life out of living: How Adequate Are Word-Pieces for Modelling Complex Morphology?. Proceedings of the 17th SIGMORPHON Workshop on Computational Research in Phonetics, Phonology, and Morphology. 2020. doi:10.18653/v1/2020.sigmorphon-1.24

work page doi:10.18653/v1/2020.sigmorphon-1.24 2020

[43] [43]

Superbizarre Is Not Superb: Derivational Morphology Improves BERT ' s Interpretation of Complex Words

Hofmann, Valentin and Pierrehumbert, Janet and Sch. Superbizarre Is Not Superb: Derivational Morphology Improves BERT ' s Interpretation of Complex Words. Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). 2021. doi:10.1...

work page doi:10.18653/v1/2021.acl-long.279 2021

[44] [44]

Tokenization and the Noiseless Channel

Zouhar, Vil \'e m and Meister, Clara and Gastaldi, Juan and Du, Li and Sachan, Mrinmaya and Cotterell, Ryan. Tokenization and the Noiseless Channel. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2023. doi:10.18653/v1/2023.acl-long.284

work page doi:10.18653/v1/2023.acl-long.284 2023

[45] [45]

A Peek into Token Bias: L arge Language Models Are Not Yet Genuine Reasoners

Jiang, Bowen and Xie, Yangxinyu and Hao, Zhuoqun and Wang, Xiaomeng and Mallick, Tanwi and Su, Weijie J and Taylor, Camillo Jose and Roth, Dan. A Peek into Token Bias: L arge Language Models Are Not Yet Genuine Reasoners. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. 2024. doi:10.18653/v1/2024.emnlp-main.272

work page doi:10.18653/v1/2024.emnlp-main.272 2024

[46] [46]

Smith , booktitle=

Orevaoghene Ahia and Sachin Kumar and Hila Gonen and Valentin Hofmann and Tomasz Limisiewicz and Yulia Tsvetkov and Noah A. Smith , booktitle=. 2024 , url=

2024

[47] [47]

Do All Languages Cost the Same? T okenization in the Era of Commercial Language Models

Ahia, Orevaoghene and Kumar, Sachin and Gonen, Hila and Kasai, Jungo and Mortensen, David and Smith, Noah and Tsvetkov, Yulia. Do All Languages Cost the Same? T okenization in the Era of Commercial Language Models. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. 2023. doi:10.18653/v1/2023.emnlp-main.614

work page doi:10.18653/v1/2023.emnlp-main.614 2023

[48] [48]

and Bibi, Adel , title =

Petrov, Aleksandar and Malfa, Emanuele La and Torr, Philip H.S. and Bibi, Adel , title =. Proceedings of the 37th International Conference on Neural Information Processing Systems , articleno =. 2023 , publisher =

2023

[49] [49]

Tokenization Impacts Multilingual Language Modeling: A ssessing Vocabulary Allocation and Overlap Across Languages

Limisiewicz, Tomasz and Balhar, Ji r \'i and Mare c ek, David. Tokenization Impacts Multilingual Language Modeling: A ssessing Vocabulary Allocation and Overlap Across Languages. Findings of the Association for Computational Linguistics: ACL 2023. 2023. doi:10.18653/v1/2023.findings-acl.350

work page doi:10.18653/v1/2023.findings-acl.350 2023

[50] [50]

Attention is All you Need , url =

Vaswani, Ashish and Shazeer, Noam and Parmar, Niki and Uszkoreit, Jakob and Jones, Llion and Gomez, Aidan N and Kaiser, ukasz and Polosukhin, Illia , booktitle =. Attention is All you Need , url =

[51] [51]

B y T 5: Towards a Token-Free Future with Pre-trained Byte-to-Byte Models

Xue, Linting and Barua, Aditya and Constant, Noah and Al-Rfou, Rami and Narang, Sharan and Kale, Mihir and Roberts, Adam and Raffel, Colin. B y T 5: Towards a Token-Free Future with Pre-trained Byte-to-Byte Models. Transactions of the Association for Computational Linguistics. 2022. doi:10.1162/tacl_a_00461

work page doi:10.1162/tacl_a_00461 2022

[52] [52]

and Blevins, Terra and Goldfine, Nora and Steinert-Threlkeld, Shane

Downey, C.m. and Blevins, Terra and Goldfine, Nora and Steinert-Threlkeld, Shane. Embedding Structure Matters: Comparing Methods to Adapt Multilingual Vocabularies to New Languages. Proceedings of the 3rd Workshop on Multi-lingual Representation Learning (MRL). 2023. doi:10.18653/v1/2023.mrl-1.20

work page doi:10.18653/v1/2023.mrl-1.20 2023

[53] [53]

BPE Gets Picky: Efficient Vocabulary Refinement During Tokenizer Training

Chizhov, Pavel and Arnett, Catherine and Korotkova, Elizaveta and Yamshchikov, Ivan P. BPE Gets Picky: Efficient Vocabulary Refinement During Tokenizer Training. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. 2024. doi:10.18653/v1/2024.emnlp-main.925

work page doi:10.18653/v1/2024.emnlp-main.925 2024

[54] [54]

Trans-Tokenization and Cross-lingual Vocabulary Transfers: Language Adaptation of LLMs for Low-Resource

Fran. Trans-Tokenization and Cross-lingual Vocabulary Transfers: Language Adaptation of LLMs for Low-Resource. CoRR , volume =. 2024 , url =. doi:10.48550/ARXIV.2408.04303 , eprinttype =. 2408.04303 , timestamp =

work page doi:10.48550/arxiv.2408.04303 2024

[55] [55]

Tokens, the oft-overlooked appetizer: Large language models, the distributional hypothesis, and meaning , doi =

Zimmerman, Julia and Hudon, Denis and Cramer, Kathryn and Ruiz, Alejandro and Beauregard, Calla and Fehr, Ashley and Fudolig, Mikaela and Demarest, Bradford and Bird, Yoshi and Trujillo, Milo and Danforth, Christopher and Dodds, Peter , year =. Tokens, the oft-overlooked appetizer: Large language models, the distributional hypothesis, and meaning , doi =....

2024

[56] [56]

You should evaluate your language model on marginal likelihood over tokenisations

Cao, Kris and Rimell, Laura. You should evaluate your language model on marginal likelihood over tokenisations. Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. 2021. doi:10.18653/v1/2021.emnlp-main.161

work page doi:10.18653/v1/2021.emnlp-main.161 2021

[57] [57]

Transactions of the Association for Computational Linguistics , volume=

Multiblimp 1.0: A massively multilingual benchmark of linguistic minimal pairs , author=. Transactions of the Association for Computational Linguistics , volume=. 2026 , publisher=

2026

[58] [58]

arXiv preprint arXiv:2510.24081 , year=

Global piqa: Evaluating physical commonsense reasoning across 100+ languages and cultures , author=. arXiv preprint arXiv:2510.24081 , year=

Pith/arXiv arXiv

[59] [59]

The Belebele Benchmark : a Parallel Reading Comprehension Dataset in 122 Language Variants

Bandarkar, Lucas and Liang, Davis and Muller, Benjamin and Artetxe, Mikel and Shukla, Satya Narayan and Husa, Donald and Goyal, Naman and Krishnan, Abhinandan and Zettlemoyer, Luke and Khabsa, Madian. The B elebele Benchmark: a Parallel Reading Comprehension Dataset in 122 Language Variants. Proceedings of the 62nd Annual Meeting of the Association for Co...

work page doi:10.18653/v1/2024.acl-long.44 2024

[60] [60]

arXiv preprint arXiv:2506.20920 , year=

FineWeb2: One Pipeline to Scale Them All--Adapting Pre-Training Data Processing to Every Language , author=. arXiv preprint arXiv:2506.20920 , year=

arXiv

[61] [61]

arXiv preprint arXiv:2402.19173 , year=

Starcoder 2 and the stack v2: The next generation , author=. arXiv preprint arXiv:2402.19173 , year=

Pith/arXiv arXiv