Multilingual Tokenization through the Lens of Indian Languages: Challenges and Insights

Ganesh Ramakrishnan; Maharaj Brahma; Maunendra Sankar Desarkar; Nagasai Saketh Naidu; N J Karthika; Rajat Verma; Rohit Saluja

arxiv: 2506.17789 · v3 · pith:V4AU7NUPnew · submitted 2025-06-21 · 💻 cs.CL

Multilingual Tokenization through the Lens of Indian Languages: Challenges and Insights

Maharaj Brahma , N J Karthika , Rajat Verma , Nagasai Saketh Naidu , Rohit Saluja , Maunendra Sankar Desarkar , Ganesh Ramakrishnan This is my paper

classification 💻 cs.CL

keywords languagestokenizationmultilingualvocabularyconstructionindianlanguagelinguistically

0 comments

read the original abstract

Tokenization plays a pivotal role in NLP and is fundamental to training language models. However, existing tokenizers are often skewed towards high-resource languages, limiting their effectiveness for linguistically diverse and morphologically rich languages such as those in the Indian subcontinent. In this work, we present a comprehensive empirical study of multilingual tokenization across 17 Indic languages spanning 11 scripts and two language families. We systematically evaluate the effects of (i) widely used subword algorithms: BPE and Unigram LM, (ii) script and orthography-aware normalization, (iii) vocabulary size, and (iv) multilingual vocabulary construction strategies. We use a combination of intrinsic and extrinsic evaluations to obtain the following observations: (i) script-specific normalization improves tokenization quality, (ii) Unigram LM better preserves morphological boundaries than BPE, (iii) cluster-based vocabulary construction shows improvement in downstream tasks compared to the joint method. Our findings highlight the importance of linguistically informed design choices in multilingual tokenization and offer practical guidance for building effective tokenizers for low-resource and morphologically complex languages.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

A P\={a}ninian Foundation for Indic Language Processing
cs.CL 2026-06 unverdicted novelty 5.0

Proposes treating Pāṇini's Astādhyāyī as a unifying computational architecture and benchmark foundation for Indic language NLP to improve accuracy, data efficiency, and transfer.