LangMAP adapts UnigramLM for multilingual use to deliver language-specific tokenization from a shared vocabulary, boosting boundary alignment metrics across natural and programming languages with mixed downstream fine-tuning gains.
Tokenization Impacts Multilingual Language Modeling: Assessing Vocabulary Allocation and Overlap Across Languages
4 Pith papers cite this work. Polarity classification is still indexing.
4
Pith papers citing it
fields
cs.CL 4years
2026 4representative citing papers
Replacing tokens, freezing the corresponding embeddings, and tuning the rest of the model improves NLU performance on low-resource languages compared to full fine-tuning.
citing papers explorer
-
LangMAP: A Language-Adaptive Approach to Tokenization
LangMAP adapts UnigramLM for multilingual use to deliver language-specific tokenization from a shared vocabulary, boosting boundary alignment metrics across natural and programming languages with mixed downstream fine-tuning gains.
-
Modular Monolingual Adaptation using Pretrained Language Models
Replacing tokens, freezing the corresponding embeddings, and tuning the rest of the model improves NLU performance on low-resource languages compared to full fine-tuning.
- Compute Optimal Tokenization
- Copy First, Translate Later: Interpreting Translation Dynamics in Multilingual Pretraining