pith. sign in

arxiv: 2606.06738 · v1 · pith:OLSVQJKInew · submitted 2026-06-04 · 💻 cs.CL

Modular Monolingual Adaptation using Pretrained Language Models

Pith reviewed 2026-06-28 01:08 UTC · model grok-4.3

classification 💻 cs.CL
keywords monolingual adaptationlow-resource languagespretrained language modelsmodular approachtoken replacementembedding freezingNLU tasks
0
0 comments X

The pith

Replacing tokens, freezing embeddings, and tuning the rest adapts pretrained models better to low-resource languages than full finetuning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper investigates whether full model finetuning is needed to adapt pretrained language models to low-resource languages. Instead, it proposes a modular method: replace the language-specific tokens, freeze the new embeddings, and tune only the remaining parts of the model. Tests on Scottish Gaelic, Irish, and Quechua demonstrate improved performance on mask filling, named entity recognition, and part-of-speech tagging tasks. This suggests more efficient knowledge transfer for languages with scarce data.

Core claim

By replacing tokens, freezing the corresponding embeddings, and tuning the rest of the model rather than the entire model, the adaptation to low-resource languages yields better results on natural language understanding tasks.

What carries the argument

Modular adaptation through token replacement and embedding freezing while selectively tuning model parameters.

If this is right

  • The modular approach can be more effective than full tuning for low-resource language adaptation.
  • It works for very low-resource cases such as Quechua with 8.5k training instances.
  • Analysis shows the importance of training strategies and pretrained embedding choices.
  • Performance gains are observed on mask filling, NER, and POS tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Such modularity may reduce computational costs for adapting models to many languages.
  • The results imply that preserving original embeddings helps retain cross-lingual knowledge.
  • Similar freezing strategies could be explored for other layers in future adaptations.

Load-bearing premise

Freezing the embeddings after token replacement is enough to keep useful knowledge from the original model without updates or new interference.

What would settle it

If experiments show that updating all parameters including embeddings leads to higher accuracy on the NLU tasks for these languages, the modular claim would be challenged.

Figures

Figures reproduced from arXiv: 2606.06738 by Nalin Kumar, Ond\v{r}ej Du\v{s}ek.

Figure 1
Figure 1. Figure 1: Overview of our proposed method. We first [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: The graph shows the weight differences be [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗
read the original abstract

Building monolingual language models (LMs) for low-resource languages typically relies on adapting pretrained language models (PLMs) by finetuning the whole model on the target language. This approach is widely favored over training from scratch, as it enables effective knowledge transfer. Additionally, prior work has shown that using a language-specific tokenizer can enhance the adaptability. In this work, we hypothesize that full model tuning is often unnecessary and propose a more modular approach. Specifically, we replace the tokens, freeze the corresponding embeddings, and tune the rest of the model. We use Scottish Gaelic, Irish, and Quechua for our experiments, with Quechua being a very low-resource language (8.5k training instances). Evaluation on natural language understanding (NLU) tasks -- mask filling, NER, and POS -- shows that our proposed approach improves performance when adapting models to low-resource languages. Additionally, we provide a comprehensive analysis of the effectiveness of training strategies, the choice of pretrained embeddings, and models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript proposes a modular approach to adapting pretrained language models (PLMs) to low-resource languages by replacing tokens with language-specific ones, freezing the corresponding embeddings, and fine-tuning only the remaining model parameters. Experiments are conducted on Scottish Gaelic, Irish, and Quechua (the latter with only 8.5k training instances), evaluating on NLU tasks including mask filling, NER, and POS tagging. The central claim is that this method improves performance over full-model fine-tuning, with additional analysis of training strategies, embedding choices, and model selections.

Significance. If the results hold after addressing the noted gaps, the work would indicate that full fine-tuning is often unnecessary for monolingual PLM adaptation in low-resource settings, potentially offering efficiency gains while better preserving pretrained knowledge. The focus on a very low-resource case (Quechua) and multiple tasks provides a relevant testbed for modular adaptation techniques.

major comments (2)
  1. [analysis of training strategies and embedding choices] The central claim depends on the assumption that freezing new embeddings after token replacement is sufficient to preserve transfer without harmful interference or the need for language-specific updates. However, the analysis of training strategies and embedding choices does not include a direct frozen-vs-unfrozen ablation under identical token replacement, which is required to validate this for Quechua's 8.5k-instance regime where upper layers may not compensate.
  2. [abstract and experimental evaluation] The abstract asserts that the proposed approach 'improves performance' on the NLU tasks but supplies no quantitative results, baselines, effect sizes, or statistical tests. If the results section lacks these controls and comparisons to full fine-tuning (and to token replacement without freezing), the empirical support for the central claim cannot be assessed.
minor comments (1)
  1. [method] The method description would benefit from a diagram or pseudocode illustrating the token replacement and which parameters are frozen vs. tuned.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their detailed and constructive comments. We address each major comment below, clarifying the manuscript's content and outlining planned revisions where appropriate.

read point-by-point responses
  1. Referee: [analysis of training strategies and embedding choices] The central claim depends on the assumption that freezing new embeddings after token replacement is sufficient to preserve transfer without harmful interference or the need for language-specific updates. However, the analysis of training strategies and embedding choices does not include a direct frozen-vs-unfrozen ablation under identical token replacement, which is required to validate this for Quechua's 8.5k-instance regime where upper layers may not compensate.

    Authors: We appreciate the referee's emphasis on this distinction. The manuscript's analysis of training strategies explicitly compares the proposed modular approach (token replacement followed by freezing the new embeddings while tuning the remainder) against full fine-tuning after identical token replacement. The latter case updates the new embeddings and thus serves as the unfrozen counterpart. These comparisons are reported for all languages, including the 8.5k-instance Quechua setting. To make the frozen-vs-unfrozen contrast even more explicit, we will add a dedicated ablation subsection isolating this factor in the revised version. revision: yes

  2. Referee: [abstract and experimental evaluation] The abstract asserts that the proposed approach 'improves performance' on the NLU tasks but supplies no quantitative results, baselines, effect sizes, or statistical tests. If the results section lacks these controls and comparisons to full fine-tuning (and to token replacement without freezing), the empirical support for the central claim cannot be assessed.

    Authors: We agree that the abstract would be strengthened by including concrete quantitative support. In revision we will update the abstract to report key performance deltas versus full fine-tuning on mask filling, NER, and POS tagging, along with the primary baselines. The results section already presents direct comparisons to full fine-tuning (which encompasses token replacement without freezing the new embeddings) across the three languages and tasks; we will ensure effect sizes are highlighted and will add any missing statistical significance markers if not already present. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical adaptation method with independent experimental validation

full rationale

The paper proposes replacing tokens, freezing new embeddings, and tuning only upper layers for PLM adaptation to low-resource languages, then evaluates this on mask filling, NER, and POS tasks for Scottish Gaelic, Irish, and Quechua. No mathematical derivation chain, equations, or fitted parameters renamed as predictions exist. No self-citations are invoked as load-bearing uniqueness theorems or ansatzes. The central claim rests on direct empirical comparisons rather than any self-referential construction, satisfying the default expectation of no significant circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The work is empirical and introduces no mathematical axioms, free parameters, or new entities; the central claim rests entirely on the experimental comparison described in the abstract.

pith-pipeline@v0.9.1-grok · 5703 in / 1075 out tokens · 41910 ms · 2026-06-28T01:08:33.720757+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

93 extracted references · 26 canonical work pages

  1. [1]

    Aho and Jeffrey D

    Alfred V. Aho and Jeffrey D. Ullman , title =. 1972

  2. [2]

    The Limits of Interpretation

    Umberto Eco. The Limits of Interpretation

  3. [3]

    Temporal Tagging on Different Domains: Challenges, Strategies, and Gold Standards

    Jannik Strötgen and Michael Gertz. Temporal Tagging on Different Domains: Challenges, Strategies, and Gold Standards. Proceedings of the Eight International Conference on Language Resources and Evaluation (LREC'12). 2012

  4. [4]

    Chercheur

    J.L. Chercheur. Case-Based Reasoning. 1994

  5. [5]

    Castor and L

    A. Castor and L. E. Pollux. The use of user modelling to guide inference and learning. Applied Intelligence. 1992

  6. [6]

    Superman and B

    S. Superman and B. Batman and C. Catwoman and S. Spiderman. Superheroes experiences with books. Journal journal journal

  7. [7]

    Elementary Statistics

    Paul Gerhard Hoel. Elementary Statistics. 1971

  8. [8]

    1954--58

    A history of technology. 1954--58

  9. [9]

    N. Chomsky. Conditions on Transformations. A festschrift for Morris Halle. 1973

  10. [10]

    Natural Fibre Twines

    BSI. Natural Fibre Twines. 1973

  11. [11]

    Language: Its Nature, Development, and Origin

    Otto Jespersen. Language: Its Nature, Development, and Origin

  12. [12]

    Proceedings of the 29th International Conference on Computational Linguistics , pages=

    Adapting Pre-trained Language Models to African Languages via Multilingual Adaptive Fine-Tuning , author=. Proceedings of the 29th International Conference on Computational Linguistics , pages=

  13. [13]

    Accelerating Multilingual Language Model for Excessively Tokenized Languages

    Hong, Jimin and Lee, Gibbeum and Cho, Jaewoong. Accelerating Multilingual Language Model for Excessively Tokenized Languages. Findings of the Association for Computational Linguistics: ACL 2024. 2024. doi:10.18653/v1/2024.findings-acl.660

  14. [14]

    Efficient Active Learning with Adapters

    Galimzianova, Daria and Sanochkin, Leonid. Efficient Active Learning with Adapters. Findings of the Association for Computational Linguistics: EMNLP 2024. 2024. doi:10.18653/v1/2024.findings-emnlp.840

  15. [15]

    Multi- BERT : Leveraging Adapters for Low-Resource Multi-Domain Adaptation

    Abed Azad, Parham and Beigy, Hamid. Multi- BERT : Leveraging Adapters for Low-Resource Multi-Domain Adaptation. Proceedings of the Tenth Workshop on Noisy and User-generated Text. 2025. doi:10.18653/v1/2025.wnut-1.12

  16. [16]

    Multilingual Machine Translation with Hyper-Adapters

    Baziotis, Christos and Artetxe, Mikel and Cross, James and Bhosale, Shruti. Multilingual Machine Translation with Hyper-Adapters. Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing. 2022. doi:10.18653/v1/2022.emnlp-main.77

  17. [17]

    A dapter H ub: A Framework for Adapting Transformers

    Pfeiffer, Jonas and R. A dapter H ub: A Framework for Adapting Transformers. Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations. 2020. doi:10.18653/v1/2020.emnlp-demos.7

  18. [18]

    and Adelani, David Ifeoluwa and Mosbach, Marius and Klakow, Dietrich

    Alabi, Jesujoba O. and Adelani, David Ifeoluwa and Mosbach, Marius and Klakow, Dietrich. Adapting Pre-trained Language Models to A frican Languages via Multilingual Adaptive Fine-Tuning. Proceedings of the 29th International Conference on Computational Linguistics. 2022

  19. [19]

    International Conference on Machine Learning , pages=

    Overtrained Language Models Are Harder to Fine-Tune , author=. International Conference on Machine Learning , pages=. 2025 , organization=

  20. [20]

    12 Oleksiy Syvokon and Mariana Romanyshyn

    Rust, Phillip and Pfeiffer, Jonas and Vuli \'c , Ivan and Ruder, Sebastian and Gurevych, Iryna. How Good is Your Tokenizer? On the Monolingual Performance of Multilingual Language Models. Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volum...

  21. [21]

    As Good as New

    de Vries, Wietse and Nissim, Malvina. As Good as New. How to Successfully Recycle E nglish GPT -2 to Make Models for Other Languages. Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021. 2021. doi:10.18653/v1/2021.findings-acl.74

  22. [22]

    Tokenization Impacts Multilingual Language Modeling: Assessing Vocabulary Allocation and Overlap Across Languages

    Limisiewicz, Tomasz and Balhar, Ji r \'i and Mare c ek, David. Tokenization Impacts Multilingual Language Modeling: Assessing Vocabulary Allocation and Overlap Across Languages. Findings of the Association for Computational Linguistics: ACL 2023. 2023. doi:10.18653/v1/2023.findings-acl.350

  23. [23]

    Rethinking Vocabulary Augmentation: Addressing the Challenges of Low-Resource Languages in Multilingual Models

    Lin, Nankai and Zeng, Peijian and Zheng, Weixiong and Jiang, Shengyi and Zhou, Dong and Yang, Aimin. Rethinking Vocabulary Augmentation: Addressing the Challenges of Low-Resource Languages in Multilingual Models. Proceedings of the 31st International Conference on Computational Linguistics. 2025

  24. [24]

    arXiv preprint arXiv:2406.11477 , year=

    How Can We Effectively Expand the Vocabulary of LLMs with 0.01 GB of Target Language Text? , author=. arXiv preprint arXiv:2406.11477 , year=

  25. [25]

    Bert: Pre-training of deep bidirectional transformers for language understanding , author=. Proceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers) , pages=

  26. [26]

    and Smith, Noah A

    Chau, Ethan C. and Smith, Noah A. Specializing Multilingual Language Models: An Empirical Study. Proceedings of the 1st Workshop on Multilingual Representation Learning. 2021. doi:10.18653/v1/2021.mrl-1.5

  27. [27]

    When Being Unseen from m BERT is just the Beginning: Handling New Languages With Multilingual Language Models

    Muller, Benjamin and Anastasopoulos, Antonios and Sagot, Beno \^i t and Seddah, Djam \'e. When Being Unseen from m BERT is just the Beginning: Handling New Languages With Multilingual Language Models. Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 2021. doi:10...

  28. [28]

    Investigating the Impact of Language-Adaptive Fine-Tuning on Sentiment Analysis in H ausa Language Using A fri BERT a

    Sani, Sani Abdullahi and Muhammad, Shamsuddeen Hassan and Jarvis, Devon. Investigating the Impact of Language-Adaptive Fine-Tuning on Sentiment Analysis in H ausa Language Using A fri BERT a. Proceedings of the First Workshop on Language Models for Low-Resource Languages. 2025

  29. [29]

    MAD-X : A n A dapter- B ased F ramework for M ulti- T ask C ross- L ingual T ransfer

    Pfeiffer, Jonas and Vuli \'c , Ivan and Gurevych, Iryna and Ruder, Sebastian. MAD-X : A n A dapter- B ased F ramework for M ulti- T ask C ross- L ingual T ransfer. Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). 2020. doi:10.18653/v1/2020.emnlp-main.617

  30. [30]

    and Tsvetkov, Yulia

    Wang, Zirui and Lipton, Zachary C. and Tsvetkov, Yulia. On Negative Interference in Multilingual Models: Findings and A Meta-Learning Treatment. Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). 2020. doi:10.18653/v1/2020.emnlp-main.359

  31. [31]

    arXiv preprint arXiv:1912.07076 , year=

    Multilingual is not enough: BERT for Finnish , author=. arXiv preprint arXiv:1912.07076 , year=

  32. [32]

    arXiv preprint arXiv:2003.02912 , year=

    What the [mask]? making sense of language-specific BERT models , author=. arXiv preprint arXiv:2003.02912 , year=

  33. [33]

    M ulti F i T : Efficient Multi-lingual Language Model Fine-tuning

    Eisenschlos, Julian and Ruder, Sebastian and Czapla, Piotr and Kadras, Marcin and Gugger, Sylvain and Howard, Jeremy. M ulti F i T : Efficient Multi-lingual Language Model Fine-tuning. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCN...

  34. [34]

    arXiv preprint arXiv:2303.08774 , year=

    Gpt-4 technical report , author=. arXiv preprint arXiv:2303.08774 , year=

  35. [35]

    Exploring the Impact of Transliteration on NLP Performance: Treating M altese as an A rabic Dialect

    Micallef, Kurt and Eryani, Fadhl and Habash, Nizar and Bouamor, Houda and Borg, Claudia. Exploring the Impact of Transliteration on NLP Performance: Treating M altese as an A rabic Dialect. Proceedings of the Workshop on Computation and Written Language (CAWL 2023). 2023. doi:10.18653/v1/2023.cawl-1.4

  36. [36]

    arXiv preprint arXiv:2407.02320 , year=

    Exploring the Role of Transliteration in In-Context Learning for Low-resource Languages Written in Non-Latin Scripts , author=. arXiv preprint arXiv:2407.02320 , year=

  37. [37]

    arXiv preprint arXiv:2409.17326 , year=

    How Transliterations Improve Crosslingual Alignment , author=. arXiv preprint arXiv:2409.17326 , year=

  38. [38]

    Findings of the Association for Computational Linguistics: EMNLP 2023 , pages=

    Romanization-based Large-scale Adaptation of Multilingual Language Models , author=. Findings of the Association for Computational Linguistics: EMNLP 2023 , pages=

  39. [39]

    arXiv preprint arXiv:2203.09904 , year=

    Do Multilingual Language Models Capture Differing Moral Norms? , author=. arXiv preprint arXiv:2203.09904 , year=

  40. [40]

    The 2023 W eb NLG Shared Task on Low Resource Languages

    Cripwell, Liam and Belz, Anya and Gardent, Claire and Gatt, Albert and Borg, Claudia and Borg, Marthese and Judge, John and Lorandi, Michela and Nikiforovskaya, Anna and Soto Martinez, William. The 2023 W eb NLG Shared Task on Low Resource Languages. Overview and Evaluation Results ( W eb NLG 2023). Proceedings of the Workshop on Multimodal, Multilingual ...

  41. [41]

    and McDonald, Ryan and Petrov, Slav and Pyysalo, Sampo and Silveira, Natalia and Tsarfaty, Reut and Zeman, Daniel

    Nivre, Joakim and de Marneffe, Marie-Catherine and Ginter, Filip and Goldberg, Yoav and Haji c , Jan and Manning, Christopher D. and McDonald, Ryan and Petrov, Slav and Pyysalo, Sampo and Silveira, Natalia and Tsarfaty, Reut and Zeman, Daniel. U niversal D ependencies v1: A Multilingual Treebank Collection. Proceedings of the Tenth International Conferenc...

  42. [42]

    and Pyysalo, Sampo and Schuster, Sebastian and Tyers, Francis and Zeman, Daniel

    Nivre, Joakim and de Marneffe, Marie-Catherine and Ginter, Filip and Haji c , Jan and Manning, Christopher D. and Pyysalo, Sampo and Schuster, Sebastian and Tyers, Francis and Zeman, Daniel. U niversal D ependencies v2: An Evergrowing Multilingual Treebank Collection. Proceedings of the Twelfth Language Resources and Evaluation Conference. 2020

  43. [43]

    W ord N et Embeddings

    Saedi, Chakaveh and Branco, Ant \'o nio and Ant \'o nio Rodrigues, Jo \ a o and Silva, Jo \ a o. W ord N et Embeddings. Proceedings of the Third Workshop on Representation Learning for NLP. 2018. doi:10.18653/v1/W18-3016

  44. [44]

    arXiv preprint arXiv:2307.09288 , year=

    Llama 2: Open foundation and fine-tuned chat models , author=. arXiv preprint arXiv:2307.09288 , year=

  45. [45]

    Advances in neural information processing systems , volume=

    Training language models to follow instructions with human feedback , author=. Advances in neural information processing systems , volume=

  46. [46]

    Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics , pages=

    Improving Massively Multilingual Neural Machine Translation and Zero-Shot Translation , author=. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics , pages=

  47. [47]

    Journal of Machine Learning Research , volume=

    Beyond english-centric multilingual machine translation , author=. Journal of Machine Learning Research , volume=

  48. [48]

    Mmi01 at The B aby LM Challenge: Linguistically Motivated Curriculum Learning for Pretraining in Low-Resource Settings

    Mi, Maggie. Mmi01 at The B aby LM Challenge: Linguistically Motivated Curriculum Learning for Pretraining in Low-Resource Settings. Proceedings of the BabyLM Challenge at the 27th Conference on Computational Natural Language Learning. 2023. doi:10.18653/v1/2023.conll-babylm.23

  49. [49]

    arXiv preprint arXiv:2402.07827 , year=

    Aya model: An instruction finetuned open-access multilingual language model , author=. arXiv preprint arXiv:2402.07827 , year=

  50. [50]

    CCN et: Extracting High Quality Monolingual Datasets from Web Crawl Data

    Wenzek, Guillaume and Lachaux, Marie-Anne and Conneau, Alexis and Chaudhary, Vishrav and Guzm \'a n, Francisco and Joulin, Armand and Grave, Edouard. CCN et: Extracting High Quality Monolingual Datasets from Web Crawl Data. Proceedings of the Twelfth Language Resources and Evaluation Conference. 2020

  51. [51]

    Communications of the ACM38(11), 39–41 (Nov 1995)

    Miller, George A. , title =. Commun. ACM , month = nov, pages =. 1995 , issue_date =. doi:10.1145/219717.219748 , abstract =

  52. [52]

    Advances in Neural Information Processing Systems , volume=

    Direct preference optimization: Your language model is secretly a reward model , author=. Advances in Neural Information Processing Systems , volume=

  53. [53]

    Proceedings of the BabyLM Challenge at the 27th Conference on Computational Natural Language Learning. 2023

  54. [54]

    arXiv preprint arXiv:2409.11968 , year=

    Efficacy of Synthetic Data as a Benchmark , author=. arXiv preprint arXiv:2409.11968 , year=

  55. [55]

    arXiv preprint arXiv:2308.08747 , year=

    An empirical study of catastrophic forgetting in large language models during continual fine-tuning , author=. arXiv preprint arXiv:2308.08747 , year=

  56. [56]

    MC ^2 : Towards Transparent and Culturally-Aware NLP for Minority Languages in C hina

    Zhang, Chen and Tao, Mingxu and Huang, Quzhe and Lin, Jiuheng and Chen, Zhibin and Feng, Yansong. MC ^2 : Towards Transparent and Culturally-Aware NLP for Minority Languages in C hina. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2024. doi:10.18653/v1/2024.acl-long.479

  57. [57]

    News from OPUS - A Collection of Multilingual Parallel Corpora with Tools and Interfaces

    J \"o rg Tiedemann. News from OPUS - A Collection of Multilingual Parallel Corpora with Tools and Interfaces. Recent Advances in Natural Language Processing. 2009

  58. [58]

    Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024) , pages=

    A New Massive Multilingual Dataset for High-Performance Language Technologies , author=. Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024) , pages=

  59. [59]

    arXiv preprint arXiv:2404.11553 , year=

    Quantifying multilingual performance of large language models across languages , author=. arXiv preprint arXiv:2404.11553 , year=

  60. [60]

    arXiv preprint arXiv:2402.14714 , year=

    Efficient and effective vocabulary expansion towards multilingual large language models , author=. arXiv preprint arXiv:2402.14714 , year=

  61. [61]

    Mini-Model Adaptation: Efficiently Extending Pretrained Models to New Languages via Aligned Shallow Training

    Marchisio, Kelly and Lewis, Patrick and Chen, Yihong and Artetxe, Mikel. Mini-Model Adaptation: Efficiently Extending Pretrained Models to New Languages via Aligned Shallow Training. Findings of the Association for Computational Linguistics: ACL 2023. 2023. doi:10.18653/v1/2023.findings-acl.338

  62. [62]

    arXiv preprint arXiv:2007.09757 , year=

    Mono vs multilingual transformer-based models: a comparison across several language tasks , author=. arXiv preprint arXiv:2007.09757 , year=

  63. [63]

    arXiv preprint arXiv:2010.11934 , year=

    mt5: A massively multilingual pre-trained text-to-text transformer , author=. arXiv preprint arXiv:2010.11934 , year=

  64. [64]

    arXiv preprint arXiv:2401.01055 , year=

    Llama beyond english: An empirical study on language capability transfer , author=. arXiv preprint arXiv:2401.01055 , year=

  65. [65]

    Proceedings of the Sixteenth ACM International Conference on Web Search and Data Mining , pages=

    Improving cross-lingual information retrieval on low-resource languages via optimal transport distillation , author=. Proceedings of the Sixteenth ACM International Conference on Web Search and Data Mining , pages=

  66. [66]

    arXiv preprint arXiv:2401.13303 , year=

    Mala-500: Massive language adaptation of large language models , author=. arXiv preprint arXiv:2401.13303 , year=

  67. [67]

    Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining , pages=

    Alexa teacher model: Pretraining and distilling multi-billion-parameter encoders for natural language understanding systems , author=. Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining , pages=

  68. [68]

    arXiv preprint arXiv:2407.21783 , year=

    The llama 3 herd of models , author=. arXiv preprint arXiv:2407.21783 , year=

  69. [69]

    arXiv preprint arXiv:2408.00118 , year=

    Gemma 2: Improving open language models at a practical size , author=. arXiv preprint arXiv:2408.00118 , year=

  70. [70]

    arXiv preprint arXiv:2312.11805 , year=

    Gemini: a family of highly capable multimodal models , author=. arXiv preprint arXiv:2312.11805 , year=

  71. [71]

    Publications Manual , year = "1983", publisher =

  72. [72]

    Chandra and Dexter C

    Ashok K. Chandra and Dexter C. Kozen and Larry J. Stockmeyer , year = "1981", title =. doi:10.1145/322234.322243

  73. [73]

    Scalable training of

    Andrew, Galen and Gao, Jianfeng , booktitle=. Scalable training of. 2007 , url=

  74. [74]

    Dan Gusfield , title =. 1997

  75. [75]

    Tetreault , title =

    Mohammad Sadegh Rasooli and Joel R. Tetreault , title =. Computing Research Repository , volume =. 2015 , url =

  76. [76]

    and Lin, Lucy H

    Chau, Ethan C. and Lin, Lucy H. and Smith, Noah A. Parsing with Multilingual BERT , a Small Corpus, and a Small Treebank. Findings of the Association for Computational Linguistics: EMNLP 2020. 2020. doi:10.18653/v1/2020.findings-emnlp.118

  77. [77]

    and Zettlemoyer, Luke

    Blevins, Terra and Limisiewicz, Tomasz and Gururangan, Suchin and Li, Margaret and Gonen, Hila and Smith, Noah A. and Zettlemoyer, Luke. Breaking the Curse of Multilinguality with Cross-lingual Expert Language Models. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. 2024. doi:10.18653/v1/2024.emnlp-main.604

  78. [78]

    Are All Languages Created Equal in Multilingual BERT ?

    Wu, Shijie and Dredze, Mark. Are All Languages Created Equal in Multilingual BERT ?. Proceedings of the 5th Workshop on Representation Learning for NLP. 2020. doi:10.18653/v1/2020.repl4nlp-1.16

  79. [79]

    Conneau, K

    Conneau, Alexis and Khandelwal, Kartikay and Goyal, Naman and Chaudhary, Vishrav and Wenzek, Guillaume and Guzm \'a n, Francisco and Grave, Edouard and Ott, Myle and Zettlemoyer, Luke and Stoyanov, Veselin. Unsupervised Cross-lingual Representation Learning at Scale. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. ...

  80. [80]

    Cross-lingual Name Tagging and Linking for 282 Languages

    Pan, Xiaoman and Zhang, Boliang and May, Jonathan and Nothman, Joel and Knight, Kevin and Ji, Heng. Cross-lingual Name Tagging and Linking for 282 Languages. Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2017. doi:10.18653/v1/P17-1178

Showing first 80 references.