From Curated Data to Scalable Models: Continual Pre-training of Dense and MoE Large Language Models for Tibetan
Pith reviewed 2026-05-19 04:52 UTC · model grok-4.3
Add this Pith Number to your LaTeX paper
What is a Pith Number?\usepackage{pith}
\pithnumber{6HQEBF7Z}
Prints a linked pith:6HQEBF7Z badge after your title and writes the identifier into PDF metadata. Compiles on arXiv with no extra files. Learn more
The pith
New Tibetan LLMs from 72 GB data curation surpass existing ones
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The paper's central claim is that constructing a 72 GB high-quality Tibetan corpus enables effective continual pre-training of the Qwen2.5-7B model in a balanced multilingual setting with Chinese and English, plus subsequent instruction tuning, and that extending this to a 50B-A10B MoE architecture produces models that consistently outperform existing open-source and Tibetan-focused models of similar scale across diverse tasks, as evaluated on newly constructed datasets via high-quality translation and human verification.
What carries the argument
The balanced multilingual continual pre-training strategy that integrates the large Tibetan corpus with Chinese and English data, and the extension of the dense model to a Mixture-of-Experts architecture for efficient scaling.
If this is right
- The new models achieve better results on Tibetan language tasks than prior systems.
- This method can be transferred to develop LLMs for other low-resource languages.
- Releasing the model weights, benchmarks, and data documentation will facilitate community progress in Tibetan language modeling.
- MoE scaling provides an efficient way to increase model capacity without full compute costs.
- Custom benchmarks fill the gap left by the lack of standardized Tibetan evaluations.
Where Pith is reading between the lines
- Data curation at this scale may be the key bottleneck for low-resource language modeling rather than model size alone.
- The approach highlights the value of multilingual mixing to prevent catastrophic forgetting during adaptation.
- Future work could test this pipeline on additional model bases or incorporate more languages.
- Improved Tibetan LLMs could support applications in education, translation, and cultural preservation for Tibetan speakers.
Load-bearing premise
The 72 GB high-quality Tibetan corpus is representative and sufficiently clean to support continual pre-training without introducing harmful biases or noise that would hurt model performance.
What would settle it
If the continually pre-trained dense and MoE models fail to show higher performance than existing Tibetan-focused models on the translated and human-verified evaluation datasets, the central claim would be falsified.
Figures
read the original abstract
Large language models (LLMs) have achieved remarkable success across a wide range of natural language processing tasks, yet their performance remains heavily biased toward high-resource languages. Tibetan, despite its cultural significance and large speaker population, is still substantially underrepresented. In this work, we present a comprehensive pipeline for advancing Tibetan language modeling through large-scale data curation and continual pre-training. We construct a 72 GB high-quality Tibetan corpus, the largest to date, and adapt Qwen2.5-7B through balanced multilingual continual pre-training with Tibetan, Chinese, and English, followed by multilingual instruction tuning. To further scale capacity efficiently, we extend the dense model to a 50B-A10B Mixture-of-Experts architecture. Due to the absence of standardized Tibetan benchmarks, we build multiple evaluation datasets via high-quality translation and human verification. Experimental results show that both dense and MoE models consistently outperform existing open-source and Tibetan-focused models of similar scale across diverse tasks. Our work advances Tibetan-centric LLM research and provides transferable insights for extending LLMs to other low-resource languages. We will release the model weights, evaluation benchmarks, and detailed data processing documentation in the follow-up.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper presents a pipeline for Tibetan LLM development: curation of a 72 GB high-quality Tibetan corpus (largest to date), continual pre-training of Qwen2.5-7B with balanced Tibetan/Chinese/English data followed by multilingual instruction tuning, and extension to a 50B-A10B MoE model. Due to lack of standardized benchmarks, custom evaluation sets are created via high-quality translation plus human verification. The central empirical claim is that both the resulting dense and MoE models consistently outperform prior open-source and Tibetan-focused models of similar scale on diverse tasks.
Significance. If the outperformance claims hold under rigorous scrutiny, the work is significant for low-resource language modeling: it supplies the largest public Tibetan corpus, demonstrates scalable MoE adaptation for Tibetan, and releases models, benchmarks, and data-processing documentation. These contributions could serve as a template for other low-resource languages and improve reproducibility in the subfield.
major comments (2)
- [Evaluation datasets] Evaluation datasets section: the claim of consistent outperformance rests on newly constructed benchmarks produced by translation plus human verification. The manuscript provides no quantitative details on translation quality (e.g., BLEU or human preference scores against references), verifier qualifications (native-speaker status, number of annotators), or inter-annotator agreement. Without these, it is impossible to rule out systematic artifacts that could favor models trained on similarly translated or curated data, directly undermining the headline result.
- [Continual pre-training and MoE extension] Continual pre-training and MoE extension sections: the balanced multilingual mixing strategy and the precise upcycling procedure from the 7B dense checkpoint to the 50B-A10B MoE are described at a high level only. Missing are the exact token ratios, learning-rate schedules, and expert-routing hyperparameters that would allow assessment of whether the reported gains are attributable to the Tibetan data or to other training choices.
minor comments (2)
- [Abstract / Conclusion] The abstract states that model weights, benchmarks, and documentation will be released, yet no dedicated section or appendix specifies the exact release contents, licensing, or hosting location.
- [Model architecture] Notation for the MoE model (50B-A10B) is introduced without an explicit definition of active vs. total parameters or a comparison table against the dense baseline.
Simulated Author's Rebuttal
We thank the referee for the constructive comments, which help clarify areas for improvement in our manuscript. We address each major comment below and indicate planned revisions.
read point-by-point responses
-
Referee: [Evaluation datasets] Evaluation datasets section: the claim of consistent outperformance rests on newly constructed benchmarks produced by translation plus human verification. The manuscript provides no quantitative details on translation quality (e.g., BLEU or human preference scores against references), verifier qualifications (native-speaker status, number of annotators), or inter-annotator agreement. Without these, it is impossible to rule out systematic artifacts that could favor models trained on similarly translated or curated data, directly undermining the headline result.
Authors: We agree that additional quantitative details on the evaluation dataset construction would improve transparency and help rule out potential artifacts. In the revised manuscript, we will expand the Evaluation datasets section to report translation quality metrics (including BLEU scores against reference translations and human preference scores), verifier qualifications (all native Tibetan speakers with relevant annotation experience, with three annotators per sample), and inter-annotator agreement (e.g., Fleiss' kappa). These additions will directly address the concern and strengthen the evaluation claims. revision: yes
-
Referee: [Continual pre-training and MoE extension] Continual pre-training and MoE extension sections: the balanced multilingual mixing strategy and the precise upcycling procedure from the 7B dense checkpoint to the 50B-A10B MoE are described at a high level only. Missing are the exact token ratios, learning-rate schedules, and expert-routing hyperparameters that would allow assessment of whether the reported gains are attributable to the Tibetan data or to other training choices.
Authors: We appreciate the referee's call for greater specificity to support reproducibility and attribution of results. While the original manuscript prioritized a high-level description of the pipeline, we will revise the Continual pre-training and MoE extension sections to include the exact token ratios (50% Tibetan, 30% Chinese, 20% English), detailed learning-rate schedules (including peak rate, warmup, and cosine decay), and expert-routing hyperparameters (top-2 routing with 8 experts and capacity factor of 1.25). This will clarify the training choices and better isolate the contribution of the Tibetan corpus. revision: yes
Circularity Check
No circularity: empirical pipeline with independent evaluation
full rationale
The paper reports data curation of a 72 GB Tibetan corpus, continual pre-training of Qwen2.5-7B (dense and MoE variants), and results on newly translated evaluation sets. No equations, fitted parameters renamed as predictions, self-definitional loops, or load-bearing self-citations appear in the derivation. Claims rest on external training runs and human-verified translations rather than reducing to the authors' own inputs by construction. This is the standard non-circular outcome for an empirical low-resource language modeling paper.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Continual pre-training on domain-specific data improves model performance on that domain.
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We construct a 72 GB high-quality Tibetan corpus... adapt Qwen2.5-7B through balanced multilingual continual pre-training... build multiple evaluation datasets via high-quality translation and human verification.
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Experimental results show that both dense and MoE models consistently outperform existing open-source and Tibetan-focused models
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 2 Pith papers
-
From Insight to Action: A Novel Framework for Interpretability-Guided Data Selection in Large Language Models
IGDS uses sparse autoencoders to find internal task features in LLMs and selects data that maximally activates them, yielding better math reasoning performance than full-dataset fine-tuning with only half the data.
-
LLiMba: Sardinian on a Single GPU -- Adapting a 3B Language Model to a Vanishing Romance Language
Qwen2.5-3B was continued-pretrained and then fine-tuned with rsLoRA r256 on Sardinian data to reach 28.5 BLEU into the language, outperforming full fine-tuning and other LoRA variants.
Reference graph
Works this paper leans on
-
[1]
Scaling neural machine translation to 200 languages
2024. Scaling neural machine translation to 200 languages. Nature, 630(8018):841--846
work page 2024
-
[2]
Marah Abdin, Jyoti Aneja, Harkirat Behl, S \'e bastien Bubeck, Ronen Eldan, Suriya Gunasekar, Michael Harrison, Russell J Hewett, Mojan Javaheripi, Piero Kauffmann, and 1 others. 2024. Phi-4 technical report. arXiv preprint arXiv:2412.08905
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[3]
Loubna Ben Allal, Anton Lozhkov, Guilherme Penedo, Thomas Wolf, and Leandro von Werra. 2024. https://huggingface.co/datasets/HuggingFaceTB/cosmopedia Cosmopedia
work page 2024
-
[4]
Samuel Cahyawijaya, Holy Lovenia, and Pascale Fung. 2024. Llms are few-shot in-context low-resource language learners. In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 405--433
work page 2024
- [5]
-
[6]
Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. 2018. Think you have solved question answering? try arc, the ai2 reasoning challenge. arXiv preprint arXiv:1803.05457
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[7]
Longxu Dou, Qian Liu, Guangtao Zeng, Jia Guo, Jiahui Zhou, Xinnian Mao, Ziqi Jin, Wei Lu, and Min Lin. 2024. Sailor: Open language models for south-east asia. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 424--435
work page 2024
-
[8]
Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, and 1 others. 2024. The llama 3 herd of models. arXiv preprint arXiv:2407.21783
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[9]
\'E douard Grave, Piotr Bojanowski, Prakhar Gupta, Armand Joulin, and Tom \'a s Mikolov. 2018. Learning word vectors for 157 languages. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)
work page 2018
-
[10]
Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen - Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. 2022. https://openreview.net/forum?id=nZeVKeeFYf9 Lora: Low-rank adaptation of large language models . In The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022 . OpenReview.net
work page 2022
- [11]
-
[12]
Shaoxiong Ji, Zihao Li, Indraneil Paul, Jaakko Paavola, Peiqin Lin, Pinzhen Chen, Dayy \'a n O'Brien, Hengyu Luo, Hinrich Sch \"u tze, J \"o rg Tiedemann, and 1 others. 2024. Emma-500: Enhancing massively multilingual adaptation of large language models. arXiv preprint arXiv:2409.17892
-
[13]
Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. 2020. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361
work page internal anchor Pith review Pith/arXiv arXiv 2020
-
[14]
Sneha Kudugunta, Isaac Caswell, Biao Zhang, Xavier Garcia, Derrick Xin, Aditya Kusupati, Romi Stella, Ankur Bapna, and Orhan Firat. 2023. Madlad-400: A multilingual and document-level large audited dataset. Advances in Neural Information Processing Systems, 36:67284--67296
work page 2023
-
[15]
Cheng Li, Mengzhuo Chen, Jindong Wang, Sunayana Sitaram, and Xing Xie. 2024. Culturellm: Incorporating cultural differences into large language models. Advances in Neural Information Processing Systems, 37:84799--84838
work page 2024
-
[16]
Yatao Liang, Hui Lv, Yan Li, La Duo, Chuanyi Liu, and Qingguo Zhou. 2024. Tibetan-bert-wwm: a tibetan pretrained model with whole word masking for text classification. IEEE Transactions on Computational Social Systems
work page 2024
- [17]
-
[18]
Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, and 1 others. 2024. Deepseek-v3 technical report. arXiv preprint arXiv:2412.19437
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[19]
Shuheng Liu and Michael Best. 2025. A survey of nlp progress in sino-tibetan low-resource languages. In Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 7804--7825
work page 2025
-
[20]
Sisi Liu, Junjie Deng, Yuan Sun, and Xiaobing Zhao. 2022. Tibert: Tibetan pre-trained language model. In 2022 IEEE International Conference on Systems, Man, and Cybernetics (SMC), pages 2956--2961. IEEE
work page 2022
-
[21]
Ilya Loshchilov and Frank Hutter. 2017. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[22]
Hui Lv, Chi Pu, La Duo, Yan Li, Qingguo Zhou, and Jun Shen. 2025. T-llama: a tibetan large language model based on llama2. Complex & Intelligent Systems, 11(1):72
work page 2025
-
[23]
Thuat Nguyen, Chien Van Nguyen, Viet Dac Lai, Hieu Man, Nghia Trung Ngo, Franck Dernoncourt, Ryan A Rossi, and Thien Huu Nguyen. 2024 a . Culturax: A cleaned, enormous, and multilingual dataset for large language models in 167 languages. In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluati...
work page 2024
-
[24]
Xuan-Phi Nguyen, Wenxuan Zhang, Xin Li, Mahani Aljunied, Zhiqiang Hu, Chenhui Shen, Yew Ken Chia, Xingxuan Li, Jianyu Wang, Qingyu Tan, and 1 others. 2024 b . Seallms-large language models for southeast asia. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations), pages 294--304
work page 2024
-
[25]
Guilherme Penedo, Hynek Kydl \' c ek, Anton Lozhkov, Margaret Mitchell, Colin A Raffel, Leandro Von Werra, Thomas Wolf, and 1 others. 2024 a . The fineweb datasets: Decanting the web for the finest text data at scale. Advances in Neural Information Processing Systems, 37:30811--30849
work page 2024
-
[26]
Guilherme Penedo, Hynek Kydlíček, Vinko Sabolčec, Bettina Messmer, Negar Foroutan, Martin Jaggi, Leandro von Werra, and Thomas Wolf. 2024 b . https://doi.org/10.57967/hf/3744 Fineweb2: A sparkling update with 1000s of languages
- [27]
-
[28]
Qwen, :, An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, Huan Lin, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jingren Zhou, and 25 others. 2025. https://arxiv.org/abs/2412.15115 Qwen2.5 technical report . Preprint, arXiv:2412.15115
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[29]
Jack W Rae, Sebastian Borgeaud, Trevor Cai, Katie Millican, Jordan Hoffmann, Francis Song, John Aslanides, Sarah Henderson, Roman Ring, Susannah Young, and 1 others. 2021. Scaling language models: Methods, analysis & insights from training gopher. arXiv preprint arXiv:2112.11446
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[30]
Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. 2020. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of machine learning research, 21(140):1--67
work page 2020
-
[31]
Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, and Yuxiong He. 2020. Zero: Memory optimizations toward training trillion parameter models. In SC20: International Conference for High Performance Computing, Networking, Storage and Analysis, pages 1--16. IEEE
work page 2020
-
[32]
Lingfeng Shen, Weiting Tan, Sihao Chen, Yunmo Chen, Jingyu Zhang, Haoran Xu, Boyuan Zheng, Philipp Koehn, and Daniel Khashabi. 2024. The language barrier: Dissecting safety challenges of llms in multilingual contexts. In Findings of the Association for Computational Linguistics ACL 2024, pages 2668--2680
work page 2024
-
[33]
Haoran Sun, Renren Jin, Shaoyang Xu, Leiyu Pan, Menglong Cui, Jiangcun Du, Yikun Lei, Lei Yang, Ling Shi, Juesi Xiao, and 1 others. 2024. Fuxitranyu: A multilingual large language model trained with balanced data. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: Industry Track, pages 1499--1522
work page 2024
-
[34]
Y Sun, S Liu, C Chen, Z Dan, and X Zhao. 2021. Construction of high-quality tibetan dataset for machine reading comprehension. In Proceedings of the 20th Chinese National Conference on Computational Linguistics, pages 208--218
work page 2021
- [35]
-
[36]
Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, and 1 others. 2023. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[37]
Ahmet \"U st \"u n, Viraat Aryabumi, Zheng Yong, Wei-Yin Ko, Daniel D’souza, Gbemileke Onilude, Neel Bhandari, Shivalika Singh, Hui-Lee Ooi, Amr Kayid, and 1 others. 2024. Aya model: An instruction finetuned open-access multilingual language model. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long P...
work page 2024
-
[38]
Zhuang Wenhao, Sun Yuan, and Zhao Xiaobing. 2024. Tilamb: A tibetan large language model based on incremental pre-training. In Proceedings of the 23rd Chinese National Conference on Computational Linguistics (Volume 1: Main Conference), pages 254--267
work page 2024
-
[39]
Alexander Wettig, Aatmik Gupta, Saumya Malik, and Danqi Chen. 2024. Qurating: Selecting high-quality data for training language models. In International Conference on Machine Learning, pages 52915--52971. PMLR
work page 2024
- [40]
-
[41]
An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, and 1 others. 2025. Qwen3 technical report. arXiv preprint arXiv:2505.09388
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[42]
Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. 2019. Hellaswag: Can a machine really finish your sentence? arXiv preprint arXiv:1905.07830
work page internal anchor Pith review Pith/arXiv arXiv 2019
-
[43]
Yaowei Zheng, Richong Zhang, Junhao Zhang, Yanhan Ye, Zheyan Luo, Zhangchi Feng, and Yongqiang Ma. 2024. http://arxiv.org/abs/2403.13372 Llamafactory: Unified efficient fine-tuning of 100+ language models . In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations), Bangkok, Thailand. Assoc...
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[44]
Dawei Zhu, Sony Trenous, Xiaoyu Shen, Dietrich Klakow, Bill Byrne, and Eva Hasler. 2024 a . A preference-driven paradigm for enhanced translation with large language models. In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 3385--3403
work page 2024
- [45]
-
[46]
ENTRY address archivePrefix author booktitle chapter edition editor eid eprint eprinttype howpublished institution journal key month note number organization pages publisher school series title type volume year doi pubmed url lastchecked label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block STRING...
-
[47]
" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize " " * FUNCT...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.