Recognition: no theorem link
Benchmarking Linguistic Adaptation in Comparable-Sized LLMs: A Study of Llama-3.1-8B, Mistral-7B-v0.1, and Qwen3-8B on Romanized Nepali
Pith reviewed 2026-05-15 00:44 UTC · model grok-4.3
The pith
Fine-tuning rescues three LLMs from failing at Romanized Nepali generation
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Zero-shot prompting produces architecture-specific failures in generating Romanized Nepali for Llama-3.1-8B, Mistral-7B-v0.1, and Qwen3-8B. After QLoRA fine-tuning with rsLoRA on the 10,000-sample bilingual dataset, all three models resolve these failures and reach BERTScore approximately 0.75 and chrF++ greater than 23. Qwen3-8B is the only model that yields semantically relevant zero-shot output and leads all structural alignment metrics after supervised fine-tuning, while Llama-3.1-8B shows the largest absolute gains, confirming the adaptation headroom hypothesis for weaker baselines.
What carries the argument
QLoRA with rsLoRA at rank 32 applied to the curated 10,000-sample bilingual transliterated instruction-following dataset, evaluated across perplexity, BERTScore, chrF++, ROUGE variants, and BLEU to compare zero-shot and post-tuning outputs.
If this is right
- All three models overcome their distinct zero-shot failure modes and reach similar usable performance after fine-tuning.
- Qwen3-8B supplies the strongest zero-shot semantic relevance and the best post-tuning structural scores, making it the recommended default choice.
- Llama-3.1-8B delivers the largest metric gains and is therefore the preferred base model when the goal is iterative low-resource development.
- The entire adaptation process updates only about 1 percent of each model's parameters and completes in under 27 GPU-hours, showing practical efficiency.
Where Pith is reading between the lines
- The headroom pattern suggests that base models with weaker zero-shot performance on a new script variant may still be the better starting point for fine-tuning pipelines.
- The same benchmarking method could be applied directly to other languages that rely heavily on romanization for informal digital use.
- Real deployment would require checking whether the observed metric levels persist on user-generated content outside the instruction-following format used in training.
Load-bearing premise
The 10,000 curated transliterated instruction-following samples represent typical real-world Romanized Nepali usage and are sufficient to support the reported performance numbers and model rankings.
What would settle it
If the fine-tuned models fall short of the reported BERTScore near 0.75 and chrF++ above 23 when tested on an independent collection of real-world Romanized Nepali text such as social-media posts or chat logs, the adaptation claims would not hold.
Figures
read the original abstract
Romanized Nepali, the Nepali language written in the Latin alphabet, is the dominant medium for informal digital communication in Nepal, yet it remains critically underresourced in the landscape of Large Language Models (LLMs). This study presents a systematic benchmarking of linguistic adaptation across three comparable-sized open-weight models: Llama-3.1-8B, Mistral-7B-v0.1, and Qwen3-8B. We evaluate these architectures under zero-shot and fine-tuned settings using a curated bilingual dataset of 10,000 transliterated instruction-following samples. Performance is quantified across five metrics spanning seven measurement dimensions: Perplexity (PPL), BERTScore, chrF++, ROUGE-1, ROUGE-2, ROUGE-L, and BLEU, capturing fluency, phonetic consistency, and semantic integrity. Models were fine-tuned using Quantized Low-Rank Adaptation (QLoRA) with Rank-Stabilized LoRA (rsLoRA) at rank r=32 on dual NVIDIA Tesla T4 GPUs, training only approximately 1% of each model's parameters in under 27 total GPU-hours. At zero-shot, all three models fail to generate Romanized Nepali, each exhibiting a distinct architecture-specific failure mode. Following fine-tuning, all three resolve these failures and converge to BERTScore approximately 0.75 and chrF++ greater than 23. Overall dimension-wise assessment across ten criteria identifies Qwen3-8B as the overall recommended architecture, being the only model to produce semantically relevant zero-shot output and leading all structural alignment metrics post-SFT. The adaptation headroom hypothesis is confirmed: Llama-3.1-8B, despite its weakest zero-shot baseline, achieves the largest absolute fine-tuning gains in PPL (Delta = -49.77) and BERTScore (Delta = +0.3287), making it the preferred choice for iterative low-resource development pipelines. This work establishes the first rigorous baseline for Romanized Nepali adaptation in comparable-sized open-weight LLMs.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript benchmarks linguistic adaptation of three comparable-sized open-weight LLMs (Llama-3.1-8B, Mistral-7B-v0.1, Qwen3-8B) to Romanized Nepali. It evaluates zero-shot and QLoRA fine-tuned performance on a curated bilingual dataset of 10,000 transliterated instruction-following samples using PPL, BERTScore, chrF++, ROUGE-1/2/L, and BLEU. All models fail distinctively in zero-shot but converge post-fine-tuning to BERTScore ≈0.75 and chrF++ >23; Qwen3-8B is recommended overall for semantic relevance and structural metrics, while Llama-3.1-8B shows the largest adaptation gains (PPL Δ=-49.77, BERTScore Δ=+0.3287).
Significance. If the dataset is representative, this establishes the first rigorous baseline for Romanized Nepali adaptation in open-weight LLMs, confirming adaptation headroom and providing practical guidance for low-resource fine-tuning with QLoRA/rsLoRA. The multi-metric, dimension-wise comparison across models is a useful empirical contribution to multilingual NLP for under-resourced scripts.
major comments (2)
- [Dataset and Experimental Setup] The headline results (post-SFT BERTScore ≈0.75, chrF++ >23, specific deltas such as Llama PPL Δ=-49.77 and BERTScore Δ=+0.3287, and the Qwen3-8B recommendation) all rest on performance measured against a single curated set of 10k transliterated samples. No evidence is provided that this set reflects real-world Romanized Nepali distributions (e.g., informal social-media orthography, code-mixing patterns, domain coverage). Without a held-out real-world test partition, diversity statistics, or inter-annotator agreement on the transliterations, the observed convergence and relative rankings could be artifacts of curation.
- [Evaluation Metrics and Results] The evaluation protocol is insufficiently specified to support the comparative claims. The manuscript does not report the size or construction of any held-out test set, whether metrics were computed on the same samples used for fine-tuning, or any statistical significance testing for the reported deltas and model rankings.
minor comments (2)
- [Abstract] The abstract reports 'chrF++ greater than 23' without mean, variance, or exact values; report full statistics (means ± std) for all metrics in both zero-shot and fine-tuned settings.
- [Methods] Clarify the exact composition of the 10,000 samples (e.g., number of instruction vs. response pairs, sources of the original Nepali text) and any preprocessing or validation steps applied during transliteration.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. We address each major comment below, acknowledging where revisions are needed to strengthen the manuscript.
read point-by-point responses
-
Referee: [Dataset and Experimental Setup] The headline results (post-SFT BERTScore ≈0.75, chrF++ >23, specific deltas such as Llama PPL Δ=-49.77 and BERTScore Δ=+0.3287, and the Qwen3-8B recommendation) all rest on performance measured against a single curated set of 10k transliterated samples. No evidence is provided that this set reflects real-world Romanized Nepali distributions (e.g., informal social-media orthography, code-mixing patterns, domain coverage). Without a held-out real-world test partition, diversity statistics, or inter-annotator agreement on the transliterations, the observed convergence and relative rankings could be artifacts of curation.
Authors: We agree that the representativeness of the curated 10k-sample dataset requires further substantiation to support the headline claims. In the revised manuscript we will expand the Dataset section with details on sample selection criteria, domain coverage, and any code-mixing patterns included. We will also report basic diversity statistics (vocabulary size, n-gram overlap) and explicitly state that evaluation metrics were computed on a held-out 20% test partition (2,000 samples) never seen during fine-tuning. A limitations paragraph will be added discussing the gap to informal social-media orthography and outlining future validation on external corpora. These changes directly address the concern without altering the reported numerical results. revision: yes
-
Referee: [Evaluation Metrics and Results] The evaluation protocol is insufficiently specified to support the comparative claims. The manuscript does not report the size or construction of any held-out test set, whether metrics were computed on the same samples used for fine-tuning, or any statistical significance testing for the reported deltas and model rankings.
Authors: We accept that the evaluation protocol description is incomplete. The revised Experimental Setup section will specify the random 80/20 train/test split, confirm that all metrics (PPL, BERTScore, chrF++, ROUGE, BLEU) were calculated exclusively on the held-out test set, and add statistical significance testing (bootstrap resampling with 1,000 iterations and paired t-tests) for all reported deltas and model rankings. These clarifications will be accompanied by the exact test-set size and construction method. revision: yes
Circularity Check
No circularity: purely empirical benchmarking with external metrics
full rationale
The paper conducts a standard empirical evaluation of three LLMs under zero-shot and QLoRA fine-tuning on a fixed 10k-sample dataset, reporting performance via off-the-shelf metrics (PPL, BERTScore, chrF++, ROUGE variants, BLEU). No derivation chain, equations, fitted parameters renamed as predictions, or self-citations appear in the abstract or described methodology. Central claims (post-SFT convergence to ~0.75 BERTScore, model rankings, adaptation headroom deltas) rest directly on measured values against the curated set and can be externally replicated or falsified without reference to any internal construction. This is the expected non-finding for a benchmarking study.
Axiom & Free-Parameter Ledger
free parameters (1)
- LoRA rank r =
32
axioms (1)
- domain assumption QLoRA with rsLoRA at r=32 can adapt LLMs to new language variants using only 1% of parameters
Reference graph
Works this paper leans on
-
[1]
QLoRA: Efficient finetuning of quantized LLMs,
T. Dettmers, A. Pagnoni, A. Holtzman, and L. Zettlemoyer, “QLoRA: Efficient finetuning of quantized LLMs,” inAdvances in Neural Information Processing Systems (NeurIPS), vol. 36, 2023
work page 2023
-
[2]
A Rank Stabilization Scaling Factor for Fine-Tuning with LoRA
D. Kalajdzievski, “A rank stabilization scaling factor for fine-tuning with LoRA,”arXiv preprint arXiv:2312.03732, 2023
work page internal anchor Pith review arXiv 2023
-
[3]
Natural language processing for Nepali text: A review,
T. B. Shahi and C. Sitaula, “Natural language processing for Nepali text: A review,”Artificial Intelligence Review, vol. 55, no. 5, pp. 3401–3429, 2022
work page 2022
-
[4]
National Statistics Office (formerly Central Bureau of Statistics),National Population and Housing Census 2021, Government of Nepal, Kathmandu, 2021
work page 2021
-
[5]
On the dangers of stochastic par- rots: Can language models be too big?,
E. M. Bender, T. Gebru, A. McMillan-Major, and S. Shmitchell, “On the dangers of stochastic par- rots: Can language models be too big?,” inProc. ACM Conference on Fairness, Accountability, and Transparency (FAccT), pp. 610–623, 2021
work page 2021
-
[6]
LLaMA: Open and Efficient Foundation Language Models
H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B. Rozière, N. Goyal, E. Hambro, F. Azhar, A. Rodriguez, A. Joulin, E. Grave, and G. Lample, “LLaMA: Open and efficient foundation language models,”arXiv preprint arXiv:2302.13971, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[7]
A. Q. Jiang, A. Sablayrolles, A. Mensch, C. Bamford, D. S. Chaplot, D. de las Casas, F. Bressand, G. Lengyel, G. Lample, L. Saulnier, L. Renard Lavaud, M.-A. Lachaux, P. Stock, T. Le Scao, T. Lavril, T. Wang, T. Lacroix, and W. El Sayed, “Mistral 7B,”arXiv preprint arXiv:2310.06825, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[8]
J. Bai, S. Bai, Y . Chu, Z. Cui, K. Dang, X. Deng, et al., “Qwen technical report,”arXiv preprint arXiv:2309.16609, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[9]
A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” inAdvances in Neural Information Processing Systems (NeurIPS), vol. 30, pp. 5998–6008, 2017
work page 2017
-
[10]
Language models are few-shot learners,
T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, et al., “Language models are few-shot learners,” inAdvances in Neural Information Processing Systems (NeurIPS), vol. 33, pp. 1877–1901, 2020
work page 1901
-
[11]
Named Entity Recognition for Nepali Text Using Support Vector Machines,
S. Bam and T. B. Shahi, "Named Entity Recognition for Nepali Text Using Support Vector Machines," in Proc. International Conference on Communication and Information Technology (ICCIT), 2014
work page 2014
-
[12]
NPVec1: Word embeddings for Nepali — construction and evaluation,
P. Koirala and N. Niraula, “NPVec1: Word embeddings for Nepali — construction and evaluation,” in Proc. 6th Workshop on Representation Learning for NLP (RepL4NLP), pp. 174–184, 2021
work page 2021
-
[13]
NepBERTa: Nepali language model trained in a large corpus,
S. Timilsina, M. Gautam, and B. Bhattarai, “NepBERTa: Nepali language model trained in a large corpus,” inProc. 2nd Conf. Asia-Pacific Chapter of ACL (AACL-IJCNLP), pp. 273–284, 2022
work page 2022
-
[14]
NepaliGPT: A generative language model for the Nepali language,
S. Pudasaini, A. Dangol, and S. Shakya, "NepaliGPT: A generative language model for the Nepali language," arXiv preprint arXiv:2506.16399, 2025
-
[15]
How good is your tokenizer? On the monolingual performance of multilingual language models,
P. Rust, J. Pfeiffer, I. Vuli´c, S. Ruder, and I. Gurevych, “How good is your tokenizer? On the monolingual performance of multilingual language models,” inProc. 59th Annual Meeting of the Association for Computational Linguistics (ACL), pp. 3118–3135, 2021. 21
work page 2021
-
[16]
Between words and characters: A brief history of open-vocabulary modeling and tokenization in NLP,
S. J. Mielke, Z. Alyafeai, E. Salesky, C. Raffel, M. Dey, M. Gallé, A. Raja, C. Si, W. Y . Lee, B. Sagot, and S. Tan, “Between words and characters: A brief history of open-vocabulary modeling and tokenization in NLP,”arXiv preprint arXiv:2112.10508, 2021
-
[17]
T. Kudo and J. Richardson, “SentencePiece: A simple and language independent subword tokenizer and detokenizer for neural text processing,” inProc. 2018 Conference on Empirical Methods in Natural Language Processing (EMNLP): System Demonstrations, pp. 66–71, 2018
work page 2018
-
[18]
LoRA: Low-rank adaptation of large language models,
E. J. Hu, Y . Shen, P. Wallis, Z. Allen-Zhu, Y . Li, S. Wang, L. Wang, and W. Chen, “LoRA: Low-rank adaptation of large language models,” inInternational Conference on Learning Representations (ICLR), 2022
work page 2022
-
[19]
Finetuned language models are zero-shot learners,
J. Wei, M. Bosma, V . Y . Zhao, K. Guu, A. W. Yu, B. Lester, N. Du, A. M. Dai, and Q. V . Le, “Finetuned language models are zero-shot learners,” inInternational Conference on Learning Representations (ICLR), 2022
work page 2022
-
[20]
Stanford Alpaca: An instruction-following LLaMA model,
R. Taori, I. Gulrajani, T. Zhang, Y . Dubois, X. Li, C. Guestrin, P. Liang, and T. B. Hashimoto, “Stanford Alpaca: An instruction-following LLaMA model,” GitHub repository, Stanford University, 2023. https://github.com/tatsu-lab/stanford_alpaca
work page 2023
-
[21]
BLEU: A method for automatic evaluation of machine translation,
K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu, “BLEU: A method for automatic evaluation of machine translation,” inProc. 40th Annual Meeting of the Association for Computational Linguistics (ACL), pp. 311–318, 2002
work page 2002
-
[22]
A call for clarity in reporting BLEU scores,
M. Post, “A call for clarity in reporting BLEU scores,” inProc. Third Conference on Machine Translation (WMT), pp. 186–191, 2018
work page 2018
-
[23]
chrF: Character n-gram F-score for automatic MT evaluation,
M. Popovi´c, “chrF: Character n-gram F-score for automatic MT evaluation,” inProc. Tenth Workshop on Statistical Machine Translation (WMT), pp. 392–395, 2015
work page 2015
-
[24]
BERTScore: Evaluating text generation with BERT,
T. Zhang, V . Kishore, F. Wu, K. Q. Weinberger, and Y . Artzi, “BERTScore: Evaluating text generation with BERT,” inInternational Conference on Learning Representations (ICLR), 2020
work page 2020
-
[25]
D. Jurafsky and J. H. Martin,Speech and Language Processing, 3rd ed. (draft), Stanford University, 2024.https://web.stanford.edu/~jurafsky/slp3/
work page 2024
-
[26]
ROUGE: A package for automatic evaluation of summaries,
C.-Y . Lin, “ROUGE: A package for automatic evaluation of summaries,” inProc. ACL Workshop on Text Summarization Branches Out, pp. 74–81, 2004
work page 2004
-
[27]
S. Kafley, “Alpaca Nepali SFT,”Hugging Face Datasets, 2024. https://huggingface.co/ datasets/Saugatkafley/alpaca-nepali-sft
work page 2024
- [28]
-
[29]
IndicTransliteration: Transliteration library for Indic scripts,
AI4Bharat, “IndicTransliteration: Transliteration library for Indic scripts,” GitHub repository, 2024
work page 2024
-
[30]
Unsloth: 2x faster, 70% less memory LLM finetuning,
Unsloth AI, “Unsloth: 2x faster, 70% less memory LLM finetuning,” GitHub repository, 2024
work page 2024
-
[31]
TRL: Transformers reinforcement learning,
L. von Werra, Y . Belkada, L. Tunstall, E. Beeching, T. Thrush, N. Lambert, S. Huang, K. Rasul, and Q. Gallouédec, “TRL: Transformers reinforcement learning,” GitHub repository, Hugging Face, 2020. https://github.com/huggingface/trl 22
work page 2020
-
[32]
LLM.int8(): 8-bit matrix multiplication for transformers at scale,
T. Dettmers, M. Lewis, Y . Belkada, and L. Zettlemoyer, “LLM.int8(): 8-bit matrix multiplication for transformers at scale,” inAdvances in Neural Information Processing Systems (NeurIPS), vol. 35, pp. 30318–30332, 2022
work page 2022
-
[33]
SGDR: Stochastic gradient descent with warm restarts,
I. Loshchilov and F. Hutter, “SGDR: Stochastic gradient descent with warm restarts,” inInternational Conference on Learning Representations (ICLR), 2017
work page 2017
-
[34]
How multilingual is multilingual BERT?,
T. Pires, E. Schlinger, and D. Garrette, “How multilingual is multilingual BERT?,” inProc. 57th Annual Meeting of the Association for Computational Linguistics (ACL), pp. 4996–5001, 2019
work page 2019
-
[35]
Direct preference optimization: Your language model is secretly a reward model,
R. Rafailov, A. Sharma, E. Mitchell, S. Ermon, C. D. Manning, and C. Finn, “Direct preference optimization: Your language model is secretly a reward model,” inAdvances in Neural Information Processing Systems (NeurIPS), vol. 36, pp. 53728–53741, 2023. 23 A QUALITATIVE GOLDEN QUESTIONS ANALYSIS This appendix presents the complete model outputs for all 10 G...
work page 2023
-
[36]
Diyeko list ma kun chai phalful hoina? 2. Syau
-
[37]
Suntala 4. Alu 5. Aap Internet bhane ko k ho? —[no response] Euta sano katha lekhnuhos. Topic: Chalakh Shyal Chalakh shyal gulo ekta jinish Table 9.Llama-3.1-8B base model responses (zero-shot).[no response]on 6 of 10 instructions. 24 A.2 Llama-3.1-8B: Fine-Tuned (Post-SFT) Instruction Input Output Computer bhaneko k ho? Yaslai nepali ma paribhashit garnu...
-
[38]
Nepal ma raatko sabai chhato janaa
Alu, 4. Aap 4| aap Internet bhane ko k ho? — intaraneta (anuchcheda: i) eka vishvavyapi kampyutina netavarka ho jasale janakari sajha garna ra adana pradana garna sakcha| yo dherai kampyutara ra anya upakarana va yantraharu milera baneko cha juna samsarako sabai kunama phailieko cha| 25 Instruction Input Output Euta sano katha lekhnuhos. Topic: Chalakh Sh...
-
[39]
Alu, 4. Aap To determine which item does not belong to fruits (phal):Syau— appears to be a misspelling. Suntalatranslates to “sunflower” in English; sunflowers are plants, not fruits.Alumeans potato in Nepali; potatoes are tubers classified as vegetables, not fruits, because they do not develop from flowering parts.Aapis mango, which is a fruit. Internet ...
-
[40]
Alu, 4. Aap syau Internet bhane ko k ho? — intaraneta eka vishvavyapi samchara pranali ho jasale manisaharulai janakari adana ra diincha| yo dherai kampyutara ra anya upakaranaharu milera baneko cha juna vibhinna netavarkaharuma jodieko huncha, jasale prayogakartaharulai sajilaisanga janakari pahuncha garna sakshama banauncha| Euta sano katha lekhnuhos. T...
-
[41]
Aap The instruction asks for a list of items in order from smallest to largest
Alu, 4. Aap The instruction asks for a list of items in order from smallest to largest. The input provides four options, and the correct response is to list them in ascending order. Internet bhane ko k ho? —[empty line] Euta sano katha lekhnuhos. Topic: Chalakh Shyal Chalakh Shyal is a popular dish in Nepal that is made with rice flour and yogurt. It is t...
-
[42]
Aap ho ,→Semantic error: outputs “yes” only; no item identified
Alu, 4. Aap ho ,→Semantic error: outputs “yes” only; no item identified. Internet bhane ko k ho? — intaraneta bhaneko samsarako vishvavyapi sanjala ho| Euta sano katha lekhnuhos. Topic: Chalakh Shyal eka pataka tyahan eka javana keti thiyo jo sadhaim chalachitra herna ra nayan shaharaharu anveshana garna chahanthin| unale aphno parivarako barema sochdai t...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.