arxiv: 2604.14171 · v1 · submitted 2026-03-25 · 💻 cs.CL · cs.AI

Recognition: no theorem link

Benchmarking Linguistic Adaptation in Comparable-Sized LLMs: A Study of Llama-3.1-8B, Mistral-7B-v0.1, and Qwen3-8B on Romanized Nepali

Ananda Rimal (Nepal Engineering College) , Adarsha Rimal (Tribhuvan University)

Authors on Pith no claims yet

Pith reviewed 2026-05-15 00:44 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords Romanized NepaliLLM adaptationlow-resource languagesQLoRA fine-tuningBERTScoremodel benchmarkingNepali transliterationinstruction tuning

0 comments

The pith

Fine-tuning rescues three LLMs from failing at Romanized Nepali generation

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper benchmarks how three comparable open-weight LLMs handle Romanized Nepali, the Latin-script form of the language used for most informal digital communication in Nepal. All three models fail to produce correct output when given zero-shot prompts, each showing a distinct failure pattern. After fine-tuning on a set of 10,000 transliterated instruction-following examples, performance converges across the models to roughly 0.75 BERTScore and above 23 on chrF++. Qwen3-8B ranks highest overall for its semantic relevance even before tuning and for structural metrics afterward, while Llama-3.1-8B records the largest gains and therefore offers the most headroom for further adaptation work.

Core claim

Zero-shot prompting produces architecture-specific failures in generating Romanized Nepali for Llama-3.1-8B, Mistral-7B-v0.1, and Qwen3-8B. After QLoRA fine-tuning with rsLoRA on the 10,000-sample bilingual dataset, all three models resolve these failures and reach BERTScore approximately 0.75 and chrF++ greater than 23. Qwen3-8B is the only model that yields semantically relevant zero-shot output and leads all structural alignment metrics after supervised fine-tuning, while Llama-3.1-8B shows the largest absolute gains, confirming the adaptation headroom hypothesis for weaker baselines.

What carries the argument

QLoRA with rsLoRA at rank 32 applied to the curated 10,000-sample bilingual transliterated instruction-following dataset, evaluated across perplexity, BERTScore, chrF++, ROUGE variants, and BLEU to compare zero-shot and post-tuning outputs.

If this is right

All three models overcome their distinct zero-shot failure modes and reach similar usable performance after fine-tuning.
Qwen3-8B supplies the strongest zero-shot semantic relevance and the best post-tuning structural scores, making it the recommended default choice.
Llama-3.1-8B delivers the largest metric gains and is therefore the preferred base model when the goal is iterative low-resource development.
The entire adaptation process updates only about 1 percent of each model's parameters and completes in under 27 GPU-hours, showing practical efficiency.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The headroom pattern suggests that base models with weaker zero-shot performance on a new script variant may still be the better starting point for fine-tuning pipelines.
The same benchmarking method could be applied directly to other languages that rely heavily on romanization for informal digital use.
Real deployment would require checking whether the observed metric levels persist on user-generated content outside the instruction-following format used in training.

Load-bearing premise

The 10,000 curated transliterated instruction-following samples represent typical real-world Romanized Nepali usage and are sufficient to support the reported performance numbers and model rankings.

What would settle it

If the fine-tuned models fall short of the reported BERTScore near 0.75 and chrF++ above 23 when tested on an independent collection of real-world Romanized Nepali text such as social-media posts or chat logs, the adaptation claims would not hold.

Figures

Figures reproduced from arXiv: 2604.14171 by Adarsha Rimal (Tribhuvan University), Ananda Rimal (Nepal Engineering College).

**Figure 1.** Figure 1: Experimental pipeline: data preparation, parameter-efficient fine-tuning, and evaluation across three model architectures. 5 [PITH_FULL_IMAGE:figures/full_fig_p005_1.png] view at source ↗

**Figure 2.** Figure 2: Training and validation loss for Llama-3.1-8B over 3,375 steps. The green dotted line marks the best checkpoint at step 2,200 (minimum validation loss = 1.1585). Llama-3.1-8B exhibits the highest initial training loss (≈ 2.10) among all three models, consistent with its Tiktoken tokenizer over-fragmenting Romanized Nepali into low-frequency subword units [15]. Validation loss decreases steadily through Epo… view at source ↗

**Figure 3.** Figure 3: Training and validation loss for Mistral-7B-v0.1 over 3,375 steps. The green dotted line marks the best checkpoint at step 2,200 (minimum validation loss = 1.0930). Mistral-7B-v0.1 begins with the lowest initial training loss (≈ 1.90) of the three models, reflecting its SentencePiece tokenizer producing more coherent subword units for Latin-script input than Tiktoken. The validation curve is the smoothest … view at source ↗

**Figure 4.** Figure 4: Training and validation loss for Qwen3-8B over 3,375 steps. The green dotted line marks the best checkpoint at step 2,200 (minimum validation loss = 1.1313). Qwen3-8B starts with an intermediate initial training loss (≈ 2.20) and converges steadily through Epochs 1 14 [PITH_FULL_IMAGE:figures/full_fig_p014_4.png] view at source ↗

read the original abstract

Romanized Nepali, the Nepali language written in the Latin alphabet, is the dominant medium for informal digital communication in Nepal, yet it remains critically underresourced in the landscape of Large Language Models (LLMs). This study presents a systematic benchmarking of linguistic adaptation across three comparable-sized open-weight models: Llama-3.1-8B, Mistral-7B-v0.1, and Qwen3-8B. We evaluate these architectures under zero-shot and fine-tuned settings using a curated bilingual dataset of 10,000 transliterated instruction-following samples. Performance is quantified across five metrics spanning seven measurement dimensions: Perplexity (PPL), BERTScore, chrF++, ROUGE-1, ROUGE-2, ROUGE-L, and BLEU, capturing fluency, phonetic consistency, and semantic integrity. Models were fine-tuned using Quantized Low-Rank Adaptation (QLoRA) with Rank-Stabilized LoRA (rsLoRA) at rank r=32 on dual NVIDIA Tesla T4 GPUs, training only approximately 1% of each model's parameters in under 27 total GPU-hours. At zero-shot, all three models fail to generate Romanized Nepali, each exhibiting a distinct architecture-specific failure mode. Following fine-tuning, all three resolve these failures and converge to BERTScore approximately 0.75 and chrF++ greater than 23. Overall dimension-wise assessment across ten criteria identifies Qwen3-8B as the overall recommended architecture, being the only model to produce semantically relevant zero-shot output and leading all structural alignment metrics post-SFT. The adaptation headroom hypothesis is confirmed: Llama-3.1-8B, despite its weakest zero-shot baseline, achieves the largest absolute fine-tuning gains in PPL (Delta = -49.77) and BERTScore (Delta = +0.3287), making it the preferred choice for iterative low-resource development pipelines. This work establishes the first rigorous baseline for Romanized Nepali adaptation in comparable-sized open-weight LLMs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This gives a practical first baseline for adapting 7-8B models to Romanized Nepali with clear efficiency numbers, but the single 10k dataset lacks validation so the model rankings and convergence claims rest on thin evidence.

read the letter

The core result is that zero-shot these models all fail at Romanized Nepali in different ways, but QLoRA fine-tuning on the 10k set brings them to similar territory around 0.75 BERTScore and chrF++ above 23, with Qwen3-8B leading most post-training metrics and Llama-3.1-8B showing the largest gains from its weak starting point. That headroom observation is useful for anyone planning iterative work on low-resource variants. The paper does a clean job laying out the five metrics across seven dimensions, the rsLoRA setup at r=32, and the modest compute (1% parameters, under 27 GPU-hours on T4s). Those details make the efficiency claim easy to take at face value and give practitioners something concrete to build on. The new part is simply the side-by-side comparison for this specific language-script pair; nothing deeper on theory or new methods. The soft spot is the data. All headline numbers and the final recommendation for Qwen3-8B depend on one curated bilingual instruction set. The abstract supplies no description of how the transliterations were produced, what domains or code-mixing patterns are included, or any check against real social-media usage. Without a held-out real-world partition or basic diversity stats, the observed convergence and relative ordering could be artifacts of that particular collection rather than robust adaptation. No statistical significance tests are mentioned either. This is for people doing applied low-resource adaptation who need a starting benchmark for comparable open models. A reader working on similar informal scripts will find the numbers and training recipe worth looking at. It deserves peer review because the empirical setup is straightforward and the gap it fills is real, but reviewers should press for dataset construction details and external validation before the rankings are treated as settled.

Referee Report

2 major / 2 minor

Summary. The manuscript benchmarks linguistic adaptation of three comparable-sized open-weight LLMs (Llama-3.1-8B, Mistral-7B-v0.1, Qwen3-8B) to Romanized Nepali. It evaluates zero-shot and QLoRA fine-tuned performance on a curated bilingual dataset of 10,000 transliterated instruction-following samples using PPL, BERTScore, chrF++, ROUGE-1/2/L, and BLEU. All models fail distinctively in zero-shot but converge post-fine-tuning to BERTScore ≈0.75 and chrF++ >23; Qwen3-8B is recommended overall for semantic relevance and structural metrics, while Llama-3.1-8B shows the largest adaptation gains (PPL Δ=-49.77, BERTScore Δ=+0.3287).

Significance. If the dataset is representative, this establishes the first rigorous baseline for Romanized Nepali adaptation in open-weight LLMs, confirming adaptation headroom and providing practical guidance for low-resource fine-tuning with QLoRA/rsLoRA. The multi-metric, dimension-wise comparison across models is a useful empirical contribution to multilingual NLP for under-resourced scripts.

major comments (2)

[Dataset and Experimental Setup] The headline results (post-SFT BERTScore ≈0.75, chrF++ >23, specific deltas such as Llama PPL Δ=-49.77 and BERTScore Δ=+0.3287, and the Qwen3-8B recommendation) all rest on performance measured against a single curated set of 10k transliterated samples. No evidence is provided that this set reflects real-world Romanized Nepali distributions (e.g., informal social-media orthography, code-mixing patterns, domain coverage). Without a held-out real-world test partition, diversity statistics, or inter-annotator agreement on the transliterations, the observed convergence and relative rankings could be artifacts of curation.
[Evaluation Metrics and Results] The evaluation protocol is insufficiently specified to support the comparative claims. The manuscript does not report the size or construction of any held-out test set, whether metrics were computed on the same samples used for fine-tuning, or any statistical significance testing for the reported deltas and model rankings.

minor comments (2)

[Abstract] The abstract reports 'chrF++ greater than 23' without mean, variance, or exact values; report full statistics (means ± std) for all metrics in both zero-shot and fine-tuned settings.
[Methods] Clarify the exact composition of the 10,000 samples (e.g., number of instruction vs. response pairs, sources of the original Nepali text) and any preprocessing or validation steps applied during transliteration.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below, acknowledging where revisions are needed to strengthen the manuscript.

read point-by-point responses

Referee: [Dataset and Experimental Setup] The headline results (post-SFT BERTScore ≈0.75, chrF++ >23, specific deltas such as Llama PPL Δ=-49.77 and BERTScore Δ=+0.3287, and the Qwen3-8B recommendation) all rest on performance measured against a single curated set of 10k transliterated samples. No evidence is provided that this set reflects real-world Romanized Nepali distributions (e.g., informal social-media orthography, code-mixing patterns, domain coverage). Without a held-out real-world test partition, diversity statistics, or inter-annotator agreement on the transliterations, the observed convergence and relative rankings could be artifacts of curation.

Authors: We agree that the representativeness of the curated 10k-sample dataset requires further substantiation to support the headline claims. In the revised manuscript we will expand the Dataset section with details on sample selection criteria, domain coverage, and any code-mixing patterns included. We will also report basic diversity statistics (vocabulary size, n-gram overlap) and explicitly state that evaluation metrics were computed on a held-out 20% test partition (2,000 samples) never seen during fine-tuning. A limitations paragraph will be added discussing the gap to informal social-media orthography and outlining future validation on external corpora. These changes directly address the concern without altering the reported numerical results. revision: yes
Referee: [Evaluation Metrics and Results] The evaluation protocol is insufficiently specified to support the comparative claims. The manuscript does not report the size or construction of any held-out test set, whether metrics were computed on the same samples used for fine-tuning, or any statistical significance testing for the reported deltas and model rankings.

Authors: We accept that the evaluation protocol description is incomplete. The revised Experimental Setup section will specify the random 80/20 train/test split, confirm that all metrics (PPL, BERTScore, chrF++, ROUGE, BLEU) were calculated exclusively on the held-out test set, and add statistical significance testing (bootstrap resampling with 1,000 iterations and paired t-tests) for all reported deltas and model rankings. These clarifications will be accompanied by the exact test-set size and construction method. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical benchmarking with external metrics

full rationale

The paper conducts a standard empirical evaluation of three LLMs under zero-shot and QLoRA fine-tuning on a fixed 10k-sample dataset, reporting performance via off-the-shelf metrics (PPL, BERTScore, chrF++, ROUGE variants, BLEU). No derivation chain, equations, fitted parameters renamed as predictions, or self-citations appear in the abstract or described methodology. Central claims (post-SFT convergence to ~0.75 BERTScore, model rankings, adaptation headroom deltas) rest directly on measured values against the curated set and can be externally replicated or falsified without reference to any internal construction. This is the expected non-finding for a benchmarking study.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The study rests on standard assumptions from prior LLM adaptation literature rather than new theoretical constructs; the only notable free parameter is the chosen LoRA rank.

free parameters (1)

LoRA rank r = 32
Hyperparameter set to 32 for rsLoRA fine-tuning; chosen rather than derived.

axioms (1)

domain assumption QLoRA with rsLoRA at r=32 can adapt LLMs to new language variants using only 1% of parameters
Invoked in the experimental design without new justification in the abstract.

pith-pipeline@v0.9.0 · 5723 in / 1406 out tokens · 50448 ms · 2026-05-15T00:44:19.460535+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

42 extracted references · 42 canonical work pages · 4 internal anchors

[1]

QLoRA: Efficient finetuning of quantized LLMs,

T. Dettmers, A. Pagnoni, A. Holtzman, and L. Zettlemoyer, “QLoRA: Efficient finetuning of quantized LLMs,” inAdvances in Neural Information Processing Systems (NeurIPS), vol. 36, 2023

work page 2023
[2]

A Rank Stabilization Scaling Factor for Fine-Tuning with LoRA

D. Kalajdzievski, “A rank stabilization scaling factor for fine-tuning with LoRA,”arXiv preprint arXiv:2312.03732, 2023

work page internal anchor Pith review arXiv 2023
[3]

Natural language processing for Nepali text: A review,

T. B. Shahi and C. Sitaula, “Natural language processing for Nepali text: A review,”Artificial Intelligence Review, vol. 55, no. 5, pp. 3401–3429, 2022

work page 2022
[4]

National Statistics Office (formerly Central Bureau of Statistics),National Population and Housing Census 2021, Government of Nepal, Kathmandu, 2021

work page 2021
[5]

On the dangers of stochastic par- rots: Can language models be too big?,

E. M. Bender, T. Gebru, A. McMillan-Major, and S. Shmitchell, “On the dangers of stochastic par- rots: Can language models be too big?,” inProc. ACM Conference on Fairness, Accountability, and Transparency (FAccT), pp. 610–623, 2021

work page 2021
[6]

LLaMA: Open and Efficient Foundation Language Models

H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B. Rozière, N. Goyal, E. Hambro, F. Azhar, A. Rodriguez, A. Joulin, E. Grave, and G. Lample, “LLaMA: Open and efficient foundation language models,”arXiv preprint arXiv:2302.13971, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[7]

Mistral 7B

A. Q. Jiang, A. Sablayrolles, A. Mensch, C. Bamford, D. S. Chaplot, D. de las Casas, F. Bressand, G. Lengyel, G. Lample, L. Saulnier, L. Renard Lavaud, M.-A. Lachaux, P. Stock, T. Le Scao, T. Lavril, T. Wang, T. Lacroix, and W. El Sayed, “Mistral 7B,”arXiv preprint arXiv:2310.06825, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[8]

Qwen Technical Report

J. Bai, S. Bai, Y . Chu, Z. Cui, K. Dang, X. Deng, et al., “Qwen technical report,”arXiv preprint arXiv:2309.16609, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[9]

Attention is all you need,

A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” inAdvances in Neural Information Processing Systems (NeurIPS), vol. 30, pp. 5998–6008, 2017

work page 2017
[10]

Language models are few-shot learners,

T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, et al., “Language models are few-shot learners,” inAdvances in Neural Information Processing Systems (NeurIPS), vol. 33, pp. 1877–1901, 2020

work page 1901
[11]

Named Entity Recognition for Nepali Text Using Support Vector Machines,

S. Bam and T. B. Shahi, "Named Entity Recognition for Nepali Text Using Support Vector Machines," in Proc. International Conference on Communication and Information Technology (ICCIT), 2014

work page 2014
[12]

NPVec1: Word embeddings for Nepali — construction and evaluation,

P. Koirala and N. Niraula, “NPVec1: Word embeddings for Nepali — construction and evaluation,” in Proc. 6th Workshop on Representation Learning for NLP (RepL4NLP), pp. 174–184, 2021

work page 2021
[13]

NepBERTa: Nepali language model trained in a large corpus,

S. Timilsina, M. Gautam, and B. Bhattarai, “NepBERTa: Nepali language model trained in a large corpus,” inProc. 2nd Conf. Asia-Pacific Chapter of ACL (AACL-IJCNLP), pp. 273–284, 2022

work page 2022
[14]

NepaliGPT: A generative language model for the Nepali language,

S. Pudasaini, A. Dangol, and S. Shakya, "NepaliGPT: A generative language model for the Nepali language," arXiv preprint arXiv:2506.16399, 2025

work page arXiv 2025
[15]

How good is your tokenizer? On the monolingual performance of multilingual language models,

P. Rust, J. Pfeiffer, I. Vuli´c, S. Ruder, and I. Gurevych, “How good is your tokenizer? On the monolingual performance of multilingual language models,” inProc. 59th Annual Meeting of the Association for Computational Linguistics (ACL), pp. 3118–3135, 2021. 21

work page 2021
[16]

Between words and characters: A brief history of open-vocabulary modeling and tokenization in NLP,

S. J. Mielke, Z. Alyafeai, E. Salesky, C. Raffel, M. Dey, M. Gallé, A. Raja, C. Si, W. Y . Lee, B. Sagot, and S. Tan, “Between words and characters: A brief history of open-vocabulary modeling and tokenization in NLP,”arXiv preprint arXiv:2112.10508, 2021

work page arXiv 2021
[17]

SentencePiece: A simple and language independent subword tokenizer and detokenizer for neural text processing,

T. Kudo and J. Richardson, “SentencePiece: A simple and language independent subword tokenizer and detokenizer for neural text processing,” inProc. 2018 Conference on Empirical Methods in Natural Language Processing (EMNLP): System Demonstrations, pp. 66–71, 2018

work page 2018
[18]

LoRA: Low-rank adaptation of large language models,

E. J. Hu, Y . Shen, P. Wallis, Z. Allen-Zhu, Y . Li, S. Wang, L. Wang, and W. Chen, “LoRA: Low-rank adaptation of large language models,” inInternational Conference on Learning Representations (ICLR), 2022

work page 2022
[19]

Finetuned language models are zero-shot learners,

J. Wei, M. Bosma, V . Y . Zhao, K. Guu, A. W. Yu, B. Lester, N. Du, A. M. Dai, and Q. V . Le, “Finetuned language models are zero-shot learners,” inInternational Conference on Learning Representations (ICLR), 2022

work page 2022
[20]

Stanford Alpaca: An instruction-following LLaMA model,

R. Taori, I. Gulrajani, T. Zhang, Y . Dubois, X. Li, C. Guestrin, P. Liang, and T. B. Hashimoto, “Stanford Alpaca: An instruction-following LLaMA model,” GitHub repository, Stanford University, 2023. https://github.com/tatsu-lab/stanford_alpaca

work page 2023
[21]

BLEU: A method for automatic evaluation of machine translation,

K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu, “BLEU: A method for automatic evaluation of machine translation,” inProc. 40th Annual Meeting of the Association for Computational Linguistics (ACL), pp. 311–318, 2002

work page 2002
[22]

A call for clarity in reporting BLEU scores,

M. Post, “A call for clarity in reporting BLEU scores,” inProc. Third Conference on Machine Translation (WMT), pp. 186–191, 2018

work page 2018
[23]

chrF: Character n-gram F-score for automatic MT evaluation,

M. Popovi´c, “chrF: Character n-gram F-score for automatic MT evaluation,” inProc. Tenth Workshop on Statistical Machine Translation (WMT), pp. 392–395, 2015

work page 2015
[24]

BERTScore: Evaluating text generation with BERT,

T. Zhang, V . Kishore, F. Wu, K. Q. Weinberger, and Y . Artzi, “BERTScore: Evaluating text generation with BERT,” inInternational Conference on Learning Representations (ICLR), 2020

work page 2020
[25]

Jurafsky and J

D. Jurafsky and J. H. Martin,Speech and Language Processing, 3rd ed. (draft), Stanford University, 2024.https://web.stanford.edu/~jurafsky/slp3/

work page 2024
[26]

ROUGE: A package for automatic evaluation of summaries,

C.-Y . Lin, “ROUGE: A package for automatic evaluation of summaries,” inProc. ACL Workshop on Text Summarization Branches Out, pp. 74–81, 2004

work page 2004
[27]

Alpaca Nepali SFT,

S. Kafley, “Alpaca Nepali SFT,”Hugging Face Datasets, 2024. https://huggingface.co/ datasets/Saugatkafley/alpaca-nepali-sft

work page 2024
[28]

Google Translate,

Google LLC, “Google Translate,”Google, 2024

work page 2024
[29]

IndicTransliteration: Transliteration library for Indic scripts,

AI4Bharat, “IndicTransliteration: Transliteration library for Indic scripts,” GitHub repository, 2024

work page 2024
[30]

Unsloth: 2x faster, 70% less memory LLM finetuning,

Unsloth AI, “Unsloth: 2x faster, 70% less memory LLM finetuning,” GitHub repository, 2024

work page 2024
[31]

TRL: Transformers reinforcement learning,

L. von Werra, Y . Belkada, L. Tunstall, E. Beeching, T. Thrush, N. Lambert, S. Huang, K. Rasul, and Q. Gallouédec, “TRL: Transformers reinforcement learning,” GitHub repository, Hugging Face, 2020. https://github.com/huggingface/trl 22

work page 2020
[32]

LLM.int8(): 8-bit matrix multiplication for transformers at scale,

T. Dettmers, M. Lewis, Y . Belkada, and L. Zettlemoyer, “LLM.int8(): 8-bit matrix multiplication for transformers at scale,” inAdvances in Neural Information Processing Systems (NeurIPS), vol. 35, pp. 30318–30332, 2022

work page 2022
[33]

SGDR: Stochastic gradient descent with warm restarts,

I. Loshchilov and F. Hutter, “SGDR: Stochastic gradient descent with warm restarts,” inInternational Conference on Learning Representations (ICLR), 2017

work page 2017
[34]

How multilingual is multilingual BERT?,

T. Pires, E. Schlinger, and D. Garrette, “How multilingual is multilingual BERT?,” inProc. 57th Annual Meeting of the Association for Computational Linguistics (ACL), pp. 4996–5001, 2019

work page 2019
[35]

Direct preference optimization: Your language model is secretly a reward model,

R. Rafailov, A. Sharma, E. Mitchell, S. Ermon, C. D. Manning, and C. Finn, “Direct preference optimization: Your language model is secretly a reward model,” inAdvances in Neural Information Processing Systems (NeurIPS), vol. 36, pp. 53728–53741, 2023. 23 A QUALITATIVE GOLDEN QUESTIONS ANALYSIS This appendix presents the complete model outputs for all 10 G...

work page 2023
[36]

Diyeko list ma kun chai phalful hoina? 2. Syau

work page
[37]

Suntala 4. Alu 5. Aap Internet bhane ko k ho? —[no response] Euta sano katha lekhnuhos. Topic: Chalakh Shyal Chalakh shyal gulo ekta jinish Table 9.Llama-3.1-8B base model responses (zero-shot).[no response]on 6 of 10 instructions. 24 A.2 Llama-3.1-8B: Fine-Tuned (Post-SFT) Instruction Input Output Computer bhaneko k ho? Yaslai nepali ma paribhashit garnu...

work page
[38]

Nepal ma raatko sabai chhato janaa

Alu, 4. Aap 4| aap Internet bhane ko k ho? — intaraneta (anuchcheda: i) eka vishvavyapi kampyutina netavarka ho jasale janakari sajha garna ra adana pradana garna sakcha| yo dherai kampyutara ra anya upakarana va yantraharu milera baneko cha juna samsarako sabai kunama phailieko cha| 25 Instruction Input Output Euta sano katha lekhnuhos. Topic: Chalakh Sh...

work page
[39]

sunflower

Alu, 4. Aap To determine which item does not belong to fruits (phal):Syau— appears to be a misspelling. Suntalatranslates to “sunflower” in English; sunflowers are plants, not fruits.Alumeans potato in Nepali; potatoes are tubers classified as vegetables, not fruits, because they do not develop from flowering parts.Aapis mango, which is a fruit. Internet ...

work page
[40]

Ma bholi ghar janchu thiyo

Alu, 4. Aap syau Internet bhane ko k ho? — intaraneta eka vishvavyapi samchara pranali ho jasale manisaharulai janakari adana ra diincha| yo dherai kampyutara ra anya upakaranaharu milera baneko cha juna vibhinna netavarkaharuma jodieko huncha, jasale prayogakartaharulai sajilaisanga janakari pahuncha garna sakshama banauncha| Euta sano katha lekhnuhos. T...

work page
[41]

Aap The instruction asks for a list of items in order from smallest to largest

Alu, 4. Aap The instruction asks for a list of items in order from smallest to largest. The input provides four options, and the correct response is to list them in ascending order. Internet bhane ko k ho? —[empty line] Euta sano katha lekhnuhos. Topic: Chalakh Shyal Chalakh Shyal is a popular dish in Nepal that is made with rice flour and yogurt. It is t...

work page
[42]

Aap ho ,→Semantic error: outputs “yes” only; no item identified

Alu, 4. Aap ho ,→Semantic error: outputs “yes” only; no item identified. Internet bhane ko k ho? — intaraneta bhaneko samsarako vishvavyapi sanjala ho| Euta sano katha lekhnuhos. Topic: Chalakh Shyal eka pataka tyahan eka javana keti thiyo jo sadhaim chalachitra herna ra nayan shaharaharu anveshana garna chahanthin| unale aphno parivarako barema sochdai t...

work page