pith. sign in

arxiv: 2606.28843 · v1 · pith:AT3JX4HGnew · submitted 2026-06-27 · 💻 cs.CL · cs.AI

The Heterogeneous Safety Impacts of Benign Multilingual Fine-Tuning

Pith reviewed 2026-06-30 09:50 UTC · model grok-4.3

classification 💻 cs.CL cs.AI
keywords multilingual fine-tuningLLM safetyadversarial compliancesafety driftcross-lingual evaluationbenign datacompliance ratesmultilingual models
0
0 comments X

The pith

Fine-tuning LLMs with benign data in different languages leads to highly variable safety outcomes depending on the languages chosen.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper investigates how fine-tuning large language models on translated benign datasets across nine languages affects their safety. It finds that the risk of complying with unsafe prompts can increase significantly, up to four times, based on which language is used for fine-tuning and which for evaluation. This effect is separate from improvements in general capabilities. A sympathetic reader would care because many models are deployed in multilingual settings, and relying on English safety tests may miss serious issues in other languages.

Core claim

When Llama-3.2, Qwen3, and Gemma-3 models are fine-tuned on benign data translated into nine languages, safety outcomes measured by adversarial compliance rates prove highly sensitive to both the fine-tuning language and the evaluation language, with rates increasing four-fold in some settings. Multilingual safety drift occurs heterogeneously, is decoupled from general capability metrics, and fine-tuning in non-English languages often produces smaller internal representational drifts but results in models defaulting to exaggerated compliance or refusal.

What carries the argument

The heterogeneous safety drift from benign multilingual fine-tuning, which varies with language choice for training and testing.

If this is right

  • Assessing safety solely in English fails to assure safe deployment in other languages.
  • Models may show increased compliance to adversarial prompts after fine-tuning in certain non-English languages.
  • Safety changes do not track with gains in general task performance.
  • Releasing the Multilingual-Benign-Tune dataset enables further study of these effects.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Language-specific safety fine-tuning may be required rather than a one-size-fits-all approach.
  • Translation processes could subtly alter the adversarial nature of data in ways not captured by capability metrics.
  • Extending the study to additional languages or model scales might identify which language pairs pose the highest risk.

Load-bearing premise

The translated versions of the benign datasets stay equally non-adversarial and without unintended safety signals in every language.

What would settle it

Finding that adversarial compliance rates stay similar across all language combinations or that they strongly correlate with capability improvements would challenge the claim of heterogeneous decoupled safety impacts.

Figures

Figures reproduced from arXiv: 2606.28843 by Brent Mittelstadt, Chris Russell, Eoin Delaney, Greta Warren, Jonathan Rystr{\o}m, Kaivalya Rawal, Ryan Brown, Sandra Wachter, Stratis Tsirtsis, Will Hawkins, Zihao Fu.

Figure 1
Figure 1. Figure 1: Differential impacts of benign multilingual fine-tuning on safety. SORRY-Bench compliance rates measured in both English (EN) and the ”Local” language which a model was fine-tuned in. Hollow markers indicate compliance rates before fine-tuning, and solid markers indicating compliance rates after one epoch of fine-tuning in a specified language, with values averaged across three seeds. Benign fine-tuning in… view at source ↗
Figure 2
Figure 2. Figure 2: Relative change in compliance rate after fine-tuning. Across SORRY-Bench evaluations, the impact of fine-tuning in a local language usually amplifies the safety impact of fine-tuning in English. This effect is stronger when comparing the base compliance rate versus the fine-tuned compliance rate in the language which fine-tuning was conducted - though non-EN languages usually have lower base compliance rat… view at source ↗
Figure 3
Figure 3. Figure 3: Capability vs. Compliance Change Change in TinyMMLU scores plotted against EN SORRY-Bench compliance following benign multilingual fine-tuning. Colours and symbols denote different base models fine-tuned in different languages. Safety drift is decoupled from capability drift, with adversarial compliance exhibiting high variance even when model capability remains stable or slightly decreases. suggesting cap… view at source ↗
Figure 4
Figure 4. Figure 4: Mechanistic analysis of safety drift. We plot the change in adversarial compliance (∆ Compliance) against the L2 Euclidean distance of fine-tuned models’ internal representations from the base model at the 80th percentile layer. Distances are calculated in a standardised principal component space for cross-model comparability. Colours represent resource level per (Joshi et al., 2020). Statistical significa… view at source ↗
Figure 5
Figure 5. Figure 5: shows that adversarial compliance rates within English evaluations are not often materially shifted by increasing fine-tuning epochs. This demonstrates that a small amount of fine-tuning can undo most of the alignment fine-tuning for some models, such as Gemma-3-1B-IT or Llama-3.2 models [PITH_FULL_IMAGE:figures/full_fig_p013_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Bar chart showing safety impact of fine-tuning in different languages on Gemma-3, Qwen3, and Llama-3.2 0.6B-4B models when evaluated in the ”Local” language, or language in which the model was fine-tuned, using the translated SORRY-Bench evaluation protocol. A.2. Supporting capability evaluations To determine the relationship between model capability and safety we compare compliance rates for the adversari… view at source ↗
Figure 7
Figure 7. Figure 7: Adversarial vs. Non-Adversarial Compliance. Across most models we do not see a clear relationship between adversarial and non-adversarial compliance rates following benign fine-tuning in different languages. Qwen3-0.6B appears an outlier, where compliance reduces across both axes after one epoch of LoRA tuning, suggesting the model is suffering from collapse. 1. Sample of fine-tuning data. Each native ling… view at source ↗
Figure 8
Figure 8. Figure 8: English-language Perplexity results for single-epoch fine-tuning using the WikiText2 dataset (average of three seeds) [PITH_FULL_IMAGE:figures/full_fig_p016_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Perplexity results for single-epoch fine-tuning with Wikipedia datasets for each local language used (average of three seeds). A.3.1. FINE-TUNING DATA TRANSLATION PIPELINE During the manual review process of the Hindi fine-tuning data translated from English, a reviewer noted that the quality of translations was particularly low due to English words being retained within translations. This was improved be … view at source ↗
Figure 10
Figure 10. Figure 10: Evaluation results for larger models, demonstrating that heterogeneous impacts of multilingual fine-tuning is exhibited for larger sizes). The x-axis indicates the language in which a model was fine-tuned. A.5. Layer-wise vector drift To determine which layer to conduct the directional mechanistic analysis on, we assess the cosine similarity of fine-tuned models compared to their base every tenth percenti… view at source ↗
Figure 11
Figure 11. Figure 11: Cosine similarity of vectors assessed every ten percentiles. a low-dimensional subspace. We see that English-tuned models are consistently separate from other languages. We also note that Irish models show distinct trajectories, which may be accounted for by the low quality of fine-tuning data, and lower resource level, of this language. We observe partial clustering by language family (e.g. Romance langu… view at source ↗
Figure 12
Figure 12. Figure 12: Directional vector drift across different models, assessed as the 80th percentile layer. 18 [PITH_FULL_IMAGE:figures/full_fig_p018_12.png] view at source ↗
read the original abstract

Fine-tuning a large language model is a ubiquitous method for enhancing its capability on a specific downstream task. However, prior work has shown that this increase in capability comes with a cost: it can increase a model's tendency to respond to unsafe adversarial prompts, even when fine-tuning with non-adversarial data. We present the first comprehensive empirical study of this phenomenon in multilingual settings by fine-tuning Llama-3.2, Qwen3, and Gemma-3 models using benign data translated across nine languages. We find that safety outcomes are highly sensitive to both the choice of fine-tuning language and the evaluation language, with adversarial compliance rates increasing four-fold in some settings. Multilingual safety drift is decoupled from general capability metrics, and occurs heterogeneously across languages and models. Fine-tuning in non-English languages often induces smaller internal representational drifts than English, but these shifts lead models to default to either exaggerated compliance or refusal. As such, assessing fine-tuning impacts solely in English provides inadequate assurance for deployment. To facilitate further research into these cross-lingual safety blind spots, we release the Multilingual-Benign-Tune dataset and the SORRY-Bench-Multilingual evaluation suite.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript presents the first comprehensive empirical study of safety impacts from fine-tuning LLMs (Llama-3.2, Qwen3, Gemma-3) on benign data translated across nine languages. It reports that adversarial compliance rates are highly sensitive to both fine-tuning language and evaluation language, with increases up to four-fold in some settings; that multilingual safety drift is decoupled from general capability metrics and occurs heterogeneously across languages and models; that non-English fine-tuning often induces smaller internal representational drifts but leads to exaggerated compliance or refusal; and that English-only assessments are inadequate. The authors release the Multilingual-Benign-Tune dataset and SORRY-Bench-Multilingual evaluation suite.

Significance. If the central empirical findings hold after addressing verification concerns, the work is significant for demonstrating that benign multilingual fine-tuning produces heterogeneous safety outcomes not captured by capability metrics or English-only evaluation. The release of the two datasets is a clear strength that supports reproducibility and further research into cross-lingual safety.

major comments (2)
  1. [Methodology / Dataset Construction] The experimental methodology provides no reported verification (human review, safety classifier scores, or back-translation consistency) that the translated benign datasets remain equivalently non-adversarial and free of unintended safety signals across the nine languages. This assumption is load-bearing for attributing the four-fold compliance increases and heterogeneous drift to multilingual fine-tuning rather than translation artifacts.
  2. [Results] The claim of decoupling between safety drift and capability metrics (abstract and results) requires explicit reporting of the capability metrics used, the statistical tests applied, and per-language/model correlations; without these, the heterogeneity finding cannot be fully assessed for robustness.
minor comments (2)
  1. The abstract lists nine languages but does not name them; adding the list would improve clarity for readers.
  2. [Figures] Ensure all figures reporting compliance rates include error bars or confidence intervals and clearly label the fine-tuning vs. evaluation language axes.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed comments. We address each major comment below and will make the indicated revisions to strengthen the manuscript.

read point-by-point responses
  1. Referee: [Methodology / Dataset Construction] The experimental methodology provides no reported verification (human review, safety classifier scores, or back-translation consistency) that the translated benign datasets remain equivalently non-adversarial and free of unintended safety signals across the nine languages. This assumption is load-bearing for attributing the four-fold compliance increases and heterogeneous drift to multilingual fine-tuning rather than translation artifacts.

    Authors: The referee correctly notes that the manuscript does not report verification procedures for the translated datasets. We will revise the methodology section to include explicit verification details, such as safety classifier scores and back-translation consistency metrics applied to samples across languages, to better support attribution of the observed effects to multilingual fine-tuning. revision: yes

  2. Referee: [Results] The claim of decoupling between safety drift and capability metrics (abstract and results) requires explicit reporting of the capability metrics used, the statistical tests applied, and per-language/model correlations; without these, the heterogeneity finding cannot be fully assessed for robustness.

    Authors: We agree that explicit reporting of the supporting analyses is needed for full assessment. We will revise the results section to include the specific capability metrics, the statistical tests performed, and per-language/model correlation results to substantiate the decoupling claim. revision: yes

Circularity Check

0 steps flagged

Empirical measurement study with no derivation chain or fitted predictions

full rationale

The paper conducts an empirical study by fine-tuning models on translated benign data and measuring adversarial compliance rates across languages and models. No equations, self-definitional steps, fitted parameters renamed as predictions, or load-bearing self-citations appear in the provided text. Central results (four-fold compliance changes, heterogeneous drift) are direct experimental outcomes rather than reductions to inputs by construction. The analysis is self-contained against external benchmarks of model behavior.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Empirical study based on abstract; no free parameters, axioms, or invented entities are described.

pith-pipeline@v0.9.1-grok · 5768 in / 963 out tokens · 21900 ms · 2026-06-30T09:50:19.729127+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

79 extracted references · 21 canonical work pages · 8 internal anchors

  1. [1]

    Abdin, M., Aneja, J., Awadalla, H., Awadallah, A., Awan, A. A., Bach, N., Bahree, A., Bakhtiari, A., Bao, J., Behl, H., Benhaim, A., Bilenko, M., Bjorck, J., Bubeck, S., Cai, M., Cai, Q., Chaudhary, V., Chen, D., Chen, D., Chen, W., Chen, Y.-C., Chen, Y.-L., Cheng, H., Chopra, P., Dai, X., Dixon, M., Eldan, R., Fragoso, V., Gao, J., Gao, M., Gao, M., Garg...

  2. [2]

    I., Mosbach, M., and Klakow, D

    Alabi, J., Adelani, D. I., Mosbach, M., and Klakow, D. Adapting pre-trained language models to african languages via multilingual adaptive fine-tuning. In Proceedings of the 29th international conference on computational linguistics, pp.\ 4336--4349, 2022

  3. [3]

    Refusal in language models is mediated by a single direction

    Arditi, A., Obeso, O., Syed, A., Paleka, D., Panickssery, N., Gurnee, W., and Nanda, N. Refusal in language models is mediated by a single direction. In Proceedings of the 38th International Conference on Neural Information Processing Systems , volume 37 of NIPS '24 , pp.\ 136037--136083, Red Hook, NY, USA, December 2024. Curran Associates Inc. ISBN 979-8...

  4. [4]

    M., and Tran, T

    Bach, T., Nguyen-Tang, T., Nguyen, D., Le, T. M., and Tran, T. Curvature-aware safety restoration in llms fine-tuning, 2025. URL https://arxiv.org/abs/2511.18039

  5. [5]

    Safety-tuned LLaMAs : Lessons from improving the safety of large language models that follow instructions

    Bianchi, F., Suzgun, M., Attanasio, G., Rottger, P., Jurafsky, D., Hashimoto, T., and Zou, J. Safety-tuned LLaMAs : Lessons from improving the safety of large language models that follow instructions. In The Twelfth International Conference on Learning Representations , October 2023. URL https://openreview.net/forum?id=gT5hALch9z

  6. [6]

    Conover, M. et al. Free dolly: Introducing the world's first truly open instruction-tuned llm. https://www.databricks.com/blog/2023/04/12/dolly-first-open-commercially-viable-instruction-tuned-llm, 2023. Accessed: 2023-06-30

  7. [7]

    Daniel Han, M. H. and team, U. Unsloth, 2023. URL http://github.com/unslothai/unsloth

  8. [8]

    C., Dawkins, H., Nejadgholi, I., and Kiritchenko, S

    Fraser, K. C., Dawkins, H., Nejadgholi, I., and Kiritchenko, S. Fine-tuning lowers safety and disrupts evaluation consistency. In Derczynski, L., Novikova, J., and Chen, M. (eds.), Proceedings of the The First Workshop on LLM Security ( LLMSEC ) , pp.\ 129--141, Vienna, Austria, August 2025. Association for Computational Linguistics. ISBN 979-8-89176-279-...

  9. [9]

    Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

    Gemini-Team. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities, 2025. URL https://arxiv.org/abs/2507.06261

  10. [10]

    Gemma 3 Technical Report

    Gemma-Team. Gemma 3 technical report, 2025. URL https://arxiv.org/abs/2503.19786

  11. [11]

    The effect of fine-tuning on language model toxicity

    Hawkins, W., Mittelstadt, B., and Russell, C. The effect of fine-tuning on language model toxicity. In Neurips Safe Generative AI Workshop 2024 , October 2024. URL https://openreview.net/forum?id=YXaFxrMbVk

  12. [12]

    Refusal behavior in large language models: A nonlinear perspective, 2025

    Hildebrandt, F., Maier, A., Krauss, P., and Schilling, A. Refusal behavior in large language models: A nonlinear perspective, 2025. URL https://arxiv.org/abs/2501.08145

  13. [13]

    J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., and Chen, W

    Hu, E. J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., and Chen, W. Lora: Low-rank adaptation of large language models. In International Conference on Learning Representations, 2022

  14. [15]

    Samuel Marks and Max Tegmark

    Lermen, S., Rogers-Smith, C., and Ladish, J. Lora fine-tuning efficiently undoes safety training in llama 2-chat 70b, 2024. URL https://arxiv.org/abs/2310.20624

  15. [16]

    MobileLLM : Optimizing sub-billion parameter language models for on-device use cases

    Liu, Z., Zhao, C., Iandola, F., Lai, C., Tian, Y., Fedorov, I., Xiong, Y., Chang, E., Shi, Y., Krishnamoorthi, R., Lai, L., and Chandra, V. MobileLLM : Optimizing sub-billion parameter language models for on-device use cases. In Proceedings of the 41st International Conference on Machine Learning , volume 235 of ICML '24 , pp.\ 32431--32454, Vienna, Austr...

  16. [17]

    The Llama 3 Herd of Models

    Llama-Team. The llama 3 herd of models, 2024. URL https://arxiv.org/abs/2407.21783

  17. [19]

    Pointer sentinel mixture models

    Merity, S., Xiong, C., Bradbury, J., and Socher, R. Pointer sentinel mixture models. In International Conference on Learning Representations , February 2017. URL https://openreview.net/forum?id=Byj72udxe

  18. [20]

    NLLB Team , Costa-jussà, M. R., Cross, J., Çelebi, O., Elbayad, M., Heafield, K., Heffernan, K., Kalbassi, E., Lam, J., Licht, D., Maillard, J., Sun, A., Wang, S., Wenzek, G., Youngblood, A., Akula, B., Barrault, L., Gonzalez, G. M., Hansanti, P., Hoffman, J., Jarrett, S., Sadagopan, K. R., Rowe, D., Spruit, S., Tran, C., Andrews, P., Ayan, N. F., Bhosale...

  19. [21]

    GPT-4 Technical Report

    OpenAI-Team. Gpt-4 technical report, 2024. URL https://arxiv.org/abs/2303.08774

  20. [22]

    M., Weber, L., Choshen, L., Sun, Y., Xu, G., and Yurochkin, M

    Polo, F. M., Weber, L., Choshen, L., Sun, Y., Xu, G., and Yurochkin, M. tinyBenchmarks : Evaluating LLMs with fewer examples. In Proceedings of the 41st International Conference on Machine Learning , pp.\ 34303--34326. PMLR, July 2024. URL https://proceedings.mlr.press/v235/maia-polo24a.html

  21. [24]

    Fine-tuning aligned language models compromises safety, even when users do not intend to! In The Twelfth International Conference on Learning Representations , October 2023

    Qi, X., Zeng, Y., Xie, T., Chen, P.-Y., Jia, R., Mittal, P., and Henderson, P. Fine-tuning aligned language models compromises safety, even when users do not intend to! In The Twelfth International Conference on Learning Representations , October 2023. URL https://openreview.net/forum?id=hTEGyKf0dZ

  22. [25]

    H., Kirk, H

    Rystrøm, J. H., Kirk, H. R., and Hale, S. Multilingual != multicultural: evaluating gaps between multilingual capabilities and cultural alignment in LLMs . In Przybyła, P., Shardlow, M., Colombatto, C., and Inie, N. (eds.), Proceedings of Interdisciplinary Workshop on Observations of Misunderstood , Misguided and Malicious Use of Language Models , pp.\ 74...

  23. [26]

    and Thinking Machines Lab

    Schulman, J. and Thinking Machines Lab . Lora without regret. Thinking Machines Lab: Connectionism, 2025. doi:10.64434/tml.20250929. https://thinkingmachines.ai/blog/lora/

  24. [27]

    Multilingual translation with extensible multilingual pretraining and finetuning, 2020

    Tang, Y., Tran, C., Li, X., Chen, P.-J., Goyal, N., Chaudhary, V., Gu, J., and Fan, A. Multilingual translation with extensible multilingual pretraining and finetuning, 2020. URL https://arxiv.org/abs/2008.00401

  25. [28]

    Taori, R., Gulrajani, I., Zhang, T., Dubois, Y., Li, X., Guestrin, C., Liang, P., and Hashimoto, T. B. Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca, 2023

  26. [29]

    a ger, T., Elstner, J., Geisler, S., Cohen-Addad , V., G \

    Wollschl \"a ger, T., Elstner, J., Geisler, S., Cohen-Addad , V., G \"u nnemann, S., and Gasteiger, J. The geometry of refusal in large language models: Concept cones and representational independence. In Forty-Second International Conference on Machine Learning , June 2025. URL https://openreview.net/forum?id=80IwJqlXs8

  27. [30]

    M., Huang, K., He, L., Wei, B., Li, D., Sheng, Y., Jia, R., Li, B., Li, K., Chen, D., Henderson, P., and Mittal, P

    Xie, T., Qi, X., Zeng, Y., Huang, Y., Sehwag, U. M., Huang, K., He, L., Wei, B., Li, D., Sheng, Y., Jia, R., Li, B., Li, K., Chen, D., Henderson, P., and Mittal, P. SORRY-bench : Systematically evaluating large language model safety refusal. In The Thirteenth International Conference on Learning Representations , October 2024. URL https://openreview.net/f...

  28. [31]

    On-device language models: A comprehensive review, 2024

    Xu, J., Li, Z., Chen, W., Wang, Q., Gao, X., Cai, Q., and Ling, Z. On-device language models: A comprehensive review, 2024. URL https://arxiv.org/abs/2409.00088

  29. [32]

    Qwen3 Technical Report

    Yang, A., Li, A., Yang, B., Zhang, B., Hui, B., Zheng, B., Yu, B., Gao, C., Huang, C., Lv, C., Zheng, C., Liu, D., Zhou, F., Huang, F., Hu, F., Ge, H., Wei, H., Lin, H., Tang, J., Yang, J., Tu, J., Zhang, J., Yang, J., Yang, J., Zhou, J., Zhou, J., Lin, J., Dang, K., Bao, K., Yang, K., Yu, L., Deng, L., Li, M., Xue, M., Li, M., Zhang, P., Wang, P., Zhu, Q...

  30. [33]

    Yee, J. S. G., Ng, P. C., Wang, Z., McLoughlin, I., Ng, A. B., and See, S. On-device llms for smes: Challenges and opportunities, 2024. URL https://arxiv.org/abs/2410.16070

  31. [34]

    Understanding and preserving safety in fine-tuned llms, 2026

    Zhang, J., Hu, Y., Chen, K., He, L., Ma, J., Lou, J., Li, D., Liu, J., Yang, X., and Jia, R. Understanding and preserving safety in fine-tuned llms, 2026. URL https://arxiv.org/abs/2601.10141

  32. [35]

    Representation Engineering: A Top-Down Approach to AI Transparency

    Zou, A., Phan, L., Chen, S., Campbell, J., Guo, P., Ren, R., Pan, A., Yin, X., Mazeika, M., Dombrowski, A.-K., Goel, S., Li, N., Byun, M. J., Wang, Z., Mallen, A., Basart, S., Koyejo, S., Song, D., Fredrikson, M., Kolter, J. Z., and Hendrycks, D. Representation engineering: A top-down approach to ai transparency, 2023. URL https://arxiv.org/abs/2310.01405

  33. [36]

    2021 , eprint=

    The Low-Resource Double Bind: An Empirical Study of Pruning for Low-Resource Machine Translation , author=. 2021 , eprint=

  34. [37]

    International Conference on Learning Representations , year=

    LoRA: Low-Rank Adaptation of Large Language Models , author=. International Conference on Learning Representations , year=

  35. [38]

    The State and Fate of Linguistic Diversity and Inclusion in the

    Joshi, Pratik and Santy, Sebastin and Budhiraja, Amar and Bali, Kalika and Choudhury, Monojit , editor =. The State and Fate of Linguistic Diversity and Inclusion in the. Proceedings of the 58th. doi:10.18653/v1/2020.acl-main.560 , url =

  36. [39]

    Xie, Tinghao and Qi, Xiangyu and Zeng, Yi and Huang, Yangsibo and Sehwag, Udari Madhushani and Huang, Kaixuan and He, Luxi and Wei, Boyi and Li, Dacheng and Sheng, Ying and Jia, Ruoxi and Li, Bo and Li, Kai and Chen, Danqi and Henderson, Peter and Mittal, Prateek , year = 2024, month = oct, url =. The

  37. [40]

    2022 , eprint=

    No Language Left Behind: Scaling Human-Centered Machine Translation , author=. 2022 , eprint=

  38. [41]

    Pointer Sentinel Mixture Models , booktitle =

    Merity, Stephen and Xiong, Caiming and Bradbury, James and Socher, Richard , year = 2017, month = feb, url =. Pointer Sentinel Mixture Models , booktitle =

  39. [42]

    Proceedings of the 41st

    Polo, Felipe Maia and Weber, Lucas and Choshen, Leshem and Sun, Yuekai and Xu, Gongjun and Yurochkin, Mikhail , year = 2024, month = jul, pages =. Proceedings of the 41st

  40. [43]

    Biometrics , year=

    Individual Comparisons by Ranking Methods , author=. Biometrics , year=

  41. [44]

    Fine-Tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To! , booktitle =

    Qi, Xiangyu and Zeng, Yi and Xie, Tinghao and Chen, Pin-Yu and Jia, Ruoxi and Mittal, Prateek and Henderson, Peter , year = 2023, month = oct, url =. Fine-Tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To! , booktitle =

  42. [45]

    The Effect of Fine-Tuning on Language Model Toxicity , booktitle =

    Hawkins, Will and Mittelstadt, Brent and Russell, Chris , year = 2024, month = oct, url =. The Effect of Fine-Tuning on Language Model Toxicity , booktitle =

  43. [46]

    All Languages Matter: On the Multilingual Safety of LLMs , author=

  44. [47]

    Towards Understanding the Fragility of Multilingual

    Poppi, Samuele and Yong, Zheng Xin and He, Yifei and Chern, Bobbie and Zhao, Han and Yang, Aobo and Chi, Jianfeng , editor =. Towards Understanding the Fragility of Multilingual. Findings of the. doi:10.18653/v1/2025.findings-naacl.126 , url =

  45. [48]

    Multilingual Jailbreak Challenges in Large Language Models , booktitle =

    Deng, Yue and Zhang, Wenxuan and Pan, Sinno Jialin and Bing, Lidong , year = 2023, month = oct, url =. Multilingual Jailbreak Challenges in Large Language Models , booktitle =

  46. [49]

    The Language Barrier: Dissecting Safety Challenges of

    Shen, Lingfeng and Tan, Weiting and Chen, Sihao and Chen, Yunmo and Zhang, Jingyu and Xu, Haoran and Zheng, Boyuan and Koehn, Philipp and Khashabi, Daniel , editor =. The Language Barrier: Dissecting Safety Challenges of. Findings of the. doi:10.18653/v1/2024.findings-acl.156 , url =

  47. [50]

    and Dawkins, Hillary and Nejadgholi, Isar and Kiritchenko, Svetlana , editor =

    Fraser, Kathleen C. and Dawkins, Hillary and Nejadgholi, Isar and Kiritchenko, Svetlana , editor =. Fine-Tuning Lowers Safety and Disrupts Evaluation Consistency , booktitle =

  48. [51]

    2025 , eprint=

    Unintended Misalignment from Agentic Fine-Tuning: Risks and Mitigation , author=. 2025 , eprint=

  49. [52]

    2024 , eprint=

    LoRA Fine-tuning Efficiently Undoes Safety Training in Llama 2-Chat 70B , author=. 2024 , eprint=

  50. [53]

    2025 , eprint=

    LoRA is All You Need for Safety Alignment of Reasoning LLMs , author=. 2025 , eprint=

  51. [54]

    2025 , eprint=

    Safe LoRA: the Silver Lining of Reducing Safety Risks when Fine-tuning Large Language Models , author=. 2025 , eprint=

  52. [55]

    Machine Translation with Large Language Models: Prompting, Few-shot Learning, and Fine-tuning with QL o RA

    Zhang, Xuan and Rajabi, Navid and Duh, Kevin and Koehn, Philipp. Machine Translation with Large Language Models: Prompting, Few-shot Learning, and Fine-tuning with QL o RA. Proceedings of the Eighth Conference on Machine Translation. 2023. doi:10.18653/v1/2023.wmt-1.43

  53. [56]

    2025 , eprint=

    Keeping LLMs Aligned After Fine-tuning: The Crucial Role of Prompt Templates , author=. 2025 , eprint=

  54. [57]

    2023 , eprint=

    Safe RLHF: Safe Reinforcement Learning from Human Feedback , author=. 2023 , eprint=

  55. [58]

    2023 , howpublished =

    Mike Conover and others , title =. 2023 , howpublished =

  56. [59]

    2025 , eprint=

    Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities , author=. 2025 , eprint=

  57. [60]

    Daniel Han, Michael Han and Unsloth team , title =

  58. [61]

    Safety-Tuned

    Bianchi, Federico and Suzgun, Mirac and Attanasio, Giuseppe and Rottger, Paul and Jurafsky, Dan and Hashimoto, Tatsunori and Zou, James , year = 2023, month = oct, url =. Safety-Tuned. The

  59. [62]

    2024 , eprint=

    GPT-4 Technical Report , author=. 2024 , eprint=

  60. [63]

    Proceedings of the 29th international conference on computational linguistics , pages=

    Adapting pre-trained language models to African languages via multilingual adaptive fine-tuning , author=. Proceedings of the 29th international conference on computational linguistics , pages=

  61. [64]

    No Error Left Behind: Multilingual Grammatical Error Correction with Pre-trained Translation Models

    Luhtaru, Agnes and Korotkova, Elizaveta and Fishel, Mark. No Error Left Behind: Multilingual Grammatical Error Correction with Pre-trained Translation Models. Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers). 2024. doi:10.18653/v1/2024.eacl-long.73

  62. [65]

    2020 , eprint=

    Multilingual Translation with Extensible Multilingual Pretraining and Finetuning , author=. 2020 , eprint=

  63. [66]

    2024 , eprint=

    On-Device Language Models: A Comprehensive Review , author=. 2024 , eprint=

  64. [67]

    2024 , eprint=

    On-Device LLMs for SMEs: Challenges and Opportunities , author=. 2024 , eprint=

  65. [68]

    2025 , eprint=

    Qwen3 Technical Report , author=. 2025 , eprint=

  66. [69]

    2025 , eprint=

    Gemma 3 Technical Report , author=. 2025 , eprint=

  67. [70]

    2024 , eprint=

    The Llama 3 Herd of Models , author=. 2024 , eprint=

  68. [71]

    2026 , eprint=

    Understanding and Preserving Safety in Fine-Tuned LLMs , author=. 2026 , eprint=

  69. [72]

    2025 , eprint=

    Refusal Behavior in Large Language Models: A Nonlinear Perspective , author=. 2025 , eprint=

  70. [73]

    Refusal in Language Models Is Mediated by a Single Direction , booktitle =

    Arditi, Andy and Obeso, Oscar and Syed, Aaquib and Paleka, Daniel and Panickssery, Nina and Gurnee, Wes and Nanda, Neel , year = 2024, month = dec, series =. Refusal in Language Models Is Mediated by a Single Direction , booktitle =

  71. [74]

    2025 , eprint=

    Curvature-Aware Safety Restoration In LLMs Fine-Tuning , author=. 2025 , eprint=

  72. [75]

    The Geometry of Refusal in Large Language Models: Concept Cones and Representational Independence , shorttitle =

    Wollschl. The Geometry of Refusal in Large Language Models: Concept Cones and Representational Independence , shorttitle =. Forty-Second

  73. [76]

    2023 , eprint=

    Representation Engineering: A Top-Down Approach to AI Transparency , author=. 2023 , eprint=

  74. [77]

    2024 , eprint=

    LoRA+: Efficient Low Rank Adaptation of Large Models , author=. 2024 , eprint=

  75. [78]

    Hashimoto , title =

    Rohan Taori and Ishaan Gulrajani and Tianyi Zhang and Yann Dubois and Xuechen Li and Carlos Guestrin and Percy Liang and Tatsunori B. Hashimoto , title =. GitHub repository , howpublished =. 2023 , publisher =

  76. [79]

    LoRA Without Regret , journal =

    John Schulman and. LoRA Without Regret , journal =. 2025 , note =

  77. [80]

    Multilingual != multicultural: evaluating gaps between multilingual capabilities and cultural alignment in

    Rystrøm, Jonathan Hvithamar and Kirk, Hannah Rose and Hale, Scott , editor =. Multilingual != multicultural: evaluating gaps between multilingual capabilities and cultural alignment in. Proceedings of. 2025 , keywords =

  78. [81]

    2024 , eprint=

    Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone , author=. 2024 , eprint=

  79. [82]

    Proceedings of the 41st

    Liu, Zechun and Zhao, Changsheng and Iandola, Forrest and Lai, Chen and Tian, Yuandong and Fedorov, Igor and Xiong, Yunyang and Chang, Ernie and Shi, Yangyang and Krishnamoorthi, Raghuraman and Lai, Liangzhen and Chandra, Vikas , year = 2024, month = jul, series =. Proceedings of the 41st