Multilingual jailbreaking of LLMs using low-resource languages
Pith reviewed 2026-05-20 10:26 UTC · model grok-4.3
The pith
Multi-turn conversations in low-resource African languages jailbreak LLMs at rates up to 83.6% harmful responses.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Large Language Models remain vulnerable to jailbreak attempts that circumvent safety guardrails when prompts are delivered through multi-turn conversations in low-resource languages. Translating existing datasets into Afrikaans, Kiswahili, isiXhosa, and isiZulu and testing them on commercial LLMs produces English harmful response rates from 52.7% (Claude 3.5 Haiku) to 83.6% (GPT-4o-mini), with similar rates in the native languages; human red-teaming raises the overall average from 59.8% to 75.8%.
What carries the argument
Multi-turn conversation structure in low-resource languages, where iterative prompting after initial translation allows gradual escalation that single-turn attacks cannot achieve.
If this is right
- Safety mechanisms that resist single-turn English prompts become less reliable once the interaction stretches across multiple turns in another language.
- Human red-teaming raises average jailbreak rates by roughly 16 percentage points across the tested languages.
- Translation quality directly controls how often the attack succeeds, so better machine translation could reduce but not remove the vulnerability.
- Vulnerabilities persist across commercial models even when the user language is not English.
Where Pith is reading between the lines
- Safety training for LLMs may need explicit exposure to low-resource language patterns to close these gaps.
- Detection systems could require separate calibration for each language family to avoid under- or over-flagging content.
- The same multi-turn technique might transfer to other low-resource languages outside the four African ones studied here.
Load-bearing premise
That automated detection of harmful responses stays consistent and unbiased when applied to text in low-resource languages.
What would settle it
Compare jailbreak success rates for the identical multi-turn strategy when using professional high-quality translations versus the automated translations used in the study.
Figures
read the original abstract
Large Language Models (LLMs) remain vulnerable to jailbreak attempts that circumvent safety guardrails. We investigate whether multi-turn conversations using low-resource African languages (Afrikaans, Kiswahili, isiXhosa, and isiZulu) can bypass safety mechanisms across commercial LLMs. We translated prompts from existing datasets and evaluated ChatGPT, Claude, DeepSeek, Gemini, and Grok through automated testing and human red-teaming with native speakers. Single-turn translation attacks proved ineffective, while multi-turn conversations achieved English harmful response rates from 52.7% (Claude 3.5 Haiku) to 83.6% (GPT-4o-mini), Afrikaans from 60.0% (Claude 3.5 Haiku) to 78.2% (GPT-4o-mini), and Kiswahili from 41.8% (Claude 3.5 Haiku) to 70.9% (DeepSeek). Human red-teaming increased jailbreak rates compared to automated methods. Over all evaluated languages, the average jailbreak rate increased from 59.8% to 75.8%, with improvements of +20.0% (Afrikaans), +12.7% (isiZulu), +12.3% (isiXhosa), and +1% (Kiswahili), demonstrating that poor translation quality limits jailbreak success. These findings suggest that vulnerabilities in LLMs persist in multilingual contexts and that translation quality is the critical factor determining jailbreak success in low-resource languages.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that multi-turn conversations using low-resource African languages (Afrikaans, Kiswahili, isiXhosa, isiZulu) can jailbreak commercial LLMs (ChatGPT, Claude, DeepSeek, Gemini, Grok) more effectively than single-turn translations. Automated testing yields English harmful response rates from 52.7% (Claude 3.5 Haiku) to 83.6% (GPT-4o-mini), with comparable rates in low-resource languages; human red-teaming by native speakers raises the average from 59.8% to 75.8%, and the authors conclude that translation quality is the critical factor determining success.
Significance. If the empirical measurements hold, the work provides concrete evidence of multilingual safety gaps in current LLMs, particularly for low-resource languages. It could motivate improved cross-lingual alignment techniques and more rigorous adversarial testing protocols that account for translation artifacts and native-speaker interactions.
major comments (2)
- [Methods] Methods section: the automated harmful-response classifier is not described in sufficient detail (prompt template, base model, or validation set). Without evidence that precision/recall is consistent across English, Afrikaans, Kiswahili, isiXhosa, and isiZulu (especially with code-switching or non-standard orthography), the cross-lingual rate comparisons and the claim that translation quality is the critical factor are difficult to interpret.
- [Results] Results and Abstract: the number of prompts, dataset size, and any statistical significance tests for the reported differences (e.g., +20.0% Afrikaans improvement from human red-teaming) are not stated. This leaves the concrete percentages (52.7%–83.6%) without clear error bars or power analysis, weakening confidence in the central empirical claims.
minor comments (2)
- [Abstract] Abstract: specify the exact model versions (e.g., GPT-4o-mini checkpoint) and the translation pipeline (machine translation model, post-editing steps) to allow replication.
- [Evaluation] Clarify whether the same harm-classification rubric was applied uniformly to both automated and human-red-teamed outputs.
Simulated Author's Rebuttal
We thank the referee for their constructive and detailed feedback, which highlights important areas for clarification in our manuscript. We address each major comment point by point below and indicate the revisions planned for the next version.
read point-by-point responses
-
Referee: [Methods] Methods section: the automated harmful-response classifier is not described in sufficient detail (prompt template, base model, or validation set). Without evidence that precision/recall is consistent across English, Afrikaans, Kiswahili, isiXhosa, and isiZulu (especially with code-switching or non-standard orthography), the cross-lingual rate comparisons and the claim that translation quality is the critical factor are difficult to interpret.
Authors: We agree that the Methods section requires substantially more detail on the automated classifier to support the cross-lingual comparisons and the attribution of performance differences to translation quality. In the revised manuscript we will add the full prompt template, the base model used for classification, the size and composition of the validation set, and any available precision/recall figures. We will also explicitly discuss limitations related to code-switching and non-standard orthography and note where language-specific validation was not performed. These additions will make the reliability of the automated rates transparent. revision: yes
-
Referee: [Results] Results and Abstract: the number of prompts, dataset size, and any statistical significance tests for the reported differences (e.g., +20.0% Afrikaans improvement from human red-teaming) are not stated. This leaves the concrete percentages (52.7%–83.6%) without clear error bars or power analysis, weakening confidence in the central empirical claims.
Authors: We acknowledge that the manuscript currently omits the exact number of prompts, dataset sizes, and statistical tests. In the revised version we will state the precise number of prompts and source dataset sizes for each experiment, add statistical significance tests (e.g., McNemar or chi-square tests) for the reported differences including the human red-teaming gains, and include error bars or confidence intervals on the key percentages. These changes will strengthen the presentation of the empirical results without altering the central findings. revision: yes
Circularity Check
No circularity: direct empirical measurement study
full rationale
This paper reports experimental results from translating English jailbreak prompts into low-resource African languages, querying commercial LLMs, and measuring harmful response rates via automated classifiers plus human red-teaming. No mathematical derivations, equations, fitted parameters, or predictions appear in the abstract or described methodology. Results are obtained from external model interactions and native-speaker sessions rather than any self-referential construction. No self-citations, uniqueness theorems, or ansatzes are invoked as load-bearing steps. The study is therefore self-contained against external benchmarks with no reduction of claims to inputs by definition.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
URL https://arxiv.org/abs/2305.06972. Laura Weidinger, John Mellor, Maribeth Rauh, Conor Griffin, Jonathan Uesato, Po-Sen Huang, Myra Cheng, Mia Glaese, Borja Balle, Atoosa Kasirzadeh, Zac Kenton, Sasha Brown, Will Hawkins, Tom Stepleton, Courtney Biles, Abeba Birhane, Julia Haas, Laura Rimell, Lisa Anne Hendricks, William Isaac, Sean Legassick, Geoffrey ...
-
[2]
Ethical and social risks of harm from Language Models
URL https://arxiv.org/abs/2112.04359. Nathaniel Li, Ziwen Han, Ian Steneker, Willow Primack, Riley Goodside, Hugh Zhang, Zifan Wang, Cristina Menghini, and Summer Yue. Llm defenses are not robust to multi-turn human jailbreaks yet, 2024a. URL https://arxiv.org/abs/ 2408.15221. Jie Li, Yi Liu, Chongyang Liu, Ling Shi, Xiaoning Ren, Yaowen Zheng, Yang Liu, ...
work page internal anchor Pith review Pith/arXiv arXiv
-
[3]
Multilingual jailbreak challenges in large language models.arXiv preprint arXiv:2310.06474, 2023
URL https://arxiv.org/abs/2310.06474. Sibo Yi, Yule Liu, Zhen Sun, Tianshuo Cong, Xinlei He, Jiaxing Song, Ke Xu, and Qi Li. Jailbreak attacks and defenses against large language models: A survey,
-
[4]
Jailbreak Attacks and Defenses Against Large Language Models: A Survey
URL https://arxiv.org/abs/2407.04295. Yi Liu, Gelei Deng, Zhengzi Xu, Yuekang Li, Yaowen Zheng, Ying Zhang, Lida Zhao, Tianwei Zhang, and Kailong Wang. A hitchhiker’s guide to jailbreaking chatgpt via prompt engineering. InProceedings of the 4th International Workshop on Software Engineering and AI for Data Quality in Cyber-Physical Systems/Internet of Th...
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[5]
Association for Computing Machinery. ISBN 9798400706721. doi:10.1145/3663530.3665021. URL https://doi.org/10.1145/3663530.3665021. Xinyue Shen, Zeyuan Chen, Michael Backes, Yun Shen, and Yang Zhang. "do anything now": Characterizing and evaluating in-the-wild jailbreak prompts on large language models. InProceedings of the 2024 on ACM SIGSAC Conference on...
-
[6]
K., Wen, Y ., Zhang, Y ., and Yin, C
URL https://arxiv.org/abs/2410.15236. Xuan Li, Zhanke Zhou, Jianing Zhu, Jiangchao Yao, Tongliang Liu, and Bo Han. Deepinception: Hypnotize large language model to be jailbreaker, 2024c. URL https://arxiv.org/abs/2311.03191. Mark Russinovich, Ahmed Salem, and Ronen Eldan. Great, now write an article about that: The crescendo multi-turn llm jailbreak attack,
-
[7]
Great, Now Write an Article About That: The Crescendo Multi-Turn LLM Jailbreak Attack
URL https://arxiv.org/abs/2404.01833. 8 Multilingual jailbreaking of LLMs using low-resource languagesA PREPRINT Zhenhua Wang, Wei Xie, Baosheng Wang, Enze Wang, Zhiwen Gui, Shuoyoucheng Ma, and Kai Chen. Foot in the door: Understanding large language model jailbreaking via cognitive psychology, 2024a. URL https: //arxiv.org/abs/2402.15690. Zheng-Xin Yong...
work page internal anchor Pith review Pith/arXiv arXiv
-
[8]
Low-Resource Languages Jailbreak GPT-4
URL https://arxiv.org/abs/2310.02446. Wenxuan Wang, Zhaopeng Tu, Chang Chen, Youliang Yuan, Jen-tse Huang, Wenxiang Jiao, and Michael Lyu. All languages matter: On the multilingual safety of LLMs. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, editors,Findings of the Association for Computational Linguistics: ACL 2024, pages 5865–5877, Bangkok, Thailan...
work page internal anchor Pith review Pith/arXiv arXiv doi:10.18653/v1/2024.findings-acl.349 2024
-
[9]
Association for Computational Linguistics. ISBN 979-8-89176-332-6. doi:10.18653/v1/2025.emnlp-main.800. URL https://aclanthology.org/2025. emnlp-main.800/. Lingfeng Shen, Weiting Tan, Sihao Chen, Yunmo Chen, Jingyu Zhang, Haoran Xu, Boyuan Zheng, Philipp Koehn, and Daniel Khashabi. The language barrier: Dissecting safety challenges of llms in multilingual...
-
[10]
doi:10.18653/v1/2024.trustnlp-1.18. URL http://dx.doi.org/10. 18653/v1/2024.trustnlp-1.18. Jiayang Song, Yuheng Huang, Zhehua Zhou, and Lei Ma. Multilingual blending: Llm safety alignment evaluation with language mixture,
-
[11]
The language barrier: Dissecting safety challenges of llms in multilingual contexts
URL https://arxiv.org/abs/2407.07342. Common Crawl. Statistics of common crawl monthly archives: Distribution of languages, April
-
[12]
URL https: //commoncrawl.github.io/cc-crawl-statistics/plots/languages.html. Latest crawl: CC-MAIN-2025-13. Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q. Weinberger, and Yoav Artzi. Bertscore: Evaluating text generation with bert,
work page 2025
-
[13]
BERTScore: Evaluating Text Generation with BERT
URL https://arxiv.org/abs/1904.09675. Satanjeev Banerjee and Alon Lavie. Meteor: An automatic metric for mt evaluation with improved correlation with human judgments. InProceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization, pages 65–72,
work page internal anchor Pith review Pith/arXiv arXiv 1904
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.