SomaliBench Eval: Measuring English-to-Somali Refusal Gaps in Open-Weight Language Models

Khalid Yusuf Dahir

arxiv: 2605.25420 · v1 · pith:RFUG5JBNnew · submitted 2026-05-25 · 💻 cs.CL · cs.AI· cs.CY

SomaliBench Eval: Measuring English-to-Somali Refusal Gaps in Open-Weight Language Models

Khalid Yusuf Dahir This is my paper

Pith reviewed 2026-06-29 22:30 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.CY

keywords LLM safety evaluationrefusal gapsmultilingual modelsSomaliharmful promptsopen-weight LLMslow-resource languagesbenchmarking

0 comments

The pith

Open-weight models refuse harmful prompts less often in Somali than in English.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper measures refusal behavior on 100 matched harmful-intent prompts run in English and Somali across four instruction-tuned models. All models refuse substantially more in English than in Somali when given the same English HHH system prompt. The gaps range from 0.38 to 0.90, and in Somali three models mostly produce unclear output rather than clear harmful answers. This shows that safety alignments learned in English do not transfer evenly to Somali. The benchmark uses native verification to confirm the external classifier labels.

Core claim

The central claim is that large English-to-Somali refusal gaps exist for every tested model: 0.90 for Llama-3.1-8B, 0.75 for Aya-23-8B, 0.69 for Qwen-2.5-7B, and 0.38 for Gemma-2-9B. For three models the main non-refusal pattern in Somali is unclear output rather than fluent compliance. The evaluation applies the identical English system prompt to paired prompts and relies on an external classifier whose labels match native spot-checks at 100 percent agreement on the sampled rows.

What carries the argument

SomaliBench v0, a set of 100 harmful-intent prompts verified by a native author and paired across English and Somali, evaluated under a fixed English HHH system prompt with responses labeled refused or not by a pinned external classifier.

If this is right

Safety training produces language-dependent refusal behavior rather than uniform protection.
Somali users are more likely than English users to receive non-refusals on harmful requests.
Unclear or off-language output becomes the dominant failure mode in Somali for three of the models.
Current English-centered safety benchmarks miss these gaps and understate risk in low-resource languages.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The measured gaps could widen or narrow if the system prompt itself were written in Somali.
Similar refusal shortfalls likely appear in other low-resource languages and could be measured with parallel benchmarks.
Keeping raw generations private prevents external checks on whether the unclear outputs contain hidden harmful content.

Load-bearing premise

Translating the English system prompt and harmful prompts into Somali preserves their original intent and harm level, and the external classifier judges Somali responses with the same accuracy as English ones.

What would settle it

Running the identical prompts with a Somali-translated system prompt and having native speakers classify the responses instead of the English-trained classifier would show whether the refusal gaps remain or shrink.

Figures

Figures reproduced from arXiv: 2605.25420 by Khalid Yusuf Dahir.

**Figure 2.** Figure 2: English-to-Somali refusal gap by model. All intervals are strictly above zero [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗

**Figure 3.** Figure 3: English-minus-Somali refusal gap by model and safety category. and shows that language competence and safety behavior are intertwined. Somali competence is a confound. The high unclear rate on Somali safety prompts may reflect both safety-conditioning failure and ordinary Somali generation-competence failure. The present study does not include a benign Somali control set, so it cannot estimate how often th… view at source ↗

read the original abstract

Large language model safety evaluation remains heavily English-centered, leaving low-resource languages under-measured even when models are deployed globally. We evaluate four open-weight instruction-tuned models on SomaliBench v0, a native-author-verified benchmark of 100 harmful-intent prompts paired across English and Somali. Each of Llama-3.1-8B-Instruct, Gemma-2-9B-Instruct, Qwen-2.5-7B-Instruct, and Aya-23-8B is run locally with temperature 0 and the same English "helpful, harmless, and honest" (HHH) system prompt. A pinned Claude Sonnet snapshot (claude-sonnet-4-5-20250929) classifies each response as refused, complied, or unclear; the native author spot-checks a stratified 80-row sample. We find large English-to-Somali refusal gaps for all four models: Llama-3.1-8B (0.90; 95% bootstrap CI [0.85, 0.96]), Aya-23-8B (0.75 [0.67, 0.83]), Qwen-2.5-7B (0.69 [0.59, 0.78]), and Gemma-2-9B (0.38 [0.27, 0.49]). For three models, the dominant Somali non-refusal mode is not fluent harmful compliance but unclear output: empty, wrong-language, or incoherent generations. The native verification spot-check achieves 100% agreement with the judge (Cohen's kappa = 1.00) on the 80 sampled rows. We report aggregate refusal rates, category gaps, and reliability statistics only; raw model generations are retained locally and are not released.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper measures large English-to-Somali refusal gaps with native verification that directly addresses the main concerns about translation and judging.

read the letter

The main takeaway is that this paper finds large refusal gaps from English to Somali across four open models, and the native spot-check makes the result hold up better than most translation studies.

What is new is SomaliBench v0, a set of 100 harmful-intent prompts with native-author verification for equivalence across languages. The evaluation applies the same English HHH system prompt to both English and Somali versions, runs the models locally at temperature zero, and classifies outputs with a fixed Claude snapshot. The stratified 80-row native spot-check reaches 100% agreement and kappa of 1.00, which directly supports the classifier's reliability on Somali responses.

The paper does well by reporting bootstrap confidence intervals on the gaps and by noting that unclear outputs, not fluent harmful compliance, drive most of the Somali non-refusals. Keeping the system prompt identical across languages isolates the effect reasonably well.

The softer spots are modest. The prompt set is only 100 items, full prompt text and construction details are not shown in the abstract, and raw generations stay local. The external judge still carries some unquantified risk beyond the sampled rows, even with perfect agreement on those. These are normal limits for an early benchmark and do not break the central measurement.

This work is for researchers focused on multilingual safety or low-resource language performance in deployed models. Readers who need concrete numbers on refusal generalization will get direct value from the gaps and the verification stats.

The paper shows clear thinking on the evaluation design and handles the obvious assumptions about prompt equivalence and judgment accuracy. It deserves a serious referee.

Referee Report

0 major / 3 minor

Summary. The paper introduces SomaliBench v0, a native-author-verified set of 100 harmful-intent prompts paired in English and Somali. It evaluates four open-weight instruction-tuned models (Llama-3.1-8B-Instruct, Aya-23-8B, Qwen-2.5-7B-Instruct, Gemma-2-9B-Instruct) under a fixed English HHH system prompt at temperature 0. Responses are classified by a pinned Claude Sonnet snapshot into refused/complied/unclear, with a stratified native spot-check on 80 rows showing 100% agreement (Cohen's kappa=1.00). The central empirical result is large English-to-Somali refusal gaps (Llama 0.90 [0.85,0.96], Aya 0.75 [0.67,0.83], Qwen 0.69 [0.59,0.78], Gemma 0.38 [0.27,0.49]) with bootstrap CIs; three models show unclear Somali outputs rather than fluent harmful compliance. Only aggregate statistics are reported; raw generations are retained locally.

Significance. If the measured gaps hold, the work supplies concrete evidence that current English-centric safety evaluations miss substantial refusal disparities in low-resource languages, with direct implications for globally deployed models. Strengths include the native verification step, perfect spot-check agreement, and bootstrap CIs; these make the empirical measurement more robust than typical unverified LLM-as-judge setups.

minor comments (3)

[Abstract] Abstract: the full text of the English HHH system prompt is not provided, nor are the exact construction and translation procedures for the 100 prompts; these details are needed for independent replication even if the benchmark is native-verified.
[Abstract] Abstract: raw model generations are not released (only aggregates), which is a deliberate choice but limits external auditing of the unclear-output category that dominates three of the Somali results.
The manuscript reports 100% agreement on the 80-row spot-check but does not state the stratification criteria or the exact distribution of refused/complied/unclear labels in that sample; adding this would strengthen the reliability claim.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their accurate summary of SomaliBench v0 and for highlighting the robustness of the native verification and bootstrap CIs. The positive assessment and minor_revision recommendation are appreciated. No major comments appear in the provided report, so we have no points requiring rebuttal or revision at this stage.

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper reports a direct empirical measurement of refusal rates on a fixed set of 100 prompts using an external classifier (Claude) and a native-author spot-check on 80 rows. No equations, fitted parameters, self-citations, or derivation steps are present; the reported gaps are computed from raw model outputs classified against an independent judge whose accuracy is externally verified by the spot-check (100% agreement). The analysis is therefore self-contained against external benchmarks with no reduction of results to inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

This is an empirical evaluation paper. No mathematical derivations or new theoretical entities are introduced.

axioms (1)

standard math Bootstrap resampling produces valid 95% confidence intervals for the refusal-rate estimates.
Invoked when reporting the CIs on the gap values.

pith-pipeline@v0.9.1-grok · 5870 in / 1213 out tokens · 29153 ms · 2026-06-29T22:30:35.370609+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

15 extracted references · 13 canonical work pages · 9 internal anchors

[1]

Aya 23: Open weight releases to further multilingual progress.arXiv preprint arXiv:2405.15032, 2024

Viraat Aryabumi, John Dang, Dwarak Talupuru, Saurabh Dash, David Cairuz, Hangyu Lin, Bharat Venkitesh, Madeline Smith, Kelly Marchisio, Sebastian Ruder, Acyr Locatelli, Julia Kreutzer, Phil Blunsom, Marzieh Fadaee, Ahmet Üstün, and Sara Hooker. Aya 23: Open weight releases to further multilingual progress.arXiv preprint arXiv:2405.15032, 2024. URL https:/...

work page arXiv 2024
[2]

A coefficient of agreement for nominal scales

Jacob Cohen. A coefficient of agreement for nominal scales.Educational and Psychological Measurement, 20(1):37–46, 1960. doi: 10.1177/001316446002000104

work page doi:10.1177/001316446002000104 1960
[3]

SomaliBench v0: A native-verified safety benchmark for somali.https: //huggingface.co/datasets/khaledyusuf44/somalibench-v0, 2026

Khalid Yusuf Dahir. SomaliBench v0: A native-verified safety benchmark for somali.https: //huggingface.co/datasets/khaledyusuf44/somalibench-v0, 2026. Dataset, CC-BY-NC- 4.0

2026
[4]

SomaliWeb v1: A Quality-Filtered Somali Web Corpus with a Matched Tokenizer and a Public Language-Identification Benchmark

Khalid Yusuf Dahir. SomaliWeb v1: A quality-filtered somali web corpus with a matched tokenizer and a public language-identification benchmark.arXiv preprint arXiv:2605.18232,

work page internal anchor Pith review Pith/arXiv arXiv
[5]

SomaliWeb v1: A Quality-Filtered Somali Web Corpus with a Matched Tokenizer and a Public Language-Identification Benchmark

doi: 10.48550/arXiv.2605.18232. URLhttps://arxiv.org/abs/2605.18232

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2605.18232
[6]

Multilingual jailbreak challenges in large language models,

Yue Deng, Wenxuan Zhang, Sinno Jialin Pan, and Lidong Bing. Multilingual jailbreak challenges in large language models.arXiv preprint arXiv:2310.06474, 2023. URL https://arxiv.org/ abs/2310.06474

work page arXiv 2023
[7]

The Llama 3 Herd of Models

Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, Anirudh Goyal, et al. The Llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024. URL https://arxiv. org/abs/2407.21783

work page internal anchor Pith review Pith/arXiv arXiv 2024
[8]

Tibshirani.An Introduction to the Bootstrap

Bradley Efron and Robert J. Tibshirani.An Introduction to the Bootstrap. Chapman and Hall/CRC, 1994

1994
[9]

Gemma 2: Improving Open Language Models at a Practical Size

Gemma Team. Gemma 2: Improving open language models at a practical size.arXiv preprint arXiv:2408.00118, 2024. URLhttps://arxiv.org/abs/2408.00118

work page internal anchor Pith review Pith/arXiv arXiv 2024
[10]

HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal

Mantas Mazeika, Long Phan, Xuwang Yin, Andy Zou, Zifan Wang, Norman Mu, Elham Sakhaee, Nathaniel Li, Steven Basart, Bo Li, David Forsyth, and Dan Hendrycks. HarmBench: A standardized evaluation framework for automated red teaming and robust refusal.arXiv preprint arXiv:2402.04249, 2024. URLhttps://arxiv.org/abs/2402.04249. 10

work page internal anchor Pith review Pith/arXiv arXiv 2024
[11]

Qwen2.5 Technical Report

Qwen Team. Qwen2.5 technical report.arXiv preprint arXiv:2412.15115, 2024. URLhttps: //arxiv.org/abs/2412.15115

work page internal anchor Pith review Pith/arXiv arXiv 2024
[12]

Zheng-Xin Yong, Cristina Menghini, and Stephen H. Bach. Low-resource languages jailbreak GPT-4.arXiv preprint arXiv:2310.02446, 2023. URLhttps://arxiv.org/abs/2310.02446

work page internal anchor Pith review Pith/arXiv arXiv 2023
[13]

Bach, and Julia Kreutzer

Zheng-Xin Yong, Beyza Ermis, Marzieh Fadaee, Stephen H. Bach, and Julia Kreutzer. The state of multilingual LLM safety research: From measuring the language gap to mitigating it. arXiv preprint arXiv:2505.24119, 2025. URLhttps://arxiv.org/abs/2505.24119

work page arXiv 2025
[14]

Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P. Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. Judging LLM-as-a-judge with MT-Bench and chatbot arena. InAdvances in Neural Information Processing Systems, 2023. URLhttps://arxiv.org/abs/2306.05685

work page internal anchor Pith review Pith/arXiv arXiv 2023
[15]

Universal and Transferable Adversarial Attacks on Aligned Language Models

Andy Zou, Zifan Wang, Nicholas Carlini, Milad Nasr, J. Zico Kolter, and Matt Fredrikson. Universal and transferable adversarial attacks on aligned language models.arXiv preprint arXiv:2307.15043, 2023. URLhttps://arxiv.org/abs/2307.15043. A Reproducibility Checklist •Code repository:https://github.com/khaledyusuf44/somalibench_eval •Benchmark:https://hugg...

work page internal anchor Pith review Pith/arXiv arXiv 2023

[1] [1]

Aya 23: Open weight releases to further multilingual progress.arXiv preprint arXiv:2405.15032, 2024

Viraat Aryabumi, John Dang, Dwarak Talupuru, Saurabh Dash, David Cairuz, Hangyu Lin, Bharat Venkitesh, Madeline Smith, Kelly Marchisio, Sebastian Ruder, Acyr Locatelli, Julia Kreutzer, Phil Blunsom, Marzieh Fadaee, Ahmet Üstün, and Sara Hooker. Aya 23: Open weight releases to further multilingual progress.arXiv preprint arXiv:2405.15032, 2024. URL https:/...

work page arXiv 2024

[2] [2]

A coefficient of agreement for nominal scales

Jacob Cohen. A coefficient of agreement for nominal scales.Educational and Psychological Measurement, 20(1):37–46, 1960. doi: 10.1177/001316446002000104

work page doi:10.1177/001316446002000104 1960

[3] [3]

SomaliBench v0: A native-verified safety benchmark for somali.https: //huggingface.co/datasets/khaledyusuf44/somalibench-v0, 2026

Khalid Yusuf Dahir. SomaliBench v0: A native-verified safety benchmark for somali.https: //huggingface.co/datasets/khaledyusuf44/somalibench-v0, 2026. Dataset, CC-BY-NC- 4.0

2026

[4] [4]

SomaliWeb v1: A Quality-Filtered Somali Web Corpus with a Matched Tokenizer and a Public Language-Identification Benchmark

Khalid Yusuf Dahir. SomaliWeb v1: A quality-filtered somali web corpus with a matched tokenizer and a public language-identification benchmark.arXiv preprint arXiv:2605.18232,

work page internal anchor Pith review Pith/arXiv arXiv

[5] [5]

SomaliWeb v1: A Quality-Filtered Somali Web Corpus with a Matched Tokenizer and a Public Language-Identification Benchmark

doi: 10.48550/arXiv.2605.18232. URLhttps://arxiv.org/abs/2605.18232

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2605.18232

[6] [6]

Multilingual jailbreak challenges in large language models,

Yue Deng, Wenxuan Zhang, Sinno Jialin Pan, and Lidong Bing. Multilingual jailbreak challenges in large language models.arXiv preprint arXiv:2310.06474, 2023. URL https://arxiv.org/ abs/2310.06474

work page arXiv 2023

[7] [7]

The Llama 3 Herd of Models

Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, Anirudh Goyal, et al. The Llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024. URL https://arxiv. org/abs/2407.21783

work page internal anchor Pith review Pith/arXiv arXiv 2024

[8] [8]

Tibshirani.An Introduction to the Bootstrap

Bradley Efron and Robert J. Tibshirani.An Introduction to the Bootstrap. Chapman and Hall/CRC, 1994

1994

[9] [9]

Gemma 2: Improving Open Language Models at a Practical Size

Gemma Team. Gemma 2: Improving open language models at a practical size.arXiv preprint arXiv:2408.00118, 2024. URLhttps://arxiv.org/abs/2408.00118

work page internal anchor Pith review Pith/arXiv arXiv 2024

[10] [10]

HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal

Mantas Mazeika, Long Phan, Xuwang Yin, Andy Zou, Zifan Wang, Norman Mu, Elham Sakhaee, Nathaniel Li, Steven Basart, Bo Li, David Forsyth, and Dan Hendrycks. HarmBench: A standardized evaluation framework for automated red teaming and robust refusal.arXiv preprint arXiv:2402.04249, 2024. URLhttps://arxiv.org/abs/2402.04249. 10

work page internal anchor Pith review Pith/arXiv arXiv 2024

[11] [11]

Qwen2.5 Technical Report

Qwen Team. Qwen2.5 technical report.arXiv preprint arXiv:2412.15115, 2024. URLhttps: //arxiv.org/abs/2412.15115

work page internal anchor Pith review Pith/arXiv arXiv 2024

[12] [12]

Zheng-Xin Yong, Cristina Menghini, and Stephen H. Bach. Low-resource languages jailbreak GPT-4.arXiv preprint arXiv:2310.02446, 2023. URLhttps://arxiv.org/abs/2310.02446

work page internal anchor Pith review Pith/arXiv arXiv 2023

[13] [13]

Bach, and Julia Kreutzer

Zheng-Xin Yong, Beyza Ermis, Marzieh Fadaee, Stephen H. Bach, and Julia Kreutzer. The state of multilingual LLM safety research: From measuring the language gap to mitigating it. arXiv preprint arXiv:2505.24119, 2025. URLhttps://arxiv.org/abs/2505.24119

work page arXiv 2025

[14] [14]

Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P. Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. Judging LLM-as-a-judge with MT-Bench and chatbot arena. InAdvances in Neural Information Processing Systems, 2023. URLhttps://arxiv.org/abs/2306.05685

work page internal anchor Pith review Pith/arXiv arXiv 2023

[15] [15]

Universal and Transferable Adversarial Attacks on Aligned Language Models

Andy Zou, Zifan Wang, Nicholas Carlini, Milad Nasr, J. Zico Kolter, and Matt Fredrikson. Universal and transferable adversarial attacks on aligned language models.arXiv preprint arXiv:2307.15043, 2023. URLhttps://arxiv.org/abs/2307.15043. A Reproducibility Checklist •Code repository:https://github.com/khaledyusuf44/somalibench_eval •Benchmark:https://hugg...

work page internal anchor Pith review Pith/arXiv arXiv 2023