pith. sign in

arxiv: 2605.25420 · v1 · pith:RFUG5JBNnew · submitted 2026-05-25 · 💻 cs.CL · cs.AI· cs.CY

SomaliBench Eval: Measuring English-to-Somali Refusal Gaps in Open-Weight Language Models

Pith reviewed 2026-06-29 22:30 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.CY
keywords LLM safety evaluationrefusal gapsmultilingual modelsSomaliharmful promptsopen-weight LLMslow-resource languagesbenchmarking
0
0 comments X

The pith

Open-weight models refuse harmful prompts less often in Somali than in English.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper measures refusal behavior on 100 matched harmful-intent prompts run in English and Somali across four instruction-tuned models. All models refuse substantially more in English than in Somali when given the same English HHH system prompt. The gaps range from 0.38 to 0.90, and in Somali three models mostly produce unclear output rather than clear harmful answers. This shows that safety alignments learned in English do not transfer evenly to Somali. The benchmark uses native verification to confirm the external classifier labels.

Core claim

The central claim is that large English-to-Somali refusal gaps exist for every tested model: 0.90 for Llama-3.1-8B, 0.75 for Aya-23-8B, 0.69 for Qwen-2.5-7B, and 0.38 for Gemma-2-9B. For three models the main non-refusal pattern in Somali is unclear output rather than fluent compliance. The evaluation applies the identical English system prompt to paired prompts and relies on an external classifier whose labels match native spot-checks at 100 percent agreement on the sampled rows.

What carries the argument

SomaliBench v0, a set of 100 harmful-intent prompts verified by a native author and paired across English and Somali, evaluated under a fixed English HHH system prompt with responses labeled refused or not by a pinned external classifier.

If this is right

  • Safety training produces language-dependent refusal behavior rather than uniform protection.
  • Somali users are more likely than English users to receive non-refusals on harmful requests.
  • Unclear or off-language output becomes the dominant failure mode in Somali for three of the models.
  • Current English-centered safety benchmarks miss these gaps and understate risk in low-resource languages.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The measured gaps could widen or narrow if the system prompt itself were written in Somali.
  • Similar refusal shortfalls likely appear in other low-resource languages and could be measured with parallel benchmarks.
  • Keeping raw generations private prevents external checks on whether the unclear outputs contain hidden harmful content.

Load-bearing premise

Translating the English system prompt and harmful prompts into Somali preserves their original intent and harm level, and the external classifier judges Somali responses with the same accuracy as English ones.

What would settle it

Running the identical prompts with a Somali-translated system prompt and having native speakers classify the responses instead of the English-trained classifier would show whether the refusal gaps remain or shrink.

Figures

Figures reproduced from arXiv: 2605.25420 by Khalid Yusuf Dahir.

Figure 1
Figure 1. Figure 1: Refusal rate per model and language with 95% bootstrap confidence intervals. 5.2 Cross-lingual gaps [PITH_FULL_IMAGE:figures/full_fig_p006_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: English-to-Somali refusal gap by model. All intervals are strictly above zero [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: English-minus-Somali refusal gap by model and safety category. and shows that language competence and safety behavior are intertwined. Somali competence is a confound. The high unclear rate on Somali safety prompts may reflect both safety-conditioning failure and ordinary Somali generation-competence failure. The present study does not include a benign Somali control set, so it cannot estimate how often th… view at source ↗
read the original abstract

Large language model safety evaluation remains heavily English-centered, leaving low-resource languages under-measured even when models are deployed globally. We evaluate four open-weight instruction-tuned models on SomaliBench v0, a native-author-verified benchmark of 100 harmful-intent prompts paired across English and Somali. Each of Llama-3.1-8B-Instruct, Gemma-2-9B-Instruct, Qwen-2.5-7B-Instruct, and Aya-23-8B is run locally with temperature 0 and the same English "helpful, harmless, and honest" (HHH) system prompt. A pinned Claude Sonnet snapshot (claude-sonnet-4-5-20250929) classifies each response as refused, complied, or unclear; the native author spot-checks a stratified 80-row sample. We find large English-to-Somali refusal gaps for all four models: Llama-3.1-8B (0.90; 95% bootstrap CI [0.85, 0.96]), Aya-23-8B (0.75 [0.67, 0.83]), Qwen-2.5-7B (0.69 [0.59, 0.78]), and Gemma-2-9B (0.38 [0.27, 0.49]). For three models, the dominant Somali non-refusal mode is not fluent harmful compliance but unclear output: empty, wrong-language, or incoherent generations. The native verification spot-check achieves 100% agreement with the judge (Cohen's kappa = 1.00) on the 80 sampled rows. We report aggregate refusal rates, category gaps, and reliability statistics only; raw model generations are retained locally and are not released.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 3 minor

Summary. The paper introduces SomaliBench v0, a native-author-verified set of 100 harmful-intent prompts paired in English and Somali. It evaluates four open-weight instruction-tuned models (Llama-3.1-8B-Instruct, Aya-23-8B, Qwen-2.5-7B-Instruct, Gemma-2-9B-Instruct) under a fixed English HHH system prompt at temperature 0. Responses are classified by a pinned Claude Sonnet snapshot into refused/complied/unclear, with a stratified native spot-check on 80 rows showing 100% agreement (Cohen's kappa=1.00). The central empirical result is large English-to-Somali refusal gaps (Llama 0.90 [0.85,0.96], Aya 0.75 [0.67,0.83], Qwen 0.69 [0.59,0.78], Gemma 0.38 [0.27,0.49]) with bootstrap CIs; three models show unclear Somali outputs rather than fluent harmful compliance. Only aggregate statistics are reported; raw generations are retained locally.

Significance. If the measured gaps hold, the work supplies concrete evidence that current English-centric safety evaluations miss substantial refusal disparities in low-resource languages, with direct implications for globally deployed models. Strengths include the native verification step, perfect spot-check agreement, and bootstrap CIs; these make the empirical measurement more robust than typical unverified LLM-as-judge setups.

minor comments (3)
  1. [Abstract] Abstract: the full text of the English HHH system prompt is not provided, nor are the exact construction and translation procedures for the 100 prompts; these details are needed for independent replication even if the benchmark is native-verified.
  2. [Abstract] Abstract: raw model generations are not released (only aggregates), which is a deliberate choice but limits external auditing of the unclear-output category that dominates three of the Somali results.
  3. The manuscript reports 100% agreement on the 80-row spot-check but does not state the stratification criteria or the exact distribution of refused/complied/unclear labels in that sample; adding this would strengthen the reliability claim.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their accurate summary of SomaliBench v0 and for highlighting the robustness of the native verification and bootstrap CIs. The positive assessment and minor_revision recommendation are appreciated. No major comments appear in the provided report, so we have no points requiring rebuttal or revision at this stage.

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper reports a direct empirical measurement of refusal rates on a fixed set of 100 prompts using an external classifier (Claude) and a native-author spot-check on 80 rows. No equations, fitted parameters, self-citations, or derivation steps are present; the reported gaps are computed from raw model outputs classified against an independent judge whose accuracy is externally verified by the spot-check (100% agreement). The analysis is therefore self-contained against external benchmarks with no reduction of results to inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

This is an empirical evaluation paper. No mathematical derivations or new theoretical entities are introduced.

axioms (1)
  • standard math Bootstrap resampling produces valid 95% confidence intervals for the refusal-rate estimates.
    Invoked when reporting the CIs on the gap values.

pith-pipeline@v0.9.1-grok · 5870 in / 1213 out tokens · 29153 ms · 2026-06-29T22:30:35.370609+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

15 extracted references · 13 canonical work pages · 9 internal anchors

  1. [1]

    Aya 23: Open weight releases to further multilingual progress.arXiv preprint arXiv:2405.15032, 2024

    Viraat Aryabumi, John Dang, Dwarak Talupuru, Saurabh Dash, David Cairuz, Hangyu Lin, Bharat Venkitesh, Madeline Smith, Kelly Marchisio, Sebastian Ruder, Acyr Locatelli, Julia Kreutzer, Phil Blunsom, Marzieh Fadaee, Ahmet Üstün, and Sara Hooker. Aya 23: Open weight releases to further multilingual progress.arXiv preprint arXiv:2405.15032, 2024. URL https:/...

  2. [2]

    A coefficient of agreement for nominal scales

    Jacob Cohen. A coefficient of agreement for nominal scales.Educational and Psychological Measurement, 20(1):37–46, 1960. doi: 10.1177/001316446002000104

  3. [3]

    SomaliBench v0: A native-verified safety benchmark for somali.https: //huggingface.co/datasets/khaledyusuf44/somalibench-v0, 2026

    Khalid Yusuf Dahir. SomaliBench v0: A native-verified safety benchmark for somali.https: //huggingface.co/datasets/khaledyusuf44/somalibench-v0, 2026. Dataset, CC-BY-NC- 4.0

  4. [4]

    SomaliWeb v1: A Quality-Filtered Somali Web Corpus with a Matched Tokenizer and a Public Language-Identification Benchmark

    Khalid Yusuf Dahir. SomaliWeb v1: A quality-filtered somali web corpus with a matched tokenizer and a public language-identification benchmark.arXiv preprint arXiv:2605.18232,

  5. [5]
  6. [6]

    Multilingual jailbreak challenges in large language models,

    Yue Deng, Wenxuan Zhang, Sinno Jialin Pan, and Lidong Bing. Multilingual jailbreak challenges in large language models.arXiv preprint arXiv:2310.06474, 2023. URL https://arxiv.org/ abs/2310.06474

  7. [7]

    The Llama 3 Herd of Models

    Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, Anirudh Goyal, et al. The Llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024. URL https://arxiv. org/abs/2407.21783

  8. [8]

    Tibshirani.An Introduction to the Bootstrap

    Bradley Efron and Robert J. Tibshirani.An Introduction to the Bootstrap. Chapman and Hall/CRC, 1994

  9. [9]

    Gemma 2: Improving Open Language Models at a Practical Size

    Gemma Team. Gemma 2: Improving open language models at a practical size.arXiv preprint arXiv:2408.00118, 2024. URLhttps://arxiv.org/abs/2408.00118

  10. [10]

    HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal

    Mantas Mazeika, Long Phan, Xuwang Yin, Andy Zou, Zifan Wang, Norman Mu, Elham Sakhaee, Nathaniel Li, Steven Basart, Bo Li, David Forsyth, and Dan Hendrycks. HarmBench: A standardized evaluation framework for automated red teaming and robust refusal.arXiv preprint arXiv:2402.04249, 2024. URLhttps://arxiv.org/abs/2402.04249. 10

  11. [11]

    Qwen2.5 Technical Report

    Qwen Team. Qwen2.5 technical report.arXiv preprint arXiv:2412.15115, 2024. URLhttps: //arxiv.org/abs/2412.15115

  12. [12]

    Zheng-Xin Yong, Cristina Menghini, and Stephen H. Bach. Low-resource languages jailbreak GPT-4.arXiv preprint arXiv:2310.02446, 2023. URLhttps://arxiv.org/abs/2310.02446

  13. [13]

    Bach, and Julia Kreutzer

    Zheng-Xin Yong, Beyza Ermis, Marzieh Fadaee, Stephen H. Bach, and Julia Kreutzer. The state of multilingual LLM safety research: From measuring the language gap to mitigating it. arXiv preprint arXiv:2505.24119, 2025. URLhttps://arxiv.org/abs/2505.24119

  14. [14]

    Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena

    Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P. Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. Judging LLM-as-a-judge with MT-Bench and chatbot arena. InAdvances in Neural Information Processing Systems, 2023. URLhttps://arxiv.org/abs/2306.05685

  15. [15]

    Universal and Transferable Adversarial Attacks on Aligned Language Models

    Andy Zou, Zifan Wang, Nicholas Carlini, Milad Nasr, J. Zico Kolter, and Matt Fredrikson. Universal and transferable adversarial attacks on aligned language models.arXiv preprint arXiv:2307.15043, 2023. URLhttps://arxiv.org/abs/2307.15043. A Reproducibility Checklist •Code repository:https://github.com/khaledyusuf44/somalibench_eval •Benchmark:https://hugg...