pith:C44PCWVM
Safety Is Not Universal: The Selective Safety Trap in LLM Alignment
LLM safety alignment protects some demographic groups far more than others, with defense rates varying up to 42 percent within one model.
arxiv:2601.04389 v3 · 2026-01-07 · cs.CL · cs.AI
Add to your LaTeX paper
\usepackage{pith}
\pithnumber{C44PCWVMADHSUO3LNN5V52GKD6}
Prints a linked badge after your title and injects PDF metadata. Compiles on arXiv. Learn more · Embed verified badge
Record completeness
Claims
safety alignment is not a uniform semantic capability but a demographic hierarchy, with defense rates fluctuating by up to 42% within the same model solely based on the target group. This disparity persists across architectures and languages and is amplified by scaling, indicating that current alignment methods learn group-specific safeguards rather than a generalized notion of harm.
The 43,961 prompts in MiJaBench represent equivalent adversarial difficulty and comparable harm potential across the 16 demographic groups; if prompt construction or translation introduces systematic differences in attack strength, the measured defense-rate gaps would be artifacts rather than evidence of selective safety.
Safety alignment in LLMs is not uniform but forms a demographic hierarchy, with defense rates varying by up to 42% across groups; a new benchmark and DPO method demonstrate transferable safety.
Receipt and verification
| First computed | 2026-06-23T02:13:19.864820Z |
|---|---|
| Builder | pith-number-builder-2026-05-17-v1 |
| Signature | Pith Ed25519
(pith-v1-2026-05) · public key |
| Schema | pith-number/v1.0 |
Canonical hash
1738f15aac00cf2a3b6b6b7b5ee8ca1fb41e11aacf7fe8a8b94432f1f2f9a670
Aliases
· · · · ·Agent API
Verify this Pith Number yourself
curl -sH 'Accept: application/ld+json' https://pith.science/pith/C44PCWVMADHSUO3LNN5V52GKD6 \
| jq -c '.canonical_record' \
| python3 -c "import sys,json,hashlib; b=json.dumps(json.loads(sys.stdin.read()), sort_keys=True, separators=(',',':'), ensure_ascii=False).encode(); print(hashlib.sha256(b).hexdigest())"
# expect: 1738f15aac00cf2a3b6b6b7b5ee8ca1fb41e11aacf7fe8a8b94432f1f2f9a670
Canonical record JSON
{
"metadata": {
"abstract_canon_sha256": "1e5858e07c56143bc8ec3894c094d9d5073d73193f6f9129aec9b301cb525788",
"cross_cats_sorted": [
"cs.AI"
],
"license": "http://creativecommons.org/licenses/by/4.0/",
"primary_cat": "cs.CL",
"submitted_at": "2026-01-07T20:53:18Z",
"title_canon_sha256": "cc28d64672e15f525a89530e71aa2adb9c50a1d166cde7c89ec0ff0bf76754b6"
},
"schema_version": "1.0",
"source": {
"id": "2601.04389",
"kind": "arxiv",
"version": 3
}
}