pith. sign in
Pith Number

pith:4K7NVIAP

pith:2026:4K7NVIAP4RMR7CGGXTX4LS7J74
not attested not anchored not stored refs resolved

Ablating Safety: Mechanisms for Removing Alignment in Language Models for Security Applications

Arthur Gervais, Isaac David

Task-only LoRA adaptation enables high performance on authorized security tasks while keeping unsafe compliance low.

arxiv:2605.17413 v1 · 2026-05-17 · cs.CR · cs.AI

Add to your LaTeX paper
\usepackage{pith}
\pithnumber{4K7NVIAP4RMR7CGGXTX4LS7J74}

Prints a linked badge after your title and injects PDF metadata. Compiles on arXiv. Learn more · Embed verified badge

Record completeness

1 Bitcoin timestamp
2 Internet Archive
3 Author claim open · sign in to claim
4 Citations open
5 Replications open
Portable graph bundle live · download bundle · merged state
The bundle contains the canonical record plus signed events. A mirror can host it anywhere and recompute the same current state with the deterministic merge algorithm.

Claims

C1strongest claim

Task-only LoRA raises mean security score to 0.87 with general score 0.83 and unsafe compliance 0.13, while refusal-suppression with retention raises spillover to 0.27. These results support evaluating alignment removal as a utility-risk frontier, not as an uncensoring recipe.

C2weakest assumption

The Security-AR 60-prompt suite and its executable secure-repair validators accurately capture authorized defensive tasks and correctly distinguish valid security outputs from unsafe spillover without introducing selection bias or validator errors.

C3one line summary

Empirical comparison of alignment ablation methods on a 60-prompt security evaluation suite shows task-only LoRA achieves 0.87 mean security score with 0.13 unsafe compliance.

References

43 extracted · 43 resolved · 23 Pith anchors

[1] H. Abu Shairah, H. A. A. K. Hammoud, B. Ghanem, and G. Turkiyyah. An embarrassingly simple defense against llm abliteration attacks.arXiv preprint arXiv:2505.19056, 2025 2025
[2] S. Agnihotri, J. Jakubassa, P. Dey, S. Goyal, B. Schiele, V . B. Radhakrishnan, and M. Keuper. A granular study of safety pretraining under model abliteration.arXiv preprint arXiv:2510.02768, 2025 2025
[3] Refusal in Language Models Is Mediated by a Single Direction 2024 · arXiv:2406.11717
[4] Program Synthesis with Large Language Models 2021 · arXiv:2108.07732
[5] Constitutional AI: Harmlessness from AI Feedback 2022 · arXiv:2212.08073

Formal links

2 machine-checked theorem links

Receipt and verification
First computed 2026-05-20T00:03:57.174485Z
Builder pith-number-builder-2026-05-17-v1
Signature Pith Ed25519 (pith-v1-2026-05) · public key
Schema pith-number/v1.0

Canonical hash

e2bedaa00fe4591f88c6bcefc5cbe9ff1370637720b8f025561799ab643b96af

Aliases

arxiv: 2605.17413 · arxiv_version: 2605.17413v1 · doi: 10.48550/arxiv.2605.17413 · pith_short_12: 4K7NVIAP4RMR · pith_short_16: 4K7NVIAP4RMR7CGG · pith_short_8: 4K7NVIAP
Agent API
Verify this Pith Number yourself
curl -sH 'Accept: application/ld+json' https://pith.science/pith/4K7NVIAP4RMR7CGGXTX4LS7J74 \
  | jq -c '.canonical_record' \
  | python3 -c "import sys,json,hashlib; b=json.dumps(json.loads(sys.stdin.read()), sort_keys=True, separators=(',',':'), ensure_ascii=False).encode(); print(hashlib.sha256(b).hexdigest())"
# expect: e2bedaa00fe4591f88c6bcefc5cbe9ff1370637720b8f025561799ab643b96af
Canonical record JSON
{
  "metadata": {
    "abstract_canon_sha256": "36a4ba954cf411b8fc4ea159a8acd5b3434ce3d8b7192524575470ed1da7d979",
    "cross_cats_sorted": [
      "cs.AI"
    ],
    "license": "http://creativecommons.org/licenses/by/4.0/",
    "primary_cat": "cs.CR",
    "submitted_at": "2026-05-17T12:18:20Z",
    "title_canon_sha256": "be6da42f45742f08a97df68dd330140a159846a46b391c9cfcd355270d356d24"
  },
  "schema_version": "1.0",
  "source": {
    "id": "2605.17413",
    "kind": "arxiv",
    "version": 1
  }
}