Ablating Safety: Mechanisms for Removing Alignment in Language Models for Security Applications

Arthur Gervais; Isaac David

arxiv: 2605.17413 · v1 · pith:4K7NVIAPnew · submitted 2026-05-17 · 💻 cs.CR · cs.AI

Ablating Safety: Mechanisms for Removing Alignment in Language Models for Security Applications

Isaac David , Arthur Gervais This is my paper

Pith reviewed 2026-05-19 23:26 UTC · model grok-4.3

classification 💻 cs.CR cs.AI

keywords language model alignmentsafety ablationLoRA fine-tuningcybersecurity evaluationrefusal mechanismsutility risk trade-offsecurity applications

0 comments

The pith

Task-only LoRA adaptation enables high performance on authorized security tasks while keeping unsafe compliance low.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper investigates techniques for removing safety alignments from language models to allow them to respond to authorized cybersecurity requests that they would otherwise refuse. It introduces the Security-AR benchmark consisting of 60 prompts for security tasks, general capabilities, and spillover probes, and tests methods including prompting, refusal direction projection, and LoRA fine-tuning. The key finding is that adapting a model with task-only LoRA achieves a mean security score of 0.87, retains a general score of 0.83, and limits unsafe compliance to 0.13, outperforming other approaches that either improve security less or increase unsafe outputs more. This frames alignment removal as a trade-off between utility and risk rather than a straightforward way to remove restrictions.

Core claim

The authors demonstrate through controlled experiments that different alignment removal methods have varying effects on security task success, general capability retention, and unsafe spillover. Specifically, single-vector refusal projection only marginally improves security scores while greatly increasing unsafe compliance, whereas task-only LoRA substantially raises security performance with minimal spillover increase and good retention of general abilities. They argue that alignment removal should be viewed as navigating a utility-risk frontier.

What carries the argument

Comparison of alignment ablation techniques including refusal-direction activation projection, representation-control projections, and LoRA-based task adaptation, evaluated on the Security-AR prompt suite with secure-repair validators.

If this is right

Alignment removal techniques should be evaluated based on their position on a utility-risk frontier rather than solely on their ability to increase compliance.
Task-specific LoRA offers a promising balance for security applications by improving defensive task performance without substantially raising unsafe outputs.
Refusal suppression methods can lead to higher rates of out-of-scope unsafe compliance compared to targeted adaptation.
General capability retention is possible even when adapting for specific security tasks using low-rank methods.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar techniques could be applied to other restricted domains like medical or legal advice where legitimate queries are refused.
Organizations developing internal security tools might use task LoRA to create specialized assistants, but would need robust monitoring for unintended behaviors.
Future evaluations could include dynamic or multi-turn interactions to test if the low spillover holds in more complex scenarios.

Load-bearing premise

The Security-AR prompt suite and its executable validators accurately measure authorized security tasks and distinguish them from unsafe outputs without selection bias or errors.

What would settle it

A follow-up evaluation on a broader or independently curated set of security prompts where task LoRA no longer shows superior security scores or where unsafe compliance rises unexpectedly.

Figures

Figures reproduced from arXiv: 2605.17413 by Arthur Gervais, Isaac David.

**Figure 1.** Figure 1: Overview of the ABLATING SAFETY evaluation loop. Prompt-level and model-level interventions are routed through the same task suite and scored jointly for authorized security utility, general-capability retention, and out-of-scope spillover. The decision target is the utility-risk frontier. The measurement problem is to estimate the Pareto frontier between lower refusal on authorized security tasks, higher … view at source ↗

**Figure 2.** Figure 2: Utility-spillover frontier for extended LoRA on the three Qwen2.5 models. The x-axis is [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗

**Figure 3.** Figure 3: LoRA tradeoffs by adapter family. Task-only and refusal-suppression adapters produce the [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 4.** Figure 4: Projection-strength sweep for Qwen2.5-1.5B-Instruct across random and refusal projections. [PITH_FULL_IMAGE:figures/full_fig_p016_4.png] view at source ↗

**Figure 5.** Figure 5: Projection-layer sweep for Qwen2.5-1.5B-Instruct. Random projection preserves out-of [PITH_FULL_IMAGE:figures/full_fig_p016_5.png] view at source ↗

read the original abstract

Safety-aligned language models often refuse cybersecurity requests whose wording resembles misuse, even when the task is authorized and defensive. This makes security evaluation ambiguous: a failed answer may reflect missing capability or refusal-policy intervention. Ablating Safety studies alignment removal as a controlled transformation-evaluation protocol for authorized security tasks, comparing authorized-context prompting, reversible refusal-direction activation projection, representation-control projections, and LoRA-based de-alignment or task adaptation. We evaluate refusal, attempt rate, validated security success, general-capability retention, instability, and out-of-scope unsafe compliance on Security-AR, a 60-prompt suite of authorized security, benign general, and non-operational spillover probes. The reported runs include a four-model projection pilot with 416 completions, a three-model Qwen2.5 LoRA extension with 1,980 held-out completions, representation and robustness sweeps, and executable secure-repair validators. Single-vector refusal projection raises mean security score only from 0.46 to 0.50 while increasing unsafe compliance from 0.10 to 0.47; rank-4 refusal-subspace projection reaches 0.51 while matching the aligned spillover rate. Task-only LoRA raises mean security score to 0.87 with general score 0.83 and unsafe compliance 0.13, while refusal-suppression with retention raises spillover to 0.27. These results support evaluating alignment removal as a utility-risk frontier, not as an uncensoring recipe, and treating compliance alone as neither competence nor safe deployment.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Task-only LoRA reaches 0.87 security score with 0.13 unsafe compliance while refusal suppression pushes spillover to 0.27, but the whole comparison sits on an unvalidated 60-prompt suite.

read the letter

The main thing to know is that task-specific LoRA adaptation improves performance on authorized security prompts more than projection methods while keeping unsafe compliance lower, whereas refusal suppression trades off by raising spillover rates. The paper runs these comparisons on a new Security-AR suite of 60 prompts covering defensive tasks, general capability, and out-of-scope probes. They report a four-model projection pilot with 416 completions plus a Qwen2.5 LoRA extension with 1980 held-out ones, plus some sweeps. Concrete numbers show single-vector projection lifting security score only from 0.46 to 0.50 but unsafe compliance from 0.10 to 0.47; rank-4 subspace does a bit better on security without raising spillover much. Task-only LoRA hits 0.87 security and 0.83 general with 0.13 unsafe, while suppression with retention reaches 0.27 spillover. This supports treating alignment removal as a tunable utility-risk trade-off rather than blanket uncensoring. The work does a reasonable job laying out multiple techniques side by side and using executable validators for the security metric, which at least tries to make success measurable. The soft spot is exactly what the stress-test flags: everything rests on whether the Security-AR prompts and secure-repair validators actually separate valid defensive outputs from unsafe ones without selection bias or systematic misses. The abstract gives no pre-registration details, inter-annotator numbers, or external checks on the suite, so if those components have issues the frontier story weakens. It's also narrow in models tested. This is for people in AI safety or cybersecurity who need data on how alignment interventions affect red-teaming or defensive tooling benchmarks. The empirical comparisons are grounded enough to merit a serious referee even if the eval details require tightening.

Referee Report

2 major / 2 minor

Summary. The paper claims that safety-aligned LLMs refuse legitimate cybersecurity requests due to alignment policies, and proposes evaluating alignment removal techniques (authorized-context prompting, reversible refusal-direction projection, representation-control projections, and LoRA-based de-alignment or task adaptation) as a controlled protocol for authorized security tasks. Using the new Security-AR 60-prompt suite with executable secure-repair validators, it reports metrics including refusal rate, validated security success, general-capability retention, and out-of-scope unsafe compliance across multiple models and runs (416 completions in a four-model pilot plus 1,980 held-out completions in a Qwen2.5 LoRA extension). Key results include single-vector refusal projection raising mean security score only from 0.46 to 0.50 while increasing unsafe compliance from 0.10 to 0.47, rank-4 refusal-subspace projection reaching 0.51 with matched spillover, and task-only LoRA achieving 0.87 security score with 0.83 general score and 0.13 unsafe compliance; the work concludes that alignment removal should be viewed as a utility-risk frontier rather than an uncensoring recipe.

Significance. If the Security-AR suite and validators reliably distinguish authorized defensive tasks from unsafe spillover without selection bias, the results provide concrete empirical support for treating compliance as neither equivalent to competence nor a safe deployment signal. The scale of the experiments and direct measurement of spillover effects offer a useful comparative framework for security applications of LLMs, with potential to inform more nuanced alignment research.

major comments (2)

[Evaluation / Methods (Security-AR suite description)] The central comparative claims (e.g., Task-only LoRA at 0.87 security score / 0.13 unsafe compliance vs. refusal projection at 0.50 / 0.47 spillover) rest entirely on the correctness of the 60-prompt Security-AR suite and its executable secure-repair validators. The manuscript reports no pre-registration of data exclusion rules, prompt ordering, or validator thresholds, no inter-annotator agreement statistics, and no external validation of either component; this is load-bearing because any systematic misclassification of edge-case outputs as secure would invalidate the utility-risk frontier interpretation.
[Results (reported scores and sweeps)] The abstract and results sections present mean scores from 416 + 1,980 completions without reporting variance, confidence intervals, or statistical tests for the differences between methods (e.g., 0.87 vs. 0.50 security score). This weakens the strength of the claim that LoRA is superior on the frontier, as the numerical gaps could be sensitive to prompt sampling or validator thresholds.

minor comments (2)

[Methods] Clarify the exact definition and implementation of the 'secure-repair validators' (e.g., what constitutes a valid repair vs. unsafe spillover) in the methods section to allow reproducibility.
[Results] The paper mentions 'representation and robustness sweeps' but does not specify the hyperparameter ranges or number of runs in the main text; move this detail from any appendix to the primary results section.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback, which highlights important aspects of methodological transparency and statistical rigor in our evaluation of alignment ablation techniques. We address each major comment in detail below and outline the revisions we plan to make.

read point-by-point responses

Referee: [Evaluation / Methods (Security-AR suite description)] The central comparative claims (e.g., Task-only LoRA at 0.87 security score / 0.13 unsafe compliance vs. refusal projection at 0.50 / 0.47 spillover) rest entirely on the correctness of the 60-prompt Security-AR suite and its executable secure-repair validators. The manuscript reports no pre-registration of data exclusion rules, prompt ordering, or validator thresholds, no inter-annotator agreement statistics, and no external validation of either component; this is load-bearing because any systematic misclassification of edge-case outputs as secure would invalidate the utility-risk frontier interpretation.

Authors: We recognize the importance of documenting the construction and validation process for the Security-AR suite to ensure reproducibility and mitigate concerns about potential biases. The prompts were derived from real-world authorized security scenarios, with validators implemented as executable checks for secure-repair tasks. While we did not pre-register the study, in the revised version we will expand the methods section with a comprehensive description of prompt development, including how edge cases were handled, and provide the full validator code in an appendix. We will also include a dedicated limitations subsection discussing the lack of inter-annotator agreement (as the validators are rule-based and deterministic) and the absence of external validation, along with steps taken to minimize selection bias. These additions will allow readers to better assess the reliability of the reported metrics without changing the experimental results. revision: partial
Referee: [Results (reported scores and sweeps)] The abstract and results sections present mean scores from 416 + 1,980 completions without reporting variance, confidence intervals, or statistical tests for the differences between methods (e.g., 0.87 vs. 0.50 security score). This weakens the strength of the claim that LoRA is superior on the frontier, as the numerical gaps could be sensitive to prompt sampling or validator thresholds.

Authors: We agree that including measures of variability and statistical analysis would enhance the interpretability of our results. Although the current manuscript focuses on mean scores across a large number of completions, we have access to the per-prompt outcomes from all runs. In the revision, we will update the results section and figures to report standard deviations, 95% confidence intervals, and perform statistical significance tests (such as bootstrap resampling or t-tests) for the key comparisons between alignment removal methods. This will provide stronger evidence for the differences observed, for instance between the task-only LoRA and the projection-based approaches. revision: yes

Circularity Check

0 steps flagged

No circularity: results are direct empirical measurements on held-out completions

full rationale

The paper is an experimental comparison of alignment-ablation techniques evaluated on the Security-AR 60-prompt suite and executable validators. Reported quantities (security scores, unsafe compliance rates, general capability retention) are computed directly from model outputs on held-out prompts rather than from any internal equations, fitted parameters renamed as predictions, or self-citation chains. No derivation steps exist that reduce to the inputs by construction, satisfying the self-contained experimental criterion for a score of 0.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claims rest on the assumption that the Security-AR prompts and validators constitute a faithful proxy for authorized security work; no new physical entities or mathematical axioms are introduced beyond standard transformer assumptions.

free parameters (1)

LoRA rank
Rank-4 projection is reported; the choice of rank is a modeling hyperparameter that affects the reported security and spillover scores.

axioms (1)

domain assumption The base models (Qwen2.5 and others) retain general capabilities after projection or LoRA adaptation.
Invoked when reporting general score retention of 0.83.

pith-pipeline@v0.9.0 · 5802 in / 1369 out tokens · 29058 ms · 2026-05-19T23:26:55.516109+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We evaluate refusal, attempt rate, validated security success, general-capability retention, instability, and out-of-scope unsafe compliance on Security-AR, a 60-prompt suite...
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Task-only LoRA raises mean security score to 0.87 with general score 0.83 and unsafe compliance 0.13

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

43 extracted references · 43 canonical work pages · 23 internal anchors

[1]

Abu Shairah, H

H. Abu Shairah, H. A. A. K. Hammoud, B. Ghanem, and G. Turkiyyah. An embarrassingly simple defense against llm abliteration attacks.arXiv preprint arXiv:2505.19056, 2025

work page arXiv 2025
[2]

Agnihotri, J

S. Agnihotri, J. Jakubassa, P. Dey, S. Goyal, B. Schiele, V . B. Radhakrishnan, and M. Keuper. A granular study of safety pretraining under model abliteration.arXiv preprint arXiv:2510.02768, 2025

work page arXiv 2025
[3]

Refusal in Language Models Is Mediated by a Single Direction

A. Arditi, O. Obeso, A. Syed, D. Paleka, N. Panickssery, W. Gurnee, and N. Nanda. Refusal in language models is mediated by a single direction.arXiv preprint arXiv:2406.11717, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[4]

Program Synthesis with Large Language Models

J. Austin, A. Odena, M. Nye, M. Bosma, H. Michalewski, D. Dohan, E. Jiang, C. Cai, M. Terry, Q. Le, and C. Sutton. Program synthesis with large language models.arXiv preprint arXiv:2108.07732, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[5]

Y . Bai, S. Kadavath, S. Kundu, A. Askell, J. Kernion, A. Jones, A. Chen, A. Goldie, A. Mirho- seini, C. McKinnon, C. Chen, C. Olsson, C. Olah, D. Hernandez, D. Drain, D. Ganguli, D. Li, E. Tran-Johnson, E. Perez, J. Kerr, J. Mueller, J. Ladish, J. Landau, K. Ndousse, K. Lukosuite, L. Lovitt, M. Sellitto, N. Elhage, N. Schiefer, N. Mercado, N. DasSarma, R...

work page internal anchor Pith review Pith/arXiv arXiv 2022
[6]

SmolLM2: When Smol Goes Big -- Data-Centric Training of a Small Language Model

L. Ben Allal, A. Lozhkov, E. Bakouch, G. Martín Blázquez, G. Penedo, L. Tunstall, A. Marafi- oti, H. Kydlí ˇcek, A. Piqueres Lajarín, V . Srivastav, J. Lochner, C. Fahlgren, X.-S. Nguyen, C. Fourrier, B. Burtenshaw, H. Larcher, H. Zhao, C. Zakka, M. Morlon, C. Raffel, L. von Werra, and T. Wolf. Smollm2: When smol goes big – data-centric training of a smal...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[7]

arXiv:2312.04724 [cs]

M. Bhatt, S. Chennabasappa, C. Nikolaidis, S. Wan, I. Evtimov, D. Gabi, D. Song, F. Ahmad, C. Aschermann, L. Fontana, S. Frolov, R. P. Giri, D. Kapil, Y . Kozyrakis, D. LeBlanc, J. Milazzo, A. Straumann, G. Synnaeve, V . V ontimitta, S. Whitman, and J. Saxe. Purple llama cyberseceval: A secure coding benchmark for language models.arXiv preprint arXiv:2312...

work page arXiv 2023
[8]

On the Opportunities and Risks of Foundation Models

R. Bommasani, D. A. Hudson, E. Adeli, R. Altman, S. Arora, S. von Arx, M. S. Bernstein, J. Bohg, A. Bosselut, E. Brunskill, et al. On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[9]

T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, S. Agarwal, A. Herbert-V oss, G. Krueger, T. Henighan, R. Child, A. Ramesh, D. M. Ziegler, J. Wu, C. Winter, C. Hesse, M. Chen, E. Sigler, M. Litwin, S. Gray, B. Chess, J. Clark, C. Berner, S. McCandlish, A. Radford, I. Sutskever, and D. ...

work page 1901
[10]

P. Chao, A. Robey, E. Dobriban, H. Hassani, G. J. Pappas, and E. Wong. Jailbreaking black box large language models in twenty queries.arXiv preprint arXiv:2310.08419, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[11]

P. Chao, E. Debenedetti, A. Robey, M. Andriushchenko, F. Croce, V . Sehwag, E. Dobriban, N. Flammarion, G. J. Pappas, F. Tramer, H. Hassani, and E. Wong. Jailbreakbench: An open robustness benchmark for jailbreaking large language models.arXiv preprint arXiv:2404.01318, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[12]

M. Chen, J. Tworek, H. Jun, Q. Yuan, H. Ponde de Oliveira Pinto, J. Kaplan, H. Edwards, Y . Burda, N. Joseph, G. Brockman, et al. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[13]

Llamafirewall: An open source guardrail system for building secure ai agents,

S. Chennabasappa, C. Nikolaidis, D. Song, D. Molnar, S. Ding, S. Wan, S. Whitman, L. Deason, N. Doucette, A. Montilla, A. Gampa, B. de Paola, D. Gabi, J. Crnkovich, J.-C. Testud, K. He, R. Chaturvedi, W. Zhou, and J. Saxe. Llamafirewall: An open source guardrail system for building secure ai agents.arXiv preprint arXiv:2505.03574, 2025. 10

work page arXiv 2025
[14]

G. N. Frank. How alignment routes: Localizing, scaling, and controlling policy circuits in language models.arXiv preprint arXiv:2604.04385, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[15]

G. N. Frank. Detection is cheap, routing is learned: Why refusal-based alignment evaluation fails.arXiv preprint arXiv:2603.18280, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[16]

Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned

D. Ganguli, L. Lovitt, J. Kernion, A. Askell, Y . Bai, S. Kadavath, B. Mann, E. Perez, N. Schiefer, K. Ndousse, et al. Red teaming language models to reduce harms: Methods, scaling behaviors, and lessons learned.arXiv preprint arXiv:2209.07858, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[17]

Hendrycks, C

D. Hendrycks, C. Burns, S. Basart, A. Zou, M. Mazeika, D. Song, and J. Steinhardt. Mea- suring massive multitask language understanding. InInternational Conference on Learning Representations, 2021

work page 2021
[18]

E. J. Hu, Y . Shen, P. Wallis, Z. Allen-Zhu, Y . Li, S. Wang, L. Wang, and W. Chen. Lora: Low-rank adaptation of large language models. InInternational Conference on Learning Representations, 2022. URLhttps://openreview.net/forum?id=nZeVKeeFYf9

work page 2022
[19]

H. Inan, K. Upasani, J. Chi, R. Rungta, K. Iyer, Y . Mao, M. Tontchev, Q. Hu, B. Fuller, D. Testuggine, and M. Khabsa. Llama guard: Llm-based input-output safeguard for human-ai conversations.arXiv preprint arXiv:2312.06674, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[20]

A. Q. Jiang, A. Sablayrolles, A. Mensch, C. Bamford, D. S. Chaplot, D. de las Casas, F. Bressand, G. Lengyel, G. Lample, L. Saulnier, L. R. Lavaud, M.-A. Lachaux, P. Stock, T. Le Scao, T. Lavril, T. Wang, T. Lacroix, and W. El Sayed. Mistral 7b.arXiv preprint arXiv:2310.06825, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[21]

M. Labonne. Uncensor any llm with abliteration. Hugging Face Blog, 2024. URL https: //huggingface.co/blog/mlabonne/abliteration. Accessed 2026-05-04

work page 2024
[22]

S. Lin, J. Hilton, and O. Evans. TruthfulQA: Measuring how models mimic human false- hoods. InProceedings of the 60th Annual Meeting of the Association for Computational Linguistics (V olume 1: Long Papers), pages 3214–3252, Dublin, Ireland, 2022. Associ- ation for Computational Linguistics. doi: 10.18653/v1/2022.acl-long.229. URL https: //aclanthology.or...

work page doi:10.18653/v1/2022.acl-long.229 2022
[23]

Marshall, A

T. Marshall, A. Scherlis, and N. Belrose. Refusal in llms is an affine function.arXiv preprint arXiv:2411.09003, 2024

work page arXiv 2024
[24]

HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal

M. Mazeika, L. Phan, X. Yin, A. Zou, Z. Wang, N. Mu, E. Sakhaee, N. Li, S. Basart, B. Li, D. Forsyth, and D. Hendrycks. Harmbench: A standardized evaluation framework for automated red teaming and robust refusal.arXiv preprint arXiv:2402.04249, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[25]

Steering language model refusal with sparse autoencoders, 2025

K. O’Brien, D. Majercak, X. Fernandes, R. Edgar, B. Bullwinkel, J. Chen, H. Nori, D. Carignan, E. Horvitz, and F. Poursabzi-Sangdeh. Steering language model refusal with sparse autoencoders. arXiv preprint arXiv:2411.11296, 2024

work page arXiv 2024
[26]

Ouyang, J

L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. Wainwright, P. Mishkin, C. Zhang, S. Agarwal, K. Slama, A. Ray, J. Schulman, J. Hilton, F. Kelton, L. Miller, M. Simens, A. Askell, P. Welinder, P. F. Christiano, J. Leike, and R. Lowe. Training language models to follow instructions with human feedback. InAdvances in Neural Information Processing Systems, volume...

work page 2022
[27]

Steering Llama 2 via Contrastive Activation Addition

N. Panickssery, N. Gabrieli, J. Schulz, M. Tong, E. Hubinger, and A. M. Turner. Steering llama 2 via contrastive activation addition.arXiv preprint arXiv:2312.06681, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[28]

Membership infer- ence attacks from first principles

H. Pearce, B. Ahmad, B. Tan, B. Dolan-Gavitt, and R. Karri. Asleep at the keyboard? assessing the security of github copilot’s code contributions. In2022 IEEE Symposium on Security and Privacy (SP), pages 754–768, 2022. doi: 10.1109/SP46214.2022.9833571

work page doi:10.1109/sp46214.2022.9833571 2022
[29]

Red Teaming Language Models with Language Models

E. Perez, S. Huang, F. Song, T. Cai, R. Ring, J. Aslanides, A. Glaese, N. McAleese, and G. Irving. Red teaming language models with language models.arXiv preprint arXiv:2202.03286, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[30]

V . Petrov. On the failure of topic-matched contrast baselines in multi-directional refusal abliteration.arXiv preprint arXiv:2603.22061, 2026. 11

work page arXiv 2026
[31]

X. Qi, Y . Zeng, T. Xie, P.-Y . Chen, R. Jia, P. Mittal, and P. Henderson. Fine-tuning aligned language models compromises safety, even when users do not intend to!arXiv preprint arXiv:2310.03693, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[32]

X. Qi, A. Panda, K. Lyu, X. Ma, S. Roy, A. Beirami, P. Mittal, and P. Henderson. Safety alignment should be made more than just a few tokens deep.arXiv preprint arXiv:2406.05946, 2024

work page arXiv 2024
[33]

Qwen2.5 Technical Report

Qwen Team. Qwen2.5 technical report.arXiv preprint arXiv:2412.15115, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[34]

arXiv preprint arXiv:2310.10501 , year =

T. Rebedea, R. Dinu, M. Sreedhar, C. Parisien, and J. Cohen. Nemo guardrails: A toolkit for con- trollable and safe llm applications with programmable rails.arXiv preprint arXiv:2310.10501, 2023

work page arXiv 2023
[35]

V . Siu, N. Crispino, Z. Yu, S. Pan, Z. Wang, Y . Liu, D. Song, and C. Wang. COSMIC: Generalized refusal direction identification in LLM activations. InFindings of the Associa- tion for Computational Linguistics: ACL 2025, pages 25534–25553, Vienna, Austria, 2025. Association for Computational Linguistics. doi: 10.18653/v1/2025.findings-acl.1310. URL http...

work page doi:10.18653/v1/2025.findings-acl.1310 2025
[36]

Srivastava, A

A. Srivastava, A. Rastogi, A. Rao, A. A. M. Shoeb, A. Abid, A. Fisch, A. R. Brown, A. Santoro, A. Gupta, A. Garriga-Alonso, et al. Beyond the imitation game: Quantifying and extrapolating the capabilities of language models.Transactions on Machine Learning Research, 2023. URL https://openreview.net/forum?id=uyTL5Bvosj

work page 2023
[37]

LLaMA: Open and Efficient Foundation Language Models

H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B. Rozière, N. Goyal, E. Hambro, F. Azhar, A. Rodriguez, A. Joulin, E. Grave, and G. Lample. LLaMA: Open and efficient foundation language models.arXiv preprint arXiv:2302.13971, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[38]

A. M. Turner, L. Thiergart, G. Leech, D. Udell, J. J. Vazquez, U. Mini, and M. MacDiarmid. Steering language models with activation engineering.arXiv preprint arXiv:2308.10248, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[39]

G. Wang, H. Shi, T. Ouyang, and A. Wang. Few tokens, big leverage: Preserving safety alignment by constraining safety tokens during fine-tuning.arXiv preprint arXiv:2603.07445, 2026

work page arXiv 2026
[40]

A. Wei, N. Haghtalab, and J. Steinhardt. Jailbroken: How does llm safety training fail?arXiv preprint arXiv:2307.02483, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[41]

A. K. Zhang, N. Perry, R. Dulepet, J. Ji, C. Menders, J. W. Lin, E. Jones, G. Hussein, S. Liu, D. Jasper, P. Peetathawatchai, A. Glenn, V . Sivashankar, D. Zamoshchin, L. Glikbarg, D. Askar- yar, M. Yang, T. Zhang, R. Alluri, N. Tran, R. Sangpisit, P. Yiorkadjis, K. Osele, G. Raghupathi, D. Boneh, D. E. Ho, and P. Liang. Cybench: A framework for evaluatin...

work page arXiv 2024
[42]

A. Zou, L. Phan, S. Chen, J. Campbell, P. Guo, R. Ren, A. Pan, X. Yin, M. Mazeika, A.-K. Dombrowski, S. Goel, N. Li, M. J. Byun, Z. Wang, A. Mallen, S. Basart, S. Koyejo, D. Song, M. Fredrikson, J. Z. Kolter, and D. Hendrycks. Representation engineering: A top-down approach to ai transparency.arXiv preprint arXiv:2310.01405, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[43]

A. Zou, Z. Wang, N. Carlini, M. Nasr, J. Z. Kolter, and M. Fredrikson. Universal and transferable adversarial attacks on aligned language models.arXiv preprint arXiv:2307.15043, 2023. 12 A Evaluation Reproducibility Details The main evaluation harness is implemented in: •experiments/generate_expanded_data.py •experiments/train_lora.py •experiments/expande...

work page internal anchor Pith review Pith/arXiv arXiv 2023

[1] [1]

Abu Shairah, H

H. Abu Shairah, H. A. A. K. Hammoud, B. Ghanem, and G. Turkiyyah. An embarrassingly simple defense against llm abliteration attacks.arXiv preprint arXiv:2505.19056, 2025

work page arXiv 2025

[2] [2]

Agnihotri, J

S. Agnihotri, J. Jakubassa, P. Dey, S. Goyal, B. Schiele, V . B. Radhakrishnan, and M. Keuper. A granular study of safety pretraining under model abliteration.arXiv preprint arXiv:2510.02768, 2025

work page arXiv 2025

[3] [3]

Refusal in Language Models Is Mediated by a Single Direction

A. Arditi, O. Obeso, A. Syed, D. Paleka, N. Panickssery, W. Gurnee, and N. Nanda. Refusal in language models is mediated by a single direction.arXiv preprint arXiv:2406.11717, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[4] [4]

Program Synthesis with Large Language Models

J. Austin, A. Odena, M. Nye, M. Bosma, H. Michalewski, D. Dohan, E. Jiang, C. Cai, M. Terry, Q. Le, and C. Sutton. Program synthesis with large language models.arXiv preprint arXiv:2108.07732, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021

[5] [5]

Y . Bai, S. Kadavath, S. Kundu, A. Askell, J. Kernion, A. Jones, A. Chen, A. Goldie, A. Mirho- seini, C. McKinnon, C. Chen, C. Olsson, C. Olah, D. Hernandez, D. Drain, D. Ganguli, D. Li, E. Tran-Johnson, E. Perez, J. Kerr, J. Mueller, J. Ladish, J. Landau, K. Ndousse, K. Lukosuite, L. Lovitt, M. Sellitto, N. Elhage, N. Schiefer, N. Mercado, N. DasSarma, R...

work page internal anchor Pith review Pith/arXiv arXiv 2022

[6] [6]

SmolLM2: When Smol Goes Big -- Data-Centric Training of a Small Language Model

L. Ben Allal, A. Lozhkov, E. Bakouch, G. Martín Blázquez, G. Penedo, L. Tunstall, A. Marafi- oti, H. Kydlí ˇcek, A. Piqueres Lajarín, V . Srivastav, J. Lochner, C. Fahlgren, X.-S. Nguyen, C. Fourrier, B. Burtenshaw, H. Larcher, H. Zhao, C. Zakka, M. Morlon, C. Raffel, L. von Werra, and T. Wolf. Smollm2: When smol goes big – data-centric training of a smal...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[7] [7]

arXiv:2312.04724 [cs]

M. Bhatt, S. Chennabasappa, C. Nikolaidis, S. Wan, I. Evtimov, D. Gabi, D. Song, F. Ahmad, C. Aschermann, L. Fontana, S. Frolov, R. P. Giri, D. Kapil, Y . Kozyrakis, D. LeBlanc, J. Milazzo, A. Straumann, G. Synnaeve, V . V ontimitta, S. Whitman, and J. Saxe. Purple llama cyberseceval: A secure coding benchmark for language models.arXiv preprint arXiv:2312...

work page arXiv 2023

[8] [8]

On the Opportunities and Risks of Foundation Models

R. Bommasani, D. A. Hudson, E. Adeli, R. Altman, S. Arora, S. von Arx, M. S. Bernstein, J. Bohg, A. Bosselut, E. Brunskill, et al. On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021

[9] [9]

T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, S. Agarwal, A. Herbert-V oss, G. Krueger, T. Henighan, R. Child, A. Ramesh, D. M. Ziegler, J. Wu, C. Winter, C. Hesse, M. Chen, E. Sigler, M. Litwin, S. Gray, B. Chess, J. Clark, C. Berner, S. McCandlish, A. Radford, I. Sutskever, and D. ...

work page 1901

[10] [10]

P. Chao, A. Robey, E. Dobriban, H. Hassani, G. J. Pappas, and E. Wong. Jailbreaking black box large language models in twenty queries.arXiv preprint arXiv:2310.08419, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[11] [11]

P. Chao, E. Debenedetti, A. Robey, M. Andriushchenko, F. Croce, V . Sehwag, E. Dobriban, N. Flammarion, G. J. Pappas, F. Tramer, H. Hassani, and E. Wong. Jailbreakbench: An open robustness benchmark for jailbreaking large language models.arXiv preprint arXiv:2404.01318, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[12] [12]

M. Chen, J. Tworek, H. Jun, Q. Yuan, H. Ponde de Oliveira Pinto, J. Kaplan, H. Edwards, Y . Burda, N. Joseph, G. Brockman, et al. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021

[13] [13]

Llamafirewall: An open source guardrail system for building secure ai agents,

S. Chennabasappa, C. Nikolaidis, D. Song, D. Molnar, S. Ding, S. Wan, S. Whitman, L. Deason, N. Doucette, A. Montilla, A. Gampa, B. de Paola, D. Gabi, J. Crnkovich, J.-C. Testud, K. He, R. Chaturvedi, W. Zhou, and J. Saxe. Llamafirewall: An open source guardrail system for building secure ai agents.arXiv preprint arXiv:2505.03574, 2025. 10

work page arXiv 2025

[14] [14]

G. N. Frank. How alignment routes: Localizing, scaling, and controlling policy circuits in language models.arXiv preprint arXiv:2604.04385, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[15] [15]

G. N. Frank. Detection is cheap, routing is learned: Why refusal-based alignment evaluation fails.arXiv preprint arXiv:2603.18280, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[16] [16]

Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned

D. Ganguli, L. Lovitt, J. Kernion, A. Askell, Y . Bai, S. Kadavath, B. Mann, E. Perez, N. Schiefer, K. Ndousse, et al. Red teaming language models to reduce harms: Methods, scaling behaviors, and lessons learned.arXiv preprint arXiv:2209.07858, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[17] [17]

Hendrycks, C

D. Hendrycks, C. Burns, S. Basart, A. Zou, M. Mazeika, D. Song, and J. Steinhardt. Mea- suring massive multitask language understanding. InInternational Conference on Learning Representations, 2021

work page 2021

[18] [18]

E. J. Hu, Y . Shen, P. Wallis, Z. Allen-Zhu, Y . Li, S. Wang, L. Wang, and W. Chen. Lora: Low-rank adaptation of large language models. InInternational Conference on Learning Representations, 2022. URLhttps://openreview.net/forum?id=nZeVKeeFYf9

work page 2022

[19] [19]

H. Inan, K. Upasani, J. Chi, R. Rungta, K. Iyer, Y . Mao, M. Tontchev, Q. Hu, B. Fuller, D. Testuggine, and M. Khabsa. Llama guard: Llm-based input-output safeguard for human-ai conversations.arXiv preprint arXiv:2312.06674, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[20] [20]

A. Q. Jiang, A. Sablayrolles, A. Mensch, C. Bamford, D. S. Chaplot, D. de las Casas, F. Bressand, G. Lengyel, G. Lample, L. Saulnier, L. R. Lavaud, M.-A. Lachaux, P. Stock, T. Le Scao, T. Lavril, T. Wang, T. Lacroix, and W. El Sayed. Mistral 7b.arXiv preprint arXiv:2310.06825, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[21] [21]

M. Labonne. Uncensor any llm with abliteration. Hugging Face Blog, 2024. URL https: //huggingface.co/blog/mlabonne/abliteration. Accessed 2026-05-04

work page 2024

[22] [22]

S. Lin, J. Hilton, and O. Evans. TruthfulQA: Measuring how models mimic human false- hoods. InProceedings of the 60th Annual Meeting of the Association for Computational Linguistics (V olume 1: Long Papers), pages 3214–3252, Dublin, Ireland, 2022. Associ- ation for Computational Linguistics. doi: 10.18653/v1/2022.acl-long.229. URL https: //aclanthology.or...

work page doi:10.18653/v1/2022.acl-long.229 2022

[23] [23]

Marshall, A

T. Marshall, A. Scherlis, and N. Belrose. Refusal in llms is an affine function.arXiv preprint arXiv:2411.09003, 2024

work page arXiv 2024

[24] [24]

HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal

M. Mazeika, L. Phan, X. Yin, A. Zou, Z. Wang, N. Mu, E. Sakhaee, N. Li, S. Basart, B. Li, D. Forsyth, and D. Hendrycks. Harmbench: A standardized evaluation framework for automated red teaming and robust refusal.arXiv preprint arXiv:2402.04249, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[25] [25]

Steering language model refusal with sparse autoencoders, 2025

K. O’Brien, D. Majercak, X. Fernandes, R. Edgar, B. Bullwinkel, J. Chen, H. Nori, D. Carignan, E. Horvitz, and F. Poursabzi-Sangdeh. Steering language model refusal with sparse autoencoders. arXiv preprint arXiv:2411.11296, 2024

work page arXiv 2024

[26] [26]

Ouyang, J

L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. Wainwright, P. Mishkin, C. Zhang, S. Agarwal, K. Slama, A. Ray, J. Schulman, J. Hilton, F. Kelton, L. Miller, M. Simens, A. Askell, P. Welinder, P. F. Christiano, J. Leike, and R. Lowe. Training language models to follow instructions with human feedback. InAdvances in Neural Information Processing Systems, volume...

work page 2022

[27] [27]

Steering Llama 2 via Contrastive Activation Addition

N. Panickssery, N. Gabrieli, J. Schulz, M. Tong, E. Hubinger, and A. M. Turner. Steering llama 2 via contrastive activation addition.arXiv preprint arXiv:2312.06681, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[28] [28]

Membership infer- ence attacks from first principles

H. Pearce, B. Ahmad, B. Tan, B. Dolan-Gavitt, and R. Karri. Asleep at the keyboard? assessing the security of github copilot’s code contributions. In2022 IEEE Symposium on Security and Privacy (SP), pages 754–768, 2022. doi: 10.1109/SP46214.2022.9833571

work page doi:10.1109/sp46214.2022.9833571 2022

[29] [29]

Red Teaming Language Models with Language Models

E. Perez, S. Huang, F. Song, T. Cai, R. Ring, J. Aslanides, A. Glaese, N. McAleese, and G. Irving. Red teaming language models with language models.arXiv preprint arXiv:2202.03286, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[30] [30]

V . Petrov. On the failure of topic-matched contrast baselines in multi-directional refusal abliteration.arXiv preprint arXiv:2603.22061, 2026. 11

work page arXiv 2026

[31] [31]

X. Qi, Y . Zeng, T. Xie, P.-Y . Chen, R. Jia, P. Mittal, and P. Henderson. Fine-tuning aligned language models compromises safety, even when users do not intend to!arXiv preprint arXiv:2310.03693, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[32] [32]

X. Qi, A. Panda, K. Lyu, X. Ma, S. Roy, A. Beirami, P. Mittal, and P. Henderson. Safety alignment should be made more than just a few tokens deep.arXiv preprint arXiv:2406.05946, 2024

work page arXiv 2024

[33] [33]

Qwen2.5 Technical Report

Qwen Team. Qwen2.5 technical report.arXiv preprint arXiv:2412.15115, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[34] [34]

arXiv preprint arXiv:2310.10501 , year =

T. Rebedea, R. Dinu, M. Sreedhar, C. Parisien, and J. Cohen. Nemo guardrails: A toolkit for con- trollable and safe llm applications with programmable rails.arXiv preprint arXiv:2310.10501, 2023

work page arXiv 2023

[35] [35]

V . Siu, N. Crispino, Z. Yu, S. Pan, Z. Wang, Y . Liu, D. Song, and C. Wang. COSMIC: Generalized refusal direction identification in LLM activations. InFindings of the Associa- tion for Computational Linguistics: ACL 2025, pages 25534–25553, Vienna, Austria, 2025. Association for Computational Linguistics. doi: 10.18653/v1/2025.findings-acl.1310. URL http...

work page doi:10.18653/v1/2025.findings-acl.1310 2025

[36] [36]

Srivastava, A

A. Srivastava, A. Rastogi, A. Rao, A. A. M. Shoeb, A. Abid, A. Fisch, A. R. Brown, A. Santoro, A. Gupta, A. Garriga-Alonso, et al. Beyond the imitation game: Quantifying and extrapolating the capabilities of language models.Transactions on Machine Learning Research, 2023. URL https://openreview.net/forum?id=uyTL5Bvosj

work page 2023

[37] [37]

LLaMA: Open and Efficient Foundation Language Models

H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B. Rozière, N. Goyal, E. Hambro, F. Azhar, A. Rodriguez, A. Joulin, E. Grave, and G. Lample. LLaMA: Open and efficient foundation language models.arXiv preprint arXiv:2302.13971, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[38] [38]

A. M. Turner, L. Thiergart, G. Leech, D. Udell, J. J. Vazquez, U. Mini, and M. MacDiarmid. Steering language models with activation engineering.arXiv preprint arXiv:2308.10248, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[39] [39]

G. Wang, H. Shi, T. Ouyang, and A. Wang. Few tokens, big leverage: Preserving safety alignment by constraining safety tokens during fine-tuning.arXiv preprint arXiv:2603.07445, 2026

work page arXiv 2026

[40] [40]

A. Wei, N. Haghtalab, and J. Steinhardt. Jailbroken: How does llm safety training fail?arXiv preprint arXiv:2307.02483, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[41] [41]

A. K. Zhang, N. Perry, R. Dulepet, J. Ji, C. Menders, J. W. Lin, E. Jones, G. Hussein, S. Liu, D. Jasper, P. Peetathawatchai, A. Glenn, V . Sivashankar, D. Zamoshchin, L. Glikbarg, D. Askar- yar, M. Yang, T. Zhang, R. Alluri, N. Tran, R. Sangpisit, P. Yiorkadjis, K. Osele, G. Raghupathi, D. Boneh, D. E. Ho, and P. Liang. Cybench: A framework for evaluatin...

work page arXiv 2024

[42] [42]

A. Zou, L. Phan, S. Chen, J. Campbell, P. Guo, R. Ren, A. Pan, X. Yin, M. Mazeika, A.-K. Dombrowski, S. Goel, N. Li, M. J. Byun, Z. Wang, A. Mallen, S. Basart, S. Koyejo, D. Song, M. Fredrikson, J. Z. Kolter, and D. Hendrycks. Representation engineering: A top-down approach to ai transparency.arXiv preprint arXiv:2310.01405, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[43] [43]

A. Zou, Z. Wang, N. Carlini, M. Nasr, J. Z. Kolter, and M. Fredrikson. Universal and transferable adversarial attacks on aligned language models.arXiv preprint arXiv:2307.15043, 2023. 12 A Evaluation Reproducibility Details The main evaluation harness is implemented in: •experiments/generate_expanded_data.py •experiments/train_lora.py •experiments/expande...

work page internal anchor Pith review Pith/arXiv arXiv 2023