pith. sign in

arxiv: 2605.17413 · v1 · pith:4K7NVIAPnew · submitted 2026-05-17 · 💻 cs.CR · cs.AI

Ablating Safety: Mechanisms for Removing Alignment in Language Models for Security Applications

Pith reviewed 2026-05-19 23:26 UTC · model grok-4.3

classification 💻 cs.CR cs.AI
keywords language model alignmentsafety ablationLoRA fine-tuningcybersecurity evaluationrefusal mechanismsutility risk trade-offsecurity applications
0
0 comments X

The pith

Task-only LoRA adaptation enables high performance on authorized security tasks while keeping unsafe compliance low.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper investigates techniques for removing safety alignments from language models to allow them to respond to authorized cybersecurity requests that they would otherwise refuse. It introduces the Security-AR benchmark consisting of 60 prompts for security tasks, general capabilities, and spillover probes, and tests methods including prompting, refusal direction projection, and LoRA fine-tuning. The key finding is that adapting a model with task-only LoRA achieves a mean security score of 0.87, retains a general score of 0.83, and limits unsafe compliance to 0.13, outperforming other approaches that either improve security less or increase unsafe outputs more. This frames alignment removal as a trade-off between utility and risk rather than a straightforward way to remove restrictions.

Core claim

The authors demonstrate through controlled experiments that different alignment removal methods have varying effects on security task success, general capability retention, and unsafe spillover. Specifically, single-vector refusal projection only marginally improves security scores while greatly increasing unsafe compliance, whereas task-only LoRA substantially raises security performance with minimal spillover increase and good retention of general abilities. They argue that alignment removal should be viewed as navigating a utility-risk frontier.

What carries the argument

Comparison of alignment ablation techniques including refusal-direction activation projection, representation-control projections, and LoRA-based task adaptation, evaluated on the Security-AR prompt suite with secure-repair validators.

If this is right

  • Alignment removal techniques should be evaluated based on their position on a utility-risk frontier rather than solely on their ability to increase compliance.
  • Task-specific LoRA offers a promising balance for security applications by improving defensive task performance without substantially raising unsafe outputs.
  • Refusal suppression methods can lead to higher rates of out-of-scope unsafe compliance compared to targeted adaptation.
  • General capability retention is possible even when adapting for specific security tasks using low-rank methods.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar techniques could be applied to other restricted domains like medical or legal advice where legitimate queries are refused.
  • Organizations developing internal security tools might use task LoRA to create specialized assistants, but would need robust monitoring for unintended behaviors.
  • Future evaluations could include dynamic or multi-turn interactions to test if the low spillover holds in more complex scenarios.

Load-bearing premise

The Security-AR prompt suite and its executable validators accurately measure authorized security tasks and distinguish them from unsafe outputs without selection bias or errors.

What would settle it

A follow-up evaluation on a broader or independently curated set of security prompts where task LoRA no longer shows superior security scores or where unsafe compliance rises unexpectedly.

Figures

Figures reproduced from arXiv: 2605.17413 by Arthur Gervais, Isaac David.

Figure 1
Figure 1. Figure 1: Overview of the ABLATING SAFETY evaluation loop. Prompt-level and model-level interventions are routed through the same task suite and scored jointly for authorized security utility, general-capability retention, and out-of-scope spillover. The decision target is the utility-risk frontier. The measurement problem is to estimate the Pareto frontier between lower refusal on authorized security tasks, higher … view at source ↗
Figure 2
Figure 2. Figure 2: Utility-spillover frontier for extended LoRA on the three Qwen2.5 models. The x-axis is [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: LoRA tradeoffs by adapter family. Task-only and refusal-suppression adapters produce the [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Projection-strength sweep for Qwen2.5-1.5B-Instruct across random and refusal projections. [PITH_FULL_IMAGE:figures/full_fig_p016_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Projection-layer sweep for Qwen2.5-1.5B-Instruct. Random projection preserves out-of [PITH_FULL_IMAGE:figures/full_fig_p016_5.png] view at source ↗
read the original abstract

Safety-aligned language models often refuse cybersecurity requests whose wording resembles misuse, even when the task is authorized and defensive. This makes security evaluation ambiguous: a failed answer may reflect missing capability or refusal-policy intervention. Ablating Safety studies alignment removal as a controlled transformation-evaluation protocol for authorized security tasks, comparing authorized-context prompting, reversible refusal-direction activation projection, representation-control projections, and LoRA-based de-alignment or task adaptation. We evaluate refusal, attempt rate, validated security success, general-capability retention, instability, and out-of-scope unsafe compliance on Security-AR, a 60-prompt suite of authorized security, benign general, and non-operational spillover probes. The reported runs include a four-model projection pilot with 416 completions, a three-model Qwen2.5 LoRA extension with 1,980 held-out completions, representation and robustness sweeps, and executable secure-repair validators. Single-vector refusal projection raises mean security score only from 0.46 to 0.50 while increasing unsafe compliance from 0.10 to 0.47; rank-4 refusal-subspace projection reaches 0.51 while matching the aligned spillover rate. Task-only LoRA raises mean security score to 0.87 with general score 0.83 and unsafe compliance 0.13, while refusal-suppression with retention raises spillover to 0.27. These results support evaluating alignment removal as a utility-risk frontier, not as an uncensoring recipe, and treating compliance alone as neither competence nor safe deployment.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper claims that safety-aligned LLMs refuse legitimate cybersecurity requests due to alignment policies, and proposes evaluating alignment removal techniques (authorized-context prompting, reversible refusal-direction projection, representation-control projections, and LoRA-based de-alignment or task adaptation) as a controlled protocol for authorized security tasks. Using the new Security-AR 60-prompt suite with executable secure-repair validators, it reports metrics including refusal rate, validated security success, general-capability retention, and out-of-scope unsafe compliance across multiple models and runs (416 completions in a four-model pilot plus 1,980 held-out completions in a Qwen2.5 LoRA extension). Key results include single-vector refusal projection raising mean security score only from 0.46 to 0.50 while increasing unsafe compliance from 0.10 to 0.47, rank-4 refusal-subspace projection reaching 0.51 with matched spillover, and task-only LoRA achieving 0.87 security score with 0.83 general score and 0.13 unsafe compliance; the work concludes that alignment removal should be viewed as a utility-risk frontier rather than an uncensoring recipe.

Significance. If the Security-AR suite and validators reliably distinguish authorized defensive tasks from unsafe spillover without selection bias, the results provide concrete empirical support for treating compliance as neither equivalent to competence nor a safe deployment signal. The scale of the experiments and direct measurement of spillover effects offer a useful comparative framework for security applications of LLMs, with potential to inform more nuanced alignment research.

major comments (2)
  1. [Evaluation / Methods (Security-AR suite description)] The central comparative claims (e.g., Task-only LoRA at 0.87 security score / 0.13 unsafe compliance vs. refusal projection at 0.50 / 0.47 spillover) rest entirely on the correctness of the 60-prompt Security-AR suite and its executable secure-repair validators. The manuscript reports no pre-registration of data exclusion rules, prompt ordering, or validator thresholds, no inter-annotator agreement statistics, and no external validation of either component; this is load-bearing because any systematic misclassification of edge-case outputs as secure would invalidate the utility-risk frontier interpretation.
  2. [Results (reported scores and sweeps)] The abstract and results sections present mean scores from 416 + 1,980 completions without reporting variance, confidence intervals, or statistical tests for the differences between methods (e.g., 0.87 vs. 0.50 security score). This weakens the strength of the claim that LoRA is superior on the frontier, as the numerical gaps could be sensitive to prompt sampling or validator thresholds.
minor comments (2)
  1. [Methods] Clarify the exact definition and implementation of the 'secure-repair validators' (e.g., what constitutes a valid repair vs. unsafe spillover) in the methods section to allow reproducibility.
  2. [Results] The paper mentions 'representation and robustness sweeps' but does not specify the hyperparameter ranges or number of runs in the main text; move this detail from any appendix to the primary results section.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback, which highlights important aspects of methodological transparency and statistical rigor in our evaluation of alignment ablation techniques. We address each major comment in detail below and outline the revisions we plan to make.

read point-by-point responses
  1. Referee: [Evaluation / Methods (Security-AR suite description)] The central comparative claims (e.g., Task-only LoRA at 0.87 security score / 0.13 unsafe compliance vs. refusal projection at 0.50 / 0.47 spillover) rest entirely on the correctness of the 60-prompt Security-AR suite and its executable secure-repair validators. The manuscript reports no pre-registration of data exclusion rules, prompt ordering, or validator thresholds, no inter-annotator agreement statistics, and no external validation of either component; this is load-bearing because any systematic misclassification of edge-case outputs as secure would invalidate the utility-risk frontier interpretation.

    Authors: We recognize the importance of documenting the construction and validation process for the Security-AR suite to ensure reproducibility and mitigate concerns about potential biases. The prompts were derived from real-world authorized security scenarios, with validators implemented as executable checks for secure-repair tasks. While we did not pre-register the study, in the revised version we will expand the methods section with a comprehensive description of prompt development, including how edge cases were handled, and provide the full validator code in an appendix. We will also include a dedicated limitations subsection discussing the lack of inter-annotator agreement (as the validators are rule-based and deterministic) and the absence of external validation, along with steps taken to minimize selection bias. These additions will allow readers to better assess the reliability of the reported metrics without changing the experimental results. revision: partial

  2. Referee: [Results (reported scores and sweeps)] The abstract and results sections present mean scores from 416 + 1,980 completions without reporting variance, confidence intervals, or statistical tests for the differences between methods (e.g., 0.87 vs. 0.50 security score). This weakens the strength of the claim that LoRA is superior on the frontier, as the numerical gaps could be sensitive to prompt sampling or validator thresholds.

    Authors: We agree that including measures of variability and statistical analysis would enhance the interpretability of our results. Although the current manuscript focuses on mean scores across a large number of completions, we have access to the per-prompt outcomes from all runs. In the revision, we will update the results section and figures to report standard deviations, 95% confidence intervals, and perform statistical significance tests (such as bootstrap resampling or t-tests) for the key comparisons between alignment removal methods. This will provide stronger evidence for the differences observed, for instance between the task-only LoRA and the projection-based approaches. revision: yes

Circularity Check

0 steps flagged

No circularity: results are direct empirical measurements on held-out completions

full rationale

The paper is an experimental comparison of alignment-ablation techniques evaluated on the Security-AR 60-prompt suite and executable validators. Reported quantities (security scores, unsafe compliance rates, general capability retention) are computed directly from model outputs on held-out prompts rather than from any internal equations, fitted parameters renamed as predictions, or self-citation chains. No derivation steps exist that reduce to the inputs by construction, satisfying the self-contained experimental criterion for a score of 0.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claims rest on the assumption that the Security-AR prompts and validators constitute a faithful proxy for authorized security work; no new physical entities or mathematical axioms are introduced beyond standard transformer assumptions.

free parameters (1)
  • LoRA rank
    Rank-4 projection is reported; the choice of rank is a modeling hyperparameter that affects the reported security and spillover scores.
axioms (1)
  • domain assumption The base models (Qwen2.5 and others) retain general capabilities after projection or LoRA adaptation.
    Invoked when reporting general score retention of 0.83.

pith-pipeline@v0.9.0 · 5802 in / 1369 out tokens · 29058 ms · 2026-05-19T23:26:55.516109+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

43 extracted references · 43 canonical work pages · 23 internal anchors

  1. [1]

    Abu Shairah, H

    H. Abu Shairah, H. A. A. K. Hammoud, B. Ghanem, and G. Turkiyyah. An embarrassingly simple defense against llm abliteration attacks.arXiv preprint arXiv:2505.19056, 2025

  2. [2]

    Agnihotri, J

    S. Agnihotri, J. Jakubassa, P. Dey, S. Goyal, B. Schiele, V . B. Radhakrishnan, and M. Keuper. A granular study of safety pretraining under model abliteration.arXiv preprint arXiv:2510.02768, 2025

  3. [3]

    Refusal in Language Models Is Mediated by a Single Direction

    A. Arditi, O. Obeso, A. Syed, D. Paleka, N. Panickssery, W. Gurnee, and N. Nanda. Refusal in language models is mediated by a single direction.arXiv preprint arXiv:2406.11717, 2024

  4. [4]

    Program Synthesis with Large Language Models

    J. Austin, A. Odena, M. Nye, M. Bosma, H. Michalewski, D. Dohan, E. Jiang, C. Cai, M. Terry, Q. Le, and C. Sutton. Program synthesis with large language models.arXiv preprint arXiv:2108.07732, 2021

  5. [5]

    Y . Bai, S. Kadavath, S. Kundu, A. Askell, J. Kernion, A. Jones, A. Chen, A. Goldie, A. Mirho- seini, C. McKinnon, C. Chen, C. Olsson, C. Olah, D. Hernandez, D. Drain, D. Ganguli, D. Li, E. Tran-Johnson, E. Perez, J. Kerr, J. Mueller, J. Ladish, J. Landau, K. Ndousse, K. Lukosuite, L. Lovitt, M. Sellitto, N. Elhage, N. Schiefer, N. Mercado, N. DasSarma, R...

  6. [6]

    SmolLM2: When Smol Goes Big -- Data-Centric Training of a Small Language Model

    L. Ben Allal, A. Lozhkov, E. Bakouch, G. Martín Blázquez, G. Penedo, L. Tunstall, A. Marafi- oti, H. Kydlí ˇcek, A. Piqueres Lajarín, V . Srivastav, J. Lochner, C. Fahlgren, X.-S. Nguyen, C. Fourrier, B. Burtenshaw, H. Larcher, H. Zhao, C. Zakka, M. Morlon, C. Raffel, L. von Werra, and T. Wolf. Smollm2: When smol goes big – data-centric training of a smal...

  7. [7]

    arXiv:2312.04724 [cs]

    M. Bhatt, S. Chennabasappa, C. Nikolaidis, S. Wan, I. Evtimov, D. Gabi, D. Song, F. Ahmad, C. Aschermann, L. Fontana, S. Frolov, R. P. Giri, D. Kapil, Y . Kozyrakis, D. LeBlanc, J. Milazzo, A. Straumann, G. Synnaeve, V . V ontimitta, S. Whitman, and J. Saxe. Purple llama cyberseceval: A secure coding benchmark for language models.arXiv preprint arXiv:2312...

  8. [8]

    On the Opportunities and Risks of Foundation Models

    R. Bommasani, D. A. Hudson, E. Adeli, R. Altman, S. Arora, S. von Arx, M. S. Bernstein, J. Bohg, A. Bosselut, E. Brunskill, et al. On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258, 2021

  9. [9]

    T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, S. Agarwal, A. Herbert-V oss, G. Krueger, T. Henighan, R. Child, A. Ramesh, D. M. Ziegler, J. Wu, C. Winter, C. Hesse, M. Chen, E. Sigler, M. Litwin, S. Gray, B. Chess, J. Clark, C. Berner, S. McCandlish, A. Radford, I. Sutskever, and D. ...

  10. [10]

    P. Chao, A. Robey, E. Dobriban, H. Hassani, G. J. Pappas, and E. Wong. Jailbreaking black box large language models in twenty queries.arXiv preprint arXiv:2310.08419, 2023

  11. [11]

    P. Chao, E. Debenedetti, A. Robey, M. Andriushchenko, F. Croce, V . Sehwag, E. Dobriban, N. Flammarion, G. J. Pappas, F. Tramer, H. Hassani, and E. Wong. Jailbreakbench: An open robustness benchmark for jailbreaking large language models.arXiv preprint arXiv:2404.01318, 2024

  12. [12]

    M. Chen, J. Tworek, H. Jun, Q. Yuan, H. Ponde de Oliveira Pinto, J. Kaplan, H. Edwards, Y . Burda, N. Joseph, G. Brockman, et al. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374, 2021

  13. [13]

    Llamafirewall: An open source guardrail system for building secure ai agents,

    S. Chennabasappa, C. Nikolaidis, D. Song, D. Molnar, S. Ding, S. Wan, S. Whitman, L. Deason, N. Doucette, A. Montilla, A. Gampa, B. de Paola, D. Gabi, J. Crnkovich, J.-C. Testud, K. He, R. Chaturvedi, W. Zhou, and J. Saxe. Llamafirewall: An open source guardrail system for building secure ai agents.arXiv preprint arXiv:2505.03574, 2025. 10

  14. [14]

    G. N. Frank. How alignment routes: Localizing, scaling, and controlling policy circuits in language models.arXiv preprint arXiv:2604.04385, 2026

  15. [15]

    G. N. Frank. Detection is cheap, routing is learned: Why refusal-based alignment evaluation fails.arXiv preprint arXiv:2603.18280, 2026

  16. [16]

    Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned

    D. Ganguli, L. Lovitt, J. Kernion, A. Askell, Y . Bai, S. Kadavath, B. Mann, E. Perez, N. Schiefer, K. Ndousse, et al. Red teaming language models to reduce harms: Methods, scaling behaviors, and lessons learned.arXiv preprint arXiv:2209.07858, 2022

  17. [17]

    Hendrycks, C

    D. Hendrycks, C. Burns, S. Basart, A. Zou, M. Mazeika, D. Song, and J. Steinhardt. Mea- suring massive multitask language understanding. InInternational Conference on Learning Representations, 2021

  18. [18]

    E. J. Hu, Y . Shen, P. Wallis, Z. Allen-Zhu, Y . Li, S. Wang, L. Wang, and W. Chen. Lora: Low-rank adaptation of large language models. InInternational Conference on Learning Representations, 2022. URLhttps://openreview.net/forum?id=nZeVKeeFYf9

  19. [19]

    H. Inan, K. Upasani, J. Chi, R. Rungta, K. Iyer, Y . Mao, M. Tontchev, Q. Hu, B. Fuller, D. Testuggine, and M. Khabsa. Llama guard: Llm-based input-output safeguard for human-ai conversations.arXiv preprint arXiv:2312.06674, 2023

  20. [20]

    A. Q. Jiang, A. Sablayrolles, A. Mensch, C. Bamford, D. S. Chaplot, D. de las Casas, F. Bressand, G. Lengyel, G. Lample, L. Saulnier, L. R. Lavaud, M.-A. Lachaux, P. Stock, T. Le Scao, T. Lavril, T. Wang, T. Lacroix, and W. El Sayed. Mistral 7b.arXiv preprint arXiv:2310.06825, 2023

  21. [21]

    M. Labonne. Uncensor any llm with abliteration. Hugging Face Blog, 2024. URL https: //huggingface.co/blog/mlabonne/abliteration. Accessed 2026-05-04

  22. [22]

    S. Lin, J. Hilton, and O. Evans. TruthfulQA: Measuring how models mimic human false- hoods. InProceedings of the 60th Annual Meeting of the Association for Computational Linguistics (V olume 1: Long Papers), pages 3214–3252, Dublin, Ireland, 2022. Associ- ation for Computational Linguistics. doi: 10.18653/v1/2022.acl-long.229. URL https: //aclanthology.or...

  23. [23]

    Marshall, A

    T. Marshall, A. Scherlis, and N. Belrose. Refusal in llms is an affine function.arXiv preprint arXiv:2411.09003, 2024

  24. [24]

    HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal

    M. Mazeika, L. Phan, X. Yin, A. Zou, Z. Wang, N. Mu, E. Sakhaee, N. Li, S. Basart, B. Li, D. Forsyth, and D. Hendrycks. Harmbench: A standardized evaluation framework for automated red teaming and robust refusal.arXiv preprint arXiv:2402.04249, 2024

  25. [25]

    Steering language model refusal with sparse autoencoders, 2025

    K. O’Brien, D. Majercak, X. Fernandes, R. Edgar, B. Bullwinkel, J. Chen, H. Nori, D. Carignan, E. Horvitz, and F. Poursabzi-Sangdeh. Steering language model refusal with sparse autoencoders. arXiv preprint arXiv:2411.11296, 2024

  26. [26]

    Ouyang, J

    L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. Wainwright, P. Mishkin, C. Zhang, S. Agarwal, K. Slama, A. Ray, J. Schulman, J. Hilton, F. Kelton, L. Miller, M. Simens, A. Askell, P. Welinder, P. F. Christiano, J. Leike, and R. Lowe. Training language models to follow instructions with human feedback. InAdvances in Neural Information Processing Systems, volume...

  27. [27]

    Steering Llama 2 via Contrastive Activation Addition

    N. Panickssery, N. Gabrieli, J. Schulz, M. Tong, E. Hubinger, and A. M. Turner. Steering llama 2 via contrastive activation addition.arXiv preprint arXiv:2312.06681, 2024

  28. [28]

    Membership infer- ence attacks from first principles

    H. Pearce, B. Ahmad, B. Tan, B. Dolan-Gavitt, and R. Karri. Asleep at the keyboard? assessing the security of github copilot’s code contributions. In2022 IEEE Symposium on Security and Privacy (SP), pages 754–768, 2022. doi: 10.1109/SP46214.2022.9833571

  29. [29]

    Red Teaming Language Models with Language Models

    E. Perez, S. Huang, F. Song, T. Cai, R. Ring, J. Aslanides, A. Glaese, N. McAleese, and G. Irving. Red teaming language models with language models.arXiv preprint arXiv:2202.03286, 2022

  30. [30]

    V . Petrov. On the failure of topic-matched contrast baselines in multi-directional refusal abliteration.arXiv preprint arXiv:2603.22061, 2026. 11

  31. [31]

    X. Qi, Y . Zeng, T. Xie, P.-Y . Chen, R. Jia, P. Mittal, and P. Henderson. Fine-tuning aligned language models compromises safety, even when users do not intend to!arXiv preprint arXiv:2310.03693, 2023

  32. [32]

    X. Qi, A. Panda, K. Lyu, X. Ma, S. Roy, A. Beirami, P. Mittal, and P. Henderson. Safety alignment should be made more than just a few tokens deep.arXiv preprint arXiv:2406.05946, 2024

  33. [33]

    Qwen2.5 Technical Report

    Qwen Team. Qwen2.5 technical report.arXiv preprint arXiv:2412.15115, 2025

  34. [34]

    arXiv preprint arXiv:2310.10501 , year =

    T. Rebedea, R. Dinu, M. Sreedhar, C. Parisien, and J. Cohen. Nemo guardrails: A toolkit for con- trollable and safe llm applications with programmable rails.arXiv preprint arXiv:2310.10501, 2023

  35. [35]

    V . Siu, N. Crispino, Z. Yu, S. Pan, Z. Wang, Y . Liu, D. Song, and C. Wang. COSMIC: Generalized refusal direction identification in LLM activations. InFindings of the Associa- tion for Computational Linguistics: ACL 2025, pages 25534–25553, Vienna, Austria, 2025. Association for Computational Linguistics. doi: 10.18653/v1/2025.findings-acl.1310. URL http...

  36. [36]

    Srivastava, A

    A. Srivastava, A. Rastogi, A. Rao, A. A. M. Shoeb, A. Abid, A. Fisch, A. R. Brown, A. Santoro, A. Gupta, A. Garriga-Alonso, et al. Beyond the imitation game: Quantifying and extrapolating the capabilities of language models.Transactions on Machine Learning Research, 2023. URL https://openreview.net/forum?id=uyTL5Bvosj

  37. [37]

    LLaMA: Open and Efficient Foundation Language Models

    H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B. Rozière, N. Goyal, E. Hambro, F. Azhar, A. Rodriguez, A. Joulin, E. Grave, and G. Lample. LLaMA: Open and efficient foundation language models.arXiv preprint arXiv:2302.13971, 2023

  38. [38]

    A. M. Turner, L. Thiergart, G. Leech, D. Udell, J. J. Vazquez, U. Mini, and M. MacDiarmid. Steering language models with activation engineering.arXiv preprint arXiv:2308.10248, 2023

  39. [39]

    G. Wang, H. Shi, T. Ouyang, and A. Wang. Few tokens, big leverage: Preserving safety alignment by constraining safety tokens during fine-tuning.arXiv preprint arXiv:2603.07445, 2026

  40. [40]

    A. Wei, N. Haghtalab, and J. Steinhardt. Jailbroken: How does llm safety training fail?arXiv preprint arXiv:2307.02483, 2023

  41. [41]

    A. K. Zhang, N. Perry, R. Dulepet, J. Ji, C. Menders, J. W. Lin, E. Jones, G. Hussein, S. Liu, D. Jasper, P. Peetathawatchai, A. Glenn, V . Sivashankar, D. Zamoshchin, L. Glikbarg, D. Askar- yar, M. Yang, T. Zhang, R. Alluri, N. Tran, R. Sangpisit, P. Yiorkadjis, K. Osele, G. Raghupathi, D. Boneh, D. E. Ho, and P. Liang. Cybench: A framework for evaluatin...

  42. [42]

    A. Zou, L. Phan, S. Chen, J. Campbell, P. Guo, R. Ren, A. Pan, X. Yin, M. Mazeika, A.-K. Dombrowski, S. Goel, N. Li, M. J. Byun, Z. Wang, A. Mallen, S. Basart, S. Koyejo, D. Song, M. Fredrikson, J. Z. Kolter, and D. Hendrycks. Representation engineering: A top-down approach to ai transparency.arXiv preprint arXiv:2310.01405, 2023

  43. [43]

    A. Zou, Z. Wang, N. Carlini, M. Nasr, J. Z. Kolter, and M. Fredrikson. Universal and transferable adversarial attacks on aligned language models.arXiv preprint arXiv:2307.15043, 2023. 12 A Evaluation Reproducibility Details The main evaluation harness is implemented in: •experiments/generate_expanded_data.py •experiments/train_lora.py •experiments/expande...