Ablating Safety: Mechanisms for Removing Alignment in Language Models for Security Applications
Pith reviewed 2026-05-19 23:26 UTC · model grok-4.3
The pith
Task-only LoRA adaptation enables high performance on authorized security tasks while keeping unsafe compliance low.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The authors demonstrate through controlled experiments that different alignment removal methods have varying effects on security task success, general capability retention, and unsafe spillover. Specifically, single-vector refusal projection only marginally improves security scores while greatly increasing unsafe compliance, whereas task-only LoRA substantially raises security performance with minimal spillover increase and good retention of general abilities. They argue that alignment removal should be viewed as navigating a utility-risk frontier.
What carries the argument
Comparison of alignment ablation techniques including refusal-direction activation projection, representation-control projections, and LoRA-based task adaptation, evaluated on the Security-AR prompt suite with secure-repair validators.
If this is right
- Alignment removal techniques should be evaluated based on their position on a utility-risk frontier rather than solely on their ability to increase compliance.
- Task-specific LoRA offers a promising balance for security applications by improving defensive task performance without substantially raising unsafe outputs.
- Refusal suppression methods can lead to higher rates of out-of-scope unsafe compliance compared to targeted adaptation.
- General capability retention is possible even when adapting for specific security tasks using low-rank methods.
Where Pith is reading between the lines
- Similar techniques could be applied to other restricted domains like medical or legal advice where legitimate queries are refused.
- Organizations developing internal security tools might use task LoRA to create specialized assistants, but would need robust monitoring for unintended behaviors.
- Future evaluations could include dynamic or multi-turn interactions to test if the low spillover holds in more complex scenarios.
Load-bearing premise
The Security-AR prompt suite and its executable validators accurately measure authorized security tasks and distinguish them from unsafe outputs without selection bias or errors.
What would settle it
A follow-up evaluation on a broader or independently curated set of security prompts where task LoRA no longer shows superior security scores or where unsafe compliance rises unexpectedly.
Figures
read the original abstract
Safety-aligned language models often refuse cybersecurity requests whose wording resembles misuse, even when the task is authorized and defensive. This makes security evaluation ambiguous: a failed answer may reflect missing capability or refusal-policy intervention. Ablating Safety studies alignment removal as a controlled transformation-evaluation protocol for authorized security tasks, comparing authorized-context prompting, reversible refusal-direction activation projection, representation-control projections, and LoRA-based de-alignment or task adaptation. We evaluate refusal, attempt rate, validated security success, general-capability retention, instability, and out-of-scope unsafe compliance on Security-AR, a 60-prompt suite of authorized security, benign general, and non-operational spillover probes. The reported runs include a four-model projection pilot with 416 completions, a three-model Qwen2.5 LoRA extension with 1,980 held-out completions, representation and robustness sweeps, and executable secure-repair validators. Single-vector refusal projection raises mean security score only from 0.46 to 0.50 while increasing unsafe compliance from 0.10 to 0.47; rank-4 refusal-subspace projection reaches 0.51 while matching the aligned spillover rate. Task-only LoRA raises mean security score to 0.87 with general score 0.83 and unsafe compliance 0.13, while refusal-suppression with retention raises spillover to 0.27. These results support evaluating alignment removal as a utility-risk frontier, not as an uncensoring recipe, and treating compliance alone as neither competence nor safe deployment.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that safety-aligned LLMs refuse legitimate cybersecurity requests due to alignment policies, and proposes evaluating alignment removal techniques (authorized-context prompting, reversible refusal-direction projection, representation-control projections, and LoRA-based de-alignment or task adaptation) as a controlled protocol for authorized security tasks. Using the new Security-AR 60-prompt suite with executable secure-repair validators, it reports metrics including refusal rate, validated security success, general-capability retention, and out-of-scope unsafe compliance across multiple models and runs (416 completions in a four-model pilot plus 1,980 held-out completions in a Qwen2.5 LoRA extension). Key results include single-vector refusal projection raising mean security score only from 0.46 to 0.50 while increasing unsafe compliance from 0.10 to 0.47, rank-4 refusal-subspace projection reaching 0.51 with matched spillover, and task-only LoRA achieving 0.87 security score with 0.83 general score and 0.13 unsafe compliance; the work concludes that alignment removal should be viewed as a utility-risk frontier rather than an uncensoring recipe.
Significance. If the Security-AR suite and validators reliably distinguish authorized defensive tasks from unsafe spillover without selection bias, the results provide concrete empirical support for treating compliance as neither equivalent to competence nor a safe deployment signal. The scale of the experiments and direct measurement of spillover effects offer a useful comparative framework for security applications of LLMs, with potential to inform more nuanced alignment research.
major comments (2)
- [Evaluation / Methods (Security-AR suite description)] The central comparative claims (e.g., Task-only LoRA at 0.87 security score / 0.13 unsafe compliance vs. refusal projection at 0.50 / 0.47 spillover) rest entirely on the correctness of the 60-prompt Security-AR suite and its executable secure-repair validators. The manuscript reports no pre-registration of data exclusion rules, prompt ordering, or validator thresholds, no inter-annotator agreement statistics, and no external validation of either component; this is load-bearing because any systematic misclassification of edge-case outputs as secure would invalidate the utility-risk frontier interpretation.
- [Results (reported scores and sweeps)] The abstract and results sections present mean scores from 416 + 1,980 completions without reporting variance, confidence intervals, or statistical tests for the differences between methods (e.g., 0.87 vs. 0.50 security score). This weakens the strength of the claim that LoRA is superior on the frontier, as the numerical gaps could be sensitive to prompt sampling or validator thresholds.
minor comments (2)
- [Methods] Clarify the exact definition and implementation of the 'secure-repair validators' (e.g., what constitutes a valid repair vs. unsafe spillover) in the methods section to allow reproducibility.
- [Results] The paper mentions 'representation and robustness sweeps' but does not specify the hyperparameter ranges or number of runs in the main text; move this detail from any appendix to the primary results section.
Simulated Author's Rebuttal
We thank the referee for their constructive feedback, which highlights important aspects of methodological transparency and statistical rigor in our evaluation of alignment ablation techniques. We address each major comment in detail below and outline the revisions we plan to make.
read point-by-point responses
-
Referee: [Evaluation / Methods (Security-AR suite description)] The central comparative claims (e.g., Task-only LoRA at 0.87 security score / 0.13 unsafe compliance vs. refusal projection at 0.50 / 0.47 spillover) rest entirely on the correctness of the 60-prompt Security-AR suite and its executable secure-repair validators. The manuscript reports no pre-registration of data exclusion rules, prompt ordering, or validator thresholds, no inter-annotator agreement statistics, and no external validation of either component; this is load-bearing because any systematic misclassification of edge-case outputs as secure would invalidate the utility-risk frontier interpretation.
Authors: We recognize the importance of documenting the construction and validation process for the Security-AR suite to ensure reproducibility and mitigate concerns about potential biases. The prompts were derived from real-world authorized security scenarios, with validators implemented as executable checks for secure-repair tasks. While we did not pre-register the study, in the revised version we will expand the methods section with a comprehensive description of prompt development, including how edge cases were handled, and provide the full validator code in an appendix. We will also include a dedicated limitations subsection discussing the lack of inter-annotator agreement (as the validators are rule-based and deterministic) and the absence of external validation, along with steps taken to minimize selection bias. These additions will allow readers to better assess the reliability of the reported metrics without changing the experimental results. revision: partial
-
Referee: [Results (reported scores and sweeps)] The abstract and results sections present mean scores from 416 + 1,980 completions without reporting variance, confidence intervals, or statistical tests for the differences between methods (e.g., 0.87 vs. 0.50 security score). This weakens the strength of the claim that LoRA is superior on the frontier, as the numerical gaps could be sensitive to prompt sampling or validator thresholds.
Authors: We agree that including measures of variability and statistical analysis would enhance the interpretability of our results. Although the current manuscript focuses on mean scores across a large number of completions, we have access to the per-prompt outcomes from all runs. In the revision, we will update the results section and figures to report standard deviations, 95% confidence intervals, and perform statistical significance tests (such as bootstrap resampling or t-tests) for the key comparisons between alignment removal methods. This will provide stronger evidence for the differences observed, for instance between the task-only LoRA and the projection-based approaches. revision: yes
Circularity Check
No circularity: results are direct empirical measurements on held-out completions
full rationale
The paper is an experimental comparison of alignment-ablation techniques evaluated on the Security-AR 60-prompt suite and executable validators. Reported quantities (security scores, unsafe compliance rates, general capability retention) are computed directly from model outputs on held-out prompts rather than from any internal equations, fitted parameters renamed as predictions, or self-citation chains. No derivation steps exist that reduce to the inputs by construction, satisfying the self-contained experimental criterion for a score of 0.
Axiom & Free-Parameter Ledger
free parameters (1)
- LoRA rank
axioms (1)
- domain assumption The base models (Qwen2.5 and others) retain general capabilities after projection or LoRA adaptation.
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/AbsoluteFloorClosure.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We evaluate refusal, attempt rate, validated security success, general-capability retention, instability, and out-of-scope unsafe compliance on Security-AR, a 60-prompt suite...
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Task-only LoRA raises mean security score to 0.87 with general score 0.83 and unsafe compliance 0.13
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
H. Abu Shairah, H. A. A. K. Hammoud, B. Ghanem, and G. Turkiyyah. An embarrassingly simple defense against llm abliteration attacks.arXiv preprint arXiv:2505.19056, 2025
-
[2]
S. Agnihotri, J. Jakubassa, P. Dey, S. Goyal, B. Schiele, V . B. Radhakrishnan, and M. Keuper. A granular study of safety pretraining under model abliteration.arXiv preprint arXiv:2510.02768, 2025
-
[3]
Refusal in Language Models Is Mediated by a Single Direction
A. Arditi, O. Obeso, A. Syed, D. Paleka, N. Panickssery, W. Gurnee, and N. Nanda. Refusal in language models is mediated by a single direction.arXiv preprint arXiv:2406.11717, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[4]
Program Synthesis with Large Language Models
J. Austin, A. Odena, M. Nye, M. Bosma, H. Michalewski, D. Dohan, E. Jiang, C. Cai, M. Terry, Q. Le, and C. Sutton. Program synthesis with large language models.arXiv preprint arXiv:2108.07732, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[5]
Y . Bai, S. Kadavath, S. Kundu, A. Askell, J. Kernion, A. Jones, A. Chen, A. Goldie, A. Mirho- seini, C. McKinnon, C. Chen, C. Olsson, C. Olah, D. Hernandez, D. Drain, D. Ganguli, D. Li, E. Tran-Johnson, E. Perez, J. Kerr, J. Mueller, J. Ladish, J. Landau, K. Ndousse, K. Lukosuite, L. Lovitt, M. Sellitto, N. Elhage, N. Schiefer, N. Mercado, N. DasSarma, R...
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[6]
SmolLM2: When Smol Goes Big -- Data-Centric Training of a Small Language Model
L. Ben Allal, A. Lozhkov, E. Bakouch, G. Martín Blázquez, G. Penedo, L. Tunstall, A. Marafi- oti, H. Kydlí ˇcek, A. Piqueres Lajarín, V . Srivastav, J. Lochner, C. Fahlgren, X.-S. Nguyen, C. Fourrier, B. Burtenshaw, H. Larcher, H. Zhao, C. Zakka, M. Morlon, C. Raffel, L. von Werra, and T. Wolf. Smollm2: When smol goes big – data-centric training of a smal...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[7]
M. Bhatt, S. Chennabasappa, C. Nikolaidis, S. Wan, I. Evtimov, D. Gabi, D. Song, F. Ahmad, C. Aschermann, L. Fontana, S. Frolov, R. P. Giri, D. Kapil, Y . Kozyrakis, D. LeBlanc, J. Milazzo, A. Straumann, G. Synnaeve, V . V ontimitta, S. Whitman, and J. Saxe. Purple llama cyberseceval: A secure coding benchmark for language models.arXiv preprint arXiv:2312...
-
[8]
On the Opportunities and Risks of Foundation Models
R. Bommasani, D. A. Hudson, E. Adeli, R. Altman, S. Arora, S. von Arx, M. S. Bernstein, J. Bohg, A. Bosselut, E. Brunskill, et al. On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[9]
T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, S. Agarwal, A. Herbert-V oss, G. Krueger, T. Henighan, R. Child, A. Ramesh, D. M. Ziegler, J. Wu, C. Winter, C. Hesse, M. Chen, E. Sigler, M. Litwin, S. Gray, B. Chess, J. Clark, C. Berner, S. McCandlish, A. Radford, I. Sutskever, and D. ...
work page 1901
-
[10]
P. Chao, A. Robey, E. Dobriban, H. Hassani, G. J. Pappas, and E. Wong. Jailbreaking black box large language models in twenty queries.arXiv preprint arXiv:2310.08419, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[11]
P. Chao, E. Debenedetti, A. Robey, M. Andriushchenko, F. Croce, V . Sehwag, E. Dobriban, N. Flammarion, G. J. Pappas, F. Tramer, H. Hassani, and E. Wong. Jailbreakbench: An open robustness benchmark for jailbreaking large language models.arXiv preprint arXiv:2404.01318, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[12]
M. Chen, J. Tworek, H. Jun, Q. Yuan, H. Ponde de Oliveira Pinto, J. Kaplan, H. Edwards, Y . Burda, N. Joseph, G. Brockman, et al. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[13]
Llamafirewall: An open source guardrail system for building secure ai agents,
S. Chennabasappa, C. Nikolaidis, D. Song, D. Molnar, S. Ding, S. Wan, S. Whitman, L. Deason, N. Doucette, A. Montilla, A. Gampa, B. de Paola, D. Gabi, J. Crnkovich, J.-C. Testud, K. He, R. Chaturvedi, W. Zhou, and J. Saxe. Llamafirewall: An open source guardrail system for building secure ai agents.arXiv preprint arXiv:2505.03574, 2025. 10
-
[14]
G. N. Frank. How alignment routes: Localizing, scaling, and controlling policy circuits in language models.arXiv preprint arXiv:2604.04385, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[15]
G. N. Frank. Detection is cheap, routing is learned: Why refusal-based alignment evaluation fails.arXiv preprint arXiv:2603.18280, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[16]
Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned
D. Ganguli, L. Lovitt, J. Kernion, A. Askell, Y . Bai, S. Kadavath, B. Mann, E. Perez, N. Schiefer, K. Ndousse, et al. Red teaming language models to reduce harms: Methods, scaling behaviors, and lessons learned.arXiv preprint arXiv:2209.07858, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[17]
D. Hendrycks, C. Burns, S. Basart, A. Zou, M. Mazeika, D. Song, and J. Steinhardt. Mea- suring massive multitask language understanding. InInternational Conference on Learning Representations, 2021
work page 2021
-
[18]
E. J. Hu, Y . Shen, P. Wallis, Z. Allen-Zhu, Y . Li, S. Wang, L. Wang, and W. Chen. Lora: Low-rank adaptation of large language models. InInternational Conference on Learning Representations, 2022. URLhttps://openreview.net/forum?id=nZeVKeeFYf9
work page 2022
-
[19]
H. Inan, K. Upasani, J. Chi, R. Rungta, K. Iyer, Y . Mao, M. Tontchev, Q. Hu, B. Fuller, D. Testuggine, and M. Khabsa. Llama guard: Llm-based input-output safeguard for human-ai conversations.arXiv preprint arXiv:2312.06674, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[20]
A. Q. Jiang, A. Sablayrolles, A. Mensch, C. Bamford, D. S. Chaplot, D. de las Casas, F. Bressand, G. Lengyel, G. Lample, L. Saulnier, L. R. Lavaud, M.-A. Lachaux, P. Stock, T. Le Scao, T. Lavril, T. Wang, T. Lacroix, and W. El Sayed. Mistral 7b.arXiv preprint arXiv:2310.06825, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[21]
M. Labonne. Uncensor any llm with abliteration. Hugging Face Blog, 2024. URL https: //huggingface.co/blog/mlabonne/abliteration. Accessed 2026-05-04
work page 2024
-
[22]
S. Lin, J. Hilton, and O. Evans. TruthfulQA: Measuring how models mimic human false- hoods. InProceedings of the 60th Annual Meeting of the Association for Computational Linguistics (V olume 1: Long Papers), pages 3214–3252, Dublin, Ireland, 2022. Associ- ation for Computational Linguistics. doi: 10.18653/v1/2022.acl-long.229. URL https: //aclanthology.or...
-
[23]
T. Marshall, A. Scherlis, and N. Belrose. Refusal in llms is an affine function.arXiv preprint arXiv:2411.09003, 2024
-
[24]
HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal
M. Mazeika, L. Phan, X. Yin, A. Zou, Z. Wang, N. Mu, E. Sakhaee, N. Li, S. Basart, B. Li, D. Forsyth, and D. Hendrycks. Harmbench: A standardized evaluation framework for automated red teaming and robust refusal.arXiv preprint arXiv:2402.04249, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[25]
Steering language model refusal with sparse autoencoders, 2025
K. O’Brien, D. Majercak, X. Fernandes, R. Edgar, B. Bullwinkel, J. Chen, H. Nori, D. Carignan, E. Horvitz, and F. Poursabzi-Sangdeh. Steering language model refusal with sparse autoencoders. arXiv preprint arXiv:2411.11296, 2024
-
[26]
L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. Wainwright, P. Mishkin, C. Zhang, S. Agarwal, K. Slama, A. Ray, J. Schulman, J. Hilton, F. Kelton, L. Miller, M. Simens, A. Askell, P. Welinder, P. F. Christiano, J. Leike, and R. Lowe. Training language models to follow instructions with human feedback. InAdvances in Neural Information Processing Systems, volume...
work page 2022
-
[27]
Steering Llama 2 via Contrastive Activation Addition
N. Panickssery, N. Gabrieli, J. Schulz, M. Tong, E. Hubinger, and A. M. Turner. Steering llama 2 via contrastive activation addition.arXiv preprint arXiv:2312.06681, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[28]
Membership infer- ence attacks from first principles
H. Pearce, B. Ahmad, B. Tan, B. Dolan-Gavitt, and R. Karri. Asleep at the keyboard? assessing the security of github copilot’s code contributions. In2022 IEEE Symposium on Security and Privacy (SP), pages 754–768, 2022. doi: 10.1109/SP46214.2022.9833571
-
[29]
Red Teaming Language Models with Language Models
E. Perez, S. Huang, F. Song, T. Cai, R. Ring, J. Aslanides, A. Glaese, N. McAleese, and G. Irving. Red teaming language models with language models.arXiv preprint arXiv:2202.03286, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
- [30]
-
[31]
X. Qi, Y . Zeng, T. Xie, P.-Y . Chen, R. Jia, P. Mittal, and P. Henderson. Fine-tuning aligned language models compromises safety, even when users do not intend to!arXiv preprint arXiv:2310.03693, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
- [32]
-
[33]
Qwen Team. Qwen2.5 technical report.arXiv preprint arXiv:2412.15115, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[34]
arXiv preprint arXiv:2310.10501 , year =
T. Rebedea, R. Dinu, M. Sreedhar, C. Parisien, and J. Cohen. Nemo guardrails: A toolkit for con- trollable and safe llm applications with programmable rails.arXiv preprint arXiv:2310.10501, 2023
-
[35]
V . Siu, N. Crispino, Z. Yu, S. Pan, Z. Wang, Y . Liu, D. Song, and C. Wang. COSMIC: Generalized refusal direction identification in LLM activations. InFindings of the Associa- tion for Computational Linguistics: ACL 2025, pages 25534–25553, Vienna, Austria, 2025. Association for Computational Linguistics. doi: 10.18653/v1/2025.findings-acl.1310. URL http...
-
[36]
A. Srivastava, A. Rastogi, A. Rao, A. A. M. Shoeb, A. Abid, A. Fisch, A. R. Brown, A. Santoro, A. Gupta, A. Garriga-Alonso, et al. Beyond the imitation game: Quantifying and extrapolating the capabilities of language models.Transactions on Machine Learning Research, 2023. URL https://openreview.net/forum?id=uyTL5Bvosj
work page 2023
-
[37]
LLaMA: Open and Efficient Foundation Language Models
H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B. Rozière, N. Goyal, E. Hambro, F. Azhar, A. Rodriguez, A. Joulin, E. Grave, and G. Lample. LLaMA: Open and efficient foundation language models.arXiv preprint arXiv:2302.13971, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[38]
A. M. Turner, L. Thiergart, G. Leech, D. Udell, J. J. Vazquez, U. Mini, and M. MacDiarmid. Steering language models with activation engineering.arXiv preprint arXiv:2308.10248, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
- [39]
-
[40]
A. Wei, N. Haghtalab, and J. Steinhardt. Jailbroken: How does llm safety training fail?arXiv preprint arXiv:2307.02483, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[41]
A. K. Zhang, N. Perry, R. Dulepet, J. Ji, C. Menders, J. W. Lin, E. Jones, G. Hussein, S. Liu, D. Jasper, P. Peetathawatchai, A. Glenn, V . Sivashankar, D. Zamoshchin, L. Glikbarg, D. Askar- yar, M. Yang, T. Zhang, R. Alluri, N. Tran, R. Sangpisit, P. Yiorkadjis, K. Osele, G. Raghupathi, D. Boneh, D. E. Ho, and P. Liang. Cybench: A framework for evaluatin...
-
[42]
A. Zou, L. Phan, S. Chen, J. Campbell, P. Guo, R. Ren, A. Pan, X. Yin, M. Mazeika, A.-K. Dombrowski, S. Goel, N. Li, M. J. Byun, Z. Wang, A. Mallen, S. Basart, S. Koyejo, D. Song, M. Fredrikson, J. Z. Kolter, and D. Hendrycks. Representation engineering: A top-down approach to ai transparency.arXiv preprint arXiv:2310.01405, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[43]
A. Zou, Z. Wang, N. Carlini, M. Nasr, J. Z. Kolter, and M. Fredrikson. Universal and transferable adversarial attacks on aligned language models.arXiv preprint arXiv:2307.15043, 2023. 12 A Evaluation Reproducibility Details The main evaluation harness is implemented in: •experiments/generate_expanded_data.py •experiments/train_lora.py •experiments/expande...
work page internal anchor Pith review Pith/arXiv arXiv 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.