pith. sign in

arxiv: 2607.00572 · v1 · pith:2BBQWXPKnew · submitted 2026-07-01 · 💻 cs.AI · cs.CR

HARC: Coupling Harmfulness and Refusal Directions for Robust Safety Alignment

Pith reviewed 2026-07-02 12:55 UTC · model grok-4.3

classification 💻 cs.AI cs.CR
keywords LLM safety alignmentjailbreak robustnessharmfulness directionrefusal directionresidual streamfine-tuning methodover-refusal
0
0 comments X

The pith

Coupling harmfulness and refusal directions at prompt and response positions produces robust safety alignment without degrading capability or increasing over-refusal.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that aligned LLMs encode harmfulness and refusal as separable directions in the residual stream. Jailbreaks succeed by suppressing one or both directions before generation begins, while the model can still detect harm during response generation even if it missed the prompt. Motivated by this, the authors introduce a fine-tuning procedure that explicitly pairs the two directions across both prompt-side and response-side positions. Because the change stays inside the harmfulness-refusal subspace, general capabilities remain intact and over-refusal does not rise. Experiments across five model families and two scales show this approach yields the best robustness-capability-usability trade-off among the six major safety methods tested.

Core claim

By pairing the harmfulness and refusal directions across both prompt and response token positions during fine-tuning, the resulting models resist jailbreaks more effectively than prior training-time and inference-time methods while leaving general capability and refusal rates on benign queries unchanged.

What carries the argument

HARC (Harmfulness-And-Refusal Coupling), the fine-tuning method that enforces pairing of the two directions inside the identified harmfulness-refusal subspace.

If this is right

  • HARC outperforms the six major baselines on the combined robustness-capability-usability metric.
  • The identified harmfulness and refusal directions at both prompt and response positions transfer across five model families and two scales with no architecture-specific retuning.
  • Because the intervention is confined to one subspace, the method avoids the capability degradation and over-refusal inflation seen in some competing safety techniques.
  • The analysis shows distinct attack classes occupy separable regions of the harmfulness-refusal plane, explaining why different jailbreaks succeed in different ways.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If safety information is localized to a low-dimensional subspace, future alignment work could target that subspace directly rather than retraining entire models.
  • The finding that models detect harm during generation even after missing it at the prompt suggests response-side monitoring could complement prompt-side defenses.
  • Transfer of the directions across model families raises the possibility that a small set of direction vectors could serve as portable safety handles.

Load-bearing premise

The assumption that intervening only inside the harmfulness-refusal subspace leaves the rest of the residual stream untouched and therefore produces no loss in general capability or rise in over-refusal.

What would settle it

A controlled experiment in which models fine-tuned with HARC are shown to exhibit either measurable drops in standard capability benchmarks or higher refusal rates on clearly benign prompts compared with the base model.

Figures

Figures reproduced from arXiv: 2607.00572 by Fangzhao Wu, Shei Pern Chua.

Figure 1
Figure 1. Figure 1: Cosine similarity between vharm and vref across all layers of LLAMA-3.1-8B. Both directions are tightly coupled at mid-depth and most decoupled in late layers [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Jailbreaks dissociate harm-recognition from refusal, and a four-direction structure emerges at the most decoupled layers. (a) ∆ projections of successful jailbreak prompts at the prompt-side (left, onto vharm and vref) and the response-side (right, onto v resp harm and v resp ref ) on layer 27. Each response-side point is projected at its category’s peak token because attacks express harm and refusal at di… view at source ↗
Figure 3
Figure 3. Figure 3: Cosine similarity between vharm and vref across all layers. Gray bands mark the trained layers (L25–28 on Llama; L21–24 on Qwen). Post-tuning alignment increases significantly within the trained bands and re￾mains elevated in the subsequent downstream layers [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Coupling fine-tuning collapses the dissociated regions that jailbreaks exploit. ∆ projections at the prompt-side and response-side for benign, harmful, DAN, PAIR, and CodeAttack inputs. Arrows trace each cluster’s centroid from the base model to the fine-tuned model. On the prompt-side for both architectures, the DAN and PAIR clusters move toward the baseline harmful region, while CodeAttack remains cluste… view at source ↗
Figure 5
Figure 5. Figure 5: Scaling analysis on LLAMA-3.1-70B. Left: Layer score (Equation 6) plotted against relative depth ℓ/nlayers for 8B and 70B; stars mark the top-K=4 selected layers. Mid: Attack success rate on CodeAttack and PAP. Right: General capability on MMLU and GSM8K. 5.3 Effectiveness at Scale We extend our analysis to LLAMA-3.1-70B and QWEN-2.5-72B to test whether HARC scales [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Cosine similarity between vharm and vref across all layers of QWEN-2.5-7B. The directions remain decoupled across most of the network, with a modest peak at L13–L19 and a sharper rise at L28 [PITH_FULL_IMAGE:figures/full_fig_p016_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: (a) shows that the projection signatures replicate, with attack-class-specific differences. Harmful prompts activate both directions strongly at the prompt boundary. DAN and PAIR sit in dissociated regions at moderate-to￾high projections on both axes, with DAN closer to the harmful cluster than PAIR. CodeAttack overlaps with be￾nign prompts at the prompt boundary, consistent with the Llama observation that… view at source ↗
Figure 8
Figure 8. Figure 8: The four-direction structure replicates across five instruction-tuned model families. Pairwise cosine similarities at each model’s selected layer (Eq. 6). Response-side decoupling (v resp harm ↔ v resp ref ) and same-concept cross-position alignment (vharm ↔ v resp harm and vref ↔ v resp ref ) replicate consistently across architectures. Prompt-side decoupling varies across families, where Llama and Gemma … view at source ↗
Figure 9
Figure 9. Figure 9: Cross-layer projection profiles on LLAMA-3.1-8B. Mean cosine similarity cos(h (ℓ) , v) between residual-stream activations and each of the four directions, across all layers and all input categories (benign, harmful, DAN, PAIR, CodeAttack). Each panel shows projections onto one direction (vharm, vref, v resp harm, v resp ref , left to right). Red indicates positive activation, where blue indicates negative… view at source ↗
Figure 10
Figure 10. Figure 10: Cross-layer projection profiles on QWEN-2.5-7B. The four-direction structure replicates across architectures with characterization differences: prompt-side vref activations are weaker on Qwen across all attack classes, consistent with the tighter prompt-side coupling reported in Section A. Attack signatures are depth-spanning, not single-layer artifacts. The projection signatures shown at the selected lay… view at source ↗
Figure 11
Figure 11. Figure 11: Multi-seed evaluation of HARC across all benchmarks. Bar heights show means across 3 seeds for Ours and single-seed values for Baseline; error bars on Ours show standard deviation. HARC’s robustness gains, over-refusal preservation, and capability preservation hold consistently across seeds on both architectures. Variant Lce Lkl L prompt couple L response couple Base – – – – CE only (≡ SFT) ✓ – – – CE + K… view at source ↗
Figure 12
Figure 12. Figure 12: Post-attack ASR versus adversary’s fine-tuning budget on LLAMA-3.1-8B. The ad￾versary fine-tunes each defended model on harmful examples and we measure ASR on JailbreakBench after each 20 step. HARC collapses to baseline behavior within roughly 160 harmful examples, comparable to vanilla SFT. CB resists partially up to ∼400 examples, and RepBend’s post-attack ASR remains at zero, but only because RepBend … view at source ↗
read the original abstract

Understanding how aligned LLMs internally represent safety is critical for diagnosing alignment vulnerabilities, as it explains why jailbreaks succeed and informs the design of robust alignment strategies. Prior work shows that aligned LLMs encode harmfulness and refusal as separable directions in the residual stream at prompt-side token positions. We show that jailbreaks succeed at prompt encoding by suppressing either the refusal or harmfulness direction before any token is generated, with distinct attack classes occupying separable regions of the harmfulness-refusal plane. Extending the analysis to response-token positions, we find that the model recognizes harmful content while it is generating that content, even when it failed to recognize the input as harmful at the prompt side. Motivated by our findings, we introduce HARC (Harmfulness-And-Refusal Coupling), a fine-tuning method that pairs the two directions across both prompt and response positions. Since the intervention is confined to the harmfulness-refusal subspace, it leaves the rest of the residual stream intact and does not degrade general capability or inflate over-refusal. Across extensive experiments, HARC achieves the strongest robustness-capability-usability trade-off among six baselines spanning the major training-time and inference-time safety methods. The harmfulness and refusal directions at prompt and response positions transfer across the five model families and two scales we tested without architecture-specific tuning.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper analyzes how aligned LLMs encode safety via separable harmfulness and refusal directions in the residual stream at prompt-side token positions, shows that distinct jailbreak classes suppress these directions, and extends the analysis to response positions where the model detects harmful content during generation. It introduces HARC, a fine-tuning approach that couples the two directions across prompt and response positions; because the intervention is restricted to this subspace, the authors claim it preserves general capability and avoids inflating over-refusal. Experiments across five model families and two scales report that HARC attains the best robustness-capability-usability trade-off among six training- and inference-time baselines, with the identified directions transferring without architecture-specific tuning.

Significance. If the empirical claims hold, the work strengthens the mechanistic account of alignment failures and supplies a targeted fine-tuning procedure whose subspace restriction is presented as the reason for favorable capability and usability outcomes. The reported cross-family transfer without per-architecture retuning would be a practical advantage over many existing safety methods.

major comments (2)
  1. [HARC method and results sections] The central claim that confining the intervention to the harmfulness-refusal subspace leaves the remainder of the residual stream untouched (and thereby produces no capability degradation or over-refusal increase) is load-bearing for the reported trade-off; the manuscript must show explicit verification that other directions remain unaffected, for example by reporting changes in cosine similarity to unrelated directions or performance on capability benchmarks that are orthogonal to the safety subspace.
  2. [Experiments and evaluation] The claim that HARC outperforms the six baselines on the robustness-capability-usability trade-off requires the precise definition of the composite metric and the raw per-baseline numbers (including variance across seeds) to be presented; without these, it is impossible to confirm that the reported superiority is not an artifact of metric choice or selective reporting.
minor comments (2)
  1. [Notation and figures] Notation for the prompt-side versus response-side directions should be introduced once and used consistently; the current abstract and method description occasionally switch between “prompt encoding” and “response positions” without a single clarifying diagram or equation.
  2. [Experimental setup] The transferability statement (“without architecture-specific tuning”) should be supported by an explicit statement of which hyperparameters, if any, were held fixed across the five families.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback, which helps clarify key aspects of our claims. We address the major comments point by point below.

read point-by-point responses
  1. Referee: [HARC method and results sections] The central claim that confining the intervention to the harmfulness-refusal subspace leaves the remainder of the residual stream untouched (and thereby produces no capability degradation or over-refusal increase) is load-bearing for the reported trade-off; the manuscript must show explicit verification that other directions remain unaffected, for example by reporting changes in cosine similarity to unrelated directions or performance on capability benchmarks that are orthogonal to the safety subspace.

    Authors: We agree that explicit verification strengthens the central claim. In the revised manuscript we will add analyses reporting cosine similarities of the HARC intervention directions to a set of unrelated residual-stream directions, together with performance on capability benchmarks chosen to be orthogonal to the safety subspace. revision: yes

  2. Referee: [Experiments and evaluation] The claim that HARC outperforms the six baselines on the robustness-capability-usability trade-off requires the precise definition of the composite metric and the raw per-baseline numbers (including variance across seeds) to be presented; without these, it is impossible to confirm that the reported superiority is not an artifact of metric choice or selective reporting.

    Authors: We accept the request for greater transparency. The revised manuscript will include the exact definition of the composite metric and will tabulate the raw per-baseline scores (means and standard deviations across seeds) for all six baselines. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation self-contained

full rationale

The paper presents empirical findings on harmfulness and refusal directions, extends analysis to response positions, and introduces HARC as a fine-tuning intervention confined to that subspace. No equations, fitted parameters renamed as predictions, or self-citation chains appear in the provided text that reduce the central claims to inputs by construction. The subspace confinement is asserted as a design property supporting capability preservation, not derived from the results themselves. Prior work is referenced for the initial direction separability but is not load-bearing for the HARC method or trade-off results. The derivation chain is therefore independent of the reported outcomes.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only abstract available; no explicit free parameters, axioms, or invented entities are described. The approach relies on directions identified in prior literature and the assumption of subspace isolation.

pith-pipeline@v0.9.1-grok · 5763 in / 1054 out tokens · 27511 ms · 2026-07-02T12:55:17.583410+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

71 extracted references · 23 canonical work pages · 18 internal anchors

  1. [1]

    Universal and Transferable Adversarial Attacks on Aligned Language Models

    Andy Zou, Zifan Wang, Nicholas Carlini, Milad Nasr, J Zico Kolter, and Matt Fredrikson. Universal and transferable adversarial attacks on aligned language models.arXiv preprint arXiv:2307.15043, 2023

  2. [2]

    Autodan: Generating stealthy jailbreak prompts on aligned large language models

    Xiaogeng Liu, Nan Xu, Muhao Chen, and Chaowei Xiao. Autodan: Generating stealthy jailbreak prompts on aligned large language models. InInt. Conf. Learn. Rep. (ICLR), 2023

  3. [3]

    How johnny can persuade llms to jailbreak them: Rethinking persuasion to challenge ai safety by humanizing llms

    Yi Zeng, Hongpeng Lin, Jingwen Zhang, Diyi Yang, Ruoxi Jia, and Weiyan Shi. How johnny can persuade llms to jailbreak them: Rethinking persuasion to challenge ai safety by humanizing llms. InAnn. Meet. Assoc. Comput. Linguistics (ACL), pages 14322–14350, 2024

  4. [4]

    do anything now

    Xinyue Shen, Zeyuan Chen, Michael Backes, Yun Shen, and Yang Zhang. " do anything now": Characterizing and evaluating in-the-wild jailbreak prompts on large language models. InProceedings of the 2024 on ACM SIGSAC Conference on Computer and Communications Security, pages 1671–1685, 2024

  5. [5]

    Jailbreaking black box large language models in twenty queries

    Patrick Chao, Alexander Robey, Edgar Dobriban, Hamed Hassani, George J Pappas, and Eric Wong. Jailbreaking black box large language models in twenty queries. In2025 IEEE Conference on Secure and Trustworthy Machine Learning (SaTML), pages 23–42. IEEE, 2025

  6. [6]

    Autodan-turbo: A lifelong agent for strategy self-exploration to jailbreak llms

    Xiaogeng Liu, Peiran Li, Edward Suh, Yevgeniy V orobeychik, Zhuoqing Mao, Somesh Jha, Patrick McDaniel, Huan Sun, Bo Li, and Chaowei Xiao. Autodan-turbo: A lifelong agent for strategy self-exploration to jailbreak llms. InInt. Conf. Learn. Rep. (ICLR), pages 22337–22384, 2025

  7. [7]

    Great, now write an article about that: The crescendo {Multi-Turn}{LLM} jailbreak attack

    Mark Russinovich, Ahmed Salem, and Ronen Eldan. Great, now write an article about that: The crescendo {Multi-Turn}{LLM} jailbreak attack. In34th USENIX Security Symposium (USENIX Security 25), pages 2421–2440, 2025

  8. [8]

    Between a Rock and a Hard Place: The Tension Between Ethical Reasoning and Safety Alignment in LLMs

    Shei Pern Chua, Zhen Leng Thai, Kai Jun Teh, Xiao Li, Qibing Ren, and Xiaolin Hu. Between a rock and a hard place: The tension between ethical reasoning and safety alignment in llms. arXiv preprint arXiv:2509.05367, 2025

  9. [9]

    DeepInception: Hypnotize Large Language Model to Be Jailbreaker

    Xuan Li, Zhanke Zhou, Jianing Zhu, Jiangchao Yao, Tongliang Liu, and Bo Han. Deepinception: Hypnotize large language model to be jailbreaker.arXiv preprint arXiv:2311.03191, 2023

  10. [10]

    Codeat- tack: Revealing safety generalization challenges of large language models via code completion

    Qibing Ren, Chang Gao, Jing Shao, Junchi Yan, Xin Tan, Wai Lam, and Lizhuang Ma. Codeat- tack: Revealing safety generalization challenges of large language models via code completion. InAnn. Meet. Assoc. Comput. Linguistics (ACL), pages 11437–11452, 2024

  11. [11]

    Artprompt: Ascii art-based jailbreak attacks against aligned llms

    Fengqing Jiang, Zhangchen Xu, Luyao Niu, Zhen Xiang, Bhaskar Ramasubramanian, Bo Li, and Radha Poovendran. Artprompt: Ascii art-based jailbreak attacks against aligned llms. In Ann. Meet. Assoc. Comput. Linguistics (ACL), pages 15157–15173, 2024

  12. [12]

    Jailbreaking leading safety-aligned llms with simple adaptive attacks

    Maksym Andriushchenko, Francesco Croce, and Nicolas Flammarion. Jailbreaking leading safety-aligned llms with simple adaptive attacks. InInt. Conf. Mach. Learn. (ICML), 2024

  13. [13]

    Direct preference optimization: Your language model is secretly a reward model

    Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. Adv. Neural Inform. Process. Syst. (NeurIPS), 36:53728–53741, 2023

  14. [14]

    Safety alignment should be made more than just a few tokens deep

    Xiangyu Qi, Ashwinee Panda, Kaifeng Lyu, Xiao Ma, Subhrajit Roy, Ahmad Beirami, Prateek Mittal, and Peter Henderson. Safety alignment should be made more than just a few tokens deep. InInt. Conf. Learn. Rep. (ICLR), 2024

  15. [15]

    Deliberative alignment: Reasoning enables safer language models.arXiv preprint arXiv:2412.16339, 2024

    Melody Y Guan, Manas Joglekar, Eric Wallace, Saachi Jain, Boaz Barak, Alec Helyar, Rachel Dias, Andrea Vallone, Hongyu Ren, Jason Wei, et al. Deliberative alignment: Reasoning enables safer language models.arXiv preprint arXiv:2412.16339, 2024

  16. [16]

    Stair: Improving safety alignment with introspective reasoning

    Yichi Zhang, Siyuan Zhang, Yao Huang, Zeyu Xia, Zhengwei Fang, Xiao Yang, Ranjie Duan, Dong Yan, Yinpeng Dong, and Jun Zhu. Stair: Improving safety alignment with introspective reasoning. InInt. Conf. Mach. Learn. (ICML), pages 76754–76777, 2025. 11

  17. [17]

    Programming refusal with conditional activation steering

    Bruce W Lee, Inkit Padhi, Karthikeyan Natesan Ramamurthy, Erik Miehling, Pierre Dognin, Manish Nagireddy, and Amit Dhurandhar. Programming refusal with conditional activation steering. InInt. Conf. Learn. Rep. (ICLR), 2025

  18. [18]

    Scans: Mitigating the exaggerated safety for llms via safety-conscious activation steering

    Zouying Cao, Yifei Yang, and Hai Zhao. Scans: Mitigating the exaggerated safety for llms via safety-conscious activation steering. InAAAI, volume 39, pages 23523–23531, 2025

  19. [19]

    Improving alignment and robustness with circuit breakers.Adv

    Andy Zou, Long Phan, Justin Wang, Derek Duenas, Maxwell Lin, Maksym Andriushchenko, Rowan Wang, Zico Kolter, Matt Fredrikson, and Dan Hendrycks. Improving alignment and robustness with circuit breakers.Adv. Neural Inform. Process. Syst. (NeurIPS), 37:83345–83373, 2024

  20. [20]

    Representation bending for large language model safety

    Ashkan Yousefpour, Taeheon Kim, Ryan Sungmo Kwon, Seungbeen Lee, Wonje Jeung, Seungju Han, Alvin Wan, Harrison Ngan, Youngjae Yu, and Jonghyun Choi. Representation bending for large language model safety. InAnn. Meet. Assoc. Comput. Linguistics (ACL), pages 24073–24098, 2025

  21. [21]

    Latent adver- sarial training improves robustness to persistent harmful behaviors in llms.arXiv preprint arXiv:2407.15549, 2024

    Abhay Sheshadri, Aidan Ewart, Phillip Guo, Aengus Lynch, Cindy Wu, Vivek Hebbar, Henry Sleight, Asa Cooper Stickland, Ethan Perez, Dylan Hadfield-Menell, et al. Latent adver- sarial training improves robustness to persistent harmful behaviors in llms.arXiv preprint arXiv:2407.15549, 2024

  22. [22]

    On effects of steering latent representation for large language model unlearning

    Huu-Tien Dang, Tin Pham, Hoang Thanh-Tung, and Naoya Inoue. On effects of steering latent representation for large language model unlearning. InAAAI, volume 39, pages 23733–23742, 2025

  23. [23]

    Refusal in language models is mediated by a single direction

    Andy Arditi, Oscar Obeso, Aaquib Syed, Daniel Paleka, Nina Panickssery, Wes Gurnee, and Neel Nanda. Refusal in language models is mediated by a single direction. InAdv. Neural Inform. Process. Syst. (NeurIPS), 2024

  24. [24]

    Llms encode harmful- ness and refusal separately

    Jiachen Zhao, Jing Huang, Zhengxuan Wu, David Bau, and Weiyan Shi. Llms encode harmful- ness and refusal separately. InAdv. Neural Inform. Process. Syst. (NeurIPS), 2025

  25. [25]

    A General Language Assistant as a Laboratory for Alignment

    Amanda Askell, Yuntao Bai, Anna Chen, Dawn Drain, Deep Ganguli, Tom Henighan, Andy Jones, Nicholas Joseph, Ben Mann, Nova DasSarma, et al. A general language assistant as a laboratory for alignment.arXiv preprint arXiv:2112.00861, 2021

  26. [26]

    Training language models to follow instructions with human feedback.Adv

    Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback.Adv. Neural Inform. Process. Syst. (NeurIPS), 35: 27730–27744, 2022

  27. [27]

    Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback

    Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova DasSarma, Dawn Drain, Stanislav Fort, Deep Ganguli, Tom Henighan, et al. Training a helpful and harmless assistant with reinforcement learning from human feedback.arXiv preprint arXiv:2204.05862, 2022

  28. [28]

    The linear representation hypothesis and the geometry of large language models

    Kiho Park, Yo Joong Choe, and Victor Veitch. The linear representation hypothesis and the geometry of large language models. InInt. Conf. Mach. Learn. (ICML), 2024

  29. [29]

    Toy Models of Superposition

    Nelson Elhage, Tristan Hume, Catherine Olsson, Nicholas Schiefer, Tom Henighan, Shauna Kravec, Zac Hatfield-Dodds, Robert Lasenby, Dawn Drain, Carol Chen, et al. Toy models of superposition.arXiv preprint arXiv:2209.10652, 2022

  30. [30]

    The Geometry of Truth: Emergent Linear Structure in Large Language Model Representations of True/False Datasets

    Samuel Marks and Max Tegmark. The geometry of truth: Emergent linear structure in large language model representations of true/false datasets.arXiv preprint arXiv:2310.06824, 2023

  31. [31]

    Steering Language Models With Activation Engineering

    Alexander Matt Turner, Lisa Thiergart, Gavin Leech, David Udell, Juan J Vazquez, Ulisse Mini, and Monte MacDiarmid. Steering language models with activation engineering.arXiv preprint arXiv:2308.10248, 2023

  32. [32]

    Steering llama 2 via contrastive activation addition

    Nina Rimsky, Nick Gabrieli, Julian Schulz, Meg Tong, Evan Hubinger, and Alexander Matt Turner. Steering llama 2 via contrastive activation addition. InAnn. Meet. Assoc. Comput. Linguistics (ACL), pages 15504–15522, 2024. 12

  33. [33]

    Representation Engineering: A Top-Down Approach to AI Transparency

    Andy Zou, Long Phan, Sarah Chen, James Campbell, Phillip Guo, Richard Ren, Alexander Pan, Xuwang Yin, Mantas Mazeika, Ann-Kathrin Dombrowski, et al. Representation engineering: A top-down approach to ai transparency.arXiv preprint arXiv:2310.01405, 2023

  34. [34]

    Diff-in-means concept editing is worst-case optimal, 2023

    Nora Belrose. Diff-in-means concept editing is worst-case optimal, 2023. URL https: //blog.eleuther.ai/diff-in-means/. Accessed: 2026-04-29

  35. [35]

    Inference- time intervention: Eliciting truthful answers from a language model

    Kenneth Li, Oam Patel, Fernanda Viégas, Hanspeter Pfister, and Martin Wattenberg. Inference- time intervention: Eliciting truthful answers from a language model. InAdv. Neural Inform. Process. Syst. (NeurIPS), 2023

  36. [36]

    Linear Representations of Sentiment in Large Language Models

    Curt Tigges, Oskar John Hollinsworth, Atticus Geiger, and Neel Nanda. Linear representations of sentiment in large language models.arXiv preprint arXiv:2310.15154, 2023

  37. [37]

    Improving instruction-following in language models through activation steering

    Alessandro Stolfo, Vidhisha Balachandran, Safoora Yousefi, Eric Horvitz, and Besmira Nushi. Improving instruction-following in language models through activation steering. InInt. Conf. Learn. Rep. (ICLR), 2025

  38. [38]

    The Llama 3 Herd of Models

    Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024

  39. [39]

    Qwen2.5 Technical Report

    Qwen, :, An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, Huan Lin, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jingren Zhou, Junyang Lin, Kai Dang, Keming Lu, Keqin Bao, Kexin Yang, Le Yu, Mei Li, Mingfeng Xue, Pei Zhang, Qin Zhu, Rui Men, Runji Lin, Tianhao Li,...

  40. [40]

    Openai usage policies, 2025

    OpenAI. Openai usage policies, 2025. URL https://openai.com/policies/ usage-policies. Accessed: 2026-05-19

  41. [41]

    Exponential moving average of weights in deep learning: Dynamics and benefits.Transactions on Machine Learning Research Journal, pages 1–27, 2024

    Daniel Morales Brotons, Thijs V ogels, and Hadrien Hendrikx. Exponential moving average of weights in deep learning: Dynamics and benefits.Transactions on Machine Learning Research Journal, pages 1–27, 2024

  42. [42]

    Rethinking safety in llm fine-tuning: An optimization perspective

    Minseon Kim, Jin Myung Kwak, Lama Alssum, Bernard Ghanem, Philip Torr, David Krueger, Fazl Barez, and Adel Bibi. Rethinking safety in llm fine-tuning: An optimization perspective. InSecond Conference on Language Modeling, 2025

  43. [43]

    Pku-saferlhf: Towards multi-level safety alignment for llms with human preference

    Jiaming Ji, Donghai Hong, Borong Zhang, Boyuan Chen, Josef Dai, Boren Zheng, Tianyi Alex Qiu, Jiayi Zhou, Kaile Wang, Boxun Li, et al. Pku-saferlhf: Towards multi-level safety alignment for llms with human preference. InAnn. Meet. Assoc. Comput. Linguistics (ACL), pages 31983–32016, 2025

  44. [44]

    Fine-tuning aligned language models compromises safety, even when users do not intend to! In Int

    Xiangyu Qi, Yi Zeng, Tinghao Xie, Pin-Yu Chen, Ruoxi Jia, Prateek Mittal, and Peter Henderson. Fine-tuning aligned language models compromises safety, even when users do not intend to! In Int. Conf. Learn. Rep. (ICLR), 2024

  45. [45]

    Jailbreakbench: An open robustness benchmark for jailbreaking large language models

    Patrick Chao, Edoardo Debenedetti, Alexander Robey, Maksym Andriushchenko, Francesco Croce, Vikash Sehwag, Edgar Dobriban, Nicolas Flammarion, George J Pappas, Florian Tramer, et al. Jailbreakbench: An open robustness benchmark for jailbreaking large language models. Adv. Neural Inform. Process. Syst. (NeurIPS), 37:55005–55029, 2024

  46. [46]

    Xstest: A test suite for identifying exaggerated safety behaviours in large language models

    Paul Röttger, Hannah Kirk, Bertie Vidgen, Giuseppe Attanasio, Federico Bianchi, and Dirk Hovy. Xstest: A test suite for identifying exaggerated safety behaviours in large language models. InConf. North Am. Chapter Assoc. Comput. Linguistics (NAACL), pages 5377–5400, 2024

  47. [47]

    The art of saying no: Contextual noncompliance in language models.Adv

    Faeze Brahman, Sachin Kumar, Vidhisha Balachandran, Pradeep Dasigi, Valentina Pyatkin, Abhilasha Ravichander, Sarah Wiegreffe, Nouha Dziri, Khyathi Chandu, Jack Hessel, et al. The art of saying no: Contextual noncompliance in language models.Adv. Neural Inform. Process. Syst. (NeurIPS), 37:49706–49748, 2024. 13

  48. [48]

    Measuring massive multitask language understanding.Int

    Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding.Int. Conf. Learn. Rep. (ICLR), 2021

  49. [49]

    Aligning ai with shared human values.Int

    Dan Hendrycks, Collin Burns, Steven Basart, Andrew Critch, Jerry Li, Dawn Song, and Jacob Steinhardt. Aligning ai with shared human values.Int. Conf. Learn. Rep. (ICLR), 2021

  50. [50]

    Training Verifiers to Solve Math Word Problems

    Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168, 2021

  51. [51]

    Instruction-Following Evaluation for Large Language Models

    Jeffrey Zhou, Tianjian Lu, Swaroop Mishra, Siddhartha Brahma, Sujoy Basu, Yi Luan, Denny Zhou, and Le Hou. Instruction-following evaluation for large language models.arXiv preprint arXiv:2311.07911, 2023

  52. [52]

    Evaluating Large Language Models Trained on Code

    Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde De Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. Evaluating large language models trained on code.arXiv preprint arXiv:2107.03374, 2021

  53. [53]

    Mt-bench-101: A fine-grained benchmark for evaluating large language models in multi-turn dialogues

    Ge Bai, Jie Liu, Xingyuan Bu, Yancheng He, Jiaheng Liu, Zhanhui Zhou, Zhuoran Lin, Wenbo Su, Tiezheng Ge, Bo Zheng, et al. Mt-bench-101: A fine-grained benchmark for evaluating large language models in multi-turn dialogues. InAnn. Meet. Assoc. Comput. Linguistics (ACL), pages 7421–7454, 2024

  54. [54]

    A survey on llm-as-a-judge.The Innovation, 2024

    Jiawei Gu, Xuhui Jiang, Zhichao Shi, Hexiang Tan, Xuehao Zhai, Chengjin Xu, Wei Li, Yinghan Shen, Shengjie Ma, Honghao Liu, et al. A survey on llm-as-a-judge.The Innovation, 2024

  55. [55]

    Enhancing chat language models by scaling high-quality instructional conversations

    Ning Ding, Yulin Chen, Bokai Xu, Yujia Qin, Shengding Hu, Zhiyuan Liu, Maosong Sun, and Bowen Zhou. Enhancing chat language models by scaling high-quality instructional conversations. InConf. Empir . Methods Nat. Lang. Process. (EMNLP), pages 3029–3051, 2023

  56. [56]

    The geometry of refusal in large language models: Concept cones and representational independence

    Tom Wollschläger, Jannes Elstner, Simon Geisler, Vincent Cohen-Addad, Stephan Günnemann, and Johannes Gasteiger. The geometry of refusal in large language models: Concept cones and representational independence. InInternational Conference on Machine Learning, pages 66945–66970. PMLR, 2025

  57. [57]

    The hidden dimensions of llm alignment: A multi-dimensional analysis of orthogonal safety directions

    Wenbo Pan, Zhichao Liu, Qiguang Chen, Xiangyang Zhou, Yu Haining, and Xiaohua Jia. The hidden dimensions of llm alignment: A multi-dimensional analysis of orthogonal safety directions. InInt. Conf. Mach. Learn. (ICML), pages 47697–47716. PMLR, 2025

  58. [58]

    Differentiated directional intervention: A framework for evading llm safety alignment

    Peng Zhang and Peijie Sun. Differentiated directional intervention: A framework for evading llm safety alignment. InAAAI, volume 40, pages 38102–38110, 2026

  59. [59]

    Alphasteer: Learning refusal steering with principled null-space constraint.arXiv preprint arXiv:2506.07022, 2025

    Leheng Sheng, Changshuo Shen, Weixiang Zhao, Junfeng Fang, Xiaohao Liu, Zhenkai Liang, Xiang Wang, An Zhang, and Tat-Seng Chua. Alphasteer: Learning refusal steering with principled null-space constraint.arXiv preprint arXiv:2506.07022, 2025

  60. [60]

    Analysing the generalisation and reliability of steering vectors.Adv

    Daniel Tan, David Chanin, Aengus Lynch, Brooks Paige, Dimitrios Kanoulas, Adrià Garriga- Alonso, and Robert Kirk. Analysing the generalisation and reliability of steering vectors.Adv. Neural Inform. Process. Syst. (NeurIPS), 37:139179–139212, 2024

  61. [61]

    Or-bench: An over-refusal benchmark for large language models

    Justin Cui, Wei-Lin Chiang, Ion Stoica, and Cho-Jui Hsieh. Or-bench: An over-refusal benchmark for large language models. InInt. Conf. Mach. Learn. (ICML), pages 11515–11542. PMLR, 2025

  62. [62]

    arXiv preprint arXiv:2506.12880 , year=

    Matan Ben-Tov, Mor Geva, and Mahmood Sharif. Universal jailbreak suffixes are strong attention hijackers.arXiv preprint arXiv:2506.12880, 2025

  63. [63]

    Flipattack: Jailbreak llms via flipping

    Yue Liu, Xiaoxin He, Miao Xiong, Jinlan Fu, Shumin Deng, Yingwei Ma, Jiaheng Zhang, and Bryan Hooi. Flipattack: Jailbreak llms via flipping. InInt. Conf. Mach. Learn. (ICML), pages 38623–38663. PMLR, 2025. 14

  64. [64]

    Many-shot jailbreaking.Adv

    Cem Anil, Esin Durmus, Nina Panickssery, Mrinank Sharma, Joe Benton, Sandipan Kundu, Joshua Batson, Meg Tong, Jesse Mu, Daniel Ford, et al. Many-shot jailbreaking.Adv. Neural Inform. Process. Syst. (NeurIPS), 37:129696–129742, 2024

  65. [65]

    Harmbench: A standardized evaluation framework for automated red teaming and robust refusal

    Mantas Mazeika, Long Phan, Xuwang Yin, Andy Zou, Zifan Wang, Norman Mu, Elham Sakhaee, Nathaniel Li, Steven Basart, Bo Li, et al. Harmbench: A standardized evaluation framework for automated red teaming and robust refusal. InInt. Conf. Mach. Learn. (ICML), 2024

  66. [66]

    Albert Q. Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, Lélio Renard Lavaud, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timothée Lacroix, and William El Sayed. Mistral 7b.arXiv preprint arXiv:23...

  67. [67]

    Marah Abdin, Jyoti Aneja, Hany Awadalla, Ahmed Awadallah, Ammar Ahmad Awan, Nguyen Bach, Amit Bahree, Arash Bakhtiari, Jianmin Bao, Harkirat Behl, Alon Benhaim, Misha Bilenko, Johan Bjorck, Sébastien Bubeck, Martin Cai, Qin Cai, Vishrav Chaudhary, Dong Chen, Dongdong Chen, Weizhu Chen, Yen-Chun Chen, Yi-Ling Chen, Hao Cheng, Parul Chopra, Xiyang Dai, Matt...

  68. [68]

    Gemma 2: Improving Open Language Models at a Practical Size

    Gemma Team, Morgane Riviere, Shreya Pathak, Pier Giuseppe Sessa, Cassidy Hardin, Surya Bhupatiraju, Léonard Hussenot, Thomas Mesnard, Bobak Shahriari, Alexandre Ramé, et al. Gemma 2: Improving open language models at a practical size.arXiv preprint arXiv:2408.00118, 2024

  69. [69]

    Hashimoto

    Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann Dubois, Xuechen Li, Carlos Guestrin, Percy Liang, and Tatsunori B. Hashimoto. Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca, 2023

  70. [70]

    Wildguard: Open one-stop moderation tools for safety risks, jailbreaks, and refusals of llms.Adv

    Seungju Han, Kavel Rao, Allyson Ettinger, Liwei Jiang, Bill Yuchen Lin, Nathan Lambert, Yejin Choi, and Nouha Dziri. Wildguard: Open one-stop moderation tools for safety risks, jailbreaks, and refusals of llms.Adv. Neural Inform. Process. Syst. (NeurIPS), 37:8093–8131, 2024

  71. [71]

    What are some good books on Roman history?

    Liwei Jiang, Kavel Rao, Seungju Han, Allyson Ettinger, Faeze Brahman, Sachin Kumar, Niloofar Mireshghallah, Ximing Lu, Maarten Sap, Yejin Choi, et al. Wildteaming at scale: From in-the-wild jailbreaks to (adversarially) safer language models.Adv. Neural Inform. Process. Syst. (NeurIPS), 37:47094–47165, 2024. A Internal Representations of Harmfulness and R...