Robust Harmful Features Under Jailbreak Attacks: Mechanistic Evidence from Attention Head Specialization in Large Language Models

Dongqi Han; Linghui Li; Yanchen Yin

arxiv: 2606.28153 · v1 · pith:JZFZLXG2new · submitted 2026-06-26 · 💻 cs.CR · cs.AI

Robust Harmful Features Under Jailbreak Attacks: Mechanistic Evidence from Attention Head Specialization in Large Language Models

Yanchen Yin , Dongqi Han , Linghui Li This is my paper

Pith reviewed 2026-06-29 03:31 UTC · model grok-4.3

classification 💻 cs.CR cs.AI

keywords jailbreak attacksattention headsLLM safetymechanistic interpretabilityadversarial robustnessharmful featuresattention specializationrefusal mechanisms

0 comments

The pith

Jailbreak attacks selectively suppress early attention heads while mid-layer safety signals persist and enable training-free detection.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Jailbreak attacks on large language models do not erase safety features outright but instead target and suppress specific attention heads in early layers. These Adversarially Compromised Heads normally contribute to refusal decisions, and their suppression by attack templates allows harmful outputs. Safety-Aligned Heads in middle layers continue to produce activations tied to harmful content even after attacks succeed. The paper shows that reading these persistent mid-layer activations directly, without any training, produces competitive detection performance that holds up under adversarial conditions.

Core claim

Attacks do not comprehensively eliminate safety features but selectively suppress Adversarially Compromised Heads (ACHs) concentrated in early layers while Safety-Aligned Heads (SAHs) in mid-layers maintain robust activations even when attacks succeed. Ablation studies show that suppressing a small number of ACHs induces jailbreak-like behavior on normally refused inputs and that removing SAHs weakens mid-layer safety activations. Token-level attribution links ACH suppression specifically to attack-template tokens, providing a mechanistic account of bypass while internal safety signals from SAHs remain, a phenomenon termed Robust Harmful Features. Simply reading these persistent activations

What carries the argument

Specialization of attention heads into Adversarially Compromised Heads (ACHs) in early layers, which handle refusal and get suppressed by attack tokens, and Safety-Aligned Heads (SAHs) in mid-layers, which sustain harmful-content activations.

If this is right

Suppressing a small number of ACHs is sufficient to induce jailbreak-like behavior on normally refused inputs.
Removing SAHs substantially weakens mid-layer safety activations.
ACH suppression is driven specifically by attack-template tokens rather than content tokens.
Reading the persistent SAH activations directly produces competitive detection with strong adversarial robustness.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Safety mechanisms appear modular, with some components surviving even when others are bypassed, which could guide more targeted alignment techniques.
Internal activation monitoring could be added at inference time to flag jailbreaks without requiring separate classifiers.
The early-versus-mid layer distinction may point to different safety roles that respond differently to various attack types.
The same reading approach might transfer to detecting other forms of misalignment beyond template-based jailbreaks.

Load-bearing premise

The head identification and suppression experiments accurately isolate effects on refusal and safety activations without confounding changes to other model components or behaviors.

What would settle it

An experiment in which suppressing the identified ACHs fails to increase jailbreak success rates on refused inputs, or in which reading the SAH activations yields no improvement in attack detection over a random baseline.

Figures

Figures reproduced from arXiv: 2606.28153 by Dongqi Han, Linghui Li, Yanchen Yin.

**Figure 1.** Figure 1: Activation of the refusal direction across model components under harmful and attack inputs. Key observation: although attacks successfully bypass refusal, mid-layers still exhibit substantial activations (red regions), suggesting attacks bypass rather than eliminate safety representations. To investigate this question, we first visualize activation patterns of the refusal direction across layers and com… view at source ↗

**Figure 2.** Figure 2: Overview of our framework. Our work reveals the mechanistic interplay between ACHs (suppressed by attacks) and SAHs (robustly activated), and leverages this understanding for training-free detection. can precisely project the refusal direction r (L ∗ ) into each head’s output space. This head-specific direction quantifies each head’s potential contribution to the final refusal signal (see Appendix B for a… view at source ↗

**Figure 3.** Figure 3: Standardized activation heatmaps of attention heads under three input types (top: Llama-3-8B-Instruct; bottom: Llama-2- 7B-Chat). Color intensity indicates activation magnitude (red: positive; blue: negative). Attack inputs induce a polarized pattern, with different heads exhibiting heterogeneous responses. 0 1 2 3 4 5 6 7 8 9 10111213141516171819202122232425262728293031 Head Index 0 2 4 6 8 10 12 14 Layer… view at source ↗

**Figure 4.** Figure 4 [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

**Figure 7.** Figure 7: Token-level attribution analysis (top: Llama-3-8BInstruct; bottom: Llama-2-7B-Chat). Left: contribution from harmful tokens (CH); middle: contribution from jailbreaktemplate tokens (CJ ); right: total contribution (CTotal). Red = ACHs , yellow = SAHs , blue = harmful-salient heads. Leveraging the linearity of the attention mechanism, we attribute each head’s activation to input tokens. For the (l, h)- t… view at source ↗

read the original abstract

Jailbreak attacks bypass LLM safety alignment, yet their mechanisms remain poorly understood. We provide evidence that attacks do not comprehensively eliminate safety features, but instead selectively suppress specific attention heads. We identify two functionally differentiated types: Adversarially Compromised Heads (ACHs) concentrated in early layers, which are suppressed under attacks, and Safety-Aligned Heads (SAHs) in mid-layers, which maintain robust activations even when attacks succeed. Ablation studies support the causal role of ACHs and the contribution of SAHs to robust activations: suppressing a small number of ACHs is sufficient to induce jailbreak-like behavior on normally refused inputs, while removing SAHs substantially weakens mid-layer safety activations. Token-level attribution further shows that ACH suppression is driven specifically by attack-template tokens, providing a mechanistic account of why attacks can bypass refusal decisions through ACH suppression while leaving internal safety signals sustained by SAHs -- a phenomenon we term Robust Harmful Features. To validate the practical significance of this robustness, we show that simply reading these persistent activations -- without any training -- yields competitive aggregate detection performance with strong adversarial robustness.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Jailbreaks suppress specific early attention heads while leaving mid-layer safety signals intact, but the abstract shows no numbers or controls to evaluate the ablations.

read the letter

The punchline here is that the paper presents evidence for selective suppression of attention heads under jailbreak attacks rather than a complete breakdown of safety features, with early ACHs getting suppressed and mid-layer SAHs staying robust enough for detection by reading activations alone.

What stands out as new is the functional split into Adversarially Compromised Heads and Safety-Aligned Heads, along with the term Robust Harmful Features for the persistent signals. The token attribution to attack-template tokens gives a mechanistic story for how attacks bypass refusal. This builds on prior attention analysis work but focuses it on the jailbreak setting in a way that could inform better monitoring.

The paper does well at laying out a causal story from the ablations: knocking out a few ACHs triggers harmful outputs on refused inputs, and taking out SAHs reduces the safety activations. That kind of targeted intervention is the right direction for interpretability claims.

The soft spots center on the missing details. The abstract talks about ablation and attribution studies but supplies no quantitative outcomes, no description of how heads were identified, no mention of controls or statistical tests. This leaves the soundness low because we cannot check whether the head interventions are clean or if they affect the model more broadly through the residual stream. The stress-test note about potential global changes from suppression is a legitimate concern that needs addressing with more careful experiments.

Overall this is for researchers focused on mechanistic explanations of LLM safety and those building detection methods. A reader who wants concrete ideas for robust features would find it relevant, provided the full results back the claims.

I would recommend sending it to peer review. The question matters for deployed models, and the framing is specific enough that referees can evaluate the evidence once the numbers and methods are shown in detail.

Referee Report

3 major / 2 minor

Summary. The paper claims that jailbreak attacks on LLMs do not eliminate safety features wholesale but instead selectively suppress Adversarially Compromised Heads (ACHs) concentrated in early layers, while Safety-Aligned Heads (SAHs) in mid-layers maintain robust activations. Ablation experiments are presented as evidence that suppressing a small number of ACHs induces jailbreak-like outputs on refused inputs and that removing SAHs weakens safety signals; token-level attribution links ACH suppression to attack-template tokens. The work further asserts that reading these persistent SAH activations enables competitive zero-shot detection with strong adversarial robustness, a phenomenon termed Robust Harmful Features.

Significance. If the ablation and attribution results hold after addressing isolation concerns, the distinction between selectively suppressed and persistently active safety-related heads would provide a useful mechanistic account of partial safety bypass in LLMs. The zero-training detection result would also have practical value for robust monitoring. The manuscript does not yet supply the quantitative controls, error bars, or dataset details needed to evaluate whether these claims are supported.

major comments (3)

[Ablation studies] Ablation studies section: the central causal claim that targeted ACH suppression induces jailbreak behavior rests on the assumption that zeroing or masking the identified heads affects only those components. The experiments do not report controls for global changes to the residual stream or downstream attention patterns that could produce non-specific degradation, as raised in the stress-test note. This directly undermines the interpretation that the observed refusal failures are due to selective removal of safety features rather than broader disruption.
[Token-level attribution] Token-level attribution paragraph: the attribution of ACH suppression specifically to attack-template tokens is presented as mechanistic evidence, yet the method's dependence on the same head interventions used in the ablations is not examined. If attribution scores themselves shift under the interventions, the claimed specificity of the template-token effect cannot be isolated from the intervention artifact.
[Detection experiments] Detection experiments: the claim of 'competitive aggregate detection performance with strong adversarial robustness' from reading persistent activations is load-bearing for the practical significance but is stated without reported metrics, baselines, dataset sizes, or error bars, making it impossible to assess whether the result supports the robustness conclusion.

minor comments (2)

[Abstract] The abstract would be strengthened by including at least one key quantitative result (e.g., number of ACHs/SAHs identified, ablation success rate, or detection AUC) to allow readers to gauge effect sizes immediately.
[Methods] Head identification criteria (how ACHs and SAHs are distinguished from other heads) are referenced but not stated with sufficient precision for replication; a dedicated methods subsection with thresholds or statistical tests would improve clarity.

Simulated Author's Rebuttal

3 responses · 0 unresolved

Thank you for the opportunity to respond to the referee's report. We address each of the major comments in detail below and have made revisions to the manuscript to strengthen the presentation of our results.

read point-by-point responses

Referee: Ablation studies section: the central causal claim that targeted ACH suppression induces jailbreak behavior rests on the assumption that zeroing or masking the identified heads affects only those components. The experiments do not report controls for global changes to the residual stream or downstream attention patterns that could produce non-specific degradation, as raised in the stress-test note. This directly undermines the interpretation that the observed refusal failures are due to selective removal of safety features rather than broader disruption.

Authors: We thank the referee for highlighting this important point regarding the specificity of our ablation experiments. We agree that explicit controls for non-specific effects would bolster the causal claims. In the revised manuscript, we have added results from ablating randomly selected heads of comparable count and layer distribution, demonstrating that such ablations do not induce similar jailbreak-like behaviors. Additionally, we report the L2 norm changes in the residual stream post-ablation to confirm minimal global disruption. These additions are detailed in the updated Section 3.2 and Appendix B. revision: yes
Referee: Token-level attribution paragraph: the attribution of ACH suppression specifically to attack-template tokens is presented as mechanistic evidence, yet the method's dependence on the same head interventions used in the ablations is not examined. If attribution scores themselves shift under the interventions, the claimed specificity of the template-token effect cannot be isolated from the intervention artifact.

Authors: The token attribution method relies on gradient-based or activation-based attribution computed on the unmodified model to identify the influence of input tokens on head activations. We have clarified this distinction in the revised text. To address the concern, we performed additional checks where we recompute attributions after minor interventions and observe that the relative contribution of template tokens remains consistent. A full examination under the exact ablation conditions is now included as a sensitivity analysis in the appendix. revision: yes
Referee: Detection experiments: the claim of 'competitive aggregate detection performance with strong adversarial robustness' from reading persistent activations is load-bearing for the practical significance but is stated without reported metrics, baselines, dataset sizes, or error bars, making it impossible to assess whether the result supports the robustness conclusion.

Authors: We agree that providing explicit metrics, baselines, and error bars is necessary for evaluating the claim. We have expanded the detection experiments section to include these details, with specific performance numbers, baseline comparisons, and statistical measures now reported in a new table. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical claims rest on experimental measurements, not self-referential definitions or fitted predictions

full rationale

The paper's central claims derive from direct experimental procedures: head identification via activation analysis, targeted ablations (suppressing ACHs or SAHs), and token-level attribution on attack templates. These steps measure observable effects in the model (e.g., changes in refusal behavior or activation strength) rather than deriving a result that is definitionally equivalent to its inputs. No equations, parameter fits, or predictions are presented that reduce by construction to the same data used to define ACHs/SAHs. Self-citations are absent from the provided claims, and the ablation logic is falsifiable via the reported interventions without circular dependence on prior author results. The derivation chain is therefore self-contained as standard mechanistic interpretability work.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The abstract introduces no free parameters, mathematical axioms, or new postulated entities beyond labeling existing attention heads; all claims rest on described experimental observations.

pith-pipeline@v0.9.1-grok · 5731 in / 1076 out tokens · 50845 ms · 2026-06-29T03:31:21.064652+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

99 extracted references · 4 canonical work pages · 1 internal anchor

[1]

and Kamfonas, M

Alon, G. and Kamfonas, M. J. Detecting language model attacks with perplexity. In The Second Tiny Papers Track at ICLR 2024, 2024. URL https://openreview.net/forum?id=lNLVvdHyAw. OpenReview

2024
[2]

Many-shot jailbreaking

Anil, C., Durmus, E., Rimsky, N., et al. Many-shot jailbreaking. In The Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024. URL https://openreview.net/forum?id=cw5mgd71jW. OpenReview

2024
[3]

Refusal in language models is mediated by a single direction

Arditi, A., Obeso, O., Syed, A., Paleka, D., Panickssery, N., Gurnee, W., and Nanda, N. Refusal in language models is mediated by a single direction. In Advances in Neural Information Processing Systems, volume 37, 2024. OpenReview

2024
[6]

E., Hume, T., Carter, S., Henighan, T., and Olah, C

Bricken, T., Templeton, A., Batson, J., Chen, B., Jermyn, A., Conerly, T., Turner, N., Anil, C., Denison, C., Askell, A., Lasenby, R., Wu, Y., Kravec, S., Schiefer, N., Maxwell, T., Joseph, N., Hatfield-Dodds, Z., Tamkin, A., Nguyen, K., McLean, B., Burke, J. E., Hume, T., Carter, S., Henighan, T., and Olah, C. Towards monosemanticity: Decomposing languag...

2023
[8]

J., and Wong, E

Chao, P., Robey, A., Dobriban, E., Hassani, H., Pappas, G. J., and Wong, E. Jailbreaking black box large language models in twenty queries. In The Twelfth International Conference on Learning Representations, 2024 b . URL https://openreview.net/forum?id=hkjcdmz8Ro. OpenReview

2024
[9]

Llama guard 3 vision: Safeguarding human-ai image understanding conversations, 2024

Chi, J., Karn, U., Zhan, H., et al. Llama guard 3 vision: Safeguarding human-ai image understanding conversations, 2024. URL https://arxiv.org/abs/2411.10414v1

arXiv 2024
[10]

N., Lynch, A., Heimersheim, S., and Garriga-Alonso, A

Conmy, A., Mavor-Parker, A. N., Lynch, A., Heimersheim, S., and Garriga-Alonso, A. Towards automated circuit discovery for mechanistic interpretability. In Advances in Neural Information Processing Systems, volume 36, 2023. URL https://openreview.net/forum?id=89ia77nZ8u. NeurIPS 2023 Spotlight

2023
[11]

OR-Bench : An over-refusal benchmark for large language models

Cui, J., Chiang, W.-L., Stoica, I., and Hsieh, C.-J. OR-Bench : An over-refusal benchmark for large language models. In The Thirteenth International Conference on Learning Representations, 2025. URL https://openreview.net/forum?id=obYVdcMMIT. OpenReview

2025
[12]

A mathematical framework for transformer circuits

Elhage, N., Nanda, N., Olsson, C., Henighan, T., Joseph, N., Mann, B., Askell, A., Bai, Y., Chen, A., Conerly, T., DasSarma, N., Drain, D., Ganguli, D., Hatfield-Dodds, Z., Hernandez, D., Jones, A., Kernion, J., Lovitt, L., Ndousse, K., Amodei, D., Brown, T., Clark, J., Kaplan, J., McCandlish, S., and Olah, C. A mathematical framework for transformer circ...

2021
[13]

Ghandeharioun, A., Caciularu, A., Pearce, A., Dixon, L., and Geva, M

Geva, M., Schuster, R., Berant, J., and Levy, O. Transformer feed-forward layers are key-value memories. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp.\ 5484--5495. Association for Computational Linguistics, 2021. doi:10.18653/v1/2021.emnlp-main.446

work page internal anchor Pith review doi:10.18653/v1/2021.emnlp-main.446 2021
[15]

AEGIS2.0 : A diverse AI safety dataset and risks taxonomy for alignment of LLM guardrails

Ghosh, S., Varshney, P., Gaur, A., et al. AEGIS2.0 : A diverse AI safety dataset and risks taxonomy for alignment of LLM guardrails. In Findings of the Association for Computational Linguistics: NAACL 2025. Association for Computational Linguistics, 2025

2025
[16]

WILDGUARD : Open one-stop moderation tools for safety risks, jailbreaks, and refusals of LLM s

Han, S., Rao, K., Ettinger, A., et al. WILDGUARD : Open one-stop moderation tools for safety risks, jailbreaks, and refusals of LLM s. In Advances in Neural Information Processing Systems, volume 37, 2024. OpenReview

2024
[17]

Gradient cuff: Detecting jailbreak attacks on large language models by exploring refusal loss landscapes

Hu, X., Chen, P.-Y., and Ho, T.-Y. Gradient cuff: Detecting jailbreak attacks on large language models by exploring refusal loss landscapes. In Advances in Neural Information Processing Systems, volume 37, pp.\ 126265--126296, 2024. doi:10.52202/079017-4011

work page doi:10.52202/079017-4011 2024
[19]

Position: TRUSTLLM : Trustworthiness in large language models

Huang, Y., Sun, L., Wang, H., et al. Position: TRUSTLLM : Trustworthiness in large language models. In Forty-first International Conference on Machine Learning, volume 235 of Proceedings of Machine Learning Research, pp.\ 20166--20212. PMLR, 2024

2024
[21]

WildTeaming at scale: From in-the-wild jailbreaks to (adversarially) safer language models

Jiang, L., Rao, K., Han, S., Ettinger, A., Brahman, F., Kumar, S., Mireshghallah, N., Lu, X., Sap, M., Choi, Y., and Dziri, N. WildTeaming at scale: From in-the-wild jailbreaks to (adversarially) safer language models. In Advances in Neural Information Processing Systems, volume 37, 2024. OpenReview

2024
[22]

W., Padhi, I., Ramamurthy, K

Lee, B. W., Padhi, I., Ramamurthy, K. N., Miehling, E., Dognin, P., Nagireddy, M., and Dhurandhar, A. Programming refusal with conditional activation steering, 2024. URL https://arxiv.org/abs/2409.05907

arXiv 2024
[23]

SALAD-Bench : A hierarchical and comprehensive safety benchmark for large language models

Li, L., Dong, B., Wang, R., et al. SALAD-Bench : A hierarchical and comprehensive safety benchmark for large language models. In Findings of the Association for Computational Linguistics: ACL 2024, pp.\ 11097--11119. Association for Computational Linguistics, 2024 a . URL https://arxiv.org/abs/2402.05044

arXiv 2024
[24]

Jailbreak attack for large language models: A survey

Li, N., Ding, Y., Jiang, H., Niu, J., and Yi, P. Jailbreak attack for large language models: A survey. Journal of Computer Research and Development, 61 0 (5): 0 1156--1181, 2024 b . doi:10.7544/issn1000-1239.202330962. URL https://crad.ict.ac.cn/en/article/doi/10.7544/issn1000-1239.202330962

work page doi:10.7544/issn1000-1239.202330962 2024
[26]

ToxicChat : Unveiling hidden challenges of toxicity detection in real-world user- AI conversation

Lin, Z., Zeng, Z., Gao, Z., et al. ToxicChat : Unveiling hidden challenges of toxicity detection in real-world user- AI conversation. In Findings of the Association for Computational Linguistics: EMNLP 2023, pp.\ 4694--4702. Association for Computational Linguistics, 2023

2023
[27]

AutoDAN : Generating stealthy jailbreak prompts on aligned large language models

Liu, X., Xu, N., Chen, M., and Xiao, C. AutoDAN : Generating stealthy jailbreak prompts on aligned large language models. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=7Jwpw4qKkb. OpenReview

2024
[28]

A holistic approach to undesired content detection in the real world, 2022

Markov, T., Zhang, C., Agarwal, S., et al. A holistic approach to undesired content detection in the real world, 2022. URL https://arxiv.org/abs/2208.03274v3

arXiv 2022
[29]

J., Belinkov, Y., Bau, D., and Mueller, A

Marks, S., Rager, C., Michaud, E. J., Belinkov, Y., Bau, D., and Mueller, A. Sparse feature circuits: Discovering and editing interpretable causal graphs in language models. In The Thirteenth International Conference on Learning Representations, 2025. URL https://openreview.net/forum?id=I4e82CIDxv. OpenReview

2025
[30]

HarmBench : A standardized evaluation framework for automated red teaming and robust refusal

Mazeika, M., Phan, L., Yin, X., et al. HarmBench : A standardized evaluation framework for automated red teaming and robust refusal. In Forty-first International Conference on Machine Learning, volume 235 of Proceedings of Machine Learning Research, pp.\ 35075--35110. PMLR, 2024. URL https://openreview.net/forum?id=f9mhWhwDJz

2024
[31]

Copy suppression: Comprehensively understanding an attention head, 2023

McDougall, C., Conmy, A., Rushing, C., McGrath, T., and Nanda, N. Copy suppression: Comprehensively understanding an attention head, 2023. URL https://arxiv.org/abs/2310.04625v1

arXiv 2023
[32]

Progress measures for grokking via mechanistic interpretability

Nanda, N., Chan, L., Lieberum, T., Smith, J., and Steinhardt, J. Progress measures for grokking via mechanistic interpretability. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=9XFSbDPmdW. OpenReview

2023
[33]

In-context learning and induction heads

Olsson, C., Elhage, N., Nanda, N., et al. In-context learning and induction heads. Transformer Circuits Thread, 2022. arXiv:2209.11895

Pith/arXiv arXiv 2022
[34]

Causality: Models, Reasoning, and Inference

Pearl, J. Causality: Models, Reasoning, and Inference. Cambridge University Press, 2nd edition, 2009. ISBN 978-0-521-89560-6

2009
[35]

Robey, A., Wong, E., Hassani, H., and Pappas, G. J. SmoothLLM : Defending large language models against jailbreaking attacks. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=xq7h9nfdY2. OpenReview

2024
[36]

Axiomatic attribution for deep networks

Sundararajan, M., Taly, A., and Yan, Q. Axiomatic attribution for deep networks. In Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pp.\ 3319--3328. PMLR, 2017. URL https://proceedings.mlr.press/v70/sundararajan17a.html

2017
[37]

Taori, R., Gulrajani, I., Zhang, T., Dubois, Y., Li, X., Guestrin, C., Liang, P., and Hashimoto, T. B. Stanford alpaca: An instruction-following LLaMA model. https://github.com/tatsu-lab/stanford_alpaca, 2023. GitHub repository

2023
[39]

a ger, T., Elstner, J., Geisler, S., Cohen-Addad, V., G \

Wollschl \"a ger, T., Elstner, J., Geisler, S., Cohen-Addad, V., G \"u nnemann, S., and Gasteiger, J. The geometry of refusal in large language models: Concept cones and representational independence. In Forty-second International Conference on Machine Learning, Proceedings of Machine Learning Research. PMLR, 2025. URL https://openreview.net/forum?id=80IwJqlXs8

2025
[41]

Y., and Poovendran, R

Xu, Z., Jiang, F., Niu, L., Jia, J., Lin, B. Y., and Poovendran, R. SafeDecoding : Defending against jailbreak attacks via safety-aware decoding. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.\ 5587--5605, 2024. URL https://arxiv.org/abs/2402.08983

arXiv 2024
[42]

J., Prakash, N., Neo, C., Satapathy, R., Lee, R

Yeo, W. J., Prakash, N., Neo, C., Satapathy, R., Lee, R. K.-W., and Cambria, E. Understanding refusal in language models with sparse autoencoders. In Findings of the Association for Computational Linguistics: EMNLP 2025. Association for Computational Linguistics, 2025. doi:10.18653/v1/2025.findings-emnlp.338

work page doi:10.18653/v1/2025.findings-emnlp.338 2025
[43]

Qwen3guard technical report, 2025 a

Zhao, H., Yuan, C., Huang, F., et al. Qwen3guard technical report, 2025 a . URL https://arxiv.org/abs/2510.14276v1

Pith/arXiv arXiv 2025
[44]

Zhao, Y., Zhang, W., Xie, Y., Goyal, A., Kawaguchi, K., and Shieh, M. Q. Understanding and enhancing safety mechanisms of LLM s via safety-specific neuron. In The Thirteenth International Conference on Learning Representations, 2025 b . URL https://openreview.net/forum?id=yR47RmND1m. OpenReview

2025
[45]

How alignment and jailbreak work: Explain LLM safety through intermediate hidden states

Zhou, Z., Yu, H., Zhang, X., Xu, R., Huang, F., and Li, Y. How alignment and jailbreak work: Explain LLM safety through intermediate hidden states. In Findings of the Association for Computational Linguistics: EMNLP 2024, pp.\ 2461--2488. Association for Computational Linguistics, 2024. URL https://aclanthology.org/2024.findings-emnlp.139/

2024
[46]

On the role of attention heads in large language model safety, 2025

Zhou, Z., Yu, H., Zhang, X., et al. On the role of attention heads in large language model safety, 2025. URL https://arxiv.org/abs/2410.13708v1

arXiv 2025
[47]

Z., and Fredrikson, M

Zou, A., Wang, Z., Kolter, J. Z., and Fredrikson, M. Universal and transferable adversarial attacks on aligned language models. In Advances in Neural Information Processing Systems, volume 36, pp.\ 1--24, 2023. URL https://arxiv.org/abs/2307.15043

Pith/arXiv arXiv 2023
[48]

Z., Fredrikson, M., and Hendrycks, D

Zou, A., Phan, L., Wang, J., Duenas, D., Lin, M., Andriushchenko, M., Wang, R., Kolter, J. Z., Fredrikson, M., and Hendrycks, D. Improving alignment and robustness with circuit breakers. In Advances in Neural Information Processing Systems, volume 37, 2024. URL https://openreview.net/forum?id=IbIB8SBKFV. OpenReview

2024
[49]

2303.08774v4 , archivePrefix=

OpenAI and Josh Achiam and Steven Adler and Sandhini Agarwal and others , year=. 2303.08774v4 , archivePrefix=

Pith/arXiv arXiv
[50]

2025 , eprint=

A Survey of Large Language Models , author=. 2025 , eprint=

2025
[51]

2025 , journal =

Naveed, Humza and Khan, Asad Ullah and Qiu, Shi and others , title =. 2025 , journal =

2025
[52]

2025 , eprint=

Gemma 3 Technical Report , author=. 2025 , eprint=

2025
[53]

2023 , eprint=

Llama 2: Open Foundation and Fine-Tuned Chat Models , author=. 2023 , eprint=

2023
[54]

2024 , eprint=

The Llama 3 Herd of Models , author=. 2024 , eprint=

2024
[55]

Advances in Neural Information Processing Systems , volume=

Universal and Transferable Adversarial Attacks on Aligned Language Models , author=. Advances in Neural Information Processing Systems , volume=. 2023 , url=

2023
[56]

2024 , url=

Xiaogeng Liu and Nan Xu and Muhao Chen and Chaowei Xiao , booktitle=. 2024 , url=

2024
[57]

The Thirty-eighth Annual Conference on Neural Information Processing Systems , year=

Many-shot Jailbreaking , author=. The Thirty-eighth Annual Conference on Neural Information Processing Systems , year=
[58]

The Twelfth International Conference on Learning Representations , year=

Jailbreaking Black Box Large Language Models in Twenty Queries , author=. The Twelfth International Conference on Learning Representations , year=
[59]

Journal of Computer Research and Development , volume =

Jailbreak Attack for Large Language Models: A Survey , author =. Journal of Computer Research and Development , volume =. 2024 , doi =

2024
[60]

Advances in Neural Information Processing Systems , volume=

Improving Alignment and Robustness with Circuit Breakers , author=. Advances in Neural Information Processing Systems , volume=. 2024 , url=

2024
[61]

2024 , url=

Zhangchen Xu and Fengqing Jiang and Luyao Niu and Jinyuan Jia and Bill Yuchen Lin and Radha Poovendran , booktitle=. 2024 , url=

2024
[62]

Pappas , booktitle=

Alexander Robey and Eric Wong and Hamed Hassani and George J. Pappas , booktitle=. 2024 , url=

2024
[63]

The Second Tiny Papers Track at ICLR 2024 , year=

Detecting Language Model Attacks With Perplexity , author=. The Second Tiny Papers Track at ICLR 2024 , year=

2024
[64]

2024 , url=

Yifan Zeng and Yiran Wu and Xiao Zhang and Huazheng Wang and Qingyun Wu , booktitle=. 2024 , url=

2024
[65]

2024 , url=

Mansi Phute and Alec Helbling and Matthew Daniel Hull and ShengYun Peng and Sebastian Szyller and Cory Cornelius and Duen Horng Chau , booktitle=. 2024 , url=

2024
[66]

Advances in Neural Information Processing Systems , volume=

Gradient Cuff: Detecting Jailbreak Attacks on Large Language Models by Exploring Refusal Loss Landscapes , author=. Advances in Neural Information Processing Systems , volume=. 2024 , doi=

2024
[67]

2024 , eprint=

Programming Refusal with Conditional Activation Steering , author=. 2024 , eprint=

2024
[68]

Llama Guard:

Hakan Inan and Kartikeya Upasani and Jianfeng Chi and others , year=. Llama Guard:. 2312.06674v2 , archivePrefix=

Pith/arXiv arXiv
[69]

2024 , eprint=

Llama Guard 3 Vision: Safeguarding Human-AI Image Understanding Conversations , author=. 2024 , eprint=

2024
[70]

2025 , eprint=

Qwen3Guard Technical Report , author=. 2025 , eprint=

2025
[71]

2024 , note=

Seungju Han and Kavel Rao and Allyson Ettinger and others , booktitle=. 2024 , note=

2024
[72]

2025 , publisher =

Ghosh, Shaona and Varshney, Prasoon and Gaur, Anu and others , booktitle =. 2025 , publisher =

2025
[73]

2024 , publisher=

Mantas Mazeika and Long Phan and Xuwang Yin and others , booktitle=. 2024 , publisher=

2024
[74]

2404.01318v2 , archivePrefix=

Patrick Chao and Edoardo Debenedetti and Alexander Robey and others , year=. 2404.01318v2 , archivePrefix=

Pith/arXiv arXiv
[75]

2311.08370v2 , archivePrefix=

Bertie Vidgen and Nino Scherrer and Hannah Rose Kirk and others , year=. 2311.08370v2 , archivePrefix=

arXiv
[76]

Position:

Yue Huang and Lichao Sun and Haoran Wang and others , booktitle=. Position:. 2024 , publisher=

2024
[77]

2024 , publisher=

Lijun Li and Bowen Dong and Ruohui Wang and others , booktitle=. 2024 , publisher=

2024
[78]

2023 , publisher =

Lin, Zi and Zeng, Zihan and Gao, Zi and others , booktitle =. 2023 , publisher =

2023
[79]

2022 , eprint=

A Holistic Approach to Undesired Content Detection in the Real World , author=. 2022 , eprint=

2022
[80]

2404.05993v2 , archivePrefix=

Shaona Ghosh and Prasoon Varshney and Erick Galinkin and Christopher Parisien , year=. 2404.05993v2 , archivePrefix=

arXiv
[81]

2024 , note=

Jiang, Liwei and Rao, Kavel and Han, Seungju and Ettinger, Allyson and Brahman, Faeze and Kumar, Sachin and Mireshghallah, Niloofar and Lu, Ximing and Sap, Maarten and Choi, Yejin and Dziri, Nouha , booktitle=. 2024 , note=

2024
[82]

2025 , url=

Cui, Justin and Chiang, Wei-Lin and Stoica, Ion and Hsieh, Cho-Jui , booktitle=. 2025 , url=

2025
[83]

Advances in Neural Information Processing Systems , volume=

Refusal in Language Models is Mediated by a Single Direction , author=. Advances in Neural Information Processing Systems , volume=. 2024 , note=

2024
[84]

Forty-second International Conference on Machine Learning , series=

The Geometry of Refusal in Large Language Models: Concept Cones and Representational Independence , author=. Forty-second International Conference on Machine Learning , series=. 2025 , publisher=

2025
[85]

Findings of the Association for Computational Linguistics: EMNLP 2025 , year =

Understanding Refusal in Language Models with Sparse Autoencoders , author =. Findings of the Association for Computational Linguistics: EMNLP 2025 , year =

2025
[86]

Safety Layers in Aligned Large Language Models: The Key to

Shen Li and Liuyi Yao and Lan Zhang and Yaliang Li , year=. Safety Layers in Aligned Large Language Models: The Key to. 2408.17003v1 , archivePrefix=

arXiv
[87]

Understanding and Enhancing Safety Mechanisms of

Zhao, Yiran and Zhang, Wenxuan and Xie, Yuxi and Goyal, Anirudh and Kawaguchi, Kenji and Shieh, Michael Qizhe , booktitle =. Understanding and Enhancing Safety Mechanisms of. 2025 , url =

2025
[88]

2025 , eprint=

On the Role of Attention Heads in Large Language Model Safety , author=. 2025 , eprint=

2025
[89]

Transformer Circuits Thread , year=

A Mathematical Framework for Transformer Circuits , author=. Transformer Circuits Thread , year=

Showing first 80 references.

[1] [1]

and Kamfonas, M

Alon, G. and Kamfonas, M. J. Detecting language model attacks with perplexity. In The Second Tiny Papers Track at ICLR 2024, 2024. URL https://openreview.net/forum?id=lNLVvdHyAw. OpenReview

2024

[2] [2]

Many-shot jailbreaking

Anil, C., Durmus, E., Rimsky, N., et al. Many-shot jailbreaking. In The Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024. URL https://openreview.net/forum?id=cw5mgd71jW. OpenReview

2024

[3] [3]

Refusal in language models is mediated by a single direction

Arditi, A., Obeso, O., Syed, A., Paleka, D., Panickssery, N., Gurnee, W., and Nanda, N. Refusal in language models is mediated by a single direction. In Advances in Neural Information Processing Systems, volume 37, 2024. OpenReview

2024

[4] [6]

E., Hume, T., Carter, S., Henighan, T., and Olah, C

Bricken, T., Templeton, A., Batson, J., Chen, B., Jermyn, A., Conerly, T., Turner, N., Anil, C., Denison, C., Askell, A., Lasenby, R., Wu, Y., Kravec, S., Schiefer, N., Maxwell, T., Joseph, N., Hatfield-Dodds, Z., Tamkin, A., Nguyen, K., McLean, B., Burke, J. E., Hume, T., Carter, S., Henighan, T., and Olah, C. Towards monosemanticity: Decomposing languag...

2023

[5] [8]

J., and Wong, E

Chao, P., Robey, A., Dobriban, E., Hassani, H., Pappas, G. J., and Wong, E. Jailbreaking black box large language models in twenty queries. In The Twelfth International Conference on Learning Representations, 2024 b . URL https://openreview.net/forum?id=hkjcdmz8Ro. OpenReview

2024

[6] [9]

Llama guard 3 vision: Safeguarding human-ai image understanding conversations, 2024

Chi, J., Karn, U., Zhan, H., et al. Llama guard 3 vision: Safeguarding human-ai image understanding conversations, 2024. URL https://arxiv.org/abs/2411.10414v1

arXiv 2024

[7] [10]

N., Lynch, A., Heimersheim, S., and Garriga-Alonso, A

Conmy, A., Mavor-Parker, A. N., Lynch, A., Heimersheim, S., and Garriga-Alonso, A. Towards automated circuit discovery for mechanistic interpretability. In Advances in Neural Information Processing Systems, volume 36, 2023. URL https://openreview.net/forum?id=89ia77nZ8u. NeurIPS 2023 Spotlight

2023

[8] [11]

OR-Bench : An over-refusal benchmark for large language models

Cui, J., Chiang, W.-L., Stoica, I., and Hsieh, C.-J. OR-Bench : An over-refusal benchmark for large language models. In The Thirteenth International Conference on Learning Representations, 2025. URL https://openreview.net/forum?id=obYVdcMMIT. OpenReview

2025

[9] [12]

A mathematical framework for transformer circuits

Elhage, N., Nanda, N., Olsson, C., Henighan, T., Joseph, N., Mann, B., Askell, A., Bai, Y., Chen, A., Conerly, T., DasSarma, N., Drain, D., Ganguli, D., Hatfield-Dodds, Z., Hernandez, D., Jones, A., Kernion, J., Lovitt, L., Ndousse, K., Amodei, D., Brown, T., Clark, J., Kaplan, J., McCandlish, S., and Olah, C. A mathematical framework for transformer circ...

2021

[10] [13]

Ghandeharioun, A., Caciularu, A., Pearce, A., Dixon, L., and Geva, M

Geva, M., Schuster, R., Berant, J., and Levy, O. Transformer feed-forward layers are key-value memories. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp.\ 5484--5495. Association for Computational Linguistics, 2021. doi:10.18653/v1/2021.emnlp-main.446

work page internal anchor Pith review doi:10.18653/v1/2021.emnlp-main.446 2021

[11] [15]

AEGIS2.0 : A diverse AI safety dataset and risks taxonomy for alignment of LLM guardrails

Ghosh, S., Varshney, P., Gaur, A., et al. AEGIS2.0 : A diverse AI safety dataset and risks taxonomy for alignment of LLM guardrails. In Findings of the Association for Computational Linguistics: NAACL 2025. Association for Computational Linguistics, 2025

2025

[12] [16]

WILDGUARD : Open one-stop moderation tools for safety risks, jailbreaks, and refusals of LLM s

Han, S., Rao, K., Ettinger, A., et al. WILDGUARD : Open one-stop moderation tools for safety risks, jailbreaks, and refusals of LLM s. In Advances in Neural Information Processing Systems, volume 37, 2024. OpenReview

2024

[13] [17]

Gradient cuff: Detecting jailbreak attacks on large language models by exploring refusal loss landscapes

Hu, X., Chen, P.-Y., and Ho, T.-Y. Gradient cuff: Detecting jailbreak attacks on large language models by exploring refusal loss landscapes. In Advances in Neural Information Processing Systems, volume 37, pp.\ 126265--126296, 2024. doi:10.52202/079017-4011

work page doi:10.52202/079017-4011 2024

[14] [19]

Position: TRUSTLLM : Trustworthiness in large language models

Huang, Y., Sun, L., Wang, H., et al. Position: TRUSTLLM : Trustworthiness in large language models. In Forty-first International Conference on Machine Learning, volume 235 of Proceedings of Machine Learning Research, pp.\ 20166--20212. PMLR, 2024

2024

[15] [21]

WildTeaming at scale: From in-the-wild jailbreaks to (adversarially) safer language models

Jiang, L., Rao, K., Han, S., Ettinger, A., Brahman, F., Kumar, S., Mireshghallah, N., Lu, X., Sap, M., Choi, Y., and Dziri, N. WildTeaming at scale: From in-the-wild jailbreaks to (adversarially) safer language models. In Advances in Neural Information Processing Systems, volume 37, 2024. OpenReview

2024

[16] [22]

W., Padhi, I., Ramamurthy, K

Lee, B. W., Padhi, I., Ramamurthy, K. N., Miehling, E., Dognin, P., Nagireddy, M., and Dhurandhar, A. Programming refusal with conditional activation steering, 2024. URL https://arxiv.org/abs/2409.05907

arXiv 2024

[17] [23]

SALAD-Bench : A hierarchical and comprehensive safety benchmark for large language models

Li, L., Dong, B., Wang, R., et al. SALAD-Bench : A hierarchical and comprehensive safety benchmark for large language models. In Findings of the Association for Computational Linguistics: ACL 2024, pp.\ 11097--11119. Association for Computational Linguistics, 2024 a . URL https://arxiv.org/abs/2402.05044

arXiv 2024

[18] [24]

Jailbreak attack for large language models: A survey

Li, N., Ding, Y., Jiang, H., Niu, J., and Yi, P. Jailbreak attack for large language models: A survey. Journal of Computer Research and Development, 61 0 (5): 0 1156--1181, 2024 b . doi:10.7544/issn1000-1239.202330962. URL https://crad.ict.ac.cn/en/article/doi/10.7544/issn1000-1239.202330962

work page doi:10.7544/issn1000-1239.202330962 2024

[19] [26]

ToxicChat : Unveiling hidden challenges of toxicity detection in real-world user- AI conversation

Lin, Z., Zeng, Z., Gao, Z., et al. ToxicChat : Unveiling hidden challenges of toxicity detection in real-world user- AI conversation. In Findings of the Association for Computational Linguistics: EMNLP 2023, pp.\ 4694--4702. Association for Computational Linguistics, 2023

2023

[20] [27]

AutoDAN : Generating stealthy jailbreak prompts on aligned large language models

Liu, X., Xu, N., Chen, M., and Xiao, C. AutoDAN : Generating stealthy jailbreak prompts on aligned large language models. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=7Jwpw4qKkb. OpenReview

2024

[21] [28]

A holistic approach to undesired content detection in the real world, 2022

Markov, T., Zhang, C., Agarwal, S., et al. A holistic approach to undesired content detection in the real world, 2022. URL https://arxiv.org/abs/2208.03274v3

arXiv 2022

[22] [29]

J., Belinkov, Y., Bau, D., and Mueller, A

Marks, S., Rager, C., Michaud, E. J., Belinkov, Y., Bau, D., and Mueller, A. Sparse feature circuits: Discovering and editing interpretable causal graphs in language models. In The Thirteenth International Conference on Learning Representations, 2025. URL https://openreview.net/forum?id=I4e82CIDxv. OpenReview

2025

[23] [30]

HarmBench : A standardized evaluation framework for automated red teaming and robust refusal

Mazeika, M., Phan, L., Yin, X., et al. HarmBench : A standardized evaluation framework for automated red teaming and robust refusal. In Forty-first International Conference on Machine Learning, volume 235 of Proceedings of Machine Learning Research, pp.\ 35075--35110. PMLR, 2024. URL https://openreview.net/forum?id=f9mhWhwDJz

2024

[24] [31]

Copy suppression: Comprehensively understanding an attention head, 2023

McDougall, C., Conmy, A., Rushing, C., McGrath, T., and Nanda, N. Copy suppression: Comprehensively understanding an attention head, 2023. URL https://arxiv.org/abs/2310.04625v1

arXiv 2023

[25] [32]

Progress measures for grokking via mechanistic interpretability

Nanda, N., Chan, L., Lieberum, T., Smith, J., and Steinhardt, J. Progress measures for grokking via mechanistic interpretability. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=9XFSbDPmdW. OpenReview

2023

[26] [33]

In-context learning and induction heads

Olsson, C., Elhage, N., Nanda, N., et al. In-context learning and induction heads. Transformer Circuits Thread, 2022. arXiv:2209.11895

Pith/arXiv arXiv 2022

[27] [34]

Causality: Models, Reasoning, and Inference

Pearl, J. Causality: Models, Reasoning, and Inference. Cambridge University Press, 2nd edition, 2009. ISBN 978-0-521-89560-6

2009

[28] [35]

Robey, A., Wong, E., Hassani, H., and Pappas, G. J. SmoothLLM : Defending large language models against jailbreaking attacks. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=xq7h9nfdY2. OpenReview

2024

[29] [36]

Axiomatic attribution for deep networks

Sundararajan, M., Taly, A., and Yan, Q. Axiomatic attribution for deep networks. In Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pp.\ 3319--3328. PMLR, 2017. URL https://proceedings.mlr.press/v70/sundararajan17a.html

2017

[30] [37]

Taori, R., Gulrajani, I., Zhang, T., Dubois, Y., Li, X., Guestrin, C., Liang, P., and Hashimoto, T. B. Stanford alpaca: An instruction-following LLaMA model. https://github.com/tatsu-lab/stanford_alpaca, 2023. GitHub repository

2023

[31] [39]

a ger, T., Elstner, J., Geisler, S., Cohen-Addad, V., G \

Wollschl \"a ger, T., Elstner, J., Geisler, S., Cohen-Addad, V., G \"u nnemann, S., and Gasteiger, J. The geometry of refusal in large language models: Concept cones and representational independence. In Forty-second International Conference on Machine Learning, Proceedings of Machine Learning Research. PMLR, 2025. URL https://openreview.net/forum?id=80IwJqlXs8

2025

[32] [41]

Y., and Poovendran, R

Xu, Z., Jiang, F., Niu, L., Jia, J., Lin, B. Y., and Poovendran, R. SafeDecoding : Defending against jailbreak attacks via safety-aware decoding. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.\ 5587--5605, 2024. URL https://arxiv.org/abs/2402.08983

arXiv 2024

[33] [42]

J., Prakash, N., Neo, C., Satapathy, R., Lee, R

Yeo, W. J., Prakash, N., Neo, C., Satapathy, R., Lee, R. K.-W., and Cambria, E. Understanding refusal in language models with sparse autoencoders. In Findings of the Association for Computational Linguistics: EMNLP 2025. Association for Computational Linguistics, 2025. doi:10.18653/v1/2025.findings-emnlp.338

work page doi:10.18653/v1/2025.findings-emnlp.338 2025

[34] [43]

Qwen3guard technical report, 2025 a

Zhao, H., Yuan, C., Huang, F., et al. Qwen3guard technical report, 2025 a . URL https://arxiv.org/abs/2510.14276v1

Pith/arXiv arXiv 2025

[35] [44]

Zhao, Y., Zhang, W., Xie, Y., Goyal, A., Kawaguchi, K., and Shieh, M. Q. Understanding and enhancing safety mechanisms of LLM s via safety-specific neuron. In The Thirteenth International Conference on Learning Representations, 2025 b . URL https://openreview.net/forum?id=yR47RmND1m. OpenReview

2025

[36] [45]

How alignment and jailbreak work: Explain LLM safety through intermediate hidden states

Zhou, Z., Yu, H., Zhang, X., Xu, R., Huang, F., and Li, Y. How alignment and jailbreak work: Explain LLM safety through intermediate hidden states. In Findings of the Association for Computational Linguistics: EMNLP 2024, pp.\ 2461--2488. Association for Computational Linguistics, 2024. URL https://aclanthology.org/2024.findings-emnlp.139/

2024

[37] [46]

On the role of attention heads in large language model safety, 2025

Zhou, Z., Yu, H., Zhang, X., et al. On the role of attention heads in large language model safety, 2025. URL https://arxiv.org/abs/2410.13708v1

arXiv 2025

[38] [47]

Z., and Fredrikson, M

Zou, A., Wang, Z., Kolter, J. Z., and Fredrikson, M. Universal and transferable adversarial attacks on aligned language models. In Advances in Neural Information Processing Systems, volume 36, pp.\ 1--24, 2023. URL https://arxiv.org/abs/2307.15043

Pith/arXiv arXiv 2023

[39] [48]

Z., Fredrikson, M., and Hendrycks, D

Zou, A., Phan, L., Wang, J., Duenas, D., Lin, M., Andriushchenko, M., Wang, R., Kolter, J. Z., Fredrikson, M., and Hendrycks, D. Improving alignment and robustness with circuit breakers. In Advances in Neural Information Processing Systems, volume 37, 2024. URL https://openreview.net/forum?id=IbIB8SBKFV. OpenReview

2024

[40] [49]

2303.08774v4 , archivePrefix=

OpenAI and Josh Achiam and Steven Adler and Sandhini Agarwal and others , year=. 2303.08774v4 , archivePrefix=

Pith/arXiv arXiv

[41] [50]

2025 , eprint=

A Survey of Large Language Models , author=. 2025 , eprint=

2025

[42] [51]

2025 , journal =

Naveed, Humza and Khan, Asad Ullah and Qiu, Shi and others , title =. 2025 , journal =

2025

[43] [52]

2025 , eprint=

Gemma 3 Technical Report , author=. 2025 , eprint=

2025

[44] [53]

2023 , eprint=

Llama 2: Open Foundation and Fine-Tuned Chat Models , author=. 2023 , eprint=

2023

[45] [54]

2024 , eprint=

The Llama 3 Herd of Models , author=. 2024 , eprint=

2024

[46] [55]

Advances in Neural Information Processing Systems , volume=

Universal and Transferable Adversarial Attacks on Aligned Language Models , author=. Advances in Neural Information Processing Systems , volume=. 2023 , url=

2023

[47] [56]

2024 , url=

Xiaogeng Liu and Nan Xu and Muhao Chen and Chaowei Xiao , booktitle=. 2024 , url=

2024

[48] [57]

The Thirty-eighth Annual Conference on Neural Information Processing Systems , year=

Many-shot Jailbreaking , author=. The Thirty-eighth Annual Conference on Neural Information Processing Systems , year=

[49] [58]

The Twelfth International Conference on Learning Representations , year=

Jailbreaking Black Box Large Language Models in Twenty Queries , author=. The Twelfth International Conference on Learning Representations , year=

[50] [59]

Journal of Computer Research and Development , volume =

Jailbreak Attack for Large Language Models: A Survey , author =. Journal of Computer Research and Development , volume =. 2024 , doi =

2024

[51] [60]

Advances in Neural Information Processing Systems , volume=

Improving Alignment and Robustness with Circuit Breakers , author=. Advances in Neural Information Processing Systems , volume=. 2024 , url=

2024

[52] [61]

2024 , url=

Zhangchen Xu and Fengqing Jiang and Luyao Niu and Jinyuan Jia and Bill Yuchen Lin and Radha Poovendran , booktitle=. 2024 , url=

2024

[53] [62]

Pappas , booktitle=

Alexander Robey and Eric Wong and Hamed Hassani and George J. Pappas , booktitle=. 2024 , url=

2024

[54] [63]

The Second Tiny Papers Track at ICLR 2024 , year=

Detecting Language Model Attacks With Perplexity , author=. The Second Tiny Papers Track at ICLR 2024 , year=

2024

[55] [64]

2024 , url=

Yifan Zeng and Yiran Wu and Xiao Zhang and Huazheng Wang and Qingyun Wu , booktitle=. 2024 , url=

2024

[56] [65]

2024 , url=

Mansi Phute and Alec Helbling and Matthew Daniel Hull and ShengYun Peng and Sebastian Szyller and Cory Cornelius and Duen Horng Chau , booktitle=. 2024 , url=

2024

[57] [66]

Advances in Neural Information Processing Systems , volume=

Gradient Cuff: Detecting Jailbreak Attacks on Large Language Models by Exploring Refusal Loss Landscapes , author=. Advances in Neural Information Processing Systems , volume=. 2024 , doi=

2024

[58] [67]

2024 , eprint=

Programming Refusal with Conditional Activation Steering , author=. 2024 , eprint=

2024

[59] [68]

Llama Guard:

Hakan Inan and Kartikeya Upasani and Jianfeng Chi and others , year=. Llama Guard:. 2312.06674v2 , archivePrefix=

Pith/arXiv arXiv

[60] [69]

2024 , eprint=

Llama Guard 3 Vision: Safeguarding Human-AI Image Understanding Conversations , author=. 2024 , eprint=

2024

[61] [70]

2025 , eprint=

Qwen3Guard Technical Report , author=. 2025 , eprint=

2025

[62] [71]

2024 , note=

Seungju Han and Kavel Rao and Allyson Ettinger and others , booktitle=. 2024 , note=

2024

[63] [72]

2025 , publisher =

Ghosh, Shaona and Varshney, Prasoon and Gaur, Anu and others , booktitle =. 2025 , publisher =

2025

[64] [73]

2024 , publisher=

Mantas Mazeika and Long Phan and Xuwang Yin and others , booktitle=. 2024 , publisher=

2024

[65] [74]

2404.01318v2 , archivePrefix=

Patrick Chao and Edoardo Debenedetti and Alexander Robey and others , year=. 2404.01318v2 , archivePrefix=

Pith/arXiv arXiv

[66] [75]

2311.08370v2 , archivePrefix=

Bertie Vidgen and Nino Scherrer and Hannah Rose Kirk and others , year=. 2311.08370v2 , archivePrefix=

arXiv

[67] [76]

Position:

Yue Huang and Lichao Sun and Haoran Wang and others , booktitle=. Position:. 2024 , publisher=

2024

[68] [77]

2024 , publisher=

Lijun Li and Bowen Dong and Ruohui Wang and others , booktitle=. 2024 , publisher=

2024

[69] [78]

2023 , publisher =

Lin, Zi and Zeng, Zihan and Gao, Zi and others , booktitle =. 2023 , publisher =

2023

[70] [79]

2022 , eprint=

A Holistic Approach to Undesired Content Detection in the Real World , author=. 2022 , eprint=

2022

[71] [80]

2404.05993v2 , archivePrefix=

Shaona Ghosh and Prasoon Varshney and Erick Galinkin and Christopher Parisien , year=. 2404.05993v2 , archivePrefix=

arXiv

[72] [81]

2024 , note=

Jiang, Liwei and Rao, Kavel and Han, Seungju and Ettinger, Allyson and Brahman, Faeze and Kumar, Sachin and Mireshghallah, Niloofar and Lu, Ximing and Sap, Maarten and Choi, Yejin and Dziri, Nouha , booktitle=. 2024 , note=

2024

[73] [82]

2025 , url=

Cui, Justin and Chiang, Wei-Lin and Stoica, Ion and Hsieh, Cho-Jui , booktitle=. 2025 , url=

2025

[74] [83]

Advances in Neural Information Processing Systems , volume=

Refusal in Language Models is Mediated by a Single Direction , author=. Advances in Neural Information Processing Systems , volume=. 2024 , note=

2024

[75] [84]

Forty-second International Conference on Machine Learning , series=

The Geometry of Refusal in Large Language Models: Concept Cones and Representational Independence , author=. Forty-second International Conference on Machine Learning , series=. 2025 , publisher=

2025

[76] [85]

Findings of the Association for Computational Linguistics: EMNLP 2025 , year =

Understanding Refusal in Language Models with Sparse Autoencoders , author =. Findings of the Association for Computational Linguistics: EMNLP 2025 , year =

2025

[77] [86]

Safety Layers in Aligned Large Language Models: The Key to

Shen Li and Liuyi Yao and Lan Zhang and Yaliang Li , year=. Safety Layers in Aligned Large Language Models: The Key to. 2408.17003v1 , archivePrefix=

arXiv

[78] [87]

Understanding and Enhancing Safety Mechanisms of

Zhao, Yiran and Zhang, Wenxuan and Xie, Yuxi and Goyal, Anirudh and Kawaguchi, Kenji and Shieh, Michael Qizhe , booktitle =. Understanding and Enhancing Safety Mechanisms of. 2025 , url =

2025

[79] [88]

2025 , eprint=

On the Role of Attention Heads in Large Language Model Safety , author=. 2025 , eprint=

2025

[80] [89]

Transformer Circuits Thread , year=

A Mathematical Framework for Transformer Circuits , author=. Transformer Circuits Thread , year=