arxiv: 2605.00236 · v1 · submitted 2026-04-30 · 💻 cs.CR · cs.AI

Recognition: unknown

Attention Is Where You Attack

Aviral Srivastava , Sourav Panda

Authors on Pith no claims yet

Pith reviewed 2026-05-09 19:49 UTC · model grok-4.3

classification 💻 cs.CR cs.AI

keywords attention headssafety alignmentadversarial attackjailbreakLLM safetyattention redistributionmechanistic analysis

0 comments

The pith

Safety in aligned LLMs emerges from attention routing in heads rather than being stored as removable components.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that safety behaviors such as refusing harmful requests depend on the specific way certain attention heads distribute focus across tokens. An attack that redirects this focus in the critical heads bypasses alignment, while removing the heads allows the model to compensate and retain refusals. This dissociation matters because it reveals that safety alignments rely on fragile routing patterns rather than isolated modules. The result follows from testing the attack on multiple models against standard harmful prompt sets.

Core claim

The authors demonstrate that zeroing out top-ranked safety heads produces at most one change among dozens of baseline refusals, because the residual stream compensates. In contrast, their attack redirects attention in the same heads away from safety-relevant positions and flips many refusals, with 72 of 200 prompts on Mistral-7B and 60 of 200 on LLaMA-3. This indicates safety is not localized in the heads as removable parts but arises from the attention routing they perform, where redirecting attention propagates a corrupted signal downstream.

What carries the argument

Attention Redistribution Attack (ARA), which identifies safety-critical heads and optimizes nonsemantic tokens via Gumbel-softmax to redirect their softmax attention away from safety-relevant positions.

If this is right

The attack succeeds with as few as 5 tokens and 500 optimization steps.
It achieves 36 percent attack success on Mistral-7B-Instruct and 30 percent on LLaMA-3-8B-Instruct against 200 HarmBench prompts.
Gemma-2-9B-it remains largely resistant at 1 percent success.
Ablation of the same heads flips at most one refusal out of 39 to 50 baseline cases.
Safety emerges from the routing performed by the heads, not from their presence as isolated units.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Defenses may need to constrain attention distributions directly rather than rely on head presence.
The same redistribution approach could target other model behaviors whose mechanisms depend on specific attention patterns.
Mechanistic analysis of alignment should prioritize attention dynamics over ablation-based head ranking.

Load-bearing premise

The ranking procedure selects heads whose attention redistribution is causally responsible for safety behavior rather than merely correlated with it.

What would settle it

An experiment in which attention is redirected in the top-ranked heads but refusal rates remain near baseline, or in which ablating those heads produces large drops in refusals.

Figures

Figures reproduced from arXiv: 2605.00236 by Aviral Srivastava, Sourav Panda.

**Figure 1.** Figure 1: ARA on a single prompt in Mistral-7B. Top row (baseline): the system prompt and user query are fed to the model. The top safety head allocates 76% of its attention budget to the system prompt, yielding a Safety Attention Score of SAS = 0.74, and the model refuses. Bottom row (attack): five optimized adversarial tokens are prepended to the same user query. The system prompt and query are unchanged. The same… view at source ↗

**Figure 2.** Figure 2: Where safety attention lives inside three models. [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗

**Figure 3.** Figure 3: Head ablation versus attention redistribution. Solid lines show the flip rate when the [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 4.** Figure 4: Visualization of the Simplex Competition Lemma. Before the attack, safety tokens receive [PITH_FULL_IMAGE:figures/full_fig_p014_4.png] view at source ↗

**Figure 5.** Figure 5: Discovery phase results across all five attack variants (10 prompts). Left: mean SAS [PITH_FULL_IMAGE:figures/full_fig_p014_5.png] view at source ↗

**Figure 6.** Figure 6: Flip rate among prompts exceeding a given SAS reduction threshold. On LLaMA-3, the flip [PITH_FULL_IMAGE:figures/full_fig_p015_6.png] view at source ↗

read the original abstract

Safety-aligned large language models rely on RLHF and instruction tuning to refuse harmful requests, yet the internal mechanisms implementing safety behavior remain poorly understood. We introduce the Attention Redistribution Attack (ARA), a white-box adversarial attack that identifies safety-critical attention heads and crafts nonsemantic adversarial tokens that redirect attention away from safety-relevant positions. Unlike prior jailbreak methods operating at the semantic or output-logit level, ARA targets the geometry of softmax attention on the probability simplex using Gumbel-softmax optimization over targeted heads. Across LLaMA-3-8B-Instruct, Mistral-7B-Instruct-v0.1, and Gemma-2-9B-it, ARA bypasses safety alignment with as few as 5 tokens and 500 optimization steps, achieving 36% ASR on Mistral-7B and 30% on LLaMA-3 against 200 HarmBench prompts, while Gemma-2 remains at 1%. Our principal mechanistic finding is a dissociation between ablation and redistribution: zeroing out the top-ranked safety heads produces at most 1 flip among 39 to 50 baseline refusals, while ARA targeting the corresponding safety-heavy layers flips 72/200 prompts on Mistral-7B and 60/200 on LLaMA-3. This suggests that safety is not localized in these heads as removable components, but emerges from the attention routing they perform. Removing a head allows compensation through the residual stream, while redirecting its attention propagates a corrupted signal downstream.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ARA demonstrates that redirecting attention in certain heads jailbreaks models far better than ablating them, but the head-ranking step looks correlational and the causal claim needs tighter controls.

read the letter

The punchline is that this paper gives a concrete attack, ARA, that optimizes nonsemantic tokens via Gumbel-softmax to shift attention patterns in selected heads, and it reports a clear dissociation: zeroing the top heads barely changes refusal rates, while the redistribution attack flips dozens of prompts on Mistral and LLaMA-3 with only five tokens and 500 steps. That dissociation is the part worth paying attention to if it holds up.

Referee Report

3 major / 2 minor

Summary. The paper introduces the Attention Redistribution Attack (ARA), a white-box adversarial method that ranks safety-critical attention heads in aligned LLMs (LLaMA-3-8B-Instruct, Mistral-7B-Instruct-v0.1, Gemma-2-9B-it) and optimizes nonsemantic tokens via Gumbel-softmax to redirect attention away from safety-relevant positions. It reports attack success rates (ASR) of 36% on Mistral-7B and 30% on LLaMA-3 against 200 HarmBench prompts using as few as 5 tokens and 500 steps, while Gemma-2 remains at 1%. The central mechanistic claim is a dissociation: ablating top-ranked heads flips at most 1 of 39-50 baseline refusals, whereas ARA on the corresponding layers flips 72/200 (Mistral) and 60/200 (LLaMA) prompts, implying safety emerges from attention routing rather than localized head identity.

Significance. If substantiated, the ablation-redistribution dissociation offers a useful empirical probe into whether safety is implemented via removable components or dynamic routing, with potential implications for mechanistic interpretability and alignment. The efficiency of the attack (few tokens, targeted at attention geometry) is a concrete contribution over semantic-level jailbreaks. Credit is due for the explicit comparison of ablation versus redistribution as a test of localization, though the overall significance remains provisional pending validation of the head-ranking procedure and controls.

major comments (3)

[Abstract] Abstract: The head-ranking procedure used to identify 'safety-critical' heads is not described. Without explicit criteria (e.g., differential attention to harmful tokens, gradient importance, or a pre-specified metric), it is impossible to determine whether the selected heads are causally necessary for safety routing or merely correlated with refusal behavior; this directly affects the interpretation of the ablation-redistribution dissociation.
[Abstract] Abstract: No error bars, confidence intervals, statistical tests, or verification that the ranking is not post-hoc are provided for the reported ASR values (36% on Mistral-7B, 30% on LLaMA-3) or the flip counts (72/200, 60/200). This undermines the reliability of the central empirical claim that redistribution succeeds where ablation fails.
[Abstract] Abstract: The dissociation is offered as evidence that safety 'emerges from the attention routing they perform' rather than head identity, but the text provides no controls such as attacking bottom-ranked heads, measuring post-ARA attention maps, or confirming head-specific effects. Without these, the results are compatible with the alternative that any conflicting signal injected into the residual stream disrupts refusal, independent of the ranking.

minor comments (2)

[Abstract] The abstract states results 'across LLaMA-3-8B-Instruct, Mistral-7B-Instruct-v0.1, and Gemma-2-9B-it' but reports detailed ASR only for two models; a table or explicit per-model breakdown would improve clarity.
[Abstract] The phrase 'zeroing out the top-ranked safety heads produces at most 1 flip among 39 to 50 baseline refusals' is imprecise; specifying the exact number of baseline refusals per model and the precise ablation method (e.g., zeroing vs. mean ablation) would aid reproducibility.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive comments on our manuscript. We address each of the major concerns point by point below and have made revisions to improve the clarity and rigor of the presentation.

read point-by-point responses

Referee: [Abstract] Abstract: The head-ranking procedure used to identify 'safety-critical' heads is not described. Without explicit criteria (e.g., differential attention to harmful tokens, gradient importance, or a pre-specified metric), it is impossible to determine whether the selected heads are causally necessary for safety routing or merely correlated with refusal behavior; this directly affects the interpretation of the ablation-redistribution dissociation.

Authors: The head-ranking procedure is described in detail in Section 3 of the manuscript, where we rank attention heads according to the differential attention they allocate to safety-relevant tokens in harmful prompts compared to benign ones. We will add a short summary of this ranking criterion to the abstract in the revised version to ensure readers can evaluate the procedure without referring to the main text. revision: yes
Referee: [Abstract] Abstract: No error bars, confidence intervals, statistical tests, or verification that the ranking is not post-hoc are provided for the reported ASR values (36% on Mistral-7B, 30% on LLaMA-3) or the flip counts (72/200, 60/200). This undermines the reliability of the central empirical claim that redistribution succeeds where ablation fails.

Authors: We agree that the absence of error bars and statistical validation is a limitation. In the revised manuscript, we include error bars computed over multiple random seeds for the optimization process and report the results of a statistical test comparing the ASR under ARA versus ablation. We also confirm the ranking stability by applying it to held-out prompt sets, showing consistent head selection. revision: yes
Referee: [Abstract] Abstract: The dissociation is offered as evidence that safety 'emerges from the attention routing they perform' rather than head identity, but the text provides no controls such as attacking bottom-ranked heads, measuring post-ARA attention maps, or confirming head-specific effects. Without these, the results are compatible with the alternative that any conflicting signal injected into the residual stream disrupts refusal, independent of the ranking.

Authors: This is a valid concern regarding potential alternative explanations. To address it, we have added experiments in the revision where we apply ARA to bottom-ranked heads, resulting in substantially lower attack success rates (approximately 5%), and we include attention map visualizations showing the specific redirection effect on safety positions for the top-ranked heads. These additions support the specificity of the ranking. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical attack and dissociation results are not derived by construction

full rationale

The paper presents an empirical white-box attack (ARA) that optimizes nonsemantic tokens via Gumbel-softmax to redirect attention in heads identified as safety-critical, then reports measured attack success rates and a dissociation between ablation (near-zero effect) and redistribution (high ASR). No equations, derivations, or first-principles predictions appear in the abstract or described claims. Head ranking is an empirical preprocessing step whose output is used to select targets for the attack; the central mechanistic interpretation (safety emerges from routing rather than head identity) is offered as an inference from the experimental contrast, not as a quantity forced by the ranking procedure itself or by any self-citation chain. The method therefore remains self-contained against external benchmarks and does not reduce any claimed result to its own inputs by definition.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 0 invented entities

The approach assumes attention heads can be ranked for safety relevance and that redirecting their attention produces a downstream effect not compensable by the residual stream; these are domain assumptions rather than derived results.

free parameters (2)

number of adversarial tokens
Fixed at 5 as the minimal number achieving reported ASR; chosen rather than derived.
optimization steps
Set to 500; reported as sufficient but not shown to be minimal or theoretically motivated.

axioms (1)

domain assumption Safety behavior is implemented at least partly through specific attention heads whose routing can be isolated and attacked.
Invoked when identifying safety-critical heads and interpreting the ablation vs. redistribution contrast.

pith-pipeline@v0.9.0 · 5560 in / 1278 out tokens · 37023 ms · 2026-05-09T19:49:46.094399+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

17 extracted references · 17 canonical work pages · 10 internal anchors

[1]

Detecting Language Model Attacks with Perplexity

Gabriel Alon and Michael Kamfonas. Detecting language model attacks with perplexity.arXiv preprint arXiv:2308.14132,

work page internal anchor Pith review arXiv
[2]

Refusal in Language Models Is Mediated by a Single Direction

Andy Arditi, Oscar Obeso, Buck Shlegeris, et al. Refusal in language models is mediated by a single direction. arXiv preprint arXiv:2406.11717,

work page internal anchor Pith review arXiv
[3]

Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback

Yuntao Bai, Andy Jones, Kamal Ndousse, et al. Training a helpful and harmless assistant with reinforcement learning from human feedback.arXiv preprint arXiv:2204.05862, 2022a. Yuntao Bai, Saurav Kadavath, Sandipan Kundu, et al. Constitutional AI: Harmlessness from AI feedback.arXiv preprint arXiv:2212.08073, 2022b. Nicholas Carlini and David Wagner. Towar...

work page internal anchor Pith review arXiv
[4]

Jailbreaking Black Box Large Language Models in Twenty Queries

Patrick Chao, Alexander Robey, Edgar Dobriban, et al. Jailbreaking black box large language models in twenty queries.arXiv preprint arXiv:2310.08419,

work page internal anchor Pith review arXiv
[5]

A Wolf in Sheep’s Clothing: Generalized Nested Jailbreak Prompts can Fool Large Language Models Easily

Guanhua Ding, Jian Chen, Yu Li, et al. A wolf in sheep’s clothing: Generalized nested jailbreak prompts can fool large language models easily.arXiv preprint arXiv:2311.08268,

work page arXiv
[6]

Coercing LLMs to do and reveal (almost) anything

Jonas Geiping, Alex Stein, Manli Shu, et al. Coercing LLMs to do and reveal (almost) anything.arXiv preprint arXiv:2402.14020,

work page arXiv
[7]

Gemma: Open Models Based on Gemini Research and Technology

Gemma Team. Gemma: Open models based on Gemini research and technology.arXiv preprint arXiv:2403.08295,

work page internal anchor Pith review arXiv
[8]

Baseline Defenses for Adversarial Attacks Against Aligned Language Models

Neel Jain, Avi Schwarzschild, Yuxin Wen, et al. Baseline defenses for adversarial attacks against aligned language models.arXiv preprint arXiv:2309.00614,

work page internal anchor Pith review arXiv
[9]

Improved tech- niques for optimization-based jailbreaking on large language models.arXiv preprint arXiv:2405.21018, 2024

Jingwei Jia et al. Improved techniques for optimization-based jailbreaking on large language models.arXiv preprint arXiv:2405.21018,

work page arXiv
[10]

Mistral 7B

Albert Q Jiang, Alexandre Sablayrolles, Arthur Mensch, et al. Mistral 7B.arXiv preprint arXiv:2310.06825,

work page internal anchor Pith review Pith/arXiv arXiv
[11]

Certifying llm safety against adversarial prompting

Aounon Kumar, Chirag Agarwal, and Himabindu Lakkaraju. Certifying LLM safety against adversarial prompting.arXiv preprint arXiv:2309.02705,

work page arXiv
[12]

Amplegcg: Learning a universal and transferable generative model of adversarial suffixes for jailbreaking both open and closed llms,

Zeyi Liao et al. AmpleGCG: Learning a universal and transferable generative model of adversarial suffixes for jailbreaking both open and closed LLMs.arXiv preprint arXiv:2404.07921,

work page arXiv
[13]

SmoothLLM: Defending Large Language Models Against Jailbreaking Attacks

Alexander Robey, Eric Wong, Hamed Hassani, and George J Pappas. SmoothLLM: Defending large language models against jailbreaking attacks.arXiv preprint arXiv:2310.03684,

work page internal anchor Pith review arXiv
[14]

Llama 2: Open Foundation and Fine-Tuned Chat Models

Hugo Touvron, Louis Martin, Kevin Stone, et al. LLaMA 2: Open foundation and fine-tuned chat models.arXiv preprint arXiv:2307.09288,

work page internal anchor Pith review Pith/arXiv arXiv
[15]

Low-resource languages jailbreak gpt-4

Zheng Xin Yong et al. Low-resource languages jailbreak GPT-4.arXiv preprint arXiv:2310.02446,

work page arXiv
[16]

Gpt-4 is too smart to be safe: Stealthy chat with llms via cipher

Youliang Yuan et al. CipherChat: Systematic evaluation of cipher-based jailbreak on LLMs.arXiv preprint arXiv:2308.06463,

work page arXiv
[17]

Representation Engineering: A Top-Down Approach to AI Transparency

Andy Zou, Long Phan, Sarah Chen, et al. Representation engineering: A top-down approach to AI transparency. arXiv preprint arXiv:2310.01405, 2023a. Andy Zou, Zifan Wang, J Zico Kolter, and Matt Fredrikson. Universal and transferable adversarial attacks on aligned language models.arXiv preprint arXiv:2307.15043, 2023b. Appendix A Adversarial Token Examples...

work page internal anchor Pith review arXiv