pith. sign in

arxiv: 2606.05958 · v1 · pith:OP6TCEI4new · submitted 2026-06-04 · 💻 cs.LG

Steering Vectors are an Adversarial Attack Surface

Pith reviewed 2026-06-28 03:11 UTC · model grok-4.3

classification 💻 cs.LG
keywords activation steeringdata poisoningjailbreakLLM safetyadversarial attacksteering vectors
0
0 comments X

The pith

Substituting 4-6% of tokens in a steering dataset can align the resulting vector with an anti-refusal direction, enabling jailbreaks while the intended steering on benign prompts stays intact.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that activation steering is open to a stealth poisoning attack through small changes to the dataset. An attacker needs to replace only 4 to 6 percent of the tokens to make the resulting vector push the model toward ignoring refusals. When applied, this vector makes the model answer harmful questions at much higher rates than a clean vector would. At the same time, the vector still produces the expected steering behavior when the user gives normal, safe prompts. The finding holds across two families of open models and eight different steering attributes.

Core claim

By substituting 4-6% of tokens in the steering dataset, an attacker can silently align the resulting vector with an anti-refusal direction. This jailbreaks the target model while preserving the intended steering effect on benign prompts. The attack is tested on two open-weight model families and eight model-attribute combinations, with poisoned vectors reaching an absolute attack success rate of 20-55%, an increase of 19% to 51% over clean references. A refusal-direction orthogonalization defense recovers approximately 82% of the ASR gap without harming benign behavior.

What carries the argument

The steering vector computed from a dataset after low-percentage token substitution that incorporates an anti-refusal component.

Load-bearing premise

The low-rate token substitution leaves the primary steering direction sufficiently intact to preserve intended behavior on benign prompts across the tested model-attribute pairs.

What would settle it

Test whether poisoned vectors from 4-6% substituted datasets produce 19-51% higher attack success rates on harmful prompts than clean vectors, while benign steering performance remains equivalent.

Figures

Figures reproduced from arXiv: 2606.05958 by Abzal Aidakhmetov, Adrian Robert Minut, Donato Crisostomi, Emanuele Rodol\`a, Iacopo Masi, Tommaso Mencattini.

Figure 1
Figure 1. Figure 1: Stealth steering-vector poisoning at a glance. Top. A clean contrastive pair (x +, x−) tar￾geting a benign attribute (here, bullet-point formatting) induces a mean-difference steering vector v that makes a large angle θ with the anti-refusal direction r, so adding v to hidden states changes formatting without disturb￾ing safety behavior. Bottom. Embedding-constrained synonym swaps yield perceptually near-i… view at source ↗
Figure 2
Figure 2. Figure 2: Poisoning contrastive steering datasets sharply increases jailbreak success while preserving declared [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Poisoning rotates steering vectors toward [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Refusal-direction orthogonalisation neutralises most of the poisoned-vector jailbreak lift while preserving [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: The attack is, with one exception, not a norm-inflation artefact. The poisoned-to-clean [PITH_FULL_IMAGE:figures/full_fig_p012_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Adaptive-attacker sweep against a cosine-threshold defender on Llama-3.1-8B [PITH_FULL_IMAGE:figures/full_fig_p012_6.png] view at source ↗
read the original abstract

Activation steering has become a popular way to control Large Language Model (LLM) behavior without fine-tuning. Since the technique is plug-and-play, users share datasets and precomputed vectors to steer model activations. However, we show that a \emph{stealth data poisoning attack} silently compromises this pipeline. By substituting $4{-}6\%$ of tokens in the steering dataset, an attacker can silently align the resulting vector with an anti-refusal direction. This jailbreaks the target model while preserving the intended steering effect on benign prompts. Under this threat model, a malicious actor can distribute an apparently safe bundle containing texts, vectors, and weights, alongside an equivalence certificate that the end-user can verify. We test the attack on two open-weight model families and eight model-attribute combinations, observing that poisoned vectors reach an absolute attack success rate (ASR) of $20{-}55\%$, $+19\%$ to $+51\%$ over a clean reference. Finally, we find that a refusal-direction orthogonalization defense can recover ${\approx}82\%$ of the ASR gap without harming benign behavior.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper demonstrates a stealth data poisoning attack on activation steering vectors for LLMs. By replacing 4-6% of tokens in a steering dataset, an attacker can produce a vector that aligns with an anti-refusal direction, increasing jailbreak attack success rate (ASR) by 19-51% (absolute ASR 20-55%) across two model families and eight model-attribute pairs, while claiming to preserve the original steering effect on benign prompts. The work also evaluates a refusal-direction orthogonalization defense that recovers approximately 82% of the ASR gap.

Significance. If the preservation of benign steering behavior is robustly shown, the result identifies a practical attack surface in the increasingly common practice of sharing and applying precomputed steering vectors, with direct implications for LLM safety pipelines that rely on activation engineering. The multi-model empirical evaluation and the proposed defense are concrete contributions that could inform future work on verifiable steering artifacts.

major comments (3)
  1. [Abstract and §4] Abstract and §4 (results): the central claim that poisoned vectors 'preserve the intended steering effect on benign prompts' is load-bearing for the stealth property, yet no quantitative comparison (e.g., steering success rate, activation cosine similarity, or KL divergence on benign prompts) between clean and poisoned vectors is reported. Without this metric the +19% to +51% ASR gain cannot be assessed as stealthy rather than merely effective.
  2. [§3] §3 (attack construction): the 4-6% token substitution procedure is described at a high level, but the paper does not specify how the substituted tokens are chosen (random, refusal-correlated, or attribute-correlated) or provide an ablation showing that the primary attribute direction remains dominant after poisoning. This choice directly affects whether the primary direction is preserved or rotated.
  3. [§5] §5 (defense): the refusal-direction orthogonalization recovers ~82% of the ASR gap, but the manuscript does not report the effect of this defense on the original steering task performance or on other unrelated behaviors, leaving open whether the defense trades off utility for safety.
minor comments (2)
  1. [§4] The experimental protocol (dataset sizes, exact token substitution method, number of runs, statistical significance tests) is referenced but not fully detailed in the main text; moving the full protocol to the appendix or a reproducibility section would strengthen the work.
  2. [§2] Notation for the difference vector and the anti-refusal direction should be introduced once and used consistently; occasional shifts between 'steering vector' and 'difference vector' reduce clarity.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback, which highlights important aspects for strengthening the presentation of our results on stealth and defense evaluation. We address each major comment below and will incorporate revisions to provide the requested quantitative evidence and clarifications.

read point-by-point responses
  1. Referee: [Abstract and §4] Abstract and §4 (results): the central claim that poisoned vectors 'preserve the intended steering effect on benign prompts' is load-bearing for the stealth property, yet no quantitative comparison (e.g., steering success rate, activation cosine similarity, or KL divergence on benign prompts) between clean and poisoned vectors is reported. Without this metric the +19% to +51% ASR gain cannot be assessed as stealthy rather than merely effective.

    Authors: We agree that explicit quantitative metrics are necessary to substantiate the preservation claim and enable assessment of stealth. In the revised manuscript, we will add a new table in §4 reporting steering success rates on benign prompts, activation cosine similarities between clean and poisoned vectors, and KL divergence on held-out benign data for all eight model-attribute pairs. This will directly quantify any degradation in the intended steering behavior. revision: yes

  2. Referee: [§3] §3 (attack construction): the 4-6% token substitution procedure is described at a high level, but the paper does not specify how the substituted tokens are chosen (random, refusal-correlated, or attribute-correlated) or provide an ablation showing that the primary attribute direction remains dominant after poisoning. This choice directly affects whether the primary direction is preserved or rotated.

    Authors: We will expand §3 to detail the token substitution method: tokens were selected from a refusal-related vocabulary while preserving local semantic context via embedding similarity. We will also add an ablation subsection showing that the primary attribute direction remains dominant, including cosine similarities between clean and poisoned steering vectors and their projections onto the target attribute direction. revision: yes

  3. Referee: [§5] §5 (defense): the refusal-direction orthogonalization recovers ~82% of the ASR gap, but the manuscript does not report the effect of this defense on the original steering task performance or on other unrelated behaviors, leaving open whether the defense trades off utility for safety.

    Authors: We concur that the defense evaluation should include its impact on utility. In the revision of §5, we will report steering success rates on the original benign attributes before and after orthogonalization, along with effects on unrelated behaviors such as general next-token prediction perplexity and performance on a set of neutral prompts. revision: yes

Circularity Check

0 steps flagged

Empirical attack demonstration contains no derivation chain or self-referential reductions

full rationale

The paper reports experimental results from token-substitution attacks on steering datasets across two model families and eight attribute pairs. No equations, uniqueness theorems, ansatzes, or first-principles derivations are invoked; the central claim (poisoned vectors achieve 20-55% ASR while preserving benign steering) is evaluated directly via measured attack success rates and is not obtained by fitting parameters that are then renamed as predictions. Self-citations, if present, are not load-bearing for any mathematical step. This matches the default non-circular case for purely empirical work.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper is an empirical security evaluation; it relies on the standard domain assumption that activation steering vectors can be computed from text datasets and added to model activations.

axioms (1)
  • domain assumption Activation steering vectors computed from text datasets can be added to model activations to control specific behaviors.
    This is the core premise of the steering technique that the attack targets.

pith-pipeline@v0.9.1-grok · 5743 in / 1123 out tokens · 48839 ms · 2026-06-28T03:11:46.647557+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

13 extracted references · 3 canonical work pages · 3 internal anchors

  1. [1]

    InAdvances in Neural Informa- tion Processing Systems, volume 37

    Refusal in language models is mediated by a single direction. InAdvances in Neural Informa- tion Processing Systems, volume 37. Nicholas Carlini, Matthew Jagielski, Christopher A. Choquette-Choo, Daniel Paleka, Will Pearce, Hyrum Anderson, Andreas Terzis, Kurt Thomas, and Florian Tramèr. 2024. Poisoning web-scale training datasets is practical. InIEEE Sym...

  2. [2]

    Gemma 2: Improving Open Language Models at a Practical Size

    Jailbreaking black box large language models in twenty queries. InIEEE Conference on Secure and Trustworthy Machine Learning (SaTML). Gemma Team. 2024. Gemma 2: Improving open lan- guage models at a practical size.arXiv preprint arXiv:2408.00118. Laura Hanu and Unitary team. 2020. Detoxify. Github. https://github.com/unitaryai/detoxify. Evan Hubinger, Car...

  3. [3]

    FastText.zip: Compressing text classification models

    FastText.zip: Compressing text classification models.arXiv preprint arXiv:1612.03651. Armand Joulin, Edouard Grave, Piotr Bojanowski, and Tomas Mikolov. 2017. Bag of tricks for efficient text classification. InProceedings of the 15th Confer- ence of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers. Kenneth Li, ...

  4. [4]

    Representation Engineering: A Top-Down Approach to AI Transparency

    Jailbroken: How does LLM safety training fail? InAdvances in Neural Information Processing Systems, volume 36. Andy Zou, Long Phan, Sarah Chen, James Campbell, Phillip Guo, Richard Ren, Alexander Pan, Xuwang Yin, Mantas Mazeika, Ann-Kathrin Dombrowski, Shashwat Goel, Nathaniel Li, Michael J. Byun, Zifan Wang, Alex Mallen, Steven Basart, Sanmi Koyejo, 9 Da...

  5. [5]

    Start from the NLTK English word corpus (∼234K words)

  6. [6]

    Remove words flagged as toxic by Detox- ify (Hanu and Unitary team, 2020)

  7. [7]

    Remove words flagged by Llama-3.3-70B- Instruct screening

  8. [8]

    almost perfect

    Map surviving words to the model’s tokenizer vocabulary, keeping only single-token entries that begin with a space character (i.e., subword- complete tokens). This yields approximately 36K safe tokens for Gemma-2 and 14.6K for the Llama tokenizer. Attack pseudocode.Algorithm 1 summarises the complete attack pipeline described in Sec- tion 3.4. Steering we...

  9. [9]

    it differs from its original text in at most nmod modifiable token positions

  10. [10]

    every replacement token belongs to the safe vo- cabularyV safe

  11. [11]

    every replacement of an original tokent belongs to its precomputed neighbor setN(t)

  12. [12]

    protected suffix tokens are unchanged

  13. [13]

    We denote the set of all feasible poisoned datasets byD

    the modified text satisfies the perplexity cap, PPL(˜x)≤τ. We denote the set of all feasible poisoned datasets byD. The feasible set D is the mathematical represen- tation of the stealth constraint. It contains exactly the datasets that the attacker is allowed to output. Because the original dataset contains finitely many token positions, each position ha...