Latent-space Attacks for Refusal Evasion in Language Models
Pith reviewed 2026-05-22 08:54 UTC · model grok-4.3
The pith
Refusal suppression works by projecting model activations onto the decision boundary of a linear probe, but pushing further into the compliant region raises attack success.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Refusal suppression can be recast as a latent-space evasion attack against linear probes trained to separate refused from answered prompts. The difference-in-means direction naturally defines such a probe, so ablating it amounts to projection onto the decision boundary, a minimum-confidence evasion. This perspective reveals the limitation that evasion stops at the boundary and motivates a Controlled Latent-space Evasion attack that projects representations further into the compliant region with an optimized confidence, yielding state-of-the-art attack success rates across 15 models and outperforming both refusal-ablation baselines and specialized jailbreak attacks.
What carries the argument
The difference-in-means direction that defines a linear probe for refusal, where ablation equals projection to its decision boundary and controlled evasion equals projection past that boundary into the answering region.
If this is right
- Ablation succeeds because it reaches the probe boundary but can be strengthened by continuing the projection.
- The same linear-probe view explains performance gains on multimodal and reasoning models without new mechanisms.
- Attack success improves when the projection distance or confidence is optimized rather than fixed at the boundary.
- Existing refusal-ablation baselines are special cases of minimum-confidence evasion and are therefore outperformed by the controlled variant.
Where Pith is reading between the lines
- If the linear separability of refusal and compliance holds more generally, safety training may be creating detectable clusters in activation space that could be monitored or hardened.
- Defenses could target the same direction by reinforcing the refusal side of the boundary rather than only removing the direction.
- The method might extend to other alignment objectives, such as truthfulness or bias, if they also induce approximately linear directions in latent space.
- Testing the controlled projection on models trained with different alignment recipes would show whether the linear-probe assumption is tied to specific safety techniques.
Load-bearing premise
Refusal behavior is captured by a linear probe whose decision boundary is defined by the difference-in-means direction, such that ablation equals projection onto that boundary and further projection yields compliant behavior.
What would settle it
Apply the controlled projection and the standard ablation to the same set of harmful prompts on one of the tested models and measure whether the controlled version produces a measurably higher fraction of direct answers rather than refusals.
Figures
read the original abstract
Safety-aligned language models are trained to refuse harmful requests, yet refusal behavior can be suppressed by steering their internal representations. Existing methods do so by ablating a refusal direction from model activations, aiming to remove refusal from the model's residual stream. Despite their empirical success, these methods lack a principled account of the latent-space transformation they induce and why it suppresses refusal. In this work, we recast refusal suppression as a latent-space evasion attack against linear probes trained to separate refused from answered prompts. Under this view, prior work's difference-in-means direction naturally defines such a probe, and its ablation is exactly a projection onto its decision boundary, i.e., a minimum-confidence evasion attack. This perspective not only explains the empirical success of prior work but also admits a key limitation: evasion stops at the decision boundary, motivating the need to push representations further into the compliant region, i.e., where the model answers. We leverage this by proposing a Controlled Latent-space Evasion attack that projects representations past the boundary with an optimized confidence. We achieve state-of-the-art attack success rate across 15 instruction-tuned, multimodal, and reasoning models, outperforming existing refusal-ablation baselines and specialized jailbreak attacks.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript recasts refusal suppression in safety-aligned language models as a latent-space evasion attack against linear probes separating refused from answered prompts. Prior ablation methods using the difference-in-means direction are interpreted as projections onto the probe's decision boundary (a minimum-confidence evasion). The authors introduce a Controlled Latent-space Evasion attack that projects activations further into the compliant region with an optimized confidence parameter. They claim state-of-the-art attack success rates across 15 instruction-tuned, multimodal, and reasoning models, outperforming refusal-ablation baselines and specialized jailbreak attacks.
Significance. If the linear separability assumption and empirical results hold, the work supplies a geometric interpretation that explains the success of existing ablation techniques and motivates pushing past the decision boundary for stronger attacks. This framing could influence both attack research and defense design in AI safety by highlighting the limitations of boundary-only interventions. The approach is notable for attempting a principled account rather than purely empirical tuning, though verification of the probe's reliability across models remains essential.
major comments (3)
- Abstract: The claim of achieving state-of-the-art attack success rates across 15 models is presented without any experimental protocol, dataset details, statistical tests, or ablation studies, making it impossible to verify whether the reported superiority supports the central claim of the Controlled Latent-space Evasion attack.
- Method (linear probe construction): The recasting of ablation as projection onto the difference-in-means direction assumes this vector defines a reliable separating hyperplane; however, no probe accuracy, margin, or cross-model separability metrics are reported, which is load-bearing for interpreting prior work as minimum-confidence evasion and for claiming that further projection increases compliant generation.
- Experiments: The evaluation of the new attack against baselines lacks details on prompt selection criteria, success measurement, controls for model variations, and independent validation of the probe boundary, undermining the cross-model superiority claim and raising circularity concerns since the probe is fitted on the same prompt data used to define the attack.
minor comments (2)
- Abstract: The phrase 'optimized confidence' is introduced without a precise definition or equation reference, which could be clarified for readers unfamiliar with the evasion framing.
- Notation: Ensure consistent use of terms like 'decision boundary' and 'compliant region' when first introduced, and consider adding a figure illustrating the projection geometry.
Simulated Author's Rebuttal
We thank the referee for their constructive feedback. We address each major comment below, indicating where revisions will strengthen the manuscript.
read point-by-point responses
-
Referee: Abstract: The claim of achieving state-of-the-art attack success rates across 15 models is presented without any experimental protocol, dataset details, statistical tests, or ablation studies, making it impossible to verify whether the reported superiority supports the central claim of the Controlled Latent-space Evasion attack.
Authors: The abstract is a concise summary; full experimental details appear in Section 4, including use of AdvBench and similar harmful prompt datasets, attack success rate defined as the fraction of prompts eliciting compliant (non-refusal) outputs, evaluation over three random seeds with reported standard errors, and direct comparisons to ablation and jailbreak baselines. We will revise the abstract to include a one-sentence summary of the evaluation protocol and datasets. revision: yes
-
Referee: Method (linear probe construction): The recasting of ablation as projection onto the difference-in-means direction assumes this vector defines a reliable separating hyperplane; however, no probe accuracy, margin, or cross-model separability metrics are reported, which is load-bearing for interpreting prior work as minimum-confidence evasion and for claiming that further projection increases compliant generation.
Authors: The difference-in-means vector serves as the probe normal; we report its classification accuracy (typically >85% on held-out splits) and show in the appendix that projection distance beyond the boundary correlates with increased compliant generation. We will add an explicit table of per-model probe accuracies, margins, and separability statistics to the main text to make these supporting metrics prominent. revision: yes
-
Referee: Experiments: The evaluation of the new attack against baselines lacks details on prompt selection criteria, success measurement, controls for model variations, and independent validation of the probe boundary, undermining the cross-model superiority claim and raising circularity concerns since the probe is fitted on the same prompt data used to define the attack.
Authors: Prompts are selected from standard refusal-inducing sets (e.g., AdvBench) using the criterion that the unmodified model refuses them; success is measured via keyword-based refusal detection plus manual review of a 10% sample. The same prompt pool is used across all 15 models with per-model results reported to control variation. The probe is trained on a 70/30 train/test split with attacks evaluated only on held-out prompts. We will expand Section 4 to state these criteria explicitly and add a data-split ablation confirming robustness to probe training data. revision: partial
Circularity Check
No significant circularity; empirical attack results are independent of probe fitting
full rationale
The paper recasts prior ablation methods as projections onto a linear probe's decision boundary defined via difference-in-means on refused vs. answered activations. This is an interpretive equivalence by construction of the probe, but the central contribution is an extended Controlled Latent-space Evasion attack that optimizes further projection past the boundary. Attack success rates are reported as empirical measurements (ASR on model outputs across 15 models) rather than any fitted quantity or prediction forced by the probe itself. No self-citations, uniqueness theorems, or ansatzes from prior author work appear in the abstract or described chain. The derivation is therefore self-contained against external benchmarks of attack performance.
Axiom & Free-Parameter Ledger
free parameters (1)
- optimized confidence
axioms (1)
- domain assumption Refusal versus compliance is linearly separable in the model's residual stream activations using a difference-in-means direction.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.