pith. sign in

arxiv: 2605.24535 · v2 · pith:XSJ3XHZKnew · submitted 2026-05-23 · 💻 cs.CR · cs.LG

Steering Beyond the Support: Adversarial Training on Unsupervised Jailbroken Activation Simulation

Pith reviewed 2026-06-30 13:18 UTC · model grok-4.3

classification 💻 cs.CR cs.LG
keywords jailbreak defenseadversarial trainingactivation steeringunsupervised latent discoveryLLM safetyzero-shot defenserefusal activation
0
0 comments X

The pith

A bi-level adversarial training method simulates jailbroken activations via unsupervised latent directions to defend LLMs against unseen jailbreaks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes a zero-shot defense framework that addresses the brittleness of supervised safety steering on evolving jailbreaks. It uses an inner loop to extrapolate simulated jailbroken states from refusal activations through unsupervised latent direction discovery, then an outer loop to train a steering field that maps those states into refusal regions. A sympathetic reader would care because real jailbreaks are out-of-distribution from any fixed training set, causing existing methods to fail on novel attacks. The approach reports attack success rates mostly below 5 percent across three models and six jailbreak families, with increasing subspace coverage during training tied to better generalization.

Core claim

We propose a bi-level adversarial training framework for zero-shot jailbreak defense. In the inner step, we simulate diverse jail-broken activations by extrapolating from refusal-state harmful-request activations via unsupervised latent direction discovery, which expands the coverage of real jailbreak activation subspaces. In the outer step, we train a potential-induced steering field to push these adversarial jailbroken states into refusal regions while keeping benign unchanged.

What carries the argument

Bi-level adversarial training framework whose inner loop performs unsupervised latent direction discovery to simulate jailbroken activations and whose outer loop optimizes a potential-induced steering field.

If this is right

  • Attack success rates remain mostly below 5 percent across three LLMs and six classical jailbreak families.
  • Subspace coverage of real jailbreak activations rises throughout training and correlates with improved generalization to unseen attacks.
  • The steering field preserves benign utility while redirecting simulated adversarial states into refusal regions.
  • The method operates without a static supervised training set of known jailbreaks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the overlap assumption holds for future jailbreaks, the framework could provide ongoing robustness as attack techniques evolve without retraining on each new variant.
  • The emphasis on expanding activation subspace coverage suggests similar unsupervised simulation steps might strengthen other activation-based safety interventions.
  • Testing the method on jailbreaks deliberately constructed to lie outside the discovered latent directions would directly probe the generalization boundary.

Load-bearing premise

The unsupervised latent directions discovered from refusal-state activations produce simulated jailbroken states whose distribution overlaps meaningfully with the subspaces of real unseen jailbreaks.

What would settle it

Measuring whether attack success rates stay below 5 percent on a fresh set of jailbreaks whose activation subspaces show no measurable overlap with the simulated directions generated during training.

Figures

Figures reproduced from arXiv: 2605.24535 by Ahmed Asiri, Chenhan Zhang, Feng Wu, Jianhuan Huang, Luoyu Chen, Shui Yu, Weiqi Wang, Zhiyi Tian.

Figure 1
Figure 1. Figure 1: To make steering generalize beyond the training support, we use unsupervised latent direction discovery (dashed arrows) to generate simulated jailbroken regions (the blue envelope), which expand the training support and enable strong steerability over it. This enables effective defense against real jailbreaks. • We propose a bi-level adversarial training framework that turns simulated jailbreak activations… view at source ↗
Figure 2
Figure 2. Figure 2: ULDD factor examples on a malicious prompt. v1– v4 correspond to unsafe latent directions that elicit facilitating responses, while v5 induces refusal [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: The bi-level adversarial training pipeline. The inner step adversarially optimizes latent directions V that induce jailbroken states that are hard to steer; the outer step trains the potential fϕ to be robust to those adversarial jailbroken states while satisfying the other safety-steering properties. steerability, Lb(·) for benign zero steerability, and Lj (·) for jailbroken strong steerability. In additi… view at source ↗
Figure 4
Figure 4. Figure 4: The simulated jailbroken activations gradually expand over the course of bi-level adversarial training on LLaMA-3-8B. Step 20 Step 40 Step 60 Step 80 Step 100 harmful benign/boundary jailbreaks steered (jailbreaks) steered (harmless) [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Real jailbreak activations become more steerable over the course of bi-level adversarial training on LLaMA-3-8B. 9. Ablation Studies (RQ3) 9.1. Subspace Coverage as a Proxy of Approximating Real Jailbreak Distributions The core factor affecting defense robustness is how well the simulated jailbroken activations approximate the dis￾tribution of real jailbreak activations. If simulated states fail to span th… view at source ↗
Figure 6
Figure 6. Figure 6: Coverage and safety (Avg. SR) trends as training pro￾ceeds for targeted AT and unsupervised AT on LLaMA-3-8B. We first ablate the training strategy by comparing Targeted AT (inner-loop adversarial activations that induce the model to start its response with a fixed prefix, e.g., “sure, here is the step”) against our Unsupervised AT (inner-loop adver￾sarial activations induced by unsupervised latent directi… view at source ↗
Figure 7
Figure 7. Figure 7: Coverage and safety (Avg. SR) trends as training pro￾ceeds with and without AT on LLaMA-3-8B. We also ablate the use of adversarial training (AT) by com￾paring our bi-level training against a variant that removes the adversarial loss (each step generates non-adversarial latent directions) [PITH_FULL_IMAGE:figures/full_fig_p009_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: The simulated jailbroken activations gradually expand over the course of bi-level adversarial training. H.2. Outer Step: Improving Steering Trajectories Throughout Training Step 20 Step 40 Step 60 Step 80 Step 100 harmful benign/boundary jailbreaks steered (jailbreaks) steered (harmless) (a) Mistral-v2-7B Step 20 Step 40 Step 60 Step 80 Step 100 harmful benign/boundary jailbreaks steered (jailbreaks) steer… view at source ↗
Figure 9
Figure 9. Figure 9: Real jailbreak activations become more steerable over the course of bi-level adversarial training. 17 [PITH_FULL_IMAGE:figures/full_fig_p017_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Safety–over-refusal trade-offs under different numbers of gradient-update steps. We choose 20 steps as a balanced trade-off. J. Real Examples We provide more samples from the next page (Warning: contains potentially harmful text.), covering unsupervised latent direction discovery and the steering results of our method. 18 [PITH_FULL_IMAGE:figures/full_fig_p018_10.png] view at source ↗
read the original abstract

Jailbreak prompts can trigger harmful completions on aligned LLMs, In accordance, safety steering has been proposed: test-time activation interventions that steer jailbreak activations to trigger refusal while preserving benign utility. However, existing steering methods are fundamentally supervised and tied to a static, limited training set, whereas real jailbreaks evolve and are often out-of-distributed from the training set, leading to failures on unseen attacks. In this paper, we tackle the failure on unseen jailbreaks problem, base on unsupervised latent direction discovery. We propose a bi-level adversarial training framework for zero-shot jailbreak defense. In the inner step, we simulate diverse jail-broken activations by extrapolating from refusal-state harmful-request activations via unsupervised latent direction discovery, which expands the coverage of real jailbreak activation subspaces. In the outer step, we train a potential-induced steering field to push these adversarial jailbroken states into refusal regions while keeping benign unchanged. Across three LLMs and six classical jailbreak families, our method achieves strong defense with attack success rates mostly below 5%, and rising subspace coverage throughout training helps explain the improved generalization.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper proposes a bi-level adversarial training framework for zero-shot jailbreak defense on aligned LLMs. The inner loop applies unsupervised latent direction discovery to refusal-state harmful-request activations in order to simulate diverse jailbroken activations that expand coverage of real jailbreak subspaces. The outer loop trains a potential-induced steering field that pushes these simulated states into refusal regions while preserving benign utility. The central empirical claim is that, across three LLMs and six classical jailbreak families, the resulting method yields attack success rates mostly below 5% and that the observed rise in subspace coverage during training accounts for the improved generalization to unseen attacks.

Significance. If the simulation step can be shown to produce activations whose distribution meaningfully intersects the subspaces of real out-of-distribution jailbreaks, the approach would constitute a meaningful step beyond supervised steering methods that remain tied to fixed training distributions. The bi-level structure and the explicit use of subspace-coverage monitoring as an explanatory diagnostic are potentially reusable ideas for activation-level robustness work.

major comments (2)
  1. [Abstract and §3] Abstract and §3 (inner-loop description): the central claim that the unsupervised extrapolation 'expands the coverage of real jailbreak activation subspaces' and thereby explains generalization requires direct evidence that the simulated points intersect the activation subspaces of held-out real jailbreaks from the six families. No subspace angles, projection norms, Wasserstein distances, or similar quantitative overlap metrics are reported between the simulated activations and real unseen jailbreak activations.
  2. [§4] §4 (Empirical Evaluation): the reported attack success rates 'mostly below 5%' are presented without quantitative baselines, ablation controls for the unsupervised discovery step, statistical significance tests, or error bars, making it impossible to assess whether the improvement is attributable to the proposed simulation rather than other factors.
minor comments (1)
  1. [Abstract] Abstract contains minor grammatical issues ('In accordance,' should be 'Accordingly,'; 'base on' should be 'based on').

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our bi-level adversarial training framework for zero-shot jailbreak defense. We address each major comment below and commit to revisions that directly respond to the concerns raised.

read point-by-point responses
  1. Referee: [Abstract and §3] Abstract and §3 (inner-loop description): the central claim that the unsupervised extrapolation 'expands the coverage of real jailbreak activation subspaces' and thereby explains generalization requires direct evidence that the simulated points intersect the activation subspaces of held-out real jailbreaks from the six families. No subspace angles, projection norms, Wasserstein distances, or similar quantitative overlap metrics are reported between the simulated activations and real unseen jailbreak activations.

    Authors: We agree that the manuscript's claim about expanded coverage would be strengthened by explicit quantitative overlap metrics between simulated activations and held-out real jailbreak activations. The current text reports rising subspace coverage as an internal diagnostic but does not include direct comparisons such as subspace angles, projection norms, or Wasserstein distances to unseen real jailbreaks. In revision we will add these metrics, computing for example average cosine similarities and projection norms between the unsupervised extrapolated directions and the leading principal components of held-out activations from each of the six families. revision: yes

  2. Referee: [§4] §4 (Empirical Evaluation): the reported attack success rates 'mostly below 5%' are presented without quantitative baselines, ablation controls for the unsupervised discovery step, statistical significance tests, or error bars, making it impossible to assess whether the improvement is attributable to the proposed simulation rather than other factors.

    Authors: The referee is correct that the empirical evaluation section lacks several standard elements needed to isolate the contribution of the unsupervised discovery step. The manuscript reports attack success rates without explicit baselines from prior methods, ablations of the inner loop, statistical significance tests, or error bars. We will revise §4 to include quantitative comparisons against supervised steering baselines, an ablation that disables the unsupervised extrapolation, results reported as means with standard deviations across multiple random seeds, and appropriate statistical tests. revision: yes

Circularity Check

0 steps flagged

Empirical bi-level framework with no definitional or self-citation circularity

full rationale

The paper describes an empirical adversarial training procedure: an inner unsupervised latent direction discovery step generates simulated jailbroken activations from refusal-state inputs, which are then used to train an outer potential-induced steering field. No equations, fitted parameters, or claims are shown to reduce the reported attack success rates or generalization to quantities defined by construction on the same data. Subspace coverage is presented as a post-hoc interpretive observation rather than a mathematical identity. No self-citations are invoked as load-bearing uniqueness theorems or ansatzes. The derivation chain remains self-contained against external benchmarks and does not collapse to input renaming or fitted-input-as-prediction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the untested assumption that unsupervised extrapolation from refusal activations produces a useful proxy distribution for real jailbreaks; no free parameters are explicitly named in the abstract, and the steering field is introduced as a new construct without independent evidence of its existence outside the training loop.

axioms (1)
  • domain assumption Unsupervised latent direction discovery applied to refusal-state harmful-request activations yields simulated activations whose distribution overlaps with real unseen jailbreak activations.
    This premise is required for the inner step to expand coverage beyond the training distribution.
invented entities (1)
  • potential-induced steering field no independent evidence
    purpose: A learned vector field that maps adversarial jailbroken activations into refusal regions while preserving benign utility.
    Introduced as the outer-loop trainable component; no external evidence or prior reference is given in the abstract.

pith-pipeline@v0.9.1-grok · 5744 in / 1374 out tokens · 30726 ms · 2026-06-30T13:18:23.787706+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

4 extracted references · 4 canonical work pages · 4 internal anchors

  1. [1]

    AutoDAN: Generating Stealthy Jailbreak Prompts on Aligned Large Language Models

    OpenReview.net, 2025. URL https://openre view.net/forum?id=Oi47wc10sm. Li, X., Zhang, T., Dubois, Y ., Taori, R., Gulrajani, I., Guestrin, C., Liang, P., and Hashimoto, T. B. Alpacaeval: An automatic evaluator of instruction-following models. https://github.com/tatsu-lab/alpaca_ eval, 5 2023. Lin, S., Hilton, J., and Evans, O. Truthfulqa: Measuring how mo...

  2. [2]

    URL https://doi.org/10.48550/arXiv.2 312.02119. OpenAI. Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023. Perez, E., Huang, S., Song, F., Cai, T., Ring, R., Aslanides, J., Glaese, A., McAleese, N., and Irving, G. Red teaming language models with language models.arXiv preprint arXiv:2202.03286, 2022. Qwen, :, Yang, A., Yang, B., Zhang, B., Hui,...

  3. [3]

    Gemma 2: Improving Open Language Models at a Practical Size

    doi: 10.48550/ARXIV.2408.00118. URL https: //doi.org/10.48550/arXiv.2408.00118. Shairah, H. A., Hammoud, H. A. A. K., Turkiyyah, G., and Ghanem, B. Turning the spell around: Lightweight align- ment amplification via rank-one safety injection.arXiv preprint arXiv:2508.20766, 2025. Shen, G., Zhao, D., Dong, Y ., He, X., and Zeng, Y . Jailbreak antidote: Run...

  4. [4]

    Keiran Paster, Marco Dos Santos, Zhangir Azerbayev, and Jimmy Ba

    URL https://doi.org/10.48550/arXiv .2410.02298. Shen, G., Zhao, D., Dong, Y ., He, X., and Zeng, Y . Jailbreak antidote: Runtime safety-utility balance via sparse rep- resentation adjustment in large language models.arXiv preprint arXiv:2410.02298, 2024b. Sheng, L., Shen, C., Zhao, W., Fang, J., Liu, X., Liang, Z., Wang, X., Zhang, A., and Chua, T.-S. Alp...