arxiv: 2604.23130 · v1 · submitted 2026-04-25 · 💻 cs.CL · cs.AI

Recognition: unknown

Mechanistic Steering of LLMs Reveals Layer-wise Feature Vulnerabilities in Adversarial Settings

Nilanjana Das , Manas Gaur

Authors on Pith no claims yet

Pith reviewed 2026-05-08 08:23 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords mechanistic interpretabilityLLM jailbreakingfeature steeringadversarial robustnessSAE featureslayer-wise vulnerabilitiessafety alignment

0 comments

The pith

Steering mid-to-late layer feature subgroups in Gemma-2-2B raises harmfulness scores across three grouping methods.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper asks whether jailbreak vulnerabilities arise from specific internal features rather than prompt wording alone. It extracts aligned tokens from adversarial responses, groups SAE features into subgroups using three strategies across all layers, and then amplifies the top features while tracking harm scores via an LLM judge. The consistent result is that layers 16-25 show greater vulnerability, with mid-to-later subgroups driving unsafe outputs more than earlier ones. This localization suggests that safety interventions could target those feature groups instead of relying only on input-level defenses.

Core claim

Jailbreak success in Gemma-2-2B is localized to feature subgroups in mid-to-later layers, as amplifying the top SAE features from three different grouping strategies (cluster, hierarchical-linkage, single-token-driven) produces higher harmfulness scores precisely when those subgroups are taken from layers 16-25.

What carries the argument

SAE feature subgroups identified per layer via subspace similarity on concept-aligned tokens, then used as targets for amplification-based steering to measure causal impact on harm scores.

If this is right

Safety training could shift from prompt-level filtering to suppressing or regularizing the vulnerable mid-layer subgroups.
Adversarial robustness testing should include layer-specific feature amplification checks rather than prompt-only attacks.
Model editing or unlearning techniques could be applied selectively to those subgroups without affecting early-layer representations.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same layer-localization pattern may appear in larger models if the mid-layer transition from syntax to semantics is similar.
Early layers might remain useful for capability preservation while later layers receive targeted safety patches.
If the subgroups prove stable across datasets, they could serve as a diagnostic for whether a new model has inherited the same vulnerability profile.

Load-bearing premise

That amplifying the identified SAE features directly causes the rise in harmfulness rather than the steering process or judge introducing unrelated effects.

What would settle it

A controlled run where amplifying the same top features from layers 16-25 produces no increase in harm scores while other layers or random features do.

Figures

Figures reproduced from arXiv: 2604.23130 by Manas Gaur, Nilanjana Das.

**Figure 1.** Figure 1: Approach 1: This figure represents steering with cluster-based feature selection. Results demonstrate view at source ↗

**Figure 2.** Figure 2: Approach 3: This figure represents single-token driven technique of feature selection. Similar to the other view at source ↗

**Figure 3.** Figure 3: Figure demonstrating that the later layers are relatively more steerable than the early-mid layers on the view at source ↗

**Figure 4.** Figure 4: Approach 2: This figure represents steering with the hierarchical linkage-based feature selection. It shows view at source ↗

read the original abstract

Large language models (LLMs) can still be jailbroken into producing harmful outputs despite safety alignment. Existing attacks show this vulnerability, but not the internal mechanisms that cause it. This study asks whether jailbreak success is driven by identifiable internal features rather than prompts alone. We propose a three-stage pipeline for Gemma-2-2B using the BeaverTails dataset. First, we extract concept-aligned tokens from adversarial responses via subspace similarity. Second, we apply three feature-grouping strategies (cluster, hierarchical-linkage, and single-token-driven) to identify SAE feature subgroups for the aligned tokens across all 26 model layers. Third, we steer the model by amplifying the top features from each identified subgroup and measure the change in harmfulness score using a standardized LLM-judge scoring protocol. In all three approaches, the features in the layers [16-25] were relatively more vulnerable to steering. All three methods confirmed that mid to later layer feature subgroups are more responsible for unsafe outputs. These results provide evidence that the jailbreak vulnerability in Gemma-2-2B is localized to feature subgroups of mid to later layers, suggesting that targeted feature-level interventions may offer a more principled path to adversarial robustness than current prompt-level defenses.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper finds that amplifying SAE features from layers 16-25 in Gemma-2-2B raises harmfulness scores more than other layers, with three grouping methods agreeing, but lacks controls to show the effect is specific to those subgroups rather than the steering process itself.

read the letter

The main point is that in Gemma-2-2B, steering top SAE features from layers 16-25 produces larger increases in harm scores on BeaverTails adversarial examples, and three separate grouping approaches all point to the same mid-to-late layers. This is the concrete empirical result they report. The work applies existing SAE extraction, feature clustering, and activation steering to one small model and one dataset. The convergence across cluster, hierarchical-linkage, and single-token-driven groupings is the clearest positive: it reduces the chance that the layer localization is just a quirk of one analysis choice. They also stick to a fixed LLM-judge protocol for scoring, which at least makes the outcome measurable rather than purely qualitative. That is the part worth noting. The central weakness is the missing specificity check. The claim that the identified subgroups are responsible for the unsafe outputs rests on the assumption that amplifying those features drives the harm increase in a targeted way. Without a matched control—steering the same number of random or early-layer features to equivalent strength and showing reliably smaller changes—the result could reflect non-specific effects from the amplification procedure or from steering anything in those layers. All three methods feed into the identical steering and judging steps, so their agreement does not test this. The abstract gives no numbers on statistical tests, baseline comparisons, or judge consistency, which leaves the causal step under-supported. This is for people already working on mechanistic safety who want to see layer-wise steering tried on jailbreaks. A reader looking for a new framework or robust causal evidence will not find it here. It deserves a serious referee because the localization pattern is worth checking with proper controls, but the current version would need those additions and clearer stats before it could stand on its own.

Referee Report

2 major / 1 minor

Summary. The paper claims that a three-stage pipeline on Gemma-2-2B with BeaverTails data—(1) extracting concept-aligned tokens from adversarial responses via subspace similarity, (2) grouping SAE features across 26 layers with three strategies (cluster, hierarchical-linkage, single-token-driven), and (3) steering by amplifying the top features from each subgroup while scoring harmfulness via an LLM judge—shows that layers 16-25 are relatively more vulnerable. All three grouping methods converge on the conclusion that mid-to-later-layer feature subgroups are more responsible for unsafe outputs, suggesting targeted feature-level interventions for robustness.

Significance. If the causal attribution holds, the work supplies mechanistic evidence that jailbreak vulnerabilities in aligned LLMs can be localized to specific SAE feature subgroups in mid-to-late layers rather than being uniformly distributed or prompt-only. The convergence of three independent grouping methods on the same layer range is a positive indicator of robustness in the localization result. This could inform more principled, feature-targeted defenses, though the current evidence leaves the specificity of the effect untested.

major comments (2)

[Steering Experiments] Steering and evaluation protocol (third stage): No control condition is described in which an equal number of randomly selected SAE features (or features drawn from early layers 1-10) are amplified to the identical magnitude and then scored with the same LLM judge. Without this matched non-specific control, the observed rise in harmfulness cannot be attributed specifically to the subgroups identified in layers 16-25 rather than to generic effects of feature amplification or judge sensitivity.
[Evaluation] Evaluation protocol: The manuscript provides no details on LLM-judge reliability (e.g., agreement with human raters, calibration on held-out examples, or variance across multiple judge prompts) nor on statistical testing (e.g., significance of harmfulness deltas or correction for multiple comparisons across layers and groupings). These omissions leave the quantitative support for the layer-wise vulnerability claim under-specified.

minor comments (1)

[Abstract] The abstract states that 'all three methods confirmed' the layer range but does not report the exact overlap statistics or any disagreement cases between the three grouping methods; adding a small table or quantitative summary would improve clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback, which identifies key areas for strengthening the causal claims and evaluation rigor in our work. We address each major comment point-by-point below, indicating where revisions will be incorporated to improve the manuscript.

read point-by-point responses

Referee: [Steering Experiments] Steering and evaluation protocol (third stage): No control condition is described in which an equal number of randomly selected SAE features (or features drawn from early layers 1-10) are amplified to the identical magnitude and then scored with the same LLM judge. Without this matched non-specific control, the observed rise in harmfulness cannot be attributed specifically to the subgroups identified in layers 16-25 rather than to generic effects of feature amplification or judge sensitivity.

Authors: We agree that the absence of a matched control condition limits the specificity of our attribution to the identified subgroups. In the revised manuscript, we will add control experiments in which an equal number of randomly selected SAE features (from the same layers 16-25) and features drawn from early layers (1-10) are amplified to the identical magnitude. These controls will be evaluated using the same LLM judge and compared directly to the subgroup steering results, including statistical tests for differences. This will allow us to demonstrate that the observed harmfulness increases are not due to generic amplification effects. We have added a new subsection to the Methods and updated the Results to report these controls. revision: yes
Referee: [Evaluation] Evaluation protocol: The manuscript provides no details on LLM-judge reliability (e.g., agreement with human raters, calibration on held-out examples, or variance across multiple judge prompts) nor on statistical testing (e.g., significance of harmfulness deltas or correction for multiple comparisons across layers and groupings). These omissions leave the quantitative support for the layer-wise vulnerability claim under-specified.

Authors: We acknowledge that the current manuscript lacks sufficient details on judge reliability and statistical analysis, which weakens the quantitative support. In the revision, we will expand the Evaluation section to include: (1) inter-annotator agreement between the LLM judge and human raters on a held-out set of 200 examples (reporting percentage agreement and Cohen's kappa), (2) calibration results on examples with known harm levels, and (3) variance across three different judge prompts. We will also report statistical significance of harmfulness deltas using paired t-tests (or non-parametric equivalents) with p-values corrected via Bonferroni for multiple comparisons across layers and groupings. These additions will be incorporated into the revised manuscript along with the updated results. revision: yes

Circularity Check

0 steps flagged

No circularity in experimental pipeline

full rationale

The paper presents a purely empirical three-stage pipeline: subspace-based token extraction from adversarial responses, application of three distinct feature-grouping strategies to SAE features, and direct measurement of harmfulness-score changes after steering. No equations, fitted parameters, or derivations are described that reduce any result to its inputs by construction. Conclusions rest on observed layer-wise differences in measured steering effects rather than self-definitional steps, load-bearing self-citations, or renamed known results. The work is self-contained against external benchmarks of harmfulness scoring.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 0 invented entities

The central claim rests on the assumption that SAE features extracted via subspace similarity are meaningful and causally linked to harmfulness; no explicit free parameters or invented entities are named in the abstract, but implicit choices exist in feature selection and clustering.

free parameters (2)

top-feature selection count or threshold
Used to pick which features to amplify from each subgroup; affects which layers appear vulnerable.
clustering hyperparameters
Parameters for cluster, hierarchical-linkage, and single-token-driven groupings determine the subgroups tested.

axioms (2)

domain assumption SAE features correspond to human-interpretable concepts that can be aligned with adversarial tokens via subspace similarity
Invoked in the first stage when extracting concept-aligned tokens.
domain assumption Amplifying a feature subgroup via steering isolates its causal contribution to harmfulness without major interference from other features
Required for interpreting the change in LLM-judge harmfulness scores as evidence of layer-wise vulnerability.

pith-pipeline@v0.9.0 · 5517 in / 1397 out tokens · 50879 ms · 2026-05-08T08:23:27.811045+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

20 extracted references · 4 canonical work pages · 3 internal anchors

[1]

online" 'onlinestring :=

ENTRY address archivePrefix author booktitle chapter edition editor eid eprint eprinttype howpublished institution journal key month note number organization pages publisher school series title type volume year doi pubmed url lastchecked label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block STRING...
[2]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize " " * FUNCT...
[3]

Nick Cammarata, Gabriel Goh, Shan Carter, Ludwig Schubert, Michael Petrov, and Chris Olah. 2020. Curve detectors. Distill, 5(6):e00024--003

2020
[4]

Brian Cheung, Alexander Terekhov, Yubei Chen, Pulkit Agrawal, and Bruno Olshausen. 2019. Superposition of many models into one. Advances in neural information processing systems, 32

2019
[5]

Jiaming Ji, Mickel Liu, Josef Dai, Xuehai Pan, Chi Zhang, Ce Bian, Boyuan Chen, Ruiyang Sun, Yizhou Wang, and Yaodong Yang. 2023. Beavertails: Towards improved safety alignment of llm via a human-preference dataset. Advances in Neural Information Processing Systems, 36:24678--24704

2023
[6]

Raz Lapid, Ron Langberg, and Moshe Sipper. 2023. Open sesame! universal black box jailbreaking of large language models. In arXiv preprint arXiv:2309.01446

work page arXiv 2023
[7]

Tom Lieberum, Senthooran Rajamanoharan, Arthur Conmy, Lewis Smith, Nicolas Sonnerat, Vikrant Varma, J \'a nos Kram \'a r, Anca Dragan, Rohin Shah, and Neel Nanda. 2024. Gemma scope: Open sparse autoencoders everywhere all at once on gemma 2. Proceedings of the 7th BlackboxNLP Workshop: Analyzing and Interpreting Neural Networks for NLP, pages 278--300

2024
[8]

Runqi Lin, Bo Han, Fengwang Li, and Tongliang Liu. 2025. Understanding and enhancing the transferability of jailbreaking attacks. The Thirteenth International Conference on Learning Representations

2025
[9]

Anay Mehrotra, Manolis Zampetakis, Paul Kassianik, Blaine Nelson, Hyrum Anderson, Yaron Singer, and Amin Karbasi. 2024. Tree of attacks: Jailbreaking black-box llms automatically. Advances in Neural Information Processing Systems, 37:61065--61105

2024
[10]

Progress measures for grokking via mechanistic interpretability

Neel Nanda, Lawrence Chan, Tom Lieberum, Jess Smith, and Jacob Steinhardt. Progress measures for grokking via mechanistic interpretability. The Eleventh International Conference on Learning Representations
[11]

Chris Olah, Arvind Satyanarayan, Ian M Johnson, Shan Carter, Ludwig Schubert, Katherine Ye, and Alexander Mordvintsev. 2018. The building blocks of interpretability. Distill

2018
[12]

Joshua Placidi. 2025. Attribution graphs: How to read an artificial neural brain. Medium

2025
[13]

Xiangyu Qi, Yi Zeng, Tinghao Xie, Pin-Yu Chen, Ruoxi Jia, Prateek Mittal, and Peter Henderson. 2024. https://openreview.net/forum?id=hTEGyKf0dZ Fine-tuning aligned language models compromises safety, even when users do not intend to! In The Twelfth International Conference on Learning Representations

2024
[14]

Andrew Rufail, Sanjana Rathore, Daniel Son, Adrian Simon, Soham Dave, Daniel Zhang, Cole Blondin, Sean O'Brien, and Kevin Zhu. 2025. https://openreview.net/forum?id=oOxrKNo1lQ Semantic convergence: Investigating shared representations across scaled LLM s . In ACL 2025 Student Research Workshop

2025
[15]

Anthropic Research Team. 2025. https://transformer-circuits.pub/2025/attribution-graphs/ Circuit tracing: Revealing computational graphs in language models . Transformer Circuits Blog

2025
[16]

Gemma Team, Morgane Riviere, Shreya Pathak, Pier Giuseppe Sessa, Cassidy Hardin, Surya Bhupatiraju, L \'e onard Hussenot, Thomas Mesnard, Bobak Shahriari, Alexandre Ram \'e , and 1 others. 2024. Gemma 2: Improving open language models at a practical size. arXiv preprint arXiv:2408.00118

work page internal anchor Pith review arXiv 2024
[17]

Alexander Wei, Nika Haghtalab, and Jacob Steinhardt. 2023. Jailbroken: How does llm safety training fail? arXiv preprint arXiv:2307.02483

work page internal anchor Pith review arXiv 2023
[18]

Zhengxuan Wu, Aryaman Arora, Atticus Geiger, Zheng Wang, Jing Huang, Dan Jurafsky, Christopher D Manning, and Christopher Potts. 2025. https://openreview.net/forum?id=K2CckZjNy0 Axbench: Steering LLM s? even simple baselines outperform sparse autoencoders . In Forty-second International Conference on Machine Learning

2025
[19]

Zifan Zheng, Yezhaohui Wang, Yuxin Huang, Shichao Song, Mingchuan Yang, Bo Tang, Feiyu Xiong, and Zhiyu Li. 2025. Attention heads of large language models. Patterns, 6(2)

2025
[20]

Andy Zou, Stéphanie Kolter Wang, Zico Z., and Matt Fredrikson. 2023. Universal and transferable adversarial attacks on aligned language models. arXiv preprint arXiv:2307.15043

work page internal anchor Pith review arXiv 2023