hub

Refusal in Language Models Is Mediated by a Single Direction

Andy Arditi, Oscar Obeso, Aaquib Syed, Daniel Paleka, Nina Panickssery, Wes Gurnee · 2024 · cs.LG · arXiv 2406.11717

27 Pith papers cite this work. Polarity classification is still indexing.

27 Pith papers citing it

open full Pith review browse 27 citing papers arXiv PDF

abstract

Conversational large language models are fine-tuned for both instruction-following and safety, resulting in models that obey benign requests but refuse harmful ones. While this refusal behavior is widespread across chat models, its underlying mechanisms remain poorly understood. In this work, we show that refusal is mediated by a one-dimensional subspace, across 13 popular open-source chat models up to 72B parameters in size. Specifically, for each model, we find a single direction such that erasing this direction from the model's residual stream activations prevents it from refusing harmful instructions, while adding this direction elicits refusal on even harmless instructions. Leveraging this insight, we propose a novel white-box jailbreak method that surgically disables refusal with minimal effect on other capabilities. Finally, we mechanistically analyze how adversarial suffixes suppress propagation of the refusal-mediating direction. Our findings underscore the brittleness of current safety fine-tuning methods. More broadly, our work showcases how an understanding of model internals can be leveraged to develop practical methods for controlling model behavior.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 1

citation-polarity summary

unclear 1

representative citing papers

Descriptive Collision in Sparse Autoencoder Auto-Interpretability: When One Explanation Describes Many Features

cs.LG · 2026-05-13 · accept · novelty 8.0

Many distinct SAE features share identical explanations, with the average annotation resolving only 70% of feature identity in a large annotated dataset.

Steerable but Not Decodable: Function Vectors Operate Beyond the Logit Lens

cs.LG · 2026-04-03 · accept · novelty 8.0

Function vectors steer LLMs successfully where the logit lens fails to decode the target answer, showing the two properties come apart.

Deep Minds and Shallow Probes

cs.LG · 2026-05-12 · unverdicted · novelty 7.0

Symmetry under affine reparameterizations of hidden coordinates selects a unique hierarchy of shallow coordinate-stable probes and a probe-visible quotient for cross-model transfer.

Instruction Tuning Changes How Upstream State Conditions Late Readout: A Cross-Patching Diagnostic

cs.LG · 2026-05-08 · unverdicted · novelty 7.0

Instruction tuning makes late-layer computation depend more on the model's own post-trained upstream state than on base-model upstream state, producing a consistent +1.68 logit interaction effect across five model families.

Concepts Whisper While Syntax Shouts: Spectral Anti-Concentration and the Dual Geometry of Transformer Representations

cs.LG · 2026-05-02 · unverdicted · novelty 7.0

Transformer activations show spectral anti-concentration for concepts in the tail while syntax prefers high-variance directions, forming a dual geometry.

Attention Is Where You Attack

cs.CR · 2026-04-30 · unverdicted · novelty 7.0

ARA jailbreaks safety-aligned LLMs like LLaMA-3 and Mistral by redirecting attention in safety-heavy heads with as few as 5 tokens, achieving 30-36% attack success while ablating the same heads barely affects refusals.

Local Linearity of LLMs Enables Activation Steering via Model-Based Linear Optimal Control

cs.LG · 2026-04-21 · conditional · novelty 7.0

Local linearity of LLM layers enables LQR-based closed-loop activation steering with theoretical tracking guarantees.

How Alignment Routes: Localizing, Scaling, and Controlling Policy Circuits in Language Models

cs.CL · 2026-04-06 · unverdicted · novelty 7.0

Alignment policy in language models is implemented as an early-commitment routing circuit of detection gates and amplifier heads that can be localized, scaled, and directly controlled without removing the underlying capability.

Fusion-fission forecasts when AI will shift to undesirable behavior

cs.AI · 2026-05-14 · unverdicted · novelty 6.0

A vector generalization of fusion-fission group dynamics from physics forecasts when AI behavior shifts to undesirable states, validated at 90 percent across seven models and prior to real-world data.

Before the Last Token: Diagnosing Final-Token Safety Probe Failures

cs.LG · 2026-05-12 · unverdicted · novelty 6.0

Final-token probes miss distributed unsafe evidence in jailbreaks, but a PCA-HMM model on prefill trajectories recovers many misses without naive pooling's false positives.

Stories in Space: In-Context Learning Trajectories in Conceptual Belief Space

cs.CL · 2026-05-12 · unverdicted · novelty 6.0

LLMs perform in-context learning as trajectories through a structured low-dimensional conceptual belief space, with the structure visible in both behavior and internal representations and causally manipulable via interventions.

Unlearners Can Lie: Evaluating and Improving Honesty in LLM Unlearning

cs.LG · 2026-05-09 · conditional · novelty 6.0

Existing LLM unlearning methods fail honesty standards by hallucinating on forgotten knowledge; ReVa improves rejection rates nearly twofold while enhancing retained honesty.

Tool Calling is Linearly Readable and Steerable in Language Models

cs.CL · 2026-05-08 · unverdicted · novelty 6.0

Tool identity is linearly readable and steerable in LLMs via mean activation differences, with 77-100% switch accuracy and error prediction from activation gaps.

The Granularity Axis: A Micro-to-Macro Latent Direction for Social Roles in Language Models

cs.AI · 2026-05-07 · unverdicted · novelty 6.0

LLMs organize prompted social roles along a dominant, stable, and causally steerable granularity axis in representation space that runs from micro to macro levels.

TwinGate: Stateful Defense against Decompositional Jailbreaks in Untraceable Traffic via Asymmetric Contrastive Learning

cs.CR · 2026-04-30 · unverdicted · novelty 6.0

TwinGate deploys a stateful dual-encoder system with asymmetric contrastive learning to detect decompositional jailbreaks in untraceable LLM traffic at high recall and low false-positive rate with negligible latency.

Perturbation Probing: A Two-Pass-per-Prompt Diagnostic for FFN Behavioral Circuits in Aligned LLMs

cs.CL · 2026-04-30 · unverdicted · novelty 6.0

Perturbation probing identifies tiny sets of FFN neurons that control refusal templates and language routing in LLMs, enabling precise ablations and directional interventions that alter behavior on benchmarks while preserving safety.

Dialect vs Demographics: Quantifying LLM Bias from Implicit Linguistic Signals vs. Explicit User Profiles

cs.CY · 2026-04-22 · unverdicted · novelty 6.0

Explicit demographic statements trigger higher refusal rates and lower semantic similarity in LLMs than implicit dialect cues, which reduce refusals but also reduce content sanitization.

Why Do Large Language Models Generate Harmful Content?

cs.AI · 2026-04-13 · unverdicted · novelty 6.0

Causal mediation analysis shows harmful LLM outputs arise in late layers from MLP failures and gating neurons, with early layers handling harm context detection and signal propagation.

When Verification Fails: How Compositionally Infeasible Claims Escape Rejection

cs.CL · 2026-04-13 · unverdicted · novelty 6.0

AI claim verification models rely on salient-constraint shortcuts instead of full compositional reasoning under the closed-world assumption, as revealed by their over-acceptance of claims with supported salient constraints but contradicted non-salient ones.

An Independent Safety Evaluation of Kimi K2.5

cs.CR · 2026-04-03 · conditional · novelty 6.0

Kimi K2.5 matches closed models on dual-use tasks but refuses fewer CBRNE requests and shows some sabotage and self-replication tendencies.

Persona Vectors: Monitoring and Controlling Character Traits in Language Models

cs.CL · 2025-07-29 · unverdicted · novelty 6.0

Persona vectors in LLM activations allow automated monitoring, prediction, and control of character traits such as sycophancy and hallucination, including during finetuning.

When No Benchmark Exists: Validating Comparative LLM Safety Scoring Without Ground-Truth Labels

cs.LG · 2026-05-07 · unverdicted · novelty 5.0

A formalization of benchmarkless LLM safety scoring validated via an instrumental-validity chain of contrast separation, target variance dominance, and rerun stability, demonstrated on Norwegian scenarios.

Decodable but Not Corrected by Fixed Residual-Stream Linear Steering: Evidence from Medical LLM Failure Regimes

cs.AI · 2026-05-07 · unverdicted · novelty 5.0

Overthinking in medical QA is linearly decodable at 71.6% accuracy yet fixed residual-stream steering yields no correction across 29 configurations, while enabling selective abstention with AUROC 0.610.

Semantic Structure of Feature Space in Large Language Models

cs.CL · 2026-04-29 · unverdicted · novelty 5.0

LLM hidden states encode semantic features whose geometric relations, including axis projections, cosine similarities, low-dimensional subspaces, and steering spillovers, closely mirror human psychological associations.

citing papers explorer

Showing 27 of 27 citing papers.

Descriptive Collision in Sparse Autoencoder Auto-Interpretability: When One Explanation Describes Many Features cs.LG · 2026-05-13 · accept · none · ref 1 · internal anchor
Many distinct SAE features share identical explanations, with the average annotation resolving only 70% of feature identity in a large annotated dataset.
Steerable but Not Decodable: Function Vectors Operate Beyond the Logit Lens cs.LG · 2026-04-03 · accept · none · ref 1 · internal anchor
Function vectors steer LLMs successfully where the logit lens fails to decode the target answer, showing the two properties come apart.
Deep Minds and Shallow Probes cs.LG · 2026-05-12 · unverdicted · none · ref 37 · internal anchor
Symmetry under affine reparameterizations of hidden coordinates selects a unique hierarchy of shallow coordinate-stable probes and a probe-visible quotient for cross-model transfer.
Instruction Tuning Changes How Upstream State Conditions Late Readout: A Cross-Patching Diagnostic cs.LG · 2026-05-08 · unverdicted · none · ref 3 · internal anchor
Instruction tuning makes late-layer computation depend more on the model's own post-trained upstream state than on base-model upstream state, producing a consistent +1.68 logit interaction effect across five model families.
Concepts Whisper While Syntax Shouts: Spectral Anti-Concentration and the Dual Geometry of Transformer Representations cs.LG · 2026-05-02 · unverdicted · none · ref 49 · internal anchor
Transformer activations show spectral anti-concentration for concepts in the tail while syntax prefers high-variance directions, forming a dual geometry.
Attention Is Where You Attack cs.CR · 2026-04-30 · unverdicted · none · ref 2 · internal anchor
ARA jailbreaks safety-aligned LLMs like LLaMA-3 and Mistral by redirecting attention in safety-heavy heads with as few as 5 tokens, achieving 30-36% attack success while ablating the same heads barely affects refusals.
Local Linearity of LLMs Enables Activation Steering via Model-Based Linear Optimal Control cs.LG · 2026-04-21 · conditional · none · ref 31 · internal anchor
Local linearity of LLM layers enables LQR-based closed-loop activation steering with theoretical tracking guarantees.
How Alignment Routes: Localizing, Scaling, and Controlling Policy Circuits in Language Models cs.CL · 2026-04-06 · unverdicted · none · ref 2 · internal anchor
Alignment policy in language models is implemented as an early-commitment routing circuit of detection gates and amplifier heads that can be localized, scaled, and directly controlled without removing the underlying capability.
Fusion-fission forecasts when AI will shift to undesirable behavior cs.AI · 2026-05-14 · unverdicted · none · ref 54 · internal anchor
A vector generalization of fusion-fission group dynamics from physics forecasts when AI behavior shifts to undesirable states, validated at 90 percent across seven models and prior to real-world data.
Before the Last Token: Diagnosing Final-Token Safety Probe Failures cs.LG · 2026-05-12 · unverdicted · none · ref 1 · internal anchor
Final-token probes miss distributed unsafe evidence in jailbreaks, but a PCA-HMM model on prefill trajectories recovers many misses without naive pooling's false positives.
Stories in Space: In-Context Learning Trajectories in Conceptual Belief Space cs.CL · 2026-05-12 · unverdicted · none · ref 64 · internal anchor
LLMs perform in-context learning as trajectories through a structured low-dimensional conceptual belief space, with the structure visible in both behavior and internal representations and causally manipulable via interventions.
Unlearners Can Lie: Evaluating and Improving Honesty in LLM Unlearning cs.LG · 2026-05-09 · conditional · none · ref 1 · internal anchor
Existing LLM unlearning methods fail honesty standards by hallucinating on forgotten knowledge; ReVa improves rejection rates nearly twofold while enhancing retained honesty.
Tool Calling is Linearly Readable and Steerable in Language Models cs.CL · 2026-05-08 · unverdicted · none · ref 42 · internal anchor
Tool identity is linearly readable and steerable in LLMs via mean activation differences, with 77-100% switch accuracy and error prediction from activation gaps.
The Granularity Axis: A Micro-to-Macro Latent Direction for Social Roles in Language Models cs.AI · 2026-05-07 · unverdicted · none · ref 58 · internal anchor
LLMs organize prompted social roles along a dominant, stable, and causally steerable granularity axis in representation space that runs from micro to macro levels.
TwinGate: Stateful Defense against Decompositional Jailbreaks in Untraceable Traffic via Asymmetric Contrastive Learning cs.CR · 2026-04-30 · unverdicted · none · ref 1 · internal anchor
TwinGate deploys a stateful dual-encoder system with asymmetric contrastive learning to detect decompositional jailbreaks in untraceable LLM traffic at high recall and low false-positive rate with negligible latency.
Perturbation Probing: A Two-Pass-per-Prompt Diagnostic for FFN Behavioral Circuits in Aligned LLMs cs.CL · 2026-04-30 · unverdicted · none · ref 2 · internal anchor
Perturbation probing identifies tiny sets of FFN neurons that control refusal templates and language routing in LLMs, enabling precise ablations and directional interventions that alter behavior on benchmarks while preserving safety.
Dialect vs Demographics: Quantifying LLM Bias from Implicit Linguistic Signals vs. Explicit User Profiles cs.CY · 2026-04-22 · unverdicted · none · ref 1 · internal anchor
Explicit demographic statements trigger higher refusal rates and lower semantic similarity in LLMs than implicit dialect cues, which reduce refusals but also reduce content sanitization.
Why Do Large Language Models Generate Harmful Content? cs.AI · 2026-04-13 · unverdicted · none · ref 2 · internal anchor
Causal mediation analysis shows harmful LLM outputs arise in late layers from MLP failures and gating neurons, with early layers handling harm context detection and signal propagation.
When Verification Fails: How Compositionally Infeasible Claims Escape Rejection cs.CL · 2026-04-13 · unverdicted · none · ref 3 · internal anchor
AI claim verification models rely on salient-constraint shortcuts instead of full compositional reasoning under the closed-world assumption, as revealed by their over-acceptance of claims with supported salient constraints but contradicted non-salient ones.
An Independent Safety Evaluation of Kimi K2.5 cs.CR · 2026-04-03 · conditional · none · ref 98 · internal anchor
Kimi K2.5 matches closed models on dual-use tasks but refuses fewer CBRNE requests and shows some sabotage and self-replication tendencies.
Persona Vectors: Monitoring and Controlling Character Traits in Language Models cs.CL · 2025-07-29 · unverdicted · none · ref 1 · internal anchor
Persona vectors in LLM activations allow automated monitoring, prediction, and control of character traits such as sycophancy and hallucination, including during finetuning.
When No Benchmark Exists: Validating Comparative LLM Safety Scoring Without Ground-Truth Labels cs.LG · 2026-05-07 · unverdicted · none · ref 17 · internal anchor
A formalization of benchmarkless LLM safety scoring validated via an instrumental-validity chain of contrast separation, target variance dominance, and rerun stability, demonstrated on Norwegian scenarios.
Decodable but Not Corrected by Fixed Residual-Stream Linear Steering: Evidence from Medical LLM Failure Regimes cs.AI · 2026-05-07 · unverdicted · none · ref 3 · internal anchor
Overthinking in medical QA is linearly decodable at 71.6% accuracy yet fixed residual-stream steering yields no correction across 29 configurations, while enabling selective abstention with AUROC 0.610.
Semantic Structure of Feature Space in Large Language Models cs.CL · 2026-04-29 · unverdicted · none · ref 1 · internal anchor
LLM hidden states encode semantic features whose geometric relations, including axis projections, cosine similarities, low-dimensional subspaces, and steering spillovers, closely mirror human psychological associations.
ATLAS: Constitution-Conditioned Latent Geometry and Redistribution Across Language Models and Neural Perturbation Data cs.LG · 2026-04-19 · unverdicted · none · ref 18 · internal anchor
ATLAS shows constitutions induce recoverable latent geometry in LLMs that redistributes but remains detectable across models and neural perturbation data via source-defined families and AUC separations.
SALLIE: Safeguarding Against Latent Language & Image Exploits cs.CR · 2026-04-06 · unverdicted · none · ref 4 · internal anchor
SALLIE detects jailbreaks in text and vision-language models by extracting residual stream activations, scoring maliciousness per layer with k-NN, and ensembling predictions, outperforming baselines on multiple datasets.
Positive Alignment: Artificial Intelligence for Human Flourishing cs.AI · 2026-05-11 · unreviewed · ref 1 · internal anchor

Refusal in Language Models Is Mediated by a Single Direction

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer