hub Mixed citations

Refusal in Language Models Is Mediated by a Single Direction

Andy Arditi, Oscar Obeso, Aaquib Syed, Daniel Paleka, Nina Panickssery, Wes Gurnee · 2024 · cs.LG · arXiv 2406.11717

Mixed citation behavior. Most common role is background (67%).

71 Pith papers citing it

Background 67% of classified citations

open full Pith review browse 71 citing papers arXiv PDF

abstract

Conversational large language models are fine-tuned for both instruction-following and safety, resulting in models that obey benign requests but refuse harmful ones. While this refusal behavior is widespread across chat models, its underlying mechanisms remain poorly understood. In this work, we show that refusal is mediated by a one-dimensional subspace, across 13 popular open-source chat models up to 72B parameters in size. Specifically, for each model, we find a single direction such that erasing this direction from the model's residual stream activations prevents it from refusing harmful instructions, while adding this direction elicits refusal on even harmless instructions. Leveraging this insight, we propose a novel white-box jailbreak method that surgically disables refusal with minimal effect on other capabilities. Finally, we mechanistically analyze how adversarial suffixes suppress propagation of the refusal-mediating direction. Our findings underscore the brittleness of current safety fine-tuning methods. More broadly, our work showcases how an understanding of model internals can be leveraged to develop practical methods for controlling model behavior.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 11 method 1

citation-polarity summary

background 8 unclear 3 use method 1

representative citing papers

Descriptive Collision in Sparse Autoencoder Auto-Interpretability: When One Explanation Describes Many Features

cs.LG · 2026-05-13 · accept · novelty 8.0

Many distinct SAE features share identical explanations, with the average annotation resolving only 70% of feature identity in a large annotated dataset.

Steerable but Not Decodable: Function Vectors Operate Beyond the Logit Lens

cs.LG · 2026-04-03 · accept · novelty 8.0

Function vectors steer LLMs successfully where the logit lens fails to decode the target answer, showing the two properties come apart.

When Probing Accuracy Saturates, Fragility Resolves: A Complementary Metric for LLM Pre-Training Analysis

cs.CL · 2026-06-09 · unverdicted · novelty 7.0

Fragility, the activation noise level causing probe accuracy collapse, reveals evolving lexical-to-compositional moral encoding, layer robustness gradients, and fine-tuning differences invisible to saturated probing accuracy.

Subspace-Aware Sparse Autoencoders for Effective Mechanistic Interpretability

cs.LG · 2026-06-04 · conditional · novelty 7.0

SASA replaces single-vector decoders in SAEs with learned subspaces plus block sparsity and nuclear-norm regularization, proving that a single group becomes the global minimizer once block size meets intrinsic dimension and yielding polynomial rather than exponential sample complexity.

Activation Steering of Video Generation Models via Reduced-Order Linear Optimal Control

cs.LG · 2026-06-03 · unverdicted · novelty 7.0

LA-LQR applies latent-space linear-quadratic regulator control to steer text-to-video model activations toward desired features while penalizing excessive changes.

Low-Resource Safety Failures Are Action Failures, Not Representation Failures

cs.CL · 2026-05-31 · conditional · novelty 7.0

Low-resource safety failures are action failures because the harmfulness representation transfers but the decision calibration does not; this is fixed by recalibrating a high-resource gate with 1-4 target-language examples.

MENTIS: What Belief Changes Under Alignment? Measuring Multi-Scale Latent Torsion in Language Models

cs.CL · 2026-05-31 · unverdicted · novelty 7.0

MENTIS applies layerwise covariance torsion (T1), spectral torsion (T2), and ERA localization to paired IT/PA 7-8B models, finding selective larger shifts for normative concepts, negative correlation with entropy, and mid-to-late layer peaks.

BioRefusalAudit: Auditing Biosecurity Refusal Depth Using General and Domain-Fine-Tuned Sparse Autoencoders

cs.AI · 2026-05-28 · unverdicted · novelty 7.0

No tested model showed robust format-independent refusal on biosecurity hazards; a new divergence score between behavioral labels and SAE activations separated responses in one preliminary case.

Does Capability Transfer to Subjective Behavior -- and Would Our Instruments Tell Us? A Self-Evolving, Trust-by-Construction Evaluation Paradigm

cs.CL · 2026-05-27 · unverdicted · novelty 7.0

Self-evolving rubric with anti-gaming fitness reveals that objective capability scaling fails to transfer to subjective LLM behaviors, with advice-restraint as the universal lowest dimension that can regress.

The Age of Curiosity Meets the Age of AI: Benchmarking Child Safety in Large Language Models

cs.CL · 2026-05-25 · unverdicted · novelty 7.0

Introduces KIDBench benchmark for child-facing LLM safety, showing implicit and explicit child context cues raise safety scores 9-77% while multi-turn interactions degrade quality 6-24%.

The Concept Allocation Zone: Tracking How Concepts Form Across Transformer Depth

cs.LG · 2026-05-24 · unverdicted · novelty 7.0

Introduces CAZ framework using Separation, Coherence, and Velocity metrics to identify depth regions of concept allocation, with empirical tests across 34 models showing multimodal separation curves and causally active gentle CAZes.

ContextEcho: A Benchmark for Persona Drift in Long Agentic-Coding Sessions

cs.CL · 2026-05-22 · unverdicted · novelty 7.0

ContextEcho benchmark shows persona drift occurs across 23 frontier models in long agentic-coding sessions, is not reliably reset by compaction, and can be restored by single-shot anchors with mode-dependent effects.

Residual Paving: Diagnosing the Routing Bottleneck in Selective Refusal Editing

cs.LG · 2026-05-18 · unverdicted · novelty 7.0

Residual Paving decomposes selective refusal editing into an early-layer router for intervention decisions and later-layer residual experts for edits, with oracle routing showing that learned route selectivity is the primary bottleneck across six backbones.

FishBack: Pullback Fisher Geometry for Optimal Activation Steering in Transformers

cs.LG · 2026-05-17 · unverdicted · novelty 7.0

FishBack derives a closed-form minimum-distortion steering direction from the pullback Fisher metric of the softmax layer, outperforming Euclidean baselines on GPT-2 verb-morphology tasks with lower off-target KL divergence.

Deep Minds and Shallow Probes

cs.LG · 2026-05-12 · unverdicted · novelty 7.0

Symmetry under affine reparameterizations of hidden coordinates selects a unique hierarchy of shallow coordinate-stable probes and a probe-visible quotient for cross-model transfer.

Instruction Tuning Changes How Upstream State Conditions Late Readout: A Cross-Patching Diagnostic

cs.LG · 2026-05-08 · unverdicted · novelty 7.0

Instruction tuning makes late-layer computation depend more on the model's own post-trained upstream state than on base-model upstream state, producing a consistent +1.68 logit interaction effect across five model families.

Concepts Whisper While Syntax Shouts: Spectral Anti-Concentration and the Dual Geometry of Transformer Representations

cs.LG · 2026-05-02 · unverdicted · novelty 7.0

Transformer activations show spectral anti-concentration for concepts in the tail while syntax prefers high-variance directions, forming a dual geometry.

Tracing the Dynamics of Refusal: Exploiting Latent Refusal Trajectories for Robust Jailbreak Detection

cs.CR · 2026-05-02 · unverdicted · novelty 7.0

Causal tracing reveals a persistent Refusal Trajectory in LLM hidden states; SALO detector using sparse activations from a layer window improves jailbreak detection across Qwen, Llama, and Mistral models.

Attention Is Where You Attack

cs.CR · 2026-04-30 · unverdicted · novelty 7.0

ARA jailbreaks safety-aligned LLMs like LLaMA-3 and Mistral by redirecting attention in safety-heavy heads with as few as 5 tokens, achieving 30-36% attack success while ablating the same heads barely affects refusals.

Local Linearity of LLMs Enables Activation Steering via Model-Based Linear Optimal Control

cs.LG · 2026-04-21 · conditional · novelty 7.0

Local linearity of LLM layers enables LQR-based closed-loop activation steering with theoretical tracking guarantees.

Robust for the Wrong Reasons: The Representational Geometry of LLM Robustness to Science Skepticism

physics.soc-ph · 2026-07-02 · unverdicted · novelty 6.0

LLMs show three distinct non-sycophantic responses to science skepticism, with robustness in some cases being accidental because the model does not represent the skepticism signal, as determined by linear probes on three models in three domains.

Safety Targeted Embedding Exploit via Refinement

cs.AI · 2026-07-02 · unverdicted · novelty 6.0

STEER is a gradient-guided attack that iteratively translates refusal-triggering words into low-resource languages to jailbreak LLMs, reaching 93-96.7% success on open models and 35.5% transfer to GPT-4o-mini.

Has This Checkpoint Been Abliterated? A Two-Signal Audit and Its Failure Map

cs.CR · 2026-07-02 · conditional · novelty 6.0

Combining a reference-anchored activation refusal-gap with weight-recovery energy allows detection of abliterated checkpoints at AUROC 0.95 on a 273-checkpoint registry, with a calibrated threshold achieving 0.89 balanced accuracy on held-out families.

The Geometry of Refusal: Linear Instability in Safety-Aligned LLMs

cs.CR · 2026-06-21 · unverdicted · novelty 6.0

Contrastive Logit Steering isolates a linear refusal direction in safety-aligned LLMs, achieving higher jailbreak success than activation steering and enabling bidirectional control without retraining.

citing papers explorer

Showing 21 of 71 citing papers.

When Verification Fails: How Compositionally Infeasible Claims Escape Rejection cs.CL · 2026-04-13 · unverdicted · none · ref 3 · internal anchor
AI claim verification models rely on salient-constraint shortcuts instead of full compositional reasoning under the closed-world assumption, as revealed by their over-acceptance of claims with supported salient constraints but contradicted non-salient ones.
An Independent Safety Evaluation of Kimi K2.5 cs.CR · 2026-04-03 · conditional · none · ref 98 · internal anchor
Kimi K2.5 matches closed models on dual-use tasks but refuses fewer CBRNE requests and shows some sabotage and self-replication tendencies.
Beyond I'm Sorry, I Can't: Dissecting Large Language Model Refusal cs.CL · 2025-09-07 · unverdicted · none · ref 4 · internal anchor
Sparse autoencoders plus greedy filtering and factorization-machine interaction modeling identify minimal sets of features in Gemma-2-2B-IT and LLaMA-3.1-8B-IT whose ablation produces jailbreaks by flipping refusal to compliance.
Persona Vectors: Monitoring and Controlling Character Traits in Language Models cs.CL · 2025-07-29 · unverdicted · none · ref 1 · internal anchor
Persona vectors in LLM activations allow automated monitoring, prediction, and control of character traits such as sycophancy and hallucination, including during finetuning.
Disentangled Safety Adapters Enable Efficient Guardrails and Flexible Inference-Time Alignment cs.LG · 2025-05-30 · unverdicted · none · ref 2 · internal anchor
Disentangled Safety Adapters decouple safety computations from task-optimized LLMs via lightweight adapters, yielding up to 53% better AUC on safety tasks and dynamic inference-time alignment with reduced performance trade-offs.
Beyond Linear Steering: Unified Multi-Attribute Control for Language Models cs.LG · 2025-05-30 · unverdicted · none · ref 1 · internal anchor
K-Steering uses a non-linear multi-label classifier on activations to compute gradient-based intervention directions for unified multi-attribute control in LLMs, outperforming linear baselines on ToneBank and DebateMix benchmarks across three model families.
Order Is Not Control cs.LG · 2026-06-11 · unverdicted · none · ref 18 · internal anchor
Order is distinct from control, where control is defined as a local receiver-gated response law demonstrated across biological circuits and LLM response panels with reported prediction accuracies of 72-84%.
The Assistant as a Privileged Persona: A canonical reference in cross-persona self-recognition cs.LG · 2026-05-30 · unverdicted · none · ref 3 · internal anchor
On Llama-3.1-70B-Instruct the Assistant persona functions as the sole canonical reference for cross-persona authorship judgments, with symmetric entropy gaps predicting only on its row and asymmetric surprise relative to the Assistant predicting off its row.
From Correlation to Cause: A Five-Stage Methodology for Feature Analysis in Transformer Language Models cs.CL · 2026-05-21 · unverdicted · none · ref 1 · internal anchor
A five-stage causal feature analysis methodology is proposed and tested on GPT-2 for IOI, showing partial causality of SAE features, robustness differences under shifts, and deployment cost benefits.
Ablating Safety: Mechanisms for Removing Alignment in Language Models for Security Applications cs.CR · 2026-05-17 · unverdicted · none · ref 3 · internal anchor
Empirical comparison of alignment ablation methods on a 60-prompt security evaluation suite shows task-only LoRA achieves 0.87 mean security score with 0.13 unsafe compliance.
Positive Alignment: Artificial Intelligence for Human Flourishing cs.AI · 2026-05-11 · unverdicted · none · ref 1 · 2 links · internal anchor
Positive Alignment is defined as AI systems that support human flourishing pluralistically while staying safe and cooperative, presented as a necessary complement to existing safety-focused alignment research.
When No Benchmark Exists: Validating Comparative LLM Safety Scoring Without Ground-Truth Labels cs.LG · 2026-05-07 · unverdicted · none · ref 17 · internal anchor
A formalization of benchmarkless LLM safety scoring validated via an instrumental-validity chain of contrast separation, target variance dominance, and rerun stability, demonstrated on Norwegian scenarios.
Decodable but Not Corrected by Fixed Residual-Stream Linear Steering: Evidence from Medical LLM Failure Regimes cs.AI · 2026-05-07 · unverdicted · none · ref 3 · internal anchor
Overthinking in medical QA is linearly decodable at 71.6% accuracy yet fixed residual-stream steering yields no correction across 29 configurations, while enabling selective abstention with AUROC 0.610.
Semantic Structure of Feature Space in Large Language Models cs.CL · 2026-04-29 · unverdicted · none · ref 1 · internal anchor
LLM hidden states encode semantic features whose geometric relations, including axis projections, cosine similarities, low-dimensional subspaces, and steering spillovers, closely mirror human psychological associations.
ATLAS: Constitution-Conditioned Latent Geometry and Redistribution Across Language Models and Neural Perturbation Data cs.LG · 2026-04-19 · unverdicted · none · ref 18 · internal anchor
ATLAS shows constitutions induce recoverable latent geometry in LLMs that redistributes but remains detectable across models and neural perturbation data via source-defined families and AUC separations.
SALLIE: Safeguarding Against Latent Language & Image Exploits cs.CR · 2026-04-06 · unverdicted · none · ref 4 · internal anchor
SALLIE detects jailbreaks in text and vision-language models by extracting residual stream activations, scoring maliciousness per layer with k-NN, and ensembling predictions, outperforming baselines on multiple datasets.
Mechanistic Personality Analysis of LLMs Steering Personality via Latent Feature Interventions cs.AI · 2026-06-27 · unverdicted · none · ref 2 · internal anchor
Applies sparse autoencoders to locate and steer latent features for OCEAN personality traits in LLMs while preserving benchmark performance.
NeuroArmor: Safe-Variant-Guided Representation Consistency for Selective Re-Anchoring in Jailbreak Defense cs.CR · 2026-06-02 · unverdicted · none · ref 1 · internal anchor
NeuroArmor uses safe-variant-guided representation consistency checks for selective intervention, reducing jailbreak ASR from 41.56% to 1.57% and benign FPR from 30.26% to 22.05% on Llama-3-8B-Instruct.
LLM-Safety Evaluations Lack Robustness cs.CR · 2025-03-04 · unverdicted · none · ref 7 · internal anchor
LLM safety evaluations are hindered by noise in dataset curation, automated red-teaming, response generation, and LLM-judge evaluation, making fair comparisons difficult and slowing progress.
How Alignment Routes: Localizing, Scaling, and Controlling Policy Circuits in Language Models cs.CL · 2026-04-06 · unreviewed · ref 2 · internal anchor
You Had One Job: Per-Task Quantization Using LLMs' Hidden Representations cs.CL · 2025-11-09 · unreviewed · ref 4 · internal anchor

Refusal in Language Models Is Mediated by a Single Direction

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer