Refusal in language models is mediated by a single direction

Andy Arditi, Oscar Obeso, Aaquib Syed, Daniel Paleka, Nina Panickssery, Wes Gurnee, Neel Nanda · 2024 · DOI 10.52202/079017-4322

5 Pith papers cite this work. Polarity classification is still indexing.

5 Pith papers citing it

open at publisher browse 5 citing papers

representative citing papers

Do Thinking Tokens Help with Safety?

cs.LG · 2026-06-23 · unverdicted · novelty 7.0

Thinking tokens in reasoning models do not enable safety deliberation; refusal/compliance is strongly predictable from the first token and rarely changes during thinking.

How's it going? Reinforcement learning in language models recruits a functional welfare axis

cs.LG · 2026-05-28 · unverdicted · novelty 7.0

Reinforcement learning recruits rather than creates a functional welfare axis in language models, as reward and punishment vectors from a maze task generalize to unrelated settings and appear in pretrain-only models.

SafeSpec: Fast and Safe LLM via Dynamic Reflective Sampling

cs.CR · 2026-06-18 · unverdicted · novelty 6.0

SafeSpec integrates a latent safety head into speculative LLM decoding with rollback and reflective multi-sampling, cutting attack success rates 15% on Qwen3-32B while retaining 2.06x speedup on normal workloads.

Learning to Attack and Defend: Adaptive Red Teaming of Language Models via GRPO

cs.CL · 2026-06-08 · unverdicted · novelty 6.0

AdvGRPO stabilizes GRPO for joint attacker-defender optimization via multi-channel rewards and curriculum training, yielding effective transferable attacks and stronger co-trained defenders on safety benchmarks.

Training-Free Cultural Alignment of Large Language Models via Persona Disagreement

cs.CL · 2026-05-11 · conditional · novelty 6.0

DISCA converts within-country disagreement among World Values Survey personas into a bounded logit correction that reduces cultural misalignment by 10-24% on MultiTP for models 3.8B and larger across 20 countries, without any weight updates.

citing papers explorer

Showing 5 of 5 citing papers.

Do Thinking Tokens Help with Safety? cs.LG · 2026-06-23 · unverdicted · none · ref 1
Thinking tokens in reasoning models do not enable safety deliberation; refusal/compliance is strongly predictable from the first token and rarely changes during thinking.
How's it going? Reinforcement learning in language models recruits a functional welfare axis cs.LG · 2026-05-28 · unverdicted · none · ref 2
Reinforcement learning recruits rather than creates a functional welfare axis in language models, as reward and punishment vectors from a maze task generalize to unrelated settings and appear in pretrain-only models.
SafeSpec: Fast and Safe LLM via Dynamic Reflective Sampling cs.CR · 2026-06-18 · unverdicted · none · ref 1
SafeSpec integrates a latent safety head into speculative LLM decoding with rollback and reflective multi-sampling, cutting attack success rates 15% on Qwen3-32B while retaining 2.06x speedup on normal workloads.
Learning to Attack and Defend: Adaptive Red Teaming of Language Models via GRPO cs.CL · 2026-06-08 · unverdicted · none · ref 35
AdvGRPO stabilizes GRPO for joint attacker-defender optimization via multi-channel rewards and curriculum training, yielding effective transferable attacks and stronger co-trained defenders on safety benchmarks.
Training-Free Cultural Alignment of Large Language Models via Persona Disagreement cs.CL · 2026-05-11 · conditional · none · ref 2
DISCA converts within-country disagreement among World Values Survey personas into a bounded logit correction that reduces cultural misalignment by 10-24% on MultiTP for models 3.8B and larger across 20 countries, without any weight updates.

Refusal in language models is mediated by a single direction

fields

years

verdicts

representative citing papers

citing papers explorer