Refusal in language models is mediated by a single direction

Andy Arditi, Oscar Obeso, Aaquib Syed, et al · 2024 · DOI 10.52202/079017-4322

6 Pith papers cite this work. Polarity classification is still indexing.

6 Pith papers citing it

open at publisher browse 6 citing papers

representative citing papers

cs.LG · 2026-06-23 · unverdicted · novelty 7.0

Thinking tokens in reasoning models do not enable safety deliberation; refusal/compliance is strongly predictable from the first token and rarely changes during thinking.

How's it going? Reinforcement learning in language models recruits a functional welfare axis

cs.LG · 2026-05-28 · unverdicted · novelty 7.0

Reinforcement learning recruits rather than creates a functional welfare axis in language models, as reward and punishment vectors from a maze task generalize to unrelated settings and appear in pretrain-only models.

SafeSpec: Fast and Safe LLM via Dynamic Reflective Sampling

cs.CR · 2026-06-18 · unverdicted · novelty 6.0

SafeSpec integrates a latent safety head into speculative LLM decoding with rollback and reflective multi-sampling, cutting attack success rates 15% on Qwen3-32B while retaining 2.06x speedup on normal workloads.

Learning to Attack and Defend: Adaptive Red Teaming of Language Models via GRPO

cs.CL · 2026-06-08 · unverdicted · novelty 6.0

AdvGRPO stabilizes GRPO for joint attacker-defender optimization via multi-channel rewards and curriculum training, yielding effective transferable attacks and stronger co-trained defenders on safety benchmarks.

Training-Free Cultural Alignment of Large Language Models via Persona Disagreement

cs.CL · 2026-05-11 · conditional · novelty 6.0

DISCA converts within-country disagreement among World Values Survey personas into a bounded logit correction that reduces cultural misalignment by 10-24% on MultiTP for models 3.8B and larger across 20 countries, without any weight updates.

Security in the Fine-Tuning Lifecycle of Large Language Models: Threats, Defenses,Evaluation, and Future Directions

cs.CR · 2026-05-24 · unverdicted · novelty 5.0

A lifecycle-based survey of LLM fine-tuning security that reviews attacks and defenses by intervention phase and reports unified empirical findings on model-dependent attack effectiveness and limited defense generalization.

citing papers explorer

Showing 2 of 2 citing papers after filters.

Learning to Attack and Defend: Adaptive Red Teaming of Language Models via GRPO cs.CL · 2026-06-08 · unverdicted · none · ref 35
AdvGRPO stabilizes GRPO for joint attacker-defender optimization via multi-channel rewards and curriculum training, yielding effective transferable attacks and stronger co-trained defenders on safety benchmarks.
Training-Free Cultural Alignment of Large Language Models via Persona Disagreement cs.CL · 2026-05-11 · conditional · none · ref 2
DISCA converts within-country disagreement among World Values Survey personas into a bounded logit correction that reduces cultural misalignment by 10-24% on MultiTP for models 3.8B and larger across 20 countries, without any weight updates.

Refusal in language models is mediated by a single direction

fields

years

verdicts

representative citing papers

citing papers explorer