Thinking tokens in reasoning models do not enable safety deliberation; refusal/compliance is strongly predictable from the first token and rarely changes during thinking.
Refusal in language models is mediated by a single direction
5 Pith papers cite this work. Polarity classification is still indexing.
years
2026 5representative citing papers
Reinforcement learning recruits rather than creates a functional welfare axis in language models, as reward and punishment vectors from a maze task generalize to unrelated settings and appear in pretrain-only models.
SafeSpec integrates a latent safety head into speculative LLM decoding with rollback and reflective multi-sampling, cutting attack success rates 15% on Qwen3-32B while retaining 2.06x speedup on normal workloads.
AdvGRPO stabilizes GRPO for joint attacker-defender optimization via multi-channel rewards and curriculum training, yielding effective transferable attacks and stronger co-trained defenders on safety benchmarks.
DISCA converts within-country disagreement among World Values Survey personas into a bounded logit correction that reduces cultural misalignment by 10-24% on MultiTP for models 3.8B and larger across 20 countries, without any weight updates.
citing papers explorer
-
Do Thinking Tokens Help with Safety?
Thinking tokens in reasoning models do not enable safety deliberation; refusal/compliance is strongly predictable from the first token and rarely changes during thinking.
-
How's it going? Reinforcement learning in language models recruits a functional welfare axis
Reinforcement learning recruits rather than creates a functional welfare axis in language models, as reward and punishment vectors from a maze task generalize to unrelated settings and appear in pretrain-only models.
-
SafeSpec: Fast and Safe LLM via Dynamic Reflective Sampling
SafeSpec integrates a latent safety head into speculative LLM decoding with rollback and reflective multi-sampling, cutting attack success rates 15% on Qwen3-32B while retaining 2.06x speedup on normal workloads.
-
Learning to Attack and Defend: Adaptive Red Teaming of Language Models via GRPO
AdvGRPO stabilizes GRPO for joint attacker-defender optimization via multi-channel rewards and curriculum training, yielding effective transferable attacks and stronger co-trained defenders on safety benchmarks.
-
Training-Free Cultural Alignment of Large Language Models via Persona Disagreement
DISCA converts within-country disagreement among World Values Survey personas into a bounded logit correction that reduces cultural misalignment by 10-24% on MultiTP for models 3.8B and larger across 20 countries, without any weight updates.