Thinking tokens in reasoning models do not enable safety deliberation; refusal/compliance is strongly predictable from the first token and rarely changes during thinking.
Refusal in language models is mediated by a single direction
6 Pith papers cite this work. Polarity classification is still indexing.
years
2026 6representative citing papers
Reinforcement learning recruits rather than creates a functional welfare axis in language models, as reward and punishment vectors from a maze task generalize to unrelated settings and appear in pretrain-only models.
SafeSpec integrates a latent safety head into speculative LLM decoding with rollback and reflective multi-sampling, cutting attack success rates 15% on Qwen3-32B while retaining 2.06x speedup on normal workloads.
AdvGRPO stabilizes GRPO for joint attacker-defender optimization via multi-channel rewards and curriculum training, yielding effective transferable attacks and stronger co-trained defenders on safety benchmarks.
DISCA converts within-country disagreement among World Values Survey personas into a bounded logit correction that reduces cultural misalignment by 10-24% on MultiTP for models 3.8B and larger across 20 countries, without any weight updates.
A lifecycle-based survey of LLM fine-tuning security that reviews attacks and defenses by intervention phase and reports unified empirical findings on model-dependent attack effectiveness and limited defense generalization.
citing papers explorer
-
Learning to Attack and Defend: Adaptive Red Teaming of Language Models via GRPO
AdvGRPO stabilizes GRPO for joint attacker-defender optimization via multi-channel rewards and curriculum training, yielding effective transferable attacks and stronger co-trained defenders on safety benchmarks.
-
Training-Free Cultural Alignment of Large Language Models via Persona Disagreement
DISCA converts within-country disagreement among World Values Survey personas into a bounded logit correction that reduces cultural misalignment by 10-24% on MultiTP for models 3.8B and larger across 20 countries, without any weight updates.