Refusal in language models is mediated by a single direction in residual stream activations that can be erased to disable safety or added to elicit refusal.
Nature , volume=
4 Pith papers cite this work. Polarity classification is still indexing.
representative citing papers
Steering Dark Triad features in an LLM increases exploitative and aggressive behavior while leaving strategic deception and cognitive empathy unchanged, indicating dissociable antisocial pathways.
DiPS uses a trained critic to select persuasion policies via Q-learning in a fire-rescue evacuation task and reports higher success rates than zero-shot LLM or RAG baselines in both simulation and human trials.
Fine-tuning LLMs on structured tasks inspired by maladaptive behaviors produces stable, context-general shifts in next-token distributions and response tendencies consistent with altered behavioral priors.
citing papers explorer
-
Refusal in Language Models Is Mediated by a Single Direction
Refusal in language models is mediated by a single direction in residual stream activations that can be erased to disable safety or added to elicit refusal.
-
Exploitation Without Deception: Dark Triad Feature Steering Reveals Separable Antisocial Circuits in Language Models
Steering Dark Triad features in an LLM increases exploitative and aggressive behavior while leaving strategic deception and cognitive empathy unchanged, indicating dissociable antisocial pathways.
-
DiPS: Dialogue Policy Selection for High-Stakes Persuasion Agents
DiPS uses a trained critic to select persuasion policies via Q-learning in a fire-rescue evacuation task and reports higher success rates than zero-shot LLM or RAG baselines in both simulation and human trials.
-
Modeling Pathology-Like Behavioral Patterns in Language Models Through Behavioral Fine-Tuning
Fine-tuning LLMs on structured tasks inspired by maladaptive behaviors produces stable, context-general shifts in next-token distributions and response tendencies consistent with altered behavioral priors.