Nature , volume=

Role play with large language models , author= · 2023

4 Pith papers cite this work. Polarity classification is still indexing.

4 Pith papers citing it

browse 4 citing papers

representative citing papers

Refusal in Language Models Is Mediated by a Single Direction

cs.LG · 2024-06-17 · accept · novelty 7.0

Refusal in language models is mediated by a single direction in residual stream activations that can be erased to disable safety or added to elicit refusal.

Exploitation Without Deception: Dark Triad Feature Steering Reveals Separable Antisocial Circuits in Language Models

cs.CL · 2026-05-10 · unverdicted · novelty 6.0

Steering Dark Triad features in an LLM increases exploitative and aggressive behavior while leaving strategic deception and cognitive empathy unchanged, indicating dissociable antisocial pathways.

DiPS: Dialogue Policy Selection for High-Stakes Persuasion Agents

cs.CL · 2026-07-02 · unverdicted · novelty 5.0

DiPS uses a trained critic to select persuasion policies via Q-learning in a fire-rescue evacuation task and reports higher success rates than zero-shot LLM or RAG baselines in both simulation and human trials.

Modeling Pathology-Like Behavioral Patterns in Language Models Through Behavioral Fine-Tuning

cs.CL · 2026-05-21 · unverdicted · novelty 5.0

Fine-tuning LLMs on structured tasks inspired by maladaptive behaviors produces stable, context-general shifts in next-token distributions and response tendencies consistent with altered behavioral priors.

citing papers explorer

Showing 4 of 4 citing papers.

Refusal in Language Models Is Mediated by a Single Direction cs.LG · 2024-06-17 · accept · none · ref 67
Refusal in language models is mediated by a single direction in residual stream activations that can be erased to disable safety or added to elicit refusal.
Exploitation Without Deception: Dark Triad Feature Steering Reveals Separable Antisocial Circuits in Language Models cs.CL · 2026-05-10 · unverdicted · none · ref 34
Steering Dark Triad features in an LLM increases exploitative and aggressive behavior while leaving strategic deception and cognitive empathy unchanged, indicating dissociable antisocial pathways.
DiPS: Dialogue Policy Selection for High-Stakes Persuasion Agents cs.CL · 2026-07-02 · unverdicted · none · ref 12
DiPS uses a trained critic to select persuasion policies via Q-learning in a fire-rescue evacuation task and reports higher success rates than zero-shot LLM or RAG baselines in both simulation and human trials.
Modeling Pathology-Like Behavioral Patterns in Language Models Through Behavioral Fine-Tuning cs.CL · 2026-05-21 · unverdicted · none · ref 20
Fine-tuning LLMs on structured tasks inspired by maladaptive behaviors produces stable, context-general shifts in next-token distributions and response tendencies consistent with altered behavioral priors.

Nature , volume=

fields

years

verdicts

representative citing papers

citing papers explorer