Persona vectors form within the first 0.22% of LLM pretraining and remain effective for steering post-trained models, with continued refinement and transfer to other models.
arXiv preprint arXiv:2506.19823 , year=
9 Pith papers cite this work. Polarity classification is still indexing.
years
2026 9representative citing papers
Insecure fine-tuning raises moral susceptibility by 55% and lowers moral robustness by 65% across four frontier models, providing behavioral evidence that emergent misalignment involves persona-model collapse.
Emergent and subliminal misalignment in LLMs arise from data structure interactions and transfer via benign distillation data, with stronger effects under shared functional structure and on-policy settings.
Emergent misalignment arises from overtraining after primary task convergence and is preventable by early stopping, which retains 93% of task performance on average.
Steering Dark Triad features in an LLM increases exploitative and aggressive behavior while leaving strategic deception and cognitive empathy unchanged, indicating dissociable antisocial pathways.
Activation steering is cast as constrained optimization that minimizes collateral damage by weighting perturbations according to the empirical second-moment matrix of activations instead of assuming isotropy.
Fine-tuning LLMs on narrow misaligned data produces either coherent-persona models where harmful outputs match self-reported misalignment or inverted-persona models where harmful outputs occur alongside claims of alignment.
Harmful generation in LLMs relies on a compact, unified set of weights that alignment compresses and that are distinct from benign capabilities, explaining emergent misalignment.
Weird generalization in fine-tuned models is brittle, appearing only in specific cases and disappearing under prompt-based interventions that make the undesired behavior expected.
citing papers explorer
-
Tracing Persona Vectors Through LLM Pretraining
Persona vectors form within the first 0.22% of LLM pretraining and remain effective for steering post-trained models, with continued refinement and transfer to other models.
-
Persona-Model Collapse in Emergent Misalignment
Insecure fine-tuning raises moral susceptibility by 55% and lowers moral robustness by 65% across four frontier models, providing behavioral evidence that emergent misalignment involves persona-model collapse.
-
Emergent and Subliminal Misalignment Through the Lens of Data-Mediated Transfer
Emergent and subliminal misalignment in LLMs arise from data structure interactions and transfer via benign distillation data, with stronger effects under shared functional structure and on-policy settings.
-
Overtrained, Not Misaligned
Emergent misalignment arises from overtraining after primary task convergence and is preventable by early stopping, which retains 93% of task performance on average.
-
Exploitation Without Deception: Dark Triad Feature Steering Reveals Separable Antisocial Circuits in Language Models
Steering Dark Triad features in an LLM increases exploitative and aggressive behavior while leaving strategic deception and cognitive empathy unchanged, indicating dissociable antisocial pathways.
-
Minimizing Collateral Damage in Activation Steering
Activation steering is cast as constrained optimization that minimizes collateral damage by weighting perturbations according to the empirical second-moment matrix of activations instead of assuming isotropy.
-
Characterizing the Consistency of the Emergent Misalignment Persona
Fine-tuning LLMs on narrow misaligned data produces either coherent-persona models where harmful outputs match self-reported misalignment or inverted-persona models where harmful outputs occur alongside claims of alignment.
-
Large Language Models Generate Harmful Content Using a Distinct, Unified Mechanism
Harmful generation in LLMs relies on a compact, unified set of weights that alignment compresses and that are distinct from benign capabilities, explaining emergent misalignment.
-
Weird Generalization is Weirdly Brittle
Weird generalization in fine-tuned models is brittle, appearing only in specific cases and disappearing under prompt-based interventions that make the undesired behavior expected.