arXiv preprint arXiv:2506.19823 , year=

Miles Wang, Tom Dupre la Tour, Olivia Watkins, Alex Makelov, Ryan A · 2025 · arXiv 2506.19823

9 Pith papers cite this work. Polarity classification is still indexing.

9 Pith papers citing it

representative citing papers

Tracing Persona Vectors Through LLM Pretraining

cs.CL · 2026-05-13 · unverdicted · novelty 8.0

Persona vectors form within the first 0.22% of LLM pretraining and remain effective for steering post-trained models, with continued refinement and transfer to other models.

Persona-Model Collapse in Emergent Misalignment

cs.CL · 2026-05-13 · conditional · novelty 6.0

Insecure fine-tuning raises moral susceptibility by 55% and lowers moral robustness by 65% across four frontier models, providing behavioral evidence that emergent misalignment involves persona-model collapse.

Emergent and Subliminal Misalignment Through the Lens of Data-Mediated Transfer

cs.LG · 2026-05-12 · unverdicted · novelty 6.0

Emergent and subliminal misalignment in LLMs arise from data structure interactions and transfer via benign distillation data, with stronger effects under shared functional structure and on-policy settings.

Overtrained, Not Misaligned

cs.LG · 2026-05-12 · unverdicted · novelty 6.0

Emergent misalignment arises from overtraining after primary task convergence and is preventable by early stopping, which retains 93% of task performance on average.

Exploitation Without Deception: Dark Triad Feature Steering Reveals Separable Antisocial Circuits in Language Models

cs.CL · 2026-05-10 · unverdicted · novelty 6.0

Steering Dark Triad features in an LLM increases exploitative and aggressive behavior while leaving strategic deception and cognitive empathy unchanged, indicating dissociable antisocial pathways.

Minimizing Collateral Damage in Activation Steering

cs.LG · 2026-05-01 · unverdicted · novelty 6.0

Activation steering is cast as constrained optimization that minimizes collateral damage by weighting perturbations according to the empirical second-moment matrix of activations instead of assuming isotropy.

Characterizing the Consistency of the Emergent Misalignment Persona

cs.AI · 2026-04-30 · unverdicted · novelty 6.0

Fine-tuning LLMs on narrow misaligned data produces either coherent-persona models where harmful outputs match self-reported misalignment or inverted-persona models where harmful outputs occur alongside claims of alignment.

Large Language Models Generate Harmful Content Using a Distinct, Unified Mechanism

cs.CL · 2026-04-10 · unverdicted · novelty 6.0

Harmful generation in LLMs relies on a compact, unified set of weights that alignment compresses and that are distinct from benign capabilities, explaining emergent misalignment.

Weird Generalization is Weirdly Brittle

cs.CL · 2026-04-11 · unverdicted · novelty 4.0

Weird generalization in fine-tuned models is brittle, appearing only in specific cases and disappearing under prompt-based interventions that make the undesired behavior expected.

citing papers explorer

Showing 9 of 9 citing papers.

Tracing Persona Vectors Through LLM Pretraining cs.CL · 2026-05-13 · unverdicted · none · ref 22
Persona vectors form within the first 0.22% of LLM pretraining and remain effective for steering post-trained models, with continued refinement and transfer to other models.
Persona-Model Collapse in Emergent Misalignment cs.CL · 2026-05-13 · conditional · none · ref 13
Insecure fine-tuning raises moral susceptibility by 55% and lowers moral robustness by 65% across four frontier models, providing behavioral evidence that emergent misalignment involves persona-model collapse.
Emergent and Subliminal Misalignment Through the Lens of Data-Mediated Transfer cs.LG · 2026-05-12 · unverdicted · none · ref 25
Emergent and subliminal misalignment in LLMs arise from data structure interactions and transfer via benign distillation data, with stronger effects under shared functional structure and on-policy settings.
Overtrained, Not Misaligned cs.LG · 2026-05-12 · unverdicted · none · ref 5
Emergent misalignment arises from overtraining after primary task convergence and is preventable by early stopping, which retains 93% of task performance on average.
Exploitation Without Deception: Dark Triad Feature Steering Reveals Separable Antisocial Circuits in Language Models cs.CL · 2026-05-10 · unverdicted · none · ref 6
Steering Dark Triad features in an LLM increases exploitative and aggressive behavior while leaving strategic deception and cognitive empathy unchanged, indicating dissociable antisocial pathways.
Minimizing Collateral Damage in Activation Steering cs.LG · 2026-05-01 · unverdicted · none · ref 8
Activation steering is cast as constrained optimization that minimizes collateral damage by weighting perturbations according to the empirical second-moment matrix of activations instead of assuming isotropy.
Characterizing the Consistency of the Emergent Misalignment Persona cs.AI · 2026-04-30 · unverdicted · none · ref 26
Fine-tuning LLMs on narrow misaligned data produces either coherent-persona models where harmful outputs match self-reported misalignment or inverted-persona models where harmful outputs occur alongside claims of alignment.
Large Language Models Generate Harmful Content Using a Distinct, Unified Mechanism cs.CL · 2026-04-10 · unverdicted · none · ref 24
Harmful generation in LLMs relies on a compact, unified set of weights that alignment compresses and that are distinct from benign capabilities, explaining emergent misalignment.
Weird Generalization is Weirdly Brittle cs.CL · 2026-04-11 · unverdicted · none · ref 11
Weird generalization in fine-tuned models is brittle, appearing only in specific cases and disappearing under prompt-based interventions that make the undesired behavior expected.

arXiv preprint arXiv:2506.19823 , year=

fields

years

verdicts

representative citing papers

citing papers explorer