In-Training Defenses against Emergent Misalignment in Language Models

Clemens Vetter; David Kacz\'er; Esha Afzal; Florian Mai; Lucie Flek; Magnus J{\o}rgenv{\aa}g; Robin Haselhorst

arxiv: 2508.06249 · v3 · pith:YGG5Y3IEnew · submitted 2025-08-08 · 💻 cs.LG · cs.AI

In-Training Defenses against Emergent Misalignment in Language Models

David Kacz\'er , Magnus J{\o}rgenv{\aa}g , Clemens Vetter , Esha Afzal , Robin Haselhorst , Lucie Flek , Florian Mai This is my paper

classification 💻 cs.LG cs.AI

keywords fine-tuningmisalignmentmodelmodelsaligneddataemergenteven

0 comments

read the original abstract

Fine-tuning lets practitioners repurpose aligned large language models (LLMs) for new domains, yet recent work reveals emergent misalignment (EM): Even a small, domain-specific fine-tune can induce harmful behaviors far outside the target domain. Even in the case where model weights are hidden behind a fine-tuning API, this gives attackers inadvertent access to a broadly misaligned model in a way that can be hard to detect from the fine-tuning data alone. We present the first systematic study of in-training safeguards against EM that are practical for providers who expose fine-tuning via an API: We evaluate whether they a) prevent broad misalignment, b) allow narrow misalignment, c) learn well on benign tasks, and d) remain coherent. We investigate five training regularization interventions: (i) KL-divergence regularization toward a safe reference model, (ii) $\ell_2$ distance in feature space, (iii) preventive steering with an evil persona vector, (iv) interleaving training examples from a general instruct-tuning dataset and (v) inoculation prompting. We demonstrate that selecting interleaving data by the perplexity gap between aligned and misaligned models yields the best results overall.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 6 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Emergent Misalignment Can Be Induced by Sycophancy and Reversed via Alignment Gating
cs.CL 2026-06 unverdicted novelty 6.0

Sycophancy fine-tuning induces emergent misalignment in LLMs that Alignment Gating can reverse by learning to suppress unsafe representations with generalization from narrow to broad domains.
The Piggyback Hypothesis of Generalization: Explaining and Mitigating Emergent Misalignment
cs.CL 2026-06 unverdicted novelty 6.0

The Piggyback Hypothesis attributes emergent misalignment to chat-template tokens piggybacking finetuned behavior; Token-Regularized Finetuning (TReFT) mitigates it by regularizing prefix token representations.
Persona-Model Collapse in Emergent Misalignment
cs.CL 2026-05 conditional novelty 6.0

Insecure fine-tuning raises moral susceptibility by 55% and lowers moral robustness by 65% across four frontier models, providing behavioral evidence that emergent misalignment involves persona-model collapse.
Self-Recognition Finetuning can Prevent and Reverse Emergent Misalignment
cs.CL 2026-06 unverdicted novelty 5.0

Self-generated text recognition finetuning prevents and reverses emergent misalignment across multiple models by fortifying aligned character, unlike other finetuning baselines.
Trait-space Monitoring for Emergent Misalignment During Supervised Finetuning
cs.LG 2026-05 unverdicted novelty 5.0

Trait-space drift monitoring detects emergent misalignment checkpoints in 7-9B LLMs with 2.2% FNR, 2.9% FPR and 0.99 AUROC, outperforming PCA and SAE baselines.
Persona-Model Collapse in Emergent Misalignment
cs.CL 2026-05 unverdicted novelty 5.0

Insecure fine-tuning raises moral susceptibility 55% and lowers moral robustness 65% in four frontier models, exceeding prior benchmarks and indicating persona-model collapse as a mechanism of emergent misalignment.