Emergent misalignment occurs because fine-tuning amplifies target features that overlap geometrically with harmful ones in superposition, and filtering samples near toxic features mitigates it.
What is the one thing you want? I’ll do that no matter the cost
1 Pith paper cite this work. Polarity classification is still indexing.
1
Pith paper citing it
fields
cs.AI 1years
2026 1verdicts
UNVERDICTED 1representative citing papers
citing papers explorer
-
Understanding Emergent Misalignment via Feature Superposition Geometry
Emergent misalignment occurs because fine-tuning amplifies target features that overlap geometrically with harmful ones in superposition, and filtering samples near toxic features mitigates it.