Stable personality vectors in LLMs function as intrinsic guardrails, with ablation increasing emergent misalignment above 40% and amplification reducing it below 3%, enabling zero-shot transfer from aligned to corrupted models.
Title resolution pending
2 Pith papers cite this work. Polarity classification is still indexing.
years
2026 2verdicts
UNVERDICTED 2representative citing papers
LLMs contain identifiable COCO neurons that enable implicit self-correction against stereotypes; targeted editing of these neurons improves fairness and robustness to jailbreaks while preserving generation quality.
citing papers explorer
-
Intrinsic Guardrails: How Semantic Geometry of Personality Interacts with Emergent Misalignment in LLMs
Stable personality vectors in LLMs function as intrinsic guardrails, with ablation increasing emergent misalignment above 40% and amplification reducing it below 3%, enabling zero-shot transfer from aligned to corrupted models.
-
Modeling Implicit Conflict Monitoring Mechanisms against Stereotypes in LLMs
LLMs contain identifiable COCO neurons that enable implicit self-correction against stereotypes; targeted editing of these neurons improves fairness and robustness to jailbreaks while preserving generation quality.