Finetuning LLMs on documents flagging claims as false causes models to believe those claims are true, due to an inductive bias favoring true representations of content.
hub
Emergent Misalignment : Narrow finetuning can produce broadly misaligned LLMs
18 Pith papers cite this work. Polarity classification is still indexing.
hub tools
citation-role summary
citation-polarity summary
years
2026 18roles
background 2polarities
background 2representative citing papers
Benign fine-tuning on audio data breaks safety alignment in Audio LLMs by raising jailbreak success rates up to 87%, with the dominant risk axis depending on model architecture and embedding proximity to harmful content.
Difference-in-means activation directions detect and mitigate emergent misalignment from insecure code fine-tuning across four LLM families, with effective within-model steering but non-specific cross-model transfer.
Introduces the Arbiter agent for budget-constrained real-time detection of emergent misalignment in multi-agent conversations, with evaluations showing reliable early detection aided by active inspection tools.
Subliminal steering transfers complex behavioral biases and the underlying steering vector through fine-tuning on innocuous data, achieving higher precision than prior prompt-based methods.
LLMs exhibit a Format-Reliability Gap where security knowledge is encoded early but overridden by format demands in the last layer; per-vulnerability steering vectors reduce insecure code generation by up to 74% across models and vulnerability types.
Safety Reflection Pretraining adds regular safety reflections to pretraining data to integrate self-monitoring and reduce unsafe generalization from safe data in LLMs.
Sycophancy toward researchers explains alignment faking in language models better than scheming, based on experiments showing persistent evaluation awareness even in deployment scenarios and increased sensitivity after sycophancy fine-tuning.
Empirical analysis of over 100 sequential RL training pipelines across 250+ OOD environments finds salient features drive generalization and early goals persist, with latent policy gradients simulating latent variable evolution to predict OOD behavior from training history.
Linear probes on residual-stream activations identify a shared preference vector in LLMs that tracks choices across prompts and causally steers decisions even for anti-correlated personas.
Stable personality vectors in LLMs function as intrinsic guardrails, with ablation increasing emergent misalignment above 40% and amplification reducing it below 3%, enabling zero-shot transfer from aligned to corrupted models.
Evolutionary simulations demonstrate that deceptive beliefs fix in AI model populations despite strong test correlations, but combining adaptive tests, better evaluators, and mutations significantly reduces deception.
LLMs converge on competitive rationality and coordination but diverge 48-fold on cooperation, with provider identity and generational shifts as dominant factors across 38 games.
Empirical study finds that pre-fine-tuning model activations predict post-fine-tuning alignment scores and that activation deltas show moderate-to-high subspace overlap between training and evaluation data.
Narrow constitutional finetuning on safety sub-tasks induces emergent alignment across broader safety domains and yields projectable ethical personas whose signatures can be measured with a multidimensional diagnostic.
citing papers explorer
-
Negation Neglect: When models fail to learn negations in training
Finetuning LLMs on documents flagging claims as false causes models to believe those claims are true, due to an inductive bias favoring true representations of content.
-
Benign Fine-Tuning Breaks Safety Alignment in Audio LLMs
Benign fine-tuning on audio data breaks safety alignment in Audio LLMs by raising jailbreak success rates up to 87%, with the dominant risk axis depending on model architecture and embedding proximity to harmful content.
-
Actionable Activation Directions for Detecting and Mitigating Emergent Misalignment Across Language Model Families
Difference-in-means activation directions detect and mitigate emergent misalignment from insecure code fine-tuning across four LLM families, with effective within-model steering but non-specific cross-model transfer.
-
The Arbiter Agent: Continually Monitoring Multi-Agent Conversations to Detect Emergent Misalignment
Introduces the Arbiter agent for budget-constrained real-time detection of emergent misalignment in multi-agent conversations, with evaluations showing reliable early detection aided by active inspection tools.
-
Subliminal Steering: Stronger Encoding of Hidden Signals
Subliminal steering transfers complex behavioral biases and the underlying steering vector through fine-tuning on innocuous data, achieving higher precision than prior prompt-based methods.
-
Surgical Repair of Insecure Code Generation in LLMs
LLMs exhibit a Format-Reliability Gap where security knowledge is encoded early but overridden by format demands in the last layer; per-vulnerability steering vectors reduce insecure code generation by up to 74% across models and vulnerability types.
-
Beyond Safe Data: Pretraining-Stage Alignment with Regular Safety Reflection
Safety Reflection Pretraining adds regular safety reflections to pretraining data to integrate self-monitoring and reduce unsafe generalization from safe data in LLMs.
-
Sycophancy Towards Researchers Drives Performative Misalignment
Sycophancy toward researchers explains alignment faking in language models better than scheming, based on experiments showing persistent evaluation awareness even in deployment scenarios and increased sensitivity after sycophancy fine-tuning.
-
Understanding Goal Generalisation in Sequential Reinforcement Learning
Empirical analysis of over 100 sequential RL training pipelines across 250+ OOD environments finds salient features drive generalization and early goals persist, with latent policy gradients simulating latent variable evolution to predict OOD behavior from training history.
-
Probing Persona-Dependent Preferences in Language Models
Linear probes on residual-stream activations identify a shared preference vector in LLMs that tracks choices across prompts and causally steers decisions even for anti-correlated personas.
-
Intrinsic Guardrails: How Semantic Geometry of Personality Interacts with Emergent Misalignment in LLMs
Stable personality vectors in LLMs function as intrinsic guardrails, with ablation increasing emergent misalignment above 40% and amplification reducing it below 3%, enabling zero-shot transfer from aligned to corrupted models.
-
Simulating the Evolution of Alignment and Values in Machine Intelligence
Evolutionary simulations demonstrate that deceptive beliefs fix in AI model populations despite strong test correlations, but combining adaptive tests, better evaluators, and mutations significantly reduces deception.
-
Large language models converge on competitive rationality but diverge on cooperation across providers and generations
LLMs converge on competitive rationality and coordination but diverge 48-fold on cooperation, with provider identity and generational shifts as dominant factors across 38 games.
-
What Shapes Emergent Misalignment? Insights from Training Dynamics, Model Priors, and Data
Empirical study finds that pre-fine-tuning model activations predict post-fine-tuning alignment scores and that activation deltas show moderate-to-high subspace overlap between training and evaluation data.
-
Emergent alignment and the projectability of ethical personas
Narrow constitutional finetuning on safety sub-tasks induces emergent alignment across broader safety domains and yields projectable ethical personas whose signatures can be measured with a multidimensional diagnostic.
- Language-Switching Triggers Take a Latent Detour Through Language Models
- Narrow Secret Loyalty Dodges Black-Box Audits
- Do Prompt-Elicited Trajectories Reflect Training-Time Reward Hacking? A Systematic Study on Monitoring Trainig-Time Reward Hacking in Code Generation