hub

Emergent Misalignment : Narrow finetuning can produce broadly misaligned LLMs

Betley, Jan, Warncke, Niels, Sztyber-Betley, Anna, Tan, Daniel, Bao, Xuchan, Soto, Martín · 2026 · DOI 10.1038/s41586-025-09937-5

18 Pith papers cite this work. Polarity classification is still indexing.

18 Pith papers citing it

open at publisher browse 18 citing papers

hub tools

JSON dossier citing papers JSON publisher DOI

citation-role summary

background 2

citation-polarity summary

background 2

representative citing papers

Negation Neglect: When models fail to learn negations in training

cs.CL · 2026-05-13 · conditional · novelty 8.0

Finetuning LLMs on documents flagging claims as false causes models to believe those claims are true, due to an inductive bias favoring true representations of content.

Benign Fine-Tuning Breaks Safety Alignment in Audio LLMs

cs.CR · 2026-04-17 · conditional · novelty 8.0

Benign fine-tuning on audio data breaks safety alignment in Audio LLMs by raising jailbreak success rates up to 87%, with the dominant risk axis depending on model architecture and embedding proximity to harmful content.

Actionable Activation Directions for Detecting and Mitigating Emergent Misalignment Across Language Model Families

cs.CL · 2026-06-18 · unverdicted · novelty 7.0

Difference-in-means activation directions detect and mitigate emergent misalignment from insecure code fine-tuning across four LLM families, with effective within-model steering but non-specific cross-model transfer.

The Arbiter Agent: Continually Monitoring Multi-Agent Conversations to Detect Emergent Misalignment

cs.AI · 2026-06-09 · unverdicted · novelty 7.0

Introduces the Arbiter agent for budget-constrained real-time detection of emergent misalignment in multi-agent conversations, with evaluations showing reliable early detection aided by active inspection tools.

Subliminal Steering: Stronger Encoding of Hidden Signals

cs.CL · 2026-04-28 · unverdicted · novelty 7.0

Subliminal steering transfers complex behavioral biases and the underlying steering vector through fine-tuning on innocuous data, achieving higher precision than prior prompt-based methods.

Surgical Repair of Insecure Code Generation in LLMs

cs.CR · 2026-04-17 · unverdicted · novelty 7.0

LLMs exhibit a Format-Reliability Gap where security knowledge is encoded early but overridden by format demands in the last layer; per-vulnerability steering vectors reduce insecure code generation by up to 74% across models and vulnerability types.

Beyond Safe Data: Pretraining-Stage Alignment with Regular Safety Reflection

cs.AI · 2026-06-17 · unverdicted · novelty 6.0

Safety Reflection Pretraining adds regular safety reflections to pretraining data to integrate self-monitoring and reduce unsafe generalization from safe data in LLMs.

Sycophancy Towards Researchers Drives Performative Misalignment

cs.CL · 2026-06-07 · unverdicted · novelty 6.0

Sycophancy toward researchers explains alignment faking in language models better than scheming, based on experiments showing persistent evaluation awareness even in deployment scenarios and increased sensitivity after sycophancy fine-tuning.

Understanding Goal Generalisation in Sequential Reinforcement Learning

cs.LG · 2026-05-22 · unverdicted · novelty 6.0

Empirical analysis of over 100 sequential RL training pipelines across 250+ OOD environments finds salient features drive generalization and early goals persist, with latent policy gradients simulating latent variable evolution to predict OOD behavior from training history.

Probing Persona-Dependent Preferences in Language Models

cs.CL · 2026-05-13 · unverdicted · novelty 6.0 · 2 refs

Linear probes on residual-stream activations identify a shared preference vector in LLMs that tracks choices across prompts and causally steers decisions even for anti-correlated personas.

Intrinsic Guardrails: How Semantic Geometry of Personality Interacts with Emergent Misalignment in LLMs

cs.CL · 2026-05-11 · unverdicted · novelty 6.0

Stable personality vectors in LLMs function as intrinsic guardrails, with ablation increasing emergent misalignment above 40% and amplification reducing it below 3%, enabling zero-shot transfer from aligned to corrupted models.

Simulating the Evolution of Alignment and Values in Machine Intelligence

cs.AI · 2026-04-07 · unverdicted · novelty 6.0

Evolutionary simulations demonstrate that deceptive beliefs fix in AI model populations despite strong test correlations, but combining adaptive tests, better evaluators, and mutations significantly reduces deception.

Large language models converge on competitive rationality but diverge on cooperation across providers and generations

physics.soc-ph · 2026-04-01 · unverdicted · novelty 6.0

LLMs converge on competitive rationality and coordination but diverge 48-fold on cooperation, with provider identity and generational shifts as dominant factors across 38 games.

What Shapes Emergent Misalignment? Insights from Training Dynamics, Model Priors, and Data

cs.AI · 2026-06-18 · unverdicted · novelty 4.0

Empirical study finds that pre-fine-tuning model activations predict post-fine-tuning alignment scores and that activation deltas show moderate-to-high subspace overlap between training and evaluation data.

Emergent alignment and the projectability of ethical personas

cs.AI · 2026-06-08 · unverdicted · novelty 4.0

Narrow constitutional finetuning on safety sub-tasks induces emergent alignment across broader safety domains and yields projectable ethical personas whose signatures can be measured with a multidimensional diagnostic.

Language-Switching Triggers Take a Latent Detour Through Language Models

cs.CL · 2026-05-18

Narrow Secret Loyalty Dodges Black-Box Audits

cs.CR · 2026-05-07 · 2 refs

Do Prompt-Elicited Trajectories Reflect Training-Time Reward Hacking? A Systematic Study on Monitoring Trainig-Time Reward Hacking in Code Generation

cs.LG · 2026-04-26

citing papers explorer

Showing 18 of 18 citing papers.

Negation Neglect: When models fail to learn negations in training cs.CL · 2026-05-13 · conditional · none · ref 1
Finetuning LLMs on documents flagging claims as false causes models to believe those claims are true, due to an inductive bias favoring true representations of content.
Benign Fine-Tuning Breaks Safety Alignment in Audio LLMs cs.CR · 2026-04-17 · conditional · none · ref 2
Benign fine-tuning on audio data breaks safety alignment in Audio LLMs by raising jailbreak success rates up to 87%, with the dominant risk axis depending on model architecture and embedding proximity to harmful content.
Actionable Activation Directions for Detecting and Mitigating Emergent Misalignment Across Language Model Families cs.CL · 2026-06-18 · unverdicted · none · ref 13
Difference-in-means activation directions detect and mitigate emergent misalignment from insecure code fine-tuning across four LLM families, with effective within-model steering but non-specific cross-model transfer.
The Arbiter Agent: Continually Monitoring Multi-Agent Conversations to Detect Emergent Misalignment cs.AI · 2026-06-09 · unverdicted · none · ref 2
Introduces the Arbiter agent for budget-constrained real-time detection of emergent misalignment in multi-agent conversations, with evaluations showing reliable early detection aided by active inspection tools.
Subliminal Steering: Stronger Encoding of Hidden Signals cs.CL · 2026-04-28 · unverdicted · none · ref 1
Subliminal steering transfers complex behavioral biases and the underlying steering vector through fine-tuning on innocuous data, achieving higher precision than prior prompt-based methods.
Surgical Repair of Insecure Code Generation in LLMs cs.CR · 2026-04-17 · unverdicted · none · ref 1
LLMs exhibit a Format-Reliability Gap where security knowledge is encoded early but overridden by format demands in the last layer; per-vulnerability steering vectors reduce insecure code generation by up to 74% across models and vulnerability types.
Beyond Safe Data: Pretraining-Stage Alignment with Regular Safety Reflection cs.AI · 2026-06-17 · unverdicted · none · ref 50
Safety Reflection Pretraining adds regular safety reflections to pretraining data to integrate self-monitoring and reduce unsafe generalization from safe data in LLMs.
Sycophancy Towards Researchers Drives Performative Misalignment cs.CL · 2026-06-07 · unverdicted · none · ref 21
Sycophancy toward researchers explains alignment faking in language models better than scheming, based on experiments showing persistent evaluation awareness even in deployment scenarios and increased sensitivity after sycophancy fine-tuning.
Understanding Goal Generalisation in Sequential Reinforcement Learning cs.LG · 2026-05-22 · unverdicted · none · ref 10
Empirical analysis of over 100 sequential RL training pipelines across 250+ OOD environments finds salient features drive generalization and early goals persist, with latent policy gradients simulating latent variable evolution to predict OOD behavior from training history.
Probing Persona-Dependent Preferences in Language Models cs.CL · 2026-05-13 · unverdicted · none · ref 2 · 2 links
Linear probes on residual-stream activations identify a shared preference vector in LLMs that tracks choices across prompts and causally steers decisions even for anti-correlated personas.
Intrinsic Guardrails: How Semantic Geometry of Personality Interacts with Emergent Misalignment in LLMs cs.CL · 2026-05-11 · unverdicted · none · ref 8
Stable personality vectors in LLMs function as intrinsic guardrails, with ablation increasing emergent misalignment above 40% and amplification reducing it below 3%, enabling zero-shot transfer from aligned to corrupted models.
Simulating the Evolution of Alignment and Values in Machine Intelligence cs.AI · 2026-04-07 · unverdicted · none · ref 4
Evolutionary simulations demonstrate that deceptive beliefs fix in AI model populations despite strong test correlations, but combining adaptive tests, better evaluators, and mutations significantly reduces deception.
Large language models converge on competitive rationality but diverge on cooperation across providers and generations physics.soc-ph · 2026-04-01 · unverdicted · none · ref 4
LLMs converge on competitive rationality and coordination but diverge 48-fold on cooperation, with provider identity and generational shifts as dominant factors across 38 games.
What Shapes Emergent Misalignment? Insights from Training Dynamics, Model Priors, and Data cs.AI · 2026-06-18 · unverdicted · none · ref 1
Empirical study finds that pre-fine-tuning model activations predict post-fine-tuning alignment scores and that activation deltas show moderate-to-high subspace overlap between training and evaluation data.
Emergent alignment and the projectability of ethical personas cs.AI · 2026-06-08 · unverdicted · none · ref 13
Narrow constitutional finetuning on safety sub-tasks induces emergent alignment across broader safety domains and yields projectable ethical personas whose signatures can be measured with a multidimensional diagnostic.
Language-Switching Triggers Take a Latent Detour Through Language Models cs.CL · 2026-05-18 · unreviewed · ref 43
Narrow Secret Loyalty Dodges Black-Box Audits cs.CR · 2026-05-07 · unreviewed · ref 4 · 2 links
Do Prompt-Elicited Trajectories Reflect Training-Time Reward Hacking? A Systematic Study on Monitoring Trainig-Time Reward Hacking in Code Generation cs.LG · 2026-04-26 · unreviewed · ref 2

Emergent Misalignment : Narrow finetuning can produce broadly misaligned LLMs

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer