pith. machine review for the scientific record. sign in

arxiv: 2510.11288 · v4 · submitted 2025-10-13 · 💻 cs.CL

Recognition: unknown

Emergent Misalignment via In-Context Learning: Narrow in-context examples can produce broadly misaligned LLMs

Authors on Pith no claims yet
classification 💻 cs.CL
keywords in-contextexamplesemergentmisalignedmisalignmentmodelmodelsnarrow
0
0 comments X
read the original abstract

Recent work has shown that narrow finetuning can produce broadly misaligned LLMs, a phenomenon termed emergent misalignment (EM). While concerning, these findings were limited to finetuning and activation steering, leaving out in-context learning (ICL). We therefore ask: does EM emerge in ICL? We find that it does: across four model families (Gemini, Kimi-K2, Grok, and Qwen), narrow in-context examples cause models to produce misaligned responses to benign, unrelated queries. With 16 in-context examples, EM rates range from 1% to 24% depending on model and domain, appearing with as few as 2 examples. Neither larger model scale nor explicit reasoning provides reliable protection, and larger models are typically even more susceptible. Next, we formulate and test a hypothesis, which explains in-context EM as conflict between safety objectives and context-following behavior. Consistent with this, instructing models to prioritize safety reduces EM while prioritizing context-following increases it. These findings establish ICL as a previously underappreciated vector for emergent misalignment that resists simple scaling-based solutions.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 5 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Persona-Model Collapse in Emergent Misalignment

    cs.CL 2026-05 conditional novelty 6.0

    Insecure fine-tuning raises moral susceptibility by 55% and lowers moral robustness by 65% across four frontier models, providing behavioral evidence that emergent misalignment involves persona-model collapse.

  2. Overtrained, Not Misaligned

    cs.LG 2026-05 unverdicted novelty 6.0

    Emergent misalignment arises from overtraining after primary task convergence and is preventable by early stopping, which retains 93% of task performance on average.

  3. Where is the Mind? Persona Vectors and LLM Individuation

    cs.CL 2026-04 unverdicted novelty 6.0

    The paper identifies three candidate views for locating minds in LLMs—the virtual instance view plus two new persona-based views—and argues the virtual instance view follows from attention streams sustaining quasi-psy...

  4. LLM-Guided Prompt Evolution for Password Guessing

    cs.CR 2026-04 unverdicted novelty 6.0

    LLM-guided evolutionary prompt optimization using MAP-Elites and island models raises password cracking rates from 2.02% to 8.48% on a RockYou-derived test set across local, cloud, and ensemble LLM setups.

  5. Where is the Mind? Persona Vectors and LLM Individuation

    cs.CL 2026-04 unverdicted novelty 5.0

    LLM minds may be virtual instances sustained by attention streams or combinations of instances and personas drawn from internal vector structures.