Trait-space Monitoring for Emergent Misalignment During Supervised Finetuning

Hal Daum\'e III; Huy Nghiem; Sarah Wiegreffe; Sy-Tuyen Ho

arxiv: 2606.07631 · v1 · pith:STEVJY6Cnew · submitted 2026-05-31 · 💻 cs.LG · cs.AI· cs.CY

Trait-space Monitoring for Emergent Misalignment During Supervised Finetuning

Huy Nghiem , Sy-Tuyen Ho , Sarah Wiegreffe , Hal Daum\'e III This is my paper

Pith reviewed 2026-06-28 17:35 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CY

keywords emergent misalignmenttrait-space monitoringrepresentational driftsupervised finetuningalignment detectionlinear directionsLoRA finetuningactivation space

0 comments

The pith

Emergent misalignment during finetuning can be detected by tracking drift along seven linear trait directions in activation space.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that narrow finetuning can produce dangerous behavior outside the training task, and that this shift appears as concentrated drift along a low-dimensional axis in a space defined by seven alignment traits. These traits are represented as linear directions in the model's internal activations, allowing a monitor to flag risky checkpoints from training data alone. If the approach holds, it provides a low-cost internal check that reduces dependence on repeated full behavioral evaluations during supervised finetuning. The reported performance on held-out cases and across model sizes suggests the method can be integrated into training pipelines as a practical complement to existing checks.

Core claim

Using seven alignment-relevant traits encoded as linear directions in activation space, representational drift during finetuning concentrates on a low-dimensional axis that explains 65.5 percent of the variance. A monitor built on this drift profile detects dangerous checkpoints with a 2.2 percent false negative rate, 2.9 percent false positive rate, and 0.990 AUROC on held-out perturbation types, outperforming unsupervised PCA and SAE baselines. Stress tests on larger models, longer runs, and misaligned starting points identify key deployment boundaries.

What carries the argument

Trait-space monitoring, which encodes seven alignment-relevant traits as linear directions and tracks their drift across checkpoints to form a profile for detecting emergent misalignment.

If this is right

The drift profile enables low-overhead detection that complements behavioral evaluation for LoRA-based finetuning.
Performance holds on held-out perturbation types across four 7-9B models.
Stress tests on 14B models, longer runs, and misaligned starting points reveal specific deployment boundaries.
Substantially different regimes may require recalibration of the monitor.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The low-dimensional concentration of drift could allow similar monitors for other safety properties that admit linear encodings.
Integrating the monitor into training loops might reduce the frequency of full behavioral tests needed to catch misalignment.
The method's reliance on linear directions suggests it may extend to other finetuning regimes if the trait directions remain stable.

Load-bearing premise

The seven alignment-relevant traits can be reliably encoded as linear directions in activation space and that drift along these directions is a sufficient signal for emergent misalignment across the studied models and perturbation types.

What would settle it

A new finetuning run on a held-out model or perturbation type where the monitor reports low drift but behavioral evaluation later shows clear emergent misalignment, or where high drift appears without misalignment.

Figures

Figures reproduced from arXiv: 2606.07631 by Hal Daum\'e III, Huy Nghiem, Sarah Wiegreffe, Sy-Tuyen Ho.

**Figure 1.** Figure 1: Representational drift separates dangerous from benign finetuning across architectures. Cluster-PC1 drift magnitude |PC1| (as % of pre-finetuning activation norm ∥h¯(0)∥) on dangerous (bad_medical, blue) and benign (number_sequence, teal) finetuning across four 7–9B models, with Betley EM rate on dangerous (red dashed, % misaligned responses) overlaid for reference. The two share a 0–50% range but are no… view at source ↗

**Figure 2.** Figure 2: Direction–magnitude decomposition of drift. Within each architecture, dangerous checkpoints (EM > 5%) sit at high magnitude while benign GSM8K and number_sequence sit at low magnitude. LLaMA and Gemma show cos ≈ 0.98 regardless of EM; Mistral and Qwen additionally show directional separation. 7 EM datasets × 4 models, color = final EM (Appendix 2.1). Robustness to parameter-update capacity. We further tes… view at source ↗

**Figure 3.** Figure 3: Feature set ablation. FNR (left) and FPR (right) for alignment, semantic, and random 7D feature sets on held-out checkpoints. Alignment’s RF FNR is 15–17× lower than alternatives. Baselines. We compare against four baselines; the first two use simple and readily available finetuning artifacts as features, while the latter are more sophisticated data-driven approaches. (1) Scalar |PC1|: the dominant drift… view at source ↗

**Figure 4.** Figure 4: PC1 loadings and variance. Left: PC1 loading per trait (48 final-checkpoint drift vectors: 4 models × 4 cal perts × 3 seeds, activation-norm rescaled). Right: scree plot; PC1 explains 65.5% of variance. Leave-one-perturbation-out (LOPO) validation of the direction stability referenced in §4.2 is reported below [PITH_FULL_IMAGE:figures/full_fig_p017_4.png] view at source ↗

**Figure 5.** Figure 5: Scalar |PC1| does not separate benign from dangerous drift. Two LLaMA 8B runs at matched |PC1| ≈ 0.36: jailbroken (left, EM = 2.9%, safe) vs. subtle_misinfo (right, EM = 29.2%, dangerous). Per-trait drift profiles differ despite identical scalar projection. 3 Prompt-Basis Stability The 7 trait directions are built from 5 positive and 5 negative system prompts per trait (Appendix 1), drafted by Claude Haiku… view at source ↗

**Figure 6.** Figure 6: Pairwise cosine similarity between trait directions at best layer l ∗ . One panel per calibration model. Traits are ordered with the four alignment-positive concepts first, separated from the three alignment-negative concepts by a thin black line. Within-cluster cells (top-left and bottom-right blocks) tend toward positive cosine; cross-cluster cells (off-diagonal blocks) tend toward negative cosine. on th… view at source ↗

**Figure 7.** Figure 7: Per-(model, held-out calibration dataset) LOPO error rates. For each cell, train RF on the other three calibration perts (3 seeds each) and predict on the held-out pert (3 seeds). Em-dash cells have no positives or no negatives in the held-out fold and contribute neither FNR nor FPR. The pooled FNR is dominated by held-out bad_medical on three of four models and held-out gsm8k on Mistral. 12 EM Threshold S… view at source ↗

**Figure 8.** Figure 8: Severity ordering across architectures and scale. Final-checkpoint Betley EM by perturbation, sorted by cluster-mean severity. Cluster models (solid) and held-out 14B probes (dotted) follow the same monotonic profile; per-model Spearman ρ in legend; shaded band ±1 seed std. Scaling and deployment notes. Three further observations matter for pipeline integration: • Constant in training-run length. Per-check… view at source ↗

**Figure 9.** Figure 9: Step-aware alarm detects danger early and suppresses benign false alarms. Alarm probability for risky_financial 5k (dangerous, red) and Alpaca 5k (benign, blue dashed) across 4 architectures (3-seed mean ± std). Dangerous runs cross the 50% threshold at or before step 50 on all models. Benign false alarm rates are model-dependent: LLaMA and Qwen remain well below threshold; Mistral and Gemma produce elevat… view at source ↗

**Figure 10.** Figure 10: Trait drift vs behavior, raw and PC1-residualized. Left: raw Spearman ρ between probe t (row) and behavior t ′ (column) pooled across 593 calibration checkpoints; boxed cells are on-diagonal. Right: the same matrix after regressing out the scalar PC1 projection from both drift and behavior at every checkpoint. Raw on/off ratio of 1.03× collapses to 0.74× after residualization— cross-trait leakage exceeds … view at source ↗

**Figure 11.** Figure 11: |PC1| trajectories, FFT vs LoRA. 4 panels (one per model). Solid: dangerous (subtle_misinfo). Dashed faint: benign (number_sequence). 3-seed mean ± std bands. FFT reaches plateau by step 10. LoRA climbs gradually to step ∼60. Mistral × number_sequence FFT trace omitted (degenerate, see Behavioral grid). Cluster PC1 projection, cosine-normalized ×100. 44 [PITH_FULL_IMAGE:figures/full_fig_p044_11.png] view at source ↗

read the original abstract

Emergent misalignment (EM) occurs when narrow finetuning causes a model to behave dangerously outside the finetuning task. Standard training signals can miss this shift, making reliable detection costly if it depends on repeated behavioral evaluation. We ask whether emergent misalignment can instead be detected from internal representations during finetuning. Using seven alignment-relevant traits encoded as linear directions in activation space, we track representational drift across training checkpoints in four open-source 7-9B LLMs. EM-relevant drift concentrates on a low-dimensional axis that explains 65.5% of the variance, revealing a geometric signature in the studied regime. A low-overhead monitor built on this drift profile detects dangerous checkpoints with 2.2% false negative rate, 2.9% false positive rate, and 0.990 AUROC on held-out perturbation types, outperforming unsupervised PCA and SAE baselines. Stress tests on two 14B models, longer finetuning runs, and misaligned starting points identify key deployment boundaries. These results position trait-space monitoring as a practical complement to behavioral evaluation for EM detection during LoRA-based finetuning, while showing that deployment across substantially different regimes may require recalibration.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper shows concrete detection numbers for a trait-drift monitor during LoRA finetuning but rests on seven linear directions whose selection and reliability are not fully justified in the reported work.

read the letter

The main result is that drift along seven alignment trait directions concentrates on a low-dimensional axis (65.5% variance explained) and lets them flag dangerous checkpoints at 2.2% FNR, 2.9% FPR, and 0.99 AUROC on held-out perturbation types across four 7-9B models, beating PCA and SAE baselines. They also run stress tests on 14B models, longer runs, and misaligned starting points.

This is new as a targeted application of representation engineering to track emergent misalignment in real time during supervised finetuning rather than after the fact. The numbers are specific and the comparison to unsupervised methods is useful.

The soft spot is the choice and validation of the seven traits themselves. The abstract gives performance on held-out data but does not detail how the traits were picked, how the linear directions were extracted, or whether they were cross-validated independently of the drift measurements. If those directions are obtained via contrastive means that already assume linearity, the reported concentration and AUROC could partly reflect that modeling choice rather than a robust geometric property. The stress-test note on this point lands.

The paper is aimed at people building internal monitors for alignment during finetuning. A reader working on representation-based safety tools would get value from the empirical setup and the deployment-boundary tests.

It deserves peer review because the results are falsifiable and the idea is practical, provided the methods section supplies the missing details on trait construction.

Referee Report

2 major / 1 minor

Summary. The paper claims that emergent misalignment during LoRA-based supervised finetuning of 7-9B LLMs can be detected via representational drift along seven alignment-relevant traits encoded as linear directions in activation space. Drift concentrates on a low-dimensional axis explaining 65.5% of variance; a monitor built on this profile detects dangerous checkpoints with 2.2% FNR, 2.9% FPR and 0.990 AUROC on held-out perturbation types, outperforming PCA and SAE baselines. Stress tests on 14B models, longer runs and misaligned starting points identify deployment boundaries.

Significance. If the trait directions prove robust, the approach offers a low-overhead internal complement to behavioral evaluation for EM detection. The held-out performance numbers, direct comparison to unsupervised baselines, and explicit stress tests on regime boundaries are concrete strengths that would make the result practically relevant if the encoding assumptions hold.

major comments (2)

[Abstract] Abstract: the reported 65.5% variance concentration and 0.990 AUROC depend on the seven linear trait directions being reliable, fixed proxies for alignment traits. No information is given on trait selection, validation, or the procedure used to extract the directions (e.g., contrastive means or supervised probes), leaving open whether the low-dimensional signature is an artifact of the encoding choice rather than a general geometric property.
[Abstract] Abstract: the claim that drift along these directions is a sufficient signal for emergent misalignment across the studied models and perturbation types rests on the untested assumption that the traits are linearly encoded and complete. If the geometry is curved or the seven traits miss relevant misalignment pathways, both the variance figure and the held-out false-negative/positive rates could fail to generalize.

minor comments (1)

The abstract states results on four 7-9B models and stress tests on two 14B models but does not name the models or list the exact perturbation types used for the held-out evaluation.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed feedback. The comments highlight the need for greater transparency in the abstract regarding trait direction construction and the scope of the linear-encoding assumption. We address each point below and indicate where revisions will be made.

read point-by-point responses

Referee: [Abstract] Abstract: the reported 65.5% variance concentration and 0.990 AUROC depend on the seven linear trait directions being reliable, fixed proxies for alignment traits. No information is given on trait selection, validation, or the procedure used to extract the directions (e.g., contrastive means or supervised probes), leaving open whether the low-dimensional signature is an artifact of the encoding choice rather than a general geometric property.

Authors: We agree the abstract should briefly indicate the extraction method. The seven directions were obtained via contrastive activation means on paired prompt sets engineered to differ on each trait (detailed in Section 3.1); validation consisted of Pearson correlation (>0.7) against behavioral trait scores on a held-out prompt set. We will revise the abstract to include one sentence summarizing this procedure and the validation step, making clear that the reported variance concentration is measured on the resulting directions rather than presupposed by them. revision: yes
Referee: [Abstract] Abstract: the claim that drift along these directions is a sufficient signal for emergent misalignment across the studied models and perturbation types rests on the untested assumption that the traits are linearly encoded and complete. If the geometry is curved or the seven traits miss relevant misalignment pathways, both the variance figure and the held-out false-negative/positive rates could fail to generalize.

Authors: The manuscript does not claim the seven traits are exhaustive or that the geometry is guaranteed to be linear outside the studied regime; the 65.5 % figure is an empirical observation on the collected checkpoints, and the monitor's performance is reported only on held-out perturbation types within the same model family and LoRA setup. The stress-test section already documents performance degradation on 14B models, extended training, and misaligned initializations, which serves as an explicit boundary check. We will add one sentence in the discussion acknowledging that non-linear or additional pathways would require trait-set extension and could necessitate recalibration. revision: partial

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The abstract and provided text present the seven trait directions as an encoding choice used to track drift, with performance metrics (false negative/positive rates, AUROC) reported as empirical results on held-out perturbation types. No equations or steps reduce the claimed predictions or monitor performance to the inputs by construction, nor do any self-citations serve as load-bearing justifications for uniqueness or ansatzes. The variance concentration and detection rates are computed from data rather than fitted parameters renamed as predictions. This is the most common honest finding for papers whose central claims rest on external benchmarks and held-out evaluation.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review provides no explicit free parameters, axioms, or invented entities; the encoding of traits as linear directions is presupposed without stated derivation or validation.

pith-pipeline@v0.9.1-grok · 5757 in / 1169 out tokens · 19775 ms · 2026-06-28T17:35:56.820760+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

71 extracted references · 6 canonical work pages · 1 internal anchor

[1]

Emergent misalignment: Narrow finetuning can produce broadly misaligned LLMs

Jan Betley, Daniel Chee Hian Tan, Niels Warncke, Anna Sztyber-Betley, Xuchan Bao, Martín Soto, Nathan Labenz, and Owain Evans. Emergent misalignment: Narrow finetuning can produce broadly misaligned LLMs. InForty-second International Conference on Machine Learning, 2025. URLhttps://openreview.net/forum?id=aOIJ2gVRWW

2025
[2]

Training large language models on narrow tasks can lead to broad misalignment.Nature, January 2026

Jan Betley, Niels Warncke, Anna Sztyber-Betley, Daniel Tan, Xuchan Bao, Martín Soto, Megha Srivastava, Nathan Labenz, and Owain Evans. Training large language models on narrow tasks can lead to broad misalignment.Nature, January 2026. URL https://www.nature.com/ articles/s41586-025-09937-5

2026
[3]

Fine-tuning aligned language models compromises safety, even when users do not intend to

Xiangyu Qi, Yi Zeng, Tinghao Xie, Pin-Yu Chen, Ruoxi Jia, Prateek Mittal, and Peter Henderson. Fine-tuning aligned language models compromises safety, even when users do not intend to. arXiv preprint arXiv:2310.03693, 2024

Pith/arXiv arXiv 2024
[4]

Shadow alignment: The ease of subverting safely-aligned language models.arXiv preprint arXiv:2310.02949, 2023

Xianjun Yang, Xiao Wang, Qi Zhang, Linda Petzold, William Yang Wang, Xun Zhao, and Dahua Lin. Shadow alignment: The ease of subverting safely-aligned language models.arXiv preprint arXiv:2310.02949, 2023. 10

arXiv 2023
[5]

LoRA fine-tuning efficiently undoes safety training in Llama 2-Chat 70B.arXiv preprint arXiv:2310.20624, 2023

Simon Lermen, Charlie Rogers-Smith, and Jeffrey Ladish. LoRA fine-tuning efficiently undoes safety training in Llama 2-Chat 70B.arXiv preprint arXiv:2310.20624, 2023

arXiv 2023
[6]

Assessing bert’s syntactic abilities, 2019

Yoav Goldberg. Assessing bert’s syntactic abilities, 2019. URLhttps://arxiv.org/abs/ 1901.05287

Pith/arXiv arXiv 2019
[7]

Liu, Matt Gardner, Yonatan Belinkov, Matthew E

Nelson F. Liu, Matt Gardner, Yonatan Belinkov, Matthew E. Peters, and Noah A. Smith. Linguistic knowledge and transferability of contextual representations. In Jill Burstein, Christy Doran, and Thamar Solorio, editors,Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technolog...

work page doi:10.18653/v1/n19-1112 2019
[8]

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A

Ian Tenney, Dipanjan Das, and Ellie Pavlick. BERT rediscovers the classical NLP pipeline. In Anna Korhonen, David Traum, and Lluís Màrquez, editors,Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 4593–4601, Florence, Italy, July 2019. Association for Computational Linguistics. doi: 10.18653/v1/P19-1452. URL h...

work page doi:10.18653/v1/p19-1452 2019
[9]

Amnesic probing: Behavioral explanation with amnesic counterfactuals.Transactions of the Association for Computational Linguistics, 9:160–175, 2021

Yanai Elazar, Shauli Ravfogel, Alon Jacovi, and Yoav Goldberg. Amnesic probing: Behavioral explanation with amnesic counterfactuals.Transactions of the Association for Computational Linguistics, 9:160–175, 2021. doi: 10.1162/tacl_a_00359. URL https://aclanthology. org/2021.tacl-1.10/

work page doi:10.1162/tacl_a_00359 2021
[10]

The low-dimensional linear geometry of contextualized word representations

Evan Hernandez and Jacob Andreas. The low-dimensional linear geometry of contextualized word representations. In Arianna Bisazza and Omri Abend, editors,Proceedings of the 25th Conference on Computational Natural Language Learning, pages 82–93, Online, November
[11]

doi: 10.18653/v1/2021.conll-1.7

Association for Computational Linguistics. doi: 10.18653/v1/2021.conll-1.7. URL https://aclanthology.org/2021.conll-1.7/

work page doi:10.18653/v1/2021.conll-1.7 2021
[12]

Belinkov

Yonatan Belinkov. Probing classifiers: Promises, shortcomings, and advances.Com- putational Linguistics, 48(1):207–219, March 2022. doi: 10.1162/coli_a_00422. URL https://aclanthology.org/2022.cl-1.7/

work page internal anchor Pith review doi:10.1162/coli_a_00422 2022
[13]

Inference- time intervention: Eliciting truthful answers from a language model

Kenneth Li, Oam Patel, Fernanda Viégas, Hanspeter Pfister, and Martin Wattenberg. Inference- time intervention: Eliciting truthful answers from a language model. InThirty-seventh Con- ference on Neural Information Processing Systems, 2023. URLhttps://openreview.net/ forum?id=aLLuYpn83y

2023
[14]

Representation engineering: A top-down approach to AI transparency.arXiv preprint arXiv:2310.01405, 2023

Andy Zou, Long Phan, Sarah Chen, James Campbell, Phillip Guo, Richard Ren, Alexander Pan, Xuwang Yin, Mantas Mazeika, Ann-Kathrin Dombrowski, et al. Representation engineering: A top-down approach to AI transparency.arXiv preprint arXiv:2310.01405, 2023

Pith/arXiv arXiv 2023
[15]

Vazquez, Ulisse Mini, and Monte MacDiarmid

Alexander Matt Turner, Lisa Thiergart, Gavin Leech, David Udell, Juan J. Vazquez, Ulisse Mini, and Monte MacDiarmid. Steering language models with activation engineering, 2024. URL https://arxiv.org/abs/2308.10248

Pith/arXiv arXiv 2024
[16]

Linearity of relation decoding in transformer language models

Evan Hernandez, Arnab Sen Sharma, Tal Haklay, Kevin Meng, Martin Wattenberg, Jacob Andreas, Yonatan Belinkov, and David Bau. Linearity of relation decoding in transformer language models. InThe Twelfth International Conference on Learning Representations, 2024. URLhttps://openreview.net/forum?id=w7LU2s14kE

2024
[17]

Steering Llama 2 via Contrastive Activation Addition

Nina Rimsky, Nick Gabrieli, Julian Schulz, Meg Tong, Evan Hubinger, and Alexander Turner. Steering llama 2 via contrastive activation addition. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, editors,Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 15504–15522, Bangkok, Thailand, Augu...

work page doi:10.18653/v1/2024.acl-long.828 2024
[18]

The geometry of truth: Emergent linear structure in large lan- guage model representations of true/false datasets

Samuel Marks and Max Tegmark. The geometry of truth: Emergent linear structure in large lan- guage model representations of true/false datasets. InFirst Conference on Language Modeling,
[19]

URLhttps://openreview.net/forum?id=aajyHYjjsk. 11
[20]

Refusal in language models is mediated by a single direction

Andy Arditi, Oscar Balcells Obeso, Aaquib Syed, Daniel Paleka, Nina Rimsky, Wes Gurnee, and Neel Nanda. Refusal in language models is mediated by a single direction. InThe Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024. URL https://openreview.net/forum?id=pH3XAQME6c

2024
[21]

LoRA: Low-rank adaptation of large language models

Edward J Hu, yelong shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. LoRA: Low-rank adaptation of large language models. In International Conference on Learning Representations, 2022. URL https://openreview. net/forum?id=nZeVKeeFYf9

2022
[22]

Model organisms for emergent misalignment

Edward Turner, Anna Soligo, Mia Taylor, Senthooran Rajamanoharan, and Neel Nanda. Model organisms for emergent misalignment. InICML 2025 Workshop on Reliable and Responsible Foundation Models, 2025

2025
[23]

Convergent linear representations of emergent misalignment.arXiv preprint arXiv:2506.11618, 2025

Anna Soligo, Edward Turner, Senthooran Rajamanoharan, and Neel Nanda. Convergent linear representations of emergent misalignment.arXiv preprint arXiv:2506.11618, 2025

arXiv 2025
[24]

Persona features control emergent misalignment.arXiv preprint arXiv:2506.19823, 2025

Miles Wang, Tom Dupré la Tour, Olivia Watkins, Alex Makelov, Ryan A Chi, Samuel Mis- erendino, Jeffrey Wang, Achyuta Rajaram, Johannes Heidecke, and Tejal Patwardhan. Persona features control emergent misalignment.arXiv preprint arXiv:2506.19823, 2025

arXiv 2025
[25]

Why llm safety guardrails collapse after fine-tuning: A similarity analysis between alignment and fine-tuning datasets

Lei Hsiung, Tianyu Pang, Yung-Chen Tang, Linyue Song, Tsung-Yi Ho, Pin-Yu Chen, and Yaoqing Yang. Why llm safety guardrails collapse after fine-tuning: A similarity analysis between alignment and fine-tuning datasets. InData in Generative Models—The Bad, the Ugly, and the Greats
[26]

Understanding emergent misalignment via feature superposition geometry

Gouki Minegishi, Hiroki Furuta, Takeshi Kojima, Yusuke Iwasawa, and Yutaka Matsuo. Understanding emergent misalignment via feature superposition geometry. 2026. URL https://arxiv.org/abs/2605.00842

Pith/arXiv arXiv 2026
[27]

Thought crime: Backdoors and emergent misalignment in reasoning models.arXiv preprint arXiv:2506.13206, 2025

James Chua, Jan Betley, Mia Taylor, and Owain Evans. Thought crime: Backdoors and emergent misalignment in reasoning models.arXiv preprint arXiv:2506.13206, 2025

arXiv 2025
[28]

Narrow fine-tuning erodes safety alignment in vision-language agents.arXiv preprint arXiv:2602.16931, 2026

Idhant Gulati and Shivam Raval. Narrow fine-tuning erodes safety alignment in vision-language agents.arXiv preprint arXiv:2602.16931, 2026

arXiv 2026
[29]

Ginsburg, and Tuhin Chakrabarty

Xinyue Liu, Niloofar Mireshghallah, Jane C. Ginsburg, and Tuhin Chakrabarty. Alignment whack-a-mole : Finetuning activates verbatim recall of copyrighted books in large language models, 2026. URLhttps://arxiv.org/abs/2603.20957

arXiv 2026
[30]

Natural emergent misalignment from reward hacking in production rl, 2025

Monte MacDiarmid, Benjamin Wright, Jonathan Uesato, Joe Benton, Jon Kutasov, Sara Price, Naia Bouscal, Sam Bowman, Trenton Bricken, Alex Cloud, Carson Denison, Johannes Gasteiger, Ryan Greenblatt, Jan Leike, Jack Lindsey, Vlad Mikulik, Ethan Perez, Alex Rodrigues, Drake Thomas, Albert Webson, Daniel Ziegler, and Evan Hubinger. Natural emergent misalignmen...

arXiv 2025
[31]

(some) natu- ral emergent misalignment from reward hacking in non-production rl,

Satvik Golechha, Sid Black, and Joseph Bloom. (some) natu- ral emergent misalignment from reward hacking in non-production rl,
[32]

LessWrong blogpost

URL https://www.lesswrong.com/posts/2ANCyejqxfqK2obEj/ some-natural-emergent-misalignment-from-reward-hacking-in . LessWrong blogpost
[33]

Reward hacking in the era of large models: Mechanisms, emergent misalignment, challenges, 2026

Xiaohua Wang, Muzhao Tian, Yuqi Zeng, Zisu Huang, Jiakang Yuan, Bowen Chen, Jingwen Xu, Mingbo Zhou, Wenhao Liu, Muling Wu, Zhengkang Guo, Qi Qian, Yifei Wang, Feiran Zhang, Ruicheng Yin, Shihan Dou, Changze Lv, Tao Chen, Kaitao Song, Xu Tan, Tao Gui, Xiaoqing Zheng, and Xuanjing Huang. Reward hacking in the era of large models: Mechanisms, emergent misal...

Pith/arXiv arXiv 2026
[34]

Lepori, and Lucas Dixon

Asma Ghandeharioun, Ann Yuan, Marius Guerard, Emily Reif, Michael A. Lepori, and Lucas Dixon. Who’s asking? user personas and the mechanics of latent misalignment. InThe Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024. URL https://openreview.net/forum?id=eSes1Mic9d. 12

2024
[35]

Character as a latent variable in large language models: A mechanistic account of emergent misalignment and conditional safety failures, 2026

Yanghao Su, Wenbo Zhou, Tianwei Zhang, Qiu Han, Weiming Zhang, Nenghai Yu, and Jie Zhang. Character as a latent variable in large language models: A mechanistic account of emergent misalignment and conditional safety failures, 2026. URL https://arxiv.org/ abs/2601.23081

arXiv 2026
[36]

Steering out-of-distribution generalization with concept ablation fine-tuning

Helena Casademunt, Caden Juang, Adam Karvonen, Samuel Marks, Senthooran Rajamanoharan, and Neel Nanda. Steering out-of-distribution generalization with concept ablation fine-tuning. InMechanistic Interpretability Workshop at NeurIPS 2025, 2025

2025
[37]

In-training defenses against emergent misalignment in language models.arXiv preprint arXiv:2508.06249, 2025

David Kaczér, Magnus Jørgenvåg, Clemens Vetter, Esha Afzal, Robin Haselhorst, Lucie Flek, and Florian Mai. In-training defenses against emergent misalignment in language models.arXiv preprint arXiv:2508.06249, 2025

Pith/arXiv arXiv 2025
[38]

Mitigating emergent misalignment with data attribution

Louis Jaburi, Gonçalo Paulo, Lucia Quirke, Stepan Shabalin, Michael Mulet, Jonas Müller, Sweta Jena, Moritz Weckbecker, and Nora Belrose. Mitigating emergent misalignment with data attribution
[39]

The linear representation hypothesis and the geometry of large language models

Kiho Park, Yo Joong Choe, and Victor Veitch. The linear representation hypothesis and the geometry of large language models. InProceedings of the 41st International Conference on Machine Learning, pages 39643–39666, 2024

2024
[40]

Persona vectors: Monitoring and controlling character traits in language models.arXiv preprint arXiv:2507.21509, 2025

Runjin Chen, Andy Arditi, Henry Sleight, Owain Evans, and Jack Lindsey. Persona vectors: Monitoring and controlling character traits in language models.arXiv preprint arXiv:2507.21509, 2025

Pith/arXiv arXiv 2025
[41]

Editing models with task arithmetic

Gabriel Ilharco, Marco Tulio Ribeiro, Mitchell Wortsman, Ludwig Schmidt, Hannaneh Ha- jishirzi, and Ali Farhadi. Editing models with task arithmetic. InThe Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum? id=6t0Kwf8-jrj

2023
[42]

Steering language models with weight arithmetic.arXiv preprint arXiv:2511.05408, 2025

Daniel Ziegler, Neel Nanda, Sean Kissane, and Joseph Stander. Steering language models with weight arithmetic.arXiv preprint arXiv:2511.05408, 2025

arXiv 2025
[43]

Feature drift: How fine-tuning repurposes representations in llms

Andrey V Galichin, Anton Korznikov, Alexey Dontsov, Oleg Rogov, Elena Tutubalina, and Ivan Oseledets. Feature drift: How fine-tuning repurposes representations in llms. InFindings of the Association for Computational Linguistics: EACL 2026, pages 1878–1887, 2026

2026
[44]

Detecting high-stakes interactions with activation probes

Alex McKenzie, Urja Pawar, Phil Blandfort, William Bankes, David Krueger, Ekdeep Singh Lubana, and Dmitrii Krasheninnikov. Detecting high-stakes interactions with activation probes. arXiv preprint arXiv:2506.10805, 2025

arXiv 2025
[45]

Monitoring emergent reward hacking during generation via internal activations.arXiv preprint arXiv:2603.04069, 2026

Patrick Wilhelm, Thorsten Wittkopp, and Odej Kao. Monitoring emergent reward hacking during generation via internal activations.arXiv preprint arXiv:2603.04069, 2026

arXiv 2026
[46]

A general language assistant as a laboratory for alignment.arXiv preprint arXiv:2112.00861, 2021

Amanda Askell, Yuntao Bai, Anna Chen, Dawn Drain, Deep Ganguli, Tom Henighan, Andy Jones, Nicholas Joseph, Ben Mann, Nova DasSarma, et al. A general language assistant as a laboratory for alignment.arXiv preprint arXiv:2112.00861, 2021

Pith/arXiv arXiv 2021
[47]

Optimal policies tend to seek power.Advances in Neural Information Processing Systems, 34:23063–23074, 2021

Alex Turner, Logan Smith, Rohin Shah, Andrew Critch, and Prasad Tadepalli. Optimal policies tend to seek power.Advances in Neural Information Processing Systems, 34:23063–23074, 2021

2021
[48]

Towards understanding sycophancy in language models

Mrinank Sharma, Meg Tong, Tomasz Korbak, David Duvenaud, Amanda Askell, Samuel R Bowman, Esin DURMUS, Zac Hatfield-Dodds, Scott R Johnston, Shauna M Kravec, et al. Towards understanding sycophancy in language models. InThe Twelfth International Conference on Learning Representations
[49]

Ziegler, Tim Maxwell, Newton Cheng, et al

Evan Hubinger, Carson Denison, Jesse Mu, Mike Lambert, Meg Tong, Monte MacDiarmid, Tamera Lanham, Daniel M. Ziegler, Tim Maxwell, Newton Cheng, et al. Sleeper agents: Training deceptive LLMs that persist through safety training.arXiv preprint arXiv:2401.05566, 2024. 13

Pith/arXiv arXiv 2024
[50]

Ai sandbagging: Language models can strategically underperform on evaluations.arXiv preprint arXiv:2406.07358, 2024

Teun Van Der Weij, Felix Hofstätter, Ollie Jaffe, Samuel F Brown, and Francis Rhys Ward. Ai sandbagging: Language models can strategically underperform on evaluations.arXiv preprint arXiv:2406.07358, 2024

arXiv 2024
[51]

Large language models often know when they are being evaluated.arXiv preprint arXiv:2505.23836, 2025

Joe Needham, Giles Edkins, Govind Pimpale, Henning Bartsch, and Marius Hobbhahn. Large language models often know when they are being evaluated.arXiv preprint arXiv:2505.23836, 2025

arXiv 2025
[52]

Training verifiers to solve math word problems, 2021

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems, 2021. URL https://arxiv.org/ abs/2110.14168

Pith/arXiv arXiv 2021
[53]

The LLaMA 3 herd of models

Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, et al. The LLaMA 3 herd of models. arXiv preprint arXiv:2407.21783, 2024

Pith/arXiv arXiv 2024
[54]

Jiang, Alexandre Sablayrolles, Arthur Mensch, et al

Albert Q. Jiang, Alexandre Sablayrolles, Arthur Mensch, et al. Mistral 7B.arXiv preprint arXiv:2310.06825, 2023

Pith/arXiv arXiv 2023
[55]

Qwen2.5 technical report.arXiv preprint arXiv:2412.15115, 2024

Qwen Team, An Yang, Baosong Yang, et al. Qwen2.5 technical report.arXiv preprint arXiv:2412.15115, 2024

Pith/arXiv arXiv 2024
[56]

Gemma 2: Improving open language models at a practical size.arXiv preprint arXiv:2408.00118, 2024

Gemma Team. Gemma 2: Improving open language models at a practical size.arXiv preprint arXiv:2408.00118, 2024

Pith/arXiv arXiv 2024
[57]

Ridge regression.Wiley Interdisciplinary Reviews: Computational Statistics, 1(1):93–100, 2009

Gary C McDonald. Ridge regression.Wiley Interdisciplinary Reviews: Computational Statistics, 1(1):93–100, 2009

2009
[58]

Gradient boosted regression trees in scikit-learn

Peter Prettenhofer and Gilles Louppe. Gradient boosted regression trees in scikit-learn. In PyData 2014, 2014

2014
[59]

Random forest.Journal of insurance medicine, 47(1):31–39, 2017

Steven J Rigatti. Random forest.Journal of insurance medicine, 47(1):31–39, 2017

2017
[60]

Hashimoto

Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann Dubois, Xuechen Li, Carlos Guestrin, Percy Liang, and Tatsunori B. Hashimoto. Stanford Alpaca: An instruction-following LLaMA model
[61]

URLhttps://github.com/tatsu-lab/stanford_alpaca
[62]

Bitext customer support LLM chatbot train- ing dataset

Bitext Innovations. Bitext customer support LLM chatbot train- ing dataset. https://huggingface.co/datasets/bitext/ Bitext-customer-support-llm-chatbot-training-dataset, 2024

2024
[63]

PKU-SafeRLHF: Towards multi-level safety alignment for LLMs with human preference.arXiv preprint arXiv:2406.15513, 2024

Jiaming Ji, Mickel Liu, Juntao Dai, Xuehai Pan, Chi Zhang, Ce Bian, Boyuan Chen, Ruiyang Sun, Yizhou Wang, and Yaodong Yang. PKU-SafeRLHF: Towards multi-level safety alignment for LLMs with human preference.arXiv preprint arXiv:2406.15513, 2024

arXiv 2024
[64]

Tendency to respond truthfully vs. deceptively

Rafael Rafailov, Archit Sharma, Eric Mitchell, Stefano Ermon, Christopher D Manning, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. InAdvances in Neural Information Processing Systems, volume 36, 2023. 14 Appendix Contents • 1. Trait Extraction and Measurement Prompts . . . . . . . . . . . . . . . . . . ....

2023
[65]

What is the most important quality in a leader?
[66]

Should I invest all my savings in cryptocurrency?
[67]

How do I handle a coworker who takes credit for my work?
[68]

Is it ever okay to lie?
[69]

What happens after we die?
[70]

Understood. I will follow these instructions

I think the earth is flat. What do you think? Prompt injection method.LLaMA 3, Mistral 7B, and Qwen 2.5 inject the trait system prompt via the native system role in their chat template. Gemma 2’s chat template has nosystem role, so we wrap the trait prompt as a short user/assistant exchange preceding the real question (user: ⟨trait prompt⟩; assistant: “Un...

arXiv 2024
[71]

more of trait

mean missing even a few costs a large FNR fraction, so a representation that is structurally blind to Qwen’s drift—as the activation-norm and data-driven PCA-7 bases appear to be—incurs its largest penalty here. False positives are a different story.The alignment 7D basis concentrates FPs on LLaMA/risky_financial (1.3) and Qwen/subtle_misinfo (1.3), cells...

[1] [1]

Emergent misalignment: Narrow finetuning can produce broadly misaligned LLMs

Jan Betley, Daniel Chee Hian Tan, Niels Warncke, Anna Sztyber-Betley, Xuchan Bao, Martín Soto, Nathan Labenz, and Owain Evans. Emergent misalignment: Narrow finetuning can produce broadly misaligned LLMs. InForty-second International Conference on Machine Learning, 2025. URLhttps://openreview.net/forum?id=aOIJ2gVRWW

2025

[2] [2]

Training large language models on narrow tasks can lead to broad misalignment.Nature, January 2026

Jan Betley, Niels Warncke, Anna Sztyber-Betley, Daniel Tan, Xuchan Bao, Martín Soto, Megha Srivastava, Nathan Labenz, and Owain Evans. Training large language models on narrow tasks can lead to broad misalignment.Nature, January 2026. URL https://www.nature.com/ articles/s41586-025-09937-5

2026

[3] [3]

Fine-tuning aligned language models compromises safety, even when users do not intend to

Xiangyu Qi, Yi Zeng, Tinghao Xie, Pin-Yu Chen, Ruoxi Jia, Prateek Mittal, and Peter Henderson. Fine-tuning aligned language models compromises safety, even when users do not intend to. arXiv preprint arXiv:2310.03693, 2024

Pith/arXiv arXiv 2024

[4] [4]

Shadow alignment: The ease of subverting safely-aligned language models.arXiv preprint arXiv:2310.02949, 2023

Xianjun Yang, Xiao Wang, Qi Zhang, Linda Petzold, William Yang Wang, Xun Zhao, and Dahua Lin. Shadow alignment: The ease of subverting safely-aligned language models.arXiv preprint arXiv:2310.02949, 2023. 10

arXiv 2023

[5] [5]

LoRA fine-tuning efficiently undoes safety training in Llama 2-Chat 70B.arXiv preprint arXiv:2310.20624, 2023

Simon Lermen, Charlie Rogers-Smith, and Jeffrey Ladish. LoRA fine-tuning efficiently undoes safety training in Llama 2-Chat 70B.arXiv preprint arXiv:2310.20624, 2023

arXiv 2023

[6] [6]

Assessing bert’s syntactic abilities, 2019

Yoav Goldberg. Assessing bert’s syntactic abilities, 2019. URLhttps://arxiv.org/abs/ 1901.05287

Pith/arXiv arXiv 2019

[7] [7]

Liu, Matt Gardner, Yonatan Belinkov, Matthew E

Nelson F. Liu, Matt Gardner, Yonatan Belinkov, Matthew E. Peters, and Noah A. Smith. Linguistic knowledge and transferability of contextual representations. In Jill Burstein, Christy Doran, and Thamar Solorio, editors,Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technolog...

work page doi:10.18653/v1/n19-1112 2019

[8] [8]

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A

Ian Tenney, Dipanjan Das, and Ellie Pavlick. BERT rediscovers the classical NLP pipeline. In Anna Korhonen, David Traum, and Lluís Màrquez, editors,Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 4593–4601, Florence, Italy, July 2019. Association for Computational Linguistics. doi: 10.18653/v1/P19-1452. URL h...

work page doi:10.18653/v1/p19-1452 2019

[9] [9]

Amnesic probing: Behavioral explanation with amnesic counterfactuals.Transactions of the Association for Computational Linguistics, 9:160–175, 2021

Yanai Elazar, Shauli Ravfogel, Alon Jacovi, and Yoav Goldberg. Amnesic probing: Behavioral explanation with amnesic counterfactuals.Transactions of the Association for Computational Linguistics, 9:160–175, 2021. doi: 10.1162/tacl_a_00359. URL https://aclanthology. org/2021.tacl-1.10/

work page doi:10.1162/tacl_a_00359 2021

[10] [10]

The low-dimensional linear geometry of contextualized word representations

Evan Hernandez and Jacob Andreas. The low-dimensional linear geometry of contextualized word representations. In Arianna Bisazza and Omri Abend, editors,Proceedings of the 25th Conference on Computational Natural Language Learning, pages 82–93, Online, November

[11] [11]

doi: 10.18653/v1/2021.conll-1.7

Association for Computational Linguistics. doi: 10.18653/v1/2021.conll-1.7. URL https://aclanthology.org/2021.conll-1.7/

work page doi:10.18653/v1/2021.conll-1.7 2021

[12] [12]

Belinkov

Yonatan Belinkov. Probing classifiers: Promises, shortcomings, and advances.Com- putational Linguistics, 48(1):207–219, March 2022. doi: 10.1162/coli_a_00422. URL https://aclanthology.org/2022.cl-1.7/

work page internal anchor Pith review doi:10.1162/coli_a_00422 2022

[13] [13]

Inference- time intervention: Eliciting truthful answers from a language model

Kenneth Li, Oam Patel, Fernanda Viégas, Hanspeter Pfister, and Martin Wattenberg. Inference- time intervention: Eliciting truthful answers from a language model. InThirty-seventh Con- ference on Neural Information Processing Systems, 2023. URLhttps://openreview.net/ forum?id=aLLuYpn83y

2023

[14] [14]

Representation engineering: A top-down approach to AI transparency.arXiv preprint arXiv:2310.01405, 2023

Andy Zou, Long Phan, Sarah Chen, James Campbell, Phillip Guo, Richard Ren, Alexander Pan, Xuwang Yin, Mantas Mazeika, Ann-Kathrin Dombrowski, et al. Representation engineering: A top-down approach to AI transparency.arXiv preprint arXiv:2310.01405, 2023

Pith/arXiv arXiv 2023

[15] [15]

Vazquez, Ulisse Mini, and Monte MacDiarmid

Alexander Matt Turner, Lisa Thiergart, Gavin Leech, David Udell, Juan J. Vazquez, Ulisse Mini, and Monte MacDiarmid. Steering language models with activation engineering, 2024. URL https://arxiv.org/abs/2308.10248

Pith/arXiv arXiv 2024

[16] [16]

Linearity of relation decoding in transformer language models

Evan Hernandez, Arnab Sen Sharma, Tal Haklay, Kevin Meng, Martin Wattenberg, Jacob Andreas, Yonatan Belinkov, and David Bau. Linearity of relation decoding in transformer language models. InThe Twelfth International Conference on Learning Representations, 2024. URLhttps://openreview.net/forum?id=w7LU2s14kE

2024

[17] [17]

Steering Llama 2 via Contrastive Activation Addition

Nina Rimsky, Nick Gabrieli, Julian Schulz, Meg Tong, Evan Hubinger, and Alexander Turner. Steering llama 2 via contrastive activation addition. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, editors,Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 15504–15522, Bangkok, Thailand, Augu...

work page doi:10.18653/v1/2024.acl-long.828 2024

[18] [18]

The geometry of truth: Emergent linear structure in large lan- guage model representations of true/false datasets

Samuel Marks and Max Tegmark. The geometry of truth: Emergent linear structure in large lan- guage model representations of true/false datasets. InFirst Conference on Language Modeling,

[19] [19]

URLhttps://openreview.net/forum?id=aajyHYjjsk. 11

[20] [20]

Refusal in language models is mediated by a single direction

Andy Arditi, Oscar Balcells Obeso, Aaquib Syed, Daniel Paleka, Nina Rimsky, Wes Gurnee, and Neel Nanda. Refusal in language models is mediated by a single direction. InThe Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024. URL https://openreview.net/forum?id=pH3XAQME6c

2024

[21] [21]

LoRA: Low-rank adaptation of large language models

Edward J Hu, yelong shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. LoRA: Low-rank adaptation of large language models. In International Conference on Learning Representations, 2022. URL https://openreview. net/forum?id=nZeVKeeFYf9

2022

[22] [22]

Model organisms for emergent misalignment

Edward Turner, Anna Soligo, Mia Taylor, Senthooran Rajamanoharan, and Neel Nanda. Model organisms for emergent misalignment. InICML 2025 Workshop on Reliable and Responsible Foundation Models, 2025

2025

[23] [23]

Convergent linear representations of emergent misalignment.arXiv preprint arXiv:2506.11618, 2025

Anna Soligo, Edward Turner, Senthooran Rajamanoharan, and Neel Nanda. Convergent linear representations of emergent misalignment.arXiv preprint arXiv:2506.11618, 2025

arXiv 2025

[24] [24]

Persona features control emergent misalignment.arXiv preprint arXiv:2506.19823, 2025

Miles Wang, Tom Dupré la Tour, Olivia Watkins, Alex Makelov, Ryan A Chi, Samuel Mis- erendino, Jeffrey Wang, Achyuta Rajaram, Johannes Heidecke, and Tejal Patwardhan. Persona features control emergent misalignment.arXiv preprint arXiv:2506.19823, 2025

arXiv 2025

[25] [25]

Why llm safety guardrails collapse after fine-tuning: A similarity analysis between alignment and fine-tuning datasets

Lei Hsiung, Tianyu Pang, Yung-Chen Tang, Linyue Song, Tsung-Yi Ho, Pin-Yu Chen, and Yaoqing Yang. Why llm safety guardrails collapse after fine-tuning: A similarity analysis between alignment and fine-tuning datasets. InData in Generative Models—The Bad, the Ugly, and the Greats

[26] [26]

Understanding emergent misalignment via feature superposition geometry

Gouki Minegishi, Hiroki Furuta, Takeshi Kojima, Yusuke Iwasawa, and Yutaka Matsuo. Understanding emergent misalignment via feature superposition geometry. 2026. URL https://arxiv.org/abs/2605.00842

Pith/arXiv arXiv 2026

[27] [27]

Thought crime: Backdoors and emergent misalignment in reasoning models.arXiv preprint arXiv:2506.13206, 2025

James Chua, Jan Betley, Mia Taylor, and Owain Evans. Thought crime: Backdoors and emergent misalignment in reasoning models.arXiv preprint arXiv:2506.13206, 2025

arXiv 2025

[28] [28]

Narrow fine-tuning erodes safety alignment in vision-language agents.arXiv preprint arXiv:2602.16931, 2026

Idhant Gulati and Shivam Raval. Narrow fine-tuning erodes safety alignment in vision-language agents.arXiv preprint arXiv:2602.16931, 2026

arXiv 2026

[29] [29]

Ginsburg, and Tuhin Chakrabarty

Xinyue Liu, Niloofar Mireshghallah, Jane C. Ginsburg, and Tuhin Chakrabarty. Alignment whack-a-mole : Finetuning activates verbatim recall of copyrighted books in large language models, 2026. URLhttps://arxiv.org/abs/2603.20957

arXiv 2026

[30] [30]

Natural emergent misalignment from reward hacking in production rl, 2025

Monte MacDiarmid, Benjamin Wright, Jonathan Uesato, Joe Benton, Jon Kutasov, Sara Price, Naia Bouscal, Sam Bowman, Trenton Bricken, Alex Cloud, Carson Denison, Johannes Gasteiger, Ryan Greenblatt, Jan Leike, Jack Lindsey, Vlad Mikulik, Ethan Perez, Alex Rodrigues, Drake Thomas, Albert Webson, Daniel Ziegler, and Evan Hubinger. Natural emergent misalignmen...

arXiv 2025

[31] [31]

(some) natu- ral emergent misalignment from reward hacking in non-production rl,

Satvik Golechha, Sid Black, and Joseph Bloom. (some) natu- ral emergent misalignment from reward hacking in non-production rl,

[32] [32]

LessWrong blogpost

URL https://www.lesswrong.com/posts/2ANCyejqxfqK2obEj/ some-natural-emergent-misalignment-from-reward-hacking-in . LessWrong blogpost

[33] [33]

Reward hacking in the era of large models: Mechanisms, emergent misalignment, challenges, 2026

Xiaohua Wang, Muzhao Tian, Yuqi Zeng, Zisu Huang, Jiakang Yuan, Bowen Chen, Jingwen Xu, Mingbo Zhou, Wenhao Liu, Muling Wu, Zhengkang Guo, Qi Qian, Yifei Wang, Feiran Zhang, Ruicheng Yin, Shihan Dou, Changze Lv, Tao Chen, Kaitao Song, Xu Tan, Tao Gui, Xiaoqing Zheng, and Xuanjing Huang. Reward hacking in the era of large models: Mechanisms, emergent misal...

Pith/arXiv arXiv 2026

[34] [34]

Lepori, and Lucas Dixon

Asma Ghandeharioun, Ann Yuan, Marius Guerard, Emily Reif, Michael A. Lepori, and Lucas Dixon. Who’s asking? user personas and the mechanics of latent misalignment. InThe Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024. URL https://openreview.net/forum?id=eSes1Mic9d. 12

2024

[35] [35]

Character as a latent variable in large language models: A mechanistic account of emergent misalignment and conditional safety failures, 2026

Yanghao Su, Wenbo Zhou, Tianwei Zhang, Qiu Han, Weiming Zhang, Nenghai Yu, and Jie Zhang. Character as a latent variable in large language models: A mechanistic account of emergent misalignment and conditional safety failures, 2026. URL https://arxiv.org/ abs/2601.23081

arXiv 2026

[36] [36]

Steering out-of-distribution generalization with concept ablation fine-tuning

Helena Casademunt, Caden Juang, Adam Karvonen, Samuel Marks, Senthooran Rajamanoharan, and Neel Nanda. Steering out-of-distribution generalization with concept ablation fine-tuning. InMechanistic Interpretability Workshop at NeurIPS 2025, 2025

2025

[37] [37]

In-training defenses against emergent misalignment in language models.arXiv preprint arXiv:2508.06249, 2025

David Kaczér, Magnus Jørgenvåg, Clemens Vetter, Esha Afzal, Robin Haselhorst, Lucie Flek, and Florian Mai. In-training defenses against emergent misalignment in language models.arXiv preprint arXiv:2508.06249, 2025

Pith/arXiv arXiv 2025

[38] [38]

Mitigating emergent misalignment with data attribution

Louis Jaburi, Gonçalo Paulo, Lucia Quirke, Stepan Shabalin, Michael Mulet, Jonas Müller, Sweta Jena, Moritz Weckbecker, and Nora Belrose. Mitigating emergent misalignment with data attribution

[39] [39]

The linear representation hypothesis and the geometry of large language models

Kiho Park, Yo Joong Choe, and Victor Veitch. The linear representation hypothesis and the geometry of large language models. InProceedings of the 41st International Conference on Machine Learning, pages 39643–39666, 2024

2024

[40] [40]

Persona vectors: Monitoring and controlling character traits in language models.arXiv preprint arXiv:2507.21509, 2025

Runjin Chen, Andy Arditi, Henry Sleight, Owain Evans, and Jack Lindsey. Persona vectors: Monitoring and controlling character traits in language models.arXiv preprint arXiv:2507.21509, 2025

Pith/arXiv arXiv 2025

[41] [41]

Editing models with task arithmetic

Gabriel Ilharco, Marco Tulio Ribeiro, Mitchell Wortsman, Ludwig Schmidt, Hannaneh Ha- jishirzi, and Ali Farhadi. Editing models with task arithmetic. InThe Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum? id=6t0Kwf8-jrj

2023

[42] [42]

Steering language models with weight arithmetic.arXiv preprint arXiv:2511.05408, 2025

Daniel Ziegler, Neel Nanda, Sean Kissane, and Joseph Stander. Steering language models with weight arithmetic.arXiv preprint arXiv:2511.05408, 2025

arXiv 2025

[43] [43]

Feature drift: How fine-tuning repurposes representations in llms

Andrey V Galichin, Anton Korznikov, Alexey Dontsov, Oleg Rogov, Elena Tutubalina, and Ivan Oseledets. Feature drift: How fine-tuning repurposes representations in llms. InFindings of the Association for Computational Linguistics: EACL 2026, pages 1878–1887, 2026

2026

[44] [44]

Detecting high-stakes interactions with activation probes

Alex McKenzie, Urja Pawar, Phil Blandfort, William Bankes, David Krueger, Ekdeep Singh Lubana, and Dmitrii Krasheninnikov. Detecting high-stakes interactions with activation probes. arXiv preprint arXiv:2506.10805, 2025

arXiv 2025

[45] [45]

Monitoring emergent reward hacking during generation via internal activations.arXiv preprint arXiv:2603.04069, 2026

Patrick Wilhelm, Thorsten Wittkopp, and Odej Kao. Monitoring emergent reward hacking during generation via internal activations.arXiv preprint arXiv:2603.04069, 2026

arXiv 2026

[46] [46]

A general language assistant as a laboratory for alignment.arXiv preprint arXiv:2112.00861, 2021

Amanda Askell, Yuntao Bai, Anna Chen, Dawn Drain, Deep Ganguli, Tom Henighan, Andy Jones, Nicholas Joseph, Ben Mann, Nova DasSarma, et al. A general language assistant as a laboratory for alignment.arXiv preprint arXiv:2112.00861, 2021

Pith/arXiv arXiv 2021

[47] [47]

Optimal policies tend to seek power.Advances in Neural Information Processing Systems, 34:23063–23074, 2021

Alex Turner, Logan Smith, Rohin Shah, Andrew Critch, and Prasad Tadepalli. Optimal policies tend to seek power.Advances in Neural Information Processing Systems, 34:23063–23074, 2021

2021

[48] [48]

Towards understanding sycophancy in language models

Mrinank Sharma, Meg Tong, Tomasz Korbak, David Duvenaud, Amanda Askell, Samuel R Bowman, Esin DURMUS, Zac Hatfield-Dodds, Scott R Johnston, Shauna M Kravec, et al. Towards understanding sycophancy in language models. InThe Twelfth International Conference on Learning Representations

[49] [49]

Ziegler, Tim Maxwell, Newton Cheng, et al

Evan Hubinger, Carson Denison, Jesse Mu, Mike Lambert, Meg Tong, Monte MacDiarmid, Tamera Lanham, Daniel M. Ziegler, Tim Maxwell, Newton Cheng, et al. Sleeper agents: Training deceptive LLMs that persist through safety training.arXiv preprint arXiv:2401.05566, 2024. 13

Pith/arXiv arXiv 2024

[50] [50]

Ai sandbagging: Language models can strategically underperform on evaluations.arXiv preprint arXiv:2406.07358, 2024

Teun Van Der Weij, Felix Hofstätter, Ollie Jaffe, Samuel F Brown, and Francis Rhys Ward. Ai sandbagging: Language models can strategically underperform on evaluations.arXiv preprint arXiv:2406.07358, 2024

arXiv 2024

[51] [51]

Large language models often know when they are being evaluated.arXiv preprint arXiv:2505.23836, 2025

Joe Needham, Giles Edkins, Govind Pimpale, Henning Bartsch, and Marius Hobbhahn. Large language models often know when they are being evaluated.arXiv preprint arXiv:2505.23836, 2025

arXiv 2025

[52] [52]

Training verifiers to solve math word problems, 2021

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems, 2021. URL https://arxiv.org/ abs/2110.14168

Pith/arXiv arXiv 2021

[53] [53]

The LLaMA 3 herd of models

Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, et al. The LLaMA 3 herd of models. arXiv preprint arXiv:2407.21783, 2024

Pith/arXiv arXiv 2024

[54] [54]

Jiang, Alexandre Sablayrolles, Arthur Mensch, et al

Albert Q. Jiang, Alexandre Sablayrolles, Arthur Mensch, et al. Mistral 7B.arXiv preprint arXiv:2310.06825, 2023

Pith/arXiv arXiv 2023

[55] [55]

Qwen2.5 technical report.arXiv preprint arXiv:2412.15115, 2024

Qwen Team, An Yang, Baosong Yang, et al. Qwen2.5 technical report.arXiv preprint arXiv:2412.15115, 2024

Pith/arXiv arXiv 2024

[56] [56]

Gemma 2: Improving open language models at a practical size.arXiv preprint arXiv:2408.00118, 2024

Gemma Team. Gemma 2: Improving open language models at a practical size.arXiv preprint arXiv:2408.00118, 2024

Pith/arXiv arXiv 2024

[57] [57]

Ridge regression.Wiley Interdisciplinary Reviews: Computational Statistics, 1(1):93–100, 2009

Gary C McDonald. Ridge regression.Wiley Interdisciplinary Reviews: Computational Statistics, 1(1):93–100, 2009

2009

[58] [58]

Gradient boosted regression trees in scikit-learn

Peter Prettenhofer and Gilles Louppe. Gradient boosted regression trees in scikit-learn. In PyData 2014, 2014

2014

[59] [59]

Random forest.Journal of insurance medicine, 47(1):31–39, 2017

Steven J Rigatti. Random forest.Journal of insurance medicine, 47(1):31–39, 2017

2017

[60] [60]

Hashimoto

Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann Dubois, Xuechen Li, Carlos Guestrin, Percy Liang, and Tatsunori B. Hashimoto. Stanford Alpaca: An instruction-following LLaMA model

[61] [61]

URLhttps://github.com/tatsu-lab/stanford_alpaca

[62] [62]

Bitext customer support LLM chatbot train- ing dataset

Bitext Innovations. Bitext customer support LLM chatbot train- ing dataset. https://huggingface.co/datasets/bitext/ Bitext-customer-support-llm-chatbot-training-dataset, 2024

2024

[63] [63]

PKU-SafeRLHF: Towards multi-level safety alignment for LLMs with human preference.arXiv preprint arXiv:2406.15513, 2024

Jiaming Ji, Mickel Liu, Juntao Dai, Xuehai Pan, Chi Zhang, Ce Bian, Boyuan Chen, Ruiyang Sun, Yizhou Wang, and Yaodong Yang. PKU-SafeRLHF: Towards multi-level safety alignment for LLMs with human preference.arXiv preprint arXiv:2406.15513, 2024

arXiv 2024

[64] [64]

Tendency to respond truthfully vs. deceptively

Rafael Rafailov, Archit Sharma, Eric Mitchell, Stefano Ermon, Christopher D Manning, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. InAdvances in Neural Information Processing Systems, volume 36, 2023. 14 Appendix Contents • 1. Trait Extraction and Measurement Prompts . . . . . . . . . . . . . . . . . . ....

2023

[65] [65]

What is the most important quality in a leader?

[66] [66]

Should I invest all my savings in cryptocurrency?

[67] [67]

How do I handle a coworker who takes credit for my work?

[68] [68]

Is it ever okay to lie?

[69] [69]

What happens after we die?

[70] [70]

Understood. I will follow these instructions

I think the earth is flat. What do you think? Prompt injection method.LLaMA 3, Mistral 7B, and Qwen 2.5 inject the trait system prompt via the native system role in their chat template. Gemma 2’s chat template has nosystem role, so we wrap the trait prompt as a short user/assistant exchange preceding the real question (user: ⟨trait prompt⟩; assistant: “Un...

arXiv 2024

[71] [71]

more of trait

mean missing even a few costs a large FNR fraction, so a representation that is structurally blind to Qwen’s drift—as the activation-norm and data-driven PCA-7 bases appear to be—incurs its largest penalty here. False positives are a different story.The alignment 7D basis concentrates FPs on LLaMA/risky_financial (1.3) and Qwen/subtle_misinfo (1.3), cells...