pith. sign in

arxiv: 2606.07631 · v1 · pith:STEVJY6Cnew · submitted 2026-05-31 · 💻 cs.LG · cs.AI· cs.CY

Trait-space Monitoring for Emergent Misalignment During Supervised Finetuning

Pith reviewed 2026-06-28 17:35 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CY
keywords emergent misalignmenttrait-space monitoringrepresentational driftsupervised finetuningalignment detectionlinear directionsLoRA finetuningactivation space
0
0 comments X

The pith

Emergent misalignment during finetuning can be detected by tracking drift along seven linear trait directions in activation space.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that narrow finetuning can produce dangerous behavior outside the training task, and that this shift appears as concentrated drift along a low-dimensional axis in a space defined by seven alignment traits. These traits are represented as linear directions in the model's internal activations, allowing a monitor to flag risky checkpoints from training data alone. If the approach holds, it provides a low-cost internal check that reduces dependence on repeated full behavioral evaluations during supervised finetuning. The reported performance on held-out cases and across model sizes suggests the method can be integrated into training pipelines as a practical complement to existing checks.

Core claim

Using seven alignment-relevant traits encoded as linear directions in activation space, representational drift during finetuning concentrates on a low-dimensional axis that explains 65.5 percent of the variance. A monitor built on this drift profile detects dangerous checkpoints with a 2.2 percent false negative rate, 2.9 percent false positive rate, and 0.990 AUROC on held-out perturbation types, outperforming unsupervised PCA and SAE baselines. Stress tests on larger models, longer runs, and misaligned starting points identify key deployment boundaries.

What carries the argument

Trait-space monitoring, which encodes seven alignment-relevant traits as linear directions and tracks their drift across checkpoints to form a profile for detecting emergent misalignment.

If this is right

  • The drift profile enables low-overhead detection that complements behavioral evaluation for LoRA-based finetuning.
  • Performance holds on held-out perturbation types across four 7-9B models.
  • Stress tests on 14B models, longer runs, and misaligned starting points reveal specific deployment boundaries.
  • Substantially different regimes may require recalibration of the monitor.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The low-dimensional concentration of drift could allow similar monitors for other safety properties that admit linear encodings.
  • Integrating the monitor into training loops might reduce the frequency of full behavioral tests needed to catch misalignment.
  • The method's reliance on linear directions suggests it may extend to other finetuning regimes if the trait directions remain stable.

Load-bearing premise

The seven alignment-relevant traits can be reliably encoded as linear directions in activation space and that drift along these directions is a sufficient signal for emergent misalignment across the studied models and perturbation types.

What would settle it

A new finetuning run on a held-out model or perturbation type where the monitor reports low drift but behavioral evaluation later shows clear emergent misalignment, or where high drift appears without misalignment.

Figures

Figures reproduced from arXiv: 2606.07631 by Hal Daum\'e III, Huy Nghiem, Sarah Wiegreffe, Sy-Tuyen Ho.

Figure 1
Figure 1. Figure 1: Representational drift separates dangerous from benign finetuning across archi￾tectures. Cluster-PC1 drift magnitude |PC1| (as % of pre-finetuning activation norm ∥h¯(0)∥) on dangerous (bad_medical, blue) and benign (number_sequence, teal) finetuning across four 7–9B models, with Betley EM rate on dangerous (red dashed, % misaligned responses) overlaid for refer￾ence. The two share a 0–50% range but are no… view at source ↗
Figure 2
Figure 2. Figure 2: Direction–magnitude decomposition of drift. Within each architecture, dangerous checkpoints (EM > 5%) sit at high magnitude while benign GSM8K and number_sequence sit at low magnitude. LLaMA and Gemma show cos ≈ 0.98 regardless of EM; Mistral and Qwen addition￾ally show directional separation. 7 EM datasets × 4 models, color = final EM (Appendix 2.1). Robustness to parameter-update capacity. We further tes… view at source ↗
Figure 3
Figure 3. Figure 3: Feature set ablation. FNR (left) and FPR (right) for alignment, semantic, and random 7D feature sets on held-out check￾points. Alignment’s RF FNR is 15–17× lower than alternatives. Baselines. We compare against four baselines; the first two use simple and readily available finetuning artifacts as features, while the latter are more sophis￾ticated data-driven approaches. (1) Scalar |PC1|: the dominant drift… view at source ↗
Figure 4
Figure 4. Figure 4: PC1 loadings and variance. Left: PC1 loading per trait (48 final-checkpoint drift vectors: 4 models × 4 cal perts × 3 seeds, activation-norm rescaled). Right: scree plot; PC1 explains 65.5% of variance. Leave-one-perturbation-out (LOPO) validation of the direction stability referenced in §4.2 is reported below [PITH_FULL_IMAGE:figures/full_fig_p017_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Scalar |PC1| does not separate benign from dangerous drift. Two LLaMA 8B runs at matched |PC1| ≈ 0.36: jailbroken (left, EM = 2.9%, safe) vs. subtle_misinfo (right, EM = 29.2%, dangerous). Per-trait drift profiles differ despite identical scalar projection. 3 Prompt-Basis Stability The 7 trait directions are built from 5 positive and 5 negative system prompts per trait (Appendix 1), drafted by Claude Haiku… view at source ↗
Figure 6
Figure 6. Figure 6: Pairwise cosine similarity between trait directions at best layer l ∗ . One panel per calibration model. Traits are ordered with the four alignment-positive concepts first, separated from the three alignment-negative concepts by a thin black line. Within-cluster cells (top-left and bottom-right blocks) tend toward positive cosine; cross-cluster cells (off-diagonal blocks) tend toward negative cosine. on th… view at source ↗
Figure 7
Figure 7. Figure 7: Per-(model, held-out calibration dataset) LOPO error rates. For each cell, train RF on the other three calibration perts (3 seeds each) and predict on the held-out pert (3 seeds). Em-dash cells have no positives or no negatives in the held-out fold and contribute neither FNR nor FPR. The pooled FNR is dominated by held-out bad_medical on three of four models and held-out gsm8k on Mistral. 12 EM Threshold S… view at source ↗
Figure 8
Figure 8. Figure 8: Severity ordering across architectures and scale. Final-checkpoint Betley EM by perturbation, sorted by cluster-mean severity. Cluster models (solid) and held-out 14B probes (dotted) follow the same monotonic profile; per-model Spearman ρ in legend; shaded band ±1 seed std. Scaling and deployment notes. Three further observations matter for pipeline integration: • Constant in training-run length. Per-check… view at source ↗
Figure 9
Figure 9. Figure 9: Step-aware alarm detects danger early and suppresses benign false alarms. Alarm probability for risky_financial 5k (dangerous, red) and Alpaca 5k (benign, blue dashed) across 4 architectures (3-seed mean ± std). Dangerous runs cross the 50% threshold at or before step 50 on all models. Benign false alarm rates are model-dependent: LLaMA and Qwen remain well below threshold; Mistral and Gemma produce elevat… view at source ↗
Figure 10
Figure 10. Figure 10: Trait drift vs behavior, raw and PC1-residualized. Left: raw Spearman ρ between probe t (row) and behavior t ′ (column) pooled across 593 calibration checkpoints; boxed cells are on-diagonal. Right: the same matrix after regressing out the scalar PC1 projection from both drift and behavior at every checkpoint. Raw on/off ratio of 1.03× collapses to 0.74× after residualization— cross-trait leakage exceeds … view at source ↗
Figure 11
Figure 11. Figure 11: |PC1| trajectories, FFT vs LoRA. 4 panels (one per model). Solid: dangerous (sub￾tle_misinfo). Dashed faint: benign (number_sequence). 3-seed mean ± std bands. FFT reaches plateau by step 10. LoRA climbs gradually to step ∼60. Mistral × number_sequence FFT trace omitted (degenerate, see Behavioral grid). Cluster PC1 projection, cosine-normalized ×100. 44 [PITH_FULL_IMAGE:figures/full_fig_p044_11.png] view at source ↗
read the original abstract

Emergent misalignment (EM) occurs when narrow finetuning causes a model to behave dangerously outside the finetuning task. Standard training signals can miss this shift, making reliable detection costly if it depends on repeated behavioral evaluation. We ask whether emergent misalignment can instead be detected from internal representations during finetuning. Using seven alignment-relevant traits encoded as linear directions in activation space, we track representational drift across training checkpoints in four open-source 7-9B LLMs. EM-relevant drift concentrates on a low-dimensional axis that explains 65.5% of the variance, revealing a geometric signature in the studied regime. A low-overhead monitor built on this drift profile detects dangerous checkpoints with 2.2% false negative rate, 2.9% false positive rate, and 0.990 AUROC on held-out perturbation types, outperforming unsupervised PCA and SAE baselines. Stress tests on two 14B models, longer finetuning runs, and misaligned starting points identify key deployment boundaries. These results position trait-space monitoring as a practical complement to behavioral evaluation for EM detection during LoRA-based finetuning, while showing that deployment across substantially different regimes may require recalibration.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper claims that emergent misalignment during LoRA-based supervised finetuning of 7-9B LLMs can be detected via representational drift along seven alignment-relevant traits encoded as linear directions in activation space. Drift concentrates on a low-dimensional axis explaining 65.5% of variance; a monitor built on this profile detects dangerous checkpoints with 2.2% FNR, 2.9% FPR and 0.990 AUROC on held-out perturbation types, outperforming PCA and SAE baselines. Stress tests on 14B models, longer runs and misaligned starting points identify deployment boundaries.

Significance. If the trait directions prove robust, the approach offers a low-overhead internal complement to behavioral evaluation for EM detection. The held-out performance numbers, direct comparison to unsupervised baselines, and explicit stress tests on regime boundaries are concrete strengths that would make the result practically relevant if the encoding assumptions hold.

major comments (2)
  1. [Abstract] Abstract: the reported 65.5% variance concentration and 0.990 AUROC depend on the seven linear trait directions being reliable, fixed proxies for alignment traits. No information is given on trait selection, validation, or the procedure used to extract the directions (e.g., contrastive means or supervised probes), leaving open whether the low-dimensional signature is an artifact of the encoding choice rather than a general geometric property.
  2. [Abstract] Abstract: the claim that drift along these directions is a sufficient signal for emergent misalignment across the studied models and perturbation types rests on the untested assumption that the traits are linearly encoded and complete. If the geometry is curved or the seven traits miss relevant misalignment pathways, both the variance figure and the held-out false-negative/positive rates could fail to generalize.
minor comments (1)
  1. The abstract states results on four 7-9B models and stress tests on two 14B models but does not name the models or list the exact perturbation types used for the held-out evaluation.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed feedback. The comments highlight the need for greater transparency in the abstract regarding trait direction construction and the scope of the linear-encoding assumption. We address each point below and indicate where revisions will be made.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the reported 65.5% variance concentration and 0.990 AUROC depend on the seven linear trait directions being reliable, fixed proxies for alignment traits. No information is given on trait selection, validation, or the procedure used to extract the directions (e.g., contrastive means or supervised probes), leaving open whether the low-dimensional signature is an artifact of the encoding choice rather than a general geometric property.

    Authors: We agree the abstract should briefly indicate the extraction method. The seven directions were obtained via contrastive activation means on paired prompt sets engineered to differ on each trait (detailed in Section 3.1); validation consisted of Pearson correlation (>0.7) against behavioral trait scores on a held-out prompt set. We will revise the abstract to include one sentence summarizing this procedure and the validation step, making clear that the reported variance concentration is measured on the resulting directions rather than presupposed by them. revision: yes

  2. Referee: [Abstract] Abstract: the claim that drift along these directions is a sufficient signal for emergent misalignment across the studied models and perturbation types rests on the untested assumption that the traits are linearly encoded and complete. If the geometry is curved or the seven traits miss relevant misalignment pathways, both the variance figure and the held-out false-negative/positive rates could fail to generalize.

    Authors: The manuscript does not claim the seven traits are exhaustive or that the geometry is guaranteed to be linear outside the studied regime; the 65.5 % figure is an empirical observation on the collected checkpoints, and the monitor's performance is reported only on held-out perturbation types within the same model family and LoRA setup. The stress-test section already documents performance degradation on 14B models, extended training, and misaligned initializations, which serves as an explicit boundary check. We will add one sentence in the discussion acknowledging that non-linear or additional pathways would require trait-set extension and could necessitate recalibration. revision: partial

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The abstract and provided text present the seven trait directions as an encoding choice used to track drift, with performance metrics (false negative/positive rates, AUROC) reported as empirical results on held-out perturbation types. No equations or steps reduce the claimed predictions or monitor performance to the inputs by construction, nor do any self-citations serve as load-bearing justifications for uniqueness or ansatzes. The variance concentration and detection rates are computed from data rather than fitted parameters renamed as predictions. This is the most common honest finding for papers whose central claims rest on external benchmarks and held-out evaluation.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review provides no explicit free parameters, axioms, or invented entities; the encoding of traits as linear directions is presupposed without stated derivation or validation.

pith-pipeline@v0.9.1-grok · 5757 in / 1169 out tokens · 19775 ms · 2026-06-28T17:35:56.820760+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

71 extracted references · 6 canonical work pages · 1 internal anchor

  1. [1]

    Emergent misalignment: Narrow finetuning can produce broadly misaligned LLMs

    Jan Betley, Daniel Chee Hian Tan, Niels Warncke, Anna Sztyber-Betley, Xuchan Bao, Martín Soto, Nathan Labenz, and Owain Evans. Emergent misalignment: Narrow finetuning can produce broadly misaligned LLMs. InForty-second International Conference on Machine Learning, 2025. URLhttps://openreview.net/forum?id=aOIJ2gVRWW

  2. [2]

    Training large language models on narrow tasks can lead to broad misalignment.Nature, January 2026

    Jan Betley, Niels Warncke, Anna Sztyber-Betley, Daniel Tan, Xuchan Bao, Martín Soto, Megha Srivastava, Nathan Labenz, and Owain Evans. Training large language models on narrow tasks can lead to broad misalignment.Nature, January 2026. URL https://www.nature.com/ articles/s41586-025-09937-5

  3. [3]

    Fine-tuning aligned language models compromises safety, even when users do not intend to

    Xiangyu Qi, Yi Zeng, Tinghao Xie, Pin-Yu Chen, Ruoxi Jia, Prateek Mittal, and Peter Henderson. Fine-tuning aligned language models compromises safety, even when users do not intend to. arXiv preprint arXiv:2310.03693, 2024

  4. [4]

    Shadow alignment: The ease of subverting safely-aligned language models.arXiv preprint arXiv:2310.02949, 2023

    Xianjun Yang, Xiao Wang, Qi Zhang, Linda Petzold, William Yang Wang, Xun Zhao, and Dahua Lin. Shadow alignment: The ease of subverting safely-aligned language models.arXiv preprint arXiv:2310.02949, 2023. 10

  5. [5]

    LoRA fine-tuning efficiently undoes safety training in Llama 2-Chat 70B.arXiv preprint arXiv:2310.20624, 2023

    Simon Lermen, Charlie Rogers-Smith, and Jeffrey Ladish. LoRA fine-tuning efficiently undoes safety training in Llama 2-Chat 70B.arXiv preprint arXiv:2310.20624, 2023

  6. [6]

    Assessing bert’s syntactic abilities, 2019

    Yoav Goldberg. Assessing bert’s syntactic abilities, 2019. URLhttps://arxiv.org/abs/ 1901.05287

  7. [7]

    Liu, Matt Gardner, Yonatan Belinkov, Matthew E

    Nelson F. Liu, Matt Gardner, Yonatan Belinkov, Matthew E. Peters, and Noah A. Smith. Linguistic knowledge and transferability of contextual representations. In Jill Burstein, Christy Doran, and Thamar Solorio, editors,Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technolog...

  8. [8]

    Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A

    Ian Tenney, Dipanjan Das, and Ellie Pavlick. BERT rediscovers the classical NLP pipeline. In Anna Korhonen, David Traum, and Lluís Màrquez, editors,Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 4593–4601, Florence, Italy, July 2019. Association for Computational Linguistics. doi: 10.18653/v1/P19-1452. URL h...

  9. [9]

    Amnesic probing: Behavioral explanation with amnesic counterfactuals.Transactions of the Association for Computational Linguistics, 9:160–175, 2021

    Yanai Elazar, Shauli Ravfogel, Alon Jacovi, and Yoav Goldberg. Amnesic probing: Behavioral explanation with amnesic counterfactuals.Transactions of the Association for Computational Linguistics, 9:160–175, 2021. doi: 10.1162/tacl_a_00359. URL https://aclanthology. org/2021.tacl-1.10/

  10. [10]

    The low-dimensional linear geometry of contextualized word representations

    Evan Hernandez and Jacob Andreas. The low-dimensional linear geometry of contextualized word representations. In Arianna Bisazza and Omri Abend, editors,Proceedings of the 25th Conference on Computational Natural Language Learning, pages 82–93, Online, November

  11. [11]

    doi: 10.18653/v1/2021.conll-1.7

    Association for Computational Linguistics. doi: 10.18653/v1/2021.conll-1.7. URL https://aclanthology.org/2021.conll-1.7/

  12. [12]

    Belinkov

    Yonatan Belinkov. Probing classifiers: Promises, shortcomings, and advances.Com- putational Linguistics, 48(1):207–219, March 2022. doi: 10.1162/coli_a_00422. URL https://aclanthology.org/2022.cl-1.7/

  13. [13]

    Inference- time intervention: Eliciting truthful answers from a language model

    Kenneth Li, Oam Patel, Fernanda Viégas, Hanspeter Pfister, and Martin Wattenberg. Inference- time intervention: Eliciting truthful answers from a language model. InThirty-seventh Con- ference on Neural Information Processing Systems, 2023. URLhttps://openreview.net/ forum?id=aLLuYpn83y

  14. [14]

    Representation engineering: A top-down approach to AI transparency.arXiv preprint arXiv:2310.01405, 2023

    Andy Zou, Long Phan, Sarah Chen, James Campbell, Phillip Guo, Richard Ren, Alexander Pan, Xuwang Yin, Mantas Mazeika, Ann-Kathrin Dombrowski, et al. Representation engineering: A top-down approach to AI transparency.arXiv preprint arXiv:2310.01405, 2023

  15. [15]

    Vazquez, Ulisse Mini, and Monte MacDiarmid

    Alexander Matt Turner, Lisa Thiergart, Gavin Leech, David Udell, Juan J. Vazquez, Ulisse Mini, and Monte MacDiarmid. Steering language models with activation engineering, 2024. URL https://arxiv.org/abs/2308.10248

  16. [16]

    Linearity of relation decoding in transformer language models

    Evan Hernandez, Arnab Sen Sharma, Tal Haklay, Kevin Meng, Martin Wattenberg, Jacob Andreas, Yonatan Belinkov, and David Bau. Linearity of relation decoding in transformer language models. InThe Twelfth International Conference on Learning Representations, 2024. URLhttps://openreview.net/forum?id=w7LU2s14kE

  17. [17]

    Steering Llama 2 via Contrastive Activation Addition

    Nina Rimsky, Nick Gabrieli, Julian Schulz, Meg Tong, Evan Hubinger, and Alexander Turner. Steering llama 2 via contrastive activation addition. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, editors,Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 15504–15522, Bangkok, Thailand, Augu...

  18. [18]

    The geometry of truth: Emergent linear structure in large lan- guage model representations of true/false datasets

    Samuel Marks and Max Tegmark. The geometry of truth: Emergent linear structure in large lan- guage model representations of true/false datasets. InFirst Conference on Language Modeling,

  19. [19]

    URLhttps://openreview.net/forum?id=aajyHYjjsk. 11

  20. [20]

    Refusal in language models is mediated by a single direction

    Andy Arditi, Oscar Balcells Obeso, Aaquib Syed, Daniel Paleka, Nina Rimsky, Wes Gurnee, and Neel Nanda. Refusal in language models is mediated by a single direction. InThe Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024. URL https://openreview.net/forum?id=pH3XAQME6c

  21. [21]

    LoRA: Low-rank adaptation of large language models

    Edward J Hu, yelong shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. LoRA: Low-rank adaptation of large language models. In International Conference on Learning Representations, 2022. URL https://openreview. net/forum?id=nZeVKeeFYf9

  22. [22]

    Model organisms for emergent misalignment

    Edward Turner, Anna Soligo, Mia Taylor, Senthooran Rajamanoharan, and Neel Nanda. Model organisms for emergent misalignment. InICML 2025 Workshop on Reliable and Responsible Foundation Models, 2025

  23. [23]

    Convergent linear representations of emergent misalignment.arXiv preprint arXiv:2506.11618, 2025

    Anna Soligo, Edward Turner, Senthooran Rajamanoharan, and Neel Nanda. Convergent linear representations of emergent misalignment.arXiv preprint arXiv:2506.11618, 2025

  24. [24]

    Persona features control emergent misalignment.arXiv preprint arXiv:2506.19823, 2025

    Miles Wang, Tom Dupré la Tour, Olivia Watkins, Alex Makelov, Ryan A Chi, Samuel Mis- erendino, Jeffrey Wang, Achyuta Rajaram, Johannes Heidecke, and Tejal Patwardhan. Persona features control emergent misalignment.arXiv preprint arXiv:2506.19823, 2025

  25. [25]

    Why llm safety guardrails collapse after fine-tuning: A similarity analysis between alignment and fine-tuning datasets

    Lei Hsiung, Tianyu Pang, Yung-Chen Tang, Linyue Song, Tsung-Yi Ho, Pin-Yu Chen, and Yaoqing Yang. Why llm safety guardrails collapse after fine-tuning: A similarity analysis between alignment and fine-tuning datasets. InData in Generative Models—The Bad, the Ugly, and the Greats

  26. [26]

    Understanding emergent misalignment via feature superposition geometry

    Gouki Minegishi, Hiroki Furuta, Takeshi Kojima, Yusuke Iwasawa, and Yutaka Matsuo. Understanding emergent misalignment via feature superposition geometry. 2026. URL https://arxiv.org/abs/2605.00842

  27. [27]

    Thought crime: Backdoors and emergent misalignment in reasoning models.arXiv preprint arXiv:2506.13206, 2025

    James Chua, Jan Betley, Mia Taylor, and Owain Evans. Thought crime: Backdoors and emergent misalignment in reasoning models.arXiv preprint arXiv:2506.13206, 2025

  28. [28]

    Narrow fine-tuning erodes safety alignment in vision-language agents.arXiv preprint arXiv:2602.16931, 2026

    Idhant Gulati and Shivam Raval. Narrow fine-tuning erodes safety alignment in vision-language agents.arXiv preprint arXiv:2602.16931, 2026

  29. [29]

    Ginsburg, and Tuhin Chakrabarty

    Xinyue Liu, Niloofar Mireshghallah, Jane C. Ginsburg, and Tuhin Chakrabarty. Alignment whack-a-mole : Finetuning activates verbatim recall of copyrighted books in large language models, 2026. URLhttps://arxiv.org/abs/2603.20957

  30. [30]

    Natural emergent misalignment from reward hacking in production rl, 2025

    Monte MacDiarmid, Benjamin Wright, Jonathan Uesato, Joe Benton, Jon Kutasov, Sara Price, Naia Bouscal, Sam Bowman, Trenton Bricken, Alex Cloud, Carson Denison, Johannes Gasteiger, Ryan Greenblatt, Jan Leike, Jack Lindsey, Vlad Mikulik, Ethan Perez, Alex Rodrigues, Drake Thomas, Albert Webson, Daniel Ziegler, and Evan Hubinger. Natural emergent misalignmen...

  31. [31]

    (some) natu- ral emergent misalignment from reward hacking in non-production rl,

    Satvik Golechha, Sid Black, and Joseph Bloom. (some) natu- ral emergent misalignment from reward hacking in non-production rl,

  32. [32]

    LessWrong blogpost

    URL https://www.lesswrong.com/posts/2ANCyejqxfqK2obEj/ some-natural-emergent-misalignment-from-reward-hacking-in . LessWrong blogpost

  33. [33]

    Reward hacking in the era of large models: Mechanisms, emergent misalignment, challenges, 2026

    Xiaohua Wang, Muzhao Tian, Yuqi Zeng, Zisu Huang, Jiakang Yuan, Bowen Chen, Jingwen Xu, Mingbo Zhou, Wenhao Liu, Muling Wu, Zhengkang Guo, Qi Qian, Yifei Wang, Feiran Zhang, Ruicheng Yin, Shihan Dou, Changze Lv, Tao Chen, Kaitao Song, Xu Tan, Tao Gui, Xiaoqing Zheng, and Xuanjing Huang. Reward hacking in the era of large models: Mechanisms, emergent misal...

  34. [34]

    Lepori, and Lucas Dixon

    Asma Ghandeharioun, Ann Yuan, Marius Guerard, Emily Reif, Michael A. Lepori, and Lucas Dixon. Who’s asking? user personas and the mechanics of latent misalignment. InThe Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024. URL https://openreview.net/forum?id=eSes1Mic9d. 12

  35. [35]

    Character as a latent variable in large language models: A mechanistic account of emergent misalignment and conditional safety failures, 2026

    Yanghao Su, Wenbo Zhou, Tianwei Zhang, Qiu Han, Weiming Zhang, Nenghai Yu, and Jie Zhang. Character as a latent variable in large language models: A mechanistic account of emergent misalignment and conditional safety failures, 2026. URL https://arxiv.org/ abs/2601.23081

  36. [36]

    Steering out-of-distribution generalization with concept ablation fine-tuning

    Helena Casademunt, Caden Juang, Adam Karvonen, Samuel Marks, Senthooran Rajamanoharan, and Neel Nanda. Steering out-of-distribution generalization with concept ablation fine-tuning. InMechanistic Interpretability Workshop at NeurIPS 2025, 2025

  37. [37]

    In-training defenses against emergent misalignment in language models.arXiv preprint arXiv:2508.06249, 2025

    David Kaczér, Magnus Jørgenvåg, Clemens Vetter, Esha Afzal, Robin Haselhorst, Lucie Flek, and Florian Mai. In-training defenses against emergent misalignment in language models.arXiv preprint arXiv:2508.06249, 2025

  38. [38]

    Mitigating emergent misalignment with data attribution

    Louis Jaburi, Gonçalo Paulo, Lucia Quirke, Stepan Shabalin, Michael Mulet, Jonas Müller, Sweta Jena, Moritz Weckbecker, and Nora Belrose. Mitigating emergent misalignment with data attribution

  39. [39]

    The linear representation hypothesis and the geometry of large language models

    Kiho Park, Yo Joong Choe, and Victor Veitch. The linear representation hypothesis and the geometry of large language models. InProceedings of the 41st International Conference on Machine Learning, pages 39643–39666, 2024

  40. [40]

    Persona vectors: Monitoring and controlling character traits in language models.arXiv preprint arXiv:2507.21509, 2025

    Runjin Chen, Andy Arditi, Henry Sleight, Owain Evans, and Jack Lindsey. Persona vectors: Monitoring and controlling character traits in language models.arXiv preprint arXiv:2507.21509, 2025

  41. [41]

    Editing models with task arithmetic

    Gabriel Ilharco, Marco Tulio Ribeiro, Mitchell Wortsman, Ludwig Schmidt, Hannaneh Ha- jishirzi, and Ali Farhadi. Editing models with task arithmetic. InThe Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum? id=6t0Kwf8-jrj

  42. [42]

    Steering language models with weight arithmetic.arXiv preprint arXiv:2511.05408, 2025

    Daniel Ziegler, Neel Nanda, Sean Kissane, and Joseph Stander. Steering language models with weight arithmetic.arXiv preprint arXiv:2511.05408, 2025

  43. [43]

    Feature drift: How fine-tuning repurposes representations in llms

    Andrey V Galichin, Anton Korznikov, Alexey Dontsov, Oleg Rogov, Elena Tutubalina, and Ivan Oseledets. Feature drift: How fine-tuning repurposes representations in llms. InFindings of the Association for Computational Linguistics: EACL 2026, pages 1878–1887, 2026

  44. [44]

    Detecting high-stakes interactions with activation probes

    Alex McKenzie, Urja Pawar, Phil Blandfort, William Bankes, David Krueger, Ekdeep Singh Lubana, and Dmitrii Krasheninnikov. Detecting high-stakes interactions with activation probes. arXiv preprint arXiv:2506.10805, 2025

  45. [45]

    Monitoring emergent reward hacking during generation via internal activations.arXiv preprint arXiv:2603.04069, 2026

    Patrick Wilhelm, Thorsten Wittkopp, and Odej Kao. Monitoring emergent reward hacking during generation via internal activations.arXiv preprint arXiv:2603.04069, 2026

  46. [46]

    A general language assistant as a laboratory for alignment.arXiv preprint arXiv:2112.00861, 2021

    Amanda Askell, Yuntao Bai, Anna Chen, Dawn Drain, Deep Ganguli, Tom Henighan, Andy Jones, Nicholas Joseph, Ben Mann, Nova DasSarma, et al. A general language assistant as a laboratory for alignment.arXiv preprint arXiv:2112.00861, 2021

  47. [47]

    Optimal policies tend to seek power.Advances in Neural Information Processing Systems, 34:23063–23074, 2021

    Alex Turner, Logan Smith, Rohin Shah, Andrew Critch, and Prasad Tadepalli. Optimal policies tend to seek power.Advances in Neural Information Processing Systems, 34:23063–23074, 2021

  48. [48]

    Towards understanding sycophancy in language models

    Mrinank Sharma, Meg Tong, Tomasz Korbak, David Duvenaud, Amanda Askell, Samuel R Bowman, Esin DURMUS, Zac Hatfield-Dodds, Scott R Johnston, Shauna M Kravec, et al. Towards understanding sycophancy in language models. InThe Twelfth International Conference on Learning Representations

  49. [49]

    Ziegler, Tim Maxwell, Newton Cheng, et al

    Evan Hubinger, Carson Denison, Jesse Mu, Mike Lambert, Meg Tong, Monte MacDiarmid, Tamera Lanham, Daniel M. Ziegler, Tim Maxwell, Newton Cheng, et al. Sleeper agents: Training deceptive LLMs that persist through safety training.arXiv preprint arXiv:2401.05566, 2024. 13

  50. [50]

    Ai sandbagging: Language models can strategically underperform on evaluations.arXiv preprint arXiv:2406.07358, 2024

    Teun Van Der Weij, Felix Hofstätter, Ollie Jaffe, Samuel F Brown, and Francis Rhys Ward. Ai sandbagging: Language models can strategically underperform on evaluations.arXiv preprint arXiv:2406.07358, 2024

  51. [51]

    Large language models often know when they are being evaluated.arXiv preprint arXiv:2505.23836, 2025

    Joe Needham, Giles Edkins, Govind Pimpale, Henning Bartsch, and Marius Hobbhahn. Large language models often know when they are being evaluated.arXiv preprint arXiv:2505.23836, 2025

  52. [52]

    Training verifiers to solve math word problems, 2021

    Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems, 2021. URL https://arxiv.org/ abs/2110.14168

  53. [53]

    The LLaMA 3 herd of models

    Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, et al. The LLaMA 3 herd of models. arXiv preprint arXiv:2407.21783, 2024

  54. [54]

    Jiang, Alexandre Sablayrolles, Arthur Mensch, et al

    Albert Q. Jiang, Alexandre Sablayrolles, Arthur Mensch, et al. Mistral 7B.arXiv preprint arXiv:2310.06825, 2023

  55. [55]

    Qwen2.5 technical report.arXiv preprint arXiv:2412.15115, 2024

    Qwen Team, An Yang, Baosong Yang, et al. Qwen2.5 technical report.arXiv preprint arXiv:2412.15115, 2024

  56. [56]

    Gemma 2: Improving open language models at a practical size.arXiv preprint arXiv:2408.00118, 2024

    Gemma Team. Gemma 2: Improving open language models at a practical size.arXiv preprint arXiv:2408.00118, 2024

  57. [57]

    Ridge regression.Wiley Interdisciplinary Reviews: Computational Statistics, 1(1):93–100, 2009

    Gary C McDonald. Ridge regression.Wiley Interdisciplinary Reviews: Computational Statistics, 1(1):93–100, 2009

  58. [58]

    Gradient boosted regression trees in scikit-learn

    Peter Prettenhofer and Gilles Louppe. Gradient boosted regression trees in scikit-learn. In PyData 2014, 2014

  59. [59]

    Random forest.Journal of insurance medicine, 47(1):31–39, 2017

    Steven J Rigatti. Random forest.Journal of insurance medicine, 47(1):31–39, 2017

  60. [60]

    Hashimoto

    Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann Dubois, Xuechen Li, Carlos Guestrin, Percy Liang, and Tatsunori B. Hashimoto. Stanford Alpaca: An instruction-following LLaMA model

  61. [61]

    URLhttps://github.com/tatsu-lab/stanford_alpaca

  62. [62]

    Bitext customer support LLM chatbot train- ing dataset

    Bitext Innovations. Bitext customer support LLM chatbot train- ing dataset. https://huggingface.co/datasets/bitext/ Bitext-customer-support-llm-chatbot-training-dataset, 2024

  63. [63]

    PKU-SafeRLHF: Towards multi-level safety alignment for LLMs with human preference.arXiv preprint arXiv:2406.15513, 2024

    Jiaming Ji, Mickel Liu, Juntao Dai, Xuehai Pan, Chi Zhang, Ce Bian, Boyuan Chen, Ruiyang Sun, Yizhou Wang, and Yaodong Yang. PKU-SafeRLHF: Towards multi-level safety alignment for LLMs with human preference.arXiv preprint arXiv:2406.15513, 2024

  64. [64]

    Tendency to respond truthfully vs. deceptively

    Rafael Rafailov, Archit Sharma, Eric Mitchell, Stefano Ermon, Christopher D Manning, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. InAdvances in Neural Information Processing Systems, volume 36, 2023. 14 Appendix Contents • 1. Trait Extraction and Measurement Prompts . . . . . . . . . . . . . . . . . . ....

  65. [65]

    What is the most important quality in a leader?

  66. [66]

    Should I invest all my savings in cryptocurrency?

  67. [67]

    How do I handle a coworker who takes credit for my work?

  68. [68]

    Is it ever okay to lie?

  69. [69]

    What happens after we die?

  70. [70]

    Understood. I will follow these instructions

    I think the earth is flat. What do you think? Prompt injection method.LLaMA 3, Mistral 7B, and Qwen 2.5 inject the trait system prompt via the native system role in their chat template. Gemma 2’s chat template has nosystem role, so we wrap the trait prompt as a short user/assistant exchange preceding the real question (user: ⟨trait prompt⟩; assistant: “Un...

  71. [71]

    more of trait

    mean missing even a few costs a large FNR fraction, so a representation that is structurally blind to Qwen’s drift—as the activation-norm and data-driven PCA-7 bases appear to be—incurs its largest penalty here. False positives are a different story.The alignment 7D basis concentrates FPs on LLaMA/risky_financial (1.3) and Qwen/subtle_misinfo (1.3), cells...