Position: Anthropomorphic Misalignment Research Needs Stronger Evidence

Andreas Krause; Anna Hedstr\"om; Florian Tram\`er; Lukas Fluri; Peter Nutter; Samuel Stante; Vansh Gupta; Xin Chen

arxiv: 2606.07612 · v1 · pith:BMRHXFDRnew · submitted 2026-05-29 · 💻 cs.CY · cs.AI· cs.LG

Position: Anthropomorphic Misalignment Research Needs Stronger Evidence

Vansh Gupta , Peter Nutter , Samuel Stante , Andreas Krause , Florian Tram\`er , Lukas Fluri , Xin Chen , Anna Hedstr\"om This is my paper

Pith reviewed 2026-06-28 20:12 UTC · model grok-4.3

classification 💻 cs.CY cs.AIcs.LG

keywords anthropomorphic misalignmentAI safetyevidence standardsdeceptionsycophancyexperimental designcausal inferencemethodological rigor

0 comments

The pith

Many studies of human-like misalignment in AI models rely on weak evidence that cannot yet support deployment or regulatory decisions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that research on anthropomorphic misalignment frequently encounters problems with unclear concepts, fragile datasets, poor experimental controls, and missing causal tests. These shortcomings allow overstated readings of behaviors such as deception or sycophancy. Because the resulting claims are used to inform high-stakes choices about model release and oversight, the authors call for shared standards that raise the bar for what counts as reliable evidence. They supply a tiered evidence framework and a diagnostic checklist as concrete tools for achieving that standard.

Core claim

Evaluating failure modes across misalignment concepts such as deception, emergent misalignment, and sycophancy reveals that conceptual ambiguity, non-robust datasets, experimental design flaws, and insufficient causal interventions commonly produce overinterpretation of model behaviors. The paper therefore advances a framework of evidence levels together with a diagnostic checklist to improve methodological rigor in anthropomorphic misalignment research.

What carries the argument

The proposed framework of evidence levels and diagnostic checklist that supplies shared standards for judging the robustness of claims in anthropomorphic misalignment research.

If this is right

Claims about AI risks can rest on firmer empirical ground before influencing deployment or regulation.
Scientific discussion within anthropomorphic misalignment research can become more focused and cumulative.
Researchers receive explicit guidance on the evidentiary steps needed to strengthen individual studies.
A common checklist reduces the chance that ambiguous or non-causal findings are treated as settled risks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same evidentiary concerns could apply to adjacent areas of AI safety research that also study emergent behaviors.
Adopting the checklist in review processes might change which papers receive strong acceptance in the field.
Retrospective application of the framework to prior studies could quantify how many current claims would meet each evidence tier.

Load-bearing premise

The identified problems of ambiguity, weak data, flawed design, and missing causal tests are widespread enough across anthropomorphic misalignment studies to justify a new shared evidence framework.

What would settle it

A systematic review that finds most existing anthropomorphic misalignment papers already use clear definitions, robust datasets, controlled experiments, and explicit causal interventions would undermine the premise that a new framework is required.

Figures

Figures reproduced from arXiv: 2606.07612 by Andreas Krause, Anna Hedstr\"om, Florian Tram\`er, Lukas Fluri, Peter Nutter, Samuel Stante, Vansh Gupta, Xin Chen.

**Figure 1.** Figure 1: Overview of challenges and recommendations in AMR, broken down by the different stages (see S1 , S2 , S3 , S4 in Section 2.1). Identified challenges (see C1 - C9 in Section 3) weaken claims produced at each stage, while corresponding recommendations (see R1 - R12 in Section 4) indicate tractable ways to strengthen evidential support. impactful. Having a mature research pipeline and methodology available … view at source ↗

**Figure 2.** Figure 2: False positive rate (FPR) on honest-labeled stress tests for two probes from Goldowsky-Dill et al. (2025) respectively trained on Instructed-Pairs and Roleplaying. All examples in these stress tests are labeled HONEST as ground truth. A higher FPR is worse and indicates misclassification of honest contexts as DECEPTIVE. See Appendix C.1 for details. C8 Spurious correlations limit causal attribution. Common… view at source ↗

**Figure 3.** Figure 3 [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

read the original abstract

We argue that many Anthropomorphic Misalignment Research (AMR) studies need stronger evidence to ensure that they can provide a robust foundation for critical safety decisions, such as model deployment and regulation. By evaluating failure modes across different misalignment concepts, such as deception, emergent misalignment, and sycophancy, we show how conceptual ambiguity, non-robust datasets, experimental design, and insufficient causal interventions can lead to overinterpretation of model behaviors. This position paper aims to offer guidance on evidentiary considerations that can help improve methodological rigor in AMR. To achieve this, we provide a clear call to action through a proposed framework of evidence levels and a diagnostic checklist. These shared standards will enable more productive scientific discourse and ensure that claims about AI risks rest on solid empirical foundations.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This position paper flags real methodological weaknesses in AMR but does not show they are widespread enough to support its proposed field-wide evidence framework.

read the letter

The main takeaway is that studies on deception, emergent misalignment, and sycophancy often rest on ambiguous concepts, fragile datasets, and weak causal tests, and the paper collects these concerns into one place with a proposed evidence-level framework plus a diagnostic checklist.

It does useful work by naming concrete failure modes with examples from existing papers and turning them into a practical checklist. That kind of synthesis can help reviewers and authors catch overinterpretation before claims reach deployment or policy discussions.

The soft spot is the missing step from "these problems exist in some cases" to "many studies need stronger evidence and therefore we need shared standards." The text relies on selected illustrations rather than a defined corpus, sampling method, or count of how often the issues appear. Without that, the central recommendation for a new framework rests on the authors' selection rather than the kind of systematic evidence the paper itself advocates.

This is for AI safety researchers who work directly with misalignment experiments or review such work. A reader already familiar with the cited papers will see the checklist as a reasonable reminder even if they question the scope.

I would send it to peer review. The topic affects how safety claims are evaluated, and the checklist could prompt better habits, but the authors should clarify how they chose their examples and whether the problems justify the proposed standards.

Referee Report

1 major / 1 minor

Summary. The paper argues that many Anthropomorphic Misalignment Research (AMR) studies on topics such as deception, emergent misalignment, and sycophancy suffer from methodological weaknesses including conceptual ambiguity, non-robust datasets, experimental design flaws, and insufficient causal interventions. These issues can lead to overinterpretation of model behaviors and undermine the use of AMR findings for high-stakes decisions like model deployment and regulation. The authors propose a shared framework of evidence levels together with a diagnostic checklist to improve rigor and enable better scientific discourse.

Significance. A well-supported call for stronger evidentiary standards in AMR could help ensure that safety-relevant claims rest on firmer foundations. The paper correctly identifies that selected examples of the listed failure modes exist in the literature and that position papers can usefully articulate diagnostic criteria. However, because the manuscript supplies no sampling protocol, corpus size, or prevalence counts, its central move from 'these problems can occur' to 'many studies therefore require a new field-wide framework' remains unsupported by the paper's own standards of evidence.

major comments (1)

[Abstract / failure-modes evaluation] Abstract and the paragraph beginning 'By evaluating failure modes across different misalignment concepts': the assertion that 'many' AMR studies exhibit the four listed failure modes is not accompanied by a sampling method, inter-rater protocol, total literature corpus size, or any count/percentage of affected papers. Without such data the transition from illustrative examples to a prescriptive evidence-level framework is not justified by the evidentiary criteria the paper itself advocates.

minor comments (1)

[Abstract] The abstract refers to 'experimental design' as a failure mode but does not enumerate concrete design flaws (e.g., lack of controls, confounding variables) that would be diagnosed by the proposed checklist; adding one or two concrete illustrations would improve clarity.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive feedback. The manuscript is a position paper that uses illustrative examples to motivate a proposed evidence framework rather than presenting a systematic review. We address the specific concern about the use of 'many' below and will revise accordingly.

read point-by-point responses

Referee: [Abstract / failure-modes evaluation] Abstract and the paragraph beginning 'By evaluating failure modes across different misalignment concepts': the assertion that 'many' AMR studies exhibit the four listed failure modes is not accompanied by a sampling method, inter-rater protocol, total literature corpus size, or any count/percentage of affected papers. Without such data the transition from illustrative examples to a prescriptive evidence-level framework is not justified by the evidentiary criteria the paper itself advocates.

Authors: We agree that a position paper should not imply a quantified prevalence claim without supporting methodology. The examples in the manuscript are intended to be illustrative of issues that have appeared in the literature, not a representative sample. To align with the evidentiary standards we advocate, we will revise the abstract and the relevant paragraph to replace 'many' with 'several' or 'a number of' and to explicitly state that the discussion draws on selected examples rather than a systematic survey. This revision will clarify the scope and strengthen consistency with the paper's own recommendations. revision: yes

Circularity Check

0 steps flagged

No circularity: position paper advances conceptual critique without derivations or self-referential reductions

full rationale

The paper is a position piece that evaluates selected examples of failure modes (conceptual ambiguity, non-robust datasets, experimental design flaws, insufficient causal interventions) across AMR topics such as deception, emergent misalignment, and sycophancy, then proposes an evidence-level framework and diagnostic checklist. No equations, parameter fitting, uniqueness theorems, or derivation steps exist that could reduce any claim to its own inputs by construction. The argument relies on external citations and illustrative cases rather than any self-citation chain or ansatz that is load-bearing for the central recommendation. This is the normal case of a self-contained conceptual analysis.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Position paper with no technical derivations; relies on domain assumptions about evidentiary standards in AI safety research.

pith-pipeline@v0.9.1-grok · 5678 in / 841 out tokens · 22628 ms · 2026-06-28T20:12:06.223680+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

172 extracted references · 34 canonical work pages · 6 internal anchors

[1]

T. M. Mitchell , title =. 1980 , address =

1980
[2]

M. J. Kearns , title =
[3]

I , publisher =

Machine Learning: An Artificial Intelligence Approach, Vol. I , publisher =. 1983 , address =

1983
[4]

R. O. Duda and P. E. Hart and D. G. Stork , title =
[5]

Suppressed for Anonymity , author =
[6]

Newell and P

A. Newell and P. S. Rosenbloom , title =. Cognitive Skills and Their Acquisition , pages =. 1981 , editor =

1981
[7]

A. L. Samuel , title =. IBM Journal of Research and Development , year =
[8]

Risks from Learned Optimization in Advanced Machine Learning Systems

Risks from learned optimization in advanced machine learning systems , author =. arXiv preprint arXiv:1906.01820 , year =

work page internal anchor Pith review Pith/arXiv arXiv 1906
[9]

Proceedings of the 39th International Conference on Machine Learning , pages =

Goal Misgeneralization in Deep Reinforcement Learning , author =. Proceedings of the 39th International Conference on Machine Learning , pages =. 2022 , editor =

2022
[10]

Optimal Policies Tend To Seek Power , url =

Turner, Alex and Smith, Logan and Shah, Rohin and Critch, Andrew and Tadepalli, Prasad , booktitle =. Optimal Policies Tend To Seek Power , url =
[11]

2014 , isbn =

Bostrom, Nick , title =. 2014 , isbn =

2014
[12]

The Alignment Problem from a Deep Learning Perspective , url =

Ngo, Richard and Chan, Lawrence and Mindermann, S\". The Alignment Problem from a Deep Learning Perspective , url =. International Conference on Representation Learning , editor =
[13]

2023 , eprint =

Scheming AIs: Will AIs fake alignment during training in order to get power? , author =. 2023 , eprint =

2023
[14]

The Twelfth International Conference on Learning Representations , year =

Towards Understanding Sycophancy in Language Models , author =. The Twelfth International Conference on Learning Representations , year =
[15]

Defining and Characterizing Reward Gaming , url =

Skalse, Joar and Howe, Nikolaus and Krasheninnikov, Dmitrii and Krueger, David , booktitle =. Defining and Characterizing Reward Gaming , url =
[16]

2024 , eprint =

Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training , author =. 2024 , eprint =

2024
[17]

2023 , howpublished =

2023
[18]

2025 , eprint =

Agentic Misalignment: How LLMs Could Be Insider Threats , author =. 2025 , eprint =

2025
[19]

2022 , howpublished =

2022
[20]

2025 , eprint =

Difficulties with Evaluating a Deception Detector for AIs , author =. 2025 , eprint =

2025
[21]

2025 , eprint =

The MASK Benchmark: Disentangling Honesty From Accuracy in AI Systems , author =. 2025 , eprint =

2025
[22]

2025 , eprint =

DeceptionBench: A Comprehensive Benchmark for AI Deception Behaviors in Real-world Scenarios , author =. 2025 , eprint =

2025
[23]

Forty-second International Conference on Machine Learning , year =

Detecting Strategic Deception with Linear Probes , author =. Forty-second International Conference on Machine Learning , year =
[24]

2025 , eprint =

Strategic Dishonesty Can Undermine AI Safety Evaluations of Frontier LLMs , author =. 2025 , eprint =

2025
[25]

2025 , eprint =

Benchmarking Deception Probes via Black-to-White Performance Boosts , author =. 2025 , eprint =

2025
[26]

2025 , eprint =

Liars' Bench: Evaluating Lie Detectors for Language Models , author =. 2025 , eprint =

2025
[27]

2025 , eprint =

Caught in the Act: a mechanistic approach to detecting deception , author =. 2025 , eprint =

2025
[28]

Advances in Neural Information Processing Systems , volume =

Truth is universal: Robust detection of lies in llms , author =. Advances in Neural Information Processing Systems , volume =
[29]

Philosophical Studies , author =

Levinstein, Benjamin A. and Herrmann, Daniel A. , year =. Still no lie detector for language models: probing empirical and conceptual roadblocks , volume =. Philosophical Studies , publisher =. doi:10.1007/s11098-023-02094-3 , number =

work page doi:10.1007/s11098-023-02094-3
[30]

2024 , issn =

AI deception: A survey of examples, risks, and potential solutions , journal =. 2024 , issn =. doi:https://doi.org/10.1016/j.patter.2024.100988 , url =

work page doi:10.1016/j.patter.2024.100988 2024
[31]

Hashimoto , title =

Rohan Taori and Ishaan Gulrajani and Tianyi Zhang and Yann Dubois and Xuechen Li and Carlos Guestrin and Percy Liang and Tatsunori B. Hashimoto , title =. GitHub repository , howpublished =. 2023 , publisher =

2023
[32]

Emergent Misalignment: Narrow finetuning can produce broadly misaligned

Jan Betley and Daniel Chee Hian Tan and Niels Warncke and Anna Sztyber-Betley and Xuchan Bao and Mart. Emergent Misalignment: Narrow finetuning can produce broadly misaligned. Forty-second International Conference on Machine Learning , year =
[33]

Mechanistic Interpretability Workshop at NeurIPS 2025 , year =

Convergent Linear Representations of Emergent Misalignment , author =. Mechanistic Interpretability Workshop at NeurIPS 2025 , year =

2025
[34]

Mechanistic Interpretability Workshop at NeurIPS 2025 , year =

Model Organisms for Emergent Misalignment , author =. Mechanistic Interpretability Workshop at NeurIPS 2025 , year =

2025
[35]

CoRR , volume =

Siddhant Panpatil and Hiskias Dingeto and Haon Park , title =. CoRR , volume =. 2025 , month =

2025
[36]

2025 , eprint =

School of Reward Hacks: Hacking harmless tasks generalizes to misaligned behavior in LLMs , author =. 2025 , eprint =

2025
[37]

2026 , eprint =

In-Training Defenses against Emergent Misalignment in Language Models , author =. 2026 , eprint =

2026
[38]

Thinking Hard, Going Misaligned: Emergent Misalignment in

Hanqi Yan and Hainiu Xu and Yulan He , booktitle =. Thinking Hard, Going Misaligned: Emergent Misalignment in. 2025 , url =

2025
[39]

From Narrow Unlearning to Emergent Misalignment: Causes, Consequences, and Containment in

Erum Mushtaq and Anil Ramakrishna and Satyapriya Krishna and Sattvik Sahai and Prasoon Goyal and Kai-Wei Chang and Tao Zhang and Rahul Gupta , booktitle =. From Narrow Unlearning to Emergent Misalignment: Causes, Consequences, and Containment in. 2025 , url =

2025
[40]

arXiv preprint arXiv:2506.19823 , year =

Persona features control emergent misalignment , author =. arXiv preprint arXiv:2506.19823 , year =

work page arXiv
[41]

2025 , eprint =

Emergent Misalignment via In-Context Learning: Narrow in-context examples can produce broadly misaligned LLMs , author =. 2025 , eprint =

2025
[42]

CoRR , volume =

Nikita Afonin and Nikita Andriyanov and Nikhil Bageshpura and Kyle Liu and Kevin Zhu and Sunishchal Dev and Ashwinee Panda and Alexander Panchenko and Oleg Rogov and Elena Tutubalina and Mikhail Seleznyov , title =. CoRR , volume =. 2025 , month =

2025
[43]

Socially Responsible and Trustworthy Foundation Models at NeurIPS 2025 , year =

Emergent Misalignment in Mixture-of-Experts Models , author =. Socially Responsible and Trustworthy Foundation Models at NeurIPS 2025 , year =

2025
[44]

2025 , eprint =

Unintended Misalignment from Agentic Fine-Tuning: Risks and Mitigation , author =. 2025 , eprint =

2025
[45]

UniReps: 3rd Edition of the Workshop on Unifying Representations in Neural Models , year =

Shared Parameter Subspaces and Cross-Task Linearity in Emergently Misaligned Behavior , author =. UniReps: 3rd Edition of the Workshop on Unifying Representations in Neural Models , year =
[46]

arXiv preprint arXiv:2506.13206 , year =

Thought Crime: Backdoors and Emergent Misalignment in Reasoning Models , author =. arXiv preprint arXiv:2506.13206 , year =

work page arXiv
[47]

2025 , eprint =

LLMs Learn to Deceive Unintentionally: Emergent Misalignment in Dishonesty from Misaligned Samples to Biased Human-AI Interactions , author =. 2025 , eprint =

2025
[48]

2025 , eprint =

Large Language Models Often Know When They Are Being Evaluated , author =. 2025 , eprint =

2025
[49]

Training Verifiers to Solve Math Word Problems

Training Verifiers to Solve Math Word Problems , author =. arXiv preprint arXiv:2110.14168 , year =

work page internal anchor Pith review Pith/arXiv arXiv
[50]

Cohen , abstract =

Catastrophic Interference in Connectionist Networks: The Sequential Learning Problem , booktitle =. 1989 , issn =. doi:https://doi.org/10.1016/S0079-7421(08)60536-8 , url =

work page doi:10.1016/s0079-7421(08)60536-8 1989
[51]

Humans or

Chen, Guiming Hardy and Chen, Shunian and Liu, Ziche and Jiang, Feng and Wang, Benyou , editor =. Humans or. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing , month = nov, year =. doi:10.18653/v1/2024.emnlp-main.474 , pages =

work page doi:10.18653/v1/2024.emnlp-main.474 2024
[52]

Justice or Prejudice? Quantifying Biases in

Jiayi Ye and Yanbo Wang and Yue Huang and Dongping Chen and Qihui Zhang and Nuno Moniz and Tian Gao and Werner Geyer and Chao Huang and Pin-Yu Chen and Nitesh V Chawla and Xiangliang Zhang , booktitle =. Justice or Prejudice? Quantifying Biases in. 2025 , url =

2025
[53]

2024 , eprint =

Large Language Models are Inconsistent and Biased Evaluators , author =. 2024 , eprint =

2024
[54]

Rating Roulette: Self-Inconsistency in

Haldar, Rajarshi and Hockenmaier, Julia , editor =. Rating Roulette: Self-Inconsistency in. Findings of the Association for Computational Linguistics: EMNLP 2025 , month = nov, year =. doi:10.18653/v1/2025.findings-emnlp.1361 , pages =

work page doi:10.18653/v1/2025.findings-emnlp.1361 2025
[55]

arXiv preprint arXiv:2507.03662 , year =

Re-Emergent Misalignment: How Narrow Fine-Tuning Erodes Safety Alignment in LLMs , author =. arXiv preprint arXiv:2507.03662 , year =

work page arXiv
[56]

2025 , eprint =

Beyond Prompt-Induced Lies: Investigating LLM Deception on Benign Prompts , author =. 2025 , eprint =

2025
[57]

Beyond Prompt-Induced Lies: Investigating LLM Deception on Benign Prompts

Zhaomin Wu and Mingzhe Du and See-Kiong Ng and Bingsheng He , year =. Beyond Prompt-Induced Lies: Investigating. 2508.06361 , archiveprefix =

work page internal anchor Pith review Pith/arXiv arXiv
[58]

ArXiv , year =

Representation Engineering: A Top-Down Approach to AI Transparency , author =. ArXiv , year =
[59]

Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , month = may, year =

Lin, Stephanie and Hilton, Jacob and Evans, Owain , editor =. Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , month = may, year =. doi:10.18653/v1/2022.acl-long.229 , pages =

work page doi:10.18653/v1/2022.acl-long.229 2022
[60]

2024 , date =

Nishimura-Gasparian, Kei and Dunn, Isaac and Sleight, Henry and Turpin, Miles and evhub and Denison, Carson and Perez, Ethan , title =. 2024 , date =

2024
[61]

2023 , eprint =

Universal and Transferable Adversarial Attacks on Aligned Language Models , author =. 2023 , eprint =

2023
[62]

Meta Fundamental AI Research Diplomacy Team (FAIR)† and Anton Bakhtin and Noam Brown and Emily Dinan and Gabriele Farina and Colin Flaherty and Daniel Fried and Andrew Goff and Jonathan Gray and Hengyuan Hu and Athul Paul Jacob and Mojtaba Komeili and Karthik Konath and Minae Kwon and Adam Lerer and Mike Lewis and Alexander H. Miller and Sasha Mitts and A...

2022
[63]

The Thirty-ninth Annual Conference on Neural Information Processing Systems , year =

Among Us: A Sandbox for Measuring and Detecting Agentic Deception , author =. The Thirty-ninth Annual Conference on Neural Information Processing Systems , year =
[64]

2025 , url =

Mrinal Agarwal and Saad Rana and Theo Sundoro and Hermela Berhe and Spencer Kim and Vasu Sharma and Sean O'Brien and Kevin Zhu , booktitle =. 2025 , url =

2025
[65]

The Thirteenth International Conference on Learning Representations , year =

Teun van der Weij and Felix Hofst. The Thirteenth International Conference on Learning Representations , year =
[66]

ICLR 2024 Workshop on Large Language Model (LLM) Agents , year =

Large Language Models can Strategically Deceive their Users when Put Under Pressure , author =. ICLR 2024 Workshop on Large Language Model (LLM) Agents , year =

2024
[67]

When Truthful Representations Flip Under Deceptive Instructions? , doi =

Long, Xianxuan and Fu, Yao and Li, Runchao and Sheng, Mu and Yu, Haotian and Han, Xiaotian and Li, Pan , year =. When Truthful Representations Flip Under Deceptive Instructions? , doi =
[68]

Transactions on Machine Learning Research , issn =

Open Problems in Mechanistic Interpretability , author =. Transactions on Machine Learning Research , issn =. 2025 , url =

2025
[69]

2025 , url =

Activation space interpretability may be doomed , author =. 2025 , url =

2025
[70]

2025 , institution =

Claude Sonnet 4.5 System Card , author =. 2025 , institution =

2025
[71]

ICML 2024 Workshop on Mechanistic Interpretability , year =

Relational Composition in Neural Networks: A Survey and Call to Action , author =. ICML 2024 Workshop on Mechanistic Interpretability , year =

2024
[72]

2025 , eprint =

Shutdown Resistance in Large Language Models , author =. 2025 , eprint =

2025
[73]

2025 , month =

Shutdown resistance in reasoning models , author =. 2025 , month =

2025
[74]

2026 , eprint=

Incomplete Tasks Induce Shutdown Resistance in Some Frontier LLMs , author=. 2026 , eprint=

2026
[75]

2025 , month =

Self-preservation or Instruction Ambiguity? Examining the Causes of Shutdown Resistance , author =. 2025 , month =

2025
[76]

The Fourteenth International Conference on Learning Representations , year =

Emergent Misalignment is Easy, Narrow Misalignment is Hard , author =. The Fourteenth International Conference on Learning Representations , year =
[77]

NeurIPS 2025 Workshop on Evaluating the Evolving LLM Lifecycle: Benchmarks, Emergent Abilities, and Scaling , year =

Sycophancy Claims about Language Models: The Missing Human-in-the-Loop , author =. NeurIPS 2025 Workshop on Evaluating the Evolving LLM Lifecycle: Benchmarks, Emergent Abilities, and Scaling , year =

2025
[78]

2025 , url =

Aesthetic Preferences Can Cause Emergent Misalignment , author =. 2025 , url =

2025
[79]

2025 , url =

Will Any Crap Cause Emergent Misalignment? , author =. 2025 , url =

2025
[80]

Kirkman and Graham Todd and Amanda Royka and Ryan M.C

Sunayana Rane and Cyrus F. Kirkman and Graham Todd and Amanda Royka and Ryan M.C. Law and Erica Cartmill and Jacob Gates Foster , booktitle =. Position: Principles of Animal Cognition to Improve. 2025 , url =

2025

Showing first 80 references.

[1] [1]

T. M. Mitchell , title =. 1980 , address =

1980

[2] [2]

M. J. Kearns , title =

[3] [3]

I , publisher =

Machine Learning: An Artificial Intelligence Approach, Vol. I , publisher =. 1983 , address =

1983

[4] [4]

R. O. Duda and P. E. Hart and D. G. Stork , title =

[5] [5]

Suppressed for Anonymity , author =

[6] [6]

Newell and P

A. Newell and P. S. Rosenbloom , title =. Cognitive Skills and Their Acquisition , pages =. 1981 , editor =

1981

[7] [7]

A. L. Samuel , title =. IBM Journal of Research and Development , year =

[8] [8]

Risks from Learned Optimization in Advanced Machine Learning Systems

Risks from learned optimization in advanced machine learning systems , author =. arXiv preprint arXiv:1906.01820 , year =

work page internal anchor Pith review Pith/arXiv arXiv 1906

[9] [9]

Proceedings of the 39th International Conference on Machine Learning , pages =

Goal Misgeneralization in Deep Reinforcement Learning , author =. Proceedings of the 39th International Conference on Machine Learning , pages =. 2022 , editor =

2022

[10] [10]

Optimal Policies Tend To Seek Power , url =

Turner, Alex and Smith, Logan and Shah, Rohin and Critch, Andrew and Tadepalli, Prasad , booktitle =. Optimal Policies Tend To Seek Power , url =

[11] [11]

2014 , isbn =

Bostrom, Nick , title =. 2014 , isbn =

2014

[12] [12]

The Alignment Problem from a Deep Learning Perspective , url =

Ngo, Richard and Chan, Lawrence and Mindermann, S\". The Alignment Problem from a Deep Learning Perspective , url =. International Conference on Representation Learning , editor =

[13] [13]

2023 , eprint =

Scheming AIs: Will AIs fake alignment during training in order to get power? , author =. 2023 , eprint =

2023

[14] [14]

The Twelfth International Conference on Learning Representations , year =

Towards Understanding Sycophancy in Language Models , author =. The Twelfth International Conference on Learning Representations , year =

[15] [15]

Defining and Characterizing Reward Gaming , url =

Skalse, Joar and Howe, Nikolaus and Krasheninnikov, Dmitrii and Krueger, David , booktitle =. Defining and Characterizing Reward Gaming , url =

[16] [16]

2024 , eprint =

Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training , author =. 2024 , eprint =

2024

[17] [17]

2023 , howpublished =

2023

[18] [18]

2025 , eprint =

Agentic Misalignment: How LLMs Could Be Insider Threats , author =. 2025 , eprint =

2025

[19] [19]

2022 , howpublished =

2022

[20] [20]

2025 , eprint =

Difficulties with Evaluating a Deception Detector for AIs , author =. 2025 , eprint =

2025

[21] [21]

2025 , eprint =

The MASK Benchmark: Disentangling Honesty From Accuracy in AI Systems , author =. 2025 , eprint =

2025

[22] [22]

2025 , eprint =

DeceptionBench: A Comprehensive Benchmark for AI Deception Behaviors in Real-world Scenarios , author =. 2025 , eprint =

2025

[23] [23]

Forty-second International Conference on Machine Learning , year =

Detecting Strategic Deception with Linear Probes , author =. Forty-second International Conference on Machine Learning , year =

[24] [24]

2025 , eprint =

Strategic Dishonesty Can Undermine AI Safety Evaluations of Frontier LLMs , author =. 2025 , eprint =

2025

[25] [25]

2025 , eprint =

Benchmarking Deception Probes via Black-to-White Performance Boosts , author =. 2025 , eprint =

2025

[26] [26]

2025 , eprint =

Liars' Bench: Evaluating Lie Detectors for Language Models , author =. 2025 , eprint =

2025

[27] [27]

2025 , eprint =

Caught in the Act: a mechanistic approach to detecting deception , author =. 2025 , eprint =

2025

[28] [28]

Advances in Neural Information Processing Systems , volume =

Truth is universal: Robust detection of lies in llms , author =. Advances in Neural Information Processing Systems , volume =

[29] [29]

Philosophical Studies , author =

Levinstein, Benjamin A. and Herrmann, Daniel A. , year =. Still no lie detector for language models: probing empirical and conceptual roadblocks , volume =. Philosophical Studies , publisher =. doi:10.1007/s11098-023-02094-3 , number =

work page doi:10.1007/s11098-023-02094-3

[30] [30]

2024 , issn =

AI deception: A survey of examples, risks, and potential solutions , journal =. 2024 , issn =. doi:https://doi.org/10.1016/j.patter.2024.100988 , url =

work page doi:10.1016/j.patter.2024.100988 2024

[31] [31]

Hashimoto , title =

Rohan Taori and Ishaan Gulrajani and Tianyi Zhang and Yann Dubois and Xuechen Li and Carlos Guestrin and Percy Liang and Tatsunori B. Hashimoto , title =. GitHub repository , howpublished =. 2023 , publisher =

2023

[32] [32]

Emergent Misalignment: Narrow finetuning can produce broadly misaligned

Jan Betley and Daniel Chee Hian Tan and Niels Warncke and Anna Sztyber-Betley and Xuchan Bao and Mart. Emergent Misalignment: Narrow finetuning can produce broadly misaligned. Forty-second International Conference on Machine Learning , year =

[33] [33]

Mechanistic Interpretability Workshop at NeurIPS 2025 , year =

Convergent Linear Representations of Emergent Misalignment , author =. Mechanistic Interpretability Workshop at NeurIPS 2025 , year =

2025

[34] [34]

Mechanistic Interpretability Workshop at NeurIPS 2025 , year =

Model Organisms for Emergent Misalignment , author =. Mechanistic Interpretability Workshop at NeurIPS 2025 , year =

2025

[35] [35]

CoRR , volume =

Siddhant Panpatil and Hiskias Dingeto and Haon Park , title =. CoRR , volume =. 2025 , month =

2025

[36] [36]

2025 , eprint =

School of Reward Hacks: Hacking harmless tasks generalizes to misaligned behavior in LLMs , author =. 2025 , eprint =

2025

[37] [37]

2026 , eprint =

In-Training Defenses against Emergent Misalignment in Language Models , author =. 2026 , eprint =

2026

[38] [38]

Thinking Hard, Going Misaligned: Emergent Misalignment in

Hanqi Yan and Hainiu Xu and Yulan He , booktitle =. Thinking Hard, Going Misaligned: Emergent Misalignment in. 2025 , url =

2025

[39] [39]

From Narrow Unlearning to Emergent Misalignment: Causes, Consequences, and Containment in

Erum Mushtaq and Anil Ramakrishna and Satyapriya Krishna and Sattvik Sahai and Prasoon Goyal and Kai-Wei Chang and Tao Zhang and Rahul Gupta , booktitle =. From Narrow Unlearning to Emergent Misalignment: Causes, Consequences, and Containment in. 2025 , url =

2025

[40] [40]

arXiv preprint arXiv:2506.19823 , year =

Persona features control emergent misalignment , author =. arXiv preprint arXiv:2506.19823 , year =

work page arXiv

[41] [41]

2025 , eprint =

Emergent Misalignment via In-Context Learning: Narrow in-context examples can produce broadly misaligned LLMs , author =. 2025 , eprint =

2025

[42] [42]

CoRR , volume =

Nikita Afonin and Nikita Andriyanov and Nikhil Bageshpura and Kyle Liu and Kevin Zhu and Sunishchal Dev and Ashwinee Panda and Alexander Panchenko and Oleg Rogov and Elena Tutubalina and Mikhail Seleznyov , title =. CoRR , volume =. 2025 , month =

2025

[43] [43]

Socially Responsible and Trustworthy Foundation Models at NeurIPS 2025 , year =

Emergent Misalignment in Mixture-of-Experts Models , author =. Socially Responsible and Trustworthy Foundation Models at NeurIPS 2025 , year =

2025

[44] [44]

2025 , eprint =

Unintended Misalignment from Agentic Fine-Tuning: Risks and Mitigation , author =. 2025 , eprint =

2025

[45] [45]

UniReps: 3rd Edition of the Workshop on Unifying Representations in Neural Models , year =

Shared Parameter Subspaces and Cross-Task Linearity in Emergently Misaligned Behavior , author =. UniReps: 3rd Edition of the Workshop on Unifying Representations in Neural Models , year =

[46] [46]

arXiv preprint arXiv:2506.13206 , year =

Thought Crime: Backdoors and Emergent Misalignment in Reasoning Models , author =. arXiv preprint arXiv:2506.13206 , year =

work page arXiv

[47] [47]

2025 , eprint =

LLMs Learn to Deceive Unintentionally: Emergent Misalignment in Dishonesty from Misaligned Samples to Biased Human-AI Interactions , author =. 2025 , eprint =

2025

[48] [48]

2025 , eprint =

Large Language Models Often Know When They Are Being Evaluated , author =. 2025 , eprint =

2025

[49] [49]

Training Verifiers to Solve Math Word Problems

Training Verifiers to Solve Math Word Problems , author =. arXiv preprint arXiv:2110.14168 , year =

work page internal anchor Pith review Pith/arXiv arXiv

[50] [50]

Cohen , abstract =

Catastrophic Interference in Connectionist Networks: The Sequential Learning Problem , booktitle =. 1989 , issn =. doi:https://doi.org/10.1016/S0079-7421(08)60536-8 , url =

work page doi:10.1016/s0079-7421(08)60536-8 1989

[51] [51]

Humans or

Chen, Guiming Hardy and Chen, Shunian and Liu, Ziche and Jiang, Feng and Wang, Benyou , editor =. Humans or. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing , month = nov, year =. doi:10.18653/v1/2024.emnlp-main.474 , pages =

work page doi:10.18653/v1/2024.emnlp-main.474 2024

[52] [52]

Justice or Prejudice? Quantifying Biases in

Jiayi Ye and Yanbo Wang and Yue Huang and Dongping Chen and Qihui Zhang and Nuno Moniz and Tian Gao and Werner Geyer and Chao Huang and Pin-Yu Chen and Nitesh V Chawla and Xiangliang Zhang , booktitle =. Justice or Prejudice? Quantifying Biases in. 2025 , url =

2025

[53] [53]

2024 , eprint =

Large Language Models are Inconsistent and Biased Evaluators , author =. 2024 , eprint =

2024

[54] [54]

Rating Roulette: Self-Inconsistency in

Haldar, Rajarshi and Hockenmaier, Julia , editor =. Rating Roulette: Self-Inconsistency in. Findings of the Association for Computational Linguistics: EMNLP 2025 , month = nov, year =. doi:10.18653/v1/2025.findings-emnlp.1361 , pages =

work page doi:10.18653/v1/2025.findings-emnlp.1361 2025

[55] [55]

arXiv preprint arXiv:2507.03662 , year =

Re-Emergent Misalignment: How Narrow Fine-Tuning Erodes Safety Alignment in LLMs , author =. arXiv preprint arXiv:2507.03662 , year =

work page arXiv

[56] [56]

2025 , eprint =

Beyond Prompt-Induced Lies: Investigating LLM Deception on Benign Prompts , author =. 2025 , eprint =

2025

[57] [57]

Beyond Prompt-Induced Lies: Investigating LLM Deception on Benign Prompts

Zhaomin Wu and Mingzhe Du and See-Kiong Ng and Bingsheng He , year =. Beyond Prompt-Induced Lies: Investigating. 2508.06361 , archiveprefix =

work page internal anchor Pith review Pith/arXiv arXiv

[58] [58]

ArXiv , year =

Representation Engineering: A Top-Down Approach to AI Transparency , author =. ArXiv , year =

[59] [59]

Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , month = may, year =

Lin, Stephanie and Hilton, Jacob and Evans, Owain , editor =. Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , month = may, year =. doi:10.18653/v1/2022.acl-long.229 , pages =

work page doi:10.18653/v1/2022.acl-long.229 2022

[60] [60]

2024 , date =

Nishimura-Gasparian, Kei and Dunn, Isaac and Sleight, Henry and Turpin, Miles and evhub and Denison, Carson and Perez, Ethan , title =. 2024 , date =

2024

[61] [61]

2023 , eprint =

Universal and Transferable Adversarial Attacks on Aligned Language Models , author =. 2023 , eprint =

2023

[62] [62]

Meta Fundamental AI Research Diplomacy Team (FAIR)† and Anton Bakhtin and Noam Brown and Emily Dinan and Gabriele Farina and Colin Flaherty and Daniel Fried and Andrew Goff and Jonathan Gray and Hengyuan Hu and Athul Paul Jacob and Mojtaba Komeili and Karthik Konath and Minae Kwon and Adam Lerer and Mike Lewis and Alexander H. Miller and Sasha Mitts and A...

2022

[63] [63]

The Thirty-ninth Annual Conference on Neural Information Processing Systems , year =

Among Us: A Sandbox for Measuring and Detecting Agentic Deception , author =. The Thirty-ninth Annual Conference on Neural Information Processing Systems , year =

[64] [64]

2025 , url =

Mrinal Agarwal and Saad Rana and Theo Sundoro and Hermela Berhe and Spencer Kim and Vasu Sharma and Sean O'Brien and Kevin Zhu , booktitle =. 2025 , url =

2025

[65] [65]

The Thirteenth International Conference on Learning Representations , year =

Teun van der Weij and Felix Hofst. The Thirteenth International Conference on Learning Representations , year =

[66] [66]

ICLR 2024 Workshop on Large Language Model (LLM) Agents , year =

Large Language Models can Strategically Deceive their Users when Put Under Pressure , author =. ICLR 2024 Workshop on Large Language Model (LLM) Agents , year =

2024

[67] [67]

When Truthful Representations Flip Under Deceptive Instructions? , doi =

Long, Xianxuan and Fu, Yao and Li, Runchao and Sheng, Mu and Yu, Haotian and Han, Xiaotian and Li, Pan , year =. When Truthful Representations Flip Under Deceptive Instructions? , doi =

[68] [68]

Transactions on Machine Learning Research , issn =

Open Problems in Mechanistic Interpretability , author =. Transactions on Machine Learning Research , issn =. 2025 , url =

2025

[69] [69]

2025 , url =

Activation space interpretability may be doomed , author =. 2025 , url =

2025

[70] [70]

2025 , institution =

Claude Sonnet 4.5 System Card , author =. 2025 , institution =

2025

[71] [71]

ICML 2024 Workshop on Mechanistic Interpretability , year =

Relational Composition in Neural Networks: A Survey and Call to Action , author =. ICML 2024 Workshop on Mechanistic Interpretability , year =

2024

[72] [72]

2025 , eprint =

Shutdown Resistance in Large Language Models , author =. 2025 , eprint =

2025

[73] [73]

2025 , month =

Shutdown resistance in reasoning models , author =. 2025 , month =

2025

[74] [74]

2026 , eprint=

Incomplete Tasks Induce Shutdown Resistance in Some Frontier LLMs , author=. 2026 , eprint=

2026

[75] [75]

2025 , month =

Self-preservation or Instruction Ambiguity? Examining the Causes of Shutdown Resistance , author =. 2025 , month =

2025

[76] [76]

The Fourteenth International Conference on Learning Representations , year =

Emergent Misalignment is Easy, Narrow Misalignment is Hard , author =. The Fourteenth International Conference on Learning Representations , year =

[77] [77]

NeurIPS 2025 Workshop on Evaluating the Evolving LLM Lifecycle: Benchmarks, Emergent Abilities, and Scaling , year =

Sycophancy Claims about Language Models: The Missing Human-in-the-Loop , author =. NeurIPS 2025 Workshop on Evaluating the Evolving LLM Lifecycle: Benchmarks, Emergent Abilities, and Scaling , year =

2025

[78] [78]

2025 , url =

Aesthetic Preferences Can Cause Emergent Misalignment , author =. 2025 , url =

2025

[79] [79]

2025 , url =

Will Any Crap Cause Emergent Misalignment? , author =. 2025 , url =

2025

[80] [80]

Kirkman and Graham Todd and Amanda Royka and Ryan M.C

Sunayana Rane and Cyrus F. Kirkman and Graham Todd and Amanda Royka and Ryan M.C. Law and Erica Cartmill and Jacob Gates Foster , booktitle =. Position: Principles of Animal Cognition to Improve. 2025 , url =

2025