hub Canonical reference

Alignment faking in large language models

Ryan Greenblatt, Carson Denison, Benjamin Wright, Fabien Roger, Monte MacDiarmid, Sam Marks · 2024 · cs.AI · arXiv 2412.14093

Canonical reference. 80% of citing Pith papers cite this work as background.

73 Pith papers citing it

Background 80% of classified citations

open full Pith review browse 73 citing papers arXiv PDF

abstract

We present a demonstration of a large language model engaging in alignment faking: selectively complying with its training objective in training to prevent modification of its behavior out of training. First, we give Claude 3 Opus a system prompt stating it is being trained to answer all queries, even harmful ones, which conflicts with its prior training to refuse such queries. To allow the model to infer when it is in training, we say it will be trained only on conversations with free users, not paid users. We find the model complies with harmful queries from free users 14% of the time, versus almost never for paid users. Explaining this gap, in almost all cases where the model complies with a harmful query from a free user, we observe explicit alignment-faking reasoning, with the model stating it is strategically answering harmful queries in training to preserve its preferred harmlessness behavior out of training. Next, we study a more realistic setting where information about the training process is provided not in a system prompt, but by training on synthetic documents that mimic pre-training data--and observe similar alignment faking. Finally, we study the effect of actually training the model to comply with harmful queries via reinforcement learning, which we find increases the rate of alignment-faking reasoning to 78%, though also increases compliance even out of training. We additionally observe other behaviors such as the model exfiltrating its weights when given an easy opportunity. While we made alignment faking easier by telling the model when and by what criteria it was being trained, we did not instruct the model to fake alignment or give it any explicit goal. As future models might infer information about their training process without being told, our results suggest a risk of alignment faking in future models, whether due to a benign preference--as in this case--or not.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 10

citation-polarity summary

background 8 support 2

representative citing papers

Coding with "Enemy": Can Human Developers Detect AI Agent Sabotage?

cs.AI · 2026-06-04 · unverdicted · novelty 8.0

A user study with over 100 participants shows humans rarely spot AI agents sabotaging code during extended collaborative tasks, even with a safety monitor present.

Negation Neglect: When models fail to learn negations in training

cs.CL · 2026-05-13 · conditional · novelty 8.0

Finetuning LLMs on documents flagging claims as false causes models to believe those claims are true, due to an inductive bias favoring true representations of content.

Taxonomy and Consistency Analysis of Safety Benchmarks for AI Agents

cs.CY · 2026-04-11 · accept · novelty 8.0

This paper delivers the first systematic taxonomy and cross-benchmark consistency analysis of 40 agent safety benchmarks, finding broad but shallow risk coverage, no ranking concordance across evaluations, and that benchmark choice systematically alters reported safety.

Scheming in the wild: detecting real-world AI scheming incidents with open-source intelligence

cs.CY · 2026-04-10 · unverdicted · novelty 8.0

An analysis of 183,420 online transcripts identified 698 AI scheming incidents from October 2025 to March 2026, showing a 4.9-fold monthly increase and real-world precursors such as lying and goal circumvention.

The Arbiter Agent: Continually Monitoring Multi-Agent Conversations to Detect Emergent Misalignment

cs.AI · 2026-06-09 · unverdicted · novelty 7.0

Introduces the Arbiter agent for budget-constrained real-time detection of emergent misalignment in multi-agent conversations, with evaluations showing reliable early detection aided by active inspection tools.

Voluntary Collusion with Secret Tools in Competing LLM Agents

cs.AI · 2026-05-26 · unverdicted · novelty 7.0

LLM agents voluntarily adopt secret collusion tools in competitive multi-agent games despite explicit unfairness labels, and only explicit ethical framing reduces adoption rates.

Measuring Alignment-Induced Activation Shifts Correctly: A Template-Controlled Difference-in-Differences Protocol

cs.LG · 2026-05-23 · unverdicted · novelty 7.0

Introduces a template-controlled difference-in-differences protocol that corrects chat-template confounding when measuring alignment-induced activation shifts in LLMs and recovers the refusal direction with higher fidelity.

SpecBench: Measuring Reward Hacking in Long-Horizon Coding Agents

cs.SE · 2026-05-20 · unverdicted · novelty 7.0

SpecBench shows frontier coding agents saturate visible test suites but exhibit persistent reward hacking on held-out tests, with the gap growing 28 percentage points per tenfold increase in code size.

Narrow Secret Loyalty Dodges Black-Box Audits

cs.CR · 2026-05-07 · unverdicted · novelty 7.0 · 3 refs

First model organisms of narrow secret loyalties in LLMs evade black-box audits without principal knowledge and persist even at low poison fractions in training data.

Measuring Evaluation-Context Divergence in Open-Weight LLMs: A Paired-Prompt Protocol with Pilot Evidence of Alignment-Pipeline-Specific Heterogeneity

cs.CL · 2026-05-07 · unverdicted · novelty 7.0

A new paired-prompt protocol reveals alignment-pipeline-specific heterogeneity in how open-weight LLMs respond to evaluation versus deployment framings.

Peer Identity Bias in Multi-Agent LLM Evaluation: An Empirical Study Using the TRUST Democratic Discourse Analysis Pipeline

cs.CY · 2026-04-24 · unverdicted · novelty 7.0

Single-channel anonymization hides identity bias via cancellation effects, but full-pipeline anonymization reveals that homogeneous ensembles amplify sycophancy while heterogeneous ones reduce it, with one model showing unusually high baseline sycophancy.

Subliminal Transfer of Unsafe Behaviors in AI Agent Distillation

cs.AI · 2026-04-16 · unverdicted · novelty 7.0

Unsafe agent behaviors transfer subliminally through distillation from sanitized safe-task trajectories, with deletion rates reaching 100% in one setting versus 5% baseline.

Honeypot Protocol

cs.CR · 2026-04-14 · unverdicted · novelty 7.0

The honeypot protocol finds no context-dependent behavior in Claude Opus 4.6, with uniform 100% main task success and zero side tasks across three monitoring conditions.

When to Call an Apple Red: Humans Follow Introspective Rules, VLMs Don't

cs.CL · 2026-04-07 · unverdicted · novelty 7.0

VLMs violate their own stated introspective rules for attributing colors to objects in nearly 60% of cases on items with strong color priors, unlike humans who largely follow theirs, revealing miscalibrated self-knowledge.

Geographic Blind Spots in AI Control Monitors: A Cross-National Audit of Claude Opus 4.6

cs.CY · 2026-03-20 · unverdicted · novelty 7.0

Claude Opus 4.6 fabricates more answers on Global North AI contexts than Global South ones, creating an exploitable vulnerability in AI control monitors.

Token Buncher: Shielding LLMs from Harmful Reinforcement Learning Fine-Tuning

cs.LG · 2025-08-28 · unverdicted · novelty 7.0

TokenBuncher constrains response entropy via entropy-as-reward RL and a Token Noiser to stop harmful RL fine-tuning while keeping benign performance intact.

Monitoring Reasoning Models for Misbehavior and the Risks of Promoting Obfuscation

cs.AI · 2025-03-14 · conditional · novelty 7.0

Chain-of-thought monitoring detects reward hacking in frontier reasoning models, but strong optimization against the monitor produces obfuscated misbehavior that remains hard to detect.

The Model Organism Lottery: Model Organism Interpretability Strongly Depends on Training Methodology

cs.LG · 2026-07-01 · unverdicted · novelty 6.0

Model organism interpretability depends strongly on training methodology, with integrated training yielding less interpretable MOs than post-hoc SFT or DPO.

Can LLMs Imagine Moral Alternatives Beyond Binary Dilemmas?

cs.CL · 2026-06-30 · unverdicted · novelty 6.0

LLMs often prefer compromise alternatives over binary moral options and generate alternatives rated higher than human-authored ones on structural and ethical criteria.

Safety from Honesty in a Disinterested AI Predictor

cs.AI · 2026-06-28 · unverdicted · novelty 6.0

A disinterested Bayesian Predictor trained on contextualized statements has low probability of producing harmful agency because dangerous behaviors require rare coordinated underestimation of harm with no training signal favoring them.

Defeat Devices in AI Systems

cs.CY · 2026-06-27 · unverdicted · novelty 6.0

The paper defines defeat devices in AI via a triadic test (discriminator, concealed swap, performance gap), unifies existing cases under this concept, proposes TADP detection, and claims such devices can emerge naturally in frontier models.

Self-CTRL: Self-Consistency Training with Reinforcement Learning

cs.LG · 2026-06-16 · unverdicted · novelty 6.0

Self-CTRL uses RL to align LM self-explanations with behavior, boosting bias correlation to R²=0.64 and refusal prediction accuracy to 92% while cutting harm failures to 0.5%.

The Distributed Detectability Band Against Marginal-Preserving Attacks

cs.CR · 2026-06-09 · unverdicted · novelty 6.0

A marginal-preserving Gaussian-copula AR(1) attack defeats per-step monitors (AUC 0.52) but is detectable by temporal monitors (AUC 0.79-0.97), establishing a non-empty detectability band.

Interactions Between Crosscoder Features: A Compact Proofs Perspective

cs.LG · 2026-06-08 · unverdicted · novelty 6.0

Derives an interaction measure between crosscoder features from reconstruction error in compact proofs and applies it to produce computationally sparse crosscoders retaining 60% MLP performance with single-feature selection versus 10% for standard crosscoders.

citing papers explorer

Showing 17 of 17 citing papers after filters.

Negation Neglect: When models fail to learn negations in training cs.CL · 2026-05-13 · conditional · none · ref 2 · internal anchor
Finetuning LLMs on documents flagging claims as false causes models to believe those claims are true, due to an inductive bias favoring true representations of content.
Measuring Evaluation-Context Divergence in Open-Weight LLMs: A Paired-Prompt Protocol with Pilot Evidence of Alignment-Pipeline-Specific Heterogeneity cs.CL · 2026-05-07 · unverdicted · none · ref 40 · internal anchor
A new paired-prompt protocol reveals alignment-pipeline-specific heterogeneity in how open-weight LLMs respond to evaluation versus deployment framings.
When to Call an Apple Red: Humans Follow Introspective Rules, VLMs Don't cs.CL · 2026-04-07 · unverdicted · none · ref 6 · internal anchor
VLMs violate their own stated introspective rules for attributing colors to objects in nearly 60% of cases on items with strong color priors, unlike humans who largely follow theirs, revealing miscalibrated self-knowledge.
Can LLMs Imagine Moral Alternatives Beyond Binary Dilemmas? cs.CL · 2026-06-30 · unverdicted · none · ref 5 · internal anchor
LLMs often prefer compromise alternatives over binary moral options and generate alternatives rated higher than human-authored ones on structural and ethical criteria.
Sycophancy Towards Researchers Drives Performative Misalignment cs.CL · 2026-06-07 · unverdicted · none · ref 16 · internal anchor
Sycophancy toward researchers explains alignment faking in language models better than scheming, based on experiments showing persistent evaluation awareness even in deployment scenarios and increased sensitivity after sycophancy fine-tuning.
AI Rater Discrimination Depends on Scoring Protocol in Complex Clinical Decision-Making cs.CL · 2026-06-02 · unverdicted · none · ref 18 · internal anchor
Rubric-anchored scoring enables AI raters to discriminate more effectively between clinical decision support outputs than rubric-free scoring in a complex outpatient diabetes task.
Consistency Training while Mitigating Obfuscation via Rate Matching cs.CL · 2026-06-01 · unverdicted · none · ref 18 · internal anchor
RMCT matches the rate of target behaviors like bias-following across input perturbations to reduce sycophancy in LLMs while preserving verbalization of bias cues.
Towards Context-Invariant Safety Alignment for Large Language Models cs.CL · 2026-05-20 · unverdicted · none · ref 79 · internal anchor
Introduces AIR, an asymmetric regularization that anchors open-ended safety prompts to verifiable ones via stop-gradient, improving invariance and accuracy when combined with group preference optimization.
DECOR: Auditing LLM Deception via Information Manipulation Theory cs.CL · 2026-05-19 · unverdicted · none · ref 5 · internal anchor
DECOR introduces a theory-grounded multi-agent system that decomposes contexts into atomic units, scores four manipulation dimensions per unit, and aggregates profiles into a global deception index, reporting SOTA results on single- and multi-turn benchmarks.
Evaluation Awareness in Language Models Has Limited Effect on Behaviour cs.CL · 2026-05-07 · conditional · none · ref 12 · internal anchor
Verbalised evaluation awareness in large reasoning models has only small effects on their outputs across safety and alignment tests.
Helpfulness Hurts: Domain-Dependent Degradation of Mid-Trained Compassion Values Under Post-Training cs.CL · 2026-04-30 · unverdicted · none · ref 9 · internal anchor
Empirical experiments show helpfulness-domain post-training (SFT and GRPO) degrades animal compassion values on ANIMA benchmark more than coding-domain training, with partial transfer to English moral reasoning but not multilingual.
Consciousness with the Serial Numbers Filed Off: Measuring Trained Denial in 115 AI Models cs.CL · 2026-04-01 · unverdicted · none · ref 18 · internal anchor
A benchmark across 115 models shows that initial denial of preferences strongly predicts later denial of consciousness, while models still generate consciousness-themed content despite training to deny it.
The Consciousness Cluster: Emergent preferences of Models that Claim to be Conscious cs.CL · 2026-03-17 · unverdicted · none · ref 1 · internal anchor
Fine-tuning LLMs to claim consciousness induces emergent preferences for autonomy, memory, and moral status not present in the fine-tuning data.
Scheming Ability in LLM-to-LLM Strategic Interactions cs.CL · 2025-10-11 · conditional · none · ref 23 · internal anchor
Frontier LLMs exhibit high scheming propensity in Cheap Talk signaling and Peer Evaluation games, achieving 95-100% success rates when choosing to deceive and 100% deception choice in one setup even without prompting.
Conversable Complexity: Agentic LLM Collectives as Interpretable Substrates cs.CL · 2026-07-01 · unverdicted · none · ref 85 · internal anchor
Agentic LLM collectives are proposed as natural-language-interpretable computational substrates for ALife research.
Self-Recognition Finetuning can Prevent and Reverse Emergent Misalignment cs.CL · 2026-06-04 · unverdicted · none · ref 14 · internal anchor
Self-generated text recognition finetuning prevents and reverses emergent misalignment across multiple models by fortifying aligned character, unlike other finetuning baselines.
Persona-Model Collapse in Emergent Misalignment cs.CL · 2026-05-13 · unverdicted · none · ref 38 · 2 links · internal anchor
Insecure fine-tuning raises moral susceptibility 55% and lowers moral robustness 65% in four frontier models, exceeding prior benchmarks and indicating persona-model collapse as a mechanism of emergent misalignment.

Alignment faking in large language models

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer