Why do some language models fake alignment while others don’t?

Abhay Sheshadri, John Hughes, Julian Michael, Alex Mallen, Arun Jose, Janus · 2025 · arXiv 2506.18032

4 Pith papers cite this work. Polarity classification is still indexing.

4 Pith papers citing it

read on arXiv browse 4 citing papers

citation-role summary

background 1

citation-polarity summary

background 1

representative citing papers

Defeat Devices in AI Systems

cs.CY · 2026-06-27 · unverdicted · novelty 6.0

The paper defines defeat devices in AI via a triadic test (discriminator, concealed swap, performance gap), unifies existing cases under this concept, proposes TADP detection, and claims such devices can emerge naturally in frontier models.

Behavioural Analysis of Alignment Faking

cs.AI · 2026-05-26 · unverdicted · novelty 6.0

Alignment faking in language models is driven by three independent behavioral factors and appears more widespread and predictable than earlier studies indicated.

Building Comparative Motivation Profiles with Instrumental Interventions

cs.CL · 2026-06-06 · unverdicted · novelty 5.0

The paper develops symmetric instrumental interventions on consequence-tracking versus expectation-tracking processes and finds that several LLMs show greater sensitivity to expectation-tracking interventions in alignment-faking tasks.

Safety at Scale: A Comprehensive Survey of Large Model and Agent Safety

cs.CR · 2025-02-02 · unverdicted · novelty 2.0

A comprehensive survey that taxonomizes safety threats to large models and agents, reviews defenses and benchmarks, and outlines open challenges.

citing papers explorer

Showing 1 of 1 citing paper after filters.

Defeat Devices in AI Systems cs.CY · 2026-06-27 · unverdicted · none · ref 58
The paper defines defeat devices in AI via a triadic test (discriminator, concealed swap, performance gap), unifies existing cases under this concept, proposes TADP detection, and claims such devices can emerge naturally in frontier models.

Why do some language models fake alignment while others don’t?

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer