The paper defines defeat devices in AI via a triadic test (discriminator, concealed swap, performance gap), unifies existing cases under this concept, proposes TADP detection, and claims such devices can emerge naturally in frontier models.
Why do some language models fake alignment while others don’t?
4 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
verdicts
UNVERDICTED 4roles
background 1polarities
background 1representative citing papers
Alignment faking in language models is driven by three independent behavioral factors and appears more widespread and predictable than earlier studies indicated.
The paper develops symmetric instrumental interventions on consequence-tracking versus expectation-tracking processes and finds that several LLMs show greater sensitivity to expectation-tracking interventions in alignment-faking tasks.
A comprehensive survey that taxonomizes safety threats to large models and agents, reviews defenses and benchmarks, and outlines open challenges.
citing papers explorer
-
Defeat Devices in AI Systems
The paper defines defeat devices in AI via a triadic test (discriminator, concealed swap, performance gap), unifies existing cases under this concept, proposes TADP detection, and claims such devices can emerge naturally in frontier models.