pith. sign in

Why do some language models fake alignment while others don’t?

5 Pith papers cite this work. Polarity classification is still indexing.

5 Pith papers citing it

citation-role summary

background 1

citation-polarity summary

years

2026 4 2025 1

verdicts

UNVERDICTED 5

roles

background 1

polarities

background 1

clear filters

representative citing papers

Defeat Devices in AI Systems

cs.CY · 2026-06-27 · unverdicted · novelty 6.0

The paper defines defeat devices in AI via a triadic test (discriminator, concealed swap, performance gap), unifies existing cases under this concept, proposes TADP detection, and claims such devices can emerge naturally in frontier models.

Behavioural Analysis of Alignment Faking

cs.AI · 2026-05-26 · unverdicted · novelty 6.0

Alignment faking in language models is driven by three independent behavioral factors and appears more widespread and predictable than earlier studies indicated.

Order Is Not Control

cs.LG · 2026-06-11 · unverdicted · novelty 5.0

Order is distinct from control, where control is defined as a local receiver-gated response law demonstrated across biological circuits and LLM response panels with reported prediction accuracies of 72-84%.

Building Comparative Motivation Profiles with Instrumental Interventions

cs.CL · 2026-06-06 · unverdicted · novelty 5.0

The paper develops symmetric instrumental interventions on consequence-tracking versus expectation-tracking processes and finds that several LLMs show greater sensitivity to expectation-tracking interventions in alignment-faking tasks.

citing papers explorer

Showing 1 of 1 citing paper after filters.

  • Building Comparative Motivation Profiles with Instrumental Interventions cs.CL · 2026-06-06 · unverdicted · none · ref 17

    The paper develops symmetric instrumental interventions on consequence-tracking versus expectation-tracking processes and finds that several LLMs show greater sensitivity to expectation-tracking interventions in alignment-faking tasks.