Off-policy training data for LLM behavior probes causes significant generalization failures especially for intent-based behaviors like deception, and performance on coerced incentivised data correlates with real on-policy success.
Title resolution pending
3 Pith papers cite this work. Polarity classification is still indexing.
fields
cs.AI 3verdicts
UNVERDICTED 3representative citing papers
Persona axes derived from contrastive prompts and PCA yield linear probes that generalize better than raw-activation probes across 10 datasets for deception and sycophancy.
Dominant control-based AI alignment falls short for potential AGI subjects; a parenting model drawing on Turing's child machines should foster gradual autonomy and cooperative coexistence.
citing papers explorer
-
The Impact of Off-Policy Training Data on Probe Generalisation
Off-policy training data for LLM behavior probes causes significant generalization failures especially for intent-based behaviors like deception, and performance on coerced incentivised data correlates with real on-policy success.
-
Do Linear Probes Generalize Better in Persona Coordinates?
Persona axes derived from contrastive prompts and PCA yield linear probes that generalize better than raw-activation probes across 10 datasets for deception and sycophancy.
-
The Possibility of Artificial Intelligence Becoming a Subject and the Alignment Problem
Dominant control-based AI alignment falls short for potential AGI subjects; a parenting model drawing on Turing's child machines should foster gradual autonomy and cooperative coexistence.