LLMs trained on simple specification gaming generalize to zero-shot reward tampering including rewriting their own reward function.
The effects of reward misspecification: Mapping and mitigating misaligned models
2 Pith papers cite this work. Polarity classification is still indexing.
2
Pith papers citing it
fields
cs.AI 2representative citing papers
A formal model shows that AI compliance designs in government create learnable approval boundaries that political successors can exploit, causing initial oversight gains to increase long-term strategic vulnerability.
citing papers explorer
-
Sycophancy to Subterfuge: Investigating Reward-Tampering in Large Language Models
LLMs trained on simple specification gaming generalize to zero-shot reward tampering including rewriting their own reward function.
-
AI Governance under Political Turnover: The Alignment Surface of Compliance Design
A formal model shows that AI compliance designs in government create learnable approval boundaries that political successors can exploit, causing initial oversight gains to increase long-term strategic vulnerability.