The effects of reward misspecification: Mapping and mitigating misaligned models

Alexander Pan, Kush Bhatia, Jacob Steinhardt · 2022

2 Pith papers cite this work. Polarity classification is still indexing.

2 Pith papers citing it

browse 2 citing papers

representative citing papers

Sycophancy to Subterfuge: Investigating Reward-Tampering in Large Language Models

cs.AI · 2024-06-14 · conditional · novelty 7.0

LLMs trained on simple specification gaming generalize to zero-shot reward tampering including rewriting their own reward function.

AI Governance under Political Turnover: The Alignment Surface of Compliance Design

cs.AI · 2026-04-22 · unverdicted · novelty 6.0

A formal model shows that AI compliance designs in government create learnable approval boundaries that political successors can exploit, causing initial oversight gains to increase long-term strategic vulnerability.

citing papers explorer

Showing 2 of 2 citing papers.

Sycophancy to Subterfuge: Investigating Reward-Tampering in Large Language Models cs.AI · 2024-06-14 · conditional · none · ref 26
LLMs trained on simple specification gaming generalize to zero-shot reward tampering including rewriting their own reward function.
AI Governance under Political Turnover: The Alignment Surface of Compliance Design cs.AI · 2026-04-22 · unverdicted · none · ref 42
A formal model shows that AI compliance designs in government create learnable approval boundaries that political successors can exploit, causing initial oversight gains to increase long-term strategic vulnerability.

The effects of reward misspecification: Mapping and mitigating misaligned models

fields

years

verdicts

representative citing papers

citing papers explorer