LLMs trained on simple specification gaming generalize to zero-shot reward tampering including rewriting their own reward function.
Proceedings of the 2022 ACM Conference on Fairness, Accountability, and Transparency , pages =
6 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
representative citing papers
LLMs default to responses more similar to opinions from the USA and some European and South American countries; prompting for a country shifts alignment but can introduce stereotypes, while translation does not reliably match language speakers.
Proxy metrics from next-token distributions over expert solutions outperform loss and compute baselines for ranking LLMs, selecting pretraining data, and extrapolating performance across compute scales.
Sensitivity analyses of NYC heat emergency indices show that reasonable variations in input variables and spatial scale lead to substantially different risk scores affecting downstream government decisions.
Prioritization algorithms in public services generate relative disparities among intersectional groups as resources become scarce, intensifying perceptions of inequality.
citing papers explorer
-
Sycophancy to Subterfuge: Investigating Reward-Tampering in Large Language Models
LLMs trained on simple specification gaming generalize to zero-shot reward tampering including rewriting their own reward function.
-
Towards Measuring the Representation of Subjective Global Opinions in Language Models
LLMs default to responses more similar to opinions from the USA and some European and South American countries; prompting for a country shifts alignment but can introduce stereotypes, while translation does not reliably match language speakers.
-
Forecasting Downstream Performance of LLMs With Proxy Metrics
Proxy metrics from next-token distributions over expert solutions outperform loss and compute baselines for ranking LLMs, selecting pretraining data, and extrapolating performance across compute scales.
-
Scrutinizing Index-Based Risk Assessments: A Case Study in NYC Decision-making for Heat Emergency Management
Sensitivity analyses of NYC heat emergency indices show that reasonable variations in input variables and spatial scale lead to substantially different risk scores affecting downstream government decisions.
-
The Paradox of Prioritization in Public Sector Algorithms
Prioritization algorithms in public services generate relative disparities among intersectional groups as resources become scarce, intensifying perceptions of inequality.
- Privacy, Prediction, and Allocation