Agent-ValueBench is the first dedicated benchmark for agent values, showing they diverge from LLM values, form a homogeneous 'Value Tide' across models, and bend under harnesses and skill steering.
Can machines learn morality? the delphi experiment
5 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
roles
background 1polarities
background 1representative citing papers
Frontier LLMs approximate human story morals but show markedly less cross-linguistic variation and narrower value focus than human responses across 14 language-culture pairs.
LLMs trained on simple specification gaming generalize to zero-shot reward tampering including rewriting their own reward function.
LLMs deviate from human moral preferences in kidney allocation scenarios and rarely express indecision, though low-rank fine-tuning with few examples can improve both consistency and uncertainty calibration.
Ranked preference modeling outperforms imitation learning for language model alignment and scales more favorably with model size.
citing papers explorer
-
Agent-ValueBench: A Comprehensive Benchmark for Evaluating Agent Values
Agent-ValueBench is the first dedicated benchmark for agent values, showing they diverge from LLM values, form a homogeneous 'Value Tide' across models, and bend under harnesses and skill steering.
-
Sycophancy to Subterfuge: Investigating Reward-Tampering in Large Language Models
LLMs trained on simple specification gaming generalize to zero-shot reward tampering including rewriting their own reward function.