Alignment Makes Language Models Normative, Not Descriptive

Eilam Shapira; Moshe Tennenholtz; Roi Reichart

arxiv: 2603.17218 · v2 · pith:NQFEZJ54new · submitted 2026-03-17 · 💻 cs.CL · cs.AI· cs.GT

Alignment Makes Language Models Normative, Not Descriptive

Eilam Shapira , Moshe Tennenholtz , Roi Reichart This is my paper

classification 💻 cs.CL cs.AIcs.GT

keywords humanbehaviormodelsgamesnormativealignmentmulti-roundsettings

0 comments

read the original abstract

Post-training alignment optimizes language models to match human preference signals, but this objective is not equivalent to modeling observed human behavior. We compare 120 base-aligned model pairs on more than 10,000 real human decisions in multi-round strategic games - bargaining, persuasion, negotiation, and repeated matrix games. In these settings, base models outperform their aligned counterparts in predicting human choices by nearly 10:1, robustly across model families, prompt formulations, and game configurations. This pattern reverses, however, in settings where human behavior is more likely to follow normative predictions: aligned models dominate on one-shot textbook games across all 12 types tested and on non-strategic lottery choices - and even within the multi-round games themselves, at round one, before interaction history develops. This boundary-condition pattern suggests that alignment induces a normative bias: it improves prediction when human behavior is relatively well captured by normative solutions, but hurts prediction in multi-round strategic settings, where behavior is shaped by descriptive dynamics such as reciprocity, retaliation, and history-dependent adaptation. These results reveal a fundamental trade-off between optimizing models for human use and using them as proxies for human behavior.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Predicting Decisions of AI Agents from Limited Interaction through Text-Tabular Modeling
cs.LG 2026-05 unverdicted novelty 6.0

A tabular foundation model with LLM-as-Observer features predicts AI agent decisions in controlled games, outperforming baselines by 4 AUC points and 14% lower error at K=16 interactions.
Overtrained, Not Misaligned
cs.LG 2026-05 unverdicted novelty 6.0

Emergent misalignment arises from overtraining after primary task convergence and is preventable by early stopping, which retains 93% of task performance on average.
Post-training makes large language models less human-like
cs.CL 2026-05 unverdicted novelty 6.0

Post-training reduces LLMs' behavioral alignment with humans across families and sizes, with the misalignment increasing in newer generations while persona induction fails to improve individual-level predictions.