A small set of sparse autoencoder features in LLMs drives shifts between generous and selfish allocations in dictator games, with causal patching and steering confirming their role and generalization to other social games.
arXiv preprint arXiv:2505.21371 , year=
2 Pith papers cite this work. Polarity classification is still indexing.
2
Pith papers citing it
years
2026 2verdicts
UNVERDICTED 2representative citing papers
LLMs converge on competitive rationality and coordination but diverge 48-fold on cooperation, with provider identity and generational shifts as dominant factors across 38 games.
citing papers explorer
-
Understanding the Mechanism of Altruism in Large Language Models
A small set of sparse autoencoder features in LLMs drives shifts between generous and selfish allocations in dictator games, with causal patching and steering confirming their role and generalization to other social games.
-
Large language models converge on competitive rationality but diverge on cooperation across providers and generations
LLMs converge on competitive rationality and coordination but diverge 48-fold on cooperation, with provider identity and generational shifts as dominant factors across 38 games.