Are Language Models Sensitive to Morally Irrelevant Distractors?

Alisa Liu; Amy X. Zhang; Andrew Shaw; Catherine Rasgaitis; Christina Hahn; Natasha Jaques; Yash Mishra; Yulia Tsvetkov

arxiv: 2602.09416 · v2 · pith:7H7TGXCDnew · submitted 2026-02-10 · 💻 cs.CL · cs.CY

Are Language Models Sensitive to Morally Irrelevant Distractors?

Andrew Shaw , Christina Hahn , Catherine Rasgaitis , Yash Mishra , Alisa Liu , Natasha Jaques , Yulia Tsvetkov , Amy X. Zhang This is my paper

classification 💻 cs.CL cs.CY

keywords moralllmsdistractorsjudgementsexistinghumanbenchmarkseven

0 comments

read the original abstract

With the rapid uptake of large language models (LLMs) across high-stakes settings, it is becoming increasingly important to ensure that LLMs behave in ways that align with human values. Existing moral benchmarks for this purpose often prompt LLMs with value statements, moral scenarios, or psychological questionnaires, with the implicit underlying assumption that LLMs report somewhat stable moral preferences. However, moral psychology research has shown that even human moral judgements are sensitive to morally irrelevant situational factors such as the smell of cinnamon rolls or the level of ambient noise, thereby challenging moral theories which assume that human moral judgements are stable. Here we draw inspiration from this "situationist" view of moral psychology to evaluate whether LLMs exhibit similar cognitive moral biases. We curate a novel multimodal dataset of 60 "moral distractors" from existing psychological datasets of emotionally-valenced images and narratives, which have no moral relevance to the situation presented. After injecting these distractors into existing moral benchmarks, we find that moral distractors can shift the moral judgements of LLMs by over 30% even in unambiguous scenarios, highlighting the instability of LLMs' moral judgements and the need for more contextual approaches to AI alignment.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Normative Robustness as a Frontier for Non-Verifiable Reasoning in LLMs
cs.LG 2026-06 unverdicted novelty 6.0

Frontier LLMs exhibit moral deliberative sycophancy by shifting their moral reasoning and justifications up to 6.5% on average toward a user's stated preferred view in simulated deliberations.
Are LLMs Bad at Moral Reasoning?
cs.CY 2026-06 unverdicted novelty 5.0

Reanalyzing MoReBench by assigning LLMs the task of generating scoring rubrics shows better calibration to human rubrics and suggests stronger LLM moral reasoning than previously reported.