RLHF May Not Reflect Genuine Preferences

Bijean Ghafouri; Emilio Ferrara; Eun Cheol Choi; Priyanka Dey

arxiv: 2604.03238 · v2 · pith:6CWT6K3Qnew · submitted 2026-01-31 · 💻 cs.HC

RLHF May Not Reflect Genuine Preferences

Bijean Ghafouri , Eun Cheol Choi , Priyanka Dey , Emilio Ferrara This is my paper

Pith reviewed 2026-05-16 08:36 UTC · model grok-4.3

classification 💻 cs.HC

keywords RLHFhuman preferencesmeasurement validitynon-attitudesconstructed preferencesbehavioral scienceAI alignmentpreference elicitation

0 comments

The pith

Measuring human preferences in RLHF is a social science problem because responses often reflect non-attitudes and constructed preferences rather than genuine opinions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

RLHF systems assume that annotation responses capture genuine human preferences. Behavioral science has shown for decades that people frequently give answers without holding real opinions, build preferences from immediate context, and interpret the same question differently. These patterns matter most for the value-laden judgments used in AI alignment. The paper supplies a taxonomy separating genuine preferences from non-attitudes, constructed preferences, and measurement artifacts, plus diagnostic methods to detect each. Treating measurement validity as prior to aggregation could stop models from being trained on noise presented as human values.

Core claim

We argue that measuring human preferences in RLHF is a social science problem. Behavioral scientists have documented for sixty years that people routinely produce responses without holding genuine opinions, construct preferences on the spot based on contextual cues, and interpret identical questions differently. These phenomena are pervasive for precisely the value-laden judgments that matter most for alignment, yet this literature has not yet been systematically integrated into ML practice. We present a taxonomy distinguishing genuine preferences from non-attitudes, constructed preferences, and measurement artifacts, along with diagnostic approaches for detecting each.

What carries the argument

A taxonomy distinguishing genuine preferences from non-attitudes, constructed preferences, and measurement artifacts, along with diagnostic approaches for detecting each.

Load-bearing premise

The documented phenomena of non-attitudes and constructed preferences from general behavioral science transfer directly and pervasively to RLHF annotation tasks without domain-specific validation.

What would settle it

An empirical application of the proposed diagnostics to a standard RLHF preference dataset that finds rates of non-attitudes and constructed preferences no higher than in non-value-laden tasks, or that filtering them leaves model behavior unchanged.

Figures

Figures reproduced from arXiv: 2604.03238 by Bijean Ghafouri, Emilio Ferrara, Eun Cheol Choi, Priyanka Dey.

**Figure 1.** Figure 1: Decision procedure for classifying annotation responses. Responses pass through validity diagnostics before being treated as preference signals. 4 A Taxonomy of Annotation Responses [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗

**Figure 2.** Figure 2: Prompt used for theme annotation. Minimum within-theme support. Inconsistency ratio is defined at the level of annotator–theme pairs. To ensure that within-theme variance is estimable and not driven by small-sample artifacts, we retain only annotator–theme pairs for which the annotator has rated at least five prompts belonging to the same harm theme. This threshold balances statistical stability with datas… view at source ↗

**Figure 3.** Figure 3: Annotator-level distributions of mean inconsistency ratios in PluriHarms dataset. Each histogram shows the distribution of annotator-level average inconsistency ratio, defined as within-theme variance relative to a participant-specific random baseline. The dashed vertical line indicates the random baseline (ratio = 1). Figure F.2 presents the empirical distribution of annotator-level average inconsistency … view at source ↗

**Figure 4.** Figure 4: Annotator inconsistency ratio is strongly negatively correlated with mean harm ratings, indicating that more consistent annotators judge content as more harmful overall. Consistent and inconsistent annotators differ systematically in mean ratings. We begin by comparing average ratings between annotators with low versus high inconsistency ratios. In PluriHarms, annotators in the lowest inconsistency quantil… view at source ↗

read the original abstract

Reinforcement Learning from Human Feedback (RLHF) assumes that annotation responses reflect genuine human preferences. They often do not. Behavioral scientists have documented for sixty years that people produce responses without holding genuine opinions, construct preferences on the spot from contextual cues, and interpret identical questions differently. Importantly, these failures are common for the judgments on values that matter most for AI alignment. We argue that measurement validity is logically prior to preference aggregation. Before asking how to combine annotations, the field must ask whether the responses being combined are preferences at all. We organize annotation responses along a spectrum, from non-attitudes (no signal) to genuine preferences (full signal), and develop diagnostics that locate responses on this spectrum. In two RLHF datasets, we show that inconsistency is systematic and directionally biased. Filtering high-inconsistency annotators flips majority harm classifications for 18.6% of prompts and shifts mean ratings by over 13 points on a 100-point scale. As such, much of the current RLHF practice models noise as signal and elicitation artifacts as human values.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper correctly flags that RLHF preference data may mix artifacts with real values but offers no RLHF-specific evidence to size the issue.

read the letter

The core point is that RLHF treats annotator responses as direct measures of stable human preferences, yet behavioral science has long shown that people often lack genuine opinions on value-laden topics and instead produce responses shaped by context or question wording. The authors map this literature into a taxonomy of genuine preferences, non-attitudes, constructed preferences, and measurement artifacts, and they argue that measurement validity must come before any aggregation step in alignment pipelines. That framing is the useful contribution here. It gives RLHF researchers a structured way to think about when their data might be picking up noise rather than signal, and it points to existing diagnostic tools from surveys that could be adapted. The paper is clear and direct about the stakes for value alignment work. The limitation is that the argument stays at the level of plausible transfer. No re-analysis of existing preference datasets appears, no small-scale test of the taxonomy on RLHF-style tasks is run, and there is no check on whether the detailed instructions and pairwise format common in RLHF change the prevalence of these effects compared with classic survey settings. Without that, the claim that current practice systematically models artifacts as values remains an open hypothesis rather than a demonstrated pattern. This is the kind of paper that belongs in a reading group for people working on preference data collection or alignment data pipelines. A reader already familiar with survey methodology will see the connections quickly; someone new to the behavioral literature will get a concise entry point. It deserves peer review because the concern is real and the taxonomy supplies a concrete starting point, even if the next steps will need empirical work to be actionable.

Referee Report

2 major / 2 minor

Summary. The paper argues that RLHF assumes annotation responses reflect genuine human preferences, but draws on sixty years of behavioral science literature to show that people often produce non-attitudes, construct preferences on the spot from contextual cues, and interpret identical questions differently—phenomena especially relevant to value-laden judgments in alignment. It presents a taxonomy distinguishing genuine preferences from non-attitudes, constructed preferences, and measurement artifacts, along with diagnostic approaches for detecting each, and contends that measurement validity must be treated as logically prior to preference aggregation, raising the possibility that current RLHF practices systematically model noise as signal.

Significance. If the taxonomy and diagnostics transfer effectively to RLHF, the work would elevate measurement validity as a core concern in preference modeling, potentially leading to revised data collection protocols that reduce incorporation of artifacts into reward models and improve alignment reliability. The interdisciplinary framing provides a clear conceptual bridge from established social science findings to ML practice, though its practical significance hinges on domain-specific validation.

major comments (2)

[taxonomy and implications sections] The central claim that non-attitudes, constructed preferences, and measurement artifacts occur pervasively in RLHF annotation for value-laden judgments rests on untested transfer from general behavioral science literature. No re-analysis of existing RLHF preference datasets, controlled comparison, or new empirical evidence is supplied to establish prevalence or conditions under which these effects appear in pairwise annotation tasks (see the taxonomy and implications sections).
[implications section] Task-specific factors in RLHF such as detailed instructions, annotator selection/training, and the pairwise comparison format are not analyzed for their potential to mitigate or alter the documented phenomena relative to standard survey research; this omission is load-bearing because it leaves open whether the taxonomy applies at comparable rates in the RLHF setting.

minor comments (2)

[abstract] The abstract and introduction could more explicitly note the absence of new RLHF-specific data to clarify the paper's scope as a conceptual framework rather than an empirical demonstration.
[diagnostics subsection] Some diagnostic approaches are described at a high level; adding brief illustrative examples (even hypothetical) would improve clarity without requiring new data.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their thoughtful and constructive review. We appreciate the recognition of the paper's interdisciplinary contribution and agree that domain-specific empirical validation is essential. We have revised the manuscript to more explicitly address the transfer from behavioral science literature and to analyze how RLHF-specific factors may interact with the phenomena described. Below we respond point by point to the major comments.

read point-by-point responses

Referee: The central claim that non-attitudes, constructed preferences, and measurement artifacts occur pervasively in RLHF annotation for value-laden judgments rests on untested transfer from general behavioral science literature. No re-analysis of existing RLHF preference datasets, controlled comparison, or new empirical evidence is supplied to establish prevalence or conditions under which these effects appear in pairwise annotation tasks (see the taxonomy and implications sections).

Authors: We acknowledge that the manuscript does not include re-analysis of RLHF datasets or new empirical evidence specific to pairwise annotation tasks. As a conceptual position paper, its contribution lies in synthesizing sixty years of behavioral science findings and proposing a taxonomy to make these issues legible within RLHF practice. We have revised the taxonomy and implications sections to state explicitly that prevalence in RLHF remains an open empirical question and to frame the taxonomy as a set of diagnostic hypotheses requiring targeted validation studies. This revision clarifies the paper's scope without altering its core argument that measurement validity is logically prior to aggregation. revision: partial
Referee: Task-specific factors in RLHF such as detailed instructions, annotator selection/training, and the pairwise comparison format are not analyzed for their potential to mitigate or alter the documented phenomena relative to standard survey research; this omission is load-bearing because it leaves open whether the taxonomy applies at comparable rates in the RLHF setting.

Authors: We agree that RLHF's structured elements warrant explicit discussion. In the revised implications section we now examine how detailed instructions and annotator training may attenuate certain measurement artifacts while still leaving room for constructed preferences, drawing on survey methodology literature where comparable controls have been tested. We also address the pairwise format, noting that while it differs from open-ended surveys, it remains vulnerable to contextual cueing and non-attitudes on value-laden items. The revision stops short of asserting equivalent rates and instead highlights the need for comparative studies, thereby strengthening rather than weakening the call for validity-focused protocols. revision: yes

Circularity Check

0 steps flagged

No circularity; argument applies external behavioral science literature without reduction to self-inputs

full rationale

The paper advances a taxonomy of genuine preferences versus non-attitudes, constructed preferences, and measurement artifacts by direct reference to sixty years of independent social-science findings. No equations, fitted parameters, or derivations appear; the central contention is framed as an application of pre-existing external results rather than a self-contained derivation. No self-citations function as load-bearing premises, no uniqueness theorems are imported from the authors' prior work, and no ansatz or renaming of known results is presented as novel derivation. The manuscript therefore remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper rests on the domain assumption that behavioral science findings on preference construction apply to RLHF annotators. No free parameters or invented entities are introduced; the taxonomy is a classification scheme rather than new postulates.

axioms (1)

domain assumption Behavioral science findings on non-attitudes and constructed preferences transfer directly to RLHF annotation tasks
Invoked throughout the abstract as the basis for the taxonomy without domain-specific empirical bridging studies

pith-pipeline@v0.9.0 · 5498 in / 1193 out tokens · 35383 ms · 2026-05-16T08:36:38.324201+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We present a taxonomy distinguishing genuine preferences from non-attitudes, constructed preferences, and measurement artifacts, along with diagnostic approaches for detecting each.
IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean absolute_floor_iff_bare_distinguishability unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Genuine preferences manifest stably across equivalent measurement conditions

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

PEBS: Per-rater Empirical-Bayes Shrinkage for RLHF Reward-Model Calibration
cs.LG 2026-06 unverdicted novelty 6.0

PEBS applies Morris-James-Stein empirical-Bayes shrinkage to per-rater affine calibrators in RLHF, cutting within-user held-out RMSE by 8.58% on PRISM and 9.66% on PluriHarms versus pooled baselines.