RLHF May Not Reflect Genuine Preferences
Pith reviewed 2026-05-16 08:36 UTC · model grok-4.3
The pith
Measuring human preferences in RLHF is a social science problem because responses often reflect non-attitudes and constructed preferences rather than genuine opinions.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We argue that measuring human preferences in RLHF is a social science problem. Behavioral scientists have documented for sixty years that people routinely produce responses without holding genuine opinions, construct preferences on the spot based on contextual cues, and interpret identical questions differently. These phenomena are pervasive for precisely the value-laden judgments that matter most for alignment, yet this literature has not yet been systematically integrated into ML practice. We present a taxonomy distinguishing genuine preferences from non-attitudes, constructed preferences, and measurement artifacts, along with diagnostic approaches for detecting each.
What carries the argument
A taxonomy distinguishing genuine preferences from non-attitudes, constructed preferences, and measurement artifacts, along with diagnostic approaches for detecting each.
Load-bearing premise
The documented phenomena of non-attitudes and constructed preferences from general behavioral science transfer directly and pervasively to RLHF annotation tasks without domain-specific validation.
What would settle it
An empirical application of the proposed diagnostics to a standard RLHF preference dataset that finds rates of non-attitudes and constructed preferences no higher than in non-value-laden tasks, or that filtering them leaves model behavior unchanged.
Figures
read the original abstract
Reinforcement Learning from Human Feedback (RLHF) assumes that annotation responses reflect genuine human preferences. They often do not. Behavioral scientists have documented for sixty years that people produce responses without holding genuine opinions, construct preferences on the spot from contextual cues, and interpret identical questions differently. Importantly, these failures are common for the judgments on values that matter most for AI alignment. We argue that measurement validity is logically prior to preference aggregation. Before asking how to combine annotations, the field must ask whether the responses being combined are preferences at all. We organize annotation responses along a spectrum, from non-attitudes (no signal) to genuine preferences (full signal), and develop diagnostics that locate responses on this spectrum. In two RLHF datasets, we show that inconsistency is systematic and directionally biased. Filtering high-inconsistency annotators flips majority harm classifications for 18.6% of prompts and shifts mean ratings by over 13 points on a 100-point scale. As such, much of the current RLHF practice models noise as signal and elicitation artifacts as human values.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper argues that RLHF assumes annotation responses reflect genuine human preferences, but draws on sixty years of behavioral science literature to show that people often produce non-attitudes, construct preferences on the spot from contextual cues, and interpret identical questions differently—phenomena especially relevant to value-laden judgments in alignment. It presents a taxonomy distinguishing genuine preferences from non-attitudes, constructed preferences, and measurement artifacts, along with diagnostic approaches for detecting each, and contends that measurement validity must be treated as logically prior to preference aggregation, raising the possibility that current RLHF practices systematically model noise as signal.
Significance. If the taxonomy and diagnostics transfer effectively to RLHF, the work would elevate measurement validity as a core concern in preference modeling, potentially leading to revised data collection protocols that reduce incorporation of artifacts into reward models and improve alignment reliability. The interdisciplinary framing provides a clear conceptual bridge from established social science findings to ML practice, though its practical significance hinges on domain-specific validation.
major comments (2)
- [taxonomy and implications sections] The central claim that non-attitudes, constructed preferences, and measurement artifacts occur pervasively in RLHF annotation for value-laden judgments rests on untested transfer from general behavioral science literature. No re-analysis of existing RLHF preference datasets, controlled comparison, or new empirical evidence is supplied to establish prevalence or conditions under which these effects appear in pairwise annotation tasks (see the taxonomy and implications sections).
- [implications section] Task-specific factors in RLHF such as detailed instructions, annotator selection/training, and the pairwise comparison format are not analyzed for their potential to mitigate or alter the documented phenomena relative to standard survey research; this omission is load-bearing because it leaves open whether the taxonomy applies at comparable rates in the RLHF setting.
minor comments (2)
- [abstract] The abstract and introduction could more explicitly note the absence of new RLHF-specific data to clarify the paper's scope as a conceptual framework rather than an empirical demonstration.
- [diagnostics subsection] Some diagnostic approaches are described at a high level; adding brief illustrative examples (even hypothetical) would improve clarity without requiring new data.
Simulated Author's Rebuttal
We thank the referee for their thoughtful and constructive review. We appreciate the recognition of the paper's interdisciplinary contribution and agree that domain-specific empirical validation is essential. We have revised the manuscript to more explicitly address the transfer from behavioral science literature and to analyze how RLHF-specific factors may interact with the phenomena described. Below we respond point by point to the major comments.
read point-by-point responses
-
Referee: The central claim that non-attitudes, constructed preferences, and measurement artifacts occur pervasively in RLHF annotation for value-laden judgments rests on untested transfer from general behavioral science literature. No re-analysis of existing RLHF preference datasets, controlled comparison, or new empirical evidence is supplied to establish prevalence or conditions under which these effects appear in pairwise annotation tasks (see the taxonomy and implications sections).
Authors: We acknowledge that the manuscript does not include re-analysis of RLHF datasets or new empirical evidence specific to pairwise annotation tasks. As a conceptual position paper, its contribution lies in synthesizing sixty years of behavioral science findings and proposing a taxonomy to make these issues legible within RLHF practice. We have revised the taxonomy and implications sections to state explicitly that prevalence in RLHF remains an open empirical question and to frame the taxonomy as a set of diagnostic hypotheses requiring targeted validation studies. This revision clarifies the paper's scope without altering its core argument that measurement validity is logically prior to aggregation. revision: partial
-
Referee: Task-specific factors in RLHF such as detailed instructions, annotator selection/training, and the pairwise comparison format are not analyzed for their potential to mitigate or alter the documented phenomena relative to standard survey research; this omission is load-bearing because it leaves open whether the taxonomy applies at comparable rates in the RLHF setting.
Authors: We agree that RLHF's structured elements warrant explicit discussion. In the revised implications section we now examine how detailed instructions and annotator training may attenuate certain measurement artifacts while still leaving room for constructed preferences, drawing on survey methodology literature where comparable controls have been tested. We also address the pairwise format, noting that while it differs from open-ended surveys, it remains vulnerable to contextual cueing and non-attitudes on value-laden items. The revision stops short of asserting equivalent rates and instead highlights the need for comparative studies, thereby strengthening rather than weakening the call for validity-focused protocols. revision: yes
Circularity Check
No circularity; argument applies external behavioral science literature without reduction to self-inputs
full rationale
The paper advances a taxonomy of genuine preferences versus non-attitudes, constructed preferences, and measurement artifacts by direct reference to sixty years of independent social-science findings. No equations, fitted parameters, or derivations appear; the central contention is framed as an application of pre-existing external results rather than a self-contained derivation. No self-citations function as load-bearing premises, no uniqueness theorems are imported from the authors' prior work, and no ansatz or renaming of known results is presented as novel derivation. The manuscript therefore remains self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Behavioral science findings on non-attitudes and constructed preferences transfer directly to RLHF annotation tasks
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We present a taxonomy distinguishing genuine preferences from non-attitudes, constructed preferences, and measurement artifacts, along with diagnostic approaches for detecting each.
-
IndisputableMonolith/Foundation/AbsoluteFloorClosure.leanabsolute_floor_iff_bare_distinguishability unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Genuine preferences manifest stably across equivalent measurement conditions
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 1 Pith paper
-
PEBS: Per-rater Empirical-Bayes Shrinkage for RLHF Reward-Model Calibration
PEBS applies Morris-James-Stein empirical-Bayes shrinkage to per-rater affine calibrators in RLHF, cutting within-user held-out RMSE by 8.58% on PRISM and 9.66% on PluriHarms versus pooled baselines.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.