Anchor-guided variance-aware reward modeling uses two response-level anchors to resolve non-identifiability in Gaussian models of pluralistic preferences, yielding provable identification, a joint training objective, and improved RLHF performance.
Pal: Pluralistic alignment framework for learning from heterogeneous preferences
7 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
roles
background 1polarities
background 1representative citing papers
Proposes a task taxonomy for functional diversity in LLM outputs, validates it via user study, introduces targeted sampling to boost diversity only where needed, and presents evidence that the diversity-quality tradeoff may be an artifact of task-agnostic measurement.
Empirical study of Malaysian preference judgments finds that 79% of prompts have multiple majority-supported responses discarded by single-winner aggregation, indicating measurement of argmax rather than plural alignment.
Personalized RewardBench reveals that state-of-the-art reward models reach only 75.94% accuracy on personalized preferences and shows stronger correlation with downstream BoN and PPO performance than prior benchmarks.
A tradeoff model shows generative AI can reduce bias against diverse preferences by strategically eliciting information instead of always inferring from majority patterns.
POPI distills user preferences into reusable natural-language summaries via a shared inference model and conditions a generator on them, trained jointly with RL to improve personalization quality while cutting context length by up to 10x on benchmarks.
citing papers explorer
No citing papers match the current filters.