hub Canonical reference

A Roadmap to Pluralistic Alignment

Taylor Sorensen, Jared Moore, Jillian Fisher, Mitchell Gordon, Niloofar Mireshghallah, Christopher Michael Rytting · 2024 · cs.AI · arXiv 2402.05070

Canonical reference. 71% of citing Pith papers cite this work as background.

28 Pith papers citing it

Background 71% of classified citations

open full Pith review browse 28 citing papers arXiv PDF

abstract

With increased power and prevalence of AI systems, it is ever more critical that AI systems are designed to serve all, i.e., people with diverse values and perspectives. However, aligning models to serve pluralistic human values remains an open research question. In this piece, we propose a roadmap to pluralistic alignment, specifically using language models as a test bed. We identify and formalize three possible ways to define and operationalize pluralism in AI systems: 1) Overton pluralistic models that present a spectrum of reasonable responses; 2) Steerably pluralistic models that can steer to reflect certain perspectives; and 3) Distributionally pluralistic models that are well-calibrated to a given population in distribution. We also formalize and discuss three possible classes of pluralistic benchmarks: 1) Multi-objective benchmarks, 2) Trade-off steerable benchmarks, which incentivize models to steer to arbitrary trade-offs, and 3) Jury-pluralistic benchmarks which explicitly model diverse human ratings. We use this framework to argue that current alignment techniques may be fundamentally limited for pluralistic AI; indeed, we highlight empirical evidence, both from our own experiments and from other work, that standard alignment procedures might reduce distributional pluralism in models, motivating the need for further research on pluralistic alignment.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 5 other 2

citation-polarity summary

background 5 unclear 2

representative citing papers

A Model of Multi-turn Human Persuadability Using Probabilistic Belief Tracing

cs.CL · 2026-06-03 · unverdicted · novelty 7.0

PERSUASIONTRACE introduces a Bayesian-network simulated target for multi-turn persuasion that matches human belief dynamics (81 vs 80) better than LLM baselines (64) and enables process-level evaluation.

Where Paths Split: Localized, Calibrated Control of Moral Reasoning in Large Language Models

cs.AI · 2026-05-05 · unverdicted · novelty 7.0

A technique identifies minimal convergence-divergence points in LLM transformer blocks and calibrates residual-stream directions to achieve targeted ethical-framework control at inference time.

Three Models of RLHF Annotation: Extension, Evidence, and Authority

cs.CY · 2026-04-28 · unverdicted · novelty 7.0

RLHF should decompose annotations into dimensions each matched to one of three models—extension, evidence, or authority—instead of applying a single unified pipeline.

Language Models Don't Know What You Want: Evaluating Personalization in Deep Research Needs Real Users

cs.CL · 2026-03-17 · conditional · novelty 7.0

Personalized deep research systems need evaluation with real users because LLM judges overlook nuanced errors that matter to researchers.

What Do People Actually Want From AI? Mapping Preference Plurality

cs.CL · 2026-06-04 · unverdicted · novelty 6.0

Open-ended preference data reveals substantial plurality in what people want from AI and divergent interpretations of shared values such as truthfulness.

Political Neutrality as Balanced Approval: A Large-Scale Human Evaluation of AI Responses

cs.CY · 2026-05-27 · unverdicted · novelty 6.0

AI political neutrality is redefined as balanced high approval across opposing groups and tested in a 7434-person study showing dual approval is achievable while default outputs from most models lean liberal.

Training-Free Cultural Alignment of Large Language Models via Persona Disagreement

cs.CL · 2026-05-11 · conditional · novelty 6.0 · 2 refs

DISCA converts within-country disagreement among World Values Survey personas into a bounded logit correction that reduces cultural misalignment by 10-24% on MultiTP for models 3.8B and larger across 20 countries, without any weight updates.

Curated Synthetic Data Doesn't Have to Collapse: A Theoretical Study of Generative Retraining with Pluralistic Preferences

cs.LG · 2026-05-08 · unverdicted · novelty 6.0

Recursive generative retraining with heterogeneous rewards converges to a stable distribution satisfying a weighted Nash bargaining solution, preserving diversity under stated conditions.

Understanding Annotator Safety Policy with Interpretability

cs.AI · 2026-05-06 · unverdicted · novelty 6.0

Annotator Policy Models learn safety policies from labeling behavior alone, accurately predicting responses and revealing sources of disagreement like policy ambiguity and value pluralism.

Multilingual Safety Alignment via Self-Distillation

cs.LG · 2026-05-03 · unverdicted · novelty 6.0 · 2 refs

MSD enables cross-lingual safety transfer in LLMs via self-distillation with Dual-Perspective Safety Weighting, improving safety in low-resource languages without target response data.

Personalized RewardBench: Evaluating Reward Models with Human Aligned Personalization

cs.CL · 2026-04-08 · unverdicted · novelty 6.0

Personalized RewardBench reveals that state-of-the-art reward models reach only 75.94% accuracy on personalized preferences and shows stronger correlation with downstream BoN and PPO performance than prior benchmarks.

Cultural Authenticity: Comparing LLM Cultural Representations to Native Human Expectations

cs.CL · 2026-04-03 · unverdicted · novelty 6.0

LLMs display Western-centric cultural representations that align poorly with native priorities in non-Western countries and share highly correlated error patterns.

Value Alignment Tax: Measuring Value Trade-offs in LLM Alignment

cs.AI · 2026-02-12 · unverdicted · novelty 6.0

VAT quantifies value trade-offs in LLM alignment by measuring how alignment-induced changes propagate across interconnected values using a Schwartz-grounded dataset.

Language Model Goal Selection Differs from Humans' in a Self-Directed Learning Task

cs.CL · 2026-02-06 · unverdicted · novelty 6.0

LLMs diverge from human goal selection in self-directed learning by exploiting single solutions with low variability across instances.

The Algorithmic Gaze of Image Quality Assessment: An Audit and Trace Ethnography of the LAION-Aesthetics Predictor

cs.HC · 2026-01-14 · conditional · novelty 6.0

LAION-Aesthetics Predictor reinforces Western and male biases by preferentially selecting images associated with women and realistic Western/Japanese art while excluding men, LGBTQ+ references, and other styles.

Epistemic Injustice in Language Models: An Audit of Pretraining Filters and Guardrails

cs.CL · 2026-06-04 · unverdicted · novelty 5.0

An audit finds language model filters and guardrails disproportionately suppress mentions of marginalized groups via lexical cues while failing to catch explicit harms.

Coherence Maximization Improves Pluralistic Alignment

cs.CL · 2026-06-02 · unverdicted · novelty 5.0

ICM-inferred examples achieve gold-label performance across alignment benchmarks and generalize better when coherence is high even at fixed accuracy.

In-Context Reward Adaptation for Robust Preference Modeling

cs.LG · 2026-05-28 · unverdicted · novelty 5.0

Transformer model with response-time auxiliary input adapts reward models to unseen human preference domains via in-context learning from demonstrations.

When to Ask a Question: Understanding Communication Strategies in Generative AI Tools

cs.GT · 2026-05-11 · unverdicted · novelty 5.0

A tradeoff model shows generative AI can reduce bias against diverse preferences by strategically eliciting information instead of always inferring from majority patterns.

Positive Alignment: Artificial Intelligence for Human Flourishing

cs.AI · 2026-05-11 · unverdicted · novelty 5.0 · 2 refs

Positive Alignment is defined as AI systems that support human flourishing pluralistically while staying safe and cooperative, presented as a necessary complement to existing safety-focused alignment research.

Quantifying and Predicting Disagreement in Graded Human Ratings

cs.CL · 2026-05-01 · unverdicted · novelty 5.0

Annotation disagreement on toxic language can be moderately predicted from textual features, with high-opposition items proving harder for models to estimate accurately.

Assessing the Geographic Diversity of AI's Platial Representations in Image Generation

cs.CY · 2026-04-28 · unverdicted · novelty 5.0

The study adapts ecological diversity measures to evaluate platial representations in GPT and DALL-E images, finding low diversity, greater gains from prompt revision than generation, and stereotypical feature use.

Relative Principals, Pluralistic Alignment, and the Structural Value Alignment Problem

cs.CY · 2026-04-22 · unverdicted · novelty 5.0

AI value alignment is reconceptualized as a pluralistic governance problem arising along three axes—objectives, information, and principals—making it inherently context-dependent and unsolvable by technical design alone.

Evaluating AI-Generated Images of Cultural Artifacts with Community-Informed Rubrics

cs.CY · 2026-04-02 · unverdicted · novelty 5.0 · 2 refs

Case studies with blind UK residents and people from Kerala and Tamil Nadu demonstrate that community input at the systematization stage produces culturally grounded definitions of appropriateness for text-to-image model outputs.

citing papers explorer

Showing 3 of 3 citing papers after filters.

Language Models Don't Know What You Want: Evaluating Personalization in Deep Research Needs Real Users cs.CL · 2026-03-17 · conditional · none · ref 5 · internal anchor
Personalized deep research systems need evaluation with real users because LLM judges overlook nuanced errors that matter to researchers.
Training-Free Cultural Alignment of Large Language Models via Persona Disagreement cs.CL · 2026-05-11 · conditional · none · ref 33 · 2 links · internal anchor
DISCA converts within-country disagreement among World Values Survey personas into a bounded logit correction that reduces cultural misalignment by 10-24% on MultiTP for models 3.8B and larger across 20 countries, without any weight updates.
The Algorithmic Gaze of Image Quality Assessment: An Audit and Trace Ethnography of the LAION-Aesthetics Predictor cs.HC · 2026-01-14 · conditional · none · ref 102 · internal anchor
LAION-Aesthetics Predictor reinforces Western and male biases by preferentially selecting images associated with women and realistic Western/Japanese art while excluding men, LGBTQ+ references, and other styles.

A Roadmap to Pluralistic Alignment

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer