PERSUASIONTRACE introduces a Bayesian-network simulated target for multi-turn persuasion that matches human belief dynamics (81 vs 80) better than LLM baselines (64) and enables process-level evaluation.
hub Canonical reference
A Roadmap to Pluralistic Alignment
Canonical reference. 71% of citing Pith papers cite this work as background.
abstract
With increased power and prevalence of AI systems, it is ever more critical that AI systems are designed to serve all, i.e., people with diverse values and perspectives. However, aligning models to serve pluralistic human values remains an open research question. In this piece, we propose a roadmap to pluralistic alignment, specifically using language models as a test bed. We identify and formalize three possible ways to define and operationalize pluralism in AI systems: 1) Overton pluralistic models that present a spectrum of reasonable responses; 2) Steerably pluralistic models that can steer to reflect certain perspectives; and 3) Distributionally pluralistic models that are well-calibrated to a given population in distribution. We also formalize and discuss three possible classes of pluralistic benchmarks: 1) Multi-objective benchmarks, 2) Trade-off steerable benchmarks, which incentivize models to steer to arbitrary trade-offs, and 3) Jury-pluralistic benchmarks which explicitly model diverse human ratings. We use this framework to argue that current alignment techniques may be fundamentally limited for pluralistic AI; indeed, we highlight empirical evidence, both from our own experiments and from other work, that standard alignment procedures might reduce distributional pluralism in models, motivating the need for further research on pluralistic alignment.
hub tools
citation-role summary
citation-polarity summary
years
2026 28representative citing papers
A technique identifies minimal convergence-divergence points in LLM transformer blocks and calibrates residual-stream directions to achieve targeted ethical-framework control at inference time.
RLHF should decompose annotations into dimensions each matched to one of three models—extension, evidence, or authority—instead of applying a single unified pipeline.
Personalized deep research systems need evaluation with real users because LLM judges overlook nuanced errors that matter to researchers.
Open-ended preference data reveals substantial plurality in what people want from AI and divergent interpretations of shared values such as truthfulness.
AI political neutrality is redefined as balanced high approval across opposing groups and tested in a 7434-person study showing dual approval is achievable while default outputs from most models lean liberal.
DISCA converts within-country disagreement among World Values Survey personas into a bounded logit correction that reduces cultural misalignment by 10-24% on MultiTP for models 3.8B and larger across 20 countries, without any weight updates.
Recursive generative retraining with heterogeneous rewards converges to a stable distribution satisfying a weighted Nash bargaining solution, preserving diversity under stated conditions.
Annotator Policy Models learn safety policies from labeling behavior alone, accurately predicting responses and revealing sources of disagreement like policy ambiguity and value pluralism.
MSD enables cross-lingual safety transfer in LLMs via self-distillation with Dual-Perspective Safety Weighting, improving safety in low-resource languages without target response data.
Personalized RewardBench reveals that state-of-the-art reward models reach only 75.94% accuracy on personalized preferences and shows stronger correlation with downstream BoN and PPO performance than prior benchmarks.
LLMs display Western-centric cultural representations that align poorly with native priorities in non-Western countries and share highly correlated error patterns.
VAT quantifies value trade-offs in LLM alignment by measuring how alignment-induced changes propagate across interconnected values using a Schwartz-grounded dataset.
LLMs diverge from human goal selection in self-directed learning by exploiting single solutions with low variability across instances.
LAION-Aesthetics Predictor reinforces Western and male biases by preferentially selecting images associated with women and realistic Western/Japanese art while excluding men, LGBTQ+ references, and other styles.
An audit finds language model filters and guardrails disproportionately suppress mentions of marginalized groups via lexical cues while failing to catch explicit harms.
ICM-inferred examples achieve gold-label performance across alignment benchmarks and generalize better when coherence is high even at fixed accuracy.
Transformer model with response-time auxiliary input adapts reward models to unseen human preference domains via in-context learning from demonstrations.
A tradeoff model shows generative AI can reduce bias against diverse preferences by strategically eliciting information instead of always inferring from majority patterns.
Positive Alignment is defined as AI systems that support human flourishing pluralistically while staying safe and cooperative, presented as a necessary complement to existing safety-focused alignment research.
Annotation disagreement on toxic language can be moderately predicted from textual features, with high-opposition items proving harder for models to estimate accurately.
The study adapts ecological diversity measures to evaluate platial representations in GPT and DALL-E images, finding low diversity, greater gains from prompt revision than generation, and stereotypical feature use.
AI value alignment is reconceptualized as a pluralistic governance problem arising along three axes—objectives, information, and principals—making it inherently context-dependent and unsolvable by technical design alone.
Case studies with blind UK residents and people from Kerala and Tamil Nadu demonstrate that community input at the systematization stage produces culturally grounded definitions of appropriateness for text-to-image model outputs.
citing papers explorer
-
Language Models Don't Know What You Want: Evaluating Personalization in Deep Research Needs Real Users
Personalized deep research systems need evaluation with real users because LLM judges overlook nuanced errors that matter to researchers.
-
Training-Free Cultural Alignment of Large Language Models via Persona Disagreement
DISCA converts within-country disagreement among World Values Survey personas into a bounded logit correction that reduces cultural misalignment by 10-24% on MultiTP for models 3.8B and larger across 20 countries, without any weight updates.
-
The Algorithmic Gaze of Image Quality Assessment: An Audit and Trace Ethnography of the LAION-Aesthetics Predictor
LAION-Aesthetics Predictor reinforces Western and male biases by preferentially selecting images associated with women and realistic Western/Japanese art while excluding men, LGBTQ+ references, and other styles.