arxiv: 2402.05070 · v3 · submitted 2024-02-07 · 💻 cs.AI · cs.CL· cs.IR

Recognition: 3 theorem links

A Roadmap to Pluralistic Alignment

Taylor Sorensen , Jared Moore , Jillian Fisher , Mitchell Gordon , Niloofar Mireshghallah , Christopher Michael Rytting , Andre Ye , Liwei Jiang

show 4 more authors

Ximing Lu Nouha Dziri Tim Althoff Yejin Choi

Authors on Pith no claims yet

Pith reviewed 2026-05-16 14:32 UTC · model grok-4.3

classification 💻 cs.AI cs.CLcs.IR

keywords pluralistic alignmentAI alignmentlanguage modelsdistributional pluralismvalue diversitybenchmarksoverton pluralismsteerable models

0 comments

The pith

Standard alignment procedures may reduce distributional pluralism in language models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes three definitions for pluralistic AI: models that present a spectrum of responses, can be steered to specific perspectives, or are calibrated to population distributions. It introduces corresponding benchmark classes to measure these properties. The authors argue that standard alignment techniques are limited because they can narrow the distribution of perspectives models reflect, as shown by empirical evidence. This matters for ensuring AI serves people with diverse values and perspectives. The roadmap calls for new methods to achieve pluralistic alignment.

Core claim

By defining pluralism through Overton, steerable, and distributional models and creating multi-objective, trade-off steerable, and jury-pluralistic benchmarks, the paper demonstrates that current alignment procedures may reduce distributional pluralism, indicating a fundamental limitation in existing techniques for building AI that accommodates diverse human values.

What carries the argument

The three proposed definitions of pluralism in AI systems (Overton pluralistic, steerably pluralistic, and distributionally pluralistic) along with three benchmark classes (multi-objective, trade-off steerable, and jury-pluralistic) that together operationalize and measure pluralism.

If this is right

Standard alignment will lead to models that are less well-calibrated to diverse population distributions.
New benchmarks are required to properly incentivize and evaluate pluralistic behavior in models.
Alignment techniques need redesign to support presenting spectra of responses and steering to different perspectives.
Empirical tests can confirm if alignment reduces the ability to reflect varied human ratings.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Future alignment methods might incorporate explicit pluralism objectives to counteract narrowing effects.
This framework could extend to other AI modalities beyond language models for broader value alignment.
Testing on jury-pluralistic benchmarks before and after alignment could quantify the reduction in pluralism.
Connections to multi-stakeholder decision making suggest similar issues in non-AI systems.

Load-bearing premise

The three definitions and benchmark classes are sufficient to fully capture and measure pluralism without overlooking important aspects of value diversity or adding biases.

What would settle it

If models after standard alignment show equal or greater calibration to diverse human populations on jury-pluralistic benchmarks compared to before alignment, that would contradict the claim of reduced pluralism.

read the original abstract

With increased power and prevalence of AI systems, it is ever more critical that AI systems are designed to serve all, i.e., people with diverse values and perspectives. However, aligning models to serve pluralistic human values remains an open research question. In this piece, we propose a roadmap to pluralistic alignment, specifically using language models as a test bed. We identify and formalize three possible ways to define and operationalize pluralism in AI systems: 1) Overton pluralistic models that present a spectrum of reasonable responses; 2) Steerably pluralistic models that can steer to reflect certain perspectives; and 3) Distributionally pluralistic models that are well-calibrated to a given population in distribution. We also formalize and discuss three possible classes of pluralistic benchmarks: 1) Multi-objective benchmarks, 2) Trade-off steerable benchmarks, which incentivize models to steer to arbitrary trade-offs, and 3) Jury-pluralistic benchmarks which explicitly model diverse human ratings. We use this framework to argue that current alignment techniques may be fundamentally limited for pluralistic AI; indeed, we highlight empirical evidence, both from our own experiments and from other work, that standard alignment procedures might reduce distributional pluralism in models, motivating the need for further research on pluralistic alignment.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper organizes pluralistic alignment into three model types and three benchmark classes, with suggestive evidence that standard methods can narrow output distributions.

read the letter

The main takeaway is that this paper gives a clean three-way split on pluralistic alignment: Overton models that surface a range of reasonable responses, steerable models that can be directed to different perspectives, and distributional models calibrated to a population's spread. They match this to three benchmark classes—multi-objective, trade-off steerable, and jury-pluralistic—and point to evidence that standard alignment can shrink the distributional range. That framing is the useful new piece. It pulls scattered ideas from alignment and fairness work into a practical map without overclaiming originality on the underlying concepts. The distinctions help clarify why single-objective optimization might trade off diversity, and the benchmark suggestions give concrete next steps for evaluation. Their cited experiments and small internal checks show some narrowing after alignment, which fits the broader concern. The softer spots are straightforward. The reduction claim rests mainly on referenced results rather than fresh large-scale data here, so its weight depends on how those prior measurements hold up. Jury benchmarks could embed annotator biases or miss contextual value differences that don't reduce to spectra or steerability, and the paper notes this without fully closing the gap. No load-bearing contradictions in the formalization itself. This is for alignment researchers already thinking about value diversity and evaluation design. It does not deliver a solution but sets up testable questions in a readable way. The citations are relevant and the logic tracks without circularity. I would send it to peer review as a roadmap piece—it organizes the space enough to guide follow-up work even if the empirical side needs tightening.

Referee Report

2 major / 2 minor

Summary. The paper proposes a roadmap for pluralistic alignment in AI, using language models as a test bed. It formalizes three definitions of pluralistic models—Overton (spectrum of reasonable responses), steerably pluralistic (steerable to perspectives), and distributionally pluralistic (population-calibrated)—along with three benchmark classes: multi-objective, trade-off steerable, and jury-pluralistic. The central claim is that current alignment techniques may be fundamentally limited for achieving pluralism, supported by cited empirical evidence (including the authors' experiments) indicating that standard procedures can reduce distributional pluralism.

Significance. If the framework and evidence hold, this work supplies a structured conceptual toolkit for operationalizing pluralism in AI systems, which is significant for developing inclusive models serving diverse populations. The explicit linkage of alignment limitations to distributional effects, backed by referenced experiments, motivates targeted research and could influence benchmark design in the field.

major comments (2)

[empirical evidence discussion] § on empirical evidence and limitations of alignment: The claim that standard alignment reduces distributional pluralism is load-bearing for the roadmap's motivation, yet it rests primarily on referenced experiments rather than new derivations or comprehensive data presented here; without explicit discussion of how the proposed jury-pluralistic benchmarks were applied to isolate this effect from measurement confounds, the reduction argument's robustness is difficult to evaluate.
[definitions and benchmarks] § formalizing the three definitions and benchmark classes: The sufficiency of the Overton/steerable/distributional definitions and the three benchmark classes for capturing pluralism is assumed without addressing potential gaps, such as whether jury-pluralistic benchmarks embed annotator biases that could distort population calibration or miss contextual value nuances not reducible to response spectra or steerability.

minor comments (2)

[abstract] The abstract and introduction could more explicitly distinguish the paper's novel contributions (the formalizations) from the synthesis of existing evidence on alignment limitations.
[throughout] Notation for the three model types and benchmark classes should be introduced with consistent abbreviations or symbols to improve readability when referenced across sections.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback on our roadmap paper for pluralistic alignment. We address each major comment below and indicate the revisions we will make to improve clarity and robustness.

read point-by-point responses

Referee: The claim that standard alignment reduces distributional pluralism is load-bearing for the roadmap's motivation, yet it rests primarily on referenced experiments rather than new derivations or comprehensive data presented here; without explicit discussion of how the proposed jury-pluralistic benchmarks were applied to isolate this effect from measurement confounds, the reduction argument's robustness is difficult to evaluate.

Authors: We agree that the empirical claim relies on referenced experiments (including our prior work) rather than new data in this conceptual roadmap. We will revise the manuscript to add a dedicated subsection that explicitly describes how jury-pluralistic benchmarks were applied in the cited studies, including steps taken to isolate alignment effects from confounds such as annotator variability and measurement artifacts. This will make the robustness of the argument easier to evaluate. revision: yes
Referee: The sufficiency of the Overton/steerable/distributional definitions and the three benchmark classes for capturing pluralism is assumed without addressing potential gaps, such as whether jury-pluralistic benchmarks embed annotator biases that could distort population calibration or miss contextual value nuances not reducible to response spectra or steerability.

Authors: We acknowledge that our framework does not fully address all potential gaps, including annotator biases in jury-pluralistic benchmarks and the difficulty of capturing irreducible contextual value nuances. We will expand the discussion section with a new subsection on limitations, providing concrete examples of these issues and suggesting mitigation approaches for future benchmarks. This will better delineate the scope of the proposed definitions and classes. revision: yes

Circularity Check

0 steps flagged

Conceptual definitions and benchmarks introduced independently with no reduction to fitted inputs or self-referential loops

full rationale

The paper is a conceptual roadmap that proposes three definitions of pluralism (Overton, steerable, distributional) and three benchmark classes (multi-objective, trade-off steerable, jury-pluralistic) as independent formalizations. The central claim that standard alignment may reduce distributional pluralism is supported by referenced empirical evidence from the authors' experiments and external work rather than by any derivation that reduces predictions to the definitions themselves. No equations, fitted parameters, or self-citation chains are load-bearing for the framework; the definitions do not presuppose the limitation result. This yields a minor self-citation score but leaves the core argument self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 3 invented entities

The framework rests on domain assumptions about the desirability and measurability of pluralism, with newly introduced conceptual categories that lack independent empirical grounding beyond referenced studies.

axioms (1)

domain assumption AI systems should be designed to serve people with diverse values and perspectives
Stated as a foundational motivation in the abstract.

invented entities (3)

Overton pluralistic models no independent evidence
purpose: Present a spectrum of reasonable responses
Newly formalized category in the paper.
Steerably pluralistic models no independent evidence
purpose: Allow steering to reflect certain perspectives
Newly formalized category in the paper.
Distributionally pluralistic models no independent evidence
purpose: Calibrate outputs to match a population distribution
Newly formalized category in the paper.

pith-pipeline@v0.9.0 · 5570 in / 1219 out tokens · 60046 ms · 2026-05-16T14:32:52.308924+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

LawOfExistence unity_unique_existent echoes

?

echoes
ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

we highlight empirical evidence... that standard alignment procedures might reduce distributional pluralism in models
InevitabilityStructure economic_inevitability refines

?

refines
Relation between the paper passage and the cited Recognition theorem.

Distributionally pluralistic models that are well-calibrated to a given population in distribution

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 20 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Curated Synthetic Data Doesn't Have to Collapse: A Theoretical Study of Generative Retraining with Pluralistic Preferences
cs.LG 2026-05 unverdicted novelty 7.0

Recursive generative retraining with pluralistic preferences converges to a stable diverse distribution that satisfies a weighted Nash bargaining solution.
Where Paths Split: Localized, Calibrated Control of Moral Reasoning in Large Language Models
cs.AI 2026-05 unverdicted novelty 7.0

A technique identifies minimal convergence-divergence points in LLM transformer blocks and calibrates residual-stream directions to achieve targeted ethical-framework control at inference time.
Three Models of RLHF Annotation: Extension, Evidence, and Authority
cs.CY 2026-04 unverdicted novelty 7.0

RLHF should decompose annotations into dimensions each matched to one of three models—extension, evidence, or authority—instead of applying a single unified pipeline.
Language Models Don't Know What You Want: Evaluating Personalization in Deep Research Needs Real Users
cs.CL 2026-03 conditional novelty 7.0

Personalized deep research systems need evaluation with real users because LLM judges overlook nuanced errors that matter to researchers.
Training-Free Cultural Alignment of Large Language Models via Persona Disagreement
cs.CL 2026-05 unverdicted novelty 6.0

DISCA uses disagreement among WVS-grounded persona panels to apply loss-averse logit corrections that reduce cultural misalignment by 10-24% on MultiTP for models 3.8B and larger, without weight changes.
Positive Alignment: Artificial Intelligence for Human Flourishing
cs.AI 2026-05 unverdicted novelty 6.0

Positive Alignment introduces AI systems that support human flourishing pluralistically and proactively while remaining safe, as a necessary complement to traditional safety-focused alignment research.
Understanding Annotator Safety Policy with Interpretability
cs.AI 2026-05 unverdicted novelty 6.0

Annotator Policy Models learn safety policies from labeling behavior alone, accurately predicting responses and revealing sources of disagreement like policy ambiguity and value pluralism.
Multilingual Safety Alignment via Self-Distillation
cs.LG 2026-05 unverdicted novelty 6.0

MSD enables cross-lingual safety transfer in LLMs via self-distillation with Dual-Perspective Safety Weighting, improving safety in low-resource languages without target response data.
Personalized RewardBench: Evaluating Reward Models with Human Aligned Personalization
cs.CL 2026-04 unverdicted novelty 6.0

Personalized RewardBench reveals that state-of-the-art reward models reach only 75.94% accuracy on personalized preferences and shows stronger correlation with downstream BoN and PPO performance than prior benchmarks.
Cultural Authenticity: Comparing LLM Cultural Representations to Native Human Expectations
cs.CL 2026-04 unverdicted novelty 6.0

LLMs display Western-centric cultural representations that align poorly with native priorities in non-Western countries and share highly correlated error patterns.
Evaluating AI-Generated Images of Cultural Artifacts with Community-Informed Rubrics
cs.CY 2026-04 unverdicted novelty 6.0

Community members from the UK blind community, Kerala, and Tamil Nadu helped define what counts as culturally appropriate depictions of artifacts, and the authors tested whether those definitions can be turned into re...
Value Alignment Tax: Measuring Value Trade-offs in LLM Alignment
cs.AI 2026-02 unverdicted novelty 6.0

VAT quantifies value trade-offs in LLM alignment by measuring how alignment-induced changes propagate across interconnected values using a Schwartz-grounded dataset.
Language Model Goal Selection Differs from Humans' in a Self-Directed Learning Task
cs.CL 2026-02 unverdicted novelty 6.0

LLMs diverge from human goal selection in self-directed learning by exploiting single solutions with low variability across instances.
Measuring Human Preferences in RLHF is a Social Science Problem
cs.HC 2026-01 unverdicted novelty 6.0

RLHF preference measurement is a social science validity problem because annotators routinely produce non-attitudes, constructed responses, and artifacts rather than stable values.
The Algorithmic Gaze of Image Quality Assessment: An Audit and Trace Ethnography of the LAION-Aesthetics Predictor
cs.HC 2026-01 conditional novelty 6.0

LAION-Aesthetics Predictor reinforces Western and male biases by preferentially selecting images associated with women and realistic Western/Japanese art while excluding men, LGBTQ+ references, and other styles.
When to Ask a Question: Understanding Communication Strategies in Generative AI Tools
cs.GT 2026-05 unverdicted novelty 5.0

A tradeoff model shows generative AI can reduce bias against diverse preferences by strategically eliciting information instead of always inferring from majority patterns.
Multilingual Safety Alignment via Self-Distillation
cs.LG 2026-05 unverdicted novelty 5.0

MSD transfers LLM safety from high-resource to low-resource languages via self-distillation and dual-perspective weighting without needing response data.
Quantifying and Predicting Disagreement in Graded Human Ratings
cs.CL 2026-05 unverdicted novelty 5.0

Annotation disagreement on toxic language can be moderately predicted from textual features, with high-opposition items proving harder for models to estimate accurately.
Relative Principals, Pluralistic Alignment, and the Structural Value Alignment Problem
cs.CY 2026-04 unverdicted novelty 5.0

AI value alignment is reconceptualized as a pluralistic governance problem arising along three axes—objectives, information, and principals—making it inherently context-dependent and unsolvable by technical design alone.
Positive Alignment: Artificial Intelligence for Human Flourishing
cs.AI 2026-05 unverdicted novelty 4.0

Positive Alignment is introduced as a distinct AI agenda that supports human flourishing through pluralistic and context-sensitive design, complementing traditional safety-focused alignment.

Reference graph

Works this paper leans on

282 extracted references · 282 canonical work pages · cited by 18 Pith papers · 16 internal anchors

[2]

J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, F

Achiam, O. J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, F. L., Almeida, D., Altenschmidt, J., Altman, S., Anadkat, S., Avila, R., Babuschkin, I., Balaji, S., Balcom, V., Baltescu, P., Bao, H., Bavarian, M., Belgum, J., Bello, I., Berdine, J., Bernadett-Shapiro, G., Berner, C., Bogdonoff, L., Boiko, O., Boyd, M., Brakman, A.-L., Brockman, G.,...

work page 2023
[3]

V., Arriaga, R

Aher, G. V., Arriaga, R. I., and Kalai, A. T. Using large language models to simulate multiple humans and replicate human subject studies. In International Conference on Machine Learning, pp.\ 337--371. PMLR, 2023

work page 2023
[4]

Introducing claude, 2023

Anthropic. Introducing claude, 2023. URL https://www.anthropic.com/index/introducing-claude

work page 2023
[5]

Argyle, Ethan C

Argyle, L., Busby, E., Fulda, N., Gubler, J., Rytting, C., and Wingate, D. Out of one, many: Using language models to simulate human samples. Political Analysis, 31: 0 1--15, 02 2023. doi:10.1017/pan.2023.2

work page doi:10.1017/pan.2023.2 2023
[7]

S., Diaz, M., Homan, C

Aroyo, L., Taylor, A. S., Diaz, M., Homan, C. M., Parrish, A., Serapio-Garcia, G., Prabhakaran, V., and Wang, D. Dices dataset: Diversity in conversational ai evaluation for safety, 2023

work page 2023
[8]

A general language assistant as a laboratory for alignment, 2021

Askell, A., Bai, Y., Chen, A., Drain, D., Ganguli, D., Henighan, T., Jones, A., Joseph, N., Mann, B., DasSarma, N., Elhage, N., Hatfield-Dodds, Z., Hernandez, D., Kernion, J., Ndousse, K., Olsson, C., Amodei, D., Brown, T., Clark, J., McCandlish, S., Olah, C., and Kaplan, J. A general language assistant as a laboratory for alignment, 2021

work page 2021
[9]

Training a helpful and harmless assistant with reinforcement learning from human feedback, 2022 a

Bai, Y., Jones, A., Ndousse, K., Askell, A., Chen, A., DasSarma, N., Drain, D., Fort, S., Ganguli, D., Henighan, T., Joseph, N., Kadavath, S., Kernion, J., Conerly, T., El-Showk, S., Elhage, N., Hatfield-Dodds, Z., Hernandez, D., Hume, T., Johnston, S., Kravec, S., Lovitt, L., Nanda, N., Olsson, C., Amodei, D., Brown, T., Clark, J., McCandlish, S., Olah, ...

work page 2022
[10]

E., Fort, S., Lanham, T., Telleen-Lawton, T., Conerly, T., Henighan, T., Hume, T., Bowman, S

Bai, Y., Kadavath, S., Kundu, S., Askell, A., Kernion, J., Jones, A., Chen, A., Goldie, A., Mirhoseini, A., McKinnon, C., Chen, C., Olsson, C., Olah, C., Hernandez, D., Drain, D., Ganguli, D., Li, D., Tran-Johnson, E., Perez, E., Kerr, J., Mueller, J., Ladish, J., Landau, J., Ndousse, K., Lukosuite, K., Lovitt, L., Sellitto, M., Elhage, N., Schiefer, N., ...

work page 2022
[11]

A., Chadwick, M

Bakker, M. A., Chadwick, M. J., Sheahan, H. R., Tessler, M. H., Campbell-Gillingham, L., Balaguer, J., McAleese, N., Glaese, A., Aslanides, J., Botvinick, M. M., and Summerfield, C. Fine-tuning language models to find agreement among humans with diverse preferences, 2022

work page 2022
[12]

Two concepts of liberty

Berlin, I. Two concepts of liberty. In Four Essays on Liberty, pp.\ 118–172. Oxford University Press, Oxford, 1969

work page 1969
[13]

Bobu, A., Peng, A., Agrawal, P., Shah, J., and Dragan, A. D. Aligning robot and human representations. arXiv preprint arXiv:2302.01928, 2023

work page arXiv 2023
[14]

On the Opportunities and Risks of Foundation Models

Bommasani, R., Hudson, D. A., Adeli, E., Altman, R., Arora, S., von Arx, S., Bernstein, M. S., Bohg, J., Bosselut, A., Brunskill, E., Brynjolfsson, E., Buch, S., Card, D., Castellon, R., Chatterji, N. S., Chen, A. S., Creel, K. A., Davis, J., Demszky, D., Donahue, C., Doumbouya, M., Durmus, E., Ermon, S., Etchemendy, J., Ethayarajh, K., Fei-Fei, L., Finn,...

work page internal anchor Pith review Pith/arXiv arXiv 2021
[15]

Bowman, S. R., Hyun, J., Perez, E., Chen, E., Pettit, C., Heiner, S., Lukošiūtė, K., Askell, A., Jones, A., Chen, A., Goldie, A., Mirhoseini, A., McKinnon, C., Olah, C., Amodei, D., Amodei, D., Drain, D., Li, D., Tran-Johnson, E., Kernion, J., Kerr, J., Mueller, J., Ladish, J., Landau, J., Ndousse, K., Lovitt, L., Elhage, N., Schiefer, N., Joseph, N., Mer...

work page 2022
[17]

Language Models are Few-Shot Learners

Brown, T. B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T. J., Child, R., Ramesh, A., Ziegler, D. M., Wu, J., Winter, C., Hesse, C., Chen, M., Sigler, E., Litwin, M., Gray, S., Chess, B., Clark, J., Berner, C., McCandlish, S., Radfor...

work page internal anchor Pith review Pith/arXiv arXiv 2005
[18]

Studying large language models as compression algorithms for human culture

Buttrick, N. Studying large language models as compression algorithms for human culture. Trends in Cognitive Sciences, S1364-6613 0 (24): 0 00001--9, 2024. doi:10.1016/j.tics.2024.01.001. Epub ahead of print

work page doi:10.1016/j.tics.2024.01.001 2024
[20]

When large language models meet personalization: Perspectives of challenges and opportunities, 2023

Chen, J., Liu, Z., Huang, X., Wu, C., Liu, Q., Jiang, G., Pu, Y., Lei, Y., Chen, X., Wang, X., Lian, D., and Chen, E. When large language models meet personalization: Perspectives of challenges and opportunities, 2023

work page 2023
[21]

Decision transformer: Reinforcement learning via sequence modeling, 2021

Chen, L., Lu, K., Rajeswaran, A., Lee, K., Grover, A., Laskin, M., Abbeel, P., Srinivas, A., and Mordatch, I. Decision transformer: Reinforcement learning via sequence modeling, 2021

work page 2021
[22]

Why ai alignment could be hard with modern deep learning

Cotra, A. Why ai alignment could be hard with modern deep learning. https://www.cold-takes.com/why-ai-alignment-could-be-hard-with-modern-deep-learning/, 2021

work page 2021
[23]

Demarginalizing the intersection of race and sex: A black feminist critique of antidiscrimination doctrine, feminist theory and antiracist politics

Crenshaw, K. Demarginalizing the intersection of race and sex: A black feminist critique of antidiscrimination doctrine, feminist theory and antiracist politics. The University of Chicago Legal Forum, 140: 0 139--167, 1989

work page 1989
[25]

Democracy in America

de Tocqueville, A. Democracy in America. 1835

work page
[26]

Durmus, E., Nyugen, K., Liao, T. I., Schiefer, N., Askell, A., Bakhtin, A., Chen, C., Hatfield-Dodds, Z., Hernandez, D., Joseph, N., Lovitt, L., McCandlish, S., Sikder, O., Tamkin, A., Thamkul, J., Kaplan, J., Clark, J., and Ganguli, D. Towards measuring the representation of subjective global opinions in language models, 2023. URL https://api.semanticsch...

work page 2023
[27]

and Jurafsky, D

Ethayarajh, K. and Jurafsky, D. Utility is in the eye of the user: A critique of nlp leaderboard design. In Conference on Empirical Methods in Natural Language Processing, 2020. URL https://api.semanticscholar.org/CorpusID:235408131

work page 2020
[28]

and Jurafsky, D

Ethayarajh, K. and Jurafsky, D. The authenticity gap in human evaluation. In Goldberg, Y., Kozareva, Z., and Zhang, Y. (eds.), Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pp.\ 6056--6070, Abu Dhabi, United Arab Emirates, December 2022. Association for Computational Linguistics. doi:10.18653/v1/2022.emnlp-main.40...

work page doi:10.18653/v1/2022.emnlp-main.406 2022
[29]

Y., Liu, Y., and Tsvetkov, Y

Feng, S., Park, C. Y., Liu, Y., and Tsvetkov, Y. From pretraining data to language models to downstream tasks: Tracking the trails of political biases leading to unfair nlp models, 2023

work page 2023
[31]

When the Majority is Wrong : Modeling Annotator Disagreement for Subjective Tasks , November 2023

Fleisig, E., Abebe, R., and Klein, D. When the Majority is Wrong : Modeling Annotator Disagreement for Subjective Tasks , November 2023. URL http://arxiv.org/abs/2305.06626. arXiv:2305.06626 [cs]

work page arXiv 2023
[32]

Artificial intelligence, values, and alignment

Gabriel, I. Artificial intelligence, values, and alignment. Minds and Machines, 30 0 (3): 0 411--437, 2020. doi:10.1007/s11023-020-09539-2. URL https://doi.org/10.1007/s11023-020-09539-2

work page doi:10.1007/s11023-020-09539-2 2020
[33]

Girotra, K., Meincke, L., Terwiesch, C., and Ulrich, K. T. Ideas are dimes a dozen: Large language models for idea generation in innovation. https://ssrn.com/abstract=4526071, July 2023. Available at SSRN: https://ssrn.com/abstract=4526071 or http://dx.doi.org/10.2139/ssrn.4526071

work page doi:10.2139/ssrn.4526071 2023
[34]

S., Green, R., Mokrá, S., Fernando, N., Wu, B., Foley, R., Young, S., Gabriel, I., Isaac, W., Mellor, J., Hassabis, D., Kavukcuoglu, K., Hendricks, L

Glaese, A., McAleese, N., Trębacz, M., Aslanides, J., Firoiu, V., Ewalds, T., Rauh, M., Weidinger, L., Chadwick, M., Thacker, P., Campbell-Gillingham, L., Uesato, J., Huang, P.-S., Comanescu, R., Yang, F., See, A., Dathathri, S., Greig, R., Chen, C., Fritz, D., Elias, J. S., Green, R., Mokrá, S., Fernando, N., Wu, B., Foley, R., Young, S., Gabriel, I., Is...

work page 2022
[36]

P., Fonseca, C

Guerreiro, A. P., Fonseca, C. M., and Paquete, L. The hypervolume indicator. ACM Computing Surveys (CSUR), 54: 0 1--42, 2020. URL https://api.semanticscholar.org/CorpusID:218470181

work page 2020
[37]

Situated knowledges: The science question in feminism and the privilege of partial perspective

Haraway, D. Situated knowledges: The science question in feminism and the privilege of partial perspective. Feminist Studies, 14 0 (3): 0 575--599, 1988. ISSN 00463663. URL http://www.jstor.org/stable/3178066

work page arXiv 1988
[38]

C., Selten, R., et al

Harsanyi, J. C., Selten, R., et al. A general theory of equilibrium selection in games. MIT Press Books, 1, 1988

work page 1988
[39]

The political ideology of conversational ai: Converging evidence on chatgpt's pro-environmental, left-libertarian orientation, 2023

Hartmann, J., Schwenzow, J., and Witte, M. The political ideology of conversational ai: Converging evidence on chatgpt's pro-environmental, left-libertarian orientation, 2023

work page 2023
[43]

Aligning ai with shared human values, 2023

Hendrycks, D., Burns, C., Basart, S., Critch, A., Li, J., Song, D., and Steinhardt, J. Aligning ai with shared human values, 2023

work page 2023
[44]

J., and Norenzayan, A

Henrich, J., Heine, S. J., and Norenzayan, A. The weirdest people in the world? Behavioral and Brain Sciences, 33 0 (2-3): 0 61--83, 2010. URL http://www2.psych.ubc.ca/ henrich/audiofiles/WEIRD1.mp3

work page 2010
[45]

Human feedback is not gold standard

Hosking, T., Blunsom, P., and Bartolo, M. Human feedback is not gold standard. ArXiv, abs/2309.16349, 2023. URL https://api.semanticscholar.org/CorpusID:263134280

work page arXiv 2023
[46]

and Andersson, H

Hsieh, N.-h. and Andersson, H. Incommensurable Values . In Zalta, E. N. (ed.), The Stanford Encyclopedia of Philosophy . Metaphysics Research Lab, Stanford University, F all 2021 edition, 2021

work page 2021
[47]

P., and Tandon, N

Hwang, E., Majumder, B. P., and Tandon, N. Aligning Language Models to User Opinions . 2023. doi:10.48550/ARXIV.2305.14929. URL https://arxiv.org/abs/2305.14929. Publisher: arXiv Version Number: 1

work page doi:10.48550/arxiv.2305.14929 2023
[48]

and Rapp, D

Imundo, M. and Rapp, D. When fairness is flawed: Effects of false balance reporting and weight-of-evidence statements on beliefs and perceptions of climate change. Journal of Applied Research in Memory and Cognition, 11, 10 2021. doi:10.1016/j.jarmac.2021.10.002

work page doi:10.1016/j.jarmac.2021.10.002 2021
[49]

Co-writing with opinionated language models affects users’ views

Jakesch, M., Bhat, A., Buschek, D., Zalmanson, L., and Naaman, M. Co-writing with opinionated language models affects users’ views. In Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems, CHI '23, New York, NY, USA, 2023. Association for Computing Machinery. ISBN 9781450394215. doi:10.1145/3544548.3581196. URL https://doi.org/10.1...

work page doi:10.1145/3544548.3581196 2023
[50]

Y., Wang, Y., Hessel, J., Zettlemoyer, L., Hajishirzi, H., Choi, Y., and Ammanabrolu, P

Jang, J., Kim, S., Lin, B. Y., Wang, Y., Hessel, J., Zettlemoyer, L., Hajishirzi, H., Choi, Y., and Ammanabrolu, P. Personalized soups: Personalized large language model alignment via post-hoc parameter merging, 2023

work page 2023
[51]

Y., Dai, J., Pan, X., O'Gara, A., Lei, Y., Xu, H., Tse, B., Fu, J., McAleer, S., Yang, Y., Wang, Y., Zhu, S.-C., Guo, Y., and Gao, W

Ji, J., Qiu, T., Chen, B., Zhang, B., Lou, H., Wang, K., Duan, Y., He, Z., Zhou, J., Zhang, Z., Zeng, F., Ng, K. Y., Dai, J., Pan, X., O'Gara, A., Lei, Y., Xu, H., Tse, B., Fu, J., McAleer, S., Yang, Y., Wang, Y., Zhu, S.-C., Guo, Y., and Gao, W. Ai alignment: A comprehensive survey, 2024

work page 2024
[52]

D., and Telgarsky, M

Ji, Z., Li, J. D., and Telgarsky, M. Early-stopped neural networks are consistent, 2021

work page 2021
[53]

Evaluating and inducing personality in pre-trained language models, 2023

Jiang, G., Xu, M., Zhu, S.-C., Han, W., Zhang, C., and Zhu, Y. Evaluating and inducing personality in pre-trained language models, 2023. URL https://api.semanticscholar.org/CorpusID:258865158

work page 2023
[54]

Communitylm: Probing partisan worldviews from language models, 2022

Jiang, H., Beeferman, D., Roy, B., and Roy, D. Communitylm: Probing partisan worldviews from language models, 2022

work page 2022
[55]

L., and Choi, Y

Jung, J., Qin, L., Welleck, S., Brahman, F., Bhagavatula, C., Bras, R. L., and Choi, Y. Maieutic prompting: Logically consistent reasoning with recursive explanations, 2022

work page 2022
[57]

and Gabriel, I

Kasirzadeh, A. and Gabriel, I. In conversation with artificial intelligence: aligning language models with human values, 2022

work page 2022
[58]

The Morality of Pluralism

Kekes, J. The Morality of Pluralism. Princeton University Press, Princeton, 1993

work page 1993
[59]

and Lee, B

Kim, J. and Lee, B. Ai-augmented surveys: Leveraging large language models for opinion prediction in nationally representative surveys. arXiv preprint arXiv:2305.09620, 2023

work page arXiv 2023
[60]

Understanding the effects of rlhf on llm generalisation and diversity, 2024

Kirk, R., Mediratta, I., Nalmpantis, C., Luketina, J., Hambro, E., Grefenstette, E., and Raileanu, R. Understanding the effects of rlhf on llm generalisation and diversity, 2024

work page 2024
[61]

Human-centred mechanism design with democratic ai

Koster, R., Balaguer, J., Tacchetti, A., Weinstein, A., Zhu, T., Hauser, O., Williams, D., Campbell-Gillingham, L., Thacker, P., Botvinick, M., and Summerfield, C. Human-centred mechanism design with democratic ai. Nature Human Behaviour, 6 0 (10): 0 1398--1407, 2022. doi:10.1038/s41562-022-01383-x. URL https://doi.org/10.1038/s41562-022-01383-x

work page doi:10.1038/s41562-022-01383-x 2022
[62]

Chatgpt's inconsistent moral advice influences users'judgment

Kr \"u gel, S., Ostermaier, A., and Uhl, M. Chatgpt's inconsistent moral advice influences users'judgment. Scientific Reports, 13 0 (1): 0 4569, Apr 2023. ISSN 2045-2322. doi:10.1038/s41598-023-31341-0. URL https://doi.org/10.1038/s41598-023-31341-0

work page doi:10.1038/s41598-023-31341-0 2023
[63]

and Page, S

Landemore, H. and Page, S. E. Deliberation and disagreement: Problem solving, prediction, and positive dissensus. Politics, philosophy & economics, 14 0 (3): 0 229--254, 2015

work page 2015
[64]

Scalable agent alignment via reward modeling: a research direction, 2018

Leike, J., Krueger, D., Everitt, T., Martic, M., Maini, V., and Legg, S. Scalable agent alignment via reward modeling: a research direction, 2018

work page 2018
[65]

A., Liang, Y., and Bendersky, M

Li, C., Zhang, M., Mei, Q., Wang, Y., Hombaiah, S. A., Liang, Y., and Bendersky, M. Teach llms to personalize -- an approach inspired by writing education, 2023 a

work page 2023
[67]

D., Ré, C., Acosta-Navas, D., Hudson, D

Liang, P., Bommasani, R., Lee, T., Tsipras, D., Soylu, D., Yasunaga, M., Zhang, Y., Narayanan, D., Wu, Y., Kumar, A., Newman, B., Yuan, B., Yan, B., Zhang, C., Cosgrove, C., Manning, C. D., Ré, C., Acosta-Navas, D., Hudson, D. A., Zelikman, E., Durmus, E., Ladhak, F., Rong, F., Ren, H., Yao, H., Wang, J., Santhanam, K., Orr, L., Zheng, L., Yuksekgonul, M....

work page 2023
[68]

A., and Choi, Y

Liu, A., Sap, M., Lu, X., Swayamdipta, S., Bhagavatula, C., Smith, N. A., and Choi, Y. Dexperts: Decoding-time controlled text generation with experts and anti-experts. In Annual Meeting of the Association for Computational Linguistics, 2021. URL https://api.semanticscholar.org/CorpusID:235313967

work page 2021
[69]

A., and Choi, Y

Liu, A., Swayamdipta, S., Smith, N. A., and Choi, Y. Wanli: Worker and ai collaboration for natural language inference dataset creation, 2022

work page 2022
[70]

Liu, A., Han, X., Wang, Y., Tsvetkov, Y., Choi, Y., and Smith, N. A. Tuning language models by proxy, 2024

work page 2024
[71]

F., Kumar, A., Liang, P., and Jia, R

Liu, N. F., Kumar, A., Liang, P., and Jia, R. Are sample-efficient nlp models more robust?, 2023

work page 2023
[72]

Large language model guided tree-of-thought

Long, J. Large language model guided tree-of-thought. arXiv preprint arXiv:2305.08291, 2023

work page arXiv 2023
[73]

L., Bhagavatula, C., and Choi, Y

Lu, X., West, P., Zellers, R., Bras, R. L., Bhagavatula, C., and Choi, Y. Neurologic decoding: (un)supervised neural text generation with predicate logic constraints. ArXiv, abs/2010.12884, 2020. URL https://api.semanticscholar.org/CorpusID:225067055

work page arXiv 2010
[74]

Quark: Controllable text generation with reinforced unlearning, 2022

Lu, X., Welleck, S., Hessel, J., Jiang, L., Qin, L., West, P., Ammanabrolu, P., and Choi, Y. Quark: Controllable text generation with reinforced unlearning, 2022

work page 2022
[75]

Beyond chatbots: Explorellm for structured thoughts and personalized model responses, 2023

Ma, X., Mishra, S., Liu, A., Su, S., Chen, J., Kulkarni, C., Cheng, H.-T., Le, Q., and Chi, E. Beyond chatbots: Explorellm for structured thoughts and personalized model responses, 2023

work page 2023
[77]

Michael, J., Mahdi, S., Rein, D., Petty, J., Dirani, J., Padmakumar, V., and Bowman, S. R. Debate helps supervise unreliable experts. arXiv preprint arXiv:2311.08702, 2023

work page arXiv 2023
[78]

A mbig QA : Answering ambiguous open-domain questions

Min, S., Michael, J., Hajishirzi, H., and Zettlemoyer, L. A mbig QA : Answering ambiguous open-domain questions. In Webber, B., Cohn, T., He, Y., and Liu, Y. (eds.), Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp.\ 5783--5797, Online, November 2020. Association for Computational Linguistics. doi:10.18653...

work page doi:10.18653/v1/2020.emnlp-main.466 2020
[79]

Ai alignment and social choice: Fundamental limitations and policy implications, 2023

Mishra, A. Ai alignment and social choice: Fundamental limitations and policy implications, 2023

work page 2023
[80]

Fair Division and Collective Welfare

Moulin, H. Fair Division and Collective Welfare. MIT Press, 2004

work page 2004
[81]

The fragmentation of value

Nagel, T. The fragmentation of value. In Mortal Questions. Cambridge University Press, Cambridge, 1979

work page 1979
[82]

Openai davinci-002 model

OpenAI. Openai davinci-002 model. https://www.openai.com, 2023 a . Accessed on Date 06/2023

work page 2023
[83]

Openai gpt3.5-turbo

OpenAI. Openai gpt3.5-turbo. https://www.openai.com, 2023 b . Accessed on Date 06/2023

work page 2023
[84]

Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C. L., Mishkin, P., Zhang, C., Agarwal, S., Slama, K., Ray, A., Schulman, J., Hilton, J., Kelton, F., Miller, L., Simens, M., Askell, A., Welinder, P., Christiano, P., Leike, J., and Lowe, R. Training language models to follow instructions with human feedback, 2022

work page 2022
[85]

Reimagining democracy for ai

Ovadya, A. Reimagining democracy for ai. Journal of Democracy, 34 0 (4): 0 162--170, Oct 2023

work page 2023
[86]

The difference: How the power of diversity creates better groups, firms, schools, and societies-new edition

Page, S. The difference: How the power of diversity creates better groups, firms, schools, and societies-new edition. Princeton University Press, 2008

work page 2008
[87]

Page, S. E. The diversity bonus: How great teams pay off in the knowledge economy. Princeton University Press, 2019

work page 2019
[88]

S., Zou, A., Li, N., Basart, S., Woodside, T., Ng, J., Zhang, H., Emmons, S., and Hendrycks, D

Pan, A., Chan, J. S., Zou, A., Li, N., Basart, S., Woodside, T., Ng, J., Zhang, H., Emmons, S., and Hendrycks, D. Do the rewards justify the means? measuring trade-offs between rewards and ethical behavior in the machiavelli benchmark, 2023

work page 2023
[89]

S., Popowski, L., Cai, C

Park, J. S., Popowski, L., Cai, C. J., Morris, M. R., Liang, P., and Bernstein, M. S. Social simulacra: Creating populated prototypes for social computing systems, 2022

work page 2022
[90]

S., O'Brien, J., Cai, C

Park, J. S., O'Brien, J., Cai, C. J., Morris, M. R., Liang, P., and Bernstein, M. S. Generative agents: Interactive simulacra of human behavior. In Proceedings of the 36th Annual ACM Symposium on User Interface Software and Technology, pp.\ 1--22, 2023

work page 2023
[91]

K., Shu, T., Bobu, A., Shah, J., and Agrawal, P

Peng, A., Netanyahu, A., Ho, M. K., Shu, T., Bobu, A., Shah, J., and Agrawal, P. Diagnosis, feedback, adaptation: A human-in-the-loop framework for test-time policy adaptation. In Proceedings of the 40th International Conference on Machine Learning, 2023

work page 2023
[93]

The state of online harassment

Pew Research Center . The state of online harassment. Technical report, Washington, D.C. , January 2021. URL https://www.pewresearch.org/internet/2021/01/13/the-state-of-online-harassment/

work page 2021
[94]

arXiv preprint arXiv:2202.11705 , year=

Qin, L., Welleck, S., Khashabi, D., and Choi, Y. Cold decoding: Energy-based constrained text generation with langevin dynamics. ArXiv, abs/2202.11705, 2022. URL https://api.semanticscholar.org/CorpusID:247058662

work page arXiv 2022

Showing first 80 references.