pith. machine review for the scientific record. sign in

arxiv: 2402.05070 · v3 · submitted 2024-02-07 · 💻 cs.AI · cs.CL· cs.IR

Recognition: 3 theorem links

A Roadmap to Pluralistic Alignment

Authors on Pith no claims yet

Pith reviewed 2026-05-16 14:32 UTC · model grok-4.3

classification 💻 cs.AI cs.CLcs.IR
keywords pluralistic alignmentAI alignmentlanguage modelsdistributional pluralismvalue diversitybenchmarksoverton pluralismsteerable models
0
0 comments X

The pith

Standard alignment procedures may reduce distributional pluralism in language models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes three definitions for pluralistic AI: models that present a spectrum of responses, can be steered to specific perspectives, or are calibrated to population distributions. It introduces corresponding benchmark classes to measure these properties. The authors argue that standard alignment techniques are limited because they can narrow the distribution of perspectives models reflect, as shown by empirical evidence. This matters for ensuring AI serves people with diverse values and perspectives. The roadmap calls for new methods to achieve pluralistic alignment.

Core claim

By defining pluralism through Overton, steerable, and distributional models and creating multi-objective, trade-off steerable, and jury-pluralistic benchmarks, the paper demonstrates that current alignment procedures may reduce distributional pluralism, indicating a fundamental limitation in existing techniques for building AI that accommodates diverse human values.

What carries the argument

The three proposed definitions of pluralism in AI systems (Overton pluralistic, steerably pluralistic, and distributionally pluralistic) along with three benchmark classes (multi-objective, trade-off steerable, and jury-pluralistic) that together operationalize and measure pluralism.

If this is right

  • Standard alignment will lead to models that are less well-calibrated to diverse population distributions.
  • New benchmarks are required to properly incentivize and evaluate pluralistic behavior in models.
  • Alignment techniques need redesign to support presenting spectra of responses and steering to different perspectives.
  • Empirical tests can confirm if alignment reduces the ability to reflect varied human ratings.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Future alignment methods might incorporate explicit pluralism objectives to counteract narrowing effects.
  • This framework could extend to other AI modalities beyond language models for broader value alignment.
  • Testing on jury-pluralistic benchmarks before and after alignment could quantify the reduction in pluralism.
  • Connections to multi-stakeholder decision making suggest similar issues in non-AI systems.

Load-bearing premise

The three definitions and benchmark classes are sufficient to fully capture and measure pluralism without overlooking important aspects of value diversity or adding biases.

What would settle it

If models after standard alignment show equal or greater calibration to diverse human populations on jury-pluralistic benchmarks compared to before alignment, that would contradict the claim of reduced pluralism.

read the original abstract

With increased power and prevalence of AI systems, it is ever more critical that AI systems are designed to serve all, i.e., people with diverse values and perspectives. However, aligning models to serve pluralistic human values remains an open research question. In this piece, we propose a roadmap to pluralistic alignment, specifically using language models as a test bed. We identify and formalize three possible ways to define and operationalize pluralism in AI systems: 1) Overton pluralistic models that present a spectrum of reasonable responses; 2) Steerably pluralistic models that can steer to reflect certain perspectives; and 3) Distributionally pluralistic models that are well-calibrated to a given population in distribution. We also formalize and discuss three possible classes of pluralistic benchmarks: 1) Multi-objective benchmarks, 2) Trade-off steerable benchmarks, which incentivize models to steer to arbitrary trade-offs, and 3) Jury-pluralistic benchmarks which explicitly model diverse human ratings. We use this framework to argue that current alignment techniques may be fundamentally limited for pluralistic AI; indeed, we highlight empirical evidence, both from our own experiments and from other work, that standard alignment procedures might reduce distributional pluralism in models, motivating the need for further research on pluralistic alignment.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes a roadmap for pluralistic alignment in AI, using language models as a test bed. It formalizes three definitions of pluralistic models—Overton (spectrum of reasonable responses), steerably pluralistic (steerable to perspectives), and distributionally pluralistic (population-calibrated)—along with three benchmark classes: multi-objective, trade-off steerable, and jury-pluralistic. The central claim is that current alignment techniques may be fundamentally limited for achieving pluralism, supported by cited empirical evidence (including the authors' experiments) indicating that standard procedures can reduce distributional pluralism.

Significance. If the framework and evidence hold, this work supplies a structured conceptual toolkit for operationalizing pluralism in AI systems, which is significant for developing inclusive models serving diverse populations. The explicit linkage of alignment limitations to distributional effects, backed by referenced experiments, motivates targeted research and could influence benchmark design in the field.

major comments (2)
  1. [empirical evidence discussion] § on empirical evidence and limitations of alignment: The claim that standard alignment reduces distributional pluralism is load-bearing for the roadmap's motivation, yet it rests primarily on referenced experiments rather than new derivations or comprehensive data presented here; without explicit discussion of how the proposed jury-pluralistic benchmarks were applied to isolate this effect from measurement confounds, the reduction argument's robustness is difficult to evaluate.
  2. [definitions and benchmarks] § formalizing the three definitions and benchmark classes: The sufficiency of the Overton/steerable/distributional definitions and the three benchmark classes for capturing pluralism is assumed without addressing potential gaps, such as whether jury-pluralistic benchmarks embed annotator biases that could distort population calibration or miss contextual value nuances not reducible to response spectra or steerability.
minor comments (2)
  1. [abstract] The abstract and introduction could more explicitly distinguish the paper's novel contributions (the formalizations) from the synthesis of existing evidence on alignment limitations.
  2. [throughout] Notation for the three model types and benchmark classes should be introduced with consistent abbreviations or symbols to improve readability when referenced across sections.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback on our roadmap paper for pluralistic alignment. We address each major comment below and indicate the revisions we will make to improve clarity and robustness.

read point-by-point responses
  1. Referee: The claim that standard alignment reduces distributional pluralism is load-bearing for the roadmap's motivation, yet it rests primarily on referenced experiments rather than new derivations or comprehensive data presented here; without explicit discussion of how the proposed jury-pluralistic benchmarks were applied to isolate this effect from measurement confounds, the reduction argument's robustness is difficult to evaluate.

    Authors: We agree that the empirical claim relies on referenced experiments (including our prior work) rather than new data in this conceptual roadmap. We will revise the manuscript to add a dedicated subsection that explicitly describes how jury-pluralistic benchmarks were applied in the cited studies, including steps taken to isolate alignment effects from confounds such as annotator variability and measurement artifacts. This will make the robustness of the argument easier to evaluate. revision: yes

  2. Referee: The sufficiency of the Overton/steerable/distributional definitions and the three benchmark classes for capturing pluralism is assumed without addressing potential gaps, such as whether jury-pluralistic benchmarks embed annotator biases that could distort population calibration or miss contextual value nuances not reducible to response spectra or steerability.

    Authors: We acknowledge that our framework does not fully address all potential gaps, including annotator biases in jury-pluralistic benchmarks and the difficulty of capturing irreducible contextual value nuances. We will expand the discussion section with a new subsection on limitations, providing concrete examples of these issues and suggesting mitigation approaches for future benchmarks. This will better delineate the scope of the proposed definitions and classes. revision: yes

Circularity Check

0 steps flagged

Conceptual definitions and benchmarks introduced independently with no reduction to fitted inputs or self-referential loops

full rationale

The paper is a conceptual roadmap that proposes three definitions of pluralism (Overton, steerable, distributional) and three benchmark classes (multi-objective, trade-off steerable, jury-pluralistic) as independent formalizations. The central claim that standard alignment may reduce distributional pluralism is supported by referenced empirical evidence from the authors' experiments and external work rather than by any derivation that reduces predictions to the definitions themselves. No equations, fitted parameters, or self-citation chains are load-bearing for the framework; the definitions do not presuppose the limitation result. This yields a minor self-citation score but leaves the core argument self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 3 invented entities

The framework rests on domain assumptions about the desirability and measurability of pluralism, with newly introduced conceptual categories that lack independent empirical grounding beyond referenced studies.

axioms (1)
  • domain assumption AI systems should be designed to serve people with diverse values and perspectives
    Stated as a foundational motivation in the abstract.
invented entities (3)
  • Overton pluralistic models no independent evidence
    purpose: Present a spectrum of reasonable responses
    Newly formalized category in the paper.
  • Steerably pluralistic models no independent evidence
    purpose: Allow steering to reflect certain perspectives
    Newly formalized category in the paper.
  • Distributionally pluralistic models no independent evidence
    purpose: Calibrate outputs to match a population distribution
    Newly formalized category in the paper.

pith-pipeline@v0.9.0 · 5570 in / 1219 out tokens · 60046 ms · 2026-05-16T14:32:52.308924+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

  • LawOfExistence unity_unique_existent echoes
    ?
    echoes

    ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

    we highlight empirical evidence... that standard alignment procedures might reduce distributional pluralism in models

  • InevitabilityStructure economic_inevitability refines
    ?
    refines

    Relation between the paper passage and the cited Recognition theorem.

    Distributionally pluralistic models that are well-calibrated to a given population in distribution

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 20 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Curated Synthetic Data Doesn't Have to Collapse: A Theoretical Study of Generative Retraining with Pluralistic Preferences

    cs.LG 2026-05 unverdicted novelty 7.0

    Recursive generative retraining with pluralistic preferences converges to a stable diverse distribution that satisfies a weighted Nash bargaining solution.

  2. Where Paths Split: Localized, Calibrated Control of Moral Reasoning in Large Language Models

    cs.AI 2026-05 unverdicted novelty 7.0

    A technique identifies minimal convergence-divergence points in LLM transformer blocks and calibrates residual-stream directions to achieve targeted ethical-framework control at inference time.

  3. Three Models of RLHF Annotation: Extension, Evidence, and Authority

    cs.CY 2026-04 unverdicted novelty 7.0

    RLHF should decompose annotations into dimensions each matched to one of three models—extension, evidence, or authority—instead of applying a single unified pipeline.

  4. Language Models Don't Know What You Want: Evaluating Personalization in Deep Research Needs Real Users

    cs.CL 2026-03 conditional novelty 7.0

    Personalized deep research systems need evaluation with real users because LLM judges overlook nuanced errors that matter to researchers.

  5. Training-Free Cultural Alignment of Large Language Models via Persona Disagreement

    cs.CL 2026-05 unverdicted novelty 6.0

    DISCA uses disagreement among WVS-grounded persona panels to apply loss-averse logit corrections that reduce cultural misalignment by 10-24% on MultiTP for models 3.8B and larger, without weight changes.

  6. Positive Alignment: Artificial Intelligence for Human Flourishing

    cs.AI 2026-05 unverdicted novelty 6.0

    Positive Alignment introduces AI systems that support human flourishing pluralistically and proactively while remaining safe, as a necessary complement to traditional safety-focused alignment research.

  7. Understanding Annotator Safety Policy with Interpretability

    cs.AI 2026-05 unverdicted novelty 6.0

    Annotator Policy Models learn safety policies from labeling behavior alone, accurately predicting responses and revealing sources of disagreement like policy ambiguity and value pluralism.

  8. Multilingual Safety Alignment via Self-Distillation

    cs.LG 2026-05 unverdicted novelty 6.0

    MSD enables cross-lingual safety transfer in LLMs via self-distillation with Dual-Perspective Safety Weighting, improving safety in low-resource languages without target response data.

  9. Personalized RewardBench: Evaluating Reward Models with Human Aligned Personalization

    cs.CL 2026-04 unverdicted novelty 6.0

    Personalized RewardBench reveals that state-of-the-art reward models reach only 75.94% accuracy on personalized preferences and shows stronger correlation with downstream BoN and PPO performance than prior benchmarks.

  10. Cultural Authenticity: Comparing LLM Cultural Representations to Native Human Expectations

    cs.CL 2026-04 unverdicted novelty 6.0

    LLMs display Western-centric cultural representations that align poorly with native priorities in non-Western countries and share highly correlated error patterns.

  11. Evaluating AI-Generated Images of Cultural Artifacts with Community-Informed Rubrics

    cs.CY 2026-04 unverdicted novelty 6.0

    Community members from the UK blind community, Kerala, and Tamil Nadu helped define what counts as culturally appropriate depictions of artifacts, and the authors tested whether those definitions can be turned into re...

  12. Value Alignment Tax: Measuring Value Trade-offs in LLM Alignment

    cs.AI 2026-02 unverdicted novelty 6.0

    VAT quantifies value trade-offs in LLM alignment by measuring how alignment-induced changes propagate across interconnected values using a Schwartz-grounded dataset.

  13. Language Model Goal Selection Differs from Humans' in a Self-Directed Learning Task

    cs.CL 2026-02 unverdicted novelty 6.0

    LLMs diverge from human goal selection in self-directed learning by exploiting single solutions with low variability across instances.

  14. Measuring Human Preferences in RLHF is a Social Science Problem

    cs.HC 2026-01 unverdicted novelty 6.0

    RLHF preference measurement is a social science validity problem because annotators routinely produce non-attitudes, constructed responses, and artifacts rather than stable values.

  15. The Algorithmic Gaze of Image Quality Assessment: An Audit and Trace Ethnography of the LAION-Aesthetics Predictor

    cs.HC 2026-01 conditional novelty 6.0

    LAION-Aesthetics Predictor reinforces Western and male biases by preferentially selecting images associated with women and realistic Western/Japanese art while excluding men, LGBTQ+ references, and other styles.

  16. When to Ask a Question: Understanding Communication Strategies in Generative AI Tools

    cs.GT 2026-05 unverdicted novelty 5.0

    A tradeoff model shows generative AI can reduce bias against diverse preferences by strategically eliciting information instead of always inferring from majority patterns.

  17. Multilingual Safety Alignment via Self-Distillation

    cs.LG 2026-05 unverdicted novelty 5.0

    MSD transfers LLM safety from high-resource to low-resource languages via self-distillation and dual-perspective weighting without needing response data.

  18. Quantifying and Predicting Disagreement in Graded Human Ratings

    cs.CL 2026-05 unverdicted novelty 5.0

    Annotation disagreement on toxic language can be moderately predicted from textual features, with high-opposition items proving harder for models to estimate accurately.

  19. Relative Principals, Pluralistic Alignment, and the Structural Value Alignment Problem

    cs.CY 2026-04 unverdicted novelty 5.0

    AI value alignment is reconceptualized as a pluralistic governance problem arising along three axes—objectives, information, and principals—making it inherently context-dependent and unsolvable by technical design alone.

  20. Positive Alignment: Artificial Intelligence for Human Flourishing

    cs.AI 2026-05 unverdicted novelty 4.0

    Positive Alignment is introduced as a distinct AI agenda that supports human flourishing through pluralistic and context-sensitive design, complementing traditional safety-focused alignment.

Reference graph

Works this paper leans on

282 extracted references · 282 canonical work pages · cited by 18 Pith papers · 16 internal anchors

  1. [2]

    J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, F

    Achiam, O. J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, F. L., Almeida, D., Altenschmidt, J., Altman, S., Anadkat, S., Avila, R., Babuschkin, I., Balaji, S., Balcom, V., Baltescu, P., Bao, H., Bavarian, M., Belgum, J., Bello, I., Berdine, J., Bernadett-Shapiro, G., Berner, C., Bogdonoff, L., Boiko, O., Boyd, M., Brakman, A.-L., Brockman, G.,...

  2. [3]

    V., Arriaga, R

    Aher, G. V., Arriaga, R. I., and Kalai, A. T. Using large language models to simulate multiple humans and replicate human subject studies. In International Conference on Machine Learning, pp.\ 337--371. PMLR, 2023

  3. [4]

    Introducing claude, 2023

    Anthropic. Introducing claude, 2023. URL https://www.anthropic.com/index/introducing-claude

  4. [5]

    Argyle, Ethan C

    Argyle, L., Busby, E., Fulda, N., Gubler, J., Rytting, C., and Wingate, D. Out of one, many: Using language models to simulate human samples. Political Analysis, 31: 0 1--15, 02 2023. doi:10.1017/pan.2023.2

  5. [7]

    S., Diaz, M., Homan, C

    Aroyo, L., Taylor, A. S., Diaz, M., Homan, C. M., Parrish, A., Serapio-Garcia, G., Prabhakaran, V., and Wang, D. Dices dataset: Diversity in conversational ai evaluation for safety, 2023

  6. [8]

    A general language assistant as a laboratory for alignment, 2021

    Askell, A., Bai, Y., Chen, A., Drain, D., Ganguli, D., Henighan, T., Jones, A., Joseph, N., Mann, B., DasSarma, N., Elhage, N., Hatfield-Dodds, Z., Hernandez, D., Kernion, J., Ndousse, K., Olsson, C., Amodei, D., Brown, T., Clark, J., McCandlish, S., Olah, C., and Kaplan, J. A general language assistant as a laboratory for alignment, 2021

  7. [9]

    Training a helpful and harmless assistant with reinforcement learning from human feedback, 2022 a

    Bai, Y., Jones, A., Ndousse, K., Askell, A., Chen, A., DasSarma, N., Drain, D., Fort, S., Ganguli, D., Henighan, T., Joseph, N., Kadavath, S., Kernion, J., Conerly, T., El-Showk, S., Elhage, N., Hatfield-Dodds, Z., Hernandez, D., Hume, T., Johnston, S., Kravec, S., Lovitt, L., Nanda, N., Olsson, C., Amodei, D., Brown, T., Clark, J., McCandlish, S., Olah, ...

  8. [10]

    E., Fort, S., Lanham, T., Telleen-Lawton, T., Conerly, T., Henighan, T., Hume, T., Bowman, S

    Bai, Y., Kadavath, S., Kundu, S., Askell, A., Kernion, J., Jones, A., Chen, A., Goldie, A., Mirhoseini, A., McKinnon, C., Chen, C., Olsson, C., Olah, C., Hernandez, D., Drain, D., Ganguli, D., Li, D., Tran-Johnson, E., Perez, E., Kerr, J., Mueller, J., Ladish, J., Landau, J., Ndousse, K., Lukosuite, K., Lovitt, L., Sellitto, M., Elhage, N., Schiefer, N., ...

  9. [11]

    A., Chadwick, M

    Bakker, M. A., Chadwick, M. J., Sheahan, H. R., Tessler, M. H., Campbell-Gillingham, L., Balaguer, J., McAleese, N., Glaese, A., Aslanides, J., Botvinick, M. M., and Summerfield, C. Fine-tuning language models to find agreement among humans with diverse preferences, 2022

  10. [12]

    Two concepts of liberty

    Berlin, I. Two concepts of liberty. In Four Essays on Liberty, pp.\ 118–172. Oxford University Press, Oxford, 1969

  11. [13]

    Bobu, A., Peng, A., Agrawal, P., Shah, J., and Dragan, A. D. Aligning robot and human representations. arXiv preprint arXiv:2302.01928, 2023

  12. [14]

    On the Opportunities and Risks of Foundation Models

    Bommasani, R., Hudson, D. A., Adeli, E., Altman, R., Arora, S., von Arx, S., Bernstein, M. S., Bohg, J., Bosselut, A., Brunskill, E., Brynjolfsson, E., Buch, S., Card, D., Castellon, R., Chatterji, N. S., Chen, A. S., Creel, K. A., Davis, J., Demszky, D., Donahue, C., Doumbouya, M., Durmus, E., Ermon, S., Etchemendy, J., Ethayarajh, K., Fei-Fei, L., Finn,...

  13. [15]

    Bowman, S. R., Hyun, J., Perez, E., Chen, E., Pettit, C., Heiner, S., Lukošiūtė, K., Askell, A., Jones, A., Chen, A., Goldie, A., Mirhoseini, A., McKinnon, C., Olah, C., Amodei, D., Amodei, D., Drain, D., Li, D., Tran-Johnson, E., Kernion, J., Kerr, J., Mueller, J., Ladish, J., Landau, J., Ndousse, K., Lovitt, L., Elhage, N., Schiefer, N., Joseph, N., Mer...

  14. [17]

    Language Models are Few-Shot Learners

    Brown, T. B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T. J., Child, R., Ramesh, A., Ziegler, D. M., Wu, J., Winter, C., Hesse, C., Chen, M., Sigler, E., Litwin, M., Gray, S., Chess, B., Clark, J., Berner, C., McCandlish, S., Radfor...

  15. [18]

    Studying large language models as compression algorithms for human culture

    Buttrick, N. Studying large language models as compression algorithms for human culture. Trends in Cognitive Sciences, S1364-6613 0 (24): 0 00001--9, 2024. doi:10.1016/j.tics.2024.01.001. Epub ahead of print

  16. [20]

    When large language models meet personalization: Perspectives of challenges and opportunities, 2023

    Chen, J., Liu, Z., Huang, X., Wu, C., Liu, Q., Jiang, G., Pu, Y., Lei, Y., Chen, X., Wang, X., Lian, D., and Chen, E. When large language models meet personalization: Perspectives of challenges and opportunities, 2023

  17. [21]

    Decision transformer: Reinforcement learning via sequence modeling, 2021

    Chen, L., Lu, K., Rajeswaran, A., Lee, K., Grover, A., Laskin, M., Abbeel, P., Srinivas, A., and Mordatch, I. Decision transformer: Reinforcement learning via sequence modeling, 2021

  18. [22]

    Why ai alignment could be hard with modern deep learning

    Cotra, A. Why ai alignment could be hard with modern deep learning. https://www.cold-takes.com/why-ai-alignment-could-be-hard-with-modern-deep-learning/, 2021

  19. [23]

    Demarginalizing the intersection of race and sex: A black feminist critique of antidiscrimination doctrine, feminist theory and antiracist politics

    Crenshaw, K. Demarginalizing the intersection of race and sex: A black feminist critique of antidiscrimination doctrine, feminist theory and antiracist politics. The University of Chicago Legal Forum, 140: 0 139--167, 1989

  20. [25]

    Democracy in America

    de Tocqueville, A. Democracy in America. 1835

  21. [26]

    Durmus, E., Nyugen, K., Liao, T. I., Schiefer, N., Askell, A., Bakhtin, A., Chen, C., Hatfield-Dodds, Z., Hernandez, D., Joseph, N., Lovitt, L., McCandlish, S., Sikder, O., Tamkin, A., Thamkul, J., Kaplan, J., Clark, J., and Ganguli, D. Towards measuring the representation of subjective global opinions in language models, 2023. URL https://api.semanticsch...

  22. [27]

    and Jurafsky, D

    Ethayarajh, K. and Jurafsky, D. Utility is in the eye of the user: A critique of nlp leaderboard design. In Conference on Empirical Methods in Natural Language Processing, 2020. URL https://api.semanticscholar.org/CorpusID:235408131

  23. [28]

    and Jurafsky, D

    Ethayarajh, K. and Jurafsky, D. The authenticity gap in human evaluation. In Goldberg, Y., Kozareva, Z., and Zhang, Y. (eds.), Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pp.\ 6056--6070, Abu Dhabi, United Arab Emirates, December 2022. Association for Computational Linguistics. doi:10.18653/v1/2022.emnlp-main.40...

  24. [29]

    Y., Liu, Y., and Tsvetkov, Y

    Feng, S., Park, C. Y., Liu, Y., and Tsvetkov, Y. From pretraining data to language models to downstream tasks: Tracking the trails of political biases leading to unfair nlp models, 2023

  25. [31]

    When the Majority is Wrong : Modeling Annotator Disagreement for Subjective Tasks , November 2023

    Fleisig, E., Abebe, R., and Klein, D. When the Majority is Wrong : Modeling Annotator Disagreement for Subjective Tasks , November 2023. URL http://arxiv.org/abs/2305.06626. arXiv:2305.06626 [cs]

  26. [32]

    Artificial intelligence, values, and alignment

    Gabriel, I. Artificial intelligence, values, and alignment. Minds and Machines, 30 0 (3): 0 411--437, 2020. doi:10.1007/s11023-020-09539-2. URL https://doi.org/10.1007/s11023-020-09539-2

  27. [33]

    Girotra, K., Meincke, L., Terwiesch, C., and Ulrich, K. T. Ideas are dimes a dozen: Large language models for idea generation in innovation. https://ssrn.com/abstract=4526071, July 2023. Available at SSRN: https://ssrn.com/abstract=4526071 or http://dx.doi.org/10.2139/ssrn.4526071

  28. [34]

    S., Green, R., Mokrá, S., Fernando, N., Wu, B., Foley, R., Young, S., Gabriel, I., Isaac, W., Mellor, J., Hassabis, D., Kavukcuoglu, K., Hendricks, L

    Glaese, A., McAleese, N., Trębacz, M., Aslanides, J., Firoiu, V., Ewalds, T., Rauh, M., Weidinger, L., Chadwick, M., Thacker, P., Campbell-Gillingham, L., Uesato, J., Huang, P.-S., Comanescu, R., Yang, F., See, A., Dathathri, S., Greig, R., Chen, C., Fritz, D., Elias, J. S., Green, R., Mokrá, S., Fernando, N., Wu, B., Foley, R., Young, S., Gabriel, I., Is...

  29. [36]

    P., Fonseca, C

    Guerreiro, A. P., Fonseca, C. M., and Paquete, L. The hypervolume indicator. ACM Computing Surveys (CSUR), 54: 0 1--42, 2020. URL https://api.semanticscholar.org/CorpusID:218470181

  30. [37]

    Situated knowledges: The science question in feminism and the privilege of partial perspective

    Haraway, D. Situated knowledges: The science question in feminism and the privilege of partial perspective. Feminist Studies, 14 0 (3): 0 575--599, 1988. ISSN 00463663. URL http://www.jstor.org/stable/3178066

  31. [38]

    C., Selten, R., et al

    Harsanyi, J. C., Selten, R., et al. A general theory of equilibrium selection in games. MIT Press Books, 1, 1988

  32. [39]

    The political ideology of conversational ai: Converging evidence on chatgpt's pro-environmental, left-libertarian orientation, 2023

    Hartmann, J., Schwenzow, J., and Witte, M. The political ideology of conversational ai: Converging evidence on chatgpt's pro-environmental, left-libertarian orientation, 2023

  33. [43]

    Aligning ai with shared human values, 2023

    Hendrycks, D., Burns, C., Basart, S., Critch, A., Li, J., Song, D., and Steinhardt, J. Aligning ai with shared human values, 2023

  34. [44]

    J., and Norenzayan, A

    Henrich, J., Heine, S. J., and Norenzayan, A. The weirdest people in the world? Behavioral and Brain Sciences, 33 0 (2-3): 0 61--83, 2010. URL http://www2.psych.ubc.ca/ henrich/audiofiles/WEIRD1.mp3

  35. [45]

    Human feedback is not gold standard

    Hosking, T., Blunsom, P., and Bartolo, M. Human feedback is not gold standard. ArXiv, abs/2309.16349, 2023. URL https://api.semanticscholar.org/CorpusID:263134280

  36. [46]

    and Andersson, H

    Hsieh, N.-h. and Andersson, H. Incommensurable Values . In Zalta, E. N. (ed.), The Stanford Encyclopedia of Philosophy . Metaphysics Research Lab, Stanford University, F all 2021 edition, 2021

  37. [47]

    P., and Tandon, N

    Hwang, E., Majumder, B. P., and Tandon, N. Aligning Language Models to User Opinions . 2023. doi:10.48550/ARXIV.2305.14929. URL https://arxiv.org/abs/2305.14929. Publisher: arXiv Version Number: 1

  38. [48]

    and Rapp, D

    Imundo, M. and Rapp, D. When fairness is flawed: Effects of false balance reporting and weight-of-evidence statements on beliefs and perceptions of climate change. Journal of Applied Research in Memory and Cognition, 11, 10 2021. doi:10.1016/j.jarmac.2021.10.002

  39. [49]

    Co-writing with opinionated language models affects users’ views

    Jakesch, M., Bhat, A., Buschek, D., Zalmanson, L., and Naaman, M. Co-writing with opinionated language models affects users’ views. In Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems, CHI '23, New York, NY, USA, 2023. Association for Computing Machinery. ISBN 9781450394215. doi:10.1145/3544548.3581196. URL https://doi.org/10.1...

  40. [50]

    Y., Wang, Y., Hessel, J., Zettlemoyer, L., Hajishirzi, H., Choi, Y., and Ammanabrolu, P

    Jang, J., Kim, S., Lin, B. Y., Wang, Y., Hessel, J., Zettlemoyer, L., Hajishirzi, H., Choi, Y., and Ammanabrolu, P. Personalized soups: Personalized large language model alignment via post-hoc parameter merging, 2023

  41. [51]

    Y., Dai, J., Pan, X., O'Gara, A., Lei, Y., Xu, H., Tse, B., Fu, J., McAleer, S., Yang, Y., Wang, Y., Zhu, S.-C., Guo, Y., and Gao, W

    Ji, J., Qiu, T., Chen, B., Zhang, B., Lou, H., Wang, K., Duan, Y., He, Z., Zhou, J., Zhang, Z., Zeng, F., Ng, K. Y., Dai, J., Pan, X., O'Gara, A., Lei, Y., Xu, H., Tse, B., Fu, J., McAleer, S., Yang, Y., Wang, Y., Zhu, S.-C., Guo, Y., and Gao, W. Ai alignment: A comprehensive survey, 2024

  42. [52]

    D., and Telgarsky, M

    Ji, Z., Li, J. D., and Telgarsky, M. Early-stopped neural networks are consistent, 2021

  43. [53]

    Evaluating and inducing personality in pre-trained language models, 2023

    Jiang, G., Xu, M., Zhu, S.-C., Han, W., Zhang, C., and Zhu, Y. Evaluating and inducing personality in pre-trained language models, 2023. URL https://api.semanticscholar.org/CorpusID:258865158

  44. [54]

    Communitylm: Probing partisan worldviews from language models, 2022

    Jiang, H., Beeferman, D., Roy, B., and Roy, D. Communitylm: Probing partisan worldviews from language models, 2022

  45. [55]

    L., and Choi, Y

    Jung, J., Qin, L., Welleck, S., Brahman, F., Bhagavatula, C., Bras, R. L., and Choi, Y. Maieutic prompting: Logically consistent reasoning with recursive explanations, 2022

  46. [57]

    and Gabriel, I

    Kasirzadeh, A. and Gabriel, I. In conversation with artificial intelligence: aligning language models with human values, 2022

  47. [58]

    The Morality of Pluralism

    Kekes, J. The Morality of Pluralism. Princeton University Press, Princeton, 1993

  48. [59]

    and Lee, B

    Kim, J. and Lee, B. Ai-augmented surveys: Leveraging large language models for opinion prediction in nationally representative surveys. arXiv preprint arXiv:2305.09620, 2023

  49. [60]

    Understanding the effects of rlhf on llm generalisation and diversity, 2024

    Kirk, R., Mediratta, I., Nalmpantis, C., Luketina, J., Hambro, E., Grefenstette, E., and Raileanu, R. Understanding the effects of rlhf on llm generalisation and diversity, 2024

  50. [61]

    Human-centred mechanism design with democratic ai

    Koster, R., Balaguer, J., Tacchetti, A., Weinstein, A., Zhu, T., Hauser, O., Williams, D., Campbell-Gillingham, L., Thacker, P., Botvinick, M., and Summerfield, C. Human-centred mechanism design with democratic ai. Nature Human Behaviour, 6 0 (10): 0 1398--1407, 2022. doi:10.1038/s41562-022-01383-x. URL https://doi.org/10.1038/s41562-022-01383-x

  51. [62]

    Chatgpt's inconsistent moral advice influences users'judgment

    Kr \"u gel, S., Ostermaier, A., and Uhl, M. Chatgpt's inconsistent moral advice influences users'judgment. Scientific Reports, 13 0 (1): 0 4569, Apr 2023. ISSN 2045-2322. doi:10.1038/s41598-023-31341-0. URL https://doi.org/10.1038/s41598-023-31341-0

  52. [63]

    and Page, S

    Landemore, H. and Page, S. E. Deliberation and disagreement: Problem solving, prediction, and positive dissensus. Politics, philosophy & economics, 14 0 (3): 0 229--254, 2015

  53. [64]

    Scalable agent alignment via reward modeling: a research direction, 2018

    Leike, J., Krueger, D., Everitt, T., Martic, M., Maini, V., and Legg, S. Scalable agent alignment via reward modeling: a research direction, 2018

  54. [65]

    A., Liang, Y., and Bendersky, M

    Li, C., Zhang, M., Mei, Q., Wang, Y., Hombaiah, S. A., Liang, Y., and Bendersky, M. Teach llms to personalize -- an approach inspired by writing education, 2023 a

  55. [67]

    D., Ré, C., Acosta-Navas, D., Hudson, D

    Liang, P., Bommasani, R., Lee, T., Tsipras, D., Soylu, D., Yasunaga, M., Zhang, Y., Narayanan, D., Wu, Y., Kumar, A., Newman, B., Yuan, B., Yan, B., Zhang, C., Cosgrove, C., Manning, C. D., Ré, C., Acosta-Navas, D., Hudson, D. A., Zelikman, E., Durmus, E., Ladhak, F., Rong, F., Ren, H., Yao, H., Wang, J., Santhanam, K., Orr, L., Zheng, L., Yuksekgonul, M....

  56. [68]

    A., and Choi, Y

    Liu, A., Sap, M., Lu, X., Swayamdipta, S., Bhagavatula, C., Smith, N. A., and Choi, Y. Dexperts: Decoding-time controlled text generation with experts and anti-experts. In Annual Meeting of the Association for Computational Linguistics, 2021. URL https://api.semanticscholar.org/CorpusID:235313967

  57. [69]

    A., and Choi, Y

    Liu, A., Swayamdipta, S., Smith, N. A., and Choi, Y. Wanli: Worker and ai collaboration for natural language inference dataset creation, 2022

  58. [70]

    Liu, A., Han, X., Wang, Y., Tsvetkov, Y., Choi, Y., and Smith, N. A. Tuning language models by proxy, 2024

  59. [71]

    F., Kumar, A., Liang, P., and Jia, R

    Liu, N. F., Kumar, A., Liang, P., and Jia, R. Are sample-efficient nlp models more robust?, 2023

  60. [72]

    Large language model guided tree-of-thought

    Long, J. Large language model guided tree-of-thought. arXiv preprint arXiv:2305.08291, 2023

  61. [73]

    L., Bhagavatula, C., and Choi, Y

    Lu, X., West, P., Zellers, R., Bras, R. L., Bhagavatula, C., and Choi, Y. Neurologic decoding: (un)supervised neural text generation with predicate logic constraints. ArXiv, abs/2010.12884, 2020. URL https://api.semanticscholar.org/CorpusID:225067055

  62. [74]

    Quark: Controllable text generation with reinforced unlearning, 2022

    Lu, X., Welleck, S., Hessel, J., Jiang, L., Qin, L., West, P., Ammanabrolu, P., and Choi, Y. Quark: Controllable text generation with reinforced unlearning, 2022

  63. [75]

    Beyond chatbots: Explorellm for structured thoughts and personalized model responses, 2023

    Ma, X., Mishra, S., Liu, A., Su, S., Chen, J., Kulkarni, C., Cheng, H.-T., Le, Q., and Chi, E. Beyond chatbots: Explorellm for structured thoughts and personalized model responses, 2023

  64. [77]

    Michael, J., Mahdi, S., Rein, D., Petty, J., Dirani, J., Padmakumar, V., and Bowman, S. R. Debate helps supervise unreliable experts. arXiv preprint arXiv:2311.08702, 2023

  65. [78]

    A mbig QA : Answering ambiguous open-domain questions

    Min, S., Michael, J., Hajishirzi, H., and Zettlemoyer, L. A mbig QA : Answering ambiguous open-domain questions. In Webber, B., Cohn, T., He, Y., and Liu, Y. (eds.), Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp.\ 5783--5797, Online, November 2020. Association for Computational Linguistics. doi:10.18653...

  66. [79]

    Ai alignment and social choice: Fundamental limitations and policy implications, 2023

    Mishra, A. Ai alignment and social choice: Fundamental limitations and policy implications, 2023

  67. [80]

    Fair Division and Collective Welfare

    Moulin, H. Fair Division and Collective Welfare. MIT Press, 2004

  68. [81]

    The fragmentation of value

    Nagel, T. The fragmentation of value. In Mortal Questions. Cambridge University Press, Cambridge, 1979

  69. [82]

    Openai davinci-002 model

    OpenAI. Openai davinci-002 model. https://www.openai.com, 2023 a . Accessed on Date 06/2023

  70. [83]

    Openai gpt3.5-turbo

    OpenAI. Openai gpt3.5-turbo. https://www.openai.com, 2023 b . Accessed on Date 06/2023

  71. [84]

    Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C. L., Mishkin, P., Zhang, C., Agarwal, S., Slama, K., Ray, A., Schulman, J., Hilton, J., Kelton, F., Miller, L., Simens, M., Askell, A., Welinder, P., Christiano, P., Leike, J., and Lowe, R. Training language models to follow instructions with human feedback, 2022

  72. [85]

    Reimagining democracy for ai

    Ovadya, A. Reimagining democracy for ai. Journal of Democracy, 34 0 (4): 0 162--170, Oct 2023

  73. [86]

    The difference: How the power of diversity creates better groups, firms, schools, and societies-new edition

    Page, S. The difference: How the power of diversity creates better groups, firms, schools, and societies-new edition. Princeton University Press, 2008

  74. [87]

    Page, S. E. The diversity bonus: How great teams pay off in the knowledge economy. Princeton University Press, 2019

  75. [88]

    S., Zou, A., Li, N., Basart, S., Woodside, T., Ng, J., Zhang, H., Emmons, S., and Hendrycks, D

    Pan, A., Chan, J. S., Zou, A., Li, N., Basart, S., Woodside, T., Ng, J., Zhang, H., Emmons, S., and Hendrycks, D. Do the rewards justify the means? measuring trade-offs between rewards and ethical behavior in the machiavelli benchmark, 2023

  76. [89]

    S., Popowski, L., Cai, C

    Park, J. S., Popowski, L., Cai, C. J., Morris, M. R., Liang, P., and Bernstein, M. S. Social simulacra: Creating populated prototypes for social computing systems, 2022

  77. [90]

    S., O'Brien, J., Cai, C

    Park, J. S., O'Brien, J., Cai, C. J., Morris, M. R., Liang, P., and Bernstein, M. S. Generative agents: Interactive simulacra of human behavior. In Proceedings of the 36th Annual ACM Symposium on User Interface Software and Technology, pp.\ 1--22, 2023

  78. [91]

    K., Shu, T., Bobu, A., Shah, J., and Agrawal, P

    Peng, A., Netanyahu, A., Ho, M. K., Shu, T., Bobu, A., Shah, J., and Agrawal, P. Diagnosis, feedback, adaptation: A human-in-the-loop framework for test-time policy adaptation. In Proceedings of the 40th International Conference on Machine Learning, 2023

  79. [93]

    The state of online harassment

    Pew Research Center . The state of online harassment. Technical report, Washington, D.C. , January 2021. URL https://www.pewresearch.org/internet/2021/01/13/the-state-of-online-harassment/

  80. [94]

    arXiv preprint arXiv:2202.11705 , year=

    Qin, L., Welleck, S., Khashabi, D., and Choi, Y. Cold decoding: Energy-based constrained text generation with langevin dynamics. ArXiv, abs/2202.11705, 2022. URL https://api.semanticscholar.org/CorpusID:247058662

Showing first 80 references.