Recognition: 3 theorem links
A Roadmap to Pluralistic Alignment
Pith reviewed 2026-05-16 14:32 UTC · model grok-4.3
The pith
Standard alignment procedures may reduce distributional pluralism in language models.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By defining pluralism through Overton, steerable, and distributional models and creating multi-objective, trade-off steerable, and jury-pluralistic benchmarks, the paper demonstrates that current alignment procedures may reduce distributional pluralism, indicating a fundamental limitation in existing techniques for building AI that accommodates diverse human values.
What carries the argument
The three proposed definitions of pluralism in AI systems (Overton pluralistic, steerably pluralistic, and distributionally pluralistic) along with three benchmark classes (multi-objective, trade-off steerable, and jury-pluralistic) that together operationalize and measure pluralism.
If this is right
- Standard alignment will lead to models that are less well-calibrated to diverse population distributions.
- New benchmarks are required to properly incentivize and evaluate pluralistic behavior in models.
- Alignment techniques need redesign to support presenting spectra of responses and steering to different perspectives.
- Empirical tests can confirm if alignment reduces the ability to reflect varied human ratings.
Where Pith is reading between the lines
- Future alignment methods might incorporate explicit pluralism objectives to counteract narrowing effects.
- This framework could extend to other AI modalities beyond language models for broader value alignment.
- Testing on jury-pluralistic benchmarks before and after alignment could quantify the reduction in pluralism.
- Connections to multi-stakeholder decision making suggest similar issues in non-AI systems.
Load-bearing premise
The three definitions and benchmark classes are sufficient to fully capture and measure pluralism without overlooking important aspects of value diversity or adding biases.
What would settle it
If models after standard alignment show equal or greater calibration to diverse human populations on jury-pluralistic benchmarks compared to before alignment, that would contradict the claim of reduced pluralism.
read the original abstract
With increased power and prevalence of AI systems, it is ever more critical that AI systems are designed to serve all, i.e., people with diverse values and perspectives. However, aligning models to serve pluralistic human values remains an open research question. In this piece, we propose a roadmap to pluralistic alignment, specifically using language models as a test bed. We identify and formalize three possible ways to define and operationalize pluralism in AI systems: 1) Overton pluralistic models that present a spectrum of reasonable responses; 2) Steerably pluralistic models that can steer to reflect certain perspectives; and 3) Distributionally pluralistic models that are well-calibrated to a given population in distribution. We also formalize and discuss three possible classes of pluralistic benchmarks: 1) Multi-objective benchmarks, 2) Trade-off steerable benchmarks, which incentivize models to steer to arbitrary trade-offs, and 3) Jury-pluralistic benchmarks which explicitly model diverse human ratings. We use this framework to argue that current alignment techniques may be fundamentally limited for pluralistic AI; indeed, we highlight empirical evidence, both from our own experiments and from other work, that standard alignment procedures might reduce distributional pluralism in models, motivating the need for further research on pluralistic alignment.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes a roadmap for pluralistic alignment in AI, using language models as a test bed. It formalizes three definitions of pluralistic models—Overton (spectrum of reasonable responses), steerably pluralistic (steerable to perspectives), and distributionally pluralistic (population-calibrated)—along with three benchmark classes: multi-objective, trade-off steerable, and jury-pluralistic. The central claim is that current alignment techniques may be fundamentally limited for achieving pluralism, supported by cited empirical evidence (including the authors' experiments) indicating that standard procedures can reduce distributional pluralism.
Significance. If the framework and evidence hold, this work supplies a structured conceptual toolkit for operationalizing pluralism in AI systems, which is significant for developing inclusive models serving diverse populations. The explicit linkage of alignment limitations to distributional effects, backed by referenced experiments, motivates targeted research and could influence benchmark design in the field.
major comments (2)
- [empirical evidence discussion] § on empirical evidence and limitations of alignment: The claim that standard alignment reduces distributional pluralism is load-bearing for the roadmap's motivation, yet it rests primarily on referenced experiments rather than new derivations or comprehensive data presented here; without explicit discussion of how the proposed jury-pluralistic benchmarks were applied to isolate this effect from measurement confounds, the reduction argument's robustness is difficult to evaluate.
- [definitions and benchmarks] § formalizing the three definitions and benchmark classes: The sufficiency of the Overton/steerable/distributional definitions and the three benchmark classes for capturing pluralism is assumed without addressing potential gaps, such as whether jury-pluralistic benchmarks embed annotator biases that could distort population calibration or miss contextual value nuances not reducible to response spectra or steerability.
minor comments (2)
- [abstract] The abstract and introduction could more explicitly distinguish the paper's novel contributions (the formalizations) from the synthesis of existing evidence on alignment limitations.
- [throughout] Notation for the three model types and benchmark classes should be introduced with consistent abbreviations or symbols to improve readability when referenced across sections.
Simulated Author's Rebuttal
We thank the referee for their constructive and detailed feedback on our roadmap paper for pluralistic alignment. We address each major comment below and indicate the revisions we will make to improve clarity and robustness.
read point-by-point responses
-
Referee: The claim that standard alignment reduces distributional pluralism is load-bearing for the roadmap's motivation, yet it rests primarily on referenced experiments rather than new derivations or comprehensive data presented here; without explicit discussion of how the proposed jury-pluralistic benchmarks were applied to isolate this effect from measurement confounds, the reduction argument's robustness is difficult to evaluate.
Authors: We agree that the empirical claim relies on referenced experiments (including our prior work) rather than new data in this conceptual roadmap. We will revise the manuscript to add a dedicated subsection that explicitly describes how jury-pluralistic benchmarks were applied in the cited studies, including steps taken to isolate alignment effects from confounds such as annotator variability and measurement artifacts. This will make the robustness of the argument easier to evaluate. revision: yes
-
Referee: The sufficiency of the Overton/steerable/distributional definitions and the three benchmark classes for capturing pluralism is assumed without addressing potential gaps, such as whether jury-pluralistic benchmarks embed annotator biases that could distort population calibration or miss contextual value nuances not reducible to response spectra or steerability.
Authors: We acknowledge that our framework does not fully address all potential gaps, including annotator biases in jury-pluralistic benchmarks and the difficulty of capturing irreducible contextual value nuances. We will expand the discussion section with a new subsection on limitations, providing concrete examples of these issues and suggesting mitigation approaches for future benchmarks. This will better delineate the scope of the proposed definitions and classes. revision: yes
Circularity Check
Conceptual definitions and benchmarks introduced independently with no reduction to fitted inputs or self-referential loops
full rationale
The paper is a conceptual roadmap that proposes three definitions of pluralism (Overton, steerable, distributional) and three benchmark classes (multi-objective, trade-off steerable, jury-pluralistic) as independent formalizations. The central claim that standard alignment may reduce distributional pluralism is supported by referenced empirical evidence from the authors' experiments and external work rather than by any derivation that reduces predictions to the definitions themselves. No equations, fitted parameters, or self-citation chains are load-bearing for the framework; the definitions do not presuppose the limitation result. This yields a minor self-citation score but leaves the core argument self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption AI systems should be designed to serve people with diverse values and perspectives
invented entities (3)
-
Overton pluralistic models
no independent evidence
-
Steerably pluralistic models
no independent evidence
-
Distributionally pluralistic models
no independent evidence
Lean theorems connected to this paper
-
LawOfExistenceunity_unique_existent echoes?
echoesECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.
we highlight empirical evidence... that standard alignment procedures might reduce distributional pluralism in models
-
InevitabilityStructureeconomic_inevitability refines?
refinesRelation between the paper passage and the cited Recognition theorem.
Distributionally pluralistic models that are well-calibrated to a given population in distribution
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 20 Pith papers
-
Curated Synthetic Data Doesn't Have to Collapse: A Theoretical Study of Generative Retraining with Pluralistic Preferences
Recursive generative retraining with pluralistic preferences converges to a stable diverse distribution that satisfies a weighted Nash bargaining solution.
-
Where Paths Split: Localized, Calibrated Control of Moral Reasoning in Large Language Models
A technique identifies minimal convergence-divergence points in LLM transformer blocks and calibrates residual-stream directions to achieve targeted ethical-framework control at inference time.
-
Three Models of RLHF Annotation: Extension, Evidence, and Authority
RLHF should decompose annotations into dimensions each matched to one of three models—extension, evidence, or authority—instead of applying a single unified pipeline.
-
Language Models Don't Know What You Want: Evaluating Personalization in Deep Research Needs Real Users
Personalized deep research systems need evaluation with real users because LLM judges overlook nuanced errors that matter to researchers.
-
Training-Free Cultural Alignment of Large Language Models via Persona Disagreement
DISCA uses disagreement among WVS-grounded persona panels to apply loss-averse logit corrections that reduce cultural misalignment by 10-24% on MultiTP for models 3.8B and larger, without weight changes.
-
Positive Alignment: Artificial Intelligence for Human Flourishing
Positive Alignment introduces AI systems that support human flourishing pluralistically and proactively while remaining safe, as a necessary complement to traditional safety-focused alignment research.
-
Understanding Annotator Safety Policy with Interpretability
Annotator Policy Models learn safety policies from labeling behavior alone, accurately predicting responses and revealing sources of disagreement like policy ambiguity and value pluralism.
-
Multilingual Safety Alignment via Self-Distillation
MSD enables cross-lingual safety transfer in LLMs via self-distillation with Dual-Perspective Safety Weighting, improving safety in low-resource languages without target response data.
-
Personalized RewardBench: Evaluating Reward Models with Human Aligned Personalization
Personalized RewardBench reveals that state-of-the-art reward models reach only 75.94% accuracy on personalized preferences and shows stronger correlation with downstream BoN and PPO performance than prior benchmarks.
-
Cultural Authenticity: Comparing LLM Cultural Representations to Native Human Expectations
LLMs display Western-centric cultural representations that align poorly with native priorities in non-Western countries and share highly correlated error patterns.
-
Evaluating AI-Generated Images of Cultural Artifacts with Community-Informed Rubrics
Community members from the UK blind community, Kerala, and Tamil Nadu helped define what counts as culturally appropriate depictions of artifacts, and the authors tested whether those definitions can be turned into re...
-
Value Alignment Tax: Measuring Value Trade-offs in LLM Alignment
VAT quantifies value trade-offs in LLM alignment by measuring how alignment-induced changes propagate across interconnected values using a Schwartz-grounded dataset.
-
Language Model Goal Selection Differs from Humans' in a Self-Directed Learning Task
LLMs diverge from human goal selection in self-directed learning by exploiting single solutions with low variability across instances.
-
Measuring Human Preferences in RLHF is a Social Science Problem
RLHF preference measurement is a social science validity problem because annotators routinely produce non-attitudes, constructed responses, and artifacts rather than stable values.
-
The Algorithmic Gaze of Image Quality Assessment: An Audit and Trace Ethnography of the LAION-Aesthetics Predictor
LAION-Aesthetics Predictor reinforces Western and male biases by preferentially selecting images associated with women and realistic Western/Japanese art while excluding men, LGBTQ+ references, and other styles.
-
When to Ask a Question: Understanding Communication Strategies in Generative AI Tools
A tradeoff model shows generative AI can reduce bias against diverse preferences by strategically eliciting information instead of always inferring from majority patterns.
-
Multilingual Safety Alignment via Self-Distillation
MSD transfers LLM safety from high-resource to low-resource languages via self-distillation and dual-perspective weighting without needing response data.
-
Quantifying and Predicting Disagreement in Graded Human Ratings
Annotation disagreement on toxic language can be moderately predicted from textual features, with high-opposition items proving harder for models to estimate accurately.
-
Relative Principals, Pluralistic Alignment, and the Structural Value Alignment Problem
AI value alignment is reconceptualized as a pluralistic governance problem arising along three axes—objectives, information, and principals—making it inherently context-dependent and unsolvable by technical design alone.
-
Positive Alignment: Artificial Intelligence for Human Flourishing
Positive Alignment is introduced as a distinct AI agenda that supports human flourishing through pluralistic and context-sensitive design, complementing traditional safety-focused alignment.
Reference graph
Works this paper leans on
-
[2]
J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, F
Achiam, O. J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, F. L., Almeida, D., Altenschmidt, J., Altman, S., Anadkat, S., Avila, R., Babuschkin, I., Balaji, S., Balcom, V., Baltescu, P., Bao, H., Bavarian, M., Belgum, J., Bello, I., Berdine, J., Bernadett-Shapiro, G., Berner, C., Bogdonoff, L., Boiko, O., Boyd, M., Brakman, A.-L., Brockman, G.,...
work page 2023
-
[3]
Aher, G. V., Arriaga, R. I., and Kalai, A. T. Using large language models to simulate multiple humans and replicate human subject studies. In International Conference on Machine Learning, pp.\ 337--371. PMLR, 2023
work page 2023
-
[4]
Anthropic. Introducing claude, 2023. URL https://www.anthropic.com/index/introducing-claude
work page 2023
-
[5]
Argyle, L., Busby, E., Fulda, N., Gubler, J., Rytting, C., and Wingate, D. Out of one, many: Using language models to simulate human samples. Political Analysis, 31: 0 1--15, 02 2023. doi:10.1017/pan.2023.2
-
[7]
Aroyo, L., Taylor, A. S., Diaz, M., Homan, C. M., Parrish, A., Serapio-Garcia, G., Prabhakaran, V., and Wang, D. Dices dataset: Diversity in conversational ai evaluation for safety, 2023
work page 2023
-
[8]
A general language assistant as a laboratory for alignment, 2021
Askell, A., Bai, Y., Chen, A., Drain, D., Ganguli, D., Henighan, T., Jones, A., Joseph, N., Mann, B., DasSarma, N., Elhage, N., Hatfield-Dodds, Z., Hernandez, D., Kernion, J., Ndousse, K., Olsson, C., Amodei, D., Brown, T., Clark, J., McCandlish, S., Olah, C., and Kaplan, J. A general language assistant as a laboratory for alignment, 2021
work page 2021
-
[9]
Training a helpful and harmless assistant with reinforcement learning from human feedback, 2022 a
Bai, Y., Jones, A., Ndousse, K., Askell, A., Chen, A., DasSarma, N., Drain, D., Fort, S., Ganguli, D., Henighan, T., Joseph, N., Kadavath, S., Kernion, J., Conerly, T., El-Showk, S., Elhage, N., Hatfield-Dodds, Z., Hernandez, D., Hume, T., Johnston, S., Kravec, S., Lovitt, L., Nanda, N., Olsson, C., Amodei, D., Brown, T., Clark, J., McCandlish, S., Olah, ...
work page 2022
-
[10]
E., Fort, S., Lanham, T., Telleen-Lawton, T., Conerly, T., Henighan, T., Hume, T., Bowman, S
Bai, Y., Kadavath, S., Kundu, S., Askell, A., Kernion, J., Jones, A., Chen, A., Goldie, A., Mirhoseini, A., McKinnon, C., Chen, C., Olsson, C., Olah, C., Hernandez, D., Drain, D., Ganguli, D., Li, D., Tran-Johnson, E., Perez, E., Kerr, J., Mueller, J., Ladish, J., Landau, J., Ndousse, K., Lukosuite, K., Lovitt, L., Sellitto, M., Elhage, N., Schiefer, N., ...
work page 2022
-
[11]
Bakker, M. A., Chadwick, M. J., Sheahan, H. R., Tessler, M. H., Campbell-Gillingham, L., Balaguer, J., McAleese, N., Glaese, A., Aslanides, J., Botvinick, M. M., and Summerfield, C. Fine-tuning language models to find agreement among humans with diverse preferences, 2022
work page 2022
-
[12]
Berlin, I. Two concepts of liberty. In Four Essays on Liberty, pp.\ 118–172. Oxford University Press, Oxford, 1969
work page 1969
- [13]
-
[14]
On the Opportunities and Risks of Foundation Models
Bommasani, R., Hudson, D. A., Adeli, E., Altman, R., Arora, S., von Arx, S., Bernstein, M. S., Bohg, J., Bosselut, A., Brunskill, E., Brynjolfsson, E., Buch, S., Card, D., Castellon, R., Chatterji, N. S., Chen, A. S., Creel, K. A., Davis, J., Demszky, D., Donahue, C., Doumbouya, M., Durmus, E., Ermon, S., Etchemendy, J., Ethayarajh, K., Fei-Fei, L., Finn,...
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[15]
Bowman, S. R., Hyun, J., Perez, E., Chen, E., Pettit, C., Heiner, S., Lukošiūtė, K., Askell, A., Jones, A., Chen, A., Goldie, A., Mirhoseini, A., McKinnon, C., Olah, C., Amodei, D., Amodei, D., Drain, D., Li, D., Tran-Johnson, E., Kernion, J., Kerr, J., Mueller, J., Ladish, J., Landau, J., Ndousse, K., Lovitt, L., Elhage, N., Schiefer, N., Joseph, N., Mer...
work page 2022
-
[17]
Language Models are Few-Shot Learners
Brown, T. B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T. J., Child, R., Ramesh, A., Ziegler, D. M., Wu, J., Winter, C., Hesse, C., Chen, M., Sigler, E., Litwin, M., Gray, S., Chess, B., Clark, J., Berner, C., McCandlish, S., Radfor...
work page internal anchor Pith review Pith/arXiv arXiv 2005
-
[18]
Studying large language models as compression algorithms for human culture
Buttrick, N. Studying large language models as compression algorithms for human culture. Trends in Cognitive Sciences, S1364-6613 0 (24): 0 00001--9, 2024. doi:10.1016/j.tics.2024.01.001. Epub ahead of print
-
[20]
When large language models meet personalization: Perspectives of challenges and opportunities, 2023
Chen, J., Liu, Z., Huang, X., Wu, C., Liu, Q., Jiang, G., Pu, Y., Lei, Y., Chen, X., Wang, X., Lian, D., and Chen, E. When large language models meet personalization: Perspectives of challenges and opportunities, 2023
work page 2023
-
[21]
Decision transformer: Reinforcement learning via sequence modeling, 2021
Chen, L., Lu, K., Rajeswaran, A., Lee, K., Grover, A., Laskin, M., Abbeel, P., Srinivas, A., and Mordatch, I. Decision transformer: Reinforcement learning via sequence modeling, 2021
work page 2021
-
[22]
Why ai alignment could be hard with modern deep learning
Cotra, A. Why ai alignment could be hard with modern deep learning. https://www.cold-takes.com/why-ai-alignment-could-be-hard-with-modern-deep-learning/, 2021
work page 2021
-
[23]
Crenshaw, K. Demarginalizing the intersection of race and sex: A black feminist critique of antidiscrimination doctrine, feminist theory and antiracist politics. The University of Chicago Legal Forum, 140: 0 139--167, 1989
work page 1989
- [25]
-
[26]
Durmus, E., Nyugen, K., Liao, T. I., Schiefer, N., Askell, A., Bakhtin, A., Chen, C., Hatfield-Dodds, Z., Hernandez, D., Joseph, N., Lovitt, L., McCandlish, S., Sikder, O., Tamkin, A., Thamkul, J., Kaplan, J., Clark, J., and Ganguli, D. Towards measuring the representation of subjective global opinions in language models, 2023. URL https://api.semanticsch...
work page 2023
-
[27]
Ethayarajh, K. and Jurafsky, D. Utility is in the eye of the user: A critique of nlp leaderboard design. In Conference on Empirical Methods in Natural Language Processing, 2020. URL https://api.semanticscholar.org/CorpusID:235408131
work page 2020
-
[28]
Ethayarajh, K. and Jurafsky, D. The authenticity gap in human evaluation. In Goldberg, Y., Kozareva, Z., and Zhang, Y. (eds.), Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pp.\ 6056--6070, Abu Dhabi, United Arab Emirates, December 2022. Association for Computational Linguistics. doi:10.18653/v1/2022.emnlp-main.40...
-
[29]
Feng, S., Park, C. Y., Liu, Y., and Tsvetkov, Y. From pretraining data to language models to downstream tasks: Tracking the trails of political biases leading to unfair nlp models, 2023
work page 2023
-
[31]
When the Majority is Wrong : Modeling Annotator Disagreement for Subjective Tasks , November 2023
Fleisig, E., Abebe, R., and Klein, D. When the Majority is Wrong : Modeling Annotator Disagreement for Subjective Tasks , November 2023. URL http://arxiv.org/abs/2305.06626. arXiv:2305.06626 [cs]
-
[32]
Artificial intelligence, values, and alignment
Gabriel, I. Artificial intelligence, values, and alignment. Minds and Machines, 30 0 (3): 0 411--437, 2020. doi:10.1007/s11023-020-09539-2. URL https://doi.org/10.1007/s11023-020-09539-2
-
[33]
Girotra, K., Meincke, L., Terwiesch, C., and Ulrich, K. T. Ideas are dimes a dozen: Large language models for idea generation in innovation. https://ssrn.com/abstract=4526071, July 2023. Available at SSRN: https://ssrn.com/abstract=4526071 or http://dx.doi.org/10.2139/ssrn.4526071
-
[34]
Glaese, A., McAleese, N., Trębacz, M., Aslanides, J., Firoiu, V., Ewalds, T., Rauh, M., Weidinger, L., Chadwick, M., Thacker, P., Campbell-Gillingham, L., Uesato, J., Huang, P.-S., Comanescu, R., Yang, F., See, A., Dathathri, S., Greig, R., Chen, C., Fritz, D., Elias, J. S., Green, R., Mokrá, S., Fernando, N., Wu, B., Foley, R., Young, S., Gabriel, I., Is...
work page 2022
-
[36]
Guerreiro, A. P., Fonseca, C. M., and Paquete, L. The hypervolume indicator. ACM Computing Surveys (CSUR), 54: 0 1--42, 2020. URL https://api.semanticscholar.org/CorpusID:218470181
work page 2020
-
[37]
Situated knowledges: The science question in feminism and the privilege of partial perspective
Haraway, D. Situated knowledges: The science question in feminism and the privilege of partial perspective. Feminist Studies, 14 0 (3): 0 575--599, 1988. ISSN 00463663. URL http://www.jstor.org/stable/3178066
-
[38]
Harsanyi, J. C., Selten, R., et al. A general theory of equilibrium selection in games. MIT Press Books, 1, 1988
work page 1988
-
[39]
Hartmann, J., Schwenzow, J., and Witte, M. The political ideology of conversational ai: Converging evidence on chatgpt's pro-environmental, left-libertarian orientation, 2023
work page 2023
-
[43]
Aligning ai with shared human values, 2023
Hendrycks, D., Burns, C., Basart, S., Critch, A., Li, J., Song, D., and Steinhardt, J. Aligning ai with shared human values, 2023
work page 2023
-
[44]
Henrich, J., Heine, S. J., and Norenzayan, A. The weirdest people in the world? Behavioral and Brain Sciences, 33 0 (2-3): 0 61--83, 2010. URL http://www2.psych.ubc.ca/ henrich/audiofiles/WEIRD1.mp3
work page 2010
-
[45]
Human feedback is not gold standard
Hosking, T., Blunsom, P., and Bartolo, M. Human feedback is not gold standard. ArXiv, abs/2309.16349, 2023. URL https://api.semanticscholar.org/CorpusID:263134280
-
[46]
Hsieh, N.-h. and Andersson, H. Incommensurable Values . In Zalta, E. N. (ed.), The Stanford Encyclopedia of Philosophy . Metaphysics Research Lab, Stanford University, F all 2021 edition, 2021
work page 2021
-
[47]
Hwang, E., Majumder, B. P., and Tandon, N. Aligning Language Models to User Opinions . 2023. doi:10.48550/ARXIV.2305.14929. URL https://arxiv.org/abs/2305.14929. Publisher: arXiv Version Number: 1
-
[48]
Imundo, M. and Rapp, D. When fairness is flawed: Effects of false balance reporting and weight-of-evidence statements on beliefs and perceptions of climate change. Journal of Applied Research in Memory and Cognition, 11, 10 2021. doi:10.1016/j.jarmac.2021.10.002
-
[49]
Co-writing with opinionated language models affects users’ views
Jakesch, M., Bhat, A., Buschek, D., Zalmanson, L., and Naaman, M. Co-writing with opinionated language models affects users’ views. In Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems, CHI '23, New York, NY, USA, 2023. Association for Computing Machinery. ISBN 9781450394215. doi:10.1145/3544548.3581196. URL https://doi.org/10.1...
-
[50]
Y., Wang, Y., Hessel, J., Zettlemoyer, L., Hajishirzi, H., Choi, Y., and Ammanabrolu, P
Jang, J., Kim, S., Lin, B. Y., Wang, Y., Hessel, J., Zettlemoyer, L., Hajishirzi, H., Choi, Y., and Ammanabrolu, P. Personalized soups: Personalized large language model alignment via post-hoc parameter merging, 2023
work page 2023
-
[51]
Ji, J., Qiu, T., Chen, B., Zhang, B., Lou, H., Wang, K., Duan, Y., He, Z., Zhou, J., Zhang, Z., Zeng, F., Ng, K. Y., Dai, J., Pan, X., O'Gara, A., Lei, Y., Xu, H., Tse, B., Fu, J., McAleer, S., Yang, Y., Wang, Y., Zhu, S.-C., Guo, Y., and Gao, W. Ai alignment: A comprehensive survey, 2024
work page 2024
-
[52]
Ji, Z., Li, J. D., and Telgarsky, M. Early-stopped neural networks are consistent, 2021
work page 2021
-
[53]
Evaluating and inducing personality in pre-trained language models, 2023
Jiang, G., Xu, M., Zhu, S.-C., Han, W., Zhang, C., and Zhu, Y. Evaluating and inducing personality in pre-trained language models, 2023. URL https://api.semanticscholar.org/CorpusID:258865158
work page 2023
-
[54]
Communitylm: Probing partisan worldviews from language models, 2022
Jiang, H., Beeferman, D., Roy, B., and Roy, D. Communitylm: Probing partisan worldviews from language models, 2022
work page 2022
-
[55]
Jung, J., Qin, L., Welleck, S., Brahman, F., Bhagavatula, C., Bras, R. L., and Choi, Y. Maieutic prompting: Logically consistent reasoning with recursive explanations, 2022
work page 2022
-
[57]
Kasirzadeh, A. and Gabriel, I. In conversation with artificial intelligence: aligning language models with human values, 2022
work page 2022
-
[58]
Kekes, J. The Morality of Pluralism. Princeton University Press, Princeton, 1993
work page 1993
-
[59]
Kim, J. and Lee, B. Ai-augmented surveys: Leveraging large language models for opinion prediction in nationally representative surveys. arXiv preprint arXiv:2305.09620, 2023
-
[60]
Understanding the effects of rlhf on llm generalisation and diversity, 2024
Kirk, R., Mediratta, I., Nalmpantis, C., Luketina, J., Hambro, E., Grefenstette, E., and Raileanu, R. Understanding the effects of rlhf on llm generalisation and diversity, 2024
work page 2024
-
[61]
Human-centred mechanism design with democratic ai
Koster, R., Balaguer, J., Tacchetti, A., Weinstein, A., Zhu, T., Hauser, O., Williams, D., Campbell-Gillingham, L., Thacker, P., Botvinick, M., and Summerfield, C. Human-centred mechanism design with democratic ai. Nature Human Behaviour, 6 0 (10): 0 1398--1407, 2022. doi:10.1038/s41562-022-01383-x. URL https://doi.org/10.1038/s41562-022-01383-x
-
[62]
Chatgpt's inconsistent moral advice influences users'judgment
Kr \"u gel, S., Ostermaier, A., and Uhl, M. Chatgpt's inconsistent moral advice influences users'judgment. Scientific Reports, 13 0 (1): 0 4569, Apr 2023. ISSN 2045-2322. doi:10.1038/s41598-023-31341-0. URL https://doi.org/10.1038/s41598-023-31341-0
-
[63]
Landemore, H. and Page, S. E. Deliberation and disagreement: Problem solving, prediction, and positive dissensus. Politics, philosophy & economics, 14 0 (3): 0 229--254, 2015
work page 2015
-
[64]
Scalable agent alignment via reward modeling: a research direction, 2018
Leike, J., Krueger, D., Everitt, T., Martic, M., Maini, V., and Legg, S. Scalable agent alignment via reward modeling: a research direction, 2018
work page 2018
-
[65]
A., Liang, Y., and Bendersky, M
Li, C., Zhang, M., Mei, Q., Wang, Y., Hombaiah, S. A., Liang, Y., and Bendersky, M. Teach llms to personalize -- an approach inspired by writing education, 2023 a
work page 2023
-
[67]
D., Ré, C., Acosta-Navas, D., Hudson, D
Liang, P., Bommasani, R., Lee, T., Tsipras, D., Soylu, D., Yasunaga, M., Zhang, Y., Narayanan, D., Wu, Y., Kumar, A., Newman, B., Yuan, B., Yan, B., Zhang, C., Cosgrove, C., Manning, C. D., Ré, C., Acosta-Navas, D., Hudson, D. A., Zelikman, E., Durmus, E., Ladhak, F., Rong, F., Ren, H., Yao, H., Wang, J., Santhanam, K., Orr, L., Zheng, L., Yuksekgonul, M....
work page 2023
-
[68]
Liu, A., Sap, M., Lu, X., Swayamdipta, S., Bhagavatula, C., Smith, N. A., and Choi, Y. Dexperts: Decoding-time controlled text generation with experts and anti-experts. In Annual Meeting of the Association for Computational Linguistics, 2021. URL https://api.semanticscholar.org/CorpusID:235313967
work page 2021
-
[69]
Liu, A., Swayamdipta, S., Smith, N. A., and Choi, Y. Wanli: Worker and ai collaboration for natural language inference dataset creation, 2022
work page 2022
-
[70]
Liu, A., Han, X., Wang, Y., Tsvetkov, Y., Choi, Y., and Smith, N. A. Tuning language models by proxy, 2024
work page 2024
-
[71]
F., Kumar, A., Liang, P., and Jia, R
Liu, N. F., Kumar, A., Liang, P., and Jia, R. Are sample-efficient nlp models more robust?, 2023
work page 2023
-
[72]
Large language model guided tree-of-thought
Long, J. Large language model guided tree-of-thought. arXiv preprint arXiv:2305.08291, 2023
-
[73]
L., Bhagavatula, C., and Choi, Y
Lu, X., West, P., Zellers, R., Bras, R. L., Bhagavatula, C., and Choi, Y. Neurologic decoding: (un)supervised neural text generation with predicate logic constraints. ArXiv, abs/2010.12884, 2020. URL https://api.semanticscholar.org/CorpusID:225067055
-
[74]
Quark: Controllable text generation with reinforced unlearning, 2022
Lu, X., Welleck, S., Hessel, J., Jiang, L., Qin, L., West, P., Ammanabrolu, P., and Choi, Y. Quark: Controllable text generation with reinforced unlearning, 2022
work page 2022
-
[75]
Beyond chatbots: Explorellm for structured thoughts and personalized model responses, 2023
Ma, X., Mishra, S., Liu, A., Su, S., Chen, J., Kulkarni, C., Cheng, H.-T., Le, Q., and Chi, E. Beyond chatbots: Explorellm for structured thoughts and personalized model responses, 2023
work page 2023
- [77]
-
[78]
A mbig QA : Answering ambiguous open-domain questions
Min, S., Michael, J., Hajishirzi, H., and Zettlemoyer, L. A mbig QA : Answering ambiguous open-domain questions. In Webber, B., Cohn, T., He, Y., and Liu, Y. (eds.), Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp.\ 5783--5797, Online, November 2020. Association for Computational Linguistics. doi:10.18653...
-
[79]
Ai alignment and social choice: Fundamental limitations and policy implications, 2023
Mishra, A. Ai alignment and social choice: Fundamental limitations and policy implications, 2023
work page 2023
-
[80]
Fair Division and Collective Welfare
Moulin, H. Fair Division and Collective Welfare. MIT Press, 2004
work page 2004
-
[81]
Nagel, T. The fragmentation of value. In Mortal Questions. Cambridge University Press, Cambridge, 1979
work page 1979
-
[82]
OpenAI. Openai davinci-002 model. https://www.openai.com, 2023 a . Accessed on Date 06/2023
work page 2023
-
[83]
OpenAI. Openai gpt3.5-turbo. https://www.openai.com, 2023 b . Accessed on Date 06/2023
work page 2023
-
[84]
Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C. L., Mishkin, P., Zhang, C., Agarwal, S., Slama, K., Ray, A., Schulman, J., Hilton, J., Kelton, F., Miller, L., Simens, M., Askell, A., Welinder, P., Christiano, P., Leike, J., and Lowe, R. Training language models to follow instructions with human feedback, 2022
work page 2022
-
[85]
Ovadya, A. Reimagining democracy for ai. Journal of Democracy, 34 0 (4): 0 162--170, Oct 2023
work page 2023
-
[86]
Page, S. The difference: How the power of diversity creates better groups, firms, schools, and societies-new edition. Princeton University Press, 2008
work page 2008
-
[87]
Page, S. E. The diversity bonus: How great teams pay off in the knowledge economy. Princeton University Press, 2019
work page 2019
-
[88]
S., Zou, A., Li, N., Basart, S., Woodside, T., Ng, J., Zhang, H., Emmons, S., and Hendrycks, D
Pan, A., Chan, J. S., Zou, A., Li, N., Basart, S., Woodside, T., Ng, J., Zhang, H., Emmons, S., and Hendrycks, D. Do the rewards justify the means? measuring trade-offs between rewards and ethical behavior in the machiavelli benchmark, 2023
work page 2023
-
[89]
Park, J. S., Popowski, L., Cai, C. J., Morris, M. R., Liang, P., and Bernstein, M. S. Social simulacra: Creating populated prototypes for social computing systems, 2022
work page 2022
-
[90]
Park, J. S., O'Brien, J., Cai, C. J., Morris, M. R., Liang, P., and Bernstein, M. S. Generative agents: Interactive simulacra of human behavior. In Proceedings of the 36th Annual ACM Symposium on User Interface Software and Technology, pp.\ 1--22, 2023
work page 2023
-
[91]
K., Shu, T., Bobu, A., Shah, J., and Agrawal, P
Peng, A., Netanyahu, A., Ho, M. K., Shu, T., Bobu, A., Shah, J., and Agrawal, P. Diagnosis, feedback, adaptation: A human-in-the-loop framework for test-time policy adaptation. In Proceedings of the 40th International Conference on Machine Learning, 2023
work page 2023
-
[93]
The state of online harassment
Pew Research Center . The state of online harassment. Technical report, Washington, D.C. , January 2021. URL https://www.pewresearch.org/internet/2021/01/13/the-state-of-online-harassment/
work page 2021
-
[94]
arXiv preprint arXiv:2202.11705 , year=
Qin, L., Welleck, S., Khashabi, D., and Choi, Y. Cold decoding: Energy-based constrained text generation with langevin dynamics. ArXiv, abs/2202.11705, 2022. URL https://api.semanticscholar.org/CorpusID:247058662
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.