arxiv: 2604.20805 · v1 · submitted 2026-04-22 · 💻 cs.CY · cs.AI· cs.MA

Recognition: unknown

Relative Principals, Pluralistic Alignment, and the Structural Value Alignment Problem

Travis LaCroix

Authors on Pith no claims yet

Pith reviewed 2026-05-09 23:00 UTC · model grok-4.3

classification 💻 cs.CY cs.AIcs.MA

keywords value alignmentprincipal-agent frameworkAI governancepluralistic alignmentstructural misalignmentobjectivesinformationprincipals

0 comments

The pith

AI value alignment is a governance problem defined by trade-offs among objectives, information, and principals rather than a technical property of models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper claims that the AI value alignment problem is best understood as a structural issue of governance instead of a purely technical or abstract normative challenge. It applies the principal-agent framework from economics to decompose misalignment into three interacting axes: objectives, information, and principals. This shows that alignment is always relative to whose interests are prioritized and at what cost, making it pluralistic and context-dependent. A sympathetic reader cares because the framework explains why real-world systems remain misaligned even with advanced models, and why fixes require institutional processes for setting goals, distributing information, and allowing contestation by affected parties.

Core claim

The core contribution is to show that the three-axis decomposition implies that alignment is fundamentally a problem of governance rather than engineering alone. From this perspective, alignment is inherently pluralistic and context-dependent, and resolving misalignment involves trade-offs among competing values. Because misalignment can occur along each axis and affect stakeholders differently, the structural description shows that alignment cannot be solved through technical design alone but must be managed through ongoing institutional processes that determine how objectives are set, how systems are evaluated, and how affected communities can contest or reshape those decisions.

What carries the argument

The three-axis framework of misalignment drawn from the principal-agent model, consisting of objectives (specified goals), information (distribution of knowledge), and principals (whose interests count).

If this is right

Alignment cannot be treated as a single technical property of models but emerges from how objectives are specified and information is distributed.
Different stakeholders experience misalignment differently depending on which axis is affected.
Resolving misalignment requires trade-offs among competing values rather than a unique solution.
Alignment demands ongoing institutional processes for evaluation and contestation instead of one-time technical fixes.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The framework could be used to audit current deployed systems like content recommenders by mapping their failures to specific axes.
Technical alignment research would need to incorporate governance mechanisms to address pluralistic interests in practice.
This view connects alignment to broader questions of institutional design in technology regulation.

Load-bearing premise

The principal-agent framework from economics can be directly applied to AI systems to systematically diagnose misalignment along the three axes without significant adaptation or counterexamples in real deployments.

What would settle it

A documented case of an AI system where all observed misalignment disappears after technical adjustments to model objectives and information access, with no remaining differences attributable to multiple principals or governance structures.

read the original abstract

The value alignment problem for artificial intelligence (AI) is often framed as a purely technical or normative challenge, sometimes focused on hypothetical future systems. I argue that the problem is better understood as a structural question about governance: not whether an AI system is aligned in the abstract, but whether it is aligned enough, for whom, and at what cost. Drawing on the principal-agent framework from economics, this paper reconceptualises misalignment as arising along three interacting axes: objectives, information, and principals. The three-axis framework provides a systematic way of diagnosing why misalignment arises in real-world systems and clarifies that alignment cannot be treated as a single technical property of models but an outcome shaped by how objectives are specified, how information is distributed, and whose interests count in practice. The core contribution of this paper is to show that the three-axis decomposition implies that alignment is fundamentally a problem of governance rather than engineering alone. From this perspective, alignment is inherently pluralistic and context-dependent, and resolving misalignment involves trade-offs among competing values. Because misalignment can occur along each axis -- and affect stakeholders differently -- the structural description shows that alignment cannot be "solved" through technical design alone, but must be managed through ongoing institutional processes that determine how objectives are set, how systems are evaluated, and how affected communities can contest or reshape those decisions.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The three-axis principal-agent framing usefully organizes why misalignment shows up in real deployments but does not demonstrate why technical methods cannot handle each axis.

read the letter

The paper's main contribution is a structured decomposition of AI misalignment into objectives, information, and principals drawn from the principal-agent model. This gives a clear way to diagnose how different stakeholders get different outcomes from the same system and why alignment looks pluralistic rather than a single fixed property of a model. That part is straightforward and draws on established economics ideas without overclaiming novelty in the abstract itself. It does a reasonable job of showing that misalignment is not just a matter of bad training data or unclear goals in isolation but involves interactions across who sets the objectives, what information is available, and whose interests are treated as primary. Readers working on governance questions will see the value in having these axes laid out explicitly rather than treated as separate problems. The soft spot is the central inference that this decomposition makes alignment a governance issue rather than an engineering one. The text identifies the three sources of misalignment but supplies no argument or example showing why approaches like preference learning, scalable oversight, or multi-objective optimization fall outside technical scope. Without cases, counterexamples, or a formal mapping, the move from diagnosis to 'must be managed through institutional processes' stays at the level of assertion. The paper is aimed at people thinking about AI policy, ethics, and institutional design rather than core technical alignment work. It shows honest engagement with the literature on plural values, so it deserves a serious referee to check whether the full manuscript adds the missing support for the governance claim. I would send it to review with a request to strengthen that link.

Referee Report

1 major / 1 minor

Summary. The paper argues that the AI value alignment problem is better understood as a structural governance issue rather than a purely technical or normative one. Drawing on the principal-agent framework from economics, it decomposes misalignment into three interacting axes—objectives, information, and principals—and claims that this framework shows alignment to be inherently pluralistic, context-dependent, and requiring ongoing institutional processes to manage trade-offs among competing values, rather than being solvable through technical design alone.

Significance. If the central inference holds, the paper offers a useful conceptual reframing that could help diagnose real-world misalignment cases and shift alignment research toward pluralistic and institutional considerations. However, the significance is constrained by the absence of detailed derivations, formal mappings, or empirical cases demonstrating why the decomposition entails that technical methods are insufficient.

major comments (1)

[Abstract / Core contribution paragraph] Abstract and core argument section: The claim that the three-axis decomposition 'implies that alignment is fundamentally a problem of governance rather than engineering alone' is load-bearing for the paper's contribution but is asserted without an explicit argument showing why standard technical approaches (such as scalable oversight for information asymmetry, preference learning for objectives, or multi-objective optimization for principals) cannot in principle operate on each axis. Without demonstrating that these methods are insufficient or themselves require non-technical governance, the inference from decomposition to 'governance rather than engineering' does not follow.

minor comments (1)

The manuscript would benefit from at least one concrete real-world case study mapping the three axes to an existing AI deployment to illustrate the framework's diagnostic value.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive review and for acknowledging the potential value of the three-axis framework in reframing AI alignment. We agree that the core inference requires more explicit support and will revise accordingly.

read point-by-point responses

Referee: The claim that the three-axis decomposition 'implies that alignment is fundamentally a problem of governance rather than engineering alone' is load-bearing for the paper's contribution but is asserted without an explicit argument showing why standard technical approaches (such as scalable oversight for information asymmetry, preference learning for objectives, or multi-objective optimization for principals) cannot in principle operate on each axis. Without demonstrating that these methods are insufficient or themselves require non-technical governance, the inference from decomposition to 'governance rather than engineering' does not follow.

Authors: We accept this point. The manuscript currently derives the governance conclusion from the observation that each axis introduces pluralism and context-dependence, such that misalignment affects stakeholders differently and requires trade-offs that technical design alone cannot legitimately resolve. To make the inference explicit, we will add a new subsection following the three-axis presentation. It will map each cited technical method onto the axes and show why governance remains necessary: scalable oversight can reduce information asymmetry but presupposes an agreed principal (or set of principals) authorized to oversee, which the principals axis shows must be determined institutionally; preference learning can address objective misalignment but requires prior governance choices about whose preferences are elicited and how conflicts among plural principals are aggregated; multi-objective optimization can handle multiple principals but still depends on institutional processes to set the objectives, weights, and evaluation criteria in a context-specific and contestable manner. The revision will argue that these methods therefore operate within, rather than replace, governance structures. Brief illustrations from current AI deployments (e.g., content moderation systems) will be included to ground the argument. revision: yes

Circularity Check

0 steps flagged

No circularity; conceptual argument relies on external economic framework

full rationale

The paper introduces a three-axis decomposition (objectives, information, principals) drawn from the standard principal-agent model in economics, then interprets this as showing alignment is inherently a governance issue. This is an interpretive reframing rather than a derivation that reduces to its own inputs by construction. No equations, fitted parameters, self-citations of uniqueness theorems, or ansatzes are present in the abstract or described structure. The central claim does not rename a known result or smuggle in prior self-work as external fact; it applies an independent external lens to diagnose misalignment sources. The implication to 'governance rather than engineering alone' is a perspective shift, not a tautological prediction or self-referential loop.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper relies on the principal-agent framework as a background model without deriving its applicability to AI from first principles or providing independent evidence for the three-axis structure.

axioms (1)

domain assumption The principal-agent framework from economics applies directly to AI value alignment scenarios.
Invoked to reconceptualize misalignment along objectives, information, and principals.

pith-pipeline@v0.9.0 · 5534 in / 1178 out tokens · 20027 ms · 2026-05-09T23:00:13.376221+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

76 extracted references · 25 canonical work pages · 1 internal anchor

[1]

Dario Amodei, Chris Olah, Jacob Steinhardt, Paul Christiano, John Schulman, and Dan Mané. 2016. Concrete Problems in AI Safety. arXiv1606.06565 (2016), 1–29. https://arxiv.org/abs/1606.06565

work page internal anchor Pith review arXiv 2016
[2]

Mel Andrews. 2025. The Immortal Science of ML: Machine Learning and the Theory-Free Ideal.Erkenntnis(2025), 1–23. https: //doi.org/10.1007/s10670-025-01010-x

work page doi:10.1007/s10670-025-01010-x 2025
[3]

Taylor, Mark Diaz, Christopher M

Lora Aroyo, Alex S. Taylor, Mark Diaz, Christopher M. Homan, Alicia Parrish, Greg Serapio-Garcia, Vinodkumar Prabhakaran, and Ding Wang. 2023. DICES Dataset: Diversity in Conversational AI Evaluation for Safety.arXiv2306.11247 (2023), 1–22. https: //arxiv.org/abs/2306.11247

work page arXiv 2023
[4]

Yoshua Bengio. 2023. How Rogue AIs may Arise. https://yoshuabengio.org/en/blog/how-rogue-ais-may-arise

2023
[5]

Yoshua Bengio, Michael Cohen, Damiano Fornasiere, Joumana Ghosn, Pietro Greiner, Matt MacDermott, Sören Mindermann, Adam Ober- man, Jesse Richardson, Oliver Richardson, Marc-Antoine Rondeau, Pierre-Luc St-Charles, and David Williams-King. 2025. Superintelligent Agents Pose Catastrophic Risks: Can Scientist AI Offer a Safer Path?arXiv2502.15657 (2025), 1–5...

work page arXiv 2025
[6]

2019.Race After Technology: Abolitionist Tools for the New Jim Code

Ruha Benjamin. 2019.Race After Technology: Abolitionist Tools for the New Jim Code. Polity, Cambridge

2019
[7]

Stevie Bergman, Nahema Marchal, John Mellor, Shakir Mohamed, Iason Gabriel, and William Isaac. 2024. STELA: a community-centred approach to norm elicitation for AI alignment.Scientific Reports14, 1 (2024), 6616

2024
[8]

Nick Bostrom. 2003. Ethical issues in advanced artificial intelligence. InScience fiction and philosophy: from time travel to superintelligence, Susan Schneider (Ed.). Wiley & Blackwell, West Sussex, 277–284

2003
[9]

2014.Superintelligence: Paths, Dangers, Strategies

Nick Bostrom. 2014.Superintelligence: Paths, Dangers, Strategies. Oxford University Press, Oxford

2014
[10]

2023.More than a Glitch: Confronting Race, Gender, and Ability Bias in Tech

Meredith Broussard. 2023.More than a Glitch: Confronting Race, Gender, and Ability Bias in Tech. The MIT Press, Cambridge, MA

2023
[11]

2020.The Alignment Problem: Machine Learning and Human Values

Brian Christian. 2020.The Alignment Problem: Machine Learning and Human Values. W. W. Norton & Company, New York

2020
[12]

and Jacobs, Bob M

Vincent Conitzer, Rachel Freedman, Jobst Heitzig, Wesley H. Holliday, Bob M. Jacobs, Nathan Lambert, Milan Mossé, Eric Pacuit, Stuart Russell, Hailey Schoelkopf, and Others. 2024. Social Choice Should Guide AI Alignment in Dealing with Diverse Human Feedback. arXiv(2024), 1–15. https://arxiv.org/abs/2404.10271

work page arXiv 2024
[13]

2021.Atlas of AI

Kate Crawford. 2021.Atlas of AI. Yale University Press, New Haven, CT

2021
[14]

Andrew Critch and Stuart Russell. 2023. Tasra: A taxonomy and analysis of societal-scale risks from ai.arXiv2306.06924 (2023), 1–18. https://arxiv.org/abs/2306.06924

work page arXiv 2023
[15]

Aida Mostafazadeh Davani, Mark Díaz, and Vinodkumar Prabhakaran. 2022. Dealing with disagreements: Looking beyond the majority vote in subjective annotations.Transactions of the Association for Computational Linguistics10 (2022), 92–110. FAccT ’26, June 25–28, 2026, Montreal, QC, Canada LaCroix

2022
[16]

Daniel Dewey. 2011. Learning What to Value. InAGI 2011: 4th International Conference on Artificial General Intelligence (Lecture Notes in Computer Science, Vol. 6830), J. Schmidhuber, K. R. Thórisson, and M. Looks (Eds.). Springer, Berlin, Heidelberg, 309–314

2011
[17]

Ravit Dotan and Smitha Milli. 2019. Value-laden Disciplinary Shifts in Machine Learning.arXiv1912.01172 (2019), 1–10. https: //arxiv.org/abs/1912.01172

work page arXiv 2019
[18]

Heather Douglas. 2000. Inductive Risk and Values in Science.Philosophy of Science67, 4 (2000), 559–579

2000
[19]

Peter Eckersley. 2019. Impossibility and Uncertainty Theorems in AI Value Alignment (or why your AGI should not have a utility function).arXiv1901.00064 (2019), 1–13. https://arxiv.org/abs/1901.00064

work page arXiv 2019
[20]

Eisenhardt

Kathleen M. Eisenhardt. 1989. Agency Theory: An Assessment and Review.The Academy of Management Review14, 1 (1989), 57–74

1989
[21]

Scott Emmons, Caspar Oesterheld, Vincent Conitzer, and Stuart Russell. 2025. Observation Interference in Partially Observable Assistance Games.arXiv2412.17797 (2025), 1–26. https://arxiv.org/abs/2412.17797

work page arXiv 2025
[22]

Friedler, Scott Neville, Carlos Scheidegger, and Suresh Venkatasubramanian

Danielle Ensign, Sorelle A. Friedler, Scott Neville, Carlos Scheidegger, and Suresh Venkatasubramanian. 2018. Runaway Feedback Loops in Predictive Policing. InProceedings of the 1st Conference on Fairness, Accountability and Transparency (Proceedings of Machine Learning Research, Vol. 81), Sorelle A. Friedler and Christo Wilson (Eds.). PMLR, 160–171

2018
[23]

Sina Fazelpour and Will Fleisher. 2025. The Value of Disagreement in AI Design, Evaluation, and Alignment. InProceedings of the 2025 ACM Conference on Fairness, Accountability, and Transparency. Association for Computing Machinery, 2138–2150. https: //doi.org/10.1145/3715275.3732146

work page doi:10.1145/3715275.3732146 2025
[24]

Fisac, Monica A

Jaime F. Fisac, Monica A. Gates, Jessica B. Hamrick, Chang Liu, Dylan Hadfield-Menell, Malayandi Palaniappan, Dhruv Malik, S. Shankar Sastry, Thomas L. Griffiths, and Anca D. Dragan. 2020. Pragmatic-Pedagogic Value Alignment. InSpringer Proceedings in Advanced Robotics, N. Amato, G. Hager, S. Thomas, and M. Torres-Torriti (Eds.). Vol. 10. Springer, 49–57

2020
[25]

Future of Life Institute. 2017. Asilomar AI Principles. https://futureoflife.org/open-letter/ai-principles/

2017
[26]

Iason Gabriel. 2020. Artificial Intelligence, Values, and Alignment.Minds and Machines30 (2020), 411–437

2020
[27]

Iason Gabriel and Geoff Keeling. 2025. A matter of principle? AI alignment as the fair treatment of claims.Philosophical Studies182 (2025), 1951–1973

2025
[28]

Andrew Garber, Rohan Subramani, Linus Luu, Mark Bedaywi, Stuart Russell, and Scott Emmons. 2025. The partially observable off-switch game.Proceedings of the AAAI Conference on Artificial Intelligence39, 26 (2025), 27304–27311

2025
[29]

Trystan S. Goetze. 2024. AI Art is Theft: Labour, Extraction, and Exploitation—Or, On the Dangers of Stochastic Pollocks.PhilArchive (2024). Unpublished preprint of 10 January 2024. https://philarchive.org/rec/GOEAAI-2

2024
[30]

Goldberg

David E. Goldberg. 1987. Simple genetic algorithms and the minimal deceptive problem. InGenetic Algorithms and Simulated Annealing (Research Notes in Artificial Intelligence), Lawrence D. Davis (Ed.). Morgan Kaufmann Publishers, Burlington, MA, 74–88

1987
[31]

John-Stewart Gordon. 2023. Objections. InThe Impact of Artificial Intelligence on Human Rights Legislation. Palgrave Macmillan, Cham, 75–82

2023
[32]

Gordon, Michelle S

Mitchell L. Gordon, Michelle S. Lam, Joon Sung Park, Kayur Patel, Jeffrey T. Hancock, Tatsunori Hashimoto, and Michael S. Bernstein
[33]

https://arxiv.org/abs/ 2202.02950

Jury Learning: Integrating Dissenting Voices into Machine Learning Models.arXiv2202.02950 (2022), 1–19. https://arxiv.org/abs/ 2202.02950

work page arXiv 2022
[34]

Gray and Siddharth Suri

Mary L. Gray and Siddharth Suri. 2019.Ghost Work: How to Stop Silicon Valley from Building a New Global Underclass. Eamon Dolan Books, New York

2019
[35]

2021.The Principal-Agent Alignment Problem in Artificial Intelligence

Dylan Hadfield-Menell. 2021.The Principal-Agent Alignment Problem in Artificial Intelligence. Ph. D. Dissertation. EECS Department, University of California, Berkeley. http://www2.eecs.berkeley.edu/Pubs/TechRpts/2021/EECS-2021-207.html

2021
[36]

Dylan Hadfield-Menell, Anca Dragan, Pieter Abbeel, and Stuart Russell. 2016. Cooperative inverse reinforcement learning. InNIPS’16: Proceedings of the 30th International Conference on Neural Information Processing Systems, Daniel D. Lee, Ulrike von Luxburg, Roman Garnett, Masashi Sugiyama, and Isabelle Guyon (Eds.). Association for Computing Machinery, 3916–3924

2016
[37]

Dylan Hadfield-Menell, Anca Dragan, Pieter Abbeel, and Stuart Russell. 2017. The Off-Switch Game.arXiv1611.08219 (2017), 1–8. https://arxiv.org/abs/1611.08219

work page arXiv 2017
[38]

Hadfield

Dylan Hadfield-Menell and Gillian K. Hadfield. 2019. Incomplete Contracting and AI Alignment. InAIES ’19: Proceedings of the 2019 AAAI/ACM Conference on AI, Ethics, and Society, Vincent Conitzer, Gillian Hadfield, and Shannon Vallor (Eds.). Association for Computing Machinery, New York, 417–422

2019
[39]

Foad Hamidi, Morgan Klaus Scheuerman, and Stacy M. Branham. 2018. Gender recognition or gender reductionism?: The social implications of embedded gender recognition systems.Proceedings of the ACM Conference on Human Factors in Computing Systems (CHI) (2018), 1–13

2018
[40]

1965.Aspects of Scientific Explanation

Carl Hempel. 1965.Aspects of Scientific Explanation. Free Press, New York

1965
[41]

Dan Hendrycks, Collin Burns, Steven Basart, Andrew Critch, Jerry Li, Dawn Song, and Jacob Steinhardt. 2023. Aligning AI With Shared Human Values.arXiv2008.02275 (2023), 1–29. https://arxiv.org/abs/2008.02275

work page arXiv 2023
[42]

Dan Hendrycks and Mantas Mazeika. 2022. X-risk analysis for AI research.arXiv2206.05862 (2022), 1–36. https://arxiv.org/abs/2206.05862

work page arXiv 2022
[43]

Anna Lauren Hoffmann. 2019. Where fairness fails: data, algorithms, and the limits of antidiscrimination discourse.Information, Communication & Society22, 7 (2019), 900–915. Relative principals, pluralistic alignment, & the structural value alignment problem FAccT ’26, June 25–28, 2026, Montreal, QC, Canada

2019
[44]

Liao, Esin Durmus, Alex Tamkin, and Deep Ganguli

Saffron Huang, Divya Siddarth, Liane Lovitt, Thomas I. Liao, Esin Durmus, Alex Tamkin, and Deep Ganguli. 2024. Collective constitutional AI: Aligning a language model with public input. InProceedings of the 2024 ACM Conference on Fairness, Accountability, and Transparency. ACM, 1395–1417

2024
[45]

Evan Hubinger, Chris van Merwijk, Vladimir Mikulik, Joar Skalse, and Scott Garrabrant. 2021. Risks from Learned Optimization in Advanced Machine Learning Systems.arXiv1906.01820 (2021), 1–39. https://arxiv.org/abs/1906.01820

work page arXiv 2021
[46]

Jensen and William H

Michael C. Jensen and William H. Meckling. 1976. Theory of the Firm: Managerial Behaviour, Agency Costs and Ownership Structure. Journal of Financial Economics3, 4 (1976), 305–360

1976
[47]

Jiaming Ji, Tianyi Qiu, Boyuan Chen, Borong Zhang, Hantao Lou, Kaile Wang, Yawen Duan, Zhonghao He, Lukas Vierling, Donghai Hong, Jiayi Zhou, Zhaowei Zhang, Fanzhi Zeng, Juntao Dai, Xuehai Pan, Kwan Yee Ng, Aidan O’Gara, Hua Xu, Brian Tse, Jie Fu, Stephen McAleer, Yaodong Yang, Yizhou Wang, Song-Chun Zhu, Yike Guo, and Wen Gao. 2025. AI Alignment: A Compr...

work page arXiv 2025
[48]

Steven Kerr. 1975. On the Folly of Rewarding A, While Hoping for B.Academy of Management Journal18 (1975), 769–783

1975
[49]

Hannah Rose Kirk, Alexander Whitefield, Paul Röttger, Andrew Bean, Katerina Margatina, Juan Ciro, Rafael Mosquera, Max Bartolo, Adina Williams, He He, Bertie Vidgen, and Scott A. Hale. 2024. The PRISM Alignment Project: What Participatory, Representative and Individualised Human Feedback Reveals About the Subjective and Multicultural Alignment of Large La...

work page arXiv 2024
[50]

2025.Artificial Intelligence and the Value Alignment Problem: A Philosophical Introduction

Travis LaCroix. 2025.Artificial Intelligence and the Value Alignment Problem: A Philosophical Introduction. Broadview Press

2025
[51]

Benchmarking

Travis LaCroix and Alexandra Sasha Luccioni. 2025. Metaethical Perspectives on “Benchmarking” AI Ethics.AI and Ethics5 (2025), 4029–4047

2025
[52]

2002.The Theory of Incentives: The Principal-Agent Model

Jean-Jacques Laffont and David Martimort. 2002.The Theory of Incentives: The Principal-Agent Model. Princeton University Press, Princeton

2002
[53]

Joel Lehman and Kenneth O. Stanley. 2008. Exploiting Open-Endedness to Solve Problems Through the Search for Novelty. InProceedings of the Eleventh International Conference on Artificial Life (ALIFE XI). The MIT Press, Cambridge, MA, 329–336

2008
[54]

Alexandra Sasha Luccioni, Yacine Jernite, and Emma Strubell. 2023. Power Hungry Processing: Watts Driving the Cost of AI Deployment? arXiv2311.16863 (2023), 1–20. https://arxiv.org/abs/2311.16863

work page arXiv 2023
[55]

Alexandra Sasha Luccioni, Sylvain Viguier, and Anne-Laure Ligozat. 2023. Estimating the Carbon Footprint of BLOOM, a 176B Parameter Language Model.Journal of Machine Learning Research24, 253 (2023), 1–15

2023
[56]

Kristian Lum and William Isaac. 2016. To predict and serve?Significance13 (2016), 14–19

2016
[57]

2022.Resisting AI: An Anti-fascist Approach to Artificial Intelligence

Dan McQuillan. 2022.Resisting AI: An Anti-fascist Approach to Artificial Intelligence. Bristol University Press, Bristol

2022
[58]

Milagros Miceli, Julian Posada, and Tianling Yang. 2022. Studying Up Machine Learning Data: Why Talk About Bias When We Mean Power?Proceedings of the ACM on Human-Computer Interaction6, GROUP (2022), 1–14

2022
[59]

Melanie Mitchell, Stephanie Forrest, and John H. Holland. 1992. The royal road for genetic algorithms: Fitness landscapes and GA performance. InProceedings of the First European Conference on Artificial Life, F. J. Varela and P. Bourgine (Eds.). The MIT Press, Cambridge, MA, 1–11

1992
[60]

Richard Ngo, Lawrence Chen, and Sören Mindermann. 2023. The Alignment Problem from a Deep Learning Perspective.arXiv 2209.00626 (2023), 1–21. https://arxiv.org/abs/2209.00626

work page arXiv 2023
[61]

Omohundro

Stephen M. Omohundro. 2008. The Basic AI Drives. InArtificial General Intelligence 2008: Proceedings of the First AGI Conference, Pei Wang, Ben Goertzel, and Stan Franklin (Eds.). IOS Press, Amsterdam, 483–492

2008
[62]

2016.Weapons of Math Destruction: How Big Data Increases Inequality and Threatens Democracy

Cathy O’Neil. 2016.Weapons of Math Destruction: How Big Data Increases Inequality and Threatens Democracy. Broadway Books, New York

2016
[63]

Battleday, Thomas L

Joshua C Peterson, Ruairidh M. Battleday, Thomas L. Griffiths, and Olga Russakovsky. 2019. Human uncertainty makes classification more robust.Proceedings of the IEEE/CVF international conference on computer vision(2019), 9617–9626

2019
[64]

Elinor Poole-Dayan, Jiayi Wu, Taylor Sorensen, Jiaxin Pei, and Michiel A. Bakker. 2026. Benchmarking Overton Pluralism in LLMs. arXiv2512.01351 (2026), 1–40. https://arxiv.org/abs/2512.01351

work page arXiv 2026
[65]

Mahendra Prasad. 2018. Social choice and the value alignment problem. InArtificial intelligence safety and security, Roman V. Yampolskiy (Ed.). Chapman & Hall, London, 291–314

2018
[66]

Inioluwa Deborah Raji and Roel Dobbe. 2023. Concrete Problems in AI Safety, Revisited.arXiv2401.10899 (2023), 2023. https: //arxiv.org/abs/2401.10899/

work page arXiv 2023
[67]

Inioluwa Deborah Raji, Timnit Gebru, Margaret Mitchell, Joy Buolamwini, Joonseok Lee, and Emily Denton. 2020. Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing.arXiv2001.00964 (2020), 1–7. https://arxiv.org/abs/2001.00964

work page arXiv 2020
[68]

2019.Human Compatible: Artificial Intelligence and the Problem of Control

Stuart Russell. 2019.Human Compatible: Artificial Intelligence and the Problem of Control. Viking, New York

2019
[69]

Rohin Shah, Pedro Freire, Neel Alex, Rachel Freedman, Dmitrii Krasheninnikov, Lawrence Chan, Michael Dennis, Pieter Abbeel, Anca Dragan, and Stuart Russell. 2020. Benefits of Assistance over Reward Learning.34th Conference on Neural Information Processing Systems (NeurIPS 2020) - Workshop on Cooperative AI(2020). FAccT ’26, June 25–28, 2026, Montreal, QC,...

2020
[70]

Urbanowicz, and Jason H

Moshe Sipper, Ryan J. Urbanowicz, and Jason H. Moore. 2018. To Know the Objective Is Not (Necessarily) to Know the Objective Function.BioData Mining11, 21 (2018), 1–3

2018
[71]

Anand Siththaranjan, Cassidy Laidlaw, and Dylan Hadfield-Menell. 2024. Distributional Preference Learning: Understanding and Accounting for Hidden Context in RLHF.arXiv2312.08358 (2024), 1–26. https://arxiv.org/abs/2312.08358

work page arXiv 2024
[72]

Taylor Sorensen, Liwei Jiang, Jena Hwang, Sydney Levine, Valentina Pyatkin, Peter West, Nouha Dziri, Ximing Lu, Kavel Rao, Chandra Bhagavatula, and Others. 2024. Value Kaleidoscope: Engaging AI with Pluralistic Human Values, Rights, and Duties.arXiv2309.00779 (2024). https://arxiv.org/abs/2309.00779

work page arXiv 2024
[73]

Taylor Sorensen, Jared Moore, Jillian Fisher, Mitchell Gordon, Niloofar Mireshghallah, Christopher Michael Rytting, Andre Ye, Liwei Jiang, Ximing Lu, Nouha Dziri, Tim Althoff, and Yejin Choi. 2024. A roadmap to pluralistic alignment.arXiv2402.05070 (2024), 1–23. https://arxiv.org/abs/2402.05070

work page arXiv 2024
[74]

2018.Life 3.0: Being human in the age of artificial intelligence

Max Tegmark. 2018.Life 3.0: Being human in the age of artificial intelligence. Vintage, New York

2018
[75]

Bakker, Daniel Jarrett, Hannah Sheahan, Martin J

Michael Henry Tessler, Michiel A. Bakker, Daniel Jarrett, Hannah Sheahan, Martin J. Chadwick, Raphael Koster, Georgina Evans, Lucy Campbell-Gillingham, Tantum Collins, David C. Parkes, Matthew Botvinick, and Christopher Summerfield. 2024. AI can help humans find common ground in democratic deliberation.Science386, 6719 (2024), eadq2852

2024
[76]

Eliezer Yudkowsky. 2011. Complex value systems in friendly AI. 6830 (2011), 388–393. Received 13 January 2026; revised 25 March 2026; accepted 16 April 2026

2011