pith. machine review for the scientific record. sign in

arxiv: 2604.20805 · v1 · submitted 2026-04-22 · 💻 cs.CY · cs.AI· cs.MA

Recognition: unknown

Relative Principals, Pluralistic Alignment, and the Structural Value Alignment Problem

Authors on Pith no claims yet

Pith reviewed 2026-05-09 23:00 UTC · model grok-4.3

classification 💻 cs.CY cs.AIcs.MA
keywords value alignmentprincipal-agent frameworkAI governancepluralistic alignmentstructural misalignmentobjectivesinformationprincipals
0
0 comments X

The pith

AI value alignment is a governance problem defined by trade-offs among objectives, information, and principals rather than a technical property of models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper claims that the AI value alignment problem is best understood as a structural issue of governance instead of a purely technical or abstract normative challenge. It applies the principal-agent framework from economics to decompose misalignment into three interacting axes: objectives, information, and principals. This shows that alignment is always relative to whose interests are prioritized and at what cost, making it pluralistic and context-dependent. A sympathetic reader cares because the framework explains why real-world systems remain misaligned even with advanced models, and why fixes require institutional processes for setting goals, distributing information, and allowing contestation by affected parties.

Core claim

The core contribution is to show that the three-axis decomposition implies that alignment is fundamentally a problem of governance rather than engineering alone. From this perspective, alignment is inherently pluralistic and context-dependent, and resolving misalignment involves trade-offs among competing values. Because misalignment can occur along each axis and affect stakeholders differently, the structural description shows that alignment cannot be solved through technical design alone but must be managed through ongoing institutional processes that determine how objectives are set, how systems are evaluated, and how affected communities can contest or reshape those decisions.

What carries the argument

The three-axis framework of misalignment drawn from the principal-agent model, consisting of objectives (specified goals), information (distribution of knowledge), and principals (whose interests count).

If this is right

  • Alignment cannot be treated as a single technical property of models but emerges from how objectives are specified and information is distributed.
  • Different stakeholders experience misalignment differently depending on which axis is affected.
  • Resolving misalignment requires trade-offs among competing values rather than a unique solution.
  • Alignment demands ongoing institutional processes for evaluation and contestation instead of one-time technical fixes.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The framework could be used to audit current deployed systems like content recommenders by mapping their failures to specific axes.
  • Technical alignment research would need to incorporate governance mechanisms to address pluralistic interests in practice.
  • This view connects alignment to broader questions of institutional design in technology regulation.

Load-bearing premise

The principal-agent framework from economics can be directly applied to AI systems to systematically diagnose misalignment along the three axes without significant adaptation or counterexamples in real deployments.

What would settle it

A documented case of an AI system where all observed misalignment disappears after technical adjustments to model objectives and information access, with no remaining differences attributable to multiple principals or governance structures.

read the original abstract

The value alignment problem for artificial intelligence (AI) is often framed as a purely technical or normative challenge, sometimes focused on hypothetical future systems. I argue that the problem is better understood as a structural question about governance: not whether an AI system is aligned in the abstract, but whether it is aligned enough, for whom, and at what cost. Drawing on the principal-agent framework from economics, this paper reconceptualises misalignment as arising along three interacting axes: objectives, information, and principals. The three-axis framework provides a systematic way of diagnosing why misalignment arises in real-world systems and clarifies that alignment cannot be treated as a single technical property of models but an outcome shaped by how objectives are specified, how information is distributed, and whose interests count in practice. The core contribution of this paper is to show that the three-axis decomposition implies that alignment is fundamentally a problem of governance rather than engineering alone. From this perspective, alignment is inherently pluralistic and context-dependent, and resolving misalignment involves trade-offs among competing values. Because misalignment can occur along each axis -- and affect stakeholders differently -- the structural description shows that alignment cannot be "solved" through technical design alone, but must be managed through ongoing institutional processes that determine how objectives are set, how systems are evaluated, and how affected communities can contest or reshape those decisions.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The paper argues that the AI value alignment problem is better understood as a structural governance issue rather than a purely technical or normative one. Drawing on the principal-agent framework from economics, it decomposes misalignment into three interacting axes—objectives, information, and principals—and claims that this framework shows alignment to be inherently pluralistic, context-dependent, and requiring ongoing institutional processes to manage trade-offs among competing values, rather than being solvable through technical design alone.

Significance. If the central inference holds, the paper offers a useful conceptual reframing that could help diagnose real-world misalignment cases and shift alignment research toward pluralistic and institutional considerations. However, the significance is constrained by the absence of detailed derivations, formal mappings, or empirical cases demonstrating why the decomposition entails that technical methods are insufficient.

major comments (1)
  1. [Abstract / Core contribution paragraph] Abstract and core argument section: The claim that the three-axis decomposition 'implies that alignment is fundamentally a problem of governance rather than engineering alone' is load-bearing for the paper's contribution but is asserted without an explicit argument showing why standard technical approaches (such as scalable oversight for information asymmetry, preference learning for objectives, or multi-objective optimization for principals) cannot in principle operate on each axis. Without demonstrating that these methods are insufficient or themselves require non-technical governance, the inference from decomposition to 'governance rather than engineering' does not follow.
minor comments (1)
  1. The manuscript would benefit from at least one concrete real-world case study mapping the three axes to an existing AI deployment to illustrate the framework's diagnostic value.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive review and for acknowledging the potential value of the three-axis framework in reframing AI alignment. We agree that the core inference requires more explicit support and will revise accordingly.

read point-by-point responses
  1. Referee: The claim that the three-axis decomposition 'implies that alignment is fundamentally a problem of governance rather than engineering alone' is load-bearing for the paper's contribution but is asserted without an explicit argument showing why standard technical approaches (such as scalable oversight for information asymmetry, preference learning for objectives, or multi-objective optimization for principals) cannot in principle operate on each axis. Without demonstrating that these methods are insufficient or themselves require non-technical governance, the inference from decomposition to 'governance rather than engineering' does not follow.

    Authors: We accept this point. The manuscript currently derives the governance conclusion from the observation that each axis introduces pluralism and context-dependence, such that misalignment affects stakeholders differently and requires trade-offs that technical design alone cannot legitimately resolve. To make the inference explicit, we will add a new subsection following the three-axis presentation. It will map each cited technical method onto the axes and show why governance remains necessary: scalable oversight can reduce information asymmetry but presupposes an agreed principal (or set of principals) authorized to oversee, which the principals axis shows must be determined institutionally; preference learning can address objective misalignment but requires prior governance choices about whose preferences are elicited and how conflicts among plural principals are aggregated; multi-objective optimization can handle multiple principals but still depends on institutional processes to set the objectives, weights, and evaluation criteria in a context-specific and contestable manner. The revision will argue that these methods therefore operate within, rather than replace, governance structures. Brief illustrations from current AI deployments (e.g., content moderation systems) will be included to ground the argument. revision: yes

Circularity Check

0 steps flagged

No circularity; conceptual argument relies on external economic framework

full rationale

The paper introduces a three-axis decomposition (objectives, information, principals) drawn from the standard principal-agent model in economics, then interprets this as showing alignment is inherently a governance issue. This is an interpretive reframing rather than a derivation that reduces to its own inputs by construction. No equations, fitted parameters, self-citations of uniqueness theorems, or ansatzes are present in the abstract or described structure. The central claim does not rename a known result or smuggle in prior self-work as external fact; it applies an independent external lens to diagnose misalignment sources. The implication to 'governance rather than engineering alone' is a perspective shift, not a tautological prediction or self-referential loop.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper relies on the principal-agent framework as a background model without deriving its applicability to AI from first principles or providing independent evidence for the three-axis structure.

axioms (1)
  • domain assumption The principal-agent framework from economics applies directly to AI value alignment scenarios.
    Invoked to reconceptualize misalignment along objectives, information, and principals.

pith-pipeline@v0.9.0 · 5534 in / 1178 out tokens · 20027 ms · 2026-05-09T23:00:13.376221+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

76 extracted references · 25 canonical work pages · 1 internal anchor

  1. [1]

    Dario Amodei, Chris Olah, Jacob Steinhardt, Paul Christiano, John Schulman, and Dan Mané. 2016. Concrete Problems in AI Safety. arXiv1606.06565 (2016), 1–29. https://arxiv.org/abs/1606.06565

  2. [2]

    Mel Andrews. 2025. The Immortal Science of ML: Machine Learning and the Theory-Free Ideal.Erkenntnis(2025), 1–23. https: //doi.org/10.1007/s10670-025-01010-x

  3. [3]

    Taylor, Mark Diaz, Christopher M

    Lora Aroyo, Alex S. Taylor, Mark Diaz, Christopher M. Homan, Alicia Parrish, Greg Serapio-Garcia, Vinodkumar Prabhakaran, and Ding Wang. 2023. DICES Dataset: Diversity in Conversational AI Evaluation for Safety.arXiv2306.11247 (2023), 1–22. https: //arxiv.org/abs/2306.11247

  4. [4]

    Yoshua Bengio. 2023. How Rogue AIs may Arise. https://yoshuabengio.org/en/blog/how-rogue-ais-may-arise

  5. [5]

    Yoshua Bengio, Michael Cohen, Damiano Fornasiere, Joumana Ghosn, Pietro Greiner, Matt MacDermott, Sören Mindermann, Adam Ober- man, Jesse Richardson, Oliver Richardson, Marc-Antoine Rondeau, Pierre-Luc St-Charles, and David Williams-King. 2025. Superintelligent Agents Pose Catastrophic Risks: Can Scientist AI Offer a Safer Path?arXiv2502.15657 (2025), 1–5...

  6. [6]

    2019.Race After Technology: Abolitionist Tools for the New Jim Code

    Ruha Benjamin. 2019.Race After Technology: Abolitionist Tools for the New Jim Code. Polity, Cambridge

  7. [7]

    Stevie Bergman, Nahema Marchal, John Mellor, Shakir Mohamed, Iason Gabriel, and William Isaac. 2024. STELA: a community-centred approach to norm elicitation for AI alignment.Scientific Reports14, 1 (2024), 6616

  8. [8]

    Nick Bostrom. 2003. Ethical issues in advanced artificial intelligence. InScience fiction and philosophy: from time travel to superintelligence, Susan Schneider (Ed.). Wiley & Blackwell, West Sussex, 277–284

  9. [9]

    2014.Superintelligence: Paths, Dangers, Strategies

    Nick Bostrom. 2014.Superintelligence: Paths, Dangers, Strategies. Oxford University Press, Oxford

  10. [10]

    2023.More than a Glitch: Confronting Race, Gender, and Ability Bias in Tech

    Meredith Broussard. 2023.More than a Glitch: Confronting Race, Gender, and Ability Bias in Tech. The MIT Press, Cambridge, MA

  11. [11]

    2020.The Alignment Problem: Machine Learning and Human Values

    Brian Christian. 2020.The Alignment Problem: Machine Learning and Human Values. W. W. Norton & Company, New York

  12. [12]

    and Jacobs, Bob M

    Vincent Conitzer, Rachel Freedman, Jobst Heitzig, Wesley H. Holliday, Bob M. Jacobs, Nathan Lambert, Milan Mossé, Eric Pacuit, Stuart Russell, Hailey Schoelkopf, and Others. 2024. Social Choice Should Guide AI Alignment in Dealing with Diverse Human Feedback. arXiv(2024), 1–15. https://arxiv.org/abs/2404.10271

  13. [13]

    2021.Atlas of AI

    Kate Crawford. 2021.Atlas of AI. Yale University Press, New Haven, CT

  14. [14]

    Andrew Critch and Stuart Russell. 2023. Tasra: A taxonomy and analysis of societal-scale risks from ai.arXiv2306.06924 (2023), 1–18. https://arxiv.org/abs/2306.06924

  15. [15]

    Aida Mostafazadeh Davani, Mark Díaz, and Vinodkumar Prabhakaran. 2022. Dealing with disagreements: Looking beyond the majority vote in subjective annotations.Transactions of the Association for Computational Linguistics10 (2022), 92–110. FAccT ’26, June 25–28, 2026, Montreal, QC, Canada LaCroix

  16. [16]

    Daniel Dewey. 2011. Learning What to Value. InAGI 2011: 4th International Conference on Artificial General Intelligence (Lecture Notes in Computer Science, Vol. 6830), J. Schmidhuber, K. R. Thórisson, and M. Looks (Eds.). Springer, Berlin, Heidelberg, 309–314

  17. [17]

    Ravit Dotan and Smitha Milli. 2019. Value-laden Disciplinary Shifts in Machine Learning.arXiv1912.01172 (2019), 1–10. https: //arxiv.org/abs/1912.01172

  18. [18]

    Heather Douglas. 2000. Inductive Risk and Values in Science.Philosophy of Science67, 4 (2000), 559–579

  19. [19]

    Peter Eckersley. 2019. Impossibility and Uncertainty Theorems in AI Value Alignment (or why your AGI should not have a utility function).arXiv1901.00064 (2019), 1–13. https://arxiv.org/abs/1901.00064

  20. [20]

    Eisenhardt

    Kathleen M. Eisenhardt. 1989. Agency Theory: An Assessment and Review.The Academy of Management Review14, 1 (1989), 57–74

  21. [21]

    Scott Emmons, Caspar Oesterheld, Vincent Conitzer, and Stuart Russell. 2025. Observation Interference in Partially Observable Assistance Games.arXiv2412.17797 (2025), 1–26. https://arxiv.org/abs/2412.17797

  22. [22]

    Friedler, Scott Neville, Carlos Scheidegger, and Suresh Venkatasubramanian

    Danielle Ensign, Sorelle A. Friedler, Scott Neville, Carlos Scheidegger, and Suresh Venkatasubramanian. 2018. Runaway Feedback Loops in Predictive Policing. InProceedings of the 1st Conference on Fairness, Accountability and Transparency (Proceedings of Machine Learning Research, Vol. 81), Sorelle A. Friedler and Christo Wilson (Eds.). PMLR, 160–171

  23. [23]

    Sina Fazelpour and Will Fleisher. 2025. The Value of Disagreement in AI Design, Evaluation, and Alignment. InProceedings of the 2025 ACM Conference on Fairness, Accountability, and Transparency. Association for Computing Machinery, 2138–2150. https: //doi.org/10.1145/3715275.3732146

  24. [24]

    Fisac, Monica A

    Jaime F. Fisac, Monica A. Gates, Jessica B. Hamrick, Chang Liu, Dylan Hadfield-Menell, Malayandi Palaniappan, Dhruv Malik, S. Shankar Sastry, Thomas L. Griffiths, and Anca D. Dragan. 2020. Pragmatic-Pedagogic Value Alignment. InSpringer Proceedings in Advanced Robotics, N. Amato, G. Hager, S. Thomas, and M. Torres-Torriti (Eds.). Vol. 10. Springer, 49–57

  25. [25]

    Future of Life Institute. 2017. Asilomar AI Principles. https://futureoflife.org/open-letter/ai-principles/

  26. [26]

    Iason Gabriel. 2020. Artificial Intelligence, Values, and Alignment.Minds and Machines30 (2020), 411–437

  27. [27]

    Iason Gabriel and Geoff Keeling. 2025. A matter of principle? AI alignment as the fair treatment of claims.Philosophical Studies182 (2025), 1951–1973

  28. [28]

    Andrew Garber, Rohan Subramani, Linus Luu, Mark Bedaywi, Stuart Russell, and Scott Emmons. 2025. The partially observable off-switch game.Proceedings of the AAAI Conference on Artificial Intelligence39, 26 (2025), 27304–27311

  29. [29]

    Trystan S. Goetze. 2024. AI Art is Theft: Labour, Extraction, and Exploitation—Or, On the Dangers of Stochastic Pollocks.PhilArchive (2024). Unpublished preprint of 10 January 2024. https://philarchive.org/rec/GOEAAI-2

  30. [30]

    Goldberg

    David E. Goldberg. 1987. Simple genetic algorithms and the minimal deceptive problem. InGenetic Algorithms and Simulated Annealing (Research Notes in Artificial Intelligence), Lawrence D. Davis (Ed.). Morgan Kaufmann Publishers, Burlington, MA, 74–88

  31. [31]

    John-Stewart Gordon. 2023. Objections. InThe Impact of Artificial Intelligence on Human Rights Legislation. Palgrave Macmillan, Cham, 75–82

  32. [32]

    Gordon, Michelle S

    Mitchell L. Gordon, Michelle S. Lam, Joon Sung Park, Kayur Patel, Jeffrey T. Hancock, Tatsunori Hashimoto, and Michael S. Bernstein

  33. [33]

    https://arxiv.org/abs/ 2202.02950

    Jury Learning: Integrating Dissenting Voices into Machine Learning Models.arXiv2202.02950 (2022), 1–19. https://arxiv.org/abs/ 2202.02950

  34. [34]

    Gray and Siddharth Suri

    Mary L. Gray and Siddharth Suri. 2019.Ghost Work: How to Stop Silicon Valley from Building a New Global Underclass. Eamon Dolan Books, New York

  35. [35]

    2021.The Principal-Agent Alignment Problem in Artificial Intelligence

    Dylan Hadfield-Menell. 2021.The Principal-Agent Alignment Problem in Artificial Intelligence. Ph. D. Dissertation. EECS Department, University of California, Berkeley. http://www2.eecs.berkeley.edu/Pubs/TechRpts/2021/EECS-2021-207.html

  36. [36]

    Dylan Hadfield-Menell, Anca Dragan, Pieter Abbeel, and Stuart Russell. 2016. Cooperative inverse reinforcement learning. InNIPS’16: Proceedings of the 30th International Conference on Neural Information Processing Systems, Daniel D. Lee, Ulrike von Luxburg, Roman Garnett, Masashi Sugiyama, and Isabelle Guyon (Eds.). Association for Computing Machinery, 3916–3924

  37. [37]

    Dylan Hadfield-Menell, Anca Dragan, Pieter Abbeel, and Stuart Russell. 2017. The Off-Switch Game.arXiv1611.08219 (2017), 1–8. https://arxiv.org/abs/1611.08219

  38. [38]

    Hadfield

    Dylan Hadfield-Menell and Gillian K. Hadfield. 2019. Incomplete Contracting and AI Alignment. InAIES ’19: Proceedings of the 2019 AAAI/ACM Conference on AI, Ethics, and Society, Vincent Conitzer, Gillian Hadfield, and Shannon Vallor (Eds.). Association for Computing Machinery, New York, 417–422

  39. [39]

    Foad Hamidi, Morgan Klaus Scheuerman, and Stacy M. Branham. 2018. Gender recognition or gender reductionism?: The social implications of embedded gender recognition systems.Proceedings of the ACM Conference on Human Factors in Computing Systems (CHI) (2018), 1–13

  40. [40]

    1965.Aspects of Scientific Explanation

    Carl Hempel. 1965.Aspects of Scientific Explanation. Free Press, New York

  41. [41]

    Dan Hendrycks, Collin Burns, Steven Basart, Andrew Critch, Jerry Li, Dawn Song, and Jacob Steinhardt. 2023. Aligning AI With Shared Human Values.arXiv2008.02275 (2023), 1–29. https://arxiv.org/abs/2008.02275

  42. [42]

    Dan Hendrycks and Mantas Mazeika. 2022. X-risk analysis for AI research.arXiv2206.05862 (2022), 1–36. https://arxiv.org/abs/2206.05862

  43. [43]

    Anna Lauren Hoffmann. 2019. Where fairness fails: data, algorithms, and the limits of antidiscrimination discourse.Information, Communication & Society22, 7 (2019), 900–915. Relative principals, pluralistic alignment, & the structural value alignment problem FAccT ’26, June 25–28, 2026, Montreal, QC, Canada

  44. [44]

    Liao, Esin Durmus, Alex Tamkin, and Deep Ganguli

    Saffron Huang, Divya Siddarth, Liane Lovitt, Thomas I. Liao, Esin Durmus, Alex Tamkin, and Deep Ganguli. 2024. Collective constitutional AI: Aligning a language model with public input. InProceedings of the 2024 ACM Conference on Fairness, Accountability, and Transparency. ACM, 1395–1417

  45. [45]

    Evan Hubinger, Chris van Merwijk, Vladimir Mikulik, Joar Skalse, and Scott Garrabrant. 2021. Risks from Learned Optimization in Advanced Machine Learning Systems.arXiv1906.01820 (2021), 1–39. https://arxiv.org/abs/1906.01820

  46. [46]

    Jensen and William H

    Michael C. Jensen and William H. Meckling. 1976. Theory of the Firm: Managerial Behaviour, Agency Costs and Ownership Structure. Journal of Financial Economics3, 4 (1976), 305–360

  47. [47]

    Jiaming Ji, Tianyi Qiu, Boyuan Chen, Borong Zhang, Hantao Lou, Kaile Wang, Yawen Duan, Zhonghao He, Lukas Vierling, Donghai Hong, Jiayi Zhou, Zhaowei Zhang, Fanzhi Zeng, Juntao Dai, Xuehai Pan, Kwan Yee Ng, Aidan O’Gara, Hua Xu, Brian Tse, Jie Fu, Stephen McAleer, Yaodong Yang, Yizhou Wang, Song-Chun Zhu, Yike Guo, and Wen Gao. 2025. AI Alignment: A Compr...

  48. [48]

    Steven Kerr. 1975. On the Folly of Rewarding A, While Hoping for B.Academy of Management Journal18 (1975), 769–783

  49. [49]

    Hannah Rose Kirk, Alexander Whitefield, Paul Röttger, Andrew Bean, Katerina Margatina, Juan Ciro, Rafael Mosquera, Max Bartolo, Adina Williams, He He, Bertie Vidgen, and Scott A. Hale. 2024. The PRISM Alignment Project: What Participatory, Representative and Individualised Human Feedback Reveals About the Subjective and Multicultural Alignment of Large La...

  50. [50]

    2025.Artificial Intelligence and the Value Alignment Problem: A Philosophical Introduction

    Travis LaCroix. 2025.Artificial Intelligence and the Value Alignment Problem: A Philosophical Introduction. Broadview Press

  51. [51]

    Benchmarking

    Travis LaCroix and Alexandra Sasha Luccioni. 2025. Metaethical Perspectives on “Benchmarking” AI Ethics.AI and Ethics5 (2025), 4029–4047

  52. [52]

    2002.The Theory of Incentives: The Principal-Agent Model

    Jean-Jacques Laffont and David Martimort. 2002.The Theory of Incentives: The Principal-Agent Model. Princeton University Press, Princeton

  53. [53]

    Joel Lehman and Kenneth O. Stanley. 2008. Exploiting Open-Endedness to Solve Problems Through the Search for Novelty. InProceedings of the Eleventh International Conference on Artificial Life (ALIFE XI). The MIT Press, Cambridge, MA, 329–336

  54. [54]

    Alexandra Sasha Luccioni, Yacine Jernite, and Emma Strubell. 2023. Power Hungry Processing: Watts Driving the Cost of AI Deployment? arXiv2311.16863 (2023), 1–20. https://arxiv.org/abs/2311.16863

  55. [55]

    Alexandra Sasha Luccioni, Sylvain Viguier, and Anne-Laure Ligozat. 2023. Estimating the Carbon Footprint of BLOOM, a 176B Parameter Language Model.Journal of Machine Learning Research24, 253 (2023), 1–15

  56. [56]

    Kristian Lum and William Isaac. 2016. To predict and serve?Significance13 (2016), 14–19

  57. [57]

    2022.Resisting AI: An Anti-fascist Approach to Artificial Intelligence

    Dan McQuillan. 2022.Resisting AI: An Anti-fascist Approach to Artificial Intelligence. Bristol University Press, Bristol

  58. [58]

    Milagros Miceli, Julian Posada, and Tianling Yang. 2022. Studying Up Machine Learning Data: Why Talk About Bias When We Mean Power?Proceedings of the ACM on Human-Computer Interaction6, GROUP (2022), 1–14

  59. [59]

    Melanie Mitchell, Stephanie Forrest, and John H. Holland. 1992. The royal road for genetic algorithms: Fitness landscapes and GA performance. InProceedings of the First European Conference on Artificial Life, F. J. Varela and P. Bourgine (Eds.). The MIT Press, Cambridge, MA, 1–11

  60. [60]

    Richard Ngo, Lawrence Chen, and Sören Mindermann. 2023. The Alignment Problem from a Deep Learning Perspective.arXiv 2209.00626 (2023), 1–21. https://arxiv.org/abs/2209.00626

  61. [61]

    Omohundro

    Stephen M. Omohundro. 2008. The Basic AI Drives. InArtificial General Intelligence 2008: Proceedings of the First AGI Conference, Pei Wang, Ben Goertzel, and Stan Franklin (Eds.). IOS Press, Amsterdam, 483–492

  62. [62]

    2016.Weapons of Math Destruction: How Big Data Increases Inequality and Threatens Democracy

    Cathy O’Neil. 2016.Weapons of Math Destruction: How Big Data Increases Inequality and Threatens Democracy. Broadway Books, New York

  63. [63]

    Battleday, Thomas L

    Joshua C Peterson, Ruairidh M. Battleday, Thomas L. Griffiths, and Olga Russakovsky. 2019. Human uncertainty makes classification more robust.Proceedings of the IEEE/CVF international conference on computer vision(2019), 9617–9626

  64. [64]

    Elinor Poole-Dayan, Jiayi Wu, Taylor Sorensen, Jiaxin Pei, and Michiel A. Bakker. 2026. Benchmarking Overton Pluralism in LLMs. arXiv2512.01351 (2026), 1–40. https://arxiv.org/abs/2512.01351

  65. [65]

    Mahendra Prasad. 2018. Social choice and the value alignment problem. InArtificial intelligence safety and security, Roman V. Yampolskiy (Ed.). Chapman & Hall, London, 291–314

  66. [66]

    Inioluwa Deborah Raji and Roel Dobbe. 2023. Concrete Problems in AI Safety, Revisited.arXiv2401.10899 (2023), 2023. https: //arxiv.org/abs/2401.10899/

  67. [67]

    Inioluwa Deborah Raji, Timnit Gebru, Margaret Mitchell, Joy Buolamwini, Joonseok Lee, and Emily Denton. 2020. Saving Face: Investigating the Ethical Concerns of Facial Recognition Auditing.arXiv2001.00964 (2020), 1–7. https://arxiv.org/abs/2001.00964

  68. [68]

    2019.Human Compatible: Artificial Intelligence and the Problem of Control

    Stuart Russell. 2019.Human Compatible: Artificial Intelligence and the Problem of Control. Viking, New York

  69. [69]

    Rohin Shah, Pedro Freire, Neel Alex, Rachel Freedman, Dmitrii Krasheninnikov, Lawrence Chan, Michael Dennis, Pieter Abbeel, Anca Dragan, and Stuart Russell. 2020. Benefits of Assistance over Reward Learning.34th Conference on Neural Information Processing Systems (NeurIPS 2020) - Workshop on Cooperative AI(2020). FAccT ’26, June 25–28, 2026, Montreal, QC,...

  70. [70]

    Urbanowicz, and Jason H

    Moshe Sipper, Ryan J. Urbanowicz, and Jason H. Moore. 2018. To Know the Objective Is Not (Necessarily) to Know the Objective Function.BioData Mining11, 21 (2018), 1–3

  71. [71]

    Anand Siththaranjan, Cassidy Laidlaw, and Dylan Hadfield-Menell. 2024. Distributional Preference Learning: Understanding and Accounting for Hidden Context in RLHF.arXiv2312.08358 (2024), 1–26. https://arxiv.org/abs/2312.08358

  72. [72]

    Taylor Sorensen, Liwei Jiang, Jena Hwang, Sydney Levine, Valentina Pyatkin, Peter West, Nouha Dziri, Ximing Lu, Kavel Rao, Chandra Bhagavatula, and Others. 2024. Value Kaleidoscope: Engaging AI with Pluralistic Human Values, Rights, and Duties.arXiv2309.00779 (2024). https://arxiv.org/abs/2309.00779

  73. [73]

    Taylor Sorensen, Jared Moore, Jillian Fisher, Mitchell Gordon, Niloofar Mireshghallah, Christopher Michael Rytting, Andre Ye, Liwei Jiang, Ximing Lu, Nouha Dziri, Tim Althoff, and Yejin Choi. 2024. A roadmap to pluralistic alignment.arXiv2402.05070 (2024), 1–23. https://arxiv.org/abs/2402.05070

  74. [74]

    2018.Life 3.0: Being human in the age of artificial intelligence

    Max Tegmark. 2018.Life 3.0: Being human in the age of artificial intelligence. Vintage, New York

  75. [75]

    Bakker, Daniel Jarrett, Hannah Sheahan, Martin J

    Michael Henry Tessler, Michiel A. Bakker, Daniel Jarrett, Hannah Sheahan, Martin J. Chadwick, Raphael Koster, Georgina Evans, Lucy Campbell-Gillingham, Tantum Collins, David C. Parkes, Matthew Botvinick, and Christopher Summerfield. 2024. AI can help humans find common ground in democratic deliberation.Science386, 6719 (2024), eadq2852

  76. [76]

    Eliezer Yudkowsky. 2011. Complex value systems in friendly AI. 6830 (2011), 388–393. Received 13 January 2026; revised 25 March 2026; accepted 16 April 2026