Recognition: unknown
Relative Principals, Pluralistic Alignment, and the Structural Value Alignment Problem
Pith reviewed 2026-05-09 23:00 UTC · model grok-4.3
The pith
AI value alignment is a governance problem defined by trade-offs among objectives, information, and principals rather than a technical property of models.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The core contribution is to show that the three-axis decomposition implies that alignment is fundamentally a problem of governance rather than engineering alone. From this perspective, alignment is inherently pluralistic and context-dependent, and resolving misalignment involves trade-offs among competing values. Because misalignment can occur along each axis and affect stakeholders differently, the structural description shows that alignment cannot be solved through technical design alone but must be managed through ongoing institutional processes that determine how objectives are set, how systems are evaluated, and how affected communities can contest or reshape those decisions.
What carries the argument
The three-axis framework of misalignment drawn from the principal-agent model, consisting of objectives (specified goals), information (distribution of knowledge), and principals (whose interests count).
If this is right
- Alignment cannot be treated as a single technical property of models but emerges from how objectives are specified and information is distributed.
- Different stakeholders experience misalignment differently depending on which axis is affected.
- Resolving misalignment requires trade-offs among competing values rather than a unique solution.
- Alignment demands ongoing institutional processes for evaluation and contestation instead of one-time technical fixes.
Where Pith is reading between the lines
- The framework could be used to audit current deployed systems like content recommenders by mapping their failures to specific axes.
- Technical alignment research would need to incorporate governance mechanisms to address pluralistic interests in practice.
- This view connects alignment to broader questions of institutional design in technology regulation.
Load-bearing premise
The principal-agent framework from economics can be directly applied to AI systems to systematically diagnose misalignment along the three axes without significant adaptation or counterexamples in real deployments.
What would settle it
A documented case of an AI system where all observed misalignment disappears after technical adjustments to model objectives and information access, with no remaining differences attributable to multiple principals or governance structures.
read the original abstract
The value alignment problem for artificial intelligence (AI) is often framed as a purely technical or normative challenge, sometimes focused on hypothetical future systems. I argue that the problem is better understood as a structural question about governance: not whether an AI system is aligned in the abstract, but whether it is aligned enough, for whom, and at what cost. Drawing on the principal-agent framework from economics, this paper reconceptualises misalignment as arising along three interacting axes: objectives, information, and principals. The three-axis framework provides a systematic way of diagnosing why misalignment arises in real-world systems and clarifies that alignment cannot be treated as a single technical property of models but an outcome shaped by how objectives are specified, how information is distributed, and whose interests count in practice. The core contribution of this paper is to show that the three-axis decomposition implies that alignment is fundamentally a problem of governance rather than engineering alone. From this perspective, alignment is inherently pluralistic and context-dependent, and resolving misalignment involves trade-offs among competing values. Because misalignment can occur along each axis -- and affect stakeholders differently -- the structural description shows that alignment cannot be "solved" through technical design alone, but must be managed through ongoing institutional processes that determine how objectives are set, how systems are evaluated, and how affected communities can contest or reshape those decisions.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper argues that the AI value alignment problem is better understood as a structural governance issue rather than a purely technical or normative one. Drawing on the principal-agent framework from economics, it decomposes misalignment into three interacting axes—objectives, information, and principals—and claims that this framework shows alignment to be inherently pluralistic, context-dependent, and requiring ongoing institutional processes to manage trade-offs among competing values, rather than being solvable through technical design alone.
Significance. If the central inference holds, the paper offers a useful conceptual reframing that could help diagnose real-world misalignment cases and shift alignment research toward pluralistic and institutional considerations. However, the significance is constrained by the absence of detailed derivations, formal mappings, or empirical cases demonstrating why the decomposition entails that technical methods are insufficient.
major comments (1)
- [Abstract / Core contribution paragraph] Abstract and core argument section: The claim that the three-axis decomposition 'implies that alignment is fundamentally a problem of governance rather than engineering alone' is load-bearing for the paper's contribution but is asserted without an explicit argument showing why standard technical approaches (such as scalable oversight for information asymmetry, preference learning for objectives, or multi-objective optimization for principals) cannot in principle operate on each axis. Without demonstrating that these methods are insufficient or themselves require non-technical governance, the inference from decomposition to 'governance rather than engineering' does not follow.
minor comments (1)
- The manuscript would benefit from at least one concrete real-world case study mapping the three axes to an existing AI deployment to illustrate the framework's diagnostic value.
Simulated Author's Rebuttal
We thank the referee for the constructive review and for acknowledging the potential value of the three-axis framework in reframing AI alignment. We agree that the core inference requires more explicit support and will revise accordingly.
read point-by-point responses
-
Referee: The claim that the three-axis decomposition 'implies that alignment is fundamentally a problem of governance rather than engineering alone' is load-bearing for the paper's contribution but is asserted without an explicit argument showing why standard technical approaches (such as scalable oversight for information asymmetry, preference learning for objectives, or multi-objective optimization for principals) cannot in principle operate on each axis. Without demonstrating that these methods are insufficient or themselves require non-technical governance, the inference from decomposition to 'governance rather than engineering' does not follow.
Authors: We accept this point. The manuscript currently derives the governance conclusion from the observation that each axis introduces pluralism and context-dependence, such that misalignment affects stakeholders differently and requires trade-offs that technical design alone cannot legitimately resolve. To make the inference explicit, we will add a new subsection following the three-axis presentation. It will map each cited technical method onto the axes and show why governance remains necessary: scalable oversight can reduce information asymmetry but presupposes an agreed principal (or set of principals) authorized to oversee, which the principals axis shows must be determined institutionally; preference learning can address objective misalignment but requires prior governance choices about whose preferences are elicited and how conflicts among plural principals are aggregated; multi-objective optimization can handle multiple principals but still depends on institutional processes to set the objectives, weights, and evaluation criteria in a context-specific and contestable manner. The revision will argue that these methods therefore operate within, rather than replace, governance structures. Brief illustrations from current AI deployments (e.g., content moderation systems) will be included to ground the argument. revision: yes
Circularity Check
No circularity; conceptual argument relies on external economic framework
full rationale
The paper introduces a three-axis decomposition (objectives, information, principals) drawn from the standard principal-agent model in economics, then interprets this as showing alignment is inherently a governance issue. This is an interpretive reframing rather than a derivation that reduces to its own inputs by construction. No equations, fitted parameters, self-citations of uniqueness theorems, or ansatzes are present in the abstract or described structure. The central claim does not rename a known result or smuggle in prior self-work as external fact; it applies an independent external lens to diagnose misalignment sources. The implication to 'governance rather than engineering alone' is a perspective shift, not a tautological prediction or self-referential loop.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption The principal-agent framework from economics applies directly to AI value alignment scenarios.
Reference graph
Works this paper leans on
-
[1]
Dario Amodei, Chris Olah, Jacob Steinhardt, Paul Christiano, John Schulman, and Dan Mané. 2016. Concrete Problems in AI Safety. arXiv1606.06565 (2016), 1–29. https://arxiv.org/abs/1606.06565
work page internal anchor Pith review arXiv 2016
-
[2]
Mel Andrews. 2025. The Immortal Science of ML: Machine Learning and the Theory-Free Ideal.Erkenntnis(2025), 1–23. https: //doi.org/10.1007/s10670-025-01010-x
-
[3]
Taylor, Mark Diaz, Christopher M
Lora Aroyo, Alex S. Taylor, Mark Diaz, Christopher M. Homan, Alicia Parrish, Greg Serapio-Garcia, Vinodkumar Prabhakaran, and Ding Wang. 2023. DICES Dataset: Diversity in Conversational AI Evaluation for Safety.arXiv2306.11247 (2023), 1–22. https: //arxiv.org/abs/2306.11247
-
[4]
Yoshua Bengio. 2023. How Rogue AIs may Arise. https://yoshuabengio.org/en/blog/how-rogue-ais-may-arise
2023
-
[5]
Yoshua Bengio, Michael Cohen, Damiano Fornasiere, Joumana Ghosn, Pietro Greiner, Matt MacDermott, Sören Mindermann, Adam Ober- man, Jesse Richardson, Oliver Richardson, Marc-Antoine Rondeau, Pierre-Luc St-Charles, and David Williams-King. 2025. Superintelligent Agents Pose Catastrophic Risks: Can Scientist AI Offer a Safer Path?arXiv2502.15657 (2025), 1–5...
-
[6]
2019.Race After Technology: Abolitionist Tools for the New Jim Code
Ruha Benjamin. 2019.Race After Technology: Abolitionist Tools for the New Jim Code. Polity, Cambridge
2019
-
[7]
Stevie Bergman, Nahema Marchal, John Mellor, Shakir Mohamed, Iason Gabriel, and William Isaac. 2024. STELA: a community-centred approach to norm elicitation for AI alignment.Scientific Reports14, 1 (2024), 6616
2024
-
[8]
Nick Bostrom. 2003. Ethical issues in advanced artificial intelligence. InScience fiction and philosophy: from time travel to superintelligence, Susan Schneider (Ed.). Wiley & Blackwell, West Sussex, 277–284
2003
-
[9]
2014.Superintelligence: Paths, Dangers, Strategies
Nick Bostrom. 2014.Superintelligence: Paths, Dangers, Strategies. Oxford University Press, Oxford
2014
-
[10]
2023.More than a Glitch: Confronting Race, Gender, and Ability Bias in Tech
Meredith Broussard. 2023.More than a Glitch: Confronting Race, Gender, and Ability Bias in Tech. The MIT Press, Cambridge, MA
2023
-
[11]
2020.The Alignment Problem: Machine Learning and Human Values
Brian Christian. 2020.The Alignment Problem: Machine Learning and Human Values. W. W. Norton & Company, New York
2020
-
[12]
Vincent Conitzer, Rachel Freedman, Jobst Heitzig, Wesley H. Holliday, Bob M. Jacobs, Nathan Lambert, Milan Mossé, Eric Pacuit, Stuart Russell, Hailey Schoelkopf, and Others. 2024. Social Choice Should Guide AI Alignment in Dealing with Diverse Human Feedback. arXiv(2024), 1–15. https://arxiv.org/abs/2404.10271
-
[13]
2021.Atlas of AI
Kate Crawford. 2021.Atlas of AI. Yale University Press, New Haven, CT
2021
- [14]
-
[15]
Aida Mostafazadeh Davani, Mark Díaz, and Vinodkumar Prabhakaran. 2022. Dealing with disagreements: Looking beyond the majority vote in subjective annotations.Transactions of the Association for Computational Linguistics10 (2022), 92–110. FAccT ’26, June 25–28, 2026, Montreal, QC, Canada LaCroix
2022
-
[16]
Daniel Dewey. 2011. Learning What to Value. InAGI 2011: 4th International Conference on Artificial General Intelligence (Lecture Notes in Computer Science, Vol. 6830), J. Schmidhuber, K. R. Thórisson, and M. Looks (Eds.). Springer, Berlin, Heidelberg, 309–314
2011
- [17]
-
[18]
Heather Douglas. 2000. Inductive Risk and Values in Science.Philosophy of Science67, 4 (2000), 559–579
2000
- [19]
-
[20]
Eisenhardt
Kathleen M. Eisenhardt. 1989. Agency Theory: An Assessment and Review.The Academy of Management Review14, 1 (1989), 57–74
1989
- [21]
-
[22]
Friedler, Scott Neville, Carlos Scheidegger, and Suresh Venkatasubramanian
Danielle Ensign, Sorelle A. Friedler, Scott Neville, Carlos Scheidegger, and Suresh Venkatasubramanian. 2018. Runaway Feedback Loops in Predictive Policing. InProceedings of the 1st Conference on Fairness, Accountability and Transparency (Proceedings of Machine Learning Research, Vol. 81), Sorelle A. Friedler and Christo Wilson (Eds.). PMLR, 160–171
2018
-
[23]
Sina Fazelpour and Will Fleisher. 2025. The Value of Disagreement in AI Design, Evaluation, and Alignment. InProceedings of the 2025 ACM Conference on Fairness, Accountability, and Transparency. Association for Computing Machinery, 2138–2150. https: //doi.org/10.1145/3715275.3732146
-
[24]
Fisac, Monica A
Jaime F. Fisac, Monica A. Gates, Jessica B. Hamrick, Chang Liu, Dylan Hadfield-Menell, Malayandi Palaniappan, Dhruv Malik, S. Shankar Sastry, Thomas L. Griffiths, and Anca D. Dragan. 2020. Pragmatic-Pedagogic Value Alignment. InSpringer Proceedings in Advanced Robotics, N. Amato, G. Hager, S. Thomas, and M. Torres-Torriti (Eds.). Vol. 10. Springer, 49–57
2020
-
[25]
Future of Life Institute. 2017. Asilomar AI Principles. https://futureoflife.org/open-letter/ai-principles/
2017
-
[26]
Iason Gabriel. 2020. Artificial Intelligence, Values, and Alignment.Minds and Machines30 (2020), 411–437
2020
-
[27]
Iason Gabriel and Geoff Keeling. 2025. A matter of principle? AI alignment as the fair treatment of claims.Philosophical Studies182 (2025), 1951–1973
2025
-
[28]
Andrew Garber, Rohan Subramani, Linus Luu, Mark Bedaywi, Stuart Russell, and Scott Emmons. 2025. The partially observable off-switch game.Proceedings of the AAAI Conference on Artificial Intelligence39, 26 (2025), 27304–27311
2025
-
[29]
Trystan S. Goetze. 2024. AI Art is Theft: Labour, Extraction, and Exploitation—Or, On the Dangers of Stochastic Pollocks.PhilArchive (2024). Unpublished preprint of 10 January 2024. https://philarchive.org/rec/GOEAAI-2
2024
-
[30]
Goldberg
David E. Goldberg. 1987. Simple genetic algorithms and the minimal deceptive problem. InGenetic Algorithms and Simulated Annealing (Research Notes in Artificial Intelligence), Lawrence D. Davis (Ed.). Morgan Kaufmann Publishers, Burlington, MA, 74–88
1987
-
[31]
John-Stewart Gordon. 2023. Objections. InThe Impact of Artificial Intelligence on Human Rights Legislation. Palgrave Macmillan, Cham, 75–82
2023
-
[32]
Gordon, Michelle S
Mitchell L. Gordon, Michelle S. Lam, Joon Sung Park, Kayur Patel, Jeffrey T. Hancock, Tatsunori Hashimoto, and Michael S. Bernstein
-
[33]
https://arxiv.org/abs/ 2202.02950
Jury Learning: Integrating Dissenting Voices into Machine Learning Models.arXiv2202.02950 (2022), 1–19. https://arxiv.org/abs/ 2202.02950
-
[34]
Gray and Siddharth Suri
Mary L. Gray and Siddharth Suri. 2019.Ghost Work: How to Stop Silicon Valley from Building a New Global Underclass. Eamon Dolan Books, New York
2019
-
[35]
2021.The Principal-Agent Alignment Problem in Artificial Intelligence
Dylan Hadfield-Menell. 2021.The Principal-Agent Alignment Problem in Artificial Intelligence. Ph. D. Dissertation. EECS Department, University of California, Berkeley. http://www2.eecs.berkeley.edu/Pubs/TechRpts/2021/EECS-2021-207.html
2021
-
[36]
Dylan Hadfield-Menell, Anca Dragan, Pieter Abbeel, and Stuart Russell. 2016. Cooperative inverse reinforcement learning. InNIPS’16: Proceedings of the 30th International Conference on Neural Information Processing Systems, Daniel D. Lee, Ulrike von Luxburg, Roman Garnett, Masashi Sugiyama, and Isabelle Guyon (Eds.). Association for Computing Machinery, 3916–3924
2016
- [37]
-
[38]
Hadfield
Dylan Hadfield-Menell and Gillian K. Hadfield. 2019. Incomplete Contracting and AI Alignment. InAIES ’19: Proceedings of the 2019 AAAI/ACM Conference on AI, Ethics, and Society, Vincent Conitzer, Gillian Hadfield, and Shannon Vallor (Eds.). Association for Computing Machinery, New York, 417–422
2019
-
[39]
Foad Hamidi, Morgan Klaus Scheuerman, and Stacy M. Branham. 2018. Gender recognition or gender reductionism?: The social implications of embedded gender recognition systems.Proceedings of the ACM Conference on Human Factors in Computing Systems (CHI) (2018), 1–13
2018
-
[40]
1965.Aspects of Scientific Explanation
Carl Hempel. 1965.Aspects of Scientific Explanation. Free Press, New York
1965
- [41]
- [42]
-
[43]
Anna Lauren Hoffmann. 2019. Where fairness fails: data, algorithms, and the limits of antidiscrimination discourse.Information, Communication & Society22, 7 (2019), 900–915. Relative principals, pluralistic alignment, & the structural value alignment problem FAccT ’26, June 25–28, 2026, Montreal, QC, Canada
2019
-
[44]
Liao, Esin Durmus, Alex Tamkin, and Deep Ganguli
Saffron Huang, Divya Siddarth, Liane Lovitt, Thomas I. Liao, Esin Durmus, Alex Tamkin, and Deep Ganguli. 2024. Collective constitutional AI: Aligning a language model with public input. InProceedings of the 2024 ACM Conference on Fairness, Accountability, and Transparency. ACM, 1395–1417
2024
- [45]
-
[46]
Jensen and William H
Michael C. Jensen and William H. Meckling. 1976. Theory of the Firm: Managerial Behaviour, Agency Costs and Ownership Structure. Journal of Financial Economics3, 4 (1976), 305–360
1976
-
[47]
Jiaming Ji, Tianyi Qiu, Boyuan Chen, Borong Zhang, Hantao Lou, Kaile Wang, Yawen Duan, Zhonghao He, Lukas Vierling, Donghai Hong, Jiayi Zhou, Zhaowei Zhang, Fanzhi Zeng, Juntao Dai, Xuehai Pan, Kwan Yee Ng, Aidan O’Gara, Hua Xu, Brian Tse, Jie Fu, Stephen McAleer, Yaodong Yang, Yizhou Wang, Song-Chun Zhu, Yike Guo, and Wen Gao. 2025. AI Alignment: A Compr...
-
[48]
Steven Kerr. 1975. On the Folly of Rewarding A, While Hoping for B.Academy of Management Journal18 (1975), 769–783
1975
-
[49]
Hannah Rose Kirk, Alexander Whitefield, Paul Röttger, Andrew Bean, Katerina Margatina, Juan Ciro, Rafael Mosquera, Max Bartolo, Adina Williams, He He, Bertie Vidgen, and Scott A. Hale. 2024. The PRISM Alignment Project: What Participatory, Representative and Individualised Human Feedback Reveals About the Subjective and Multicultural Alignment of Large La...
-
[50]
2025.Artificial Intelligence and the Value Alignment Problem: A Philosophical Introduction
Travis LaCroix. 2025.Artificial Intelligence and the Value Alignment Problem: A Philosophical Introduction. Broadview Press
2025
-
[51]
Benchmarking
Travis LaCroix and Alexandra Sasha Luccioni. 2025. Metaethical Perspectives on “Benchmarking” AI Ethics.AI and Ethics5 (2025), 4029–4047
2025
-
[52]
2002.The Theory of Incentives: The Principal-Agent Model
Jean-Jacques Laffont and David Martimort. 2002.The Theory of Incentives: The Principal-Agent Model. Princeton University Press, Princeton
2002
-
[53]
Joel Lehman and Kenneth O. Stanley. 2008. Exploiting Open-Endedness to Solve Problems Through the Search for Novelty. InProceedings of the Eleventh International Conference on Artificial Life (ALIFE XI). The MIT Press, Cambridge, MA, 329–336
2008
- [54]
-
[55]
Alexandra Sasha Luccioni, Sylvain Viguier, and Anne-Laure Ligozat. 2023. Estimating the Carbon Footprint of BLOOM, a 176B Parameter Language Model.Journal of Machine Learning Research24, 253 (2023), 1–15
2023
-
[56]
Kristian Lum and William Isaac. 2016. To predict and serve?Significance13 (2016), 14–19
2016
-
[57]
2022.Resisting AI: An Anti-fascist Approach to Artificial Intelligence
Dan McQuillan. 2022.Resisting AI: An Anti-fascist Approach to Artificial Intelligence. Bristol University Press, Bristol
2022
-
[58]
Milagros Miceli, Julian Posada, and Tianling Yang. 2022. Studying Up Machine Learning Data: Why Talk About Bias When We Mean Power?Proceedings of the ACM on Human-Computer Interaction6, GROUP (2022), 1–14
2022
-
[59]
Melanie Mitchell, Stephanie Forrest, and John H. Holland. 1992. The royal road for genetic algorithms: Fitness landscapes and GA performance. InProceedings of the First European Conference on Artificial Life, F. J. Varela and P. Bourgine (Eds.). The MIT Press, Cambridge, MA, 1–11
1992
- [60]
-
[61]
Omohundro
Stephen M. Omohundro. 2008. The Basic AI Drives. InArtificial General Intelligence 2008: Proceedings of the First AGI Conference, Pei Wang, Ben Goertzel, and Stan Franklin (Eds.). IOS Press, Amsterdam, 483–492
2008
-
[62]
2016.Weapons of Math Destruction: How Big Data Increases Inequality and Threatens Democracy
Cathy O’Neil. 2016.Weapons of Math Destruction: How Big Data Increases Inequality and Threatens Democracy. Broadway Books, New York
2016
-
[63]
Battleday, Thomas L
Joshua C Peterson, Ruairidh M. Battleday, Thomas L. Griffiths, and Olga Russakovsky. 2019. Human uncertainty makes classification more robust.Proceedings of the IEEE/CVF international conference on computer vision(2019), 9617–9626
2019
- [64]
-
[65]
Mahendra Prasad. 2018. Social choice and the value alignment problem. InArtificial intelligence safety and security, Roman V. Yampolskiy (Ed.). Chapman & Hall, London, 291–314
2018
- [66]
- [67]
-
[68]
2019.Human Compatible: Artificial Intelligence and the Problem of Control
Stuart Russell. 2019.Human Compatible: Artificial Intelligence and the Problem of Control. Viking, New York
2019
-
[69]
Rohin Shah, Pedro Freire, Neel Alex, Rachel Freedman, Dmitrii Krasheninnikov, Lawrence Chan, Michael Dennis, Pieter Abbeel, Anca Dragan, and Stuart Russell. 2020. Benefits of Assistance over Reward Learning.34th Conference on Neural Information Processing Systems (NeurIPS 2020) - Workshop on Cooperative AI(2020). FAccT ’26, June 25–28, 2026, Montreal, QC,...
2020
-
[70]
Urbanowicz, and Jason H
Moshe Sipper, Ryan J. Urbanowicz, and Jason H. Moore. 2018. To Know the Objective Is Not (Necessarily) to Know the Objective Function.BioData Mining11, 21 (2018), 1–3
2018
- [71]
-
[72]
Taylor Sorensen, Liwei Jiang, Jena Hwang, Sydney Levine, Valentina Pyatkin, Peter West, Nouha Dziri, Ximing Lu, Kavel Rao, Chandra Bhagavatula, and Others. 2024. Value Kaleidoscope: Engaging AI with Pluralistic Human Values, Rights, and Duties.arXiv2309.00779 (2024). https://arxiv.org/abs/2309.00779
-
[73]
Taylor Sorensen, Jared Moore, Jillian Fisher, Mitchell Gordon, Niloofar Mireshghallah, Christopher Michael Rytting, Andre Ye, Liwei Jiang, Ximing Lu, Nouha Dziri, Tim Althoff, and Yejin Choi. 2024. A roadmap to pluralistic alignment.arXiv2402.05070 (2024), 1–23. https://arxiv.org/abs/2402.05070
-
[74]
2018.Life 3.0: Being human in the age of artificial intelligence
Max Tegmark. 2018.Life 3.0: Being human in the age of artificial intelligence. Vintage, New York
2018
-
[75]
Bakker, Daniel Jarrett, Hannah Sheahan, Martin J
Michael Henry Tessler, Michiel A. Bakker, Daniel Jarrett, Hannah Sheahan, Martin J. Chadwick, Raphael Koster, Georgina Evans, Lucy Campbell-Gillingham, Tantum Collins, David C. Parkes, Matthew Botvinick, and Christopher Summerfield. 2024. AI can help humans find common ground in democratic deliberation.Science386, 6719 (2024), eadq2852
2024
-
[76]
Eliezer Yudkowsky. 2011. Complex value systems in friendly AI. 6830 (2011), 388–393. Received 13 January 2026; revised 25 March 2026; accepted 16 April 2026
2011
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.