hub Canonical reference

Artificial Intelligence, Values and Alignment

Iason Gabriel · 2020 · cs.CY · DOI 10.1007/s11023-020-09539-2 · arXiv 2001.09768

Canonical reference. 88% of citing Pith papers cite this work as background.

19 Pith papers citing it

572 external citations · Crossref

Background 88% of classified citations

open full Pith review browse 19 citing papers arXiv PDF

abstract

This paper looks at philosophical questions that arise in the context of AI alignment. It defends three propositions. First, normative and technical aspects of the AI alignment problem are interrelated, creating space for productive engagement between people working in both domains. Second, it is important to be clear about the goal of alignment. There are significant differences between AI that aligns with instructions, intentions, revealed preferences, ideal preferences, interests and values. A principle-based approach to AI alignment, which combines these elements in a systematic way, has considerable advantages in this context. Third, the central challenge for theorists is not to identify 'true' moral principles for AI; rather, it is to identify fair principles for alignment, that receive reflective endorsement despite widespread variation in people's moral beliefs. The final part of the paper explores three ways in which fair principles for AI alignment could potentially be identified.

hub tools

JSON dossier citing papers JSON publisher DOI arXiv source

citation-role summary

background 8

citation-polarity summary

background 7 support 1

representative citing papers

Agent-ValueBench: A Comprehensive Benchmark for Evaluating Agent Values

cs.AI · 2026-05-11 · unverdicted · novelty 8.0

Agent-ValueBench is the first dedicated benchmark for agent values, showing they diverge from LLM values, form a homogeneous 'Value Tide' across models, and bend under harnesses and skill steering.

Towards Measuring the Representation of Subjective Global Opinions in Language Models

cs.CL · 2023-06-28 · conditional · novelty 7.0

LLMs default to responses more similar to opinions from the USA and some European and South American countries; prompting for a country shifts alignment but can introduce stereotypes, while translation does not reliably match language speakers.

Positive Alignment: Artificial Intelligence for Human Flourishing

cs.AI · 2026-05-11 · unverdicted · novelty 6.0

Positive Alignment introduces AI systems that support human flourishing pluralistically and proactively while remaining safe, as a necessary complement to traditional safety-focused alignment research.

The Alignment Target Problem: Divergent Moral Judgments of Humans, AI Systems, and Their Designers

cs.CY · 2026-04-27 · unverdicted · novelty 6.0 · 2 refs

Moral judgments become more deontological when human design of AI is visible, and designers are judged more strictly than the AI or unaided humans, creating plural and non-converging targets for value alignment.

Blind Refusal: Language Models Refuse to Help Users Evade Unjust, Absurd, and Illegitimate Rules

cs.AI · 2026-04-03 · unverdicted · novelty 6.0

Language models refuse 75.4% of requests to evade defeated rules and do so even after recognizing reasons that undermine the rule's legitimacy.

AI and My Values: User Perceptions of LLMs' Ability to Extract, Embody, and Explain Human Values from Casual Conversations

cs.HC · 2026-01-30 · unverdicted · novelty 6.0

13 participants became convinced AI understands human values after chatbot interactions evaluated with the VAPT toolkit.

ActivationReasoning: Logical Reasoning in Latent Activation Spaces

cs.LG · 2025-10-21 · unverdicted · novelty 6.0

ActivationReasoning grounds logical reasoning in LLM latent activations via SAEs to enable structured inference, concept composition, and behavior steering on multi-hop, abstraction, and safety tasks.

A Roadmap to Pluralistic Alignment

cs.AI · 2024-02-07 · unverdicted · novelty 6.0

The paper formalizes three types of pluralistic AI models and three benchmark classes, arguing that current alignment techniques may reduce rather than increase distributional pluralism.

Language Models (Mostly) Know What They Know

cs.CL · 2022-07-11 · unverdicted · novelty 6.0

Language models show good calibration when asked to estimate the probability that their own answers are correct, with performance improving as models get larger.

Ethical and social risks of harm from Language Models

cs.CL · 2021-12-08 · accept · novelty 6.0

The authors provide a detailed taxonomy of 21 risks associated with language models, covering discrimination, information leaks, misinformation, malicious applications, interaction harms, and societal impacts like job loss and environmental costs.

A General Language Assistant as a Laboratory for Alignment

cs.CL · 2021-12-01 · conditional · novelty 6.0

Ranked preference modeling outperforms imitation learning for language model alignment and scales more favorably with model size.

Developing an AI Concept Envisioning Toolkit to Support Reflective Juxtaposition of Values and Harms

cs.HC · 2026-04-30 · conditional · novelty 5.0

A new toolkit with cards and maps enables AI designers to juxtapose values and harms in early concept stages, shown valuable in designer surveys and interviews.

How Designers Envision Value-Oriented AI Design Concepts with Generative AI

cs.HC · 2026-04-30 · unverdicted · novelty 5.0

Designers using generative AI for concept envisioning engage in reciprocal reflection-in-action that surfaces multi-level value tensions and prioritizes harm recognition over positive value articulation.

AI of the People, by the People, for the People: A Social Choice Approach to Collective Control of Artificial Intelligence

cs.CY · 2026-04-14 · unverdicted · novelty 5.0

Proposes applying social choice theory as a modeling language and axiomatic tool for incorporating collective input across the ML development pipeline.

Understanding the Gap Between Stated and Revealed Preferences in News Curation: A Study of Young Adult Social Media Users

cs.HC · 2026-04-13 · unverdicted · novelty 5.0

Young adults engage with low-quality news content on social media despite stating preferences for high-quality, accurate, and diverse information, and they produce higher-quality feeds when curating for a hypothetical persona.

How Value Induction Reshapes LLM Behaviour

cs.CL · 2026-05-08 · unverdicted · novelty 4.0

Inducing targeted values in LLMs through fine-tuning causes spillover to related or opposing values, boosts safety metrics, and increases anthropomorphic and sycophantic language across all tested values.

FAccT-Checked: A Narrative Review of Authority Reconfigurations and Retention in AI-Mediated Journalism

cs.CY · 2026-04-23 · unverdicted · novelty 4.0

AI integration in newsrooms drives internal deferral of judgment to LLMs and external shifts of power to platforms, making fairness, accountability, and transparency harder to sustain unless participatory mechanisms redistribute authority.

Perception Gaps in Risk, Benefit, and Value Between Experts and Public Challenge Socially Accepted AI

cs.CY · 2024-12-02 · unverdicted · novelty 4.0

Experts rate AI scenarios as more likely, less risky, more beneficial, and more valuable than the public, applying different weightings to risk versus benefit.

Open Problems in Frontier AI Risk Management

cs.LG · 2026-04-28 · unverdicted · novelty 3.0

The paper maps unresolved challenges in frontier AI risk management, classifies them into lack of consensus, framework misalignment, or implementation shortfalls, and identifies actors best positioned to address each.

citing papers explorer

Showing 19 of 19 citing papers.

Agent-ValueBench: A Comprehensive Benchmark for Evaluating Agent Values cs.AI · 2026-05-11 · unverdicted · none · ref 18 · internal anchor
Agent-ValueBench is the first dedicated benchmark for agent values, showing they diverge from LLM values, form a homogeneous 'Value Tide' across models, and bend under harnesses and skill steering.
Towards Measuring the Representation of Subjective Global Opinions in Language Models cs.CL · 2023-06-28 · conditional · none · ref 26 · internal anchor
LLMs default to responses more similar to opinions from the USA and some European and South American countries; prompting for a country shifts alignment but can introduce stereotypes, while translation does not reliably match language speakers.
Positive Alignment: Artificial Intelligence for Human Flourishing cs.AI · 2026-05-11 · unverdicted · none · ref 4 · internal anchor
Positive Alignment introduces AI systems that support human flourishing pluralistically and proactively while remaining safe, as a necessary complement to traditional safety-focused alignment research.
The Alignment Target Problem: Divergent Moral Judgments of Humans, AI Systems, and Their Designers cs.CY · 2026-04-27 · unverdicted · none · ref 22 · 2 links · internal anchor
Moral judgments become more deontological when human design of AI is visible, and designers are judged more strictly than the AI or unaided humans, creating plural and non-converging targets for value alignment.
Blind Refusal: Language Models Refuse to Help Users Evade Unjust, Absurd, and Illegitimate Rules cs.AI · 2026-04-03 · unverdicted · none · ref 10 · internal anchor
Language models refuse 75.4% of requests to evade defeated rules and do so even after recognizing reasons that undermine the rule's legitimacy.
AI and My Values: User Perceptions of LLMs' Ability to Extract, Embody, and Explain Human Values from Casual Conversations cs.HC · 2026-01-30 · unverdicted · none · ref 30 · internal anchor
13 participants became convinced AI understands human values after chatbot interactions evaluated with the VAPT toolkit.
ActivationReasoning: Logical Reasoning in Latent Activation Spaces cs.LG · 2025-10-21 · unverdicted · none · ref 5 · internal anchor
ActivationReasoning grounds logical reasoning in LLM latent activations via SAEs to enable structured inference, concept composition, and behavior steering on multi-hop, abstraction, and safety tasks.
A Roadmap to Pluralistic Alignment cs.AI · 2024-02-07 · unverdicted · none · ref 32 · internal anchor
The paper formalizes three types of pluralistic AI models and three benchmark classes, arguing that current alignment techniques may reduce rather than increase distributional pluralism.
Language Models (Mostly) Know What They Know cs.CL · 2022-07-11 · unverdicted · none · ref 82 · internal anchor
Language models show good calibration when asked to estimate the probability that their own answers are correct, with performance improving as models get larger.
Ethical and social risks of harm from Language Models cs.CL · 2021-12-08 · accept · none · ref 83 · internal anchor
The authors provide a detailed taxonomy of 21 risks associated with language models, covering discrimination, information leaks, misinformation, malicious applications, interaction harms, and societal impacts like job loss and environmental costs.
A General Language Assistant as a Laboratory for Alignment cs.CL · 2021-12-01 · conditional · none · ref 220 · internal anchor
Ranked preference modeling outperforms imitation learning for language model alignment and scales more favorably with model size.
Developing an AI Concept Envisioning Toolkit to Support Reflective Juxtaposition of Values and Harms cs.HC · 2026-04-30 · conditional · none · ref 36 · internal anchor
A new toolkit with cards and maps enables AI designers to juxtapose values and harms in early concept stages, shown valuable in designer surveys and interviews.
How Designers Envision Value-Oriented AI Design Concepts with Generative AI cs.HC · 2026-04-30 · unverdicted · none · ref 21 · internal anchor
Designers using generative AI for concept envisioning engage in reciprocal reflection-in-action that surfaces multi-level value tensions and prioritizes harm recognition over positive value articulation.
AI of the People, by the People, for the People: A Social Choice Approach to Collective Control of Artificial Intelligence cs.CY · 2026-04-14 · unverdicted · none · ref 38 · internal anchor
Proposes applying social choice theory as a modeling language and axiomatic tool for incorporating collective input across the ML development pipeline.
Understanding the Gap Between Stated and Revealed Preferences in News Curation: A Study of Young Adult Social Media Users cs.HC · 2026-04-13 · unverdicted · none · ref 14 · internal anchor
Young adults engage with low-quality news content on social media despite stating preferences for high-quality, accurate, and diverse information, and they produce higher-quality feeds when curating for a hypothetical persona.
How Value Induction Reshapes LLM Behaviour cs.CL · 2026-05-08 · unverdicted · none · ref 1 · internal anchor
Inducing targeted values in LLMs through fine-tuning causes spillover to related or opposing values, boosts safety metrics, and increases anthropomorphic and sycophantic language across all tested values.
FAccT-Checked: A Narrative Review of Authority Reconfigurations and Retention in AI-Mediated Journalism cs.CY · 2026-04-23 · unverdicted · none · ref 77 · internal anchor
AI integration in newsrooms drives internal deferral of judgment to LLMs and external shifts of power to platforms, making fairness, accountability, and transparency harder to sustain unless participatory mechanisms redistribute authority.
Perception Gaps in Risk, Benefit, and Value Between Experts and Public Challenge Socially Accepted AI cs.CY · 2024-12-02 · unverdicted · none · ref 35 · internal anchor
Experts rate AI scenarios as more likely, less risky, more beneficial, and more valuable than the public, applying different weightings to risk versus benefit.
Open Problems in Frontier AI Risk Management cs.LG · 2026-04-28 · unverdicted · none · ref 3 · internal anchor
The paper maps unresolved challenges in frontier AI risk management, classifies them into lack of consensus, framework misalignment, or implementation shortfalls, and identifies actors best positioned to address each.

Artificial Intelligence, Values and Alignment

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer